Re: [PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list

2017-05-13 Thread Alex Lyakas
Hi Liu,

On Wed, Mar 22, 2017 at 1:40 AM, Liu Bo <bo.li@oracle.com> wrote:
> On Sun, Mar 19, 2017 at 07:18:59PM +0200, Alex Lyakas wrote:
>> We have a commit_root_sem, which is a read-write semaphore that protects the
>> commit roots.
>> But it is also used to protect the list of caching block groups.
>>
>> As a result, while doing "slow" caching, the following issue is seen:
>>
>> Some of the caching threads are scanning the extent tree with
>> commit_root_sem
>> acquired in shared mode, with stack like:
>> [] read_extent_buffer_pages+0x2d2/0x300 [btrfs]
>> [] btree_read_extent_buffer_pages.constprop.50+0xb7/0x1e0
>> [btrfs]
>> [] read_tree_block+0x40/0x70 [btrfs]
>> [] read_block_for_search.isra.33+0x12c/0x370 [btrfs]
>> [] btrfs_search_slot+0x3c6/0xb10 [btrfs]
>> [] caching_thread+0x1b9/0x820 [btrfs]
>> [] normal_work_helper+0xc6/0x340 [btrfs]
>> [] btrfs_cache_helper+0x12/0x20 [btrfs]
>>
>> IO requests that want to allocate space are waiting in cache_block_group()
>> to acquire the commit_root_sem in exclusive mode. But they only want to add
>> the caching control structure to the list of caching block-groups:
>> [] schedule+0x29/0x70
>> [] rwsem_down_write_failed+0x145/0x320
>> [] call_rwsem_down_write_failed+0x13/0x20
>> [] cache_block_group+0x25b/0x450 [btrfs]
>> [] find_free_extent+0xd16/0xdb0 [btrfs]
>> [] btrfs_reserve_extent+0xaf/0x160 [btrfs]
>>
>> Other caching threads want to continue their scanning, and for that they
>> are waiting to acquire commit_root_sem in shared mode. But since there are
>> IO threads that want the exclusive lock, the caching threads are unable
>> to continue the scanning, because (I presume) rw_semaphore guarantees some
>> fairness:
>> [] schedule+0x29/0x70
>> [] rwsem_down_read_failed+0xc5/0x120
>> [] call_rwsem_down_read_failed+0x14/0x30
>> [] caching_thread+0x1a1/0x820 [btrfs]
>> [] normal_work_helper+0xc6/0x340 [btrfs]
>> [] btrfs_cache_helper+0x12/0x20 [btrfs]
>> [] process_one_work+0x146/0x410
>>
>> This causes slowness of the IO, especially when there are many block groups
>> that need to be scanned for free space. In some cases it takes minutes
>> until a single IO thread is able to allocate free space.
>>
>> I don't see a deadlock here, because the caching threads that were able to
>> acquire
>> the commit_root_sem will call rwsem_is_contended() and should give up the
>> semaphore,
>> so that IO threads are able to acquire it in exclusive mode.
>>
>> However, introducing a separate mutex that protects only the list of caching
>> block groups makes things move forward much faster.
>>
>
> The problem did exist and the patch looks good to me.
>
>> This patch is based on kernel 3.18.
>> Unfortunately, I am not able to submit a patch based on one of the latest
>> kernels, because
>> here btrfs is part of the larger system, and upgrading the kernel is a
>> significant effort.
>> Hence marking the patch as RFC.
>> Hopefully, this patch still has some value to the community.
>>
>> Signed-off-by: Alex Lyakas <a...@zadarastorage.com>
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 42d11e7..74feacb 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -1490,6 +1490,8 @@ struct btrfs_fs_info {
>> struct list_head trans_list;
>> struct list_head dead_roots;
>> struct list_head caching_block_groups;
>> +/* protects the above list */
>> +struct mutex caching_block_groups_mutex;
>>
>> spinlock_t delayed_iput_lock;
>> struct list_head delayed_iputs;
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 5177954..130ec58 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -2229,6 +2229,7 @@ int open_ctree(struct super_block *sb,
>> INIT_LIST_HEAD(_info->delayed_iputs);
>> INIT_LIST_HEAD(_info->delalloc_roots);
>> INIT_LIST_HEAD(_info->caching_block_groups);
>> +mutex_init(_info->caching_block_groups_mutex);
>> spin_lock_init(_info->delalloc_root_lock);
>> spin_lock_init(_info->trans_lock);
>> spin_lock_init(_info->fs_roots_radix_lock);
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index a067065..906fb08 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -637,10 +637,10 @@ static int cache_block_group(struct
>> btrfs_block_group_cache *cache,
>> return 0;
>> }
>

Re: include linux kernel headers for btrfs filesystem

2017-03-20 Thread Alex Lyakas
Ilan,

On Mon, Mar 20, 2017 at 10:33 AM, Ilan Schwarts  wrote:
> I need to cast struct inode to struct btrfs_inode.
> in order to do it, i looked at implementation of btrfs_getattr.
>
> the code is simple:
> struct btrfs_inode *btrfsInode;
> btrfsInode = BTRFS_I(inode);
>
> in order to compile i must add the headers on top of the function:
> #include "/data/kernel/linux-4.1.21-x86_64/fs/btrfs/ctree.h"
> #include "/data/kernel/linux-4.1.21-x86_64/fs/btrfs/btrfs_inode.h"
>
> What is the problem ?
> I must manually download and include ctree.h and btrfs_inode.h, they
> are not provided in the kernel-headers package.
> On every platform I compile my driver, I have specific VM for the
> distro/kernel version, so on every VM I usually download package
> kernel-headers and everything compiles perfectly.
>
> btrfs was introduced in kernel 3.0 above.
> Arent the btrfs headers should be there ? do they exist in another
> package ? maybe fs-headers or something like that ?

Try using the below simple Makefile[1] to compile btrfs loadable
module. You need to have the kernel-headers package installed.
You can place the makefile anywhere you want, and compile via:
# make -f 

Thanks,
Alex.


[1]
obj-m += btrfs.o

# or substitute with hard-coded kernel version
KVERSION = $(shell uname -r)

SRC_DIR=/fs/btrfs
BTRFS_KO=btrfs.ko

# or specify any other output directory
OUT_DIR=/lib/modules/$(KVERSION)/kernel/fs/btrfs

all: $(OUT_DIR)/$(BTRFS_KO)

$(OUT_DIR)/$(BTRFS_KO): $(SRC_DIR)/$(BTRFS_KO)
cp $(SRC_DIR)/$(BTRFS_KO) $(OUT_DIR)/

$(SRC_DIR)/$(BTRFS_KO): $(SRC_DIR)/*.c $(SRC_DIR)/*.h
$(MAKE) -C /lib/modules/$(KVERSION)/build M=$(SRC_DIR) modules

clean:
$(MAKE) -C /lib/modules/$(KVERSION)/build M=$(SRC_DIR) clean
test -f $(OUT_DIR)/$(BTRFS_KO) && rm $(OUT_DIR)/$(BTRFS_KO)|| true


> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list

2017-03-19 Thread Alex Lyakas
We have a commit_root_sem, which is a read-write semaphore that protects the 
commit roots.

But it is also used to protect the list of caching block groups.

As a result, while doing "slow" caching, the following issue is seen:

Some of the caching threads are scanning the extent tree with 
commit_root_sem

acquired in shared mode, with stack like:
[] read_extent_buffer_pages+0x2d2/0x300 [btrfs]
[] btree_read_extent_buffer_pages.constprop.50+0xb7/0x1e0 
[btrfs]

[] read_tree_block+0x40/0x70 [btrfs]
[] read_block_for_search.isra.33+0x12c/0x370 [btrfs]
[] btrfs_search_slot+0x3c6/0xb10 [btrfs]
[] caching_thread+0x1b9/0x820 [btrfs]
[] normal_work_helper+0xc6/0x340 [btrfs]
[] btrfs_cache_helper+0x12/0x20 [btrfs]

IO requests that want to allocate space are waiting in cache_block_group()
to acquire the commit_root_sem in exclusive mode. But they only want to add
the caching control structure to the list of caching block-groups:
[] schedule+0x29/0x70
[] rwsem_down_write_failed+0x145/0x320
[] call_rwsem_down_write_failed+0x13/0x20
[] cache_block_group+0x25b/0x450 [btrfs]
[] find_free_extent+0xd16/0xdb0 [btrfs]
[] btrfs_reserve_extent+0xaf/0x160 [btrfs]

Other caching threads want to continue their scanning, and for that they
are waiting to acquire commit_root_sem in shared mode. But since there are
IO threads that want the exclusive lock, the caching threads are unable
to continue the scanning, because (I presume) rw_semaphore guarantees some 
fairness:

[] schedule+0x29/0x70
[] rwsem_down_read_failed+0xc5/0x120
[] call_rwsem_down_read_failed+0x14/0x30
[] caching_thread+0x1a1/0x820 [btrfs]
[] normal_work_helper+0xc6/0x340 [btrfs]
[] btrfs_cache_helper+0x12/0x20 [btrfs]
[] process_one_work+0x146/0x410

This causes slowness of the IO, especially when there are many block groups
that need to be scanned for free space. In some cases it takes minutes
until a single IO thread is able to allocate free space.

I don't see a deadlock here, because the caching threads that were able to 
acquire
the commit_root_sem will call rwsem_is_contended() and should give up the 
semaphore,

so that IO threads are able to acquire it in exclusive mode.

However, introducing a separate mutex that protects only the list of caching
block groups makes things move forward much faster.

This patch is based on kernel 3.18.
Unfortunately, I am not able to submit a patch based on one of the latest 
kernels, because
here btrfs is part of the larger system, and upgrading the kernel is a 
significant effort.

Hence marking the patch as RFC.
Hopefully, this patch still has some value to the community.

Signed-off-by: Alex Lyakas <a...@zadarastorage.com>

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 42d11e7..74feacb 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1490,6 +1490,8 @@ struct btrfs_fs_info {
struct list_head trans_list;
struct list_head dead_roots;
struct list_head caching_block_groups;
+/* protects the above list */
+struct mutex caching_block_groups_mutex;

spinlock_t delayed_iput_lock;
struct list_head delayed_iputs;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5177954..130ec58 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2229,6 +2229,7 @@ int open_ctree(struct super_block *sb,
INIT_LIST_HEAD(_info->delayed_iputs);
INIT_LIST_HEAD(_info->delalloc_roots);
INIT_LIST_HEAD(_info->caching_block_groups);
+mutex_init(_info->caching_block_groups_mutex);
spin_lock_init(_info->delalloc_root_lock);
spin_lock_init(_info->trans_lock);
spin_lock_init(_info->fs_roots_radix_lock);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a067065..906fb08 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -637,10 +637,10 @@ static int cache_block_group(struct 
btrfs_block_group_cache *cache,

return 0;
}

-down_write(_info->commit_root_sem);
+mutex_lock(_info->caching_block_groups_mutex);
atomic_inc(_ctl->count);
list_add_tail(_ctl->list, _info->caching_block_groups);
-up_write(_info->commit_root_sem);
+mutex_unlock(_info->caching_block_groups_mutex);

btrfs_get_block_group(cache);

@@ -5693,6 +5693,7 @@ void btrfs_prepare_extent_commit(struct 
btrfs_trans_handle *trans,


down_write(_info->commit_root_sem);

+mutex_lock(_info->caching_block_groups_mutex);
list_for_each_entry_safe(caching_ctl, next,
 _info->caching_block_groups, list) {
cache = caching_ctl->block_group;
@@ -5704,6 +5705,7 @@ void btrfs_prepare_extent_commit(struct 
btrfs_trans_handle *trans,

cache->last_byte_to_unpin = caching_ctl->progress;
}
}
+mutex_unlock(_info->caching_block_groups_mutex);

if (fs_info->pinned_extents == _info->freed_extents[0])
fs_info->pinned_extents = _info->freed_extents[1];
@@ -8849,14 +8851,14 @@ int btrfs_free_block_gro

Re: [PATCH] Btrfs: deal with unexpected return value in flush_space

2016-10-01 Thread Alex Lyakas
David, Holger,

Thank you for picking up that old patch of mine.

Alex.


On Fri, Jul 29, 2016 at 8:53 PM, Liu Bo  wrote:
> On Fri, Jul 29, 2016 at 07:01:50PM +0200, David Sterba wrote:
>> On Thu, Jul 28, 2016 at 11:49:14AM -0700, Liu Bo wrote:
>> > > For reviewers - this came up before here:
>> > > https://patchwork.kernel.org/patch/7778651/
>
> David, this patch made a mistake in commit log.
>
>> > >
>> > > Same fix basically.
>> >
>> > Aha, I've given it my Reviewed-by.
>> >
>> > Taking either one works for me, I can make the clarifying comment into a
>> > seperate patch if we need to.
>>
>> I'll pick the first patch and please send the separate comment update.
>> Thanks.
>
> Sure.
>
> Thanks,
>
> -liubo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RCF - PATCH] btrfs: do not ignore errors from primary superblock

2016-05-17 Thread Alex Lyakas

RFC: This patch not for merging, but only for review and discussion.

When mounting, we consider only the primary superblock on each device.
But when writing the superblocks, we might silently ignore errors
from the primary superblock, if we succeeded to write secondary
superblocks. In such case, the primary superblock was not updated
properly, and if we crash at this point, later mount will use
an out-of-date superblock.

This patch changes the behavior to NOT IGNORING any errors on the primary 
superblock,
and IGNORING any errors on secondary superblocks. This way, we always insist 
on having

an up-to-date primary superblock.

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4e47849..0ae9f7c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3357,11 +3357,13 @@ static int write_dev_supers(struct btrfs_device 
*device,

bh = __find_get_block(device->bdev, bytenr / 4096,
  BTRFS_SUPER_INFO_SIZE);
if (!bh) {
-errors++;
+/* we only care about primary superblock errors */
+if (i == 0)
+errors++;
continue;
}
wait_on_buffer(bh);
-if (!buffer_uptodate(bh))
+if (!buffer_uptodate(bh) && i == 0)
errors++;

/* drop our reference */
@@ -3388,9 +3390,10 @@ static int write_dev_supers(struct btrfs_device 
*device,

  BTRFS_SUPER_INFO_SIZE);
if (!bh) {
btrfs_err(device->dev_root->fs_info,
-"couldn't get super buffer head for bytenr %llu",
-bytenr);
-errors++;
+"couldn't get super buffer head for bytenr %llu (sb 
copy %d)",

+bytenr, i);
+if (i == 0)
+errors++;
continue;
}

@@ -3413,10 +3416,10 @@ static int write_dev_supers(struct btrfs_device 
*device,

ret = btrfsic_submit_bh(WRITE_FUA, bh);
else
ret = btrfsic_submit_bh(WRITE_SYNC, bh);
-if (ret)
+if (ret && i == 0)
errors++;
}
-return errors < i ? 0 : -1;
+return errors ? -1 : 0;
}

/*


P.S.: when reviewing the code of write_dev_supers(), I also noticed that 
when wait==0 and we hit an error in one __getblk(), then the caller 
(write_all_supers) will not properly wait for submitted buffer-heads to 
complete, and we won't do the additional "brelse(bh);", which wait==0 case 
does. Is this a problem?



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 6/9] Btrfs: implement the free space B-tree

2016-04-22 Thread Alex Lyakas
Hi Omar, Chris,

I have reviewed the free-space-tree code. It is a very nice feature.

However, I have a basic understanding question.

Let's say we are running a delayed ref which inserts a new EXTENT_ITEM
into the extent tree, e.g., we are in alloc_reserved_file_extent. At
this point we call remove_from_free_space_tree(), which updates the
free-space-tree about the allocated space. But this requires to COW
the free-space-tree itself. So we allocate a new tree block for the
free-space tree, and insert a new delayed ref, which will update the
extent tree about the new tree block allocation. We also insert a
delayed ref to free the previous copy of the free-space-tree block.

At some point we run these new delayed refs, so we insert/remove
EXTENT_ITEMs from the extent tree, and this in turn requires us to
update the free-space-tree again. So we need again to COW
free-space-tree blocks, generating more delayed refs.

At which point this recursion stops?

Do we assume that at some point all needed free-space tree blocks have
been COW'ed already, and we do not COW a tree block more than once per
transaction (unless it was written to disk due to memory pressure)?

Thanks!
Alex.


On Tue, Dec 29, 2015 at 11:19 PM, Chris Mason  wrote:
> On Tue, Sep 29, 2015 at 08:50:35PM -0700, Omar Sandoval wrote:
>> From: Omar Sandoval 
>>
>> The free space cache has turned out to be a scalability bottleneck on
>> large, busy filesystems. When the cache for a lot of block groups needs
>> to be written out, we can get extremely long commit times; if this
>> happens in the critical section, things are especially bad because we
>> block new transactions from happening.
>>
>> The main problem with the free space cache is that it has to be written
>> out in its entirety and is managed in an ad hoc fashion. Using a B-tree
>> to store free space fixes this: updates can be done as needed and we get
>> all of the benefits of using a B-tree: checksumming, RAID handling,
>> well-understood behavior.
>>
>> With the free space tree, we get commit times that are about the same as
>> the no cache case with load times slower than the free space cache case
>> but still much faster than the no cache case. Free space is represented
>> with extents until it becomes more space-efficient to use bitmaps,
>> giving us similar space overhead to the free space cache.
>>
>> The operations on the free space tree are: adding and removing free
>> space, handling the creation and deletion of block groups, and loading
>> the free space for a block group. We can also create the free space tree
>> by walking the extent tree and clear the free space tree.
>>
>> Signed-off-by: Omar Sandoval 
>
>> +int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)
>> +{
>> + struct btrfs_trans_handle *trans;
>> + struct btrfs_root *tree_root = fs_info->tree_root;
>> + struct btrfs_root *free_space_root;
>> + struct btrfs_block_group_cache *block_group;
>> + struct rb_node *node;
>> + int ret;
>> +
>> + trans = btrfs_start_transaction(tree_root, 0);
>> + if (IS_ERR(trans))
>> + return PTR_ERR(trans);
>> +
>> + free_space_root = btrfs_create_tree(trans, fs_info,
>> + BTRFS_FREE_SPACE_TREE_OBJECTID);
>> + if (IS_ERR(free_space_root)) {
>> + ret = PTR_ERR(free_space_root);
>> + goto abort;
>> + }
>> + fs_info->free_space_root = free_space_root;
>> +
>> + node = rb_first(_info->block_group_cache_tree);
>> + while (node) {
>> + block_group = rb_entry(node, struct btrfs_block_group_cache,
>> +cache_node);
>> + ret = populate_free_space_tree(trans, fs_info, block_group);
>> + if (ret)
>> + goto abort;
>> + node = rb_next(node);
>> + }
>> +
>> + btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE);
>> +
>> + ret = btrfs_commit_transaction(trans, tree_root);
>> + if (ret)
>> + return ret;
>> +
>> + return 0;
>> +
>> +abort:
>> + btrfs_abort_transaction(trans, tree_root, ret);
>> + btrfs_end_transaction(trans, tree_root);
>> + return ret;
>> +}
>> +
>
> Hi Omar,
>
> The only problem I've hit testing this stuff is where we create the tree
> on existing filesystems.  There are a few different problems here:
>
> 1) The populate code happens after resuming balance operations.  The
> balancing code could be changing these block groups while we scan them.
> I fixed this by moving the scan up earlier.
>
> 2) Delayed references may be run, which will also change the extent tree
> as we're scanning it.
>
> 3) We might need to commit the transaction to reclaim space.
>
> For now I'm ignoring #3 and adding a flag in fs_info that will make us
> skip delayed references.  This really isn't a good long term solution,
> we need to be able to do this on a per block group basis 

Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework

2016-04-03 Thread Alex Lyakas
Hello Qu, Wang,

On Wed, Mar 30, 2016 at 2:34 AM, Qu Wenruo <quwen...@cn.fujitsu.com> wrote:
>
>
> Alex Lyakas wrote on 2016/03/29 19:22 +0200:
>>
>> Greetings Qu Wenruo,
>>
>> I have reviewed the dedup patchset found in the github account you
>> mentioned. I have several questions. Please note that by all means I
>> am not criticizing your design or code. I just want to make sure that
>> my understanding of the code is proper.
>
>
> It's OK to criticize the design or code, and that's how review works.
>
>>
>> 1) You mentioned in several emails that at some point byte-to-byte
>> comparison is to be performed. However, I do not see this in the code.
>> It seems that generic_search() only looks for the hash value match. If
>> there is a match, it goes ahead and adds a delayed ref.
>
>
> I mentioned byte-to-byte comparison as, "not to be implemented in any time
> soon".
>
> Considering the lack of facility to read out extent contents without any
> inode structure, it's not going to be done in any time soon.
>
>>
>> 2) If btrfs_dedupe_search() does not find a match, we unlock the dedup
>> mutex and proceed with the normal COW. What happens if there are
>> several IO streams to different files writing an identical block, but
>> we don't have such block in our dedup DB? Then all
>> btrfs_dedupe_search() calls will not find a match, so all streams will
>> allocate space for their block (which are all identical). At some
>> point, they will call insert_reserved_file_extent() and will call
>> btrfs_dedupe_add(). Since there is a global mutex, the first stream
>> will insert the dedup hash entries into the DB, and all other streams
>> will find that such hash entry already exists. So the end result is
>> that we have the hash entry in the DB, but still we have multiple
>> copies of the same block allocated, due to timing issues. Is this
>> correct?
>
>
> That's right, and that's also unavoidable for the hash initializing stage.
>
>>
>> 3) generic_search() competes with __btrfs_free_extent(). Meaning that
>> generic_search() wants to add a delayed ref to an existing extent,
>> whereas __btrfs_free_extent() wants to delete an entry from the dedup
>> DB. The race is resolved as follows:
>> - generic_search attempts to lock the delayed ref head
>> - if it succeeds to lock, then __btrfs_free_extent() is not running
>> right now. So we can add a delayed ref. Later, when delayed ref head
>> will be run, it will figure out what needs to be done (free the extent
>> or not)
>> - if we fail to lock, then there is a delayed ref processing for this
>> bytenr. We drop all locks and redo the search from the top. If
>> __btrfs_free_extent() has deleted the dedup hash meanwhile, we will
>> not find it, and proceed with normal COW.
>> Is my understanding correct?
>
>
> Yes that's correct.

Reviewing the code again, it seems that I still lack understanding.
What is special about the dedup code adding a delayed data ref versus
other places doing that? In other places, we do not insist on locking
the delayed ref head, but in dedup we do. For example,
__btrfs_drop_extents calls btrfs_inc_extent_ref, without locking the
ref head. I know that one of your purposes was to draw attention to
delayed ref processing, so you have succeeded.

Thanks,
Alex.




>
>>
>> I have also few nitpicks on the code, will reply to relevant patches.
>
>
> Feel free to comment.
>
> Thanks,
> Qu
>
>>
>> Thanks for doing this work,
>> Alex.
>>
>>
>>
>> On Tue, Mar 22, 2016 at 3:35 AM, Qu Wenruo <quwen...@cn.fujitsu.com>
>> wrote:
>>>
>>> This patchset can be fetched from github:
>>> https://github.com/adam900710/linux.git wang_dedupe_20160322
>>>
>>> This updated version of inband de-duplication has the following features:
>>> 1) ONE unified dedup framework.
>>> Most of its code is hidden quietly in dedup.c and export the minimal
>>> interfaces for its caller.
>>> Reviewer and further developer would benefit from the unified
>>> framework.
>>>
>>> 2) TWO different back-end with different trade-off
>>> One is the improved version of previous Fujitsu in-memory only dedup.
>>> The other one is enhanced dedup implementation from Liu Bo.
>>> Changed its tree structure to handle bytenr -> hash search for
>>> deleting hash, without the hideous data backref hack.
>>>
>>> 3) Support compression with dedupe
>>> Now dedupe can work with compression.
>

Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework

2016-03-30 Thread Alex Lyakas
Thanks for your comments, Qu.

Alex.


On Wed, Mar 30, 2016 at 2:34 AM, Qu Wenruo <quwen...@cn.fujitsu.com> wrote:
>
>
> Alex Lyakas wrote on 2016/03/29 19:22 +0200:
>>
>> Greetings Qu Wenruo,
>>
>> I have reviewed the dedup patchset found in the github account you
>> mentioned. I have several questions. Please note that by all means I
>> am not criticizing your design or code. I just want to make sure that
>> my understanding of the code is proper.
>
>
> It's OK to criticize the design or code, and that's how review works.
>
>>
>> 1) You mentioned in several emails that at some point byte-to-byte
>> comparison is to be performed. However, I do not see this in the code.
>> It seems that generic_search() only looks for the hash value match. If
>> there is a match, it goes ahead and adds a delayed ref.
>
>
> I mentioned byte-to-byte comparison as, "not to be implemented in any time
> soon".
>
> Considering the lack of facility to read out extent contents without any
> inode structure, it's not going to be done in any time soon.
>
>>
>> 2) If btrfs_dedupe_search() does not find a match, we unlock the dedup
>> mutex and proceed with the normal COW. What happens if there are
>> several IO streams to different files writing an identical block, but
>> we don't have such block in our dedup DB? Then all
>> btrfs_dedupe_search() calls will not find a match, so all streams will
>> allocate space for their block (which are all identical). At some
>> point, they will call insert_reserved_file_extent() and will call
>> btrfs_dedupe_add(). Since there is a global mutex, the first stream
>> will insert the dedup hash entries into the DB, and all other streams
>> will find that such hash entry already exists. So the end result is
>> that we have the hash entry in the DB, but still we have multiple
>> copies of the same block allocated, due to timing issues. Is this
>> correct?
>
>
> That's right, and that's also unavoidable for the hash initializing stage.
>
>>
>> 3) generic_search() competes with __btrfs_free_extent(). Meaning that
>> generic_search() wants to add a delayed ref to an existing extent,
>> whereas __btrfs_free_extent() wants to delete an entry from the dedup
>> DB. The race is resolved as follows:
>> - generic_search attempts to lock the delayed ref head
>> - if it succeeds to lock, then __btrfs_free_extent() is not running
>> right now. So we can add a delayed ref. Later, when delayed ref head
>> will be run, it will figure out what needs to be done (free the extent
>> or not)
>> - if we fail to lock, then there is a delayed ref processing for this
>> bytenr. We drop all locks and redo the search from the top. If
>> __btrfs_free_extent() has deleted the dedup hash meanwhile, we will
>> not find it, and proceed with normal COW.
>> Is my understanding correct?
>
>
> Yes that's correct.
>
>>
>> I have also few nitpicks on the code, will reply to relevant patches.
>
>
> Feel free to comment.
>
> Thanks,
> Qu
>
>>
>> Thanks for doing this work,
>> Alex.
>>
>>
>>
>> On Tue, Mar 22, 2016 at 3:35 AM, Qu Wenruo <quwen...@cn.fujitsu.com>
>> wrote:
>>>
>>> This patchset can be fetched from github:
>>> https://github.com/adam900710/linux.git wang_dedupe_20160322
>>>
>>> This updated version of inband de-duplication has the following features:
>>> 1) ONE unified dedup framework.
>>> Most of its code is hidden quietly in dedup.c and export the minimal
>>> interfaces for its caller.
>>> Reviewer and further developer would benefit from the unified
>>> framework.
>>>
>>> 2) TWO different back-end with different trade-off
>>> One is the improved version of previous Fujitsu in-memory only dedup.
>>> The other one is enhanced dedup implementation from Liu Bo.
>>> Changed its tree structure to handle bytenr -> hash search for
>>> deleting hash, without the hideous data backref hack.
>>>
>>> 3) Support compression with dedupe
>>> Now dedupe can work with compression.
>>> Means that, a dedupe miss case can be compressed, and dedupe hit case
>>> can also reuse compressed file extents.
>>>
>>> 4) Ioctl interface with persist dedup status
>>> Advised by David, now we use ioctl to enable/disable dedup.
>>>
>>> And we now have dedup status, recorded in the first item of dedup
>>> tree.
>>> Just like quota, once 

Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework

2016-03-29 Thread Alex Lyakas
Greetings Qu Wenruo,

I have reviewed the dedup patchset found in the github account you
mentioned. I have several questions. Please note that by all means I
am not criticizing your design or code. I just want to make sure that
my understanding of the code is proper.

1) You mentioned in several emails that at some point byte-to-byte
comparison is to be performed. However, I do not see this in the code.
It seems that generic_search() only looks for the hash value match. If
there is a match, it goes ahead and adds a delayed ref.

2) If btrfs_dedupe_search() does not find a match, we unlock the dedup
mutex and proceed with the normal COW. What happens if there are
several IO streams to different files writing an identical block, but
we don't have such block in our dedup DB? Then all
btrfs_dedupe_search() calls will not find a match, so all streams will
allocate space for their block (which are all identical). At some
point, they will call insert_reserved_file_extent() and will call
btrfs_dedupe_add(). Since there is a global mutex, the first stream
will insert the dedup hash entries into the DB, and all other streams
will find that such hash entry already exists. So the end result is
that we have the hash entry in the DB, but still we have multiple
copies of the same block allocated, due to timing issues. Is this
correct?

3) generic_search() competes with __btrfs_free_extent(). Meaning that
generic_search() wants to add a delayed ref to an existing extent,
whereas __btrfs_free_extent() wants to delete an entry from the dedup
DB. The race is resolved as follows:
- generic_search attempts to lock the delayed ref head
- if it succeeds to lock, then __btrfs_free_extent() is not running
right now. So we can add a delayed ref. Later, when delayed ref head
will be run, it will figure out what needs to be done (free the extent
or not)
- if we fail to lock, then there is a delayed ref processing for this
bytenr. We drop all locks and redo the search from the top. If
__btrfs_free_extent() has deleted the dedup hash meanwhile, we will
not find it, and proceed with normal COW.
Is my understanding correct?

I have also few nitpicks on the code, will reply to relevant patches.

Thanks for doing this work,
Alex.



On Tue, Mar 22, 2016 at 3:35 AM, Qu Wenruo  wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/linux.git wang_dedupe_20160322
>
> This updated version of inband de-duplication has the following features:
> 1) ONE unified dedup framework.
>Most of its code is hidden quietly in dedup.c and export the minimal
>interfaces for its caller.
>Reviewer and further developer would benefit from the unified
>framework.
>
> 2) TWO different back-end with different trade-off
>One is the improved version of previous Fujitsu in-memory only dedup.
>The other one is enhanced dedup implementation from Liu Bo.
>Changed its tree structure to handle bytenr -> hash search for
>deleting hash, without the hideous data backref hack.
>
> 3) Support compression with dedupe
>Now dedupe can work with compression.
>Means that, a dedupe miss case can be compressed, and dedupe hit case
>can also reuse compressed file extents.
>
> 4) Ioctl interface with persist dedup status
>Advised by David, now we use ioctl to enable/disable dedup.
>
>And we now have dedup status, recorded in the first item of dedup
>tree.
>Just like quota, once enabled, no extra ioctl is needed for next
>mount.
>
> 5) Ability to disable dedup for given dirs/files
>It works just like the compression prop method, by adding a new
>xattr.
>
> TODO:
> 1) Add extent-by-extent comparison for faster but more conflicting algorithm
>Current SHA256 hash is quite slow, and for some old(5 years ago) CPU,
>CPU may even be a bottleneck other than IO.
>But for faster hash, it will definitely cause conflicts, so we need
>extent comparison before we introduce new dedup algorithm.
>
> 2) Misc end-user related helpers
>Like handy and easy to implement dedup rate report.
>And method to query in-memory hash size for those "non-exist" users who
>want to use 'dedup enable -l' option but didn't ever know how much
>RAM they have.
>
> Changelog:
> v2:
>   Totally reworked to handle multiple backends
> v3:
>   Fix a stupid but deadly on-disk backend bug
>   Add handle for multiple hash on same bytenr corner case to fix abort
>   trans error
>   Increase dedup rate by enhancing delayed ref handler for both backend.
>   Move dedup_add() to run_delayed_ref() time, to fix abort trans error.
>   Increase dedup block size up limit to 8M.
> v4:
>   Add dedup prop for disabling dedup for given files/dirs.
>   Merge inmem_search() and ondisk_search() into generic_search() to save
>   some code
>   Fix another delayed_ref related bug.
>   Use the same mutex for both inmem and ondisk backend.
>   Move dedup_add() back to btrfs_finish_ordered_io() to 

Re: [RFC - PATCH] btrfs: do not write corrupted metadata blocks to disk

2016-03-10 Thread Alex Lyakas
Hello Filipe,

I have sent two patches addressing this issue.

When testing, I discovered that log tree blocks can sometimes carry
chunk tree UUID which is all zeros! Does this make sense? You can take
a look at a small debug-tree output demonstrating such phenomenon at
https://drive.google.com/file/d/0B9rmyUifdvMLbHBuSWU5dlVKNWc. Due to
this I did not include the chunk tree UUID check. Hoping very much
that fs UUID should always be valid for all tree blocks))

Thanks,
Alex.



On Mon, Feb 22, 2016 at 12:28 PM, Filipe Manana <fdman...@kernel.org> wrote:
> On Mon, Feb 22, 2016 at 9:46 AM, Alex Lyakas <a...@zadarastorage.com> wrote:
>> Thank you, Filipe, for your review.
>>
>> On Mon, Feb 22, 2016 at 3:05 AM, Filipe Manana <fdman...@kernel.org> wrote:
>>> On Sun, Feb 21, 2016 at 3:36 PM, Alex Lyakas <a...@zadarastorage.com> wrote:
>>>> csum_dirty_buffer was issuing a warning in case the extent buffer
>>>> did not look alright, but was still returning success.
>>>> Let's return error in this case, and also add two additional sanity
>>>> checks on the extent buffer header.
>>>>
>>>> We had btrfs metadata corruption, and after looking at the logs we saw
>>>> that WARN_ON(found_start != start) has been triggered. We are still
>>>> investigating
>>>
>>> There's a warning for WARN_ON(found_start != start || !PageUptodate(page))
>>>
>>> Are you sure it triggered only because of found_start != start and not
>>> because of !PageUptodate(page) (or both)?
>> The problem initially happened on kernel 3.8.13.  In this kernel, the
>> code looks like this:
>>  found_start = btrfs_header_bytenr(eb);
>>  if (found_start != start) {
>>  WARN_ON(1);
>>  return 0;
>>  }
>>  if (!PageUptodate(page)) {
>>  WARN_ON(1);
>>  return 0;
>>  }
>> (You can see it on
>> http://lxr.free-electrons.com/source/fs/btrfs/disk-io.c?v=3.8#L420)
>> The WARN_ON that we hit was on the found_start comparison.
>
> Ok, I see now that one of those useless cleanup patches merged both
> conditions into a single if some time ago.
>
>>
>>>
>>>> which component trashed the cache page which belonged to btrfs. But btrfs
>>>> only issued a warning, and as a result, the corrupted metadata block went 
>>>> to
>>>> disk.
>>>>
>>>> I think we should return an error in such case that the extent buffer
>>>> doesn't look alright.
>>>
>>> I think so too.
>>>
>>>> The caller up the chain may BUG_ON on this, for example flush_epd_write_bio
>>>> will,
>>>> but it is better than to have a silent metadata corruption on disk.
>>>>
>>>> Note: this patch has been properly tested on 3.18 kernel only.
>>>>
>>>> Signed-off-by: Alex Lyakas <a...@zadarastorage.com>
>>>> ---
>>>> fs/btrfs/disk-io.c | 14 --
>>>> 1 file changed, 12 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>>> index 4545e2e..701e706 100644
>>>> --- a/fs/btrfs/disk-io.c
>>>> +++ b/fs/btrfs/disk-io.c
>>>> @@ -508,22 +508,32 @@ static int csum_dirty_buffer(struct btrfs_fs_info
>>>> *fs_info, struct page *page)
>>>> {
>>>> u64 start = page_offset(page);
>>>> u64 found_start;
>>>> struct extent_buffer *eb;
>>>>
>>>> eb = (struct extent_buffer *)page->private;
>>>> if (page != eb->pages[0])
>>>> return 0;
>>>> found_start = btrfs_header_bytenr(eb);
>>>> if (WARN_ON(found_start != start || !PageUptodate(page)))
>>>> -return 0;
>>>> -csum_tree_block(fs_info, eb, 0);
>>>> +return -EUCLEAN;
>>>> +#ifdef CONFIG_BTRFS_ASSERT
>>>
>>> A bit odd to surround these with CONFIG_BTRFS_ASSERT if we don't do 
>>> assertions.
>>> I would remove this #ifdef ... #endif or do the memcmp calls inside 
>>> ASSERT().
>> Agreed.
>>
>>>
>>>> +if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
>>>> +(unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE)))
>>>> +return -EUCLEAN;
>>>> +if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
>>>> +  

[PATCH 1/2] btrfs: csum_tree_block: return proper errno value

2016-03-10 Thread Alex Lyakas
Signed-off-by: Alex Lyakas <a...@zadarastorage.com>
---
 fs/btrfs/disk-io.c | 13 +
 1 file changed, 5 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4545e2e..4420ab2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -296,52 +296,52 @@ static int csum_tree_block(struct btrfs_fs_info *fs_info,
unsigned long map_len;
int err;
u32 crc = ~(u32)0;
unsigned long inline_result;
 
len = buf->len - offset;
while (len > 0) {
err = map_private_extent_buffer(buf, offset, 32,
, _start, _len);
if (err)
-   return 1;
+   return err;
cur_len = min(len, map_len - (offset - map_start));
crc = btrfs_csum_data(kaddr + offset - map_start,
  crc, cur_len);
len -= cur_len;
offset += cur_len;
}
if (csum_size > sizeof(inline_result)) {
result = kzalloc(csum_size, GFP_NOFS);
if (!result)
-   return 1;
+   return -ENOMEM;
} else {
result = (char *)_result;
}
 
btrfs_csum_final(crc, result);
 
if (verify) {
if (memcmp_extent_buffer(buf, result, 0, csum_size)) {
u32 val;
u32 found = 0;
memcpy(, result, csum_size);
 
read_extent_buffer(buf, , 0, csum_size);
btrfs_warn_rl(fs_info,
"%s checksum verify failed on %llu wanted %X 
found %X "
"level %d",
fs_info->sb->s_id, buf->start,
val, found, btrfs_header_level(buf));
if (result != (char *)_result)
kfree(result);
-   return 1;
+   return -EUCLEAN;
}
} else {
write_extent_buffer(buf, result, 0, csum_size);
}
if (result != (char *)_result)
kfree(result);
return 0;
 }
 
 /*
@@ -509,22 +509,21 @@ static int csum_dirty_buffer(struct btrfs_fs_info 
*fs_info, struct page *page)
u64 start = page_offset(page);
u64 found_start;
struct extent_buffer *eb;
 
eb = (struct extent_buffer *)page->private;
if (page != eb->pages[0])
return 0;
found_start = btrfs_header_bytenr(eb);
if (WARN_ON(found_start != start || !PageUptodate(page)))
return 0;
-   csum_tree_block(fs_info, eb, 0);
-   return 0;
+   return csum_tree_block(fs_info, eb, 0);
 }
 
 static int check_tree_block_fsid(struct btrfs_fs_info *fs_info,
 struct extent_buffer *eb)
 {
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
u8 fsid[BTRFS_UUID_SIZE];
int ret = 1;
 
read_extent_buffer(eb, fsid, btrfs_header_fsid(), BTRFS_FSID_SIZE);
@@ -653,24 +652,22 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio 
*io_bio,
btrfs_err(root->fs_info, "bad tree block level %d",
   (int)btrfs_header_level(eb));
ret = -EIO;
goto err;
}
 
btrfs_set_buffer_lockdep_class(btrfs_header_owner(eb),
   eb, found_level);
 
ret = csum_tree_block(root->fs_info, eb, 1);
-   if (ret) {
-   ret = -EIO;
+   if (ret)
goto err;
-   }
 
/*
 * If this is a leaf block and it is corrupt, set the corrupt bit so
 * that we don't try and read the other copies of this block, just
 * return -EIO.
 */
if (found_level == 0 && check_leaf(root, eb)) {
set_bit(EXTENT_BUFFER_CORRUPT, >bflags);
ret = -EIO;
}
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: do not write corrupted metadata blocks to disk

2016-03-10 Thread Alex Lyakas
csum_dirty_buffer was issuing a warning in case the extent buffer
did not look alright, but was still returning success.
Let's return error in this case, and also add an additional sanity
check on the extent buffer header.
The caller up the chain may BUG_ON on this, for example flush_epd_write_bio 
will,
but it is better than to have a silent metadata corruption on disk.

Signed-off-by: Alex Lyakas <a...@zadarastorage.com>
---
 fs/btrfs/disk-io.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4420ab2..cf85714 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -506,23 +506,34 @@ static int btree_read_extent_buffer_pages(struct 
btrfs_root *root,
 
 static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page)
 {
u64 start = page_offset(page);
u64 found_start;
struct extent_buffer *eb;
 
eb = (struct extent_buffer *)page->private;
if (page != eb->pages[0])
return 0;
+
found_start = btrfs_header_bytenr(eb);
-   if (WARN_ON(found_start != start || !PageUptodate(page)))
-   return 0;
+   /*
+* Please do not consolidate these warnings into a single if.
+* It is useful to know what went wrong.
+*/
+   if (WARN_ON(found_start != start))
+   return -EUCLEAN;
+   if (WARN_ON(!PageUptodate(page)))
+   return -EUCLEAN;
+
+   ASSERT(memcmp_extent_buffer(eb, fs_info->fsid,
+   btrfs_header_fsid(), BTRFS_FSID_SIZE) == 0);
+
return csum_tree_block(fs_info, eb, 0);
 }
 
 static int check_tree_block_fsid(struct btrfs_fs_info *fs_info,
 struct extent_buffer *eb)
 {
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
u8 fsid[BTRFS_UUID_SIZE];
int ret = 1;
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC - PATCH] btrfs: do not write corrupted metadata blocks to disk

2016-02-22 Thread Alex Lyakas
Thank you, Filipe, for your review.

On Mon, Feb 22, 2016 at 3:05 AM, Filipe Manana <fdman...@kernel.org> wrote:
> On Sun, Feb 21, 2016 at 3:36 PM, Alex Lyakas <a...@zadarastorage.com> wrote:
>> csum_dirty_buffer was issuing a warning in case the extent buffer
>> did not look alright, but was still returning success.
>> Let's return error in this case, and also add two additional sanity
>> checks on the extent buffer header.
>>
>> We had btrfs metadata corruption, and after looking at the logs we saw
>> that WARN_ON(found_start != start) has been triggered. We are still
>> investigating
>
> There's a warning for WARN_ON(found_start != start || !PageUptodate(page))
>
> Are you sure it triggered only because of found_start != start and not
> because of !PageUptodate(page) (or both)?
The problem initially happened on kernel 3.8.13.  In this kernel, the
code looks like this:
 found_start = btrfs_header_bytenr(eb);
 if (found_start != start) {
 WARN_ON(1);
 return 0;
 }
 if (!PageUptodate(page)) {
 WARN_ON(1);
 return 0;
 }
(You can see it on
http://lxr.free-electrons.com/source/fs/btrfs/disk-io.c?v=3.8#L420)
The WARN_ON that we hit was on the found_start comparison.

>
>> which component trashed the cache page which belonged to btrfs. But btrfs
>> only issued a warning, and as a result, the corrupted metadata block went to
>> disk.
>>
>> I think we should return an error in such case that the extent buffer
>> doesn't look alright.
>
> I think so too.
>
>> The caller up the chain may BUG_ON on this, for example flush_epd_write_bio
>> will,
>> but it is better than to have a silent metadata corruption on disk.
>>
>> Note: this patch has been properly tested on 3.18 kernel only.
>>
>> Signed-off-by: Alex Lyakas <a...@zadarastorage.com>
>> ---
>> fs/btrfs/disk-io.c | 14 --
>> 1 file changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 4545e2e..701e706 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -508,22 +508,32 @@ static int csum_dirty_buffer(struct btrfs_fs_info
>> *fs_info, struct page *page)
>> {
>> u64 start = page_offset(page);
>> u64 found_start;
>> struct extent_buffer *eb;
>>
>> eb = (struct extent_buffer *)page->private;
>> if (page != eb->pages[0])
>> return 0;
>> found_start = btrfs_header_bytenr(eb);
>> if (WARN_ON(found_start != start || !PageUptodate(page)))
>> -return 0;
>> -csum_tree_block(fs_info, eb, 0);
>> +return -EUCLEAN;
>> +#ifdef CONFIG_BTRFS_ASSERT
>
> A bit odd to surround these with CONFIG_BTRFS_ASSERT if we don't do 
> assertions.
> I would remove this #ifdef ... #endif or do the memcmp calls inside ASSERT().
Agreed.

>
>> +if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
>> +(unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE)))
>> +return -EUCLEAN;
>> +if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
>> +(unsigned long)btrfs_header_chunk_tree_uuid(eb),
>> +BTRFS_FSID_SIZE)))
>
> This second comparison doesn't seem correct. Second argument to
> memcmp_extent_buffer should be fs_info->chunk_tree_uuid, which
> shouldn't be the same as the fsid (take a look at utils.c:make_btrfs()
> in the tools, both uuids are generated by different calls to
> uuid_generate()) - did you make your tests only before adding this
> comparison?. Also you should use BTRFS_UUID_SIZE instead of
> BTRFS_FSID_SIZE (even if both have the same value).
Obviously, you are right. In the 3.18-based code that I fixed locally
here, the fix looks like this:

if (found_start != start) {
ZBTRFS_WARN(1, "FS[%s]: header_bytenr(eb)(%llu) !=
page->index<<PAGE_CACHE_SHIFT(%llu)", root->fs_info->sb->s_id,
found_start, start);
return -EUCLEAN;
}
if (!PageUptodate(page)) {
ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu page->index(%llu)
!PageUptodate", root->fs_info->sb->s_id, start, (u64)page->index);
return -EUCLEAN;
}
if (memcmp_extent_buffer(eb, root->fs_info->fsid, (unsigned
long)btrfs_header_fsid(), BTRFS_FSID_SIZE)) {
u8 hdr_fsid[BTRFS_FSID_SIZE] = {0};

read_extent_buffer(eb, hdr_fsid, (unsigned
long)btrfs_header_fsid(), BTRFS_FSID_SIZE);
ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu header->fsid["PRIX128"]
!= fs_info->fsid["PRIX128"]", root

[RFC - PATCH] btrfs: do not write corrupted metadata blocks to disk

2016-02-21 Thread Alex Lyakas

csum_dirty_buffer was issuing a warning in case the extent buffer
did not look alright, but was still returning success.
Let's return error in this case, and also add two additional sanity
checks on the extent buffer header.

We had btrfs metadata corruption, and after looking at the logs we saw
that WARN_ON(found_start != start) has been triggered. We are still 
investigating

which component trashed the cache page which belonged to btrfs. But btrfs
only issued a warning, and as a result, the corrupted metadata block went to 
disk.


I think we should return an error in such case that the extent buffer 
doesn't look alright.
The caller up the chain may BUG_ON on this, for example flush_epd_write_bio 
will,

but it is better than to have a silent metadata corruption on disk.

Note: this patch has been properly tested on 3.18 kernel only.

Signed-off-by: Alex Lyakas <a...@zadarastorage.com>
---
fs/btrfs/disk-io.c | 14 --
1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4545e2e..701e706 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -508,22 +508,32 @@ static int csum_dirty_buffer(struct btrfs_fs_info 
*fs_info, struct page *page)

{
u64 start = page_offset(page);
u64 found_start;
struct extent_buffer *eb;

eb = (struct extent_buffer *)page->private;
if (page != eb->pages[0])
return 0;
found_start = btrfs_header_bytenr(eb);
if (WARN_ON(found_start != start || !PageUptodate(page)))
-return 0;
-csum_tree_block(fs_info, eb, 0);
+return -EUCLEAN;
+#ifdef CONFIG_BTRFS_ASSERT
+if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
+(unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE)))
+return -EUCLEAN;
+if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid,
+(unsigned long)btrfs_header_chunk_tree_uuid(eb),
+BTRFS_FSID_SIZE)))
+return -EUCLEAN;
+#endif
+if (csum_tree_block(fs_info, eb, 0))
+return -EUCLEAN;
return 0;
}

static int check_tree_block_fsid(struct btrfs_fs_info *fs_info,
 struct extent_buffer *eb)
{
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
u8 fsid[BTRFS_UUID_SIZE];
int ret = 1;

--
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state

2015-12-13 Thread Alex Lyakas
[Resending in plain text, apologies.]

Hi Chandan, Josef, Chris,

I am not sure I understand the fix to the problem.

It may happen that when updating the device tree, we need to allocate a new
chunk via do_chunk_alloc (while we are holding the device tree root node
locked). This is a legitimate thing for find_free_extent() to do. And
do_chunk_alloc() call may lead to call to
btrfs_create_pending_block_groups(), which will try to update the device
tree. This may happen due to direct call to
btrfs_create_pending_block_groups() that exists in do_chunk_alloc(), or
perhaps by __btrfs_end_transaction() that find_free_extent() does after it
completed chunk allocation (although in this case it will use the
transaction that already exists in current->journal_info).
So the deadlock still may happen?

Thanks,
 Alex.

>
>
> On Mon, Nov 2, 2015 at 6:52 PM, Chris Mason  wrote:
>>
>> On Mon, Nov 02, 2015 at 01:59:46PM +0530, Chandan Rajendra wrote:
>> > When executing generic/001 in a loop on a ppc64 machine (with both
>> > sectorsize
>> > and nodesize set to 64k), the following call trace is observed,
>>
>> Thanks Chandan, I hit this same trace on x86-64 with 16K nodes.
>>
>> -chris
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2 v3] Btrfs: fix regression when running delayed references

2015-12-13 Thread Alex Lyakas
Hi Filipe Manana,

My understanding of selecting delayed refs to run or merging them is
far from complete. Can you please explain what will happen in the
following scenario:

1) Ref1 is created, as you explain
2) Somebody calls __btrfs_run_delayed_refs() and runs Ref1, and we end
up with an EXTENT_ITEM and an inline extent back ref
3) Ref2 and Ref3 are added
4) Somebody calls __btrfs_run_delayed_refs()

At this point, we cannot merge Ref2 and Ref3, because they might be
referencing tree blocks of completely different trees, thus
comp_tree_refs() will return 1 or -1. But we will select Ref3 to be
run, because we prefer BTRFS_ADD_DELAYED_REF over
BTRFS_DROP_DELAYED_REF, as you explained. So we hit the same BUG_ON
now, because we already have Ref1 in the extent tree.

So something should prevent us from running Ref3 before running Ref2.
We should run Ref2 first, which should get rid of the EXTENT_ITEM and
the inline backref, and then run Ref3 to create a new backref with a
proper owner. What is that something?

Can you please point me at what am I missing?

Also, can such scenario happen in 3.18 kernel, which still has an
rbtree per ref-head? Looking at the code, I don't see anything
preventing that from happening.

Thanks,
Alex.


On Sun, Oct 25, 2015 at 8:51 PM,   wrote:
> From: Filipe Manana 
>
> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
> references implementation in order to fix certain problems with qgroups.
> However that rework introduced one more regression that leads to the
> following trace when running delayed references for metadata:
>
> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
> [35908.065201] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor 
> raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc 
> loop fuse parport_pc psmouse i2
> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: GW 
>   4.3.0-rc5-btrfs-next-17+ #1
> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
> [35908.065201] task: 880114b7d780 ti: 88010c4c8000 task.ti: 
> 88010c4c8000
> [35908.065201] RIP: 0010:[]  [] 
> insert_inline_extent_backref+0x52/0xb1 [btrfs]
> [35908.065201] RSP: 0018:88010c4cbb08  EFLAGS: 00010293
> [35908.065201] RAX:  RBX: 88008a661000 RCX: 
> 
> [35908.065201] RDX: a04dd58f RSI: 0001 RDI: 
> 
> [35908.065201] RBP: 88010c4cbb40 R08: 1000 R09: 
> 88010c4cb9f8
> [35908.065201] R10:  R11: 002c R12: 
> 
> [35908.065201] R13: 88020a74c578 R14:  R15: 
> 
> [35908.065201] FS:  () GS:88023edc() 
> knlGS:
> [35908.065201] CS:  0010 DS:  ES:  CR0: 8005003b
> [35908.065201] CR2: 015e8708 CR3: 000102185000 CR4: 
> 06e0
> [35908.065201] Stack:
> [35908.065201]  88010c4cbb18 0f37 88020a74c578 
> 88015a408000
> [35908.065201]  880154a44000  0005 
> 88010c4cbbd8
> [35908.065201]  a0492b9a 0005  
> 
> [35908.065201] Call Trace:
> [35908.065201]  [] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs]
> [35908.065201]  [] ? __btrfs_run_delayed_refs+0x4d4/0xd33 
> [btrfs]
> [35908.065201]  [] __btrfs_run_delayed_refs+0xafa/0xd33 
> [btrfs]
> [35908.065201]  [] ? join_transaction.isra.10+0x25/0x41f 
> [btrfs]
> [35908.065201]  [] ? join_transaction.isra.10+0xa8/0x41f 
> [btrfs]
> [35908.065201]  [] btrfs_run_delayed_refs+0x75/0x1dd [btrfs]
> [35908.065201]  [] delayed_ref_async_start+0x3c/0x7b [btrfs]
> [35908.065201]  [] normal_work_helper+0x14c/0x32a [btrfs]
> [35908.065201]  [] btrfs_extent_refs_helper+0x12/0x14 
> [btrfs]
> [35908.065201]  [] process_one_work+0x24a/0x4ac
> [35908.065201]  [] worker_thread+0x206/0x2c2
> [35908.065201]  [] ? rescuer_thread+0x2cb/0x2cb
> [35908.065201]  [] ? rescuer_thread+0x2cb/0x2cb
> [35908.065201]  [] kthread+0xef/0xf7
> [35908.065201]  [] ? kthread_parkme+0x24/0x24
> [35908.065201]  [] ret_from_fork+0x3f/0x70
> [35908.065201]  [] ? kthread_parkme+0x24/0x24
> [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d 
> 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> 
> 0b 4c 8b 45 30 8b 4d 28 45 31
> [35908.065201] RIP  [] 
> insert_inline_extent_backref+0x52/0xb1 [btrfs]
> [35908.065201]  RSP 
> [35908.310885] ---[ end trace fe4299baf0666457 ]---
>
> This happens because the new delayed references code no longer merges
> delayed references that have different sequence values. 

Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock

2015-12-13 Thread Alex Lyakas
Hi Filipe Manana,

Can't the call to btrfs_create_pending_block_groups() cause a
deadlock, like in
http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this
call updates the device tree, and we may be calling do_chunk_alloc()
from find_free_extent() when holding a lock on the device tree root
(because we want to COW a block of the device tree).

My understanding from Josef's chunk allocator rework
(http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now
when allocating a new chunk we do not immediately update the
device/chunk tree. We keep the new chunk in "pending_chunks" and in
"new_bgs" on a transaction handle, and we actually update the
chunk/device tree only when we are done with a particular transaction
handle. This way we avoid that sort of deadlocks.

But this patch breaks this rule, as it may make us update the
device/chunk tree in the context of chunk allocation, which is the
scenario that the rework was meant to avoid.

Can you please point me at what I am missing?

Thanks,
Alex.


On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval  wrote:
> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>>
>> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
>> finishing block group creation"), introduced in 4.2-rc1, the following
>> test was failing due to exhaustion of the system array in the superblock:
>>
>>   #!/bin/bash
>>
>>   truncate -s 100T big.img
>>   mkfs.btrfs big.img
>>   mount -o loop big.img /mnt/loop
>>
>>   num=5
>>   sz=10T
>>   for ((i = 0; i < $num; i++)); do
>>   echo fallocate $i $sz
>>   fallocate -l $sz /mnt/loop/testfile$i
>>   done
>>   btrfs filesystem sync /mnt/loop
>>
>>   for ((i = 0; i < $num; i++)); do
>> echo rm $i
>> rm /mnt/loop/testfile$i
>> btrfs filesystem sync /mnt/loop
>>   done
>>   umount /mnt/loop
>>
>> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
>> allocation of system block groups. This happened because the test creates
>> a large number of data block groups per transaction and when committing
>> the transaction we start the writeout of the block group caches for all
>> the new new (dirty) block groups, which results in pre-allocating space
>> for each block group's free space cache using the same transaction handle.
>> That in turn often leads to creation of more block groups, and all get
>> attached to the new_bgs list of the same transaction handle to the point
>> of getting a list with over 1500 elements, and creation of new block groups
>> leads to the need of reserving space in the chunk block reserve and often
>> creating a new system block group too.
>>
>> So that made us quickly exhaust the chunk block reserve/system space info,
>> because as of the commit mentioned before, we do reserve space for each
>> new block group in the chunk block reserve, unlike before where we would
>> not and would at most allocate one new system block group and therefore
>> would only ensure that there was enough space in the system space info to
>> allocate 1 new block group even if we ended up allocating thousands of
>> new block groups using the same transaction handle. That worked most of
>> the time because the computed required space at check_system_chunk() is
>> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
>> that all nodes/leafs in a path will be COWed and split) and since the
>> updates to the chunk tree all happen at btrfs_create_pending_block_groups
>> it is unlikely that a path needs to be COWed more than once (unless
>> writepages() for the btree inode is called by mm in between) and that
>> compensated for the need of creating any new nodes/leads in the chunk
>> tree.
>>
>> So fix this by ensuring we don't accumulate a too large list of new block
>> groups in a transaction's handles new_bgs list, inserting/updating the
>> chunk tree for all accumulated new block groups and releasing the unused
>> space from the chunk block reserve whenever the list becomes sufficiently
>> large. This is a generic solution even though the problem currently can
>> only happen when starting the writeout of the free space caches for all
>> dirty block groups (btrfs_start_dirty_block_groups()).
>>
>> Reported-by: Omar Sandoval 
>> Signed-off-by: Filipe Manana 
>
> Thanks a lot for taking a look.
>
> Tested-by: Omar Sandoval 
>
>> ---
>>  fs/btrfs/extent-tree.c | 18 ++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 171312d..07204bf 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -4227,6 +4227,24 @@ out:
>>   space_info->chunk_alloc = 0;
>>   spin_unlock(_info->lock);
>>   mutex_unlock(_info->chunk_mutex);
>> + /*
>> +  * When we allocate a new chunk we reserve space in the chunk block
>> +  * reserve 

Re: [PATCH 1/2 v3] Btrfs: fix regression when running delayed references

2015-12-13 Thread Alex Lyakas
Hi Filipe,

Thank you for the explanation.

On Sun, Dec 13, 2015 at 5:43 PM, Filipe Manana <fdman...@kernel.org> wrote:
> On Sun, Dec 13, 2015 at 10:51 AM, Alex Lyakas <a...@zadarastorage.com> wrote:
>> Hi Filipe Manana,
>>
>> My understanding of selecting delayed refs to run or merging them is
>> far from complete. Can you please explain what will happen in the
>> following scenario:
>>
>> 1) Ref1 is created, as you explain
>> 2) Somebody calls __btrfs_run_delayed_refs() and runs Ref1, and we end
>> up with an EXTENT_ITEM and an inline extent back ref
>> 3) Ref2 and Ref3 are added
>> 4) Somebody calls __btrfs_run_delayed_refs()
>>
>> At this point, we cannot merge Ref2 and Ref3, because they might be
>> referencing tree blocks of completely different trees, thus
>> comp_tree_refs() will return 1 or -1. But we will select Ref3 to be
>> run, because we prefer BTRFS_ADD_DELAYED_REF over
>> BTRFS_DROP_DELAYED_REF, as you explained. So we hit the same BUG_ON
>> now, because we already have Ref1 in the extent tree.
>
> No, that won't happen. If the ref (Ref3) is for a different tree, than
> it has a different inline extent from Ref1
> (lookup_inline_extent_backref returns -ENOENT and not 0).
Understood. So in this case, we will first add inline ref for Ref3,
and later drop the Ref1 inline ref via update_inline_extent_backref()
by truncating the EXTENT_ITEM. All in the same transaction.


>
> If they are all for the same tree it means Ref3 is not merged with
> Ref2 because they have different seq numbers and a seq value exist in
> fs_info->tree_mod_seq_list, and we skip Ref3 through
> btrfs_check_delayed_seq() until such seq number goes away from
> tree_mod_seq_list.
Ok, so we won't process this ref-head at all, until the "seq problem"
disappears.

> If no seq number exists in tree_mod_seq_list then
> we merge it (Ref3) through btrfs_merge_delayed_refs(), called when
> running delayed refs, with Ref2 (which removes both refs since one is
> "-1" and the other "+1").
So in this case we don't care that the inline ref we have in the
EXTENT_ITEM was actually inserted on behalf of Ref1. Because it's for
the same EXTENT_ITEM and for the same root. So Ref3 and Ref1 are fully
equivalent. Interesting.

Thanks!
Alex.

>
> Iow, after this regression fix, no behaviour changed from releases before 4.2.
>
>>
>> So something should prevent us from running Ref3 before running Ref2.
>> We should run Ref2 first, which should get rid of the EXTENT_ITEM and
>> the inline backref, and then run Ref3 to create a new backref with a
>> proper owner. What is that something?
>>
>> Can you please point me at what am I missing?
>>
>> Also, can such scenario happen in 3.18 kernel, which still has an
>> rbtree per ref-head? Looking at the code, I don't see anything
>> preventing that from happening.
>>
>> Thanks,
>> Alex.
>>
>>
>> On Sun, Oct 25, 2015 at 8:51 PM,  <fdman...@kernel.org> wrote:
>>> From: Filipe Manana <fdman...@suse.com>
>>>
>>> In the kernel 4.2 merge window we had a refactoring/rework of the delayed
>>> references implementation in order to fix certain problems with qgroups.
>>> However that rework introduced one more regression that leads to the
>>> following trace when running delayed references for metadata:
>>>
>>> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832!
>>> [35908.065201] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor 
>>> raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache 
>>> sunrpc loop fuse parport_pc psmouse i2
>>> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: GW   
>>> 4.3.0-rc5-btrfs-next-17+ #1
>>> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014
>>> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs]
>>> [35908.065201] task: 880114b7d780 ti: 88010c4c8000 task.ti: 
>>> 88010c4c8000
>>> [35908.065201] RIP: 0010:[]  [] 
>>> insert_inline_extent_backref+0x52/0xb1 [btrfs]
>>> [35908.065201] RSP: 0018:88010c4cbb08  EFLAGS: 00010293
>>> [35908.065201] RAX:  RBX: 88008a661000 RCX: 
>>> 
>>> [35908.065201] RDX: a04dd58f RSI: 0001 RDI: 
>>> 
>>> [35908.065201] RBP: 88010c4cbb40 R08: 1000

Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock

2015-12-13 Thread Alex Lyakas
Thank you, Filipe. Now it is more clear.
Fortunately, in my kernel 3.18 I do not have do_chunk_alloc() calling
btrfs_create_pending_block_groups(), so I cannot hit this deadlock.
But can hit the issue that this call is meant to fix.

Thanks,
Alex.


On Sun, Dec 13, 2015 at 5:45 PM, Filipe Manana <fdman...@kernel.org> wrote:
> On Sun, Dec 13, 2015 at 10:29 AM, Alex Lyakas <a...@zadarastorage.com> wrote:
>> Hi Filipe Manana,
>>
>> Can't the call to btrfs_create_pending_block_groups() cause a
>> deadlock, like in
>> http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this
>> call updates the device tree, and we may be calling do_chunk_alloc()
>> from find_free_extent() when holding a lock on the device tree root
>> (because we want to COW a block of the device tree).
>>
>> My understanding from Josef's chunk allocator rework
>> (http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now
>> when allocating a new chunk we do not immediately update the
>> device/chunk tree. We keep the new chunk in "pending_chunks" and in
>> "new_bgs" on a transaction handle, and we actually update the
>> chunk/device tree only when we are done with a particular transaction
>> handle. This way we avoid that sort of deadlocks.
>>
>> But this patch breaks this rule, as it may make us update the
>> device/chunk tree in the context of chunk allocation, which is the
>> scenario that the rework was meant to avoid.
>>
>> Can you please point me at what I am missing?
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d9a0540a79f87456907f2ce031f058cf745c5bff
>
>>
>> Thanks,
>> Alex.
>>
>>
>> On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval <osan...@fb.com> wrote:
>>> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote:
>>>> From: Filipe Manana <fdman...@suse.com>
>>>>
>>>> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when
>>>> finishing block group creation"), introduced in 4.2-rc1, the following
>>>> test was failing due to exhaustion of the system array in the superblock:
>>>>
>>>>   #!/bin/bash
>>>>
>>>>   truncate -s 100T big.img
>>>>   mkfs.btrfs big.img
>>>>   mount -o loop big.img /mnt/loop
>>>>
>>>>   num=5
>>>>   sz=10T
>>>>   for ((i = 0; i < $num; i++)); do
>>>>   echo fallocate $i $sz
>>>>   fallocate -l $sz /mnt/loop/testfile$i
>>>>   done
>>>>   btrfs filesystem sync /mnt/loop
>>>>
>>>>   for ((i = 0; i < $num; i++)); do
>>>> echo rm $i
>>>> rm /mnt/loop/testfile$i
>>>> btrfs filesystem sync /mnt/loop
>>>>   done
>>>>   umount /mnt/loop
>>>>
>>>> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive
>>>> allocation of system block groups. This happened because the test creates
>>>> a large number of data block groups per transaction and when committing
>>>> the transaction we start the writeout of the block group caches for all
>>>> the new new (dirty) block groups, which results in pre-allocating space
>>>> for each block group's free space cache using the same transaction handle.
>>>> That in turn often leads to creation of more block groups, and all get
>>>> attached to the new_bgs list of the same transaction handle to the point
>>>> of getting a list with over 1500 elements, and creation of new block groups
>>>> leads to the need of reserving space in the chunk block reserve and often
>>>> creating a new system block group too.
>>>>
>>>> So that made us quickly exhaust the chunk block reserve/system space info,
>>>> because as of the commit mentioned before, we do reserve space for each
>>>> new block group in the chunk block reserve, unlike before where we would
>>>> not and would at most allocate one new system block group and therefore
>>>> would only ensure that there was enough space in the system space info to
>>>> allocate 1 new block group even if we ended up allocating thousands of
>>>> new block groups using the same transaction handle. That worked most of
>>>> the time because the computed required space at check_system_chunk() is
>>>> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and
>>>> that all nodes/leafs in a path will be C

Re: [RFC PATCH] btrfs: flush_space: treat return value of do_chunk_alloc properly

2015-12-06 Thread Alex Lyakas
Hi Liu,
I was studying on how block reservation works, and making some
modifications in reserve_metadata_bytes to understand better what it
does. Then suddenly I saw this problem. I guess it depends on which
value of "flush" parameter is passed to reserve_metadata_bytes.

Alex.


On Thu, Dec 3, 2015 at 8:14 PM, Liu Bo <bo.li@oracle.com> wrote:
> On Thu, Dec 03, 2015 at 06:51:03PM +0200, Alex Lyakas wrote:
>> do_chunk_alloc returns 1 when it succeeds to allocate a new chunk.
>> But flush_space will not convert this to 0, and will also return 1.
>> As a result, reserve_metadata_bytes will think that flush_space failed,
>> and may potentially return this value "1" to the caller (depends how
>> reserve_metadata_bytes was called). The caller will also treat this as an 
>> error.
>> For example, btrfs_block_rsv_refill does:
>>
>> int ret = -ENOSPC;
>> ...
>> ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
>> if (!ret) {
>> block_rsv_add_bytes(block_rsv, num_bytes, 0);
>> return 0;
>> }
>>
>> return ret;
>>
>> So it will return -ENOSPC.
>
> It will return 1 instead of -ENOSPC.
>
> The patch looks good, I noticed this before, but I didn't manage to trigger a 
> error for this, did you catch a error like that?
>
> Thanks,
>
> -liubo
>
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index 4b89680..1ba3f0d 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -4727,7 +4727,7 @@ static int flush_space(struct btrfs_root *root,
>>  btrfs_get_alloc_profile(root, 0),
>>  CHUNK_ALLOC_NO_FORCE);
>> btrfs_end_transaction(trans, root);
>> -   if (ret == -ENOSPC)
>> +   if (ret > 0 || ret == -ENOSPC)
>> ret = 0;
>> break;
>> case COMMIT_TRANS:
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] btrfs: flush_space: treat return value of do_chunk_alloc properly

2015-12-06 Thread Alex Lyakas
do_chunk_alloc returns 1 when it succeeds to allocate a new chunk.
But flush_space will not convert this to 0, and will also return 1.
As a result, reserve_metadata_bytes will think that flush_space failed,
and may potentially return this value "1" to the caller (depends how
reserve_metadata_bytes was called). The caller will also treat this as an error.
For example, btrfs_block_rsv_refill does:

int ret = -ENOSPC;
...
ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
if (!ret) {
block_rsv_add_bytes(block_rsv, num_bytes, 0);
return 0;
}

return ret;

So it will return -ENOSPC.

Signed-off-by: Alex Lyakas <a...@zadarastorage.com>
Reviewed-by: Josef Bacik <jba...@fb.com>

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4b89680..1ba3f0d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4727,7 +4727,7 @@ static int flush_space(struct btrfs_root *root,
 btrfs_get_alloc_profile(root, 0),
 CHUNK_ALLOC_NO_FORCE);
btrfs_end_transaction(trans, root);
-   if (ret == -ENOSPC)
+   if (ret > 0 || ret == -ENOSPC)
ret = 0;
break;
case COMMIT_TRANS:

On Sun, Dec 6, 2015 at 12:19 PM, Alex Lyakas <a...@zadarastorage.com> wrote:
> Hi Liu,
> I was studying on how block reservation works, and making some
> modifications in reserve_metadata_bytes to understand better what it
> does. Then suddenly I saw this problem. I guess it depends on which
> value of "flush" parameter is passed to reserve_metadata_bytes.
>
> Alex.
>
>
> On Thu, Dec 3, 2015 at 8:14 PM, Liu Bo <bo.li@oracle.com> wrote:
>> On Thu, Dec 03, 2015 at 06:51:03PM +0200, Alex Lyakas wrote:
>>> do_chunk_alloc returns 1 when it succeeds to allocate a new chunk.
>>> But flush_space will not convert this to 0, and will also return 1.
>>> As a result, reserve_metadata_bytes will think that flush_space failed,
>>> and may potentially return this value "1" to the caller (depends how
>>> reserve_metadata_bytes was called). The caller will also treat this as an 
>>> error.
>>> For example, btrfs_block_rsv_refill does:
>>>
>>> int ret = -ENOSPC;
>>> ...
>>> ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
>>> if (!ret) {
>>> block_rsv_add_bytes(block_rsv, num_bytes, 0);
>>> return 0;
>>> }
>>>
>>> return ret;
>>>
>>> So it will return -ENOSPC.
>>
>> It will return 1 instead of -ENOSPC.
>>
>> The patch looks good, I noticed this before, but I didn't manage to trigger 
>> a error for this, did you catch a error like that?
>>
>> Thanks,
>>
>> -liubo
>>
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index 4b89680..1ba3f0d 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -4727,7 +4727,7 @@ static int flush_space(struct btrfs_root *root,
>>>  btrfs_get_alloc_profile(root, 0),
>>>  CHUNK_ALLOC_NO_FORCE);
>>> btrfs_end_transaction(trans, root);
>>> -   if (ret == -ENOSPC)
>>> +   if (ret > 0 || ret == -ENOSPC)
>>> ret = 0;
>>> break;
>>> case COMMIT_TRANS:
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] btrfs: flush_space: treat return value of do_chunk_alloc properly

2015-12-03 Thread Alex Lyakas
do_chunk_alloc returns 1 when it succeeds to allocate a new chunk.
But flush_space will not convert this to 0, and will also return 1.
As a result, reserve_metadata_bytes will think that flush_space failed,
and may potentially return this value "1" to the caller (depends how
reserve_metadata_bytes was called). The caller will also treat this as an error.
For example, btrfs_block_rsv_refill does:

int ret = -ENOSPC;
...
ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush);
if (!ret) {
block_rsv_add_bytes(block_rsv, num_bytes, 0);
return 0;
}

return ret;

So it will return -ENOSPC.

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4b89680..1ba3f0d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4727,7 +4727,7 @@ static int flush_space(struct btrfs_root *root,
 btrfs_get_alloc_profile(root, 0),
 CHUNK_ALLOC_NO_FORCE);
btrfs_end_transaction(trans, root);
-   if (ret == -ENOSPC)
+   if (ret > 0 || ret == -ENOSPC)
ret = 0;
break;
case COMMIT_TRANS:
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: clear bio reference after submit_one_bio()

2015-11-07 Thread Alex Lyakas
Hi Holger,
I think it will cause an invalid paging request, just like in case
that Naohiro has fixed.
I am not running the "latest and greatest" btrfs in my system, and it
is not easy to set it up, that's why I cannot submit patches based on
the latest code, I can only review and comment on patches.

Alex.


On Thu, Nov 5, 2015 at 3:08 PM, Holger Hoffstätte
<holger.hoffstae...@googlemail.com> wrote:
> On 10/11/15 20:09, Alex Lyakas wrote:
>> Hi Naota,
>>
>> What happens if btrfs_bio_alloc() in submit_extent_page fails? Then we
>> return -ENOMEM to the caller, but we do not set *bio_ret to NULL. And
>> if *bio_ret was non-NULL upon entry into submit_extent_page, then we
>> had submitted this bio before getting to btrfs_bio_alloc(). So should
>> btrfs_bio_alloc() failure be handled in the same way?
>>
>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
>> index 3915c94..cd443bc 100644
>> --- a/fs/btrfs/extent_io.c
>> +++ b/fs/btrfs/extent_io.c
>> @@ -2834,8 +2834,11 @@ static int submit_extent_page(int rw, struct
>> extent_io_tree *tree,
>>
>> bio = btrfs_bio_alloc(bdev, sector, BIO_MAX_PAGES,
>> GFP_NOFS | __GFP_HIGH);
>> -   if (!bio)
>> +   if (!bio) {
>> +   if (bio_ret)
>> +   *bio_ret = NULL;
>> return -ENOMEM;
>> +   }
>>
>> bio_add_page(bio, page, page_size, offset);
>> bio->bi_end_io = end_io_func;
>>
>
> Did you get any feedback on this? It seems it could cause data loss or
> corruption on allocation failures, no?
>
> -h
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: clear bio reference after submit_one_bio()

2015-10-11 Thread Alex Lyakas
Hi Naota,

What happens if btrfs_bio_alloc() in submit_extent_page fails? Then we
return -ENOMEM to the caller, but we do not set *bio_ret to NULL. And
if *bio_ret was non-NULL upon entry into submit_extent_page, then we
had submitted this bio before getting to btrfs_bio_alloc(). So should
btrfs_bio_alloc() failure be handled in the same way?

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 3915c94..cd443bc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2834,8 +2834,11 @@ static int submit_extent_page(int rw, struct
extent_io_tree *tree,

bio = btrfs_bio_alloc(bdev, sector, BIO_MAX_PAGES,
GFP_NOFS | __GFP_HIGH);
-   if (!bio)
+   if (!bio) {
+   if (bio_ret)
+   *bio_ret = NULL;
return -ENOMEM;
+   }

bio_add_page(bio, page, page_size, offset);
bio->bi_end_io = end_io_func;


Thanks,
Alex.

On Wed, Jan 7, 2015 at 12:46 AM, Satoru Takeuchi
 wrote:
> Hi Naota,
>
> On 2015/01/06 1:01, Naohiro Aota wrote:
>> After submit_one_bio(), `bio' can go away. However submit_extent_page()
>> leave `bio' referable if submit_one_bio() failed (e.g. -ENOMEM on OOM).
>> It will cause invalid paging request when submit_extent_page() is called
>> next time.
>>
>> I reproduced ENOMEM case with the following script (need
>> CONFIG_FAIL_PAGE_ALLOC, and CONFIG_FAULT_INJECTION_DEBUG_FS).
>
> I confirmed that this problem reproduce with 3.19-rc3 and
> not reproduce with 3.19-rc3 with your patch.
>
> Tested-by: Satoru Takeuchi 
>
> Thank you for reporting this problem with the reproducer
> and fixing it too.
>
>   NOTE:
>   I used v3.19-rc3's tools/testing/fault-injection/failcmd.sh
>   for the following "./failcmd.sh".
>
>   >./failcmd.sh -p $percent -t $times -i $interval \
>   >--ignore-gfp-highmem=N --ignore-gfp-wait=N 
> --min-order=0 \
>   >-- \
>   >cat $directory/file > /dev/null
>
> * 3.19-rc1 + your patch
>
> ===
> # ./run
> 512+0 records in
> 512+0 records out
> #
> ===
>
> * 3.19-rc3
>
> ===
> # ./run
> 512+0 records in
> 512+0 records out
> [  188.433726] run (776): drop_caches: 1
> [  188.682372] FAULT_INJECTION: forcing a failure.
> name fail_page_alloc, interval 100, probability 111000, space 0, times 3
> [  188.689986] CPU: 0 PID: 954 Comm: cat Not tainted 3.19.0-rc3-ktest #1
> [  188.693834] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> Bochs 01/01/2011
> [  188.698466]  0064 88007b343618 816e5563 
> 88007fc0fc78
> [  188.702730]  81c655c0 88007b343638 813851b5 
> 0010
> [  188.707043]  0002 88007b343768 81188126 
> 88007b3435a8
> [  188.711283] Call Trace:
> [  188.712620]  [] dump_stack+0x45/0x57
> [  188.715330]  [] should_fail+0x135/0x140
> [  188.718218]  [] __alloc_pages_nodemask+0xd6/0xb30
> [  188.721567]  [] ? blk_rq_map_sg+0x35/0x170
> [  188.724558]  [] ? virtio_queue_rq+0x145/0x2b0 
> [virtio_blk]
> [  188.728191]  [] ? 
> btrfs_submit_compressed_read+0xcf/0x4d0 [btrfs]
> [  188.732079]  [] ? kmem_cache_alloc+0x1cb/0x230
> [  188.735153]  [] ? mempool_alloc_slab+0x15/0x20
> [  188.738188]  [] alloc_pages_current+0x9a/0x120
> [  188.741153]  [] btrfs_submit_compressed_read+0x1a9/0x4d0 
> [btrfs]
> [  188.744835]  [] btrfs_submit_bio_hook+0x1c1/0x1d0 [btrfs]
> [  188.748225]  [] ? lookup_extent_mapping+0x13/0x20 [btrfs]
> [  188.751547]  [] ? btrfs_get_extent+0x98/0xad0 [btrfs]
> [  188.754656]  [] submit_one_bio+0x67/0xa0 [btrfs]
> [  188.757554]  [] submit_extent_page.isra.35+0xd7/0x1c0 
> [btrfs]
> [  188.760981]  [] __do_readpage+0x31d/0x7b0 [btrfs]
> [  188.763920]  [] ? btrfs_create_repair_bio+0x110/0x110 
> [btrfs]
> [  188.767382]  [] ? btrfs_submit_direct+0x7b0/0x7b0 [btrfs]
> [  188.770671]  [] ? btrfs_lookup_ordered_range+0x13d/0x180 
> [btrfs]
> [  188.774366]  [] 
> __extent_readpages.constprop.42+0x2ba/0x2d0 [btrfs]
> [  188.778031]  [] ? btrfs_submit_direct+0x7b0/0x7b0 [btrfs]
> [  188.781241]  [] extent_readpages+0x169/0x1b0 [btrfs]
> [  188.784322]  [] ? btrfs_submit_direct+0x7b0/0x7b0 [btrfs]
> [  188.789014]  [] btrfs_readpages+0x1f/0x30 [btrfs]
> [  188.792028]  [] __do_page_cache_readahead+0x18c/0x1f0
> [  188.795078]  [] ondemand_readahead+0xdf/0x260
> [  188.797702]  [] ? btrfs_congested_fn+0x5f/0xa0 [btrfs]
> [  188.800718]  [] page_cache_async_readahead+0x71/0xa0
> [  188.803650]  [] generic_file_read_iter+0x40f/0x5e0
> [  188.806480]  [] new_sync_read+0x7e/0xb0
> [  188.808832]  [] __vfs_read+0x18/0x50
> [  188.811068]  [] vfs_read+0x8a/0x140
> [  188.813298]  [] SyS_read+0x46/0xb0

Re: [PATCH] Btrfs: check pending chunks when shrinking fs to avoid corruption

2015-09-30 Thread Alex Lyakas
Hi Filipe,

Looking the code of this patch, I see that if we discover a pending
chunk, we unlock the chunk mutex, commit the transaction (which
completes the allocation of all pending chunks and inserts relevant
items into the device tree and chunk tree), and retry the search.

However, after we unlock the chunk mutex, somebody could have
attempted a new chunk allocation, which would have resulted in new
pending chunk. On the other hand, we have done:

btrfs_device_set_total_bytes(device, new_size);

so this line should prevent anybody to allocate beyond the new size.
In that case, we are sure that on the seconds pass there will be no
pending chunks beyond the new size, so we can shrink to new_size
safely. Is my understanding correct?

Thanks,
Alex.



On Tue, Jun 2, 2015 at 3:43 PM,   wrote:
> From: Filipe Manana 
>
> When we shrink the usable size of a device (its total_bytes), we go over
> all the device extent items in the device tree and attempt to relocate
> the chunk of any device extent that goes beyond the new usable size for
> the device. We do that after setting the new usable size (total_bytes) in
> the device object, so that all new allocations (and reallocations) don't
> use areas of the device that go beyond the new (shorter) size. However we
> were not considering that before setting the new size in the device,
> pending chunks might have been created that use device extents that go
> beyond the new size, and those device extents are not yet in the device
> tree after we search the device tree - they are still attached to the
> list of new block group for some ongoing transaction handle, and they are
> only added to the device tree when the transaction handle is ended (via
> btrfs_create_pending_block_groups()).
>
> So check for pending chunks with device extents that go beyond the new
> size and if any exists, commit the current transaction and repeat the
> search in the device tree.
>
> Not doing this it would mean we would return success to user space while
> still having extents that go beyond the new size, and later user space
> could override those locations on the device while the fs still references
> them, causing all sorts of corruption and unexpected events.
>
> Signed-off-by: Filipe Manana 
> ---
>  fs/btrfs/volumes.c | 49 -
>  1 file changed, 40 insertions(+), 9 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index dbea12e..09e89a6 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3984,6 +3984,7 @@ int btrfs_shrink_device(struct btrfs_device *device, 
> u64 new_size)
> int slot;
> int failed = 0;
> bool retried = false;
> +   bool checked_pending_chunks = false;
> struct extent_buffer *l;
> struct btrfs_key key;
> struct btrfs_super_block *super_copy = root->fs_info->super_copy;
> @@ -4064,15 +4065,6 @@ again:
> goto again;
> } else if (failed && retried) {
> ret = -ENOSPC;
> -   lock_chunks(root);
> -
> -   btrfs_device_set_total_bytes(device, old_size);
> -   if (device->writeable)
> -   device->fs_devices->total_rw_bytes += diff;
> -   spin_lock(>fs_info->free_chunk_lock);
> -   root->fs_info->free_chunk_space += diff;
> -   spin_unlock(>fs_info->free_chunk_lock);
> -   unlock_chunks(root);
> goto done;
> }
>
> @@ -4084,6 +4076,35 @@ again:
> }
>
> lock_chunks(root);
> +
> +   /*
> +* We checked in the above loop all device extents that were already 
> in
> +* the device tree. However before we have updated the device's
> +* total_bytes to the new size, we might have had chunk allocations 
> that
> +* have not complete yet (new block groups attached to transaction
> +* handles), and therefore their device extents were not yet in the
> +* device tree and we missed them in the loop above. So if we have any
> +* pending chunk using a device extent that overlaps the device range
> +* that we can not use anymore, commit the current transaction and
> +* repeat the search on the device tree - this way we guarantee we 
> will
> +* not have chunks using device extents that end beyond 'new_size'.
> +*/
> +   if (!checked_pending_chunks) {
> +   u64 start = new_size;
> +   u64 len = old_size - new_size;
> +
> +   if (contains_pending_extent(trans, device, , len)) {
> +   unlock_chunks(root);
> +   checked_pending_chunks = true;
> +   failed = 0;
> +   retried = false;
> +   ret = btrfs_commit_transaction(trans, root);
> +   if (ret)
> +  

Re: [PATCH v5 04/18] btrfs: Add threshold workqueue based on kernel workqueue

2015-08-19 Thread Alex Lyakas
Hi Qu,


On Fri, Feb 28, 2014 at 4:46 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote:
 The original btrfs_workers has thresholding functions to dynamically
 create or destroy kthreads.

 Though there is no such function in kernel workqueue because the worker
 is not created manually, we can still use the workqueue_set_max_active
 to simulated the behavior, mainly to achieve a better HDD performance by
 setting a high threshold on submit_workers.
 (Sadly, no resource can be saved)

 So in this patch, extra workqueue pending counters are introduced to
 dynamically change the max active of each btrfs_workqueue_struct, hoping
 to restore the behavior of the original thresholding function.

 Also, workqueue_set_max_active use a mutex to protect workqueue_struct,
 which is not meant to be called too frequently, so a new interval
 mechanism is applied, that will only call workqueue_set_max_active after
 a count of work is queued. Hoping to balance both the random and
 sequence performance on HDD.

 Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
 Tested-by: David Sterba dste...@suse.cz
 ---
 Changelog:
 v2-v3:
   - Add thresholding mechanism to simulate the old thresholding mechanism.
   - Will not enable thresholding when thresh is set to small value.
 v3-v4:
   None
 v4-v5:
   None
 ---
  fs/btrfs/async-thread.c | 107 
 
  fs/btrfs/async-thread.h |   3 +-
  2 files changed, 101 insertions(+), 9 deletions(-)

 diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
 index 193c849..977bce2 100644
 --- a/fs/btrfs/async-thread.c
 +++ b/fs/btrfs/async-thread.c
 @@ -30,6 +30,9 @@
  #define WORK_ORDER_DONE_BIT 2
  #define WORK_HIGH_PRIO_BIT 3

 +#define NO_THRESHOLD (-1)
 +#define DFT_THRESHOLD (32)
 +
  /*
   * container for the kthread task pointer and the list of pending work
   * One of these is allocated per thread.
 @@ -737,6 +740,14 @@ struct __btrfs_workqueue_struct {

 /* Spinlock for ordered_list */
 spinlock_t list_lock;
 +
 +   /* Thresholding related variants */
 +   atomic_t pending;
 +   int max_active;
 +   int current_max;
 +   int thresh;
 +   unsigned int count;
 +   spinlock_t thres_lock;
  };

  struct btrfs_workqueue_struct {
 @@ -745,19 +756,34 @@ struct btrfs_workqueue_struct {
  };

  static inline struct __btrfs_workqueue_struct
 -*__btrfs_alloc_workqueue(char *name, int flags, int max_active)
 +*__btrfs_alloc_workqueue(char *name, int flags, int max_active, int thresh)
  {
 struct __btrfs_workqueue_struct *ret = kzalloc(sizeof(*ret), 
 GFP_NOFS);

 if (unlikely(!ret))
 return NULL;

 +   ret-max_active = max_active;
 +   atomic_set(ret-pending, 0);
 +   if (thresh == 0)
 +   thresh = DFT_THRESHOLD;
 +   /* For low threshold, disabling threshold is a better choice */
 +   if (thresh  DFT_THRESHOLD) {
 +   ret-current_max = max_active;
 +   ret-thresh = NO_THRESHOLD;
 +   } else {
 +   ret-current_max = 1;
 +   ret-thresh = thresh;
 +   }
 +
 if (flags  WQ_HIGHPRI)
 ret-normal_wq = alloc_workqueue(%s-%s-high, flags,
 -max_active, btrfs, name);
 +ret-max_active,
 +btrfs, name);
 else
 ret-normal_wq = alloc_workqueue(%s-%s, flags,
 -max_active, btrfs, name);
 +ret-max_active, btrfs,
 +name);
Shouldn't we use ret-current_max instead of ret-max_active (in both calls)?
According to the rest of the code, max_active is the absolute
maximum beyond which the normal_wq cannot go (you use clamp_value to
ensure that). And current_max is the current value of max_active
of the normal_wq. But here, you set the normal_wq to max_active
immediately. Is this intentional?


 if (unlikely(!ret-normal_wq)) {
 kfree(ret);
 return NULL;
 @@ -765,6 +791,7 @@ static inline struct __btrfs_workqueue_struct

 INIT_LIST_HEAD(ret-ordered_list);
 spin_lock_init(ret-list_lock);
 +   spin_lock_init(ret-thres_lock);
 return ret;
  }

 @@ -773,7 +800,8 @@ __btrfs_destroy_workqueue(struct __btrfs_workqueue_struct 
 *wq);

  struct btrfs_workqueue_struct *btrfs_alloc_workqueue(char *name,
  int flags,
 -int max_active)
 +int max_active,
 +int thresh)
  {
 struct btrfs_workqueue_struct *ret = kzalloc(sizeof(*ret), GFP_NOFS);

 @@ -781,14 +809,15 @@ struct btrfs_workqueue_struct 
 *btrfs_alloc_workqueue(char 

Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN

2015-07-21 Thread Alex Lyakas
 the commit thread needs to
compete on tree-block
locks with them (and they hold the locks because they also read tree
blocks from disk as it seems)

So my question is shouldn't we be much more aggressive in
__btrfs_end_transaction, running delayed refs several times and
checking trans-delayed_ref_updates after each run, and return only if
this number is zero or small enough.
This way when we trigger a commit, it will not have a lot of delayed
refs to run, it will get very quickly to the critical section, pass it
hopefully very quickly (get to TRANS_STATE_UNBLOCKED), and then we can
open a new transaction while the previous is doing
btrfs_write_and_wait_transaction.
That's what I wanted to ask.

Thanks!
Alex.


[1] In my case, btrfs metadata is ~10GBs and the machine has 8GB of
RAM. Due to this we need to read a lot of ebs from disk, as they are
not in the page cache. Also need to keep in mind that every COW of eb
requires a new slot in the page cache, because we index by bytenr
that we receive from the free-space cache, which is a logical
coordinate by which EXTENT_ITEMs are sorted in the extent tree.



On Mon, Jul 13, 2015 at 7:02 PM, Chris Mason c...@fb.com wrote:
 On Mon, Jul 13, 2015 at 06:55:29PM +0200, Alex Lyakas wrote:
 Filipe,
 Thanks for the explanation. Those reasons were not so obvious for me.

 Would it make sense not to COW the block in case-1, if we are mounted
 with notreelog? Or, perhaps, to check that the block does not belong
 to a log tree?


 Hi Alex,

 The crc rules are the most important, we have to make sure the block
 isn't changed while it is in flight.  Also, think about something like
 this:

 transaction write block A, puts pointer to it in the btree, generation Y

 hard disk properly completes the IO

 transaction rewrites block A, same generation Y

 hard disk drops the IO on the floor and never does it

 Later on, we try to read block A again.  We find it has the correct crc
 and the correct generation number, but the contents are actually wrong.

 The second case is more difficult. One problem is that
 BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block
 due to memory pressure (this is what I see happening), we complete the
 writeback, release the extent buffer, and pages are evicted from the
 page cache of btree_inode. After some time we read the block again
 (because we want to modify it in the same transaction), but its header
 is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at
 this point it should be safe to avoid COW, we will re-COW.

 Would it make sense to have some runtime-only mechanism to lock-out
 the write-back for an eb? I.e., if we know that eb is not under
 writeback, and writeback is locked out from starting, we can redirty
 the block without COW. Then we allow the writeback to start when it
 wants to.

 In one of my test runs, btrfs had 6.4GB of metadata (before
 raid-induced overhead), but during a particular transaction total of
 10GB of metadata (again, before raid-induced overhead) was written to
 disk. (Thisis  total of all ebs having
 header-generation==curr_transid, not only during commit of the
 transaction). This particular run was with notreelog.

 Machine had 8GB of RAM. Linux allows the btree_inode to grow its
 page-cache upto ~6.9GB (judging by btree_inode-i_mapping-nrpages).
 But even though the used amount of metadata is less than that, this
 re-COW'ing of already-COW'ed blocks seems to cause page-cache
 trashing...

 Interesting.  We've addressed this in the past with changes to the
 writepage(s) callback for the btree, basically skipping memory pressure
 related writeback if there isn't that much dirty.  There is a lot of
 room to improve those decisions, like preferring to write leaves over
 nodes, especially full leaves that are not likely to change again.

 -chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN

2015-07-13 Thread Alex Lyakas
Filipe,
Thanks for the explanation. Those reasons were not so obvious for me.

Would it make sense not to COW the block in case-1, if we are mounted
with notreelog? Or, perhaps, to check that the block does not belong
to a log tree?

The second case is more difficult. One problem is that
BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block
due to memory pressure (this is what I see happening), we complete the
writeback, release the extent buffer, and pages are evicted from the
page cache of btree_inode. After some time we read the block again
(because we want to modify it in the same transaction), but its header
is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at
this point it should be safe to avoid COW, we will re-COW.

Would it make sense to have some runtime-only mechanism to lock-out
the write-back for an eb? I.e., if we know that eb is not under
writeback, and writeback is locked out from starting, we can redirty
the block without COW. Then we allow the writeback to start when it
wants to.

In one of my test runs, btrfs had 6.4GB of metadata (before
raid-induced overhead), but during a particular transaction total of
10GB of metadata (again, before raid-induced overhead) was written to
disk. (Thisis  total of all ebs having
header-generation==curr_transid, not only during commit of the
transaction). This particular run was with notreelog.

Machine had 8GB of RAM. Linux allows the btree_inode to grow its
page-cache upto ~6.9GB (judging by btree_inode-i_mapping-nrpages).
But even though the used amount of metadata is less than that, this
re-COW'ing of already-COW'ed blocks seems to cause page-cache
trashing...

Thanks,
Alex.


On Mon, Jul 13, 2015 at 11:27 AM, Filipe David Manana
fdman...@gmail.com wrote:
 On Sun, Jul 12, 2015 at 6:15 PM, Alex Lyakas a...@zadarastorage.com wrote:
 Greetings,
 Looking at the code of should_cow_block(), I see:

 if (btrfs_header_generation(buf) == trans-transid 
!btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) 
 ...
 So if the extent buffer has been written to disk, and now is changed again
 in the same transaction, we insist on COW'ing it. Can anybody explain why
 COW is needed in this case? The transaction has not committed yet, so what
 is the danger of rewriting to the same location on disk? My understanding
 was that a tree block needs to be COW'ed at most once in the same
 transaction. But I see that this is not the case.

 That logic is there, as far as I can see, for at least 2 obvious reasons:

 1) fsync/log trees. All extent buffers (tree blocks) of a log tree
 have the same transaction id/generation, and you can have multiple
 fsyncs (log transaction commits) per transaction so you need to ensure
 consistency. If we skipped the COWing in the example below, you would
 get an inconsistent log tree at log replay time when the fs is
 mounted:

 transaction N start

fsync inode A start
creates tree block X
flush X to disk
write a new superblock
fsync inode A end

fsync inode B start
skip COW of X because its generation == current transaction id and
 modify it in place
flush X to disk

 == crash ===

write a new superblock
fsync inode B end

 transaction N commit

 2) The flag BTRFS_HEADER_FLAG_WRITTEN is set not when the block is
 written to disk but instead when we trigger writeback for it. So while
 the writeback is ongoing we want to make sure the block's content
 isn't concurrently modified (we don't keep the eb write locked to
 allow concurrent reads during the writeback).

 All tree blocks that don't belong to a log tree are normally written
 only when at the end of a transaction commit. But often, due to memory
 pressure for e.g., the VM can call the writepages() callback of the
 btree inode to force dirty tree blocks to be written to disk before
 the transaction commit.


 I am asking because I am doing some profiling of btrfs metadata work under
 heavy loads, and I see that sometimes btrfs COW's almost twice more tree
 blocks than the total metadata size.

 Thanks,
 Alex.

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



 --
 Filipe David Manana,

 Reasonable men adapt themselves to the world.
  Unreasonable men adapt the world to themselves.
  That's why all progress depends on unreasonable men.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN

2015-07-12 Thread Alex Lyakas

Greetings,
Looking at the code of should_cow_block(), I see:

if (btrfs_header_generation(buf) == trans-transid 
   !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) 
...
So if the extent buffer has been written to disk, and now is changed again 
in the same transaction, we insist on COW'ing it. Can anybody explain why 
COW is needed in this case? The transaction has not committed yet, so what 
is the danger of rewriting to the same location on disk? My understanding 
was that a tree block needs to be COW'ed at most once in the same 
transaction. But I see that this is not the case.


I am asking because I am doing some profiling of btrfs metadata work under 
heavy loads, and I see that sometimes btrfs COW's almost twice more tree 
blocks than the total metadata size.


Thanks,
Alex.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: rebuild missing block group during chunk recovery if possible

2014-12-24 Thread Alex Lyakas
Hi Qu,

On Wed, Dec 24, 2014 at 3:09 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote:

  Original Message 
 Subject: Re: [PATCH] btrfs-progs: rebuild missing block group during chunk
 recovery if possible
 From: Alex Lyakas alex.bt...@zadarastorage.com
 To: Qu Wenruo quwen...@cn.fujitsu.com
 Date: 2014年12月24日 00:49

 Hi Qu,

 On Thu, Oct 30, 2014 at 4:54 AM, Qu Wenruo quwen...@cn.fujitsu.com
 wrote:

 [snipped]
 +
 +static int __insert_block_group(struct btrfs_trans_handle *trans,
 +   struct chunk_record *chunk_rec,
 +   struct btrfs_root *extent_root,
 +   u64 used)
 +{
 +   struct btrfs_block_group_item bg_item;
 +   struct btrfs_key key;
 +   int ret = 0;
 +
 +   btrfs_set_block_group_used(bg_item, used);
 +   btrfs_set_block_group_chunk_objectid(bg_item, used);

 This looks like a bug. Instead of used, I think it should be
 BTRFS_FIRST_CHUNK_TREE_OBJECTID.

 Oh, my mistake, BTRFS_FIRST_CHUNK_TREE_OBJECTID is right.
 Thanks for pointing out this.


 [snipped]
 --
 2.1.2

 Couple of questions:
 # In remove_chunk_extent_item, should we also consider rebuild
 chunks now? It can happen that a rebuild chunks is a SYSTEM chunk.
 Should we try to handle it as well?

 Not quite sure about the meaning of rebuild here.
 The chunk-recovery has the rebuild_chunk_tree() function to rebuild the
 whole chunk tree with
 the good/repaired chunks we found.

 # Same question for rebuild_sys_array. Should we also consider
 rebuild chunks?

 The chunk-recovery has rebuild_sys_array() to handle SYSTEM chunk too.

I meant that with this patch you have added rebuild_chunks list:
struct list_head good_chunks;
struct list_head bad_chunks;
struct list_head rebuild_chunks; --- you added this
struct list_head unrepaired_chunks;


These are chunks that have no block-group record, but we are confident
that we can rebuild the block-group records for these chunks by
scanning all EXTENT_ITEMs in the block-group range and calculating the
used value for the block-group. If we fail, we just set
used==block-group size. My question is: should we now consider those
rebuild_chunks same as good_chunks? I.e., should we also consider
those chunks in the following functions:
- remove_chunk_extent_item: probably no, because we need the
EXTENT_ITEMs to recalculate the used value
- rebuild_sys_array: if it happens that a rebuild_chunk is also a
SYSTEM chunk, should we add it to the sys_chunk_array too? (In
addition to good_chunks).

Thanks,
Alex.


 Thanks,
 Qu


 Thanks,
 Alex.



 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How btrfs-find-root knows that the block is actually a root?

2014-12-23 Thread Alex Lyakas
Hi Qu,

On Tue, Dec 23, 2014 at 7:27 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote:

  Original Message 
 Subject: How btrfs-find-root knows that the block is actually a root?
 From: Alex Lyakas alex.bt...@zadarastorage.com
 To: linux-btrfs linux-btrfs@vger.kernel.org
 Date: 2014年12月22日 22:57

 Greetings,

 I am looking at the code of search_iobuf() in
 btrfs-find-root.c.(3.17.3)  I see that we probe nodesize blocks one by
 one, and for each block we check:
 - its owner is what we are looking for
 - its header-bytenr is what we are looking at currently
 - its level is not too small
 - it has valid checksum
 - it has the desired generation

 If all those conditions are true, we declare this block as a root and
 end the program.

 How do we actually know that it's a root and not a leaf or an
 intermediate node? What if we are searching for a root of the root
 tree, which has one node and two leafs (all have the same highest
 transid), and one of the leafs has logical lower than the actual
 root, i.e., it comes first in our scan. Then we will declare this leaf
 as a root, won't we? Or somehow the root always has the lowest
 logical?

 You can refer to this patch:
 https://patchwork.kernel.org/patch/5285521/
I see that this has not been applied to any of David's branches. Do
you have a repo to look at this code in its entirety?


 Your questions are mostly right.
 The best method should be search through all the metadata, and only the
 highest level header for
 a given generation may be the root for that generation.

 But that method still has some problems.
 1) Overwritten old node/leaf
 As btrfs metadata cow happens, old node/leaf may be overwritten and become
 incompletely,
 so above method won't always work as expected.

 2) Corrupted fs
 That will makes everything not work as expected.
 But sadly, when someone needs to use btrfs-find-root, there is a high
 possibility the fs is already corrupted.

 3) Slow speed
 It needs to scan over all the sectors of metadata chunks, it may var from
 megabytese to tegabytes,
 which makes the complete scan impractical.
 So current find-root uses a trade-off, if find a header at the position
 superblock points to, and generation
 matches, then just consider it as the desired root and exit.
I think this is a bit optimistic. What if the root tree has several
leaves having the same generation as the root? Then we might declare a
leaf as a root and exit. But further recovery based on that output
will get us into trouble.



 Also, I am confused by this line:
 level = h_level;
 This means that if we encounter a block that seems good, we will
 skip all other blocks that have lower level. Is this intended?

 This is intended, for case user already know the root's level, so it will
 skip any header whose level is below it.
But this line is performed before the generation check. Let's say that
user did not specify any level (so search_level==0). Then assume we
encounter a block, which has lower generation than what we need, but
higher level. At this point, we do
level = h_level;
and we will skip any blocks lower than this level from now on. What if
the root tree got shirnked (due to subvolume deletion, for example),
and the good root has lower level? We will skip it then, and will
not find the root.

Thanks for your comments,
Alex.



 Thanks,
 Qu


 Thanks,
 Alex.
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: rebuild missing block group during chunk recovery if possible

2014-12-23 Thread Alex Lyakas
Hi Qu,

On Thu, Oct 30, 2014 at 4:54 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote:
 Before the patch, chunk will be considered bad if the corresponding
 block group is missing, even the only uncertain data is the 'used'
 member of the block group.

 This patch will try to recalculate the 'used' value of the block group
 and rebuild it.
 So even only chunk item and dev extent item is found, the chunk can be
 recovered.
 Although if extent tree is damanged and needed extent item can't be
 read, the block group's 'used' value will be the block group length, to
 prevent any later write/block reserve damaging the block group.
 In that case, we will prompt user and recommend them to use
 '--init-extent-tree' to rebuild extent tree if possible.

 Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
 ---
  btrfsck.h   |   3 +-
  chunk-recover.c | 242 
 +---
  cmds-check.c|  29 ---
  3 files changed, 234 insertions(+), 40 deletions(-)

 diff --git a/btrfsck.h b/btrfsck.h
 index 356c767..7a50648 100644
 --- a/btrfsck.h
 +++ b/btrfsck.h
 @@ -179,5 +179,6 @@ btrfs_new_device_extent_record(struct extent_buffer *leaf,
  int check_chunks(struct cache_tree *chunk_cache,
  struct block_group_tree *block_group_cache,
  struct device_extent_tree *dev_extent_cache,
 -struct list_head *good, struct list_head *bad, int silent);
 +struct list_head *good, struct list_head *bad,
 +struct list_head *rebuild, int silent);
  #endif
 diff --git a/chunk-recover.c b/chunk-recover.c
 index 6f43066..dbf98b5 100644
 --- a/chunk-recover.c
 +++ b/chunk-recover.c
 @@ -61,6 +61,7 @@ struct recover_control {

 struct list_head good_chunks;
 struct list_head bad_chunks;
 +   struct list_head rebuild_chunks;
 struct list_head unrepaired_chunks;
 pthread_mutex_t rc_lock;
  };
 @@ -203,6 +204,7 @@ static void init_recover_control(struct recover_control 
 *rc, int verbose,

 INIT_LIST_HEAD(rc-good_chunks);
 INIT_LIST_HEAD(rc-bad_chunks);
 +   INIT_LIST_HEAD(rc-rebuild_chunks);
 INIT_LIST_HEAD(rc-unrepaired_chunks);

 rc-verbose = verbose;
 @@ -529,22 +531,32 @@ static void print_check_result(struct recover_control 
 *rc)
 return;

 printf(CHECK RESULT:\n);
 -   printf(Healthy Chunks:\n);
 +   printf(Recoverable Chunks:\n);
 list_for_each_entry(chunk, rc-good_chunks, list) {
 print_chunk_info(chunk,   );
 good++;
 total++;
 }
 -   printf(Bad Chunks:\n);
 +   list_for_each_entry(chunk, rc-rebuild_chunks, list) {
 +   print_chunk_info(chunk,   );
 +   good++;
 +   total++;
 +   }
 +   list_for_each_entry(chunk, rc-unrepaired_chunks, list) {
 +   print_chunk_info(chunk,   );
 +   good++;
 +   total++;
 +   }
 +   printf(Unrecoverable Chunks:\n);
 list_for_each_entry(chunk, rc-bad_chunks, list) {
 print_chunk_info(chunk,   );
 bad++;
 total++;
 }
 printf(\n);
 -   printf(Total Chunks:\t%d\n, total);
 -   printf(  Heathy:\t%d\n, good);
 -   printf(  Bad:\t%d\n, bad);
 +   printf(Total Chunks:\t\t%d\n, total);
 +   printf(  Recoverable:\t\t%d\n, good);
 +   printf(  Unrecoverable:\t%d\n, bad);

 printf(\n);
 printf(Orphan Block Groups:\n);
 @@ -555,6 +567,7 @@ static void print_check_result(struct recover_control *rc)
 printf(Orphan Device Extents:\n);
 list_for_each_entry(devext, rc-devext.no_chunk_orphans, chunk_list)
 print_device_extent_info(devext,   );
 +   printf(\n);
  }

  static int check_chunk_by_metadata(struct recover_control *rc,
 @@ -938,6 +951,11 @@ static int build_device_maps_by_chunk_records(struct 
 recover_control *rc,
 if (ret)
 return ret;
 }
 +   list_for_each_entry(chunk, rc-rebuild_chunks, list) {
 +   ret = build_device_map_by_chunk_record(root, chunk);
 +   if (ret)
 +   return ret;
 +   }
 return ret;
  }

 @@ -1168,12 +1186,31 @@ static int __rebuild_device_items(struct 
 btrfs_trans_handle *trans,
 return ret;
  }

 +static int __insert_chunk_item(struct btrfs_trans_handle *trans,
 +   struct chunk_record *chunk_rec,
 +   struct btrfs_root *chunk_root)
 +{
 +   struct btrfs_key key;
 +   struct btrfs_chunk *chunk = NULL;
 +   int ret = 0;
 +
 +   chunk = create_chunk_item(chunk_rec);
 +   if (!chunk)
 +   return -ENOMEM;
 +   key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID;
 +   key.type = BTRFS_CHUNK_ITEM_KEY;
 +   key.offset = chunk_rec-offset;
 +
 +   ret = 

How btrfs-find-root knows that the block is actually a root?

2014-12-22 Thread Alex Lyakas
Greetings,

I am looking at the code of search_iobuf() in
btrfs-find-root.c.(3.17.3)  I see that we probe nodesize blocks one by
one, and for each block we check:
- its owner is what we are looking for
- its header-bytenr is what we are looking at currently
- its level is not too small
- it has valid checksum
- it has the desired generation

If all those conditions are true, we declare this block as a root and
end the program.

How do we actually know that it's a root and not a leaf or an
intermediate node? What if we are searching for a root of the root
tree, which has one node and two leafs (all have the same highest
transid), and one of the leafs has logical lower than the actual
root, i.e., it comes first in our scan. Then we will declare this leaf
as a root, won't we? Or somehow the root always has the lowest
logical?

Also, I am confused by this line:
level = h_level;
This means that if we encounter a block that seems good, we will
skip all other blocks that have lower level. Is this intended?

Thanks,
Alex.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: update commit root on snapshot creation after orphan cleanup

2014-08-02 Thread Alex Lyakas
Hi Filipe,
Thank you for the explanation.
I understand that without your patch we return to user-space after
deleting the orphans, but leaving the transaction open. So user-space
sees the snapshot and can start send. With your patch, we return to
user-space after orphan cleanup has been committed. Unless we crash in
the middle, like you pointed.

I will also look at the new patch.

Thanks!
Alex.




On Thu, Jul 31, 2014 at 3:41 PM, Filipe David Manana fdman...@gmail.com wrote:
 On Mon, Jul 28, 2014 at 6:31 PM, Filipe David Manana fdman...@gmail.com 
 wrote:
 On Sat, Jul 19, 2014 at 8:11 PM, Alex Lyakas
 alex.bt...@zadarastorage.com wrote:
 Hi Filipe,
 It's quite possible I don't fully understand the issue. It seems that
 we are creating a read-only snapshot, commit a transaction, and then
 go and modify the snapshot once again, by deleting all the
 ORPHAN_ITEMs we have in its file tree (btrfs_orphan_cleanup).
 Shouldn't all this be part of snapshot creation, so that after we
 commit, we have a clean file tree with no orphans there? (not sure if
 this makes sense though).

 With your patch we do this additional commit after the cleanup. But
 nothing prevents send from starting before this additional commit,
 correct? And it would still see the orphans through the commit root.
 You say that it is not a problem, but I am not sure why (probably I am
 missing something here). So for me it looks like your patch closes a
 race window significantly (at the cost of an additional commit), but
 does not close it fully.

 Hi Alex,

 That's right, after the transaction commit finishes, the snapshot will
 be visible and accessible to user space - so someone may start a send
 before the orphan cleanup starts. It was ok only for the serialized
 case (create snapshot, wait for ioctl to return, call send ioctl).

 Actually no. If after the 1st transaction commit (the one that creates
 the snapshot and makes it visible to user space) and before the orphan
 cleanup is called another task attempts to use the snapshot for a send
 operation, it will block when doing the snapshot dentry lookup -
 because both tasks acquire the parent inode's mutex (implicitly
 through the vfs and explicitly via the snapshot/subvol ioctl entry
 point).

 Nevertheless, it's better to move the commit root switch part to the
 dentry lookup function (as the new patch does), since after the first
 transaction commit and before the second one commits, a reboot might
 happen, and after that we would get into the same issue until the
 first transaction commit happens after the reboot. I'll update the new
 patch's comment.

 thanks



 But most important: perhaps send should look for ORPHAN_ITEMs and
 treat those inodes as deleted?

 There are other cases were orphans can exist, like for file truncates
 for example, where ignoring the inode wouldn't be very correct.
 Tried that approach initially, but it's actually more complex to
 implement and adds some additional overhead (tree searches - and the
 orphan items are normally too far from the inode items, due to a very
 high objectid (-5ULL)).

 I've reworked this with a different approach and CC'ed you
 (https://patchwork.kernel.org/patch/4635471/).

 thanks


 Thanks,
 Alex.



 On Tue, Jun 3, 2014 at 2:41 PM, Filipe David Borba Manana
 fdman...@gmail.com wrote:
 On snapshot creation (either writable or read-only), we do orphan cleanup
 against the root of the snapshot. If the cleanup did remove any orphans,
 then the current root node will be different from the commit root node
 until the next transaction commit happens.

 A send operation always uses the commit root of a snapshot - this means
 it will see the orphans if it starts computing the send stream before the
 next transaction commit happens (triggered by a timer or sync() for .e.g),
 which is when the commit root gets assigned a reference to current root,
 where the orphans are not visible anymore. The consequence of send seeing
 the orphans is explained below.

 For example:

 mkfs.btrfs -f /dev/sdd
 mount -o commit=999 /dev/sdd /mnt

 # open a file with O_TMPFILE and leave it open
 # write some data to the file
 btrfs subvolume snapshot -r /mnt /mnt/snap1

 btrfs send /mnt/snap1 -f /tmp/send.data

 The send operation will fail with the following error:

 ERROR: send ioctl failed with -116: Stale file handle

 What happens here is that our snapshot has an orphan inode still visible
 through the commit root, that corresponds to the tmpfile. However send
 will attempt to call inode.c:btrfs_iget(), with the goal of reading the
 file's data, which will return -ESTALE because it will use the current
 root (and not the commit root) of the snapshot.

 Of course, there are other cases where we can get orphans, but this
 example using a tmpfile makes it much easier to reproduce the issue.

 Therefore on snapshot creation, after calling btrfs_orphan_cleanup, if
 the commit root is different from the current root, just commit

Re: [PATCH] Btrfs: update commit root on snapshot creation after orphan cleanup

2014-07-19 Thread Alex Lyakas
Hi Filipe,
It's quite possible I don't fully understand the issue. It seems that
we are creating a read-only snapshot, commit a transaction, and then
go and modify the snapshot once again, by deleting all the
ORPHAN_ITEMs we have in its file tree (btrfs_orphan_cleanup).
Shouldn't all this be part of snapshot creation, so that after we
commit, we have a clean file tree with no orphans there? (not sure if
this makes sense though).

With your patch we do this additional commit after the cleanup. But
nothing prevents send from starting before this additional commit,
correct? And it would still see the orphans through the commit root.
You say that it is not a problem, but I am not sure why (probably I am
missing something here). So for me it looks like your patch closes a
race window significantly (at the cost of an additional commit), but
does not close it fully.

But most important: perhaps send should look for ORPHAN_ITEMs and
treat those inodes as deleted?

Thanks,
Alex.



On Tue, Jun 3, 2014 at 2:41 PM, Filipe David Borba Manana
fdman...@gmail.com wrote:
 On snapshot creation (either writable or read-only), we do orphan cleanup
 against the root of the snapshot. If the cleanup did remove any orphans,
 then the current root node will be different from the commit root node
 until the next transaction commit happens.

 A send operation always uses the commit root of a snapshot - this means
 it will see the orphans if it starts computing the send stream before the
 next transaction commit happens (triggered by a timer or sync() for .e.g),
 which is when the commit root gets assigned a reference to current root,
 where the orphans are not visible anymore. The consequence of send seeing
 the orphans is explained below.

 For example:

 mkfs.btrfs -f /dev/sdd
 mount -o commit=999 /dev/sdd /mnt

 # open a file with O_TMPFILE and leave it open
 # write some data to the file
 btrfs subvolume snapshot -r /mnt /mnt/snap1

 btrfs send /mnt/snap1 -f /tmp/send.data

 The send operation will fail with the following error:

 ERROR: send ioctl failed with -116: Stale file handle

 What happens here is that our snapshot has an orphan inode still visible
 through the commit root, that corresponds to the tmpfile. However send
 will attempt to call inode.c:btrfs_iget(), with the goal of reading the
 file's data, which will return -ESTALE because it will use the current
 root (and not the commit root) of the snapshot.

 Of course, there are other cases where we can get orphans, but this
 example using a tmpfile makes it much easier to reproduce the issue.

 Therefore on snapshot creation, after calling btrfs_orphan_cleanup, if
 the commit root is different from the current root, just commit the
 transaction associated with the snapshot's root (if it exists), so that
 a send will not see any orphans that don't exist anymore. This also
 guarantees a send will always see the same content regardless of whether
 a transaction commit happened already before the send was requested and
 after the orphan cleanup (meaning the commit root and current roots are
 the same) or it hasn't happened yet (commit and current roots are
 different).

 Signed-off-by: Filipe David Borba Manana fdman...@gmail.com
 ---
  fs/btrfs/ioctl.c | 29 +
  1 file changed, 29 insertions(+)

 diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
 index 95194a9..6680ad9 100644
 --- a/fs/btrfs/ioctl.c
 +++ b/fs/btrfs/ioctl.c
 @@ -712,6 +712,35 @@ static int create_snapshot(struct btrfs_root *root, 
 struct inode *dir,
 if (ret)
 goto fail;

 +   /*
 +* If orphan cleanup did remove any orphans, it means the tree was
 +* modified and therefore the commit root is not the same as the
 +* current root anymore. This is a problem, because send uses the
 +* commit root and therefore can see inode items that don't exist
 +* in the current root anymore, and for example make calls to
 +* btrfs_iget, which will do tree lookups based on the current root
 +* and not on the commit root. Those lookups will fail, returning a
 +* -ESTALE error, and making send fail with that error. So make sure
 +* a send does not see any orphans we have just removed, and that it
 +* will see the same inodes regardless of whether a transaction
 +* commit happened before it started (meaning that the commit root
 +* will be the same as the current root) or not.
 +*/
 +   if (readonly  pending_snapshot-snap-node !=
 +   pending_snapshot-snap-commit_root) {
 +   trans = btrfs_join_transaction(pending_snapshot-snap);
 +   if (IS_ERR(trans)  PTR_ERR(trans) != -ENOENT) {
 +   ret = PTR_ERR(trans);
 +   goto fail;
 +   }
 +   if (!IS_ERR(trans)) {
 +   ret = btrfs_commit_transaction(trans,
 +

Re: Snapshot aware defrag and qgroups thoughts

2014-06-19 Thread Alex Lyakas
Hi Josef,
thanks for the detailed description of how the extent tree works!
When I was digging through that in the past, I made some slides to
remember all the call chains. Maybe somebody finds that useful to
accompany your notes.
https://drive.google.com/file/d/0ByBy89zr3kJNNmM5OG5wXzQ3LUE/edit?usp=sharing

Thanks,
Alex.


On Mon, Apr 21, 2014 at 5:55 PM, Josef Bacik jba...@fb.com wrote:
 We have a big problem, but it involves a lot of moving parts, so I'm going
 to
 explain all of the parts, and then the problem, and then what I am doing to
 fix
 the problem.  I want you guys to check my work to make sure I'm not missing
 something so when I come back from paternity leave in a few weeks I can just
 sit
 down and finish this work out.

 = Extent refs ===

 This is basically how extent refs work.  You have either

 key.objectid = bytenr;
 key.type = BTRFS_EXTENT_ITEM_KEY;
 key.offset = length;

 or you have

 key.objectid = bytenr;
 key.type = BTRFS_METADATA_ITEM_KEY;
 key.offset = level of the metadata block;

 in the case of skinny metadata.  Then you have the extent item which
 describes
 the number of refs and such, followed by 1 or more inline refs.  All you
 need
 to know for this problem is how I'm going to describe them.  What I call a
 normal ref or a full ref is a reference that has the actual root
 information
 in the ref.  What I call a shared ref is one where we only know the tree
 block
 that owns the particular ref.  So how does this work in practice?

 1) Normal allocation - metadata:  We allocate a tree block as we add new
 items
 to a tree.  We know that this root owns this tree block so we create a
 normal
 ref with the root objectid in the extent ref.  We also set the owner of the
 block itself to our objectid.  This is important to keep in mind.

 2) Normal allocaiton - data: We allocate some data for a given fs tree and
 we
 add a extent ref with the root objectid of the tree we are in, the inode
 number
 and the logical offset into the inode for this inode.

 3) Splitting a data extent: We write to the middle of an existing extent.
 We
 will split this extent into two BTRFS_EXTENT_DATA_KEY items and the increase
 the
 ref count of the original extent by 1.  This means we look up the extent ref
 for
 root-objectid, inode number and the _original_ inode offset.  We don't
 create
 another extent ref, this is important to keep in mind.

 = btrfs_copy_root/update_ref_for_cow/btrfs_inc_ref/btrfs_dec_ref =

 But Josef, didn't you say there were shared refs?  Why yes I did, but I need
 to
 explain it in context of the people who actually do the dirty work. We'll
 start
 with the easy case

 1) btrfs_copy_root - where snapshots start:  When we make a snapshot we call
 this function, which allocates a completely new block with a new root
 objectid
 and then memcpy's the original root we are snapshotting.  Then we call
 btrfs_inc_ref on our new buffer, which will walk all items in that buffer
 and
 add a new normal ref to each of those blocks for our new root.  This is only
 at
 the level below the new root, nothing below that point.

 2) btrfs_inc_ref/btrfs_dec_ref - how we deal with snapshots: These guys are
 responsible for dealing with the particular action we want to make on our
 given
 buffer.  So if we are free'ing our buffer, we need to drop any refs it has
 to
 the blocks it points to.  For level  0 this means modifying refs for all of
 the
 tree blocks it points to.  For level == 0 this means modifying refs for any
 data
 extents the leaf may point to.

 3) update_ref_for_cow - this is where the magic happens:  This has a few
 different modes of operation, but every operation means we check to see if
 the
 block is shared, which is we see if we have been snapshotted and if we have
 been
 see if this block has changed since we snapshotted.  If it is shared then we
 look up the extent refs and the flags.  If not then we carry on. From here
 we
 have a few options.

 3a) Not shared: Don't do anything, we can do our normal cow operations and
 carry
 on.

 3b) Shared and cowing from the owning root: This is where the
 btrfs_header_owner() is important.  If we owned this block and it is shared
 then
 we know that any of the upper levels won't have a normal ref to anything
 underneath this block, so we need to add a shared ref for anything this
 block
 points to.  So the first thing we do is btrfs_inc_ref(), but we set the full
 backref flag.  This means that when we add refs for everything this block
 points
 to we don't use a root objectid, we use the bytenr of this block. Then we
 set
 BTRFS_BLOCK_FLAG_FULL_BACKREF for the extent flags for this give block.

 3c) Shared and cowing from not the owning root: So if we are cowing down
 from
 the snapshot we need to make sure that any block we own completely ourselves
 has
 normal refs for any blocks it points to.  So we cow down and hit a shared
 block
 that we aren't the owner of, so we do btrfs_inc_ref() for our block and
 

Re: safe/necessary to balance system chunks?

2014-06-19 Thread Alex Lyakas
On Fri, Apr 25, 2014 at 10:14 PM, Hugo Mills h...@carfax.org.uk wrote:
 On Fri, Apr 25, 2014 at 02:12:17PM -0400, Austin S Hemmelgarn wrote:
 On 2014-04-25 13:24, Chris Murphy wrote:
 
  On Apr 25, 2014, at 8:57 AM, Steve Leung sjle...@shaw.ca wrote:
 
 
  Hi list,
 
  I've got a 3-device RAID1 btrfs filesystem that started out life as 
  single-device.
 
  btrfs fi df:
 
  Data, RAID1: total=1.31TiB, used=1.07TiB
  System, RAID1: total=32.00MiB, used=224.00KiB
  System, DUP: total=32.00MiB, used=32.00KiB
  System, single: total=4.00MiB, used=0.00
  Metadata, RAID1: total=66.00GiB, used=2.97GiB
 
  This still lists some system chunks as DUP, and not as RAID1.  Does this 
  mean that if one device were to fail, some system chunks would be 
  unrecoverable?  How bad would that be?
 
  Since it's system type, it might mean the whole volume is toast if the 
  drive containing those 32KB dies. I'm not sure what kind of information is 
  in system chunk type, but I'd expect it's important enough that if 
  unavailable that mounting the file system may be difficult or impossible. 
  Perhaps btrfs restore would still work?
 
  Anyway, it's probably a high penalty for losing only 32KB of data.  I 
  think this could use some testing to try and reproduce conversions where 
  some amount of system or metadata type chunks are stuck in DUP. This 
  has come up before on the list but I'm not sure how it's happening, as 
  I've never encountered it.
 
 As far as I understand it, the system chunks are THE root chunk tree for
 the entire system, that is to say, it's the tree of tree roots that is
 pointed to by the superblock. (I would love to know if this
 understanding is wrong).  Thus losing that data almost always means
 losing the whole filesystem.

From a conversation I had with cmason a while ago, the System
 chunks contain the chunk tree. They're special because *everything* in
 the filesystem -- including the locations of all the trees, including
 the chunk tree and the roots tree -- is positioned in terms of the
 internal virtual address space. Therefore, when starting up the FS,
 you can read the superblock (which is at a known position on each
 device), which tells you the virtual address of the other trees... and
 you still need to find out where that really is.

The superblock has (I think) a list of physical block addresses at
 the end of it (sys_chunk_array), which allows you to find the blocks
 for the chunk tree and work out this mapping, which allows you to find
 everything else. I'm not 100% certain of the actual format of that
 array -- it's declared as u8 [2048], so I'm guessing there's a load of
 casting to something useful going on in the code somewhere.
The format is just a list of pairs:
struct btrfs_disk_key,  struct btrfs_chunk
struct btrfs_disk_key,  struct btrfs_chunk
...

For each SYSTEM block-group (btrfs_chunk), we need one entry in the
sys_chunk_array. During mkfs the first SYSTEM block group is created,
for me its 4MB. So only if the whole chunk tree grows over 4MB, we
need to create an additional SYSTEM block group, and then we need to
have a second entry in the sys_chunk_array. And so on.

Alex.



Hugo.

 --
 === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
   PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- Is it still called an affair if I'm sleeping with my wife ---
 behind her lover's back?
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6] Btrfs: fix memory leak of orphan block rsv

2014-06-18 Thread Alex Lyakas
Hi Filipe,
I finally got to debug this deeper. As it turns out, this happens only
if both nospace_cache and clear_cache are specified. You need to
unmount and mount again to cause this. After mounting, due to
clear_cache, all the block-groups are marked as BTRFS_DC_CLEAR, and
then cache_save_setup() is called on them (this function is called
only in case of BTRFS_DC_CLEAR). So cache_save_setup() goes ahead and
creates the free-space inode. But then it realizes that it was mounted
with nospace_cache, so it does not put any content in the inode. But
the inode itself gets created. The patch that fixes this for me:


alex@ubuntu-alex:/mnt/work/alex/linux-stable/source$ git diff -U10
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d170412..06f876e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2941,20 +2941,26 @@ again:
goto out;
}

if (IS_ERR(inode)) {
BUG_ON(retries);
retries++;

if (block_group-ro)
goto out_free;

+   /* with nospace_cache avoid creating the free-space inode */
+   if (!btrfs_test_opt(root, SPACE_CACHE)) {
+   dcs = BTRFS_DC_WRITTEN;
+   goto out_free;
+   }
+
ret = create_free_space_inode(root, trans, block_group, path);
if (ret)
goto out_free;
goto again;
}

/* We've already setup this transaction, go ahead and exit */
if (block_group-cache_generation == trans-transid 
i_size_read(inode)) {
dcs = BTRFS_DC_SETUP;



Thanks,
Alex.


On Wed, Nov 6, 2013 at 3:19 PM, Filipe David Manana fdman...@gmail.com wrote:
 On Mon, Nov 4, 2013 at 12:16 PM, Alex Lyakas
 alex.bt...@zadarastorage.com wrote:
 Hi Filipe,
 any luck with this patch?:)

 Hey Alex,

 I haven't digged further, but I remember I couldn't reproduce your
 issue (with latest btrfs-next of that day) of getting the free space
 inodes created even when mount option nospace_cache is given.

 What kernel were you using?


 Alex.

 On Wed, Oct 23, 2013 at 5:26 PM, Filipe David Manana fdman...@gmail.com 
 wrote:
 On Wed, Oct 23, 2013 at 3:14 PM, Alex Lyakas
 alex.bt...@zadarastorage.com wrote:
 Hello,

 On Wed, Oct 23, 2013 at 4:35 PM, Filipe David Manana fdman...@gmail.com 
 wrote:
 On Wed, Oct 23, 2013 at 2:33 PM, Alex Lyakas
 alex.bt...@zadarastorage.com wrote:
 Hi Filipe,


 On Tue, Aug 20, 2013 at 2:52 AM, Filipe David Borba Manana
 fdman...@gmail.com wrote:

 This issue is simple to reproduce and observe if kmemleak is enabled.
 Two simple ways to reproduce it:

 ** 1

 $ mkfs.btrfs -f /dev/loop0
 $ mount /dev/loop0 /mnt/btrfs
 $ btrfs balance start /mnt/btrfs
 $ umount /mnt/btrfs

 So here it seems that the leak can only happen in case the block-group
 has a free-space inode. This is what the orphan item is added for.
 Yes, here kmemleak reports.
 But: if space_cache option is disabled (and nospace_cache) enabled, it
 seems that btrfs still creates the FREE_SPACE inodes, although they
 are empty because in cache_save_setup:

 inode = lookup_free_space_inode(root, block_group, path);
 if (IS_ERR(inode)  PTR_ERR(inode) != -ENOENT) {
 ret = PTR_ERR(inode);
 btrfs_release_path(path);
 goto out;
 }

 if (IS_ERR(inode)) {
 ...
 ret = create_free_space_inode(root, trans, block_group, path);

 and only later it actually sets BTRFS_DC_WRITTEN if space_cache option
 is disabled. Amazing!
 Although this is a different issue, do you know perhaps why these
 empty inodes are needed?

 Don't know if they are needed. But you have a point, it seems odd to
 create the free space cache inode if mount option nospace_cache was
 supplied. Thanks Alex. Testing the following patch:

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index c43ee8a..eb1b7da 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -3162,6 +3162,9 @@ static int cache_save_setup(struct
 btrfs_block_group_cache *block_group,
 int retries = 0;
 int ret = 0;

 +   if (!btrfs_test_opt(root, SPACE_CACHE))
 +   return 0;
 +
 /*
  * If this block group is smaller than 100 megs don't bother 
 caching the
  * block group.



 Thanks!
 Alex.




 ** 2

 $ mkfs.btrfs -f /dev/loop0
 $ mount /dev/loop0 /mnt/btrfs
 $ touch /mnt/btrfs/foobar
 $ rm -f /mnt/btrfs/foobar
 $ umount /mnt/btrfs


 I tried the second repro script on kernel 3.8.13, and kmemleak does
 not report a leak (even if I force the kmemleak scan). I did not try
 the balance-repro script, though. Am I missing something?

 Maybe it's not an issue on 3.8.13 and older releases.
 This was on btrfs-next from August 19.

 thanks for testing


 Thanks,
 Alex.




 After a while, kmemleak reports the leak:

 $ cat /sys/kernel/debug/kmemleak
 unreferenced object

Re: snapshot send with parent question

2014-05-31 Thread Alex Lyakas
Michael,
btrfs-send doesn't really know or care how did you manage to get from
a to c. It is able to compare any two RO subvolumes (not necessarily
related by snapshot operations), and produce a stream of commands that
transfer a-content into c-content.

Send assumes that at a receive side, you have a snapshot identical to
a. Then receive side locates the a-snapshot (by a received_UUID
field) and creates a RW snapshot out of it. This snapshot would be
identical to c, after applying the stream of commands. Then receive
side applies the stream of commands (in strict order), and at the end
sets the RW snapshot to be RO. At this point, this snapshot should be
identical to c.

The stream of commands most probably will not be identical to
operations that you did in order to get from a into c. But it will
transfer a-content into c-content (leave alone possible bugs),
which is what's important.

Of course, if a and c are related via snapshot operations, then
btrfs-send will be much more efficient, in terms that it will be able
to skip entire btrfs subtrees (look at btrfs_compare_trees), thus
avoiding many additional comparisons that some other tool like rsync
would have done.

Thanks,
Alex.



On Sun, Apr 20, 2014 at 1:00 AM, Michael Welsh Duggan m...@md5i.com wrote:
 Assume the following scenario:
 There exists a read-only snapshot called a.
 A read-write snapshot called b is created from a, and is then modified.
 A read-only snapshot of b is created, called c.
 A btrfs send is done for c, with a marked as its parent.

 Will the send data only contain the differences between a and c?  My
 experiments seem to indicate no, but I have no confidence that I am not
 doing something else correctly.

 Also, when a btrfs receive gets a stream containing the differences
 between a (parent) and c, does it only look at the relative pathname
 differences between a and c in order to determine the matching parent on
 the receiving side?

 --
 Michael Welsh Duggan
 (m...@md5i.com)

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: correctly determine if blocks are shared in btrfs_compare_trees

2014-04-05 Thread Alex Lyakas
Hi Filipe,
Can you please explain more what is the scenario you are worried about.

Let's say we have two FS trees (subvolumes) subv1 and subv2, subv2
being a RO snapshot of subv1. And they have a shared subtree at
logical==X. Now we change subv1, so its subtree is COW'ed and some
other logical address (Y) is being allocated for subtree root. But X
still cannot be reused as long as subv2 exists. That's the essence of
the extent tree providing refcount for each tree/data block in the FS,
no?

Now finally we delete subv2 and block X is freed. So it can be
reallocated as a root of another subtree. And then it might be
snapshotted again and shared as before.
So where do you see a problem?

If we have two FS-tree subtrees starting at the same logical=X, how
can they be different? This means we allocated logical=X again, while
it was still in use, which is very very bad.

Am I missing something here?

Thanks,
Alex.

P.S.: by logical I (and hopefully you) mean the extent-tree level
addresses, i.e., if we have a tree block with logical=X, then we also
have an EXTENT_ITEM with key (X, EXTENT_ITEM, nodesize/leafsize).


On Fri, Feb 21, 2014 at 12:15 AM, Filipe David Borba Manana
fdman...@gmail.com wrote:
 Just comparing the pointers (logical disk addresses) of the btree nodes is
 not completely bullet proof, we have to check if their generation numbers
 match too.

 It is guaranteed that a COW operation will result in a block with a different
 logical disk address than the original block's address, but over time we can
 reuse that former logical disk address.

 For example, creating a 2Gb filesystem on a loop device, and having a script
 running in a loop always updating the access timestamp of a file, resulted in
 the same logical disk address being reused for the same fs btree block in 
 about
 only 4 minutes.

 This could make us skip entire subtrees when doing an incremental send (which
 is currently the only user of btrfs_compare_trees). However the odds of 
 getting
 2 blocks at the same tree level, with the same logical disk address, equal 
 first
 slot keys and different generations, should hopefully be very low.

 Signed-off-by: Filipe David Borba Manana fdman...@gmail.com
 ---
  fs/btrfs/ctree.c |   11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

 diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
 index cbd3a7d..88d1b1e 100644
 --- a/fs/btrfs/ctree.c
 +++ b/fs/btrfs/ctree.c
 @@ -5376,6 +5376,8 @@ int btrfs_compare_trees(struct btrfs_root *left_root,
 int advance_right;
 u64 left_blockptr;
 u64 right_blockptr;
 +   u64 left_gen;
 +   u64 right_gen;
 u64 left_start_ctransid;
 u64 right_start_ctransid;
 u64 ctransid;
 @@ -5640,7 +5642,14 @@ int btrfs_compare_trees(struct btrfs_root *left_root,
 right_blockptr = btrfs_node_blockptr(
 
 right_path-nodes[right_level],
 
 right_path-slots[right_level]);
 -   if (left_blockptr == right_blockptr) {
 +   left_gen = btrfs_node_ptr_generation(
 +   left_path-nodes[left_level],
 +   left_path-slots[left_level]);
 +   right_gen = btrfs_node_ptr_generation(
 +   
 right_path-nodes[right_level],
 +   
 right_path-slots[right_level]);
 +   if (left_blockptr == right_blockptr 
 +   left_gen == right_gen) {
 /*
  * As we're on a shared block, don't
  * allow to go deeper.
 --
 1.7.9.5

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: attach delayed ref updates to delayed ref heads

2014-03-30 Thread Alex Lyakas
Hi Josef,
I have a question about update_existing_head_ref() logic. The question
is not specific to the rework that you have done.

You have a code like this:
if (ref-must_insert_reserved) {
/* if the extent was freed and then
 * reallocated before the delayed ref
 * entries were processed, we can end up
 * with an existing head ref without
 * the must_insert_reserved flag set.
 * Set it again here
 */
existing_ref-must_insert_reserved = ref-must_insert_reserved;
/*
 * update the num_bytes so we make sure the accounting
 * is done correctly
 */
existing-num_bytes = update-num_bytes;
}

How can it happen that you have a delayed_ref head for a particular
bytenr, and then somebody wants to add a ref head for the same bytenr
with must_insert_reserved=true? How could he have possibly allocated
the same bytenr from the free-space cache?
I know that when extent is freed by __btrfs_free_extent(), it calls
update_block_groups(), which pins down the extent. So this extent will
be dropped into free-space-cache only on transaction commit, when all
delayed refs have been processed already.

The only close case that I see is in btrfs_free_tree_block(), where it
adds a BTRFS_DROP_DELAYED_REF, and then if check_ref_cleanup()==0 and
BTRFS_HEADER_FLAG_WRITTEN is not set, it drops the extent directly
into free-space cache. However, check_ref_cleanup() would have deleted
the ref head, so we would not have found an existing ref head.

Can you pls give a clue on this?

Thanks!
Alex.



On Thu, Jan 23, 2014 at 5:28 PM, Josef Bacik jba...@fb.com wrote:
 Currently we have two rb-trees, one for delayed ref heads and one for all of 
 the
 delayed refs, including the delayed ref heads.  When we process the delayed 
 refs
 we have to hold onto the delayed ref lock for all of the selecting and merging
 and such, which results in quite a bit of lock contention.  This was solved by
 having a waitqueue and only one flusher at a time, however this hurts if we 
 get
 a lot of delayed refs queued up.

 So instead just have an rb tree for the delayed ref heads, and then attach the
 delayed ref updates to an rb tree that is per delayed ref head.  Then we only
 need to take the delayed ref lock when adding new delayed refs and when
 selecting a delayed ref head to process, all the rest of the time we deal 
 with a
 per delayed ref head lock which will be much less contentious.

 The locking rules for this get a little more complicated since we have to lock
 up to 3 things to properly process delayed refs, but I will address that 
 problem
 later.  For now this passes all of xfstests and my overnight stress tests.
 Thanks,

 Signed-off-by: Josef Bacik jba...@fb.com
 ---
  fs/btrfs/backref.c |  23 ++--
  fs/btrfs/delayed-ref.c | 223 +-
  fs/btrfs/delayed-ref.h |  23 ++--
  fs/btrfs/disk-io.c |  79 ++--
  fs/btrfs/extent-tree.c | 317 
 -
  fs/btrfs/transaction.c |   7 +-
  6 files changed, 267 insertions(+), 405 deletions(-)

 diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
 index 835b6c9..34a8952 100644
 --- a/fs/btrfs/backref.c
 +++ b/fs/btrfs/backref.c
 @@ -538,14 +538,13 @@ static int __add_delayed_refs(struct 
 btrfs_delayed_ref_head *head, u64 seq,
 if (extent_op  extent_op-update_key)
 btrfs_disk_key_to_cpu(op_key, extent_op-key);

 -   while ((n = rb_prev(n))) {
 +   spin_lock(head-lock);
 +   n = rb_first(head-ref_root);
 +   while (n) {
 struct btrfs_delayed_ref_node *node;
 node = rb_entry(n, struct btrfs_delayed_ref_node,
 rb_node);
 -   if (node-bytenr != head-node.bytenr)
 -   break;
 -   WARN_ON(node-is_head);
 -
 +   n = rb_next(n);
 if (node-seq  seq)
 continue;

 @@ -612,10 +611,10 @@ static int __add_delayed_refs(struct 
 btrfs_delayed_ref_head *head, u64 seq,
 WARN_ON(1);
 }
 if (ret)
 -   return ret;
 +   break;
 }
 -
 -   return 0;
 +   spin_unlock(head-lock);
 +   return ret;
  }

  /*
 @@ -882,15 +881,15 @@ again:
 btrfs_put_delayed_ref(head-node);
 goto again;
 }
 +   spin_unlock(delayed_refs-lock);
 ret = __add_delayed_refs(head, time_seq,
  prefs_delayed);
 mutex_unlock(head-mutex);
 -   if (ret) {
 -   spin_unlock(delayed_refs-lock);
 +   if (ret)
 goto out;
 -   }
 +   

Re: [PATCH] Btrfs: fix memory leak in btrfs_create_tree()

2014-03-27 Thread Alex Lyakas
Hi Tsutomu Itoh,

On Thu, Mar 21, 2013 at 6:32 AM, Tsutomu Itoh t-i...@jp.fujitsu.com wrote:
 We should free leaf and root before returning from the error
 handling code.

 Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com
 ---
  fs/btrfs/disk-io.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 7d84651..b1b5baa 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -1291,6 +1291,7 @@ struct btrfs_root *btrfs_create_tree(struct 
 btrfs_trans_handle *trans,
   0, objectid, NULL, 0, 0, 0);
 if (IS_ERR(leaf)) {
 ret = PTR_ERR(leaf);
 +   leaf = NULL;
 goto fail;
 }

 @@ -1334,11 +1335,16 @@ struct btrfs_root *btrfs_create_tree(struct 
 btrfs_trans_handle *trans,

 btrfs_tree_unlock(leaf);

 +   return root;
 +
  fail:
 -   if (ret)
 -   return ERR_PTR(ret);
 +   if (leaf) {
 +   btrfs_tree_unlock(leaf);
 +   free_extent_buffer(leaf);
I believe this is not enough. Few lines above, another reference on
the root is taken by
root-commit_root = btrfs_root_node(root);

So I believe the proper fix would be:
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d9698fd..260af79 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1354,10 +1354,10 @@ struct btrfs_root *btrfs_create_tree(struct
btrfs_trans_handle *trans,
return root;

 fail:
-   if (leaf) {
+   if (leaf)
btrfs_tree_unlock(leaf);
-   free_extent_buffer(leaf);
-   }
+   free_extent_buffer(root-node);
+   free_extent_buffer(root-commit_root);
kfree(root);

return ERR_PTR(ret);



Thanks,
Alex.



 +   }
 +   kfree(root);

 -   return root;
 +   return ERR_PTR(ret);
  }

  static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans,

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed

2014-03-08 Thread Alex Lyakas
Thanks, Miao,
so the problem is that cow_file_range() joins transaction, allocates
space through btrfs_reserve_extent(), then detaches from transaction.
And then btrfs_finish_ordered_io() joins transaction again, adds a
delayed ref and detaches from transaction. So if between these two,
the transaction commits and we crash, then yes, the allocation is
lost.

Alex.


On Tue, Mar 4, 2014 at 8:04 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 On  sat, 1 Mar 2014 20:05:01 +0200, Alex Lyakas wrote:
 Hi Miao,

 On Wed, Jan 15, 2014 at 2:00 PM, Miao Xie mi...@cn.fujitsu.com wrote:
 When we mounted the filesystem after the crash, we got the following
 message:
   BTRFS error (device xxx): block group 4315938816 has wrong amount of free 
 space
   BTRFS error (device xxx): failed to load free space cache for block group 
 4315938816

 It is because we didn't update the metadata of the allocated space until
 the file data was written into the disk. During this time, there was no
 information about the allocated spaces in either the extent tree nor the
 free space cache. when we wrote out the free space cache at this time, those
 spaces were lost.
 Can you please clarify more about the problem?
 So I understand that we allocate something from the free space cache.
 So after that, the free space cache does not account for this extent
 anymore. On the other hand the extent tree also does not account for
 it (yet). We need to add a delayed reference, which will be run at
 transaction commit and update the extent tree. But free-space cache is
 also written out during transaction commit. So how the issue happens?
 Can you perhaps post a flow of events?

 Task1   Task2
 btrfs_writepages()
   alloc data space
 (The allocated space was
  removed from the free
  space cache)
   submit_bio()
 btrfs_commit_transaction()
   write out space cache
   ...
   commit transaction completed
 system crash
  (end_io() wasn't executed)

 The system crashed before end_io was executed, so no file references to the
 allocated space, and the extent tree also does not account for it. That space
 was lost.

 Thanks
 Miao

 Thanks.
 Alex.



 In ordered to fix this problem, I use a state tree for every block group
 to record those allocated spaces. We record the information when they are
 allocated, and clean up the information after the metadata update. Besides
 that, we also introduce a read-write semaphore to avoid the race between
 the allocation and the free space cache write out.

 Only data block groups had this problem, so the above change is just
 for data space allocation.

 Signed-off-by: Miao Xie mi...@cn.fujitsu.com
 ---
  fs/btrfs/ctree.h| 15 ++-
  fs/btrfs/disk-io.c  |  2 +-
  fs/btrfs/extent-tree.c  | 24 
  fs/btrfs/free-space-cache.c | 42 ++
  fs/btrfs/inode.c| 42 +++---
  5 files changed, 108 insertions(+), 17 deletions(-)

 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
 index 1667c9a..f58e1f7 100644
 --- a/fs/btrfs/ctree.h
 +++ b/fs/btrfs/ctree.h
 @@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache {
 /* free space cache stuff */
 struct btrfs_free_space_ctl *free_space_ctl;

 +   /*
 +* It is used to record the extents that are allocated for
 +* the data, but don/t update its metadata.
 +*/
 +   struct extent_io_tree pinned_extents;
 +
 /* block group cache stuff */
 struct rb_node cache_node;

 @@ -1540,6 +1546,13 @@ struct btrfs_fs_info {
  */
 struct list_head space_info;

 +   /*
 +* It is just used for the delayed data space allocation
 +* because only the data space allocation can be done during
 +* we write out the free space cache.
 +*/
 +   struct rw_semaphore data_rwsem;
 +
 struct btrfs_space_info *data_sinfo;

 struct reloc_control *reloc_ctl;
 @@ -3183,7 +3196,7 @@ int btrfs_alloc_logged_file_extent(struct 
 btrfs_trans_handle *trans,
struct btrfs_key *ins);
  int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes,
  u64 min_alloc_size, u64 empty_size, u64 hint_byte,
 -struct btrfs_key *ins, int is_data);
 +struct btrfs_key *ins, int is_data, bool need_pin);
  int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root 
 *root,
   struct extent_buffer *buf, int full_backref, int for_cow);
  int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root 
 *root,
 diff --git a/fs/btrfs/disk

Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed

2014-03-01 Thread Alex Lyakas
Hi Miao,

On Wed, Jan 15, 2014 at 2:00 PM, Miao Xie mi...@cn.fujitsu.com wrote:
 When we mounted the filesystem after the crash, we got the following
 message:
   BTRFS error (device xxx): block group 4315938816 has wrong amount of free 
 space
   BTRFS error (device xxx): failed to load free space cache for block group 
 4315938816

 It is because we didn't update the metadata of the allocated space until
 the file data was written into the disk. During this time, there was no
 information about the allocated spaces in either the extent tree nor the
 free space cache. when we wrote out the free space cache at this time, those
 spaces were lost.
Can you please clarify more about the problem?
So I understand that we allocate something from the free space cache.
So after that, the free space cache does not account for this extent
anymore. On the other hand the extent tree also does not account for
it (yet). We need to add a delayed reference, which will be run at
transaction commit and update the extent tree. But free-space cache is
also written out during transaction commit. So how the issue happens?
Can you perhaps post a flow of events?

Thanks.
Alex.



 In ordered to fix this problem, I use a state tree for every block group
 to record those allocated spaces. We record the information when they are
 allocated, and clean up the information after the metadata update. Besides
 that, we also introduce a read-write semaphore to avoid the race between
 the allocation and the free space cache write out.

 Only data block groups had this problem, so the above change is just
 for data space allocation.

 Signed-off-by: Miao Xie mi...@cn.fujitsu.com
 ---
  fs/btrfs/ctree.h| 15 ++-
  fs/btrfs/disk-io.c  |  2 +-
  fs/btrfs/extent-tree.c  | 24 
  fs/btrfs/free-space-cache.c | 42 ++
  fs/btrfs/inode.c| 42 +++---
  5 files changed, 108 insertions(+), 17 deletions(-)

 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
 index 1667c9a..f58e1f7 100644
 --- a/fs/btrfs/ctree.h
 +++ b/fs/btrfs/ctree.h
 @@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache {
 /* free space cache stuff */
 struct btrfs_free_space_ctl *free_space_ctl;

 +   /*
 +* It is used to record the extents that are allocated for
 +* the data, but don/t update its metadata.
 +*/
 +   struct extent_io_tree pinned_extents;
 +
 /* block group cache stuff */
 struct rb_node cache_node;

 @@ -1540,6 +1546,13 @@ struct btrfs_fs_info {
  */
 struct list_head space_info;

 +   /*
 +* It is just used for the delayed data space allocation
 +* because only the data space allocation can be done during
 +* we write out the free space cache.
 +*/
 +   struct rw_semaphore data_rwsem;
 +
 struct btrfs_space_info *data_sinfo;

 struct reloc_control *reloc_ctl;
 @@ -3183,7 +3196,7 @@ int btrfs_alloc_logged_file_extent(struct 
 btrfs_trans_handle *trans,
struct btrfs_key *ins);
  int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes,
  u64 min_alloc_size, u64 empty_size, u64 hint_byte,
 -struct btrfs_key *ins, int is_data);
 +struct btrfs_key *ins, int is_data, bool need_pin);
  int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
   struct extent_buffer *buf, int full_backref, int for_cow);
  int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root,
 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 8072cfa..426b558 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -2276,7 +2276,6 @@ int open_ctree(struct super_block *sb,
 fs_info-pinned_extents = fs_info-freed_extents[0];
 fs_info-do_barriers = 1;

 -
 mutex_init(fs_info-ordered_operations_mutex);
 mutex_init(fs_info-ordered_extent_flush_mutex);
 mutex_init(fs_info-tree_log_mutex);
 @@ -2287,6 +2286,7 @@ int open_ctree(struct super_block *sb,
 init_rwsem(fs_info-extent_commit_sem);
 init_rwsem(fs_info-cleanup_work_sem);
 init_rwsem(fs_info-subvol_sem);
 +   init_rwsem(fs_info-data_rwsem);
 sema_init(fs_info-uuid_tree_rescan_sem, 1);
 fs_info-dev_replace.lock_owner = 0;
 atomic_set(fs_info-dev_replace.nesting_level, 0);
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 3664cfb..7b07876 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -6173,7 +6173,7 @@ enum btrfs_loop_type {
  static noinline int find_free_extent(struct btrfs_root *orig_root,
  u64 num_bytes, u64 empty_size,
  u64 hint_byte, struct btrfs_key *ins,
 -

Re: [PATCH] Btrfs: fix a deadlock on chunk mutex

2014-02-18 Thread Alex Lyakas
Hello Josef,

On Tue, Dec 18, 2012 at 3:52 PM, Josef Bacik jba...@fusionio.com wrote:
 On Wed, Dec 12, 2012 at 06:52:37PM -0700, Liu Bo wrote:
 An user reported that he has hit an annoying deadlock while playing with
 ceph based on btrfs.

 Current updating device tree requires space from METADATA chunk,
 so we -may- need to do a recursive chunk allocation when adding/updating
 dev extent, that is where the deadlock comes from.

 If we use SYSTEM metadata to update device tree, we can avoid the recursive
 stuff.


 This is going to cause us to allocate much more system chunks than we used to
 which could land us in trouble.  Instead let's just keep us from re-entering 
 if
 we're already allocating a chunk.  We do the chunk allocation when we don't 
 have
 enough space for a cluster, but we'll likely have plenty of space to make an
 allocation.  Can you give this patch a try Jim and see if it fixes your 
 problem?
 Thanks,

 Josef


 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index e152809..59df5e7 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -3564,6 +3564,10 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
 *trans,
 int wait_for_alloc = 0;
 int ret = 0;

 +   /* Don't re-enter if we're already allocating a chunk */
 +   if (trans-allocating_chunk)
 +   return -ENOSPC;
 +
 space_info = __find_space_info(extent_root-fs_info, flags);
 if (!space_info) {
 ret = update_space_info(extent_root-fs_info, flags,
 @@ -3606,6 +3610,8 @@ again:
 goto again;
 }

 +   trans-allocating_chunk = true;
 +
 /*
  * If we have mixed data/metadata chunks we want to make sure we keep
  * allocating mixed chunks instead of individual chunks.
 @@ -3632,6 +3638,7 @@ again:
 check_system_chunk(trans, extent_root, flags);

 ret = btrfs_alloc_chunk(trans, extent_root, flags);
 +   trans-allocating_chunk = false;
 if (ret  0  ret != -ENOSPC)
 goto out;

 diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
 index e6509b9..47ad8be 100644
 --- a/fs/btrfs/transaction.c
 +++ b/fs/btrfs/transaction.c
 @@ -388,6 +388,7 @@ again:
 h-qgroup_reserved = qgroup_reserved;
 h-delayed_ref_elem.seq = 0;
 h-type = type;
 +   h-allocating_chunk = false;
 INIT_LIST_HEAD(h-qgroup_ref_list);
 INIT_LIST_HEAD(h-new_bgs);

 diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
 index 0e8aa1e..69700f7 100644
 --- a/fs/btrfs/transaction.h
 +++ b/fs/btrfs/transaction.h
 @@ -68,6 +68,7 @@ struct btrfs_trans_handle {
 struct btrfs_block_rsv *orig_rsv;
 short aborted;
 short adding_csums;
 +   bool allocating_chunk;
 enum btrfs_trans_type type;
 /*
  * this root is only needed to validate that the root passed to

I hit this problem in a following scenario:
- a data chunk allocation is triggered, and locks chunk_mutex
- the same thread now also wants to allocate a metadata chunk, so it
recursively calls do_chunk_alloc, but cannot lock the chunk_mutex =
deadlock
- btrfs has only one metadata chunk, the one that was initially
allocated by mkfs, it has:
total_bytes=8388608
bytes_used=8130560
bytes_pinned=77824
bytes_reserved=180224
so bytes_used + bytes_pinned + bytes_reserved == total_bytes

Your patch would have returned ENOSPC and avoid the deadlock, but
there would be a failure to allocate a tree block for metadata. So the
transaction would have probably aborted.

How such situation should be handled?

Idea1:
- lock chunk mutex,
- if we are allocating a data chunk, check whether the metadata space
is below some threshold. If yes, go and allocate a metadata chunk
first and then only a data chunk.

Idea2:
- check if we are the same thread that already locked the chunk mutex.
If yes, allow recursive call but don't attempt to lock/unlock the
chunk_mutex this time

Or some other way?

Thanks!
Alex.






 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix a deadlock on chunk mutex

2014-02-18 Thread Alex Lyakas
Hi Josef,
is this the commit to look at:
6df9a95e63395f595d0d1eb5d561dd6c91c40270 Btrfs: make the chunk
allocator completely tree lockless

or some other commits are also relevant?

Alex.


On Tue, Feb 18, 2014 at 6:06 PM, Josef Bacik jba...@fb.com wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1



 On 02/18/2014 10:47 AM, Alex Lyakas wrote:
 Hello Josef,

 On Tue, Dec 18, 2012 at 3:52 PM, Josef Bacik jba...@fusionio.com
 wrote:
 On Wed, Dec 12, 2012 at 06:52:37PM -0700, Liu Bo wrote:
 An user reported that he has hit an annoying deadlock while
 playing with ceph based on btrfs.

 Current updating device tree requires space from METADATA
 chunk, so we -may- need to do a recursive chunk allocation when
 adding/updating dev extent, that is where the deadlock comes
 from.

 If we use SYSTEM metadata to update device tree, we can avoid
 the recursive stuff.


 This is going to cause us to allocate much more system chunks
 than we used to which could land us in trouble.  Instead let's
 just keep us from re-entering if we're already allocating a
 chunk.  We do the chunk allocation when we don't have enough
 space for a cluster, but we'll likely have plenty of space to
 make an allocation.  Can you give this patch a try Jim and see if
 it fixes your problem? Thanks,

 Josef


 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index e152809..59df5e7 100644 --- a/fs/btrfs/extent-tree.c +++
 b/fs/btrfs/extent-tree.c @@ -3564,6 +3564,10 @@ static int
 do_chunk_alloc(struct btrfs_trans_handle *trans, int
 wait_for_alloc = 0; int ret = 0;

 +   /* Don't re-enter if we're already allocating a chunk */
 +   if (trans-allocating_chunk) +   return
 -ENOSPC; + space_info = __find_space_info(extent_root-fs_info,
 flags); if (!space_info) { ret =
 update_space_info(extent_root-fs_info, flags, @@ -3606,6 +3610,8
 @@ again: goto again; }

 +   trans-allocating_chunk = true; + /* * If we have mixed
 data/metadata chunks we want to make sure we keep * allocating
 mixed chunks instead of individual chunks. @@ -3632,6 +3638,7 @@
 again: check_system_chunk(trans, extent_root, flags);

 ret = btrfs_alloc_chunk(trans, extent_root, flags); +
 trans-allocating_chunk = false; if (ret  0  ret != -ENOSPC)
 goto out;

 diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
 index e6509b9..47ad8be 100644 --- a/fs/btrfs/transaction.c +++
 b/fs/btrfs/transaction.c @@ -388,6 +388,7 @@ again:
 h-qgroup_reserved = qgroup_reserved; h-delayed_ref_elem.seq =
 0; h-type = type; +   h-allocating_chunk = false;
 INIT_LIST_HEAD(h-qgroup_ref_list);
 INIT_LIST_HEAD(h-new_bgs);

 diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h
 index 0e8aa1e..69700f7 100644 --- a/fs/btrfs/transaction.h +++
 b/fs/btrfs/transaction.h @@ -68,6 +68,7 @@ struct
 btrfs_trans_handle { struct btrfs_block_rsv *orig_rsv; short
 aborted; short adding_csums; +   bool allocating_chunk; enum
 btrfs_trans_type type; /* * this root is only needed to validate
 that the root passed to

 I hit this problem in a following scenario: - a data chunk
 allocation is triggered, and locks chunk_mutex - the same thread
 now also wants to allocate a metadata chunk, so it recursively
 calls do_chunk_alloc, but cannot lock the chunk_mutex = deadlock -
 btrfs has only one metadata chunk, the one that was initially
 allocated by mkfs, it has: total_bytes=8388608 bytes_used=8130560
 bytes_pinned=77824 bytes_reserved=180224 so bytes_used +
 bytes_pinned + bytes_reserved == total_bytes

 Your patch would have returned ENOSPC and avoid the deadlock, but
 there would be a failure to allocate a tree block for metadata. So
 the transaction would have probably aborted.

 How such situation should be handled?

 Idea1: - lock chunk mutex, - if we are allocating a data chunk,
 check whether the metadata space is below some threshold. If yes,
 go and allocate a metadata chunk first and then only a data chunk.

 Idea2: - check if we are the same thread that already locked the
 chunk mutex. If yes, allow recursive call but don't attempt to
 lock/unlock the chunk_mutex this time

 Or some other way?


 I fixed this with the delayed chunk allocation stuff which doesn't
 actually do the block group creation stuff until we end the
 transaction, so we can allocate metadata chunks without any issue.
 Thanks,

 Josef
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1
 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

 iQIcBAEBAgAGBQJTA4UMAAoJEANb+wAKly3B+KEP/RdlEyJWydetjQxllF0cgHY1
 UraqWBl+mSSHlwZlHyGjmAu6cK6n+QfTZtdIBhihdY50UcvMuWtVmz2JzlbxeO5+
 88dBevADmW+QQoRl0yyQgnjlLWm+LvMTgOd1r+DZqlGs6sdX05dMI207+fQOW+c4
 P+UKbT/eUYRVC4K//J1GUk4Yh3Q70U25321RWCehSUciwDVJO2LztD9VBAgh3qUc
 o5uh5syshS3RbEi0hnUQ8tDKXWvdZQBA2RF4loXACCmQO95e84mxVpoYPd9S1yYs
 J+wf+Bak5hKZxmXJkOVcjLj4GsVQFJWTBTj6FvOFrm5TAFEGSyzrEzL8xW361+VS
 I1q8GPSVN1fGKkVypddylLIXLHmqXb57UElvGhoBM0otxNd8+xfSpLZ045vv5qLx
 RKwhJI1gIWD59kBre0fdSkUJZDeYSmLWOiwG6hG3A7Yy93c6/1RLHRnHq5NEe12R

Re: [PATCH] btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl

2014-02-15 Thread Alex Lyakas
Hello Hugo,

Is this issue specific to the receive ioctl?

Because what you are describing applies to any user-kernel ioctl-based
interface, when you compile the user-space as 32-bit, which the kernel
space has been compiled as 64-bit. For that purpose, I believe, there
exists the compat_ioctl callback. It's implementation should do
thunking, i.e., treat the user-space structure as if it were
compiled for 32-bit, and unpack it properly.

Today, however, btrfs supplies the same callback both for
unlocked_ioctl and compat_ioctl. So we should see the same problem
with all ioctls, if I am not missing anything.

Thanks,
Alex.



On Mon, Jan 6, 2014 at 10:50 AM, Hugo Mills h...@carfax.org.uk wrote:
 On Sun, Jan 05, 2014 at 06:26:11PM +, Hugo Mills wrote:
 On Sun, Jan 05, 2014 at 05:55:27PM +, Hugo Mills wrote:
  The structure for BTRFS_SET_RECEIVED_IOCTL packs differently on 32-bit
  and 64-bit systems. This means that it is impossible to use btrfs
  receive on a system with a 64-bit kernel and 32-bit userspace, because
  the structure size (and hence the ioctl number) is different.
 
  This patch adds a compatibility structure and ioctl to deal with the
  above case.

Oops, forgot to mention -- this has been compile tested, but not
 actually run yet. The machine in question is several miles away and is
 a production machine (it's my work desktop, and I can't afford much
 downtime on it).

... And it doesn't even compile properly, now I come to build a
 .deb. I'm still interested in comments about the general approach, but
 the specific patch is a load of balls.

Hugo.

Hugo.

  Signed-off-by: Hugo Mills h...@carfax.org.uk
  ---
   fs/btrfs/ioctl.c | 95 
  +++-
   1 file changed, 87 insertions(+), 8 deletions(-)
 
  diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
  index 21da576..e186439 100644
  --- a/fs/btrfs/ioctl.c
  +++ b/fs/btrfs/ioctl.c
  @@ -57,6 +57,32 @@
   #include send.h
   #include dev-replace.h
 
  +#ifdef CONFIG_64BIT
  +/* If we have a 32-bit userspace and 64-bit kernel, then the UAPI
  + * structures are incorrect, as the timespec structure from userspace
  + * is 4 bytes too small. We define these alternatives here to teach
  + * the kernel about the 32-bit struct packing.
  + */
  +struct btrfs_ioctl_timespec {
  +   __u64 sec;
  +   __u32 nsec;
  +} ((__packed__));
  +
  +struct btrfs_ioctl_received_subvol_args {
  +   charuuid[BTRFS_UUID_SIZE];  /* in */
  +   __u64   stransid;   /* in */
  +   __u64   rtransid;   /* out */
  +   struct btrfs_ioctl_timespec stime; /* in */
  +   struct btrfs_ioctl_timespec rtime; /* out */
  +   __u64   flags;  /* in */
  +   __u64   reserved[16];   /* in */
  +} ((__packed__));
  +#endif
  +
  +#define BTRFS_IOC_SET_RECEIVED_SUBVOL_32 _IOWR(BTRFS_IOCTL_MAGIC, 37, \
  +   struct btrfs_ioctl_received_subvol_args_32)
  +
  +
   static int btrfs_clone(struct inode *src, struct inode *inode,
 u64 off, u64 olen, u64 olen_aligned, u64 destoff);
 
  @@ -4313,10 +4339,69 @@ static long btrfs_ioctl_quota_rescan_wait(struct 
  file *file, void __user *arg)
  return btrfs_qgroup_wait_for_completion(root-fs_info);
   }
 
  +#ifdef CONFIG_64BIT
  +static long btrfs_ioctl_set_received_subvol_32(struct file *file,
  +   void __user *arg)
  +{
  +   struct btrfs_ioctl_received_subvol_args_32 *args32 = NULL;
  +   struct btrfs_ioctl_received_subvol_args *args64 = NULL;
  +   int ret = 0;
  +
  +   args32 = memdup_user(arg, sizeof(*args32));
  +   if (IS_ERR(args32)) {
  +   ret = PTR_ERR(args32);
  +   args32 = NULL;
  +   goto out;
  +   }
  +
  +   args64 = malloc(sizeof(*args64));
  +   if (IS_ERR(args64)) {
  +   ret = PTR_ERR(args64);
  +   args64 = NULL;
  +   goto out;
  +   }
  +
  +   memcpy(args64-uuid, args32-uuid, BTRFS_UUID_SIZE);
  +   args64-stransid = args32-stransid;
  +   args64-rtransid = args32-rtransid;
  +   args64-stime.sec = args32-stime.sec;
  +   args64-stime.nsec = args32-stime.nsec;
  +   args64-rtime.sec = args32-rtime.sec;
  +   args64-rtime.nsec = args32-rtime.nsec;
  +   args64-flags = args32-flags;
  +
  +   ret = _btrfs_ioctl_set_received_subvol(file, args64);
  +
  +out:
  +   kfree(args32);
  +   kfree(args64);
  +   return ret;
  +}
  +#endif
  +
   static long btrfs_ioctl_set_received_subvol(struct file *file,
  void __user *arg)
   {
  struct btrfs_ioctl_received_subvol_args *sa = NULL;
  +   int ret = 0;
  +
  +   sa = memdup_user(arg, sizeof(*sa));
  +   if (IS_ERR(sa)) {
  +   ret = PTR_ERR(sa);
  +   sa = NULL;
  +   goto out;
  +   }
  +
  +   ret = _btrfs_ioctl_set_received_subvol(file, sa);
  +
  +out:
  +   kfree(sa);
  +   return ret;
  +}
  +
  +static long _btrfs_ioctl_set_received_subvol(struct 

Re: [PATCH] Btrfs: return ENOSPC when target space is full

2014-01-19 Thread Alex Lyakas
Hi Filipe,
I think in the context of do_chunk_alloc(), 0 doesn't mean success.
0 means allocation was not attempted. While 1 means allocation was
attempted and succeeded. -ENOSPC means allocation was attempted but
failed. Any other errno deserves transaction abort.
Anyways, the callers are ok with 0 and -ENOSPC and re-search for a
free extent in these cases.

Alex.


On Mon, Aug 5, 2013 at 5:25 PM, Filipe David Borba Manana
fdman...@gmail.com wrote:
 In extent-tree.c:do_chunk_alloc(), early on we returned 0 (success)
 when the target space was full and when chunk allocation is needed.
 However, later on in that same function we return ENOSPC if
 btrfs_alloc_chunk() fails (and chunk allocation was needed) and
 set the space's full flag.

 This was inconsistent, as -ENOSPC should be returned if the space
 is full and a chunk allocation needs to performed. If the space is
 full but no chunk allocation is needed, just return 0 (success).

 Signed-off-by: Filipe David Borba Manana fdman...@gmail.com
 ---
  fs/btrfs/extent-tree.c |6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index e868c35..ef89a66 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -3829,8 +3829,12 @@ again:
 if (force  space_info-force_alloc)
 force = space_info-force_alloc;
 if (space_info-full) {
 +   if (should_alloc_chunk(extent_root, space_info, force))
 +   ret = -ENOSPC;
 +   else
 +   ret = 0;
 spin_unlock(space_info-lock);
 -   return 0;
 +   return ret;
 }

 if (!should_alloc_chunk(extent_root, space_info, force)) {
 --
 1.7.9.5

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs-progs: avoid using btrfs internal subvolume path to send

2014-01-11 Thread Alex Lyakas
Hi Miguel,

On Sat, Nov 30, 2013 at 1:43 PM, Miguel Negrão
miguel.negrao-li...@friendlyvirus.org wrote:
 Em 29-11-2013 16:37, Wang Shilong escreveu:
 From: Wang Shilong wangsl.f...@cn.fujitsu.com

 Steps to reproduce:
   # mkfs.btrfs -f /dev/sda
   # mount /dev/sda /mnt
   # btrfs subvolume create /mnt/foo
   # umount /mnt
   # mount -o subvol=foo /dev/sda /mnt
   # btrfs sub snapshot -r /mnt /mnt/snap
   # btrfs send /mnt/snap  /dev/null

 We will fail to send '/mnt/snap',this is because btrfs send try to
 open '/mnt/snap' by btrfs internal subvolume path 'foo/snap' rather
 than relative path based on mounted point, this will return us 'no
 such file or directory',this is not right, fix it.

 I was going to write to the list to report exactly this issue. In my
 case, this happens when the default subvolume has been changed from id 5
 to some other id. I get the error 'no such file or directory'. Currently
 my workaround is to mount the root subvolume with -o subvolid=5 and then
 do the send.

 Also, I'd like to ask, are there plans to make the send and receive
 commands resumeable somehow (or perhaps it is already, but couldn't see
 how) ?
I have proposed a patch to address the resumability of send-receive
some time ago in this thread:
http://www.spinics.net/lists/linux-btrfs/msg18180.html

However, this changes the current user-kernel protocol used by send,
and overall a big change, which is not easy to integrate.

Alex.



 best,
 Miguel Negrão



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 04/13] Btrfs: disable qgroups accounting when quata_enable is 0

2013-12-03 Thread Alex Lyakas
Hi Liu, Jan,

What happens to struct qgroup_updates sitting in
trans-qgroup_ref_list in case the transaction aborts? It seems that
they are not freed.

For example, if we are in btrfs_commit_transaction() and:
- call create_pending_snapshots()
- call btrfs_run_delayed_items() and this fails
So we go to cleanup_transaction() without calling
btrfs_delayed_refs_qgroup_accounting(), which would have been called
by btrfs_run_delayed_refs().

I receive kmemleak warnings about these thingies not being freed,
although on an older kernel. However, looking at Josef's tree, this
still seems to be the case.

Thanks,
Alex.


On Mon, Oct 14, 2013 at 7:59 AM, Liu Bo bo.li@oracle.com wrote:
 It's unnecessary to do qgroups accounting without enabling quota.

 Signed-off-by: Liu Bo bo.li@oracle.com
 ---
  fs/btrfs/ctree.c   |2 +-
  fs/btrfs/delayed-ref.c |   18 ++
  fs/btrfs/qgroup.c  |3 +++
  3 files changed, 18 insertions(+), 5 deletions(-)

 diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
 index 61b5bcd..fb89235 100644
 --- a/fs/btrfs/ctree.c
 +++ b/fs/btrfs/ctree.c
 @@ -407,7 +407,7 @@ u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info,

 tree_mod_log_write_lock(fs_info);
 spin_lock(fs_info-tree_mod_seq_lock);
 -   if (!elem-seq) {
 +   if (elem  !elem-seq) {
 elem-seq = btrfs_inc_tree_mod_seq_major(fs_info);
 list_add_tail(elem-list, fs_info-tree_mod_seq_list);
 }
 diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
 index 9e1a1c9..3ec3d08 100644
 --- a/fs/btrfs/delayed-ref.c
 +++ b/fs/btrfs/delayed-ref.c
 @@ -691,8 +691,13 @@ static noinline void add_delayed_tree_ref(struct 
 btrfs_fs_info *fs_info,
 ref-is_head = 0;
 ref-in_tree = 1;

 -   if (need_ref_seq(for_cow, ref_root))
 -   seq = btrfs_get_tree_mod_seq(fs_info, 
 trans-delayed_ref_elem);
 +   if (need_ref_seq(for_cow, ref_root)) {
 +   struct seq_list *elem = NULL;
 +
 +   if (fs_info-quota_enabled)
 +   elem = trans-delayed_ref_elem;
 +   seq = btrfs_get_tree_mod_seq(fs_info, elem);
 +   }
 ref-seq = seq;

 full_ref = btrfs_delayed_node_to_tree_ref(ref);
 @@ -750,8 +755,13 @@ static noinline void add_delayed_data_ref(struct 
 btrfs_fs_info *fs_info,
 ref-is_head = 0;
 ref-in_tree = 1;

 -   if (need_ref_seq(for_cow, ref_root))
 -   seq = btrfs_get_tree_mod_seq(fs_info, 
 trans-delayed_ref_elem);
 +   if (need_ref_seq(for_cow, ref_root)) {
 +   struct seq_list *elem = NULL;
 +
 +   if (fs_info-quota_enabled)
 +   elem = trans-delayed_ref_elem;
 +   seq = btrfs_get_tree_mod_seq(fs_info, elem);
 +   }
 ref-seq = seq;

 full_ref = btrfs_delayed_node_to_data_ref(ref);
 diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
 index 4e6ef49..1cb58f9 100644
 --- a/fs/btrfs/qgroup.c
 +++ b/fs/btrfs/qgroup.c
 @@ -1188,6 +1188,9 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle 
 *trans,
  {
 struct qgroup_update *u;

 +   if (!trans-root-fs_info-quota_enabled)
 +   return 0;
 +
 BUG_ON(!trans-delayed_ref_elem.seq);
 u = kmalloc(sizeof(*u), GFP_NOFS);
 if (!u)
 --
 1.7.7

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: calculate disk space that a subvol could free

2013-11-28 Thread Alex Lyakas
Hello Anand,
I have sent a similar comment to your email thread in
http://www.spinics.net/lists/linux-btrfs/msg27547.html

I believe this approach of calculating freeable space is incorrect.
Try this:
- create a fresh btrfs
- create a regular file
- write some data into it in such a way, that it was, say 4000
EXTENT_DATA items, so that file tree and extent tree get deep enough
- run btrfs-debug-tree and verify that all EXTENT_ITEMs of this file
(in the extent tree) have refcnt=1
- create a snapshot of the subvolume
- run btrfs-debug-tree again

You will see that most (in my case - all) of EXTENT_ITEMs still have
refcnt=1. Reason for this is as I mentioned in
http://www.spinics.net/lists/linux-btrfs/msg27547.html

But if you delete the subvolume, no space will be freed, because all
these extents are also shared by the snapshot. Although it seems that
your tool will report that all subvolume's space is freeable (based on
refcnt=1).

Can you pls try that experiment and comment on it? Perhaps I am
missing something here?

Thanks!
Alex.



On Thu, Oct 10, 2013 at 6:33 AM, Wang Shilong
wangsl.f...@cn.fujitsu.com wrote:
 On 10/10/2013 11:35 AM, Anand Jain wrote:


  If 'btrfs_file_extent_item' can contain the ref count it would
  solve the current problem quite easily.  (problem is that, its
  of n * n searches to know data extents with its ref for a given
  subvol).

 Just considering btrfs_file_extent_item is not enough, because
 we should consider metadata(as i have said before).


  But what'r the challenge(s) to have ref count in the
  btrfs_file_extent_item ? any thoughts ?

 Haven't thought a better idea yet.



 Thanks, Anand
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v6] Btrfs: fix memory leak of orphan block rsv

2013-11-04 Thread Alex Lyakas
Hi Filipe,
any luck with this patch?:)

Alex.

On Wed, Oct 23, 2013 at 5:26 PM, Filipe David Manana fdman...@gmail.com wrote:
 On Wed, Oct 23, 2013 at 3:14 PM, Alex Lyakas
 alex.bt...@zadarastorage.com wrote:
 Hello,

 On Wed, Oct 23, 2013 at 4:35 PM, Filipe David Manana fdman...@gmail.com 
 wrote:
 On Wed, Oct 23, 2013 at 2:33 PM, Alex Lyakas
 alex.bt...@zadarastorage.com wrote:
 Hi Filipe,


 On Tue, Aug 20, 2013 at 2:52 AM, Filipe David Borba Manana
 fdman...@gmail.com wrote:

 This issue is simple to reproduce and observe if kmemleak is enabled.
 Two simple ways to reproduce it:

 ** 1

 $ mkfs.btrfs -f /dev/loop0
 $ mount /dev/loop0 /mnt/btrfs
 $ btrfs balance start /mnt/btrfs
 $ umount /mnt/btrfs

 So here it seems that the leak can only happen in case the block-group
 has a free-space inode. This is what the orphan item is added for.
 Yes, here kmemleak reports.
 But: if space_cache option is disabled (and nospace_cache) enabled, it
 seems that btrfs still creates the FREE_SPACE inodes, although they
 are empty because in cache_save_setup:

 inode = lookup_free_space_inode(root, block_group, path);
 if (IS_ERR(inode)  PTR_ERR(inode) != -ENOENT) {
 ret = PTR_ERR(inode);
 btrfs_release_path(path);
 goto out;
 }

 if (IS_ERR(inode)) {
 ...
 ret = create_free_space_inode(root, trans, block_group, path);

 and only later it actually sets BTRFS_DC_WRITTEN if space_cache option
 is disabled. Amazing!
 Although this is a different issue, do you know perhaps why these
 empty inodes are needed?

 Don't know if they are needed. But you have a point, it seems odd to
 create the free space cache inode if mount option nospace_cache was
 supplied. Thanks Alex. Testing the following patch:

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index c43ee8a..eb1b7da 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -3162,6 +3162,9 @@ static int cache_save_setup(struct
 btrfs_block_group_cache *block_group,
 int retries = 0;
 int ret = 0;

 +   if (!btrfs_test_opt(root, SPACE_CACHE))
 +   return 0;
 +
 /*
  * If this block group is smaller than 100 megs don't bother caching 
 the
  * block group.



 Thanks!
 Alex.




 ** 2

 $ mkfs.btrfs -f /dev/loop0
 $ mount /dev/loop0 /mnt/btrfs
 $ touch /mnt/btrfs/foobar
 $ rm -f /mnt/btrfs/foobar
 $ umount /mnt/btrfs


 I tried the second repro script on kernel 3.8.13, and kmemleak does
 not report a leak (even if I force the kmemleak scan). I did not try
 the balance-repro script, though. Am I missing something?

 Maybe it's not an issue on 3.8.13 and older releases.
 This was on btrfs-next from August 19.

 thanks for testing


 Thanks,
 Alex.




 After a while, kmemleak reports the leak:

 $ cat /sys/kernel/debug/kmemleak
 unreferenced object 0x880402b13e00 (size 128):
   comm btrfs, pid 19621, jiffies 4341648183 (age 70057.844s)
   hex dump (first 32 bytes):
 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
 00 fc c6 b1 04 88 ff ff 04 00 04 00 ad 4e ad de  .N..
   backtrace:
 [817275a6] kmemleak_alloc+0x26/0x50
 [8117832b] kmem_cache_alloc_trace+0xeb/0x1d0
 [a04db499] btrfs_alloc_block_rsv+0x39/0x70 [btrfs]
 [a04f8bad] btrfs_orphan_add+0x13d/0x1b0 [btrfs]
 [a04e2b13] btrfs_remove_block_group+0x143/0x500 [btrfs]
 [a0518158] btrfs_relocate_chunk.isra.63+0x618/0x790 [btrfs]
 [a051bc27] btrfs_balance+0x8f7/0xe90 [btrfs]
 [a05240a0] btrfs_ioctl_balance+0x250/0x550 [btrfs]
 [a05269ca] btrfs_ioctl+0xdfa/0x25f0 [btrfs]
 [8119c936] do_vfs_ioctl+0x96/0x570
 [8119cea1] SyS_ioctl+0x91/0xb0
 [81750242] system_call_fastpath+0x16/0x1b
 [] 0x

 This affects btrfs-next, revision be8e3cd00d7293dd177e3f8a4a1645ce09ca3acb
 (Btrfs: separate out tests into their own directory).

 Signed-off-by: Filipe David Borba Manana fdman...@gmail.com
 ---

 V2: removed atomic_t member in struct btrfs_block_rsv, as suggested by
 Josef Bacik, and use instead the condition reserved == 0 to decide
 when to free the block.
 V3: simplified patch, just kfree() (and not btrfs_free_block_rsv) the
 root's orphan_block_rsv when free'ing the root. Thanks Josef for
 the suggestion.
 V4: use btrfs_free_block_rsv() instead of kfree(). The error I was getting
 in xfstests when using btrfs_free_block_rsv() was unrelated, Josef 
 just
 pointed it to me (separate issue).
 V5: move the free call below the iput() call, so that btrfs_evict_node()
 can process the orphan_block_rsv first to do some needed cleanup 
 before
 we free it.
 V6: free the root's orphan_block_rsv in close_ctree() too. After a balance
 the orphan_block_rsv of the tree of tree roots was being leaked, 
 because
 free_fs_root() is only called for filesystem

Re: [patch 3/7] btrfs: Add per-super attributes to sysfs

2013-10-26 Thread Alex Lyakas
Hi Jeff,

On Tue, Sep 10, 2013 at 7:24 AM, Jeff Mahoney je...@suse.com wrote:
 This patch adds per-super attributes to sysfs.

 It doesn't publish any attributes yet, but does the proper lifetime
 handling as well as the basic infrastructure to add new attributes.

 Signed-off-by: Jeff Mahoney je...@suse.com
 ---
  fs/btrfs/ctree.h |2 +
  fs/btrfs/super.c |   13 +++-
  fs/btrfs/sysfs.c |   58 
 +++
  fs/btrfs/sysfs.h |   19 ++
  4 files changed, 91 insertions(+), 1 deletion(-)

 --- a/fs/btrfs/ctree.h  2013-09-10 00:09:12.990087784 -0400
 +++ b/fs/btrfs/ctree.h  2013-09-10 00:09:35.521794520 -0400
 @@ -3694,6 +3694,8 @@ int btrfs_defrag_leaves(struct btrfs_tra
  /* sysfs.c */
  int btrfs_init_sysfs(void);
  void btrfs_exit_sysfs(void);
 +int btrfs_sysfs_add_one(struct btrfs_fs_info *fs_info);
 +void btrfs_sysfs_remove_one(struct btrfs_fs_info *fs_info);

  /* xattr.c */
  ssize_t btrfs_listxattr(struct dentry *dentry, char *buffer, size_t size);
 --- a/fs/btrfs/super.c  2013-09-10 00:09:12.994087730 -0400
 +++ b/fs/btrfs/super.c  2013-09-10 00:09:35.525794464 -0400
 @@ -301,6 +301,8 @@ void __btrfs_panic(struct btrfs_fs_info

  static void btrfs_put_super(struct super_block *sb)
  {
 +   btrfs_sysfs_remove_one(btrfs_sb(sb));
 +
 (void)close_ctree(btrfs_sb(sb)-tree_root);
 /* FIXME: need to fix VFS to return error? */
 /* AV: return it _where_?  -put_super() can be triggered by any 
 number
 @@ -1143,8 +1145,17 @@ static struct dentry *btrfs_mount(struct
 }

 root = !error ? get_default_root(s, subvol_objectid) : ERR_PTR(error);
 -   if (IS_ERR(root))
 +   if (IS_ERR(root)) {
 deactivate_locked_super(s);
 +   return root;
 +   }
 +
 +   error = btrfs_sysfs_add_one(fs_info);
 +   if (error) {
 +   dput(root);
 +   deactivate_locked_super(s);
 +   return ERR_PTR(error);
 +   }

 return root;

 --- a/fs/btrfs/sysfs.c  2013-09-10 00:09:13.002087628 -0400
 +++ b/fs/btrfs/sysfs.c  2013-09-10 00:09:49.501616538 -0400
 @@ -61,6 +61,64 @@ static struct attribute *btrfs_supp_feat
 NULL
  };

 +static struct attribute *btrfs_attrs[] = {
 +   NULL,
 +};
 +
 +static void btrfs_fs_info_release(struct kobject *kobj)
 +{
 +   struct btrfs_fs_info *fs_info;
 +   fs_info = container_of(kobj, struct btrfs_fs_info, super_kobj);
 +   complete(fs_info-kobj_unregister);
 +}
 +
 +static ssize_t btrfs_attr_show(struct kobject *kobj,
 +  struct attribute *attr, char *buf)
 +{
 +   struct btrfs_attr *a = container_of(attr, struct btrfs_attr, attr);
 +   struct btrfs_fs_info *fs_info;
 +   fs_info = container_of(kobj, struct btrfs_fs_info, super_kobj);
 +
 +   return a-show ? a-show(a, fs_info, buf) : 0;
 +}
 +
 +static ssize_t btrfs_attr_store(struct kobject *kobj,
 +   struct attribute *attr,
 +   const char *buf, size_t len)
 +{
 +   struct btrfs_attr *a = container_of(attr, struct btrfs_attr, attr);
 +   struct btrfs_fs_info *fs_info;
 +   fs_info = container_of(kobj, struct btrfs_fs_info, super_kobj);
 +
 +   return a-store ? a-store(a, fs_info, buf, len) : 0;
 +}
 +
 +static const struct sysfs_ops btrfs_attr_ops = {
 +   .show = btrfs_attr_show,
 +   .store = btrfs_attr_store,
 +};
 +
 +static struct kobj_type btrfs_ktype = {
 +   .default_attrs  = btrfs_attrs,
 +   .sysfs_ops  = btrfs_attr_ops,
 +   .release= btrfs_fs_info_release,
 +};
 +
 +int btrfs_sysfs_add_one(struct btrfs_fs_info *fs_info)
 +{
 +   init_completion(fs_info-kobj_unregister);
 +   fs_info-super_kobj.kset = btrfs_kset;
 +   return kobject_init_and_add(fs_info-super_kobj, btrfs_ktype, NULL,
 +   %pU, fs_info-fsid);
 +}
 +
 +void btrfs_sysfs_remove_one(struct btrfs_fs_info *fs_info)
 +{
 +   kobject_del(fs_info-super_kobj);
Is there a reason for this explicit call? The last kobject_put will do
this automatically, no?

 +   kobject_put(fs_info-super_kobj);
 +   wait_for_completion(fs_info-kobj_unregister);
 +}
 +
  static void btrfs_supp_feat_release(struct kobject *kobj)
  {
 complete(btrfs_feat-f_kobj_unregister);
 --- a/fs/btrfs/sysfs.h  2013-09-10 00:09:13.002087628 -0400
 +++ b/fs/btrfs/sysfs.h  2013-09-10 00:09:35.525794464 -0400
 @@ -8,6 +8,24 @@ enum btrfs_feature_set {
 FEAT_MAX
  };

 +struct btrfs_attr {
 +   struct attribute attr;
 +   ssize_t (*show)(struct btrfs_attr *, struct btrfs_fs_info *, char *);
 +   ssize_t (*store)(struct btrfs_attr *, struct btrfs_fs_info *,
 +const char *, size_t);
 +};
 +
 +#define __INIT_BTRFS_ATTR(_name, _mode, _show, _store) \
 +{  \
 +   

Re: [PATCH] btrfs-progs: calculate disk space that a subvol could free upon delete

2013-10-26 Thread Alex Lyakas
Hi Anand,

1) so let's say I have a subvolume and a snapshot of this subvolume.
So in this case, I will see Sole space = 0 for both of them,
correct? Because all extents (except inline ones) are shared.

2) How is this in terms of responsiveness? On a huge subvolume, we
need to iterate all the EXTENT_DATAs and then lookup their
EXTENT_ITEMs.

3) So it's kind of poor man's replacement for quota groups, correct?

I like that it's so easy to check for exclusive data, though:)

Alex.


On Fri, Sep 13, 2013 at 6:44 PM, Wang Shilong wangshilong1...@gmail.com wrote:

 Hello Anand,

 (This patch is for review and comments only)

 This patch provides a way to know how much space can be
 relinquished if when subvol /snapshot is deleted.  With
 this sys admin can make better judgments in managing the
 filesystem when fs is near full.


 I think this is really *helpful* since users can not really know how much
 space(Exclusive) in a subvolume .

 Thanks,
 Wang

 as shown below the parameter 'sole space' indicates the size
 which is freed when subvol is deleted. (any other better
 term for this?, pls suggest).
 -
 btrfs su show /btrfs/sv1
 /btrfs/sv1
   Name:   sv1
   uuid:   b078ba48-d4a5-2f49-ac03-9bd1d56cc768
   Parent uuid:-
   Creation time:  2013-09-13 18:17:32
   Object ID:  257
   Generation (Gen):   18
   Gen at creation:17
   Parent: 5
   Top Level:  5
   Flags:  -
   Sole space: 1.56MiB 
   Snapshot(s):

 btrfs su snap /btrfs/sv1 /btrfs/ss2
 Create a snapshot of '/btrfs/sv1' in '/btrfs/ss2'

 btrfs su show /btrfs/sv1
 /btrfs/sv1
   Name:   sv1
   uuid:   b078ba48-d4a5-2f49-ac03-9bd1d56cc768
   Parent uuid:-
   Creation time:  2013-09-13 18:17:32
   Object ID:  257
   Generation (Gen):   19
   Gen at creation:17
   Parent: 5
   Top Level:  5
   Flags:  -
   Sole space: 0.00  -
   Snapshot(s):
   ss2
 -

 Signed-off-by: Anand Jain anand.j...@oracle.com
 ---
 cmds-subvolume.c |   5 ++
 utils.c  | 154 
 +++
 utils.h  |   1 +
 3 files changed, 160 insertions(+)

 diff --git a/cmds-subvolume.c b/cmds-subvolume.c
 index de246ab..2b02d66 100644
 --- a/cmds-subvolume.c
 +++ b/cmds-subvolume.c
 @@ -809,6 +809,7 @@ static int cmd_subvol_show(int argc, char **argv)
   int fd = -1, mntfd = -1;
   int ret = 1;
   DIR *dirstream1 = NULL, *dirstream2 = NULL;
 + u64 freeable_bytes;

   if (check_argc_exact(argc, 2))
   usage(cmd_subvol_show_usage);
 @@ -878,6 +879,8 @@ static int cmd_subvol_show(int argc, char **argv)
   goto out;
   }

 + freeable_bytes = get_subvol_freeable_bytes(fd);
 +
   ret = 0;
   /* print the info */
   printf(%s\n, fullpath);
 @@ -915,6 +918,8 @@ static int cmd_subvol_show(int argc, char **argv)
   else
   printf(\tFlags: \t\t\t-\n);

 + printf(\tSole space: \t\t%s\n,
 + pretty_size(freeable_bytes));
   /* print the snapshots of the given subvol if any*/
   printf(\tSnapshot(s):\n);
   filter_set = btrfs_list_alloc_filter_set();
 diff --git a/utils.c b/utils.c
 index 22c3310..f01d580 100644
 --- a/utils.c
 +++ b/utils.c
 @@ -2019,3 +2019,157 @@ int is_dev_excl_op_free(int fd)
   ret = ioctl(fd, BTRFS_IOC_CHECK_DEV_EXCL_OPS, NULL);
   return ret  0 ? ret : -errno;
 }
 +
 +/* gets the ref count for given extent
 + * 0 = didn't find the item
 + * n = number of references
 +*/
 +u64 get_extent_refcnt(int fd, u64 disk_blk)
 +{
 + int ret = 0, i, e;
 + struct btrfs_ioctl_search_args args;
 + struct btrfs_ioctl_search_key *sk = args.key;
 + struct btrfs_ioctl_search_header sh;
 + unsigned long off = 0;
 +
 + memset(args, 0, sizeof(args));
 +
 + sk-tree_id = BTRFS_EXTENT_TREE_OBJECTID;
 +
 + sk-min_type = BTRFS_EXTENT_ITEM_KEY;
 + sk-max_type = BTRFS_EXTENT_ITEM_KEY;
 +
 + sk-min_objectid = disk_blk;
 + sk-max_objectid = disk_blk;
 +
 + sk-max_offset = (u64)-1;
 + sk-max_transid = (u64)-1;
 +
 + while (1) {
 + sk-nr_items = 4096;
 +
 + ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, args);
 + e = errno;
 + if (ret  0) {
 + fprintf(stderr, ERROR: search failed - %s\n,
 + strerror(e));
 + return 0;
 + }
 + if (sk-nr_items == 0)
 + break;
 +
 + off = 0;
 + for (i = 0; i  sk-nr_items; i++) {
 + struct btrfs_extent_item *ei;
 + u64 ref;
 

Re: [PATCH] btrfs-progs: calculate disk space that a subvol could free upon delete

2013-10-26 Thread Alex Lyakas
Thinking about this more, I believe this way of checking for exclusive
data doesn't work. When a snapshot is created, btrfs doesn't go and
explicitly increment refcount on *all* relevant EXTENT_ITEMs in the
extent tree. This way creating a snapshot would take forever for large
subvolumes. Instead, it only does that on EXTENT_ITEMs, which are
pointed by EXTENT_DATAs in the root node of the snapshottted file
tree. For rest of nodes/leafs of a file tree, an implicit tree-block
references are added (not sure if implicit is the right term) for
top tree blocks only. This is accomplished by _btrfs_mod_ref() code,
called from btrfs_copy_root() during snapshot creation flow. Snapshot
deletion code is the one that is smart enough to properly unshare
shared tree blocks with such implicit references.

What do you think?

Alex.


On Sat, Oct 26, 2013 at 10:49 PM, Alex Lyakas
alex.bt...@zadarastorage.com wrote:
 Hi Anand,

 1) so let's say I have a subvolume and a snapshot of this subvolume.
 So in this case, I will see Sole space = 0 for both of them,
 correct? Because all extents (except inline ones) are shared.

 2) How is this in terms of responsiveness? On a huge subvolume, we
 need to iterate all the EXTENT_DATAs and then lookup their
 EXTENT_ITEMs.

 3) So it's kind of poor man's replacement for quota groups, correct?

 I like that it's so easy to check for exclusive data, though:)

 Alex.


 On Fri, Sep 13, 2013 at 6:44 PM, Wang Shilong wangshilong1...@gmail.com 
 wrote:

 Hello Anand,

 (This patch is for review and comments only)

 This patch provides a way to know how much space can be
 relinquished if when subvol /snapshot is deleted.  With
 this sys admin can make better judgments in managing the
 filesystem when fs is near full.


 I think this is really *helpful* since users can not really know how much
 space(Exclusive) in a subvolume .

 Thanks,
 Wang

 as shown below the parameter 'sole space' indicates the size
 which is freed when subvol is deleted. (any other better
 term for this?, pls suggest).
 -
 btrfs su show /btrfs/sv1
 /btrfs/sv1
   Name:   sv1
   uuid:   b078ba48-d4a5-2f49-ac03-9bd1d56cc768
   Parent uuid:-
   Creation time:  2013-09-13 18:17:32
   Object ID:  257
   Generation (Gen):   18
   Gen at creation:17
   Parent: 5
   Top Level:  5
   Flags:  -
   Sole space: 1.56MiB 
   Snapshot(s):

 btrfs su snap /btrfs/sv1 /btrfs/ss2
 Create a snapshot of '/btrfs/sv1' in '/btrfs/ss2'

 btrfs su show /btrfs/sv1
 /btrfs/sv1
   Name:   sv1
   uuid:   b078ba48-d4a5-2f49-ac03-9bd1d56cc768
   Parent uuid:-
   Creation time:  2013-09-13 18:17:32
   Object ID:  257
   Generation (Gen):   19
   Gen at creation:17
   Parent: 5
   Top Level:  5
   Flags:  -
   Sole space: 0.00  -
   Snapshot(s):
   ss2
 -

 Signed-off-by: Anand Jain anand.j...@oracle.com
 ---
 cmds-subvolume.c |   5 ++
 utils.c  | 154 
 +++
 utils.h  |   1 +
 3 files changed, 160 insertions(+)

 diff --git a/cmds-subvolume.c b/cmds-subvolume.c
 index de246ab..2b02d66 100644
 --- a/cmds-subvolume.c
 +++ b/cmds-subvolume.c
 @@ -809,6 +809,7 @@ static int cmd_subvol_show(int argc, char **argv)
   int fd = -1, mntfd = -1;
   int ret = 1;
   DIR *dirstream1 = NULL, *dirstream2 = NULL;
 + u64 freeable_bytes;

   if (check_argc_exact(argc, 2))
   usage(cmd_subvol_show_usage);
 @@ -878,6 +879,8 @@ static int cmd_subvol_show(int argc, char **argv)
   goto out;
   }

 + freeable_bytes = get_subvol_freeable_bytes(fd);
 +
   ret = 0;
   /* print the info */
   printf(%s\n, fullpath);
 @@ -915,6 +918,8 @@ static int cmd_subvol_show(int argc, char **argv)
   else
   printf(\tFlags: \t\t\t-\n);

 + printf(\tSole space: \t\t%s\n,
 + pretty_size(freeable_bytes));
   /* print the snapshots of the given subvol if any*/
   printf(\tSnapshot(s):\n);
   filter_set = btrfs_list_alloc_filter_set();
 diff --git a/utils.c b/utils.c
 index 22c3310..f01d580 100644
 --- a/utils.c
 +++ b/utils.c
 @@ -2019,3 +2019,157 @@ int is_dev_excl_op_free(int fd)
   ret = ioctl(fd, BTRFS_IOC_CHECK_DEV_EXCL_OPS, NULL);
   return ret  0 ? ret : -errno;
 }
 +
 +/* gets the ref count for given extent
 + * 0 = didn't find the item
 + * n = number of references
 +*/
 +u64 get_extent_refcnt(int fd, u64 disk_blk)
 +{
 + int ret = 0, i, e;
 + struct btrfs_ioctl_search_args args;
 + struct btrfs_ioctl_search_key *sk = args.key;
 + struct btrfs_ioctl_search_header sh

Re: [PATCH v6] Btrfs: fix memory leak of orphan block rsv

2013-10-23 Thread Alex Lyakas
Hello,

On Wed, Oct 23, 2013 at 4:35 PM, Filipe David Manana fdman...@gmail.com wrote:
 On Wed, Oct 23, 2013 at 2:33 PM, Alex Lyakas
 alex.bt...@zadarastorage.com wrote:
 Hi Filipe,


 On Tue, Aug 20, 2013 at 2:52 AM, Filipe David Borba Manana
 fdman...@gmail.com wrote:

 This issue is simple to reproduce and observe if kmemleak is enabled.
 Two simple ways to reproduce it:

 ** 1

 $ mkfs.btrfs -f /dev/loop0
 $ mount /dev/loop0 /mnt/btrfs
 $ btrfs balance start /mnt/btrfs
 $ umount /mnt/btrfs

So here it seems that the leak can only happen in case the block-group
has a free-space inode. This is what the orphan item is added for.
Yes, here kmemleak reports.
But: if space_cache option is disabled (and nospace_cache) enabled, it
seems that btrfs still creates the FREE_SPACE inodes, although they
are empty because in cache_save_setup:

inode = lookup_free_space_inode(root, block_group, path);
if (IS_ERR(inode)  PTR_ERR(inode) != -ENOENT) {
ret = PTR_ERR(inode);
btrfs_release_path(path);
goto out;
}

if (IS_ERR(inode)) {
...
ret = create_free_space_inode(root, trans, block_group, path);

and only later it actually sets BTRFS_DC_WRITTEN if space_cache option
is disabled. Amazing!
Although this is a different issue, do you know perhaps why these
empty inodes are needed?

Thanks!
Alex.




 ** 2

 $ mkfs.btrfs -f /dev/loop0
 $ mount /dev/loop0 /mnt/btrfs
 $ touch /mnt/btrfs/foobar
 $ rm -f /mnt/btrfs/foobar
 $ umount /mnt/btrfs


 I tried the second repro script on kernel 3.8.13, and kmemleak does
 not report a leak (even if I force the kmemleak scan). I did not try
 the balance-repro script, though. Am I missing something?

 Maybe it's not an issue on 3.8.13 and older releases.
 This was on btrfs-next from August 19.

 thanks for testing


 Thanks,
 Alex.




 After a while, kmemleak reports the leak:

 $ cat /sys/kernel/debug/kmemleak
 unreferenced object 0x880402b13e00 (size 128):
   comm btrfs, pid 19621, jiffies 4341648183 (age 70057.844s)
   hex dump (first 32 bytes):
 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
 00 fc c6 b1 04 88 ff ff 04 00 04 00 ad 4e ad de  .N..
   backtrace:
 [817275a6] kmemleak_alloc+0x26/0x50
 [8117832b] kmem_cache_alloc_trace+0xeb/0x1d0
 [a04db499] btrfs_alloc_block_rsv+0x39/0x70 [btrfs]
 [a04f8bad] btrfs_orphan_add+0x13d/0x1b0 [btrfs]
 [a04e2b13] btrfs_remove_block_group+0x143/0x500 [btrfs]
 [a0518158] btrfs_relocate_chunk.isra.63+0x618/0x790 [btrfs]
 [a051bc27] btrfs_balance+0x8f7/0xe90 [btrfs]
 [a05240a0] btrfs_ioctl_balance+0x250/0x550 [btrfs]
 [a05269ca] btrfs_ioctl+0xdfa/0x25f0 [btrfs]
 [8119c936] do_vfs_ioctl+0x96/0x570
 [8119cea1] SyS_ioctl+0x91/0xb0
 [81750242] system_call_fastpath+0x16/0x1b
 [] 0x

 This affects btrfs-next, revision be8e3cd00d7293dd177e3f8a4a1645ce09ca3acb
 (Btrfs: separate out tests into their own directory).

 Signed-off-by: Filipe David Borba Manana fdman...@gmail.com
 ---

 V2: removed atomic_t member in struct btrfs_block_rsv, as suggested by
 Josef Bacik, and use instead the condition reserved == 0 to decide
 when to free the block.
 V3: simplified patch, just kfree() (and not btrfs_free_block_rsv) the
 root's orphan_block_rsv when free'ing the root. Thanks Josef for
 the suggestion.
 V4: use btrfs_free_block_rsv() instead of kfree(). The error I was getting
 in xfstests when using btrfs_free_block_rsv() was unrelated, Josef just
 pointed it to me (separate issue).
 V5: move the free call below the iput() call, so that btrfs_evict_node()
 can process the orphan_block_rsv first to do some needed cleanup before
 we free it.
 V6: free the root's orphan_block_rsv in close_ctree() too. After a balance
 the orphan_block_rsv of the tree of tree roots was being leaked, because
 free_fs_root() is only called for filesystem trees.

  fs/btrfs/disk-io.c |5 +
  1 file changed, 5 insertions(+)

 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 3b12c26..5d17163 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -3430,6 +3430,8 @@ static void free_fs_root(struct btrfs_root *root)
  {
 iput(root-cache_inode);
 WARN_ON(!RB_EMPTY_ROOT(root-inode_tree));
 +   btrfs_free_block_rsv(root, root-orphan_block_rsv);
 +   root-orphan_block_rsv = NULL;
 if (root-anon_dev)
 free_anon_bdev(root-anon_dev);
 free_extent_buffer(root-node);
 @@ -3582,6 +3584,9 @@ int close_ctree(struct btrfs_root *root)

 btrfs_free_stripe_hash_table(fs_info);

 +   btrfs_free_block_rsv(root, root-orphan_block_rsv);
 +   root-orphan_block_rsv = NULL;
 +
 return 0;
  }

 --
 1.7.9.5

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body

Re: [PATCH] btrfs: commit transaction after deleting a subvolume

2013-10-20 Thread Alex Lyakas
 Thank you for addressing this, David.

On Sat, Aug 31, 2013 at 1:25 AM, David Sterba dste...@suse.cz wrote:
 Alex pointed out the consequences after a transaction is not committed
 when a subvolume is deleted, so in case of a crash before an actual
 commit happens will let the subvolume reappear.

 Original post:
 http://www.spinics.net/lists/linux-btrfs/msg22088.html

 Josef's objections:
 http://www.spinics.net/lists/linux-btrfs/msg22256.html

 While there's no need to do a full commit for regular files, a subvolume
 may get a different treatment.

 http://www.spinics.net/lists/linux-btrfs/msg23087.html:

 That a subvol/snapshot may appear after crash if transation commit did
 not happen does not feel so good. We know that the subvol is only
 scheduled for deletion and needs to be processed by cleaner.

 From that point I'd rather see the commit to happen to avoid any
 unexpected surprises.  A subvolume that re-appears still holds the data
 references and consumes space although the user does not assume that.

 Automated snapshotting and deleting needs some guarantees about the
 behaviour and what to do after a crash. So now it has to process the
 backlog of previously deleted snapshots and verify that they're not
 there, compared to deleted - will never appear, can forget about it.
 

 There is a performance penalty incured by the change, but deleting a
 subvolume is not a frequent operation and the tradeoff seems justified
 by getting the guarantee stated above.

 CC: Alex Lyakas alex.bt...@zadarastorage.com
 CC: Josef Bacik jba...@fusionio.com
 Signed-off-by: David Sterba dste...@suse.cz
 ---
  fs/btrfs/ioctl.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)

 diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
 index e407f75..4394632 100644
 --- a/fs/btrfs/ioctl.c
 +++ b/fs/btrfs/ioctl.c
 @@ -2268,7 +2268,7 @@ static noinline int btrfs_ioctl_snap_destroy(struct 
 file *file,
  out_end_trans:
 trans-block_rsv = NULL;
 trans-bytes_reserved = 0;
 -   ret = btrfs_end_transaction(trans, root);
 +   ret = btrfs_commit_transaction(trans, root);
 if (ret  !err)
 err = ret;
 inode-i_flags |= S_DEAD;
 --
 1.7.9

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Btrfs: stop caching thread if extent_commit_sem is contended

2013-10-17 Thread Alex Lyakas
Thanks for addressing this issue, Josef!

On Thu, Sep 26, 2013 at 4:26 PM, Josef Bacik jba...@fusionio.com wrote:
 We can starve out the transaction commit with a bunch of caching threads all
 running at the same time.  This is because we will only drop the
 extent_commit_sem if we need_resched(), which isn't likely to happen since we
 will be reading a lot from the disk so have already schedule()'ed plenty.  
 Alex
 observed that he could starve out a transaction commit for up to a minute with
 32 caching threads all running at once.  This will allow us to drop the
 extent_commit_sem to allow the transaction commit to swap the commit_root out
 and then all the cachers will start back up. Here is an explanation provided 
 by
 Igno

 So, just to fill in what happens in this loop:

 mutex_unlock(caching_ctl-mutex);
 cond_resched();
 goto again;

 where 'again:' takes caching_ctl-mutex and fs_info-extent_commit_sem
 again:

 again:
 mutex_lock(caching_ctl-mutex);
 /* need to make sure the commit_root doesn't disappear */
 down_read(fs_info-extent_commit_sem);

 So, if I'm reading the code correct, there can be a fair amount of
 concurrency here: there may be multiple 'caching kthreads' per filesystem
 active, while there's one fs_info-extent_commit_sem per filesystem
 AFAICS.

 So, what happens if there are a lot of CPUs all busy holding the
 -extent_commit_sem rwsem read-locked and a writer arrives? They'd all
 rush to try to release the fs_info-extent_commit_sem, and they'd block in
 the down_read() because there's a writer waiting.

 So there's a guarantee of forward progress. This should answer akpm's
 concern I think.

 Thanks,

 Acked-by: Ingo Molnar mi...@kernel.org
 Signed-off-by: Josef Bacik jba...@fusionio.com
 ---
  fs/btrfs/extent-tree.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index cfb3cf7..cc074c34 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -442,7 +442,8 @@ next:
 if (ret)
 break;

 -   if (need_resched()) {
 +   if (need_resched() ||
 +   rwsem_is_contended(fs_info-extent_commit_sem)) {
 caching_ctl-progress = last;
 btrfs_release_path(path);
 up_read(fs_info-extent_commit_sem);
 --
 1.8.3.1

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Notify caching_thread()s to give up on extent_commit_sem when needed.

2013-08-29 Thread Alex Lyakas
caching_thread()s do all their work under read access to extent_commit_sem.
They give up on this read access only when need_resched() tells them, or
when they exit. As a result, somebody that wants a WRITE access to this sem,
might wait for a long time. Especially this is problematic in
cache_block_group(),
which can be called on critical paths like find_free_extent() and in commit
path via commit_cowonly_roots().

This patch is an RFC, that attempts to fix this problem, by notifying the
caching threads to give up on extent_commit_sem.

On a system with a lot of metadata (~20Gb total metadata, ~10Gb extent tree),
with increased number of caching_threads, commits were very slow,
stuck in commit_cowonly_roots, due to this issue.
With this patch, commits no longer get stuck in commit_cowonly_roots.

This patch is not indented to be applied, just a request to comment on whether
you agree this problem happens, and whether the fix goes in the right direction.

Signed-off-by: Alex Lyakas alex.bt...@zadarastorage.com
---
 fs/btrfs/ctree.h   |7 +++
 fs/btrfs/disk-io.c |1 +
 fs/btrfs/extent-tree.c |9 +
 fs/btrfs/transaction.c |2 +-
 4 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index c90be01..b602611 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1427,6 +1427,13 @@ struct btrfs_fs_info {
 struct mutex ordered_extent_flush_mutex;

 struct rw_semaphore extent_commit_sem;
+/* notifies the readers to give up on the sem ASAP */
+atomic_t extent_commit_sem_give_up_read;
+#define BTRFS_DOWN_WRITE_EXTENT_COMMIT_SEM(fs_info)  \
+do { atomic_inc((fs_info)-extent_commit_sem_give_up_read); \
+ down_write((fs_info)-extent_commit_sem);  \
+ atomic_dec((fs_info)-extent_commit_sem_give_up_read); \
+} while (0)

 struct rw_semaphore cleanup_work_sem;

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 69e9afb..b88e688 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2291,6 +2291,7 @@ int open_ctree(struct super_block *sb,
 mutex_init(fs_info-cleaner_mutex);
 mutex_init(fs_info-volume_mutex);
 init_rwsem(fs_info-extent_commit_sem);
+atomic_set(fs_info-extent_commit_sem_give_up_read, 0);
 init_rwsem(fs_info-cleanup_work_sem);
 init_rwsem(fs_info-subvol_sem);
 sema_init(fs_info-uuid_tree_rescan_sem, 1);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 95c6539..28fee78 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -442,7 +442,8 @@ next:
 if (ret)
 break;

-if (need_resched()) {
+if (need_resched() ||
+atomic_read(fs_info-extent_commit_sem_give_up_read)  0) {
 caching_ctl-progress = last;
 btrfs_release_path(path);
 up_read(fs_info-extent_commit_sem);
@@ -632,7 +633,7 @@ static int cache_block_group(struct
btrfs_block_group_cache *cache,
 return 0;
 }

-down_write(fs_info-extent_commit_sem);
+BTRFS_DOWN_WRITE_EXTENT_COMMIT_SEM(fs_info);
 atomic_inc(caching_ctl-count);
 list_add_tail(caching_ctl-list, fs_info-caching_block_groups);
 up_write(fs_info-extent_commit_sem);
@@ -5462,7 +5463,7 @@ void btrfs_prepare_extent_commit(struct
btrfs_trans_handle *trans,
 struct btrfs_block_group_cache *cache;
 struct btrfs_space_info *space_info;

-down_write(fs_info-extent_commit_sem);
+BTRFS_DOWN_WRITE_EXTENT_COMMIT_SEM(fs_info);

 list_for_each_entry_safe(caching_ctl, next,
  fs_info-caching_block_groups, list) {
@@ -8219,7 +8220,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info)
 struct btrfs_caching_control *caching_ctl;
 struct rb_node *n;

-down_write(info-extent_commit_sem);
+BTRFS_DOWN_WRITE_EXTENT_COMMIT_SEM(fs_info);
 while (!list_empty(info-caching_block_groups)) {
 caching_ctl = list_entry(info-caching_block_groups.next,
  struct btrfs_caching_control, list);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index cac4a3f..976d20a 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -969,7 +969,7 @@ static noinline int commit_cowonly_roots(struct
btrfs_trans_handle *trans,
 return ret;
 }

-down_write(fs_info-extent_commit_sem);
+BTRFS_DOWN_WRITE_EXTENT_COMMIT_SEM(fs_info);
 switch_commit_root(fs_info-extent_root);
 up_write(fs_info-extent_commit_sem);

-- 
1.7.9.5
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Notify caching_thread()s to give up on extent_commit_sem when needed.

2013-08-29 Thread Alex Lyakas
On Thu, Aug 29, 2013 at 10:55 PM, Josef Bacik jba...@fusionio.com wrote:
 On Thu, Aug 29, 2013 at 10:09:29PM +0300, Alex Lyakas wrote:
 Hi Josef,

 On Thu, Aug 29, 2013 at 5:38 PM, Josef Bacik jba...@fusionio.com wrote:
  On Thu, Aug 29, 2013 at 01:31:05PM +0300, Alex Lyakas wrote:
  caching_thread()s do all their work under read access to 
  extent_commit_sem.
  They give up on this read access only when need_resched() tells them, or
  when they exit. As a result, somebody that wants a WRITE access to this 
  sem,
  might wait for a long time. Especially this is problematic in
  cache_block_group(),
  which can be called on critical paths like find_free_extent() and in 
  commit
  path via commit_cowonly_roots().
 
  This patch is an RFC, that attempts to fix this problem, by notifying the
  caching threads to give up on extent_commit_sem.
 
  On a system with a lot of metadata (~20Gb total metadata, ~10Gb extent 
  tree),
  with increased number of caching_threads, commits were very slow,
  stuck in commit_cowonly_roots, due to this issue.
  With this patch, commits no longer get stuck in commit_cowonly_roots.
 
 
  But what kind of effect do you see on overall performance/runtime?  
  Honestly I'd
  expect we'd spend more of our time waiting for the caching kthread to fill 
  in
  free space so we can make allocations than waiting on this lock 
  contention.  I'd
  like to see real numbers here to see what kind of effect this patch has on 
  your
  workload.  (I don't doubt it makes a difference, I'm just curious to see 
  how big
  of a difference it makes.)

 Primarily for me it affects the commit thread right after mounting,
 when it spends time in the critical part of the commit, in which
 trans_no_join is set, i.e., it is not possible to start a new
 transaction. So all the new writers that want a transaction are
 delayed at this point.

 Here are some numbers (and some more logs are in the attached file).

 Filesystem has a good amount of metadata (btrfs-progs modified
 slightly to print exact byte values):
 root@dc:/home/zadara# btrfs fi df /btrfs/pool-0002/
 Data: total=846116945920(788.01GB), used=842106667008(784.27GB)
 System: total=4194304(4.00MB), used=94208(92.00KB)
 Metadata: total=31146901504(29.01GB), used=25248698368(23.51GB)

 original code, 2 caching workers, try 1
 Aug 29 13:41:22 dc kernel: [28381.203745] [17617][tx]btrfs
 [ZBTRFS_TXN_COMMIT_PHASE_STARTED:439] FS[dm-119] txn[6627] COMMIT
 extwr:0 wr:1
 Aug 29 13:41:25 dc kernel: [28384.624838] [17617][tx]btrfs
 [ZBTRFS_TXN_COMMIT_PHASE_DONE:519] FS[dm-119] txn[6627] COMMIT took
 3421 ms committers=1 open=0ms blocked=3188ms
 Aug 29 13:41:25 dc kernel: [28384.624846] [17617][tx]btrfs
 [ZBTRFS_TXN_COMMIT_PHASE_DONE:524] FS[dm-119] txn[6627] roo:0 rdr1:0
 cbg:0 rdr2:0
 Aug 29 13:41:25 dc kernel: [28384.624850] [17617][tx]btrfs
 [ZBTRFS_TXN_COMMIT_PHASE_DONE:529] FS[dm-119] txn[6627] wc:0 wpc:0
 wew:0 fps:0
 Aug 29 13:41:25 dc kernel: [28384.624854] [17617][tx]btrfs
 [ZBTRFS_TXN_COMMIT_PHASE_DONE:534] -FS[dm-119] txn[6627] ww:0 cs:0
 rdi:0 rdr3:0
 Aug 29 13:41:25 dc kernel: [28384.624858] [17617][tx]btrfs
 [ZBTRFS_TXN_COMMIT_PHASE_DONE:538] -FS[dm-119] txn[6627] cfr:0
 ccr:2088 pec:1099
 Aug 29 13:41:25 dc kernel: [28384.624862] [17617][tx]btrfs
 [ZBTRFS_TXN_COMMIT_PHASE_DONE:541] FS[dm-119] txn[6627] wrw:230 wrs:1

 I have a breakdown of commit times here, to identify bottlenecks of
 the commit. Times are in ms.
 Names of phases are:

 roo - btrfs_run_ordered_operations
 rdr1 - btrfs_run_delayed_refs (call 1)
 cbg - btrfs_create_pending_block_groups
 rdr2 - btrfs_run_delayed_refs (call 2)
 wc - wait_for_commit (if was needed)
 wpc - wair for previous commit (if was needed)
 wew - wait for external writers to detach
 fps - flush_all_pending_stuffs
 ww - wait for all the other writers to detach
 cs - create_pending_snapshots
 rdi - btrfs_run_delayed_items
 rdr3 - btrfs_run_delayed_refs (call 3)
 cfr - commit_fs_roots
 ccr - commit_cowonly_roots
 pec - btrfs_prepare_extent_commit
 wrw - btrfs_write_and_wait_transaction
 wrs - write_ctree_super

 Two lines marked as - are the critical part of the commit.


 original code, 2 caching workers, try 2
 Aug 29 13:43:30 dc kernel: [28508.683625] [22490][tx]btrfs
 [ZBTRFS_TXN_COMMIT_PHASE_STARTED:439] FS[dm-119] txn[6630] COMMIT
 extwr:0 wr:1
 Aug 29 13:43:31 dc kernel: [28510.569269] [22490][tx]btrfs
 [ZBTRFS_TXN_COMMIT_PHASE_DONE:519] FS[dm-119] txn[6630] COMMIT took
 1885 ms committers=1 open=0ms blocked=1550ms
 Aug 29 13:43:31 dc kernel: [28510.569276] [22490][tx]btrfs
 [ZBTRFS_TXN_COMMIT_PHASE_DONE:524] FS[dm-119] txn[6630] roo:0 rdr1:0
 cbg:0 rdr2:0
 Aug 29 13:43:31 dc kernel: [28510.569281] [22490][tx]btrfs
 [ZBTRFS_TXN_COMMIT_PHASE_DONE:529] FS[dm-119] txn[6630] wc:0 wpc:0
 wew:0 fps:0
 Aug 29 13:43:31 dc kernel: [28510.569285] [22490][tx]btrfs
 [ZBTRFS_TXN_COMMIT_PHASE_DONE:534] -FS[dm-119] txn[6630] ww:0 cs:0
 rdi:0 rdr3:0
 Aug 29 13:43:31 dc kernel: [28510.569288] [22490][tx]btrfs

Re: btrfs:async-thread: atomic_start_pending=1 is set, but it's too late

2013-08-29 Thread Alex Lyakas
Thanks, Chris, Josef, for confirming!

On Thu, Aug 29, 2013 at 11:08 PM, Chris Mason clma...@fusionio.com wrote:
 Quoting Josef Bacik (2013-08-29 16:03:06)
 On Mon, Aug 26, 2013 at 05:16:42PM +0300, Alex Lyakas wrote:
  Greetings all,
  I see a following issue with spawning new threads for btrfs_workers
  that have atomic_worker_start set:
 
  # I have BTRFS that has 24Gb of total metadata, out of which extent
  tree takes 11Gb; space_cache option is not used.
  # After mouting, cache_block_group() triggers ~250 work items to
  cache-in the needed block groups.
  # At this point, fs_info-caching_workers has one thread, which is
  considered idle.
  # Work items start to add to this thread's pending list, until this
  thread becomes considered busy.
  # Now workers-atomic_worker_start is set, but
  check_pending_worker_creates() has not run yet (it is called only from
  worker_loop), so the same single thread is picked as fallback.
 
  The problem is that this thread is still running the caching_thread
  function, scanning for EXTENT_ITEMs of the first block-group. This
  takes 3-4seconds for 1Gb block group.
 
  # Once caching_thread() exits, check_pending_worker_creates() is
  called, and creates the second thread, but it's too late, because all
  the 250 work items are already sitting in the first thread's pending
  list. So the  second thread doesn't help at all.
 
  As a result, all block-group caching is performed by the same thread,
  which, due to one-by-one scanning of EXTENT_ITEMs, takes forever for
  this BTRFS.
 
  How this can be fixed?
  - can perhaps check_pending_worker_creates() be called out of
  worker_loop, e.g., by find_worker()? Instead of just setting
  workers-atomic_start_pending?
  - maybe for fs_info-caching_workers we don't have to create new
  workers asynchronously, so we can pass NULL for async_helper in
  btrfs_init_workers()? (probably we have to, just checking)

 So I looked at this, and I'm pretty sure we have an async_helper just 
 because of
 copy+paste.  Hey I want a new async group, let me copy this other one and
 change the name!  So yes let's just pass NULL here.  In fact the only cases
 that we should be using an async helper is for super critical areas, so I'm
 pretty sure _most_ of the cases that specify an async helper don't need to.
 Chris is this correct, or am I missing something?  Thanks,

 No, I think we can just turn off the async start here.

 -chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: stop all workers before cleaning up roots

2013-08-01 Thread Alex Lyakas
Hi Josef,

On Thu, May 30, 2013 at 11:58 PM, Josef Bacik jba...@fusionio.com wrote:
 Dave reported a panic because the extent_root-commit_root was NULL in the
 caching kthread.  That is because we just unset it in free_root_pointers, 
 which
 is not the correct thing to do, we have to either wait for the caching kthread
 to complete or hold the extent_commit_sem lock so we know the thread has 
 exited.
 This patch makes the kthreads all stop first and then we do our cleanup.  This
 should fix the race.  Thanks,

 Reported-by: David Sterba dste...@suse.cz
 Signed-off-by: Josef Bacik jba...@fusionio.com
 ---
  fs/btrfs/disk-io.c |6 +++---
  1 files changed, 3 insertions(+), 3 deletions(-)

 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 2b53afd..77cb566 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -3547,13 +3547,13 @@ int close_ctree(struct btrfs_root *root)

 btrfs_free_block_groups(fs_info);

do you think it would be safer to stop all workers first and make sure
they are stopped, then do btrfs_free_block_groups()? I see, for
example, that btrfs_free_block_groups() checks:
if (block_group-cached == BTRFS_CACHE_STARTED)
which could be perhaps racy with other people spawning caching_threads.

So maybe better to stop all threads (including cleaner and committer)
and then free everything?


 -   free_root_pointers(fs_info, 1);
 +   btrfs_stop_all_workers(fs_info);

 del_fs_roots(fs_info);

 -   iput(fs_info-btree_inode);
 +   free_root_pointers(fs_info, 1);

 -   btrfs_stop_all_workers(fs_info);
 +   iput(fs_info-btree_inode);

  #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY
 if (btrfs_test_opt(root, CHECK_INTEGRITY))
 --
 1.7.7.6

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

Alex.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: update drop progress before stopping snapshot dropping

2013-07-30 Thread Alex Lyakas
Thanks for posting that patch, Josef.

On Mon, Jul 15, 2013 at 6:59 PM, Josef Bacik jba...@fusionio.com wrote:

 Alex pointed out a problem and fix that exists in the drop one snapshot at
 a
 time patch.  If we decide we need to exit for whatever reason (umount for
 example) we will just exit the snapshot dropping without updating the drop
 progress.  So the next time we go to resume we will BUG_ON() because we
 can't
 find the extent we left off at because we never updated it.  This patch
 fixes
 the problem.

 Cc: sta...@vger.kernel.org
 Reported-by: Alex Lyakas alex.bt...@zadarastorage.com
 Signed-off-by: Josef Bacik jba...@fusionio.com
 ---
  fs/btrfs/extent-tree.c |   14 --
  1 files changed, 8 insertions(+), 6 deletions(-)

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index bc00b24..8c204e1 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -7584,11 +7584,6 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
 wc-reada_count = BTRFS_NODEPTRS_PER_BLOCK(root);

 while (1) {
 -   if (!for_reloc  btrfs_need_cleaner_sleep(root)) {
 -   pr_debug(btrfs: drop snapshot early exit\n);
 -   err = -EAGAIN;
 -   goto out_end_trans;
 -   }

 ret = walk_down_tree(trans, root, path, wc);
 if (ret  0) {
 @@ -7616,7 +7611,8 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
 }

 BUG_ON(wc-level == 0);
 -   if (btrfs_should_end_transaction(trans, tree_root)) {
 +   if (btrfs_should_end_transaction(trans, tree_root) ||
 +   (!for_reloc  btrfs_need_cleaner_sleep(root))) {
 ret = btrfs_update_root(trans, tree_root,
 root-root_key,
 root_item);
 @@ -7627,6 +7623,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
 }

 btrfs_end_transaction_throttle(trans, tree_root);
 +   if (!for_reloc  btrfs_need_cleaner_sleep(root))
 {
 +   pr_debug(btrfs: drop snapshot early
 exit\n);
 +   err = -EAGAIN;
 +   goto out_free;
 +   }
 +
 trans = btrfs_start_transaction(tree_root, 0);
 if (IS_ERR(trans)) {
 err = PTR_ERR(trans);
 --
 1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix all callers of read_tree_block

2013-07-30 Thread Alex Lyakas
Hi Josef,

On Tue, Apr 23, 2013 at 9:20 PM, Josef Bacik jba...@fusionio.com wrote:
 We kept leaking extent buffers when mounting a broken file system and it turns
 out it's because not everybody uses read_tree_block properly.  You need to 
 check
 and make sure the extent_buffer is uptodate before you use it.  This patch 
 fixes
 everybody who calls read_tree_block directly to make sure they check that it 
 is
 uptodate and free it and return an error if it is not.  With this we no longer
 leak EB's when things go horribly wrong.  Thanks,

 Signed-off-by: Josef Bacik jba...@fusionio.com
 ---
  fs/btrfs/backref.c |   10 --
  fs/btrfs/ctree.c   |   21 -
  fs/btrfs/disk-io.c |   19 +--
  fs/btrfs/extent-tree.c |4 +++-
  fs/btrfs/relocation.c  |   18 +++---
  5 files changed, 59 insertions(+), 13 deletions(-)

 diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
 index 23e927b..04b5b30 100644
 --- a/fs/btrfs/backref.c
 +++ b/fs/btrfs/backref.c
 @@ -423,7 +423,10 @@ static int __add_missing_keys(struct btrfs_fs_info 
 *fs_info,
 BUG_ON(!ref-wanted_disk_byte);
 eb = read_tree_block(fs_info-tree_root, 
 ref-wanted_disk_byte,
  fs_info-tree_root-leafsize, 0);
 -   BUG_ON(!eb);
 +   if (!eb || !extent_buffer_uptodate(eb)) {
 +   free_extent_buffer(eb);
 +   return -EIO;
 +   }
 btrfs_tree_read_lock(eb);
 if (btrfs_header_level(eb) == 0)
 btrfs_item_key_to_cpu(eb, ref-key_for_search, 0);
 @@ -913,7 +916,10 @@ again:
 info_level);
 eb = read_tree_block(fs_info-extent_root,
ref-parent, bsz, 
 0);
 -   BUG_ON(!eb);
 +   if (!eb || !extent_buffer_uptodate(eb)) {
 +   free_extent_buffer(eb);
 +   return -EIO;
 +   }
 ret = find_extent_in_eb(eb, bytenr,
 *extent_item_pos, 
 eie);
 ref-inode_list = eie;
 diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
 index 566d99b..2bc3440 100644
 --- a/fs/btrfs/ctree.c
 +++ b/fs/btrfs/ctree.c
 @@ -1281,7 +1281,8 @@ get_old_root(struct btrfs_root *root, u64 time_seq)
 free_extent_buffer(eb_root);
 blocksize = btrfs_level_size(root, old_root-level);
 old = read_tree_block(root, logical, blocksize, 0);
 -   if (!old) {
 +   if (!old || !extent_buffer_uptodate(old)) {
 +   free_extent_buffer(old);
 pr_warn(btrfs: failed to read tree block %llu from 
 get_old_root\n,
 logical);
 WARN_ON(1);
 @@ -1526,8 +1527,10 @@ int btrfs_realloc_node(struct btrfs_trans_handle 
 *trans,
 if (!cur) {
 cur = read_tree_block(root, blocknr,
  blocksize, gen);
 -   if (!cur)
 +   if (!cur || !extent_buffer_uptodate(cur)) {
 +   free_extent_buffer(cur);
 return -EIO;
 +   }
 } else if (!uptodate) {
 err = btrfs_read_buffer(cur, gen);
 if (err) {
 @@ -1692,6 +1695,8 @@ static noinline struct extent_buffer 
 *read_node_slot(struct btrfs_root *root,
struct extent_buffer *parent, int slot)
  {
 int level = btrfs_header_level(parent);
 +   struct extent_buffer *eb;
 +
 if (slot  0)
 return NULL;
 if (slot = btrfs_header_nritems(parent))
 @@ -1699,9 +1704,15 @@ static noinline struct extent_buffer 
 *read_node_slot(struct btrfs_root *root,

 BUG_ON(level == 0);

 -   return read_tree_block(root, btrfs_node_blockptr(parent, slot),
 -  btrfs_level_size(root, level - 1),
 -  btrfs_node_ptr_generation(parent, slot));
 +   eb = read_tree_block(root, btrfs_node_blockptr(parent, slot),
 +btrfs_level_size(root, level - 1),
 +btrfs_node_ptr_generation(parent, slot));
 +   if (eb  !extent_buffer_uptodate(eb)) {
 +   free_extent_buffer(eb);
 +   eb = NULL;
 +   }
 +
 +   return eb;
  }

  /*
 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index fb0e5c2..4605cc7 100644
 --- 

Re: [PATCH] Btrfs: fix lock leak when resuming snapshot deletion

2013-07-16 Thread Alex Lyakas
On Mon, Jul 15, 2013 at 7:43 PM, Josef Bacik jba...@fusionio.com wrote:
 We aren't setting path-locks[level] when we resume a snapshot deletion which
 means we won't unlock the buffer when we free the path.  This causes deadlocks
 if we happen to re-allocate the block before we've evicted the extent buffer
 from cache.  Thanks,

 Reported-by: Alex Lyakas alex.bt...@zadarastorage.com
 Signed-off-by: Josef Bacik jba...@fusionio.com
 ---
  fs/btrfs/extent-tree.c |2 ++
  1 files changed, 2 insertions(+), 0 deletions(-)

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 8c204e1..997a5dd 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -7555,6 +7555,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
 while (1) {
 btrfs_tree_lock(path-nodes[level]);
 btrfs_set_lock_blocking(path-nodes[level]);
 +   path-locks[level] = BTRFS_WRITE_LOCK_BLOCKING;

 ret = btrfs_lookup_extent_info(trans, root,
 path-nodes[level]-start,
 @@ -7570,6 +7571,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
 break;

 btrfs_tree_unlock(path-nodes[level]);
 +   path-locks[level] = 0;
 WARN_ON(wc-refs[level] != 1);
 level--;
 }
 --
 1.7.7.6

 --

Tested-by: Liran Strugano li...@zadarastorage.com

 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: clean snapshots one by one

2013-07-14 Thread Alex Lyakas
Hi,

On Thu, Jul 4, 2013 at 10:52 PM, Alex Lyakas
alex.bt...@zadarastorage.com wrote:
 Hi David,

 On Thu, Jul 4, 2013 at 8:03 PM, David Sterba dste...@suse.cz wrote:
 On Thu, Jul 04, 2013 at 06:29:23PM +0300, Alex Lyakas wrote:
  @@ -7363,6 +7365,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
  wc-reada_count = BTRFS_NODEPTRS_PER_BLOCK(root);
 
  while (1) {
  +   if (!for_reloc  btrfs_fs_closing(root-fs_info)) {
  +   pr_debug(btrfs: drop snapshot early exit\n);
  +   err = -EAGAIN;
  +   goto out_end_trans;
  +   }
 Here you exit the loop, but the drop_progress in the root item is
 incorrect. When the system is remounted, and snapshot deletion
 resumes, it seems that it tries to resume from the EXTENT_ITEM that
 does not exist anymore, and [1] shows that btrfs_lookup_extent_info()
 simply does not find the needed extent.
 So then I hit panic in walk_down_tree():
 BUG: wc-refs[level - 1] == 0

 I fixed it like follows:
 There is a place where btrfs_drop_snapshot() checks if it needs to
 detach from transaction and re-attach. So I moved the exit point there
 and the code is like this:

   if (btrfs_should_end_transaction(trans, tree_root) ||
   (!for_reloc  btrfs_need_cleaner_sleep(root))) {
   ret = btrfs_update_root(trans, tree_root,
   root-root_key,
   root_item);
   if (ret) {
   btrfs_abort_transaction(trans, tree_root, 
 ret);
   err = ret;
   goto out_end_trans;
   }

   btrfs_end_transaction_throttle(trans, tree_root);
   if (!for_reloc  btrfs_need_cleaner_sleep(root)) {
   err = -EAGAIN;
   goto out_free;
   }
   trans = btrfs_start_transaction(tree_root, 0);
 ...

 With this fix, I do not hit the panic, and snapshot deletion proceeds
 and completes alright after mount.

 Do you agree to my analysis or I am missing something? It seems that
 Josef's btrfs-next still has this issue (as does Chris's for-linus).

 Sound analysis and I agree with the fix. The clean-by-one patch has been
 merged into 3.10 so we need a stable fix for that.
 Thanks for confirming, David!

 From more testing, I have two more notes:

 # After applying the fix, whenever snapshot deletion is resumed after
 mount, and successfully completes, then I unmount again, and rmmod
 btrfs, linux complains about loosing few struct extent_buffer during
 kem_cache_delete().
 So somewhere on that path:
 if (btrfs_disk_key_objectid(root_item-drop_progress) == 0) {
 ...
 } else {
 === HERE

 and later we perhaps somehow overwrite the contents of struct
 btrfs_path that is used in the whole function. Because at the end of
 the function we always do btrfs_free_path(), which inside does
 btrfs_release_path().  I was not able to determine where the leak
 happens, do you have any hint? No other activity happens in the system
 except the resumed snap deletion, and this problem only happens when
 resuming.

I found where the memory leak happens. When we abort snapshot deletion
in the middle, then this btrfs_root is basically left alone hanging in
the air. It is out of the dead_roots already, so when del_fs_roots()
is called during unmount, it will not free this root and its
root-node (which is the one that triggers memory leak warning on
kmem_cache_destroy) and perhaps other stuff too. The issue still
exists in btrfs-next.

Simplest fix I came up with was:

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d275681..52a2c54 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7468,6 +7468,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
int err = 0;
int ret;
int level;
+   bool root_freed = false;

path = btrfs_alloc_path();
if (!path) {
@@ -7641,6 +7642,8 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
free_extent_buffer(root-commit_root);
btrfs_put_fs_root(root);
}
+   root_freed = true;
+
 out_end_trans:
btrfs_end_transaction_throttle(trans, tree_root);
 out_free:
@@ -7649,6 +7652,18 @@ out_free:
 out:
if (err)
btrfs_std_error(root-fs_info, err);
+
+   /*
+* If the root was not freed by any reason, this means that FS had
+* a problem and will probably be unmounted soon.
+* But we need to put the root back into the 'dead_roots' list,
+* so that it will be properly freed during unmount.
+*/
+   if (!root_freed) {
+   WARN_ON(err == 0);
+   btrfs_add_dead_root(root);
+   }
+
return err

Re: [PATCH v3] btrfs: clean snapshots one by one

2013-07-04 Thread Alex Lyakas
Hi David,
I believe this patch has the following problem:

On Tue, Mar 12, 2013 at 5:13 PM, David Sterba dste...@suse.cz wrote:
 Each time pick one dead root from the list and let the caller know if
 it's needed to continue. This should improve responsiveness during
 umount and balance which at some point waits for cleaning all currently
 queued dead roots.

 A new dead root is added to the end of the list, so the snapshots
 disappear in the order of deletion.

 The snapshot cleaning work is now done only from the cleaner thread and the
 others wake it if needed.

 Signed-off-by: David Sterba dste...@suse.cz
 ---

 v1,v2:
 * http://thread.gmane.org/gmane.comp.file-systems.btrfs/23212

 v2-v3:
 * remove run_again from btrfs_clean_one_deleted_snapshot and return 1
   unconditionally

  fs/btrfs/disk-io.c |   10 ++--
  fs/btrfs/extent-tree.c |8 ++
  fs/btrfs/relocation.c  |3 --
  fs/btrfs/transaction.c |   56 +++
  fs/btrfs/transaction.h |2 +-
  5 files changed, 53 insertions(+), 26 deletions(-)

 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 988b860..4de2351 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -1690,15 +1690,19 @@ static int cleaner_kthread(void *arg)
 struct btrfs_root *root = arg;

 do {
 +   int again = 0;
 +
 if (!(root-fs_info-sb-s_flags  MS_RDONLY) 
 +   down_read_trylock(root-fs_info-sb-s_umount) 
 mutex_trylock(root-fs_info-cleaner_mutex)) {
 btrfs_run_delayed_iputs(root);
 -   btrfs_clean_old_snapshots(root);
 +   again = btrfs_clean_one_deleted_snapshot(root);
 mutex_unlock(root-fs_info-cleaner_mutex);
 btrfs_run_defrag_inodes(root-fs_info);
 +   up_read(root-fs_info-sb-s_umount);
 }

 -   if (!try_to_freeze()) {
 +   if (!try_to_freeze()  !again) {
 set_current_state(TASK_INTERRUPTIBLE);
 if (!kthread_should_stop())
 schedule();
 @@ -3403,8 +3407,8 @@ int btrfs_commit_super(struct btrfs_root *root)

 mutex_lock(root-fs_info-cleaner_mutex);
 btrfs_run_delayed_iputs(root);
 -   btrfs_clean_old_snapshots(root);
 mutex_unlock(root-fs_info-cleaner_mutex);
 +   wake_up_process(root-fs_info-cleaner_kthread);

 /* wait until ongoing cleanup work done */
 down_write(root-fs_info-cleanup_work_sem);
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 742b7a7..a08d0fe 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -7263,6 +7263,8 @@ static noinline int walk_up_tree(struct 
 btrfs_trans_handle *trans,
   * reference count by one. if update_ref is true, this function
   * also make sure backrefs for the shared block and all lower level
   * blocks are properly updated.
 + *
 + * If called with for_reloc == 0, may exit early with -EAGAIN
   */
  int btrfs_drop_snapshot(struct btrfs_root *root,
  struct btrfs_block_rsv *block_rsv, int update_ref,
 @@ -7363,6 +7365,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
 wc-reada_count = BTRFS_NODEPTRS_PER_BLOCK(root);

 while (1) {
 +   if (!for_reloc  btrfs_fs_closing(root-fs_info)) {
 +   pr_debug(btrfs: drop snapshot early exit\n);
 +   err = -EAGAIN;
 +   goto out_end_trans;
 +   }
Here you exit the loop, but the drop_progress in the root item is
incorrect. When the system is remounted, and snapshot deletion
resumes, it seems that it tries to resume from the EXTENT_ITEM that
does not exist anymore, and [1] shows that btrfs_lookup_extent_info()
simply does not find the needed extent.
So then I hit panic in walk_down_tree():
BUG: wc-refs[level - 1] == 0

I fixed it like follows:
There is a place where btrfs_drop_snapshot() checks if it needs to
detach from transaction and re-attach. So I moved the exit point there
and the code is like this:

if (btrfs_should_end_transaction(trans, tree_root) ||
(!for_reloc  btrfs_need_cleaner_sleep(root))) {
ret = btrfs_update_root(trans, tree_root,
root-root_key,
root_item);
if (ret) {
btrfs_abort_transaction(trans, tree_root, ret);
err = ret;
goto out_end_trans;
}

btrfs_end_transaction_throttle(trans, tree_root);
if (!for_reloc  btrfs_need_cleaner_sleep(root)) {
err = -EAGAIN;
   

Re: [PATCH v3] btrfs: clean snapshots one by one

2013-07-04 Thread Alex Lyakas
Hi David,

On Thu, Jul 4, 2013 at 8:03 PM, David Sterba dste...@suse.cz wrote:
 On Thu, Jul 04, 2013 at 06:29:23PM +0300, Alex Lyakas wrote:
  @@ -7363,6 +7365,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
  wc-reada_count = BTRFS_NODEPTRS_PER_BLOCK(root);
 
  while (1) {
  +   if (!for_reloc  btrfs_fs_closing(root-fs_info)) {
  +   pr_debug(btrfs: drop snapshot early exit\n);
  +   err = -EAGAIN;
  +   goto out_end_trans;
  +   }
 Here you exit the loop, but the drop_progress in the root item is
 incorrect. When the system is remounted, and snapshot deletion
 resumes, it seems that it tries to resume from the EXTENT_ITEM that
 does not exist anymore, and [1] shows that btrfs_lookup_extent_info()
 simply does not find the needed extent.
 So then I hit panic in walk_down_tree():
 BUG: wc-refs[level - 1] == 0

 I fixed it like follows:
 There is a place where btrfs_drop_snapshot() checks if it needs to
 detach from transaction and re-attach. So I moved the exit point there
 and the code is like this:

   if (btrfs_should_end_transaction(trans, tree_root) ||
   (!for_reloc  btrfs_need_cleaner_sleep(root))) {
   ret = btrfs_update_root(trans, tree_root,
   root-root_key,
   root_item);
   if (ret) {
   btrfs_abort_transaction(trans, tree_root, ret);
   err = ret;
   goto out_end_trans;
   }

   btrfs_end_transaction_throttle(trans, tree_root);
   if (!for_reloc  btrfs_need_cleaner_sleep(root)) {
   err = -EAGAIN;
   goto out_free;
   }
   trans = btrfs_start_transaction(tree_root, 0);
 ...

 With this fix, I do not hit the panic, and snapshot deletion proceeds
 and completes alright after mount.

 Do you agree to my analysis or I am missing something? It seems that
 Josef's btrfs-next still has this issue (as does Chris's for-linus).

 Sound analysis and I agree with the fix. The clean-by-one patch has been
 merged into 3.10 so we need a stable fix for that.
Thanks for confirming, David!

From more testing, I have two more notes:

# After applying the fix, whenever snapshot deletion is resumed after
mount, and successfully completes, then I unmount again, and rmmod
btrfs, linux complains about loosing few struct extent_buffer during
kem_cache_delete().
So somewhere on that path:
if (btrfs_disk_key_objectid(root_item-drop_progress) == 0) {
...
} else {
=== HERE

and later we perhaps somehow overwrite the contents of struct
btrfs_path that is used in the whole function. Because at the end of
the function we always do btrfs_free_path(), which inside does
btrfs_release_path().  I was not able to determine where the leak
happens, do you have any hint? No other activity happens in the system
except the resumed snap deletion, and this problem only happens when
resuming.

# This is for Josef: after I unmount the fs with ongoing snap deletion
(after applying my fix), and run the latest btrfsck - it complains a
lot about problems in extent tree:( But after I mount again, snap
deletion resumes then completes, then I unmount and btrfsck is happy
again. So probably it does not account orphan roots properly?

David, will you provide a fixed patch, if possible?

Thanks!
Alex.


 thanks,
 david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: question about transaction-abort-related commits

2013-07-02 Thread Alex Lyakas
On Sun, Jun 30, 2013 at 2:36 PM, Josef Bacik jba...@fusionio.com wrote:
 On Sun, Jun 30, 2013 at 11:29:14AM +0300, Alex Lyakas wrote:
 Hi Josef,

 On Wed, Jun 26, 2013 at 5:16 PM, Alex Lyakas
 alex.bt...@zadarastorage.com wrote:
  Hi Josef,
  Can you please help me with another question.
 
  I am looking at your patch:
  Btrfs: fix chunk allocation error handling
  https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0448748849ef7c593be40e2c1404f7974bd3aac6
 
  Here you changed the order of btrfs_make_block_group() vs
  btrfs_alloc_dev_extent(), because we could have allocated from the
  in-memory block group, before we have inserted the dev extent into a
  tree. However, with this fix, I hit the deadlock[1] of
  btrfs_alloc_dev_extent() that also wants to allocate a chunk and
  recursively calls do_chunk_alloc, but then is stuck on chunk_mutex.
 
  Was this patch:
  Btrfs: don't re-enter when allocating a chunk
  https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c6b305a89b1903d63652691ad5eb9f05aa0326b8
  introduced to fix this deadlock?

 With these two patches (Btrfs: fix chunk allocation error handling
 and Btrfs: don't re-enter when allocating a chunk), I am hitting
 ENOSPC during metadata chunk allocation.

 Upon entry into do_chunk_alloc, I have only one METADATA block-group
 as follows:
 total_bytes=8388608
 bytes_used=7938048
 bytes_pinned=446464
 bytes_reserved=4096
 bytes_readonly=0
 bytes_may_use=3362816

 As we see bytes_used+bytes_pinned+bytes_reserved==total_bytes

 What happens next is that within __btrfs_alloc_chunk():
 - find_free_dev_extent() finds a free extent (metadata policy is SINGLE)
 - btrfs_alloc_dev_extent() fails with ENOSPC

 (btrfs_make_block_group() is called after btrfs_alloc_dev_extent()
 with these patches).

 What should be done in such situation, when there is not enough
 METADATA to allocate a device extent item, but we still don't allow
 allocating from the newly-created METADATA block group?


 So I had a third patch that you are likely missing that makes sure we try and
 allocate chunks sooner specifically for this case

 96f1bb57771f71bf1d55d5031a1cf47908494330

 and then Miao made it better I think with this

 3c76cd84e0c0d3ceb094a1020f8c55c2417e18d3

 Thanks,

 Josef

Thank you Josef, I didn't realize that.

Alex.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: question about transaction-abort-related commits

2013-06-30 Thread Alex Lyakas
Hi Josef,

On Wed, Jun 26, 2013 at 5:16 PM, Alex Lyakas
alex.bt...@zadarastorage.com wrote:
 Hi Josef,
 Can you please help me with another question.

 I am looking at your patch:
 Btrfs: fix chunk allocation error handling
 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0448748849ef7c593be40e2c1404f7974bd3aac6

 Here you changed the order of btrfs_make_block_group() vs
 btrfs_alloc_dev_extent(), because we could have allocated from the
 in-memory block group, before we have inserted the dev extent into a
 tree. However, with this fix, I hit the deadlock[1] of
 btrfs_alloc_dev_extent() that also wants to allocate a chunk and
 recursively calls do_chunk_alloc, but then is stuck on chunk_mutex.

 Was this patch:
 Btrfs: don't re-enter when allocating a chunk
 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c6b305a89b1903d63652691ad5eb9f05aa0326b8
 introduced to fix this deadlock?

With these two patches (Btrfs: fix chunk allocation error handling
and Btrfs: don't re-enter when allocating a chunk), I am hitting
ENOSPC during metadata chunk allocation.

Upon entry into do_chunk_alloc, I have only one METADATA block-group
as follows:
total_bytes=8388608
bytes_used=7938048
bytes_pinned=446464
bytes_reserved=4096
bytes_readonly=0
bytes_may_use=3362816

As we see bytes_used+bytes_pinned+bytes_reserved==total_bytes

What happens next is that within __btrfs_alloc_chunk():
- find_free_dev_extent() finds a free extent (metadata policy is SINGLE)
- btrfs_alloc_dev_extent() fails with ENOSPC

(btrfs_make_block_group() is called after btrfs_alloc_dev_extent()
with these patches).

What should be done in such situation, when there is not enough
METADATA to allocate a device extent item, but we still don't allow
allocating from the newly-created METADATA block group?

Thanks,
Alex.





 Thanks,
 Alex.

 [1]
 [a044e57d] do_chunk_alloc+0x8d/0x510 [btrfs]
 [a04532ad] find_free_extent+0x9cd/0xb90 [btrfs]
 [a0453510] btrfs_reserve_extent+0xa0/0x1b0 [btrfs]
 [a0453bc9] btrfs_alloc_free_block+0xf9/0x570 [btrfs]
 [a043d9e6] __btrfs_cow_block+0x126/0x500 [btrfs]
 [a043dfba] btrfs_cow_block+0x17a/0x230 [btrfs]
 [a04425b1] btrfs_search_slot+0x381/0x820 [btrfs]
 [a044463c] btrfs_insert_empty_items+0x7c/0x120 [btrfs]
 [a048f31b] btrfs_alloc_dev_extent+0x9b/0x1c0 [btrfs]
 [a048f9ca] __btrfs_alloc_chunk+0x58a/0x850 [btrfs]
 [a049239f] btrfs_alloc_chunk+0xbf/0x160 [btrfs]
 [a044e81b] do_chunk_alloc+0x32b/0x510 [btrfs]
 [a04532ad] find_free_extent+0x9cd/0xb90 [btrfs]
 [a0453510] btrfs_reserve_extent+0xa0/0x1b0 [btrfs]
 [a0453bc9] btrfs_alloc_free_block+0xf9/0x570 [btrfs]
 [a043d9e6] __btrfs_cow_block+0x126/0x500 [btrfs]
 [a043dfba] btrfs_cow_block+0x17a/0x230 [btrfs]
 [a0441613] push_leaf_right+0x133/0x1a0 [btrfs]
 [a0441d51] split_leaf+0x5e1/0x770 [btrfs]
 [a04429b5] btrfs_search_slot+0x785/0x820 [btrfs]
 [a0449c0e] lookup_inline_extent_backref+0x8e/0x5b0 [btrfs]
 [a044a193] insert_inline_extent_backref+0x63/0x130 [btrfs]
 [a044abaf] __btrfs_inc_extent_ref+0x9f/0x240 [btrfs]
 [a0451841] run_clustered_refs+0x971/0xd00 [btrfs]
 [a0455db0] btrfs_run_delayed_refs+0xd0/0x330 [btrfs]
 [a0467a87] __btrfs_end_transaction+0xf7/0x440 [btrfs]
 [a0467e20] btrfs_end_transaction+0x10/0x20 [btrfs]




 On Mon, Jun 24, 2013 at 9:56 PM, Alex Lyakas
 alex.bt...@zadarastorage.com wrote:

 Thanks for commenting Josef. I hope your head will get better:)
 Actually, while re-looking at the code, I see that there are couple of
 goto cleanup;, before we ensure that all the writers have detached
 from the committing transaction. So Liu's code is still needed, looks
 like.

 Thanks,
 Alex.

 On Mon, Jun 24, 2013 at 7:24 PM, Josef Bacik jba...@fusionio.com wrote:
  On Sun, Jun 23, 2013 at 09:52:14PM +0300, Alex Lyakas wrote:
  Hello Josef, Liu,
  I am reviewing commits in the mainline tree:
 
  e4a2bcaca9643e7430207810653222fc5187f2be Btrfs: if we aren't
  committing just end the transaction if we error out
  (call end_transaction() instead of goto cleanup_transaction) - Josef
 
  f094ac32aba3a51c00e970a2ea029339af2ca048 Btrfs: fix NULL pointer after
  aborting a transaction
  (wait until all writers detach, before setting running_transaction to
  NULL) - Liu
 
  66b6135b7cf741f6f44ba938b27583ea3b83bd12 Btrfs: avoid deadlock on
  transaction waiting list
  (check if transaction was already removed from the transactions list) -
  Liu
 
  Josef, according to your fix, if the committer encounters a problem
  early, it just doesn't commit. Instead it aborts the transaction
  (possibly setting FS to read-only) and detaches from the transaction.
  So if this was the only committer (e.g., the transaction_kthread),
  then transaction commit will not happen at all. Is this what you
  intended? So

Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit

2013-06-26 Thread Alex Lyakas
Hi Miao,

On Mon, Jun 17, 2013 at 4:51 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 On  sun, 16 Jun 2013 13:38:42 +0300, Alex Lyakas wrote:
 Hi Miao,

 On Thu, Jun 13, 2013 at 6:08 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 On wed, 12 Jun 2013 23:11:02 +0300, Alex Lyakas wrote:
 I reviewed the code starting from:
 69aef69a1bc154 Btrfs: don't wait for all the writers circularly during
 the transaction commit
 until
 2ce7935bf4cdf3 Btrfs: remove the time check in btrfs_commit_transaction()

 It looks very good. Let me check if I understand the fix correctly:
 # When transaction starts to commit, we want to wait only for external
 writers (those that did ATTACH/START/USERSPACE)
 # We guarantee at this point that no new external writers will hop on
 the committing transaction, by setting -blocked state, so we only
 wait for existing extwriters to detach from transaction

 I have a doubt about this point with your new code. Example:
 Task1 - external writer
 Task2 - transaction kthread

 Task1   Task2
 |start_transaction(TRANS_START)   |
 |-wait_current_trans(blocked=0, so it doesn't wait) |
 |-join_transaction()  |
 |--lock(trans_lock)   |
 |--can_join_transaction() YES  |
 |
   |-btrfs_commit_transaction()
 |
   |--blocked=1
 |
   |--in_commit=1
 |
   |--wait_event(extwriter== 0);
 |
   |--btrfs_flush_all_pending_stuffs()
 |
 |
 |--extwriter_counter_inc() |
 |--unlock(trans_lock)   |
 |
   | lock(trans_lock)
 |
   | trans_no_join=1

 Basically, the blocked/in_commit check is not synchronized with
 joining a transaction. After checking blocked, the external writer
 may proceed and join the transaction. Right before joining, it calls
 can_join_transaction(). But this function checks in_commit flag under
 fs_info-trans_lock. But btrfs_commit_transaction() sets this flag not
 under trans_lock, but under commit_lock, so checking this flag is not
 synchronized.

 Or maybe I am wrong, because btrfs_commit_transaction() locks and
 unlocks trans_lock to check for previous transaction, so by accident
 there is no problem, and above scenario cannot happen?

 Your analysis at the last section is right, so the right process is:

 Task1   Task2
 |start_transaction(TRANS_START) |
 |-wait_current_trans(blocked=0, so it doesn't wait) |
 |-join_transaction()|
 |--lock(trans_lock) |
 |--can_join_transaction() YES   |
 |   
 |-btrfs_commit_transaction()
 |   |--blocked=1
 |   |--in_commit=1
 |--extwriter_counter_inc()  |
 |--unlock(trans_lock)   |
 |   |--lock(trans_lock)
 |   |--...
 |   |--unlock(trans_lock)
 |   |--...
 |   
 |--wait_event(extwriter== 0);
 |   
 |--btrfs_flush_all_pending_stuffs()

 The problem you worried can not happen.

 Anyway, it is not good that the blocked/in_commit check is not synchronized 
 with
 joining a transaction. So I modified the relative code in this patchset.


The four patches that we applied related to extwriters issue work very
good. They definitely solve the non-deterministic behavior while
waiting for the writers to detach. Thanks for addressing this issue.
One note is that the new behavior is perhaps less friendly to the
transaction join flow. With your fix, the committer unconditionally
sets trans_no_join and waits for old writers to detach. At this
point, new joins will block. While previously, the committer was
finding a convenient spot in the join pattern, when all writers have
detached (although it was non-deterministic when this will happen). So
perhaps some compromise can be done - like wait for 10sec until all
writers detach, and if not, just go ahead and set trans_no_join.

Thanks for your help!
Alex.


 Miao
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: question about transaction-abort-related commits

2013-06-24 Thread Alex Lyakas
Thanks for commenting Josef. I hope your head will get better:)
Actually, while re-looking at the code, I see that there are couple of
goto cleanup;, before we ensure that all the writers have detached
from the committing transaction. So Liu's code is still needed, looks
like.

Thanks,
Alex.

On Mon, Jun 24, 2013 at 7:24 PM, Josef Bacik jba...@fusionio.com wrote:
 On Sun, Jun 23, 2013 at 09:52:14PM +0300, Alex Lyakas wrote:
 Hello Josef, Liu,
 I am reviewing commits in the mainline tree:

 e4a2bcaca9643e7430207810653222fc5187f2be Btrfs: if we aren't
 committing just end the transaction if we error out
 (call end_transaction() instead of goto cleanup_transaction) - Josef

 f094ac32aba3a51c00e970a2ea029339af2ca048 Btrfs: fix NULL pointer after
 aborting a transaction
 (wait until all writers detach, before setting running_transaction to
 NULL) - Liu

 66b6135b7cf741f6f44ba938b27583ea3b83bd12 Btrfs: avoid deadlock on
 transaction waiting list
 (check if transaction was already removed from the transactions list) - Liu

 Josef, according to your fix, if the committer encounters a problem
 early, it just doesn't commit. Instead it aborts the transaction
 (possibly setting FS to read-only) and detaches from the transaction.
 So if this was the only committer (e.g., the transaction_kthread),
 then transaction commit will not happen at all. Is this what you
 intended? So then the user will notice that FS went read-only, and she
 will unmount the FS, and transaction will be cleaned up in
 btrfs_error_commit_super()=btrfs_cleanup_transaction(), and not in
 cleanup_transaction() via btrfs_commit_transaction(). Is my
 understanding correct?

 Liu, it looks like after having Josef's fix, the above two commits of
 yours are not strictly needed, right? Because Josef's fix ensures that
 only the real committer will call cleanup_transaction(), so at this
 point there is only one writer attached to the transaction, which is
 the committer itself (your fixes doesn't hurt though). Is that
 correct?


 I've looked at the patches and I'm going to say yes with the caveat that I
 stopped thinking about it when my head started hurting :).  Thanks,

 Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


question about transaction-abort-related commits

2013-06-23 Thread Alex Lyakas
Hello Josef, Liu,
I am reviewing commits in the mainline tree:

e4a2bcaca9643e7430207810653222fc5187f2be Btrfs: if we aren't
committing just end the transaction if we error out
(call end_transaction() instead of goto cleanup_transaction) - Josef

f094ac32aba3a51c00e970a2ea029339af2ca048 Btrfs: fix NULL pointer after
aborting a transaction
(wait until all writers detach, before setting running_transaction to
NULL) - Liu

66b6135b7cf741f6f44ba938b27583ea3b83bd12 Btrfs: avoid deadlock on
transaction waiting list
(check if transaction was already removed from the transactions list) - Liu

Josef, according to your fix, if the committer encounters a problem
early, it just doesn't commit. Instead it aborts the transaction
(possibly setting FS to read-only) and detaches from the transaction.
So if this was the only committer (e.g., the transaction_kthread),
then transaction commit will not happen at all. Is this what you
intended? So then the user will notice that FS went read-only, and she
will unmount the FS, and transaction will be cleaned up in
btrfs_error_commit_super()=btrfs_cleanup_transaction(), and not in
cleanup_transaction() via btrfs_commit_transaction(). Is my
understanding correct?

Liu, it looks like after having Josef's fix, the above two commits of
yours are not strictly needed, right? Because Josef's fix ensures that
only the real committer will call cleanup_transaction(), so at this
point there is only one writer attached to the transaction, which is
the committer itself (your fixes doesn't hurt though). Is that
correct?

Thanks for helping,
Alex.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit

2013-06-16 Thread Alex Lyakas
Hi Miao,

On Thu, Jun 13, 2013 at 6:08 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 On wed, 12 Jun 2013 23:11:02 +0300, Alex Lyakas wrote:
 I reviewed the code starting from:
 69aef69a1bc154 Btrfs: don't wait for all the writers circularly during
 the transaction commit
 until
 2ce7935bf4cdf3 Btrfs: remove the time check in btrfs_commit_transaction()

 It looks very good. Let me check if I understand the fix correctly:
 # When transaction starts to commit, we want to wait only for external
 writers (those that did ATTACH/START/USERSPACE)
 # We guarantee at this point that no new external writers will hop on
 the committing transaction, by setting -blocked state, so we only
 wait for existing extwriters to detach from transaction

I have a doubt about this point with your new code. Example:
Task1 - external writer
Task2 - transaction kthread

Task1   Task2
|start_transaction(TRANS_START)   |
|-wait_current_trans(blocked=0, so it doesn't wait) |
|-join_transaction()  |
|--lock(trans_lock)   |
|--can_join_transaction() YES  |
|
  |-btrfs_commit_transaction()
|
  |--blocked=1
|
  |--in_commit=1
|
  |--wait_event(extwriter== 0);
|
  |--btrfs_flush_all_pending_stuffs()
||
|--extwriter_counter_inc() |
|--unlock(trans_lock)   |
|
  | lock(trans_lock)
|
  | trans_no_join=1

Basically, the blocked/in_commit check is not synchronized with
joining a transaction. After checking blocked, the external writer
may proceed and join the transaction. Right before joining, it calls
can_join_transaction(). But this function checks in_commit flag under
fs_info-trans_lock. But btrfs_commit_transaction() sets this flag not
under trans_lock, but under commit_lock, so checking this flag is not
synchronized.

Or maybe I am wrong, because btrfs_commit_transaction() locks and
unlocks trans_lock to check for previous transaction, so by accident
there is no problem, and above scenario cannot happen?


 # We do not care at this point for TRANS_JOIN etc, we let them hop on
 if they want
 # When all external writers have detached, we flush their delalloc and
 then we prevent all the others to join (TRANS_JOIN etc)

 # Previously, we had the do-while loop, that intended to do the same,
 but it used num_writers, which counts both external writers and also
 TRANS_JOIN. So the loop was racy because new joins prevented it from
 completing.

 Is my understanding correct?

 Yes, you are right.

 I have some questions:
 # Why was the do-while loop needed? Can we just delete the do-while
 loop as it was before, call flush_all_pending stuffs(),  then set
 trans_no_join and wait for all writers to detach? Is there some
 correctness problem here?
 Or we need to wait for external writers to detach before calling
 flush_all_pending_stuffs() one last time?

 The external writers will introduce pending works, we need flush them
 after they detach, otherwise we will forget to deal with them at the current
 transaction just like the following case:

 Task1   Task2
 start_transaction
 commit_transaction
   flush_all_pending_stuffs
 add pending works
 end_transaction
   ...


 # Why TRANS_ATTACH is considered external writer?

 - at most cases, it is done by the users' operations.
 - if in_commit is set, we shouldn't start it, or the deadlock will happen.
   it is the same as TRANS_START/TRANS_USERSPACE.

 # Can I apply this fix to 3.8.x kernel (manually, of course)? Or some
 additional things are needed that are missing in this kernel?

 Yes, you can rebase it against 3.8.x kernel freely.

 Thanks
 Miao

Thanks,
Alex.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit

2013-06-12 Thread Alex Lyakas
Hi Miao,

On Thu, May 9, 2013 at 10:57 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 Hi, Alex

 Could you try the following patchset?

   git://github.com/miaoxie/linux-btrfs.git trans-commit-improve

 I think it can avoid the problem you said below.

 Note: this patchset is against chris's for-linus branch.

I reviewed the code starting from:
69aef69a1bc154 Btrfs: don't wait for all the writers circularly during
the transaction commit
until
2ce7935bf4cdf3 Btrfs: remove the time check in btrfs_commit_transaction()

It looks very good. Let me check if I understand the fix correctly:
# When transaction starts to commit, we want to wait only for external
writers (those that did ATTACH/START/USERSPACE)
# We guarantee at this point that no new external writers will hop on
the committing transaction, by setting -blocked state, so we only
wait for existing extwriters to detach from transaction
# We do not care at this point for TRANS_JOIN etc, we let them hop on
if they want
# When all external writers have detached, we flush their delalloc and
then we prevent all the others to join (TRANS_JOIN etc)

# Previously, we had the do-while loop, that intended to do the same,
but it used num_writers, which counts both external writers and also
TRANS_JOIN. So the loop was racy because new joins prevented it from
completing.

Is my understanding correct?

I have some questions:
# Why was the do-while loop needed? Can we just delete the do-while
loop as it was before, call flush_all_pending stuffs(),  then set
trans_no_join and wait for all writers to detach? Is there some
correctness problem here?
Or we need to wait for external writers to detach before calling
flush_all_pending_stuffs() one last time?

# Why TRANS_ATTACH is considered external writer?

# Can I apply this fix to 3.8.x kernel (manually, of course)? Or some
additional things are needed that are missing in this kernel?

Thanks,
Alex.







 Thanks
 Miao

 On Wed, 10 Apr 2013 21:45:43 +0300, Alex Lyakas wrote:
 Hi Miao,
 I attempted to fix the issue by not joining a transaction that has
 trans-in_commit set. I did something similar to what
 wait_current_trans() does, but I did:

 smp_rmb();
 if (cur_trans  cur_trans-in_commit) {
 ...
 wait_event(root-fs_info-transaction_wait,  !cur_trans-blocked);
 ...

 I also had to change the order of setting in_commit and blocked in
 btrfs_commit_transaction:
   trans-transaction-blocked = 1;
   trans-transaction-in_commit = 1;
   smp_wmb();
 to make sure that if in_commit is set, then blocked cannot be 0,
 because btrfs_commit_transaction haven't set it yet to 1.

 However, with this fix I observe two issues:
 # With large trees and heavy commits, join_transaction() is delayed
 sometimes by 1-3 seconds. This delays the host IO by too much.
 # With this fix, I think too many transactions happen. Basically with
 this fix, once transaction-in_commit is set, then I insist to open a
 new transaction and not to join the current one. It has some bad
 influence on host response times pattern, but I cannot exactly tell
 why is that.

 Did you have other fix in mind?

 Without the fix, I observe sometimes commits that take like 80
 seconds, out of which like 50 seconds are spent in the do-while loop
 of btrfs_commit_transaction.

 Thanks,
 Alex.



 On Mon, Mar 25, 2013 at 11:11 AM, Alex Lyakas
 alex.bt...@zadarastorage.com wrote:
 Hi Miao,

 On Mon, Mar 25, 2013 at 3:51 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 On Sun, 24 Mar 2013 13:13:22 +0200, Alex Lyakas wrote:
 Hi Miao,
 I am seeing another issue. Your fix prevents from TRANS_START to get
 in the way of a committing transaction. But it does not prevent from
 TRANS_JOIN. On the other hand, btrfs_commit_transaction has the
 following loop:

 do {
 // attempt to do some useful stuff and/or sleep
 } while (atomic_read(cur_trans-num_writers)  1 ||
(should_grow  cur_trans-num_joined != joined));

 What I see is basically that new writers join the transaction, while
 btrfs_commit_transaction() does this loop. I see
 cur_trans-num_writers decreasing, but then it increases, then
 decreases etc. This can go for several seconds during heavy IO load.
 There is nothing to prevent new TRANS_JOIN writers coming and joining
 a transaction over and over, thus delaying transaction commit. The IO
 path uses TRANS_JOIN; for example run_delalloc_nocow() does that.

 Do you observe such behavior? Do you believe it's problematic?

 I know this behavior, there is no problem with it, the latter code
 will prevent from TRANS_JOIN.

 1672 spin_lock(root-fs_info-trans_lock);
 1673 root-fs_info-trans_no_join = 1;
 1674 spin_unlock(root-fs_info-trans_lock);
 1675 wait_event(cur_trans-writer_wait,
 1676atomic_read(cur_trans-num_writers) == 1);

 Yes, this code prevents anybody from joining, but before
 btrfs_commit_transaction() gets to this code, it may spend sometimes
 10 seconds (in my tests) in the do-while loop, while

wait_block_group_cache_progress() waits forever in case of drive failure

2013-06-04 Thread Alex Lyakas
Greetings all,
when testing drive failures, I occasionally hit the following hang:

# Block group is being cached-in by caching_thread()
# caching_thread() experiences an error, e.g., in btrfs_search_slot,
because of drive failure:
ret = btrfs_search_slot(NULL, extent_root, key, path, 0, 0);
if (ret  0)
goto err;

# caching thread exits:
err:
btrfs_free_path(path);
up_read(fs_info-extent_commit_sem);

free_excluded_extents(extent_root, block_group);

mutex_unlock(caching_ctl-mutex);
out:
wake_up(caching_ctl-wait);

put_caching_control(caching_ctl);
btrfs_put_block_group(block_group);

However, wait_block_group_cache_progress() is still stuck in a stack like this:
[816ec509] schedule+0x29/0x70
[a044bd42] wait_block_group_cache_progress+0xe2/0x110 [btrfs]
[8107fc10] ? add_wait_queue+0x60/0x60
[8107fc10] ? add_wait_queue+0x60/0x60
[a04568d6] find_free_extent+0x306/0xb90 [btrfs]
[a04462ee] ? btrfs_search_slot+0x2fe/0x820 [btrfs]
[a0457200] btrfs_reserve_extent+0xa0/0x1b0 [btrfs]
...
because of:
wait_event(caching_ctl-wait, block_group_cache_done(cache) ||
   (cache-free_space_ctl-free_space = num_bytes));

But cache-cached never becomes BTRFS_CACHE_FINISHED, and
cache-free_space_ctl-free_space will also not grow enough, so the
wait never finishes.
At this point, the system totally hangs.

Same problem can happen with wait_block_group_cache_done().

I am thinking: can we add additional condition, like:
wait_event(caching_ctl-wait,
   test_bit(BTRFS_FS_STATE_ERROR, fs_info-fs_state) ||
   block_group_cache_done(cache) ||
   (cache-free_space_ctl-free_space = num_bytes));

So that when transaction aborts, FS is marked as bad, and then all
these waits will complete, so that the user can unmount?

Or some other way to fix this problem?

Thanks,
Alex.

P.S: should I open a bugzilla for this?
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[no subject]

2013-05-28 Thread Alex Lyakas
Hello all,
I have the following unresponsive btrfs:

btrfs_end_transaction() is called and is stuck in btrfs_tree_lock():

May 27 16:13:55 vc kernel: [ 7130.421159] kworker/u:85D
 0 19859  2 0x
May 27 16:13:55 vc kernel: [ 7130.421159]  880095335568
0046 00010093cb38 880083b11b48
May 27 16:13:55 vc kernel: [ 7130.421159]  880095335fd8
880095335fd8 880095335fd8 00013f40
May 27 16:13:55 vc kernel: [ 7130.421159]  8800a1fddd00
88008b1fc5c0 880095335578 880090f736d8
May 27 16:13:55 vc kernel: [ 7130.421159] Call Trace:
May 27 16:13:55 vc kernel: [ 7130.421159]  [816eb399]
schedule+0x29/0x70
May 27 16:13:55 vc kernel: [ 7130.421159]  [a03665ad]
btrfs_tree_lock+0xcd/0x250 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [8107fcc0] ?
add_wait_queue+0x60/0x60
May 27 16:13:55 vc kernel: [ 7130.421159]  [a031d558]
btrfs_init_new_buffer+0x68/0x140 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [a031d70d]
btrfs_alloc_free_block+0xdd/0x460 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [8113ff9b] ?
__set_page_dirty_nobuffers+0x1b/0x20
May 27 16:13:55 vc kernel: [ 7130.421159]  [a0327b2e] ?
btree_set_page_dirty+0xe/0x10 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [a0307756]
__btrfs_cow_block+0x126/0x4f0 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [a0307cc3]
btrfs_cow_block+0x123/0x1d0 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [a030c281]
btrfs_search_slot+0x381/0x820 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [a03138ce]
lookup_inline_extent_backref+0x8e/0x5b0 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [a032b6e9] ?
btrfs_mark_buffer_dirty+0x99/0xf0 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [a031301e] ?
setup_inline_extent_backref+0x18e/0x290 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [a0313e53]
insert_inline_extent_backref+0x63/0x130 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [a030677a] ?
btrfs_alloc_path+0x1a/0x20 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [a031486f]
__btrfs_inc_extent_ref+0x9f/0x240 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [a0377aa9] ?
btrfs_merge_delayed_refs+0x289/0x300 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [a031b3a1]
run_clustered_refs+0x971/0xd00 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [a030714d] ?
btrfs_put_tree_mod_seq+0x10d/0x150 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [a031f7f0]
btrfs_run_delayed_refs+0xd0/0x320 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [a0330bf7]
__btrfs_end_transaction+0xf7/0x410 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.421159]  [a0330f60]
btrfs_end_transaction+0x10/0x20 [btrfs]

As a result, transaction cannot commit, it waits for all writers to
detach in the do-while loop.

May 27 16:13:55 vc kernel: [ 7130.419009] btrfs-transacti D
 0 15150  2 0x
May 27 16:13:55 vc kernel: [ 7130.419012]  88009f86bce8
0046 032d032d 
May 27 16:13:55 vc kernel: [ 7130.419016]  88009f86bfd8
88009f86bfd8 88009f86bfd8 00013f40
May 27 16:13:55 vc kernel: [ 7130.419020]  8800af1e9740
8800a03f8000 0090 88009693cb00
May 27 16:13:55 vc kernel: [ 7130.419023] Call Trace:
May 27 16:13:55 vc kernel: [ 7130.419027]  [816eb399]
schedule+0x29/0x70
May 27 16:13:55 vc kernel: [ 7130.419031]  [816e9b1d]
schedule_timeout+0x1ed/0x250
May 27 16:13:55 vc kernel: [ 7130.419055]  [a03497a3] ?
btrfs_run_ordered_operations+0x2b3/0x2e0 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.419060]  [81045cd9] ?
default_spin_lock_flags+0x9/0x10
May 27 16:13:55 vc kernel: [ 7130.419081]  [a0330388]
btrfs_commit_transaction+0x3b8/0xae0 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.419085]  [8107fcc0] ?
add_wait_queue+0x60/0x60
May 27 16:13:55 vc kernel: [ 7130.419104]  [a0328525]
transaction_kthread+0x1b5/0x230 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.419124]  [a0328370] ?
btree_invalidatepage+0x80/0x80 [btrfs]
May 27 16:13:55 vc kernel: [ 7130.419128]  [8107f0d0]
kthread+0xc0/0xd0
May 27 16:13:55 vc kernel: [ 7130.419132]  [8107f010] ?
flush_kthread_worker+0xb0/0xb0
May 27 16:13:55 vc kernel: [ 7130.419136]  [816f506c]
ret_from_fork+0x7c/0xb0
May 27 16:13:55 vc kernel: [ 7130.419140]  [8107f010] ?
flush_kthread_worker+0xb0/0xb0

There is additional thread stuck in btrfs_tree_lock(), not sure how it
is related, perhaps there's some deadlock between the two?

May 27 16:13:55 vc kernel: [ 7130.421159] flush-btrfs-2   D
0001 0 18816  2 0x
May 27 16:13:55 vc kernel: [ 7130.421159]  88008b553948
0046 880017991050 
May 27 16:13:55 vc kernel: [ 7130.421159]  

Re: [PATCH] Btrfs: clear received_uuid field for new writable snapshots

2013-05-22 Thread Alex Lyakas
Hi Stephan,
I fully understand the first part of your fix, and I believe it's
quite critical. Indeed, a writable snapshot should have no evidence
that it has an ancestor that was once received.

Can you pls let me know that I understand the second part of your fix.
In btrfs-progs the following code in tree_search() would have
prevented us from mistakenly selecting such snapshot as a parent for
receive:
if (type == subvol_search_by_received_uuid) {
entry = rb_entry(n, struct subvol_info,
rb_received_node);
comp = memcmp(entry-received_uuid, uuid,
BTRFS_UUID_SIZE);
if (!comp) {
if (entry-stransid  stransid)
comp = -1;
else if (entry-stransid  stransid)
comp = 1;
else
comp = 0;
}
The code checks both received_uuid (which would have been wrongly
equal to what we need), but also the stransid (which was the ctransid
on the send side), which would have been zero, so it wouldn't match.

Now after your fix, the stransid field becomes not needed, correct?
Because if we have a valid received_uuid, this means that either we
are the received snapshot, or our whole chain of ancestors are
read-only, and eventually there was an ancestor that was received.
So we have valid data and can be used as a parent. Is it still needed
after your fix to check the stransid field ? (it doesn't hurt to check
it)

Clearring/Not clearing the rtransid - does it bring any value?
rtransid is the local transid of when we had completed the receive
process for this snap. Is there any interesting usage of this value?

Thanks,
Alex.


On Wed, Apr 17, 2013 at 12:11 PM, Stefan Behrens
sbehr...@giantdisaster.de wrote:

 For created snapshots, the full root_item is copied from the source
 root and afterwards selectively modified. The current code forgets
 to clear the field received_uuid. The only problem is that it is
 confusing when you look at it with 'btrfs subv list', since for
 writable snapshots, the contents of the snapshot can be completely
 unrelated to the previously received snapshot.
 The receiver ignores such snapshots anyway because he also checks
 the field stransid in the root_item and that value used to be reset
 to zero for all created snapshots.

 This commit changes two things:
 - clear the received_uuid field for new writable snapshots.
 - don't clear the send/receive related information like the stransid
   for read-only snapshots (which makes them useable as a parent for
   the automatic selection of parents in the receive code).

 Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de
 ---
  fs/btrfs/transaction.c | 12 
  1 file changed, 8 insertions(+), 4 deletions(-)

 diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
 index ffac232..94cbd10 100644
 --- a/fs/btrfs/transaction.c
 +++ b/fs/btrfs/transaction.c
 @@ -1170,13 +1170,17 @@ static noinline int create_pending_snapshot(struct 
 btrfs_trans_handle *trans,
 memcpy(new_root_item-uuid, new_uuid.b, BTRFS_UUID_SIZE);
 memcpy(new_root_item-parent_uuid, root-root_item.uuid,
 BTRFS_UUID_SIZE);
 +   if (!(root_flags  BTRFS_ROOT_SUBVOL_RDONLY)) {
 +   memset(new_root_item-received_uuid, 0,
 +  sizeof(new_root_item-received_uuid));
 +   memset(new_root_item-stime, 0, 
 sizeof(new_root_item-stime));
 +   memset(new_root_item-rtime, 0, 
 sizeof(new_root_item-rtime));
 +   btrfs_set_root_stransid(new_root_item, 0);
 +   btrfs_set_root_rtransid(new_root_item, 0);
 +   }
 new_root_item-otime.sec = cpu_to_le64(cur_time.tv_sec);
 new_root_item-otime.nsec = cpu_to_le32(cur_time.tv_nsec);
 btrfs_set_root_otransid(new_root_item, trans-transid);
 -   memset(new_root_item-stime, 0, sizeof(new_root_item-stime));
 -   memset(new_root_item-rtime, 0, sizeof(new_root_item-rtime));
 -   btrfs_set_root_stransid(new_root_item, 0);
 -   btrfs_set_root_rtransid(new_root_item, 0);

 old = btrfs_lock_root_node(root);
 ret = btrfs_cow_block(trans, root, old, NULL, 0, old);
 --
 1.8.2.1

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit

2013-04-10 Thread Alex Lyakas
Hi Miao,
I attempted to fix the issue by not joining a transaction that has
trans-in_commit set. I did something similar to what
wait_current_trans() does, but I did:

smp_rmb();
if (cur_trans  cur_trans-in_commit) {
...
wait_event(root-fs_info-transaction_wait,  !cur_trans-blocked);
...

I also had to change the order of setting in_commit and blocked in
btrfs_commit_transaction:
trans-transaction-blocked = 1;
trans-transaction-in_commit = 1;
smp_wmb();
to make sure that if in_commit is set, then blocked cannot be 0,
because btrfs_commit_transaction haven't set it yet to 1.

However, with this fix I observe two issues:
# With large trees and heavy commits, join_transaction() is delayed
sometimes by 1-3 seconds. This delays the host IO by too much.
# With this fix, I think too many transactions happen. Basically with
this fix, once transaction-in_commit is set, then I insist to open a
new transaction and not to join the current one. It has some bad
influence on host response times pattern, but I cannot exactly tell
why is that.

Did you have other fix in mind?

Without the fix, I observe sometimes commits that take like 80
seconds, out of which like 50 seconds are spent in the do-while loop
of btrfs_commit_transaction.

Thanks,
Alex.



On Mon, Mar 25, 2013 at 11:11 AM, Alex Lyakas
alex.bt...@zadarastorage.com wrote:
 Hi Miao,

 On Mon, Mar 25, 2013 at 3:51 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 On Sun, 24 Mar 2013 13:13:22 +0200, Alex Lyakas wrote:
 Hi Miao,
 I am seeing another issue. Your fix prevents from TRANS_START to get
 in the way of a committing transaction. But it does not prevent from
 TRANS_JOIN. On the other hand, btrfs_commit_transaction has the
 following loop:

 do {
 // attempt to do some useful stuff and/or sleep
 } while (atomic_read(cur_trans-num_writers)  1 ||
(should_grow  cur_trans-num_joined != joined));

 What I see is basically that new writers join the transaction, while
 btrfs_commit_transaction() does this loop. I see
 cur_trans-num_writers decreasing, but then it increases, then
 decreases etc. This can go for several seconds during heavy IO load.
 There is nothing to prevent new TRANS_JOIN writers coming and joining
 a transaction over and over, thus delaying transaction commit. The IO
 path uses TRANS_JOIN; for example run_delalloc_nocow() does that.

 Do you observe such behavior? Do you believe it's problematic?

 I know this behavior, there is no problem with it, the latter code
 will prevent from TRANS_JOIN.

 1672 spin_lock(root-fs_info-trans_lock);
 1673 root-fs_info-trans_no_join = 1;
 1674 spin_unlock(root-fs_info-trans_lock);
 1675 wait_event(cur_trans-writer_wait,
 1676atomic_read(cur_trans-num_writers) == 1);

 Yes, this code prevents anybody from joining, but before
 btrfs_commit_transaction() gets to this code, it may spend sometimes
 10 seconds (in my tests) in the do-while loop, while new writers come
 and go. Basically, it is not deterministic when the do-while loop will
 exit, it depends on the IO pattern.

 And if we block the TRANS_JOIN at the place you point out, the deadlock
 will happen because we need deal with the ordered operations which will
 use TRANS_JOIN here.

 (I am dealing with the problem you said above by adding a new type of
 TRANS_* now)

 Thanks.
 Alex.



 Thanks
 Miao

 Thanks,
 Alex.


 On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie mi...@cn.fujitsu.com wrote:
 On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote:
 Hi Miao,
 can you please explain your solution a bit more.

 On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 Now btrfs_commit_transaction() does this

 ret = btrfs_run_ordered_operations(root, 0)

 which async flushes all inodes on the ordered operations list, it 
 introduced
 a deadlock that transaction-start task, transaction-commit task and the 
 flush
 workers waited for each other.
 (See the following URL to get the detail
  http://marc.info/?l=linux-btrfsm=136070705732646w=2)

 As we know, if -in_commit is set, it means someone is committing the
 current transaction, we should not try to join it if we are not JOIN
 or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
 the above problem. In this way, there is another benefit: there is no new
 transaction handle to block the transaction which is on the way of 
 commit,
 once we set -in_commit.

 Signed-off-by: Miao Xie mi...@cn.fujitsu.com
 ---
  fs/btrfs/transaction.c |   17 -
  1 files changed, 16 insertions(+), 1 deletions(-)

 diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
 index bc2f2d1..71b7e2e 100644
 --- a/fs/btrfs/transaction.c
 +++ b/fs/btrfs/transaction.c
 @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct 
 btrfs_root *root)
 root-commit_root = btrfs_root_node(root);
  }

 +static inline int can_join_transaction(struct btrfs_transaction *trans

Re: Backup Options

2013-04-09 Thread Alex Lyakas
Hi David,
maybe my old patch
http://www.spinics.net/lists/linux-btrfs/msg19739.html
can help this issue?

Thanks,
Alex.


On Wed, Apr 3, 2013 at 8:23 PM, David Sterba dste...@suse.cz wrote:
 On Wed, Apr 03, 2013 at 04:33:22AM +0200, Harald Glatt wrote:
 However what I actually did was:
 # cd /mnt/restore
 # nc -l -p  | btrfs receive .

 After noticing this difference I had to try it again as described in
 my mail and - oh wonder - it works now!! Giving 'btrfs receive' a dot
 as a parameter seems to fail in this case. Is this expected behavior
 or a bug?

 Bug. Relative paths do not work on the receive side.

 david
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs stuck on

2013-04-02 Thread Alex Lyakas
Hi David,

On Fri, Mar 29, 2013 at 8:12 PM, David Sterba dste...@suse.cz wrote:
 On Thu, Mar 21, 2013 at 11:56:37AM -0700, Ask Bjørn Hansen wrote:
 A few weeks ago I replaced a ZFS backup system with one backed by
 btrfs. A script loops over a bunch of hosts rsyncing them to each
 their own subvolume.  After each rsync I snapshot the host-specific
 subvolume.

 The disk is an iscsi disk that in my benchmarks performs roughly
 like a local raid with 2-3 SATA disks.

 It worked fine for about a week (~150 snapshots from ~20 sub volumes)
 before it suddenly exploded in disk io wait. Doing anything (in
 particular changes) on the file system is just insanely slow, rsync
 basically can't complete (an rsync that should take 10-20 minutes
 takes 24 hours; I have a directory of 60k files I tried deleting and
 it's deleting one file every few minutes, that sort of thing).

 I'm seeing similar problem after a test that produces tons of snapshots
 and snap deletions at the same time. Accessing the directory (eg. via
 ls) containing the snapshots blocks for a long time.

 The contention point is a mutex of the directory entry, used for lookups
 on the 'ls' side, and the snapshot deletion process holds the mutex as
 well with obvious consequences. The contention is multiplied by the
 number of snapshots waiting to be deleted and eagerly grabbing the
 mutex, making other waiters starve.

Can you pls clarify what mutex do you mean? Do you mean the
dir-i_mutex, taken by btrfs_ioctl_snap_destroy()? If yes, then this
mutex is held only while adding a snap to todo deletion list, and
not during snap deletion itself. Otherwise, I don't see
btrfs_drop_snapshot() locking any mutex, for example.


 You've observed this as deletion progressing very slowly and rsync
 blocked. That's really annoying and I'm working towards fixing it.

 I am using 3.8.2-206.fc18.x86_64 (Fedora 18). I tried rebooting, it
 doesn't make a difference. As soon as I boot [btrfs-cleaner] and
 [btrfs-transacti] gets really busy.

 I wonder if it's because I deleted a few snapshots at some point?

 Yes. The progress or performance impact depends on amount of data shared
 among the snapshots and used / free space fragmentation.

 david
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

Thanks,
Alex.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: fix locking on ROOT_REPLACE operations in tree mod log

2013-04-02 Thread Alex Lyakas
Hi Jan,
I have manually applied this patch and also your previous patch onto
kernel 3.8.2, but, unfortunately, I am still hitting the issue:(
I will check further whether I can be more helpful in debugging this
issue, than just reporting it:(

Thank for your help,
Alex.



On Wed, Mar 20, 2013 at 3:49 PM, Jan Schmidt list.bt...@jan-o-sch.net wrote:
 To resolve backrefs, ROOT_REPLACE operations in the tree mod log are
 required to be tied to at least one KEY_REMOVE_WHILE_FREEING operation.
 Therefore, those operations must be enclosed by tree_mod_log_write_lock()
 and tree_mod_log_write_unlock() calls.

 Those calls are private to the tree_mod_log_* functions, which means that
 removal of the elements of an old root node must be logged from
 tree_mod_log_insert_root. This partly reverts and corrects commit ba1bfbd5
 (Btrfs: fix a tree mod logging issue for root replacement operations).

 This fixes the brand-new version of xfstest 276 as of commit cfe73f71.

 Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
 ---
 Has probably been Reported-by: Alex Lyakas alex.bt...@zadarastorage.com
 (unconfirmed).

 Chages for v2:
 - use the correct base (current cmason/for-linus)

  fs/btrfs/ctree.c |   30 --
  1 files changed, 20 insertions(+), 10 deletions(-)

 diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
 index ecd25a1..ca9d8f1 100644
 --- a/fs/btrfs/ctree.c
 +++ b/fs/btrfs/ctree.c
 @@ -651,6 +651,8 @@ tree_mod_log_insert_root(struct btrfs_fs_info *fs_info,
 if (tree_mod_dont_log(fs_info, NULL))
 return 0;

 +   __tree_mod_log_free_eb(fs_info, old_root);
 +
 ret = tree_mod_alloc(fs_info, flags, tm);
 if (ret  0)
 goto out;
 @@ -736,7 +738,7 @@ tree_mod_log_search(struct btrfs_fs_info *fs_info, u64 
 start, u64 min_seq)
  static noinline void
  tree_mod_log_eb_copy(struct btrfs_fs_info *fs_info, struct extent_buffer 
 *dst,
  struct extent_buffer *src, unsigned long dst_offset,
 -unsigned long src_offset, int nr_items)
 +unsigned long src_offset, int nr_items, int log_removal)
  {
 int ret;
 int i;
 @@ -750,10 +752,12 @@ tree_mod_log_eb_copy(struct btrfs_fs_info *fs_info, 
 struct extent_buffer *dst,
 }

 for (i = 0; i  nr_items; i++) {
 -   ret = tree_mod_log_insert_key_locked(fs_info, src,
 -i + src_offset,
 -MOD_LOG_KEY_REMOVE);
 -   BUG_ON(ret  0);
 +   if (log_removal) {
 +   ret = tree_mod_log_insert_key_locked(fs_info, src,
 +   i + src_offset,
 +   MOD_LOG_KEY_REMOVE);
 +   BUG_ON(ret  0);
 +   }
 ret = tree_mod_log_insert_key_locked(fs_info, dst,
  i + dst_offset,
  MOD_LOG_KEY_ADD);
 @@ -927,7 +931,6 @@ static noinline int update_ref_for_cow(struct 
 btrfs_trans_handle *trans,
 ret = btrfs_dec_ref(trans, root, buf, 1, 1);
 BUG_ON(ret); /* -ENOMEM */
 }
 -   tree_mod_log_free_eb(root-fs_info, buf);
 clean_tree_block(trans, root, buf);
 *last_ref = 1;
 }
 @@ -1046,6 +1049,7 @@ static noinline int __btrfs_cow_block(struct 
 btrfs_trans_handle *trans,
 btrfs_set_node_ptr_generation(parent, parent_slot,
   trans-transid);
 btrfs_mark_buffer_dirty(parent);
 +   tree_mod_log_free_eb(root-fs_info, buf);
 btrfs_free_tree_block(trans, root, buf, parent_start,
   last_ref);
 }
 @@ -1750,7 +1754,6 @@ static noinline int balance_level(struct 
 btrfs_trans_handle *trans,
 goto enospc;
 }

 -   tree_mod_log_free_eb(root-fs_info, root-node);
 tree_mod_log_set_root_pointer(root, child);
 rcu_assign_pointer(root-node, child);

 @@ -2995,7 +2998,7 @@ static int push_node_left(struct btrfs_trans_handle 
 *trans,
 push_items = min(src_nritems - 8, push_items);

 tree_mod_log_eb_copy(root-fs_info, dst, src, dst_nritems, 0,
 -push_items);
 +push_items, 1);
 copy_extent_buffer(dst, src,
btrfs_node_key_ptr_offset(dst_nritems),
btrfs_node_key_ptr_offset(0),
 @@ -3066,7 +3069,7 @@ static int balance_node_right(struct btrfs_trans_handle 
 *trans,
   sizeof(struct btrfs_key_ptr));

 tree_mod_log_eb_copy(root-fs_info

Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit

2013-03-24 Thread Alex Lyakas
Hi Miao,
I am seeing another issue. Your fix prevents from TRANS_START to get
in the way of a committing transaction. But it does not prevent from
TRANS_JOIN. On the other hand, btrfs_commit_transaction has the
following loop:

do {
// attempt to do some useful stuff and/or sleep
} while (atomic_read(cur_trans-num_writers)  1 ||
 (should_grow  cur_trans-num_joined != joined));

What I see is basically that new writers join the transaction, while
btrfs_commit_transaction() does this loop. I see
cur_trans-num_writers decreasing, but then it increases, then
decreases etc. This can go for several seconds during heavy IO load.
There is nothing to prevent new TRANS_JOIN writers coming and joining
a transaction over and over, thus delaying transaction commit. The IO
path uses TRANS_JOIN; for example run_delalloc_nocow() does that.

Do you observe such behavior? Do you believe it's problematic?

Thanks,
Alex.


On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie mi...@cn.fujitsu.com wrote:
 On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote:
 Hi Miao,
 can you please explain your solution a bit more.

 On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 Now btrfs_commit_transaction() does this

 ret = btrfs_run_ordered_operations(root, 0)

 which async flushes all inodes on the ordered operations list, it introduced
 a deadlock that transaction-start task, transaction-commit task and the 
 flush
 workers waited for each other.
 (See the following URL to get the detail
  http://marc.info/?l=linux-btrfsm=136070705732646w=2)

 As we know, if -in_commit is set, it means someone is committing the
 current transaction, we should not try to join it if we are not JOIN
 or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
 the above problem. In this way, there is another benefit: there is no new
 transaction handle to block the transaction which is on the way of commit,
 once we set -in_commit.

 Signed-off-by: Miao Xie mi...@cn.fujitsu.com
 ---
  fs/btrfs/transaction.c |   17 -
  1 files changed, 16 insertions(+), 1 deletions(-)

 diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
 index bc2f2d1..71b7e2e 100644
 --- a/fs/btrfs/transaction.c
 +++ b/fs/btrfs/transaction.c
 @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct 
 btrfs_root *root)
 root-commit_root = btrfs_root_node(root);
  }

 +static inline int can_join_transaction(struct btrfs_transaction *trans,
 +  int type)
 +{
 +   return !(trans-in_commit 
 +type != TRANS_JOIN 
 +type != TRANS_JOIN_NOLOCK);
 +}
 +
  /*
   * either allocate a new transaction or hop into the existing one
   */
 @@ -86,6 +94,10 @@ loop:
 spin_unlock(fs_info-trans_lock);
 return cur_trans-aborted;
 }
 +   if (!can_join_transaction(cur_trans, type)) {
 +   spin_unlock(fs_info-trans_lock);
 +   return -EBUSY;
 +   }
 atomic_inc(cur_trans-use_count);
 atomic_inc(cur_trans-num_writers);
 cur_trans-num_joined++;
 @@ -360,8 +372,11 @@ again:

 do {
 ret = join_transaction(root, type);
 -   if (ret == -EBUSY)
 +   if (ret == -EBUSY) {
 wait_current_trans(root);
 +   if (unlikely(type == TRANS_ATTACH))
 +   ret = -ENOENT;
 +   }

 So I understand that instead of incrementing num_writes and joining
 the current transaction, you do not join and wait for the current
 transaction to unblock.

 More specifically,TRANS_START、TRANS_USERSPACE and TRANS_ATTACH can not
 join and just wait for the current transaction to unblock if -in_commit
 is set.

 Which task in Josef's example
 http://marc.info/?l=linux-btrfsm=136070705732646w=2
 task 1, task 2 or task 3 is the one that will not join the
 transaction, but instead wait?

 Task1 will not join the transaction, in this way, async inode flush
 won't run, and then task3 won't do anything.

 Before applying the patch:
 Start/Attach_Trans_Task Commit_Task 
 Flush_Worker
 (Task1) (Task2) 
 (Task3) -- the name in Josef's example
 btrfs_start_transaction()
  |-may_wait_transaction()
  |  (return 0)
  |  btrfs_commit_transaction()
  |   |-set -in_commit and
  |   |  blocked to 1
  |   |-wait writers to be 1
  |   |  (writers is 1)
  |-join_transaction()   |
  |  (writers is 2)   |
  |-btrfs_commit_transaction()   |
  |   |-set

Re: [PATCH v2] btrfs: clean snapshots one by one

2013-03-16 Thread Alex Lyakas
Hi David,

On Thu, Mar 7, 2013 at 1:55 PM, David Sterba dste...@suse.cz wrote:
 On Wed, Mar 06, 2013 at 10:12:11PM -0500, Chris Mason wrote:
   Also, I want to ask, hope this is not inappropriate. Do you also agree
   with Josef, that it's ok for BTRFS_IOC_SNAP_DESTROY not to commit the
   transaction, but just to detach from it? Had we committed, we would
   have ensured that ORPHAN_ITEM is in the root tree, thus preventing
   from subvol to re-appear after crash.
   It seems a little inconsistent with snap creation, where not only the
   transaction is committed, but delalloc flush is performed to ensure
   that all data is on disk before creating the snap.
 
  That's another question, can you please point me to the thread where
  this was discussed?
http://www.spinics.net/lists/linux-btrfs/msg22256.html



 That's a really old one.  The original snapshot code expected people to
 run sync first, but that's not very user friendly.  The idea is that if
 you write a file and then take a snapshot, that file should be in the
 snapshot.

 The snapshot behaviour sounds ok to me.

 That a subvol/snapshot may appear after crash if transation commit did
 not happen does not feel so good. We know that the subvol is only
 scheduled for deletion and needs to be processed by cleaner.

 From that point I'd rather see the commit to happen to avoid any
 unexpected surprises.  A subvolume that re-appears still holds the data
 references and consumes space although the user does not assume that.

 Automated snapshotting and deleting needs some guarantees about the
 behaviour and what to do after a crash. So now it has to process the
 backlog of previously deleted snapshots and verify that they're not
 there, compared to deleted - will never appear, can forget about it.

Exactly. Currently, the user space has no idea when the deletion will
start, or when it is completed (it has to track the ROOT_ITEM, drop
progress, ORPHAN_ITEM etc.). That's why I was thinking, that at least
committing a transaction on snap_destroy could ensure that deletion
will not be reverted.

Thanks,
Alex.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs: clean snapshots one by one

2013-03-16 Thread Alex Lyakas
On Tue, Mar 12, 2013 at 5:13 PM, David Sterba dste...@suse.cz wrote:
 Each time pick one dead root from the list and let the caller know if
 it's needed to continue. This should improve responsiveness during
 umount and balance which at some point waits for cleaning all currently
 queued dead roots.

 A new dead root is added to the end of the list, so the snapshots
 disappear in the order of deletion.

 The snapshot cleaning work is now done only from the cleaner thread and the
 others wake it if needed.

 Signed-off-by: David Sterba dste...@suse.cz
 ---

 v1,v2:
 * http://thread.gmane.org/gmane.comp.file-systems.btrfs/23212

 v2-v3:
 * remove run_again from btrfs_clean_one_deleted_snapshot and return 1
   unconditionally

  fs/btrfs/disk-io.c |   10 ++--
  fs/btrfs/extent-tree.c |8 ++
  fs/btrfs/relocation.c  |3 --
  fs/btrfs/transaction.c |   56 +++
  fs/btrfs/transaction.h |2 +-
  5 files changed, 53 insertions(+), 26 deletions(-)

 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 988b860..4de2351 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -1690,15 +1690,19 @@ static int cleaner_kthread(void *arg)
 struct btrfs_root *root = arg;

 do {
 +   int again = 0;
 +
 if (!(root-fs_info-sb-s_flags  MS_RDONLY) 
 +   down_read_trylock(root-fs_info-sb-s_umount) 
 mutex_trylock(root-fs_info-cleaner_mutex)) {
 btrfs_run_delayed_iputs(root);
 -   btrfs_clean_old_snapshots(root);
 +   again = btrfs_clean_one_deleted_snapshot(root);
 mutex_unlock(root-fs_info-cleaner_mutex);
 btrfs_run_defrag_inodes(root-fs_info);
 +   up_read(root-fs_info-sb-s_umount);
 }

 -   if (!try_to_freeze()) {
 +   if (!try_to_freeze()  !again) {
 set_current_state(TASK_INTERRUPTIBLE);
 if (!kthread_should_stop())
 schedule();
 @@ -3403,8 +3407,8 @@ int btrfs_commit_super(struct btrfs_root *root)

 mutex_lock(root-fs_info-cleaner_mutex);
 btrfs_run_delayed_iputs(root);
 -   btrfs_clean_old_snapshots(root);
 mutex_unlock(root-fs_info-cleaner_mutex);
 +   wake_up_process(root-fs_info-cleaner_kthread);

 /* wait until ongoing cleanup work done */
 down_write(root-fs_info-cleanup_work_sem);
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 742b7a7..a08d0fe 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -7263,6 +7263,8 @@ static noinline int walk_up_tree(struct 
 btrfs_trans_handle *trans,
   * reference count by one. if update_ref is true, this function
   * also make sure backrefs for the shared block and all lower level
   * blocks are properly updated.
 + *
 + * If called with for_reloc == 0, may exit early with -EAGAIN
   */
  int btrfs_drop_snapshot(struct btrfs_root *root,
  struct btrfs_block_rsv *block_rsv, int update_ref,
 @@ -7363,6 +7365,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
 wc-reada_count = BTRFS_NODEPTRS_PER_BLOCK(root);

 while (1) {
 +   if (!for_reloc  btrfs_fs_closing(root-fs_info)) {
 +   pr_debug(btrfs: drop snapshot early exit\n);
 +   err = -EAGAIN;
 +   goto out_end_trans;
 +   }
 +
 ret = walk_down_tree(trans, root, path, wc);
 if (ret  0) {
 err = ret;
 diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
 index 8445000..50deb9ed 100644
 --- a/fs/btrfs/relocation.c
 +++ b/fs/btrfs/relocation.c
 @@ -4148,10 +4148,7 @@ int btrfs_relocate_block_group(struct btrfs_root 
 *extent_root, u64 group_start)

 while (1) {
 mutex_lock(fs_info-cleaner_mutex);
 -
 -   btrfs_clean_old_snapshots(fs_info-tree_root);
 ret = relocate_block_group(rc);
 -
 mutex_unlock(fs_info-cleaner_mutex);
 if (ret  0) {
 err = ret;
 diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
 index a0467eb..a2781c3 100644
 --- a/fs/btrfs/transaction.c
 +++ b/fs/btrfs/transaction.c
 @@ -950,7 +950,7 @@ static noinline int commit_cowonly_roots(struct 
 btrfs_trans_handle *trans,
  int btrfs_add_dead_root(struct btrfs_root *root)
  {
 spin_lock(root-fs_info-trans_lock);
 -   list_add(root-root_list, root-fs_info-dead_roots);
 +   list_add_tail(root-root_list, root-fs_info-dead_roots);
 spin_unlock(root-fs_info-trans_lock);
 return 0;
  }
 @@ -1876,31 +1876,49 @@ cleanup_transaction:
  }

  /*
 - * interface function to delete all the snapshots we have scheduled for 
 deletion
 + * return  0 if error
 + * 0 if 

Re: [PATCH] Btrfs: fix backref walking race with tree deletions

2013-03-12 Thread Alex Lyakas
 leaf 4214784, which is the
leaf of subvolume 257.

The tree-dump I showed you is taken after the test failed, and at this
point if I try btrfs send, everything is resolved alright:

btrfs [find_extent_clone] Search [rt=257 ino=277 off=0 len=8192]
[found extent=4386816 extent_item_pos=0]
btrfs [iterate_extent_inodes] resolving for extent 4386816 pos=0
btrfs [iterate_extent_inodes] extent 4386816 pos=0 found 2 leafs
btrfs [iterate_extent_inodes] extent 4386816 pos=0 root 262 references
leaf 4431872
btrfs [iterate_extent_inodes] extent 4386816 pos=0 root 261 references
leaf 4431872
btrfs [iterate_extent_inodes] extent 4386816 pos=0 root 257 references
leaf 4214784

Can you advise on how to debug this further?

Thanks,
Alex.



On Thu, Feb 21, 2013 at 5:35 PM, Jan Schmidt list.bt...@jan-o-sch.net wrote:
 When a subvolume is removed, we remove the root item from the root tree,
 while the tree blocks and backrefs remain for a while. When backref walking
 comes across one of those orphan tree blocks, it can find a backref for a
 no longer existing root. This is all good, we only must tolerate
 __resolve_indirect_ref returning an error and continue with the good refs
 found.

 Reported-by: Alex Lyakas alex.bt...@zadarastorage.com
 Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net
 ---
  fs/btrfs/backref.c |5 +
  1 files changed, 1 insertions(+), 4 deletions(-)

 diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
 index 04edf69..bd605c8 100644
 --- a/fs/btrfs/backref.c
 +++ b/fs/btrfs/backref.c
 @@ -352,11 +352,8 @@ static int __resolve_indirect_refs(struct btrfs_fs_info 
 *fs_info,
 err = __resolve_indirect_ref(fs_info, search_commit_root,
  time_seq, ref, parents,
  extent_item_pos);
 -   if (err) {
 -   if (ret == 0)
 -   ret = err;
 +   if (err)
 continue;
 -   }

 /* we put the first parent into the ref at hand */
 ULIST_ITER_INIT(uiter);
 --
 1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: same EXTENT_ITEM appears twice in the extent tree

2013-03-09 Thread Alex Lyakas
So, no advice on how this could have happened?
Ok, maybe it won't happen again...

On Sun, Mar 3, 2013 at 5:44 PM, Alex Lyakas
alex.bt...@zadarastorage.com wrote:
 Hi Chris,

 On Sun, Mar 3, 2013 at 5:28 PM, Chris Mason chris.ma...@fusionio.com wrote:
 On Sun, Mar 03, 2013 at 06:40:50AM -0700, Alex Lyakas wrote:
 Greetings all,
 I have an extent tree that looks like follows:

   item 22 key (27059916800 EXTENT_ITEM 16384) itemoff 2656 itemsize 24
   extent refs 1 gen 164 flags 1
   item 23 key (27059916800 EXTENT_ITEM 98304) itemoff 2603 itemsize 53
   extent refs 1 gen 165 flags 1
   extent data backref root 257 objectid 257 offset 17446191104 
 count 1
   item 24 key (27059916800 SHARED_DATA_REF 47169536) itemoff 2599 
 itemsize 4
   shared data backref count 1

 Have you been experimenting on this FS with snapshot deletion patches?

 No, I haven't applied any patches on top of the commit I mentioned. (I
 presume you mean David's patch for one-by-one deletion). Since
 created, this FS has only seen straight IO with parallel snapshot
 creation and deletion. However, the kernel was crashing pretty
 frequently during this test, so I presume log replay was taking place.

 Any particular thing I can look for in the debug-tree output, except
 searching for more double-allocations?

 Thanks,
 Alex.



 -chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


same EXTENT_ITEM appears twice in the extent tree

2013-03-03 Thread Alex Lyakas
Greetings all,
I have an extent tree that looks like follows:

item 22 key (27059916800 EXTENT_ITEM 16384) itemoff 2656 itemsize 24
extent refs 1 gen 164 flags 1
item 23 key (27059916800 EXTENT_ITEM 98304) itemoff 2603 itemsize 53
extent refs 1 gen 165 flags 1
extent data backref root 257 objectid 257 offset 17446191104 
count 1
item 24 key (27059916800 SHARED_DATA_REF 47169536) itemoff 2599 
itemsize 4
shared data backref count 1

As can be seen, same EXTENT_ITEM appears twice. This was undetected,
until __btrfs_free_extent was called, after cleaner deleted one of the
snapshots. Then it lead to assert:
if (found_extent) {
BUG_ON(is_data  refs_to_drop !=
   extent_data_ref_count(root, path, iref));
if (iref) {
BUG_ON(path-slots[0] != extent_slot);
} else {
BUG_ON(path-slots[0] != extent_slot + 1);  /* 
CRASH */
path-slots[0] = extent_slot;
num_to_del = 2;
}

As for the usage of this bad extent, there are multiple snapshots
sharing the 98304-length extent, but only one that uses the 16384
extent:
file tree key (257 ROOT_ITEM 0)
item 19 key (257 EXTENT_DATA 17446191104) itemoff 2935 itemsize 53
extent data disk byte 27059916800 nr 98304
extent data offset 0 nr 98304 ram 98304
extent compression 0
...
file tree key (350 ROOT_ITEM 164)
item 21 key (257 EXTENT_DATA 17446191104) itemoff 2829 itemsize 53
extent data disk byte 27059916800 nr 16384
extent data offset 0 nr 16384 ram 16384
extent compression 0

...
file tree key (352 ROOT_ITEM 167)
item 19 key (257 EXTENT_DATA 17446191104) itemoff 2935 itemsize 53
extent data disk byte 27059916800 nr 98304
extent data offset 0 nr 98304 ram 98304
extent compression 0

Kernel is for-linus, top commit:

commit 1eafa6c73791e4f312324ddad9cbcaf6a1b6052b
Author: Miao Xie mi...@cn.fujitsu.com
Date:   Tue Jan 22 10:49:00 2013 +

Btrfs: fix repeated delalloc work allocation

I believe I might have more extents like this, because btrfs-debug-tree warns:
warning, bad space info total_bytes 26851934208 used 26852773888
warning, bad space info total_bytes 27925676032 used 27926892544

Mount options: nodatasum,nodatacow,noatime,nospace_cache. Metadata
profile is DUP, data profile is single.

Can anybody advise on how this could have happened? I can provide the
whole debug-tree, btrfs-image or any additional info.

Thanks,
Alex.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: same EXTENT_ITEM appears twice in the extent tree

2013-03-03 Thread Alex Lyakas
Hi Chris,

On Sun, Mar 3, 2013 at 5:28 PM, Chris Mason chris.ma...@fusionio.com wrote:
 On Sun, Mar 03, 2013 at 06:40:50AM -0700, Alex Lyakas wrote:
 Greetings all,
 I have an extent tree that looks like follows:

   item 22 key (27059916800 EXTENT_ITEM 16384) itemoff 2656 itemsize 24
   extent refs 1 gen 164 flags 1
   item 23 key (27059916800 EXTENT_ITEM 98304) itemoff 2603 itemsize 53
   extent refs 1 gen 165 flags 1
   extent data backref root 257 objectid 257 offset 17446191104 
 count 1
   item 24 key (27059916800 SHARED_DATA_REF 47169536) itemoff 2599 
 itemsize 4
   shared data backref count 1

 Have you been experimenting on this FS with snapshot deletion patches?

No, I haven't applied any patches on top of the commit I mentioned. (I
presume you mean David's patch for one-by-one deletion). Since
created, this FS has only seen straight IO with parallel snapshot
creation and deletion. However, the kernel was crashing pretty
frequently during this test, so I presume log replay was taking place.

Any particular thing I can look for in the debug-tree output, except
searching for more double-allocations?

Thanks,
Alex.



 -chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: basic questions regarding COW in Btrfs

2013-03-02 Thread Alex Lyakas
Hi Josef,
I hope it's ok to piggy back on this thread for the following question:

I see that in btrfs_cross_ref_exist()=check_committed_ref() path,
there is the following check:

if (btrfs_extent_generation(leaf, ei) =
btrfs_root_last_snapshot(root-root_item))
goto out;

So this basically means that after we have taken a snap of a subvol,
then all subvol's extents must be COW'ed, even if we delete the snap a
minute later.
I wonder, why is that so?
Is this because file extents can be shared indirectly, like when we
create a snap, we only COW the root and only mark all root's
*immediate* children shared in the extent tree?
Can the new backref walking code be used here to check more
accurately, if the extent is shared by anybody else?

Thanks,
Alex.



On Mon, Feb 25, 2013 at 9:00 PM, Aastha Mehta aasth...@gmail.com wrote:
 Ah okay, I now see how it works. Thanks a lot for your response.

 Regards,
 Aastha.


 On 25 February 2013 18:27, Josef Bacik jba...@fusionio.com wrote:
 On Mon, Feb 25, 2013 at 08:15:40AM -0700, Aastha Mehta wrote:
 Thanks again Josef.

 I understood that cow_file_range is called for a regular file. Just to
 clarify, in cow_file_range is cow done at the time of reserving
 extents in the extent btree for the io to be done in this delalloc? I
 see the following comment above find_free_extent() which is called
 while trying to reserve extents:

 /*
  * walks the btree of allocated extents and find a hole of a given size.
  * The key ins is changed to record the hole:
  * ins-objectid == block start
  * ins-flags = BTRFS_EXTENT_ITEM_KEY
  * ins-offset == number of blocks
  * Any available blocks before search_start are skipped.
  */

 This seems to be the only place where a cow might be done, because a
 key is being inserted into an extent which modifies it.


 The key isn't inserted at this time, it's just returned with those values 
 for us
 to do as we please.  There is no update of the btree until
 insert_reserved_extent/btrfs_mark_extent_written in btrfs_finish_ordered_io.
 Thanks,

 Josef
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit

2013-03-02 Thread Alex Lyakas
Hi Miao,
thanks for the great ASCII graphics and detailed explanation!

Alex.


On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie mi...@cn.fujitsu.com wrote:
 On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote:
 Hi Miao,
 can you please explain your solution a bit more.

 On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 Now btrfs_commit_transaction() does this

 ret = btrfs_run_ordered_operations(root, 0)

 which async flushes all inodes on the ordered operations list, it introduced
 a deadlock that transaction-start task, transaction-commit task and the 
 flush
 workers waited for each other.
 (See the following URL to get the detail
  http://marc.info/?l=linux-btrfsm=136070705732646w=2)

 As we know, if -in_commit is set, it means someone is committing the
 current transaction, we should not try to join it if we are not JOIN
 or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
 the above problem. In this way, there is another benefit: there is no new
 transaction handle to block the transaction which is on the way of commit,
 once we set -in_commit.

 Signed-off-by: Miao Xie mi...@cn.fujitsu.com
 ---
  fs/btrfs/transaction.c |   17 -
  1 files changed, 16 insertions(+), 1 deletions(-)

 diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
 index bc2f2d1..71b7e2e 100644
 --- a/fs/btrfs/transaction.c
 +++ b/fs/btrfs/transaction.c
 @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct 
 btrfs_root *root)
 root-commit_root = btrfs_root_node(root);
  }

 +static inline int can_join_transaction(struct btrfs_transaction *trans,
 +  int type)
 +{
 +   return !(trans-in_commit 
 +type != TRANS_JOIN 
 +type != TRANS_JOIN_NOLOCK);
 +}
 +
  /*
   * either allocate a new transaction or hop into the existing one
   */
 @@ -86,6 +94,10 @@ loop:
 spin_unlock(fs_info-trans_lock);
 return cur_trans-aborted;
 }
 +   if (!can_join_transaction(cur_trans, type)) {
 +   spin_unlock(fs_info-trans_lock);
 +   return -EBUSY;
 +   }
 atomic_inc(cur_trans-use_count);
 atomic_inc(cur_trans-num_writers);
 cur_trans-num_joined++;
 @@ -360,8 +372,11 @@ again:

 do {
 ret = join_transaction(root, type);
 -   if (ret == -EBUSY)
 +   if (ret == -EBUSY) {
 wait_current_trans(root);
 +   if (unlikely(type == TRANS_ATTACH))
 +   ret = -ENOENT;
 +   }

 So I understand that instead of incrementing num_writes and joining
 the current transaction, you do not join and wait for the current
 transaction to unblock.

 More specifically,TRANS_START、TRANS_USERSPACE and TRANS_ATTACH can not
 join and just wait for the current transaction to unblock if -in_commit
 is set.

 Which task in Josef's example
 http://marc.info/?l=linux-btrfsm=136070705732646w=2
 task 1, task 2 or task 3 is the one that will not join the
 transaction, but instead wait?

 Task1 will not join the transaction, in this way, async inode flush
 won't run, and then task3 won't do anything.

 Before applying the patch:
 Start/Attach_Trans_Task Commit_Task 
 Flush_Worker
 (Task1) (Task2) 
 (Task3) -- the name in Josef's example
 btrfs_start_transaction()
  |-may_wait_transaction()
  |  (return 0)
  |  btrfs_commit_transaction()
  |   |-set -in_commit and
  |   |  blocked to 1
  |   |-wait writers to be 1
  |   |  (writers is 1)
  |-join_transaction()   |
  |  (writers is 2)   |
  |-btrfs_commit_transaction()   |
  |   |-set trans_no_join to 1
  |   |  (close join transaction)
  |-btrfs_run_ordered_operations |
 (Those ordered operations|
  are added when releasing|
  file)   |
  |-async inode flush()  |
  |-wait_flush_comlete() |
  |  
 work_loop()
  |   
 |-run_work()
  |   
 |-btrfs_join_transaction()
  |
|-wait_current_trans()
  |-wait writers to be 1

 This three tasks waited for each other.

 After applying

Re: [PATCH v2] btrfs: clean snapshots one by one

2013-03-02 Thread Alex Lyakas
Hi David,

On Fri, Mar 1, 2013 at 6:17 PM, David Sterba dste...@suse.cz wrote:
 Each time pick one dead root from the list and let the caller know if
 it's needed to continue. This should improve responsiveness during
 umount and balance which at some point wait for cleaning all currently
 queued dead roots.

 A new dead root is added to the end of the list, so the snapshots
 disappear in the order of deletion.

 Process snapshot cleaning is now done only from the cleaner thread and
 the others wake it if needed.

 Signed-off-by: David Sterba dste...@suse.cz
 ---

 v1-v2:
 - added s_umount trylock in cleaner thread
 - added exit into drop_snapshot if fs is going down

 patch based on cmason/integration

  fs/btrfs/disk-io.c |   10 ++--
  fs/btrfs/extent-tree.c |8 ++
  fs/btrfs/relocation.c  |3 --
  fs/btrfs/transaction.c |   57 
 
  fs/btrfs/transaction.h |2 +-
  5 files changed, 54 insertions(+), 26 deletions(-)

 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index eb7c143..cc85fc7 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -1652,15 +1652,19 @@ static int cleaner_kthread(void *arg)
 struct btrfs_root *root = arg;

 do {
 +   int again = 0;
 +
 if (!(root-fs_info-sb-s_flags  MS_RDONLY) 
 +   down_read_trylock(root-fs_info-sb-s_umount) 
 mutex_trylock(root-fs_info-cleaner_mutex)) {
 btrfs_run_delayed_iputs(root);
 -   btrfs_clean_old_snapshots(root);
 +   again = btrfs_clean_one_deleted_snapshot(root);
 mutex_unlock(root-fs_info-cleaner_mutex);
 btrfs_run_defrag_inodes(root-fs_info);
 +   up_read(root-fs_info-sb-s_umount);
 }

 -   if (!try_to_freeze()) {
 +   if (!try_to_freeze()  !again) {
 set_current_state(TASK_INTERRUPTIBLE);
 if (!kthread_should_stop())
 schedule();
 @@ -3338,8 +3342,8 @@ int btrfs_commit_super(struct btrfs_root *root)

 mutex_lock(root-fs_info-cleaner_mutex);
 btrfs_run_delayed_iputs(root);
 -   btrfs_clean_old_snapshots(root);
 mutex_unlock(root-fs_info-cleaner_mutex);
 +   wake_up_process(root-fs_info-cleaner_kthread);

 /* wait until ongoing cleanup work done */
 down_write(root-fs_info-cleanup_work_sem);
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index d2b3a5e..0119ae7 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -7078,6 +7078,8 @@ static noinline int walk_up_tree(struct 
 btrfs_trans_handle *trans,
   * reference count by one. if update_ref is true, this function
   * also make sure backrefs for the shared block and all lower level
   * blocks are properly updated.
 + *
 + * If called with for_reloc == 0, may exit early with -EAGAIN
   */
  int btrfs_drop_snapshot(struct btrfs_root *root,
  struct btrfs_block_rsv *block_rsv, int update_ref,
 @@ -7179,6 +7181,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
 wc-reada_count = BTRFS_NODEPTRS_PER_BLOCK(root);

 while (1) {
 +   if (!for_reloc  btrfs_fs_closing(root-fs_info)) {
 +   pr_debug(btrfs: drop snapshot early exit\n);
 +   err = -EAGAIN;
 +   goto out_end_trans;
 +   }
 +
 ret = walk_down_tree(trans, root, path, wc);
 if (ret  0) {
 err = ret;
 diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
 index ba5a321..ab6a718 100644
 --- a/fs/btrfs/relocation.c
 +++ b/fs/btrfs/relocation.c
 @@ -4060,10 +4060,7 @@ int btrfs_relocate_block_group(struct btrfs_root 
 *extent_root, u64 group_start)

 while (1) {
 mutex_lock(fs_info-cleaner_mutex);
 -
 -   btrfs_clean_old_snapshots(fs_info-tree_root);
 ret = relocate_block_group(rc);
 -
 mutex_unlock(fs_info-cleaner_mutex);
 if (ret  0) {
 err = ret;
 diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
 index a83d486..6b233c15 100644
 --- a/fs/btrfs/transaction.c
 +++ b/fs/btrfs/transaction.c
 @@ -950,7 +950,7 @@ static noinline int commit_cowonly_roots(struct 
 btrfs_trans_handle *trans,
  int btrfs_add_dead_root(struct btrfs_root *root)
  {
 spin_lock(root-fs_info-trans_lock);
 -   list_add(root-root_list, root-fs_info-dead_roots);
 +   list_add_tail(root-root_list, root-fs_info-dead_roots);
 spin_unlock(root-fs_info-trans_lock);
 return 0;
  }
 @@ -1858,31 +1858,50 @@ cleanup_transaction:
  }

  /*
 - * interface function to delete all the snapshots we have scheduled for 
 deletion
 + * return  0 if error
 + * 0 if there are no more 

Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit

2013-02-24 Thread Alex Lyakas
Hi Miao,
can you please explain your solution a bit more.

On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 Now btrfs_commit_transaction() does this

 ret = btrfs_run_ordered_operations(root, 0)

 which async flushes all inodes on the ordered operations list, it introduced
 a deadlock that transaction-start task, transaction-commit task and the flush
 workers waited for each other.
 (See the following URL to get the detail
  http://marc.info/?l=linux-btrfsm=136070705732646w=2)

 As we know, if -in_commit is set, it means someone is committing the
 current transaction, we should not try to join it if we are not JOIN
 or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
 the above problem. In this way, there is another benefit: there is no new
 transaction handle to block the transaction which is on the way of commit,
 once we set -in_commit.

 Signed-off-by: Miao Xie mi...@cn.fujitsu.com
 ---
  fs/btrfs/transaction.c |   17 -
  1 files changed, 16 insertions(+), 1 deletions(-)

 diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
 index bc2f2d1..71b7e2e 100644
 --- a/fs/btrfs/transaction.c
 +++ b/fs/btrfs/transaction.c
 @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct btrfs_root 
 *root)
 root-commit_root = btrfs_root_node(root);
  }

 +static inline int can_join_transaction(struct btrfs_transaction *trans,
 +  int type)
 +{
 +   return !(trans-in_commit 
 +type != TRANS_JOIN 
 +type != TRANS_JOIN_NOLOCK);
 +}
 +
  /*
   * either allocate a new transaction or hop into the existing one
   */
 @@ -86,6 +94,10 @@ loop:
 spin_unlock(fs_info-trans_lock);
 return cur_trans-aborted;
 }
 +   if (!can_join_transaction(cur_trans, type)) {
 +   spin_unlock(fs_info-trans_lock);
 +   return -EBUSY;
 +   }
 atomic_inc(cur_trans-use_count);
 atomic_inc(cur_trans-num_writers);
 cur_trans-num_joined++;
 @@ -360,8 +372,11 @@ again:

 do {
 ret = join_transaction(root, type);
 -   if (ret == -EBUSY)
 +   if (ret == -EBUSY) {
 wait_current_trans(root);
 +   if (unlikely(type == TRANS_ATTACH))
 +   ret = -ENOENT;
 +   }

So I understand that instead of incrementing num_writes and joining
the current transaction, you do not join and wait for the current
transaction to unblock.

Which task in Josef's example
http://marc.info/?l=linux-btrfsm=136070705732646w=2
task 1, task 2 or task 3 is the one that will not join the
transaction, but instead wait?

Also, I think I don't fully understand Josef's example. What is
preventing from async flushing to complete?
Is task 3 waiting because trans_no_join is set?
Is task 3 the one that actually does the delalloc flush?

Thanks,
Alex.






 } while (ret == -EBUSY);

 if (ret  0) {
 --
 1.6.5.2
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: LAST CALL FOR BTRFS-NEXT

2013-02-20 Thread Alex Lyakas
Hi Josef,
can you please consider including these two patches from Jan 28:
https://patchwork.kernel.org/patch/2057051/
https://patchwork.kernel.org/patch/2057071/

I realize they have V2 label, although the cover letter had V3,
this was my bad. However, they both apply to what you have now in
btrfs-next.

Thanks,
Alex.


On Wed, Feb 20, 2013 at 5:12 PM, David Sterba dste...@suse.cz wrote:
 Please add this patch to next queue

 btrfs: limit fallocate extent reservation to 256MB
 https://patchwork.kernel.org/patch/1752311/

 Tested-by: David Sterba dste...@suse.cz

 thanks,
 david
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] btrfs: clean snapshots one by one

2013-02-17 Thread Alex Lyakas
Hi David,
thank you for addressing this issue.

On Mon, Feb 11, 2013 at 6:11 PM, David Sterba dste...@suse.cz wrote:
 Each time pick one dead root from the list and let the caller know if
 it's needed to continue. This should improve responsiveness during
 umount and balance which at some point wait for cleaning all currently
 queued dead roots.

 A new dead root is added to the end of the list, so the snapshots
 disappear in the order of deletion.

 Process snapshot cleaning is now done only from the cleaner thread and
 the others wake it if needed.
This is great.



 Signed-off-by: David Sterba dste...@suse.cz
 ---

 * btrfs_clean_old_snapshots is removed from the reloc loop, I don't know if 
 this
   is safe wrt reloc's assumptions

 * btrfs_run_delayed_iputs is left in place in super_commit, may get removed as
   well because transaction commit calls it in the end

 * the responsiveness can be improved further if btrfs_drop_snapshot check
   fs_closing, but this needs changes to error handling in the main reloc loop

  fs/btrfs/disk-io.c |8 --
  fs/btrfs/relocation.c  |3 --
  fs/btrfs/transaction.c |   57 
 
  fs/btrfs/transaction.h |2 +-
  4 files changed, 44 insertions(+), 26 deletions(-)

 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 51bff86..6a02336 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -1635,15 +1635,17 @@ static int cleaner_kthread(void *arg)
 struct btrfs_root *root = arg;

 do {
 +   int again = 0;
 +
 if (!(root-fs_info-sb-s_flags  MS_RDONLY) 
 mutex_trylock(root-fs_info-cleaner_mutex)) {
 btrfs_run_delayed_iputs(root);
 -   btrfs_clean_old_snapshots(root);
 +   again = btrfs_clean_one_deleted_snapshot(root);
 mutex_unlock(root-fs_info-cleaner_mutex);
 btrfs_run_defrag_inodes(root-fs_info);
 }

 -   if (!try_to_freeze()) {
 +   if (!try_to_freeze()  !again) {
 set_current_state(TASK_INTERRUPTIBLE);
 if (!kthread_should_stop())
 schedule();
 @@ -3301,8 +3303,8 @@ int btrfs_commit_super(struct btrfs_root *root)

 mutex_lock(root-fs_info-cleaner_mutex);
 btrfs_run_delayed_iputs(root);
 -   btrfs_clean_old_snapshots(root);
 mutex_unlock(root-fs_info-cleaner_mutex);
 +   wake_up_process(root-fs_info-cleaner_kthread);
I am probably missing something, but if the cleaner wakes up here,
won't it attempt cleaning the next snap? Because I don't see the
cleaner checking anywhere that we are unmounting. Or at this point
dead_roots is supposed to be empty?



 /* wait until ongoing cleanup work done */
 down_write(root-fs_info-cleanup_work_sem);
 diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
 index ba5a321..ab6a718 100644
 --- a/fs/btrfs/relocation.c
 +++ b/fs/btrfs/relocation.c
 @@ -4060,10 +4060,7 @@ int btrfs_relocate_block_group(struct btrfs_root 
 *extent_root, u64 group_start)

 while (1) {
 mutex_lock(fs_info-cleaner_mutex);
 -
 -   btrfs_clean_old_snapshots(fs_info-tree_root);
 ret = relocate_block_group(rc);
 -
 mutex_unlock(fs_info-cleaner_mutex);
 if (ret  0) {
 err = ret;
 diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
 index 361fb7d..f1e3606 100644
 --- a/fs/btrfs/transaction.c
 +++ b/fs/btrfs/transaction.c
 @@ -895,7 +895,7 @@ static noinline int commit_cowonly_roots(struct 
 btrfs_trans_handle *trans,
  int btrfs_add_dead_root(struct btrfs_root *root)
  {
 spin_lock(root-fs_info-trans_lock);
 -   list_add(root-root_list, root-fs_info-dead_roots);
 +   list_add_tail(root-root_list, root-fs_info-dead_roots);
 spin_unlock(root-fs_info-trans_lock);
 return 0;
  }
 @@ -1783,31 +1783,50 @@ cleanup_transaction:
  }

  /*
 - * interface function to delete all the snapshots we have scheduled for 
 deletion
 + * return  0 if error
 + * 0 if there are no more dead_roots at the time of call
 + * 1 there are more to be processed, call me again
 + *
 + * The return value indicates there are certainly more snapshots to delete, 
 but
 + * if there comes a new one during processing, it may return 0. We don't 
 mind,
 + * because btrfs_commit_super will poke cleaner thread and it will process 
 it a
 + * few seconds later.
   */
 -int btrfs_clean_old_snapshots(struct btrfs_root *root)
 +int btrfs_clean_one_deleted_snapshot(struct btrfs_root *root)
  {
 -   LIST_HEAD(list);
 +   int ret;
 +   int run_again = 1;
 struct btrfs_fs_info *fs_info = root-fs_info;

 +   if (root-fs_info-sb-s_flags  MS_RDONLY) {
 +   pr_debug(G btrfs: cleaner called for RO fs!\n);
 +  

Re: Deleted subvolume reappears and other cleaner issues

2013-02-10 Thread Alex Lyakas
Thanks for your comments, Josef.

Another thing that confuses me is that there are some cases, in which
btrfs_drop_snapshot() has a failure, but still returns 0, like for
example, if btrfs_del_root() fails. (For cases when
btrfs_drop_snapshot() returns non-zero there is a BUG_ON).
So in this case for me __btrfs_abort_transaction() sees that
trans-blocks_used==0, so it doesn't call __btrfs_std_error, which
would further force the filesystem to become RO. So after that
btrfs_drop_snapshot successfully completes, and, basically, nobody
will retry the subvol deletion. In addition, in this case, after
couple of seconds the machine completely freezes for me. I have not
yet succeeded to determine why.

Thanks,
Alex.


On Wed, Feb 6, 2013 at 5:14 PM, Josef Bacik jba...@fusionio.com wrote:
 On Thu, Jan 31, 2013 at 06:03:06AM -0700, Alex Lyakas wrote:
 Hi,
 I want to check if any of the below issues are worth/should be  fixed:

 # btrfs_ioctl_snap_destroy() does not commit a transaction. As a
 result, user can ask to delete a subvol, he receives ok back. Even
 if user does btrfs sub list,
 he will not see the deleted subvol (even though the transaction was
 not committed yet). But if a crash happens, ORPHAN_ITEM will not
 re-appear after crash.
 So after crash, the subvolume still exists perfectly fine (happened
 couple of times here).

 Same thing happens to normal unlinks, I don't see a reason to have different
 rules for subvols.


 # btrfs_drop_snapshot() does not commit a transaction after
 btrfs_del_orphan_item(). So if the subvol deletion completed in one go
 (did not have to detach and re-attach to transaction, thus committing
 the ORPHAN_ITEM and drop_progress/level), then after crash ORPHAN_ITEM
 will not be in the tree, and subvolume still exists.


 Again same thing happens with normal files.

 # btrfs_drop_snapshot() checks btrfs_should_end_transaction(), and
 then does btrfs_end_transaction_throttle() and
 btrfs_start_transaction(). However, it looks like it can rejoin the
 same transaction if transaction was not not blocked yet. Minor issue,
 perhaps?

 No if we didn't block then its ok and we wait longer, we only throttle to give
 the transaction stuff a chance to commit, so if the join logic decides its ok 
 to
 go on then we're good.


 # umount may get delayed because of pending-for-deletion subvolumes:
 btrfs_commit_super() locks the cleaner_mutex, so it will wait for the
 cleaner to complete.
 On the other hand, cleaner will not give up until it completes
 processing all its splice. If currently cleaner is not running, then
 btrfs_commit_super()
 calls btrfs_clean_old_snapshots() directly. So does it make sense:
 - btrfs_commit_super() will not call btrfs_clean_old_snapshots()
 - close_ctree() calls kthread_stop(cleaner_kthread) early, and cleaner
 thread periodically checks if it needs to exit

 I don't quite follow this, but sure?  Thanks,

 Josef
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Btrfs: fix memory leak of pending_snapshot-inherit

2013-02-07 Thread Alex Lyakas
Arne, Miao,
I also agree that it is better to move this responsibility to
create_pending_snapshot().

Alex.


On Thu, Feb 7, 2013 at 10:43 AM, Arne Jansen sensi...@gmx.net wrote:
 On 02/07/13 07:02, Miao Xie wrote:
 The argument inherit of btrfs_ioctl_snap_create_transid() was assigned
 to NULL during we created the snapshots, so we didn't free it though we
 called kfree() in the caller.

 But since we are sure the snapshot creation is done after the function -
 btrfs_ioctl_snap_create_transid() - completes, it is safe that we don't
 assign the pointer inherit to NULL, and just free it in the caller of
 btrfs_ioctl_snap_create_transid(). In this way, the code can become more
 readable.

 NAK. The snapshot creation is triggered from btrfs_commit_transaction,
 I don't want to implicitly rely on commit_transaction being called for
 each snapshot created. I'm not even sure the async path really commits
 the transaction.
 The responsibility for the creation is passed to the pending_snapshot
 data structure, and so should the responsibility for the inherit struct.

 -Arne


 Reported-by: Alex Lyakas alex.bt...@zadarastorage.com
 Cc: Arne Jansen sensi...@gmx.net
 Signed-off-by: Miao Xie mi...@cn.fujitsu.com
 ---
  fs/btrfs/ioctl.c | 18 +++---
  1 file changed, 7 insertions(+), 11 deletions(-)

 diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
 index 02d3035..40f2fbf 100644
 --- a/fs/btrfs/ioctl.c
 +++ b/fs/btrfs/ioctl.c
 @@ -367,7 +367,7 @@ static noinline int create_subvol(struct btrfs_root 
 *root,
 struct dentry *dentry,
 char *name, int namelen,
 u64 *async_transid,
 -   struct btrfs_qgroup_inherit **inherit)
 +   struct btrfs_qgroup_inherit *inherit)
  {
   struct btrfs_trans_handle *trans;
   struct btrfs_key key;
 @@ -401,8 +401,7 @@ static noinline int create_subvol(struct btrfs_root 
 *root,
   if (IS_ERR(trans))
   return PTR_ERR(trans);

 - ret = btrfs_qgroup_inherit(trans, root-fs_info, 0, objectid,
 -inherit ? *inherit : NULL);
 + ret = btrfs_qgroup_inherit(trans, root-fs_info, 0, objectid, inherit);
   if (ret)
   goto fail;

 @@ -530,7 +529,7 @@ fail:

  static int create_snapshot(struct btrfs_root *root, struct dentry *dentry,
  char *name, int namelen, u64 *async_transid,
 -bool readonly, struct btrfs_qgroup_inherit 
 **inherit)
 +bool readonly, struct btrfs_qgroup_inherit *inherit)
  {
   struct inode *inode;
   struct btrfs_pending_snapshot *pending_snapshot;
 @@ -549,10 +548,7 @@ static int create_snapshot(struct btrfs_root *root, 
 struct dentry *dentry,
   pending_snapshot-dentry = dentry;
   pending_snapshot-root = root;
   pending_snapshot-readonly = readonly;
 - if (inherit) {
 - pending_snapshot-inherit = *inherit;
 - *inherit = NULL;/* take responsibility to free it */
 - }
 + pending_snapshot-inherit = inherit;

   trans = btrfs_start_transaction(root-fs_info-extent_root, 6);
   if (IS_ERR(trans)) {
 @@ -692,7 +688,7 @@ static noinline int btrfs_mksubvol(struct path *parent,
  char *name, int namelen,
  struct btrfs_root *snap_src,
  u64 *async_transid, bool readonly,
 -struct btrfs_qgroup_inherit **inherit)
 +struct btrfs_qgroup_inherit *inherit)
  {
   struct inode *dir  = parent-dentry-d_inode;
   struct dentry *dentry;
 @@ -1454,7 +1450,7 @@ out:
  static noinline int btrfs_ioctl_snap_create_transid(struct file *file,
   char *name, unsigned long fd, int subvol,
   u64 *transid, bool readonly,
 - struct btrfs_qgroup_inherit **inherit)
 + struct btrfs_qgroup_inherit *inherit)
  {
   int namelen;
   int ret = 0;
 @@ -1563,7 +1559,7 @@ static noinline int btrfs_ioctl_snap_create_v2(struct 
 file *file,

   ret = btrfs_ioctl_snap_create_transid(file, vol_args-name,
 vol_args-fd, subvol, ptr,
 -   readonly, inherit);
 +   readonly, inherit);

   if (ret == 0  ptr 
   copy_to_user(arg +


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Deleted subvolume reappears and other cleaner issues

2013-02-06 Thread Alex Lyakas
Can anyone please comment on below?

On Thu, Jan 31, 2013 at 3:03 PM, Alex Lyakas
alex.bt...@zadarastorage.com wrote:
 Hi,
 I want to check if any of the below issues are worth/should be  fixed:

 # btrfs_ioctl_snap_destroy() does not commit a transaction. As a
 result, user can ask to delete a subvol, he receives ok back. Even
 if user does btrfs sub list,
 he will not see the deleted subvol (even though the transaction was
 not committed yet). But if a crash happens, ORPHAN_ITEM will not
 re-appear after crash.
 So after crash, the subvolume still exists perfectly fine (happened
 couple of times here).

 # btrfs_drop_snapshot() does not commit a transaction after
 btrfs_del_orphan_item(). So if the subvol deletion completed in one go
 (did not have to detach and re-attach to transaction, thus committing
 the ORPHAN_ITEM and drop_progress/level), then after crash ORPHAN_ITEM
 will not be in the tree, and subvolume still exists.

 # btrfs_drop_snapshot() checks btrfs_should_end_transaction(), and
 then does btrfs_end_transaction_throttle() and
 btrfs_start_transaction(). However, it looks like it can rejoin the
 same transaction if transaction was not not blocked yet. Minor issue,
 perhaps?

 # umount may get delayed because of pending-for-deletion subvolumes:
 btrfs_commit_super() locks the cleaner_mutex, so it will wait for the
 cleaner to complete.
 On the other hand, cleaner will not give up until it completes
 processing all its splice. If currently cleaner is not running, then
 btrfs_commit_super()
 calls btrfs_clean_old_snapshots() directly. So does it make sense:
 - btrfs_commit_super() will not call btrfs_clean_old_snapshots()
 - close_ctree() calls kthread_stop(cleaner_kthread) early, and cleaner
 thread periodically checks if it needs to exit

 Thanks,
 Alex.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Leaking btrfs_qgroup_inherit on snapshot creation?

2013-02-06 Thread Alex Lyakas
Hi Jan, Arne,
I see this code in create_snapshot:

if (inherit) {
pending_snapshot-inherit = *inherit;
*inherit = NULL;/* take responsibility to free it */
}

So, first thing I think it should be:
if (*inherit)
because in btrfs_ioctl_snap_create_v2() we have:
struct btrfs_qgroup_inherit *inherit = NULL;
...
btrfs_ioctl_snap_create_transid(..., inherit)

so the current check is very unlikely to be NULL.

Second, I don't see anybody freeing pending_snapshot-inherit. I guess
it should be freed after callin btrfs_qgroup_inherit() and also in
btrfs_destroy_pending_snapshots().

Thanks,
Alex.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Deleted subvolume reappears and other cleaner issues

2013-01-31 Thread Alex Lyakas
Hi,
I want to check if any of the below issues are worth/should be  fixed:

# btrfs_ioctl_snap_destroy() does not commit a transaction. As a
result, user can ask to delete a subvol, he receives ok back. Even
if user does btrfs sub list,
he will not see the deleted subvol (even though the transaction was
not committed yet). But if a crash happens, ORPHAN_ITEM will not
re-appear after crash.
So after crash, the subvolume still exists perfectly fine (happened
couple of times here).

# btrfs_drop_snapshot() does not commit a transaction after
btrfs_del_orphan_item(). So if the subvol deletion completed in one go
(did not have to detach and re-attach to transaction, thus committing
the ORPHAN_ITEM and drop_progress/level), then after crash ORPHAN_ITEM
will not be in the tree, and subvolume still exists.

# btrfs_drop_snapshot() checks btrfs_should_end_transaction(), and
then does btrfs_end_transaction_throttle() and
btrfs_start_transaction(). However, it looks like it can rejoin the
same transaction if transaction was not not blocked yet. Minor issue,
perhaps?

# umount may get delayed because of pending-for-deletion subvolumes:
btrfs_commit_super() locks the cleaner_mutex, so it will wait for the
cleaner to complete.
On the other hand, cleaner will not give up until it completes
processing all its splice. If currently cleaner is not running, then
btrfs_commit_super()
calls btrfs_clean_old_snapshots() directly. So does it make sense:
- btrfs_commit_super() will not call btrfs_clean_old_snapshots()
- close_ctree() calls kthread_stop(cleaner_kthread) early, and cleaner
thread periodically checks if it needs to exit

Thanks,
Alex.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   >