Re: [PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list
Hi Liu, On Wed, Mar 22, 2017 at 1:40 AM, Liu Bo <bo.li@oracle.com> wrote: > On Sun, Mar 19, 2017 at 07:18:59PM +0200, Alex Lyakas wrote: >> We have a commit_root_sem, which is a read-write semaphore that protects the >> commit roots. >> But it is also used to protect the list of caching block groups. >> >> As a result, while doing "slow" caching, the following issue is seen: >> >> Some of the caching threads are scanning the extent tree with >> commit_root_sem >> acquired in shared mode, with stack like: >> [] read_extent_buffer_pages+0x2d2/0x300 [btrfs] >> [] btree_read_extent_buffer_pages.constprop.50+0xb7/0x1e0 >> [btrfs] >> [] read_tree_block+0x40/0x70 [btrfs] >> [] read_block_for_search.isra.33+0x12c/0x370 [btrfs] >> [] btrfs_search_slot+0x3c6/0xb10 [btrfs] >> [] caching_thread+0x1b9/0x820 [btrfs] >> [] normal_work_helper+0xc6/0x340 [btrfs] >> [] btrfs_cache_helper+0x12/0x20 [btrfs] >> >> IO requests that want to allocate space are waiting in cache_block_group() >> to acquire the commit_root_sem in exclusive mode. But they only want to add >> the caching control structure to the list of caching block-groups: >> [] schedule+0x29/0x70 >> [] rwsem_down_write_failed+0x145/0x320 >> [] call_rwsem_down_write_failed+0x13/0x20 >> [] cache_block_group+0x25b/0x450 [btrfs] >> [] find_free_extent+0xd16/0xdb0 [btrfs] >> [] btrfs_reserve_extent+0xaf/0x160 [btrfs] >> >> Other caching threads want to continue their scanning, and for that they >> are waiting to acquire commit_root_sem in shared mode. But since there are >> IO threads that want the exclusive lock, the caching threads are unable >> to continue the scanning, because (I presume) rw_semaphore guarantees some >> fairness: >> [] schedule+0x29/0x70 >> [] rwsem_down_read_failed+0xc5/0x120 >> [] call_rwsem_down_read_failed+0x14/0x30 >> [] caching_thread+0x1a1/0x820 [btrfs] >> [] normal_work_helper+0xc6/0x340 [btrfs] >> [] btrfs_cache_helper+0x12/0x20 [btrfs] >> [] process_one_work+0x146/0x410 >> >> This causes slowness of the IO, especially when there are many block groups >> that need to be scanned for free space. In some cases it takes minutes >> until a single IO thread is able to allocate free space. >> >> I don't see a deadlock here, because the caching threads that were able to >> acquire >> the commit_root_sem will call rwsem_is_contended() and should give up the >> semaphore, >> so that IO threads are able to acquire it in exclusive mode. >> >> However, introducing a separate mutex that protects only the list of caching >> block groups makes things move forward much faster. >> > > The problem did exist and the patch looks good to me. > >> This patch is based on kernel 3.18. >> Unfortunately, I am not able to submit a patch based on one of the latest >> kernels, because >> here btrfs is part of the larger system, and upgrading the kernel is a >> significant effort. >> Hence marking the patch as RFC. >> Hopefully, this patch still has some value to the community. >> >> Signed-off-by: Alex Lyakas <a...@zadarastorage.com> >> >> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h >> index 42d11e7..74feacb 100644 >> --- a/fs/btrfs/ctree.h >> +++ b/fs/btrfs/ctree.h >> @@ -1490,6 +1490,8 @@ struct btrfs_fs_info { >> struct list_head trans_list; >> struct list_head dead_roots; >> struct list_head caching_block_groups; >> +/* protects the above list */ >> +struct mutex caching_block_groups_mutex; >> >> spinlock_t delayed_iput_lock; >> struct list_head delayed_iputs; >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >> index 5177954..130ec58 100644 >> --- a/fs/btrfs/disk-io.c >> +++ b/fs/btrfs/disk-io.c >> @@ -2229,6 +2229,7 @@ int open_ctree(struct super_block *sb, >> INIT_LIST_HEAD(_info->delayed_iputs); >> INIT_LIST_HEAD(_info->delalloc_roots); >> INIT_LIST_HEAD(_info->caching_block_groups); >> +mutex_init(_info->caching_block_groups_mutex); >> spin_lock_init(_info->delalloc_root_lock); >> spin_lock_init(_info->trans_lock); >> spin_lock_init(_info->fs_roots_radix_lock); >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index a067065..906fb08 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -637,10 +637,10 @@ static int cache_block_group(struct >> btrfs_block_group_cache *cache, >> return 0; >> } >
Re: include linux kernel headers for btrfs filesystem
Ilan, On Mon, Mar 20, 2017 at 10:33 AM, Ilan Schwartswrote: > I need to cast struct inode to struct btrfs_inode. > in order to do it, i looked at implementation of btrfs_getattr. > > the code is simple: > struct btrfs_inode *btrfsInode; > btrfsInode = BTRFS_I(inode); > > in order to compile i must add the headers on top of the function: > #include "/data/kernel/linux-4.1.21-x86_64/fs/btrfs/ctree.h" > #include "/data/kernel/linux-4.1.21-x86_64/fs/btrfs/btrfs_inode.h" > > What is the problem ? > I must manually download and include ctree.h and btrfs_inode.h, they > are not provided in the kernel-headers package. > On every platform I compile my driver, I have specific VM for the > distro/kernel version, so on every VM I usually download package > kernel-headers and everything compiles perfectly. > > btrfs was introduced in kernel 3.0 above. > Arent the btrfs headers should be there ? do they exist in another > package ? maybe fs-headers or something like that ? Try using the below simple Makefile[1] to compile btrfs loadable module. You need to have the kernel-headers package installed. You can place the makefile anywhere you want, and compile via: # make -f Thanks, Alex. [1] obj-m += btrfs.o # or substitute with hard-coded kernel version KVERSION = $(shell uname -r) SRC_DIR=/fs/btrfs BTRFS_KO=btrfs.ko # or specify any other output directory OUT_DIR=/lib/modules/$(KVERSION)/kernel/fs/btrfs all: $(OUT_DIR)/$(BTRFS_KO) $(OUT_DIR)/$(BTRFS_KO): $(SRC_DIR)/$(BTRFS_KO) cp $(SRC_DIR)/$(BTRFS_KO) $(OUT_DIR)/ $(SRC_DIR)/$(BTRFS_KO): $(SRC_DIR)/*.c $(SRC_DIR)/*.h $(MAKE) -C /lib/modules/$(KVERSION)/build M=$(SRC_DIR) modules clean: $(MAKE) -C /lib/modules/$(KVERSION)/build M=$(SRC_DIR) clean test -f $(OUT_DIR)/$(BTRFS_KO) && rm $(OUT_DIR)/$(BTRFS_KO)|| true > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC] btrfs: introduce a separate mutex for caching_block_groups list
We have a commit_root_sem, which is a read-write semaphore that protects the commit roots. But it is also used to protect the list of caching block groups. As a result, while doing "slow" caching, the following issue is seen: Some of the caching threads are scanning the extent tree with commit_root_sem acquired in shared mode, with stack like: [] read_extent_buffer_pages+0x2d2/0x300 [btrfs] [] btree_read_extent_buffer_pages.constprop.50+0xb7/0x1e0 [btrfs] [] read_tree_block+0x40/0x70 [btrfs] [] read_block_for_search.isra.33+0x12c/0x370 [btrfs] [] btrfs_search_slot+0x3c6/0xb10 [btrfs] [] caching_thread+0x1b9/0x820 [btrfs] [] normal_work_helper+0xc6/0x340 [btrfs] [] btrfs_cache_helper+0x12/0x20 [btrfs] IO requests that want to allocate space are waiting in cache_block_group() to acquire the commit_root_sem in exclusive mode. But they only want to add the caching control structure to the list of caching block-groups: [] schedule+0x29/0x70 [] rwsem_down_write_failed+0x145/0x320 [] call_rwsem_down_write_failed+0x13/0x20 [] cache_block_group+0x25b/0x450 [btrfs] [] find_free_extent+0xd16/0xdb0 [btrfs] [] btrfs_reserve_extent+0xaf/0x160 [btrfs] Other caching threads want to continue their scanning, and for that they are waiting to acquire commit_root_sem in shared mode. But since there are IO threads that want the exclusive lock, the caching threads are unable to continue the scanning, because (I presume) rw_semaphore guarantees some fairness: [] schedule+0x29/0x70 [] rwsem_down_read_failed+0xc5/0x120 [] call_rwsem_down_read_failed+0x14/0x30 [] caching_thread+0x1a1/0x820 [btrfs] [] normal_work_helper+0xc6/0x340 [btrfs] [] btrfs_cache_helper+0x12/0x20 [btrfs] [] process_one_work+0x146/0x410 This causes slowness of the IO, especially when there are many block groups that need to be scanned for free space. In some cases it takes minutes until a single IO thread is able to allocate free space. I don't see a deadlock here, because the caching threads that were able to acquire the commit_root_sem will call rwsem_is_contended() and should give up the semaphore, so that IO threads are able to acquire it in exclusive mode. However, introducing a separate mutex that protects only the list of caching block groups makes things move forward much faster. This patch is based on kernel 3.18. Unfortunately, I am not able to submit a patch based on one of the latest kernels, because here btrfs is part of the larger system, and upgrading the kernel is a significant effort. Hence marking the patch as RFC. Hopefully, this patch still has some value to the community. Signed-off-by: Alex Lyakas <a...@zadarastorage.com> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 42d11e7..74feacb 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1490,6 +1490,8 @@ struct btrfs_fs_info { struct list_head trans_list; struct list_head dead_roots; struct list_head caching_block_groups; +/* protects the above list */ +struct mutex caching_block_groups_mutex; spinlock_t delayed_iput_lock; struct list_head delayed_iputs; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 5177954..130ec58 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2229,6 +2229,7 @@ int open_ctree(struct super_block *sb, INIT_LIST_HEAD(_info->delayed_iputs); INIT_LIST_HEAD(_info->delalloc_roots); INIT_LIST_HEAD(_info->caching_block_groups); +mutex_init(_info->caching_block_groups_mutex); spin_lock_init(_info->delalloc_root_lock); spin_lock_init(_info->trans_lock); spin_lock_init(_info->fs_roots_radix_lock); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a067065..906fb08 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -637,10 +637,10 @@ static int cache_block_group(struct btrfs_block_group_cache *cache, return 0; } -down_write(_info->commit_root_sem); +mutex_lock(_info->caching_block_groups_mutex); atomic_inc(_ctl->count); list_add_tail(_ctl->list, _info->caching_block_groups); -up_write(_info->commit_root_sem); +mutex_unlock(_info->caching_block_groups_mutex); btrfs_get_block_group(cache); @@ -5693,6 +5693,7 @@ void btrfs_prepare_extent_commit(struct btrfs_trans_handle *trans, down_write(_info->commit_root_sem); +mutex_lock(_info->caching_block_groups_mutex); list_for_each_entry_safe(caching_ctl, next, _info->caching_block_groups, list) { cache = caching_ctl->block_group; @@ -5704,6 +5705,7 @@ void btrfs_prepare_extent_commit(struct btrfs_trans_handle *trans, cache->last_byte_to_unpin = caching_ctl->progress; } } +mutex_unlock(_info->caching_block_groups_mutex); if (fs_info->pinned_extents == _info->freed_extents[0]) fs_info->pinned_extents = _info->freed_extents[1]; @@ -8849,14 +8851,14 @@ int btrfs_free_block_gro
Re: [PATCH] Btrfs: deal with unexpected return value in flush_space
David, Holger, Thank you for picking up that old patch of mine. Alex. On Fri, Jul 29, 2016 at 8:53 PM, Liu Bowrote: > On Fri, Jul 29, 2016 at 07:01:50PM +0200, David Sterba wrote: >> On Thu, Jul 28, 2016 at 11:49:14AM -0700, Liu Bo wrote: >> > > For reviewers - this came up before here: >> > > https://patchwork.kernel.org/patch/7778651/ > > David, this patch made a mistake in commit log. > >> > > >> > > Same fix basically. >> > >> > Aha, I've given it my Reviewed-by. >> > >> > Taking either one works for me, I can make the clarifying comment into a >> > seperate patch if we need to. >> >> I'll pick the first patch and please send the separate comment update. >> Thanks. > > Sure. > > Thanks, > > -liubo > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RCF - PATCH] btrfs: do not ignore errors from primary superblock
RFC: This patch not for merging, but only for review and discussion. When mounting, we consider only the primary superblock on each device. But when writing the superblocks, we might silently ignore errors from the primary superblock, if we succeeded to write secondary superblocks. In such case, the primary superblock was not updated properly, and if we crash at this point, later mount will use an out-of-date superblock. This patch changes the behavior to NOT IGNORING any errors on the primary superblock, and IGNORING any errors on secondary superblocks. This way, we always insist on having an up-to-date primary superblock. diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 4e47849..0ae9f7c 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3357,11 +3357,13 @@ static int write_dev_supers(struct btrfs_device *device, bh = __find_get_block(device->bdev, bytenr / 4096, BTRFS_SUPER_INFO_SIZE); if (!bh) { -errors++; +/* we only care about primary superblock errors */ +if (i == 0) +errors++; continue; } wait_on_buffer(bh); -if (!buffer_uptodate(bh)) +if (!buffer_uptodate(bh) && i == 0) errors++; /* drop our reference */ @@ -3388,9 +3390,10 @@ static int write_dev_supers(struct btrfs_device *device, BTRFS_SUPER_INFO_SIZE); if (!bh) { btrfs_err(device->dev_root->fs_info, -"couldn't get super buffer head for bytenr %llu", -bytenr); -errors++; +"couldn't get super buffer head for bytenr %llu (sb copy %d)", +bytenr, i); +if (i == 0) +errors++; continue; } @@ -3413,10 +3416,10 @@ static int write_dev_supers(struct btrfs_device *device, ret = btrfsic_submit_bh(WRITE_FUA, bh); else ret = btrfsic_submit_bh(WRITE_SYNC, bh); -if (ret) +if (ret && i == 0) errors++; } -return errors < i ? 0 : -1; +return errors ? -1 : 0; } /* P.S.: when reviewing the code of write_dev_supers(), I also noticed that when wait==0 and we hit an error in one __getblk(), then the caller (write_all_supers) will not properly wait for submitted buffer-heads to complete, and we won't do the additional "brelse(bh);", which wait==0 case does. Is this a problem? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 6/9] Btrfs: implement the free space B-tree
Hi Omar, Chris, I have reviewed the free-space-tree code. It is a very nice feature. However, I have a basic understanding question. Let's say we are running a delayed ref which inserts a new EXTENT_ITEM into the extent tree, e.g., we are in alloc_reserved_file_extent. At this point we call remove_from_free_space_tree(), which updates the free-space-tree about the allocated space. But this requires to COW the free-space-tree itself. So we allocate a new tree block for the free-space tree, and insert a new delayed ref, which will update the extent tree about the new tree block allocation. We also insert a delayed ref to free the previous copy of the free-space-tree block. At some point we run these new delayed refs, so we insert/remove EXTENT_ITEMs from the extent tree, and this in turn requires us to update the free-space-tree again. So we need again to COW free-space-tree blocks, generating more delayed refs. At which point this recursion stops? Do we assume that at some point all needed free-space tree blocks have been COW'ed already, and we do not COW a tree block more than once per transaction (unless it was written to disk due to memory pressure)? Thanks! Alex. On Tue, Dec 29, 2015 at 11:19 PM, Chris Masonwrote: > On Tue, Sep 29, 2015 at 08:50:35PM -0700, Omar Sandoval wrote: >> From: Omar Sandoval >> >> The free space cache has turned out to be a scalability bottleneck on >> large, busy filesystems. When the cache for a lot of block groups needs >> to be written out, we can get extremely long commit times; if this >> happens in the critical section, things are especially bad because we >> block new transactions from happening. >> >> The main problem with the free space cache is that it has to be written >> out in its entirety and is managed in an ad hoc fashion. Using a B-tree >> to store free space fixes this: updates can be done as needed and we get >> all of the benefits of using a B-tree: checksumming, RAID handling, >> well-understood behavior. >> >> With the free space tree, we get commit times that are about the same as >> the no cache case with load times slower than the free space cache case >> but still much faster than the no cache case. Free space is represented >> with extents until it becomes more space-efficient to use bitmaps, >> giving us similar space overhead to the free space cache. >> >> The operations on the free space tree are: adding and removing free >> space, handling the creation and deletion of block groups, and loading >> the free space for a block group. We can also create the free space tree >> by walking the extent tree and clear the free space tree. >> >> Signed-off-by: Omar Sandoval > >> +int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info) >> +{ >> + struct btrfs_trans_handle *trans; >> + struct btrfs_root *tree_root = fs_info->tree_root; >> + struct btrfs_root *free_space_root; >> + struct btrfs_block_group_cache *block_group; >> + struct rb_node *node; >> + int ret; >> + >> + trans = btrfs_start_transaction(tree_root, 0); >> + if (IS_ERR(trans)) >> + return PTR_ERR(trans); >> + >> + free_space_root = btrfs_create_tree(trans, fs_info, >> + BTRFS_FREE_SPACE_TREE_OBJECTID); >> + if (IS_ERR(free_space_root)) { >> + ret = PTR_ERR(free_space_root); >> + goto abort; >> + } >> + fs_info->free_space_root = free_space_root; >> + >> + node = rb_first(_info->block_group_cache_tree); >> + while (node) { >> + block_group = rb_entry(node, struct btrfs_block_group_cache, >> +cache_node); >> + ret = populate_free_space_tree(trans, fs_info, block_group); >> + if (ret) >> + goto abort; >> + node = rb_next(node); >> + } >> + >> + btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE); >> + >> + ret = btrfs_commit_transaction(trans, tree_root); >> + if (ret) >> + return ret; >> + >> + return 0; >> + >> +abort: >> + btrfs_abort_transaction(trans, tree_root, ret); >> + btrfs_end_transaction(trans, tree_root); >> + return ret; >> +} >> + > > Hi Omar, > > The only problem I've hit testing this stuff is where we create the tree > on existing filesystems. There are a few different problems here: > > 1) The populate code happens after resuming balance operations. The > balancing code could be changing these block groups while we scan them. > I fixed this by moving the scan up earlier. > > 2) Delayed references may be run, which will also change the extent tree > as we're scanning it. > > 3) We might need to commit the transaction to reclaim space. > > For now I'm ignoring #3 and adding a flag in fs_info that will make us > skip delayed references. This really isn't a good long term solution, > we need to be able to do this on a per block group basis
Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
Hello Qu, Wang, On Wed, Mar 30, 2016 at 2:34 AM, Qu Wenruo <quwen...@cn.fujitsu.com> wrote: > > > Alex Lyakas wrote on 2016/03/29 19:22 +0200: >> >> Greetings Qu Wenruo, >> >> I have reviewed the dedup patchset found in the github account you >> mentioned. I have several questions. Please note that by all means I >> am not criticizing your design or code. I just want to make sure that >> my understanding of the code is proper. > > > It's OK to criticize the design or code, and that's how review works. > >> >> 1) You mentioned in several emails that at some point byte-to-byte >> comparison is to be performed. However, I do not see this in the code. >> It seems that generic_search() only looks for the hash value match. If >> there is a match, it goes ahead and adds a delayed ref. > > > I mentioned byte-to-byte comparison as, "not to be implemented in any time > soon". > > Considering the lack of facility to read out extent contents without any > inode structure, it's not going to be done in any time soon. > >> >> 2) If btrfs_dedupe_search() does not find a match, we unlock the dedup >> mutex and proceed with the normal COW. What happens if there are >> several IO streams to different files writing an identical block, but >> we don't have such block in our dedup DB? Then all >> btrfs_dedupe_search() calls will not find a match, so all streams will >> allocate space for their block (which are all identical). At some >> point, they will call insert_reserved_file_extent() and will call >> btrfs_dedupe_add(). Since there is a global mutex, the first stream >> will insert the dedup hash entries into the DB, and all other streams >> will find that such hash entry already exists. So the end result is >> that we have the hash entry in the DB, but still we have multiple >> copies of the same block allocated, due to timing issues. Is this >> correct? > > > That's right, and that's also unavoidable for the hash initializing stage. > >> >> 3) generic_search() competes with __btrfs_free_extent(). Meaning that >> generic_search() wants to add a delayed ref to an existing extent, >> whereas __btrfs_free_extent() wants to delete an entry from the dedup >> DB. The race is resolved as follows: >> - generic_search attempts to lock the delayed ref head >> - if it succeeds to lock, then __btrfs_free_extent() is not running >> right now. So we can add a delayed ref. Later, when delayed ref head >> will be run, it will figure out what needs to be done (free the extent >> or not) >> - if we fail to lock, then there is a delayed ref processing for this >> bytenr. We drop all locks and redo the search from the top. If >> __btrfs_free_extent() has deleted the dedup hash meanwhile, we will >> not find it, and proceed with normal COW. >> Is my understanding correct? > > > Yes that's correct. Reviewing the code again, it seems that I still lack understanding. What is special about the dedup code adding a delayed data ref versus other places doing that? In other places, we do not insist on locking the delayed ref head, but in dedup we do. For example, __btrfs_drop_extents calls btrfs_inc_extent_ref, without locking the ref head. I know that one of your purposes was to draw attention to delayed ref processing, so you have succeeded. Thanks, Alex. > >> >> I have also few nitpicks on the code, will reply to relevant patches. > > > Feel free to comment. > > Thanks, > Qu > >> >> Thanks for doing this work, >> Alex. >> >> >> >> On Tue, Mar 22, 2016 at 3:35 AM, Qu Wenruo <quwen...@cn.fujitsu.com> >> wrote: >>> >>> This patchset can be fetched from github: >>> https://github.com/adam900710/linux.git wang_dedupe_20160322 >>> >>> This updated version of inband de-duplication has the following features: >>> 1) ONE unified dedup framework. >>> Most of its code is hidden quietly in dedup.c and export the minimal >>> interfaces for its caller. >>> Reviewer and further developer would benefit from the unified >>> framework. >>> >>> 2) TWO different back-end with different trade-off >>> One is the improved version of previous Fujitsu in-memory only dedup. >>> The other one is enhanced dedup implementation from Liu Bo. >>> Changed its tree structure to handle bytenr -> hash search for >>> deleting hash, without the hideous data backref hack. >>> >>> 3) Support compression with dedupe >>> Now dedupe can work with compression. >
Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
Thanks for your comments, Qu. Alex. On Wed, Mar 30, 2016 at 2:34 AM, Qu Wenruo <quwen...@cn.fujitsu.com> wrote: > > > Alex Lyakas wrote on 2016/03/29 19:22 +0200: >> >> Greetings Qu Wenruo, >> >> I have reviewed the dedup patchset found in the github account you >> mentioned. I have several questions. Please note that by all means I >> am not criticizing your design or code. I just want to make sure that >> my understanding of the code is proper. > > > It's OK to criticize the design or code, and that's how review works. > >> >> 1) You mentioned in several emails that at some point byte-to-byte >> comparison is to be performed. However, I do not see this in the code. >> It seems that generic_search() only looks for the hash value match. If >> there is a match, it goes ahead and adds a delayed ref. > > > I mentioned byte-to-byte comparison as, "not to be implemented in any time > soon". > > Considering the lack of facility to read out extent contents without any > inode structure, it's not going to be done in any time soon. > >> >> 2) If btrfs_dedupe_search() does not find a match, we unlock the dedup >> mutex and proceed with the normal COW. What happens if there are >> several IO streams to different files writing an identical block, but >> we don't have such block in our dedup DB? Then all >> btrfs_dedupe_search() calls will not find a match, so all streams will >> allocate space for their block (which are all identical). At some >> point, they will call insert_reserved_file_extent() and will call >> btrfs_dedupe_add(). Since there is a global mutex, the first stream >> will insert the dedup hash entries into the DB, and all other streams >> will find that such hash entry already exists. So the end result is >> that we have the hash entry in the DB, but still we have multiple >> copies of the same block allocated, due to timing issues. Is this >> correct? > > > That's right, and that's also unavoidable for the hash initializing stage. > >> >> 3) generic_search() competes with __btrfs_free_extent(). Meaning that >> generic_search() wants to add a delayed ref to an existing extent, >> whereas __btrfs_free_extent() wants to delete an entry from the dedup >> DB. The race is resolved as follows: >> - generic_search attempts to lock the delayed ref head >> - if it succeeds to lock, then __btrfs_free_extent() is not running >> right now. So we can add a delayed ref. Later, when delayed ref head >> will be run, it will figure out what needs to be done (free the extent >> or not) >> - if we fail to lock, then there is a delayed ref processing for this >> bytenr. We drop all locks and redo the search from the top. If >> __btrfs_free_extent() has deleted the dedup hash meanwhile, we will >> not find it, and proceed with normal COW. >> Is my understanding correct? > > > Yes that's correct. > >> >> I have also few nitpicks on the code, will reply to relevant patches. > > > Feel free to comment. > > Thanks, > Qu > >> >> Thanks for doing this work, >> Alex. >> >> >> >> On Tue, Mar 22, 2016 at 3:35 AM, Qu Wenruo <quwen...@cn.fujitsu.com> >> wrote: >>> >>> This patchset can be fetched from github: >>> https://github.com/adam900710/linux.git wang_dedupe_20160322 >>> >>> This updated version of inband de-duplication has the following features: >>> 1) ONE unified dedup framework. >>> Most of its code is hidden quietly in dedup.c and export the minimal >>> interfaces for its caller. >>> Reviewer and further developer would benefit from the unified >>> framework. >>> >>> 2) TWO different back-end with different trade-off >>> One is the improved version of previous Fujitsu in-memory only dedup. >>> The other one is enhanced dedup implementation from Liu Bo. >>> Changed its tree structure to handle bytenr -> hash search for >>> deleting hash, without the hideous data backref hack. >>> >>> 3) Support compression with dedupe >>> Now dedupe can work with compression. >>> Means that, a dedupe miss case can be compressed, and dedupe hit case >>> can also reuse compressed file extents. >>> >>> 4) Ioctl interface with persist dedup status >>> Advised by David, now we use ioctl to enable/disable dedup. >>> >>> And we now have dedup status, recorded in the first item of dedup >>> tree. >>> Just like quota, once
Re: [PATCH v8 00/27][For 4.7] Btrfs: Add inband (write time) de-duplication framework
Greetings Qu Wenruo, I have reviewed the dedup patchset found in the github account you mentioned. I have several questions. Please note that by all means I am not criticizing your design or code. I just want to make sure that my understanding of the code is proper. 1) You mentioned in several emails that at some point byte-to-byte comparison is to be performed. However, I do not see this in the code. It seems that generic_search() only looks for the hash value match. If there is a match, it goes ahead and adds a delayed ref. 2) If btrfs_dedupe_search() does not find a match, we unlock the dedup mutex and proceed with the normal COW. What happens if there are several IO streams to different files writing an identical block, but we don't have such block in our dedup DB? Then all btrfs_dedupe_search() calls will not find a match, so all streams will allocate space for their block (which are all identical). At some point, they will call insert_reserved_file_extent() and will call btrfs_dedupe_add(). Since there is a global mutex, the first stream will insert the dedup hash entries into the DB, and all other streams will find that such hash entry already exists. So the end result is that we have the hash entry in the DB, but still we have multiple copies of the same block allocated, due to timing issues. Is this correct? 3) generic_search() competes with __btrfs_free_extent(). Meaning that generic_search() wants to add a delayed ref to an existing extent, whereas __btrfs_free_extent() wants to delete an entry from the dedup DB. The race is resolved as follows: - generic_search attempts to lock the delayed ref head - if it succeeds to lock, then __btrfs_free_extent() is not running right now. So we can add a delayed ref. Later, when delayed ref head will be run, it will figure out what needs to be done (free the extent or not) - if we fail to lock, then there is a delayed ref processing for this bytenr. We drop all locks and redo the search from the top. If __btrfs_free_extent() has deleted the dedup hash meanwhile, we will not find it, and proceed with normal COW. Is my understanding correct? I have also few nitpicks on the code, will reply to relevant patches. Thanks for doing this work, Alex. On Tue, Mar 22, 2016 at 3:35 AM, Qu Wenruowrote: > This patchset can be fetched from github: > https://github.com/adam900710/linux.git wang_dedupe_20160322 > > This updated version of inband de-duplication has the following features: > 1) ONE unified dedup framework. >Most of its code is hidden quietly in dedup.c and export the minimal >interfaces for its caller. >Reviewer and further developer would benefit from the unified >framework. > > 2) TWO different back-end with different trade-off >One is the improved version of previous Fujitsu in-memory only dedup. >The other one is enhanced dedup implementation from Liu Bo. >Changed its tree structure to handle bytenr -> hash search for >deleting hash, without the hideous data backref hack. > > 3) Support compression with dedupe >Now dedupe can work with compression. >Means that, a dedupe miss case can be compressed, and dedupe hit case >can also reuse compressed file extents. > > 4) Ioctl interface with persist dedup status >Advised by David, now we use ioctl to enable/disable dedup. > >And we now have dedup status, recorded in the first item of dedup >tree. >Just like quota, once enabled, no extra ioctl is needed for next >mount. > > 5) Ability to disable dedup for given dirs/files >It works just like the compression prop method, by adding a new >xattr. > > TODO: > 1) Add extent-by-extent comparison for faster but more conflicting algorithm >Current SHA256 hash is quite slow, and for some old(5 years ago) CPU, >CPU may even be a bottleneck other than IO. >But for faster hash, it will definitely cause conflicts, so we need >extent comparison before we introduce new dedup algorithm. > > 2) Misc end-user related helpers >Like handy and easy to implement dedup rate report. >And method to query in-memory hash size for those "non-exist" users who >want to use 'dedup enable -l' option but didn't ever know how much >RAM they have. > > Changelog: > v2: > Totally reworked to handle multiple backends > v3: > Fix a stupid but deadly on-disk backend bug > Add handle for multiple hash on same bytenr corner case to fix abort > trans error > Increase dedup rate by enhancing delayed ref handler for both backend. > Move dedup_add() to run_delayed_ref() time, to fix abort trans error. > Increase dedup block size up limit to 8M. > v4: > Add dedup prop for disabling dedup for given files/dirs. > Merge inmem_search() and ondisk_search() into generic_search() to save > some code > Fix another delayed_ref related bug. > Use the same mutex for both inmem and ondisk backend. > Move dedup_add() back to btrfs_finish_ordered_io() to
Re: [RFC - PATCH] btrfs: do not write corrupted metadata blocks to disk
Hello Filipe, I have sent two patches addressing this issue. When testing, I discovered that log tree blocks can sometimes carry chunk tree UUID which is all zeros! Does this make sense? You can take a look at a small debug-tree output demonstrating such phenomenon at https://drive.google.com/file/d/0B9rmyUifdvMLbHBuSWU5dlVKNWc. Due to this I did not include the chunk tree UUID check. Hoping very much that fs UUID should always be valid for all tree blocks)) Thanks, Alex. On Mon, Feb 22, 2016 at 12:28 PM, Filipe Manana <fdman...@kernel.org> wrote: > On Mon, Feb 22, 2016 at 9:46 AM, Alex Lyakas <a...@zadarastorage.com> wrote: >> Thank you, Filipe, for your review. >> >> On Mon, Feb 22, 2016 at 3:05 AM, Filipe Manana <fdman...@kernel.org> wrote: >>> On Sun, Feb 21, 2016 at 3:36 PM, Alex Lyakas <a...@zadarastorage.com> wrote: >>>> csum_dirty_buffer was issuing a warning in case the extent buffer >>>> did not look alright, but was still returning success. >>>> Let's return error in this case, and also add two additional sanity >>>> checks on the extent buffer header. >>>> >>>> We had btrfs metadata corruption, and after looking at the logs we saw >>>> that WARN_ON(found_start != start) has been triggered. We are still >>>> investigating >>> >>> There's a warning for WARN_ON(found_start != start || !PageUptodate(page)) >>> >>> Are you sure it triggered only because of found_start != start and not >>> because of !PageUptodate(page) (or both)? >> The problem initially happened on kernel 3.8.13. In this kernel, the >> code looks like this: >> found_start = btrfs_header_bytenr(eb); >> if (found_start != start) { >> WARN_ON(1); >> return 0; >> } >> if (!PageUptodate(page)) { >> WARN_ON(1); >> return 0; >> } >> (You can see it on >> http://lxr.free-electrons.com/source/fs/btrfs/disk-io.c?v=3.8#L420) >> The WARN_ON that we hit was on the found_start comparison. > > Ok, I see now that one of those useless cleanup patches merged both > conditions into a single if some time ago. > >> >>> >>>> which component trashed the cache page which belonged to btrfs. But btrfs >>>> only issued a warning, and as a result, the corrupted metadata block went >>>> to >>>> disk. >>>> >>>> I think we should return an error in such case that the extent buffer >>>> doesn't look alright. >>> >>> I think so too. >>> >>>> The caller up the chain may BUG_ON on this, for example flush_epd_write_bio >>>> will, >>>> but it is better than to have a silent metadata corruption on disk. >>>> >>>> Note: this patch has been properly tested on 3.18 kernel only. >>>> >>>> Signed-off-by: Alex Lyakas <a...@zadarastorage.com> >>>> --- >>>> fs/btrfs/disk-io.c | 14 -- >>>> 1 file changed, 12 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >>>> index 4545e2e..701e706 100644 >>>> --- a/fs/btrfs/disk-io.c >>>> +++ b/fs/btrfs/disk-io.c >>>> @@ -508,22 +508,32 @@ static int csum_dirty_buffer(struct btrfs_fs_info >>>> *fs_info, struct page *page) >>>> { >>>> u64 start = page_offset(page); >>>> u64 found_start; >>>> struct extent_buffer *eb; >>>> >>>> eb = (struct extent_buffer *)page->private; >>>> if (page != eb->pages[0]) >>>> return 0; >>>> found_start = btrfs_header_bytenr(eb); >>>> if (WARN_ON(found_start != start || !PageUptodate(page))) >>>> -return 0; >>>> -csum_tree_block(fs_info, eb, 0); >>>> +return -EUCLEAN; >>>> +#ifdef CONFIG_BTRFS_ASSERT >>> >>> A bit odd to surround these with CONFIG_BTRFS_ASSERT if we don't do >>> assertions. >>> I would remove this #ifdef ... #endif or do the memcmp calls inside >>> ASSERT(). >> Agreed. >> >>> >>>> +if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid, >>>> +(unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE))) >>>> +return -EUCLEAN; >>>> +if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid, >>>> +
[PATCH 1/2] btrfs: csum_tree_block: return proper errno value
Signed-off-by: Alex Lyakas <a...@zadarastorage.com> --- fs/btrfs/disk-io.c | 13 + 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 4545e2e..4420ab2 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -296,52 +296,52 @@ static int csum_tree_block(struct btrfs_fs_info *fs_info, unsigned long map_len; int err; u32 crc = ~(u32)0; unsigned long inline_result; len = buf->len - offset; while (len > 0) { err = map_private_extent_buffer(buf, offset, 32, , _start, _len); if (err) - return 1; + return err; cur_len = min(len, map_len - (offset - map_start)); crc = btrfs_csum_data(kaddr + offset - map_start, crc, cur_len); len -= cur_len; offset += cur_len; } if (csum_size > sizeof(inline_result)) { result = kzalloc(csum_size, GFP_NOFS); if (!result) - return 1; + return -ENOMEM; } else { result = (char *)_result; } btrfs_csum_final(crc, result); if (verify) { if (memcmp_extent_buffer(buf, result, 0, csum_size)) { u32 val; u32 found = 0; memcpy(, result, csum_size); read_extent_buffer(buf, , 0, csum_size); btrfs_warn_rl(fs_info, "%s checksum verify failed on %llu wanted %X found %X " "level %d", fs_info->sb->s_id, buf->start, val, found, btrfs_header_level(buf)); if (result != (char *)_result) kfree(result); - return 1; + return -EUCLEAN; } } else { write_extent_buffer(buf, result, 0, csum_size); } if (result != (char *)_result) kfree(result); return 0; } /* @@ -509,22 +509,21 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page) u64 start = page_offset(page); u64 found_start; struct extent_buffer *eb; eb = (struct extent_buffer *)page->private; if (page != eb->pages[0]) return 0; found_start = btrfs_header_bytenr(eb); if (WARN_ON(found_start != start || !PageUptodate(page))) return 0; - csum_tree_block(fs_info, eb, 0); - return 0; + return csum_tree_block(fs_info, eb, 0); } static int check_tree_block_fsid(struct btrfs_fs_info *fs_info, struct extent_buffer *eb) { struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; u8 fsid[BTRFS_UUID_SIZE]; int ret = 1; read_extent_buffer(eb, fsid, btrfs_header_fsid(), BTRFS_FSID_SIZE); @@ -653,24 +652,22 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio, btrfs_err(root->fs_info, "bad tree block level %d", (int)btrfs_header_level(eb)); ret = -EIO; goto err; } btrfs_set_buffer_lockdep_class(btrfs_header_owner(eb), eb, found_level); ret = csum_tree_block(root->fs_info, eb, 1); - if (ret) { - ret = -EIO; + if (ret) goto err; - } /* * If this is a leaf block and it is corrupt, set the corrupt bit so * that we don't try and read the other copies of this block, just * return -EIO. */ if (found_level == 0 && check_leaf(root, eb)) { set_bit(EXTENT_BUFFER_CORRUPT, >bflags); ret = -EIO; } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs: do not write corrupted metadata blocks to disk
csum_dirty_buffer was issuing a warning in case the extent buffer did not look alright, but was still returning success. Let's return error in this case, and also add an additional sanity check on the extent buffer header. The caller up the chain may BUG_ON on this, for example flush_epd_write_bio will, but it is better than to have a silent metadata corruption on disk. Signed-off-by: Alex Lyakas <a...@zadarastorage.com> --- fs/btrfs/disk-io.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 4420ab2..cf85714 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -506,23 +506,34 @@ static int btree_read_extent_buffer_pages(struct btrfs_root *root, static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page) { u64 start = page_offset(page); u64 found_start; struct extent_buffer *eb; eb = (struct extent_buffer *)page->private; if (page != eb->pages[0]) return 0; + found_start = btrfs_header_bytenr(eb); - if (WARN_ON(found_start != start || !PageUptodate(page))) - return 0; + /* +* Please do not consolidate these warnings into a single if. +* It is useful to know what went wrong. +*/ + if (WARN_ON(found_start != start)) + return -EUCLEAN; + if (WARN_ON(!PageUptodate(page))) + return -EUCLEAN; + + ASSERT(memcmp_extent_buffer(eb, fs_info->fsid, + btrfs_header_fsid(), BTRFS_FSID_SIZE) == 0); + return csum_tree_block(fs_info, eb, 0); } static int check_tree_block_fsid(struct btrfs_fs_info *fs_info, struct extent_buffer *eb) { struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; u8 fsid[BTRFS_UUID_SIZE]; int ret = 1; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC - PATCH] btrfs: do not write corrupted metadata blocks to disk
Thank you, Filipe, for your review. On Mon, Feb 22, 2016 at 3:05 AM, Filipe Manana <fdman...@kernel.org> wrote: > On Sun, Feb 21, 2016 at 3:36 PM, Alex Lyakas <a...@zadarastorage.com> wrote: >> csum_dirty_buffer was issuing a warning in case the extent buffer >> did not look alright, but was still returning success. >> Let's return error in this case, and also add two additional sanity >> checks on the extent buffer header. >> >> We had btrfs metadata corruption, and after looking at the logs we saw >> that WARN_ON(found_start != start) has been triggered. We are still >> investigating > > There's a warning for WARN_ON(found_start != start || !PageUptodate(page)) > > Are you sure it triggered only because of found_start != start and not > because of !PageUptodate(page) (or both)? The problem initially happened on kernel 3.8.13. In this kernel, the code looks like this: found_start = btrfs_header_bytenr(eb); if (found_start != start) { WARN_ON(1); return 0; } if (!PageUptodate(page)) { WARN_ON(1); return 0; } (You can see it on http://lxr.free-electrons.com/source/fs/btrfs/disk-io.c?v=3.8#L420) The WARN_ON that we hit was on the found_start comparison. > >> which component trashed the cache page which belonged to btrfs. But btrfs >> only issued a warning, and as a result, the corrupted metadata block went to >> disk. >> >> I think we should return an error in such case that the extent buffer >> doesn't look alright. > > I think so too. > >> The caller up the chain may BUG_ON on this, for example flush_epd_write_bio >> will, >> but it is better than to have a silent metadata corruption on disk. >> >> Note: this patch has been properly tested on 3.18 kernel only. >> >> Signed-off-by: Alex Lyakas <a...@zadarastorage.com> >> --- >> fs/btrfs/disk-io.c | 14 -- >> 1 file changed, 12 insertions(+), 2 deletions(-) >> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >> index 4545e2e..701e706 100644 >> --- a/fs/btrfs/disk-io.c >> +++ b/fs/btrfs/disk-io.c >> @@ -508,22 +508,32 @@ static int csum_dirty_buffer(struct btrfs_fs_info >> *fs_info, struct page *page) >> { >> u64 start = page_offset(page); >> u64 found_start; >> struct extent_buffer *eb; >> >> eb = (struct extent_buffer *)page->private; >> if (page != eb->pages[0]) >> return 0; >> found_start = btrfs_header_bytenr(eb); >> if (WARN_ON(found_start != start || !PageUptodate(page))) >> -return 0; >> -csum_tree_block(fs_info, eb, 0); >> +return -EUCLEAN; >> +#ifdef CONFIG_BTRFS_ASSERT > > A bit odd to surround these with CONFIG_BTRFS_ASSERT if we don't do > assertions. > I would remove this #ifdef ... #endif or do the memcmp calls inside ASSERT(). Agreed. > >> +if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid, >> +(unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE))) >> +return -EUCLEAN; >> +if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid, >> +(unsigned long)btrfs_header_chunk_tree_uuid(eb), >> +BTRFS_FSID_SIZE))) > > This second comparison doesn't seem correct. Second argument to > memcmp_extent_buffer should be fs_info->chunk_tree_uuid, which > shouldn't be the same as the fsid (take a look at utils.c:make_btrfs() > in the tools, both uuids are generated by different calls to > uuid_generate()) - did you make your tests only before adding this > comparison?. Also you should use BTRFS_UUID_SIZE instead of > BTRFS_FSID_SIZE (even if both have the same value). Obviously, you are right. In the 3.18-based code that I fixed locally here, the fix looks like this: if (found_start != start) { ZBTRFS_WARN(1, "FS[%s]: header_bytenr(eb)(%llu) != page->index<<PAGE_CACHE_SHIFT(%llu)", root->fs_info->sb->s_id, found_start, start); return -EUCLEAN; } if (!PageUptodate(page)) { ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu page->index(%llu) !PageUptodate", root->fs_info->sb->s_id, start, (u64)page->index); return -EUCLEAN; } if (memcmp_extent_buffer(eb, root->fs_info->fsid, (unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE)) { u8 hdr_fsid[BTRFS_FSID_SIZE] = {0}; read_extent_buffer(eb, hdr_fsid, (unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE); ZBTRFS_WARN(1, "FS[%s]: eb bytenr=%llu header->fsid["PRIX128"] != fs_info->fsid["PRIX128"]", root
[RFC - PATCH] btrfs: do not write corrupted metadata blocks to disk
csum_dirty_buffer was issuing a warning in case the extent buffer did not look alright, but was still returning success. Let's return error in this case, and also add two additional sanity checks on the extent buffer header. We had btrfs metadata corruption, and after looking at the logs we saw that WARN_ON(found_start != start) has been triggered. We are still investigating which component trashed the cache page which belonged to btrfs. But btrfs only issued a warning, and as a result, the corrupted metadata block went to disk. I think we should return an error in such case that the extent buffer doesn't look alright. The caller up the chain may BUG_ON on this, for example flush_epd_write_bio will, but it is better than to have a silent metadata corruption on disk. Note: this patch has been properly tested on 3.18 kernel only. Signed-off-by: Alex Lyakas <a...@zadarastorage.com> --- fs/btrfs/disk-io.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 4545e2e..701e706 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -508,22 +508,32 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, struct page *page) { u64 start = page_offset(page); u64 found_start; struct extent_buffer *eb; eb = (struct extent_buffer *)page->private; if (page != eb->pages[0]) return 0; found_start = btrfs_header_bytenr(eb); if (WARN_ON(found_start != start || !PageUptodate(page))) -return 0; -csum_tree_block(fs_info, eb, 0); +return -EUCLEAN; +#ifdef CONFIG_BTRFS_ASSERT +if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid, +(unsigned long)btrfs_header_fsid(), BTRFS_FSID_SIZE))) +return -EUCLEAN; +if (WARN_ON(memcmp_extent_buffer(eb, fs_info->fsid, +(unsigned long)btrfs_header_chunk_tree_uuid(eb), +BTRFS_FSID_SIZE))) +return -EUCLEAN; +#endif +if (csum_tree_block(fs_info, eb, 0)) +return -EUCLEAN; return 0; } static int check_tree_block_fsid(struct btrfs_fs_info *fs_info, struct extent_buffer *eb) { struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; u8 fsid[BTRFS_UUID_SIZE]; int ret = 1; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2] Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state
[Resending in plain text, apologies.] Hi Chandan, Josef, Chris, I am not sure I understand the fix to the problem. It may happen that when updating the device tree, we need to allocate a new chunk via do_chunk_alloc (while we are holding the device tree root node locked). This is a legitimate thing for find_free_extent() to do. And do_chunk_alloc() call may lead to call to btrfs_create_pending_block_groups(), which will try to update the device tree. This may happen due to direct call to btrfs_create_pending_block_groups() that exists in do_chunk_alloc(), or perhaps by __btrfs_end_transaction() that find_free_extent() does after it completed chunk allocation (although in this case it will use the transaction that already exists in current->journal_info). So the deadlock still may happen? Thanks, Alex. > > > On Mon, Nov 2, 2015 at 6:52 PM, Chris Masonwrote: >> >> On Mon, Nov 02, 2015 at 01:59:46PM +0530, Chandan Rajendra wrote: >> > When executing generic/001 in a loop on a ppc64 machine (with both >> > sectorsize >> > and nodesize set to 64k), the following call trace is observed, >> >> Thanks Chandan, I hit this same trace on x86-64 with 16K nodes. >> >> -chris >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2 v3] Btrfs: fix regression when running delayed references
Hi Filipe Manana, My understanding of selecting delayed refs to run or merging them is far from complete. Can you please explain what will happen in the following scenario: 1) Ref1 is created, as you explain 2) Somebody calls __btrfs_run_delayed_refs() and runs Ref1, and we end up with an EXTENT_ITEM and an inline extent back ref 3) Ref2 and Ref3 are added 4) Somebody calls __btrfs_run_delayed_refs() At this point, we cannot merge Ref2 and Ref3, because they might be referencing tree blocks of completely different trees, thus comp_tree_refs() will return 1 or -1. But we will select Ref3 to be run, because we prefer BTRFS_ADD_DELAYED_REF over BTRFS_DROP_DELAYED_REF, as you explained. So we hit the same BUG_ON now, because we already have Ref1 in the extent tree. So something should prevent us from running Ref3 before running Ref2. We should run Ref2 first, which should get rid of the EXTENT_ITEM and the inline backref, and then run Ref3 to create a new backref with a proper owner. What is that something? Can you please point me at what am I missing? Also, can such scenario happen in 3.18 kernel, which still has an rbtree per ref-head? Looking at the code, I don't see anything preventing that from happening. Thanks, Alex. On Sun, Oct 25, 2015 at 8:51 PM,wrote: > From: Filipe Manana > > In the kernel 4.2 merge window we had a refactoring/rework of the delayed > references implementation in order to fix certain problems with qgroups. > However that rework introduced one more regression that leads to the > following trace when running delayed references for metadata: > > [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832! > [35908.065201] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC > [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor > raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc > loop fuse parport_pc psmouse i2 > [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: GW > 4.3.0-rc5-btrfs-next-17+ #1 > [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 > [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] > [35908.065201] task: 880114b7d780 ti: 88010c4c8000 task.ti: > 88010c4c8000 > [35908.065201] RIP: 0010:[] [] > insert_inline_extent_backref+0x52/0xb1 [btrfs] > [35908.065201] RSP: 0018:88010c4cbb08 EFLAGS: 00010293 > [35908.065201] RAX: RBX: 88008a661000 RCX: > > [35908.065201] RDX: a04dd58f RSI: 0001 RDI: > > [35908.065201] RBP: 88010c4cbb40 R08: 1000 R09: > 88010c4cb9f8 > [35908.065201] R10: R11: 002c R12: > > [35908.065201] R13: 88020a74c578 R14: R15: > > [35908.065201] FS: () GS:88023edc() > knlGS: > [35908.065201] CS: 0010 DS: ES: CR0: 8005003b > [35908.065201] CR2: 015e8708 CR3: 000102185000 CR4: > 06e0 > [35908.065201] Stack: > [35908.065201] 88010c4cbb18 0f37 88020a74c578 > 88015a408000 > [35908.065201] 880154a44000 0005 > 88010c4cbbd8 > [35908.065201] a0492b9a 0005 > > [35908.065201] Call Trace: > [35908.065201] [] __btrfs_inc_extent_ref+0x8b/0x208 [btrfs] > [35908.065201] [] ? __btrfs_run_delayed_refs+0x4d4/0xd33 > [btrfs] > [35908.065201] [] __btrfs_run_delayed_refs+0xafa/0xd33 > [btrfs] > [35908.065201] [] ? join_transaction.isra.10+0x25/0x41f > [btrfs] > [35908.065201] [] ? join_transaction.isra.10+0xa8/0x41f > [btrfs] > [35908.065201] [] btrfs_run_delayed_refs+0x75/0x1dd [btrfs] > [35908.065201] [] delayed_ref_async_start+0x3c/0x7b [btrfs] > [35908.065201] [] normal_work_helper+0x14c/0x32a [btrfs] > [35908.065201] [] btrfs_extent_refs_helper+0x12/0x14 > [btrfs] > [35908.065201] [] process_one_work+0x24a/0x4ac > [35908.065201] [] worker_thread+0x206/0x2c2 > [35908.065201] [] ? rescuer_thread+0x2cb/0x2cb > [35908.065201] [] ? rescuer_thread+0x2cb/0x2cb > [35908.065201] [] kthread+0xef/0xf7 > [35908.065201] [] ? kthread_parkme+0x24/0x24 > [35908.065201] [] ret_from_fork+0x3f/0x70 > [35908.065201] [] ? kthread_parkme+0x24/0x24 > [35908.065201] Code: 6a 01 41 56 41 54 ff 75 10 41 51 4d 89 c1 49 89 c8 48 8d > 4d d0 e8 f6 f1 ff ff 48 83 c4 28 85 c0 75 2c 49 81 fc ff 00 00 00 77 02 <0f> > 0b 4c 8b 45 30 8b 4d 28 45 31 > [35908.065201] RIP [] > insert_inline_extent_backref+0x52/0xb1 [btrfs] > [35908.065201] RSP > [35908.310885] ---[ end trace fe4299baf0666457 ]--- > > This happens because the new delayed references code no longer merges > delayed references that have different sequence values.
Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock
Hi Filipe Manana, Can't the call to btrfs_create_pending_block_groups() cause a deadlock, like in http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this call updates the device tree, and we may be calling do_chunk_alloc() from find_free_extent() when holding a lock on the device tree root (because we want to COW a block of the device tree). My understanding from Josef's chunk allocator rework (http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now when allocating a new chunk we do not immediately update the device/chunk tree. We keep the new chunk in "pending_chunks" and in "new_bgs" on a transaction handle, and we actually update the chunk/device tree only when we are done with a particular transaction handle. This way we avoid that sort of deadlocks. But this patch breaks this rule, as it may make us update the device/chunk tree in the context of chunk allocation, which is the scenario that the rework was meant to avoid. Can you please point me at what I am missing? Thanks, Alex. On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandovalwrote: > On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote: >> From: Filipe Manana >> >> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when >> finishing block group creation"), introduced in 4.2-rc1, the following >> test was failing due to exhaustion of the system array in the superblock: >> >> #!/bin/bash >> >> truncate -s 100T big.img >> mkfs.btrfs big.img >> mount -o loop big.img /mnt/loop >> >> num=5 >> sz=10T >> for ((i = 0; i < $num; i++)); do >> echo fallocate $i $sz >> fallocate -l $sz /mnt/loop/testfile$i >> done >> btrfs filesystem sync /mnt/loop >> >> for ((i = 0; i < $num; i++)); do >> echo rm $i >> rm /mnt/loop/testfile$i >> btrfs filesystem sync /mnt/loop >> done >> umount /mnt/loop >> >> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive >> allocation of system block groups. This happened because the test creates >> a large number of data block groups per transaction and when committing >> the transaction we start the writeout of the block group caches for all >> the new new (dirty) block groups, which results in pre-allocating space >> for each block group's free space cache using the same transaction handle. >> That in turn often leads to creation of more block groups, and all get >> attached to the new_bgs list of the same transaction handle to the point >> of getting a list with over 1500 elements, and creation of new block groups >> leads to the need of reserving space in the chunk block reserve and often >> creating a new system block group too. >> >> So that made us quickly exhaust the chunk block reserve/system space info, >> because as of the commit mentioned before, we do reserve space for each >> new block group in the chunk block reserve, unlike before where we would >> not and would at most allocate one new system block group and therefore >> would only ensure that there was enough space in the system space info to >> allocate 1 new block group even if we ended up allocating thousands of >> new block groups using the same transaction handle. That worked most of >> the time because the computed required space at check_system_chunk() is >> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and >> that all nodes/leafs in a path will be COWed and split) and since the >> updates to the chunk tree all happen at btrfs_create_pending_block_groups >> it is unlikely that a path needs to be COWed more than once (unless >> writepages() for the btree inode is called by mm in between) and that >> compensated for the need of creating any new nodes/leads in the chunk >> tree. >> >> So fix this by ensuring we don't accumulate a too large list of new block >> groups in a transaction's handles new_bgs list, inserting/updating the >> chunk tree for all accumulated new block groups and releasing the unused >> space from the chunk block reserve whenever the list becomes sufficiently >> large. This is a generic solution even though the problem currently can >> only happen when starting the writeout of the free space caches for all >> dirty block groups (btrfs_start_dirty_block_groups()). >> >> Reported-by: Omar Sandoval >> Signed-off-by: Filipe Manana > > Thanks a lot for taking a look. > > Tested-by: Omar Sandoval > >> --- >> fs/btrfs/extent-tree.c | 18 ++ >> 1 file changed, 18 insertions(+) >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index 171312d..07204bf 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -4227,6 +4227,24 @@ out: >> space_info->chunk_alloc = 0; >> spin_unlock(_info->lock); >> mutex_unlock(_info->chunk_mutex); >> + /* >> + * When we allocate a new chunk we reserve space in the chunk block >> + * reserve
Re: [PATCH 1/2 v3] Btrfs: fix regression when running delayed references
Hi Filipe, Thank you for the explanation. On Sun, Dec 13, 2015 at 5:43 PM, Filipe Manana <fdman...@kernel.org> wrote: > On Sun, Dec 13, 2015 at 10:51 AM, Alex Lyakas <a...@zadarastorage.com> wrote: >> Hi Filipe Manana, >> >> My understanding of selecting delayed refs to run or merging them is >> far from complete. Can you please explain what will happen in the >> following scenario: >> >> 1) Ref1 is created, as you explain >> 2) Somebody calls __btrfs_run_delayed_refs() and runs Ref1, and we end >> up with an EXTENT_ITEM and an inline extent back ref >> 3) Ref2 and Ref3 are added >> 4) Somebody calls __btrfs_run_delayed_refs() >> >> At this point, we cannot merge Ref2 and Ref3, because they might be >> referencing tree blocks of completely different trees, thus >> comp_tree_refs() will return 1 or -1. But we will select Ref3 to be >> run, because we prefer BTRFS_ADD_DELAYED_REF over >> BTRFS_DROP_DELAYED_REF, as you explained. So we hit the same BUG_ON >> now, because we already have Ref1 in the extent tree. > > No, that won't happen. If the ref (Ref3) is for a different tree, than > it has a different inline extent from Ref1 > (lookup_inline_extent_backref returns -ENOENT and not 0). Understood. So in this case, we will first add inline ref for Ref3, and later drop the Ref1 inline ref via update_inline_extent_backref() by truncating the EXTENT_ITEM. All in the same transaction. > > If they are all for the same tree it means Ref3 is not merged with > Ref2 because they have different seq numbers and a seq value exist in > fs_info->tree_mod_seq_list, and we skip Ref3 through > btrfs_check_delayed_seq() until such seq number goes away from > tree_mod_seq_list. Ok, so we won't process this ref-head at all, until the "seq problem" disappears. > If no seq number exists in tree_mod_seq_list then > we merge it (Ref3) through btrfs_merge_delayed_refs(), called when > running delayed refs, with Ref2 (which removes both refs since one is > "-1" and the other "+1"). So in this case we don't care that the inline ref we have in the EXTENT_ITEM was actually inserted on behalf of Ref1. Because it's for the same EXTENT_ITEM and for the same root. So Ref3 and Ref1 are fully equivalent. Interesting. Thanks! Alex. > > Iow, after this regression fix, no behaviour changed from releases before 4.2. > >> >> So something should prevent us from running Ref3 before running Ref2. >> We should run Ref2 first, which should get rid of the EXTENT_ITEM and >> the inline backref, and then run Ref3 to create a new backref with a >> proper owner. What is that something? >> >> Can you please point me at what am I missing? >> >> Also, can such scenario happen in 3.18 kernel, which still has an >> rbtree per ref-head? Looking at the code, I don't see anything >> preventing that from happening. >> >> Thanks, >> Alex. >> >> >> On Sun, Oct 25, 2015 at 8:51 PM, <fdman...@kernel.org> wrote: >>> From: Filipe Manana <fdman...@suse.com> >>> >>> In the kernel 4.2 merge window we had a refactoring/rework of the delayed >>> references implementation in order to fix certain problems with qgroups. >>> However that rework introduced one more regression that leads to the >>> following trace when running delayed references for metadata: >>> >>> [35908.064664] kernel BUG at fs/btrfs/extent-tree.c:1832! >>> [35908.065201] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC >>> [35908.065201] Modules linked in: dm_flakey dm_mod btrfs crc32c_generic xor >>> raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache >>> sunrpc loop fuse parport_pc psmouse i2 >>> [35908.065201] CPU: 14 PID: 15014 Comm: kworker/u32:9 Tainted: GW >>> 4.3.0-rc5-btrfs-next-17+ #1 >>> [35908.065201] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >>> rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 >>> [35908.065201] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper [btrfs] >>> [35908.065201] task: 880114b7d780 ti: 88010c4c8000 task.ti: >>> 88010c4c8000 >>> [35908.065201] RIP: 0010:[] [] >>> insert_inline_extent_backref+0x52/0xb1 [btrfs] >>> [35908.065201] RSP: 0018:88010c4cbb08 EFLAGS: 00010293 >>> [35908.065201] RAX: RBX: 88008a661000 RCX: >>> >>> [35908.065201] RDX: a04dd58f RSI: 0001 RDI: >>> >>> [35908.065201] RBP: 88010c4cbb40 R08: 1000
Re: [PATCH] Btrfs: fix quick exhaustion of the system array in the superblock
Thank you, Filipe. Now it is more clear. Fortunately, in my kernel 3.18 I do not have do_chunk_alloc() calling btrfs_create_pending_block_groups(), so I cannot hit this deadlock. But can hit the issue that this call is meant to fix. Thanks, Alex. On Sun, Dec 13, 2015 at 5:45 PM, Filipe Manana <fdman...@kernel.org> wrote: > On Sun, Dec 13, 2015 at 10:29 AM, Alex Lyakas <a...@zadarastorage.com> wrote: >> Hi Filipe Manana, >> >> Can't the call to btrfs_create_pending_block_groups() cause a >> deadlock, like in >> http://www.spinics.net/lists/linux-btrfs/msg48744.html? Because this >> call updates the device tree, and we may be calling do_chunk_alloc() >> from find_free_extent() when holding a lock on the device tree root >> (because we want to COW a block of the device tree). >> >> My understanding from Josef's chunk allocator rework >> (http://www.spinics.net/lists/linux-btrfs/msg25722.html) was that now >> when allocating a new chunk we do not immediately update the >> device/chunk tree. We keep the new chunk in "pending_chunks" and in >> "new_bgs" on a transaction handle, and we actually update the >> chunk/device tree only when we are done with a particular transaction >> handle. This way we avoid that sort of deadlocks. >> >> But this patch breaks this rule, as it may make us update the >> device/chunk tree in the context of chunk allocation, which is the >> scenario that the rework was meant to avoid. >> >> Can you please point me at what I am missing? > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=d9a0540a79f87456907f2ce031f058cf745c5bff > >> >> Thanks, >> Alex. >> >> >> On Wed, Jul 22, 2015 at 1:53 AM, Omar Sandoval <osan...@fb.com> wrote: >>> On Mon, Jul 20, 2015 at 02:56:20PM +0100, fdman...@kernel.org wrote: >>>> From: Filipe Manana <fdman...@suse.com> >>>> >>>> Omar reported that after commit 4fbcdf669454 ("Btrfs: fix -ENOSPC when >>>> finishing block group creation"), introduced in 4.2-rc1, the following >>>> test was failing due to exhaustion of the system array in the superblock: >>>> >>>> #!/bin/bash >>>> >>>> truncate -s 100T big.img >>>> mkfs.btrfs big.img >>>> mount -o loop big.img /mnt/loop >>>> >>>> num=5 >>>> sz=10T >>>> for ((i = 0; i < $num; i++)); do >>>> echo fallocate $i $sz >>>> fallocate -l $sz /mnt/loop/testfile$i >>>> done >>>> btrfs filesystem sync /mnt/loop >>>> >>>> for ((i = 0; i < $num; i++)); do >>>> echo rm $i >>>> rm /mnt/loop/testfile$i >>>> btrfs filesystem sync /mnt/loop >>>> done >>>> umount /mnt/loop >>>> >>>> This made btrfs_add_system_chunk() fail with -EFBIG due to excessive >>>> allocation of system block groups. This happened because the test creates >>>> a large number of data block groups per transaction and when committing >>>> the transaction we start the writeout of the block group caches for all >>>> the new new (dirty) block groups, which results in pre-allocating space >>>> for each block group's free space cache using the same transaction handle. >>>> That in turn often leads to creation of more block groups, and all get >>>> attached to the new_bgs list of the same transaction handle to the point >>>> of getting a list with over 1500 elements, and creation of new block groups >>>> leads to the need of reserving space in the chunk block reserve and often >>>> creating a new system block group too. >>>> >>>> So that made us quickly exhaust the chunk block reserve/system space info, >>>> because as of the commit mentioned before, we do reserve space for each >>>> new block group in the chunk block reserve, unlike before where we would >>>> not and would at most allocate one new system block group and therefore >>>> would only ensure that there was enough space in the system space info to >>>> allocate 1 new block group even if we ended up allocating thousands of >>>> new block groups using the same transaction handle. That worked most of >>>> the time because the computed required space at check_system_chunk() is >>>> very pessimistic (assumes a chunk tree height of BTRFS_MAX_LEVEL/8 and >>>> that all nodes/leafs in a path will be C
Re: [RFC PATCH] btrfs: flush_space: treat return value of do_chunk_alloc properly
Hi Liu, I was studying on how block reservation works, and making some modifications in reserve_metadata_bytes to understand better what it does. Then suddenly I saw this problem. I guess it depends on which value of "flush" parameter is passed to reserve_metadata_bytes. Alex. On Thu, Dec 3, 2015 at 8:14 PM, Liu Bo <bo.li@oracle.com> wrote: > On Thu, Dec 03, 2015 at 06:51:03PM +0200, Alex Lyakas wrote: >> do_chunk_alloc returns 1 when it succeeds to allocate a new chunk. >> But flush_space will not convert this to 0, and will also return 1. >> As a result, reserve_metadata_bytes will think that flush_space failed, >> and may potentially return this value "1" to the caller (depends how >> reserve_metadata_bytes was called). The caller will also treat this as an >> error. >> For example, btrfs_block_rsv_refill does: >> >> int ret = -ENOSPC; >> ... >> ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush); >> if (!ret) { >> block_rsv_add_bytes(block_rsv, num_bytes, 0); >> return 0; >> } >> >> return ret; >> >> So it will return -ENOSPC. > > It will return 1 instead of -ENOSPC. > > The patch looks good, I noticed this before, but I didn't manage to trigger a > error for this, did you catch a error like that? > > Thanks, > > -liubo > >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index 4b89680..1ba3f0d 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -4727,7 +4727,7 @@ static int flush_space(struct btrfs_root *root, >> btrfs_get_alloc_profile(root, 0), >> CHUNK_ALLOC_NO_FORCE); >> btrfs_end_transaction(trans, root); >> - if (ret == -ENOSPC) >> + if (ret > 0 || ret == -ENOSPC) >> ret = 0; >> break; >> case COMMIT_TRANS: >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] btrfs: flush_space: treat return value of do_chunk_alloc properly
do_chunk_alloc returns 1 when it succeeds to allocate a new chunk. But flush_space will not convert this to 0, and will also return 1. As a result, reserve_metadata_bytes will think that flush_space failed, and may potentially return this value "1" to the caller (depends how reserve_metadata_bytes was called). The caller will also treat this as an error. For example, btrfs_block_rsv_refill does: int ret = -ENOSPC; ... ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush); if (!ret) { block_rsv_add_bytes(block_rsv, num_bytes, 0); return 0; } return ret; So it will return -ENOSPC. Signed-off-by: Alex Lyakas <a...@zadarastorage.com> Reviewed-by: Josef Bacik <jba...@fb.com> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 4b89680..1ba3f0d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4727,7 +4727,7 @@ static int flush_space(struct btrfs_root *root, btrfs_get_alloc_profile(root, 0), CHUNK_ALLOC_NO_FORCE); btrfs_end_transaction(trans, root); - if (ret == -ENOSPC) + if (ret > 0 || ret == -ENOSPC) ret = 0; break; case COMMIT_TRANS: On Sun, Dec 6, 2015 at 12:19 PM, Alex Lyakas <a...@zadarastorage.com> wrote: > Hi Liu, > I was studying on how block reservation works, and making some > modifications in reserve_metadata_bytes to understand better what it > does. Then suddenly I saw this problem. I guess it depends on which > value of "flush" parameter is passed to reserve_metadata_bytes. > > Alex. > > > On Thu, Dec 3, 2015 at 8:14 PM, Liu Bo <bo.li@oracle.com> wrote: >> On Thu, Dec 03, 2015 at 06:51:03PM +0200, Alex Lyakas wrote: >>> do_chunk_alloc returns 1 when it succeeds to allocate a new chunk. >>> But flush_space will not convert this to 0, and will also return 1. >>> As a result, reserve_metadata_bytes will think that flush_space failed, >>> and may potentially return this value "1" to the caller (depends how >>> reserve_metadata_bytes was called). The caller will also treat this as an >>> error. >>> For example, btrfs_block_rsv_refill does: >>> >>> int ret = -ENOSPC; >>> ... >>> ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush); >>> if (!ret) { >>> block_rsv_add_bytes(block_rsv, num_bytes, 0); >>> return 0; >>> } >>> >>> return ret; >>> >>> So it will return -ENOSPC. >> >> It will return 1 instead of -ENOSPC. >> >> The patch looks good, I noticed this before, but I didn't manage to trigger >> a error for this, did you catch a error like that? >> >> Thanks, >> >> -liubo >> >>> >>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >>> index 4b89680..1ba3f0d 100644 >>> --- a/fs/btrfs/extent-tree.c >>> +++ b/fs/btrfs/extent-tree.c >>> @@ -4727,7 +4727,7 @@ static int flush_space(struct btrfs_root *root, >>> btrfs_get_alloc_profile(root, 0), >>> CHUNK_ALLOC_NO_FORCE); >>> btrfs_end_transaction(trans, root); >>> - if (ret == -ENOSPC) >>> + if (ret > 0 || ret == -ENOSPC) >>> ret = 0; >>> break; >>> case COMMIT_TRANS: >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] btrfs: flush_space: treat return value of do_chunk_alloc properly
do_chunk_alloc returns 1 when it succeeds to allocate a new chunk. But flush_space will not convert this to 0, and will also return 1. As a result, reserve_metadata_bytes will think that flush_space failed, and may potentially return this value "1" to the caller (depends how reserve_metadata_bytes was called). The caller will also treat this as an error. For example, btrfs_block_rsv_refill does: int ret = -ENOSPC; ... ret = reserve_metadata_bytes(root, block_rsv, num_bytes, flush); if (!ret) { block_rsv_add_bytes(block_rsv, num_bytes, 0); return 0; } return ret; So it will return -ENOSPC. diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 4b89680..1ba3f0d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4727,7 +4727,7 @@ static int flush_space(struct btrfs_root *root, btrfs_get_alloc_profile(root, 0), CHUNK_ALLOC_NO_FORCE); btrfs_end_transaction(trans, root); - if (ret == -ENOSPC) + if (ret > 0 || ret == -ENOSPC) ret = 0; break; case COMMIT_TRANS: -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: clear bio reference after submit_one_bio()
Hi Holger, I think it will cause an invalid paging request, just like in case that Naohiro has fixed. I am not running the "latest and greatest" btrfs in my system, and it is not easy to set it up, that's why I cannot submit patches based on the latest code, I can only review and comment on patches. Alex. On Thu, Nov 5, 2015 at 3:08 PM, Holger Hoffstätte <holger.hoffstae...@googlemail.com> wrote: > On 10/11/15 20:09, Alex Lyakas wrote: >> Hi Naota, >> >> What happens if btrfs_bio_alloc() in submit_extent_page fails? Then we >> return -ENOMEM to the caller, but we do not set *bio_ret to NULL. And >> if *bio_ret was non-NULL upon entry into submit_extent_page, then we >> had submitted this bio before getting to btrfs_bio_alloc(). So should >> btrfs_bio_alloc() failure be handled in the same way? >> >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c >> index 3915c94..cd443bc 100644 >> --- a/fs/btrfs/extent_io.c >> +++ b/fs/btrfs/extent_io.c >> @@ -2834,8 +2834,11 @@ static int submit_extent_page(int rw, struct >> extent_io_tree *tree, >> >> bio = btrfs_bio_alloc(bdev, sector, BIO_MAX_PAGES, >> GFP_NOFS | __GFP_HIGH); >> - if (!bio) >> + if (!bio) { >> + if (bio_ret) >> + *bio_ret = NULL; >> return -ENOMEM; >> + } >> >> bio_add_page(bio, page, page_size, offset); >> bio->bi_end_io = end_io_func; >> > > Did you get any feedback on this? It seems it could cause data loss or > corruption on allocation failures, no? > > -h > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: clear bio reference after submit_one_bio()
Hi Naota, What happens if btrfs_bio_alloc() in submit_extent_page fails? Then we return -ENOMEM to the caller, but we do not set *bio_ret to NULL. And if *bio_ret was non-NULL upon entry into submit_extent_page, then we had submitted this bio before getting to btrfs_bio_alloc(). So should btrfs_bio_alloc() failure be handled in the same way? diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 3915c94..cd443bc 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2834,8 +2834,11 @@ static int submit_extent_page(int rw, struct extent_io_tree *tree, bio = btrfs_bio_alloc(bdev, sector, BIO_MAX_PAGES, GFP_NOFS | __GFP_HIGH); - if (!bio) + if (!bio) { + if (bio_ret) + *bio_ret = NULL; return -ENOMEM; + } bio_add_page(bio, page, page_size, offset); bio->bi_end_io = end_io_func; Thanks, Alex. On Wed, Jan 7, 2015 at 12:46 AM, Satoru Takeuchiwrote: > Hi Naota, > > On 2015/01/06 1:01, Naohiro Aota wrote: >> After submit_one_bio(), `bio' can go away. However submit_extent_page() >> leave `bio' referable if submit_one_bio() failed (e.g. -ENOMEM on OOM). >> It will cause invalid paging request when submit_extent_page() is called >> next time. >> >> I reproduced ENOMEM case with the following script (need >> CONFIG_FAIL_PAGE_ALLOC, and CONFIG_FAULT_INJECTION_DEBUG_FS). > > I confirmed that this problem reproduce with 3.19-rc3 and > not reproduce with 3.19-rc3 with your patch. > > Tested-by: Satoru Takeuchi > > Thank you for reporting this problem with the reproducer > and fixing it too. > > NOTE: > I used v3.19-rc3's tools/testing/fault-injection/failcmd.sh > for the following "./failcmd.sh". > > >./failcmd.sh -p $percent -t $times -i $interval \ > >--ignore-gfp-highmem=N --ignore-gfp-wait=N > --min-order=0 \ > >-- \ > >cat $directory/file > /dev/null > > * 3.19-rc1 + your patch > > === > # ./run > 512+0 records in > 512+0 records out > # > === > > * 3.19-rc3 > > === > # ./run > 512+0 records in > 512+0 records out > [ 188.433726] run (776): drop_caches: 1 > [ 188.682372] FAULT_INJECTION: forcing a failure. > name fail_page_alloc, interval 100, probability 111000, space 0, times 3 > [ 188.689986] CPU: 0 PID: 954 Comm: cat Not tainted 3.19.0-rc3-ktest #1 > [ 188.693834] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > Bochs 01/01/2011 > [ 188.698466] 0064 88007b343618 816e5563 > 88007fc0fc78 > [ 188.702730] 81c655c0 88007b343638 813851b5 > 0010 > [ 188.707043] 0002 88007b343768 81188126 > 88007b3435a8 > [ 188.711283] Call Trace: > [ 188.712620] [] dump_stack+0x45/0x57 > [ 188.715330] [] should_fail+0x135/0x140 > [ 188.718218] [] __alloc_pages_nodemask+0xd6/0xb30 > [ 188.721567] [] ? blk_rq_map_sg+0x35/0x170 > [ 188.724558] [] ? virtio_queue_rq+0x145/0x2b0 > [virtio_blk] > [ 188.728191] [] ? > btrfs_submit_compressed_read+0xcf/0x4d0 [btrfs] > [ 188.732079] [] ? kmem_cache_alloc+0x1cb/0x230 > [ 188.735153] [] ? mempool_alloc_slab+0x15/0x20 > [ 188.738188] [] alloc_pages_current+0x9a/0x120 > [ 188.741153] [] btrfs_submit_compressed_read+0x1a9/0x4d0 > [btrfs] > [ 188.744835] [] btrfs_submit_bio_hook+0x1c1/0x1d0 [btrfs] > [ 188.748225] [] ? lookup_extent_mapping+0x13/0x20 [btrfs] > [ 188.751547] [] ? btrfs_get_extent+0x98/0xad0 [btrfs] > [ 188.754656] [] submit_one_bio+0x67/0xa0 [btrfs] > [ 188.757554] [] submit_extent_page.isra.35+0xd7/0x1c0 > [btrfs] > [ 188.760981] [] __do_readpage+0x31d/0x7b0 [btrfs] > [ 188.763920] [] ? btrfs_create_repair_bio+0x110/0x110 > [btrfs] > [ 188.767382] [] ? btrfs_submit_direct+0x7b0/0x7b0 [btrfs] > [ 188.770671] [] ? btrfs_lookup_ordered_range+0x13d/0x180 > [btrfs] > [ 188.774366] [] > __extent_readpages.constprop.42+0x2ba/0x2d0 [btrfs] > [ 188.778031] [] ? btrfs_submit_direct+0x7b0/0x7b0 [btrfs] > [ 188.781241] [] extent_readpages+0x169/0x1b0 [btrfs] > [ 188.784322] [] ? btrfs_submit_direct+0x7b0/0x7b0 [btrfs] > [ 188.789014] [] btrfs_readpages+0x1f/0x30 [btrfs] > [ 188.792028] [] __do_page_cache_readahead+0x18c/0x1f0 > [ 188.795078] [] ondemand_readahead+0xdf/0x260 > [ 188.797702] [] ? btrfs_congested_fn+0x5f/0xa0 [btrfs] > [ 188.800718] [] page_cache_async_readahead+0x71/0xa0 > [ 188.803650] [] generic_file_read_iter+0x40f/0x5e0 > [ 188.806480] [] new_sync_read+0x7e/0xb0 > [ 188.808832] [] __vfs_read+0x18/0x50 > [ 188.811068] [] vfs_read+0x8a/0x140 > [ 188.813298] [] SyS_read+0x46/0xb0
Re: [PATCH] Btrfs: check pending chunks when shrinking fs to avoid corruption
Hi Filipe, Looking the code of this patch, I see that if we discover a pending chunk, we unlock the chunk mutex, commit the transaction (which completes the allocation of all pending chunks and inserts relevant items into the device tree and chunk tree), and retry the search. However, after we unlock the chunk mutex, somebody could have attempted a new chunk allocation, which would have resulted in new pending chunk. On the other hand, we have done: btrfs_device_set_total_bytes(device, new_size); so this line should prevent anybody to allocate beyond the new size. In that case, we are sure that on the seconds pass there will be no pending chunks beyond the new size, so we can shrink to new_size safely. Is my understanding correct? Thanks, Alex. On Tue, Jun 2, 2015 at 3:43 PM,wrote: > From: Filipe Manana > > When we shrink the usable size of a device (its total_bytes), we go over > all the device extent items in the device tree and attempt to relocate > the chunk of any device extent that goes beyond the new usable size for > the device. We do that after setting the new usable size (total_bytes) in > the device object, so that all new allocations (and reallocations) don't > use areas of the device that go beyond the new (shorter) size. However we > were not considering that before setting the new size in the device, > pending chunks might have been created that use device extents that go > beyond the new size, and those device extents are not yet in the device > tree after we search the device tree - they are still attached to the > list of new block group for some ongoing transaction handle, and they are > only added to the device tree when the transaction handle is ended (via > btrfs_create_pending_block_groups()). > > So check for pending chunks with device extents that go beyond the new > size and if any exists, commit the current transaction and repeat the > search in the device tree. > > Not doing this it would mean we would return success to user space while > still having extents that go beyond the new size, and later user space > could override those locations on the device while the fs still references > them, causing all sorts of corruption and unexpected events. > > Signed-off-by: Filipe Manana > --- > fs/btrfs/volumes.c | 49 - > 1 file changed, 40 insertions(+), 9 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index dbea12e..09e89a6 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -3984,6 +3984,7 @@ int btrfs_shrink_device(struct btrfs_device *device, > u64 new_size) > int slot; > int failed = 0; > bool retried = false; > + bool checked_pending_chunks = false; > struct extent_buffer *l; > struct btrfs_key key; > struct btrfs_super_block *super_copy = root->fs_info->super_copy; > @@ -4064,15 +4065,6 @@ again: > goto again; > } else if (failed && retried) { > ret = -ENOSPC; > - lock_chunks(root); > - > - btrfs_device_set_total_bytes(device, old_size); > - if (device->writeable) > - device->fs_devices->total_rw_bytes += diff; > - spin_lock(>fs_info->free_chunk_lock); > - root->fs_info->free_chunk_space += diff; > - spin_unlock(>fs_info->free_chunk_lock); > - unlock_chunks(root); > goto done; > } > > @@ -4084,6 +4076,35 @@ again: > } > > lock_chunks(root); > + > + /* > +* We checked in the above loop all device extents that were already > in > +* the device tree. However before we have updated the device's > +* total_bytes to the new size, we might have had chunk allocations > that > +* have not complete yet (new block groups attached to transaction > +* handles), and therefore their device extents were not yet in the > +* device tree and we missed them in the loop above. So if we have any > +* pending chunk using a device extent that overlaps the device range > +* that we can not use anymore, commit the current transaction and > +* repeat the search on the device tree - this way we guarantee we > will > +* not have chunks using device extents that end beyond 'new_size'. > +*/ > + if (!checked_pending_chunks) { > + u64 start = new_size; > + u64 len = old_size - new_size; > + > + if (contains_pending_extent(trans, device, , len)) { > + unlock_chunks(root); > + checked_pending_chunks = true; > + failed = 0; > + retried = false; > + ret = btrfs_commit_transaction(trans, root); > + if (ret) > +
Re: [PATCH v5 04/18] btrfs: Add threshold workqueue based on kernel workqueue
Hi Qu, On Fri, Feb 28, 2014 at 4:46 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote: The original btrfs_workers has thresholding functions to dynamically create or destroy kthreads. Though there is no such function in kernel workqueue because the worker is not created manually, we can still use the workqueue_set_max_active to simulated the behavior, mainly to achieve a better HDD performance by setting a high threshold on submit_workers. (Sadly, no resource can be saved) So in this patch, extra workqueue pending counters are introduced to dynamically change the max active of each btrfs_workqueue_struct, hoping to restore the behavior of the original thresholding function. Also, workqueue_set_max_active use a mutex to protect workqueue_struct, which is not meant to be called too frequently, so a new interval mechanism is applied, that will only call workqueue_set_max_active after a count of work is queued. Hoping to balance both the random and sequence performance on HDD. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com Tested-by: David Sterba dste...@suse.cz --- Changelog: v2-v3: - Add thresholding mechanism to simulate the old thresholding mechanism. - Will not enable thresholding when thresh is set to small value. v3-v4: None v4-v5: None --- fs/btrfs/async-thread.c | 107 fs/btrfs/async-thread.h | 3 +- 2 files changed, 101 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c index 193c849..977bce2 100644 --- a/fs/btrfs/async-thread.c +++ b/fs/btrfs/async-thread.c @@ -30,6 +30,9 @@ #define WORK_ORDER_DONE_BIT 2 #define WORK_HIGH_PRIO_BIT 3 +#define NO_THRESHOLD (-1) +#define DFT_THRESHOLD (32) + /* * container for the kthread task pointer and the list of pending work * One of these is allocated per thread. @@ -737,6 +740,14 @@ struct __btrfs_workqueue_struct { /* Spinlock for ordered_list */ spinlock_t list_lock; + + /* Thresholding related variants */ + atomic_t pending; + int max_active; + int current_max; + int thresh; + unsigned int count; + spinlock_t thres_lock; }; struct btrfs_workqueue_struct { @@ -745,19 +756,34 @@ struct btrfs_workqueue_struct { }; static inline struct __btrfs_workqueue_struct -*__btrfs_alloc_workqueue(char *name, int flags, int max_active) +*__btrfs_alloc_workqueue(char *name, int flags, int max_active, int thresh) { struct __btrfs_workqueue_struct *ret = kzalloc(sizeof(*ret), GFP_NOFS); if (unlikely(!ret)) return NULL; + ret-max_active = max_active; + atomic_set(ret-pending, 0); + if (thresh == 0) + thresh = DFT_THRESHOLD; + /* For low threshold, disabling threshold is a better choice */ + if (thresh DFT_THRESHOLD) { + ret-current_max = max_active; + ret-thresh = NO_THRESHOLD; + } else { + ret-current_max = 1; + ret-thresh = thresh; + } + if (flags WQ_HIGHPRI) ret-normal_wq = alloc_workqueue(%s-%s-high, flags, -max_active, btrfs, name); +ret-max_active, +btrfs, name); else ret-normal_wq = alloc_workqueue(%s-%s, flags, -max_active, btrfs, name); +ret-max_active, btrfs, +name); Shouldn't we use ret-current_max instead of ret-max_active (in both calls)? According to the rest of the code, max_active is the absolute maximum beyond which the normal_wq cannot go (you use clamp_value to ensure that). And current_max is the current value of max_active of the normal_wq. But here, you set the normal_wq to max_active immediately. Is this intentional? if (unlikely(!ret-normal_wq)) { kfree(ret); return NULL; @@ -765,6 +791,7 @@ static inline struct __btrfs_workqueue_struct INIT_LIST_HEAD(ret-ordered_list); spin_lock_init(ret-list_lock); + spin_lock_init(ret-thres_lock); return ret; } @@ -773,7 +800,8 @@ __btrfs_destroy_workqueue(struct __btrfs_workqueue_struct *wq); struct btrfs_workqueue_struct *btrfs_alloc_workqueue(char *name, int flags, -int max_active) +int max_active, +int thresh) { struct btrfs_workqueue_struct *ret = kzalloc(sizeof(*ret), GFP_NOFS); @@ -781,14 +809,15 @@ struct btrfs_workqueue_struct *btrfs_alloc_workqueue(char
Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN
the commit thread needs to compete on tree-block locks with them (and they hold the locks because they also read tree blocks from disk as it seems) So my question is shouldn't we be much more aggressive in __btrfs_end_transaction, running delayed refs several times and checking trans-delayed_ref_updates after each run, and return only if this number is zero or small enough. This way when we trigger a commit, it will not have a lot of delayed refs to run, it will get very quickly to the critical section, pass it hopefully very quickly (get to TRANS_STATE_UNBLOCKED), and then we can open a new transaction while the previous is doing btrfs_write_and_wait_transaction. That's what I wanted to ask. Thanks! Alex. [1] In my case, btrfs metadata is ~10GBs and the machine has 8GB of RAM. Due to this we need to read a lot of ebs from disk, as they are not in the page cache. Also need to keep in mind that every COW of eb requires a new slot in the page cache, because we index by bytenr that we receive from the free-space cache, which is a logical coordinate by which EXTENT_ITEMs are sorted in the extent tree. On Mon, Jul 13, 2015 at 7:02 PM, Chris Mason c...@fb.com wrote: On Mon, Jul 13, 2015 at 06:55:29PM +0200, Alex Lyakas wrote: Filipe, Thanks for the explanation. Those reasons were not so obvious for me. Would it make sense not to COW the block in case-1, if we are mounted with notreelog? Or, perhaps, to check that the block does not belong to a log tree? Hi Alex, The crc rules are the most important, we have to make sure the block isn't changed while it is in flight. Also, think about something like this: transaction write block A, puts pointer to it in the btree, generation Y hard disk properly completes the IO transaction rewrites block A, same generation Y hard disk drops the IO on the floor and never does it Later on, we try to read block A again. We find it has the correct crc and the correct generation number, but the contents are actually wrong. The second case is more difficult. One problem is that BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block due to memory pressure (this is what I see happening), we complete the writeback, release the extent buffer, and pages are evicted from the page cache of btree_inode. After some time we read the block again (because we want to modify it in the same transaction), but its header is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at this point it should be safe to avoid COW, we will re-COW. Would it make sense to have some runtime-only mechanism to lock-out the write-back for an eb? I.e., if we know that eb is not under writeback, and writeback is locked out from starting, we can redirty the block without COW. Then we allow the writeback to start when it wants to. In one of my test runs, btrfs had 6.4GB of metadata (before raid-induced overhead), but during a particular transaction total of 10GB of metadata (again, before raid-induced overhead) was written to disk. (Thisis total of all ebs having header-generation==curr_transid, not only during commit of the transaction). This particular run was with notreelog. Machine had 8GB of RAM. Linux allows the btree_inode to grow its page-cache upto ~6.9GB (judging by btree_inode-i_mapping-nrpages). But even though the used amount of metadata is less than that, this re-COW'ing of already-COW'ed blocks seems to cause page-cache trashing... Interesting. We've addressed this in the past with changes to the writepage(s) callback for the btree, basically skipping memory pressure related writeback if there isn't that much dirty. There is a lot of room to improve those decisions, like preferring to write leaves over nodes, especially full leaves that are not likely to change again. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN
Filipe, Thanks for the explanation. Those reasons were not so obvious for me. Would it make sense not to COW the block in case-1, if we are mounted with notreelog? Or, perhaps, to check that the block does not belong to a log tree? The second case is more difficult. One problem is that BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block due to memory pressure (this is what I see happening), we complete the writeback, release the extent buffer, and pages are evicted from the page cache of btree_inode. After some time we read the block again (because we want to modify it in the same transaction), but its header is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at this point it should be safe to avoid COW, we will re-COW. Would it make sense to have some runtime-only mechanism to lock-out the write-back for an eb? I.e., if we know that eb is not under writeback, and writeback is locked out from starting, we can redirty the block without COW. Then we allow the writeback to start when it wants to. In one of my test runs, btrfs had 6.4GB of metadata (before raid-induced overhead), but during a particular transaction total of 10GB of metadata (again, before raid-induced overhead) was written to disk. (Thisis total of all ebs having header-generation==curr_transid, not only during commit of the transaction). This particular run was with notreelog. Machine had 8GB of RAM. Linux allows the btree_inode to grow its page-cache upto ~6.9GB (judging by btree_inode-i_mapping-nrpages). But even though the used amount of metadata is less than that, this re-COW'ing of already-COW'ed blocks seems to cause page-cache trashing... Thanks, Alex. On Mon, Jul 13, 2015 at 11:27 AM, Filipe David Manana fdman...@gmail.com wrote: On Sun, Jul 12, 2015 at 6:15 PM, Alex Lyakas a...@zadarastorage.com wrote: Greetings, Looking at the code of should_cow_block(), I see: if (btrfs_header_generation(buf) == trans-transid !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) ... So if the extent buffer has been written to disk, and now is changed again in the same transaction, we insist on COW'ing it. Can anybody explain why COW is needed in this case? The transaction has not committed yet, so what is the danger of rewriting to the same location on disk? My understanding was that a tree block needs to be COW'ed at most once in the same transaction. But I see that this is not the case. That logic is there, as far as I can see, for at least 2 obvious reasons: 1) fsync/log trees. All extent buffers (tree blocks) of a log tree have the same transaction id/generation, and you can have multiple fsyncs (log transaction commits) per transaction so you need to ensure consistency. If we skipped the COWing in the example below, you would get an inconsistent log tree at log replay time when the fs is mounted: transaction N start fsync inode A start creates tree block X flush X to disk write a new superblock fsync inode A end fsync inode B start skip COW of X because its generation == current transaction id and modify it in place flush X to disk == crash === write a new superblock fsync inode B end transaction N commit 2) The flag BTRFS_HEADER_FLAG_WRITTEN is set not when the block is written to disk but instead when we trigger writeback for it. So while the writeback is ongoing we want to make sure the block's content isn't concurrently modified (we don't keep the eb write locked to allow concurrent reads during the writeback). All tree blocks that don't belong to a log tree are normally written only when at the end of a transaction commit. But often, due to memory pressure for e.g., the VM can call the writepages() callback of the btree inode to force dirty tree blocks to be written to disk before the transaction commit. I am asking because I am doing some profiling of btrfs metadata work under heavy loads, and I see that sometimes btrfs COW's almost twice more tree blocks than the total metadata size. Thanks, Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
question about should_cow_block() and BTRFS_HEADER_FLAG_WRITTEN
Greetings, Looking at the code of should_cow_block(), I see: if (btrfs_header_generation(buf) == trans-transid !btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN) ... So if the extent buffer has been written to disk, and now is changed again in the same transaction, we insist on COW'ing it. Can anybody explain why COW is needed in this case? The transaction has not committed yet, so what is the danger of rewriting to the same location on disk? My understanding was that a tree block needs to be COW'ed at most once in the same transaction. But I see that this is not the case. I am asking because I am doing some profiling of btrfs metadata work under heavy loads, and I see that sometimes btrfs COW's almost twice more tree blocks than the total metadata size. Thanks, Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: rebuild missing block group during chunk recovery if possible
Hi Qu, On Wed, Dec 24, 2014 at 3:09 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote: Original Message Subject: Re: [PATCH] btrfs-progs: rebuild missing block group during chunk recovery if possible From: Alex Lyakas alex.bt...@zadarastorage.com To: Qu Wenruo quwen...@cn.fujitsu.com Date: 2014年12月24日 00:49 Hi Qu, On Thu, Oct 30, 2014 at 4:54 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote: [snipped] + +static int __insert_block_group(struct btrfs_trans_handle *trans, + struct chunk_record *chunk_rec, + struct btrfs_root *extent_root, + u64 used) +{ + struct btrfs_block_group_item bg_item; + struct btrfs_key key; + int ret = 0; + + btrfs_set_block_group_used(bg_item, used); + btrfs_set_block_group_chunk_objectid(bg_item, used); This looks like a bug. Instead of used, I think it should be BTRFS_FIRST_CHUNK_TREE_OBJECTID. Oh, my mistake, BTRFS_FIRST_CHUNK_TREE_OBJECTID is right. Thanks for pointing out this. [snipped] -- 2.1.2 Couple of questions: # In remove_chunk_extent_item, should we also consider rebuild chunks now? It can happen that a rebuild chunks is a SYSTEM chunk. Should we try to handle it as well? Not quite sure about the meaning of rebuild here. The chunk-recovery has the rebuild_chunk_tree() function to rebuild the whole chunk tree with the good/repaired chunks we found. # Same question for rebuild_sys_array. Should we also consider rebuild chunks? The chunk-recovery has rebuild_sys_array() to handle SYSTEM chunk too. I meant that with this patch you have added rebuild_chunks list: struct list_head good_chunks; struct list_head bad_chunks; struct list_head rebuild_chunks; --- you added this struct list_head unrepaired_chunks; These are chunks that have no block-group record, but we are confident that we can rebuild the block-group records for these chunks by scanning all EXTENT_ITEMs in the block-group range and calculating the used value for the block-group. If we fail, we just set used==block-group size. My question is: should we now consider those rebuild_chunks same as good_chunks? I.e., should we also consider those chunks in the following functions: - remove_chunk_extent_item: probably no, because we need the EXTENT_ITEMs to recalculate the used value - rebuild_sys_array: if it happens that a rebuild_chunk is also a SYSTEM chunk, should we add it to the sys_chunk_array too? (In addition to good_chunks). Thanks, Alex. Thanks, Qu Thanks, Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How btrfs-find-root knows that the block is actually a root?
Hi Qu, On Tue, Dec 23, 2014 at 7:27 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote: Original Message Subject: How btrfs-find-root knows that the block is actually a root? From: Alex Lyakas alex.bt...@zadarastorage.com To: linux-btrfs linux-btrfs@vger.kernel.org Date: 2014年12月22日 22:57 Greetings, I am looking at the code of search_iobuf() in btrfs-find-root.c.(3.17.3) I see that we probe nodesize blocks one by one, and for each block we check: - its owner is what we are looking for - its header-bytenr is what we are looking at currently - its level is not too small - it has valid checksum - it has the desired generation If all those conditions are true, we declare this block as a root and end the program. How do we actually know that it's a root and not a leaf or an intermediate node? What if we are searching for a root of the root tree, which has one node and two leafs (all have the same highest transid), and one of the leafs has logical lower than the actual root, i.e., it comes first in our scan. Then we will declare this leaf as a root, won't we? Or somehow the root always has the lowest logical? You can refer to this patch: https://patchwork.kernel.org/patch/5285521/ I see that this has not been applied to any of David's branches. Do you have a repo to look at this code in its entirety? Your questions are mostly right. The best method should be search through all the metadata, and only the highest level header for a given generation may be the root for that generation. But that method still has some problems. 1) Overwritten old node/leaf As btrfs metadata cow happens, old node/leaf may be overwritten and become incompletely, so above method won't always work as expected. 2) Corrupted fs That will makes everything not work as expected. But sadly, when someone needs to use btrfs-find-root, there is a high possibility the fs is already corrupted. 3) Slow speed It needs to scan over all the sectors of metadata chunks, it may var from megabytese to tegabytes, which makes the complete scan impractical. So current find-root uses a trade-off, if find a header at the position superblock points to, and generation matches, then just consider it as the desired root and exit. I think this is a bit optimistic. What if the root tree has several leaves having the same generation as the root? Then we might declare a leaf as a root and exit. But further recovery based on that output will get us into trouble. Also, I am confused by this line: level = h_level; This means that if we encounter a block that seems good, we will skip all other blocks that have lower level. Is this intended? This is intended, for case user already know the root's level, so it will skip any header whose level is below it. But this line is performed before the generation check. Let's say that user did not specify any level (so search_level==0). Then assume we encounter a block, which has lower generation than what we need, but higher level. At this point, we do level = h_level; and we will skip any blocks lower than this level from now on. What if the root tree got shirnked (due to subvolume deletion, for example), and the good root has lower level? We will skip it then, and will not find the root. Thanks for your comments, Alex. Thanks, Qu Thanks, Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: rebuild missing block group during chunk recovery if possible
Hi Qu, On Thu, Oct 30, 2014 at 4:54 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote: Before the patch, chunk will be considered bad if the corresponding block group is missing, even the only uncertain data is the 'used' member of the block group. This patch will try to recalculate the 'used' value of the block group and rebuild it. So even only chunk item and dev extent item is found, the chunk can be recovered. Although if extent tree is damanged and needed extent item can't be read, the block group's 'used' value will be the block group length, to prevent any later write/block reserve damaging the block group. In that case, we will prompt user and recommend them to use '--init-extent-tree' to rebuild extent tree if possible. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- btrfsck.h | 3 +- chunk-recover.c | 242 +--- cmds-check.c| 29 --- 3 files changed, 234 insertions(+), 40 deletions(-) diff --git a/btrfsck.h b/btrfsck.h index 356c767..7a50648 100644 --- a/btrfsck.h +++ b/btrfsck.h @@ -179,5 +179,6 @@ btrfs_new_device_extent_record(struct extent_buffer *leaf, int check_chunks(struct cache_tree *chunk_cache, struct block_group_tree *block_group_cache, struct device_extent_tree *dev_extent_cache, -struct list_head *good, struct list_head *bad, int silent); +struct list_head *good, struct list_head *bad, +struct list_head *rebuild, int silent); #endif diff --git a/chunk-recover.c b/chunk-recover.c index 6f43066..dbf98b5 100644 --- a/chunk-recover.c +++ b/chunk-recover.c @@ -61,6 +61,7 @@ struct recover_control { struct list_head good_chunks; struct list_head bad_chunks; + struct list_head rebuild_chunks; struct list_head unrepaired_chunks; pthread_mutex_t rc_lock; }; @@ -203,6 +204,7 @@ static void init_recover_control(struct recover_control *rc, int verbose, INIT_LIST_HEAD(rc-good_chunks); INIT_LIST_HEAD(rc-bad_chunks); + INIT_LIST_HEAD(rc-rebuild_chunks); INIT_LIST_HEAD(rc-unrepaired_chunks); rc-verbose = verbose; @@ -529,22 +531,32 @@ static void print_check_result(struct recover_control *rc) return; printf(CHECK RESULT:\n); - printf(Healthy Chunks:\n); + printf(Recoverable Chunks:\n); list_for_each_entry(chunk, rc-good_chunks, list) { print_chunk_info(chunk, ); good++; total++; } - printf(Bad Chunks:\n); + list_for_each_entry(chunk, rc-rebuild_chunks, list) { + print_chunk_info(chunk, ); + good++; + total++; + } + list_for_each_entry(chunk, rc-unrepaired_chunks, list) { + print_chunk_info(chunk, ); + good++; + total++; + } + printf(Unrecoverable Chunks:\n); list_for_each_entry(chunk, rc-bad_chunks, list) { print_chunk_info(chunk, ); bad++; total++; } printf(\n); - printf(Total Chunks:\t%d\n, total); - printf( Heathy:\t%d\n, good); - printf( Bad:\t%d\n, bad); + printf(Total Chunks:\t\t%d\n, total); + printf( Recoverable:\t\t%d\n, good); + printf( Unrecoverable:\t%d\n, bad); printf(\n); printf(Orphan Block Groups:\n); @@ -555,6 +567,7 @@ static void print_check_result(struct recover_control *rc) printf(Orphan Device Extents:\n); list_for_each_entry(devext, rc-devext.no_chunk_orphans, chunk_list) print_device_extent_info(devext, ); + printf(\n); } static int check_chunk_by_metadata(struct recover_control *rc, @@ -938,6 +951,11 @@ static int build_device_maps_by_chunk_records(struct recover_control *rc, if (ret) return ret; } + list_for_each_entry(chunk, rc-rebuild_chunks, list) { + ret = build_device_map_by_chunk_record(root, chunk); + if (ret) + return ret; + } return ret; } @@ -1168,12 +1186,31 @@ static int __rebuild_device_items(struct btrfs_trans_handle *trans, return ret; } +static int __insert_chunk_item(struct btrfs_trans_handle *trans, + struct chunk_record *chunk_rec, + struct btrfs_root *chunk_root) +{ + struct btrfs_key key; + struct btrfs_chunk *chunk = NULL; + int ret = 0; + + chunk = create_chunk_item(chunk_rec); + if (!chunk) + return -ENOMEM; + key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID; + key.type = BTRFS_CHUNK_ITEM_KEY; + key.offset = chunk_rec-offset; + + ret =
How btrfs-find-root knows that the block is actually a root?
Greetings, I am looking at the code of search_iobuf() in btrfs-find-root.c.(3.17.3) I see that we probe nodesize blocks one by one, and for each block we check: - its owner is what we are looking for - its header-bytenr is what we are looking at currently - its level is not too small - it has valid checksum - it has the desired generation If all those conditions are true, we declare this block as a root and end the program. How do we actually know that it's a root and not a leaf or an intermediate node? What if we are searching for a root of the root tree, which has one node and two leafs (all have the same highest transid), and one of the leafs has logical lower than the actual root, i.e., it comes first in our scan. Then we will declare this leaf as a root, won't we? Or somehow the root always has the lowest logical? Also, I am confused by this line: level = h_level; This means that if we encounter a block that seems good, we will skip all other blocks that have lower level. Is this intended? Thanks, Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: update commit root on snapshot creation after orphan cleanup
Hi Filipe, Thank you for the explanation. I understand that without your patch we return to user-space after deleting the orphans, but leaving the transaction open. So user-space sees the snapshot and can start send. With your patch, we return to user-space after orphan cleanup has been committed. Unless we crash in the middle, like you pointed. I will also look at the new patch. Thanks! Alex. On Thu, Jul 31, 2014 at 3:41 PM, Filipe David Manana fdman...@gmail.com wrote: On Mon, Jul 28, 2014 at 6:31 PM, Filipe David Manana fdman...@gmail.com wrote: On Sat, Jul 19, 2014 at 8:11 PM, Alex Lyakas alex.bt...@zadarastorage.com wrote: Hi Filipe, It's quite possible I don't fully understand the issue. It seems that we are creating a read-only snapshot, commit a transaction, and then go and modify the snapshot once again, by deleting all the ORPHAN_ITEMs we have in its file tree (btrfs_orphan_cleanup). Shouldn't all this be part of snapshot creation, so that after we commit, we have a clean file tree with no orphans there? (not sure if this makes sense though). With your patch we do this additional commit after the cleanup. But nothing prevents send from starting before this additional commit, correct? And it would still see the orphans through the commit root. You say that it is not a problem, but I am not sure why (probably I am missing something here). So for me it looks like your patch closes a race window significantly (at the cost of an additional commit), but does not close it fully. Hi Alex, That's right, after the transaction commit finishes, the snapshot will be visible and accessible to user space - so someone may start a send before the orphan cleanup starts. It was ok only for the serialized case (create snapshot, wait for ioctl to return, call send ioctl). Actually no. If after the 1st transaction commit (the one that creates the snapshot and makes it visible to user space) and before the orphan cleanup is called another task attempts to use the snapshot for a send operation, it will block when doing the snapshot dentry lookup - because both tasks acquire the parent inode's mutex (implicitly through the vfs and explicitly via the snapshot/subvol ioctl entry point). Nevertheless, it's better to move the commit root switch part to the dentry lookup function (as the new patch does), since after the first transaction commit and before the second one commits, a reboot might happen, and after that we would get into the same issue until the first transaction commit happens after the reboot. I'll update the new patch's comment. thanks But most important: perhaps send should look for ORPHAN_ITEMs and treat those inodes as deleted? There are other cases were orphans can exist, like for file truncates for example, where ignoring the inode wouldn't be very correct. Tried that approach initially, but it's actually more complex to implement and adds some additional overhead (tree searches - and the orphan items are normally too far from the inode items, due to a very high objectid (-5ULL)). I've reworked this with a different approach and CC'ed you (https://patchwork.kernel.org/patch/4635471/). thanks Thanks, Alex. On Tue, Jun 3, 2014 at 2:41 PM, Filipe David Borba Manana fdman...@gmail.com wrote: On snapshot creation (either writable or read-only), we do orphan cleanup against the root of the snapshot. If the cleanup did remove any orphans, then the current root node will be different from the commit root node until the next transaction commit happens. A send operation always uses the commit root of a snapshot - this means it will see the orphans if it starts computing the send stream before the next transaction commit happens (triggered by a timer or sync() for .e.g), which is when the commit root gets assigned a reference to current root, where the orphans are not visible anymore. The consequence of send seeing the orphans is explained below. For example: mkfs.btrfs -f /dev/sdd mount -o commit=999 /dev/sdd /mnt # open a file with O_TMPFILE and leave it open # write some data to the file btrfs subvolume snapshot -r /mnt /mnt/snap1 btrfs send /mnt/snap1 -f /tmp/send.data The send operation will fail with the following error: ERROR: send ioctl failed with -116: Stale file handle What happens here is that our snapshot has an orphan inode still visible through the commit root, that corresponds to the tmpfile. However send will attempt to call inode.c:btrfs_iget(), with the goal of reading the file's data, which will return -ESTALE because it will use the current root (and not the commit root) of the snapshot. Of course, there are other cases where we can get orphans, but this example using a tmpfile makes it much easier to reproduce the issue. Therefore on snapshot creation, after calling btrfs_orphan_cleanup, if the commit root is different from the current root, just commit
Re: [PATCH] Btrfs: update commit root on snapshot creation after orphan cleanup
Hi Filipe, It's quite possible I don't fully understand the issue. It seems that we are creating a read-only snapshot, commit a transaction, and then go and modify the snapshot once again, by deleting all the ORPHAN_ITEMs we have in its file tree (btrfs_orphan_cleanup). Shouldn't all this be part of snapshot creation, so that after we commit, we have a clean file tree with no orphans there? (not sure if this makes sense though). With your patch we do this additional commit after the cleanup. But nothing prevents send from starting before this additional commit, correct? And it would still see the orphans through the commit root. You say that it is not a problem, but I am not sure why (probably I am missing something here). So for me it looks like your patch closes a race window significantly (at the cost of an additional commit), but does not close it fully. But most important: perhaps send should look for ORPHAN_ITEMs and treat those inodes as deleted? Thanks, Alex. On Tue, Jun 3, 2014 at 2:41 PM, Filipe David Borba Manana fdman...@gmail.com wrote: On snapshot creation (either writable or read-only), we do orphan cleanup against the root of the snapshot. If the cleanup did remove any orphans, then the current root node will be different from the commit root node until the next transaction commit happens. A send operation always uses the commit root of a snapshot - this means it will see the orphans if it starts computing the send stream before the next transaction commit happens (triggered by a timer or sync() for .e.g), which is when the commit root gets assigned a reference to current root, where the orphans are not visible anymore. The consequence of send seeing the orphans is explained below. For example: mkfs.btrfs -f /dev/sdd mount -o commit=999 /dev/sdd /mnt # open a file with O_TMPFILE and leave it open # write some data to the file btrfs subvolume snapshot -r /mnt /mnt/snap1 btrfs send /mnt/snap1 -f /tmp/send.data The send operation will fail with the following error: ERROR: send ioctl failed with -116: Stale file handle What happens here is that our snapshot has an orphan inode still visible through the commit root, that corresponds to the tmpfile. However send will attempt to call inode.c:btrfs_iget(), with the goal of reading the file's data, which will return -ESTALE because it will use the current root (and not the commit root) of the snapshot. Of course, there are other cases where we can get orphans, but this example using a tmpfile makes it much easier to reproduce the issue. Therefore on snapshot creation, after calling btrfs_orphan_cleanup, if the commit root is different from the current root, just commit the transaction associated with the snapshot's root (if it exists), so that a send will not see any orphans that don't exist anymore. This also guarantees a send will always see the same content regardless of whether a transaction commit happened already before the send was requested and after the orphan cleanup (meaning the commit root and current roots are the same) or it hasn't happened yet (commit and current roots are different). Signed-off-by: Filipe David Borba Manana fdman...@gmail.com --- fs/btrfs/ioctl.c | 29 + 1 file changed, 29 insertions(+) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 95194a9..6680ad9 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -712,6 +712,35 @@ static int create_snapshot(struct btrfs_root *root, struct inode *dir, if (ret) goto fail; + /* +* If orphan cleanup did remove any orphans, it means the tree was +* modified and therefore the commit root is not the same as the +* current root anymore. This is a problem, because send uses the +* commit root and therefore can see inode items that don't exist +* in the current root anymore, and for example make calls to +* btrfs_iget, which will do tree lookups based on the current root +* and not on the commit root. Those lookups will fail, returning a +* -ESTALE error, and making send fail with that error. So make sure +* a send does not see any orphans we have just removed, and that it +* will see the same inodes regardless of whether a transaction +* commit happened before it started (meaning that the commit root +* will be the same as the current root) or not. +*/ + if (readonly pending_snapshot-snap-node != + pending_snapshot-snap-commit_root) { + trans = btrfs_join_transaction(pending_snapshot-snap); + if (IS_ERR(trans) PTR_ERR(trans) != -ENOENT) { + ret = PTR_ERR(trans); + goto fail; + } + if (!IS_ERR(trans)) { + ret = btrfs_commit_transaction(trans, +
Re: Snapshot aware defrag and qgroups thoughts
Hi Josef, thanks for the detailed description of how the extent tree works! When I was digging through that in the past, I made some slides to remember all the call chains. Maybe somebody finds that useful to accompany your notes. https://drive.google.com/file/d/0ByBy89zr3kJNNmM5OG5wXzQ3LUE/edit?usp=sharing Thanks, Alex. On Mon, Apr 21, 2014 at 5:55 PM, Josef Bacik jba...@fb.com wrote: We have a big problem, but it involves a lot of moving parts, so I'm going to explain all of the parts, and then the problem, and then what I am doing to fix the problem. I want you guys to check my work to make sure I'm not missing something so when I come back from paternity leave in a few weeks I can just sit down and finish this work out. = Extent refs === This is basically how extent refs work. You have either key.objectid = bytenr; key.type = BTRFS_EXTENT_ITEM_KEY; key.offset = length; or you have key.objectid = bytenr; key.type = BTRFS_METADATA_ITEM_KEY; key.offset = level of the metadata block; in the case of skinny metadata. Then you have the extent item which describes the number of refs and such, followed by 1 or more inline refs. All you need to know for this problem is how I'm going to describe them. What I call a normal ref or a full ref is a reference that has the actual root information in the ref. What I call a shared ref is one where we only know the tree block that owns the particular ref. So how does this work in practice? 1) Normal allocation - metadata: We allocate a tree block as we add new items to a tree. We know that this root owns this tree block so we create a normal ref with the root objectid in the extent ref. We also set the owner of the block itself to our objectid. This is important to keep in mind. 2) Normal allocaiton - data: We allocate some data for a given fs tree and we add a extent ref with the root objectid of the tree we are in, the inode number and the logical offset into the inode for this inode. 3) Splitting a data extent: We write to the middle of an existing extent. We will split this extent into two BTRFS_EXTENT_DATA_KEY items and the increase the ref count of the original extent by 1. This means we look up the extent ref for root-objectid, inode number and the _original_ inode offset. We don't create another extent ref, this is important to keep in mind. = btrfs_copy_root/update_ref_for_cow/btrfs_inc_ref/btrfs_dec_ref = But Josef, didn't you say there were shared refs? Why yes I did, but I need to explain it in context of the people who actually do the dirty work. We'll start with the easy case 1) btrfs_copy_root - where snapshots start: When we make a snapshot we call this function, which allocates a completely new block with a new root objectid and then memcpy's the original root we are snapshotting. Then we call btrfs_inc_ref on our new buffer, which will walk all items in that buffer and add a new normal ref to each of those blocks for our new root. This is only at the level below the new root, nothing below that point. 2) btrfs_inc_ref/btrfs_dec_ref - how we deal with snapshots: These guys are responsible for dealing with the particular action we want to make on our given buffer. So if we are free'ing our buffer, we need to drop any refs it has to the blocks it points to. For level 0 this means modifying refs for all of the tree blocks it points to. For level == 0 this means modifying refs for any data extents the leaf may point to. 3) update_ref_for_cow - this is where the magic happens: This has a few different modes of operation, but every operation means we check to see if the block is shared, which is we see if we have been snapshotted and if we have been see if this block has changed since we snapshotted. If it is shared then we look up the extent refs and the flags. If not then we carry on. From here we have a few options. 3a) Not shared: Don't do anything, we can do our normal cow operations and carry on. 3b) Shared and cowing from the owning root: This is where the btrfs_header_owner() is important. If we owned this block and it is shared then we know that any of the upper levels won't have a normal ref to anything underneath this block, so we need to add a shared ref for anything this block points to. So the first thing we do is btrfs_inc_ref(), but we set the full backref flag. This means that when we add refs for everything this block points to we don't use a root objectid, we use the bytenr of this block. Then we set BTRFS_BLOCK_FLAG_FULL_BACKREF for the extent flags for this give block. 3c) Shared and cowing from not the owning root: So if we are cowing down from the snapshot we need to make sure that any block we own completely ourselves has normal refs for any blocks it points to. So we cow down and hit a shared block that we aren't the owner of, so we do btrfs_inc_ref() for our block and
Re: safe/necessary to balance system chunks?
On Fri, Apr 25, 2014 at 10:14 PM, Hugo Mills h...@carfax.org.uk wrote: On Fri, Apr 25, 2014 at 02:12:17PM -0400, Austin S Hemmelgarn wrote: On 2014-04-25 13:24, Chris Murphy wrote: On Apr 25, 2014, at 8:57 AM, Steve Leung sjle...@shaw.ca wrote: Hi list, I've got a 3-device RAID1 btrfs filesystem that started out life as single-device. btrfs fi df: Data, RAID1: total=1.31TiB, used=1.07TiB System, RAID1: total=32.00MiB, used=224.00KiB System, DUP: total=32.00MiB, used=32.00KiB System, single: total=4.00MiB, used=0.00 Metadata, RAID1: total=66.00GiB, used=2.97GiB This still lists some system chunks as DUP, and not as RAID1. Does this mean that if one device were to fail, some system chunks would be unrecoverable? How bad would that be? Since it's system type, it might mean the whole volume is toast if the drive containing those 32KB dies. I'm not sure what kind of information is in system chunk type, but I'd expect it's important enough that if unavailable that mounting the file system may be difficult or impossible. Perhaps btrfs restore would still work? Anyway, it's probably a high penalty for losing only 32KB of data. I think this could use some testing to try and reproduce conversions where some amount of system or metadata type chunks are stuck in DUP. This has come up before on the list but I'm not sure how it's happening, as I've never encountered it. As far as I understand it, the system chunks are THE root chunk tree for the entire system, that is to say, it's the tree of tree roots that is pointed to by the superblock. (I would love to know if this understanding is wrong). Thus losing that data almost always means losing the whole filesystem. From a conversation I had with cmason a while ago, the System chunks contain the chunk tree. They're special because *everything* in the filesystem -- including the locations of all the trees, including the chunk tree and the roots tree -- is positioned in terms of the internal virtual address space. Therefore, when starting up the FS, you can read the superblock (which is at a known position on each device), which tells you the virtual address of the other trees... and you still need to find out where that really is. The superblock has (I think) a list of physical block addresses at the end of it (sys_chunk_array), which allows you to find the blocks for the chunk tree and work out this mapping, which allows you to find everything else. I'm not 100% certain of the actual format of that array -- it's declared as u8 [2048], so I'm guessing there's a load of casting to something useful going on in the code somewhere. The format is just a list of pairs: struct btrfs_disk_key, struct btrfs_chunk struct btrfs_disk_key, struct btrfs_chunk ... For each SYSTEM block-group (btrfs_chunk), we need one entry in the sys_chunk_array. During mkfs the first SYSTEM block group is created, for me its 4MB. So only if the whole chunk tree grows over 4MB, we need to create an additional SYSTEM block group, and then we need to have a second entry in the sys_chunk_array. And so on. Alex. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Is it still called an affair if I'm sleeping with my wife --- behind her lover's back? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6] Btrfs: fix memory leak of orphan block rsv
Hi Filipe, I finally got to debug this deeper. As it turns out, this happens only if both nospace_cache and clear_cache are specified. You need to unmount and mount again to cause this. After mounting, due to clear_cache, all the block-groups are marked as BTRFS_DC_CLEAR, and then cache_save_setup() is called on them (this function is called only in case of BTRFS_DC_CLEAR). So cache_save_setup() goes ahead and creates the free-space inode. But then it realizes that it was mounted with nospace_cache, so it does not put any content in the inode. But the inode itself gets created. The patch that fixes this for me: alex@ubuntu-alex:/mnt/work/alex/linux-stable/source$ git diff -U10 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index d170412..06f876e 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2941,20 +2941,26 @@ again: goto out; } if (IS_ERR(inode)) { BUG_ON(retries); retries++; if (block_group-ro) goto out_free; + /* with nospace_cache avoid creating the free-space inode */ + if (!btrfs_test_opt(root, SPACE_CACHE)) { + dcs = BTRFS_DC_WRITTEN; + goto out_free; + } + ret = create_free_space_inode(root, trans, block_group, path); if (ret) goto out_free; goto again; } /* We've already setup this transaction, go ahead and exit */ if (block_group-cache_generation == trans-transid i_size_read(inode)) { dcs = BTRFS_DC_SETUP; Thanks, Alex. On Wed, Nov 6, 2013 at 3:19 PM, Filipe David Manana fdman...@gmail.com wrote: On Mon, Nov 4, 2013 at 12:16 PM, Alex Lyakas alex.bt...@zadarastorage.com wrote: Hi Filipe, any luck with this patch?:) Hey Alex, I haven't digged further, but I remember I couldn't reproduce your issue (with latest btrfs-next of that day) of getting the free space inodes created even when mount option nospace_cache is given. What kernel were you using? Alex. On Wed, Oct 23, 2013 at 5:26 PM, Filipe David Manana fdman...@gmail.com wrote: On Wed, Oct 23, 2013 at 3:14 PM, Alex Lyakas alex.bt...@zadarastorage.com wrote: Hello, On Wed, Oct 23, 2013 at 4:35 PM, Filipe David Manana fdman...@gmail.com wrote: On Wed, Oct 23, 2013 at 2:33 PM, Alex Lyakas alex.bt...@zadarastorage.com wrote: Hi Filipe, On Tue, Aug 20, 2013 at 2:52 AM, Filipe David Borba Manana fdman...@gmail.com wrote: This issue is simple to reproduce and observe if kmemleak is enabled. Two simple ways to reproduce it: ** 1 $ mkfs.btrfs -f /dev/loop0 $ mount /dev/loop0 /mnt/btrfs $ btrfs balance start /mnt/btrfs $ umount /mnt/btrfs So here it seems that the leak can only happen in case the block-group has a free-space inode. This is what the orphan item is added for. Yes, here kmemleak reports. But: if space_cache option is disabled (and nospace_cache) enabled, it seems that btrfs still creates the FREE_SPACE inodes, although they are empty because in cache_save_setup: inode = lookup_free_space_inode(root, block_group, path); if (IS_ERR(inode) PTR_ERR(inode) != -ENOENT) { ret = PTR_ERR(inode); btrfs_release_path(path); goto out; } if (IS_ERR(inode)) { ... ret = create_free_space_inode(root, trans, block_group, path); and only later it actually sets BTRFS_DC_WRITTEN if space_cache option is disabled. Amazing! Although this is a different issue, do you know perhaps why these empty inodes are needed? Don't know if they are needed. But you have a point, it seems odd to create the free space cache inode if mount option nospace_cache was supplied. Thanks Alex. Testing the following patch: diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index c43ee8a..eb1b7da 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3162,6 +3162,9 @@ static int cache_save_setup(struct btrfs_block_group_cache *block_group, int retries = 0; int ret = 0; + if (!btrfs_test_opt(root, SPACE_CACHE)) + return 0; + /* * If this block group is smaller than 100 megs don't bother caching the * block group. Thanks! Alex. ** 2 $ mkfs.btrfs -f /dev/loop0 $ mount /dev/loop0 /mnt/btrfs $ touch /mnt/btrfs/foobar $ rm -f /mnt/btrfs/foobar $ umount /mnt/btrfs I tried the second repro script on kernel 3.8.13, and kmemleak does not report a leak (even if I force the kmemleak scan). I did not try the balance-repro script, though. Am I missing something? Maybe it's not an issue on 3.8.13 and older releases. This was on btrfs-next from August 19. thanks for testing Thanks, Alex. After a while, kmemleak reports the leak: $ cat /sys/kernel/debug/kmemleak unreferenced object
Re: snapshot send with parent question
Michael, btrfs-send doesn't really know or care how did you manage to get from a to c. It is able to compare any two RO subvolumes (not necessarily related by snapshot operations), and produce a stream of commands that transfer a-content into c-content. Send assumes that at a receive side, you have a snapshot identical to a. Then receive side locates the a-snapshot (by a received_UUID field) and creates a RW snapshot out of it. This snapshot would be identical to c, after applying the stream of commands. Then receive side applies the stream of commands (in strict order), and at the end sets the RW snapshot to be RO. At this point, this snapshot should be identical to c. The stream of commands most probably will not be identical to operations that you did in order to get from a into c. But it will transfer a-content into c-content (leave alone possible bugs), which is what's important. Of course, if a and c are related via snapshot operations, then btrfs-send will be much more efficient, in terms that it will be able to skip entire btrfs subtrees (look at btrfs_compare_trees), thus avoiding many additional comparisons that some other tool like rsync would have done. Thanks, Alex. On Sun, Apr 20, 2014 at 1:00 AM, Michael Welsh Duggan m...@md5i.com wrote: Assume the following scenario: There exists a read-only snapshot called a. A read-write snapshot called b is created from a, and is then modified. A read-only snapshot of b is created, called c. A btrfs send is done for c, with a marked as its parent. Will the send data only contain the differences between a and c? My experiments seem to indicate no, but I have no confidence that I am not doing something else correctly. Also, when a btrfs receive gets a stream containing the differences between a (parent) and c, does it only look at the relative pathname differences between a and c in order to determine the matching parent on the receiving side? -- Michael Welsh Duggan (m...@md5i.com) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: correctly determine if blocks are shared in btrfs_compare_trees
Hi Filipe, Can you please explain more what is the scenario you are worried about. Let's say we have two FS trees (subvolumes) subv1 and subv2, subv2 being a RO snapshot of subv1. And they have a shared subtree at logical==X. Now we change subv1, so its subtree is COW'ed and some other logical address (Y) is being allocated for subtree root. But X still cannot be reused as long as subv2 exists. That's the essence of the extent tree providing refcount for each tree/data block in the FS, no? Now finally we delete subv2 and block X is freed. So it can be reallocated as a root of another subtree. And then it might be snapshotted again and shared as before. So where do you see a problem? If we have two FS-tree subtrees starting at the same logical=X, how can they be different? This means we allocated logical=X again, while it was still in use, which is very very bad. Am I missing something here? Thanks, Alex. P.S.: by logical I (and hopefully you) mean the extent-tree level addresses, i.e., if we have a tree block with logical=X, then we also have an EXTENT_ITEM with key (X, EXTENT_ITEM, nodesize/leafsize). On Fri, Feb 21, 2014 at 12:15 AM, Filipe David Borba Manana fdman...@gmail.com wrote: Just comparing the pointers (logical disk addresses) of the btree nodes is not completely bullet proof, we have to check if their generation numbers match too. It is guaranteed that a COW operation will result in a block with a different logical disk address than the original block's address, but over time we can reuse that former logical disk address. For example, creating a 2Gb filesystem on a loop device, and having a script running in a loop always updating the access timestamp of a file, resulted in the same logical disk address being reused for the same fs btree block in about only 4 minutes. This could make us skip entire subtrees when doing an incremental send (which is currently the only user of btrfs_compare_trees). However the odds of getting 2 blocks at the same tree level, with the same logical disk address, equal first slot keys and different generations, should hopefully be very low. Signed-off-by: Filipe David Borba Manana fdman...@gmail.com --- fs/btrfs/ctree.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index cbd3a7d..88d1b1e 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -5376,6 +5376,8 @@ int btrfs_compare_trees(struct btrfs_root *left_root, int advance_right; u64 left_blockptr; u64 right_blockptr; + u64 left_gen; + u64 right_gen; u64 left_start_ctransid; u64 right_start_ctransid; u64 ctransid; @@ -5640,7 +5642,14 @@ int btrfs_compare_trees(struct btrfs_root *left_root, right_blockptr = btrfs_node_blockptr( right_path-nodes[right_level], right_path-slots[right_level]); - if (left_blockptr == right_blockptr) { + left_gen = btrfs_node_ptr_generation( + left_path-nodes[left_level], + left_path-slots[left_level]); + right_gen = btrfs_node_ptr_generation( + right_path-nodes[right_level], + right_path-slots[right_level]); + if (left_blockptr == right_blockptr + left_gen == right_gen) { /* * As we're on a shared block, don't * allow to go deeper. -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: attach delayed ref updates to delayed ref heads
Hi Josef, I have a question about update_existing_head_ref() logic. The question is not specific to the rework that you have done. You have a code like this: if (ref-must_insert_reserved) { /* if the extent was freed and then * reallocated before the delayed ref * entries were processed, we can end up * with an existing head ref without * the must_insert_reserved flag set. * Set it again here */ existing_ref-must_insert_reserved = ref-must_insert_reserved; /* * update the num_bytes so we make sure the accounting * is done correctly */ existing-num_bytes = update-num_bytes; } How can it happen that you have a delayed_ref head for a particular bytenr, and then somebody wants to add a ref head for the same bytenr with must_insert_reserved=true? How could he have possibly allocated the same bytenr from the free-space cache? I know that when extent is freed by __btrfs_free_extent(), it calls update_block_groups(), which pins down the extent. So this extent will be dropped into free-space-cache only on transaction commit, when all delayed refs have been processed already. The only close case that I see is in btrfs_free_tree_block(), where it adds a BTRFS_DROP_DELAYED_REF, and then if check_ref_cleanup()==0 and BTRFS_HEADER_FLAG_WRITTEN is not set, it drops the extent directly into free-space cache. However, check_ref_cleanup() would have deleted the ref head, so we would not have found an existing ref head. Can you pls give a clue on this? Thanks! Alex. On Thu, Jan 23, 2014 at 5:28 PM, Josef Bacik jba...@fb.com wrote: Currently we have two rb-trees, one for delayed ref heads and one for all of the delayed refs, including the delayed ref heads. When we process the delayed refs we have to hold onto the delayed ref lock for all of the selecting and merging and such, which results in quite a bit of lock contention. This was solved by having a waitqueue and only one flusher at a time, however this hurts if we get a lot of delayed refs queued up. So instead just have an rb tree for the delayed ref heads, and then attach the delayed ref updates to an rb tree that is per delayed ref head. Then we only need to take the delayed ref lock when adding new delayed refs and when selecting a delayed ref head to process, all the rest of the time we deal with a per delayed ref head lock which will be much less contentious. The locking rules for this get a little more complicated since we have to lock up to 3 things to properly process delayed refs, but I will address that problem later. For now this passes all of xfstests and my overnight stress tests. Thanks, Signed-off-by: Josef Bacik jba...@fb.com --- fs/btrfs/backref.c | 23 ++-- fs/btrfs/delayed-ref.c | 223 +- fs/btrfs/delayed-ref.h | 23 ++-- fs/btrfs/disk-io.c | 79 ++-- fs/btrfs/extent-tree.c | 317 - fs/btrfs/transaction.c | 7 +- 6 files changed, 267 insertions(+), 405 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 835b6c9..34a8952 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -538,14 +538,13 @@ static int __add_delayed_refs(struct btrfs_delayed_ref_head *head, u64 seq, if (extent_op extent_op-update_key) btrfs_disk_key_to_cpu(op_key, extent_op-key); - while ((n = rb_prev(n))) { + spin_lock(head-lock); + n = rb_first(head-ref_root); + while (n) { struct btrfs_delayed_ref_node *node; node = rb_entry(n, struct btrfs_delayed_ref_node, rb_node); - if (node-bytenr != head-node.bytenr) - break; - WARN_ON(node-is_head); - + n = rb_next(n); if (node-seq seq) continue; @@ -612,10 +611,10 @@ static int __add_delayed_refs(struct btrfs_delayed_ref_head *head, u64 seq, WARN_ON(1); } if (ret) - return ret; + break; } - - return 0; + spin_unlock(head-lock); + return ret; } /* @@ -882,15 +881,15 @@ again: btrfs_put_delayed_ref(head-node); goto again; } + spin_unlock(delayed_refs-lock); ret = __add_delayed_refs(head, time_seq, prefs_delayed); mutex_unlock(head-mutex); - if (ret) { - spin_unlock(delayed_refs-lock); + if (ret) goto out; - } +
Re: [PATCH] Btrfs: fix memory leak in btrfs_create_tree()
Hi Tsutomu Itoh, On Thu, Mar 21, 2013 at 6:32 AM, Tsutomu Itoh t-i...@jp.fujitsu.com wrote: We should free leaf and root before returning from the error handling code. Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com --- fs/btrfs/disk-io.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 7d84651..b1b5baa 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1291,6 +1291,7 @@ struct btrfs_root *btrfs_create_tree(struct btrfs_trans_handle *trans, 0, objectid, NULL, 0, 0, 0); if (IS_ERR(leaf)) { ret = PTR_ERR(leaf); + leaf = NULL; goto fail; } @@ -1334,11 +1335,16 @@ struct btrfs_root *btrfs_create_tree(struct btrfs_trans_handle *trans, btrfs_tree_unlock(leaf); + return root; + fail: - if (ret) - return ERR_PTR(ret); + if (leaf) { + btrfs_tree_unlock(leaf); + free_extent_buffer(leaf); I believe this is not enough. Few lines above, another reference on the root is taken by root-commit_root = btrfs_root_node(root); So I believe the proper fix would be: diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index d9698fd..260af79 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1354,10 +1354,10 @@ struct btrfs_root *btrfs_create_tree(struct btrfs_trans_handle *trans, return root; fail: - if (leaf) { + if (leaf) btrfs_tree_unlock(leaf); - free_extent_buffer(leaf); - } + free_extent_buffer(root-node); + free_extent_buffer(root-commit_root); kfree(root); return ERR_PTR(ret); Thanks, Alex. + } + kfree(root); - return root; + return ERR_PTR(ret); } static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans, -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed
Thanks, Miao, so the problem is that cow_file_range() joins transaction, allocates space through btrfs_reserve_extent(), then detaches from transaction. And then btrfs_finish_ordered_io() joins transaction again, adds a delayed ref and detaches from transaction. So if between these two, the transaction commits and we crash, then yes, the allocation is lost. Alex. On Tue, Mar 4, 2014 at 8:04 AM, Miao Xie mi...@cn.fujitsu.com wrote: On sat, 1 Mar 2014 20:05:01 +0200, Alex Lyakas wrote: Hi Miao, On Wed, Jan 15, 2014 at 2:00 PM, Miao Xie mi...@cn.fujitsu.com wrote: When we mounted the filesystem after the crash, we got the following message: BTRFS error (device xxx): block group 4315938816 has wrong amount of free space BTRFS error (device xxx): failed to load free space cache for block group 4315938816 It is because we didn't update the metadata of the allocated space until the file data was written into the disk. During this time, there was no information about the allocated spaces in either the extent tree nor the free space cache. when we wrote out the free space cache at this time, those spaces were lost. Can you please clarify more about the problem? So I understand that we allocate something from the free space cache. So after that, the free space cache does not account for this extent anymore. On the other hand the extent tree also does not account for it (yet). We need to add a delayed reference, which will be run at transaction commit and update the extent tree. But free-space cache is also written out during transaction commit. So how the issue happens? Can you perhaps post a flow of events? Task1 Task2 btrfs_writepages() alloc data space (The allocated space was removed from the free space cache) submit_bio() btrfs_commit_transaction() write out space cache ... commit transaction completed system crash (end_io() wasn't executed) The system crashed before end_io was executed, so no file references to the allocated space, and the extent tree also does not account for it. That space was lost. Thanks Miao Thanks. Alex. In ordered to fix this problem, I use a state tree for every block group to record those allocated spaces. We record the information when they are allocated, and clean up the information after the metadata update. Besides that, we also introduce a read-write semaphore to avoid the race between the allocation and the free space cache write out. Only data block groups had this problem, so the above change is just for data space allocation. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/ctree.h| 15 ++- fs/btrfs/disk-io.c | 2 +- fs/btrfs/extent-tree.c | 24 fs/btrfs/free-space-cache.c | 42 ++ fs/btrfs/inode.c| 42 +++--- 5 files changed, 108 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1667c9a..f58e1f7 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache { /* free space cache stuff */ struct btrfs_free_space_ctl *free_space_ctl; + /* +* It is used to record the extents that are allocated for +* the data, but don/t update its metadata. +*/ + struct extent_io_tree pinned_extents; + /* block group cache stuff */ struct rb_node cache_node; @@ -1540,6 +1546,13 @@ struct btrfs_fs_info { */ struct list_head space_info; + /* +* It is just used for the delayed data space allocation +* because only the data space allocation can be done during +* we write out the free space cache. +*/ + struct rw_semaphore data_rwsem; + struct btrfs_space_info *data_sinfo; struct reloc_control *reloc_ctl; @@ -3183,7 +3196,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans, struct btrfs_key *ins); int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes, u64 min_alloc_size, u64 empty_size, u64 hint_byte, -struct btrfs_key *ins, int is_data); +struct btrfs_key *ins, int is_data, bool need_pin); int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf, int full_backref, int for_cow); int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, diff --git a/fs/btrfs/disk
Re: [RFC PATCH 5/5] Btrfs: fix broken free space cache after the system crashed
Hi Miao, On Wed, Jan 15, 2014 at 2:00 PM, Miao Xie mi...@cn.fujitsu.com wrote: When we mounted the filesystem after the crash, we got the following message: BTRFS error (device xxx): block group 4315938816 has wrong amount of free space BTRFS error (device xxx): failed to load free space cache for block group 4315938816 It is because we didn't update the metadata of the allocated space until the file data was written into the disk. During this time, there was no information about the allocated spaces in either the extent tree nor the free space cache. when we wrote out the free space cache at this time, those spaces were lost. Can you please clarify more about the problem? So I understand that we allocate something from the free space cache. So after that, the free space cache does not account for this extent anymore. On the other hand the extent tree also does not account for it (yet). We need to add a delayed reference, which will be run at transaction commit and update the extent tree. But free-space cache is also written out during transaction commit. So how the issue happens? Can you perhaps post a flow of events? Thanks. Alex. In ordered to fix this problem, I use a state tree for every block group to record those allocated spaces. We record the information when they are allocated, and clean up the information after the metadata update. Besides that, we also introduce a read-write semaphore to avoid the race between the allocation and the free space cache write out. Only data block groups had this problem, so the above change is just for data space allocation. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/ctree.h| 15 ++- fs/btrfs/disk-io.c | 2 +- fs/btrfs/extent-tree.c | 24 fs/btrfs/free-space-cache.c | 42 ++ fs/btrfs/inode.c| 42 +++--- 5 files changed, 108 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 1667c9a..f58e1f7 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1244,6 +1244,12 @@ struct btrfs_block_group_cache { /* free space cache stuff */ struct btrfs_free_space_ctl *free_space_ctl; + /* +* It is used to record the extents that are allocated for +* the data, but don/t update its metadata. +*/ + struct extent_io_tree pinned_extents; + /* block group cache stuff */ struct rb_node cache_node; @@ -1540,6 +1546,13 @@ struct btrfs_fs_info { */ struct list_head space_info; + /* +* It is just used for the delayed data space allocation +* because only the data space allocation can be done during +* we write out the free space cache. +*/ + struct rw_semaphore data_rwsem; + struct btrfs_space_info *data_sinfo; struct reloc_control *reloc_ctl; @@ -3183,7 +3196,7 @@ int btrfs_alloc_logged_file_extent(struct btrfs_trans_handle *trans, struct btrfs_key *ins); int btrfs_reserve_extent(struct btrfs_root *root, u64 num_bytes, u64 min_alloc_size, u64 empty_size, u64 hint_byte, -struct btrfs_key *ins, int is_data); +struct btrfs_key *ins, int is_data, bool need_pin); int btrfs_inc_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf, int full_backref, int for_cow); int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 8072cfa..426b558 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2276,7 +2276,6 @@ int open_ctree(struct super_block *sb, fs_info-pinned_extents = fs_info-freed_extents[0]; fs_info-do_barriers = 1; - mutex_init(fs_info-ordered_operations_mutex); mutex_init(fs_info-ordered_extent_flush_mutex); mutex_init(fs_info-tree_log_mutex); @@ -2287,6 +2286,7 @@ int open_ctree(struct super_block *sb, init_rwsem(fs_info-extent_commit_sem); init_rwsem(fs_info-cleanup_work_sem); init_rwsem(fs_info-subvol_sem); + init_rwsem(fs_info-data_rwsem); sema_init(fs_info-uuid_tree_rescan_sem, 1); fs_info-dev_replace.lock_owner = 0; atomic_set(fs_info-dev_replace.nesting_level, 0); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 3664cfb..7b07876 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -6173,7 +6173,7 @@ enum btrfs_loop_type { static noinline int find_free_extent(struct btrfs_root *orig_root, u64 num_bytes, u64 empty_size, u64 hint_byte, struct btrfs_key *ins, -
Re: [PATCH] Btrfs: fix a deadlock on chunk mutex
Hello Josef, On Tue, Dec 18, 2012 at 3:52 PM, Josef Bacik jba...@fusionio.com wrote: On Wed, Dec 12, 2012 at 06:52:37PM -0700, Liu Bo wrote: An user reported that he has hit an annoying deadlock while playing with ceph based on btrfs. Current updating device tree requires space from METADATA chunk, so we -may- need to do a recursive chunk allocation when adding/updating dev extent, that is where the deadlock comes from. If we use SYSTEM metadata to update device tree, we can avoid the recursive stuff. This is going to cause us to allocate much more system chunks than we used to which could land us in trouble. Instead let's just keep us from re-entering if we're already allocating a chunk. We do the chunk allocation when we don't have enough space for a cluster, but we'll likely have plenty of space to make an allocation. Can you give this patch a try Jim and see if it fixes your problem? Thanks, Josef diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e152809..59df5e7 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3564,6 +3564,10 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, int wait_for_alloc = 0; int ret = 0; + /* Don't re-enter if we're already allocating a chunk */ + if (trans-allocating_chunk) + return -ENOSPC; + space_info = __find_space_info(extent_root-fs_info, flags); if (!space_info) { ret = update_space_info(extent_root-fs_info, flags, @@ -3606,6 +3610,8 @@ again: goto again; } + trans-allocating_chunk = true; + /* * If we have mixed data/metadata chunks we want to make sure we keep * allocating mixed chunks instead of individual chunks. @@ -3632,6 +3638,7 @@ again: check_system_chunk(trans, extent_root, flags); ret = btrfs_alloc_chunk(trans, extent_root, flags); + trans-allocating_chunk = false; if (ret 0 ret != -ENOSPC) goto out; diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index e6509b9..47ad8be 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -388,6 +388,7 @@ again: h-qgroup_reserved = qgroup_reserved; h-delayed_ref_elem.seq = 0; h-type = type; + h-allocating_chunk = false; INIT_LIST_HEAD(h-qgroup_ref_list); INIT_LIST_HEAD(h-new_bgs); diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 0e8aa1e..69700f7 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -68,6 +68,7 @@ struct btrfs_trans_handle { struct btrfs_block_rsv *orig_rsv; short aborted; short adding_csums; + bool allocating_chunk; enum btrfs_trans_type type; /* * this root is only needed to validate that the root passed to I hit this problem in a following scenario: - a data chunk allocation is triggered, and locks chunk_mutex - the same thread now also wants to allocate a metadata chunk, so it recursively calls do_chunk_alloc, but cannot lock the chunk_mutex = deadlock - btrfs has only one metadata chunk, the one that was initially allocated by mkfs, it has: total_bytes=8388608 bytes_used=8130560 bytes_pinned=77824 bytes_reserved=180224 so bytes_used + bytes_pinned + bytes_reserved == total_bytes Your patch would have returned ENOSPC and avoid the deadlock, but there would be a failure to allocate a tree block for metadata. So the transaction would have probably aborted. How such situation should be handled? Idea1: - lock chunk mutex, - if we are allocating a data chunk, check whether the metadata space is below some threshold. If yes, go and allocate a metadata chunk first and then only a data chunk. Idea2: - check if we are the same thread that already locked the chunk mutex. If yes, allow recursive call but don't attempt to lock/unlock the chunk_mutex this time Or some other way? Thanks! Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix a deadlock on chunk mutex
Hi Josef, is this the commit to look at: 6df9a95e63395f595d0d1eb5d561dd6c91c40270 Btrfs: make the chunk allocator completely tree lockless or some other commits are also relevant? Alex. On Tue, Feb 18, 2014 at 6:06 PM, Josef Bacik jba...@fb.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/18/2014 10:47 AM, Alex Lyakas wrote: Hello Josef, On Tue, Dec 18, 2012 at 3:52 PM, Josef Bacik jba...@fusionio.com wrote: On Wed, Dec 12, 2012 at 06:52:37PM -0700, Liu Bo wrote: An user reported that he has hit an annoying deadlock while playing with ceph based on btrfs. Current updating device tree requires space from METADATA chunk, so we -may- need to do a recursive chunk allocation when adding/updating dev extent, that is where the deadlock comes from. If we use SYSTEM metadata to update device tree, we can avoid the recursive stuff. This is going to cause us to allocate much more system chunks than we used to which could land us in trouble. Instead let's just keep us from re-entering if we're already allocating a chunk. We do the chunk allocation when we don't have enough space for a cluster, but we'll likely have plenty of space to make an allocation. Can you give this patch a try Jim and see if it fixes your problem? Thanks, Josef diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e152809..59df5e7 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3564,6 +3564,10 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, int wait_for_alloc = 0; int ret = 0; + /* Don't re-enter if we're already allocating a chunk */ + if (trans-allocating_chunk) + return -ENOSPC; + space_info = __find_space_info(extent_root-fs_info, flags); if (!space_info) { ret = update_space_info(extent_root-fs_info, flags, @@ -3606,6 +3610,8 @@ again: goto again; } + trans-allocating_chunk = true; + /* * If we have mixed data/metadata chunks we want to make sure we keep * allocating mixed chunks instead of individual chunks. @@ -3632,6 +3638,7 @@ again: check_system_chunk(trans, extent_root, flags); ret = btrfs_alloc_chunk(trans, extent_root, flags); + trans-allocating_chunk = false; if (ret 0 ret != -ENOSPC) goto out; diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index e6509b9..47ad8be 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -388,6 +388,7 @@ again: h-qgroup_reserved = qgroup_reserved; h-delayed_ref_elem.seq = 0; h-type = type; + h-allocating_chunk = false; INIT_LIST_HEAD(h-qgroup_ref_list); INIT_LIST_HEAD(h-new_bgs); diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 0e8aa1e..69700f7 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -68,6 +68,7 @@ struct btrfs_trans_handle { struct btrfs_block_rsv *orig_rsv; short aborted; short adding_csums; + bool allocating_chunk; enum btrfs_trans_type type; /* * this root is only needed to validate that the root passed to I hit this problem in a following scenario: - a data chunk allocation is triggered, and locks chunk_mutex - the same thread now also wants to allocate a metadata chunk, so it recursively calls do_chunk_alloc, but cannot lock the chunk_mutex = deadlock - btrfs has only one metadata chunk, the one that was initially allocated by mkfs, it has: total_bytes=8388608 bytes_used=8130560 bytes_pinned=77824 bytes_reserved=180224 so bytes_used + bytes_pinned + bytes_reserved == total_bytes Your patch would have returned ENOSPC and avoid the deadlock, but there would be a failure to allocate a tree block for metadata. So the transaction would have probably aborted. How such situation should be handled? Idea1: - lock chunk mutex, - if we are allocating a data chunk, check whether the metadata space is below some threshold. If yes, go and allocate a metadata chunk first and then only a data chunk. Idea2: - check if we are the same thread that already locked the chunk mutex. If yes, allow recursive call but don't attempt to lock/unlock the chunk_mutex this time Or some other way? I fixed this with the delayed chunk allocation stuff which doesn't actually do the block group creation stuff until we end the transaction, so we can allocate metadata chunks without any issue. Thanks, Josef -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJTA4UMAAoJEANb+wAKly3B+KEP/RdlEyJWydetjQxllF0cgHY1 UraqWBl+mSSHlwZlHyGjmAu6cK6n+QfTZtdIBhihdY50UcvMuWtVmz2JzlbxeO5+ 88dBevADmW+QQoRl0yyQgnjlLWm+LvMTgOd1r+DZqlGs6sdX05dMI207+fQOW+c4 P+UKbT/eUYRVC4K//J1GUk4Yh3Q70U25321RWCehSUciwDVJO2LztD9VBAgh3qUc o5uh5syshS3RbEi0hnUQ8tDKXWvdZQBA2RF4loXACCmQO95e84mxVpoYPd9S1yYs J+wf+Bak5hKZxmXJkOVcjLj4GsVQFJWTBTj6FvOFrm5TAFEGSyzrEzL8xW361+VS I1q8GPSVN1fGKkVypddylLIXLHmqXb57UElvGhoBM0otxNd8+xfSpLZ045vv5qLx RKwhJI1gIWD59kBre0fdSkUJZDeYSmLWOiwG6hG3A7Yy93c6/1RLHRnHq5NEe12R
Re: [PATCH] btrfs: Fix 32/64-bit problem with BTRFS_SET_RECEIVED_SUBVOL ioctl
Hello Hugo, Is this issue specific to the receive ioctl? Because what you are describing applies to any user-kernel ioctl-based interface, when you compile the user-space as 32-bit, which the kernel space has been compiled as 64-bit. For that purpose, I believe, there exists the compat_ioctl callback. It's implementation should do thunking, i.e., treat the user-space structure as if it were compiled for 32-bit, and unpack it properly. Today, however, btrfs supplies the same callback both for unlocked_ioctl and compat_ioctl. So we should see the same problem with all ioctls, if I am not missing anything. Thanks, Alex. On Mon, Jan 6, 2014 at 10:50 AM, Hugo Mills h...@carfax.org.uk wrote: On Sun, Jan 05, 2014 at 06:26:11PM +, Hugo Mills wrote: On Sun, Jan 05, 2014 at 05:55:27PM +, Hugo Mills wrote: The structure for BTRFS_SET_RECEIVED_IOCTL packs differently on 32-bit and 64-bit systems. This means that it is impossible to use btrfs receive on a system with a 64-bit kernel and 32-bit userspace, because the structure size (and hence the ioctl number) is different. This patch adds a compatibility structure and ioctl to deal with the above case. Oops, forgot to mention -- this has been compile tested, but not actually run yet. The machine in question is several miles away and is a production machine (it's my work desktop, and I can't afford much downtime on it). ... And it doesn't even compile properly, now I come to build a .deb. I'm still interested in comments about the general approach, but the specific patch is a load of balls. Hugo. Hugo. Signed-off-by: Hugo Mills h...@carfax.org.uk --- fs/btrfs/ioctl.c | 95 +++- 1 file changed, 87 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 21da576..e186439 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -57,6 +57,32 @@ #include send.h #include dev-replace.h +#ifdef CONFIG_64BIT +/* If we have a 32-bit userspace and 64-bit kernel, then the UAPI + * structures are incorrect, as the timespec structure from userspace + * is 4 bytes too small. We define these alternatives here to teach + * the kernel about the 32-bit struct packing. + */ +struct btrfs_ioctl_timespec { + __u64 sec; + __u32 nsec; +} ((__packed__)); + +struct btrfs_ioctl_received_subvol_args { + charuuid[BTRFS_UUID_SIZE]; /* in */ + __u64 stransid; /* in */ + __u64 rtransid; /* out */ + struct btrfs_ioctl_timespec stime; /* in */ + struct btrfs_ioctl_timespec rtime; /* out */ + __u64 flags; /* in */ + __u64 reserved[16]; /* in */ +} ((__packed__)); +#endif + +#define BTRFS_IOC_SET_RECEIVED_SUBVOL_32 _IOWR(BTRFS_IOCTL_MAGIC, 37, \ + struct btrfs_ioctl_received_subvol_args_32) + + static int btrfs_clone(struct inode *src, struct inode *inode, u64 off, u64 olen, u64 olen_aligned, u64 destoff); @@ -4313,10 +4339,69 @@ static long btrfs_ioctl_quota_rescan_wait(struct file *file, void __user *arg) return btrfs_qgroup_wait_for_completion(root-fs_info); } +#ifdef CONFIG_64BIT +static long btrfs_ioctl_set_received_subvol_32(struct file *file, + void __user *arg) +{ + struct btrfs_ioctl_received_subvol_args_32 *args32 = NULL; + struct btrfs_ioctl_received_subvol_args *args64 = NULL; + int ret = 0; + + args32 = memdup_user(arg, sizeof(*args32)); + if (IS_ERR(args32)) { + ret = PTR_ERR(args32); + args32 = NULL; + goto out; + } + + args64 = malloc(sizeof(*args64)); + if (IS_ERR(args64)) { + ret = PTR_ERR(args64); + args64 = NULL; + goto out; + } + + memcpy(args64-uuid, args32-uuid, BTRFS_UUID_SIZE); + args64-stransid = args32-stransid; + args64-rtransid = args32-rtransid; + args64-stime.sec = args32-stime.sec; + args64-stime.nsec = args32-stime.nsec; + args64-rtime.sec = args32-rtime.sec; + args64-rtime.nsec = args32-rtime.nsec; + args64-flags = args32-flags; + + ret = _btrfs_ioctl_set_received_subvol(file, args64); + +out: + kfree(args32); + kfree(args64); + return ret; +} +#endif + static long btrfs_ioctl_set_received_subvol(struct file *file, void __user *arg) { struct btrfs_ioctl_received_subvol_args *sa = NULL; + int ret = 0; + + sa = memdup_user(arg, sizeof(*sa)); + if (IS_ERR(sa)) { + ret = PTR_ERR(sa); + sa = NULL; + goto out; + } + + ret = _btrfs_ioctl_set_received_subvol(file, sa); + +out: + kfree(sa); + return ret; +} + +static long _btrfs_ioctl_set_received_subvol(struct
Re: [PATCH] Btrfs: return ENOSPC when target space is full
Hi Filipe, I think in the context of do_chunk_alloc(), 0 doesn't mean success. 0 means allocation was not attempted. While 1 means allocation was attempted and succeeded. -ENOSPC means allocation was attempted but failed. Any other errno deserves transaction abort. Anyways, the callers are ok with 0 and -ENOSPC and re-search for a free extent in these cases. Alex. On Mon, Aug 5, 2013 at 5:25 PM, Filipe David Borba Manana fdman...@gmail.com wrote: In extent-tree.c:do_chunk_alloc(), early on we returned 0 (success) when the target space was full and when chunk allocation is needed. However, later on in that same function we return ENOSPC if btrfs_alloc_chunk() fails (and chunk allocation was needed) and set the space's full flag. This was inconsistent, as -ENOSPC should be returned if the space is full and a chunk allocation needs to performed. If the space is full but no chunk allocation is needed, just return 0 (success). Signed-off-by: Filipe David Borba Manana fdman...@gmail.com --- fs/btrfs/extent-tree.c |6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e868c35..ef89a66 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3829,8 +3829,12 @@ again: if (force space_info-force_alloc) force = space_info-force_alloc; if (space_info-full) { + if (should_alloc_chunk(extent_root, space_info, force)) + ret = -ENOSPC; + else + ret = 0; spin_unlock(space_info-lock); - return 0; + return ret; } if (!should_alloc_chunk(extent_root, space_info, force)) { -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs-progs: avoid using btrfs internal subvolume path to send
Hi Miguel, On Sat, Nov 30, 2013 at 1:43 PM, Miguel Negrão miguel.negrao-li...@friendlyvirus.org wrote: Em 29-11-2013 16:37, Wang Shilong escreveu: From: Wang Shilong wangsl.f...@cn.fujitsu.com Steps to reproduce: # mkfs.btrfs -f /dev/sda # mount /dev/sda /mnt # btrfs subvolume create /mnt/foo # umount /mnt # mount -o subvol=foo /dev/sda /mnt # btrfs sub snapshot -r /mnt /mnt/snap # btrfs send /mnt/snap /dev/null We will fail to send '/mnt/snap',this is because btrfs send try to open '/mnt/snap' by btrfs internal subvolume path 'foo/snap' rather than relative path based on mounted point, this will return us 'no such file or directory',this is not right, fix it. I was going to write to the list to report exactly this issue. In my case, this happens when the default subvolume has been changed from id 5 to some other id. I get the error 'no such file or directory'. Currently my workaround is to mount the root subvolume with -o subvolid=5 and then do the send. Also, I'd like to ask, are there plans to make the send and receive commands resumeable somehow (or perhaps it is already, but couldn't see how) ? I have proposed a patch to address the resumability of send-receive some time ago in this thread: http://www.spinics.net/lists/linux-btrfs/msg18180.html However, this changes the current user-kernel protocol used by send, and overall a big change, which is not easy to integrate. Alex. best, Miguel Negrão -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v7 04/13] Btrfs: disable qgroups accounting when quata_enable is 0
Hi Liu, Jan, What happens to struct qgroup_updates sitting in trans-qgroup_ref_list in case the transaction aborts? It seems that they are not freed. For example, if we are in btrfs_commit_transaction() and: - call create_pending_snapshots() - call btrfs_run_delayed_items() and this fails So we go to cleanup_transaction() without calling btrfs_delayed_refs_qgroup_accounting(), which would have been called by btrfs_run_delayed_refs(). I receive kmemleak warnings about these thingies not being freed, although on an older kernel. However, looking at Josef's tree, this still seems to be the case. Thanks, Alex. On Mon, Oct 14, 2013 at 7:59 AM, Liu Bo bo.li@oracle.com wrote: It's unnecessary to do qgroups accounting without enabling quota. Signed-off-by: Liu Bo bo.li@oracle.com --- fs/btrfs/ctree.c |2 +- fs/btrfs/delayed-ref.c | 18 ++ fs/btrfs/qgroup.c |3 +++ 3 files changed, 18 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 61b5bcd..fb89235 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -407,7 +407,7 @@ u64 btrfs_get_tree_mod_seq(struct btrfs_fs_info *fs_info, tree_mod_log_write_lock(fs_info); spin_lock(fs_info-tree_mod_seq_lock); - if (!elem-seq) { + if (elem !elem-seq) { elem-seq = btrfs_inc_tree_mod_seq_major(fs_info); list_add_tail(elem-list, fs_info-tree_mod_seq_list); } diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 9e1a1c9..3ec3d08 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -691,8 +691,13 @@ static noinline void add_delayed_tree_ref(struct btrfs_fs_info *fs_info, ref-is_head = 0; ref-in_tree = 1; - if (need_ref_seq(for_cow, ref_root)) - seq = btrfs_get_tree_mod_seq(fs_info, trans-delayed_ref_elem); + if (need_ref_seq(for_cow, ref_root)) { + struct seq_list *elem = NULL; + + if (fs_info-quota_enabled) + elem = trans-delayed_ref_elem; + seq = btrfs_get_tree_mod_seq(fs_info, elem); + } ref-seq = seq; full_ref = btrfs_delayed_node_to_tree_ref(ref); @@ -750,8 +755,13 @@ static noinline void add_delayed_data_ref(struct btrfs_fs_info *fs_info, ref-is_head = 0; ref-in_tree = 1; - if (need_ref_seq(for_cow, ref_root)) - seq = btrfs_get_tree_mod_seq(fs_info, trans-delayed_ref_elem); + if (need_ref_seq(for_cow, ref_root)) { + struct seq_list *elem = NULL; + + if (fs_info-quota_enabled) + elem = trans-delayed_ref_elem; + seq = btrfs_get_tree_mod_seq(fs_info, elem); + } ref-seq = seq; full_ref = btrfs_delayed_node_to_data_ref(ref); diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index 4e6ef49..1cb58f9 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -1188,6 +1188,9 @@ int btrfs_qgroup_record_ref(struct btrfs_trans_handle *trans, { struct qgroup_update *u; + if (!trans-root-fs_info-quota_enabled) + return 0; + BUG_ON(!trans-delayed_ref_elem.seq); u = kmalloc(sizeof(*u), GFP_NOFS); if (!u) -- 1.7.7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs-progs: calculate disk space that a subvol could free
Hello Anand, I have sent a similar comment to your email thread in http://www.spinics.net/lists/linux-btrfs/msg27547.html I believe this approach of calculating freeable space is incorrect. Try this: - create a fresh btrfs - create a regular file - write some data into it in such a way, that it was, say 4000 EXTENT_DATA items, so that file tree and extent tree get deep enough - run btrfs-debug-tree and verify that all EXTENT_ITEMs of this file (in the extent tree) have refcnt=1 - create a snapshot of the subvolume - run btrfs-debug-tree again You will see that most (in my case - all) of EXTENT_ITEMs still have refcnt=1. Reason for this is as I mentioned in http://www.spinics.net/lists/linux-btrfs/msg27547.html But if you delete the subvolume, no space will be freed, because all these extents are also shared by the snapshot. Although it seems that your tool will report that all subvolume's space is freeable (based on refcnt=1). Can you pls try that experiment and comment on it? Perhaps I am missing something here? Thanks! Alex. On Thu, Oct 10, 2013 at 6:33 AM, Wang Shilong wangsl.f...@cn.fujitsu.com wrote: On 10/10/2013 11:35 AM, Anand Jain wrote: If 'btrfs_file_extent_item' can contain the ref count it would solve the current problem quite easily. (problem is that, its of n * n searches to know data extents with its ref for a given subvol). Just considering btrfs_file_extent_item is not enough, because we should consider metadata(as i have said before). But what'r the challenge(s) to have ref count in the btrfs_file_extent_item ? any thoughts ? Haven't thought a better idea yet. Thanks, Anand -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v6] Btrfs: fix memory leak of orphan block rsv
Hi Filipe, any luck with this patch?:) Alex. On Wed, Oct 23, 2013 at 5:26 PM, Filipe David Manana fdman...@gmail.com wrote: On Wed, Oct 23, 2013 at 3:14 PM, Alex Lyakas alex.bt...@zadarastorage.com wrote: Hello, On Wed, Oct 23, 2013 at 4:35 PM, Filipe David Manana fdman...@gmail.com wrote: On Wed, Oct 23, 2013 at 2:33 PM, Alex Lyakas alex.bt...@zadarastorage.com wrote: Hi Filipe, On Tue, Aug 20, 2013 at 2:52 AM, Filipe David Borba Manana fdman...@gmail.com wrote: This issue is simple to reproduce and observe if kmemleak is enabled. Two simple ways to reproduce it: ** 1 $ mkfs.btrfs -f /dev/loop0 $ mount /dev/loop0 /mnt/btrfs $ btrfs balance start /mnt/btrfs $ umount /mnt/btrfs So here it seems that the leak can only happen in case the block-group has a free-space inode. This is what the orphan item is added for. Yes, here kmemleak reports. But: if space_cache option is disabled (and nospace_cache) enabled, it seems that btrfs still creates the FREE_SPACE inodes, although they are empty because in cache_save_setup: inode = lookup_free_space_inode(root, block_group, path); if (IS_ERR(inode) PTR_ERR(inode) != -ENOENT) { ret = PTR_ERR(inode); btrfs_release_path(path); goto out; } if (IS_ERR(inode)) { ... ret = create_free_space_inode(root, trans, block_group, path); and only later it actually sets BTRFS_DC_WRITTEN if space_cache option is disabled. Amazing! Although this is a different issue, do you know perhaps why these empty inodes are needed? Don't know if they are needed. But you have a point, it seems odd to create the free space cache inode if mount option nospace_cache was supplied. Thanks Alex. Testing the following patch: diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index c43ee8a..eb1b7da 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3162,6 +3162,9 @@ static int cache_save_setup(struct btrfs_block_group_cache *block_group, int retries = 0; int ret = 0; + if (!btrfs_test_opt(root, SPACE_CACHE)) + return 0; + /* * If this block group is smaller than 100 megs don't bother caching the * block group. Thanks! Alex. ** 2 $ mkfs.btrfs -f /dev/loop0 $ mount /dev/loop0 /mnt/btrfs $ touch /mnt/btrfs/foobar $ rm -f /mnt/btrfs/foobar $ umount /mnt/btrfs I tried the second repro script on kernel 3.8.13, and kmemleak does not report a leak (even if I force the kmemleak scan). I did not try the balance-repro script, though. Am I missing something? Maybe it's not an issue on 3.8.13 and older releases. This was on btrfs-next from August 19. thanks for testing Thanks, Alex. After a while, kmemleak reports the leak: $ cat /sys/kernel/debug/kmemleak unreferenced object 0x880402b13e00 (size 128): comm btrfs, pid 19621, jiffies 4341648183 (age 70057.844s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fc c6 b1 04 88 ff ff 04 00 04 00 ad 4e ad de .N.. backtrace: [817275a6] kmemleak_alloc+0x26/0x50 [8117832b] kmem_cache_alloc_trace+0xeb/0x1d0 [a04db499] btrfs_alloc_block_rsv+0x39/0x70 [btrfs] [a04f8bad] btrfs_orphan_add+0x13d/0x1b0 [btrfs] [a04e2b13] btrfs_remove_block_group+0x143/0x500 [btrfs] [a0518158] btrfs_relocate_chunk.isra.63+0x618/0x790 [btrfs] [a051bc27] btrfs_balance+0x8f7/0xe90 [btrfs] [a05240a0] btrfs_ioctl_balance+0x250/0x550 [btrfs] [a05269ca] btrfs_ioctl+0xdfa/0x25f0 [btrfs] [8119c936] do_vfs_ioctl+0x96/0x570 [8119cea1] SyS_ioctl+0x91/0xb0 [81750242] system_call_fastpath+0x16/0x1b [] 0x This affects btrfs-next, revision be8e3cd00d7293dd177e3f8a4a1645ce09ca3acb (Btrfs: separate out tests into their own directory). Signed-off-by: Filipe David Borba Manana fdman...@gmail.com --- V2: removed atomic_t member in struct btrfs_block_rsv, as suggested by Josef Bacik, and use instead the condition reserved == 0 to decide when to free the block. V3: simplified patch, just kfree() (and not btrfs_free_block_rsv) the root's orphan_block_rsv when free'ing the root. Thanks Josef for the suggestion. V4: use btrfs_free_block_rsv() instead of kfree(). The error I was getting in xfstests when using btrfs_free_block_rsv() was unrelated, Josef just pointed it to me (separate issue). V5: move the free call below the iput() call, so that btrfs_evict_node() can process the orphan_block_rsv first to do some needed cleanup before we free it. V6: free the root's orphan_block_rsv in close_ctree() too. After a balance the orphan_block_rsv of the tree of tree roots was being leaked, because free_fs_root() is only called for filesystem
Re: [patch 3/7] btrfs: Add per-super attributes to sysfs
Hi Jeff, On Tue, Sep 10, 2013 at 7:24 AM, Jeff Mahoney je...@suse.com wrote: This patch adds per-super attributes to sysfs. It doesn't publish any attributes yet, but does the proper lifetime handling as well as the basic infrastructure to add new attributes. Signed-off-by: Jeff Mahoney je...@suse.com --- fs/btrfs/ctree.h |2 + fs/btrfs/super.c | 13 +++- fs/btrfs/sysfs.c | 58 +++ fs/btrfs/sysfs.h | 19 ++ 4 files changed, 91 insertions(+), 1 deletion(-) --- a/fs/btrfs/ctree.h 2013-09-10 00:09:12.990087784 -0400 +++ b/fs/btrfs/ctree.h 2013-09-10 00:09:35.521794520 -0400 @@ -3694,6 +3694,8 @@ int btrfs_defrag_leaves(struct btrfs_tra /* sysfs.c */ int btrfs_init_sysfs(void); void btrfs_exit_sysfs(void); +int btrfs_sysfs_add_one(struct btrfs_fs_info *fs_info); +void btrfs_sysfs_remove_one(struct btrfs_fs_info *fs_info); /* xattr.c */ ssize_t btrfs_listxattr(struct dentry *dentry, char *buffer, size_t size); --- a/fs/btrfs/super.c 2013-09-10 00:09:12.994087730 -0400 +++ b/fs/btrfs/super.c 2013-09-10 00:09:35.525794464 -0400 @@ -301,6 +301,8 @@ void __btrfs_panic(struct btrfs_fs_info static void btrfs_put_super(struct super_block *sb) { + btrfs_sysfs_remove_one(btrfs_sb(sb)); + (void)close_ctree(btrfs_sb(sb)-tree_root); /* FIXME: need to fix VFS to return error? */ /* AV: return it _where_? -put_super() can be triggered by any number @@ -1143,8 +1145,17 @@ static struct dentry *btrfs_mount(struct } root = !error ? get_default_root(s, subvol_objectid) : ERR_PTR(error); - if (IS_ERR(root)) + if (IS_ERR(root)) { deactivate_locked_super(s); + return root; + } + + error = btrfs_sysfs_add_one(fs_info); + if (error) { + dput(root); + deactivate_locked_super(s); + return ERR_PTR(error); + } return root; --- a/fs/btrfs/sysfs.c 2013-09-10 00:09:13.002087628 -0400 +++ b/fs/btrfs/sysfs.c 2013-09-10 00:09:49.501616538 -0400 @@ -61,6 +61,64 @@ static struct attribute *btrfs_supp_feat NULL }; +static struct attribute *btrfs_attrs[] = { + NULL, +}; + +static void btrfs_fs_info_release(struct kobject *kobj) +{ + struct btrfs_fs_info *fs_info; + fs_info = container_of(kobj, struct btrfs_fs_info, super_kobj); + complete(fs_info-kobj_unregister); +} + +static ssize_t btrfs_attr_show(struct kobject *kobj, + struct attribute *attr, char *buf) +{ + struct btrfs_attr *a = container_of(attr, struct btrfs_attr, attr); + struct btrfs_fs_info *fs_info; + fs_info = container_of(kobj, struct btrfs_fs_info, super_kobj); + + return a-show ? a-show(a, fs_info, buf) : 0; +} + +static ssize_t btrfs_attr_store(struct kobject *kobj, + struct attribute *attr, + const char *buf, size_t len) +{ + struct btrfs_attr *a = container_of(attr, struct btrfs_attr, attr); + struct btrfs_fs_info *fs_info; + fs_info = container_of(kobj, struct btrfs_fs_info, super_kobj); + + return a-store ? a-store(a, fs_info, buf, len) : 0; +} + +static const struct sysfs_ops btrfs_attr_ops = { + .show = btrfs_attr_show, + .store = btrfs_attr_store, +}; + +static struct kobj_type btrfs_ktype = { + .default_attrs = btrfs_attrs, + .sysfs_ops = btrfs_attr_ops, + .release= btrfs_fs_info_release, +}; + +int btrfs_sysfs_add_one(struct btrfs_fs_info *fs_info) +{ + init_completion(fs_info-kobj_unregister); + fs_info-super_kobj.kset = btrfs_kset; + return kobject_init_and_add(fs_info-super_kobj, btrfs_ktype, NULL, + %pU, fs_info-fsid); +} + +void btrfs_sysfs_remove_one(struct btrfs_fs_info *fs_info) +{ + kobject_del(fs_info-super_kobj); Is there a reason for this explicit call? The last kobject_put will do this automatically, no? + kobject_put(fs_info-super_kobj); + wait_for_completion(fs_info-kobj_unregister); +} + static void btrfs_supp_feat_release(struct kobject *kobj) { complete(btrfs_feat-f_kobj_unregister); --- a/fs/btrfs/sysfs.h 2013-09-10 00:09:13.002087628 -0400 +++ b/fs/btrfs/sysfs.h 2013-09-10 00:09:35.525794464 -0400 @@ -8,6 +8,24 @@ enum btrfs_feature_set { FEAT_MAX }; +struct btrfs_attr { + struct attribute attr; + ssize_t (*show)(struct btrfs_attr *, struct btrfs_fs_info *, char *); + ssize_t (*store)(struct btrfs_attr *, struct btrfs_fs_info *, +const char *, size_t); +}; + +#define __INIT_BTRFS_ATTR(_name, _mode, _show, _store) \ +{ \ +
Re: [PATCH] btrfs-progs: calculate disk space that a subvol could free upon delete
Hi Anand, 1) so let's say I have a subvolume and a snapshot of this subvolume. So in this case, I will see Sole space = 0 for both of them, correct? Because all extents (except inline ones) are shared. 2) How is this in terms of responsiveness? On a huge subvolume, we need to iterate all the EXTENT_DATAs and then lookup their EXTENT_ITEMs. 3) So it's kind of poor man's replacement for quota groups, correct? I like that it's so easy to check for exclusive data, though:) Alex. On Fri, Sep 13, 2013 at 6:44 PM, Wang Shilong wangshilong1...@gmail.com wrote: Hello Anand, (This patch is for review and comments only) This patch provides a way to know how much space can be relinquished if when subvol /snapshot is deleted. With this sys admin can make better judgments in managing the filesystem when fs is near full. I think this is really *helpful* since users can not really know how much space(Exclusive) in a subvolume . Thanks, Wang as shown below the parameter 'sole space' indicates the size which is freed when subvol is deleted. (any other better term for this?, pls suggest). - btrfs su show /btrfs/sv1 /btrfs/sv1 Name: sv1 uuid: b078ba48-d4a5-2f49-ac03-9bd1d56cc768 Parent uuid:- Creation time: 2013-09-13 18:17:32 Object ID: 257 Generation (Gen): 18 Gen at creation:17 Parent: 5 Top Level: 5 Flags: - Sole space: 1.56MiB Snapshot(s): btrfs su snap /btrfs/sv1 /btrfs/ss2 Create a snapshot of '/btrfs/sv1' in '/btrfs/ss2' btrfs su show /btrfs/sv1 /btrfs/sv1 Name: sv1 uuid: b078ba48-d4a5-2f49-ac03-9bd1d56cc768 Parent uuid:- Creation time: 2013-09-13 18:17:32 Object ID: 257 Generation (Gen): 19 Gen at creation:17 Parent: 5 Top Level: 5 Flags: - Sole space: 0.00 - Snapshot(s): ss2 - Signed-off-by: Anand Jain anand.j...@oracle.com --- cmds-subvolume.c | 5 ++ utils.c | 154 +++ utils.h | 1 + 3 files changed, 160 insertions(+) diff --git a/cmds-subvolume.c b/cmds-subvolume.c index de246ab..2b02d66 100644 --- a/cmds-subvolume.c +++ b/cmds-subvolume.c @@ -809,6 +809,7 @@ static int cmd_subvol_show(int argc, char **argv) int fd = -1, mntfd = -1; int ret = 1; DIR *dirstream1 = NULL, *dirstream2 = NULL; + u64 freeable_bytes; if (check_argc_exact(argc, 2)) usage(cmd_subvol_show_usage); @@ -878,6 +879,8 @@ static int cmd_subvol_show(int argc, char **argv) goto out; } + freeable_bytes = get_subvol_freeable_bytes(fd); + ret = 0; /* print the info */ printf(%s\n, fullpath); @@ -915,6 +918,8 @@ static int cmd_subvol_show(int argc, char **argv) else printf(\tFlags: \t\t\t-\n); + printf(\tSole space: \t\t%s\n, + pretty_size(freeable_bytes)); /* print the snapshots of the given subvol if any*/ printf(\tSnapshot(s):\n); filter_set = btrfs_list_alloc_filter_set(); diff --git a/utils.c b/utils.c index 22c3310..f01d580 100644 --- a/utils.c +++ b/utils.c @@ -2019,3 +2019,157 @@ int is_dev_excl_op_free(int fd) ret = ioctl(fd, BTRFS_IOC_CHECK_DEV_EXCL_OPS, NULL); return ret 0 ? ret : -errno; } + +/* gets the ref count for given extent + * 0 = didn't find the item + * n = number of references +*/ +u64 get_extent_refcnt(int fd, u64 disk_blk) +{ + int ret = 0, i, e; + struct btrfs_ioctl_search_args args; + struct btrfs_ioctl_search_key *sk = args.key; + struct btrfs_ioctl_search_header sh; + unsigned long off = 0; + + memset(args, 0, sizeof(args)); + + sk-tree_id = BTRFS_EXTENT_TREE_OBJECTID; + + sk-min_type = BTRFS_EXTENT_ITEM_KEY; + sk-max_type = BTRFS_EXTENT_ITEM_KEY; + + sk-min_objectid = disk_blk; + sk-max_objectid = disk_blk; + + sk-max_offset = (u64)-1; + sk-max_transid = (u64)-1; + + while (1) { + sk-nr_items = 4096; + + ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, args); + e = errno; + if (ret 0) { + fprintf(stderr, ERROR: search failed - %s\n, + strerror(e)); + return 0; + } + if (sk-nr_items == 0) + break; + + off = 0; + for (i = 0; i sk-nr_items; i++) { + struct btrfs_extent_item *ei; + u64 ref;
Re: [PATCH] btrfs-progs: calculate disk space that a subvol could free upon delete
Thinking about this more, I believe this way of checking for exclusive data doesn't work. When a snapshot is created, btrfs doesn't go and explicitly increment refcount on *all* relevant EXTENT_ITEMs in the extent tree. This way creating a snapshot would take forever for large subvolumes. Instead, it only does that on EXTENT_ITEMs, which are pointed by EXTENT_DATAs in the root node of the snapshottted file tree. For rest of nodes/leafs of a file tree, an implicit tree-block references are added (not sure if implicit is the right term) for top tree blocks only. This is accomplished by _btrfs_mod_ref() code, called from btrfs_copy_root() during snapshot creation flow. Snapshot deletion code is the one that is smart enough to properly unshare shared tree blocks with such implicit references. What do you think? Alex. On Sat, Oct 26, 2013 at 10:49 PM, Alex Lyakas alex.bt...@zadarastorage.com wrote: Hi Anand, 1) so let's say I have a subvolume and a snapshot of this subvolume. So in this case, I will see Sole space = 0 for both of them, correct? Because all extents (except inline ones) are shared. 2) How is this in terms of responsiveness? On a huge subvolume, we need to iterate all the EXTENT_DATAs and then lookup their EXTENT_ITEMs. 3) So it's kind of poor man's replacement for quota groups, correct? I like that it's so easy to check for exclusive data, though:) Alex. On Fri, Sep 13, 2013 at 6:44 PM, Wang Shilong wangshilong1...@gmail.com wrote: Hello Anand, (This patch is for review and comments only) This patch provides a way to know how much space can be relinquished if when subvol /snapshot is deleted. With this sys admin can make better judgments in managing the filesystem when fs is near full. I think this is really *helpful* since users can not really know how much space(Exclusive) in a subvolume . Thanks, Wang as shown below the parameter 'sole space' indicates the size which is freed when subvol is deleted. (any other better term for this?, pls suggest). - btrfs su show /btrfs/sv1 /btrfs/sv1 Name: sv1 uuid: b078ba48-d4a5-2f49-ac03-9bd1d56cc768 Parent uuid:- Creation time: 2013-09-13 18:17:32 Object ID: 257 Generation (Gen): 18 Gen at creation:17 Parent: 5 Top Level: 5 Flags: - Sole space: 1.56MiB Snapshot(s): btrfs su snap /btrfs/sv1 /btrfs/ss2 Create a snapshot of '/btrfs/sv1' in '/btrfs/ss2' btrfs su show /btrfs/sv1 /btrfs/sv1 Name: sv1 uuid: b078ba48-d4a5-2f49-ac03-9bd1d56cc768 Parent uuid:- Creation time: 2013-09-13 18:17:32 Object ID: 257 Generation (Gen): 19 Gen at creation:17 Parent: 5 Top Level: 5 Flags: - Sole space: 0.00 - Snapshot(s): ss2 - Signed-off-by: Anand Jain anand.j...@oracle.com --- cmds-subvolume.c | 5 ++ utils.c | 154 +++ utils.h | 1 + 3 files changed, 160 insertions(+) diff --git a/cmds-subvolume.c b/cmds-subvolume.c index de246ab..2b02d66 100644 --- a/cmds-subvolume.c +++ b/cmds-subvolume.c @@ -809,6 +809,7 @@ static int cmd_subvol_show(int argc, char **argv) int fd = -1, mntfd = -1; int ret = 1; DIR *dirstream1 = NULL, *dirstream2 = NULL; + u64 freeable_bytes; if (check_argc_exact(argc, 2)) usage(cmd_subvol_show_usage); @@ -878,6 +879,8 @@ static int cmd_subvol_show(int argc, char **argv) goto out; } + freeable_bytes = get_subvol_freeable_bytes(fd); + ret = 0; /* print the info */ printf(%s\n, fullpath); @@ -915,6 +918,8 @@ static int cmd_subvol_show(int argc, char **argv) else printf(\tFlags: \t\t\t-\n); + printf(\tSole space: \t\t%s\n, + pretty_size(freeable_bytes)); /* print the snapshots of the given subvol if any*/ printf(\tSnapshot(s):\n); filter_set = btrfs_list_alloc_filter_set(); diff --git a/utils.c b/utils.c index 22c3310..f01d580 100644 --- a/utils.c +++ b/utils.c @@ -2019,3 +2019,157 @@ int is_dev_excl_op_free(int fd) ret = ioctl(fd, BTRFS_IOC_CHECK_DEV_EXCL_OPS, NULL); return ret 0 ? ret : -errno; } + +/* gets the ref count for given extent + * 0 = didn't find the item + * n = number of references +*/ +u64 get_extent_refcnt(int fd, u64 disk_blk) +{ + int ret = 0, i, e; + struct btrfs_ioctl_search_args args; + struct btrfs_ioctl_search_key *sk = args.key; + struct btrfs_ioctl_search_header sh
Re: [PATCH v6] Btrfs: fix memory leak of orphan block rsv
Hello, On Wed, Oct 23, 2013 at 4:35 PM, Filipe David Manana fdman...@gmail.com wrote: On Wed, Oct 23, 2013 at 2:33 PM, Alex Lyakas alex.bt...@zadarastorage.com wrote: Hi Filipe, On Tue, Aug 20, 2013 at 2:52 AM, Filipe David Borba Manana fdman...@gmail.com wrote: This issue is simple to reproduce and observe if kmemleak is enabled. Two simple ways to reproduce it: ** 1 $ mkfs.btrfs -f /dev/loop0 $ mount /dev/loop0 /mnt/btrfs $ btrfs balance start /mnt/btrfs $ umount /mnt/btrfs So here it seems that the leak can only happen in case the block-group has a free-space inode. This is what the orphan item is added for. Yes, here kmemleak reports. But: if space_cache option is disabled (and nospace_cache) enabled, it seems that btrfs still creates the FREE_SPACE inodes, although they are empty because in cache_save_setup: inode = lookup_free_space_inode(root, block_group, path); if (IS_ERR(inode) PTR_ERR(inode) != -ENOENT) { ret = PTR_ERR(inode); btrfs_release_path(path); goto out; } if (IS_ERR(inode)) { ... ret = create_free_space_inode(root, trans, block_group, path); and only later it actually sets BTRFS_DC_WRITTEN if space_cache option is disabled. Amazing! Although this is a different issue, do you know perhaps why these empty inodes are needed? Thanks! Alex. ** 2 $ mkfs.btrfs -f /dev/loop0 $ mount /dev/loop0 /mnt/btrfs $ touch /mnt/btrfs/foobar $ rm -f /mnt/btrfs/foobar $ umount /mnt/btrfs I tried the second repro script on kernel 3.8.13, and kmemleak does not report a leak (even if I force the kmemleak scan). I did not try the balance-repro script, though. Am I missing something? Maybe it's not an issue on 3.8.13 and older releases. This was on btrfs-next from August 19. thanks for testing Thanks, Alex. After a while, kmemleak reports the leak: $ cat /sys/kernel/debug/kmemleak unreferenced object 0x880402b13e00 (size 128): comm btrfs, pid 19621, jiffies 4341648183 (age 70057.844s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fc c6 b1 04 88 ff ff 04 00 04 00 ad 4e ad de .N.. backtrace: [817275a6] kmemleak_alloc+0x26/0x50 [8117832b] kmem_cache_alloc_trace+0xeb/0x1d0 [a04db499] btrfs_alloc_block_rsv+0x39/0x70 [btrfs] [a04f8bad] btrfs_orphan_add+0x13d/0x1b0 [btrfs] [a04e2b13] btrfs_remove_block_group+0x143/0x500 [btrfs] [a0518158] btrfs_relocate_chunk.isra.63+0x618/0x790 [btrfs] [a051bc27] btrfs_balance+0x8f7/0xe90 [btrfs] [a05240a0] btrfs_ioctl_balance+0x250/0x550 [btrfs] [a05269ca] btrfs_ioctl+0xdfa/0x25f0 [btrfs] [8119c936] do_vfs_ioctl+0x96/0x570 [8119cea1] SyS_ioctl+0x91/0xb0 [81750242] system_call_fastpath+0x16/0x1b [] 0x This affects btrfs-next, revision be8e3cd00d7293dd177e3f8a4a1645ce09ca3acb (Btrfs: separate out tests into their own directory). Signed-off-by: Filipe David Borba Manana fdman...@gmail.com --- V2: removed atomic_t member in struct btrfs_block_rsv, as suggested by Josef Bacik, and use instead the condition reserved == 0 to decide when to free the block. V3: simplified patch, just kfree() (and not btrfs_free_block_rsv) the root's orphan_block_rsv when free'ing the root. Thanks Josef for the suggestion. V4: use btrfs_free_block_rsv() instead of kfree(). The error I was getting in xfstests when using btrfs_free_block_rsv() was unrelated, Josef just pointed it to me (separate issue). V5: move the free call below the iput() call, so that btrfs_evict_node() can process the orphan_block_rsv first to do some needed cleanup before we free it. V6: free the root's orphan_block_rsv in close_ctree() too. After a balance the orphan_block_rsv of the tree of tree roots was being leaked, because free_fs_root() is only called for filesystem trees. fs/btrfs/disk-io.c |5 + 1 file changed, 5 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 3b12c26..5d17163 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3430,6 +3430,8 @@ static void free_fs_root(struct btrfs_root *root) { iput(root-cache_inode); WARN_ON(!RB_EMPTY_ROOT(root-inode_tree)); + btrfs_free_block_rsv(root, root-orphan_block_rsv); + root-orphan_block_rsv = NULL; if (root-anon_dev) free_anon_bdev(root-anon_dev); free_extent_buffer(root-node); @@ -3582,6 +3584,9 @@ int close_ctree(struct btrfs_root *root) btrfs_free_stripe_hash_table(fs_info); + btrfs_free_block_rsv(root, root-orphan_block_rsv); + root-orphan_block_rsv = NULL; + return 0; } -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body
Re: [PATCH] btrfs: commit transaction after deleting a subvolume
Thank you for addressing this, David. On Sat, Aug 31, 2013 at 1:25 AM, David Sterba dste...@suse.cz wrote: Alex pointed out the consequences after a transaction is not committed when a subvolume is deleted, so in case of a crash before an actual commit happens will let the subvolume reappear. Original post: http://www.spinics.net/lists/linux-btrfs/msg22088.html Josef's objections: http://www.spinics.net/lists/linux-btrfs/msg22256.html While there's no need to do a full commit for regular files, a subvolume may get a different treatment. http://www.spinics.net/lists/linux-btrfs/msg23087.html: That a subvol/snapshot may appear after crash if transation commit did not happen does not feel so good. We know that the subvol is only scheduled for deletion and needs to be processed by cleaner. From that point I'd rather see the commit to happen to avoid any unexpected surprises. A subvolume that re-appears still holds the data references and consumes space although the user does not assume that. Automated snapshotting and deleting needs some guarantees about the behaviour and what to do after a crash. So now it has to process the backlog of previously deleted snapshots and verify that they're not there, compared to deleted - will never appear, can forget about it. There is a performance penalty incured by the change, but deleting a subvolume is not a frequent operation and the tradeoff seems justified by getting the guarantee stated above. CC: Alex Lyakas alex.bt...@zadarastorage.com CC: Josef Bacik jba...@fusionio.com Signed-off-by: David Sterba dste...@suse.cz --- fs/btrfs/ioctl.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index e407f75..4394632 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2268,7 +2268,7 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, out_end_trans: trans-block_rsv = NULL; trans-bytes_reserved = 0; - ret = btrfs_end_transaction(trans, root); + ret = btrfs_commit_transaction(trans, root); if (ret !err) err = ret; inode-i_flags |= S_DEAD; -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Btrfs: stop caching thread if extent_commit_sem is contended
Thanks for addressing this issue, Josef! On Thu, Sep 26, 2013 at 4:26 PM, Josef Bacik jba...@fusionio.com wrote: We can starve out the transaction commit with a bunch of caching threads all running at the same time. This is because we will only drop the extent_commit_sem if we need_resched(), which isn't likely to happen since we will be reading a lot from the disk so have already schedule()'ed plenty. Alex observed that he could starve out a transaction commit for up to a minute with 32 caching threads all running at once. This will allow us to drop the extent_commit_sem to allow the transaction commit to swap the commit_root out and then all the cachers will start back up. Here is an explanation provided by Igno So, just to fill in what happens in this loop: mutex_unlock(caching_ctl-mutex); cond_resched(); goto again; where 'again:' takes caching_ctl-mutex and fs_info-extent_commit_sem again: again: mutex_lock(caching_ctl-mutex); /* need to make sure the commit_root doesn't disappear */ down_read(fs_info-extent_commit_sem); So, if I'm reading the code correct, there can be a fair amount of concurrency here: there may be multiple 'caching kthreads' per filesystem active, while there's one fs_info-extent_commit_sem per filesystem AFAICS. So, what happens if there are a lot of CPUs all busy holding the -extent_commit_sem rwsem read-locked and a writer arrives? They'd all rush to try to release the fs_info-extent_commit_sem, and they'd block in the down_read() because there's a writer waiting. So there's a guarantee of forward progress. This should answer akpm's concern I think. Thanks, Acked-by: Ingo Molnar mi...@kernel.org Signed-off-by: Josef Bacik jba...@fusionio.com --- fs/btrfs/extent-tree.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index cfb3cf7..cc074c34 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -442,7 +442,8 @@ next: if (ret) break; - if (need_resched()) { + if (need_resched() || + rwsem_is_contended(fs_info-extent_commit_sem)) { caching_ctl-progress = last; btrfs_release_path(path); up_read(fs_info-extent_commit_sem); -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Notify caching_thread()s to give up on extent_commit_sem when needed.
caching_thread()s do all their work under read access to extent_commit_sem. They give up on this read access only when need_resched() tells them, or when they exit. As a result, somebody that wants a WRITE access to this sem, might wait for a long time. Especially this is problematic in cache_block_group(), which can be called on critical paths like find_free_extent() and in commit path via commit_cowonly_roots(). This patch is an RFC, that attempts to fix this problem, by notifying the caching threads to give up on extent_commit_sem. On a system with a lot of metadata (~20Gb total metadata, ~10Gb extent tree), with increased number of caching_threads, commits were very slow, stuck in commit_cowonly_roots, due to this issue. With this patch, commits no longer get stuck in commit_cowonly_roots. This patch is not indented to be applied, just a request to comment on whether you agree this problem happens, and whether the fix goes in the right direction. Signed-off-by: Alex Lyakas alex.bt...@zadarastorage.com --- fs/btrfs/ctree.h |7 +++ fs/btrfs/disk-io.c |1 + fs/btrfs/extent-tree.c |9 + fs/btrfs/transaction.c |2 +- 4 files changed, 14 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index c90be01..b602611 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1427,6 +1427,13 @@ struct btrfs_fs_info { struct mutex ordered_extent_flush_mutex; struct rw_semaphore extent_commit_sem; +/* notifies the readers to give up on the sem ASAP */ +atomic_t extent_commit_sem_give_up_read; +#define BTRFS_DOWN_WRITE_EXTENT_COMMIT_SEM(fs_info) \ +do { atomic_inc((fs_info)-extent_commit_sem_give_up_read); \ + down_write((fs_info)-extent_commit_sem); \ + atomic_dec((fs_info)-extent_commit_sem_give_up_read); \ +} while (0) struct rw_semaphore cleanup_work_sem; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 69e9afb..b88e688 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2291,6 +2291,7 @@ int open_ctree(struct super_block *sb, mutex_init(fs_info-cleaner_mutex); mutex_init(fs_info-volume_mutex); init_rwsem(fs_info-extent_commit_sem); +atomic_set(fs_info-extent_commit_sem_give_up_read, 0); init_rwsem(fs_info-cleanup_work_sem); init_rwsem(fs_info-subvol_sem); sema_init(fs_info-uuid_tree_rescan_sem, 1); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 95c6539..28fee78 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -442,7 +442,8 @@ next: if (ret) break; -if (need_resched()) { +if (need_resched() || +atomic_read(fs_info-extent_commit_sem_give_up_read) 0) { caching_ctl-progress = last; btrfs_release_path(path); up_read(fs_info-extent_commit_sem); @@ -632,7 +633,7 @@ static int cache_block_group(struct btrfs_block_group_cache *cache, return 0; } -down_write(fs_info-extent_commit_sem); +BTRFS_DOWN_WRITE_EXTENT_COMMIT_SEM(fs_info); atomic_inc(caching_ctl-count); list_add_tail(caching_ctl-list, fs_info-caching_block_groups); up_write(fs_info-extent_commit_sem); @@ -5462,7 +5463,7 @@ void btrfs_prepare_extent_commit(struct btrfs_trans_handle *trans, struct btrfs_block_group_cache *cache; struct btrfs_space_info *space_info; -down_write(fs_info-extent_commit_sem); +BTRFS_DOWN_WRITE_EXTENT_COMMIT_SEM(fs_info); list_for_each_entry_safe(caching_ctl, next, fs_info-caching_block_groups, list) { @@ -8219,7 +8220,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info) struct btrfs_caching_control *caching_ctl; struct rb_node *n; -down_write(info-extent_commit_sem); +BTRFS_DOWN_WRITE_EXTENT_COMMIT_SEM(fs_info); while (!list_empty(info-caching_block_groups)) { caching_ctl = list_entry(info-caching_block_groups.next, struct btrfs_caching_control, list); diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index cac4a3f..976d20a 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -969,7 +969,7 @@ static noinline int commit_cowonly_roots(struct btrfs_trans_handle *trans, return ret; } -down_write(fs_info-extent_commit_sem); +BTRFS_DOWN_WRITE_EXTENT_COMMIT_SEM(fs_info); switch_commit_root(fs_info-extent_root); up_write(fs_info-extent_commit_sem); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Notify caching_thread()s to give up on extent_commit_sem when needed.
On Thu, Aug 29, 2013 at 10:55 PM, Josef Bacik jba...@fusionio.com wrote: On Thu, Aug 29, 2013 at 10:09:29PM +0300, Alex Lyakas wrote: Hi Josef, On Thu, Aug 29, 2013 at 5:38 PM, Josef Bacik jba...@fusionio.com wrote: On Thu, Aug 29, 2013 at 01:31:05PM +0300, Alex Lyakas wrote: caching_thread()s do all their work under read access to extent_commit_sem. They give up on this read access only when need_resched() tells them, or when they exit. As a result, somebody that wants a WRITE access to this sem, might wait for a long time. Especially this is problematic in cache_block_group(), which can be called on critical paths like find_free_extent() and in commit path via commit_cowonly_roots(). This patch is an RFC, that attempts to fix this problem, by notifying the caching threads to give up on extent_commit_sem. On a system with a lot of metadata (~20Gb total metadata, ~10Gb extent tree), with increased number of caching_threads, commits were very slow, stuck in commit_cowonly_roots, due to this issue. With this patch, commits no longer get stuck in commit_cowonly_roots. But what kind of effect do you see on overall performance/runtime? Honestly I'd expect we'd spend more of our time waiting for the caching kthread to fill in free space so we can make allocations than waiting on this lock contention. I'd like to see real numbers here to see what kind of effect this patch has on your workload. (I don't doubt it makes a difference, I'm just curious to see how big of a difference it makes.) Primarily for me it affects the commit thread right after mounting, when it spends time in the critical part of the commit, in which trans_no_join is set, i.e., it is not possible to start a new transaction. So all the new writers that want a transaction are delayed at this point. Here are some numbers (and some more logs are in the attached file). Filesystem has a good amount of metadata (btrfs-progs modified slightly to print exact byte values): root@dc:/home/zadara# btrfs fi df /btrfs/pool-0002/ Data: total=846116945920(788.01GB), used=842106667008(784.27GB) System: total=4194304(4.00MB), used=94208(92.00KB) Metadata: total=31146901504(29.01GB), used=25248698368(23.51GB) original code, 2 caching workers, try 1 Aug 29 13:41:22 dc kernel: [28381.203745] [17617][tx]btrfs [ZBTRFS_TXN_COMMIT_PHASE_STARTED:439] FS[dm-119] txn[6627] COMMIT extwr:0 wr:1 Aug 29 13:41:25 dc kernel: [28384.624838] [17617][tx]btrfs [ZBTRFS_TXN_COMMIT_PHASE_DONE:519] FS[dm-119] txn[6627] COMMIT took 3421 ms committers=1 open=0ms blocked=3188ms Aug 29 13:41:25 dc kernel: [28384.624846] [17617][tx]btrfs [ZBTRFS_TXN_COMMIT_PHASE_DONE:524] FS[dm-119] txn[6627] roo:0 rdr1:0 cbg:0 rdr2:0 Aug 29 13:41:25 dc kernel: [28384.624850] [17617][tx]btrfs [ZBTRFS_TXN_COMMIT_PHASE_DONE:529] FS[dm-119] txn[6627] wc:0 wpc:0 wew:0 fps:0 Aug 29 13:41:25 dc kernel: [28384.624854] [17617][tx]btrfs [ZBTRFS_TXN_COMMIT_PHASE_DONE:534] -FS[dm-119] txn[6627] ww:0 cs:0 rdi:0 rdr3:0 Aug 29 13:41:25 dc kernel: [28384.624858] [17617][tx]btrfs [ZBTRFS_TXN_COMMIT_PHASE_DONE:538] -FS[dm-119] txn[6627] cfr:0 ccr:2088 pec:1099 Aug 29 13:41:25 dc kernel: [28384.624862] [17617][tx]btrfs [ZBTRFS_TXN_COMMIT_PHASE_DONE:541] FS[dm-119] txn[6627] wrw:230 wrs:1 I have a breakdown of commit times here, to identify bottlenecks of the commit. Times are in ms. Names of phases are: roo - btrfs_run_ordered_operations rdr1 - btrfs_run_delayed_refs (call 1) cbg - btrfs_create_pending_block_groups rdr2 - btrfs_run_delayed_refs (call 2) wc - wait_for_commit (if was needed) wpc - wair for previous commit (if was needed) wew - wait for external writers to detach fps - flush_all_pending_stuffs ww - wait for all the other writers to detach cs - create_pending_snapshots rdi - btrfs_run_delayed_items rdr3 - btrfs_run_delayed_refs (call 3) cfr - commit_fs_roots ccr - commit_cowonly_roots pec - btrfs_prepare_extent_commit wrw - btrfs_write_and_wait_transaction wrs - write_ctree_super Two lines marked as - are the critical part of the commit. original code, 2 caching workers, try 2 Aug 29 13:43:30 dc kernel: [28508.683625] [22490][tx]btrfs [ZBTRFS_TXN_COMMIT_PHASE_STARTED:439] FS[dm-119] txn[6630] COMMIT extwr:0 wr:1 Aug 29 13:43:31 dc kernel: [28510.569269] [22490][tx]btrfs [ZBTRFS_TXN_COMMIT_PHASE_DONE:519] FS[dm-119] txn[6630] COMMIT took 1885 ms committers=1 open=0ms blocked=1550ms Aug 29 13:43:31 dc kernel: [28510.569276] [22490][tx]btrfs [ZBTRFS_TXN_COMMIT_PHASE_DONE:524] FS[dm-119] txn[6630] roo:0 rdr1:0 cbg:0 rdr2:0 Aug 29 13:43:31 dc kernel: [28510.569281] [22490][tx]btrfs [ZBTRFS_TXN_COMMIT_PHASE_DONE:529] FS[dm-119] txn[6630] wc:0 wpc:0 wew:0 fps:0 Aug 29 13:43:31 dc kernel: [28510.569285] [22490][tx]btrfs [ZBTRFS_TXN_COMMIT_PHASE_DONE:534] -FS[dm-119] txn[6630] ww:0 cs:0 rdi:0 rdr3:0 Aug 29 13:43:31 dc kernel: [28510.569288] [22490][tx]btrfs
Re: btrfs:async-thread: atomic_start_pending=1 is set, but it's too late
Thanks, Chris, Josef, for confirming! On Thu, Aug 29, 2013 at 11:08 PM, Chris Mason clma...@fusionio.com wrote: Quoting Josef Bacik (2013-08-29 16:03:06) On Mon, Aug 26, 2013 at 05:16:42PM +0300, Alex Lyakas wrote: Greetings all, I see a following issue with spawning new threads for btrfs_workers that have atomic_worker_start set: # I have BTRFS that has 24Gb of total metadata, out of which extent tree takes 11Gb; space_cache option is not used. # After mouting, cache_block_group() triggers ~250 work items to cache-in the needed block groups. # At this point, fs_info-caching_workers has one thread, which is considered idle. # Work items start to add to this thread's pending list, until this thread becomes considered busy. # Now workers-atomic_worker_start is set, but check_pending_worker_creates() has not run yet (it is called only from worker_loop), so the same single thread is picked as fallback. The problem is that this thread is still running the caching_thread function, scanning for EXTENT_ITEMs of the first block-group. This takes 3-4seconds for 1Gb block group. # Once caching_thread() exits, check_pending_worker_creates() is called, and creates the second thread, but it's too late, because all the 250 work items are already sitting in the first thread's pending list. So the second thread doesn't help at all. As a result, all block-group caching is performed by the same thread, which, due to one-by-one scanning of EXTENT_ITEMs, takes forever for this BTRFS. How this can be fixed? - can perhaps check_pending_worker_creates() be called out of worker_loop, e.g., by find_worker()? Instead of just setting workers-atomic_start_pending? - maybe for fs_info-caching_workers we don't have to create new workers asynchronously, so we can pass NULL for async_helper in btrfs_init_workers()? (probably we have to, just checking) So I looked at this, and I'm pretty sure we have an async_helper just because of copy+paste. Hey I want a new async group, let me copy this other one and change the name! So yes let's just pass NULL here. In fact the only cases that we should be using an async helper is for super critical areas, so I'm pretty sure _most_ of the cases that specify an async helper don't need to. Chris is this correct, or am I missing something? Thanks, No, I think we can just turn off the async start here. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: stop all workers before cleaning up roots
Hi Josef, On Thu, May 30, 2013 at 11:58 PM, Josef Bacik jba...@fusionio.com wrote: Dave reported a panic because the extent_root-commit_root was NULL in the caching kthread. That is because we just unset it in free_root_pointers, which is not the correct thing to do, we have to either wait for the caching kthread to complete or hold the extent_commit_sem lock so we know the thread has exited. This patch makes the kthreads all stop first and then we do our cleanup. This should fix the race. Thanks, Reported-by: David Sterba dste...@suse.cz Signed-off-by: Josef Bacik jba...@fusionio.com --- fs/btrfs/disk-io.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 2b53afd..77cb566 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3547,13 +3547,13 @@ int close_ctree(struct btrfs_root *root) btrfs_free_block_groups(fs_info); do you think it would be safer to stop all workers first and make sure they are stopped, then do btrfs_free_block_groups()? I see, for example, that btrfs_free_block_groups() checks: if (block_group-cached == BTRFS_CACHE_STARTED) which could be perhaps racy with other people spawning caching_threads. So maybe better to stop all threads (including cleaner and committer) and then free everything? - free_root_pointers(fs_info, 1); + btrfs_stop_all_workers(fs_info); del_fs_roots(fs_info); - iput(fs_info-btree_inode); + free_root_pointers(fs_info, 1); - btrfs_stop_all_workers(fs_info); + iput(fs_info-btree_inode); #ifdef CONFIG_BTRFS_FS_CHECK_INTEGRITY if (btrfs_test_opt(root, CHECK_INTEGRITY)) -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: update drop progress before stopping snapshot dropping
Thanks for posting that patch, Josef. On Mon, Jul 15, 2013 at 6:59 PM, Josef Bacik jba...@fusionio.com wrote: Alex pointed out a problem and fix that exists in the drop one snapshot at a time patch. If we decide we need to exit for whatever reason (umount for example) we will just exit the snapshot dropping without updating the drop progress. So the next time we go to resume we will BUG_ON() because we can't find the extent we left off at because we never updated it. This patch fixes the problem. Cc: sta...@vger.kernel.org Reported-by: Alex Lyakas alex.bt...@zadarastorage.com Signed-off-by: Josef Bacik jba...@fusionio.com --- fs/btrfs/extent-tree.c | 14 -- 1 files changed, 8 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index bc00b24..8c204e1 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7584,11 +7584,6 @@ int btrfs_drop_snapshot(struct btrfs_root *root, wc-reada_count = BTRFS_NODEPTRS_PER_BLOCK(root); while (1) { - if (!for_reloc btrfs_need_cleaner_sleep(root)) { - pr_debug(btrfs: drop snapshot early exit\n); - err = -EAGAIN; - goto out_end_trans; - } ret = walk_down_tree(trans, root, path, wc); if (ret 0) { @@ -7616,7 +7611,8 @@ int btrfs_drop_snapshot(struct btrfs_root *root, } BUG_ON(wc-level == 0); - if (btrfs_should_end_transaction(trans, tree_root)) { + if (btrfs_should_end_transaction(trans, tree_root) || + (!for_reloc btrfs_need_cleaner_sleep(root))) { ret = btrfs_update_root(trans, tree_root, root-root_key, root_item); @@ -7627,6 +7623,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root, } btrfs_end_transaction_throttle(trans, tree_root); + if (!for_reloc btrfs_need_cleaner_sleep(root)) { + pr_debug(btrfs: drop snapshot early exit\n); + err = -EAGAIN; + goto out_free; + } + trans = btrfs_start_transaction(tree_root, 0); if (IS_ERR(trans)) { err = PTR_ERR(trans); -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix all callers of read_tree_block
Hi Josef, On Tue, Apr 23, 2013 at 9:20 PM, Josef Bacik jba...@fusionio.com wrote: We kept leaking extent buffers when mounting a broken file system and it turns out it's because not everybody uses read_tree_block properly. You need to check and make sure the extent_buffer is uptodate before you use it. This patch fixes everybody who calls read_tree_block directly to make sure they check that it is uptodate and free it and return an error if it is not. With this we no longer leak EB's when things go horribly wrong. Thanks, Signed-off-by: Josef Bacik jba...@fusionio.com --- fs/btrfs/backref.c | 10 -- fs/btrfs/ctree.c | 21 - fs/btrfs/disk-io.c | 19 +-- fs/btrfs/extent-tree.c |4 +++- fs/btrfs/relocation.c | 18 +++--- 5 files changed, 59 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 23e927b..04b5b30 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -423,7 +423,10 @@ static int __add_missing_keys(struct btrfs_fs_info *fs_info, BUG_ON(!ref-wanted_disk_byte); eb = read_tree_block(fs_info-tree_root, ref-wanted_disk_byte, fs_info-tree_root-leafsize, 0); - BUG_ON(!eb); + if (!eb || !extent_buffer_uptodate(eb)) { + free_extent_buffer(eb); + return -EIO; + } btrfs_tree_read_lock(eb); if (btrfs_header_level(eb) == 0) btrfs_item_key_to_cpu(eb, ref-key_for_search, 0); @@ -913,7 +916,10 @@ again: info_level); eb = read_tree_block(fs_info-extent_root, ref-parent, bsz, 0); - BUG_ON(!eb); + if (!eb || !extent_buffer_uptodate(eb)) { + free_extent_buffer(eb); + return -EIO; + } ret = find_extent_in_eb(eb, bytenr, *extent_item_pos, eie); ref-inode_list = eie; diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 566d99b..2bc3440 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -1281,7 +1281,8 @@ get_old_root(struct btrfs_root *root, u64 time_seq) free_extent_buffer(eb_root); blocksize = btrfs_level_size(root, old_root-level); old = read_tree_block(root, logical, blocksize, 0); - if (!old) { + if (!old || !extent_buffer_uptodate(old)) { + free_extent_buffer(old); pr_warn(btrfs: failed to read tree block %llu from get_old_root\n, logical); WARN_ON(1); @@ -1526,8 +1527,10 @@ int btrfs_realloc_node(struct btrfs_trans_handle *trans, if (!cur) { cur = read_tree_block(root, blocknr, blocksize, gen); - if (!cur) + if (!cur || !extent_buffer_uptodate(cur)) { + free_extent_buffer(cur); return -EIO; + } } else if (!uptodate) { err = btrfs_read_buffer(cur, gen); if (err) { @@ -1692,6 +1695,8 @@ static noinline struct extent_buffer *read_node_slot(struct btrfs_root *root, struct extent_buffer *parent, int slot) { int level = btrfs_header_level(parent); + struct extent_buffer *eb; + if (slot 0) return NULL; if (slot = btrfs_header_nritems(parent)) @@ -1699,9 +1704,15 @@ static noinline struct extent_buffer *read_node_slot(struct btrfs_root *root, BUG_ON(level == 0); - return read_tree_block(root, btrfs_node_blockptr(parent, slot), - btrfs_level_size(root, level - 1), - btrfs_node_ptr_generation(parent, slot)); + eb = read_tree_block(root, btrfs_node_blockptr(parent, slot), +btrfs_level_size(root, level - 1), +btrfs_node_ptr_generation(parent, slot)); + if (eb !extent_buffer_uptodate(eb)) { + free_extent_buffer(eb); + eb = NULL; + } + + return eb; } /* diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index fb0e5c2..4605cc7 100644 ---
Re: [PATCH] Btrfs: fix lock leak when resuming snapshot deletion
On Mon, Jul 15, 2013 at 7:43 PM, Josef Bacik jba...@fusionio.com wrote: We aren't setting path-locks[level] when we resume a snapshot deletion which means we won't unlock the buffer when we free the path. This causes deadlocks if we happen to re-allocate the block before we've evicted the extent buffer from cache. Thanks, Reported-by: Alex Lyakas alex.bt...@zadarastorage.com Signed-off-by: Josef Bacik jba...@fusionio.com --- fs/btrfs/extent-tree.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 8c204e1..997a5dd 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7555,6 +7555,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root, while (1) { btrfs_tree_lock(path-nodes[level]); btrfs_set_lock_blocking(path-nodes[level]); + path-locks[level] = BTRFS_WRITE_LOCK_BLOCKING; ret = btrfs_lookup_extent_info(trans, root, path-nodes[level]-start, @@ -7570,6 +7571,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root, break; btrfs_tree_unlock(path-nodes[level]); + path-locks[level] = 0; WARN_ON(wc-refs[level] != 1); level--; } -- 1.7.7.6 -- Tested-by: Liran Strugano li...@zadarastorage.com To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: clean snapshots one by one
Hi, On Thu, Jul 4, 2013 at 10:52 PM, Alex Lyakas alex.bt...@zadarastorage.com wrote: Hi David, On Thu, Jul 4, 2013 at 8:03 PM, David Sterba dste...@suse.cz wrote: On Thu, Jul 04, 2013 at 06:29:23PM +0300, Alex Lyakas wrote: @@ -7363,6 +7365,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root, wc-reada_count = BTRFS_NODEPTRS_PER_BLOCK(root); while (1) { + if (!for_reloc btrfs_fs_closing(root-fs_info)) { + pr_debug(btrfs: drop snapshot early exit\n); + err = -EAGAIN; + goto out_end_trans; + } Here you exit the loop, but the drop_progress in the root item is incorrect. When the system is remounted, and snapshot deletion resumes, it seems that it tries to resume from the EXTENT_ITEM that does not exist anymore, and [1] shows that btrfs_lookup_extent_info() simply does not find the needed extent. So then I hit panic in walk_down_tree(): BUG: wc-refs[level - 1] == 0 I fixed it like follows: There is a place where btrfs_drop_snapshot() checks if it needs to detach from transaction and re-attach. So I moved the exit point there and the code is like this: if (btrfs_should_end_transaction(trans, tree_root) || (!for_reloc btrfs_need_cleaner_sleep(root))) { ret = btrfs_update_root(trans, tree_root, root-root_key, root_item); if (ret) { btrfs_abort_transaction(trans, tree_root, ret); err = ret; goto out_end_trans; } btrfs_end_transaction_throttle(trans, tree_root); if (!for_reloc btrfs_need_cleaner_sleep(root)) { err = -EAGAIN; goto out_free; } trans = btrfs_start_transaction(tree_root, 0); ... With this fix, I do not hit the panic, and snapshot deletion proceeds and completes alright after mount. Do you agree to my analysis or I am missing something? It seems that Josef's btrfs-next still has this issue (as does Chris's for-linus). Sound analysis and I agree with the fix. The clean-by-one patch has been merged into 3.10 so we need a stable fix for that. Thanks for confirming, David! From more testing, I have two more notes: # After applying the fix, whenever snapshot deletion is resumed after mount, and successfully completes, then I unmount again, and rmmod btrfs, linux complains about loosing few struct extent_buffer during kem_cache_delete(). So somewhere on that path: if (btrfs_disk_key_objectid(root_item-drop_progress) == 0) { ... } else { === HERE and later we perhaps somehow overwrite the contents of struct btrfs_path that is used in the whole function. Because at the end of the function we always do btrfs_free_path(), which inside does btrfs_release_path(). I was not able to determine where the leak happens, do you have any hint? No other activity happens in the system except the resumed snap deletion, and this problem only happens when resuming. I found where the memory leak happens. When we abort snapshot deletion in the middle, then this btrfs_root is basically left alone hanging in the air. It is out of the dead_roots already, so when del_fs_roots() is called during unmount, it will not free this root and its root-node (which is the one that triggers memory leak warning on kmem_cache_destroy) and perhaps other stuff too. The issue still exists in btrfs-next. Simplest fix I came up with was: diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index d275681..52a2c54 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7468,6 +7468,7 @@ int btrfs_drop_snapshot(struct btrfs_root *root, int err = 0; int ret; int level; + bool root_freed = false; path = btrfs_alloc_path(); if (!path) { @@ -7641,6 +7642,8 @@ int btrfs_drop_snapshot(struct btrfs_root *root, free_extent_buffer(root-commit_root); btrfs_put_fs_root(root); } + root_freed = true; + out_end_trans: btrfs_end_transaction_throttle(trans, tree_root); out_free: @@ -7649,6 +7652,18 @@ out_free: out: if (err) btrfs_std_error(root-fs_info, err); + + /* +* If the root was not freed by any reason, this means that FS had +* a problem and will probably be unmounted soon. +* But we need to put the root back into the 'dead_roots' list, +* so that it will be properly freed during unmount. +*/ + if (!root_freed) { + WARN_ON(err == 0); + btrfs_add_dead_root(root); + } + return err
Re: [PATCH v3] btrfs: clean snapshots one by one
Hi David, I believe this patch has the following problem: On Tue, Mar 12, 2013 at 5:13 PM, David Sterba dste...@suse.cz wrote: Each time pick one dead root from the list and let the caller know if it's needed to continue. This should improve responsiveness during umount and balance which at some point waits for cleaning all currently queued dead roots. A new dead root is added to the end of the list, so the snapshots disappear in the order of deletion. The snapshot cleaning work is now done only from the cleaner thread and the others wake it if needed. Signed-off-by: David Sterba dste...@suse.cz --- v1,v2: * http://thread.gmane.org/gmane.comp.file-systems.btrfs/23212 v2-v3: * remove run_again from btrfs_clean_one_deleted_snapshot and return 1 unconditionally fs/btrfs/disk-io.c | 10 ++-- fs/btrfs/extent-tree.c |8 ++ fs/btrfs/relocation.c |3 -- fs/btrfs/transaction.c | 56 +++ fs/btrfs/transaction.h |2 +- 5 files changed, 53 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 988b860..4de2351 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1690,15 +1690,19 @@ static int cleaner_kthread(void *arg) struct btrfs_root *root = arg; do { + int again = 0; + if (!(root-fs_info-sb-s_flags MS_RDONLY) + down_read_trylock(root-fs_info-sb-s_umount) mutex_trylock(root-fs_info-cleaner_mutex)) { btrfs_run_delayed_iputs(root); - btrfs_clean_old_snapshots(root); + again = btrfs_clean_one_deleted_snapshot(root); mutex_unlock(root-fs_info-cleaner_mutex); btrfs_run_defrag_inodes(root-fs_info); + up_read(root-fs_info-sb-s_umount); } - if (!try_to_freeze()) { + if (!try_to_freeze() !again) { set_current_state(TASK_INTERRUPTIBLE); if (!kthread_should_stop()) schedule(); @@ -3403,8 +3407,8 @@ int btrfs_commit_super(struct btrfs_root *root) mutex_lock(root-fs_info-cleaner_mutex); btrfs_run_delayed_iputs(root); - btrfs_clean_old_snapshots(root); mutex_unlock(root-fs_info-cleaner_mutex); + wake_up_process(root-fs_info-cleaner_kthread); /* wait until ongoing cleanup work done */ down_write(root-fs_info-cleanup_work_sem); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 742b7a7..a08d0fe 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7263,6 +7263,8 @@ static noinline int walk_up_tree(struct btrfs_trans_handle *trans, * reference count by one. if update_ref is true, this function * also make sure backrefs for the shared block and all lower level * blocks are properly updated. + * + * If called with for_reloc == 0, may exit early with -EAGAIN */ int btrfs_drop_snapshot(struct btrfs_root *root, struct btrfs_block_rsv *block_rsv, int update_ref, @@ -7363,6 +7365,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root, wc-reada_count = BTRFS_NODEPTRS_PER_BLOCK(root); while (1) { + if (!for_reloc btrfs_fs_closing(root-fs_info)) { + pr_debug(btrfs: drop snapshot early exit\n); + err = -EAGAIN; + goto out_end_trans; + } Here you exit the loop, but the drop_progress in the root item is incorrect. When the system is remounted, and snapshot deletion resumes, it seems that it tries to resume from the EXTENT_ITEM that does not exist anymore, and [1] shows that btrfs_lookup_extent_info() simply does not find the needed extent. So then I hit panic in walk_down_tree(): BUG: wc-refs[level - 1] == 0 I fixed it like follows: There is a place where btrfs_drop_snapshot() checks if it needs to detach from transaction and re-attach. So I moved the exit point there and the code is like this: if (btrfs_should_end_transaction(trans, tree_root) || (!for_reloc btrfs_need_cleaner_sleep(root))) { ret = btrfs_update_root(trans, tree_root, root-root_key, root_item); if (ret) { btrfs_abort_transaction(trans, tree_root, ret); err = ret; goto out_end_trans; } btrfs_end_transaction_throttle(trans, tree_root); if (!for_reloc btrfs_need_cleaner_sleep(root)) { err = -EAGAIN;
Re: [PATCH v3] btrfs: clean snapshots one by one
Hi David, On Thu, Jul 4, 2013 at 8:03 PM, David Sterba dste...@suse.cz wrote: On Thu, Jul 04, 2013 at 06:29:23PM +0300, Alex Lyakas wrote: @@ -7363,6 +7365,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root, wc-reada_count = BTRFS_NODEPTRS_PER_BLOCK(root); while (1) { + if (!for_reloc btrfs_fs_closing(root-fs_info)) { + pr_debug(btrfs: drop snapshot early exit\n); + err = -EAGAIN; + goto out_end_trans; + } Here you exit the loop, but the drop_progress in the root item is incorrect. When the system is remounted, and snapshot deletion resumes, it seems that it tries to resume from the EXTENT_ITEM that does not exist anymore, and [1] shows that btrfs_lookup_extent_info() simply does not find the needed extent. So then I hit panic in walk_down_tree(): BUG: wc-refs[level - 1] == 0 I fixed it like follows: There is a place where btrfs_drop_snapshot() checks if it needs to detach from transaction and re-attach. So I moved the exit point there and the code is like this: if (btrfs_should_end_transaction(trans, tree_root) || (!for_reloc btrfs_need_cleaner_sleep(root))) { ret = btrfs_update_root(trans, tree_root, root-root_key, root_item); if (ret) { btrfs_abort_transaction(trans, tree_root, ret); err = ret; goto out_end_trans; } btrfs_end_transaction_throttle(trans, tree_root); if (!for_reloc btrfs_need_cleaner_sleep(root)) { err = -EAGAIN; goto out_free; } trans = btrfs_start_transaction(tree_root, 0); ... With this fix, I do not hit the panic, and snapshot deletion proceeds and completes alright after mount. Do you agree to my analysis or I am missing something? It seems that Josef's btrfs-next still has this issue (as does Chris's for-linus). Sound analysis and I agree with the fix. The clean-by-one patch has been merged into 3.10 so we need a stable fix for that. Thanks for confirming, David! From more testing, I have two more notes: # After applying the fix, whenever snapshot deletion is resumed after mount, and successfully completes, then I unmount again, and rmmod btrfs, linux complains about loosing few struct extent_buffer during kem_cache_delete(). So somewhere on that path: if (btrfs_disk_key_objectid(root_item-drop_progress) == 0) { ... } else { === HERE and later we perhaps somehow overwrite the contents of struct btrfs_path that is used in the whole function. Because at the end of the function we always do btrfs_free_path(), which inside does btrfs_release_path(). I was not able to determine where the leak happens, do you have any hint? No other activity happens in the system except the resumed snap deletion, and this problem only happens when resuming. # This is for Josef: after I unmount the fs with ongoing snap deletion (after applying my fix), and run the latest btrfsck - it complains a lot about problems in extent tree:( But after I mount again, snap deletion resumes then completes, then I unmount and btrfsck is happy again. So probably it does not account orphan roots properly? David, will you provide a fixed patch, if possible? Thanks! Alex. thanks, david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: question about transaction-abort-related commits
On Sun, Jun 30, 2013 at 2:36 PM, Josef Bacik jba...@fusionio.com wrote: On Sun, Jun 30, 2013 at 11:29:14AM +0300, Alex Lyakas wrote: Hi Josef, On Wed, Jun 26, 2013 at 5:16 PM, Alex Lyakas alex.bt...@zadarastorage.com wrote: Hi Josef, Can you please help me with another question. I am looking at your patch: Btrfs: fix chunk allocation error handling https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0448748849ef7c593be40e2c1404f7974bd3aac6 Here you changed the order of btrfs_make_block_group() vs btrfs_alloc_dev_extent(), because we could have allocated from the in-memory block group, before we have inserted the dev extent into a tree. However, with this fix, I hit the deadlock[1] of btrfs_alloc_dev_extent() that also wants to allocate a chunk and recursively calls do_chunk_alloc, but then is stuck on chunk_mutex. Was this patch: Btrfs: don't re-enter when allocating a chunk https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c6b305a89b1903d63652691ad5eb9f05aa0326b8 introduced to fix this deadlock? With these two patches (Btrfs: fix chunk allocation error handling and Btrfs: don't re-enter when allocating a chunk), I am hitting ENOSPC during metadata chunk allocation. Upon entry into do_chunk_alloc, I have only one METADATA block-group as follows: total_bytes=8388608 bytes_used=7938048 bytes_pinned=446464 bytes_reserved=4096 bytes_readonly=0 bytes_may_use=3362816 As we see bytes_used+bytes_pinned+bytes_reserved==total_bytes What happens next is that within __btrfs_alloc_chunk(): - find_free_dev_extent() finds a free extent (metadata policy is SINGLE) - btrfs_alloc_dev_extent() fails with ENOSPC (btrfs_make_block_group() is called after btrfs_alloc_dev_extent() with these patches). What should be done in such situation, when there is not enough METADATA to allocate a device extent item, but we still don't allow allocating from the newly-created METADATA block group? So I had a third patch that you are likely missing that makes sure we try and allocate chunks sooner specifically for this case 96f1bb57771f71bf1d55d5031a1cf47908494330 and then Miao made it better I think with this 3c76cd84e0c0d3ceb094a1020f8c55c2417e18d3 Thanks, Josef Thank you Josef, I didn't realize that. Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: question about transaction-abort-related commits
Hi Josef, On Wed, Jun 26, 2013 at 5:16 PM, Alex Lyakas alex.bt...@zadarastorage.com wrote: Hi Josef, Can you please help me with another question. I am looking at your patch: Btrfs: fix chunk allocation error handling https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=0448748849ef7c593be40e2c1404f7974bd3aac6 Here you changed the order of btrfs_make_block_group() vs btrfs_alloc_dev_extent(), because we could have allocated from the in-memory block group, before we have inserted the dev extent into a tree. However, with this fix, I hit the deadlock[1] of btrfs_alloc_dev_extent() that also wants to allocate a chunk and recursively calls do_chunk_alloc, but then is stuck on chunk_mutex. Was this patch: Btrfs: don't re-enter when allocating a chunk https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=c6b305a89b1903d63652691ad5eb9f05aa0326b8 introduced to fix this deadlock? With these two patches (Btrfs: fix chunk allocation error handling and Btrfs: don't re-enter when allocating a chunk), I am hitting ENOSPC during metadata chunk allocation. Upon entry into do_chunk_alloc, I have only one METADATA block-group as follows: total_bytes=8388608 bytes_used=7938048 bytes_pinned=446464 bytes_reserved=4096 bytes_readonly=0 bytes_may_use=3362816 As we see bytes_used+bytes_pinned+bytes_reserved==total_bytes What happens next is that within __btrfs_alloc_chunk(): - find_free_dev_extent() finds a free extent (metadata policy is SINGLE) - btrfs_alloc_dev_extent() fails with ENOSPC (btrfs_make_block_group() is called after btrfs_alloc_dev_extent() with these patches). What should be done in such situation, when there is not enough METADATA to allocate a device extent item, but we still don't allow allocating from the newly-created METADATA block group? Thanks, Alex. Thanks, Alex. [1] [a044e57d] do_chunk_alloc+0x8d/0x510 [btrfs] [a04532ad] find_free_extent+0x9cd/0xb90 [btrfs] [a0453510] btrfs_reserve_extent+0xa0/0x1b0 [btrfs] [a0453bc9] btrfs_alloc_free_block+0xf9/0x570 [btrfs] [a043d9e6] __btrfs_cow_block+0x126/0x500 [btrfs] [a043dfba] btrfs_cow_block+0x17a/0x230 [btrfs] [a04425b1] btrfs_search_slot+0x381/0x820 [btrfs] [a044463c] btrfs_insert_empty_items+0x7c/0x120 [btrfs] [a048f31b] btrfs_alloc_dev_extent+0x9b/0x1c0 [btrfs] [a048f9ca] __btrfs_alloc_chunk+0x58a/0x850 [btrfs] [a049239f] btrfs_alloc_chunk+0xbf/0x160 [btrfs] [a044e81b] do_chunk_alloc+0x32b/0x510 [btrfs] [a04532ad] find_free_extent+0x9cd/0xb90 [btrfs] [a0453510] btrfs_reserve_extent+0xa0/0x1b0 [btrfs] [a0453bc9] btrfs_alloc_free_block+0xf9/0x570 [btrfs] [a043d9e6] __btrfs_cow_block+0x126/0x500 [btrfs] [a043dfba] btrfs_cow_block+0x17a/0x230 [btrfs] [a0441613] push_leaf_right+0x133/0x1a0 [btrfs] [a0441d51] split_leaf+0x5e1/0x770 [btrfs] [a04429b5] btrfs_search_slot+0x785/0x820 [btrfs] [a0449c0e] lookup_inline_extent_backref+0x8e/0x5b0 [btrfs] [a044a193] insert_inline_extent_backref+0x63/0x130 [btrfs] [a044abaf] __btrfs_inc_extent_ref+0x9f/0x240 [btrfs] [a0451841] run_clustered_refs+0x971/0xd00 [btrfs] [a0455db0] btrfs_run_delayed_refs+0xd0/0x330 [btrfs] [a0467a87] __btrfs_end_transaction+0xf7/0x440 [btrfs] [a0467e20] btrfs_end_transaction+0x10/0x20 [btrfs] On Mon, Jun 24, 2013 at 9:56 PM, Alex Lyakas alex.bt...@zadarastorage.com wrote: Thanks for commenting Josef. I hope your head will get better:) Actually, while re-looking at the code, I see that there are couple of goto cleanup;, before we ensure that all the writers have detached from the committing transaction. So Liu's code is still needed, looks like. Thanks, Alex. On Mon, Jun 24, 2013 at 7:24 PM, Josef Bacik jba...@fusionio.com wrote: On Sun, Jun 23, 2013 at 09:52:14PM +0300, Alex Lyakas wrote: Hello Josef, Liu, I am reviewing commits in the mainline tree: e4a2bcaca9643e7430207810653222fc5187f2be Btrfs: if we aren't committing just end the transaction if we error out (call end_transaction() instead of goto cleanup_transaction) - Josef f094ac32aba3a51c00e970a2ea029339af2ca048 Btrfs: fix NULL pointer after aborting a transaction (wait until all writers detach, before setting running_transaction to NULL) - Liu 66b6135b7cf741f6f44ba938b27583ea3b83bd12 Btrfs: avoid deadlock on transaction waiting list (check if transaction was already removed from the transactions list) - Liu Josef, according to your fix, if the committer encounters a problem early, it just doesn't commit. Instead it aborts the transaction (possibly setting FS to read-only) and detaches from the transaction. So if this was the only committer (e.g., the transaction_kthread), then transaction commit will not happen at all. Is this what you intended? So
Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit
Hi Miao, On Mon, Jun 17, 2013 at 4:51 AM, Miao Xie mi...@cn.fujitsu.com wrote: On sun, 16 Jun 2013 13:38:42 +0300, Alex Lyakas wrote: Hi Miao, On Thu, Jun 13, 2013 at 6:08 AM, Miao Xie mi...@cn.fujitsu.com wrote: On wed, 12 Jun 2013 23:11:02 +0300, Alex Lyakas wrote: I reviewed the code starting from: 69aef69a1bc154 Btrfs: don't wait for all the writers circularly during the transaction commit until 2ce7935bf4cdf3 Btrfs: remove the time check in btrfs_commit_transaction() It looks very good. Let me check if I understand the fix correctly: # When transaction starts to commit, we want to wait only for external writers (those that did ATTACH/START/USERSPACE) # We guarantee at this point that no new external writers will hop on the committing transaction, by setting -blocked state, so we only wait for existing extwriters to detach from transaction I have a doubt about this point with your new code. Example: Task1 - external writer Task2 - transaction kthread Task1 Task2 |start_transaction(TRANS_START) | |-wait_current_trans(blocked=0, so it doesn't wait) | |-join_transaction() | |--lock(trans_lock) | |--can_join_transaction() YES | | |-btrfs_commit_transaction() | |--blocked=1 | |--in_commit=1 | |--wait_event(extwriter== 0); | |--btrfs_flush_all_pending_stuffs() | | |--extwriter_counter_inc() | |--unlock(trans_lock) | | | lock(trans_lock) | | trans_no_join=1 Basically, the blocked/in_commit check is not synchronized with joining a transaction. After checking blocked, the external writer may proceed and join the transaction. Right before joining, it calls can_join_transaction(). But this function checks in_commit flag under fs_info-trans_lock. But btrfs_commit_transaction() sets this flag not under trans_lock, but under commit_lock, so checking this flag is not synchronized. Or maybe I am wrong, because btrfs_commit_transaction() locks and unlocks trans_lock to check for previous transaction, so by accident there is no problem, and above scenario cannot happen? Your analysis at the last section is right, so the right process is: Task1 Task2 |start_transaction(TRANS_START) | |-wait_current_trans(blocked=0, so it doesn't wait) | |-join_transaction()| |--lock(trans_lock) | |--can_join_transaction() YES | | |-btrfs_commit_transaction() | |--blocked=1 | |--in_commit=1 |--extwriter_counter_inc() | |--unlock(trans_lock) | | |--lock(trans_lock) | |--... | |--unlock(trans_lock) | |--... | |--wait_event(extwriter== 0); | |--btrfs_flush_all_pending_stuffs() The problem you worried can not happen. Anyway, it is not good that the blocked/in_commit check is not synchronized with joining a transaction. So I modified the relative code in this patchset. The four patches that we applied related to extwriters issue work very good. They definitely solve the non-deterministic behavior while waiting for the writers to detach. Thanks for addressing this issue. One note is that the new behavior is perhaps less friendly to the transaction join flow. With your fix, the committer unconditionally sets trans_no_join and waits for old writers to detach. At this point, new joins will block. While previously, the committer was finding a convenient spot in the join pattern, when all writers have detached (although it was non-deterministic when this will happen). So perhaps some compromise can be done - like wait for 10sec until all writers detach, and if not, just go ahead and set trans_no_join. Thanks for your help! Alex. Miao -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: question about transaction-abort-related commits
Thanks for commenting Josef. I hope your head will get better:) Actually, while re-looking at the code, I see that there are couple of goto cleanup;, before we ensure that all the writers have detached from the committing transaction. So Liu's code is still needed, looks like. Thanks, Alex. On Mon, Jun 24, 2013 at 7:24 PM, Josef Bacik jba...@fusionio.com wrote: On Sun, Jun 23, 2013 at 09:52:14PM +0300, Alex Lyakas wrote: Hello Josef, Liu, I am reviewing commits in the mainline tree: e4a2bcaca9643e7430207810653222fc5187f2be Btrfs: if we aren't committing just end the transaction if we error out (call end_transaction() instead of goto cleanup_transaction) - Josef f094ac32aba3a51c00e970a2ea029339af2ca048 Btrfs: fix NULL pointer after aborting a transaction (wait until all writers detach, before setting running_transaction to NULL) - Liu 66b6135b7cf741f6f44ba938b27583ea3b83bd12 Btrfs: avoid deadlock on transaction waiting list (check if transaction was already removed from the transactions list) - Liu Josef, according to your fix, if the committer encounters a problem early, it just doesn't commit. Instead it aborts the transaction (possibly setting FS to read-only) and detaches from the transaction. So if this was the only committer (e.g., the transaction_kthread), then transaction commit will not happen at all. Is this what you intended? So then the user will notice that FS went read-only, and she will unmount the FS, and transaction will be cleaned up in btrfs_error_commit_super()=btrfs_cleanup_transaction(), and not in cleanup_transaction() via btrfs_commit_transaction(). Is my understanding correct? Liu, it looks like after having Josef's fix, the above two commits of yours are not strictly needed, right? Because Josef's fix ensures that only the real committer will call cleanup_transaction(), so at this point there is only one writer attached to the transaction, which is the committer itself (your fixes doesn't hurt though). Is that correct? I've looked at the patches and I'm going to say yes with the caveat that I stopped thinking about it when my head started hurting :). Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
question about transaction-abort-related commits
Hello Josef, Liu, I am reviewing commits in the mainline tree: e4a2bcaca9643e7430207810653222fc5187f2be Btrfs: if we aren't committing just end the transaction if we error out (call end_transaction() instead of goto cleanup_transaction) - Josef f094ac32aba3a51c00e970a2ea029339af2ca048 Btrfs: fix NULL pointer after aborting a transaction (wait until all writers detach, before setting running_transaction to NULL) - Liu 66b6135b7cf741f6f44ba938b27583ea3b83bd12 Btrfs: avoid deadlock on transaction waiting list (check if transaction was already removed from the transactions list) - Liu Josef, according to your fix, if the committer encounters a problem early, it just doesn't commit. Instead it aborts the transaction (possibly setting FS to read-only) and detaches from the transaction. So if this was the only committer (e.g., the transaction_kthread), then transaction commit will not happen at all. Is this what you intended? So then the user will notice that FS went read-only, and she will unmount the FS, and transaction will be cleaned up in btrfs_error_commit_super()=btrfs_cleanup_transaction(), and not in cleanup_transaction() via btrfs_commit_transaction(). Is my understanding correct? Liu, it looks like after having Josef's fix, the above two commits of yours are not strictly needed, right? Because Josef's fix ensures that only the real committer will call cleanup_transaction(), so at this point there is only one writer attached to the transaction, which is the committer itself (your fixes doesn't hurt though). Is that correct? Thanks for helping, Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit
Hi Miao, On Thu, Jun 13, 2013 at 6:08 AM, Miao Xie mi...@cn.fujitsu.com wrote: On wed, 12 Jun 2013 23:11:02 +0300, Alex Lyakas wrote: I reviewed the code starting from: 69aef69a1bc154 Btrfs: don't wait for all the writers circularly during the transaction commit until 2ce7935bf4cdf3 Btrfs: remove the time check in btrfs_commit_transaction() It looks very good. Let me check if I understand the fix correctly: # When transaction starts to commit, we want to wait only for external writers (those that did ATTACH/START/USERSPACE) # We guarantee at this point that no new external writers will hop on the committing transaction, by setting -blocked state, so we only wait for existing extwriters to detach from transaction I have a doubt about this point with your new code. Example: Task1 - external writer Task2 - transaction kthread Task1 Task2 |start_transaction(TRANS_START) | |-wait_current_trans(blocked=0, so it doesn't wait) | |-join_transaction() | |--lock(trans_lock) | |--can_join_transaction() YES | | |-btrfs_commit_transaction() | |--blocked=1 | |--in_commit=1 | |--wait_event(extwriter== 0); | |--btrfs_flush_all_pending_stuffs() || |--extwriter_counter_inc() | |--unlock(trans_lock) | | | lock(trans_lock) | | trans_no_join=1 Basically, the blocked/in_commit check is not synchronized with joining a transaction. After checking blocked, the external writer may proceed and join the transaction. Right before joining, it calls can_join_transaction(). But this function checks in_commit flag under fs_info-trans_lock. But btrfs_commit_transaction() sets this flag not under trans_lock, but under commit_lock, so checking this flag is not synchronized. Or maybe I am wrong, because btrfs_commit_transaction() locks and unlocks trans_lock to check for previous transaction, so by accident there is no problem, and above scenario cannot happen? # We do not care at this point for TRANS_JOIN etc, we let them hop on if they want # When all external writers have detached, we flush their delalloc and then we prevent all the others to join (TRANS_JOIN etc) # Previously, we had the do-while loop, that intended to do the same, but it used num_writers, which counts both external writers and also TRANS_JOIN. So the loop was racy because new joins prevented it from completing. Is my understanding correct? Yes, you are right. I have some questions: # Why was the do-while loop needed? Can we just delete the do-while loop as it was before, call flush_all_pending stuffs(), then set trans_no_join and wait for all writers to detach? Is there some correctness problem here? Or we need to wait for external writers to detach before calling flush_all_pending_stuffs() one last time? The external writers will introduce pending works, we need flush them after they detach, otherwise we will forget to deal with them at the current transaction just like the following case: Task1 Task2 start_transaction commit_transaction flush_all_pending_stuffs add pending works end_transaction ... # Why TRANS_ATTACH is considered external writer? - at most cases, it is done by the users' operations. - if in_commit is set, we shouldn't start it, or the deadlock will happen. it is the same as TRANS_START/TRANS_USERSPACE. # Can I apply this fix to 3.8.x kernel (manually, of course)? Or some additional things are needed that are missing in this kernel? Yes, you can rebase it against 3.8.x kernel freely. Thanks Miao Thanks, Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit
Hi Miao, On Thu, May 9, 2013 at 10:57 AM, Miao Xie mi...@cn.fujitsu.com wrote: Hi, Alex Could you try the following patchset? git://github.com/miaoxie/linux-btrfs.git trans-commit-improve I think it can avoid the problem you said below. Note: this patchset is against chris's for-linus branch. I reviewed the code starting from: 69aef69a1bc154 Btrfs: don't wait for all the writers circularly during the transaction commit until 2ce7935bf4cdf3 Btrfs: remove the time check in btrfs_commit_transaction() It looks very good. Let me check if I understand the fix correctly: # When transaction starts to commit, we want to wait only for external writers (those that did ATTACH/START/USERSPACE) # We guarantee at this point that no new external writers will hop on the committing transaction, by setting -blocked state, so we only wait for existing extwriters to detach from transaction # We do not care at this point for TRANS_JOIN etc, we let them hop on if they want # When all external writers have detached, we flush their delalloc and then we prevent all the others to join (TRANS_JOIN etc) # Previously, we had the do-while loop, that intended to do the same, but it used num_writers, which counts both external writers and also TRANS_JOIN. So the loop was racy because new joins prevented it from completing. Is my understanding correct? I have some questions: # Why was the do-while loop needed? Can we just delete the do-while loop as it was before, call flush_all_pending stuffs(), then set trans_no_join and wait for all writers to detach? Is there some correctness problem here? Or we need to wait for external writers to detach before calling flush_all_pending_stuffs() one last time? # Why TRANS_ATTACH is considered external writer? # Can I apply this fix to 3.8.x kernel (manually, of course)? Or some additional things are needed that are missing in this kernel? Thanks, Alex. Thanks Miao On Wed, 10 Apr 2013 21:45:43 +0300, Alex Lyakas wrote: Hi Miao, I attempted to fix the issue by not joining a transaction that has trans-in_commit set. I did something similar to what wait_current_trans() does, but I did: smp_rmb(); if (cur_trans cur_trans-in_commit) { ... wait_event(root-fs_info-transaction_wait, !cur_trans-blocked); ... I also had to change the order of setting in_commit and blocked in btrfs_commit_transaction: trans-transaction-blocked = 1; trans-transaction-in_commit = 1; smp_wmb(); to make sure that if in_commit is set, then blocked cannot be 0, because btrfs_commit_transaction haven't set it yet to 1. However, with this fix I observe two issues: # With large trees and heavy commits, join_transaction() is delayed sometimes by 1-3 seconds. This delays the host IO by too much. # With this fix, I think too many transactions happen. Basically with this fix, once transaction-in_commit is set, then I insist to open a new transaction and not to join the current one. It has some bad influence on host response times pattern, but I cannot exactly tell why is that. Did you have other fix in mind? Without the fix, I observe sometimes commits that take like 80 seconds, out of which like 50 seconds are spent in the do-while loop of btrfs_commit_transaction. Thanks, Alex. On Mon, Mar 25, 2013 at 11:11 AM, Alex Lyakas alex.bt...@zadarastorage.com wrote: Hi Miao, On Mon, Mar 25, 2013 at 3:51 AM, Miao Xie mi...@cn.fujitsu.com wrote: On Sun, 24 Mar 2013 13:13:22 +0200, Alex Lyakas wrote: Hi Miao, I am seeing another issue. Your fix prevents from TRANS_START to get in the way of a committing transaction. But it does not prevent from TRANS_JOIN. On the other hand, btrfs_commit_transaction has the following loop: do { // attempt to do some useful stuff and/or sleep } while (atomic_read(cur_trans-num_writers) 1 || (should_grow cur_trans-num_joined != joined)); What I see is basically that new writers join the transaction, while btrfs_commit_transaction() does this loop. I see cur_trans-num_writers decreasing, but then it increases, then decreases etc. This can go for several seconds during heavy IO load. There is nothing to prevent new TRANS_JOIN writers coming and joining a transaction over and over, thus delaying transaction commit. The IO path uses TRANS_JOIN; for example run_delalloc_nocow() does that. Do you observe such behavior? Do you believe it's problematic? I know this behavior, there is no problem with it, the latter code will prevent from TRANS_JOIN. 1672 spin_lock(root-fs_info-trans_lock); 1673 root-fs_info-trans_no_join = 1; 1674 spin_unlock(root-fs_info-trans_lock); 1675 wait_event(cur_trans-writer_wait, 1676atomic_read(cur_trans-num_writers) == 1); Yes, this code prevents anybody from joining, but before btrfs_commit_transaction() gets to this code, it may spend sometimes 10 seconds (in my tests) in the do-while loop, while
wait_block_group_cache_progress() waits forever in case of drive failure
Greetings all, when testing drive failures, I occasionally hit the following hang: # Block group is being cached-in by caching_thread() # caching_thread() experiences an error, e.g., in btrfs_search_slot, because of drive failure: ret = btrfs_search_slot(NULL, extent_root, key, path, 0, 0); if (ret 0) goto err; # caching thread exits: err: btrfs_free_path(path); up_read(fs_info-extent_commit_sem); free_excluded_extents(extent_root, block_group); mutex_unlock(caching_ctl-mutex); out: wake_up(caching_ctl-wait); put_caching_control(caching_ctl); btrfs_put_block_group(block_group); However, wait_block_group_cache_progress() is still stuck in a stack like this: [816ec509] schedule+0x29/0x70 [a044bd42] wait_block_group_cache_progress+0xe2/0x110 [btrfs] [8107fc10] ? add_wait_queue+0x60/0x60 [8107fc10] ? add_wait_queue+0x60/0x60 [a04568d6] find_free_extent+0x306/0xb90 [btrfs] [a04462ee] ? btrfs_search_slot+0x2fe/0x820 [btrfs] [a0457200] btrfs_reserve_extent+0xa0/0x1b0 [btrfs] ... because of: wait_event(caching_ctl-wait, block_group_cache_done(cache) || (cache-free_space_ctl-free_space = num_bytes)); But cache-cached never becomes BTRFS_CACHE_FINISHED, and cache-free_space_ctl-free_space will also not grow enough, so the wait never finishes. At this point, the system totally hangs. Same problem can happen with wait_block_group_cache_done(). I am thinking: can we add additional condition, like: wait_event(caching_ctl-wait, test_bit(BTRFS_FS_STATE_ERROR, fs_info-fs_state) || block_group_cache_done(cache) || (cache-free_space_ctl-free_space = num_bytes)); So that when transaction aborts, FS is marked as bad, and then all these waits will complete, so that the user can unmount? Or some other way to fix this problem? Thanks, Alex. P.S: should I open a bugzilla for this? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[no subject]
Hello all, I have the following unresponsive btrfs: btrfs_end_transaction() is called and is stuck in btrfs_tree_lock(): May 27 16:13:55 vc kernel: [ 7130.421159] kworker/u:85D 0 19859 2 0x May 27 16:13:55 vc kernel: [ 7130.421159] 880095335568 0046 00010093cb38 880083b11b48 May 27 16:13:55 vc kernel: [ 7130.421159] 880095335fd8 880095335fd8 880095335fd8 00013f40 May 27 16:13:55 vc kernel: [ 7130.421159] 8800a1fddd00 88008b1fc5c0 880095335578 880090f736d8 May 27 16:13:55 vc kernel: [ 7130.421159] Call Trace: May 27 16:13:55 vc kernel: [ 7130.421159] [816eb399] schedule+0x29/0x70 May 27 16:13:55 vc kernel: [ 7130.421159] [a03665ad] btrfs_tree_lock+0xcd/0x250 [btrfs] May 27 16:13:55 vc kernel: [ 7130.421159] [8107fcc0] ? add_wait_queue+0x60/0x60 May 27 16:13:55 vc kernel: [ 7130.421159] [a031d558] btrfs_init_new_buffer+0x68/0x140 [btrfs] May 27 16:13:55 vc kernel: [ 7130.421159] [a031d70d] btrfs_alloc_free_block+0xdd/0x460 [btrfs] May 27 16:13:55 vc kernel: [ 7130.421159] [8113ff9b] ? __set_page_dirty_nobuffers+0x1b/0x20 May 27 16:13:55 vc kernel: [ 7130.421159] [a0327b2e] ? btree_set_page_dirty+0xe/0x10 [btrfs] May 27 16:13:55 vc kernel: [ 7130.421159] [a0307756] __btrfs_cow_block+0x126/0x4f0 [btrfs] May 27 16:13:55 vc kernel: [ 7130.421159] [a0307cc3] btrfs_cow_block+0x123/0x1d0 [btrfs] May 27 16:13:55 vc kernel: [ 7130.421159] [a030c281] btrfs_search_slot+0x381/0x820 [btrfs] May 27 16:13:55 vc kernel: [ 7130.421159] [a03138ce] lookup_inline_extent_backref+0x8e/0x5b0 [btrfs] May 27 16:13:55 vc kernel: [ 7130.421159] [a032b6e9] ? btrfs_mark_buffer_dirty+0x99/0xf0 [btrfs] May 27 16:13:55 vc kernel: [ 7130.421159] [a031301e] ? setup_inline_extent_backref+0x18e/0x290 [btrfs] May 27 16:13:55 vc kernel: [ 7130.421159] [a0313e53] insert_inline_extent_backref+0x63/0x130 [btrfs] May 27 16:13:55 vc kernel: [ 7130.421159] [a030677a] ? btrfs_alloc_path+0x1a/0x20 [btrfs] May 27 16:13:55 vc kernel: [ 7130.421159] [a031486f] __btrfs_inc_extent_ref+0x9f/0x240 [btrfs] May 27 16:13:55 vc kernel: [ 7130.421159] [a0377aa9] ? btrfs_merge_delayed_refs+0x289/0x300 [btrfs] May 27 16:13:55 vc kernel: [ 7130.421159] [a031b3a1] run_clustered_refs+0x971/0xd00 [btrfs] May 27 16:13:55 vc kernel: [ 7130.421159] [a030714d] ? btrfs_put_tree_mod_seq+0x10d/0x150 [btrfs] May 27 16:13:55 vc kernel: [ 7130.421159] [a031f7f0] btrfs_run_delayed_refs+0xd0/0x320 [btrfs] May 27 16:13:55 vc kernel: [ 7130.421159] [a0330bf7] __btrfs_end_transaction+0xf7/0x410 [btrfs] May 27 16:13:55 vc kernel: [ 7130.421159] [a0330f60] btrfs_end_transaction+0x10/0x20 [btrfs] As a result, transaction cannot commit, it waits for all writers to detach in the do-while loop. May 27 16:13:55 vc kernel: [ 7130.419009] btrfs-transacti D 0 15150 2 0x May 27 16:13:55 vc kernel: [ 7130.419012] 88009f86bce8 0046 032d032d May 27 16:13:55 vc kernel: [ 7130.419016] 88009f86bfd8 88009f86bfd8 88009f86bfd8 00013f40 May 27 16:13:55 vc kernel: [ 7130.419020] 8800af1e9740 8800a03f8000 0090 88009693cb00 May 27 16:13:55 vc kernel: [ 7130.419023] Call Trace: May 27 16:13:55 vc kernel: [ 7130.419027] [816eb399] schedule+0x29/0x70 May 27 16:13:55 vc kernel: [ 7130.419031] [816e9b1d] schedule_timeout+0x1ed/0x250 May 27 16:13:55 vc kernel: [ 7130.419055] [a03497a3] ? btrfs_run_ordered_operations+0x2b3/0x2e0 [btrfs] May 27 16:13:55 vc kernel: [ 7130.419060] [81045cd9] ? default_spin_lock_flags+0x9/0x10 May 27 16:13:55 vc kernel: [ 7130.419081] [a0330388] btrfs_commit_transaction+0x3b8/0xae0 [btrfs] May 27 16:13:55 vc kernel: [ 7130.419085] [8107fcc0] ? add_wait_queue+0x60/0x60 May 27 16:13:55 vc kernel: [ 7130.419104] [a0328525] transaction_kthread+0x1b5/0x230 [btrfs] May 27 16:13:55 vc kernel: [ 7130.419124] [a0328370] ? btree_invalidatepage+0x80/0x80 [btrfs] May 27 16:13:55 vc kernel: [ 7130.419128] [8107f0d0] kthread+0xc0/0xd0 May 27 16:13:55 vc kernel: [ 7130.419132] [8107f010] ? flush_kthread_worker+0xb0/0xb0 May 27 16:13:55 vc kernel: [ 7130.419136] [816f506c] ret_from_fork+0x7c/0xb0 May 27 16:13:55 vc kernel: [ 7130.419140] [8107f010] ? flush_kthread_worker+0xb0/0xb0 There is additional thread stuck in btrfs_tree_lock(), not sure how it is related, perhaps there's some deadlock between the two? May 27 16:13:55 vc kernel: [ 7130.421159] flush-btrfs-2 D 0001 0 18816 2 0x May 27 16:13:55 vc kernel: [ 7130.421159] 88008b553948 0046 880017991050 May 27 16:13:55 vc kernel: [ 7130.421159]
Re: [PATCH] Btrfs: clear received_uuid field for new writable snapshots
Hi Stephan, I fully understand the first part of your fix, and I believe it's quite critical. Indeed, a writable snapshot should have no evidence that it has an ancestor that was once received. Can you pls let me know that I understand the second part of your fix. In btrfs-progs the following code in tree_search() would have prevented us from mistakenly selecting such snapshot as a parent for receive: if (type == subvol_search_by_received_uuid) { entry = rb_entry(n, struct subvol_info, rb_received_node); comp = memcmp(entry-received_uuid, uuid, BTRFS_UUID_SIZE); if (!comp) { if (entry-stransid stransid) comp = -1; else if (entry-stransid stransid) comp = 1; else comp = 0; } The code checks both received_uuid (which would have been wrongly equal to what we need), but also the stransid (which was the ctransid on the send side), which would have been zero, so it wouldn't match. Now after your fix, the stransid field becomes not needed, correct? Because if we have a valid received_uuid, this means that either we are the received snapshot, or our whole chain of ancestors are read-only, and eventually there was an ancestor that was received. So we have valid data and can be used as a parent. Is it still needed after your fix to check the stransid field ? (it doesn't hurt to check it) Clearring/Not clearing the rtransid - does it bring any value? rtransid is the local transid of when we had completed the receive process for this snap. Is there any interesting usage of this value? Thanks, Alex. On Wed, Apr 17, 2013 at 12:11 PM, Stefan Behrens sbehr...@giantdisaster.de wrote: For created snapshots, the full root_item is copied from the source root and afterwards selectively modified. The current code forgets to clear the field received_uuid. The only problem is that it is confusing when you look at it with 'btrfs subv list', since for writable snapshots, the contents of the snapshot can be completely unrelated to the previously received snapshot. The receiver ignores such snapshots anyway because he also checks the field stransid in the root_item and that value used to be reset to zero for all created snapshots. This commit changes two things: - clear the received_uuid field for new writable snapshots. - don't clear the send/receive related information like the stransid for read-only snapshots (which makes them useable as a parent for the automatic selection of parents in the receive code). Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de --- fs/btrfs/transaction.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index ffac232..94cbd10 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1170,13 +1170,17 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, memcpy(new_root_item-uuid, new_uuid.b, BTRFS_UUID_SIZE); memcpy(new_root_item-parent_uuid, root-root_item.uuid, BTRFS_UUID_SIZE); + if (!(root_flags BTRFS_ROOT_SUBVOL_RDONLY)) { + memset(new_root_item-received_uuid, 0, + sizeof(new_root_item-received_uuid)); + memset(new_root_item-stime, 0, sizeof(new_root_item-stime)); + memset(new_root_item-rtime, 0, sizeof(new_root_item-rtime)); + btrfs_set_root_stransid(new_root_item, 0); + btrfs_set_root_rtransid(new_root_item, 0); + } new_root_item-otime.sec = cpu_to_le64(cur_time.tv_sec); new_root_item-otime.nsec = cpu_to_le32(cur_time.tv_nsec); btrfs_set_root_otransid(new_root_item, trans-transid); - memset(new_root_item-stime, 0, sizeof(new_root_item-stime)); - memset(new_root_item-rtime, 0, sizeof(new_root_item-rtime)); - btrfs_set_root_stransid(new_root_item, 0); - btrfs_set_root_rtransid(new_root_item, 0); old = btrfs_lock_root_node(root); ret = btrfs_cow_block(trans, root, old, NULL, 0, old); -- 1.8.2.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit
Hi Miao, I attempted to fix the issue by not joining a transaction that has trans-in_commit set. I did something similar to what wait_current_trans() does, but I did: smp_rmb(); if (cur_trans cur_trans-in_commit) { ... wait_event(root-fs_info-transaction_wait, !cur_trans-blocked); ... I also had to change the order of setting in_commit and blocked in btrfs_commit_transaction: trans-transaction-blocked = 1; trans-transaction-in_commit = 1; smp_wmb(); to make sure that if in_commit is set, then blocked cannot be 0, because btrfs_commit_transaction haven't set it yet to 1. However, with this fix I observe two issues: # With large trees and heavy commits, join_transaction() is delayed sometimes by 1-3 seconds. This delays the host IO by too much. # With this fix, I think too many transactions happen. Basically with this fix, once transaction-in_commit is set, then I insist to open a new transaction and not to join the current one. It has some bad influence on host response times pattern, but I cannot exactly tell why is that. Did you have other fix in mind? Without the fix, I observe sometimes commits that take like 80 seconds, out of which like 50 seconds are spent in the do-while loop of btrfs_commit_transaction. Thanks, Alex. On Mon, Mar 25, 2013 at 11:11 AM, Alex Lyakas alex.bt...@zadarastorage.com wrote: Hi Miao, On Mon, Mar 25, 2013 at 3:51 AM, Miao Xie mi...@cn.fujitsu.com wrote: On Sun, 24 Mar 2013 13:13:22 +0200, Alex Lyakas wrote: Hi Miao, I am seeing another issue. Your fix prevents from TRANS_START to get in the way of a committing transaction. But it does not prevent from TRANS_JOIN. On the other hand, btrfs_commit_transaction has the following loop: do { // attempt to do some useful stuff and/or sleep } while (atomic_read(cur_trans-num_writers) 1 || (should_grow cur_trans-num_joined != joined)); What I see is basically that new writers join the transaction, while btrfs_commit_transaction() does this loop. I see cur_trans-num_writers decreasing, but then it increases, then decreases etc. This can go for several seconds during heavy IO load. There is nothing to prevent new TRANS_JOIN writers coming and joining a transaction over and over, thus delaying transaction commit. The IO path uses TRANS_JOIN; for example run_delalloc_nocow() does that. Do you observe such behavior? Do you believe it's problematic? I know this behavior, there is no problem with it, the latter code will prevent from TRANS_JOIN. 1672 spin_lock(root-fs_info-trans_lock); 1673 root-fs_info-trans_no_join = 1; 1674 spin_unlock(root-fs_info-trans_lock); 1675 wait_event(cur_trans-writer_wait, 1676atomic_read(cur_trans-num_writers) == 1); Yes, this code prevents anybody from joining, but before btrfs_commit_transaction() gets to this code, it may spend sometimes 10 seconds (in my tests) in the do-while loop, while new writers come and go. Basically, it is not deterministic when the do-while loop will exit, it depends on the IO pattern. And if we block the TRANS_JOIN at the place you point out, the deadlock will happen because we need deal with the ordered operations which will use TRANS_JOIN here. (I am dealing with the problem you said above by adding a new type of TRANS_* now) Thanks. Alex. Thanks Miao Thanks, Alex. On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie mi...@cn.fujitsu.com wrote: On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote: Hi Miao, can you please explain your solution a bit more. On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie mi...@cn.fujitsu.com wrote: Now btrfs_commit_transaction() does this ret = btrfs_run_ordered_operations(root, 0) which async flushes all inodes on the ordered operations list, it introduced a deadlock that transaction-start task, transaction-commit task and the flush workers waited for each other. (See the following URL to get the detail http://marc.info/?l=linux-btrfsm=136070705732646w=2) As we know, if -in_commit is set, it means someone is committing the current transaction, we should not try to join it if we are not JOIN or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid the above problem. In this way, there is another benefit: there is no new transaction handle to block the transaction which is on the way of commit, once we set -in_commit. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/transaction.c | 17 - 1 files changed, 16 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index bc2f2d1..71b7e2e 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct btrfs_root *root) root-commit_root = btrfs_root_node(root); } +static inline int can_join_transaction(struct btrfs_transaction *trans
Re: Backup Options
Hi David, maybe my old patch http://www.spinics.net/lists/linux-btrfs/msg19739.html can help this issue? Thanks, Alex. On Wed, Apr 3, 2013 at 8:23 PM, David Sterba dste...@suse.cz wrote: On Wed, Apr 03, 2013 at 04:33:22AM +0200, Harald Glatt wrote: However what I actually did was: # cd /mnt/restore # nc -l -p | btrfs receive . After noticing this difference I had to try it again as described in my mail and - oh wonder - it works now!! Giving 'btrfs receive' a dot as a parameter seems to fail in this case. Is this expected behavior or a bug? Bug. Relative paths do not work on the receive side. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs stuck on
Hi David, On Fri, Mar 29, 2013 at 8:12 PM, David Sterba dste...@suse.cz wrote: On Thu, Mar 21, 2013 at 11:56:37AM -0700, Ask Bjørn Hansen wrote: A few weeks ago I replaced a ZFS backup system with one backed by btrfs. A script loops over a bunch of hosts rsyncing them to each their own subvolume. After each rsync I snapshot the host-specific subvolume. The disk is an iscsi disk that in my benchmarks performs roughly like a local raid with 2-3 SATA disks. It worked fine for about a week (~150 snapshots from ~20 sub volumes) before it suddenly exploded in disk io wait. Doing anything (in particular changes) on the file system is just insanely slow, rsync basically can't complete (an rsync that should take 10-20 minutes takes 24 hours; I have a directory of 60k files I tried deleting and it's deleting one file every few minutes, that sort of thing). I'm seeing similar problem after a test that produces tons of snapshots and snap deletions at the same time. Accessing the directory (eg. via ls) containing the snapshots blocks for a long time. The contention point is a mutex of the directory entry, used for lookups on the 'ls' side, and the snapshot deletion process holds the mutex as well with obvious consequences. The contention is multiplied by the number of snapshots waiting to be deleted and eagerly grabbing the mutex, making other waiters starve. Can you pls clarify what mutex do you mean? Do you mean the dir-i_mutex, taken by btrfs_ioctl_snap_destroy()? If yes, then this mutex is held only while adding a snap to todo deletion list, and not during snap deletion itself. Otherwise, I don't see btrfs_drop_snapshot() locking any mutex, for example. You've observed this as deletion progressing very slowly and rsync blocked. That's really annoying and I'm working towards fixing it. I am using 3.8.2-206.fc18.x86_64 (Fedora 18). I tried rebooting, it doesn't make a difference. As soon as I boot [btrfs-cleaner] and [btrfs-transacti] gets really busy. I wonder if it's because I deleted a few snapshots at some point? Yes. The progress or performance impact depends on amount of data shared among the snapshots and used / free space fragmentation. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Thanks, Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: fix locking on ROOT_REPLACE operations in tree mod log
Hi Jan, I have manually applied this patch and also your previous patch onto kernel 3.8.2, but, unfortunately, I am still hitting the issue:( I will check further whether I can be more helpful in debugging this issue, than just reporting it:( Thank for your help, Alex. On Wed, Mar 20, 2013 at 3:49 PM, Jan Schmidt list.bt...@jan-o-sch.net wrote: To resolve backrefs, ROOT_REPLACE operations in the tree mod log are required to be tied to at least one KEY_REMOVE_WHILE_FREEING operation. Therefore, those operations must be enclosed by tree_mod_log_write_lock() and tree_mod_log_write_unlock() calls. Those calls are private to the tree_mod_log_* functions, which means that removal of the elements of an old root node must be logged from tree_mod_log_insert_root. This partly reverts and corrects commit ba1bfbd5 (Btrfs: fix a tree mod logging issue for root replacement operations). This fixes the brand-new version of xfstest 276 as of commit cfe73f71. Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- Has probably been Reported-by: Alex Lyakas alex.bt...@zadarastorage.com (unconfirmed). Chages for v2: - use the correct base (current cmason/for-linus) fs/btrfs/ctree.c | 30 -- 1 files changed, 20 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index ecd25a1..ca9d8f1 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -651,6 +651,8 @@ tree_mod_log_insert_root(struct btrfs_fs_info *fs_info, if (tree_mod_dont_log(fs_info, NULL)) return 0; + __tree_mod_log_free_eb(fs_info, old_root); + ret = tree_mod_alloc(fs_info, flags, tm); if (ret 0) goto out; @@ -736,7 +738,7 @@ tree_mod_log_search(struct btrfs_fs_info *fs_info, u64 start, u64 min_seq) static noinline void tree_mod_log_eb_copy(struct btrfs_fs_info *fs_info, struct extent_buffer *dst, struct extent_buffer *src, unsigned long dst_offset, -unsigned long src_offset, int nr_items) +unsigned long src_offset, int nr_items, int log_removal) { int ret; int i; @@ -750,10 +752,12 @@ tree_mod_log_eb_copy(struct btrfs_fs_info *fs_info, struct extent_buffer *dst, } for (i = 0; i nr_items; i++) { - ret = tree_mod_log_insert_key_locked(fs_info, src, -i + src_offset, -MOD_LOG_KEY_REMOVE); - BUG_ON(ret 0); + if (log_removal) { + ret = tree_mod_log_insert_key_locked(fs_info, src, + i + src_offset, + MOD_LOG_KEY_REMOVE); + BUG_ON(ret 0); + } ret = tree_mod_log_insert_key_locked(fs_info, dst, i + dst_offset, MOD_LOG_KEY_ADD); @@ -927,7 +931,6 @@ static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans, ret = btrfs_dec_ref(trans, root, buf, 1, 1); BUG_ON(ret); /* -ENOMEM */ } - tree_mod_log_free_eb(root-fs_info, buf); clean_tree_block(trans, root, buf); *last_ref = 1; } @@ -1046,6 +1049,7 @@ static noinline int __btrfs_cow_block(struct btrfs_trans_handle *trans, btrfs_set_node_ptr_generation(parent, parent_slot, trans-transid); btrfs_mark_buffer_dirty(parent); + tree_mod_log_free_eb(root-fs_info, buf); btrfs_free_tree_block(trans, root, buf, parent_start, last_ref); } @@ -1750,7 +1754,6 @@ static noinline int balance_level(struct btrfs_trans_handle *trans, goto enospc; } - tree_mod_log_free_eb(root-fs_info, root-node); tree_mod_log_set_root_pointer(root, child); rcu_assign_pointer(root-node, child); @@ -2995,7 +2998,7 @@ static int push_node_left(struct btrfs_trans_handle *trans, push_items = min(src_nritems - 8, push_items); tree_mod_log_eb_copy(root-fs_info, dst, src, dst_nritems, 0, -push_items); +push_items, 1); copy_extent_buffer(dst, src, btrfs_node_key_ptr_offset(dst_nritems), btrfs_node_key_ptr_offset(0), @@ -3066,7 +3069,7 @@ static int balance_node_right(struct btrfs_trans_handle *trans, sizeof(struct btrfs_key_ptr)); tree_mod_log_eb_copy(root-fs_info
Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit
Hi Miao, I am seeing another issue. Your fix prevents from TRANS_START to get in the way of a committing transaction. But it does not prevent from TRANS_JOIN. On the other hand, btrfs_commit_transaction has the following loop: do { // attempt to do some useful stuff and/or sleep } while (atomic_read(cur_trans-num_writers) 1 || (should_grow cur_trans-num_joined != joined)); What I see is basically that new writers join the transaction, while btrfs_commit_transaction() does this loop. I see cur_trans-num_writers decreasing, but then it increases, then decreases etc. This can go for several seconds during heavy IO load. There is nothing to prevent new TRANS_JOIN writers coming and joining a transaction over and over, thus delaying transaction commit. The IO path uses TRANS_JOIN; for example run_delalloc_nocow() does that. Do you observe such behavior? Do you believe it's problematic? Thanks, Alex. On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie mi...@cn.fujitsu.com wrote: On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote: Hi Miao, can you please explain your solution a bit more. On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie mi...@cn.fujitsu.com wrote: Now btrfs_commit_transaction() does this ret = btrfs_run_ordered_operations(root, 0) which async flushes all inodes on the ordered operations list, it introduced a deadlock that transaction-start task, transaction-commit task and the flush workers waited for each other. (See the following URL to get the detail http://marc.info/?l=linux-btrfsm=136070705732646w=2) As we know, if -in_commit is set, it means someone is committing the current transaction, we should not try to join it if we are not JOIN or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid the above problem. In this way, there is another benefit: there is no new transaction handle to block the transaction which is on the way of commit, once we set -in_commit. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/transaction.c | 17 - 1 files changed, 16 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index bc2f2d1..71b7e2e 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct btrfs_root *root) root-commit_root = btrfs_root_node(root); } +static inline int can_join_transaction(struct btrfs_transaction *trans, + int type) +{ + return !(trans-in_commit +type != TRANS_JOIN +type != TRANS_JOIN_NOLOCK); +} + /* * either allocate a new transaction or hop into the existing one */ @@ -86,6 +94,10 @@ loop: spin_unlock(fs_info-trans_lock); return cur_trans-aborted; } + if (!can_join_transaction(cur_trans, type)) { + spin_unlock(fs_info-trans_lock); + return -EBUSY; + } atomic_inc(cur_trans-use_count); atomic_inc(cur_trans-num_writers); cur_trans-num_joined++; @@ -360,8 +372,11 @@ again: do { ret = join_transaction(root, type); - if (ret == -EBUSY) + if (ret == -EBUSY) { wait_current_trans(root); + if (unlikely(type == TRANS_ATTACH)) + ret = -ENOENT; + } So I understand that instead of incrementing num_writes and joining the current transaction, you do not join and wait for the current transaction to unblock. More specifically,TRANS_START、TRANS_USERSPACE and TRANS_ATTACH can not join and just wait for the current transaction to unblock if -in_commit is set. Which task in Josef's example http://marc.info/?l=linux-btrfsm=136070705732646w=2 task 1, task 2 or task 3 is the one that will not join the transaction, but instead wait? Task1 will not join the transaction, in this way, async inode flush won't run, and then task3 won't do anything. Before applying the patch: Start/Attach_Trans_Task Commit_Task Flush_Worker (Task1) (Task2) (Task3) -- the name in Josef's example btrfs_start_transaction() |-may_wait_transaction() | (return 0) | btrfs_commit_transaction() | |-set -in_commit and | | blocked to 1 | |-wait writers to be 1 | | (writers is 1) |-join_transaction() | | (writers is 2) | |-btrfs_commit_transaction() | | |-set
Re: [PATCH v2] btrfs: clean snapshots one by one
Hi David, On Thu, Mar 7, 2013 at 1:55 PM, David Sterba dste...@suse.cz wrote: On Wed, Mar 06, 2013 at 10:12:11PM -0500, Chris Mason wrote: Also, I want to ask, hope this is not inappropriate. Do you also agree with Josef, that it's ok for BTRFS_IOC_SNAP_DESTROY not to commit the transaction, but just to detach from it? Had we committed, we would have ensured that ORPHAN_ITEM is in the root tree, thus preventing from subvol to re-appear after crash. It seems a little inconsistent with snap creation, where not only the transaction is committed, but delalloc flush is performed to ensure that all data is on disk before creating the snap. That's another question, can you please point me to the thread where this was discussed? http://www.spinics.net/lists/linux-btrfs/msg22256.html That's a really old one. The original snapshot code expected people to run sync first, but that's not very user friendly. The idea is that if you write a file and then take a snapshot, that file should be in the snapshot. The snapshot behaviour sounds ok to me. That a subvol/snapshot may appear after crash if transation commit did not happen does not feel so good. We know that the subvol is only scheduled for deletion and needs to be processed by cleaner. From that point I'd rather see the commit to happen to avoid any unexpected surprises. A subvolume that re-appears still holds the data references and consumes space although the user does not assume that. Automated snapshotting and deleting needs some guarantees about the behaviour and what to do after a crash. So now it has to process the backlog of previously deleted snapshots and verify that they're not there, compared to deleted - will never appear, can forget about it. Exactly. Currently, the user space has no idea when the deletion will start, or when it is completed (it has to track the ROOT_ITEM, drop progress, ORPHAN_ITEM etc.). That's why I was thinking, that at least committing a transaction on snap_destroy could ensure that deletion will not be reverted. Thanks, Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: clean snapshots one by one
On Tue, Mar 12, 2013 at 5:13 PM, David Sterba dste...@suse.cz wrote: Each time pick one dead root from the list and let the caller know if it's needed to continue. This should improve responsiveness during umount and balance which at some point waits for cleaning all currently queued dead roots. A new dead root is added to the end of the list, so the snapshots disappear in the order of deletion. The snapshot cleaning work is now done only from the cleaner thread and the others wake it if needed. Signed-off-by: David Sterba dste...@suse.cz --- v1,v2: * http://thread.gmane.org/gmane.comp.file-systems.btrfs/23212 v2-v3: * remove run_again from btrfs_clean_one_deleted_snapshot and return 1 unconditionally fs/btrfs/disk-io.c | 10 ++-- fs/btrfs/extent-tree.c |8 ++ fs/btrfs/relocation.c |3 -- fs/btrfs/transaction.c | 56 +++ fs/btrfs/transaction.h |2 +- 5 files changed, 53 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 988b860..4de2351 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1690,15 +1690,19 @@ static int cleaner_kthread(void *arg) struct btrfs_root *root = arg; do { + int again = 0; + if (!(root-fs_info-sb-s_flags MS_RDONLY) + down_read_trylock(root-fs_info-sb-s_umount) mutex_trylock(root-fs_info-cleaner_mutex)) { btrfs_run_delayed_iputs(root); - btrfs_clean_old_snapshots(root); + again = btrfs_clean_one_deleted_snapshot(root); mutex_unlock(root-fs_info-cleaner_mutex); btrfs_run_defrag_inodes(root-fs_info); + up_read(root-fs_info-sb-s_umount); } - if (!try_to_freeze()) { + if (!try_to_freeze() !again) { set_current_state(TASK_INTERRUPTIBLE); if (!kthread_should_stop()) schedule(); @@ -3403,8 +3407,8 @@ int btrfs_commit_super(struct btrfs_root *root) mutex_lock(root-fs_info-cleaner_mutex); btrfs_run_delayed_iputs(root); - btrfs_clean_old_snapshots(root); mutex_unlock(root-fs_info-cleaner_mutex); + wake_up_process(root-fs_info-cleaner_kthread); /* wait until ongoing cleanup work done */ down_write(root-fs_info-cleanup_work_sem); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 742b7a7..a08d0fe 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7263,6 +7263,8 @@ static noinline int walk_up_tree(struct btrfs_trans_handle *trans, * reference count by one. if update_ref is true, this function * also make sure backrefs for the shared block and all lower level * blocks are properly updated. + * + * If called with for_reloc == 0, may exit early with -EAGAIN */ int btrfs_drop_snapshot(struct btrfs_root *root, struct btrfs_block_rsv *block_rsv, int update_ref, @@ -7363,6 +7365,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root, wc-reada_count = BTRFS_NODEPTRS_PER_BLOCK(root); while (1) { + if (!for_reloc btrfs_fs_closing(root-fs_info)) { + pr_debug(btrfs: drop snapshot early exit\n); + err = -EAGAIN; + goto out_end_trans; + } + ret = walk_down_tree(trans, root, path, wc); if (ret 0) { err = ret; diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 8445000..50deb9ed 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -4148,10 +4148,7 @@ int btrfs_relocate_block_group(struct btrfs_root *extent_root, u64 group_start) while (1) { mutex_lock(fs_info-cleaner_mutex); - - btrfs_clean_old_snapshots(fs_info-tree_root); ret = relocate_block_group(rc); - mutex_unlock(fs_info-cleaner_mutex); if (ret 0) { err = ret; diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index a0467eb..a2781c3 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -950,7 +950,7 @@ static noinline int commit_cowonly_roots(struct btrfs_trans_handle *trans, int btrfs_add_dead_root(struct btrfs_root *root) { spin_lock(root-fs_info-trans_lock); - list_add(root-root_list, root-fs_info-dead_roots); + list_add_tail(root-root_list, root-fs_info-dead_roots); spin_unlock(root-fs_info-trans_lock); return 0; } @@ -1876,31 +1876,49 @@ cleanup_transaction: } /* - * interface function to delete all the snapshots we have scheduled for deletion + * return 0 if error + * 0 if
Re: [PATCH] Btrfs: fix backref walking race with tree deletions
leaf 4214784, which is the leaf of subvolume 257. The tree-dump I showed you is taken after the test failed, and at this point if I try btrfs send, everything is resolved alright: btrfs [find_extent_clone] Search [rt=257 ino=277 off=0 len=8192] [found extent=4386816 extent_item_pos=0] btrfs [iterate_extent_inodes] resolving for extent 4386816 pos=0 btrfs [iterate_extent_inodes] extent 4386816 pos=0 found 2 leafs btrfs [iterate_extent_inodes] extent 4386816 pos=0 root 262 references leaf 4431872 btrfs [iterate_extent_inodes] extent 4386816 pos=0 root 261 references leaf 4431872 btrfs [iterate_extent_inodes] extent 4386816 pos=0 root 257 references leaf 4214784 Can you advise on how to debug this further? Thanks, Alex. On Thu, Feb 21, 2013 at 5:35 PM, Jan Schmidt list.bt...@jan-o-sch.net wrote: When a subvolume is removed, we remove the root item from the root tree, while the tree blocks and backrefs remain for a while. When backref walking comes across one of those orphan tree blocks, it can find a backref for a no longer existing root. This is all good, we only must tolerate __resolve_indirect_ref returning an error and continue with the good refs found. Reported-by: Alex Lyakas alex.bt...@zadarastorage.com Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/backref.c |5 + 1 files changed, 1 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 04edf69..bd605c8 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -352,11 +352,8 @@ static int __resolve_indirect_refs(struct btrfs_fs_info *fs_info, err = __resolve_indirect_ref(fs_info, search_commit_root, time_seq, ref, parents, extent_item_pos); - if (err) { - if (ret == 0) - ret = err; + if (err) continue; - } /* we put the first parent into the ref at hand */ ULIST_ITER_INIT(uiter); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: same EXTENT_ITEM appears twice in the extent tree
So, no advice on how this could have happened? Ok, maybe it won't happen again... On Sun, Mar 3, 2013 at 5:44 PM, Alex Lyakas alex.bt...@zadarastorage.com wrote: Hi Chris, On Sun, Mar 3, 2013 at 5:28 PM, Chris Mason chris.ma...@fusionio.com wrote: On Sun, Mar 03, 2013 at 06:40:50AM -0700, Alex Lyakas wrote: Greetings all, I have an extent tree that looks like follows: item 22 key (27059916800 EXTENT_ITEM 16384) itemoff 2656 itemsize 24 extent refs 1 gen 164 flags 1 item 23 key (27059916800 EXTENT_ITEM 98304) itemoff 2603 itemsize 53 extent refs 1 gen 165 flags 1 extent data backref root 257 objectid 257 offset 17446191104 count 1 item 24 key (27059916800 SHARED_DATA_REF 47169536) itemoff 2599 itemsize 4 shared data backref count 1 Have you been experimenting on this FS with snapshot deletion patches? No, I haven't applied any patches on top of the commit I mentioned. (I presume you mean David's patch for one-by-one deletion). Since created, this FS has only seen straight IO with parallel snapshot creation and deletion. However, the kernel was crashing pretty frequently during this test, so I presume log replay was taking place. Any particular thing I can look for in the debug-tree output, except searching for more double-allocations? Thanks, Alex. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
same EXTENT_ITEM appears twice in the extent tree
Greetings all, I have an extent tree that looks like follows: item 22 key (27059916800 EXTENT_ITEM 16384) itemoff 2656 itemsize 24 extent refs 1 gen 164 flags 1 item 23 key (27059916800 EXTENT_ITEM 98304) itemoff 2603 itemsize 53 extent refs 1 gen 165 flags 1 extent data backref root 257 objectid 257 offset 17446191104 count 1 item 24 key (27059916800 SHARED_DATA_REF 47169536) itemoff 2599 itemsize 4 shared data backref count 1 As can be seen, same EXTENT_ITEM appears twice. This was undetected, until __btrfs_free_extent was called, after cleaner deleted one of the snapshots. Then it lead to assert: if (found_extent) { BUG_ON(is_data refs_to_drop != extent_data_ref_count(root, path, iref)); if (iref) { BUG_ON(path-slots[0] != extent_slot); } else { BUG_ON(path-slots[0] != extent_slot + 1); /* CRASH */ path-slots[0] = extent_slot; num_to_del = 2; } As for the usage of this bad extent, there are multiple snapshots sharing the 98304-length extent, but only one that uses the 16384 extent: file tree key (257 ROOT_ITEM 0) item 19 key (257 EXTENT_DATA 17446191104) itemoff 2935 itemsize 53 extent data disk byte 27059916800 nr 98304 extent data offset 0 nr 98304 ram 98304 extent compression 0 ... file tree key (350 ROOT_ITEM 164) item 21 key (257 EXTENT_DATA 17446191104) itemoff 2829 itemsize 53 extent data disk byte 27059916800 nr 16384 extent data offset 0 nr 16384 ram 16384 extent compression 0 ... file tree key (352 ROOT_ITEM 167) item 19 key (257 EXTENT_DATA 17446191104) itemoff 2935 itemsize 53 extent data disk byte 27059916800 nr 98304 extent data offset 0 nr 98304 ram 98304 extent compression 0 Kernel is for-linus, top commit: commit 1eafa6c73791e4f312324ddad9cbcaf6a1b6052b Author: Miao Xie mi...@cn.fujitsu.com Date: Tue Jan 22 10:49:00 2013 + Btrfs: fix repeated delalloc work allocation I believe I might have more extents like this, because btrfs-debug-tree warns: warning, bad space info total_bytes 26851934208 used 26852773888 warning, bad space info total_bytes 27925676032 used 27926892544 Mount options: nodatasum,nodatacow,noatime,nospace_cache. Metadata profile is DUP, data profile is single. Can anybody advise on how this could have happened? I can provide the whole debug-tree, btrfs-image or any additional info. Thanks, Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: same EXTENT_ITEM appears twice in the extent tree
Hi Chris, On Sun, Mar 3, 2013 at 5:28 PM, Chris Mason chris.ma...@fusionio.com wrote: On Sun, Mar 03, 2013 at 06:40:50AM -0700, Alex Lyakas wrote: Greetings all, I have an extent tree that looks like follows: item 22 key (27059916800 EXTENT_ITEM 16384) itemoff 2656 itemsize 24 extent refs 1 gen 164 flags 1 item 23 key (27059916800 EXTENT_ITEM 98304) itemoff 2603 itemsize 53 extent refs 1 gen 165 flags 1 extent data backref root 257 objectid 257 offset 17446191104 count 1 item 24 key (27059916800 SHARED_DATA_REF 47169536) itemoff 2599 itemsize 4 shared data backref count 1 Have you been experimenting on this FS with snapshot deletion patches? No, I haven't applied any patches on top of the commit I mentioned. (I presume you mean David's patch for one-by-one deletion). Since created, this FS has only seen straight IO with parallel snapshot creation and deletion. However, the kernel was crashing pretty frequently during this test, so I presume log replay was taking place. Any particular thing I can look for in the debug-tree output, except searching for more double-allocations? Thanks, Alex. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: basic questions regarding COW in Btrfs
Hi Josef, I hope it's ok to piggy back on this thread for the following question: I see that in btrfs_cross_ref_exist()=check_committed_ref() path, there is the following check: if (btrfs_extent_generation(leaf, ei) = btrfs_root_last_snapshot(root-root_item)) goto out; So this basically means that after we have taken a snap of a subvol, then all subvol's extents must be COW'ed, even if we delete the snap a minute later. I wonder, why is that so? Is this because file extents can be shared indirectly, like when we create a snap, we only COW the root and only mark all root's *immediate* children shared in the extent tree? Can the new backref walking code be used here to check more accurately, if the extent is shared by anybody else? Thanks, Alex. On Mon, Feb 25, 2013 at 9:00 PM, Aastha Mehta aasth...@gmail.com wrote: Ah okay, I now see how it works. Thanks a lot for your response. Regards, Aastha. On 25 February 2013 18:27, Josef Bacik jba...@fusionio.com wrote: On Mon, Feb 25, 2013 at 08:15:40AM -0700, Aastha Mehta wrote: Thanks again Josef. I understood that cow_file_range is called for a regular file. Just to clarify, in cow_file_range is cow done at the time of reserving extents in the extent btree for the io to be done in this delalloc? I see the following comment above find_free_extent() which is called while trying to reserve extents: /* * walks the btree of allocated extents and find a hole of a given size. * The key ins is changed to record the hole: * ins-objectid == block start * ins-flags = BTRFS_EXTENT_ITEM_KEY * ins-offset == number of blocks * Any available blocks before search_start are skipped. */ This seems to be the only place where a cow might be done, because a key is being inserted into an extent which modifies it. The key isn't inserted at this time, it's just returned with those values for us to do as we please. There is no update of the btree until insert_reserved_extent/btrfs_mark_extent_written in btrfs_finish_ordered_io. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit
Hi Miao, thanks for the great ASCII graphics and detailed explanation! Alex. On Mon, Feb 25, 2013 at 12:20 PM, Miao Xie mi...@cn.fujitsu.com wrote: On sun, 24 Feb 2013 21:49:55 +0200, Alex Lyakas wrote: Hi Miao, can you please explain your solution a bit more. On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie mi...@cn.fujitsu.com wrote: Now btrfs_commit_transaction() does this ret = btrfs_run_ordered_operations(root, 0) which async flushes all inodes on the ordered operations list, it introduced a deadlock that transaction-start task, transaction-commit task and the flush workers waited for each other. (See the following URL to get the detail http://marc.info/?l=linux-btrfsm=136070705732646w=2) As we know, if -in_commit is set, it means someone is committing the current transaction, we should not try to join it if we are not JOIN or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid the above problem. In this way, there is another benefit: there is no new transaction handle to block the transaction which is on the way of commit, once we set -in_commit. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/transaction.c | 17 - 1 files changed, 16 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index bc2f2d1..71b7e2e 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct btrfs_root *root) root-commit_root = btrfs_root_node(root); } +static inline int can_join_transaction(struct btrfs_transaction *trans, + int type) +{ + return !(trans-in_commit +type != TRANS_JOIN +type != TRANS_JOIN_NOLOCK); +} + /* * either allocate a new transaction or hop into the existing one */ @@ -86,6 +94,10 @@ loop: spin_unlock(fs_info-trans_lock); return cur_trans-aborted; } + if (!can_join_transaction(cur_trans, type)) { + spin_unlock(fs_info-trans_lock); + return -EBUSY; + } atomic_inc(cur_trans-use_count); atomic_inc(cur_trans-num_writers); cur_trans-num_joined++; @@ -360,8 +372,11 @@ again: do { ret = join_transaction(root, type); - if (ret == -EBUSY) + if (ret == -EBUSY) { wait_current_trans(root); + if (unlikely(type == TRANS_ATTACH)) + ret = -ENOENT; + } So I understand that instead of incrementing num_writes and joining the current transaction, you do not join and wait for the current transaction to unblock. More specifically,TRANS_START、TRANS_USERSPACE and TRANS_ATTACH can not join and just wait for the current transaction to unblock if -in_commit is set. Which task in Josef's example http://marc.info/?l=linux-btrfsm=136070705732646w=2 task 1, task 2 or task 3 is the one that will not join the transaction, but instead wait? Task1 will not join the transaction, in this way, async inode flush won't run, and then task3 won't do anything. Before applying the patch: Start/Attach_Trans_Task Commit_Task Flush_Worker (Task1) (Task2) (Task3) -- the name in Josef's example btrfs_start_transaction() |-may_wait_transaction() | (return 0) | btrfs_commit_transaction() | |-set -in_commit and | | blocked to 1 | |-wait writers to be 1 | | (writers is 1) |-join_transaction() | | (writers is 2) | |-btrfs_commit_transaction() | | |-set trans_no_join to 1 | | (close join transaction) |-btrfs_run_ordered_operations | (Those ordered operations| are added when releasing| file) | |-async inode flush() | |-wait_flush_comlete() | | work_loop() | |-run_work() | |-btrfs_join_transaction() | |-wait_current_trans() |-wait writers to be 1 This three tasks waited for each other. After applying
Re: [PATCH v2] btrfs: clean snapshots one by one
Hi David, On Fri, Mar 1, 2013 at 6:17 PM, David Sterba dste...@suse.cz wrote: Each time pick one dead root from the list and let the caller know if it's needed to continue. This should improve responsiveness during umount and balance which at some point wait for cleaning all currently queued dead roots. A new dead root is added to the end of the list, so the snapshots disappear in the order of deletion. Process snapshot cleaning is now done only from the cleaner thread and the others wake it if needed. Signed-off-by: David Sterba dste...@suse.cz --- v1-v2: - added s_umount trylock in cleaner thread - added exit into drop_snapshot if fs is going down patch based on cmason/integration fs/btrfs/disk-io.c | 10 ++-- fs/btrfs/extent-tree.c |8 ++ fs/btrfs/relocation.c |3 -- fs/btrfs/transaction.c | 57 fs/btrfs/transaction.h |2 +- 5 files changed, 54 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index eb7c143..cc85fc7 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1652,15 +1652,19 @@ static int cleaner_kthread(void *arg) struct btrfs_root *root = arg; do { + int again = 0; + if (!(root-fs_info-sb-s_flags MS_RDONLY) + down_read_trylock(root-fs_info-sb-s_umount) mutex_trylock(root-fs_info-cleaner_mutex)) { btrfs_run_delayed_iputs(root); - btrfs_clean_old_snapshots(root); + again = btrfs_clean_one_deleted_snapshot(root); mutex_unlock(root-fs_info-cleaner_mutex); btrfs_run_defrag_inodes(root-fs_info); + up_read(root-fs_info-sb-s_umount); } - if (!try_to_freeze()) { + if (!try_to_freeze() !again) { set_current_state(TASK_INTERRUPTIBLE); if (!kthread_should_stop()) schedule(); @@ -3338,8 +3342,8 @@ int btrfs_commit_super(struct btrfs_root *root) mutex_lock(root-fs_info-cleaner_mutex); btrfs_run_delayed_iputs(root); - btrfs_clean_old_snapshots(root); mutex_unlock(root-fs_info-cleaner_mutex); + wake_up_process(root-fs_info-cleaner_kthread); /* wait until ongoing cleanup work done */ down_write(root-fs_info-cleanup_work_sem); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index d2b3a5e..0119ae7 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7078,6 +7078,8 @@ static noinline int walk_up_tree(struct btrfs_trans_handle *trans, * reference count by one. if update_ref is true, this function * also make sure backrefs for the shared block and all lower level * blocks are properly updated. + * + * If called with for_reloc == 0, may exit early with -EAGAIN */ int btrfs_drop_snapshot(struct btrfs_root *root, struct btrfs_block_rsv *block_rsv, int update_ref, @@ -7179,6 +7181,12 @@ int btrfs_drop_snapshot(struct btrfs_root *root, wc-reada_count = BTRFS_NODEPTRS_PER_BLOCK(root); while (1) { + if (!for_reloc btrfs_fs_closing(root-fs_info)) { + pr_debug(btrfs: drop snapshot early exit\n); + err = -EAGAIN; + goto out_end_trans; + } + ret = walk_down_tree(trans, root, path, wc); if (ret 0) { err = ret; diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index ba5a321..ab6a718 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -4060,10 +4060,7 @@ int btrfs_relocate_block_group(struct btrfs_root *extent_root, u64 group_start) while (1) { mutex_lock(fs_info-cleaner_mutex); - - btrfs_clean_old_snapshots(fs_info-tree_root); ret = relocate_block_group(rc); - mutex_unlock(fs_info-cleaner_mutex); if (ret 0) { err = ret; diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index a83d486..6b233c15 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -950,7 +950,7 @@ static noinline int commit_cowonly_roots(struct btrfs_trans_handle *trans, int btrfs_add_dead_root(struct btrfs_root *root) { spin_lock(root-fs_info-trans_lock); - list_add(root-root_list, root-fs_info-dead_roots); + list_add_tail(root-root_list, root-fs_info-dead_roots); spin_unlock(root-fs_info-trans_lock); return 0; } @@ -1858,31 +1858,50 @@ cleanup_transaction: } /* - * interface function to delete all the snapshots we have scheduled for deletion + * return 0 if error + * 0 if there are no more
Re: [PATCH 2/3] Btrfs: fix the deadlock between the transaction start/attach and commit
Hi Miao, can you please explain your solution a bit more. On Wed, Feb 20, 2013 at 11:16 AM, Miao Xie mi...@cn.fujitsu.com wrote: Now btrfs_commit_transaction() does this ret = btrfs_run_ordered_operations(root, 0) which async flushes all inodes on the ordered operations list, it introduced a deadlock that transaction-start task, transaction-commit task and the flush workers waited for each other. (See the following URL to get the detail http://marc.info/?l=linux-btrfsm=136070705732646w=2) As we know, if -in_commit is set, it means someone is committing the current transaction, we should not try to join it if we are not JOIN or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid the above problem. In this way, there is another benefit: there is no new transaction handle to block the transaction which is on the way of commit, once we set -in_commit. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/transaction.c | 17 - 1 files changed, 16 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index bc2f2d1..71b7e2e 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -51,6 +51,14 @@ static noinline void switch_commit_root(struct btrfs_root *root) root-commit_root = btrfs_root_node(root); } +static inline int can_join_transaction(struct btrfs_transaction *trans, + int type) +{ + return !(trans-in_commit +type != TRANS_JOIN +type != TRANS_JOIN_NOLOCK); +} + /* * either allocate a new transaction or hop into the existing one */ @@ -86,6 +94,10 @@ loop: spin_unlock(fs_info-trans_lock); return cur_trans-aborted; } + if (!can_join_transaction(cur_trans, type)) { + spin_unlock(fs_info-trans_lock); + return -EBUSY; + } atomic_inc(cur_trans-use_count); atomic_inc(cur_trans-num_writers); cur_trans-num_joined++; @@ -360,8 +372,11 @@ again: do { ret = join_transaction(root, type); - if (ret == -EBUSY) + if (ret == -EBUSY) { wait_current_trans(root); + if (unlikely(type == TRANS_ATTACH)) + ret = -ENOENT; + } So I understand that instead of incrementing num_writes and joining the current transaction, you do not join and wait for the current transaction to unblock. Which task in Josef's example http://marc.info/?l=linux-btrfsm=136070705732646w=2 task 1, task 2 or task 3 is the one that will not join the transaction, but instead wait? Also, I think I don't fully understand Josef's example. What is preventing from async flushing to complete? Is task 3 waiting because trans_no_join is set? Is task 3 the one that actually does the delalloc flush? Thanks, Alex. } while (ret == -EBUSY); if (ret 0) { -- 1.6.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: LAST CALL FOR BTRFS-NEXT
Hi Josef, can you please consider including these two patches from Jan 28: https://patchwork.kernel.org/patch/2057051/ https://patchwork.kernel.org/patch/2057071/ I realize they have V2 label, although the cover letter had V3, this was my bad. However, they both apply to what you have now in btrfs-next. Thanks, Alex. On Wed, Feb 20, 2013 at 5:12 PM, David Sterba dste...@suse.cz wrote: Please add this patch to next queue btrfs: limit fallocate extent reservation to 256MB https://patchwork.kernel.org/patch/1752311/ Tested-by: David Sterba dste...@suse.cz thanks, david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] btrfs: clean snapshots one by one
Hi David, thank you for addressing this issue. On Mon, Feb 11, 2013 at 6:11 PM, David Sterba dste...@suse.cz wrote: Each time pick one dead root from the list and let the caller know if it's needed to continue. This should improve responsiveness during umount and balance which at some point wait for cleaning all currently queued dead roots. A new dead root is added to the end of the list, so the snapshots disappear in the order of deletion. Process snapshot cleaning is now done only from the cleaner thread and the others wake it if needed. This is great. Signed-off-by: David Sterba dste...@suse.cz --- * btrfs_clean_old_snapshots is removed from the reloc loop, I don't know if this is safe wrt reloc's assumptions * btrfs_run_delayed_iputs is left in place in super_commit, may get removed as well because transaction commit calls it in the end * the responsiveness can be improved further if btrfs_drop_snapshot check fs_closing, but this needs changes to error handling in the main reloc loop fs/btrfs/disk-io.c |8 -- fs/btrfs/relocation.c |3 -- fs/btrfs/transaction.c | 57 fs/btrfs/transaction.h |2 +- 4 files changed, 44 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 51bff86..6a02336 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1635,15 +1635,17 @@ static int cleaner_kthread(void *arg) struct btrfs_root *root = arg; do { + int again = 0; + if (!(root-fs_info-sb-s_flags MS_RDONLY) mutex_trylock(root-fs_info-cleaner_mutex)) { btrfs_run_delayed_iputs(root); - btrfs_clean_old_snapshots(root); + again = btrfs_clean_one_deleted_snapshot(root); mutex_unlock(root-fs_info-cleaner_mutex); btrfs_run_defrag_inodes(root-fs_info); } - if (!try_to_freeze()) { + if (!try_to_freeze() !again) { set_current_state(TASK_INTERRUPTIBLE); if (!kthread_should_stop()) schedule(); @@ -3301,8 +3303,8 @@ int btrfs_commit_super(struct btrfs_root *root) mutex_lock(root-fs_info-cleaner_mutex); btrfs_run_delayed_iputs(root); - btrfs_clean_old_snapshots(root); mutex_unlock(root-fs_info-cleaner_mutex); + wake_up_process(root-fs_info-cleaner_kthread); I am probably missing something, but if the cleaner wakes up here, won't it attempt cleaning the next snap? Because I don't see the cleaner checking anywhere that we are unmounting. Or at this point dead_roots is supposed to be empty? /* wait until ongoing cleanup work done */ down_write(root-fs_info-cleanup_work_sem); diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index ba5a321..ab6a718 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -4060,10 +4060,7 @@ int btrfs_relocate_block_group(struct btrfs_root *extent_root, u64 group_start) while (1) { mutex_lock(fs_info-cleaner_mutex); - - btrfs_clean_old_snapshots(fs_info-tree_root); ret = relocate_block_group(rc); - mutex_unlock(fs_info-cleaner_mutex); if (ret 0) { err = ret; diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 361fb7d..f1e3606 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -895,7 +895,7 @@ static noinline int commit_cowonly_roots(struct btrfs_trans_handle *trans, int btrfs_add_dead_root(struct btrfs_root *root) { spin_lock(root-fs_info-trans_lock); - list_add(root-root_list, root-fs_info-dead_roots); + list_add_tail(root-root_list, root-fs_info-dead_roots); spin_unlock(root-fs_info-trans_lock); return 0; } @@ -1783,31 +1783,50 @@ cleanup_transaction: } /* - * interface function to delete all the snapshots we have scheduled for deletion + * return 0 if error + * 0 if there are no more dead_roots at the time of call + * 1 there are more to be processed, call me again + * + * The return value indicates there are certainly more snapshots to delete, but + * if there comes a new one during processing, it may return 0. We don't mind, + * because btrfs_commit_super will poke cleaner thread and it will process it a + * few seconds later. */ -int btrfs_clean_old_snapshots(struct btrfs_root *root) +int btrfs_clean_one_deleted_snapshot(struct btrfs_root *root) { - LIST_HEAD(list); + int ret; + int run_again = 1; struct btrfs_fs_info *fs_info = root-fs_info; + if (root-fs_info-sb-s_flags MS_RDONLY) { + pr_debug(G btrfs: cleaner called for RO fs!\n); +
Re: Deleted subvolume reappears and other cleaner issues
Thanks for your comments, Josef. Another thing that confuses me is that there are some cases, in which btrfs_drop_snapshot() has a failure, but still returns 0, like for example, if btrfs_del_root() fails. (For cases when btrfs_drop_snapshot() returns non-zero there is a BUG_ON). So in this case for me __btrfs_abort_transaction() sees that trans-blocks_used==0, so it doesn't call __btrfs_std_error, which would further force the filesystem to become RO. So after that btrfs_drop_snapshot successfully completes, and, basically, nobody will retry the subvol deletion. In addition, in this case, after couple of seconds the machine completely freezes for me. I have not yet succeeded to determine why. Thanks, Alex. On Wed, Feb 6, 2013 at 5:14 PM, Josef Bacik jba...@fusionio.com wrote: On Thu, Jan 31, 2013 at 06:03:06AM -0700, Alex Lyakas wrote: Hi, I want to check if any of the below issues are worth/should be fixed: # btrfs_ioctl_snap_destroy() does not commit a transaction. As a result, user can ask to delete a subvol, he receives ok back. Even if user does btrfs sub list, he will not see the deleted subvol (even though the transaction was not committed yet). But if a crash happens, ORPHAN_ITEM will not re-appear after crash. So after crash, the subvolume still exists perfectly fine (happened couple of times here). Same thing happens to normal unlinks, I don't see a reason to have different rules for subvols. # btrfs_drop_snapshot() does not commit a transaction after btrfs_del_orphan_item(). So if the subvol deletion completed in one go (did not have to detach and re-attach to transaction, thus committing the ORPHAN_ITEM and drop_progress/level), then after crash ORPHAN_ITEM will not be in the tree, and subvolume still exists. Again same thing happens with normal files. # btrfs_drop_snapshot() checks btrfs_should_end_transaction(), and then does btrfs_end_transaction_throttle() and btrfs_start_transaction(). However, it looks like it can rejoin the same transaction if transaction was not not blocked yet. Minor issue, perhaps? No if we didn't block then its ok and we wait longer, we only throttle to give the transaction stuff a chance to commit, so if the join logic decides its ok to go on then we're good. # umount may get delayed because of pending-for-deletion subvolumes: btrfs_commit_super() locks the cleaner_mutex, so it will wait for the cleaner to complete. On the other hand, cleaner will not give up until it completes processing all its splice. If currently cleaner is not running, then btrfs_commit_super() calls btrfs_clean_old_snapshots() directly. So does it make sense: - btrfs_commit_super() will not call btrfs_clean_old_snapshots() - close_ctree() calls kthread_stop(cleaner_kthread) early, and cleaner thread periodically checks if it needs to exit I don't quite follow this, but sure? Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Btrfs: fix memory leak of pending_snapshot-inherit
Arne, Miao, I also agree that it is better to move this responsibility to create_pending_snapshot(). Alex. On Thu, Feb 7, 2013 at 10:43 AM, Arne Jansen sensi...@gmx.net wrote: On 02/07/13 07:02, Miao Xie wrote: The argument inherit of btrfs_ioctl_snap_create_transid() was assigned to NULL during we created the snapshots, so we didn't free it though we called kfree() in the caller. But since we are sure the snapshot creation is done after the function - btrfs_ioctl_snap_create_transid() - completes, it is safe that we don't assign the pointer inherit to NULL, and just free it in the caller of btrfs_ioctl_snap_create_transid(). In this way, the code can become more readable. NAK. The snapshot creation is triggered from btrfs_commit_transaction, I don't want to implicitly rely on commit_transaction being called for each snapshot created. I'm not even sure the async path really commits the transaction. The responsibility for the creation is passed to the pending_snapshot data structure, and so should the responsibility for the inherit struct. -Arne Reported-by: Alex Lyakas alex.bt...@zadarastorage.com Cc: Arne Jansen sensi...@gmx.net Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/ioctl.c | 18 +++--- 1 file changed, 7 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 02d3035..40f2fbf 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -367,7 +367,7 @@ static noinline int create_subvol(struct btrfs_root *root, struct dentry *dentry, char *name, int namelen, u64 *async_transid, - struct btrfs_qgroup_inherit **inherit) + struct btrfs_qgroup_inherit *inherit) { struct btrfs_trans_handle *trans; struct btrfs_key key; @@ -401,8 +401,7 @@ static noinline int create_subvol(struct btrfs_root *root, if (IS_ERR(trans)) return PTR_ERR(trans); - ret = btrfs_qgroup_inherit(trans, root-fs_info, 0, objectid, -inherit ? *inherit : NULL); + ret = btrfs_qgroup_inherit(trans, root-fs_info, 0, objectid, inherit); if (ret) goto fail; @@ -530,7 +529,7 @@ fail: static int create_snapshot(struct btrfs_root *root, struct dentry *dentry, char *name, int namelen, u64 *async_transid, -bool readonly, struct btrfs_qgroup_inherit **inherit) +bool readonly, struct btrfs_qgroup_inherit *inherit) { struct inode *inode; struct btrfs_pending_snapshot *pending_snapshot; @@ -549,10 +548,7 @@ static int create_snapshot(struct btrfs_root *root, struct dentry *dentry, pending_snapshot-dentry = dentry; pending_snapshot-root = root; pending_snapshot-readonly = readonly; - if (inherit) { - pending_snapshot-inherit = *inherit; - *inherit = NULL;/* take responsibility to free it */ - } + pending_snapshot-inherit = inherit; trans = btrfs_start_transaction(root-fs_info-extent_root, 6); if (IS_ERR(trans)) { @@ -692,7 +688,7 @@ static noinline int btrfs_mksubvol(struct path *parent, char *name, int namelen, struct btrfs_root *snap_src, u64 *async_transid, bool readonly, -struct btrfs_qgroup_inherit **inherit) +struct btrfs_qgroup_inherit *inherit) { struct inode *dir = parent-dentry-d_inode; struct dentry *dentry; @@ -1454,7 +1450,7 @@ out: static noinline int btrfs_ioctl_snap_create_transid(struct file *file, char *name, unsigned long fd, int subvol, u64 *transid, bool readonly, - struct btrfs_qgroup_inherit **inherit) + struct btrfs_qgroup_inherit *inherit) { int namelen; int ret = 0; @@ -1563,7 +1559,7 @@ static noinline int btrfs_ioctl_snap_create_v2(struct file *file, ret = btrfs_ioctl_snap_create_transid(file, vol_args-name, vol_args-fd, subvol, ptr, - readonly, inherit); + readonly, inherit); if (ret == 0 ptr copy_to_user(arg + -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Deleted subvolume reappears and other cleaner issues
Can anyone please comment on below? On Thu, Jan 31, 2013 at 3:03 PM, Alex Lyakas alex.bt...@zadarastorage.com wrote: Hi, I want to check if any of the below issues are worth/should be fixed: # btrfs_ioctl_snap_destroy() does not commit a transaction. As a result, user can ask to delete a subvol, he receives ok back. Even if user does btrfs sub list, he will not see the deleted subvol (even though the transaction was not committed yet). But if a crash happens, ORPHAN_ITEM will not re-appear after crash. So after crash, the subvolume still exists perfectly fine (happened couple of times here). # btrfs_drop_snapshot() does not commit a transaction after btrfs_del_orphan_item(). So if the subvol deletion completed in one go (did not have to detach and re-attach to transaction, thus committing the ORPHAN_ITEM and drop_progress/level), then after crash ORPHAN_ITEM will not be in the tree, and subvolume still exists. # btrfs_drop_snapshot() checks btrfs_should_end_transaction(), and then does btrfs_end_transaction_throttle() and btrfs_start_transaction(). However, it looks like it can rejoin the same transaction if transaction was not not blocked yet. Minor issue, perhaps? # umount may get delayed because of pending-for-deletion subvolumes: btrfs_commit_super() locks the cleaner_mutex, so it will wait for the cleaner to complete. On the other hand, cleaner will not give up until it completes processing all its splice. If currently cleaner is not running, then btrfs_commit_super() calls btrfs_clean_old_snapshots() directly. So does it make sense: - btrfs_commit_super() will not call btrfs_clean_old_snapshots() - close_ctree() calls kthread_stop(cleaner_kthread) early, and cleaner thread periodically checks if it needs to exit Thanks, Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Leaking btrfs_qgroup_inherit on snapshot creation?
Hi Jan, Arne, I see this code in create_snapshot: if (inherit) { pending_snapshot-inherit = *inherit; *inherit = NULL;/* take responsibility to free it */ } So, first thing I think it should be: if (*inherit) because in btrfs_ioctl_snap_create_v2() we have: struct btrfs_qgroup_inherit *inherit = NULL; ... btrfs_ioctl_snap_create_transid(..., inherit) so the current check is very unlikely to be NULL. Second, I don't see anybody freeing pending_snapshot-inherit. I guess it should be freed after callin btrfs_qgroup_inherit() and also in btrfs_destroy_pending_snapshots(). Thanks, Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Deleted subvolume reappears and other cleaner issues
Hi, I want to check if any of the below issues are worth/should be fixed: # btrfs_ioctl_snap_destroy() does not commit a transaction. As a result, user can ask to delete a subvol, he receives ok back. Even if user does btrfs sub list, he will not see the deleted subvol (even though the transaction was not committed yet). But if a crash happens, ORPHAN_ITEM will not re-appear after crash. So after crash, the subvolume still exists perfectly fine (happened couple of times here). # btrfs_drop_snapshot() does not commit a transaction after btrfs_del_orphan_item(). So if the subvol deletion completed in one go (did not have to detach and re-attach to transaction, thus committing the ORPHAN_ITEM and drop_progress/level), then after crash ORPHAN_ITEM will not be in the tree, and subvolume still exists. # btrfs_drop_snapshot() checks btrfs_should_end_transaction(), and then does btrfs_end_transaction_throttle() and btrfs_start_transaction(). However, it looks like it can rejoin the same transaction if transaction was not not blocked yet. Minor issue, perhaps? # umount may get delayed because of pending-for-deletion subvolumes: btrfs_commit_super() locks the cleaner_mutex, so it will wait for the cleaner to complete. On the other hand, cleaner will not give up until it completes processing all its splice. If currently cleaner is not running, then btrfs_commit_super() calls btrfs_clean_old_snapshots() directly. So does it make sense: - btrfs_commit_super() will not call btrfs_clean_old_snapshots() - close_ctree() calls kthread_stop(cleaner_kthread) early, and cleaner thread periodically checks if it needs to exit Thanks, Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html