Re: [PATCH] Btrfs: fill -last_trans for delayed inode in btrfs_fill_inode.
On Thu, 09 Apr 2015 12:08:43 +0800, Dongsheng Yang wrote: We need to fill inode when we found a node for it in delayed_nodes_tree. But we did not fill the -last_trans currently, it will cause the test of xfstest/generic/311 fail. Scenario of the 311 is shown as below: Problem: (1). test_fd = open(fname, O_RDWR|O_DIRECT) (2). pwrite(test_fd, buf, 4096, 0) (3). close(test_fd) (4). drop_all_caches() echo 3 /proc/sys/vm/drop_caches (5). test_fd = open(fname, O_RDWR|O_DIRECT) (6). fsync(test_fd); we did not get the correct log entry for the file Reason: When we re-open this file in (5), we would find a node in delayed_nodes_tree and fill the inode we are lookup with the information. But the -last_trans is not filled, then the fsync() will check the -last_trans and found it's 0 then say this inode is already in our tree which is commited, not recording the extents for it. Fix: This patch fill the -last_trans properly and set the runtime_flags if needed in this situation. Then we can get the log entries we expected after (6) and generic/311 passed. Signed-off-by: Dongsheng Yang yangds.f...@cn.fujitsu.com Good catch! Reviewed-by: Miao Xie miao...@huawei.com --- fs/btrfs/delayed-inode.c | 2 ++ fs/btrfs/inode.c | 21 - 2 files changed, 14 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 82f0c7c..9e8b435 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -1801,6 +1801,8 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev) set_nlink(inode, btrfs_stack_inode_nlink(inode_item)); inode_set_bytes(inode, btrfs_stack_inode_nbytes(inode_item)); BTRFS_I(inode)-generation = btrfs_stack_inode_generation(inode_item); +BTRFS_I(inode)-last_trans = btrfs_stack_inode_transid(inode_item); + inode-i_version = btrfs_stack_inode_sequence(inode_item); inode-i_rdev = 0; *rdev = btrfs_stack_inode_rdev(inode_item); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index d2e732d..b132936 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3628,25 +3628,28 @@ static void btrfs_read_locked_inode(struct inode *inode) BTRFS_I(inode)-generation = btrfs_inode_generation(leaf, inode_item); BTRFS_I(inode)-last_trans = btrfs_inode_transid(leaf, inode_item); + inode-i_version = btrfs_inode_sequence(leaf, inode_item); + inode-i_generation = BTRFS_I(inode)-generation; + inode-i_rdev = 0; + rdev = btrfs_inode_rdev(leaf, inode_item); + + BTRFS_I(inode)-index_cnt = (u64)-1; + BTRFS_I(inode)-flags = btrfs_inode_flags(leaf, inode_item); + +cache_index: /* * If we were modified in the current generation and evicted from memory * and then re-read we need to do a full sync since we don't have any * idea about which extents were modified before we were evicted from * cache. + * + * This is required for both inode re-read from disk and delayed inode + * in delayed_nodes_tree. */ if (BTRFS_I(inode)-last_trans == root-fs_info-generation) set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, BTRFS_I(inode)-runtime_flags); - inode-i_version = btrfs_inode_sequence(leaf, inode_item); - inode-i_generation = BTRFS_I(inode)-generation; - inode-i_rdev = 0; - rdev = btrfs_inode_rdev(leaf, inode_item); - - BTRFS_I(inode)-index_cnt = (u64)-1; - BTRFS_I(inode)-flags = btrfs_inode_flags(leaf, inode_item); - -cache_index: path-slots[0]++; if (inode-i_nlink != 1 || path-slots[0] = btrfs_header_nritems(leaf)) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC v6 6/9] vfs: Add sb_want_write() function to get vfsmount from a given sb.
On Wed, 04 Feb 2015 10:10:55 +0800, Qu Wenruo wrote: *** Please DON'T merge this patch, it's only for disscusion purpose *** There are sysfs interfaces in some fs, only btrfs yet, which will modify on-disk data. Unlike normal file operation routine we can use mnt_want_write_file() to protect the operation, change through sysfs won't to be binded to any file in the filesystem. So introduce new sb_want_write() to do the protection agains a super block, which acts much like mnt_want_write() but will return success if the super block is read-write. Since sysfs handler don't go through the normal vfsmount, so it won't increase the refcount of and even we have sb_want_write() waiting sb to be unfrozen, the fs can still be unmounted without problem. Causing the modules unable to be removed and user can find out what's wrong until To solve such problem, we have different strategies to solve it. 1) Extra check on last instance umount of a sb This is the method the patch uses. This method seems valid enough, since we want to get write protection on a sb, so it's OK for the sb if there is *ANY* mount instance. Problem 1.1) But lsof and other tools won't help if sb_want_write() on frozen fs cause it unable to be unmounted. Problem 1.2) When get namespace involved, things will get more complicated. Like the following case: Alice | Bob Mount devA on /mnt1 in her ns | Mount devA on /mnt2/ in his ns freeze /mnt1 | sb_want_write() (waiting) | umount /mnt1 (success since there is | another mount instance) | | umount /mnt2 (fail since there | is sb_want_write() waiting) So Alice can't thaw the fs since there is no mount point for it now. 2) Don't allow any umount of the sb if there is sb_want_write(). More aggressive one, purpose by Miao Xie. Can't resolve problem 1.1) but will solve problem 1.2). This is one of the two methods that I told you, but not the one I recommended. What I wanted to recommend is that thaw the fs at the beginning of the sb kill process, and in sb_want_write(), we check if the sb is active or not after we pass sb_start_write, if the sb is not active, go back. (This way also is not so good, but better than the above one) Although introduced new problem like the following: Alice Mount devA on /mnt1 freeze /mnt1 sb_want_write() (waiting) mount devA on /mnt2 and /mnt3 /mnt[123] all can't be unmounted, but new mount can still be created. 3) sb_want_write() doesn't make any sense and break VFS rules! Action which will change on-disk data should not be tunable through sysfs, and sb_want_write() things which by-pass all the VFS check is just evil. And for btrfs, we already have the ioctl to set label, why bothering new sysfs interface to do it again? Although I use method 1) to do it, I am still not certain about which is method is the correct one. So any advise is welcomed. Thanks, Qu [SNIP] +/** + * sb_want_write - get write acess to a super block + * @sb: the superblock of the filesystem + * + * This tells the low-level filesystem that a write is about to be performed to + * it, and makes sure that the writes are allowed (superblock is read-write, + * filesystem is not frozen) before returning success. + * When the write operation is finished, sb_drop_write() must be called. + * This is much like mnt_want_write() as a refcount, but only needs + * the superblock to be read-write. + */ +int sb_want_write(struct super_block *sb) +{ + spin_lock(sb-s_want_write_lock); + if (sb-s_want_write_block) { + spin_unlock(sb-s_want_write_lock); + return -EBUSY; + } + sb-s_want_write_count++; + spin_unlock(sb-s_want_write_lock); + + sb_start_write(sb); + if (sb-s_readonly_remount || sb-s_flags MS_RDONLY) { If someone remount the fs to R/O here(after the check), we should not continue to change label/features. I think we need add some check in remount functions. Thanks Miao -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 0/9] btrfs: Fix freeze/sysfs deadlock in better method.
On Fri, 30 Jan 2015 20:17:49 +0100, David Sterba wrote: On Fri, Jan 30, 2015 at 05:20:45PM +0800, Qu Wenruo wrote: [Use VFS protect for sysfs change] The 6th patch will introduce a new help function sb_want_write() to claim write permission to a superblock. With this, we are able to do write protection like mnt_want_write() but only needs to ensure that the superblock is writeable. This also keeps the same synchronized behavior using ioctl, which will block on frozen fs until it is unfrozen. You know what I think abuot the commit inside sysfs, but it looks better to me now with the sb_* protections so I give it a go. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html I worried about the following case # fsfreeze btrfs # echo new label btrfs_sysfs It should be hangup On the other terminal # umount btrfs Because the 2nd echo command didn't increase mount reference, so umount would not know someone still blocked on the fs, it would not go back and return EBUSY like someone access the fs by common fs interface, it would deactive fs directly and then blocked on sysfs removal. Thanks Miao -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way
On Fri, 30 Jan 2015 09:33:17 +0800, Qu Wenruo wrote: Original Message Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way From: Miao Xie miao...@huawei.com To: Qu Wenruo quwen...@cn.fujitsu.com, linux-btrfs@vger.kernel.org Date: 2015年01月30日 09:29 On Fri, 30 Jan 2015 09:20:46 +0800, Qu Wenruo wrote: Here need ACCESS_ONCE to wrap info-mount_opt, or the complier might use info-mount_opt instead of new_opt. Thanks for pointing out this one. But I worried that this is not key reason of the wrong space cache. Could you explain the race condition which caused the wrong space cache? Thanks Miao CPU0: remount() |- sync_fs() - after sync_fs() we can start new trans |- btrfs_parse_options() CPU1: |- start_transaction() |- Do some bg allocation, not recorded in space_cache. I think it is a bug if a free space is not recorded in space cache. Could you explain why it is not recorded? Thanks Miao IIRC, in that window, the fs_info-mount_opt's SPACE_CACHE bit is cleared. So space cache is not recorded. SPACE_CACHE is used to control cache write out, not in-memory cache. All the free space should be recorded in in-memory cache.And when we write out the in-memory space cache, we need protect the space cache from changing. Thanks Miao Thanks, Qu |- set SPACE_CACHE bit due to cache_gen |- commit_transaction() |- write space cache and update cache_gen. but since some of it is not recorded in space cache, the space cache missing some records. |- clear SPACE_CACHE bit dut to nospace_cache So the space cache is wrong. Thanks, Qu +} kfree(orig); return ret; } . -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way
On Thu, 29 Jan 2015 10:24:35 +0800, Qu Wenruo wrote: Current btrfs_parse_options() is not atomic, which can set and clear a bit, especially for nospace_cache case. For example, if a fs is mounted with nospace_cache, btrfs_parse_options() will set SPACE_CACHE bit first(since cache_generation is non-zeo) and clear the SPACE_CACHE bit due to nospace_cache mount option. So under heavy operations and remount a nospace_cache btrfs, there is a windows for commit to create space cache. This bug can be reproduced by fstest/btrfs/071 073 074 with nospace_cache mount option. It has about 50% chance to create space cache, and about 10% chance to create wrong space cache, which can't pass btrfsck. This patch will do the mount option parse in a copy-and-update method. First copy the mount_opt from fs_info to new_opt, and only update options in new_opt. At last, copy the new_opt back to fs_info-mount_opt. This patch is already good enough to fix the above nospace_cache + remount bug, but need later patch to make sure mount options does not change during transaction. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- fs/btrfs/ctree.h | 16 fs/btrfs/super.c | 115 +-- 2 files changed, 69 insertions(+), 62 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 5f99743..26bb47b 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2119,18 +2119,18 @@ struct btrfs_ioctl_defrag_range_args { #define btrfs_test_opt(root, opt)((root)-fs_info-mount_opt \ BTRFS_MOUNT_##opt) -#define btrfs_set_and_info(root, opt, fmt, args...) \ +#define btrfs_set_and_info(fs_info, val, opt, fmt, args...) \ {\ - if (!btrfs_test_opt(root, opt)) \ - btrfs_info(root-fs_info, fmt, ##args); \ - btrfs_set_opt(root-fs_info-mount_opt, opt); \ + if (!btrfs_raw_test_opt(val, opt)) \ + btrfs_info(fs_info, fmt, ##args); \ + btrfs_set_opt(val, opt);\ } -#define btrfs_clear_and_info(root, opt, fmt, args...) \ +#define btrfs_clear_and_info(fs_info, val, opt, fmt, args...) \ {\ - if (btrfs_test_opt(root, opt)) \ - btrfs_info(root-fs_info, fmt, ##args); \ - btrfs_clear_opt(root-fs_info-mount_opt, opt); \ + if (btrfs_raw_test_opt(val, opt)) \ + btrfs_info(fs_info, fmt, ##args); \ + btrfs_clear_opt(val, opt); \ } /* diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index b0c45b2..490fe1f 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -395,10 +395,13 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) int ret = 0; char *compress_type; bool compress_force = false; + unsigned long new_opt; + + new_opt = info-mount_opt; Here and cache_gen = btrfs_super_cache_generation(root-fs_info-super_copy); if (cache_gen) - btrfs_set_opt(info-mount_opt, SPACE_CACHE); [SNIP] out: - if (!ret btrfs_test_opt(root, SPACE_CACHE)) - btrfs_info(root-fs_info, disk space caching is enabled); + if (!ret) { + if (btrfs_raw_test_opt(new_opt, SPACE_CACHE)) + btrfs_info(info, disk space caching is enabled); + info-mount_opt = new_opt; Here need ACCESS_ONCE to wrap info-mount_opt, or the complier might use info-mount_opt instead of new_opt. But I worried that this is not key reason of the wrong space cache. Could you explain the race condition which caused the wrong space cache? Thanks Miao + } kfree(orig); return ret; } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way
On Fri, 30 Jan 2015 10:51:52 +0800, Qu Wenruo wrote: Original Message Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way From: Miao Xie miao...@huawei.com To: Qu Wenruo quwen...@cn.fujitsu.com, linux-btrfs@vger.kernel.org Date: 2015年01月30日 10:06 On Fri, 30 Jan 2015 09:33:17 +0800, Qu Wenruo wrote: Original Message Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way From: Miao Xie miao...@huawei.com To: Qu Wenruo quwen...@cn.fujitsu.com, linux-btrfs@vger.kernel.org Date: 2015年01月30日 09:29 On Fri, 30 Jan 2015 09:20:46 +0800, Qu Wenruo wrote: Here need ACCESS_ONCE to wrap info-mount_opt, or the complier might use info-mount_opt instead of new_opt. Thanks for pointing out this one. But I worried that this is not key reason of the wrong space cache. Could you explain the race condition which caused the wrong space cache? Thanks Miao CPU0: remount() |- sync_fs() - after sync_fs() we can start new trans |- btrfs_parse_options() CPU1: |- start_transaction() |- Do some bg allocation, not recorded in space_cache. I think it is a bug if a free space is not recorded in space cache. Could you explain why it is not recorded? Thanks Miao IIRC, in that window, the fs_info-mount_opt's SPACE_CACHE bit is cleared. So space cache is not recorded. SPACE_CACHE is used to control cache write out, not in-memory cache. All the free space should be recorded in in-memory cache.And when we write out the in-memory space cache, we need protect the space cache from changing. Thanks Miao You're right, the wrong space cache problem is not caused by the non-atomic mount option problem. But the atomic mount option change with per-transaction mount option will at least make it disappear when using nospace_cache mount option. But we need fix a problem, not hide a problem. Thanks Miao Thanks, Qu Thanks, Qu |- set SPACE_CACHE bit due to cache_gen |- commit_transaction() |- write space cache and update cache_gen. but since some of it is not recorded in space cache, the space cache missing some records. |- clear SPACE_CACHE bit dut to nospace_cache So the space cache is wrong. Thanks, Qu +} kfree(orig); return ret; } . . -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
On Fri, 30 Jan 2015 04:37:14 +, Al Viro wrote: On Fri, Jan 30, 2015 at 12:14:24PM +0800, Miao Xie wrote: On Fri, 30 Jan 2015 02:14:45 +, Al Viro wrote: On Fri, Jan 30, 2015 at 09:44:03AM +0800, Qu Wenruo wrote: This shouldn't happen. If someone is ro, the whole fs should be ro, right? Wrong. Individual vfsmounts over an r/w superblock might very well be r/o. As for that trylock... What for? It invites transient failures for no good reason. Removal of sysfs entry will block while write(2) to that sucker is in progress, so btrfs shutdown will block at that point in ctree_close(). It won't go away under you. could you explain the race condition? I think the deadlock won't happen, during the btrfs shutdown, we hold s_umount, the write operation will fail to lock it, and quit quickly, and then umount will continue. First of all, -s_umount is not a mutex; it's rwsem. So you mean down_read_trylock(). As for the transient failures - grep for down_write on it... E.g. have somebody call mount() from the same device. We call sget(), which finds existing superblock and calls grab_super(). Sure, that -s_umount will be released shortly, but in the meanwhile your trylock will fail... I know it, so I suggested to return -EBUSY in the previous mail. I think it is acceptable method, mount/umount operations are not so many after all. Thanks Miao I think sb_want_write() is similar to trylock(s_umount), the difference is that sb_want_write() is more complex. Now, you might want to move those sysfs entry removals to the very beginning of btrfs_kill_super(), but that's a different story - you need only to make sure that they are removed not later than the destruction of the data structures they need (IOW, the current location might very well be OK - I hadn't checked the details). Yes, we need move those sysfs entry removals, but needn't move to the very beginning of btrfs_kill_super(), just at the beginning of close_ctree(); So move them... It's a matter of moving one function call around a bit. . -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
On Fri, 30 Jan 2015 10:02:26 +0800, Qu Wenruo wrote: Original Message Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb. From: Qu Wenruo quwen...@cn.fujitsu.com To: Miao Xie miao...@huawei.com, linux-btrfs@vger.kernel.org Date: 2015年01月30日 09:44 Original Message Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb. From: Miao Xie miao...@huawei.com To: Qu Wenruo quwen...@cn.fujitsu.com, linux-btrfs@vger.kernel.org Date: 2015年01月30日 08:52 On Thu, 29 Jan 2015 10:24:39 +0800, Qu Wenruo wrote: There are sysfs interfaces in some fs, only btrfs yet, which will modify on-disk data. Unlike normal file operation routine we can use mnt_want_write_file() to protect the operation, change through sysfs won't to be binded to any file in the filesystem. So we can only extract the first vfsmount of a superblock and pass it to mnt_want_write() to do the protection. This method is wrong, becasue one fs may be mounted on the multi places at the same time, someone is R/O, someone is R/W, you may get a R/O and fail to get the write permission. This shouldn't happen. If someone is ro, the whole fs should be ro, right? You can mount a device which is already mounted as rw to other point as ro, and remount a mount point to ro will also cause all other mount point to ro. So I didn't see the problem here. I think you do label/feature change by sysfs interface by the following way btrfs_sysfs_change_() { /* Use trylock to avoid the race with umount */ if(!mutex_trylock(sb-s_umount)) return -EBUSY; check R/O and FREEZE mutex_unlock(sb-s_umount); } This looks better since it not introduce changes to VFS. Thanks, Qu Oh, wait a second, this one leads to the old problem and old solution. If we hold s_umount mutex, we must do freeze check and can't start transaction since it will deadlock. And for freeze check, we must use sb_try_start_intwrite() to hold the freeze lock and then add a new btrfs_start_transaction_freeze() which will not call sb_start_write()... Oh this seems so similar, v2 or v3 version RFC patch? So still goes to the old method? No. Just check R/O and RREEZE, if failed, go out. if the check pass, we start_transaction. Because we do it in s_umount lock, no one can change fs to R/O or FREEZE. Maybe the above description is not so clear, explain it again. btrfs_sysfs_change_() { /* Use trylock to avoid the race with umount */ if(!mutex_trylock(sb-s_umount)) return -EBUSY; if (fs is R/O or FREEZED) { mutex_unlock(sb-s_umount); return -EACCES; } btrfs_start_transaction() change label/feature btrfs_commit_transaction() mutex_unlock(sb-s_umount); } Thanks Miao Thanks, Qu Thanks Miao Cc: linux-fsdevel linux-fsde...@vger.kernel.org Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- fs/namespace.c| 25 + include/linux/mount.h | 1 + 2 files changed, 26 insertions(+) diff --git a/fs/namespace.c b/fs/namespace.c index cd1e968..5a16a62 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1105,6 +1105,31 @@ struct vfsmount *mntget(struct vfsmount *mnt) } EXPORT_SYMBOL(mntget); +/* + * get a vfsmount from a given sb + * + * This is especially used for case where change fs' sysfs interface + * will lead to a write, e.g. Change label through sysfs in btrfs. + * So vfs can get a vfsmount and then use mnt_want_write() to protect. + */ +struct vfsmount *get_vfsmount_sb(struct super_block *sb) +{ +struct vfsmount *ret_vfs = NULL; +struct mount *mnt; +int ret = 0; + +lock_mount_hash(); +if (list_empty(sb-s_mounts)) +goto out; +mnt = list_entry(sb-s_mounts.next, struct mount, mnt_instance); +ret_vfs = mnt-mnt; +ret_vfs = mntget(ret_vfs); +out: +unlock_mount_hash(); +return ret_vfs; +} +EXPORT_SYMBOL(get_vfsmount_sb); + struct vfsmount *mnt_clone_internal(struct path *path) { struct mount *p; diff --git a/include/linux/mount.h b/include/linux/mount.h index c2c561d..cf1b0f5 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -79,6 +79,7 @@ extern void mnt_drop_write_file(struct file *file); extern void mntput(struct vfsmount *mnt); extern struct vfsmount *mntget(struct vfsmount *mnt); extern struct vfsmount *mnt_clone_internal(struct path *path); +extern struct vfsmount *get_vfsmount_sb(struct super_block *sb); extern int __mnt_is_readonly(struct vfsmount *mnt); struct path; . -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
On Fri, 30 Jan 2015 02:14:45 +, Al Viro wrote: On Fri, Jan 30, 2015 at 09:44:03AM +0800, Qu Wenruo wrote: This shouldn't happen. If someone is ro, the whole fs should be ro, right? Wrong. Individual vfsmounts over an r/w superblock might very well be r/o. As for that trylock... What for? It invites transient failures for no good reason. Removal of sysfs entry will block while write(2) to that sucker is in progress, so btrfs shutdown will block at that point in ctree_close(). It won't go away under you. could you explain the race condition? I think the deadlock won't happen, during the btrfs shutdown, we hold s_umount, the write operation will fail to lock it, and quit quickly, and then umount will continue. I think sb_want_write() is similar to trylock(s_umount), the difference is that sb_want_write() is more complex. Now, you might want to move those sysfs entry removals to the very beginning of btrfs_kill_super(), but that's a different story - you need only to make sure that they are removed not later than the destruction of the data structures they need (IOW, the current location might very well be OK - I hadn't checked the details). Yes, we need move those sysfs entry removals, but needn't move to the very beginning of btrfs_kill_super(), just at the beginning of close_ctree(); The current location is not right, it will introduce the use-after-free problem. because we remove the sysfs entry after we release transaction_kthread, use-after-free problem might happen in this case Task1 Task2 change Label by sysfs close_ctree kthread_stop(transaction_kthread); change label wake_up(transaction_kthread) Thanks Miao As for it won't go r/o under us - sb_want_write() will do that just fine. . -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.
On Thu, 29 Jan 2015 10:24:39 +0800, Qu Wenruo wrote: There are sysfs interfaces in some fs, only btrfs yet, which will modify on-disk data. Unlike normal file operation routine we can use mnt_want_write_file() to protect the operation, change through sysfs won't to be binded to any file in the filesystem. So we can only extract the first vfsmount of a superblock and pass it to mnt_want_write() to do the protection. This method is wrong, becasue one fs may be mounted on the multi places at the same time, someone is R/O, someone is R/W, you may get a R/O and fail to get the write permission. I think you do label/feature change by sysfs interface by the following way btrfs_sysfs_change_() { /* Use trylock to avoid the race with umount */ if(!mutex_trylock(sb-s_umount)) return -EBUSY; check R/O and FREEZE mutex_unlock(sb-s_umount); } Thanks Miao Cc: linux-fsdevel linux-fsde...@vger.kernel.org Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- fs/namespace.c| 25 + include/linux/mount.h | 1 + 2 files changed, 26 insertions(+) diff --git a/fs/namespace.c b/fs/namespace.c index cd1e968..5a16a62 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1105,6 +1105,31 @@ struct vfsmount *mntget(struct vfsmount *mnt) } EXPORT_SYMBOL(mntget); +/* + * get a vfsmount from a given sb + * + * This is especially used for case where change fs' sysfs interface + * will lead to a write, e.g. Change label through sysfs in btrfs. + * So vfs can get a vfsmount and then use mnt_want_write() to protect. + */ +struct vfsmount *get_vfsmount_sb(struct super_block *sb) +{ + struct vfsmount *ret_vfs = NULL; + struct mount *mnt; + int ret = 0; + + lock_mount_hash(); + if (list_empty(sb-s_mounts)) + goto out; + mnt = list_entry(sb-s_mounts.next, struct mount, mnt_instance); + ret_vfs = mnt-mnt; + ret_vfs = mntget(ret_vfs); +out: + unlock_mount_hash(); + return ret_vfs; +} +EXPORT_SYMBOL(get_vfsmount_sb); + struct vfsmount *mnt_clone_internal(struct path *path) { struct mount *p; diff --git a/include/linux/mount.h b/include/linux/mount.h index c2c561d..cf1b0f5 100644 --- a/include/linux/mount.h +++ b/include/linux/mount.h @@ -79,6 +79,7 @@ extern void mnt_drop_write_file(struct file *file); extern void mntput(struct vfsmount *mnt); extern struct vfsmount *mntget(struct vfsmount *mnt); extern struct vfsmount *mnt_clone_internal(struct path *path); +extern struct vfsmount *get_vfsmount_sb(struct super_block *sb); extern int __mnt_is_readonly(struct vfsmount *mnt); struct path; -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.
On Fri, 23 Jan 2015 17:59:49 +0100, David Sterba wrote: On Wed, Jan 21, 2015 at 03:04:02PM +0800, Miao Xie wrote: Pending changes are *not* only mount options. Feature change and label change are also pending changes if using sysfs. My miss, I don't notice feature and label change by sysfs. But the implementation of feature and label change by sysfs is wrong, we can not change them without write permission. Label change does not happen if the fs is readonly. If the filesystem is RW and label is changed through sysfs, then remount to RO will sync the filesystem and the new label will be saved. The sysfs features write handler is missing that protection though, I'll send a patch. First, the R/O protection is so cheap, there is a race between R/O remount and label/feature change, please consider the following case: Remount R/O taskLabel/Attr Change Task Check R/O remount ro R/O change Label/feature Second, it forgets to handle the freezing event. For freeze, it's not the same problem since the fs will be unfreeze sooner or later and transaction will be initiated. You can not assume the operations of the users, they might freeze the fs and then shutdown the machine. The semantics of freezing should make the on-device image consistent, but still keep some changes in memory. For example, if we change the features/label through sysfs, and then umount the fs, It is different from pending change. No, now features/label changing using sysfs both use pending changes to do the commit. See BTRFS_PENDING_COMMIT bit. So freeze - change features/label - sync will still cause the deadlock in the same way, and you can try it yourself. As I said above, the implementation of sysfs feature and label change is wrong, it is better to separate them from the pending mount option change, make the sysfs feature and label change be done in the context of transaction after getting the write permission. If so, we needn't do anything special when sync the fs. That would mean to drop the write support of sysfs files that change global filesystem state (label and features right now). This would leave only the ioctl way to do that. I'd like to keep the sysfs write support though for ease of use from scripts and languages not ioctl-friendly. . not drop the write support of sysfs, just fix the bug and make it change the label and features under the writable context. Thanks Miao -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.
On Wed, 21 Jan 2015 15:47:54 +0800, Qu Wenruo wrote: On Wed, 21 Jan 2015 11:53:34 +0800, Qu Wenruo wrote: [snipped] This will cause another problem, nobody can ensure there will be next transaction and the change may never to written into disk. First, the pending changes is mount option, that is in-memory data. Second, the same problem would happen after you freeze fs. Pending changes are *not* only mount options. Feature change and label change are also pending changes if using sysfs. My miss, I don't notice feature and label change by sysfs. But the implementation of feature and label change by sysfs is wrong, we can not change them without write permission. Normal ioctl label changing is not affected. For freeze, it's not the same problem since the fs will be unfreeze sooner or later and transaction will be initiated. You can not assume the operations of the users, they might freeze the fs and then shutdown the machine. For example, if we change the features/label through sysfs, and then umount the fs, It is different from pending change. No, now features/label changing using sysfs both use pending changes to do the commit. See BTRFS_PENDING_COMMIT bit. So freeze - change features/label - sync will still cause the deadlock in the same way, and you can try it yourself. As I said above, the implementation of sysfs feature and label change is wrong, it is better to separate them from the pending mount option change, make the sysfs feature and label change be done in the context of transaction after getting the write permission. If so, we needn't do anything special when sync the fs. In short, changing the sysfs feature and label change implementation and removing the unnecessary btrfs_start_transaction in sync_fs can fix the deadlock. Your method will only fix the deadlock, but will introduce the risk like pending inode_cache will never be written to disk as I already explained. (If still using the fs_info-pending_changes mechanism) To ensure pending changes written to disk sync_fs() should start a transaction if needed, or there will be chance that no transaction can handle it. We are sure that writting down the inode cache need transaction. But INODE_CACHE is not a forcible flag. Sometimes though you set it, it is very likely that the inode cache files are not created and the data is not written down because the fs might still be reading inode usage information, and this operation might span several transactions. So I think what you worried is not a problem. Thanks Miao But I don't see the necessity to pending current work(inode_cache, feature/label changes) to next transaction. To David: I'm a little curious about why inode_cache needs to be delayed to next transaction. In btrfs_remount() we have s_umount mutex, and we synced the whole filesystem already, so there should be no running transaction and we can just set any mount option into fs_info. Or even in worst case, there is a racing window, we can still start a transaction and do the commit, a little overhead in such minor case won't impact the overall performance. For sysfs change, I prefer attach or start transaction method, and for mount option change, and such sysfs tuning is also minor case for a filesystem. What do you think about reverting the whole patchset and rework the sysfs interface? Thanks, Qu Thanks Miao Thanks, Qu If you want to change features/label, you should get write permission and make sure the fs is not be freezed because those are on-disk data. So the problem doesn't exist, or there is a bug. Thanks Miao since there is no write, there is no running transaction and if we don't start a new transaction, it won't be flushed to disk. Thanks, Qu the reason is: - Make the behavior of the fs be consistent(both freezed fs and unfreezed fs) - Data on the disk is right and integrated Thanks Miao . . . -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.
On Tue, 20 Jan 2015 20:10:56 -0500, Chris Mason wrote: On Tue, Jan 20, 2015 at 8:09 PM, Qu Wenruo quwen...@cn.fujitsu.com wrote: Original Message Subject: Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock. From: Chris Mason c...@fb.com To: Qu Wenruo quwen...@cn.fujitsu.com Date: 2015年01月21日 09:05 On Tue, Jan 20, 2015 at 7:58 PM, Qu Wenruo quwen...@cn.fujitsu.com wrote: Original Message Subject: Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock. From: David Sterba dste...@suse.cz To: Qu Wenruo quwen...@cn.fujitsu.com Date: 2015年01月21日 01:13 On Mon, Jan 19, 2015 at 03:42:41PM +0800, Qu Wenruo wrote: --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1000,6 +1000,14 @@ int btrfs_sync_fs(struct super_block *sb, int wait) */ if (fs_info-pending_changes == 0) return 0; +/* + * Test if the fs is frozen, or start_trasaction + * will deadlock on itself. + */ +if (__sb_start_write(sb, SB_FREEZE_FS, false)) +__sb_end_write(sb, SB_FREEZE_FS); +else +return 0; But what if someone freezes the FS after __sb_end_write() and before btrfs_start_transaction()? I don't see what keeps new freezers from coming in. -chris Either VFS::freeze_super() and VFS::syncfs() will hold the s_umount mutex, so freeze will not happen during sync. You're right. I was worried about the sync ioctl, but the mutex won't be held there to deadlock against. We'll be fine. There is another problem which is introduced by pending change. That is we will start and commmit a transaction by changing pending mount option after we set the fs to be R/O. I think it is better that we don't start a new transaction for pending changes which are set after the transaction is committed, just make them be handled by the next transaction, the reason is: - Make the behavior of the fs be consistent(both freezed fs and unfreezed fs) - Data on the disk is right and integrated Thanks Miao -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.
On Wed, 21 Jan 2015 11:53:34 +0800, Qu Wenruo wrote: +/* + * Test if the fs is frozen, or start_trasaction + * will deadlock on itself. + */ +if (__sb_start_write(sb, SB_FREEZE_FS, false)) +__sb_end_write(sb, SB_FREEZE_FS); +else +return 0; But what if someone freezes the FS after __sb_end_write() and before btrfs_start_transaction()? I don't see what keeps new freezers from coming in. -chris Either VFS::freeze_super() and VFS::syncfs() will hold the s_umount mutex, so freeze will not happen during sync. You're right. I was worried about the sync ioctl, but the mutex won't be held there to deadlock against. We'll be fine. There is another problem which is introduced by pending change. That is we will start and commmit a transaction by changing pending mount option after we set the fs to be R/O. Oh, I missed this problem. I think it is better that we don't start a new transaction for pending changes which are set after the transaction is committed, just make them be handled by the next transaction, This will cause another problem, nobody can ensure there will be next transaction and the change may never to written into disk. First, the pending changes is mount option, that is in-memory data. Second, the same problem would happen after you freeze fs. Pending changes are *not* only mount options. Feature change and label change are also pending changes if using sysfs. My miss, I don't notice feature and label change by sysfs. But the implementation of feature and label change by sysfs is wrong, we can not change them without write permission. Normal ioctl label changing is not affected. For freeze, it's not the same problem since the fs will be unfreeze sooner or later and transaction will be initiated. You can not assume the operations of the users, they might freeze the fs and then shutdown the machine. For example, if we change the features/label through sysfs, and then umount the fs, It is different from pending change. No, now features/label changing using sysfs both use pending changes to do the commit. See BTRFS_PENDING_COMMIT bit. So freeze - change features/label - sync will still cause the deadlock in the same way, and you can try it yourself. As I said above, the implementation of sysfs feature and label change is wrong, it is better to separate them from the pending mount option change, make the sysfs feature and label change be done in the context of transaction after getting the write permission. If so, we needn't do anything special when sync the fs. In short, changing the sysfs feature and label change implementation and removing the unnecessary btrfs_start_transaction in sync_fs can fix the deadlock. Thanks Miao Thanks, Qu If you want to change features/label, you should get write permission and make sure the fs is not be freezed because those are on-disk data. So the problem doesn't exist, or there is a bug. Thanks Miao since there is no write, there is no running transaction and if we don't start a new transaction, it won't be flushed to disk. Thanks, Qu the reason is: - Make the behavior of the fs be consistent(both freezed fs and unfreezed fs) - Data on the disk is right and integrated Thanks Miao . . -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.
On Mon, 19 Jan 2015 15:42:41 +0800, Qu Wenruo wrote: Commit 6b5fe46dfa52 (btrfs: do commit in sync_fs if there are pending changes) will call btrfs_start_transaction() in sync_fs(), to handle some operations needed to be done in next transaction. However this can cause deadlock if the filesystem is frozen, with the following sys_r+w output: [ 143.255932] Call Trace: [ 143.255936] [816c0e09] schedule+0x29/0x70 [ 143.255939] [811cb7f3] __sb_start_write+0xb3/0x100 [ 143.255971] [a040ec06] start_transaction+0x2e6/0x5a0 [btrfs] [ 143.255992] [a040f1eb] btrfs_start_transaction+0x1b/0x20 [btrfs] [ 143.256003] [a03dc0ba] btrfs_sync_fs+0xca/0xd0 [btrfs] [ 143.256007] [811f7be0] sync_fs_one_sb+0x20/0x30 [ 143.256011] [811cbd01] iterate_supers+0xe1/0xf0 [ 143.256014] [811f7d75] sys_sync+0x55/0x90 [ 143.256017] [816c49d2] system_call_fastpath+0x12/0x17 [ 143.256111] Call Trace: [ 143.256114] [816c0e09] schedule+0x29/0x70 [ 143.256119] [816c3405] rwsem_down_write_failed+0x1c5/0x2d0 [ 143.256123] [8133f013] call_rwsem_down_write_failed+0x13/0x20 [ 143.256131] [811caae8] thaw_super+0x28/0xc0 [ 143.256135] [811db3e5] do_vfs_ioctl+0x3f5/0x540 [ 143.256187] [811db5c1] SyS_ioctl+0x91/0xb0 [ 143.256213] [816c49d2] system_call_fastpath+0x12/0x17 The reason is like the following: (Holding s_umount) VFS sync_fs staff: |- btrfs_sync_fs() |- btrfs_start_transaction() |- sb_start_intwrite() (Waiting thaw_fs to unfreeze) VFS thaw_fs staff: thaw_fs() (Waiting sync_fs to release s_umount) So deadlock happens. This can be easily triggered by fstest/generic/068 with inode_cache mount option. The fix is to check if the fs is frozen, if the fs is frozen, just return and waiting for the next transaction. Cc: David Sterba dste...@suse.cz Reported-by: Gui Hecheng guihc.f...@cn.fujitsu.com Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- fs/btrfs/super.c | 8 1 file changed, 8 insertions(+) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 60f7cbe..1d9f1e6 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1000,6 +1000,14 @@ int btrfs_sync_fs(struct super_block *sb, int wait) */ if (fs_info-pending_changes == 0) return 0; I think the problem is here -- why -pending_changes is not 0 when the filesystem is frozen? so I think the reason of this problem is btrfs_freeze forget to deal with the pending changes, and the correct fix is to correct the behavior of btrfs_freeze(). Thanks Miao + /* + * Test if the fs is frozen, or start_trasaction + * will deadlock on itself. + */ + if (__sb_start_write(sb, SB_FREEZE_FS, false)) + __sb_end_write(sb, SB_FREEZE_FS); + else + return 0; trans = btrfs_start_transaction(root, 0); } else { return PTR_ERR(trans); -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix typo of variable in scrub_stripe
On Fri, 09 Jan 2015 17:37:52 +0900, Tsutomu Itoh wrote: The address that should be freed is not 'ppath' but 'path'. Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com --- fs/btrfs/scrub.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index f2bb13a..403fbdb 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -3053,7 +3053,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, ppath = btrfs_alloc_path(); if (!ppath) { - btrfs_free_path(ppath); + btrfs_free_path(path); My bad. Thanks to fix it. Reviewed-by: Miao Xie miao...@huawei.com return -ENOMEM; } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: delete chunk allocation attemp when setting block group ro
On Thu, 08 Jan 2015 13:23:13 -0800, Shaohua Li wrote: Below test will fail currently: mkfs.ext4 -F /dev/sda btrfs-convert /dev/sda mount /dev/sda /mnt btrfs device add -f /dev/sdb /mnt btrfs balance start -v -dconvert=raid1 -mconvert=raid1 /mnt The reason is there are some block groups with usage 0, but the whole disk hasn't free space to allocate new chunk, so we even can't set such block group readonly. This patch deletes the chunk allocation when setting block group ro. For META, we already have reserve. But for SYSTEM, we don't have, so the check_system_chunk is still required. Signed-off-by: Shaohua Li s...@fb.com --- fs/btrfs/extent-tree.c | 31 +++ 1 file changed, 7 insertions(+), 24 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a80b971..430101b6 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -8493,22 +8493,8 @@ static int set_block_group_ro(struct btrfs_block_group_cache *cache, int force) { struct btrfs_space_info *sinfo = cache-space_info; u64 num_bytes; - u64 min_allocable_bytes; int ret = -ENOSPC; - - /* - * We need some metadata space and system metadata space for - * allocating chunks in some corner cases until we force to set - * it to be readonly. - */ - if ((sinfo-flags - (BTRFS_BLOCK_GROUP_SYSTEM | BTRFS_BLOCK_GROUP_METADATA)) - !force) - min_allocable_bytes = 1 * 1024 * 1024; - else - min_allocable_bytes = 0; - spin_lock(sinfo-lock); spin_lock(cache-lock); @@ -8521,8 +8507,8 @@ static int set_block_group_ro(struct btrfs_block_group_cache *cache, int force) cache-bytes_super - btrfs_block_group_used(cache-item); if (sinfo-bytes_used + sinfo-bytes_reserved + sinfo-bytes_pinned + - sinfo-bytes_may_use + sinfo-bytes_readonly + num_bytes + - min_allocable_bytes = sinfo-total_bytes) { + sinfo-bytes_may_use + sinfo-bytes_readonly + num_bytes + = sinfo-total_bytes) { sinfo-bytes_readonly += num_bytes; cache-ro = 1; list_add_tail(cache-ro_list, sinfo-ro_bgs); @@ -8548,14 +8534,6 @@ int btrfs_set_block_group_ro(struct btrfs_root *root, if (IS_ERR(trans)) return PTR_ERR(trans); - alloc_flags = update_block_group_flags(root, cache-flags); - if (alloc_flags != cache-flags) { - ret = do_chunk_alloc(trans, root, alloc_flags, - CHUNK_ALLOC_FORCE); - if (ret 0) - goto out; - } - ret = set_block_group_ro(cache, 0); if (!ret) goto out; @@ -8566,6 +8544,11 @@ int btrfs_set_block_group_ro(struct btrfs_root *root, goto out; ret = set_block_group_ro(cache, 0); out: + if (cache-flags BTRFS_BLOCK_GROUP_SYSTEM) { + alloc_flags = update_block_group_flags(root, cache-flags); + check_system_chunk(trans, root, alloc_flags); Please consider the case that the following patch fixed 199c36eaa95077a47ae1bc55532fc0fbeb80cc95 If there is no free device space, check_system_chunk can not allocate new system metadata chunk, so when we run final step of the chunk allocation to update the device item and insert the new chunk item, we would fail. Thanks Miao + } + btrfs_end_transaction(trans, root); return ret; } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: delete chunk allocation attemp when setting block group ro
On Thu, 08 Jan 2015 18:06:50 -0800, Shaohua Li wrote: On Fri, Jan 09, 2015 at 09:01:57AM +0800, Miao Xie wrote: On Thu, 08 Jan 2015 13:23:13 -0800, Shaohua Li wrote: Below test will fail currently: mkfs.ext4 -F /dev/sda btrfs-convert /dev/sda mount /dev/sda /mnt btrfs device add -f /dev/sdb /mnt btrfs balance start -v -dconvert=raid1 -mconvert=raid1 /mnt The reason is there are some block groups with usage 0, but the whole disk hasn't free space to allocate new chunk, so we even can't set such block group readonly. This patch deletes the chunk allocation when setting block group ro. For META, we already have reserve. But for SYSTEM, we don't have, so the check_system_chunk is still required. Signed-off-by: Shaohua Li s...@fb.com --- fs/btrfs/extent-tree.c | 31 +++ 1 file changed, 7 insertions(+), 24 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index a80b971..430101b6 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -8493,22 +8493,8 @@ static int set_block_group_ro(struct btrfs_block_group_cache *cache, int force) { struct btrfs_space_info *sinfo = cache-space_info; u64 num_bytes; - u64 min_allocable_bytes; int ret = -ENOSPC; - - /* -* We need some metadata space and system metadata space for -* allocating chunks in some corner cases until we force to set -* it to be readonly. -*/ - if ((sinfo-flags -(BTRFS_BLOCK_GROUP_SYSTEM | BTRFS_BLOCK_GROUP_METADATA)) - !force) - min_allocable_bytes = 1 * 1024 * 1024; - else - min_allocable_bytes = 0; - spin_lock(sinfo-lock); spin_lock(cache-lock); [SNIP] ret = set_block_group_ro(cache, 0); if (!ret) goto out; @@ -8566,6 +8544,11 @@ int btrfs_set_block_group_ro(struct btrfs_root *root, goto out; ret = set_block_group_ro(cache, 0); out: + if (cache-flags BTRFS_BLOCK_GROUP_SYSTEM) { + alloc_flags = update_block_group_flags(root, cache-flags); + check_system_chunk(trans, root, alloc_flags); Please consider the case that the following patch fixed 199c36eaa95077a47ae1bc55532fc0fbeb80cc95 If there is no free device space, check_system_chunk can not allocate new system metadata chunk, so when we run final step of the chunk allocation to update the device item and insert the new chunk item, we would fail. So the relocation will always fail in this case. The check just makes the failure earlier, right? We don't have the BUG_ON in do_chunk_alloc() currently. The final step of the chunk allocation is a delayed operation, we must make sure it can be done successfully, or we would abort the transaction, make the filesystem readonly and lose the data that is written into the filesystem before we do balance, it would make the users unconfortable. With this patch, we will set the block group successfully at the first time we invoke set_block_group_ro(). But if the block group that will be set to RO is the only system metadata block group in the filesystem, and there is no device space to allocate a new one, that is we have no space to deal with the pending final step of chunk allocation, so the problem I said above will happen. Thanks Miao -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 03/10] Btrfs, raid56: don't change bbio and raid_map
Because we will reuse bbio and raid_map during the scrub later, it is better that we don't change any variant of bbio and don't free it at the end of IO request. So we introduced similar variants into the raid bio, and don't access those bbio's variants any more. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None. --- fs/btrfs/raid56.c | 42 +++--- 1 file changed, 23 insertions(+), 19 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 66944b9..cb31cc6 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -58,7 +58,6 @@ */ #define RBIO_CACHE_READY_BIT 3 - #define RBIO_CACHE_SIZE 1024 struct btrfs_raid_bio { @@ -146,6 +145,10 @@ struct btrfs_raid_bio { atomic_t refs; + + atomic_t stripes_pending; + + atomic_t error; /* * these are two arrays of pointers. We allocate the * rbio big enough to hold them both and setup their @@ -858,13 +861,13 @@ static void raid_write_end_io(struct bio *bio, int err) bio_put(bio); - if (!atomic_dec_and_test(rbio-bbio-stripes_pending)) + if (!atomic_dec_and_test(rbio-stripes_pending)) return; err = 0; /* OK, we have read all the stripes we need to. */ - if (atomic_read(rbio-bbio-error) rbio-bbio-max_errors) + if (atomic_read(rbio-error) rbio-bbio-max_errors) err = -EIO; rbio_orig_end_io(rbio, err, 0); @@ -949,6 +952,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio-faila = -1; rbio-failb = -1; atomic_set(rbio-refs, 1); + atomic_set(rbio-error, 0); + atomic_set(rbio-stripes_pending, 0); /* * the stripe_pages and bio_pages array point to the extra @@ -1169,7 +1174,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) set_bit(RBIO_RMW_LOCKED_BIT, rbio-flags); spin_unlock_irq(rbio-bio_list_lock); - atomic_set(rbio-bbio-error, 0); + atomic_set(rbio-error, 0); /* * now that we've set rmw_locked, run through the @@ -1245,8 +1250,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) } } - atomic_set(bbio-stripes_pending, bio_list_size(bio_list)); - BUG_ON(atomic_read(bbio-stripes_pending) == 0); + atomic_set(rbio-stripes_pending, bio_list_size(bio_list)); + BUG_ON(atomic_read(rbio-stripes_pending) == 0); while (1) { bio = bio_list_pop(bio_list); @@ -1331,11 +1336,11 @@ static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed) if (rbio-faila == -1) { /* first failure on this rbio */ rbio-faila = failed; - atomic_inc(rbio-bbio-error); + atomic_inc(rbio-error); } else if (rbio-failb == -1) { /* second failure on this rbio */ rbio-failb = failed; - atomic_inc(rbio-bbio-error); + atomic_inc(rbio-error); } else { ret = -EIO; } @@ -1394,11 +1399,11 @@ static void raid_rmw_end_io(struct bio *bio, int err) bio_put(bio); - if (!atomic_dec_and_test(rbio-bbio-stripes_pending)) + if (!atomic_dec_and_test(rbio-stripes_pending)) return; err = 0; - if (atomic_read(rbio-bbio-error) rbio-bbio-max_errors) + if (atomic_read(rbio-error) rbio-bbio-max_errors) goto cleanup; /* @@ -1439,7 +1444,6 @@ static void async_read_rebuild(struct btrfs_raid_bio *rbio) static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio) { int bios_to_read = 0; - struct btrfs_bio *bbio = rbio-bbio; struct bio_list bio_list; int ret; int nr_pages = DIV_ROUND_UP(rbio-stripe_len, PAGE_CACHE_SIZE); @@ -1455,7 +1459,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio) index_rbio_pages(rbio); - atomic_set(rbio-bbio-error, 0); + atomic_set(rbio-error, 0); /* * build a list of bios to read all the missing parts of this * stripe @@ -1503,7 +1507,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio) * the bbio may be freed once we submit the last bio. Make sure * not to touch it after that */ - atomic_set(bbio-stripes_pending, bios_to_read); + atomic_set(rbio-stripes_pending, bios_to_read); while (1) { bio = bio_list_pop(bio_list); if (!bio) @@ -1917,10 +1921,10 @@ static void raid_recover_end_io(struct bio *bio, int err) set_bio_pages_uptodate(bio); bio_put(bio); - if (!atomic_dec_and_test(rbio-bbio-stripes_pending)) + if (!atomic_dec_and_test(rbio-stripes_pending)) return; - if (atomic_read(rbio-bbio-error) rbio-bbio-max_errors) + if (atomic_read
[PATCH v4 01/10] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
From: Zhao Lei zhao...@cn.fujitsu.com bbio_ret in this condition is always !NULL because previous code already have a check-and-skip: 4908 if (!bbio_ret) 4909 goto out; Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com Signed-off-by: Miao Xie mi...@cn.fujitsu.com Reviewed-by: David Sterba dste...@suse.cz --- Changelog v1 - v4: - None. --- fs/btrfs/volumes.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 54db1fb..6f80aef 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5167,8 +5167,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw, BTRFS_BLOCK_GROUP_RAID6)) { u64 tmp; - if (bbio_ret ((rw REQ_WRITE) || mirror_num 1) -raid_map_ret) { + if (raid_map_ret ((rw REQ_WRITE) || mirror_num 1)) { int i, rot; /* push stripe_nr back to the start of the full stripe */ -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 00/10] Implement device scrub/replace for RAID56
This patchset implement the device scrub/replace function for RAID56, the most implementation of the common data is similar to the other RAID type. The differentia or difficulty is the parity process. The basic idea is reading and check the data which has checksum out of the raid56 stripe lock, if the data is right, then lock the raid56 stripe, read out the other data in the same stripe, if no IO error happens, calculate the parity and check the original one, if the original parity is right, the scrub parity passes. or write out the new one. But if the common data(not parity) that we read out is wrong, we will try to recover it, and then check and repair the parity. And in order to avoid making the code more and more complex, we copy some code of common data process for the parity, the cleanup work is in my TODO list. We have done some test, the patchset worked well. Of course, more tests are welcome. If you are interesting to use it or test it, you can pull the patchset from https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace Changelog v3 - v4: - Fix the problem that the scrub's raid bio was cached, which was reported by Chris. - Remove the 10st patch, the deadlock that was described in that patch doesn't exist on the current kernel. - Rebase the patchset to the top of integration branch Changelog v2 - v3: - Fix wrong stripe start logical address calculation which was reported by Chris. - Fix unhandled raid bios for parity scrub, which are added into the plug list of the head raid bio. - Fix possible deadlock caused by the pending bios in the plug list when the io submitters were going to sleep. - Fix undealt use-after-free problem of the source device in the final device replace procedure. - Modify the code that is used to avoid the rbio merge. Changelog v1 - v2: - Change some function names in raid56.c to make them fit the code style of the raid56. Thanks Miao Miao Xie (7): Btrfs, raid56: don't change bbio and raid_map Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted Btrfs, raid56: use a variant to record the operation type Btrfs, raid56: support parity scrub on raid56 Btrfs, replace: write dirty pages into the replace target device Btrfs, replace: write raid56 parity into the replace target device Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56 Zhao Lei (3): Btrfs: remove noused bbio_ret in __btrfs_map_block in condition Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block Btrfs, replace: enable dev-replace for raid56 fs/btrfs/ctree.h | 7 +- fs/btrfs/dev-replace.c | 9 +- fs/btrfs/raid56.c | 763 +- fs/btrfs/raid56.h | 16 +- fs/btrfs/scrub.c | 803 +++-- fs/btrfs/volumes.c | 52 +++- fs/btrfs/volumes.h | 14 +- 7 files changed, 1531 insertions(+), 133 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 07/10] Btrfs, replace: write dirty pages into the replace target device
The implementation is simple: - In order to avoid changing the code logic of btrfs_map_bio and RAID56, we add the stripes of the replace target devices at the end of the stripe array in btrfs bio, and we sort those target device stripes in the array. And we keep the number of the target device stripes in the btrfs bio. - Except write operation on RAID56, all the other operation don't take the target device stripes into account. - When we do write operation, we read the data from the common devices and calculate the parity. Then write the dirty data and new parity out, at this time, we will find the relative replace target stripes and wirte the relative data into it. Note: The function that copying old data on the source device to the target device was implemented in the past, it is similar to the other RAID type. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None. --- fs/btrfs/raid56.c | 104 + fs/btrfs/volumes.c | 26 -- fs/btrfs/volumes.h | 10 -- 3 files changed, 97 insertions(+), 43 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 58a8408..16fe456 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -131,6 +131,8 @@ struct btrfs_raid_bio { /* number of data stripes (no p/q) */ int nr_data; + int real_stripes; + int stripe_npages; /* * set if we're doing a parity rebuild @@ -638,7 +640,7 @@ static struct page *rbio_pstripe_page(struct btrfs_raid_bio *rbio, int index) */ static struct page *rbio_qstripe_page(struct btrfs_raid_bio *rbio, int index) { - if (rbio-nr_data + 1 == rbio-bbio-num_stripes) + if (rbio-nr_data + 1 == rbio-real_stripes) return NULL; index += ((rbio-nr_data + 1) * rbio-stripe_len) @@ -981,7 +983,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, { struct btrfs_raid_bio *rbio; int nr_data = 0; - int num_pages = rbio_nr_pages(stripe_len, bbio-num_stripes); + int real_stripes = bbio-num_stripes - bbio-num_tgtdevs; + int num_pages = rbio_nr_pages(stripe_len, real_stripes); int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE); void *p; @@ -1001,6 +1004,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio-fs_info = root-fs_info; rbio-stripe_len = stripe_len; rbio-nr_pages = num_pages; + rbio-real_stripes = real_stripes; rbio-stripe_npages = stripe_npages; rbio-faila = -1; rbio-failb = -1; @@ -1017,10 +1021,10 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio-bio_pages = p + sizeof(struct page *) * num_pages; rbio-dbitmap = p + sizeof(struct page *) * num_pages * 2; - if (raid_map[bbio-num_stripes - 1] == RAID6_Q_STRIPE) - nr_data = bbio-num_stripes - 2; + if (raid_map[real_stripes - 1] == RAID6_Q_STRIPE) + nr_data = real_stripes - 2; else - nr_data = bbio-num_stripes - 1; + nr_data = real_stripes - 1; rbio-nr_data = nr_data; return rbio; @@ -1132,7 +1136,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio, static void validate_rbio_for_rmw(struct btrfs_raid_bio *rbio) { if (rbio-faila = 0 || rbio-failb = 0) { - BUG_ON(rbio-faila == rbio-bbio-num_stripes - 1); + BUG_ON(rbio-faila == rbio-real_stripes - 1); __raid56_parity_recover(rbio); } else { finish_rmw(rbio); @@ -1193,7 +1197,7 @@ static void index_rbio_pages(struct btrfs_raid_bio *rbio) static noinline void finish_rmw(struct btrfs_raid_bio *rbio) { struct btrfs_bio *bbio = rbio-bbio; - void *pointers[bbio-num_stripes]; + void *pointers[rbio-real_stripes]; int stripe_len = rbio-stripe_len; int nr_data = rbio-nr_data; int stripe; @@ -1207,11 +1211,11 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) bio_list_init(bio_list); - if (bbio-num_stripes - rbio-nr_data == 1) { - p_stripe = bbio-num_stripes - 1; - } else if (bbio-num_stripes - rbio-nr_data == 2) { - p_stripe = bbio-num_stripes - 2; - q_stripe = bbio-num_stripes - 1; + if (rbio-real_stripes - rbio-nr_data == 1) { + p_stripe = rbio-real_stripes - 1; + } else if (rbio-real_stripes - rbio-nr_data == 2) { + p_stripe = rbio-real_stripes - 2; + q_stripe = rbio-real_stripes - 1; } else { BUG(); } @@ -1268,7 +1272,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) SetPageUptodate(p); pointers[stripe++] = kmap(p); - raid6_call.gen_syndrome(bbio-num_stripes, PAGE_SIZE
[PATCH v4 05/10] Btrfs, raid56: use a variant to record the operation type
We will introduce new operation type later, if we still use integer variant as bool variant to record the operation type, we would add new variant and increase the size of raid bio structure. It is not good, by this patch, we define different number for different operation, and we can just use a variant to record the operation type. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None. --- fs/btrfs/raid56.c | 31 +-- 1 file changed, 17 insertions(+), 14 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index c954537..4924388 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -69,6 +69,11 @@ #define RBIO_CACHE_SIZE 1024 +enum btrfs_rbio_ops { + BTRFS_RBIO_WRITE= 0, + BTRFS_RBIO_READ_REBUILD = 1, +}; + struct btrfs_raid_bio { struct btrfs_fs_info *fs_info; struct btrfs_bio *bbio; @@ -131,7 +136,7 @@ struct btrfs_raid_bio { * differently from a parity rebuild as part of * rmw */ - int read_rebuild; + enum btrfs_rbio_ops operation; /* first bad stripe */ int faila; @@ -154,7 +159,6 @@ struct btrfs_raid_bio { atomic_t refs; - atomic_t stripes_pending; atomic_t error; @@ -590,8 +594,7 @@ static int rbio_can_merge(struct btrfs_raid_bio *last, return 0; /* reads can't merge with writes */ - if (last-read_rebuild != - cur-read_rebuild) { + if (last-operation != cur-operation) { return 0; } @@ -784,9 +787,9 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio) spin_unlock(rbio-bio_list_lock); spin_unlock_irqrestore(h-lock, flags); - if (next-read_rebuild) + if (next-operation == BTRFS_RBIO_READ_REBUILD) async_read_rebuild(next); - else { + else if (next-operation == BTRFS_RBIO_WRITE){ steal_rbio(rbio, next); async_rmw_stripe(next); } @@ -1720,6 +1723,7 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio, } bio_list_add(rbio-bio_list, bio); rbio-bio_list_bytes = bio-bi_iter.bi_size; + rbio-operation = BTRFS_RBIO_WRITE; /* * don't plug on full rbios, just get them out the door @@ -1768,7 +1772,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio) faila = rbio-faila; failb = rbio-failb; - if (rbio-read_rebuild) { + if (rbio-operation == BTRFS_RBIO_READ_REBUILD) { spin_lock_irq(rbio-bio_list_lock); set_bit(RBIO_RMW_LOCKED_BIT, rbio-flags); spin_unlock_irq(rbio-bio_list_lock); @@ -1785,7 +1789,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio) * if we're rebuilding a read, we have to use * pages from the bio list */ - if (rbio-read_rebuild + if (rbio-operation == BTRFS_RBIO_READ_REBUILD (stripe == faila || stripe == failb)) { page = page_in_rbio(rbio, stripe, pagenr, 0); } else { @@ -1878,7 +1882,7 @@ pstripe: * know they can be trusted. If this was a read reconstruction, * other endio functions will fiddle the uptodate bits */ - if (!rbio-read_rebuild) { + if (rbio-operation == BTRFS_RBIO_WRITE) { for (i = 0; i nr_pages; i++) { if (faila != -1) { page = rbio_stripe_page(rbio, faila, i); @@ -1895,7 +1899,7 @@ pstripe: * if we're rebuilding a read, we have to use * pages from the bio list */ - if (rbio-read_rebuild + if (rbio-operation == BTRFS_RBIO_READ_REBUILD (stripe == faila || stripe == failb)) { page = page_in_rbio(rbio, stripe, pagenr, 0); } else { @@ -1910,8 +1914,7 @@ cleanup: kfree(pointers); cleanup_io: - - if (rbio-read_rebuild) { + if (rbio-operation == BTRFS_RBIO_READ_REBUILD) { if (err == 0 !test_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags)) cache_rbio_pages(rbio); @@ -2050,7 +2053,7 @@ out: return 0; cleanup: - if (rbio-read_rebuild) + if (rbio-operation == BTRFS_RBIO_READ_REBUILD) rbio_orig_end_io(rbio, -EIO, 0); return -EIO; } @@ -2076,7 +2079,7 @@ int raid56_parity_recover(struct
[PATCH v4 09/10] Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56
The commit c404e0dc (Btrfs: fix use-after-free in the finishing procedure of the device replace) fixed a use-after-free problem which happened when removing the source device at the end of device replace, but at that time, btrfs didn't support device replace on raid56, so we didn't fix the problem on the raid56 profile. Currently, we implemented device replace for raid56, so we need kick that problem out before we enable that function for raid56. The fix method is very simple, we just increase the bio per-cpu counter before we submit a raid56 io, and decrease the counter when the raid56 io ends. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v3 - v4: - None. Changelog v2 - v3: - New patch to fix undealt use-after-free problem of the source device in the final device replace procedure. Changelog v1 - v2: - None. --- fs/btrfs/ctree.h | 7 ++- fs/btrfs/dev-replace.c | 4 ++-- fs/btrfs/raid56.c | 41 - fs/btrfs/raid56.h | 4 ++-- fs/btrfs/scrub.c | 2 +- fs/btrfs/volumes.c | 7 ++- 6 files changed, 45 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index fc73e86..3770f4c 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -4156,7 +4156,12 @@ int btrfs_scrub_progress(struct btrfs_root *root, u64 devid, /* dev-replace.c */ void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info); void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info *fs_info); -void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info); +void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount); + +static inline void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info) +{ + btrfs_bio_counter_sub(fs_info, 1); +} /* reada.c */ struct reada_control { diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 91f6b8f..326919b 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -928,9 +928,9 @@ void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info *fs_info) percpu_counter_inc(fs_info-bio_counter); } -void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info) +void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount) { - percpu_counter_dec(fs_info-bio_counter); + percpu_counter_sub(fs_info-bio_counter, amount); if (waitqueue_active(fs_info-replace_wait)) wake_up(fs_info-replace_wait); diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 7e6f239..44573bf 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -162,6 +162,8 @@ struct btrfs_raid_bio { */ int bio_list_bytes; + int generic_bio_cnt; + atomic_t refs; atomic_t stripes_pending; @@ -354,6 +356,7 @@ static void merge_rbio(struct btrfs_raid_bio *dest, { bio_list_merge(dest-bio_list, victim-bio_list); dest-bio_list_bytes += victim-bio_list_bytes; + dest-generic_bio_cnt += victim-generic_bio_cnt; bio_list_init(victim-bio_list); } @@ -891,6 +894,10 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, int err, int uptodate) { struct bio *cur = bio_list_get(rbio-bio_list); struct bio *next; + + if (rbio-generic_bio_cnt) + btrfs_bio_counter_sub(rbio-fs_info, rbio-generic_bio_cnt); + free_raid_bio(rbio); while (cur) { @@ -1775,6 +1782,7 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio, struct btrfs_raid_bio *rbio; struct btrfs_plug_cb *plug = NULL; struct blk_plug_cb *cb; + int ret; rbio = alloc_rbio(root, bbio, raid_map, stripe_len); if (IS_ERR(rbio)) { @@ -1785,12 +1793,19 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio, rbio-bio_list_bytes = bio-bi_iter.bi_size; rbio-operation = BTRFS_RBIO_WRITE; + btrfs_bio_counter_inc_noblocked(root-fs_info); + rbio-generic_bio_cnt = 1; + /* * don't plug on full rbios, just get them out the door * as quickly as we can */ - if (rbio_is_full(rbio)) - return full_stripe_write(rbio); + if (rbio_is_full(rbio)) { + ret = full_stripe_write(rbio); + if (ret) + btrfs_bio_counter_dec(root-fs_info); + return ret; + } cb = blk_check_plugged(btrfs_raid_unplug, root-fs_info, sizeof(*plug)); @@ -1801,10 +1816,13 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio, INIT_LIST_HEAD(plug-rbio_list); } list_add_tail(rbio-plug_list, plug-rbio_list); + ret = 0; } else { - return __raid56_parity_write(rbio); + ret = __raid56_parity_write(rbio); + if (ret) + btrfs_bio_counter_dec(root-fs_info); } - return 0; + return ret
[PATCH v4 10/10] Btrfs, replace: enable dev-replace for raid56
From: Zhao Lei zhao...@cn.fujitsu.com Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None. --- fs/btrfs/dev-replace.c | 5 - 1 file changed, 5 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 326919b..51133ea 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -316,11 +316,6 @@ int btrfs_dev_replace_start(struct btrfs_root *root, struct btrfs_device *tgt_device = NULL; struct btrfs_device *src_device = NULL; - if (btrfs_fs_incompat(fs_info, RAID56)) { - btrfs_warn(fs_info, dev_replace cannot yet handle RAID5/RAID6); - return -EOPNOTSUPP; - } - switch (args-start.cont_reading_from_srcdev_mode) { case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS: case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID: -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 04/10] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
This patch implement the RAID5/6 common data repair function, the implementation is similar to the scrub on the other RAID such as RAID1, the differentia is that we don't read the data from the mirror, we use the data repair function of RAID5/6. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v3 - v4: - Fix the problem that the scrub's raid bio was cached, which was reported by Chris. Changelog v2 - v3: - None. Changelog v1 - v2: - Change some function names in raid56.c to make them fit the code style of the raid56. --- fs/btrfs/raid56.c | 52 ++ fs/btrfs/raid56.h | 2 +- fs/btrfs/scrub.c | 194 - fs/btrfs/volumes.c | 16 - fs/btrfs/volumes.h | 4 ++ 5 files changed, 235 insertions(+), 33 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index cb31cc6..c954537 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -58,6 +58,15 @@ */ #define RBIO_CACHE_READY_BIT 3 +/* + * bbio and raid_map is managed by the caller, so we shouldn't free + * them here. And besides that, all rbios with this flag should not + * be cached, because we need raid_map to check the rbios' stripe + * is the same or not, but it is very likely that the caller has + * free raid_map, so don't cache those rbios. + */ +#define RBIO_HOLD_BBIO_MAP_BIT 4 + #define RBIO_CACHE_SIZE 1024 struct btrfs_raid_bio { @@ -799,6 +808,21 @@ done_nolock: remove_rbio_from_cache(rbio); } +static inline void +__free_bbio_and_raid_map(struct btrfs_bio *bbio, u64 *raid_map, int need) +{ + if (need) { + kfree(raid_map); + kfree(bbio); + } +} + +static inline void free_bbio_and_raid_map(struct btrfs_raid_bio *rbio) +{ + __free_bbio_and_raid_map(rbio-bbio, rbio-raid_map, + !test_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags)); +} + static void __free_raid_bio(struct btrfs_raid_bio *rbio) { int i; @@ -817,8 +841,9 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio) rbio-stripe_pages[i] = NULL; } } - kfree(rbio-raid_map); - kfree(rbio-bbio); + + free_bbio_and_raid_map(rbio); + kfree(rbio); } @@ -933,11 +958,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2, GFP_NOFS); - if (!rbio) { - kfree(raid_map); - kfree(bbio); + if (!rbio) return ERR_PTR(-ENOMEM); - } bio_list_init(rbio-bio_list); INIT_LIST_HEAD(rbio-plug_list); @@ -1692,8 +1714,10 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio, struct blk_plug_cb *cb; rbio = alloc_rbio(root, bbio, raid_map, stripe_len); - if (IS_ERR(rbio)) + if (IS_ERR(rbio)) { + __free_bbio_and_raid_map(bbio, raid_map, 1); return PTR_ERR(rbio); + } bio_list_add(rbio-bio_list, bio); rbio-bio_list_bytes = bio-bi_iter.bi_size; @@ -1888,7 +1912,8 @@ cleanup: cleanup_io: if (rbio-read_rebuild) { - if (err == 0) + if (err == 0 + !test_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags)) cache_rbio_pages(rbio); else clear_bit(RBIO_CACHE_READY_BIT, rbio-flags); @@ -2038,15 +2063,19 @@ cleanup: */ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, struct btrfs_bio *bbio, u64 *raid_map, - u64 stripe_len, int mirror_num) + u64 stripe_len, int mirror_num, int hold_bbio) { struct btrfs_raid_bio *rbio; int ret; rbio = alloc_rbio(root, bbio, raid_map, stripe_len); - if (IS_ERR(rbio)) + if (IS_ERR(rbio)) { + __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio); return PTR_ERR(rbio); + } + if (hold_bbio) + set_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags); rbio-read_rebuild = 1; bio_list_add(rbio-bio_list, bio); rbio-bio_list_bytes = bio-bi_iter.bi_size; @@ -2054,8 +2083,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, rbio-faila = find_logical_bio_stripe(rbio, bio); if (rbio-faila == -1) { BUG(); - kfree(raid_map); - kfree(bbio); + __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio); kfree(rbio); return -EIO; } diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h index ea5d73b..b310e8c 100644 --- a/fs/btrfs/raid56.h +++ b/fs/btrfs/raid56.h @@ -41,7 +41,7 @@ static inline int nr_data_stripes(struct map_lookup *map) int raid56_parity_recover(struct btrfs_root *root, struct bio *bio
[PATCH v4 02/10] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
From: Zhao Lei zhao...@cn.fujitsu.com stripe_index's value was set again in latter line: stripe_index = 0; Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com Signed-off-by: Miao Xie mi...@cn.fujitsu.com Reviewed-by: David Sterba dste...@suse.cz --- Changelog v1 - v4: - None. --- fs/btrfs/volumes.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 6f80aef..eeb5b31 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5172,9 +5172,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw, /* push stripe_nr back to the start of the full stripe */ stripe_nr = raid56_full_stripe_start; - do_div(stripe_nr, stripe_len); - - stripe_index = do_div(stripe_nr, nr_data_stripes(map)); + do_div(stripe_nr, stripe_len * nr_data_stripes(map)); /* RAID[56] write or recovery. Return all stripes */ num_stripes = map-num_stripes; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list
hi, Chris On Fri, 28 Nov 2014 16:32:03 -0500, Chris Mason wrote: On Wed, Nov 26, 2014 at 10:00 PM, Miao Xie mi...@cn.fujitsu.com wrote: On Thu, 27 Nov 2014 09:39:56 +0800, Miao Xie wrote: On Wed, 26 Nov 2014 10:02:23 -0500, Chris Mason wrote: On Wed, Nov 26, 2014 at 8:04 AM, Miao Xie mi...@cn.fujitsu.com wrote: The increase/decrease of bio counter is on the I/O path, so we should use io_schedule() instead of schedule(), or the deadlock might be triggered by the pending I/O in the plug list. io_schedule() can help us because it will flush all the pending I/O before the task is going to sleep. Can you please describe this deadlock in more detail? schedule() also triggers a flush of the plug list, and if that's no longer sufficient we can run into other problems (especially with preemption on). Sorry for my miss. I forgot to check the current implementation of schedule(), which flushes the plug list unconditionally. Please ignore this patch. I have updated my raid56-scrub-replace branch, please re-pull the branch. https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace Sorry, I wasn't clear. I do like the patch because it uses a slightly better trigger mechanism for the flush. I was just worried about a larger deadlock. I ran the raid56 work with stress.sh overnight, then scrubbed the resulting filesystem and ran balance when the scrub completed. All of these passed without errors (excellent!). Then I zero'd 4GB of one drive and ran scrub again. This was the result. Please make sure CONFIG_DEBUG_PAGEALLOC is enabled and you should be able to reproduce. I sent out the 4th version of the patchset, please try it. I have pushed the new patchset to my git tree, you can re-pull it. https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace Thanks Miao [192392.495260] BUG: unable to handle kernel paging request at 880303062f80 [192392.495279] IP: [a05fe77a] lock_stripe_add+0xba/0x390 [btrfs] [192392.495281] PGD 2bdb067 PUD 107e7fd067 PMD 107e7e4067 PTE 800303062060 [192392.495283] Oops: [#1] SMP DEBUG_PAGEALLOC [192392.495307] Modules linked in: ipmi_devintf loop fuse k10temp coretemp hwmon btrfs raid6_pq zlib_deflate lzo_compress xor xfs exportfs libcrc32c tcp_diag inet_diag nfsv4 ip6table_filter ip6_tables xt_NFLOG nfnetlink_log nfnetlink xt_comment xt_statistic iptable_filter ip_tables x_tables mptctl netconsole autofs4 nfsv3 nfs lockd grace rpcsec_gss_krb5 auth_rpcgss oid_registry sunrpc ipv6 ext3 jbd dm_mod rtc_cmos ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support pcspkr i2c_i801 lpc_ich mfd_core shpchp ehci_pci ehci_hcd mlx4_en ptp pps_core mlx4_core sg ses enclosure button megaraid_sas [192392.495310] CPU: 0 PID: 11992 Comm: kworker/u65:2 Not tainted 3.18.0-rc6-mason+ #7 [192392.495310] Hardware name: ZTSYSTEMS Echo Ridge T4 /A9DRPF-10D, BIOS 1.07 05/10/2012 [192392.495323] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs] [192392.495324] task: 88013dae9110 ti: 8802296a task.ti: 8802296a [192392.495335] RIP: 0010:[a05fe77a] [a05fe77a] lock_stripe_add+0xba/0x390 [btrfs] [192392.495335] RSP: 0018:8802296a3ac8 EFLAGS: 00010006 [192392.495336] RAX: 880577e85018 RBX: 880497f0b2f8 RCX: 8801190fb000 [192392.495337] RDX: 013d RSI: 880303062f80 RDI: 040c275a [192392.495338] RBP: 8802296a3b48 R08: 880497f0 R09: 0001 [192392.495339] R10: R11: R12: 0282 [192392.495339] R13: b250 R14: 880577e85000 R15: 880497f0b2a0 [192392.495340] FS: () GS:88085fc0() knlGS: [192392.495341] CS: 0010 DS: ES: CR0: 80050033 [192392.495342] CR2: 880303062f80 CR3: 05289000 CR4: 000407f0 [192392.495342] Stack: [192392.495344] 880755e28000 880497f0 013d 8801190fb000 [192392.495346] 88013dae9110 81090d40 8802296a3b00 [192392.495347] 8802296a3b00 0010 8802296a3b68 8801190fb000 [192392.495348] Call Trace: [192392.495353] [81090d40] ? bit_waitqueue+0xa0/0xa0 [192392.495363] [a05fea66] raid56_parity_submit_scrub_rbio+0x16/0x30 [btrfs] [192392.495372] [a05e2f0e] scrub_parity_check_and_repair+0x15e/0x1e0 [btrfs] [192392.495380] [a05e301d] scrub_block_put+0x8d/0x90 [btrfs] [192392.495388] [a05e6ed7] ? scrub_bio_end_io_worker+0xd7/0x870 [btrfs] [192392.495396] [a05e6ee9] scrub_bio_end_io_worker+0xe9/0x870 [btrfs] [192392.495405] [a05b8c44] normal_work_helper+0x84/0x330 [btrfs] [192392.495414] [a05b8f42] btrfs_scrub_helper+0x12/0x20 [btrfs] [192392.495417] [8106c50f] process_one_work+0x1bf/0x520 [192392.495419] [8106c48d
Re: [PATCH v4 00/10] Implement device scrub/replace for RAID56
On Tue, 2 Dec 2014 08:28:22 -0500, Chris Mason wrote: On Tue, Dec 2, 2014 at 7:39 AM, Miao Xie mi...@cn.fujitsu.com wrote: This patchset implement the device scrub/replace function for RAID56, the most implementation of the common data is similar to the other RAID type. The differentia or difficulty is the parity process. The basic idea is reading and check the data which has checksum out of the raid56 stripe lock, if the data is right, then lock the raid56 stripe, read out the other data in the same stripe, if no IO error happens, calculate the parity and check the original one, if the original parity is right, the scrub parity passes. or write out the new one. But if the common data(not parity) that we read out is wrong, we will try to recover it, and then check and repair the parity. And in order to avoid making the code more and more complex, we copy some code of common data process for the parity, the cleanup work is in my TODO list. We have done some test, the patchset worked well. Of course, more tests are welcome. If you are interesting to use it or test it, you can pull the patchset from https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace Changelog v3 - v4: - Fix the problem that the scrub's raid bio was cached, which was reported by Chris. - Remove the 10st patch, the deadlock that was described in that patch doesn't exist on the current kernel. - Rebase the patchset to the top of integration branch Thanks, I'll try this today. I need to rebase in a new version of the RCU patches, can you please cook one on top of v3.18-rc6 instead? No problem. Thanks Miao -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 00/10] Implement device scrub/replace for RAID56
On Tue, 2 Dec 2014 08:28:22 -0500, Chris Mason wrote: On Tue, Dec 2, 2014 at 7:39 AM, Miao Xie mi...@cn.fujitsu.com wrote: This patchset implement the device scrub/replace function for RAID56, the most implementation of the common data is similar to the other RAID type. The differentia or difficulty is the parity process. The basic idea is reading and check the data which has checksum out of the raid56 stripe lock, if the data is right, then lock the raid56 stripe, read out the other data in the same stripe, if no IO error happens, calculate the parity and check the original one, if the original parity is right, the scrub parity passes. or write out the new one. But if the common data(not parity) that we read out is wrong, we will try to recover it, and then check and repair the parity. And in order to avoid making the code more and more complex, we copy some code of common data process for the parity, the cleanup work is in my TODO list. We have done some test, the patchset worked well. Of course, more tests are welcome. If you are interesting to use it or test it, you can pull the patchset from https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace Changelog v3 - v4: - Fix the problem that the scrub's raid bio was cached, which was reported by Chris. - Remove the 10st patch, the deadlock that was described in that patch doesn't exist on the current kernel. - Rebase the patchset to the top of integration branch Thanks, I'll try this today. I need to rebase in a new version of the RCU patches, can you please cook one on top of v3.18-rc6 instead? I have updated my raid56-scrub-replace branch, please re-pull it. https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace The v4 patchset in the mail list can be applied on v3.18-rc6 successfully, so I don't update it. Thanks Miao -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix wrong list access on the failure of reading out checksum
If we failed to reading out the checksum, we would free all the checksums in the list. But the current code accessed the list head, not the entry in the list. Fix it. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/file-item.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 783a943..c26b58f 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -413,7 +413,8 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, ret = 0; fail: while (ret 0 !list_empty(tmplist)) { - sums = list_entry(tmplist, struct btrfs_ordered_sum, list); + sums = list_first_entry(tmplist, struct btrfs_ordered_sum, + list); list_del(sums-list); kfree(sums); } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix wrong list access on the failure of reading out checksum
Please ignore this patch, Chris has fixed this problem. Thanks Miao On Mon, 1 Dec 2014 18:04:13 +0800, Miao Xie wrote: If we failed to reading out the checksum, we would free all the checksums in the list. But the current code accessed the list head, not the entry in the list. Fix it. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/file-item.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 783a943..c26b58f 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -413,7 +413,8 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, ret = 0; fail: while (ret 0 !list_empty(tmplist)) { - sums = list_entry(tmplist, struct btrfs_ordered_sum, list); + sums = list_first_entry(tmplist, struct btrfs_ordered_sum, + list); list_del(sums-list); kfree(sums); } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 05/11] Btrfs, raid56: use a variant to record the operation type
We will introduce new operation type later, if we still use integer variant as bool variant to record the operation type, we would add new variant and increase the size of raid bio structure. It is not good, by this patch, we define different number for different operation, and we can just use a variant to record the operation type. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v3: - None. --- fs/btrfs/raid56.c | 30 +- 1 file changed, 17 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 6013d88..bfc406d 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -62,6 +62,11 @@ #define RBIO_CACHE_SIZE 1024 +enum btrfs_rbio_ops { + BTRFS_RBIO_WRITE= 0, + BTRFS_RBIO_READ_REBUILD = 1, +}; + struct btrfs_raid_bio { struct btrfs_fs_info *fs_info; struct btrfs_bio *bbio; @@ -124,7 +129,7 @@ struct btrfs_raid_bio { * differently from a parity rebuild as part of * rmw */ - int read_rebuild; + enum btrfs_rbio_ops operation; /* first bad stripe */ int faila; @@ -147,7 +152,6 @@ struct btrfs_raid_bio { atomic_t refs; - atomic_t stripes_pending; atomic_t error; @@ -583,8 +587,7 @@ static int rbio_can_merge(struct btrfs_raid_bio *last, return 0; /* reads can't merge with writes */ - if (last-read_rebuild != - cur-read_rebuild) { + if (last-operation != cur-operation) { return 0; } @@ -777,9 +780,9 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio) spin_unlock(rbio-bio_list_lock); spin_unlock_irqrestore(h-lock, flags); - if (next-read_rebuild) + if (next-operation == BTRFS_RBIO_READ_REBUILD) async_read_rebuild(next); - else { + else if (next-operation == BTRFS_RBIO_WRITE){ steal_rbio(rbio, next); async_rmw_stripe(next); } @@ -1713,6 +1716,7 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio, } bio_list_add(rbio-bio_list, bio); rbio-bio_list_bytes = bio-bi_iter.bi_size; + rbio-operation = BTRFS_RBIO_WRITE; /* * don't plug on full rbios, just get them out the door @@ -1761,7 +1765,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio) faila = rbio-faila; failb = rbio-failb; - if (rbio-read_rebuild) { + if (rbio-operation == BTRFS_RBIO_READ_REBUILD) { spin_lock_irq(rbio-bio_list_lock); set_bit(RBIO_RMW_LOCKED_BIT, rbio-flags); spin_unlock_irq(rbio-bio_list_lock); @@ -1778,7 +1782,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio) * if we're rebuilding a read, we have to use * pages from the bio list */ - if (rbio-read_rebuild + if (rbio-operation == BTRFS_RBIO_READ_REBUILD (stripe == faila || stripe == failb)) { page = page_in_rbio(rbio, stripe, pagenr, 0); } else { @@ -1871,7 +1875,7 @@ pstripe: * know they can be trusted. If this was a read reconstruction, * other endio functions will fiddle the uptodate bits */ - if (!rbio-read_rebuild) { + if (rbio-operation == BTRFS_RBIO_WRITE) { for (i = 0; i nr_pages; i++) { if (faila != -1) { page = rbio_stripe_page(rbio, faila, i); @@ -1888,7 +1892,7 @@ pstripe: * if we're rebuilding a read, we have to use * pages from the bio list */ - if (rbio-read_rebuild + if (rbio-operation == BTRFS_RBIO_READ_REBUILD (stripe == faila || stripe == failb)) { page = page_in_rbio(rbio, stripe, pagenr, 0); } else { @@ -1904,7 +1908,7 @@ cleanup: cleanup_io: - if (rbio-read_rebuild) { + if (rbio-operation == BTRFS_RBIO_READ_REBUILD) { if (err == 0) cache_rbio_pages(rbio); else @@ -2042,7 +2046,7 @@ out: return 0; cleanup: - if (rbio-read_rebuild) + if (rbio-operation == BTRFS_RBIO_READ_REBUILD) rbio_orig_end_io(rbio, -EIO, 0); return -EIO; } @@ -2068,7 +2072,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, if (hold_bbio
[PATCH v3 00/11] Implement device scrub/replace for RAID56
This patchset implement the device scrub/replace function for RAID56, the most implementation of the common data is similar to the other RAID type. The differentia or difficulty is the parity process. The basic idea is reading and check the data which has checksum out of the raid56 stripe lock, if the data is right, then lock the raid56 stripe, read out the other data in the same stripe, if no IO error happens, calculate the parity and check the original one, if the original parity is right, the scrub parity passes. or write out the new one. But if the common data(not parity) that we read out is wrong, we will try to recover it, and then check and repair the parity. And in order to avoid making the code more and more complex, we copy some code of common data process for the parity, the cleanup work is in my TODO list. We have done some test, the patchset worked well. Of course, more tests are welcome. If you are interesting to use it or test it, you can pull the patchset from https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace Changelog v2 - v3: - Fix wrong stripe start logical address calculation which was reported by Chris. - Fix unhandled raid bios for parity scrub, which are added into the plug list of the head raid bio. - Fix possible deadlock caused by the pending bios in the plug list when the io submitters were going to sleep. - Fix undealt use-after-free problem of the source device in the final device replace procedure. - Modify the code that is used to avoid the rbio merge. Changelog v1 - v2: - Change some function names in raid56.c to make them fit the code style of the raid56. Thanks Miao Miao Xie (8): Btrfs, raid56: don't change bbio and raid_map Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted Btrfs, raid56: use a variant to record the operation type Btrfs, raid56: support parity scrub on raid56 Btrfs, replace: write dirty pages into the replace target device Btrfs, replace: write raid56 parity into the replace target device Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56 Btrfs: fix possible deadlock caused by pending I/O in plug list Zhao Lei (3): Btrfs: remove noused bbio_ret in __btrfs_map_block in condition Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block Btrfs, replace: enable dev-replace for raid56 fs/btrfs/dev-replace.c | 20 +- fs/btrfs/raid56.c | 746 - fs/btrfs/raid56.h | 16 +- fs/btrfs/scrub.c | 803 +++-- fs/btrfs/volumes.c | 52 +++- fs/btrfs/volumes.h | 14 +- 6 files changed, 1521 insertions(+), 130 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 08/11] Btrfs, replace: write raid56 parity into the replace target device
This function reused the code of parity scrub, and we just write the right parity or corrected parity into the target device before the parity scrub end. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v3: - None. --- fs/btrfs/raid56.c | 23 +++ fs/btrfs/scrub.c | 2 +- 2 files changed, 24 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 6f82c1b..cfa449f 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -2311,7 +2311,9 @@ static void raid_write_parity_end_io(struct bio *bio, int err) static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, int need_check) { + struct btrfs_bio *bbio = rbio-bbio; void *pointers[rbio-real_stripes]; + DECLARE_BITMAP(pbitmap, rbio-stripe_npages); int nr_data = rbio-nr_data; int stripe; int pagenr; @@ -2321,6 +2323,7 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, struct page *q_page = NULL; struct bio_list bio_list; struct bio *bio; + int is_replace = 0; int ret; bio_list_init(bio_list); @@ -2334,6 +2337,11 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, BUG(); } + if (bbio-num_tgtdevs bbio-tgtdev_map[rbio-scrubp]) { + is_replace = 1; + bitmap_copy(pbitmap, rbio-dbitmap, rbio-stripe_npages); + } + /* * Because the higher layers(scrubber) are unlikely to * use this area of the disk again soon, so don't cache @@ -2422,6 +2430,21 @@ writeback: goto cleanup; } + if (!is_replace) + goto submit_write; + + for_each_set_bit(pagenr, pbitmap, rbio-stripe_npages) { + struct page *page; + + page = rbio_stripe_page(rbio, rbio-scrubp, pagenr); + ret = rbio_add_io_page(rbio, bio_list, page, + bbio-tgtdev_map[rbio-scrubp], + pagenr, rbio-stripe_len); + if (ret) + goto cleanup; + } + +submit_write: nr_data = bio_list_size(bio_list); if (!nr_data) { /* Every parity is right */ diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 7f95afc..0ae837f 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -2714,7 +2714,7 @@ static void scrub_parity_check_and_repair(struct scrub_parity *sparity) goto out; length = sparity-logic_end - sparity-logic_start + 1; - ret = btrfs_map_sblock(sctx-dev_root-fs_info, REQ_GET_READ_MIRRORS, + ret = btrfs_map_sblock(sctx-dev_root-fs_info, WRITE, sparity-logic_start, length, bbio, 0, raid_map); if (ret || !bbio || !raid_map) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 01/11] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
From: Zhao Lei zhao...@cn.fujitsu.com bbio_ret in this condition is always !NULL because previous code already have a check-and-skip: 4908 if (!bbio_ret) 4909 goto out; Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com Signed-off-by: Miao Xie mi...@cn.fujitsu.com Reviewed-by: David Sterba dste...@suse.cz --- Changelog v1 - v3: - None. --- fs/btrfs/volumes.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f61278f..41b0dff 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5162,8 +5162,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw, BTRFS_BLOCK_GROUP_RAID6)) { u64 tmp; - if (bbio_ret ((rw REQ_WRITE) || mirror_num 1) -raid_map_ret) { + if (raid_map_ret ((rw REQ_WRITE) || mirror_num 1)) { int i, rot; /* push stripe_nr back to the start of the full stripe */ -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 02/11] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
From: Zhao Lei zhao...@cn.fujitsu.com stripe_index's value was set again in latter line: stripe_index = 0; Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com Signed-off-by: Miao Xie mi...@cn.fujitsu.com Reviewed-by: David Sterba dste...@suse.cz --- Changelog v1 - v3: - None. --- fs/btrfs/volumes.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 41b0dff..66d5035 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5167,9 +5167,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw, /* push stripe_nr back to the start of the full stripe */ stripe_nr = raid56_full_stripe_start; - do_div(stripe_nr, stripe_len); - - stripe_index = do_div(stripe_nr, nr_data_stripes(map)); + do_div(stripe_nr, stripe_len * nr_data_stripes(map)); /* RAID[56] write or recovery. Return all stripes */ num_stripes = map-num_stripes; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 03/11] Btrfs, raid56: don't change bbio and raid_map
Because we will reuse bbio and raid_map during the scrub later, it is better that we don't change any variant of bbio and don't free it at the end of IO request. So we introduced similar variants into the raid bio, and don't access those bbio's variants any more. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v3: - None. --- fs/btrfs/raid56.c | 42 +++--- 1 file changed, 23 insertions(+), 19 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 6a41631..c54b0e6 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -58,7 +58,6 @@ */ #define RBIO_CACHE_READY_BIT 3 - #define RBIO_CACHE_SIZE 1024 struct btrfs_raid_bio { @@ -146,6 +145,10 @@ struct btrfs_raid_bio { atomic_t refs; + + atomic_t stripes_pending; + + atomic_t error; /* * these are two arrays of pointers. We allocate the * rbio big enough to hold them both and setup their @@ -858,13 +861,13 @@ static void raid_write_end_io(struct bio *bio, int err) bio_put(bio); - if (!atomic_dec_and_test(rbio-bbio-stripes_pending)) + if (!atomic_dec_and_test(rbio-stripes_pending)) return; err = 0; /* OK, we have read all the stripes we need to. */ - if (atomic_read(rbio-bbio-error) rbio-bbio-max_errors) + if (atomic_read(rbio-error) rbio-bbio-max_errors) err = -EIO; rbio_orig_end_io(rbio, err, 0); @@ -949,6 +952,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio-faila = -1; rbio-failb = -1; atomic_set(rbio-refs, 1); + atomic_set(rbio-error, 0); + atomic_set(rbio-stripes_pending, 0); /* * the stripe_pages and bio_pages array point to the extra @@ -1169,7 +1174,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) set_bit(RBIO_RMW_LOCKED_BIT, rbio-flags); spin_unlock_irq(rbio-bio_list_lock); - atomic_set(rbio-bbio-error, 0); + atomic_set(rbio-error, 0); /* * now that we've set rmw_locked, run through the @@ -1245,8 +1250,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) } } - atomic_set(bbio-stripes_pending, bio_list_size(bio_list)); - BUG_ON(atomic_read(bbio-stripes_pending) == 0); + atomic_set(rbio-stripes_pending, bio_list_size(bio_list)); + BUG_ON(atomic_read(rbio-stripes_pending) == 0); while (1) { bio = bio_list_pop(bio_list); @@ -1331,11 +1336,11 @@ static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed) if (rbio-faila == -1) { /* first failure on this rbio */ rbio-faila = failed; - atomic_inc(rbio-bbio-error); + atomic_inc(rbio-error); } else if (rbio-failb == -1) { /* second failure on this rbio */ rbio-failb = failed; - atomic_inc(rbio-bbio-error); + atomic_inc(rbio-error); } else { ret = -EIO; } @@ -1394,11 +1399,11 @@ static void raid_rmw_end_io(struct bio *bio, int err) bio_put(bio); - if (!atomic_dec_and_test(rbio-bbio-stripes_pending)) + if (!atomic_dec_and_test(rbio-stripes_pending)) return; err = 0; - if (atomic_read(rbio-bbio-error) rbio-bbio-max_errors) + if (atomic_read(rbio-error) rbio-bbio-max_errors) goto cleanup; /* @@ -1439,7 +1444,6 @@ static void async_read_rebuild(struct btrfs_raid_bio *rbio) static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio) { int bios_to_read = 0; - struct btrfs_bio *bbio = rbio-bbio; struct bio_list bio_list; int ret; int nr_pages = DIV_ROUND_UP(rbio-stripe_len, PAGE_CACHE_SIZE); @@ -1455,7 +1459,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio) index_rbio_pages(rbio); - atomic_set(rbio-bbio-error, 0); + atomic_set(rbio-error, 0); /* * build a list of bios to read all the missing parts of this * stripe @@ -1503,7 +1507,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio) * the bbio may be freed once we submit the last bio. Make sure * not to touch it after that */ - atomic_set(bbio-stripes_pending, bios_to_read); + atomic_set(rbio-stripes_pending, bios_to_read); while (1) { bio = bio_list_pop(bio_list); if (!bio) @@ -1917,10 +1921,10 @@ static void raid_recover_end_io(struct bio *bio, int err) set_bio_pages_uptodate(bio); bio_put(bio); - if (!atomic_dec_and_test(rbio-bbio-stripes_pending)) + if (!atomic_dec_and_test(rbio-stripes_pending)) return; - if (atomic_read(rbio-bbio-error) rbio-bbio-max_errors) + if (atomic_read
[PATCH v3 06/11] Btrfs, raid56: support parity scrub on raid56
The implementation is: - Read and check all the data with checksum in the same stripe. All the data which has checksum is COW data, and we are sure that it is not changed though we don't lock the stripe. because the space of that data just can be reclaimed after the current transction is committed, and then the fs can use it to store the other data, but when doing scrub, we hold the current transaction, that is that data can not be recovered, it is safe that read and check it out of the stripe lock. - Lock the stripe - Read out all the data without checksum and parity The data without checksum and the parity may be changed if we don't lock the stripe, so we need read it in the stripe lock context. - Check the parity - Re-calculate the new parity and write back it if the old parity is not right - Unlock the stripe If we can not read out the data or the data we read is corrupted, we will try to repair it. If the repair fails. we will mark the horizontal sub-stripe(pages on the same horizontal) as corrupted sub-stripe, and we will skip the parity check and repair of that horizontal sub-stripe. And in order to skip the horizontal sub-stripe that has no data, we introduce a bitmap. If there is some data on the horizontal sub-stripe, we will the relative bit to 1, and when we check and repair the parity, we will skip those horizontal sub-stripes that the relative bits is 0. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v2 - v3: - Fix wrong stripe start logical address calculation which was reported by Chris. - Fix unhandled raid bios for parity scrub, which are added into the plug list of the head raid bio. - Modify the code that is used to avoid the rbio merge. Changelog v1 - v2: - None. --- fs/btrfs/raid56.c | 514 - fs/btrfs/raid56.h | 12 ++ fs/btrfs/scrub.c | 609 -- 3 files changed, 1115 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index bfc406d..3b99cbc 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -65,6 +65,7 @@ enum btrfs_rbio_ops { BTRFS_RBIO_WRITE= 0, BTRFS_RBIO_READ_REBUILD = 1, + BTRFS_RBIO_PARITY_SCRUB = 2, }; struct btrfs_raid_bio { @@ -123,6 +124,7 @@ struct btrfs_raid_bio { /* number of data stripes (no p/q) */ int nr_data; + int stripe_npages; /* * set if we're doing a parity rebuild * for a read from higher up, which is handled @@ -137,6 +139,7 @@ struct btrfs_raid_bio { /* second bad stripe (for raid6 use) */ int failb; + int scrubp; /* * number of pages needed to represent the full * stripe @@ -171,6 +174,11 @@ struct btrfs_raid_bio { * here for faster lookup */ struct page **bio_pages; + + /* +* bitmap to record which horizontal stripe has data +*/ + unsigned long *dbitmap; }; static int __raid56_parity_recover(struct btrfs_raid_bio *rbio); @@ -185,6 +193,10 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio); static void index_rbio_pages(struct btrfs_raid_bio *rbio); static int alloc_rbio_pages(struct btrfs_raid_bio *rbio); +static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, +int need_check); +static void async_scrub_parity(struct btrfs_raid_bio *rbio); + /* * the stripe hash table is used for locking, and to collect * bios in hopes of making a full stripe @@ -586,10 +598,20 @@ static int rbio_can_merge(struct btrfs_raid_bio *last, cur-raid_map[0]) return 0; - /* reads can't merge with writes */ - if (last-operation != cur-operation) { + /* we can't merge with different operations */ + if (last-operation != cur-operation) + return 0; + /* +* We've need read the full stripe from the drive. +* check and repair the parity and write the new results. +* +* We're not allowed to add any new bios to the +* bio list here, anyone else that wants to +* change this stripe needs to do their own rmw. +*/ + if (last-operation == BTRFS_RBIO_PARITY_SCRUB || + cur-operation == BTRFS_RBIO_PARITY_SCRUB) return 0; - } return 1; } @@ -782,9 +804,12 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio) if (next-operation == BTRFS_RBIO_READ_REBUILD) async_read_rebuild(next); - else if (next-operation == BTRFS_RBIO_WRITE){ + else if (next-operation == BTRFS_RBIO_WRITE) { steal_rbio(rbio, next); async_rmw_stripe(next); + } else if (next-operation == BTRFS_RBIO_PARITY_SCRUB
[PATCH v3 11/11] Btrfs, replace: enable dev-replace for raid56
From: Zhao Lei zhao...@cn.fujitsu.com Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v3: - None. --- fs/btrfs/dev-replace.c | 5 - 1 file changed, 5 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 894796a..9f6a464 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -316,11 +316,6 @@ int btrfs_dev_replace_start(struct btrfs_root *root, struct btrfs_device *tgt_device = NULL; struct btrfs_device *src_device = NULL; - if (btrfs_fs_incompat(fs_info, RAID56)) { - btrfs_warn(fs_info, dev_replace cannot yet handle RAID5/RAID6); - return -EOPNOTSUPP; - } - switch (args-start.cont_reading_from_srcdev_mode) { case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS: case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID: -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list
The increase/decrease of bio counter is on the I/O path, so we should use io_schedule() instead of schedule(), or the deadlock might be triggered by the pending I/O in the plug list. io_schedule() can help us because it will flush all the pending I/O before the task is going to sleep. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v2 - v3: - New patch to fix possible deadlock caused by the pending bios in the plug list when the io submitters were going to sleep. Changelog v1 - v2: - None. --- fs/btrfs/dev-replace.c | 15 +++ 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index fa27b4e..894796a 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -928,16 +928,23 @@ void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount) wake_up(fs_info-replace_wait); } +#define btrfs_wait_event_io(wq, condition) \ +do { \ + if (condition) \ + break; \ + (void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 0, 0, \ + io_schedule()); \ +} while (0) + void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info) { - DEFINE_WAIT(wait); again: percpu_counter_inc(fs_info-bio_counter); if (test_bit(BTRFS_FS_STATE_DEV_REPLACING, fs_info-fs_state)) { btrfs_bio_counter_dec(fs_info); - wait_event(fs_info-replace_wait, - !test_bit(BTRFS_FS_STATE_DEV_REPLACING, -fs_info-fs_state)); + btrfs_wait_event_io(fs_info-replace_wait, + !test_bit(BTRFS_FS_STATE_DEV_REPLACING, + fs_info-fs_state)); goto again; } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 07/11] Btrfs, replace: write dirty pages into the replace target device
The implementation is simple: - In order to avoid changing the code logic of btrfs_map_bio and RAID56, we add the stripes of the replace target devices at the end of the stripe array in btrfs bio, and we sort those target device stripes in the array. And we keep the number of the target device stripes in the btrfs bio. - Except write operation on RAID56, all the other operation don't take the target device stripes into account. - When we do write operation, we read the data from the common devices and calculate the parity. Then write the dirty data and new parity out, at this time, we will find the relative replace target stripes and wirte the relative data into it. Note: The function that copying old data on the source device to the target device was implemented in the past, it is similar to the other RAID type. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v3: - None. --- fs/btrfs/raid56.c | 104 + fs/btrfs/volumes.c | 26 -- fs/btrfs/volumes.h | 10 -- 3 files changed, 97 insertions(+), 43 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 3b99cbc..6f82c1b 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -124,6 +124,8 @@ struct btrfs_raid_bio { /* number of data stripes (no p/q) */ int nr_data; + int real_stripes; + int stripe_npages; /* * set if we're doing a parity rebuild @@ -631,7 +633,7 @@ static struct page *rbio_pstripe_page(struct btrfs_raid_bio *rbio, int index) */ static struct page *rbio_qstripe_page(struct btrfs_raid_bio *rbio, int index) { - if (rbio-nr_data + 1 == rbio-bbio-num_stripes) + if (rbio-nr_data + 1 == rbio-real_stripes) return NULL; index += ((rbio-nr_data + 1) * rbio-stripe_len) @@ -974,7 +976,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, { struct btrfs_raid_bio *rbio; int nr_data = 0; - int num_pages = rbio_nr_pages(stripe_len, bbio-num_stripes); + int real_stripes = bbio-num_stripes - bbio-num_tgtdevs; + int num_pages = rbio_nr_pages(stripe_len, real_stripes); int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE); void *p; @@ -994,6 +997,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio-fs_info = root-fs_info; rbio-stripe_len = stripe_len; rbio-nr_pages = num_pages; + rbio-real_stripes = real_stripes; rbio-stripe_npages = stripe_npages; rbio-faila = -1; rbio-failb = -1; @@ -1010,10 +1014,10 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio-bio_pages = p + sizeof(struct page *) * num_pages; rbio-dbitmap = p + sizeof(struct page *) * num_pages * 2; - if (raid_map[bbio-num_stripes - 1] == RAID6_Q_STRIPE) - nr_data = bbio-num_stripes - 2; + if (raid_map[real_stripes - 1] == RAID6_Q_STRIPE) + nr_data = real_stripes - 2; else - nr_data = bbio-num_stripes - 1; + nr_data = real_stripes - 1; rbio-nr_data = nr_data; return rbio; @@ -1125,7 +1129,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio, static void validate_rbio_for_rmw(struct btrfs_raid_bio *rbio) { if (rbio-faila = 0 || rbio-failb = 0) { - BUG_ON(rbio-faila == rbio-bbio-num_stripes - 1); + BUG_ON(rbio-faila == rbio-real_stripes - 1); __raid56_parity_recover(rbio); } else { finish_rmw(rbio); @@ -1186,7 +1190,7 @@ static void index_rbio_pages(struct btrfs_raid_bio *rbio) static noinline void finish_rmw(struct btrfs_raid_bio *rbio) { struct btrfs_bio *bbio = rbio-bbio; - void *pointers[bbio-num_stripes]; + void *pointers[rbio-real_stripes]; int stripe_len = rbio-stripe_len; int nr_data = rbio-nr_data; int stripe; @@ -1200,11 +1204,11 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) bio_list_init(bio_list); - if (bbio-num_stripes - rbio-nr_data == 1) { - p_stripe = bbio-num_stripes - 1; - } else if (bbio-num_stripes - rbio-nr_data == 2) { - p_stripe = bbio-num_stripes - 2; - q_stripe = bbio-num_stripes - 1; + if (rbio-real_stripes - rbio-nr_data == 1) { + p_stripe = rbio-real_stripes - 1; + } else if (rbio-real_stripes - rbio-nr_data == 2) { + p_stripe = rbio-real_stripes - 2; + q_stripe = rbio-real_stripes - 1; } else { BUG(); } @@ -1261,7 +1265,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) SetPageUptodate(p); pointers[stripe++] = kmap(p); - raid6_call.gen_syndrome(bbio-num_stripes, PAGE_SIZE
[PATCH v3 04/11] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
This patch implement the RAID5/6 common data repair function, the implementation is similar to the scrub on the other RAID such as RAID1, the differentia is that we don't read the data from the mirror, we use the data repair function of RAID5/6. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v3: - None. --- fs/btrfs/raid56.c | 42 +--- fs/btrfs/raid56.h | 2 +- fs/btrfs/scrub.c | 194 - fs/btrfs/volumes.c | 16 - fs/btrfs/volumes.h | 4 ++ 5 files changed, 226 insertions(+), 32 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index c54b0e6..6013d88 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -58,6 +58,8 @@ */ #define RBIO_CACHE_READY_BIT 3 +#define RBIO_HOLD_BBIO_MAP_BIT 4 + #define RBIO_CACHE_SIZE 1024 struct btrfs_raid_bio { @@ -799,6 +801,21 @@ done_nolock: remove_rbio_from_cache(rbio); } +static inline void +__free_bbio_and_raid_map(struct btrfs_bio *bbio, u64 *raid_map, int need) +{ + if (need) { + kfree(raid_map); + kfree(bbio); + } +} + +static inline void free_bbio_and_raid_map(struct btrfs_raid_bio *rbio) +{ + __free_bbio_and_raid_map(rbio-bbio, rbio-raid_map, + !test_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags)); +} + static void __free_raid_bio(struct btrfs_raid_bio *rbio) { int i; @@ -817,8 +834,9 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio) rbio-stripe_pages[i] = NULL; } } - kfree(rbio-raid_map); - kfree(rbio-bbio); + + free_bbio_and_raid_map(rbio); + kfree(rbio); } @@ -933,11 +951,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2, GFP_NOFS); - if (!rbio) { - kfree(raid_map); - kfree(bbio); + if (!rbio) return ERR_PTR(-ENOMEM); - } bio_list_init(rbio-bio_list); INIT_LIST_HEAD(rbio-plug_list); @@ -1692,8 +1707,10 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio, struct blk_plug_cb *cb; rbio = alloc_rbio(root, bbio, raid_map, stripe_len); - if (IS_ERR(rbio)) + if (IS_ERR(rbio)) { + __free_bbio_and_raid_map(bbio, raid_map, 1); return PTR_ERR(rbio); + } bio_list_add(rbio-bio_list, bio); rbio-bio_list_bytes = bio-bi_iter.bi_size; @@ -2038,15 +2055,19 @@ cleanup: */ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, struct btrfs_bio *bbio, u64 *raid_map, - u64 stripe_len, int mirror_num) + u64 stripe_len, int mirror_num, int hold_bbio) { struct btrfs_raid_bio *rbio; int ret; rbio = alloc_rbio(root, bbio, raid_map, stripe_len); - if (IS_ERR(rbio)) + if (IS_ERR(rbio)) { + __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio); return PTR_ERR(rbio); + } + if (hold_bbio) + set_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags); rbio-read_rebuild = 1; bio_list_add(rbio-bio_list, bio); rbio-bio_list_bytes = bio-bi_iter.bi_size; @@ -2054,8 +2075,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, rbio-faila = find_logical_bio_stripe(rbio, bio); if (rbio-faila == -1) { BUG(); - kfree(raid_map); - kfree(bbio); + __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio); kfree(rbio); return -EIO; } diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h index ea5d73b..b310e8c 100644 --- a/fs/btrfs/raid56.h +++ b/fs/btrfs/raid56.h @@ -41,7 +41,7 @@ static inline int nr_data_stripes(struct map_lookup *map) int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, struct btrfs_bio *bbio, u64 *raid_map, -u64 stripe_len, int mirror_num); +u64 stripe_len, int mirror_num, int hold_bbio); int raid56_parity_write(struct btrfs_root *root, struct bio *bio, struct btrfs_bio *bbio, u64 *raid_map, u64 stripe_len); diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index efa0831..ca4b9eb 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -63,6 +63,13 @@ struct scrub_ctx; */ #define SCRUB_MAX_PAGES_PER_BLOCK 16 /* 64k per node/leaf/sector */ +struct scrub_recover { + atomic_trefs; + struct btrfs_bio*bbio; + u64 *raid_map; + u64 map_length; +}; + struct scrub_page { struct scrub_block *sblock
Re: [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list
On Thu, 27 Nov 2014 09:39:56 +0800, Miao Xie wrote: On Wed, 26 Nov 2014 10:02:23 -0500, Chris Mason wrote: On Wed, Nov 26, 2014 at 8:04 AM, Miao Xie mi...@cn.fujitsu.com wrote: The increase/decrease of bio counter is on the I/O path, so we should use io_schedule() instead of schedule(), or the deadlock might be triggered by the pending I/O in the plug list. io_schedule() can help us because it will flush all the pending I/O before the task is going to sleep. Can you please describe this deadlock in more detail? schedule() also triggers a flush of the plug list, and if that's no longer sufficient we can run into other problems (especially with preemption on). Sorry for my miss. I forgot to check the current implementation of schedule(), which flushes the plug list unconditionally. Please ignore this patch. I have updated my raid56-scrub-replace branch, please re-pull the branch. https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace Thanks Miao Thanks Miao -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/9] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
This patch implement the RAID5/6 common data repair function, the implementation is similar to the scrub on the other RAID such as RAID1, the differentia is that we don't read the data from the mirror, we use the data repair function of RAID5/6. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v2: - Remove the redundant prefix underscores of the function names to make them obey the common pattern of the source in raid56.c --- fs/btrfs/raid56.c | 42 +--- fs/btrfs/raid56.h | 2 +- fs/btrfs/scrub.c | 194 - fs/btrfs/volumes.c | 16 - fs/btrfs/volumes.h | 4 ++ 5 files changed, 226 insertions(+), 32 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index c54b0e6..6013d88 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -58,6 +58,8 @@ */ #define RBIO_CACHE_READY_BIT 3 +#define RBIO_HOLD_BBIO_MAP_BIT 4 + #define RBIO_CACHE_SIZE 1024 struct btrfs_raid_bio { @@ -799,6 +801,21 @@ done_nolock: remove_rbio_from_cache(rbio); } +static inline void +__free_bbio_and_raid_map(struct btrfs_bio *bbio, u64 *raid_map, int need) +{ + if (need) { + kfree(raid_map); + kfree(bbio); + } +} + +static inline void free_bbio_and_raid_map(struct btrfs_raid_bio *rbio) +{ + __free_bbio_and_raid_map(rbio-bbio, rbio-raid_map, + !test_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags)); +} + static void __free_raid_bio(struct btrfs_raid_bio *rbio) { int i; @@ -817,8 +834,9 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio) rbio-stripe_pages[i] = NULL; } } - kfree(rbio-raid_map); - kfree(rbio-bbio); + + free_bbio_and_raid_map(rbio); + kfree(rbio); } @@ -933,11 +951,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2, GFP_NOFS); - if (!rbio) { - kfree(raid_map); - kfree(bbio); + if (!rbio) return ERR_PTR(-ENOMEM); - } bio_list_init(rbio-bio_list); INIT_LIST_HEAD(rbio-plug_list); @@ -1692,8 +1707,10 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio, struct blk_plug_cb *cb; rbio = alloc_rbio(root, bbio, raid_map, stripe_len); - if (IS_ERR(rbio)) + if (IS_ERR(rbio)) { + __free_bbio_and_raid_map(bbio, raid_map, 1); return PTR_ERR(rbio); + } bio_list_add(rbio-bio_list, bio); rbio-bio_list_bytes = bio-bi_iter.bi_size; @@ -2038,15 +2055,19 @@ cleanup: */ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, struct btrfs_bio *bbio, u64 *raid_map, - u64 stripe_len, int mirror_num) + u64 stripe_len, int mirror_num, int hold_bbio) { struct btrfs_raid_bio *rbio; int ret; rbio = alloc_rbio(root, bbio, raid_map, stripe_len); - if (IS_ERR(rbio)) + if (IS_ERR(rbio)) { + __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio); return PTR_ERR(rbio); + } + if (hold_bbio) + set_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags); rbio-read_rebuild = 1; bio_list_add(rbio-bio_list, bio); rbio-bio_list_bytes = bio-bi_iter.bi_size; @@ -2054,8 +2075,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, rbio-faila = find_logical_bio_stripe(rbio, bio); if (rbio-faila == -1) { BUG(); - kfree(raid_map); - kfree(bbio); + __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio); kfree(rbio); return -EIO; } diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h index ea5d73b..b310e8c 100644 --- a/fs/btrfs/raid56.h +++ b/fs/btrfs/raid56.h @@ -41,7 +41,7 @@ static inline int nr_data_stripes(struct map_lookup *map) int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, struct btrfs_bio *bbio, u64 *raid_map, -u64 stripe_len, int mirror_num); +u64 stripe_len, int mirror_num, int hold_bbio); int raid56_parity_write(struct btrfs_root *root, struct bio *bio, struct btrfs_bio *bbio, u64 *raid_map, u64 stripe_len); diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index efa0831..ca4b9eb 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -63,6 +63,13 @@ struct scrub_ctx; */ #define SCRUB_MAX_PAGES_PER_BLOCK 16 /* 64k per node/leaf/sector */ +struct scrub_recover { + atomic_trefs; + struct btrfs_bio*bbio; + u64 *raid_map
Re: [PATCH] Btrfs: make sure we wait on logged extents when fsycning two subvols
On Thu, 6 Nov 2014 10:19:54 -0500, Josef Bacik wrote: If we have two fsync()'s race on different subvols one will do all of its work to get into the log_tree, wait on it's outstanding IO, and then allow the log_tree to finish it's commit. The problem is we were just free'ing that subvols logged extents instead of waiting on them, so whoever lost the race wouldn't really have their data on disk. Fix this by waiting properly instead of freeing the logged extents. Thanks, cc: sta...@vger.kernel.org Signed-off-by: Josef Bacik jba...@fb.com --- fs/btrfs/tree-log.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 2d0fa43..70f99b1 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -2600,9 +2600,9 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, if (atomic_read(log_root_tree-log_commit[index2])) { blk_finish_plug(plug); btrfs_wait_marked_extents(log, log-dirty_log_pages, mark); + btrfs_wait_logged_extents(log, log_transid); Why not add this log root into a list of log root tree, and then the committer wait all ordered extents in each log root which is added in that list? By this way, we can let the committer do some work during the data of ordered extents is being transferred to the disk. Thanks Miao wait_log_commit(trans, log_root_tree, root_log_ctx.log_transid); - btrfs_free_logged_extents(log, log_transid); mutex_unlock(log_root_tree-log_mutex); ret = root_log_ctx.log_ret; goto out; -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/9] Implement device scrub/replace for RAID56
This patchset implement the device scrub/replace function for RAID56, the most implementation of the common data is similar to the other RAID type. The differentia or difficulty is the parity process. In order to avoid that problem the data that is easy to be change out the stripe lock, we do most work in the RAID56 stripe lock context. And in order to avoid making the code more and more complex, we copy some code of common data process for the parity, the cleanup work is in my TODO list. We have done some test, the patchset worked well. Of course, more tests are welcome. If you are interesting to use it or test it, you can pull the patchset from https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace Thanks Miao Miao Xie (6): Btrfs, raid56: don't change bbio and raid_map Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted Btrfs,raid56: use a variant to record the operation type Btrfs,raid56: support parity scrub on raid56 Btrfs, replace: write dirty pages into the replace target device Btrfs, replace: write raid56 parity into the replace target device Zhao Lei (3): Btrfs: remove noused bbio_ret in __btrfs_map_block in condition Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block Btrfs, replace: enable dev-replace for raid56 fs/btrfs/dev-replace.c | 5 - fs/btrfs/raid56.c | 711 +++- fs/btrfs/raid56.h | 14 +- fs/btrfs/scrub.c | 793 +++-- fs/btrfs/volumes.c | 47 ++- fs/btrfs/volumes.h | 14 +- 6 files changed, 1471 insertions(+), 113 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/9] Btrfs,raid56: support parity scrub on raid56
The implementation is: - Read and check all the data with checksum in the same stripe. All the data which has checksum is COW data, and we are sure that it is not changed though we don't lock the stripe. because the space of that data just can be reclaimed after the current transction is committed, and then the fs can use it to store the other data, but when doing scrub, we hold the current transaction, that is that data can not be recovered, it is safe that read and check it out of the stripe lock. - Lock the stripe - Read out all the data without checksum and parity The data without checksum and the parity may be changed if we don't lock the stripe, so we need read it in the stripe lock context. - Check the parity - Re-calculate the new parity and write back it if the old parity is not right - Unlock the stripe If we can not read out the data or the data we read is corrupted, we will try to repair it. If the repair fails. we will mark the horizontal sub-stripe(pages on the same horizontal) as corrupted sub-stripe, and we will skip the parity check and repair of that horizontal sub-stripe. And in order to skip the horizontal sub-tripe that has no data, we introduce a bitmap. If there is some data on the horizontal sub-stripe, we will the relative bit to 1, and when we check and repair the parity, we will skip those horizontal sub-stripes that the relative bits is 0. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/raid56.c | 500 - fs/btrfs/raid56.h | 12 ++ fs/btrfs/scrub.c | 599 +- 3 files changed, 1099 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index d550e9b..a13eb1b 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -65,6 +65,7 @@ enum btrfs_rbio_ops { BTRFS_RBIO_WRITE= 0, BTRFS_RBIO_READ_REBUILD = 1, + BTRFS_RBIO_PARITY_SCRUB = 2, }; struct btrfs_raid_bio { @@ -123,6 +124,7 @@ struct btrfs_raid_bio { /* number of data stripes (no p/q) */ int nr_data; + int stripe_npages; /* * set if we're doing a parity rebuild * for a read from higher up, which is handled @@ -137,6 +139,7 @@ struct btrfs_raid_bio { /* second bad stripe (for raid6 use) */ int failb; + int scrubp; /* * number of pages needed to represent the full * stripe @@ -171,6 +174,11 @@ struct btrfs_raid_bio { * here for faster lookup */ struct page **bio_pages; + + /* +* bitmap to record which horizontal stripe has data +*/ + unsigned long *dbitmap; }; static int __raid56_parity_recover(struct btrfs_raid_bio *rbio); @@ -185,6 +193,8 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio); static void index_rbio_pages(struct btrfs_raid_bio *rbio); static int alloc_rbio_pages(struct btrfs_raid_bio *rbio); +static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, +int need_check); /* * the stripe hash table is used for locking, and to collect * bios in hopes of making a full stripe @@ -950,9 +960,11 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, struct btrfs_raid_bio *rbio; int nr_data = 0; int num_pages = rbio_nr_pages(stripe_len, bbio-num_stripes); + int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE); void *p; - rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2, + rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2 + + DIV_ROUND_UP(stripe_npages, BITS_PER_LONG / 8), GFP_NOFS); if (!rbio) return ERR_PTR(-ENOMEM); @@ -967,6 +979,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio-fs_info = root-fs_info; rbio-stripe_len = stripe_len; rbio-nr_pages = num_pages; + rbio-stripe_npages = stripe_npages; rbio-faila = -1; rbio-failb = -1; atomic_set(rbio-refs, 1); @@ -980,6 +993,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, p = rbio + 1; rbio-stripe_pages = p; rbio-bio_pages = p + sizeof(struct page *) * num_pages; + rbio-dbitmap = p + sizeof(struct page *) * num_pages * 2; if (raid_map[bbio-num_stripes - 1] == RAID6_Q_STRIPE) nr_data = bbio-num_stripes - 2; @@ -1774,6 +1788,14 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio) index_rbio_pages(rbio); for (pagenr = 0; pagenr nr_pages; pagenr++) { + /* +* Now we just use bitmap to mark the horizontal stripes in +* which we have data when doing parity scrub. +*/ + if (rbio-operation == BTRFS_RBIO_PARITY_SCRUB
[PATCH 1/9] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
From: Zhao Lei zhao...@cn.fujitsu.com bbio_ret in this condition is always !NULL because previous code already have a check-and-skip: 4908 if (!bbio_ret) 4909 goto out; Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/volumes.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f61278f..41b0dff 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5162,8 +5162,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw, BTRFS_BLOCK_GROUP_RAID6)) { u64 tmp; - if (bbio_ret ((rw REQ_WRITE) || mirror_num 1) -raid_map_ret) { + if (raid_map_ret ((rw REQ_WRITE) || mirror_num 1)) { int i, rot; /* push stripe_nr back to the start of the full stripe */ -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/9] Btrfs,raid56: use a variant to record the operation type
We will introduce new operation type later, if we still use integer variant as bool variant to record the operation type, we would add new variant and increase the size of raid bio structure. It is not good, by this patch, we define different number for different operation, and we can just use a variant to record the operation type. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/raid56.c | 30 +- 1 file changed, 17 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index b3e9c76..d550e9b 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -62,6 +62,11 @@ #define RBIO_CACHE_SIZE 1024 +enum btrfs_rbio_ops { + BTRFS_RBIO_WRITE= 0, + BTRFS_RBIO_READ_REBUILD = 1, +}; + struct btrfs_raid_bio { struct btrfs_fs_info *fs_info; struct btrfs_bio *bbio; @@ -124,7 +129,7 @@ struct btrfs_raid_bio { * differently from a parity rebuild as part of * rmw */ - int read_rebuild; + enum btrfs_rbio_ops operation; /* first bad stripe */ int faila; @@ -147,7 +152,6 @@ struct btrfs_raid_bio { atomic_t refs; - atomic_t stripes_pending; atomic_t error; @@ -583,8 +587,7 @@ static int rbio_can_merge(struct btrfs_raid_bio *last, return 0; /* reads can't merge with writes */ - if (last-read_rebuild != - cur-read_rebuild) { + if (last-operation != cur-operation) { return 0; } @@ -777,9 +780,9 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio) spin_unlock(rbio-bio_list_lock); spin_unlock_irqrestore(h-lock, flags); - if (next-read_rebuild) + if (next-operation == BTRFS_RBIO_READ_REBUILD) async_read_rebuild(next); - else { + else if (next-operation == BTRFS_RBIO_WRITE){ steal_rbio(rbio, next); async_rmw_stripe(next); } @@ -1713,6 +1716,7 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio, } bio_list_add(rbio-bio_list, bio); rbio-bio_list_bytes = bio-bi_iter.bi_size; + rbio-operation = BTRFS_RBIO_WRITE; /* * don't plug on full rbios, just get them out the door @@ -1761,7 +1765,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio) faila = rbio-faila; failb = rbio-failb; - if (rbio-read_rebuild) { + if (rbio-operation == BTRFS_RBIO_READ_REBUILD) { spin_lock_irq(rbio-bio_list_lock); set_bit(RBIO_RMW_LOCKED_BIT, rbio-flags); spin_unlock_irq(rbio-bio_list_lock); @@ -1778,7 +1782,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio) * if we're rebuilding a read, we have to use * pages from the bio list */ - if (rbio-read_rebuild + if (rbio-operation == BTRFS_RBIO_READ_REBUILD (stripe == faila || stripe == failb)) { page = page_in_rbio(rbio, stripe, pagenr, 0); } else { @@ -1871,7 +1875,7 @@ pstripe: * know they can be trusted. If this was a read reconstruction, * other endio functions will fiddle the uptodate bits */ - if (!rbio-read_rebuild) { + if (rbio-operation == BTRFS_RBIO_WRITE) { for (i = 0; i nr_pages; i++) { if (faila != -1) { page = rbio_stripe_page(rbio, faila, i); @@ -1888,7 +1892,7 @@ pstripe: * if we're rebuilding a read, we have to use * pages from the bio list */ - if (rbio-read_rebuild + if (rbio-operation == BTRFS_RBIO_READ_REBUILD (stripe == faila || stripe == failb)) { page = page_in_rbio(rbio, stripe, pagenr, 0); } else { @@ -1904,7 +1908,7 @@ cleanup: cleanup_io: - if (rbio-read_rebuild) { + if (rbio-operation == BTRFS_RBIO_READ_REBUILD) { if (err == 0) cache_rbio_pages(rbio); else @@ -2042,7 +2046,7 @@ out: return 0; cleanup: - if (rbio-read_rebuild) + if (rbio-operation == BTRFS_RBIO_READ_REBUILD) rbio_orig_end_io(rbio, -EIO, 0); return -EIO; } @@ -2068,7 +2072,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, if (hold_bbio) set_bit(RBIO_HOLD_BBIO_MAP_BIT
[PATCH 4/9] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
This patch implement the RAID5/6 common data repair function, the implementation is similar to the scrub on the other RAID such as RAID1, the differentia is that we don't read the data from the mirror, we use the data repair function of RAID5/6. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/raid56.c | 42 +--- fs/btrfs/raid56.h | 2 +- fs/btrfs/scrub.c | 194 - fs/btrfs/volumes.c | 16 - fs/btrfs/volumes.h | 4 ++ 5 files changed, 226 insertions(+), 32 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index c54b0e6..b3e9c76 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -58,6 +58,8 @@ */ #define RBIO_CACHE_READY_BIT 3 +#define RBIO_HOLD_BBIO_MAP_BIT 4 + #define RBIO_CACHE_SIZE 1024 struct btrfs_raid_bio { @@ -799,6 +801,21 @@ done_nolock: remove_rbio_from_cache(rbio); } +static inline void +___free_bbio_and_raid_map(struct btrfs_bio *bbio, u64 *raid_map, int need) +{ + if (need) { + kfree(raid_map); + kfree(bbio); + } +} + +static inline void __free_bbio_and_raid_map(struct btrfs_raid_bio *rbio) +{ + ___free_bbio_and_raid_map(rbio-bbio, rbio-raid_map, + !test_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags)); +} + static void __free_raid_bio(struct btrfs_raid_bio *rbio) { int i; @@ -817,8 +834,9 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio) rbio-stripe_pages[i] = NULL; } } - kfree(rbio-raid_map); - kfree(rbio-bbio); + + __free_bbio_and_raid_map(rbio); + kfree(rbio); } @@ -933,11 +951,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2, GFP_NOFS); - if (!rbio) { - kfree(raid_map); - kfree(bbio); + if (!rbio) return ERR_PTR(-ENOMEM); - } bio_list_init(rbio-bio_list); INIT_LIST_HEAD(rbio-plug_list); @@ -1692,8 +1707,10 @@ int raid56_parity_write(struct btrfs_root *root, struct bio *bio, struct blk_plug_cb *cb; rbio = alloc_rbio(root, bbio, raid_map, stripe_len); - if (IS_ERR(rbio)) + if (IS_ERR(rbio)) { + ___free_bbio_and_raid_map(bbio, raid_map, 1); return PTR_ERR(rbio); + } bio_list_add(rbio-bio_list, bio); rbio-bio_list_bytes = bio-bi_iter.bi_size; @@ -2038,15 +2055,19 @@ cleanup: */ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, struct btrfs_bio *bbio, u64 *raid_map, - u64 stripe_len, int mirror_num) + u64 stripe_len, int mirror_num, int hold_bbio) { struct btrfs_raid_bio *rbio; int ret; rbio = alloc_rbio(root, bbio, raid_map, stripe_len); - if (IS_ERR(rbio)) + if (IS_ERR(rbio)) { + ___free_bbio_and_raid_map(bbio, raid_map, !hold_bbio); return PTR_ERR(rbio); + } + if (hold_bbio) + set_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags); rbio-read_rebuild = 1; bio_list_add(rbio-bio_list, bio); rbio-bio_list_bytes = bio-bi_iter.bi_size; @@ -2054,8 +2075,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, rbio-faila = find_logical_bio_stripe(rbio, bio); if (rbio-faila == -1) { BUG(); - kfree(raid_map); - kfree(bbio); + ___free_bbio_and_raid_map(bbio, raid_map, !hold_bbio); kfree(rbio); return -EIO; } diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h index ea5d73b..b310e8c 100644 --- a/fs/btrfs/raid56.h +++ b/fs/btrfs/raid56.h @@ -41,7 +41,7 @@ static inline int nr_data_stripes(struct map_lookup *map) int raid56_parity_recover(struct btrfs_root *root, struct bio *bio, struct btrfs_bio *bbio, u64 *raid_map, -u64 stripe_len, int mirror_num); +u64 stripe_len, int mirror_num, int hold_bbio); int raid56_parity_write(struct btrfs_root *root, struct bio *bio, struct btrfs_bio *bbio, u64 *raid_map, u64 stripe_len); diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index efa0831..ca4b9eb 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -63,6 +63,13 @@ struct scrub_ctx; */ #define SCRUB_MAX_PAGES_PER_BLOCK 16 /* 64k per node/leaf/sector */ +struct scrub_recover { + atomic_trefs; + struct btrfs_bio*bbio; + u64 *raid_map; + u64 map_length; +}; + struct scrub_page { struct scrub_block *sblock; struct page
[PATCH 9/9] Btrfs, replace: enable dev-replace for raid56
From: Zhao Lei zhao...@cn.fujitsu.com Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/dev-replace.c | 5 - 1 file changed, 5 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 6f662b3..6aa835c 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -316,11 +316,6 @@ int btrfs_dev_replace_start(struct btrfs_root *root, struct btrfs_device *tgt_device = NULL; struct btrfs_device *src_device = NULL; - if (btrfs_fs_incompat(fs_info, RAID56)) { - btrfs_warn(fs_info, dev_replace cannot yet handle RAID5/RAID6); - return -EOPNOTSUPP; - } - switch (args-start.cont_reading_from_srcdev_mode) { case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS: case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID: -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/9] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block
From: Zhao Lei zhao...@cn.fujitsu.com stripe_index's value was set again in latter line: stripe_index = 0; Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/volumes.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 41b0dff..66d5035 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5167,9 +5167,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw, /* push stripe_nr back to the start of the full stripe */ stripe_nr = raid56_full_stripe_start; - do_div(stripe_nr, stripe_len); - - stripe_index = do_div(stripe_nr, nr_data_stripes(map)); + do_div(stripe_nr, stripe_len * nr_data_stripes(map)); /* RAID[56] write or recovery. Return all stripes */ num_stripes = map-num_stripes; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/9] Btrfs, replace: write raid56 parity into the replace target device
This function reused the code of parity scrub, and we just write the right parity or corrected parity into the target device before the parity scrub end. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/raid56.c | 23 +++ fs/btrfs/scrub.c | 2 +- 2 files changed, 24 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 7ad9546a..b69c01f 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -2305,7 +2305,9 @@ static void raid_write_parity_end_io(struct bio *bio, int err) static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, int need_check) { + struct btrfs_bio *bbio = rbio-bbio; void *pointers[rbio-real_stripes]; + DECLARE_BITMAP(pbitmap, rbio-stripe_npages); int nr_data = rbio-nr_data; int stripe; int pagenr; @@ -2315,6 +2317,7 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, struct page *q_page = NULL; struct bio_list bio_list; struct bio *bio; + int is_replace = 0; int ret; bio_list_init(bio_list); @@ -2328,6 +2331,11 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio, BUG(); } + if (bbio-num_tgtdevs bbio-tgtdev_map[rbio-scrubp]) { + is_replace = 1; + bitmap_copy(pbitmap, rbio-dbitmap, rbio-stripe_npages); + } + /* * Because the higher layers(scrubber) are unlikely to * use this area of the disk again soon, so don't cache @@ -2416,6 +2424,21 @@ writeback: goto cleanup; } + if (!is_replace) + goto submit_write; + + for_each_set_bit(pagenr, pbitmap, rbio-stripe_npages) { + struct page *page; + + page = rbio_stripe_page(rbio, rbio-scrubp, pagenr); + ret = rbio_add_io_page(rbio, bio_list, page, + bbio-tgtdev_map[rbio-scrubp], + pagenr, rbio-stripe_len); + if (ret) + goto cleanup; + } + +submit_write: nr_data = bio_list_size(bio_list); if (!nr_data) { /* Every parity is right */ diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 3ef1e1b..f690c8f 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -2710,7 +2710,7 @@ static void scrub_parity_check_and_repair(struct scrub_parity *sparity) goto out; length = sparity-logic_end - sparity-logic_start + 1; - ret = btrfs_map_sblock(sctx-dev_root-fs_info, REQ_GET_READ_MIRRORS, + ret = btrfs_map_sblock(sctx-dev_root-fs_info, WRITE, sparity-logic_start, length, bbio, 0, raid_map); if (ret || !bbio || !raid_map) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/9] Btrfs, replace: write dirty pages into the replace target device
The implementation is simple: - In order to avoid changing the code logic of btrfs_map_bio and RAID56, we add the stripes of the replace target devices at the end of the stripe array in btrfs bio, and we sort those target device stripes in the array. And we keep the number of the target device stripes in the btrfs bio. - Except write operation on RAID56, all the other operation don't take the target device stripes into account. - When we do write operation, we read the data from the common devices and calculate the parity. Then write the dirty data and new parity out, at this time, we will find the relative replace target stripes and wirte the relative data into it. Note: The function that copying old data on the source device to the target device was implemented in the past, it is similar to the other RAID type. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/raid56.c | 104 + fs/btrfs/volumes.c | 26 -- fs/btrfs/volumes.h | 10 -- 3 files changed, 97 insertions(+), 43 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index a13eb1b..7ad9546a 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -124,6 +124,8 @@ struct btrfs_raid_bio { /* number of data stripes (no p/q) */ int nr_data; + int real_stripes; + int stripe_npages; /* * set if we're doing a parity rebuild @@ -619,7 +621,7 @@ static struct page *rbio_pstripe_page(struct btrfs_raid_bio *rbio, int index) */ static struct page *rbio_qstripe_page(struct btrfs_raid_bio *rbio, int index) { - if (rbio-nr_data + 1 == rbio-bbio-num_stripes) + if (rbio-nr_data + 1 == rbio-real_stripes) return NULL; index += ((rbio-nr_data + 1) * rbio-stripe_len) @@ -959,7 +961,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, { struct btrfs_raid_bio *rbio; int nr_data = 0; - int num_pages = rbio_nr_pages(stripe_len, bbio-num_stripes); + int real_stripes = bbio-num_stripes - bbio-num_tgtdevs; + int num_pages = rbio_nr_pages(stripe_len, real_stripes); int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE); void *p; @@ -979,6 +982,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio-fs_info = root-fs_info; rbio-stripe_len = stripe_len; rbio-nr_pages = num_pages; + rbio-real_stripes = real_stripes; rbio-stripe_npages = stripe_npages; rbio-faila = -1; rbio-failb = -1; @@ -995,10 +999,10 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio-bio_pages = p + sizeof(struct page *) * num_pages; rbio-dbitmap = p + sizeof(struct page *) * num_pages * 2; - if (raid_map[bbio-num_stripes - 1] == RAID6_Q_STRIPE) - nr_data = bbio-num_stripes - 2; + if (raid_map[real_stripes - 1] == RAID6_Q_STRIPE) + nr_data = real_stripes - 2; else - nr_data = bbio-num_stripes - 1; + nr_data = real_stripes - 1; rbio-nr_data = nr_data; return rbio; @@ -1110,7 +1114,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio, static void validate_rbio_for_rmw(struct btrfs_raid_bio *rbio) { if (rbio-faila = 0 || rbio-failb = 0) { - BUG_ON(rbio-faila == rbio-bbio-num_stripes - 1); + BUG_ON(rbio-faila == rbio-real_stripes - 1); __raid56_parity_recover(rbio); } else { finish_rmw(rbio); @@ -1171,7 +1175,7 @@ static void index_rbio_pages(struct btrfs_raid_bio *rbio) static noinline void finish_rmw(struct btrfs_raid_bio *rbio) { struct btrfs_bio *bbio = rbio-bbio; - void *pointers[bbio-num_stripes]; + void *pointers[rbio-real_stripes]; int stripe_len = rbio-stripe_len; int nr_data = rbio-nr_data; int stripe; @@ -1185,11 +1189,11 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) bio_list_init(bio_list); - if (bbio-num_stripes - rbio-nr_data == 1) { - p_stripe = bbio-num_stripes - 1; - } else if (bbio-num_stripes - rbio-nr_data == 2) { - p_stripe = bbio-num_stripes - 2; - q_stripe = bbio-num_stripes - 1; + if (rbio-real_stripes - rbio-nr_data == 1) { + p_stripe = rbio-real_stripes - 1; + } else if (rbio-real_stripes - rbio-nr_data == 2) { + p_stripe = rbio-real_stripes - 2; + q_stripe = rbio-real_stripes - 1; } else { BUG(); } @@ -1246,7 +1250,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) SetPageUptodate(p); pointers[stripe++] = kmap(p); - raid6_call.gen_syndrome(bbio-num_stripes, PAGE_SIZE, + raid6_call.gen_syndrome(rbio
[PATCH 3/9] Btrfs, raid56: don't change bbio and raid_map
Because we will reuse bbio and raid_map during the scrub later, it is better that we don't change any variant of bbio and don't free it at the end of IO request. So we introduced similar variants into the raid bio, and don't access those bbio's variants any more. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/raid56.c | 42 +++--- 1 file changed, 23 insertions(+), 19 deletions(-) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index 6a41631..c54b0e6 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -58,7 +58,6 @@ */ #define RBIO_CACHE_READY_BIT 3 - #define RBIO_CACHE_SIZE 1024 struct btrfs_raid_bio { @@ -146,6 +145,10 @@ struct btrfs_raid_bio { atomic_t refs; + + atomic_t stripes_pending; + + atomic_t error; /* * these are two arrays of pointers. We allocate the * rbio big enough to hold them both and setup their @@ -858,13 +861,13 @@ static void raid_write_end_io(struct bio *bio, int err) bio_put(bio); - if (!atomic_dec_and_test(rbio-bbio-stripes_pending)) + if (!atomic_dec_and_test(rbio-stripes_pending)) return; err = 0; /* OK, we have read all the stripes we need to. */ - if (atomic_read(rbio-bbio-error) rbio-bbio-max_errors) + if (atomic_read(rbio-error) rbio-bbio-max_errors) err = -EIO; rbio_orig_end_io(rbio, err, 0); @@ -949,6 +952,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root *root, rbio-faila = -1; rbio-failb = -1; atomic_set(rbio-refs, 1); + atomic_set(rbio-error, 0); + atomic_set(rbio-stripes_pending, 0); /* * the stripe_pages and bio_pages array point to the extra @@ -1169,7 +1174,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) set_bit(RBIO_RMW_LOCKED_BIT, rbio-flags); spin_unlock_irq(rbio-bio_list_lock); - atomic_set(rbio-bbio-error, 0); + atomic_set(rbio-error, 0); /* * now that we've set rmw_locked, run through the @@ -1245,8 +1250,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio) } } - atomic_set(bbio-stripes_pending, bio_list_size(bio_list)); - BUG_ON(atomic_read(bbio-stripes_pending) == 0); + atomic_set(rbio-stripes_pending, bio_list_size(bio_list)); + BUG_ON(atomic_read(rbio-stripes_pending) == 0); while (1) { bio = bio_list_pop(bio_list); @@ -1331,11 +1336,11 @@ static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed) if (rbio-faila == -1) { /* first failure on this rbio */ rbio-faila = failed; - atomic_inc(rbio-bbio-error); + atomic_inc(rbio-error); } else if (rbio-failb == -1) { /* second failure on this rbio */ rbio-failb = failed; - atomic_inc(rbio-bbio-error); + atomic_inc(rbio-error); } else { ret = -EIO; } @@ -1394,11 +1399,11 @@ static void raid_rmw_end_io(struct bio *bio, int err) bio_put(bio); - if (!atomic_dec_and_test(rbio-bbio-stripes_pending)) + if (!atomic_dec_and_test(rbio-stripes_pending)) return; err = 0; - if (atomic_read(rbio-bbio-error) rbio-bbio-max_errors) + if (atomic_read(rbio-error) rbio-bbio-max_errors) goto cleanup; /* @@ -1439,7 +1444,6 @@ static void async_read_rebuild(struct btrfs_raid_bio *rbio) static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio) { int bios_to_read = 0; - struct btrfs_bio *bbio = rbio-bbio; struct bio_list bio_list; int ret; int nr_pages = DIV_ROUND_UP(rbio-stripe_len, PAGE_CACHE_SIZE); @@ -1455,7 +1459,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio) index_rbio_pages(rbio); - atomic_set(rbio-bbio-error, 0); + atomic_set(rbio-error, 0); /* * build a list of bios to read all the missing parts of this * stripe @@ -1503,7 +1507,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio) * the bbio may be freed once we submit the last bio. Make sure * not to touch it after that */ - atomic_set(bbio-stripes_pending, bios_to_read); + atomic_set(rbio-stripes_pending, bios_to_read); while (1) { bio = bio_list_pop(bio_list); if (!bio) @@ -1917,10 +1921,10 @@ static void raid_recover_end_io(struct bio *bio, int err) set_bio_pages_uptodate(bio); bio_put(bio); - if (!atomic_dec_and_test(rbio-bbio-stripes_pending)) + if (!atomic_dec_and_test(rbio-stripes_pending)) return; - if (atomic_read(rbio-bbio-error) rbio-bbio-max_errors) + if (atomic_read(rbio-error) rbio-bbio-max_errors
Re: [PATCH] Btrfs: fix incorrect compression ratio detection
On Tue, 7 Oct 2014 18:44:35 -0400, Wang Shilong wrote: Steps to reproduce: # mkfs.btrfs -f /dev/sdb # mount -t btrfs /dev/sdb /mnt -o compress=lzo # dd if=/dev/zero of=/mnt/data bs=$((33*4096)) count=1 after previous steps, inode will be detected as bad compression ratio, and NOCOMPRESS flag will be set for that inode. Reason is that compress have a max limit pages every time(128K), if a 132k write in, it will be splitted into two write(128k+4k), this bug is a leftover for commit 68bb462d42a(Btrfs: don't compress for a small write) Fix this problem by checking every time before compression, if it is a small write(=blocksize), we bail out and fall into nocompression directly. Signed-off-by: Wang Shilong wangshilong1...@gmail.com Looks good. Reviewed-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/inode.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 344a322..b78e90a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -411,14 +411,6 @@ static noinline int compress_file_range(struct inode *inode, (start 0 || end + 1 BTRFS_I(inode)-disk_i_size)) btrfs_add_inode_defrag(NULL, inode); - /* - * skip compression for a small file range(=blocksize) that - * isn't an inline extent, since it dosen't save disk space at all. - */ - if ((end - start + 1) = blocksize - (start 0 || end + 1 BTRFS_I(inode)-disk_i_size)) - goto cleanup_and_bail_uncompressed; - actual_end = min_t(u64, isize, end + 1); again: will_compress = 0; @@ -440,6 +432,14 @@ again: total_compressed = actual_end - start; + /* + * skip compression for a small file range(=blocksize) that + * isn't an inline extent, since it dosen't save disk space at all. + */ + if (total_compressed = blocksize +(start 0 || end + 1 BTRFS_I(inode)-disk_i_size)) + goto cleanup_and_bail_uncompressed; + /* we want to make sure that amount of ram required to uncompress * an extent is reasonable, so we limit the total size in ram * of a compressed extent to 128k. This is a crucial number -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: don't do async reclaim during log replay V2
On Thu, 6 Nov 2014 09:39:19 -0500, Josef Bacik wrote: On 10/23/2014 04:44 AM, Miao Xie wrote: On Thu, 18 Sep 2014 11:27:17 -0400, Josef Bacik wrote: Trying to reproduce a log enospc bug I hit a panic in the async reclaim code during log replay. This is because we use fs_info-fs_root as our root for shrinking and such. Technically we can use whatever root we want, but let's just not allow async reclaim while we're doing log replay. Thanks, Why not move the code of fs_root initialization to the front of log replay? I think it is better than the fix way in this patch because the async reclaimer can help us do some work. Because this is simpler. We could move the initialization forward, but then say somebody comes and adds some other dependency to the async reclaim stuff in the future and doesn't think about log replay and suddenly some poor sap's box panics on mount. Log replay is a known quantity, we don't have to worry about enospc, so lets make it as simple as possible. Thanks, Yes, you are right. So this patch looks good. Reviewed-by: Miao Xie mi...@cn.fujitsu.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: don't take the chunk_mutex/dev_list mutex in statfs V2
On Mon, 3 Nov 2014 08:56:50 -0500, Josef Bacik wrote: Our gluster boxes get several thousand statfs() calls per second, which begins to suck hardcore with all of the lock contention on the chunk mutex and dev list mutex. We don't really need to hold these things, if we have transient weirdness with statfs() because of the chunk allocator we don't care, so remove this locking. We still need the dev_list lock if you mount with -o alloc_start however, which is a good argument for nuking that thing from orbit, but that's a patch for another day. Thanks, Signed-off-by: Josef Bacik jba...@fb.com --- V1-V2: make sure -alloc_start is set before doing the dev extent lookup logic. I am strange that why we need dev_list_lock if we mount with -o alloc_start. AFAIK. -alloc_start is protected by chunk_mutex. But I think we needn't care that someone changes -alloc_start, in other words, we needn't take chunk_mutex during the whole process, the following case can be tolerated by the users, I think. Task1 Task2 statfs mutex_lock(fs_info-chunk_mutex); tmp = fs_info-alloc_start; mutex_unlock(fs_info-chunk_mutex); btrfs_calc_avail_data_space(fs_info, tmp) ... mount -o remount,alloc_start= ... Thanks Miao fs/btrfs/super.c | 72 1 file changed, 47 insertions(+), 25 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 54bd91e..dc337d1 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1644,8 +1644,20 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) int i = 0, nr_devices; int ret; + /* + * We aren't under the device list lock, so this is racey-ish, but good + * enough for our purposes. + */ nr_devices = fs_info-fs_devices-open_devices; - BUG_ON(!nr_devices); + if (!nr_devices) { + smp_mb(); + nr_devices = fs_info-fs_devices-open_devices; + ASSERT(nr_devices); + if (!nr_devices) { + *free_bytes = 0; + return 0; + } + } devices_info = kmalloc_array(nr_devices, sizeof(*devices_info), GFP_NOFS); @@ -1670,11 +1682,17 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) else min_stripe_size = BTRFS_STRIPE_LEN; - list_for_each_entry(device, fs_devices-devices, dev_list) { + if (fs_info-alloc_start) + mutex_lock(fs_devices-device_list_mutex); + rcu_read_lock(); + list_for_each_entry_rcu(device, fs_devices-devices, dev_list) { if (!device-in_fs_metadata || !device-bdev || device-is_tgtdev_for_dev_replace) continue; + if (i = nr_devices) + break; + avail_space = device-total_bytes - device-bytes_used; /* align with stripe_len */ @@ -1689,24 +1707,32 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) skip_space = 1024 * 1024; /* user can set the offset in fs_info-alloc_start. */ - if (fs_info-alloc_start + BTRFS_STRIPE_LEN = - device-total_bytes) + if (fs_info-alloc_start + fs_info-alloc_start + BTRFS_STRIPE_LEN = + device-total_bytes) { + rcu_read_unlock(); skip_space = max(fs_info-alloc_start, skip_space); - /* - * btrfs can not use the free space in [0, skip_space - 1], - * we must subtract it from the total. In order to implement - * it, we account the used space in this range first. - */ - ret = btrfs_account_dev_extents_size(device, 0, skip_space - 1, - used_space); - if (ret) { - kfree(devices_info); - return ret; - } + /* + * btrfs can not use the free space in + * [0, skip_space - 1], we must subtract it from the + * total. In order to implement it, we account the used + * space in this range first. + */ + ret = btrfs_account_dev_extents_size(device, 0, + skip_space - 1, + used_space); + if (ret) { + kfree(devices_info); +
Re: [PATCH] Btrfs: don't do async reclaim during log replay V2
Ping.. On Thu, 23 Oct 2014 16:44:54 +0800, Miao Xie wrote: On Thu, 18 Sep 2014 11:27:17 -0400, Josef Bacik wrote: Trying to reproduce a log enospc bug I hit a panic in the async reclaim code during log replay. This is because we use fs_info-fs_root as our root for shrinking and such. Technically we can use whatever root we want, but let's just not allow async reclaim while we're doing log replay. Thanks, Why not move the code of fs_root initialization to the front of log replay? I think it is better than the fix way in this patch because the async reclaimer can help us do some work. Thanks Miao Signed-off-by: Josef Bacik jba...@fb.com --- V1-V2: use fs_info-log_root_recovering instead, didn't notice this existed before. fs/btrfs/extent-tree.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 28a27d5..44d0497 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4513,7 +4513,13 @@ again: space_info-flush = 1; } else if (!ret space_info-flags BTRFS_BLOCK_GROUP_METADATA) { used += orig_bytes; -if (need_do_async_reclaim(space_info, root-fs_info, used) +/* + * We will do the space reservation dance during log replay, + * which means we won't have fs_info-fs_root set, so don't do + * the async reclaim as we will panic. + */ +if (!root-fs_info-log_root_recovering +need_do_async_reclaim(space_info, root-fs_info, used) !work_busy(root-fs_info-async_reclaim_work)) queue_work(system_unbound_wq, root-fs_info-async_reclaim_work); -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] Btrfs: fix snapshot inconsistency after a file write followed by truncate
On Wed, 29 Oct 2014 08:21:12 +, Filipe Manana wrote: If right after starting the snapshot creation ioctl we perform a write against a file followed by a truncate, with both operations increasing the file's size, we can get a snapshot tree that reflects a state of the source subvolume's tree where the file truncation happened but the write operation didn't. This leaves a gap between 2 file extent items of the inode, which makes btrfs' fsck complain about it. For example, if we perform the following file operations: $ mkfs.btrfs -f /dev/vdd $ mount /dev/vdd /mnt $ xfs_io -f \ -c pwrite -S 0xaa -b 32K 0 32K \ -c fsync \ -c pwrite -S 0xbb -b 32770 16K 32770 \ -c truncate 90123 \ /mnt/foobar and the snapshot creation ioctl was just called before the second write, we often can get the following inode items in the snapshot's btree: item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160 inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0 item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20 inode ref index 282 namelen 10 name: foobar item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53 extent data disk byte 1104855040 nr 32768 extent data offset 0 nr 32768 ram 32768 extent compression 0 item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53 extent data disk byte 0 nr 0 extent data offset 0 nr 40960 ram 40960 extent compression 0 There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[ for which there's no file extent item covering it. This is because the file write and file truncate operations happened both right after the snapshot creation ioctl called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the ordered extent that matches the write and, in btrfs_setsize(), we were able to call btrfs_cont_expand() before being able to commit the current transaction in the snapshot creation ioctl. So this made it possibe to insert the hole file extent item in the source subvolume (which represents the region added by the truncate) right before the transaction commit from the snapshot creation ioctl. Btrfs' fsck tool complains about such cases with a message like the following: root 331 inode 257 errors 100, file extent discount From a user perspective, the expectation when a snapshot is created while those file operations are being performed is that the snapshot will have a file that either: 1) is empty 2) only the first write was captured 3) only the 2 writes were captured 4) both writes and the truncation were captured But never capture a state where only the first write and the truncation were captured (since the second write was performed before the truncation). A test case for xfstests follows. Signed-off-by: Filipe Manana fdman...@suse.com --- V2: Use different approach to solve the problem. Don't start and wait for all dellaloc to finish after every expanding truncate, instead add an additional flush at transaction commit time if we're doing a transaction commit that creates snapshots. This method will make the transaction commit spend more time, why not use i_disk_size to expand the file size in btrfs_setsize()? Or we might rename btrfs_{start, end}_nocow_write(), and use them in btrfs_setsize()? Thanks Miao V3: Removed useless test condition in +wait_pending_snapshot_roots_delalloc(). fs/btrfs/transaction.c | 59 ++ 1 file changed, 59 insertions(+) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 396ae8b..5e7f004 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1714,12 +1714,65 @@ static inline void btrfs_wait_delalloc_flush(struct btrfs_fs_info *fs_info) btrfs_wait_ordered_roots(fs_info, -1); } +static int +start_pending_snapshot_roots_delalloc(struct btrfs_trans_handle *trans, + struct list_head *splice) +{ + struct btrfs_pending_snapshot *pending_snapshot; + int ret = 0; + + if (btrfs_test_opt(trans-root, FLUSHONCOMMIT)) + return 0; + + spin_lock(trans-root-fs_info-trans_lock); + list_splice_init(trans-transaction-pending_snapshots, splice); + spin_unlock(trans-root-fs_info-trans_lock); + + /* + * Start again delalloc for the roots our pending snapshots are made + * from. We did it before starting/joining a transaction and we do it + * here again because new inode operations might have happened since + * then and we want to make sure the snapshot captures a fully + * consistent state of the source root
Re: [PATCH] Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()
On Mon, 27 Oct 2014 09:16:55 +, Filipe Manana wrote: If we couldn't find our extent item, we accessed the current slot (path-slots[0]) to check if it corresponds to an equivalent skinny metadata item. However this slot could be beyond our last item in the leaf (i.e. path-slots[0] = btrfs_header_nritems(leaf)), in which case we shouldn't process it. Since btrfs_lookup_extent() is only used to find extent items for data extents, fix this by removing completely the logic that looks up for an equivalent skinny metadata item, since it can not exist. I think we also need a better function name, such as btrfs_lookup_data_extent. Thanks Miao Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0d599ba..9141b2b 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -710,7 +710,7 @@ void btrfs_clear_space_info_full(struct btrfs_fs_info *info) rcu_read_unlock(); } -/* simple helper to search for an existing extent at a given offset */ +/* simple helper to search for an existing data extent at a given offset */ int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len) { int ret; @@ -726,12 +726,6 @@ int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len) key.type = BTRFS_EXTENT_ITEM_KEY; ret = btrfs_search_slot(NULL, root-fs_info-extent_root, key, path, 0, 0); - if (ret 0) { - btrfs_item_key_to_cpu(path-nodes[0], key, path-slots[0]); - if (key.objectid == start - key.type == BTRFS_METADATA_ITEM_KEY) - ret = 0; - } btrfs_free_path(path); return ret; } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items
On Mon, 27 Oct 2014 09:19:52 +, Filipe Manana wrote: We have a race that can lead us to miss skinny extent items in the function btrfs_lookup_extent_info() when the skinny metadata feature is enabled. So basically the sequence of steps is: 1) We search in the extent tree for the skinny extent, which returns 0 (not found); 2) We check the previous item in the returned leaf for a non-skinny extent, and we don't find it; 3) Because we didn't find the non-skinny extent in step 2), we release our path to search the extent tree again, but this time for a non-skinny extent key; 4) Right after we released our path in step 3), a skinny extent was inserted in the extent tree (delayed refs were run) - our second extent tree search will miss it, because it's not looking for a skinny extent; 5) After the second search returned (with ret 0), we look for any delayed ref for our extent's bytenr (and we do it while holding a read lock on the leaf), but we won't find any, as such delayed ref had just run and completed after we released out path in step 3) before doing the second search. Fix this by removing completely the path release and re-search logic. This is safe, because if we seach for a metadata item and we don't find it, we have the guarantee that the returned leaf is the one where the item would be inserted, and so path-slots[0] 0 and path-slots[0] - 1 must be the slot where the non-skinny extent item is if it exists. The only case where path-slots[0] is I think this analysis is wrong if there are some independent shared ref metadata for a tree block, just like: ++-+-+ | tree block extent item | shared ref1 | shared ref2 | ++-+-+ Thanks Miao zero is when there are no smaller keys in the tree (i.e. no left siblings for our leaf), in which case the re-search logic isn't needed as well. This race has been present since the introduction of skinny metadata (change 3173a18f70554fe7880bb2d85c7da566e364eb3c). Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 8 1 file changed, 8 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9141b2b..2cedd06 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -780,7 +780,6 @@ search_again: else key.type = BTRFS_EXTENT_ITEM_KEY; -again: ret = btrfs_search_slot(trans, root-fs_info-extent_root, key, path, 0, 0); if (ret 0) @@ -796,13 +795,6 @@ again: key.offset == root-nodesize) ret = 0; } - if (ret) { - key.objectid = bytenr; - key.type = BTRFS_EXTENT_ITEM_KEY; - key.offset = root-nodesize; - btrfs_release_path(path); - goto again; - } } if (ret == 0) { -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items
On Mon, 27 Oct 2014 13:44:22 +, Filipe David Manana wrote: On Mon, Oct 27, 2014 at 12:11 PM, Filipe David Manana fdman...@gmail.com wrote: On Mon, Oct 27, 2014 at 11:08 AM, Miao Xie mi...@cn.fujitsu.com wrote: On Mon, 27 Oct 2014 09:19:52 +, Filipe Manana wrote: We have a race that can lead us to miss skinny extent items in the function btrfs_lookup_extent_info() when the skinny metadata feature is enabled. So basically the sequence of steps is: 1) We search in the extent tree for the skinny extent, which returns 0 (not found); 2) We check the previous item in the returned leaf for a non-skinny extent, and we don't find it; 3) Because we didn't find the non-skinny extent in step 2), we release our path to search the extent tree again, but this time for a non-skinny extent key; 4) Right after we released our path in step 3), a skinny extent was inserted in the extent tree (delayed refs were run) - our second extent tree search will miss it, because it's not looking for a skinny extent; 5) After the second search returned (with ret 0), we look for any delayed ref for our extent's bytenr (and we do it while holding a read lock on the leaf), but we won't find any, as such delayed ref had just run and completed after we released out path in step 3) before doing the second search. Fix this by removing completely the path release and re-search logic. This is safe, because if we seach for a metadata item and we don't find it, we have the guarantee that the returned leaf is the one where the item would be inserted, and so path-slots[0] 0 and path-slots[0] - 1 must be the slot where the non-skinny extent item is if it exists. The only case where path-slots[0] is I think this analysis is wrong if there are some independent shared ref metadata for a tree block, just like: ++-+-+ | tree block extent item | shared ref1 | shared ref2 | ++-+-+ Trying to guess what's in your mind. Is the concern that if after a non-skinny extent item we have non-inlined references, the assumption that path-slots[0] - 1 points to the extent item would be wrong when searching for a skinny extent? That wouldn't be the case because BTRFS_EXTENT_ITEM_KEY == 168 and BTRFS_METADATA_ITEM_KEY == 169, with BTRFS_SHARED_BLOCK_REF_KEY == 182. So in the presence of such non-inlined shared tree block reference items, searching for a skinny extent item leaves us at a slot that points to the first non-inlined ref (regardless of its type, since they're all 169), and therefore path-slots[0] - 1 is the non-skinny extent item. You are right. I forget to check the value of key type. Sorry. This patch seems good for me. Reviewed-by: Miao Xie mi...@cn.fujitsu.com thanks. Why does that matters? Can you elaborate why it's not correct? We're looking for the extent item only in btrfs_lookup_extent_info(), and running a delayed ref, independently of being inlined/shared, it implies inserting a new extent item or updating an existing extent item (updating ref count). thanks Thanks Miao zero is when there are no smaller keys in the tree (i.e. no left siblings for our leaf), in which case the re-search logic isn't needed as well. This race has been present since the introduction of skinny metadata (change 3173a18f70554fe7880bb2d85c7da566e364eb3c). Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 8 1 file changed, 8 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9141b2b..2cedd06 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -780,7 +780,6 @@ search_again: else key.type = BTRFS_EXTENT_ITEM_KEY; -again: ret = btrfs_search_slot(trans, root-fs_info-extent_root, key, path, 0, 0); if (ret 0) @@ -796,13 +795,6 @@ again: key.offset == root-nodesize) ret = 0; } - if (ret) { - key.objectid = bytenr; - key.type = BTRFS_EXTENT_ITEM_KEY; - key.offset = root-nodesize; - btrfs_release_path(path); - goto again; - } } if (ret == 0) { -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http
Re: [PATCH] Btrfs: properly clean up btrfs_end_io_wq_cache
On Wed, 15 Oct 2014 17:19:59 -0400, Josef Bacik wrote: In one of Dave's cleanup commits he forgot to call btrfs_end_io_wq_exit on unload, which makes us unable to unload and then re-load the btrfs module. This fixes the problem. Thanks, Signed-off-by: Josef Bacik jba...@fb.com Reviewed-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/super.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index b83ef15..c1d020f 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -2151,6 +2151,7 @@ static void __exit exit_btrfs_fs(void) extent_map_exit(); extent_io_exit(); btrfs_interface_exit(); + btrfs_end_io_wq_exit(); unregister_filesystem(btrfs_fs_type); btrfs_exit_sysfs(); btrfs_cleanup_fs_uuids(); -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: don't do async reclaim during log replay V2
On Thu, 18 Sep 2014 11:27:17 -0400, Josef Bacik wrote: Trying to reproduce a log enospc bug I hit a panic in the async reclaim code during log replay. This is because we use fs_info-fs_root as our root for shrinking and such. Technically we can use whatever root we want, but let's just not allow async reclaim while we're doing log replay. Thanks, Why not move the code of fs_root initialization to the front of log replay? I think it is better than the fix way in this patch because the async reclaimer can help us do some work. Thanks Miao Signed-off-by: Josef Bacik jba...@fb.com --- V1-V2: use fs_info-log_root_recovering instead, didn't notice this existed before. fs/btrfs/extent-tree.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 28a27d5..44d0497 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4513,7 +4513,13 @@ again: space_info-flush = 1; } else if (!ret space_info-flags BTRFS_BLOCK_GROUP_METADATA) { used += orig_bytes; - if (need_do_async_reclaim(space_info, root-fs_info, used) + /* + * We will do the space reservation dance during log replay, + * which means we won't have fs_info-fs_root set, so don't do + * the async reclaim as we will panic. + */ + if (!root-fs_info-log_root_recovering + need_do_async_reclaim(space_info, root-fs_info, used) !work_busy(root-fs_info-async_reclaim_work)) queue_work(system_unbound_wq, root-fs_info-async_reclaim_work); -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: device balance times
On Wed, 22 Oct 2014 14:40:47 +0200, Piotr Pawłow wrote: On 22.10.2014 03:43, Chris Murphy wrote: On Oct 21, 2014, at 4:14 PM, Piotr Pawłowp...@siedziba.pl wrote: Looks normal to me. Last time I started a balance after adding 6th device to my FS, it took 4 days to move 25GBs of data. It's long term untenable. At some point it must be fixed. It's way, way slower than md raid. At a certain point it needs to fallback to block level copying, with a ~ 32KB block. It can't be treating things as if they're 1K files, doing file level copying that takes forever. It's just too risky that another device fails in the meantime. There's device replace for restoring redundancy, which is fast, but not implemented yet for RAID5/6. Now my colleague and I is implementing the scrub/replace for RAID5/6 and I have a plan to reimplement the balance and split it off from the metadata/file data process. the main idea is - allocate a new chunk which has the same size as the relocated one, but don't insert it into the block group list, so we don't allocate the free space from it. - set the source chunk to be Read-only - copy the data from the source chunk to the new chunk - replace the extent map of the source chunk with the one of the new chunk(The new chunk has the same logical address and the length as the old one) - release the source chunk By this way, we needn't deal the data one extent by one extent, and needn't do any space reservation, so the speed will be very fast even we have lots of snapshots. Thanks Miao I think the problem is that balance was originally used for balancing data / metadata split - moving stuff out of mostly empty chunks to free them and use for something else. It pretty much has to be done on the extent level. Then balance was repurposed for things like converting RAID profiles and restoring redundancy and balancing device usage in multi-device configurations. It works, but the approach to do it extent by extent is slow. I wonder if we could do some of these operations by just copying whole chunks in bulk. Wasn't that the point of introducing logical addresses? - to be able to move chunks around quickly without changing anything except updating chunk pointers? BTW: I'd love a simple interface to be able to select a chunk and tell it to move somewhere else. I'd like to tell chunks with metadata, or with tons of extents: Hey, chunks! Why don't you move to my SSDs? :) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Btrfs: check-int: don't complain about balanced blocks
On Thu, 16 Oct 2014 17:48:49 +0200, Stefan Behrens wrote: The xfstest btrfs/014 which tests the balance operation caused that the check_int module complained that known blocks changed their physical location. Since this is not an error in this case, only print such message if the verbose mode was enabled. Reported-by: Wang Shilong wangshilong1...@gmail.com Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de --- fs/btrfs/check-integrity.c | 87 ++ 1 file changed, 49 insertions(+), 38 deletions(-) diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c index 65fc2e0bbc4a..65226d7c9fe0 100644 --- a/fs/btrfs/check-integrity.c +++ b/fs/btrfs/check-integrity.c @@ -1325,24 +1325,28 @@ static int btrfsic_create_link_to_next_block( l = NULL; next_block-generation = BTRFSIC_GENERATION_UNKNOWN; } else { - if (next_block-logical_bytenr != next_bytenr - !(!next_block-is_metadata - 0 == next_block-logical_bytenr)) { - printk(KERN_INFO -Referenced block @%llu (%s/%llu/%d) - found in hash table, %c, - bytenr mismatch (!= stored %llu).\n, -next_bytenr, next_block_ctx-dev-name, -next_block_ctx-dev_bytenr, *mirror_nump, -btrfsic_get_block_type(state, next_block), -next_block-logical_bytenr); - } else if (state-print_mask BTRFSIC_PRINT_MASK_VERBOSE) - printk(KERN_INFO -Referenced block @%llu (%s/%llu/%d) - found in hash table, %c.\n, -next_bytenr, next_block_ctx-dev-name, -next_block_ctx-dev_bytenr, *mirror_nump, -btrfsic_get_block_type(state, next_block)); + if (state-print_mask BTRFSIC_PRINT_MASK_VERBOSE) { + if (next_block-logical_bytenr != next_bytenr + !(!next_block-is_metadata + 0 == next_block-logical_bytenr)) + printk(KERN_INFO +Referenced block @%llu (%s/%llu/%d) + found in hash table, %c, + bytenr mismatch (!= stored %llu).\n, According to the coding style, we don't expect the user-visible strings are broken. Thanks Miao +next_bytenr, next_block_ctx-dev-name, +next_block_ctx-dev_bytenr, *mirror_nump, +btrfsic_get_block_type(state, + next_block), +next_block-logical_bytenr); + else + printk(KERN_INFO +Referenced block @%llu (%s/%llu/%d) + found in hash table, %c.\n, +next_bytenr, next_block_ctx-dev-name, +next_block_ctx-dev_bytenr, *mirror_nump, +btrfsic_get_block_type(state, + next_block)); + } next_block-logical_bytenr = next_bytenr; next_block-mirror_num = *mirror_nump; @@ -1528,7 +1532,9 @@ static int btrfsic_handle_extent_data( return -1; } if (!block_was_created) { - if (next_block-logical_bytenr != next_bytenr + if ((state-print_mask + BTRFSIC_PRINT_MASK_VERBOSE) + next_block-logical_bytenr != next_bytenr !(!next_block-is_metadata 0 == next_block-logical_bytenr)) { printk(KERN_INFO @@ -1881,25 +1887,30 @@ again: dev_state, dev_bytenr); } - if (block-logical_bytenr != bytenr - !(!block-is_metadata - block-logical_bytenr == 0)) - printk(KERN_INFO -Written block @%llu (%s/%llu/%d) - found in hash table, %c, - bytenr mismatch - (!= stored %llu).\n, -
Re: [PATCH] Btrfs: return failure if btrfs_dev_replace_finishing() failed
Guan On Sat, 11 Oct 2014 14:45:29 +0800, Eryu Guan wrote: device replace could fail due to another running scrub process, but this failure doesn't get returned to userspace. The following steps could reproduce this issue mkfs -t btrfs -f /dev/sdb1 /dev/sdb2 mount /dev/sdb1 /mnt/btrfs while true; do btrfs scrub start -B /mnt/btrfs /dev/null 21 done btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs # if this replace succeeded, do the following and repeat until # you see this log in dmesg # BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115 #btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs # once you see the error log in dmesg, check return value of # replace echo $? Also only WARN_ON if the return code is not -EINPROGRESS. Signed-off-by: Eryu Guan guane...@gmail.com Ping, any comments on this patch? Thanks, Eryu --- fs/btrfs/dev-replace.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index eea26e1..44d32ab 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -418,9 +418,11 @@ int btrfs_dev_replace_start(struct btrfs_root *root, dev_replace-scrub_progress, 0, 1); ret = btrfs_dev_replace_finishing(root-fs_info, ret); - WARN_ON(ret); + /* don't warn if EINPROGRESS, someone else might be running scrub */ + if (ret != -EINPROGRESS) + WARN_ON(ret); picky comment I prefer WARN_ON(ret ret != -EINPROGRESS). Yes, this is simpler :) - return 0; + return ret; here we will return -EINPROGRESS if scrub is running, I think it better that we assign some special number to args-result, and then return 0, just like the case the device replace is running. Seems that requires a new result type, say, #define BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS 3 and assign this result to args-result if btrfs_scrub_dev() returned -EINPROGRESS But I don't think returning 0 unconditionally is a good idea, since btrfs_dev_replace_finishing() could return other errors too, that way these errors will be lost, and userspace still won't catch the errors ($? is 0) Of course. Maybe the above explanation of mine was not so clear. In fact, I just talked about the EINPROGRESS case, for the other case, returning error code is better. What I'm thinking about is something like: ret = btrfs_scrub_dev(...); ret = btrfs_dev_replace_finishing(root-fs_info, ret); if (ret == -EINPROGRESS) { args-result = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS; ret = 0; } else { WARN_ON(ret); } return ret; What do you think? If no objection I'll work on v2. I like it. Thanks Miao Thanks for your review! Eryu Thanks Miao leave: dev_replace-srcdev = NULL; @@ -538,7 +540,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); mutex_unlock(dev_replace-lock_finishing_cancel_unmount); - return 0; + return scrub_ret; } printk_in_rcu(KERN_INFO -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html . -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: return failure if btrfs_dev_replace_finishing() failed
On Fri, 10 Oct 2014 15:13:31 +0800, Eryu Guan wrote: On Thu, Sep 25, 2014 at 06:28:14PM +0800, Eryu Guan wrote: device replace could fail due to another running scrub process, but this failure doesn't get returned to userspace. The following steps could reproduce this issue mkfs -t btrfs -f /dev/sdb1 /dev/sdb2 mount /dev/sdb1 /mnt/btrfs while true; do btrfs scrub start -B /mnt/btrfs /dev/null 21 done btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs # if this replace succeeded, do the following and repeat until # you see this log in dmesg # BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115 #btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs # once you see the error log in dmesg, check return value of # replace echo $? Also only WARN_ON if the return code is not -EINPROGRESS. Signed-off-by: Eryu Guan guane...@gmail.com Ping, any comments on this patch? Thanks, Eryu --- fs/btrfs/dev-replace.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index eea26e1..44d32ab 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -418,9 +418,11 @@ int btrfs_dev_replace_start(struct btrfs_root *root, dev_replace-scrub_progress, 0, 1); ret = btrfs_dev_replace_finishing(root-fs_info, ret); -WARN_ON(ret); +/* don't warn if EINPROGRESS, someone else might be running scrub */ +if (ret != -EINPROGRESS) +WARN_ON(ret); picky comment I prefer WARN_ON(ret ret != -EINPROGRESS). -return 0; +return ret; here we will return -EINPROGRESS if scrub is running, I think it better that we assign some special number to args-result, and then return 0, just like the case the device replace is running. Thanks Miao leave: dev_replace-srcdev = NULL; @@ -538,7 +540,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); mutex_unlock(dev_replace-lock_finishing_cancel_unmount); -return 0; +return scrub_ret; } printk_in_rcu(KERN_INFO -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix ABBA deadlock in btrfs_dev_replace_finishing()
It has been fixed by https://patchwork.kernel.org/patch/4747961/ Thanks Miao On Sun, 21 Sep 2014 12:41:49 +0800, Eryu Guan wrote: btrfs_map_bio() first calls btrfs_bio_counter_inc_blocked() which checks fs state and increase bio_counter, then calls __btrfs_map_block() which will take the dev_replace lock. On the other hand, btrfs_dev_replace_finishing() takes dev_replace lock first then set fs state to BTRFS_FS_STATE_DEV_REPLACING and waits for bio_counter to be zero. The deadlock can be reproduced easily by running replace and fsstress at the same time, e.g. mkfs -t btrfs -f /dev/sdb1 /dev/sdb2 mount /dev/sdb1 /mnt/btrfs fsstress -d /mnt/btrfs -n 100 -p 2 -l 0 # fsstress from ltp supports -l option i=0 while btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs \ btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs; do echo === loop $i === let i=$i+1 done This was introduced by c404e0d Btrfs: fix use-after-free in the finishing procedure of the device replace Signed-off-by: Eryu Guan guane...@gmail.com --- Tested by the reproducer and xfstests, no new failure found. But I found kmem_cache leak if I remove btrfs module after my new test case[1], which does fsstress replace subvolume create/mount/umount/delete at the same time. BUG btrfs_extent_state (Tainted: GB ): Objects remaining in btrfs_extent_state on kmem_cache_close() .. kmem_cache_destroy btrfs_extent_state: Slab cache still has objects CPU: 3 PID: 9503 Comm: modprobe Tainted: GB 3.17.0-rc5+ #12 Hardware name: Hewlett-Packard ProLiant DL388eGen8, BIOS P73 06/01/2012 8dd09c52 880411c37eb0 81642f7a 8800b9a19300 880411c37ed0 8118ce89 a05dcd20 880411c37ee0 a056a80f 880411c37ef0 Call Trace: [81642f7a] dump_stack+0x45/0x56 [8118ce89] kmem_cache_destroy+0xf9/0x100 [a056a80f] extent_io_exit+0x1f/0x50 [btrfs] [a05c3ae3] exit_btrfs_fs+0x2c/0x549 [btrfs] [810efda2] SyS_delete_module+0x162/0x200 [81013bb7] ? do_notify_resume+0x97/0xb0 [8164af69] system_call_fastpath+0x16/0x1b The test would hang before the fix. I'm not sure if it's related to the fix (seems not), please help review. Thanks, Eryu Guan [1] http://www.spinics.net/lists/linux-btrfs/msg37625.html fs/btrfs/dev-replace.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index eea26e1..5dfd292 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -510,6 +510,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, /* keep away write_all_supers() during the finishing procedure */ mutex_lock(root-fs_info-chunk_mutex); mutex_lock(root-fs_info-fs_devices-device_list_mutex); + btrfs_rm_dev_replace_blocked(fs_info); btrfs_dev_replace_lock(dev_replace); dev_replace-replace_state = scrub_ret ? BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED @@ -567,12 +568,8 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, btrfs_kobj_rm_device(fs_info, src_device); btrfs_kobj_add_device(fs_info, tgt_device); - btrfs_rm_dev_replace_blocked(fs_info); - btrfs_rm_dev_replace_srcdev(fs_info, src_device); - btrfs_rm_dev_replace_unblocked(fs_info); - /* * this is again a consistent state where no dev_replace procedure * is running, the target device is part of the filesystem, the @@ -581,6 +578,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, * belong to this filesystem. */ btrfs_dev_replace_unlock(dev_replace); + btrfs_rm_dev_replace_unblocked(fs_info); mutex_unlock(root-fs_info-fs_devices-device_list_mutex); mutex_unlock(root-fs_info-chunk_mutex); -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Fix the wrong condition judgment about subset extent map
This patch and the previous one(The following patch) also fixed a oops, which can be reproduced by LTP stress test(ltpstress.sh + fsstress). [PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map Thanks Miao On Mon, 22 Sep 2014 09:13:03 +0800, Qu Wenruo wrote: Previous commit: btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map is using wrong condition to judgement whether the range is a subset of a existing extent map. This may cause bug in btrfs no-holes mode. This patch will correct the judgment and fix the bug. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- fs/btrfs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8039021..a99ee9d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6527,7 +6527,7 @@ insert: * extent causing the -EEXIST. */ if (start = extent_map_end(existing) || - start + len = existing-start) { + start = existing-start) { /* * The existing extent map is the one nearest to * the [start, start + len) range which overlaps -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel integration branch updated
Chris On Fri, 19 Sep 2014 09:45:17 +0800, Qu Wenruo wrote: Hi Chris, I'm sorry that the commit 'btrfs: Fix and enhance merge_extent_mapping() to insert best fitted extent map' has a V2 patch, so the one in tree is not up-to-data. Although the v2 change is quite small and it's relevantly dependent, so it should not be a pain change. I think it is better to merge it to v3.17 since it is a regression of v3.17 kernel Thanks Miao Thanks, Qu Original Message Subject: kernel integration branch updated From: Chris Mason c...@fb.com To: linux-btrfs linux-btrfs@vger.kernel.org Date: 2014年09月18日 22:19 Hi everyone, I've added a few more patches to the kernel integration branch, and rebased onto rc5. This should be my last rebase before sending into linux-next, please take a look. It's still missing three patches from Josef, which we're updating. I can put more patches on top, but I'd prefer not to rebase again unless some patches need removing. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 00/11] Implement the data repair function for direct read
This patchset implement the data repair function for the direct read, it is implemented like buffered read: 1.When we find the data is not right, we try to read the data from the other mirror. 2.When the io on the mirror ends, we will insert the endio work into the dedicated btrfs workqueue, not common read endio workqueue, because the original endio work is still blocked in the btrfs endio workqueue, if we insert the endio work of the io on the mirror into that workqueue, deadlock would happen. 3.If We get right data, we write it back to repair the corrupted mirror. 4.If the data on the new mirror is still corrupted, we will try next mirror until we read right data or all the mirrors are traversed. 5.After the above work, we set the uptodate flag according to the result. The difference is that the direct read may be splited to several small io, in order to get the number of the mirror on which the io error happens. we have to do data check and repair on the end IO function of those sub-IO request. Besides that, we also fixed some bugs of direct io. Changelog v3 - v4: - Remove the 1st patch which has been applied into the upstream kernel. - Use a dedicated btrfs workqueue instead of the system workqueue to deal with the completed repair bio, this suggest was from Chris. - Rebase the patchset to integration branch of Chris's git tree. Changelog v2 - v3: - Fix wrong returned bio when doing bio clone, which was reported by Filipe Changelog v1 - v2: - Fix the warning which was triggered by __GFP_ZERO in the 2nd patch Miao Xie (11): Btrfs: load checksum data once when submitting a direct read io Btrfs: cleanup similar code of the buffered data data check and dio read data check Btrfs: do file data check by sub-bio's self Btrfs: fix missing error handler if submiting re-read bio fails Btrfs: Cleanup unused variant and argument of IO failure handlers Btrfs: split bio_readpage_error into several functions Btrfs: modify repair_io_failure and make it suit direct io Btrfs: modify clean_io_failure and make it suit direct io Btrfs: Set real mirror number for read operation on RAID0/5/6 Btrfs: implement repair function when direct read fails Btrfs: cleanup the read failure record after write or when the inode is freeing fs/btrfs/async-thread.c | 1 + fs/btrfs/async-thread.h | 1 + fs/btrfs/btrfs_inode.h | 10 +- fs/btrfs/ctree.h| 4 +- fs/btrfs/disk-io.c | 11 +- fs/btrfs/disk-io.h | 1 + fs/btrfs/extent_io.c| 254 +-- fs/btrfs/extent_io.h| 38 - fs/btrfs/file-item.c| 14 +- fs/btrfs/inode.c| 446 +++- fs/btrfs/scrub.c| 4 +- fs/btrfs/volumes.c | 5 + fs/btrfs/volumes.h | 5 +- 13 files changed, 601 insertions(+), 193 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 03/11] Btrfs: do file data check by sub-bio's self
Direct IO splits the original bio to several sub-bios because of the limit of raid stripe, and the filesystem will wait for all sub-bios and then run final end io process. But it was very hard to implement the data repair when dio read failure happens, because at the final end io function, we didn't know which mirror the data was read from. So in order to implement the data repair, we have to move the file data check in the final end io function to the sub-bio end io function, in which we can get the mirror number of the device we access. This patch did this work as the first step of the direct io data repair implementation. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None --- fs/btrfs/btrfs_inode.h | 9 + fs/btrfs/extent_io.c | 2 +- fs/btrfs/inode.c | 100 - fs/btrfs/volumes.h | 5 ++- 4 files changed, 87 insertions(+), 29 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 8bea70e..4d30947 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -245,8 +245,11 @@ static inline int btrfs_inode_in_log(struct inode *inode, u64 generation) return 0; } +#define BTRFS_DIO_ORIG_BIO_SUBMITTED 0x1 + struct btrfs_dio_private { struct inode *inode; + unsigned long flags; u64 logical_offset; u64 disk_bytenr; u64 bytes; @@ -263,6 +266,12 @@ struct btrfs_dio_private { /* dio_bio came from fs/direct-io.c */ struct bio *dio_bio; + + /* +* The original bio may be splited to several sub-bios, this is +* done during endio of sub-bios +*/ + int (*subio_endio)(struct inode *, struct btrfs_io_bio *); }; /* diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index dfe1afe..92a6d9f 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2472,7 +2472,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err) struct inode *inode = page-mapping-host; pr_debug(end_bio_extent_readpage: bi_sector=%llu, err=%d, -mirror=%lu\n, (u64)bio-bi_iter.bi_sector, err, +mirror=%u\n, (u64)bio-bi_iter.bi_sector, err, io_bio-mirror_num); tree = BTRFS_I(inode)-io_tree; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e8139c6..cf79f79 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7198,29 +7198,40 @@ unlock_err: return ret; } -static void btrfs_endio_direct_read(struct bio *bio, int err) +static int btrfs_subio_endio_read(struct inode *inode, + struct btrfs_io_bio *io_bio) { - struct btrfs_dio_private *dip = bio-bi_private; struct bio_vec *bvec; - struct inode *inode = dip-inode; - struct bio *dio_bio; - struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); u64 start; - int ret; int i; + int ret; + int err = 0; - if (err || (BTRFS_I(inode)-flags BTRFS_INODE_NODATASUM)) - goto skip_checksum; + if (BTRFS_I(inode)-flags BTRFS_INODE_NODATASUM) + return 0; - start = dip-logical_offset; - bio_for_each_segment_all(bvec, bio, i) { + start = io_bio-logical; + bio_for_each_segment_all(bvec, io_bio-bio, i) { ret = __readpage_endio_check(inode, io_bio, i, bvec-bv_page, 0, start, bvec-bv_len); if (ret) err = -EIO; start += bvec-bv_len; } -skip_checksum: + + return err; +} + +static void btrfs_endio_direct_read(struct bio *bio, int err) +{ + struct btrfs_dio_private *dip = bio-bi_private; + struct inode *inode = dip-inode; + struct bio *dio_bio; + struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); + + if (!err (dip-flags BTRFS_DIO_ORIG_BIO_SUBMITTED)) + err = btrfs_subio_endio_read(inode, io_bio); + unlock_extent(BTRFS_I(inode)-io_tree, dip-logical_offset, dip-logical_offset + dip-bytes - 1); dio_bio = dip-dio_bio; @@ -7298,6 +7309,7 @@ static int __btrfs_submit_bio_start_direct_io(struct inode *inode, int rw, static void btrfs_end_dio_bio(struct bio *bio, int err) { struct btrfs_dio_private *dip = bio-bi_private; + int ret; if (err) { btrfs_err(BTRFS_I(dip-inode)-root-fs_info, @@ -7305,6 +7317,13 @@ static void btrfs_end_dio_bio(struct bio *bio, int err) btrfs_ino(dip-inode), bio-bi_rw, (unsigned long long)bio-bi_iter.bi_sector, bio-bi_iter.bi_size, err); + } else if (dip-subio_endio) { + ret = dip-subio_endio(dip-inode, btrfs_io_bio(bio)); + if (ret) + err = ret; + } + + if (err
[PATCH v4 07/11] Btrfs: modify repair_io_failure and make it suit direct io
The original code of repair_io_failure was just used for buffered read, because it got some filesystem data from page structure, it is safe for the page in the page cache. But when we do a direct read, the pages in bio are not in the page cache, that is there is no filesystem data in the page structure. In order to implement direct read data repair, we need modify repair_io_failure and pass all filesystem data it need by function parameters. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None --- fs/btrfs/extent_io.c | 8 +--- fs/btrfs/extent_io.h | 2 +- fs/btrfs/scrub.c | 1 + 3 files changed, 7 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index cf1de40..9fbc005 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1997,7 +1997,7 @@ static int free_io_failure(struct inode *inode, struct io_failure_record *rec) */ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start, u64 length, u64 logical, struct page *page, - int mirror_num) + unsigned int pg_offset, int mirror_num) { struct bio *bio; struct btrfs_device *dev; @@ -2036,7 +2036,7 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start, return -EIO; } bio-bi_bdev = dev-bdev; - bio_add_page(bio, page, length, start - page_offset(page)); + bio_add_page(bio, page, length, pg_offset); if (btrfsic_submit_bio_wait(WRITE_SYNC, bio)) { /* try to remap that extent elsewhere? */ @@ -2067,7 +2067,8 @@ int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb, for (i = 0; i num_pages; i++) { struct page *p = extent_buffer_page(eb, i); ret = repair_io_failure(root-fs_info, start, PAGE_CACHE_SIZE, - start, p, mirror_num); + start, p, start - page_offset(p), + mirror_num); if (ret) break; start += PAGE_CACHE_SIZE; @@ -2127,6 +2128,7 @@ static int clean_io_failure(u64 start, struct page *page) if (num_copies 1) { repair_io_failure(fs_info, start, failrec-len, failrec-logical, page, + start - page_offset(page), failrec-failed_mirror); } } diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 75b621b..a82ecbc 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -340,7 +340,7 @@ struct btrfs_fs_info; int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start, u64 length, u64 logical, struct page *page, - int mirror_num); + unsigned int pg_offset, int mirror_num); int end_extent_writepage(struct page *page, int err, u64 start, u64 end); int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb, int mirror_num); diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index cce122b..3978529 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -682,6 +682,7 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *fixup_ctx) fs_info = BTRFS_I(inode)-root-fs_info; ret = repair_io_failure(fs_info, offset, PAGE_SIZE, fixup-logical, page, + offset - page_offset(page), fixup-mirror_num); unlock_page(page); corrected = !ret; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 05/11] Btrfs: Cleanup unused variant and argument of IO failure handlers
Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None --- fs/btrfs/extent_io.c | 26 ++ 1 file changed, 10 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index f8dda46..154cb8e 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1981,8 +1981,7 @@ struct io_failure_record { int in_validation; }; -static int free_io_failure(struct inode *inode, struct io_failure_record *rec, - int did_repair) +static int free_io_failure(struct inode *inode, struct io_failure_record *rec) { int ret; int err = 0; @@ -2109,7 +2108,6 @@ static int clean_io_failure(u64 start, struct page *page) struct btrfs_fs_info *fs_info = BTRFS_I(inode)-root-fs_info; struct extent_state *state; int num_copies; - int did_repair = 0; int ret; private = 0; @@ -2130,7 +2128,6 @@ static int clean_io_failure(u64 start, struct page *page) /* there was no real error, just free the record */ pr_debug(clean_io_failure: freeing dummy error at %llu\n, failrec-start); - did_repair = 1; goto out; } if (fs_info-sb-s_flags MS_RDONLY) @@ -2147,19 +2144,16 @@ static int clean_io_failure(u64 start, struct page *page) num_copies = btrfs_num_copies(fs_info, failrec-logical, failrec-len); if (num_copies 1) { - ret = repair_io_failure(fs_info, start, failrec-len, - failrec-logical, page, - failrec-failed_mirror); - did_repair = !ret; + repair_io_failure(fs_info, start, failrec-len, + failrec-logical, page, + failrec-failed_mirror); } - ret = 0; } out: - if (!ret) - ret = free_io_failure(inode, failrec, did_repair); + free_io_failure(inode, failrec); - return ret; + return 0; } /* @@ -2269,7 +2263,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, */ pr_debug(bio_readpage_error: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n, num_copies, failrec-this_mirror, failed_mirror); - free_io_failure(inode, failrec, 0); + free_io_failure(inode, failrec); return -EIO; } @@ -2312,13 +2306,13 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, if (failrec-this_mirror num_copies) { pr_debug(bio_readpage_error: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n, num_copies, failrec-this_mirror, failed_mirror); - free_io_failure(inode, failrec, 0); + free_io_failure(inode, failrec); return -EIO; } bio = btrfs_io_bio_alloc(GFP_NOFS, 1); if (!bio) { - free_io_failure(inode, failrec, 0); + free_io_failure(inode, failrec); return -EIO; } bio-bi_end_io = failed_bio-bi_end_io; @@ -2349,7 +2343,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, failrec-this_mirror, failrec-bio_flags, 0); if (ret) { - free_io_failure(inode, failrec, 0); + free_io_failure(inode, failrec); bio_put(bio); } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 02/11] Btrfs: cleanup similar code of the buffered data data check and dio read data check
Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None --- fs/btrfs/inode.c | 102 +-- 1 file changed, 47 insertions(+), 55 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index af304e1..e8139c6 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2893,6 +2893,40 @@ static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end, return 0; } +static int __readpage_endio_check(struct inode *inode, + struct btrfs_io_bio *io_bio, + int icsum, struct page *page, + int pgoff, u64 start, size_t len) +{ + char *kaddr; + u32 csum_expected; + u32 csum = ~(u32)0; + static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); + + csum_expected = *(((u32 *)io_bio-csum) + icsum); + + kaddr = kmap_atomic(page); + csum = btrfs_csum_data(kaddr + pgoff, csum, len); + btrfs_csum_final(csum, (char *)csum); + if (csum != csum_expected) + goto zeroit; + + kunmap_atomic(kaddr); + return 0; +zeroit: + if (__ratelimit(_rs)) + btrfs_info(BTRFS_I(inode)-root-fs_info, + csum failed ino %llu off %llu csum %u expected csum %u, + btrfs_ino(inode), start, csum, csum_expected); + memset(kaddr + pgoff, 1, len); + flush_dcache_page(page); + kunmap_atomic(kaddr); + if (csum_expected == 0) + return 0; + return -EIO; +} + /* * when reads are done, we need to check csums to verify the data is correct * if there's a match, we allow the bio to finish. If not, the code in @@ -2905,20 +2939,15 @@ static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio, size_t offset = start - page_offset(page); struct inode *inode = page-mapping-host; struct extent_io_tree *io_tree = BTRFS_I(inode)-io_tree; - char *kaddr; struct btrfs_root *root = BTRFS_I(inode)-root; - u32 csum_expected; - u32 csum = ~(u32)0; - static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL, - DEFAULT_RATELIMIT_BURST); if (PageChecked(page)) { ClearPageChecked(page); - goto good; + return 0; } if (BTRFS_I(inode)-flags BTRFS_INODE_NODATASUM) - goto good; + return 0; if (root-root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1, NULL)) { @@ -2928,28 +2957,8 @@ static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio, } phy_offset = inode-i_sb-s_blocksize_bits; - csum_expected = *(((u32 *)io_bio-csum) + phy_offset); - - kaddr = kmap_atomic(page); - csum = btrfs_csum_data(kaddr + offset, csum, end - start + 1); - btrfs_csum_final(csum, (char *)csum); - if (csum != csum_expected) - goto zeroit; - - kunmap_atomic(kaddr); -good: - return 0; - -zeroit: - if (__ratelimit(_rs)) - btrfs_info(root-fs_info, csum failed ino %llu off %llu csum %u expected csum %u, - btrfs_ino(page-mapping-host), start, csum, csum_expected); - memset(kaddr + offset, 1, end - start + 1); - flush_dcache_page(page); - kunmap_atomic(kaddr); - if (csum_expected == 0) - return 0; - return -EIO; + return __readpage_endio_check(inode, io_bio, phy_offset, page, offset, + start, (size_t)(end - start + 1)); } struct delayed_iput { @@ -7194,41 +7203,24 @@ static void btrfs_endio_direct_read(struct bio *bio, int err) struct btrfs_dio_private *dip = bio-bi_private; struct bio_vec *bvec; struct inode *inode = dip-inode; - struct btrfs_root *root = BTRFS_I(inode)-root; struct bio *dio_bio; struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); - u32 *csums = (u32 *)io_bio-csum; u64 start; + int ret; int i; + if (err || (BTRFS_I(inode)-flags BTRFS_INODE_NODATASUM)) + goto skip_checksum; + start = dip-logical_offset; bio_for_each_segment_all(bvec, bio, i) { - if (!(BTRFS_I(inode)-flags BTRFS_INODE_NODATASUM)) { - struct page *page = bvec-bv_page; - char *kaddr; - u32 csum = ~(u32)0; - unsigned long flags; - - local_irq_save(flags); - kaddr = kmap_atomic(page); - csum = btrfs_csum_data(kaddr + bvec-bv_offset, - csum, bvec-bv_len
[PATCH v4 04/11] Btrfs: fix missing error handler if submiting re-read bio fails
We forgot to free failure record and bio after submitting re-read bio failed, fix it. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None --- fs/btrfs/extent_io.c | 5 + 1 file changed, 5 insertions(+) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 92a6d9f..f8dda46 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2348,6 +2348,11 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, ret = tree-ops-submit_bio_hook(inode, read_mode, bio, failrec-this_mirror, failrec-bio_flags, 0); + if (ret) { + free_io_failure(inode, failrec, 0); + bio_put(bio); + } + return ret; } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 10/11] Btrfs: implement repair function when direct read fails
This patch implement data repair function when direct read fails. The detail of the implementation is: - When we find the data is not right, we try to read the data from the other mirror. - When the io on the mirror ends, we will insert the endio work into the dedicated btrfs workqueue, not common read endio workqueue, because the original endio work is still blocked in the btrfs endio workqueue, if we insert the endio work of the io on the mirror into that workqueue, deadlock would happen. - After we get right data, we write it back to the corrupted mirror. - And if the data on the new mirror is still corrupted, we will try next mirror until we read right data or all the mirrors are traversed. - After the above work, we set the uptodate flag according to the result. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v3 - v4: - Use a dedicated btrfs workqueue instead of the system workqueue to deal with the completed repair bio, this suggest was from Chris. Changelog v1 - v3: - None --- fs/btrfs/async-thread.c | 1 + fs/btrfs/async-thread.h | 1 + fs/btrfs/btrfs_inode.h | 2 +- fs/btrfs/ctree.h| 1 + fs/btrfs/disk-io.c | 11 +- fs/btrfs/disk-io.h | 1 + fs/btrfs/extent_io.c| 12 ++- fs/btrfs/extent_io.h| 5 +- fs/btrfs/inode.c| 276 9 files changed, 281 insertions(+), 29 deletions(-) diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c index fbd76de..2da0a66 100644 --- a/fs/btrfs/async-thread.c +++ b/fs/btrfs/async-thread.c @@ -74,6 +74,7 @@ BTRFS_WORK_HELPER(endio_helper); BTRFS_WORK_HELPER(endio_meta_helper); BTRFS_WORK_HELPER(endio_meta_write_helper); BTRFS_WORK_HELPER(endio_raid56_helper); +BTRFS_WORK_HELPER(endio_repair_helper); BTRFS_WORK_HELPER(rmw_helper); BTRFS_WORK_HELPER(endio_write_helper); BTRFS_WORK_HELPER(freespace_write_helper); diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h index e9e31c9..e386c29 100644 --- a/fs/btrfs/async-thread.h +++ b/fs/btrfs/async-thread.h @@ -53,6 +53,7 @@ BTRFS_WORK_HELPER_PROTO(endio_helper); BTRFS_WORK_HELPER_PROTO(endio_meta_helper); BTRFS_WORK_HELPER_PROTO(endio_meta_write_helper); BTRFS_WORK_HELPER_PROTO(endio_raid56_helper); +BTRFS_WORK_HELPER_PROTO(endio_repair_helper); BTRFS_WORK_HELPER_PROTO(rmw_helper); BTRFS_WORK_HELPER_PROTO(endio_write_helper); BTRFS_WORK_HELPER_PROTO(freespace_write_helper); diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 4d30947..7a7521c 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -271,7 +271,7 @@ struct btrfs_dio_private { * The original bio may be splited to several sub-bios, this is * done during endio of sub-bios */ - int (*subio_endio)(struct inode *, struct btrfs_io_bio *); + int (*subio_endio)(struct inode *, struct btrfs_io_bio *, int); }; /* diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 7b54cd9..63acfd8 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1538,6 +1538,7 @@ struct btrfs_fs_info { struct btrfs_workqueue *endio_workers; struct btrfs_workqueue *endio_meta_workers; struct btrfs_workqueue *endio_raid56_workers; + struct btrfs_workqueue *endio_repair_workers; struct btrfs_workqueue *rmw_workers; struct btrfs_workqueue *endio_meta_write_workers; struct btrfs_workqueue *endio_write_workers; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index ff3ee22..1594d91 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -713,7 +713,11 @@ static void end_workqueue_bio(struct bio *bio, int err) func = btrfs_endio_write_helper; } } else { - if (end_io_wq-metadata == BTRFS_WQ_ENDIO_RAID56) { + if (unlikely(end_io_wq-metadata == +BTRFS_WQ_ENDIO_DIO_REPAIR)) { + wq = fs_info-endio_repair_workers; + func = btrfs_endio_repair_helper; + } else if (end_io_wq-metadata == BTRFS_WQ_ENDIO_RAID56) { wq = fs_info-endio_raid56_workers; func = btrfs_endio_raid56_helper; } else if (end_io_wq-metadata) { @@ -741,6 +745,7 @@ int btrfs_bio_wq_end_io(struct btrfs_fs_info *info, struct bio *bio, int metadata) { struct end_io_wq *end_io_wq; + end_io_wq = kmalloc(sizeof(*end_io_wq), GFP_NOFS); if (!end_io_wq) return -ENOMEM; @@ -2059,6 +2064,7 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info) btrfs_destroy_workqueue(fs_info-endio_workers); btrfs_destroy_workqueue(fs_info-endio_meta_workers); btrfs_destroy_workqueue(fs_info-endio_raid56_workers); + btrfs_destroy_workqueue(fs_info-endio_repair_workers); btrfs_destroy_workqueue(fs_info-rmw_workers); btrfs_destroy_workqueue
[PATCH v4 01/11] Btrfs: load checksum data once when submitting a direct read io
The current code would load checksum data for several times when we split a whole direct read io because of the limit of the raid stripe, it would make us search the csum tree for several times. In fact, it just wasted time, and made the contention of the csum tree root be more serious. This patch improves this problem by loading the data at once. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v3 - v4: - None Changelog v2 - v3: - Fix the wrong return value of btrfs_bio_clone Changelog v1 - v2: - Remove the __GFP_ZERO flag in btrfs_submit_direct because it would trigger a WARNing. It is reported by Filipe David Manana, Thanks. --- fs/btrfs/btrfs_inode.h | 1 - fs/btrfs/ctree.h | 3 +-- fs/btrfs/extent_io.c | 13 +++-- fs/btrfs/file-item.c | 14 ++ fs/btrfs/inode.c | 38 +- 5 files changed, 35 insertions(+), 34 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index fd87941..8bea70e 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -263,7 +263,6 @@ struct btrfs_dio_private { /* dio_bio came from fs/direct-io.c */ struct bio *dio_bio; - u8 csum[0]; }; /* diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index ded7781..7b54cd9 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3719,8 +3719,7 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans, int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode, struct bio *bio, u32 *dst); int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode, - struct btrfs_dio_private *dip, struct bio *bio, - u64 logical_offset); + struct bio *bio, u64 logical_offset); int btrfs_insert_file_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 objectid, u64 pos, diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 86b39de..dfe1afe 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2621,9 +2621,18 @@ btrfs_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs, struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask) { - return bio_clone_bioset(bio, gfp_mask, btrfs_bioset); -} + struct btrfs_io_bio *btrfs_bio; + struct bio *new; + new = bio_clone_bioset(bio, gfp_mask, btrfs_bioset); + if (new) { + btrfs_bio = btrfs_io_bio(new); + btrfs_bio-csum = NULL; + btrfs_bio-csum_allocated = NULL; + btrfs_bio-end_io = NULL; + } + return new; +} /* this also allocates from the btrfs_bioset */ struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs) diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 6e6262e..783a943 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -299,19 +299,9 @@ int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode, } int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode, - struct btrfs_dio_private *dip, struct bio *bio, - u64 offset) + struct bio *bio, u64 offset) { - int len = (bio-bi_iter.bi_sector 9) - dip-disk_bytenr; - u16 csum_size = btrfs_super_csum_size(root-fs_info-super_copy); - int ret; - - len = inode-i_sb-s_blocksize_bits; - len *= csum_size; - - ret = __btrfs_lookup_bio_sums(root, inode, bio, offset, - (u32 *)(dip-csum + len), 1); - return ret; + return __btrfs_lookup_bio_sums(root, inode, bio, offset, NULL, 1); } int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 2118ea6..af304e1 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7196,7 +7196,8 @@ static void btrfs_endio_direct_read(struct bio *bio, int err) struct inode *inode = dip-inode; struct btrfs_root *root = BTRFS_I(inode)-root; struct bio *dio_bio; - u32 *csums = (u32 *)dip-csum; + struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); + u32 *csums = (u32 *)io_bio-csum; u64 start; int i; @@ -7238,6 +7239,9 @@ static void btrfs_endio_direct_read(struct bio *bio, int err) if (err) clear_bit(BIO_UPTODATE, dio_bio-bi_flags); dio_end_io(dio_bio, err); + + if (io_bio-end_io) + io_bio-end_io(io_bio, err); bio_put(bio); } @@ -7377,13 +7381,20 @@ static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode, ret = btrfs_csum_one_bio(root, inode, bio, file_offset, 1); if (ret) goto err; - } else if (!skip_sum) { - ret
[PATCH v4 11/11] Btrfs: cleanup the read failure record after write or when the inode is freeing
After the data is written successfully, we should cleanup the read failure record in that range because - If we set data COW for the file, the range that the failure record pointed to is mapped to a new place, so it is invalid. - If we set no data COW for the file, and if there is no error during writting, the corrupted data is corrected, so the failure record can be removed. And if some errors happen on the mirrors, we also needn't worry about it because the failure record will be recreated if we read the same place again. Sometimes, we may fail to correct the data, so the failure records will be left in the tree, we need free them when we free the inode or the memory leak happens. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None --- fs/btrfs/extent_io.c | 34 ++ fs/btrfs/extent_io.h | 1 + fs/btrfs/inode.c | 6 ++ 3 files changed, 41 insertions(+) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 86dc352..5427fd5 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2138,6 +2138,40 @@ out: return 0; } +/* + * Can be called when + * - hold extent lock + * - under ordered extent + * - the inode is freeing + */ +void btrfs_free_io_failure_record(struct inode *inode, u64 start, u64 end) +{ + struct extent_io_tree *failure_tree = BTRFS_I(inode)-io_failure_tree; + struct io_failure_record *failrec; + struct extent_state *state, *next; + + if (RB_EMPTY_ROOT(failure_tree-state)) + return; + + spin_lock(failure_tree-lock); + state = find_first_extent_bit_state(failure_tree, start, EXTENT_DIRTY); + while (state) { + if (state-start end) + break; + + ASSERT(state-end = end); + + next = next_state(state); + + failrec = (struct io_failure_record *)state-private; + free_extent_state(state); + kfree(failrec); + + state = next; + } + spin_unlock(failure_tree-lock); +} + int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end, struct io_failure_record **failrec_ret) { diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 176a4b1..5e91fb9 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -366,6 +366,7 @@ struct io_failure_record { int in_validation; }; +void btrfs_free_io_failure_record(struct inode *inode, u64 start, u64 end); int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end, struct io_failure_record **failrec_ret); int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index bc8cdaf..c591af5 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2697,6 +2697,10 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) goto out; } + btrfs_free_io_failure_record(inode, ordered_extent-file_offset, +ordered_extent-file_offset + +ordered_extent-len - 1); + if (test_bit(BTRFS_ORDERED_TRUNCATED, ordered_extent-flags)) { truncated = true; logical_len = ordered_extent-truncated_len; @@ -4792,6 +4796,8 @@ void btrfs_evict_inode(struct inode *inode) /* do we really want it for -i_nlink 0 and zero btrfs_root_refs? */ btrfs_wait_ordered_range(inode, 0, (u64)-1); + btrfs_free_io_failure_record(inode, 0, (u64)-1); + if (root-fs_info-log_root_recovering) { BUG_ON(test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM, BTRFS_I(inode)-runtime_flags)); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 06/11] Btrfs: split bio_readpage_error into several functions
The data repair function of direct read will be implemented later, and some code in bio_readpage_error will be reused, so split bio_readpage_error into several functions which will be used in direct read repair later. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None --- fs/btrfs/extent_io.c | 159 ++- fs/btrfs/extent_io.h | 28 + 2 files changed, 123 insertions(+), 64 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 154cb8e..cf1de40 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1962,25 +1962,6 @@ static void check_page_uptodate(struct extent_io_tree *tree, struct page *page) SetPageUptodate(page); } -/* - * When IO fails, either with EIO or csum verification fails, we - * try other mirrors that might have a good copy of the data. This - * io_failure_record is used to record state as we go through all the - * mirrors. If another mirror has good data, the page is set up to date - * and things continue. If a good mirror can't be found, the original - * bio end_io callback is called to indicate things have failed. - */ -struct io_failure_record { - struct page *page; - u64 start; - u64 len; - u64 logical; - unsigned long bio_flags; - int this_mirror; - int failed_mirror; - int in_validation; -}; - static int free_io_failure(struct inode *inode, struct io_failure_record *rec) { int ret; @@ -2156,40 +2137,24 @@ out: return 0; } -/* - * this is a generic handler for readpage errors (default - * readpage_io_failed_hook). if other copies exist, read those and write back - * good data to the failed position. does not investigate in remapping the - * failed extent elsewhere, hoping the device will be smart enough to do this as - * needed - */ - -static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, - struct page *page, u64 start, u64 end, - int failed_mirror) +int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end, + struct io_failure_record **failrec_ret) { - struct io_failure_record *failrec = NULL; + struct io_failure_record *failrec; u64 private; struct extent_map *em; - struct inode *inode = page-mapping-host; struct extent_io_tree *failure_tree = BTRFS_I(inode)-io_failure_tree; struct extent_io_tree *tree = BTRFS_I(inode)-io_tree; struct extent_map_tree *em_tree = BTRFS_I(inode)-extent_tree; - struct bio *bio; - struct btrfs_io_bio *btrfs_failed_bio; - struct btrfs_io_bio *btrfs_bio; - int num_copies; int ret; - int read_mode; u64 logical; - BUG_ON(failed_bio-bi_rw REQ_WRITE); - ret = get_state_private(failure_tree, start, private); if (ret) { failrec = kzalloc(sizeof(*failrec), GFP_NOFS); if (!failrec) return -ENOMEM; + failrec-start = start; failrec-len = end - start + 1; failrec-this_mirror = 0; @@ -2209,11 +2174,11 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, em = NULL; } read_unlock(em_tree-lock); - if (!em) { kfree(failrec); return -EIO; } + logical = start - em-start; logical = em-block_start + logical; if (test_bit(EXTENT_FLAG_COMPRESSED, em-flags)) { @@ -,8 +2187,10 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, extent_set_compress_type(failrec-bio_flags, em-compress_type); } - pr_debug(bio_readpage_error: (new) logical=%llu, start=%llu, -len=%llu\n, logical, start, failrec-len); + + pr_debug(Get IO Failure Record: (new) logical=%llu, start=%llu, len=%llu\n, +logical, start, failrec-len); + failrec-logical = logical; free_extent_map(em); @@ -2243,8 +2210,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, } } else { failrec = (struct io_failure_record *)(unsigned long)private; - pr_debug(bio_readpage_error: (found) logical=%llu, -start=%llu, len=%llu, validation=%d\n, + pr_debug(Get IO Failure Record: (found) logical=%llu, start=%llu, len=%llu, validation=%d\n, failrec-logical, failrec-start, failrec-len, failrec-in_validation); /* @@ -2253,6 +2219,17 @@ static int bio_readpage_error(struct bio *failed_bio, u64
[PATCH v4 09/11] Btrfs: Set real mirror number for read operation on RAID0/5/6
We need real mirror number for RAID0/5/6 when reading data, or if read error happens, we would pass 0 as the number of the mirror on which the io error happens. It is wrong and would cause the filesystem read the data from the corrupted mirror again. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None --- fs/btrfs/volumes.c | 5 + 1 file changed, 5 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 1aacf5f..4856547 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5073,6 +5073,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw, num_stripes = min_t(u64, map-num_stripes, stripe_nr_end - stripe_nr_orig); stripe_index = do_div(stripe_nr, map-num_stripes); + if (!(rw (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS))) + mirror_num = 1; } else if (map-type BTRFS_BLOCK_GROUP_RAID1) { if (rw (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS)) num_stripes = map-num_stripes; @@ -5176,6 +5178,9 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw, /* We distribute the parity blocks across stripes */ tmp = stripe_nr + stripe_index; stripe_index = do_div(tmp, map-num_stripes); + if (!(rw (REQ_WRITE | REQ_DISCARD | + REQ_GET_READ_MIRRORS)) mirror_num = 1) + mirror_num = 1; } } else { /* -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 03/10] Btrfs: fix wrong fsid check of scrub
All the metadata in the seed devices has the same fsid as the fsid of the seed filesystem which is on the seed device, so we should check them by the current filesystem. Fix it. Signed-off-by: Miao Xie mi...@cn.fujitsu.com Reviewed-by: David Sterba dste...@suse.cz --- Changelog v1 - v2: - Use const keyword to restrict the fsid. - Remove unnecessary the variant. --- fs/btrfs/scrub.c | 16 +++- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index dfb92a2..12a6801 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -1361,6 +1361,14 @@ static void scrub_recheck_block(struct btrfs_fs_info *fs_info, return; } +static inline int scrub_check_fsid(const u8 *fsid, + struct scrub_page *spage) +{ + struct btrfs_fs_devices *fs_devices = spage-dev-fs_devices; + + return !memcmp(fsid, fs_devices-fsid, BTRFS_UUID_SIZE); +} + static void scrub_recheck_block_checksum(struct btrfs_fs_info *fs_info, struct scrub_block *sblock, int is_metadata, int have_csum, @@ -1380,7 +1388,7 @@ static void scrub_recheck_block_checksum(struct btrfs_fs_info *fs_info, h = (struct btrfs_header *)mapped_buffer; if (sblock-pagev[0]-logical != btrfs_stack_header_bytenr(h) || - memcmp(h-fsid, fs_info-fsid, BTRFS_UUID_SIZE) || + !scrub_check_fsid(h-fsid, sblock-pagev[0]) || memcmp(h-chunk_tree_uuid, fs_info-chunk_tree_uuid, BTRFS_UUID_SIZE)) { sblock-header_error = 1; @@ -1750,7 +1758,7 @@ static int scrub_checksum_tree_block(struct scrub_block *sblock) if (sblock-pagev[0]-generation != btrfs_stack_header_generation(h)) ++fail; - if (memcmp(h-fsid, fs_info-fsid, BTRFS_UUID_SIZE)) + if (!scrub_check_fsid(h-fsid, sblock-pagev[0])) ++fail; if (memcmp(h-chunk_tree_uuid, fs_info-chunk_tree_uuid, @@ -1790,8 +1798,6 @@ static int scrub_checksum_super(struct scrub_block *sblock) { struct btrfs_super_block *s; struct scrub_ctx *sctx = sblock-sctx; - struct btrfs_root *root = sctx-dev_root; - struct btrfs_fs_info *fs_info = root-fs_info; u8 calculated_csum[BTRFS_CSUM_SIZE]; u8 on_disk_csum[BTRFS_CSUM_SIZE]; struct page *page; @@ -1816,7 +1822,7 @@ static int scrub_checksum_super(struct scrub_block *sblock) if (sblock-pagev[0]-generation != btrfs_super_generation(s)) ++fail_gen; - if (memcmp(s-fsid, fs_info-fsid, BTRFS_UUID_SIZE)) + if (!scrub_check_fsid(s-fsid, sblock-pagev[0])) ++fail_cor; len = BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 06/10] Btrfs: Fix the problem that the dirty flag of dev stats is cleared
The io error might happen during writing out the device stats, and the device stats information and dirty flag would be update at that time, but the current code didn't consider this case, just clear the dirty flag, it would cause that we forgot to write out the new device stats information. Fix it. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v2: - Change the variant name and make some cleanup by David's comment --- fs/btrfs/volumes.c | 8 ++-- fs/btrfs/volumes.h | 16 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 19188df..4ea73c8 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -159,6 +159,7 @@ static struct btrfs_device *__alloc_device(void) spin_lock_init(dev-reada_lock); atomic_set(dev-reada_in_flight, 0); + atomic_set(dev-dev_stats_dirty, 0); INIT_RADIX_TREE(dev-reada_zones, GFP_NOFS ~__GFP_WAIT); INIT_RADIX_TREE(dev-reada_extents, GFP_NOFS ~__GFP_WAIT); @@ -6398,16 +6399,19 @@ int btrfs_run_dev_stats(struct btrfs_trans_handle *trans, struct btrfs_root *dev_root = fs_info-dev_root; struct btrfs_fs_devices *fs_devices = fs_info-fs_devices; struct btrfs_device *device; + int dirtied; int ret = 0; mutex_lock(fs_devices-device_list_mutex); list_for_each_entry(device, fs_devices-devices, dev_list) { - if (!device-dev_stats_valid || !device-dev_stats_dirty) + dirtied = atomic_read(device-dev_stats_dirty); + + if (!device-dev_stats_valid || !dirtied) continue; ret = update_dev_stat_item(trans, dev_root, device); if (!ret) - device-dev_stats_dirty = 0; + atomic_sub(dirtied, device-dev_stats_dirty); } mutex_unlock(fs_devices-device_list_mutex); diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 6fcc8ea..9a1eff3 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -110,7 +110,8 @@ struct btrfs_device { /* disk I/O failure stats. For detailed description refer to * enum btrfs_dev_stat_values in ioctl.h */ int dev_stats_valid; - int dev_stats_dirty; /* counters need to be written to disk */ + + atomic_t dev_stats_dirty; /* counters need to be written to disk */ atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX]; }; @@ -359,11 +360,18 @@ unsigned long btrfs_full_stripe_len(struct btrfs_root *root, int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans, struct btrfs_root *extent_root, u64 chunk_offset, u64 chunk_size); + +static inline void btrfs_dev_dirty_stat(struct btrfs_device *dev) +{ + smp_mb__before_atomic(); + atomic_inc(dev-dev_stats_dirty); +} + static inline void btrfs_dev_stat_inc(struct btrfs_device *dev, int index) { atomic_inc(dev-dev_stat_values + index); - dev-dev_stats_dirty = 1; + btrfs_dev_dirty_stat(dev); } static inline int btrfs_dev_stat_read(struct btrfs_device *dev, @@ -378,7 +386,7 @@ static inline int btrfs_dev_stat_read_and_reset(struct btrfs_device *dev, int ret; ret = atomic_xchg(dev-dev_stat_values + index, 0); - dev-dev_stats_dirty = 1; + btrfs_dev_dirty_stat(dev); return ret; } @@ -386,7 +394,7 @@ static inline void btrfs_dev_stat_set(struct btrfs_device *dev, int index, unsigned long val) { atomic_set(dev-dev_stat_values + index, val); - dev-dev_stats_dirty = 1; + btrfs_dev_dirty_stat(dev); } static inline void btrfs_dev_stat_reset(struct btrfs_device *dev, -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 11/12] Btrfs: implement repair function when direct read fails
On Tue, 2 Sep 2014 09:05:15 -0400, Chris Mason wrote: diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 08e65e9..56b1546 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -698,7 +719,12 @@ static void end_workqueue_bio(struct bio *bio, int err) fs_info = end_io_wq-info; end_io_wq-error = err; - btrfs_init_work(end_io_wq-work, end_workqueue_fn, NULL, NULL); + + if (likely(end_io_wq-metadata != BTRFS_WQ_ENDIO_DIO_REPAIR)) + btrfs_init_work(end_io_wq-work, end_workqueue_fn, NULL, + NULL); + else + INIT_WORK(end_io_wq-work.normal_work, dio_end_workqueue_fn); It's not clear why this one is using INIT_WORK instead of btrfs_init_work, or why we're calling directly into queue_work instead of btrfs_queue_work. What am I missing? I'm sorry that I forgot writing the explanation in this patch's changlog, I wrote it in Patch 0. 2.When the io on the mirror ends, we will insert the endio work into the system workqueue, not btrfs own endio workqueue, because the original endio work is still blocked in the btrfs endio workqueue, if we insert the endio work of the io on the mirror into that workqueue, deadlock would happen. Can you elaborate the deadlock? Now that buffer read can insert a subsequent read-mirror bio into btrfs endio workqueue without problems, what's the difference? We do have problems if we're inserting dependent items in the same workqueue. Miao, please make a repair workqueue. I'll also have a use for it in the raid56 parity work I think. OK, I'll update the patch soon. Thanks Miao -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/18] Btrfs: cleanup double assignment of device-bytes_used when device replace finishes
Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/dev-replace.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index a85b5f5..10dfb41 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -550,7 +550,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, tgt_device-is_tgtdev_for_dev_replace = 0; tgt_device-devid = src_device-devid; src_device-devid = BTRFS_DEV_REPLACE_DEVID; - tgt_device-bytes_used = src_device-bytes_used; memcpy(uuid_tmp, tgt_device-uuid, sizeof(uuid_tmp)); memcpy(tgt_device-uuid, src_device-uuid, sizeof(tgt_device-uuid)); memcpy(src_device-uuid, uuid_tmp, sizeof(src_device-uuid)); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/18] Btrfs: Fix wrong free_chunk_space assignment during removing a device
During removing a device, we have modified free_chunk_space when we shrink the device, so we needn't assign a new value to it after the device shrink. Fix it. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/volumes.c | 5 - 1 file changed, 5 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f8273bb..1524b3f 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1671,11 +1671,6 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) if (ret) goto error_undo; - spin_lock(root-fs_info-free_chunk_lock); - root-fs_info-free_chunk_space = device-total_bytes - - device-bytes_used; - spin_unlock(root-fs_info-free_chunk_lock); - device-in_fs_metadata = 0; btrfs_scrub_cancel_dev(root-fs_info, device); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/18] Btrfs: cleanup unused num_can_discard in fs_devices
The member variants - num_can_discard - of fs_devices structure are set, but no one use them to do anything. so remove them. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/volumes.c | 16 ++-- fs/btrfs/volumes.h | 1 - 2 files changed, 2 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e9676a4..483fc6d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -720,8 +720,6 @@ static int __btrfs_close_devices(struct btrfs_fs_devices *fs_devices) fs_devices-rw_devices--; } - if (device-can_discard) - fs_devices-num_can_discard--; if (device-missing) fs_devices-missing_devices--; @@ -828,10 +826,8 @@ static int __btrfs_open_devices(struct btrfs_fs_devices *fs_devices, } q = bdev_get_queue(bdev); - if (blk_queue_discard(q)) { + if (blk_queue_discard(q)) device-can_discard = 1; - fs_devices-num_can_discard++; - } device-bdev = bdev; device-in_fs_metadata = 0; @@ -1835,8 +1831,7 @@ void btrfs_rm_dev_replace_srcdev(struct btrfs_fs_info *fs_info, if (!fs_devices-seeding) fs_devices-rw_devices++; } - if (srcdev-can_discard) - fs_devices-num_can_discard--; + if (srcdev-bdev) { fs_devices-open_devices--; @@ -1886,8 +1881,6 @@ void btrfs_destroy_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, fs_info-fs_devices-open_devices--; } fs_info-fs_devices-num_devices--; - if (tgtdev-can_discard) - fs_info-fs_devices-num_can_discard++; next_device = list_entry(fs_info-fs_devices-devices.next, struct btrfs_device, dev_list); @@ -2008,7 +2001,6 @@ static int btrfs_prepare_sprout(struct btrfs_root *root) fs_devices-num_devices = 0; fs_devices-open_devices = 0; fs_devices-missing_devices = 0; - fs_devices-num_can_discard = 0; fs_devices-rotating = 0; fs_devices-seed = seed_devices; @@ -2200,8 +2192,6 @@ int btrfs_init_new_device(struct btrfs_root *root, char *device_path) root-fs_info-fs_devices-open_devices++; root-fs_info-fs_devices-rw_devices++; root-fs_info-fs_devices-total_devices++; - if (device-can_discard) - root-fs_info-fs_devices-num_can_discard++; root-fs_info-fs_devices-total_rw_bytes += device-total_bytes; spin_lock(root-fs_info-free_chunk_lock); @@ -2371,8 +2361,6 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_root *root, char *device_path, list_add(device-dev_list, fs_info-fs_devices-devices); fs_info-fs_devices-num_devices++; fs_info-fs_devices-open_devices++; - if (device-can_discard) - fs_info-fs_devices-num_can_discard++; mutex_unlock(root-fs_info-fs_devices-device_list_mutex); *device_out = device; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index e894ac6..37f8bff 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -124,7 +124,6 @@ struct btrfs_fs_devices { u64 rw_devices; u64 missing_devices; u64 total_rw_bytes; - u64 num_can_discard; u64 total_devices; struct block_device *latest_bdev; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] Btrfs: restructure btrfs_get_bdev_and_sb and pick up some code used later
Some code in btrfs_get_bdev_and_sb will be re-used by the other function later, so restructure btrfs_get_bdev_and_sb and pick up those code to make a new function. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/volumes.c | 66 +- 1 file changed, 36 insertions(+), 30 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index bcb19d5..9d52fd8 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -193,42 +193,47 @@ static noinline struct btrfs_fs_devices *find_fsid(u8 *fsid) return NULL; } +static int __btrfs_get_sb(struct block_device *bdev, int flush, + struct buffer_head **bh) +{ + int ret; + + if (flush) + filemap_write_and_wait(bdev-bd_inode-i_mapping); + + ret = set_blocksize(bdev, 4096); + if (ret) + return ret; + + invalidate_bdev(bdev); + *bh = btrfs_read_dev_super(bdev); + if (!*bh) + return -EINVAL; + + return 0; +} + static int -btrfs_get_bdev_and_sb(const char *device_path, fmode_t flags, void *holder, - int flush, struct block_device **bdev, - struct buffer_head **bh) +btrfs_get_bdev_and_sb_by_path(const char *device_path, fmode_t flags, + void *holder, int flush, + struct block_device **bdev, + struct buffer_head **bh) { int ret; *bdev = blkdev_get_by_path(device_path, flags, holder); - if (IS_ERR(*bdev)) { - ret = PTR_ERR(*bdev); printk(KERN_INFO BTRFS: open %s failed\n, device_path); - goto error; + return PTR_ERR(*bdev); } - if (flush) - filemap_write_and_wait((*bdev)-bd_inode-i_mapping); - ret = set_blocksize(*bdev, 4096); + ret = __btrfs_get_sb(*bdev, flush, bh); if (ret) { blkdev_put(*bdev, flags); - goto error; - } - invalidate_bdev(*bdev); - *bh = btrfs_read_dev_super(*bdev); - if (!*bh) { - ret = -EINVAL; - blkdev_put(*bdev, flags); - goto error; + return ret; } return 0; - -error: - *bdev = NULL; - *bh = NULL; - return ret; } static void requeue_list(struct btrfs_pending_bios *pending_bios, @@ -806,8 +811,8 @@ static int __btrfs_open_devices(struct btrfs_fs_devices *fs_devices, continue; /* Just open everything we can; ignore failures here */ - if (btrfs_get_bdev_and_sb(device-name-str, flags, holder, 1, - bdev, bh)) + if (btrfs_get_bdev_and_sb_by_path(device-name-str, flags, + holder, 1, bdev, bh)) continue; disk_super = (struct btrfs_super_block *)bh-b_data; @@ -1629,10 +1634,10 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) goto out; } } else { - ret = btrfs_get_bdev_and_sb(device_path, - FMODE_WRITE | FMODE_EXCL, - root-fs_info-bdev_holder, 0, - bdev, bh); + ret = btrfs_get_bdev_and_sb_by_path(device_path, + FMODE_WRITE | FMODE_EXCL, + root-fs_info-bdev_holder, + 0, bdev, bh); if (ret) goto out; disk_super = (struct btrfs_super_block *)bh-b_data; @@ -1906,8 +1911,9 @@ static int btrfs_find_device_by_path(struct btrfs_root *root, char *device_path, struct buffer_head *bh; *device = NULL; - ret = btrfs_get_bdev_and_sb(device_path, FMODE_READ, - root-fs_info-bdev_holder, 0, bdev, bh); + ret = btrfs_get_bdev_and_sb_by_path(device_path, FMODE_READ, + root-fs_info-bdev_holder, 0, + bdev, bh); if (ret) return ret; disk_super = (struct btrfs_super_block *)bh-b_data; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/18] Btrfs: fix unprotected assignment of the target device
We didn't protect the assignment of the target device, it might cause the problem that the super block update was skipped because we might find wrong size of the target device during the assignment. Fix it by moving the assignment sentences into the initialization function of the target device. And there is another merit that we can check if the target device is suitable more early. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/dev-replace.c | 32 fs/btrfs/volumes.c | 23 +++ fs/btrfs/volumes.h | 1 + 3 files changed, 28 insertions(+), 28 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 10dfb41..72dc02e 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -330,29 +330,19 @@ int btrfs_dev_replace_start(struct btrfs_root *root, return -EINVAL; mutex_lock(fs_info-volume_mutex); - ret = btrfs_init_dev_replace_tgtdev(root, args-start.tgtdev_name, - tgt_device); - if (ret) { - btrfs_err(fs_info, target device %s is invalid!, - args-start.tgtdev_name); - mutex_unlock(fs_info-volume_mutex); - return -EINVAL; - } - ret = btrfs_dev_replace_find_srcdev(root, args-start.srcdevid, args-start.srcdev_name, src_device); - mutex_unlock(fs_info-volume_mutex); if (ret) { - ret = -EINVAL; - goto leave_no_lock; + mutex_unlock(fs_info-volume_mutex); + return ret; } - if (tgt_device-total_bytes src_device-total_bytes) { - btrfs_err(fs_info, target device is smaller than source device!); - ret = -EINVAL; - goto leave_no_lock; - } + ret = btrfs_init_dev_replace_tgtdev(root, args-start.tgtdev_name, + src_device, tgt_device); + mutex_unlock(fs_info-volume_mutex); + if (ret) + return ret; btrfs_dev_replace_lock(dev_replace); switch (dev_replace-replace_state) { @@ -380,10 +370,6 @@ int btrfs_dev_replace_start(struct btrfs_root *root, src_device-devid, rcu_str_deref(tgt_device-name)); - tgt_device-total_bytes = src_device-total_bytes; - tgt_device-disk_total_bytes = src_device-disk_total_bytes; - tgt_device-bytes_used = src_device-bytes_used; - /* * from now on, the writes to the srcdev are all duplicated to * go to the tgtdev as well (refer to btrfs_map_block()). @@ -426,9 +412,7 @@ leave: dev_replace-srcdev = NULL; dev_replace-tgtdev = NULL; btrfs_dev_replace_unlock(dev_replace); -leave_no_lock: - if (tgt_device) - btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); + btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); return ret; } diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 483fc6d..1646659 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2295,6 +2295,7 @@ error: } int btrfs_init_dev_replace_tgtdev(struct btrfs_root *root, char *device_path, + struct btrfs_device *srcdev, struct btrfs_device **device_out) { struct request_queue *q; @@ -2307,24 +2308,37 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_root *root, char *device_path, int ret = 0; *device_out = NULL; - if (fs_info-fs_devices-seeding) + if (fs_info-fs_devices-seeding) { + btrfs_err(fs_info, the filesystem is a seed filesystem!); return -EINVAL; + } bdev = blkdev_get_by_path(device_path, FMODE_WRITE | FMODE_EXCL, fs_info-bdev_holder); - if (IS_ERR(bdev)) + if (IS_ERR(bdev)) { + btrfs_err(fs_info, target device %s is invalid!, device_path); return PTR_ERR(bdev); + } filemap_write_and_wait(bdev-bd_inode-i_mapping); devices = fs_info-fs_devices-devices; list_for_each_entry(device, devices, dev_list) { if (device-bdev == bdev) { + btrfs_err(fs_info, target device is in the filesystem!); ret = -EEXIST; goto error; } } + + if (i_size_read(bdev-bd_inode) srcdev-total_bytes) { + btrfs_err(fs_info, target device is smaller than source device!); + ret = -EINVAL; + goto error; + } + + device = btrfs_alloc_device(NULL, devid, NULL); if (IS_ERR(device)) { ret = PTR_ERR(device); @@ -2348,8 +2362,9 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_root *root
[PATCH 3/5] Btrfs: restructure btrfs_scan_one_device
Some code in btrfs_scan_one_device will be re-used by the other function later, so restructure btrfs_scan_one_device and pick up those code to make a new function. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/volumes.c | 57 +++--- 1 file changed, 33 insertions(+), 24 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 740a4f9..bcb19d5 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -885,24 +885,18 @@ int btrfs_open_devices(struct btrfs_fs_devices *fs_devices, return ret; } -/* - * Look for a btrfs signature on a device. This may be called out of the mount path - * and we are not allowed to call set_blocksize during the scan. The superblock - * is read via pagecache - */ -int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, - struct btrfs_fs_devices **fs_devices_ret) +static int __scan_device(struct block_device *bdev, const char *path, +struct btrfs_fs_devices **fs_devices_ret) { struct btrfs_super_block *disk_super; - struct block_device *bdev; struct page *page; void *p; - int ret = -EINVAL; u64 devid; u64 transid; u64 total_devices; u64 bytenr; pgoff_t index; + int ret; /* * we would like to check all the supers, but that would make @@ -911,38 +905,30 @@ int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, * later supers, using BTRFS_SUPER_MIRROR_MAX instead */ bytenr = btrfs_sb_offset(0); - flags |= FMODE_EXCL; - mutex_lock(uuid_mutex); - - bdev = blkdev_get_by_path(path, flags, holder); - - if (IS_ERR(bdev)) { - ret = PTR_ERR(bdev); - goto error; - } /* make sure our super fits in the device */ if (bytenr + PAGE_CACHE_SIZE = i_size_read(bdev-bd_inode)) - goto error_bdev_put; + return -EINVAL; /* make sure our super fits in the page */ if (sizeof(*disk_super) PAGE_CACHE_SIZE) - goto error_bdev_put; + return -EINVAL; /* make sure our super doesn't straddle pages on disk */ index = bytenr PAGE_CACHE_SHIFT; if ((bytenr + sizeof(*disk_super) - 1) PAGE_CACHE_SHIFT != index) - goto error_bdev_put; + return -EINVAL; /* pull in the page with our super */ page = read_cache_page_gfp(bdev-bd_inode-i_mapping, index, GFP_NOFS); if (IS_ERR_OR_NULL(page)) - goto error_bdev_put; + return -ENOMEM; - p = kmap(page); + ret = -EINVAL; + p = kmap(page); /* align our pointer to the offset of the super block */ disk_super = p + (bytenr ~PAGE_CACHE_MASK); @@ -974,7 +960,30 @@ error_unmap: kunmap(page); page_cache_release(page); -error_bdev_put: + return ret; +} + +/* + * Look for a btrfs signature on a device. This may be called out of the mount path + * and we are not allowed to call set_blocksize during the scan. The superblock + * is read via pagecache + */ +int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, + struct btrfs_fs_devices **fs_devices_ret) +{ + struct block_device *bdev; + int ret; + + flags |= FMODE_EXCL; + + mutex_lock(uuid_mutex); + bdev = blkdev_get_by_path(path, flags, holder); + if (IS_ERR(bdev)) { + ret = PTR_ERR(bdev); + goto error; + } + + ret = __scan_device(bdev, path, fs_devices_ret); blkdev_put(bdev, flags); error: mutex_unlock(uuid_mutex); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 16/18] Btrfs: stop mounting the fs if the non-ENOENT errors happen when opening seed fs
When we open a seed filesystem, if the degraded mount option is set, we continue to mount the fs if we don't find some devices in the seed filesystem. But we should stop mounting if other errors happen. Fix it Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/volumes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index fd8141e..cc59fcb 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -6093,7 +6093,7 @@ static int read_one_dev(struct btrfs_root *root, if (memcmp(fs_uuid, root-fs_info-fsid, BTRFS_UUID_SIZE)) { ret = open_seed_devices(root, fs_uuid); - if (ret !btrfs_test_opt(root, DEGRADED)) + if (ret !(ret == -ENOENT btrfs_test_opt(root, DEGRADED))) return ret; } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/18] Btrfs: fix use-after-free problem of the device during device replace
The problem is: Task0(device scan task) Task1(device replace task) scan_one_device() mutex_lock(uuid_mutex) device = find_device() mutex_lock(device_list_mutex) lock_chunk() rm_and_free_source_device unlock_chunk() mutex_unlock(device_list_mutex) check device Destroying the target device if device replace fails also has the same problem. We fix this problem by locking uuid_mutex during destroying source device or target device, just like the device remove operation. It is a temporary solution, we can fix this problem and make the code more clear by atomic counter in the future. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/dev-replace.c | 3 +++ fs/btrfs/volumes.c | 4 +++- fs/btrfs/volumes.h | 2 ++ 3 files changed, 8 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index aa4c828..e9cbbdb 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -509,6 +509,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, ret = btrfs_commit_transaction(trans, root); WARN_ON(ret); + mutex_lock(uuid_mutex); /* keep away write_all_supers() during the finishing procedure */ mutex_lock(root-fs_info-fs_devices-device_list_mutex); mutex_lock(root-fs_info-chunk_mutex); @@ -536,6 +537,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, btrfs_dev_replace_unlock(dev_replace); mutex_unlock(root-fs_info-chunk_mutex); mutex_unlock(root-fs_info-fs_devices-device_list_mutex); + mutex_unlock(uuid_mutex); if (tgt_device) btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); mutex_unlock(dev_replace-lock_finishing_cancel_unmount); @@ -591,6 +593,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, */ mutex_unlock(root-fs_info-chunk_mutex); mutex_unlock(root-fs_info-fs_devices-device_list_mutex); + mutex_unlock(uuid_mutex); /* write back the superblocks */ trans = btrfs_start_transaction(root, 0); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f0173b1..24d7001 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -50,7 +50,7 @@ static void __btrfs_reset_dev_stats(struct btrfs_device *dev); static void btrfs_dev_stat_print_on_error(struct btrfs_device *dev); static void btrfs_dev_stat_print_on_load(struct btrfs_device *device); -static DEFINE_MUTEX(uuid_mutex); +DEFINE_MUTEX(uuid_mutex); static LIST_HEAD(fs_uuids); static void lock_chunks(struct btrfs_root *root) @@ -1867,6 +1867,7 @@ void btrfs_destroy_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, { struct btrfs_device *next_device; + mutex_lock(uuid_mutex); WARN_ON(!tgtdev); mutex_lock(fs_info-fs_devices-device_list_mutex); if (tgtdev-bdev) { @@ -1886,6 +1887,7 @@ void btrfs_destroy_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, call_rcu(tgtdev-rcu, free_device); mutex_unlock(fs_info-fs_devices-device_list_mutex); + mutex_unlock(uuid_mutex); } static int btrfs_find_device_by_path(struct btrfs_root *root, char *device_path, diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 76600a3..2b37da3 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -24,6 +24,8 @@ #include linux/btrfs.h #include async-thread.h +extern struct mutex uuid_mutex; + #define BTRFS_STRIPE_LEN (64 * 1024) struct buffer_head; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/18] Btrfs: update free_chunk_space during allocting a new chunk
We should update free_chunk_space in time when we allocate a new chunk, not when we deal with the pending device update and block group insertion, because we need the real free_chunk_space data to calculate the reserved space, if we don't update it in time, we would consider the disk space which has be allocated as free space, and would use it to do overcommit reservation. Fix it. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/volumes.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 45e0b5d..d8e4a3d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4432,6 +4432,11 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, for (i = 0; i map-num_stripes; i++) map-stripes[i].dev-bytes_used += stripe_size; + spin_lock(extent_root-fs_info-free_chunk_lock); + extent_root-fs_info-free_chunk_space -= (stripe_size * + map-num_stripes); + spin_unlock(extent_root-fs_info-free_chunk_lock); + free_extent_map(em); check_raid56_incompat_flag(extent_root-fs_info, type); @@ -4515,11 +4520,6 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans, goto out; } - spin_lock(extent_root-fs_info-free_chunk_lock); - extent_root-fs_info-free_chunk_space -= (stripe_size * - map-num_stripes); - spin_unlock(extent_root-fs_info-free_chunk_lock); - stripe = chunk-stripe; for (i = 0; i map-num_stripes; i++) { device = map-stripes[i].dev; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] Btrfs: don't return btrfs_fs_devices if the caller doesn't want it
We will implement the function that the filesystem scan all the devices in the system and build the device set for btrfs. In this case, we needn't get btrfs_fs_devices when adding a device into list. This patch changes device_add_list and implement this feature. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/volumes.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 1aacf5f..740a4f9 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -568,7 +568,8 @@ static noinline int device_list_add(const char *path, if (!fs_devices-opened) device-generation = found_transid; - *fs_devices_ret = fs_devices; + if (fs_devices_ret) + *fs_devices_ret = fs_devices; return ret; } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/18] Btrfs: fix unprotected system chunk array insertion
We didn't protect the system chunk array when we added a new system chunk into it, it would cause the array be corrupted if someone remove/add some system chunk into array at the same time. Fix it by chunk lock. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/volumes.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 41da102..9f22398d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4054,10 +4054,13 @@ static int btrfs_add_system_chunk(struct btrfs_root *root, u32 array_size; u8 *ptr; + lock_chunks(root); array_size = btrfs_super_sys_array_size(super_copy); if (array_size + item_size + sizeof(disk_key) -BTRFS_SYSTEM_CHUNK_ARRAY_SIZE) +BTRFS_SYSTEM_CHUNK_ARRAY_SIZE) { + unlock_chunks(root); return -EFBIG; + } ptr = super_copy-sys_chunk_array + array_size; btrfs_cpu_key_to_disk(disk_key, key); @@ -4066,6 +4069,8 @@ static int btrfs_add_system_chunk(struct btrfs_root *root, memcpy(ptr, chunk, item_size); item_size += sizeof(disk_key); btrfs_set_super_sys_array_size(super_copy, array_size + item_size); + unlock_chunks(root); + return 0; } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RFC 0/5] Scan all devices to build fs device list
This patchset implements device list automatic building function. As we know, currently we need scan the devices to build device list by a user tool before mounting the filesystem, especially mount the filesystem after we re-install btrfs module. It is not convenient. This patchset can improve that problem. With this patchset, we will scan all the devices in the system to build the device list if we find the number of the devices is not right when we mount the filesystem. By this way, we needn't scan the device by the user tool and reduce the mount failure probability due to the incomplete device list. --- Miao Xie (5): block: export disk_class and disk_type for btrfs Btrfs: don't return btrfs_fs_devices if the caller doesn't want it Btrfs: restructure btrfs_scan_one_device Btrfs: restructure btrfs_get_bdev_and_sb and pick up some code used later Btrfs: scan all the devices and build the fs device list by btrfs's self block/genhd.c | 7 +- fs/btrfs/super.c | 3 + fs/btrfs/volumes.c| 227 -- fs/btrfs/volumes.h| 5 +- include/linux/genhd.h | 1 + 5 files changed, 177 insertions(+), 66 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/18] Btrfs: fix unprotected device list access when getting the fs information
When we get the fs information, we forgot to acquire the mutex of device list, it might cause the problem we might access a device that was removed. Fix it by acquiring the device list mutex. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/super.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 089991d..6b98358 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1703,7 +1703,11 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) struct btrfs_block_rsv *block_rsv = fs_info-global_block_rsv; int ret; - /* holding chunk_muext to avoid allocating new chunks */ + /* +* holding chunk_muext to avoid allocating new chunks, holding +* device_list_mutex to avoid the device being removed +*/ + mutex_lock(fs_info-fs_devices-device_list_mutex); mutex_lock(fs_info-chunk_mutex); rcu_read_lock(); list_for_each_entry_rcu(found, head, list) { @@ -1744,11 +1748,13 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf) ret = btrfs_calc_avail_data_space(fs_info-tree_root, total_free_data); if (ret) { mutex_unlock(fs_info-chunk_mutex); + mutex_unlock(fs_info-fs_devices-device_list_mutex); return ret; } buf-f_bavail += div_u64(total_free_data, factor); buf-f_bavail = buf-f_bavail bits; mutex_unlock(fs_info-chunk_mutex); + mutex_unlock(fs_info-fs_devices-device_list_mutex); buf-f_type = BTRFS_SUPER_MAGIC; buf-f_bsize = dentry-d_sb-s_blocksize; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html