Re: [PATCH] Btrfs: fill -last_trans for delayed inode in btrfs_fill_inode.

2015-04-09 Thread Miao Xie
On Thu, 09 Apr 2015 12:08:43 +0800, Dongsheng Yang wrote:
 We need to fill inode when we found a node for it in delayed_nodes_tree.
 But we did not fill the -last_trans currently, it will cause the test
 of xfstest/generic/311 fail. Scenario of the 311 is shown as below:
 
 Problem:
   (1). test_fd = open(fname, O_RDWR|O_DIRECT)
   (2). pwrite(test_fd, buf, 4096, 0)
   (3). close(test_fd)
   (4). drop_all_caches()   echo 3  /proc/sys/vm/drop_caches
   (5). test_fd = open(fname, O_RDWR|O_DIRECT)
   (6). fsync(test_fd);
    we did not get the correct log entry 
 for the file
 Reason:
   When we re-open this file in (5), we would find a node
 in delayed_nodes_tree and fill the inode we are lookup with the
 information. But the -last_trans is not filled, then the fsync()
 will check the -last_trans and found it's 0 then say this inode
 is already in our tree which is commited, not recording the extents
 for it.
 
 Fix:
   This patch fill the -last_trans properly and set the
 runtime_flags if needed in this situation. Then we can get the
 log entries we expected after (6) and generic/311 passed.
 
 Signed-off-by: Dongsheng Yang yangds.f...@cn.fujitsu.com

Good catch!

Reviewed-by: Miao Xie miao...@huawei.com

 ---
  fs/btrfs/delayed-inode.c |  2 ++
  fs/btrfs/inode.c | 21 -
  2 files changed, 14 insertions(+), 9 deletions(-)
 
 diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
 index 82f0c7c..9e8b435 100644
 --- a/fs/btrfs/delayed-inode.c
 +++ b/fs/btrfs/delayed-inode.c
 @@ -1801,6 +1801,8 @@ int btrfs_fill_inode(struct inode *inode, u32 *rdev)
   set_nlink(inode, btrfs_stack_inode_nlink(inode_item));
   inode_set_bytes(inode, btrfs_stack_inode_nbytes(inode_item));
   BTRFS_I(inode)-generation = btrfs_stack_inode_generation(inode_item);
 +BTRFS_I(inode)-last_trans = btrfs_stack_inode_transid(inode_item);
 +
   inode-i_version = btrfs_stack_inode_sequence(inode_item);
   inode-i_rdev = 0;
   *rdev = btrfs_stack_inode_rdev(inode_item);
 diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
 index d2e732d..b132936 100644
 --- a/fs/btrfs/inode.c
 +++ b/fs/btrfs/inode.c
 @@ -3628,25 +3628,28 @@ static void btrfs_read_locked_inode(struct inode 
 *inode)
   BTRFS_I(inode)-generation = btrfs_inode_generation(leaf, inode_item);
   BTRFS_I(inode)-last_trans = btrfs_inode_transid(leaf, inode_item);
  
 + inode-i_version = btrfs_inode_sequence(leaf, inode_item);
 + inode-i_generation = BTRFS_I(inode)-generation;
 + inode-i_rdev = 0;
 + rdev = btrfs_inode_rdev(leaf, inode_item);
 +
 + BTRFS_I(inode)-index_cnt = (u64)-1;
 + BTRFS_I(inode)-flags = btrfs_inode_flags(leaf, inode_item);
 +
 +cache_index:
   /*
* If we were modified in the current generation and evicted from memory
* and then re-read we need to do a full sync since we don't have any
* idea about which extents were modified before we were evicted from
* cache.
 +  *
 +  * This is required for both inode re-read from disk and delayed inode
 +  * in delayed_nodes_tree.
*/
   if (BTRFS_I(inode)-last_trans == root-fs_info-generation)
   set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
   BTRFS_I(inode)-runtime_flags);
  
 - inode-i_version = btrfs_inode_sequence(leaf, inode_item);
 - inode-i_generation = BTRFS_I(inode)-generation;
 - inode-i_rdev = 0;
 - rdev = btrfs_inode_rdev(leaf, inode_item);
 -
 - BTRFS_I(inode)-index_cnt = (u64)-1;
 - BTRFS_I(inode)-flags = btrfs_inode_flags(leaf, inode_item);
 -
 -cache_index:
   path-slots[0]++;
   if (inode-i_nlink != 1 ||
   path-slots[0] = btrfs_header_nritems(leaf))
 


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC v6 6/9] vfs: Add sb_want_write() function to get vfsmount from a given sb.

2015-02-04 Thread Miao Xie
On Wed, 04 Feb 2015 10:10:55 +0800, Qu Wenruo wrote:
 *** Please DON'T merge this patch, it's only for disscusion purpose ***
 
 There are sysfs interfaces in some fs, only btrfs yet, which will modify
 on-disk data.
 Unlike normal file operation routine we can use mnt_want_write_file() to
 protect the operation, change through sysfs won't to be binded to any file
 in the filesystem.
 
 So introduce new sb_want_write() to do the protection agains a super
 block, which acts much like mnt_want_write() but will return success if
 the super block is read-write.
 
 Since sysfs handler don't go through the normal vfsmount, so it won't
 increase the refcount of and even we have sb_want_write() waiting sb to
 be unfrozen, the fs can still be unmounted without problem.
 Causing the modules unable to be removed and user can find out what's
 wrong until 
 
 To solve such problem, we have different strategies to solve it.
 1) Extra check on last instance umount of a sb
 This is the method the patch uses.
 This method seems valid enough, since we want to get write protection on
 a sb, so it's OK for the sb if there is *ANY* mount instance.
 Problem 1.1)
 But lsof and other tools won't help if sb_want_write() on frozen fs cause
 it unable to be unmounted.
 
 Problem 1.2)
 When get namespace involved, things will get more complicated.
 Like the following case:
   Alice   |   Bob
 Mount devA on /mnt1 in her ns | Mount devA on /mnt2/ in his ns
 freeze /mnt1  |
 sb_want_write() (waiting) |
 umount /mnt1 (success since there is  |
 another mount instance)   |
   | umount /mnt2 (fail since there
   | is sb_want_write() waiting)
 
 So Alice can't thaw the fs since there is no mount point for it now.
 
 2) Don't allow any umount of the sb if there is sb_want_write().
 More aggressive one, purpose by Miao Xie.
 Can't resolve problem 1.1) but will solve problem 1.2).

This is one of the two methods that I told you, but not the one I recommended.
What I wanted to recommend is that thaw the fs at the beginning of the
sb kill process, and in sb_want_write(), we check if the sb is active or
not after we pass sb_start_write, if the sb is not active, go back.
(This way also is not so good, but better than the above one)

 Although introduced new problem like the following:
   Alice
 Mount devA on /mnt1
 freeze /mnt1
 sb_want_write() (waiting)
 mount devA on /mnt2 and /mnt3
 
 /mnt[123] all can't be unmounted, but new mount can still be created.
 
 3) sb_want_write() doesn't make any sense and break VFS rules!
 Action which will change on-disk data should not be tunable through sysfs,
 and sb_want_write() things which by-pass all the VFS check is just evil.
 And for btrfs, we already have the ioctl to set label, why bothering new
 sysfs interface to do it again?
 
 Although I use method 1) to do it, I am still not certain about which is
 method is the correct one.
 
 So any advise is welcomed.
 
 Thanks,
 Qu

[SNIP]

 +/**
 + * sb_want_write - get write acess to a super block
 + * @sb: the superblock of the filesystem
 + *
 + * This tells the low-level filesystem that a write is about to be performed 
 to
 + * it, and makes sure that the writes are allowed (superblock is read-write,
 + * filesystem is not frozen) before returning success.
 + * When the write operation is finished, sb_drop_write() must be called.
 + * This is much like mnt_want_write() as a refcount, but only needs
 + * the superblock to be read-write.
 + */
 +int sb_want_write(struct super_block *sb)
 +{
 + spin_lock(sb-s_want_write_lock);
 + if (sb-s_want_write_block) {
 + spin_unlock(sb-s_want_write_lock);
 + return -EBUSY;
 + }
 + sb-s_want_write_count++;
 + spin_unlock(sb-s_want_write_lock);
 +
 + sb_start_write(sb);
 + if (sb-s_readonly_remount || sb-s_flags  MS_RDONLY) {

If someone remount the fs to R/O here(after the check), we should not continue
to change label/features. I think we need add some check in remount functions.

Thanks
Miao
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 0/9] btrfs: Fix freeze/sysfs deadlock in better method.

2015-01-30 Thread Miao Xie
On Fri, 30 Jan 2015 20:17:49 +0100, David Sterba wrote:
 On Fri, Jan 30, 2015 at 05:20:45PM +0800, Qu Wenruo wrote:
 [Use VFS protect for sysfs change]
 The 6th patch will introduce a new help function sb_want_write() to
 claim write permission to a superblock.
 With this, we are able to do write protection like mnt_want_write() but
 only needs to ensure that the superblock is writeable.
 This also keeps the same synchronized behavior using ioctl, which will
 block on frozen fs until it is unfrozen.
 
 You know what I think abuot the commit inside sysfs, but it looks better
 to me now with the sb_* protections so I give it a go.
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

I worried about the following case

# fsfreeze btrfs
# echo new label  btrfs_sysfs
It should be hangup


On the other terminal
# umount btrfs


Because the 2nd echo command didn't increase mount reference, so umount
would not know someone still blocked on the fs, it would not go back and
return EBUSY like someone access the fs by common fs interface, it would
deactive fs directly and then blocked on sysfs removal.


Thanks
Miao
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way

2015-01-29 Thread Miao Xie
On Fri, 30 Jan 2015 09:33:17 +0800, Qu Wenruo wrote:
 
  Original Message 
 Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse 
 mount
 option in a atomic way
 From: Miao Xie miao...@huawei.com
 To: Qu Wenruo quwen...@cn.fujitsu.com, linux-btrfs@vger.kernel.org
 Date: 2015年01月30日 09:29
 On Fri, 30 Jan 2015 09:20:46 +0800, Qu Wenruo wrote:
 Here need ACCESS_ONCE to wrap info-mount_opt, or the complier might use
 info-mount_opt instead of new_opt.
 Thanks for pointing out this one.
 But I worried that this is not key reason of the wrong space cache. Could
 you explain the race condition which caused the wrong space cache?

 Thanks
 Miao
 CPU0:
 remount()
 |- sync_fs() - after sync_fs() we can start new trans
 |- btrfs_parse_options() CPU1:
  |- start_transaction()
  |- Do some bg allocation, not recorded in space_cache.
 I think it is a bug if a free space is not recorded in space cache. Could you
 explain why it is not recorded?

 Thanks
 Miao
 IIRC, in that window, the fs_info-mount_opt's SPACE_CACHE bit is cleared.
 So space cache is not recorded.

SPACE_CACHE is used to control cache write out, not in-memory cache. All the
free space should be recorded in in-memory cache.And when we write out
the in-memory space cache, we need protect the space cache from changing.

Thanks
Miao

 
 Thanks,
 Qu

   |- set SPACE_CACHE bit due to cache_gen

  |- commit_transaction()
  |- write space cache and update cache_gen.
  but since some of it is not recorded in space
 cache,
  the space cache missing some records.
   |- clear SPACE_CACHE bit dut to nospace_cache

 So the space cache is wrong.

 Thanks,
 Qu
 +}
kfree(orig);
return ret;
}

 .

 
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way

2015-01-29 Thread Miao Xie
On Thu, 29 Jan 2015 10:24:35 +0800, Qu Wenruo wrote:
 Current btrfs_parse_options() is not atomic, which can set and clear a
 bit, especially for nospace_cache case.
 
 For example, if a fs is mounted with nospace_cache,
 btrfs_parse_options() will set SPACE_CACHE bit first(since
 cache_generation is non-zeo) and clear the SPACE_CACHE bit due to
 nospace_cache mount option.
 So under heavy operations and remount a nospace_cache btrfs, there is a
 windows for commit to create space cache.
 
 This bug can be reproduced by fstest/btrfs/071 073 074 with
 nospace_cache mount option. It has about 50% chance to create space
 cache, and about 10% chance to create wrong space cache, which can't
 pass btrfsck.
 
 This patch will do the mount option parse in a copy-and-update method.
 First copy the mount_opt from fs_info to new_opt, and only update
 options in new_opt. At last, copy the new_opt back to
 fs_info-mount_opt.
 
 This patch is already good enough to fix the above nospace_cache +
 remount bug, but need later patch to make sure mount options does not
 change during transaction.
 
 Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
 ---
  fs/btrfs/ctree.h |  16 
  fs/btrfs/super.c | 115 
 +--
  2 files changed, 69 insertions(+), 62 deletions(-)
 
 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
 index 5f99743..26bb47b 100644
 --- a/fs/btrfs/ctree.h
 +++ b/fs/btrfs/ctree.h
 @@ -2119,18 +2119,18 @@ struct btrfs_ioctl_defrag_range_args {
  #define btrfs_test_opt(root, opt)((root)-fs_info-mount_opt  \
BTRFS_MOUNT_##opt)
  
 -#define btrfs_set_and_info(root, opt, fmt, args...)  \
 +#define btrfs_set_and_info(fs_info, val, opt, fmt, args...)  \
  {\
 - if (!btrfs_test_opt(root, opt)) \
 - btrfs_info(root-fs_info, fmt, ##args); \
 - btrfs_set_opt(root-fs_info-mount_opt, opt);   \
 + if (!btrfs_raw_test_opt(val, opt))  \
 + btrfs_info(fs_info, fmt, ##args);   \
 + btrfs_set_opt(val, opt);\
  }
  
 -#define btrfs_clear_and_info(root, opt, fmt, args...)
 \
 +#define btrfs_clear_and_info(fs_info, val, opt, fmt, args...)
 \
  {\
 - if (btrfs_test_opt(root, opt))  \
 - btrfs_info(root-fs_info, fmt, ##args); \
 - btrfs_clear_opt(root-fs_info-mount_opt, opt); \
 + if (btrfs_raw_test_opt(val, opt))   \
 + btrfs_info(fs_info, fmt, ##args);   \
 + btrfs_clear_opt(val, opt);  \
  }
  
  /*
 diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
 index b0c45b2..490fe1f 100644
 --- a/fs/btrfs/super.c
 +++ b/fs/btrfs/super.c
 @@ -395,10 +395,13 @@ int btrfs_parse_options(struct btrfs_root *root, char 
 *options)
   int ret = 0;
   char *compress_type;
   bool compress_force = false;
 + unsigned long new_opt;
 +
 + new_opt = info-mount_opt;

Here and

  
   cache_gen = btrfs_super_cache_generation(root-fs_info-super_copy);
   if (cache_gen)
 - btrfs_set_opt(info-mount_opt, SPACE_CACHE);
[SNIP]
  out:
 - if (!ret  btrfs_test_opt(root, SPACE_CACHE))
 - btrfs_info(root-fs_info, disk space caching is enabled);
 + if (!ret) {
 + if (btrfs_raw_test_opt(new_opt, SPACE_CACHE))
 + btrfs_info(info, disk space caching is enabled);
 + info-mount_opt = new_opt;

Here need ACCESS_ONCE to wrap info-mount_opt, or the complier might use
info-mount_opt instead of new_opt.

But I worried that this is not key reason of the wrong space cache. Could
you explain the race condition which caused the wrong space cache?

Thanks
Miao

 + }
   kfree(orig);
   return ret;
  }
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse mount option in a atomic way

2015-01-29 Thread Miao Xie
On Fri, 30 Jan 2015 10:51:52 +0800, Qu Wenruo wrote:
 
  Original Message 
 Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse 
 mount
 option in a atomic way
 From: Miao Xie miao...@huawei.com
 To: Qu Wenruo quwen...@cn.fujitsu.com, linux-btrfs@vger.kernel.org
 Date: 2015年01月30日 10:06
 On Fri, 30 Jan 2015 09:33:17 +0800, Qu Wenruo wrote:
  Original Message 
 Subject: Re: [PATCH RESEND v4 2/8] btrfs: Make btrfs_parse_options() parse 
 mount
 option in a atomic way
 From: Miao Xie miao...@huawei.com
 To: Qu Wenruo quwen...@cn.fujitsu.com, linux-btrfs@vger.kernel.org
 Date: 2015年01月30日 09:29
 On Fri, 30 Jan 2015 09:20:46 +0800, Qu Wenruo wrote:
 Here need ACCESS_ONCE to wrap info-mount_opt, or the complier might use
 info-mount_opt instead of new_opt.
 Thanks for pointing out this one.
 But I worried that this is not key reason of the wrong space cache. Could
 you explain the race condition which caused the wrong space cache?

 Thanks
 Miao
 CPU0:
 remount()
 |- sync_fs() - after sync_fs() we can start new trans
 |- btrfs_parse_options() CPU1:
   |- start_transaction()
   |- Do some bg allocation, not recorded in 
 space_cache.
 I think it is a bug if a free space is not recorded in space cache. Could 
 you
 explain why it is not recorded?

 Thanks
 Miao
 IIRC, in that window, the fs_info-mount_opt's SPACE_CACHE bit is cleared.
 So space cache is not recorded.
 SPACE_CACHE is used to control cache write out, not in-memory cache. All the
 free space should be recorded in in-memory cache.And when we write out
 the in-memory space cache, we need protect the space cache from changing.

 Thanks
 Miao
 You're right, the wrong space cache problem is not caused by the non-atomic
 mount option problem.
 But the atomic mount option change with per-transaction mount option will at
 least make it disappear
 when using nospace_cache mount option.

But we need fix a problem, not hide a problem.

Thanks
Miao

 
 Thanks,
 Qu

 Thanks,
 Qu
|- set SPACE_CACHE bit due to cache_gen

   |- commit_transaction()
   |- write space cache and update cache_gen.
   but since some of it is not recorded in 
 space
 cache,
   the space cache missing some records.
|- clear SPACE_CACHE bit dut to nospace_cache

 So the space cache is wrong.

 Thanks,
 Qu
 +}
 kfree(orig);
 return ret;
 }

 .


 
 .
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.

2015-01-29 Thread Miao Xie
On Fri, 30 Jan 2015 04:37:14 +, Al Viro wrote:
 On Fri, Jan 30, 2015 at 12:14:24PM +0800, Miao Xie wrote:
 On Fri, 30 Jan 2015 02:14:45 +, Al Viro wrote:
 On Fri, Jan 30, 2015 at 09:44:03AM +0800, Qu Wenruo wrote:

 This shouldn't happen. If someone is ro, the whole fs should be ro, right?

 Wrong.  Individual vfsmounts over an r/w superblock might very well be r/o.
 As for that trylock...  What for?  It invites transient failures for no
 good reason.  Removal of sysfs entry will block while write(2) to that 
 sucker
 is in progress, so btrfs shutdown will block at that point in ctree_close().
 It won't go away under you.

 could you explain the race condition? I think the deadlock won't happen, 
 during
 the btrfs shutdown, we hold s_umount, the write operation will fail to lock 
 it,
 and quit quickly, and then umount will continue.
 
   First of all, -s_umount is not a mutex; it's rwsem.  So you mean
 down_read_trylock().  As for the transient failures - grep for down_write
 on it...  E.g. have somebody call mount() from the same device.  We call
 sget(), which finds existing superblock and calls grab_super().  Sure,
 that -s_umount will be released shortly, but in the meanwhile your trylock
 will fail...

I know it, so I suggested to return -EBUSY in the previous mail.
I think it is acceptable method, mount/umount operations are not so many
after all.

Thanks
Miao

 
 I think sb_want_write() is similar to trylock(s_umount), the difference is 
 that
 sb_want_write() is more complex.


 Now, you might want to move those sysfs entry removals to the very beginning
 of btrfs_kill_super(), but that's a different story - you need only to make
 sure that they are removed not later than the destruction of the data
 structures they need (IOW, the current location might very well be OK - I
 hadn't checked the details).

 Yes, we need move those sysfs entry removals, but needn't move to the very
 beginning of btrfs_kill_super(), just at the beginning of close_ctree();
 
 So move them...  It's a matter of moving one function call around a bit.
 .
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.

2015-01-29 Thread Miao Xie
On Fri, 30 Jan 2015 10:02:26 +0800, Qu Wenruo wrote:
 
  Original Message 
 Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get
 vfsmount from a given sb.
 From: Qu Wenruo quwen...@cn.fujitsu.com
 To: Miao Xie miao...@huawei.com, linux-btrfs@vger.kernel.org
 Date: 2015年01月30日 09:44

  Original Message 
 Subject: Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get
 vfsmount from a given sb.
 From: Miao Xie miao...@huawei.com
 To: Qu Wenruo quwen...@cn.fujitsu.com, linux-btrfs@vger.kernel.org
 Date: 2015年01月30日 08:52
 On Thu, 29 Jan 2015 10:24:39 +0800, Qu Wenruo wrote:
 There are sysfs interfaces in some fs, only btrfs yet, which will modify
 on-disk data.
 Unlike normal file operation routine we can use mnt_want_write_file() to
 protect the operation, change through sysfs won't to be binded to any file
 in the filesystem.
 So we can only extract the first vfsmount of a superblock and pass it to
 mnt_want_write() to do the protection.
 This method is wrong, becasue one fs  may be mounted on the multi places
 at the same time, someone is R/O, someone is R/W, you may get a R/O and
 fail to get the write permission.
 This shouldn't happen. If someone is ro, the whole fs should be ro, right?
 You can mount a device which is already mounted as rw to other point as ro,
 and remount a mount point to ro will also cause all other mount point to ro.

 So I didn't see the problem here.

 I think you do label/feature change by sysfs interface by the following way

 btrfs_sysfs_change_()
 {
 /* Use trylock to avoid the race with umount */
 if(!mutex_trylock(sb-s_umount))
 return -EBUSY;

 check R/O and FREEZE

 mutex_unlock(sb-s_umount);
 }
 This looks better since it not introduce changes to VFS.

 Thanks,
 Qu
 Oh, wait a second, this one leads to the old problem and old solution.
 
 If we hold s_umount mutex, we must do freeze check and can't start transaction
 since it will deadlock.
 
 And for freeze check, we must use sb_try_start_intwrite() to hold the freeze
 lock and then add a new
 btrfs_start_transaction_freeze() which will not call sb_start_write()...
 
 Oh this seems so similar, v2 or v3 version RFC patch?
 So still goes to the old method?

No. Just check R/O and RREEZE, if failed, go out. if the check pass,
we start_transaction. Because we do it in s_umount lock, no one can
change fs to R/O or FREEZE.

Maybe the above description is not so clear, explain it again.

btrfs_sysfs_change_()
{
/* Use trylock to avoid the race with umount */
if(!mutex_trylock(sb-s_umount))
return -EBUSY;

if (fs is R/O or FREEZED) {
mutex_unlock(sb-s_umount);
return -EACCES;
}

btrfs_start_transaction()
change label/feature
btrfs_commit_transaction()

mutex_unlock(sb-s_umount);
}

Thanks
Miao

 
 Thanks,
 Qu

 Thanks
 Miao

 Cc: linux-fsdevel linux-fsde...@vger.kernel.org
 Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
 ---
   fs/namespace.c| 25 +
   include/linux/mount.h |  1 +
   2 files changed, 26 insertions(+)

 diff --git a/fs/namespace.c b/fs/namespace.c
 index cd1e968..5a16a62 100644
 --- a/fs/namespace.c
 +++ b/fs/namespace.c
 @@ -1105,6 +1105,31 @@ struct vfsmount *mntget(struct vfsmount *mnt)
   }
   EXPORT_SYMBOL(mntget);
   +/*
 + * get a vfsmount from a given sb
 + *
 + * This is especially used for case where change fs' sysfs interface
 + * will lead to a write, e.g. Change label through sysfs in btrfs.
 + * So vfs can get a vfsmount and then use mnt_want_write() to protect.
 + */
 +struct vfsmount *get_vfsmount_sb(struct super_block *sb)
 +{
 +struct vfsmount *ret_vfs = NULL;
 +struct mount *mnt;
 +int ret = 0;
 +
 +lock_mount_hash();
 +if (list_empty(sb-s_mounts))
 +goto out;
 +mnt = list_entry(sb-s_mounts.next, struct mount, mnt_instance);
 +ret_vfs = mnt-mnt;
 +ret_vfs = mntget(ret_vfs);
 +out:
 +unlock_mount_hash();
 +return ret_vfs;
 +}
 +EXPORT_SYMBOL(get_vfsmount_sb);
 +
   struct vfsmount *mnt_clone_internal(struct path *path)
   {
   struct mount *p;
 diff --git a/include/linux/mount.h b/include/linux/mount.h
 index c2c561d..cf1b0f5 100644
 --- a/include/linux/mount.h
 +++ b/include/linux/mount.h
 @@ -79,6 +79,7 @@ extern void mnt_drop_write_file(struct file *file);
   extern void mntput(struct vfsmount *mnt);
   extern struct vfsmount *mntget(struct vfsmount *mnt);
   extern struct vfsmount *mnt_clone_internal(struct path *path);
 +extern struct vfsmount *get_vfsmount_sb(struct super_block *sb);
   extern int __mnt_is_readonly(struct vfsmount *mnt);
 struct path;


 
 .
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.

2015-01-29 Thread Miao Xie
On Fri, 30 Jan 2015 02:14:45 +, Al Viro wrote:
 On Fri, Jan 30, 2015 at 09:44:03AM +0800, Qu Wenruo wrote:
 
 This shouldn't happen. If someone is ro, the whole fs should be ro, right?
 
 Wrong.  Individual vfsmounts over an r/w superblock might very well be r/o.
 As for that trylock...  What for?  It invites transient failures for no
 good reason.  Removal of sysfs entry will block while write(2) to that sucker
 is in progress, so btrfs shutdown will block at that point in ctree_close().
 It won't go away under you.

could you explain the race condition? I think the deadlock won't happen, during
the btrfs shutdown, we hold s_umount, the write operation will fail to lock it,
and quit quickly, and then umount will continue.

I think sb_want_write() is similar to trylock(s_umount), the difference is that
sb_want_write() is more complex.

 
 Now, you might want to move those sysfs entry removals to the very beginning
 of btrfs_kill_super(), but that's a different story - you need only to make
 sure that they are removed not later than the destruction of the data
 structures they need (IOW, the current location might very well be OK - I
 hadn't checked the details).

Yes, we need move those sysfs entry removals, but needn't move to the very
beginning of btrfs_kill_super(), just at the beginning of close_ctree();

The current location is not right, it will introduce the use-after-free
problem. because we remove the sysfs entry after we release
transaction_kthread, use-after-free problem might happen in this case
Task1   Task2
change Label by sysfs
close_ctree
  kthread_stop(transaction_kthread);
  change label
  wake_up(transaction_kthread)


Thanks
Miao

 
 As for it won't go r/o under us - sb_want_write() will do that just fine.
 .
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND v4 6/8] vfs: Add get_vfsmount_sb() function to get vfsmount from a given sb.

2015-01-29 Thread Miao Xie
On Thu, 29 Jan 2015 10:24:39 +0800, Qu Wenruo wrote:
 There are sysfs interfaces in some fs, only btrfs yet, which will modify
 on-disk data.
 Unlike normal file operation routine we can use mnt_want_write_file() to
 protect the operation, change through sysfs won't to be binded to any file
 in the filesystem.
 So we can only extract the first vfsmount of a superblock and pass it to
 mnt_want_write() to do the protection.

This method is wrong, becasue one fs  may be mounted on the multi places
at the same time, someone is R/O, someone is R/W, you may get a R/O and
fail to get the write permission.

I think you do label/feature change by sysfs interface by the following way

btrfs_sysfs_change_()
{
/* Use trylock to avoid the race with umount */
if(!mutex_trylock(sb-s_umount))
return -EBUSY;

check R/O and FREEZE

mutex_unlock(sb-s_umount);
}

Thanks
Miao

 
 Cc: linux-fsdevel linux-fsde...@vger.kernel.org
 Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
 ---
  fs/namespace.c| 25 +
  include/linux/mount.h |  1 +
  2 files changed, 26 insertions(+)
 
 diff --git a/fs/namespace.c b/fs/namespace.c
 index cd1e968..5a16a62 100644
 --- a/fs/namespace.c
 +++ b/fs/namespace.c
 @@ -1105,6 +1105,31 @@ struct vfsmount *mntget(struct vfsmount *mnt)
  }
  EXPORT_SYMBOL(mntget);
  
 +/*
 + * get a vfsmount from a given sb
 + *
 + * This is especially used for case where change fs' sysfs interface
 + * will lead to a write, e.g. Change label through sysfs in btrfs.
 + * So vfs can get a vfsmount and then use mnt_want_write() to protect.
 + */
 +struct vfsmount *get_vfsmount_sb(struct super_block *sb)
 +{
 + struct vfsmount *ret_vfs = NULL;
 + struct mount *mnt;
 + int ret = 0;
 +
 + lock_mount_hash();
 + if (list_empty(sb-s_mounts))
 + goto out;
 + mnt = list_entry(sb-s_mounts.next, struct mount, mnt_instance);
 + ret_vfs = mnt-mnt;
 + ret_vfs = mntget(ret_vfs);
 +out:
 + unlock_mount_hash();
 + return ret_vfs;
 +}
 +EXPORT_SYMBOL(get_vfsmount_sb);
 +
  struct vfsmount *mnt_clone_internal(struct path *path)
  {
   struct mount *p;
 diff --git a/include/linux/mount.h b/include/linux/mount.h
 index c2c561d..cf1b0f5 100644
 --- a/include/linux/mount.h
 +++ b/include/linux/mount.h
 @@ -79,6 +79,7 @@ extern void mnt_drop_write_file(struct file *file);
  extern void mntput(struct vfsmount *mnt);
  extern struct vfsmount *mntget(struct vfsmount *mnt);
  extern struct vfsmount *mnt_clone_internal(struct path *path);
 +extern struct vfsmount *get_vfsmount_sb(struct super_block *sb);
  extern int __mnt_is_readonly(struct vfsmount *mnt);
  
  struct path;
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.

2015-01-25 Thread Miao Xie
On Fri, 23 Jan 2015 17:59:49 +0100, David Sterba wrote:
 On Wed, Jan 21, 2015 at 03:04:02PM +0800, Miao Xie wrote:
 Pending changes are *not* only mount options. Feature change and label 
 change
 are also pending changes if using sysfs.

 My miss, I don't notice feature and label change by sysfs.

 But the implementation of feature and label change by sysfs is wrong, we can
 not change them without write permission.
 
 Label change does not happen if the fs is readonly. If the filesystem is
 RW and label is changed through sysfs, then remount to RO will sync the
 filesystem and the new label will be saved.
 
 The sysfs features write handler is missing that protection though, I'll
 send a patch.

First, the R/O protection is so cheap, there is a race between R/O remount and
label/feature change, please consider the following case:
Remount R/O taskLabel/Attr Change Task
Check R/O
remount ro R/O
change Label/feature

Second, it forgets to handle the freezing event.

 
 For freeze, it's not the same problem since the fs will be unfreeze sooner 
 or
 later and transaction will be initiated.

 You can not assume the operations of the users, they might freeze the fs and
 then shutdown the machine.
 
 The semantics of freezing should make the on-device image consistent,
 but still keep some changes in memory.
 
 For example, if we change the features/label through sysfs, and then 
 umount
 the fs,
 It is different from pending change.
 No, now features/label changing using sysfs both use pending changes to do 
 the
 commit.
 See BTRFS_PENDING_COMMIT bit.
 So freeze - change features/label - sync will still cause the deadlock in 
 the
 same way,
 and you can try it yourself.

 As I said above, the implementation of sysfs feature and label change is 
 wrong,
 it is better to separate them from the pending mount option change, make the
 sysfs feature and label change be done in the context of transaction after
 getting the write permission. If so, we needn't do anything special when sync
 the fs.
 
 That would mean to drop the write support of sysfs files that change
 global filesystem state (label and features right now). This would leave
 only the ioctl way to do that. I'd like to keep the sysfs write support
 though for ease of use from scripts and languages not ioctl-friendly.
 .

not drop the write support of sysfs, just fix the bug and make it change the
label and features under the writable context.

Thanks
Miao
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.

2015-01-21 Thread Miao Xie
On Wed, 21 Jan 2015 15:47:54 +0800, Qu Wenruo wrote:
 On Wed, 21 Jan 2015 11:53:34 +0800, Qu Wenruo wrote:
 [snipped]
 This will cause another problem, nobody can ensure there will be next
 transaction and the change may
 never to written into disk.
 First, the pending changes is mount option, that is in-memory data.
 Second, the same problem would happen after you freeze fs.
 Pending changes are *not* only mount options. Feature change and label 
 change
 are also pending changes if using sysfs.
 My miss, I don't notice feature and label change by sysfs.

 But the implementation of feature and label change by sysfs is wrong, we can
 not change them without write permission.

 Normal ioctl label changing is not affected.

 For freeze, it's not the same problem since the fs will be unfreeze sooner 
 or
 later and transaction will be initiated.
 You can not assume the operations of the users, they might freeze the fs and
 then shutdown the machine.

 For example, if we change the features/label through sysfs, and then 
 umount
 the fs,
 It is different from pending change.
 No, now features/label changing using sysfs both use pending changes to do 
 the
 commit.
 See BTRFS_PENDING_COMMIT bit.
 So freeze - change features/label - sync will still cause the deadlock in 
 the
 same way,
 and you can try it yourself.
 As I said above, the implementation of sysfs feature and label change is 
 wrong,
 it is better to separate them from the pending mount option change, make the
 sysfs feature and label change be done in the context of transaction after
 getting the write permission. If so, we needn't do anything special when sync
 the fs.

 In short, changing the sysfs feature and label change implementation and
 removing the unnecessary btrfs_start_transaction in sync_fs can fix the
 deadlock.
 Your method will only fix the deadlock, but will introduce the risk like 
 pending
 inode_cache will never
 be written to disk as I already explained. (If still using the
 fs_info-pending_changes mechanism)
 To ensure pending changes written to disk sync_fs() should start a transaction
 if needed,
 or there will be chance that no transaction can handle it.

We are sure that writting down the inode cache need transaction. But INODE_CACHE
is not a forcible flag. Sometimes though you set it, it is very likely that the
inode cache files are not created and the data is not written down because the
fs might still be reading inode usage information, and this operation might span
several transactions. So I think what you worried is not a problem.

Thanks
Miao

 
 But I don't see the necessity to pending current work(inode_cache, 
 feature/label
 changes) to next transaction.
 
 To David:
 I'm a little curious about why inode_cache needs to be delayed to next 
 transaction.
 In btrfs_remount() we have s_umount mutex, and we synced the whole filesystem
 already,
 so there should be no running transaction and we can just set any mount option
 into fs_info.
 
 Or even in worst case, there is a racing window, we can still start a
 transaction and do the commit,
 a little overhead in such minor case won't impact the overall performance.
 
 For sysfs change, I prefer attach or start transaction method, and for mount
 option change, and
 such sysfs tuning is also minor case for a filesystem.
 
 What do you think about reverting the whole patchset and rework the sysfs
 interface?
 
 Thanks,
 Qu

 Thanks
 Miao

 Thanks,
 Qu

 If you want to change features/label,  you should get write permission and 
 make
 sure the fs is not be freezed because those are on-disk data. So the 
 problem
 doesn't exist, or there is a bug.

 Thanks
 Miao

 since there is no write, there is no running transaction and if we don't
 start a
 new transaction,
 it won't be flushed to disk.

 Thanks,
 Qu
 the reason is:
 - Make the behavior of the fs be consistent(both freezed fs and 
 unfreezed fs)
 - Data on the disk is right and integrated


 Thanks
 Miao
 .

 .

 
 .
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.

2015-01-20 Thread Miao Xie
On Tue, 20 Jan 2015 11:17:07 +0800, Qu Wenruo wrote:
 --- a/fs/btrfs/super.c
 +++ b/fs/btrfs/super.c
 @@ -1000,6 +1000,14 @@ int btrfs_sync_fs(struct super_block *sb, int 
 wait)
 */
if (fs_info-pending_changes == 0)
return 0;
 +/*
 + * Test if the fs is frozen, or start_trasaction
 + * will deadlock on itself.
 + */
 +if (__sb_start_write(sb, SB_FREEZE_FS, false))
 +__sb_end_write(sb, SB_FREEZE_FS);
 +else
 +return 0;
 I'm not sure this is the right fix. We should use either
 mnt_want_write_file or sb_start_write around the start/commit functions.
 The fs may be frozen already, but we also have to catch transition to
 that state, or RO remount.
 But the deadlock between s_umount and frozen level is a larger problem...

 Even Miao mentioned that we can start a transaction in btrfs_freeze(), but
 there is still possibility that
 we try to change the feature of the frozen btrfs and do sync, again the
 deadlock will happen.
 Although handling in btrfs_freeze() is also needed, but can't resolve all 
 the
 problem.

 IMHO the fix is still needed, or at least as a workaround until we find a 
 real
 root solution for it
 (If nobody want to revert the patchset)

 BTW, what about put the pending changes to a workqueue? If we don't start
 transaction under
 s_umount context like sync_fs()
 I don't like this fix.
 I think we should deal with those pending changes when we freeze a 
 filesystem.
 or we break the rule of fs freeze.
 I am afraid handling it in btrfs_freeze() won't help.
 Case like freeze() - change_feature - sync() - unfreeze() will still 
 deadlock
 in sync().

We should not change feature after the fs is freezed.

 Even cleared the pending changes in freeze(), it can still be set through 
 sysfs
 interface even the fs is frozen.
 
 And in fact, if we put the things like attach/create a transaction into a
 workqueue, we will not break
 the freeze rule.
 Since if the fs is frozen, there is no running transaction and we need to 
 create
 a new one,
 that will call sb_start_intwrite(), which will sleep until the fs is unfreeze.

I read the pending change code just now, and I found the pending change is just
used for changing the mount option now, so I think as a work-around fix we
needn't start a new transaction to handle the pending flags which are set after
the current transaction is committed, because the data on the disk is
integrated.

Thanks
Miao


 
 Thanks,
 Qu

 Thanks
 Miao

 Thanks,
 Qu
 Also, returning 0 is not right, the ioctl actually skipped the expected
 work.

trans = btrfs_start_transaction(root, 0);
} else {
return PTR_ERR(trans);
 .

 
 .
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.

2015-01-20 Thread Miao Xie
On Tue, 20 Jan 2015 20:10:56 -0500, Chris Mason wrote:
 On Tue, Jan 20, 2015 at 8:09 PM, Qu Wenruo quwen...@cn.fujitsu.com wrote:

  Original Message 
 Subject: Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs
 to avoid deadlock.
 From: Chris Mason c...@fb.com
 To: Qu Wenruo quwen...@cn.fujitsu.com
 Date: 2015年01月21日 09:05


 On Tue, Jan 20, 2015 at 7:58 PM, Qu Wenruo quwen...@cn.fujitsu.com wrote:

  Original Message 
 Subject: Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen
 fs to avoid deadlock.
 From: David Sterba dste...@suse.cz
 To: Qu Wenruo quwen...@cn.fujitsu.com
 Date: 2015年01月21日 01:13
 On Mon, Jan 19, 2015 at 03:42:41PM +0800, Qu Wenruo wrote:
 --- a/fs/btrfs/super.c
 +++ b/fs/btrfs/super.c
 @@ -1000,6 +1000,14 @@ int btrfs_sync_fs(struct super_block *sb, int 
 wait)
*/
   if (fs_info-pending_changes == 0)
   return 0;
 +/*
 + * Test if the fs is frozen, or start_trasaction
 + * will deadlock on itself.
 + */
 +if (__sb_start_write(sb, SB_FREEZE_FS, false))
 +__sb_end_write(sb, SB_FREEZE_FS);
 +else
 +return 0;

 But what if someone freezes the FS after __sb_end_write() and before
 btrfs_start_transaction()?   I don't see what keeps new freezers from 
 coming in.

 -chris
 Either VFS::freeze_super() and VFS::syncfs() will hold the s_umount mutex, so
 freeze will not happen
 during sync.
 
 You're right.  I was worried about the sync ioctl, but the mutex won't be held
 there to deadlock against.  We'll be fine.

There is another problem which is introduced by pending change. That is we will
start and commmit a transaction by changing pending mount option after we set
the fs to be R/O.

I think it is better that we don't start a new transaction for pending changes
which are set after the transaction is committed, just make them be handled by
the next transaction, the reason is:
- Make the behavior of the fs be consistent(both freezed fs and unfreezed fs)
- Data on the disk is right and integrated


Thanks
Miao
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.

2015-01-20 Thread Miao Xie
On Wed, 21 Jan 2015 11:53:34 +0800, Qu Wenruo wrote:
 +/*
 + * Test if the fs is frozen, or start_trasaction
 + * will deadlock on itself.
 + */
 +if (__sb_start_write(sb, SB_FREEZE_FS, false))
 +__sb_end_write(sb, SB_FREEZE_FS);
 +else
 +return 0;
 But what if someone freezes the FS after __sb_end_write() and before
 btrfs_start_transaction()?   I don't see what keeps new freezers from
 coming in.

 -chris
 Either VFS::freeze_super() and VFS::syncfs() will hold the s_umount 
 mutex, so
 freeze will not happen
 during sync.
 You're right.  I was worried about the sync ioctl, but the mutex won't be 
 held
 there to deadlock against.  We'll be fine.
 There is another problem which is introduced by pending change. That is we 
 will
 start and commmit a transaction by changing pending mount option after we 
 set
 the fs to be R/O.
 Oh, I missed this problem.
 I think it is better that we don't start a new transaction for pending 
 changes
 which are set after the transaction is committed, just make them be 
 handled by
 the next transaction,
 This will cause another problem, nobody can ensure there will be next
 transaction and the change may
 never to written into disk.
 First, the pending changes is mount option, that is in-memory data.
 Second, the same problem would happen after you freeze fs.
 Pending changes are *not* only mount options. Feature change and label change
 are also pending changes if using sysfs.

My miss, I don't notice feature and label change by sysfs.

But the implementation of feature and label change by sysfs is wrong, we can
not change them without write permission.

 Normal ioctl label changing is not affected.
 
 For freeze, it's not the same problem since the fs will be unfreeze sooner or
 later and transaction will be initiated.

You can not assume the operations of the users, they might freeze the fs and
then shutdown the machine.


 For example, if we change the features/label through sysfs, and then umount
 the fs,
 It is different from pending change.
 No, now features/label changing using sysfs both use pending changes to do the
 commit.
 See BTRFS_PENDING_COMMIT bit.
 So freeze - change features/label - sync will still cause the deadlock in 
 the
 same way,
 and you can try it yourself.

As I said above, the implementation of sysfs feature and label change is wrong,
it is better to separate them from the pending mount option change, make the
sysfs feature and label change be done in the context of transaction after
getting the write permission. If so, we needn't do anything special when sync
the fs.

In short, changing the sysfs feature and label change implementation and
removing the unnecessary btrfs_start_transaction in sync_fs can fix the
deadlock.

Thanks
Miao

 
 Thanks,
 Qu
 
 If you want to change features/label,  you should get write permission and 
 make
 sure the fs is not be freezed because those are on-disk data. So the problem
 doesn't exist, or there is a bug.

 Thanks
 Miao

 since there is no write, there is no running transaction and if we don't 
 start a
 new transaction,
 it won't be flushed to disk.

 Thanks,
 Qu
 the reason is:
 - Make the behavior of the fs be consistent(both freezed fs and unfreezed 
 fs)
 - Data on the disk is right and integrated


 Thanks
 Miao
 .

 
 .
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Don't call btrfs_start_transaction() on frozen fs to avoid deadlock.

2015-01-19 Thread Miao Xie
On Mon, 19 Jan 2015 15:42:41 +0800, Qu Wenruo wrote:
 Commit 6b5fe46dfa52 (btrfs: do commit in sync_fs if there are pending
 changes) will call btrfs_start_transaction() in sync_fs(), to handle
 some operations needed to be done in next transaction.
 
 However this can cause deadlock if the filesystem is frozen, with the
 following sys_r+w output:
 [  143.255932] Call Trace:
 [  143.255936]  [816c0e09] schedule+0x29/0x70
 [  143.255939]  [811cb7f3] __sb_start_write+0xb3/0x100
 [  143.255971]  [a040ec06] start_transaction+0x2e6/0x5a0
 [btrfs]
 [  143.255992]  [a040f1eb] btrfs_start_transaction+0x1b/0x20
 [btrfs]
 [  143.256003]  [a03dc0ba] btrfs_sync_fs+0xca/0xd0 [btrfs]
 [  143.256007]  [811f7be0] sync_fs_one_sb+0x20/0x30
 [  143.256011]  [811cbd01] iterate_supers+0xe1/0xf0
 [  143.256014]  [811f7d75] sys_sync+0x55/0x90
 [  143.256017]  [816c49d2] system_call_fastpath+0x12/0x17
 [  143.256111] Call Trace:
 [  143.256114]  [816c0e09] schedule+0x29/0x70
 [  143.256119]  [816c3405] rwsem_down_write_failed+0x1c5/0x2d0
 [  143.256123]  [8133f013] call_rwsem_down_write_failed+0x13/0x20
 [  143.256131]  [811caae8] thaw_super+0x28/0xc0
 [  143.256135]  [811db3e5] do_vfs_ioctl+0x3f5/0x540
 [  143.256187]  [811db5c1] SyS_ioctl+0x91/0xb0
 [  143.256213]  [816c49d2] system_call_fastpath+0x12/0x17
 
 The reason is like the following:
 (Holding s_umount)
 VFS sync_fs staff:
 |- btrfs_sync_fs()
|- btrfs_start_transaction()
   |- sb_start_intwrite()
   (Waiting thaw_fs to unfreeze)
   VFS thaw_fs staff:
   thaw_fs()
   (Waiting sync_fs to release
s_umount)
 
 So deadlock happens.
 This can be easily triggered by fstest/generic/068 with inode_cache
 mount option.
 
 The fix is to check if the fs is frozen, if the fs is frozen, just
 return and waiting for the next transaction.
 
 Cc: David Sterba dste...@suse.cz
 Reported-by: Gui Hecheng guihc.f...@cn.fujitsu.com
 Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
 ---
  fs/btrfs/super.c | 8 
  1 file changed, 8 insertions(+)
 
 diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
 index 60f7cbe..1d9f1e6 100644
 --- a/fs/btrfs/super.c
 +++ b/fs/btrfs/super.c
 @@ -1000,6 +1000,14 @@ int btrfs_sync_fs(struct super_block *sb, int wait)
*/
   if (fs_info-pending_changes == 0)
   return 0;


I think the problem is here -- why -pending_changes is not 0 when the
filesystem is frozen? so I think the reason of this problem is btrfs_freeze
forget to deal with the pending changes, and the correct fix is to correct
the behavior of btrfs_freeze().

Thanks
Miao

 + /*
 +  * Test if the fs is frozen, or start_trasaction
 +  * will deadlock on itself.
 +  */
 + if (__sb_start_write(sb, SB_FREEZE_FS, false))
 + __sb_end_write(sb, SB_FREEZE_FS);
 + else
 + return 0;
   trans = btrfs_start_transaction(root, 0);
   } else {
   return PTR_ERR(trans);
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix typo of variable in scrub_stripe

2015-01-09 Thread Miao Xie
On Fri, 09 Jan 2015 17:37:52 +0900, Tsutomu Itoh wrote:
 The address that should be freed is not 'ppath' but 'path'.
 
 Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com 
 ---
  fs/btrfs/scrub.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
 index f2bb13a..403fbdb 100644
 --- a/fs/btrfs/scrub.c
 +++ b/fs/btrfs/scrub.c
 @@ -3053,7 +3053,7 @@ static noinline_for_stack int scrub_stripe(struct 
 scrub_ctx *sctx,
  
   ppath = btrfs_alloc_path();
   if (!ppath) {
 - btrfs_free_path(ppath);
 + btrfs_free_path(path);

My bad. Thanks to fix it.

Reviewed-by: Miao Xie miao...@huawei.com

   return -ENOMEM;
   }
  
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: delete chunk allocation attemp when setting block group ro

2015-01-08 Thread Miao Xie
On Thu, 08 Jan 2015 13:23:13 -0800, Shaohua Li wrote:
 Below test will fail currently:
   mkfs.ext4 -F /dev/sda
   btrfs-convert /dev/sda
   mount /dev/sda /mnt
   btrfs device add -f /dev/sdb /mnt
   btrfs balance start -v -dconvert=raid1 -mconvert=raid1 /mnt
 
 The reason is there are some block groups with usage 0, but the whole
 disk hasn't free space to allocate new chunk, so we even can't set such
 block group readonly. This patch deletes the chunk allocation when
 setting block group ro. For META, we already have reserve. But for
 SYSTEM, we don't have, so the check_system_chunk is still required.
 
 Signed-off-by: Shaohua Li s...@fb.com
 ---
  fs/btrfs/extent-tree.c | 31 +++
  1 file changed, 7 insertions(+), 24 deletions(-)
 
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index a80b971..430101b6 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -8493,22 +8493,8 @@ static int set_block_group_ro(struct 
 btrfs_block_group_cache *cache, int force)
  {
   struct btrfs_space_info *sinfo = cache-space_info;
   u64 num_bytes;
 - u64 min_allocable_bytes;
   int ret = -ENOSPC;
  
 -
 - /*
 -  * We need some metadata space and system metadata space for
 -  * allocating chunks in some corner cases until we force to set
 -  * it to be readonly.
 -  */
 - if ((sinfo-flags 
 -  (BTRFS_BLOCK_GROUP_SYSTEM | BTRFS_BLOCK_GROUP_METADATA)) 
 - !force)
 - min_allocable_bytes = 1 * 1024 * 1024;
 - else
 - min_allocable_bytes = 0;
 -
   spin_lock(sinfo-lock);
   spin_lock(cache-lock);
  
 @@ -8521,8 +8507,8 @@ static int set_block_group_ro(struct 
 btrfs_block_group_cache *cache, int force)
   cache-bytes_super - btrfs_block_group_used(cache-item);
  
   if (sinfo-bytes_used + sinfo-bytes_reserved + sinfo-bytes_pinned +
 - sinfo-bytes_may_use + sinfo-bytes_readonly + num_bytes +
 - min_allocable_bytes = sinfo-total_bytes) {
 + sinfo-bytes_may_use + sinfo-bytes_readonly + num_bytes
 + = sinfo-total_bytes) {
   sinfo-bytes_readonly += num_bytes;
   cache-ro = 1;
   list_add_tail(cache-ro_list, sinfo-ro_bgs);
 @@ -8548,14 +8534,6 @@ int btrfs_set_block_group_ro(struct btrfs_root *root,
   if (IS_ERR(trans))
   return PTR_ERR(trans);
  
 - alloc_flags = update_block_group_flags(root, cache-flags);
 - if (alloc_flags != cache-flags) {
 - ret = do_chunk_alloc(trans, root, alloc_flags,
 -  CHUNK_ALLOC_FORCE);
 - if (ret  0)
 - goto out;
 - }
 -
   ret = set_block_group_ro(cache, 0);
   if (!ret)
   goto out;
 @@ -8566,6 +8544,11 @@ int btrfs_set_block_group_ro(struct btrfs_root *root,
   goto out;
   ret = set_block_group_ro(cache, 0);
  out:
 + if (cache-flags  BTRFS_BLOCK_GROUP_SYSTEM) {
 + alloc_flags = update_block_group_flags(root, cache-flags);
 + check_system_chunk(trans, root, alloc_flags);

Please consider the case that the following patch fixed
  199c36eaa95077a47ae1bc55532fc0fbeb80cc95

If there is no free device space, check_system_chunk can not allocate
new system metadata chunk, so when we run final step of the chunk
allocation to update the device item and insert the new chunk item, we
would fail.

Thanks
Miao

 + }
 +
   btrfs_end_transaction(trans, root);
   return ret;
  }
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: delete chunk allocation attemp when setting block group ro

2015-01-08 Thread Miao Xie
On Thu, 08 Jan 2015 18:06:50 -0800, Shaohua Li wrote:
 On Fri, Jan 09, 2015 at 09:01:57AM +0800, Miao Xie wrote:
 On Thu, 08 Jan 2015 13:23:13 -0800, Shaohua Li wrote:
 Below test will fail currently:
   mkfs.ext4 -F /dev/sda
   btrfs-convert /dev/sda
   mount /dev/sda /mnt
   btrfs device add -f /dev/sdb /mnt
   btrfs balance start -v -dconvert=raid1 -mconvert=raid1 /mnt

 The reason is there are some block groups with usage 0, but the whole
 disk hasn't free space to allocate new chunk, so we even can't set such
 block group readonly. This patch deletes the chunk allocation when
 setting block group ro. For META, we already have reserve. But for
 SYSTEM, we don't have, so the check_system_chunk is still required.

 Signed-off-by: Shaohua Li s...@fb.com
 ---
  fs/btrfs/extent-tree.c | 31 +++
  1 file changed, 7 insertions(+), 24 deletions(-)

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index a80b971..430101b6 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -8493,22 +8493,8 @@ static int set_block_group_ro(struct 
 btrfs_block_group_cache *cache, int force)
  {
 struct btrfs_space_info *sinfo = cache-space_info;
 u64 num_bytes;
 -   u64 min_allocable_bytes;
 int ret = -ENOSPC;
  
 -
 -   /*
 -* We need some metadata space and system metadata space for
 -* allocating chunks in some corner cases until we force to set
 -* it to be readonly.
 -*/
 -   if ((sinfo-flags 
 -(BTRFS_BLOCK_GROUP_SYSTEM | BTRFS_BLOCK_GROUP_METADATA)) 
 -   !force)
 -   min_allocable_bytes = 1 * 1024 * 1024;
 -   else
 -   min_allocable_bytes = 0;
 -
 spin_lock(sinfo-lock);
 spin_lock(cache-lock);
  
[SNIP]
 ret = set_block_group_ro(cache, 0);
 if (!ret)
 goto out;
 @@ -8566,6 +8544,11 @@ int btrfs_set_block_group_ro(struct btrfs_root *root,
 goto out;
 ret = set_block_group_ro(cache, 0);
  out:
 +   if (cache-flags  BTRFS_BLOCK_GROUP_SYSTEM) {
 +   alloc_flags = update_block_group_flags(root, cache-flags);
 +   check_system_chunk(trans, root, alloc_flags);

 Please consider the case that the following patch fixed
   199c36eaa95077a47ae1bc55532fc0fbeb80cc95

 If there is no free device space, check_system_chunk can not allocate
 new system metadata chunk, so when we run final step of the chunk
 allocation to update the device item and insert the new chunk item, we
 would fail.
 
 So the relocation will always fail in this case. The check just makes
 the failure earlier, right? We don't have the BUG_ON in
 do_chunk_alloc() currently.

The final step of the chunk allocation is a delayed operation, we must make sure
it can be done successfully, or we would abort the transaction, make the
filesystem readonly and lose the data that is written into the filesystem before
we do balance, it would make the users unconfortable.

With this patch, we will set the block group successfully at the first time we
invoke set_block_group_ro(). But if the block group that will be set to RO is
the only system metadata block group in the filesystem, and there is no device
space to allocate a new one, that is we have no space to deal with the pending
final step of chunk allocation, so the problem I said above will happen.

Thanks
Miao
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 03/10] Btrfs, raid56: don't change bbio and raid_map

2014-12-02 Thread Miao Xie
Because we will reuse bbio and raid_map during the scrub later, it is
better that we don't change any variant of bbio and don't free it at
the end of IO request. So we introduced similar variants into the raid
bio, and don't access those bbio's variants any more.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None.
---
 fs/btrfs/raid56.c | 42 +++---
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 66944b9..cb31cc6 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,7 +58,6 @@
  */
 #define RBIO_CACHE_READY_BIT   3
 
-
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -146,6 +145,10 @@ struct btrfs_raid_bio {
 
atomic_t refs;
 
+
+   atomic_t stripes_pending;
+
+   atomic_t error;
/*
 * these are two arrays of pointers.  We allocate the
 * rbio big enough to hold them both and setup their
@@ -858,13 +861,13 @@ static void raid_write_end_io(struct bio *bio, int err)
 
bio_put(bio);
 
-   if (!atomic_dec_and_test(rbio-bbio-stripes_pending))
+   if (!atomic_dec_and_test(rbio-stripes_pending))
return;
 
err = 0;
 
/* OK, we have read all the stripes we need to. */
-   if (atomic_read(rbio-bbio-error)  rbio-bbio-max_errors)
+   if (atomic_read(rbio-error)  rbio-bbio-max_errors)
err = -EIO;
 
rbio_orig_end_io(rbio, err, 0);
@@ -949,6 +952,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
rbio-faila = -1;
rbio-failb = -1;
atomic_set(rbio-refs, 1);
+   atomic_set(rbio-error, 0);
+   atomic_set(rbio-stripes_pending, 0);
 
/*
 * the stripe_pages and bio_pages array point to the extra
@@ -1169,7 +1174,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
set_bit(RBIO_RMW_LOCKED_BIT, rbio-flags);
spin_unlock_irq(rbio-bio_list_lock);
 
-   atomic_set(rbio-bbio-error, 0);
+   atomic_set(rbio-error, 0);
 
/*
 * now that we've set rmw_locked, run through the
@@ -1245,8 +1250,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
}
}
 
-   atomic_set(bbio-stripes_pending, bio_list_size(bio_list));
-   BUG_ON(atomic_read(bbio-stripes_pending) == 0);
+   atomic_set(rbio-stripes_pending, bio_list_size(bio_list));
+   BUG_ON(atomic_read(rbio-stripes_pending) == 0);
 
while (1) {
bio = bio_list_pop(bio_list);
@@ -1331,11 +1336,11 @@ static int fail_rbio_index(struct btrfs_raid_bio *rbio, 
int failed)
if (rbio-faila == -1) {
/* first failure on this rbio */
rbio-faila = failed;
-   atomic_inc(rbio-bbio-error);
+   atomic_inc(rbio-error);
} else if (rbio-failb == -1) {
/* second failure on this rbio */
rbio-failb = failed;
-   atomic_inc(rbio-bbio-error);
+   atomic_inc(rbio-error);
} else {
ret = -EIO;
}
@@ -1394,11 +1399,11 @@ static void raid_rmw_end_io(struct bio *bio, int err)
 
bio_put(bio);
 
-   if (!atomic_dec_and_test(rbio-bbio-stripes_pending))
+   if (!atomic_dec_and_test(rbio-stripes_pending))
return;
 
err = 0;
-   if (atomic_read(rbio-bbio-error)  rbio-bbio-max_errors)
+   if (atomic_read(rbio-error)  rbio-bbio-max_errors)
goto cleanup;
 
/*
@@ -1439,7 +1444,6 @@ static void async_read_rebuild(struct btrfs_raid_bio 
*rbio)
 static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 {
int bios_to_read = 0;
-   struct btrfs_bio *bbio = rbio-bbio;
struct bio_list bio_list;
int ret;
int nr_pages = DIV_ROUND_UP(rbio-stripe_len, PAGE_CACHE_SIZE);
@@ -1455,7 +1459,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 
index_rbio_pages(rbio);
 
-   atomic_set(rbio-bbio-error, 0);
+   atomic_set(rbio-error, 0);
/*
 * build a list of bios to read all the missing parts of this
 * stripe
@@ -1503,7 +1507,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 * the bbio may be freed once we submit the last bio.  Make sure
 * not to touch it after that
 */
-   atomic_set(bbio-stripes_pending, bios_to_read);
+   atomic_set(rbio-stripes_pending, bios_to_read);
while (1) {
bio = bio_list_pop(bio_list);
if (!bio)
@@ -1917,10 +1921,10 @@ static void raid_recover_end_io(struct bio *bio, int 
err)
set_bio_pages_uptodate(bio);
bio_put(bio);
 
-   if (!atomic_dec_and_test(rbio-bbio-stripes_pending))
+   if (!atomic_dec_and_test(rbio-stripes_pending))
return;
 
-   if (atomic_read(rbio-bbio-error)  rbio-bbio-max_errors)
+   if (atomic_read

[PATCH v4 01/10] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition

2014-12-02 Thread Miao Xie
From: Zhao Lei zhao...@cn.fujitsu.com

bbio_ret in this condition is always !NULL because previous code
already have a check-and-skip:
4908 if (!bbio_ret)
4909 goto out;

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
Reviewed-by: David Sterba dste...@suse.cz
---
Changelog v1 - v4:
- None.
---
 fs/btrfs/volumes.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 54db1fb..6f80aef 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5167,8 +5167,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
BTRFS_BLOCK_GROUP_RAID6)) {
u64 tmp;
 
-   if (bbio_ret  ((rw  REQ_WRITE) || mirror_num  1)
-raid_map_ret) {
+   if (raid_map_ret  ((rw  REQ_WRITE) || mirror_num  1)) {
int i, rot;
 
/* push stripe_nr back to the start of the full stripe 
*/
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 00/10] Implement device scrub/replace for RAID56

2014-12-02 Thread Miao Xie
This patchset implement the device scrub/replace function for RAID56, the
most implementation of the common data is similar to the other RAID type.
The differentia or difficulty is the parity process. The basic idea is reading
and check the data which has checksum out of the raid56 stripe lock, if the
data is right, then lock the raid56 stripe, read out the other data in the
same stripe, if no IO error happens, calculate the parity and check the
original one, if the original parity is right, the scrub parity passes.
or write out the new one. But if the common data(not parity) that we read out
is wrong, we will try to recover it, and then check and repair the parity.

And in order to avoid making the code more and more complex, we copy some
code of common data process for the parity, the cleanup work is in my
TODO list.

We have done some test, the patchset worked well. Of course, more tests
are welcome. If you are interesting to use it or test it, you can pull
the patchset from

  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Changelog v3 - v4:
- Fix the problem that the scrub's raid bio was cached, which was reported
  by Chris.
- Remove the 10st patch, the deadlock that was described in that patch doesn't
  exist on the current kernel.
- Rebase the patchset to the top of integration branch

Changelog v2 - v3:
- Fix wrong stripe start logical address calculation which was reported
  by Chris.
- Fix unhandled raid bios for parity scrub, which are added into the plug
  list of the head raid bio.
- Fix possible deadlock caused by the pending bios in the plug list
  when the io submitters were going to sleep.
- Fix undealt use-after-free problem of the source device in the final
  device replace procedure.
- Modify the code that is used to avoid the rbio merge.

Changelog v1 - v2:
- Change some function names in raid56.c to make them fit the code style
  of the raid56.

Thanks
Miao

Miao Xie (7):
  Btrfs, raid56: don't change bbio and raid_map
  Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
  Btrfs, raid56: use a variant to record the operation type
  Btrfs, raid56: support parity scrub on raid56
  Btrfs, replace: write dirty pages into the replace target device
  Btrfs, replace: write raid56 parity into the replace target device
  Btrfs, raid56: fix use-after-free problem in the final device replace
procedure on raid56

Zhao Lei (3):
  Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
  Btrfs: remove unnecessary code of stripe_index assignment in
__btrfs_map_block
  Btrfs, replace: enable dev-replace for raid56

 fs/btrfs/ctree.h   |   7 +-
 fs/btrfs/dev-replace.c |   9 +-
 fs/btrfs/raid56.c  | 763 +-
 fs/btrfs/raid56.h  |  16 +-
 fs/btrfs/scrub.c   | 803 +++--
 fs/btrfs/volumes.c |  52 +++-
 fs/btrfs/volumes.h |  14 +-
 7 files changed, 1531 insertions(+), 133 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 07/10] Btrfs, replace: write dirty pages into the replace target device

2014-12-02 Thread Miao Xie
The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
  RAID56, we add the stripes of the replace target devices at the
  end of the stripe array in btrfs bio, and we sort those target
  device stripes in the array. And we keep the number of the target
  device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
  take the target device stripes into account.
- When we do write operation, we read the data from the common devices
  and calculate the parity. Then write the dirty data and new parity
  out, at this time, we will find the relative replace target stripes
  and wirte the relative data into it.

Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None.
---
 fs/btrfs/raid56.c  | 104 +
 fs/btrfs/volumes.c |  26 --
 fs/btrfs/volumes.h |  10 --
 3 files changed, 97 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 58a8408..16fe456 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -131,6 +131,8 @@ struct btrfs_raid_bio {
/* number of data stripes (no p/q) */
int nr_data;
 
+   int real_stripes;
+
int stripe_npages;
/*
 * set if we're doing a parity rebuild
@@ -638,7 +640,7 @@ static struct page *rbio_pstripe_page(struct btrfs_raid_bio 
*rbio, int index)
  */
 static struct page *rbio_qstripe_page(struct btrfs_raid_bio *rbio, int index)
 {
-   if (rbio-nr_data + 1 == rbio-bbio-num_stripes)
+   if (rbio-nr_data + 1 == rbio-real_stripes)
return NULL;
 
index += ((rbio-nr_data + 1) * rbio-stripe_len) 
@@ -981,7 +983,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
 {
struct btrfs_raid_bio *rbio;
int nr_data = 0;
-   int num_pages = rbio_nr_pages(stripe_len, bbio-num_stripes);
+   int real_stripes = bbio-num_stripes - bbio-num_tgtdevs;
+   int num_pages = rbio_nr_pages(stripe_len, real_stripes);
int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE);
void *p;
 
@@ -1001,6 +1004,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct 
btrfs_root *root,
rbio-fs_info = root-fs_info;
rbio-stripe_len = stripe_len;
rbio-nr_pages = num_pages;
+   rbio-real_stripes = real_stripes;
rbio-stripe_npages = stripe_npages;
rbio-faila = -1;
rbio-failb = -1;
@@ -1017,10 +1021,10 @@ static struct btrfs_raid_bio *alloc_rbio(struct 
btrfs_root *root,
rbio-bio_pages = p + sizeof(struct page *) * num_pages;
rbio-dbitmap = p + sizeof(struct page *) * num_pages * 2;
 
-   if (raid_map[bbio-num_stripes - 1] == RAID6_Q_STRIPE)
-   nr_data = bbio-num_stripes - 2;
+   if (raid_map[real_stripes - 1] == RAID6_Q_STRIPE)
+   nr_data = real_stripes - 2;
else
-   nr_data = bbio-num_stripes - 1;
+   nr_data = real_stripes - 1;
 
rbio-nr_data = nr_data;
return rbio;
@@ -1132,7 +1136,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
 static void validate_rbio_for_rmw(struct btrfs_raid_bio *rbio)
 {
if (rbio-faila = 0 || rbio-failb = 0) {
-   BUG_ON(rbio-faila == rbio-bbio-num_stripes - 1);
+   BUG_ON(rbio-faila == rbio-real_stripes - 1);
__raid56_parity_recover(rbio);
} else {
finish_rmw(rbio);
@@ -1193,7 +1197,7 @@ static void index_rbio_pages(struct btrfs_raid_bio *rbio)
 static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 {
struct btrfs_bio *bbio = rbio-bbio;
-   void *pointers[bbio-num_stripes];
+   void *pointers[rbio-real_stripes];
int stripe_len = rbio-stripe_len;
int nr_data = rbio-nr_data;
int stripe;
@@ -1207,11 +1211,11 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
 
bio_list_init(bio_list);
 
-   if (bbio-num_stripes - rbio-nr_data == 1) {
-   p_stripe = bbio-num_stripes - 1;
-   } else if (bbio-num_stripes - rbio-nr_data == 2) {
-   p_stripe = bbio-num_stripes - 2;
-   q_stripe = bbio-num_stripes - 1;
+   if (rbio-real_stripes - rbio-nr_data == 1) {
+   p_stripe = rbio-real_stripes - 1;
+   } else if (rbio-real_stripes - rbio-nr_data == 2) {
+   p_stripe = rbio-real_stripes - 2;
+   q_stripe = rbio-real_stripes - 1;
} else {
BUG();
}
@@ -1268,7 +1272,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
SetPageUptodate(p);
pointers[stripe++] = kmap(p);
 
-   raid6_call.gen_syndrome(bbio-num_stripes, PAGE_SIZE

[PATCH v4 05/10] Btrfs, raid56: use a variant to record the operation type

2014-12-02 Thread Miao Xie
We will introduce new operation type later, if we still use integer
variant as bool variant to record the operation type, we would add new
variant and increase the size of raid bio structure. It is not good,
by this patch, we define different number for different operation,
and we can just use a variant to record the operation type.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None.
---
 fs/btrfs/raid56.c | 31 +--
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c954537..4924388 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -69,6 +69,11 @@
 
 #define RBIO_CACHE_SIZE 1024
 
+enum btrfs_rbio_ops {
+   BTRFS_RBIO_WRITE= 0,
+   BTRFS_RBIO_READ_REBUILD = 1,
+};
+
 struct btrfs_raid_bio {
struct btrfs_fs_info *fs_info;
struct btrfs_bio *bbio;
@@ -131,7 +136,7 @@ struct btrfs_raid_bio {
 * differently from a parity rebuild as part of
 * rmw
 */
-   int read_rebuild;
+   enum btrfs_rbio_ops operation;
 
/* first bad stripe */
int faila;
@@ -154,7 +159,6 @@ struct btrfs_raid_bio {
 
atomic_t refs;
 
-
atomic_t stripes_pending;
 
atomic_t error;
@@ -590,8 +594,7 @@ static int rbio_can_merge(struct btrfs_raid_bio *last,
return 0;
 
/* reads can't merge with writes */
-   if (last-read_rebuild !=
-   cur-read_rebuild) {
+   if (last-operation != cur-operation) {
return 0;
}
 
@@ -784,9 +787,9 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
spin_unlock(rbio-bio_list_lock);
spin_unlock_irqrestore(h-lock, flags);
 
-   if (next-read_rebuild)
+   if (next-operation == BTRFS_RBIO_READ_REBUILD)
async_read_rebuild(next);
-   else {
+   else if (next-operation == BTRFS_RBIO_WRITE){
steal_rbio(rbio, next);
async_rmw_stripe(next);
}
@@ -1720,6 +1723,7 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
}
bio_list_add(rbio-bio_list, bio);
rbio-bio_list_bytes = bio-bi_iter.bi_size;
+   rbio-operation = BTRFS_RBIO_WRITE;
 
/*
 * don't plug on full rbios, just get them out the door
@@ -1768,7 +1772,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
faila = rbio-faila;
failb = rbio-failb;
 
-   if (rbio-read_rebuild) {
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD) {
spin_lock_irq(rbio-bio_list_lock);
set_bit(RBIO_RMW_LOCKED_BIT, rbio-flags);
spin_unlock_irq(rbio-bio_list_lock);
@@ -1785,7 +1789,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
 * if we're rebuilding a read, we have to use
 * pages from the bio list
 */
-   if (rbio-read_rebuild 
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD 
(stripe == faila || stripe == failb)) {
page = page_in_rbio(rbio, stripe, pagenr, 0);
} else {
@@ -1878,7 +1882,7 @@ pstripe:
 * know they can be trusted.  If this was a read reconstruction,
 * other endio functions will fiddle the uptodate bits
 */
-   if (!rbio-read_rebuild) {
+   if (rbio-operation == BTRFS_RBIO_WRITE) {
for (i = 0;  i  nr_pages; i++) {
if (faila != -1) {
page = rbio_stripe_page(rbio, faila, i);
@@ -1895,7 +1899,7 @@ pstripe:
 * if we're rebuilding a read, we have to use
 * pages from the bio list
 */
-   if (rbio-read_rebuild 
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD 
(stripe == faila || stripe == failb)) {
page = page_in_rbio(rbio, stripe, pagenr, 0);
} else {
@@ -1910,8 +1914,7 @@ cleanup:
kfree(pointers);
 
 cleanup_io:
-
-   if (rbio-read_rebuild) {
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD) {
if (err == 0 
!test_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags))
cache_rbio_pages(rbio);
@@ -2050,7 +2053,7 @@ out:
return 0;
 
 cleanup:
-   if (rbio-read_rebuild)
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD)
rbio_orig_end_io(rbio, -EIO, 0);
return -EIO;
 }
@@ -2076,7 +2079,7 @@ int raid56_parity_recover(struct

[PATCH v4 09/10] Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56

2014-12-02 Thread Miao Xie
The commit c404e0dc (Btrfs: fix use-after-free in the finishing
procedure of the device replace) fixed a use-after-free problem
which happened when removing the source device at the end of device
replace, but at that time, btrfs didn't support device replace
on raid56, so we didn't fix the problem on the raid56 profile.
Currently, we implemented device replace for raid56, so we need
kick that problem out before we enable that function for raid56.

The fix method is very simple, we just increase the bio per-cpu
counter before we submit a raid56 io, and decrease the counter
when the raid56 io ends.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v3 - v4:
- None.

Changelog v2 - v3:
- New patch to fix undealt use-after-free problem of the source device
  in the final device replace procedure.

Changelog v1 - v2:
- None.
---
 fs/btrfs/ctree.h   |  7 ++-
 fs/btrfs/dev-replace.c |  4 ++--
 fs/btrfs/raid56.c  | 41 -
 fs/btrfs/raid56.h  |  4 ++--
 fs/btrfs/scrub.c   |  2 +-
 fs/btrfs/volumes.c |  7 ++-
 6 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index fc73e86..3770f4c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4156,7 +4156,12 @@ int btrfs_scrub_progress(struct btrfs_root *root, u64 
devid,
 /* dev-replace.c */
 void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info);
 void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info *fs_info);
-void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info);
+void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount);
+
+static inline void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info)
+{
+   btrfs_bio_counter_sub(fs_info, 1);
+}
 
 /* reada.c */
 struct reada_control {
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 91f6b8f..326919b 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -928,9 +928,9 @@ void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info 
*fs_info)
percpu_counter_inc(fs_info-bio_counter);
 }
 
-void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info)
+void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
 {
-   percpu_counter_dec(fs_info-bio_counter);
+   percpu_counter_sub(fs_info-bio_counter, amount);
 
if (waitqueue_active(fs_info-replace_wait))
wake_up(fs_info-replace_wait);
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 7e6f239..44573bf 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -162,6 +162,8 @@ struct btrfs_raid_bio {
 */
int bio_list_bytes;
 
+   int generic_bio_cnt;
+
atomic_t refs;
 
atomic_t stripes_pending;
@@ -354,6 +356,7 @@ static void merge_rbio(struct btrfs_raid_bio *dest,
 {
bio_list_merge(dest-bio_list, victim-bio_list);
dest-bio_list_bytes += victim-bio_list_bytes;
+   dest-generic_bio_cnt += victim-generic_bio_cnt;
bio_list_init(victim-bio_list);
 }
 
@@ -891,6 +894,10 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, 
int err, int uptodate)
 {
struct bio *cur = bio_list_get(rbio-bio_list);
struct bio *next;
+
+   if (rbio-generic_bio_cnt)
+   btrfs_bio_counter_sub(rbio-fs_info, rbio-generic_bio_cnt);
+
free_raid_bio(rbio);
 
while (cur) {
@@ -1775,6 +1782,7 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
struct btrfs_raid_bio *rbio;
struct btrfs_plug_cb *plug = NULL;
struct blk_plug_cb *cb;
+   int ret;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
if (IS_ERR(rbio)) {
@@ -1785,12 +1793,19 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
rbio-bio_list_bytes = bio-bi_iter.bi_size;
rbio-operation = BTRFS_RBIO_WRITE;
 
+   btrfs_bio_counter_inc_noblocked(root-fs_info);
+   rbio-generic_bio_cnt = 1;
+
/*
 * don't plug on full rbios, just get them out the door
 * as quickly as we can
 */
-   if (rbio_is_full(rbio))
-   return full_stripe_write(rbio);
+   if (rbio_is_full(rbio)) {
+   ret = full_stripe_write(rbio);
+   if (ret)
+   btrfs_bio_counter_dec(root-fs_info);
+   return ret;
+   }
 
cb = blk_check_plugged(btrfs_raid_unplug, root-fs_info,
   sizeof(*plug));
@@ -1801,10 +1816,13 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
INIT_LIST_HEAD(plug-rbio_list);
}
list_add_tail(rbio-plug_list, plug-rbio_list);
+   ret = 0;
} else {
-   return __raid56_parity_write(rbio);
+   ret = __raid56_parity_write(rbio);
+   if (ret)
+   btrfs_bio_counter_dec(root-fs_info);
}
-   return 0;
+   return ret

[PATCH v4 10/10] Btrfs, replace: enable dev-replace for raid56

2014-12-02 Thread Miao Xie
From: Zhao Lei zhao...@cn.fujitsu.com

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None.
---
 fs/btrfs/dev-replace.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 326919b..51133ea 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -316,11 +316,6 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
struct btrfs_device *tgt_device = NULL;
struct btrfs_device *src_device = NULL;
 
-   if (btrfs_fs_incompat(fs_info, RAID56)) {
-   btrfs_warn(fs_info, dev_replace cannot yet handle 
RAID5/RAID6);
-   return -EOPNOTSUPP;
-   }
-
switch (args-start.cont_reading_from_srcdev_mode) {
case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS:
case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID:
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 04/10] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted

2014-12-02 Thread Miao Xie
This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v3 - v4:
- Fix the problem that the scrub's raid bio was cached, which was reported by
  Chris.

Changelog v2 - v3:
- None.

Changelog v1 - v2:
- Change some function names in raid56.c to make them fit the code style
  of the raid56.
---
 fs/btrfs/raid56.c  |  52 ++
 fs/btrfs/raid56.h  |   2 +-
 fs/btrfs/scrub.c   | 194 -
 fs/btrfs/volumes.c |  16 -
 fs/btrfs/volumes.h |   4 ++
 5 files changed, 235 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index cb31cc6..c954537 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,6 +58,15 @@
  */
 #define RBIO_CACHE_READY_BIT   3
 
+/*
+ * bbio and raid_map is managed by the caller, so we shouldn't free
+ * them here. And besides that, all rbios with this flag should not
+ * be cached, because we need raid_map to check the rbios' stripe
+ * is the same or not, but it is very likely that the caller has
+ * free raid_map, so don't cache those rbios.
+ */
+#define RBIO_HOLD_BBIO_MAP_BIT 4
+
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -799,6 +808,21 @@ done_nolock:
remove_rbio_from_cache(rbio);
 }
 
+static inline void
+__free_bbio_and_raid_map(struct btrfs_bio *bbio, u64 *raid_map, int need)
+{
+   if (need) {
+   kfree(raid_map);
+   kfree(bbio);
+   }
+}
+
+static inline void free_bbio_and_raid_map(struct btrfs_raid_bio *rbio)
+{
+   __free_bbio_and_raid_map(rbio-bbio, rbio-raid_map,
+   !test_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags));
+}
+
 static void __free_raid_bio(struct btrfs_raid_bio *rbio)
 {
int i;
@@ -817,8 +841,9 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio)
rbio-stripe_pages[i] = NULL;
}
}
-   kfree(rbio-raid_map);
-   kfree(rbio-bbio);
+
+   free_bbio_and_raid_map(rbio);
+
kfree(rbio);
 }
 
@@ -933,11 +958,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
 
rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2,
GFP_NOFS);
-   if (!rbio) {
-   kfree(raid_map);
-   kfree(bbio);
+   if (!rbio)
return ERR_PTR(-ENOMEM);
-   }
 
bio_list_init(rbio-bio_list);
INIT_LIST_HEAD(rbio-plug_list);
@@ -1692,8 +1714,10 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
struct blk_plug_cb *cb;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   __free_bbio_and_raid_map(bbio, raid_map, 1);
return PTR_ERR(rbio);
+   }
bio_list_add(rbio-bio_list, bio);
rbio-bio_list_bytes = bio-bi_iter.bi_size;
 
@@ -1888,7 +1912,8 @@ cleanup:
 cleanup_io:
 
if (rbio-read_rebuild) {
-   if (err == 0)
+   if (err == 0 
+   !test_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags))
cache_rbio_pages(rbio);
else
clear_bit(RBIO_CACHE_READY_BIT, rbio-flags);
@@ -2038,15 +2063,19 @@ cleanup:
  */
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
  struct btrfs_bio *bbio, u64 *raid_map,
- u64 stripe_len, int mirror_num)
+ u64 stripe_len, int mirror_num, int hold_bbio)
 {
struct btrfs_raid_bio *rbio;
int ret;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
return PTR_ERR(rbio);
+   }
 
+   if (hold_bbio)
+   set_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags);
rbio-read_rebuild = 1;
bio_list_add(rbio-bio_list, bio);
rbio-bio_list_bytes = bio-bi_iter.bi_size;
@@ -2054,8 +2083,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct 
bio *bio,
rbio-faila = find_logical_bio_stripe(rbio, bio);
if (rbio-faila == -1) {
BUG();
-   kfree(raid_map);
-   kfree(bbio);
+   __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
kfree(rbio);
return -EIO;
}
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index ea5d73b..b310e8c 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -41,7 +41,7 @@ static inline int nr_data_stripes(struct map_lookup *map)
 
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio

[PATCH v4 02/10] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block

2014-12-02 Thread Miao Xie
From: Zhao Lei zhao...@cn.fujitsu.com

stripe_index's value was set again in latter line:
stripe_index = 0;

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
Reviewed-by: David Sterba dste...@suse.cz
---
Changelog v1 - v4:
- None.
---
 fs/btrfs/volumes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6f80aef..eeb5b31 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5172,9 +5172,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
 
/* push stripe_nr back to the start of the full stripe 
*/
stripe_nr = raid56_full_stripe_start;
-   do_div(stripe_nr, stripe_len);
-
-   stripe_index = do_div(stripe_nr, nr_data_stripes(map));
+   do_div(stripe_nr, stripe_len * nr_data_stripes(map));
 
/* RAID[56] write or recovery. Return all stripes */
num_stripes = map-num_stripes;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list

2014-12-02 Thread Miao Xie
hi, Chris

On Fri, 28 Nov 2014 16:32:03 -0500, Chris Mason wrote:
 On Wed, Nov 26, 2014 at 10:00 PM, Miao Xie mi...@cn.fujitsu.com wrote:
 On Thu, 27 Nov 2014 09:39:56 +0800, Miao Xie wrote:
  On Wed, 26 Nov 2014 10:02:23 -0500, Chris Mason wrote:
  On Wed, Nov 26, 2014 at 8:04 AM, Miao Xie mi...@cn.fujitsu.com wrote:
  The increase/decrease of bio counter is on the I/O path, so we should
  use io_schedule() instead of schedule(), or the deadlock might be
  triggered by the pending I/O in the plug list. io_schedule() can help
  us because it will flush all the pending I/O before the task is going
  to sleep.

  Can you please describe this deadlock in more detail?  schedule() also 
 triggers
  a flush of the plug list, and if that's no longer sufficient we can run 
 into other
  problems (especially with preemption on).

  Sorry for my miss. I forgot to check the current implementation of 
 schedule(), which flushes the plug list unconditionally. Please ignore this 
 patch.

 I have updated my raid56-scrub-replace branch, please re-pull the branch.

   https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace
 
 Sorry, I wasn't clear.  I do like the patch because it uses a slightly better 
 trigger mechanism for the flush.  I was just worried about a larger deadlock.
 
 I ran the raid56 work with stress.sh overnight, then scrubbed the resulting 
 filesystem and ran balance when the scrub completed.  All of these passed 
 without errors (excellent!).
 
 Then I zero'd 4GB of one drive and ran scrub again.  This was the result.  
 Please make sure CONFIG_DEBUG_PAGEALLOC is enabled and you should be able to 
 reproduce.

I sent out the 4th version of the patchset, please try it.

I have pushed the new patchset to my git tree, you can re-pull it.
  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Thanks
Miao

 
 [192392.495260] BUG: unable to handle kernel paging request at 
 880303062f80
 [192392.495279] IP: [a05fe77a] lock_stripe_add+0xba/0x390 [btrfs]
 [192392.495281] PGD 2bdb067 PUD 107e7fd067 PMD 107e7e4067 PTE 800303062060
 [192392.495283] Oops:  [#1] SMP DEBUG_PAGEALLOC
 [192392.495307] Modules linked in: ipmi_devintf loop fuse k10temp coretemp 
 hwmon btrfs raid6_pq zlib_deflate lzo_compress xor xfs exportfs libcrc32c 
 tcp_diag inet_diag nfsv4 ip6table_filter ip6_tables xt_NFLOG nfnetlink_log 
 nfnetlink xt_comment xt_statistic iptable_filter ip_tables x_tables mptctl 
 netconsole autofs4 nfsv3 nfs lockd grace rpcsec_gss_krb5 auth_rpcgss 
 oid_registry sunrpc ipv6 ext3 jbd dm_mod rtc_cmos ipmi_si ipmi_msghandler 
 iTCO_wdt iTCO_vendor_support pcspkr i2c_i801 lpc_ich mfd_core shpchp ehci_pci 
 ehci_hcd mlx4_en ptp pps_core mlx4_core sg ses enclosure button megaraid_sas
 [192392.495310] CPU: 0 PID: 11992 Comm: kworker/u65:2 Not tainted 
 3.18.0-rc6-mason+ #7
 [192392.495310] Hardware name: ZTSYSTEMS Echo Ridge T4  /A9DRPF-10D, BIOS 
 1.07 05/10/2012
 [192392.495323] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs]
 [192392.495324] task: 88013dae9110 ti: 8802296a task.ti: 
 8802296a
 [192392.495335] RIP: 0010:[a05fe77a]  [a05fe77a] 
 lock_stripe_add+0xba/0x390 [btrfs]
 [192392.495335] RSP: 0018:8802296a3ac8  EFLAGS: 00010006
 [192392.495336] RAX: 880577e85018 RBX: 880497f0b2f8 RCX: 
 8801190fb000
 [192392.495337] RDX: 013d RSI: 880303062f80 RDI: 
 040c275a
 [192392.495338] RBP: 8802296a3b48 R08: 880497f0 R09: 
 0001
 [192392.495339] R10:  R11:  R12: 
 0282
 [192392.495339] R13: b250 R14: 880577e85000 R15: 
 880497f0b2a0
 [192392.495340] FS:  () GS:88085fc0() 
 knlGS:
 [192392.495341] CS:  0010 DS:  ES:  CR0: 80050033
 [192392.495342] CR2: 880303062f80 CR3: 05289000 CR4: 
 000407f0
 [192392.495342] Stack:
 [192392.495344]  880755e28000 880497f0 013d 
 8801190fb000
 [192392.495346]   88013dae9110 81090d40 
 8802296a3b00
 [192392.495347]  8802296a3b00 0010 8802296a3b68 
 8801190fb000
 [192392.495348] Call Trace:
 [192392.495353]  [81090d40] ? bit_waitqueue+0xa0/0xa0
 [192392.495363]  [a05fea66] 
 raid56_parity_submit_scrub_rbio+0x16/0x30 [btrfs]
 [192392.495372]  [a05e2f0e] 
 scrub_parity_check_and_repair+0x15e/0x1e0 [btrfs]
 [192392.495380]  [a05e301d] scrub_block_put+0x8d/0x90 [btrfs]
 [192392.495388]  [a05e6ed7] ? scrub_bio_end_io_worker+0xd7/0x870 
 [btrfs]
 [192392.495396]  [a05e6ee9] scrub_bio_end_io_worker+0xe9/0x870 
 [btrfs]
 [192392.495405]  [a05b8c44] normal_work_helper+0x84/0x330 [btrfs]
 [192392.495414]  [a05b8f42] btrfs_scrub_helper+0x12/0x20 [btrfs]
 [192392.495417]  [8106c50f] process_one_work+0x1bf/0x520
 [192392.495419]  [8106c48d

Re: [PATCH v4 00/10] Implement device scrub/replace for RAID56

2014-12-02 Thread Miao Xie
On Tue, 2 Dec 2014 08:28:22 -0500, Chris Mason wrote:
 
 
 On Tue, Dec 2, 2014 at 7:39 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 This patchset implement the device scrub/replace function for RAID56, the
 most implementation of the common data is similar to the other RAID type.
 The differentia or difficulty is the parity process. The basic idea is 
 reading
 and check the data which has checksum out of the raid56 stripe lock, if the
 data is right, then lock the raid56 stripe, read out the other data in the
 same stripe, if no IO error happens, calculate the parity and check the
 original one, if the original parity is right, the scrub parity passes.
 or write out the new one. But if the common data(not parity) that we read out
 is wrong, we will try to recover it, and then check and repair the parity.

 And in order to avoid making the code more and more complex, we copy some
 code of common data process for the parity, the cleanup work is in my
 TODO list.

 We have done some test, the patchset worked well. Of course, more tests
 are welcome. If you are interesting to use it or test it, you can pull
 the patchset from

   https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

 Changelog v3 - v4:
 - Fix the problem that the scrub's raid bio was cached, which was reported
   by Chris.
 - Remove the 10st patch, the deadlock that was described in that patch 
 doesn't
   exist on the current kernel.
 - Rebase the patchset to the top of integration branch
 
 Thanks, I'll try this today.  I need to rebase in a new version of the RCU 
 patches, can you please cook one on top of v3.18-rc6 instead?

No problem.

Thanks
Miao

 
 -chris
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/10] Implement device scrub/replace for RAID56

2014-12-02 Thread Miao Xie
On Tue, 2 Dec 2014 08:28:22 -0500, Chris Mason wrote:
 On Tue, Dec 2, 2014 at 7:39 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 This patchset implement the device scrub/replace function for RAID56, the
 most implementation of the common data is similar to the other RAID type.
 The differentia or difficulty is the parity process. The basic idea is 
 reading
 and check the data which has checksum out of the raid56 stripe lock, if the
 data is right, then lock the raid56 stripe, read out the other data in the
 same stripe, if no IO error happens, calculate the parity and check the
 original one, if the original parity is right, the scrub parity passes.
 or write out the new one. But if the common data(not parity) that we read out
 is wrong, we will try to recover it, and then check and repair the parity.

 And in order to avoid making the code more and more complex, we copy some
 code of common data process for the parity, the cleanup work is in my
 TODO list.

 We have done some test, the patchset worked well. Of course, more tests
 are welcome. If you are interesting to use it or test it, you can pull
 the patchset from

   https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

 Changelog v3 - v4:
 - Fix the problem that the scrub's raid bio was cached, which was reported
   by Chris.
 - Remove the 10st patch, the deadlock that was described in that patch 
 doesn't
   exist on the current kernel.
 - Rebase the patchset to the top of integration branch
 
 Thanks, I'll try this today.  I need to rebase in a new version of the RCU 
 patches, can you please cook one on top of v3.18-rc6 instead?

I have updated my raid56-scrub-replace branch, please re-pull it.
  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

The v4 patchset in the mail list can be applied on v3.18-rc6 successfully, so
I don't update it.

Thanks
Miao
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix wrong list access on the failure of reading out checksum

2014-12-01 Thread Miao Xie
If we failed to reading out the checksum, we would free all the checksums
in the list. But the current code accessed the list head, not the entry
in the list. Fix it.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/file-item.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 783a943..c26b58f 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -413,7 +413,8 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 
start, u64 end,
ret = 0;
 fail:
while (ret  0  !list_empty(tmplist)) {
-   sums = list_entry(tmplist, struct btrfs_ordered_sum, list);
+   sums = list_first_entry(tmplist, struct btrfs_ordered_sum,
+   list);
list_del(sums-list);
kfree(sums);
}
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix wrong list access on the failure of reading out checksum

2014-12-01 Thread Miao Xie
Please ignore this patch, Chris has fixed this problem.

Thanks
Miao

On Mon, 1 Dec 2014 18:04:13 +0800, Miao Xie wrote:
 If we failed to reading out the checksum, we would free all the checksums
 in the list. But the current code accessed the list head, not the entry
 in the list. Fix it.
 
 Signed-off-by: Miao Xie mi...@cn.fujitsu.com
 ---
  fs/btrfs/file-item.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 
 diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
 index 783a943..c26b58f 100644
 --- a/fs/btrfs/file-item.c
 +++ b/fs/btrfs/file-item.c
 @@ -413,7 +413,8 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 
 start, u64 end,
   ret = 0;
  fail:
   while (ret  0  !list_empty(tmplist)) {
 - sums = list_entry(tmplist, struct btrfs_ordered_sum, list);
 + sums = list_first_entry(tmplist, struct btrfs_ordered_sum,
 + list);
   list_del(sums-list);
   kfree(sums);
   }
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 05/11] Btrfs, raid56: use a variant to record the operation type

2014-11-26 Thread Miao Xie
We will introduce new operation type later, if we still use integer
variant as bool variant to record the operation type, we would add new
variant and increase the size of raid bio structure. It is not good,
by this patch, we define different number for different operation,
and we can just use a variant to record the operation type.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v3:
- None.
---
 fs/btrfs/raid56.c | 30 +-
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 6013d88..bfc406d 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -62,6 +62,11 @@
 
 #define RBIO_CACHE_SIZE 1024
 
+enum btrfs_rbio_ops {
+   BTRFS_RBIO_WRITE= 0,
+   BTRFS_RBIO_READ_REBUILD = 1,
+};
+
 struct btrfs_raid_bio {
struct btrfs_fs_info *fs_info;
struct btrfs_bio *bbio;
@@ -124,7 +129,7 @@ struct btrfs_raid_bio {
 * differently from a parity rebuild as part of
 * rmw
 */
-   int read_rebuild;
+   enum btrfs_rbio_ops operation;
 
/* first bad stripe */
int faila;
@@ -147,7 +152,6 @@ struct btrfs_raid_bio {
 
atomic_t refs;
 
-
atomic_t stripes_pending;
 
atomic_t error;
@@ -583,8 +587,7 @@ static int rbio_can_merge(struct btrfs_raid_bio *last,
return 0;
 
/* reads can't merge with writes */
-   if (last-read_rebuild !=
-   cur-read_rebuild) {
+   if (last-operation != cur-operation) {
return 0;
}
 
@@ -777,9 +780,9 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
spin_unlock(rbio-bio_list_lock);
spin_unlock_irqrestore(h-lock, flags);
 
-   if (next-read_rebuild)
+   if (next-operation == BTRFS_RBIO_READ_REBUILD)
async_read_rebuild(next);
-   else {
+   else if (next-operation == BTRFS_RBIO_WRITE){
steal_rbio(rbio, next);
async_rmw_stripe(next);
}
@@ -1713,6 +1716,7 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
}
bio_list_add(rbio-bio_list, bio);
rbio-bio_list_bytes = bio-bi_iter.bi_size;
+   rbio-operation = BTRFS_RBIO_WRITE;
 
/*
 * don't plug on full rbios, just get them out the door
@@ -1761,7 +1765,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
faila = rbio-faila;
failb = rbio-failb;
 
-   if (rbio-read_rebuild) {
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD) {
spin_lock_irq(rbio-bio_list_lock);
set_bit(RBIO_RMW_LOCKED_BIT, rbio-flags);
spin_unlock_irq(rbio-bio_list_lock);
@@ -1778,7 +1782,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
 * if we're rebuilding a read, we have to use
 * pages from the bio list
 */
-   if (rbio-read_rebuild 
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD 
(stripe == faila || stripe == failb)) {
page = page_in_rbio(rbio, stripe, pagenr, 0);
} else {
@@ -1871,7 +1875,7 @@ pstripe:
 * know they can be trusted.  If this was a read reconstruction,
 * other endio functions will fiddle the uptodate bits
 */
-   if (!rbio-read_rebuild) {
+   if (rbio-operation == BTRFS_RBIO_WRITE) {
for (i = 0;  i  nr_pages; i++) {
if (faila != -1) {
page = rbio_stripe_page(rbio, faila, i);
@@ -1888,7 +1892,7 @@ pstripe:
 * if we're rebuilding a read, we have to use
 * pages from the bio list
 */
-   if (rbio-read_rebuild 
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD 
(stripe == faila || stripe == failb)) {
page = page_in_rbio(rbio, stripe, pagenr, 0);
} else {
@@ -1904,7 +1908,7 @@ cleanup:
 
 cleanup_io:
 
-   if (rbio-read_rebuild) {
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD) {
if (err == 0)
cache_rbio_pages(rbio);
else
@@ -2042,7 +2046,7 @@ out:
return 0;
 
 cleanup:
-   if (rbio-read_rebuild)
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD)
rbio_orig_end_io(rbio, -EIO, 0);
return -EIO;
 }
@@ -2068,7 +2072,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct 
bio *bio,
 
if (hold_bbio

[PATCH v3 00/11] Implement device scrub/replace for RAID56

2014-11-26 Thread Miao Xie
This patchset implement the device scrub/replace function for RAID56, the
most implementation of the common data is similar to the other RAID type.
The differentia or difficulty is the parity process. The basic idea is reading
and check the data which has checksum out of the raid56 stripe lock, if the
data is right, then lock the raid56 stripe, read out the other data in the
same stripe, if no IO error happens, calculate the parity and check the
original one, if the original parity is right, the scrub parity passes.
or write out the new one. But if the common data(not parity) that we read out
is wrong, we will try to recover it, and then check and repair the parity.

And in order to avoid making the code more and more complex, we copy some
code of common data process for the parity, the cleanup work is in my
TODO list.

We have done some test, the patchset worked well. Of course, more tests
are welcome. If you are interesting to use it or test it, you can pull
the patchset from

  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Changelog v2 - v3:
- Fix wrong stripe start logical address calculation which was reported
  by Chris.
- Fix unhandled raid bios for parity scrub, which are added into the plug
  list of the head raid bio.
- Fix possible deadlock caused by the pending bios in the plug list
  when the io submitters were going to sleep.
- Fix undealt use-after-free problem of the source device in the final
  device replace procedure.
- Modify the code that is used to avoid the rbio merge.

Changelog v1 - v2:
- Change some function names in raid56.c to make them fit the code style
  of the raid56.

Thanks
Miao

Miao Xie (8):
  Btrfs, raid56: don't change bbio and raid_map
  Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
  Btrfs, raid56: use a variant to record the operation type
  Btrfs, raid56: support parity scrub on raid56
  Btrfs, replace: write dirty pages into the replace target device
  Btrfs, replace: write raid56 parity into the replace target device
  Btrfs, raid56: fix use-after-free problem in the final device replace
procedure on raid56
  Btrfs: fix possible deadlock caused by pending I/O in plug list

Zhao Lei (3):
  Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
  Btrfs: remove unnecessary code of stripe_index assignment in
__btrfs_map_block
  Btrfs, replace: enable dev-replace for raid56

 fs/btrfs/dev-replace.c |  20 +-
 fs/btrfs/raid56.c  | 746 -
 fs/btrfs/raid56.h  |  16 +-
 fs/btrfs/scrub.c   | 803 +++--
 fs/btrfs/volumes.c |  52 +++-
 fs/btrfs/volumes.h |  14 +-
 6 files changed, 1521 insertions(+), 130 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 08/11] Btrfs, replace: write raid56 parity into the replace target device

2014-11-26 Thread Miao Xie
This function reused the code of parity scrub, and we just write
the right parity or corrected parity into the target device before
the parity scrub end.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v3:
- None.
---
 fs/btrfs/raid56.c | 23 +++
 fs/btrfs/scrub.c  |  2 +-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 6f82c1b..cfa449f 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2311,7 +2311,9 @@ static void raid_write_parity_end_io(struct bio *bio, int 
err)
 static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 int need_check)
 {
+   struct btrfs_bio *bbio = rbio-bbio;
void *pointers[rbio-real_stripes];
+   DECLARE_BITMAP(pbitmap, rbio-stripe_npages);
int nr_data = rbio-nr_data;
int stripe;
int pagenr;
@@ -2321,6 +2323,7 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
struct page *q_page = NULL;
struct bio_list bio_list;
struct bio *bio;
+   int is_replace = 0;
int ret;
 
bio_list_init(bio_list);
@@ -2334,6 +2337,11 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
BUG();
}
 
+   if (bbio-num_tgtdevs  bbio-tgtdev_map[rbio-scrubp]) {
+   is_replace = 1;
+   bitmap_copy(pbitmap, rbio-dbitmap, rbio-stripe_npages);
+   }
+
/*
 * Because the higher layers(scrubber) are unlikely to
 * use this area of the disk again soon, so don't cache
@@ -2422,6 +2430,21 @@ writeback:
goto cleanup;
}
 
+   if (!is_replace)
+   goto submit_write;
+
+   for_each_set_bit(pagenr, pbitmap, rbio-stripe_npages) {
+   struct page *page;
+
+   page = rbio_stripe_page(rbio, rbio-scrubp, pagenr);
+   ret = rbio_add_io_page(rbio, bio_list, page,
+  bbio-tgtdev_map[rbio-scrubp],
+  pagenr, rbio-stripe_len);
+   if (ret)
+   goto cleanup;
+   }
+
+submit_write:
nr_data = bio_list_size(bio_list);
if (!nr_data) {
/* Every parity is right */
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 7f95afc..0ae837f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2714,7 +2714,7 @@ static void scrub_parity_check_and_repair(struct 
scrub_parity *sparity)
goto out;
 
length = sparity-logic_end - sparity-logic_start + 1;
-   ret = btrfs_map_sblock(sctx-dev_root-fs_info, REQ_GET_READ_MIRRORS,
+   ret = btrfs_map_sblock(sctx-dev_root-fs_info, WRITE,
   sparity-logic_start,
   length, bbio, 0, raid_map);
if (ret || !bbio || !raid_map)
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 01/11] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition

2014-11-26 Thread Miao Xie
From: Zhao Lei zhao...@cn.fujitsu.com

bbio_ret in this condition is always !NULL because previous code
already have a check-and-skip:
4908 if (!bbio_ret)
4909 goto out;

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
Reviewed-by: David Sterba dste...@suse.cz
---
Changelog v1 - v3:
- None.
---
 fs/btrfs/volumes.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f61278f..41b0dff 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5162,8 +5162,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
BTRFS_BLOCK_GROUP_RAID6)) {
u64 tmp;
 
-   if (bbio_ret  ((rw  REQ_WRITE) || mirror_num  1)
-raid_map_ret) {
+   if (raid_map_ret  ((rw  REQ_WRITE) || mirror_num  1)) {
int i, rot;
 
/* push stripe_nr back to the start of the full stripe 
*/
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 02/11] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block

2014-11-26 Thread Miao Xie
From: Zhao Lei zhao...@cn.fujitsu.com

stripe_index's value was set again in latter line:
stripe_index = 0;

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
Reviewed-by: David Sterba dste...@suse.cz
---
Changelog v1 - v3:
- None.
---
 fs/btrfs/volumes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 41b0dff..66d5035 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5167,9 +5167,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
 
/* push stripe_nr back to the start of the full stripe 
*/
stripe_nr = raid56_full_stripe_start;
-   do_div(stripe_nr, stripe_len);
-
-   stripe_index = do_div(stripe_nr, nr_data_stripes(map));
+   do_div(stripe_nr, stripe_len * nr_data_stripes(map));
 
/* RAID[56] write or recovery. Return all stripes */
num_stripes = map-num_stripes;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 03/11] Btrfs, raid56: don't change bbio and raid_map

2014-11-26 Thread Miao Xie
Because we will reuse bbio and raid_map during the scrub later, it is
better that we don't change any variant of bbio and don't free it at
the end of IO request. So we introduced similar variants into the raid
bio, and don't access those bbio's variants any more.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v3:
- None.
---
 fs/btrfs/raid56.c | 42 +++---
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 6a41631..c54b0e6 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,7 +58,6 @@
  */
 #define RBIO_CACHE_READY_BIT   3
 
-
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -146,6 +145,10 @@ struct btrfs_raid_bio {
 
atomic_t refs;
 
+
+   atomic_t stripes_pending;
+
+   atomic_t error;
/*
 * these are two arrays of pointers.  We allocate the
 * rbio big enough to hold them both and setup their
@@ -858,13 +861,13 @@ static void raid_write_end_io(struct bio *bio, int err)
 
bio_put(bio);
 
-   if (!atomic_dec_and_test(rbio-bbio-stripes_pending))
+   if (!atomic_dec_and_test(rbio-stripes_pending))
return;
 
err = 0;
 
/* OK, we have read all the stripes we need to. */
-   if (atomic_read(rbio-bbio-error)  rbio-bbio-max_errors)
+   if (atomic_read(rbio-error)  rbio-bbio-max_errors)
err = -EIO;
 
rbio_orig_end_io(rbio, err, 0);
@@ -949,6 +952,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
rbio-faila = -1;
rbio-failb = -1;
atomic_set(rbio-refs, 1);
+   atomic_set(rbio-error, 0);
+   atomic_set(rbio-stripes_pending, 0);
 
/*
 * the stripe_pages and bio_pages array point to the extra
@@ -1169,7 +1174,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
set_bit(RBIO_RMW_LOCKED_BIT, rbio-flags);
spin_unlock_irq(rbio-bio_list_lock);
 
-   atomic_set(rbio-bbio-error, 0);
+   atomic_set(rbio-error, 0);
 
/*
 * now that we've set rmw_locked, run through the
@@ -1245,8 +1250,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
}
}
 
-   atomic_set(bbio-stripes_pending, bio_list_size(bio_list));
-   BUG_ON(atomic_read(bbio-stripes_pending) == 0);
+   atomic_set(rbio-stripes_pending, bio_list_size(bio_list));
+   BUG_ON(atomic_read(rbio-stripes_pending) == 0);
 
while (1) {
bio = bio_list_pop(bio_list);
@@ -1331,11 +1336,11 @@ static int fail_rbio_index(struct btrfs_raid_bio *rbio, 
int failed)
if (rbio-faila == -1) {
/* first failure on this rbio */
rbio-faila = failed;
-   atomic_inc(rbio-bbio-error);
+   atomic_inc(rbio-error);
} else if (rbio-failb == -1) {
/* second failure on this rbio */
rbio-failb = failed;
-   atomic_inc(rbio-bbio-error);
+   atomic_inc(rbio-error);
} else {
ret = -EIO;
}
@@ -1394,11 +1399,11 @@ static void raid_rmw_end_io(struct bio *bio, int err)
 
bio_put(bio);
 
-   if (!atomic_dec_and_test(rbio-bbio-stripes_pending))
+   if (!atomic_dec_and_test(rbio-stripes_pending))
return;
 
err = 0;
-   if (atomic_read(rbio-bbio-error)  rbio-bbio-max_errors)
+   if (atomic_read(rbio-error)  rbio-bbio-max_errors)
goto cleanup;
 
/*
@@ -1439,7 +1444,6 @@ static void async_read_rebuild(struct btrfs_raid_bio 
*rbio)
 static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 {
int bios_to_read = 0;
-   struct btrfs_bio *bbio = rbio-bbio;
struct bio_list bio_list;
int ret;
int nr_pages = DIV_ROUND_UP(rbio-stripe_len, PAGE_CACHE_SIZE);
@@ -1455,7 +1459,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 
index_rbio_pages(rbio);
 
-   atomic_set(rbio-bbio-error, 0);
+   atomic_set(rbio-error, 0);
/*
 * build a list of bios to read all the missing parts of this
 * stripe
@@ -1503,7 +1507,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 * the bbio may be freed once we submit the last bio.  Make sure
 * not to touch it after that
 */
-   atomic_set(bbio-stripes_pending, bios_to_read);
+   atomic_set(rbio-stripes_pending, bios_to_read);
while (1) {
bio = bio_list_pop(bio_list);
if (!bio)
@@ -1917,10 +1921,10 @@ static void raid_recover_end_io(struct bio *bio, int 
err)
set_bio_pages_uptodate(bio);
bio_put(bio);
 
-   if (!atomic_dec_and_test(rbio-bbio-stripes_pending))
+   if (!atomic_dec_and_test(rbio-stripes_pending))
return;
 
-   if (atomic_read(rbio-bbio-error)  rbio-bbio-max_errors)
+   if (atomic_read

[PATCH v3 06/11] Btrfs, raid56: support parity scrub on raid56

2014-11-26 Thread Miao Xie
The implementation is:
- Read and check all the data with checksum in the same stripe.
  All the data which has checksum is COW data, and we are sure
  that it is not changed though we don't lock the stripe. because
  the space of that data just can be reclaimed after the current
  transction is committed, and then the fs can use it to store the
  other data, but when doing scrub, we hold the current transaction,
  that is that data can not be recovered, it is safe that read and check
  it out of the stripe lock.
- Lock the stripe
- Read out all the data without checksum and parity
  The data without checksum and the parity may be changed if we don't
  lock the stripe, so we need read it in the stripe lock context.
- Check the parity
- Re-calculate the new parity and write back it if the old parity
  is not right
- Unlock the stripe

If we can not read out the data or the data we read is corrupted,
we will try to repair it. If the repair fails. we will mark the
horizontal sub-stripe(pages on the same horizontal) as corrupted
sub-stripe, and we will skip the parity check and repair of that
horizontal sub-stripe.

And in order to skip the horizontal sub-stripe that has no data, we
introduce a bitmap. If there is some data on the horizontal sub-stripe,
we will the relative bit to 1, and when we check and repair the
parity, we will skip those horizontal sub-stripes that the relative
bits is 0.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v2 - v3:
- Fix wrong stripe start logical address calculation which was reported
  by Chris.
- Fix unhandled raid bios for parity scrub, which are added into the plug
  list of the head raid bio.
- Modify the code that is used to avoid the rbio merge.

Changelog v1 - v2:
- None.
---
 fs/btrfs/raid56.c | 514 -
 fs/btrfs/raid56.h |  12 ++
 fs/btrfs/scrub.c  | 609 --
 3 files changed, 1115 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index bfc406d..3b99cbc 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -65,6 +65,7 @@
 enum btrfs_rbio_ops {
BTRFS_RBIO_WRITE= 0,
BTRFS_RBIO_READ_REBUILD = 1,
+   BTRFS_RBIO_PARITY_SCRUB = 2,
 };
 
 struct btrfs_raid_bio {
@@ -123,6 +124,7 @@ struct btrfs_raid_bio {
/* number of data stripes (no p/q) */
int nr_data;
 
+   int stripe_npages;
/*
 * set if we're doing a parity rebuild
 * for a read from higher up, which is handled
@@ -137,6 +139,7 @@ struct btrfs_raid_bio {
/* second bad stripe (for raid6 use) */
int failb;
 
+   int scrubp;
/*
 * number of pages needed to represent the full
 * stripe
@@ -171,6 +174,11 @@ struct btrfs_raid_bio {
 * here for faster lookup
 */
struct page **bio_pages;
+
+   /*
+* bitmap to record which horizontal stripe has data
+*/
+   unsigned long *dbitmap;
 };
 
 static int __raid56_parity_recover(struct btrfs_raid_bio *rbio);
@@ -185,6 +193,10 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio);
 static void index_rbio_pages(struct btrfs_raid_bio *rbio);
 static int alloc_rbio_pages(struct btrfs_raid_bio *rbio);
 
+static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
+int need_check);
+static void async_scrub_parity(struct btrfs_raid_bio *rbio);
+
 /*
  * the stripe hash table is used for locking, and to collect
  * bios in hopes of making a full stripe
@@ -586,10 +598,20 @@ static int rbio_can_merge(struct btrfs_raid_bio *last,
cur-raid_map[0])
return 0;
 
-   /* reads can't merge with writes */
-   if (last-operation != cur-operation) {
+   /* we can't merge with different operations */
+   if (last-operation != cur-operation)
+   return 0;
+   /*
+* We've need read the full stripe from the drive.
+* check and repair the parity and write the new results.
+*
+* We're not allowed to add any new bios to the
+* bio list here, anyone else that wants to
+* change this stripe needs to do their own rmw.
+*/
+   if (last-operation == BTRFS_RBIO_PARITY_SCRUB ||
+   cur-operation == BTRFS_RBIO_PARITY_SCRUB)
return 0;
-   }
 
return 1;
 }
@@ -782,9 +804,12 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
 
if (next-operation == BTRFS_RBIO_READ_REBUILD)
async_read_rebuild(next);
-   else if (next-operation == BTRFS_RBIO_WRITE){
+   else if (next-operation == BTRFS_RBIO_WRITE) {
steal_rbio(rbio, next);
async_rmw_stripe(next);
+   } else if (next-operation == BTRFS_RBIO_PARITY_SCRUB

[PATCH v3 11/11] Btrfs, replace: enable dev-replace for raid56

2014-11-26 Thread Miao Xie
From: Zhao Lei zhao...@cn.fujitsu.com

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v3:
- None.
---
 fs/btrfs/dev-replace.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 894796a..9f6a464 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -316,11 +316,6 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
struct btrfs_device *tgt_device = NULL;
struct btrfs_device *src_device = NULL;
 
-   if (btrfs_fs_incompat(fs_info, RAID56)) {
-   btrfs_warn(fs_info, dev_replace cannot yet handle 
RAID5/RAID6);
-   return -EOPNOTSUPP;
-   }
-
switch (args-start.cont_reading_from_srcdev_mode) {
case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS:
case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID:
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list

2014-11-26 Thread Miao Xie
The increase/decrease of bio counter is on the I/O path, so we should
use io_schedule() instead of schedule(), or the deadlock might be
triggered by the pending I/O in the plug list. io_schedule() can help
us because it will flush all the pending I/O before the task is going
to sleep.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v2 - v3:
- New patch to fix possible deadlock caused by the pending bios in the
  plug list when the io submitters were going to sleep.

Changelog v1 - v2:
- None.
---
 fs/btrfs/dev-replace.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index fa27b4e..894796a 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -928,16 +928,23 @@ void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, 
s64 amount)
wake_up(fs_info-replace_wait);
 }
 
+#define btrfs_wait_event_io(wq, condition) \
+do {   \
+   if (condition)  \
+   break;  \
+   (void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 0, 0,  \
+   io_schedule()); \
+} while (0)
+
 void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info)
 {
-   DEFINE_WAIT(wait);
 again:
percpu_counter_inc(fs_info-bio_counter);
if (test_bit(BTRFS_FS_STATE_DEV_REPLACING, fs_info-fs_state)) {
btrfs_bio_counter_dec(fs_info);
-   wait_event(fs_info-replace_wait,
-  !test_bit(BTRFS_FS_STATE_DEV_REPLACING,
-fs_info-fs_state));
+   btrfs_wait_event_io(fs_info-replace_wait,
+   !test_bit(BTRFS_FS_STATE_DEV_REPLACING,
+ fs_info-fs_state));
goto again;
}
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 07/11] Btrfs, replace: write dirty pages into the replace target device

2014-11-26 Thread Miao Xie
The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
  RAID56, we add the stripes of the replace target devices at the
  end of the stripe array in btrfs bio, and we sort those target
  device stripes in the array. And we keep the number of the target
  device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
  take the target device stripes into account.
- When we do write operation, we read the data from the common devices
  and calculate the parity. Then write the dirty data and new parity
  out, at this time, we will find the relative replace target stripes
  and wirte the relative data into it.

Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v3:
- None.
---
 fs/btrfs/raid56.c  | 104 +
 fs/btrfs/volumes.c |  26 --
 fs/btrfs/volumes.h |  10 --
 3 files changed, 97 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 3b99cbc..6f82c1b 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -124,6 +124,8 @@ struct btrfs_raid_bio {
/* number of data stripes (no p/q) */
int nr_data;
 
+   int real_stripes;
+
int stripe_npages;
/*
 * set if we're doing a parity rebuild
@@ -631,7 +633,7 @@ static struct page *rbio_pstripe_page(struct btrfs_raid_bio 
*rbio, int index)
  */
 static struct page *rbio_qstripe_page(struct btrfs_raid_bio *rbio, int index)
 {
-   if (rbio-nr_data + 1 == rbio-bbio-num_stripes)
+   if (rbio-nr_data + 1 == rbio-real_stripes)
return NULL;
 
index += ((rbio-nr_data + 1) * rbio-stripe_len) 
@@ -974,7 +976,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
 {
struct btrfs_raid_bio *rbio;
int nr_data = 0;
-   int num_pages = rbio_nr_pages(stripe_len, bbio-num_stripes);
+   int real_stripes = bbio-num_stripes - bbio-num_tgtdevs;
+   int num_pages = rbio_nr_pages(stripe_len, real_stripes);
int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE);
void *p;
 
@@ -994,6 +997,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
rbio-fs_info = root-fs_info;
rbio-stripe_len = stripe_len;
rbio-nr_pages = num_pages;
+   rbio-real_stripes = real_stripes;
rbio-stripe_npages = stripe_npages;
rbio-faila = -1;
rbio-failb = -1;
@@ -1010,10 +1014,10 @@ static struct btrfs_raid_bio *alloc_rbio(struct 
btrfs_root *root,
rbio-bio_pages = p + sizeof(struct page *) * num_pages;
rbio-dbitmap = p + sizeof(struct page *) * num_pages * 2;
 
-   if (raid_map[bbio-num_stripes - 1] == RAID6_Q_STRIPE)
-   nr_data = bbio-num_stripes - 2;
+   if (raid_map[real_stripes - 1] == RAID6_Q_STRIPE)
+   nr_data = real_stripes - 2;
else
-   nr_data = bbio-num_stripes - 1;
+   nr_data = real_stripes - 1;
 
rbio-nr_data = nr_data;
return rbio;
@@ -1125,7 +1129,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
 static void validate_rbio_for_rmw(struct btrfs_raid_bio *rbio)
 {
if (rbio-faila = 0 || rbio-failb = 0) {
-   BUG_ON(rbio-faila == rbio-bbio-num_stripes - 1);
+   BUG_ON(rbio-faila == rbio-real_stripes - 1);
__raid56_parity_recover(rbio);
} else {
finish_rmw(rbio);
@@ -1186,7 +1190,7 @@ static void index_rbio_pages(struct btrfs_raid_bio *rbio)
 static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 {
struct btrfs_bio *bbio = rbio-bbio;
-   void *pointers[bbio-num_stripes];
+   void *pointers[rbio-real_stripes];
int stripe_len = rbio-stripe_len;
int nr_data = rbio-nr_data;
int stripe;
@@ -1200,11 +1204,11 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
 
bio_list_init(bio_list);
 
-   if (bbio-num_stripes - rbio-nr_data == 1) {
-   p_stripe = bbio-num_stripes - 1;
-   } else if (bbio-num_stripes - rbio-nr_data == 2) {
-   p_stripe = bbio-num_stripes - 2;
-   q_stripe = bbio-num_stripes - 1;
+   if (rbio-real_stripes - rbio-nr_data == 1) {
+   p_stripe = rbio-real_stripes - 1;
+   } else if (rbio-real_stripes - rbio-nr_data == 2) {
+   p_stripe = rbio-real_stripes - 2;
+   q_stripe = rbio-real_stripes - 1;
} else {
BUG();
}
@@ -1261,7 +1265,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
SetPageUptodate(p);
pointers[stripe++] = kmap(p);
 
-   raid6_call.gen_syndrome(bbio-num_stripes, PAGE_SIZE

[PATCH v3 04/11] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted

2014-11-26 Thread Miao Xie
This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v3:
- None.
---
 fs/btrfs/raid56.c  |  42 +---
 fs/btrfs/raid56.h  |   2 +-
 fs/btrfs/scrub.c   | 194 -
 fs/btrfs/volumes.c |  16 -
 fs/btrfs/volumes.h |   4 ++
 5 files changed, 226 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c54b0e6..6013d88 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,6 +58,8 @@
  */
 #define RBIO_CACHE_READY_BIT   3
 
+#define RBIO_HOLD_BBIO_MAP_BIT 4
+
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -799,6 +801,21 @@ done_nolock:
remove_rbio_from_cache(rbio);
 }
 
+static inline void
+__free_bbio_and_raid_map(struct btrfs_bio *bbio, u64 *raid_map, int need)
+{
+   if (need) {
+   kfree(raid_map);
+   kfree(bbio);
+   }
+}
+
+static inline void free_bbio_and_raid_map(struct btrfs_raid_bio *rbio)
+{
+   __free_bbio_and_raid_map(rbio-bbio, rbio-raid_map,
+   !test_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags));
+}
+
 static void __free_raid_bio(struct btrfs_raid_bio *rbio)
 {
int i;
@@ -817,8 +834,9 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio)
rbio-stripe_pages[i] = NULL;
}
}
-   kfree(rbio-raid_map);
-   kfree(rbio-bbio);
+
+   free_bbio_and_raid_map(rbio);
+
kfree(rbio);
 }
 
@@ -933,11 +951,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
 
rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2,
GFP_NOFS);
-   if (!rbio) {
-   kfree(raid_map);
-   kfree(bbio);
+   if (!rbio)
return ERR_PTR(-ENOMEM);
-   }
 
bio_list_init(rbio-bio_list);
INIT_LIST_HEAD(rbio-plug_list);
@@ -1692,8 +1707,10 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
struct blk_plug_cb *cb;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   __free_bbio_and_raid_map(bbio, raid_map, 1);
return PTR_ERR(rbio);
+   }
bio_list_add(rbio-bio_list, bio);
rbio-bio_list_bytes = bio-bi_iter.bi_size;
 
@@ -2038,15 +2055,19 @@ cleanup:
  */
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
  struct btrfs_bio *bbio, u64 *raid_map,
- u64 stripe_len, int mirror_num)
+ u64 stripe_len, int mirror_num, int hold_bbio)
 {
struct btrfs_raid_bio *rbio;
int ret;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
return PTR_ERR(rbio);
+   }
 
+   if (hold_bbio)
+   set_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags);
rbio-read_rebuild = 1;
bio_list_add(rbio-bio_list, bio);
rbio-bio_list_bytes = bio-bi_iter.bi_size;
@@ -2054,8 +2075,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct 
bio *bio,
rbio-faila = find_logical_bio_stripe(rbio, bio);
if (rbio-faila == -1) {
BUG();
-   kfree(raid_map);
-   kfree(bbio);
+   __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
kfree(rbio);
return -EIO;
}
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index ea5d73b..b310e8c 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -41,7 +41,7 @@ static inline int nr_data_stripes(struct map_lookup *map)
 
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 struct btrfs_bio *bbio, u64 *raid_map,
-u64 stripe_len, int mirror_num);
+u64 stripe_len, int mirror_num, int hold_bbio);
 int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
   struct btrfs_bio *bbio, u64 *raid_map,
   u64 stripe_len);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index efa0831..ca4b9eb 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -63,6 +63,13 @@ struct scrub_ctx;
  */
 #define SCRUB_MAX_PAGES_PER_BLOCK  16  /* 64k per node/leaf/sector */
 
+struct scrub_recover {
+   atomic_trefs;
+   struct btrfs_bio*bbio;
+   u64 *raid_map;
+   u64 map_length;
+};
+
 struct scrub_page {
struct scrub_block  *sblock

Re: [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list

2014-11-26 Thread Miao Xie
On Thu, 27 Nov 2014 09:39:56 +0800, Miao Xie wrote:
 On Wed, 26 Nov 2014 10:02:23 -0500, Chris Mason wrote:
 On Wed, Nov 26, 2014 at 8:04 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 The increase/decrease of bio counter is on the I/O path, so we should
 use io_schedule() instead of schedule(), or the deadlock might be
 triggered by the pending I/O in the plug list. io_schedule() can help
 us because it will flush all the pending I/O before the task is going
 to sleep.

 Can you please describe this deadlock in more detail?  schedule() also 
 triggers
 a flush of the plug list, and if that's no longer sufficient we can run into 
 other
 problems (especially with preemption on).
 
 Sorry for my miss. I forgot to check the current implementation of 
 schedule(), which flushes the plug list unconditionally. Please ignore this 
 patch.

I have updated my raid56-scrub-replace branch, please re-pull the branch.

  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Thanks
Miao

 
 Thanks
 Miao
 

 -chris


 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/9] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted

2014-11-24 Thread Miao Xie
This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v2:
- Remove the redundant prefix underscores of the function names to make
  them obey the common pattern of the source in raid56.c
---
 fs/btrfs/raid56.c  |  42 +---
 fs/btrfs/raid56.h  |   2 +-
 fs/btrfs/scrub.c   | 194 -
 fs/btrfs/volumes.c |  16 -
 fs/btrfs/volumes.h |   4 ++
 5 files changed, 226 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c54b0e6..6013d88 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,6 +58,8 @@
  */
 #define RBIO_CACHE_READY_BIT   3
 
+#define RBIO_HOLD_BBIO_MAP_BIT 4
+
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -799,6 +801,21 @@ done_nolock:
remove_rbio_from_cache(rbio);
 }
 
+static inline void
+__free_bbio_and_raid_map(struct btrfs_bio *bbio, u64 *raid_map, int need)
+{
+   if (need) {
+   kfree(raid_map);
+   kfree(bbio);
+   }
+}
+
+static inline void free_bbio_and_raid_map(struct btrfs_raid_bio *rbio)
+{
+   __free_bbio_and_raid_map(rbio-bbio, rbio-raid_map,
+   !test_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags));
+}
+
 static void __free_raid_bio(struct btrfs_raid_bio *rbio)
 {
int i;
@@ -817,8 +834,9 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio)
rbio-stripe_pages[i] = NULL;
}
}
-   kfree(rbio-raid_map);
-   kfree(rbio-bbio);
+
+   free_bbio_and_raid_map(rbio);
+
kfree(rbio);
 }
 
@@ -933,11 +951,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
 
rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2,
GFP_NOFS);
-   if (!rbio) {
-   kfree(raid_map);
-   kfree(bbio);
+   if (!rbio)
return ERR_PTR(-ENOMEM);
-   }
 
bio_list_init(rbio-bio_list);
INIT_LIST_HEAD(rbio-plug_list);
@@ -1692,8 +1707,10 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
struct blk_plug_cb *cb;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   __free_bbio_and_raid_map(bbio, raid_map, 1);
return PTR_ERR(rbio);
+   }
bio_list_add(rbio-bio_list, bio);
rbio-bio_list_bytes = bio-bi_iter.bi_size;
 
@@ -2038,15 +2055,19 @@ cleanup:
  */
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
  struct btrfs_bio *bbio, u64 *raid_map,
- u64 stripe_len, int mirror_num)
+ u64 stripe_len, int mirror_num, int hold_bbio)
 {
struct btrfs_raid_bio *rbio;
int ret;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
return PTR_ERR(rbio);
+   }
 
+   if (hold_bbio)
+   set_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags);
rbio-read_rebuild = 1;
bio_list_add(rbio-bio_list, bio);
rbio-bio_list_bytes = bio-bi_iter.bi_size;
@@ -2054,8 +2075,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct 
bio *bio,
rbio-faila = find_logical_bio_stripe(rbio, bio);
if (rbio-faila == -1) {
BUG();
-   kfree(raid_map);
-   kfree(bbio);
+   __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
kfree(rbio);
return -EIO;
}
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index ea5d73b..b310e8c 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -41,7 +41,7 @@ static inline int nr_data_stripes(struct map_lookup *map)
 
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 struct btrfs_bio *bbio, u64 *raid_map,
-u64 stripe_len, int mirror_num);
+u64 stripe_len, int mirror_num, int hold_bbio);
 int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
   struct btrfs_bio *bbio, u64 *raid_map,
   u64 stripe_len);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index efa0831..ca4b9eb 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -63,6 +63,13 @@ struct scrub_ctx;
  */
 #define SCRUB_MAX_PAGES_PER_BLOCK  16  /* 64k per node/leaf/sector */
 
+struct scrub_recover {
+   atomic_trefs;
+   struct btrfs_bio*bbio;
+   u64 *raid_map

Re: [PATCH] Btrfs: make sure we wait on logged extents when fsycning two subvols

2014-11-20 Thread Miao Xie
On Thu, 6 Nov 2014 10:19:54 -0500, Josef Bacik wrote:
 If we have two fsync()'s race on different subvols one will do all of its work
 to get into the log_tree, wait on it's outstanding IO, and then allow the
 log_tree to finish it's commit.  The problem is we were just free'ing that
 subvols logged extents instead of waiting on them, so whoever lost the race
 wouldn't really have their data on disk.  Fix this by waiting properly instead
 of freeing the logged extents.  Thanks,
 
 cc: sta...@vger.kernel.org
 Signed-off-by: Josef Bacik jba...@fb.com
 ---
  fs/btrfs/tree-log.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
 index 2d0fa43..70f99b1 100644
 --- a/fs/btrfs/tree-log.c
 +++ b/fs/btrfs/tree-log.c
 @@ -2600,9 +2600,9 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
   if (atomic_read(log_root_tree-log_commit[index2])) {
   blk_finish_plug(plug);
   btrfs_wait_marked_extents(log, log-dirty_log_pages, mark);
 + btrfs_wait_logged_extents(log, log_transid);

Why not add this log root into a list of log root tree, and then the committer
wait all ordered extents in each log root which is added in that list? By this
way, we can let the committer do some work during the data of ordered extents 
is 
being transferred to the disk.

Thanks
Miao

   wait_log_commit(trans, log_root_tree,
   root_log_ctx.log_transid);
 - btrfs_free_logged_extents(log, log_transid);
   mutex_unlock(log_root_tree-log_mutex);
   ret = root_log_ctx.log_ret;
   goto out;
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/9] Implement device scrub/replace for RAID56

2014-11-14 Thread Miao Xie
This patchset implement the device scrub/replace function for RAID56, the
most implementation of the common data is similar to the other RAID type.
The differentia or difficulty is the parity process. In order to avoid
that problem the data that is easy to be change out the stripe lock,
we do most work in the RAID56 stripe lock context.

And in order to avoid making the code more and more complex, we copy some
code of common data process for the parity, the cleanup work is in my
TODO list.

We have done some test, the patchset worked well. Of course, more tests
are welcome. If you are interesting to use it or test it, you can pull
the patchset from

  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Thanks
Miao

Miao Xie (6):
  Btrfs, raid56: don't change bbio and raid_map
  Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
  Btrfs,raid56: use a variant to record the operation type
  Btrfs,raid56: support parity scrub on raid56
  Btrfs, replace: write dirty pages into the replace target device
  Btrfs, replace: write raid56 parity into the replace target device

Zhao Lei (3):
  Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
  Btrfs: remove unnecessary code of stripe_index assignment in
__btrfs_map_block
  Btrfs, replace: enable dev-replace for raid56

 fs/btrfs/dev-replace.c |   5 -
 fs/btrfs/raid56.c  | 711 +++-
 fs/btrfs/raid56.h  |  14 +-
 fs/btrfs/scrub.c   | 793 +++--
 fs/btrfs/volumes.c |  47 ++-
 fs/btrfs/volumes.h |  14 +-
 6 files changed, 1471 insertions(+), 113 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/9] Btrfs,raid56: support parity scrub on raid56

2014-11-14 Thread Miao Xie
The implementation is:
- Read and check all the data with checksum in the same stripe.
  All the data which has checksum is COW data, and we are sure
  that it is not changed though we don't lock the stripe. because
  the space of that data just can be reclaimed after the current
  transction is committed, and then the fs can use it to store the
  other data, but when doing scrub, we hold the current transaction,
  that is that data can not be recovered, it is safe that read and check
  it out of the stripe lock.
- Lock the stripe
- Read out all the data without checksum and parity
  The data without checksum and the parity may be changed if we don't
  lock the stripe, so we need read it in the stripe lock context.
- Check the parity
- Re-calculate the new parity and write back it if the old parity
  is not right
- Unlock the stripe

If we can not read out the data or the data we read is corrupted,
we will try to repair it. If the repair fails. we will mark the
horizontal sub-stripe(pages on the same horizontal) as corrupted
sub-stripe, and we will skip the parity check and repair of that
horizontal sub-stripe.

And in order to skip the horizontal sub-tripe that has no data, we
introduce a bitmap. If there is some data on the horizontal sub-stripe,
we will the relative bit to 1, and when we check and repair the
parity, we will skip those horizontal sub-stripes that the relative
bits is 0.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/raid56.c | 500 -
 fs/btrfs/raid56.h |  12 ++
 fs/btrfs/scrub.c  | 599 +-
 3 files changed, 1099 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index d550e9b..a13eb1b 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -65,6 +65,7 @@
 enum btrfs_rbio_ops {
BTRFS_RBIO_WRITE= 0,
BTRFS_RBIO_READ_REBUILD = 1,
+   BTRFS_RBIO_PARITY_SCRUB = 2,
 };
 
 struct btrfs_raid_bio {
@@ -123,6 +124,7 @@ struct btrfs_raid_bio {
/* number of data stripes (no p/q) */
int nr_data;
 
+   int stripe_npages;
/*
 * set if we're doing a parity rebuild
 * for a read from higher up, which is handled
@@ -137,6 +139,7 @@ struct btrfs_raid_bio {
/* second bad stripe (for raid6 use) */
int failb;
 
+   int scrubp;
/*
 * number of pages needed to represent the full
 * stripe
@@ -171,6 +174,11 @@ struct btrfs_raid_bio {
 * here for faster lookup
 */
struct page **bio_pages;
+
+   /*
+* bitmap to record which horizontal stripe has data
+*/
+   unsigned long *dbitmap;
 };
 
 static int __raid56_parity_recover(struct btrfs_raid_bio *rbio);
@@ -185,6 +193,8 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio);
 static void index_rbio_pages(struct btrfs_raid_bio *rbio);
 static int alloc_rbio_pages(struct btrfs_raid_bio *rbio);
 
+static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
+int need_check);
 /*
  * the stripe hash table is used for locking, and to collect
  * bios in hopes of making a full stripe
@@ -950,9 +960,11 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
struct btrfs_raid_bio *rbio;
int nr_data = 0;
int num_pages = rbio_nr_pages(stripe_len, bbio-num_stripes);
+   int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE);
void *p;
 
-   rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2,
+   rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2 +
+  DIV_ROUND_UP(stripe_npages, BITS_PER_LONG / 8),
GFP_NOFS);
if (!rbio)
return ERR_PTR(-ENOMEM);
@@ -967,6 +979,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
rbio-fs_info = root-fs_info;
rbio-stripe_len = stripe_len;
rbio-nr_pages = num_pages;
+   rbio-stripe_npages = stripe_npages;
rbio-faila = -1;
rbio-failb = -1;
atomic_set(rbio-refs, 1);
@@ -980,6 +993,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
p = rbio + 1;
rbio-stripe_pages = p;
rbio-bio_pages = p + sizeof(struct page *) * num_pages;
+   rbio-dbitmap = p + sizeof(struct page *) * num_pages * 2;
 
if (raid_map[bbio-num_stripes - 1] == RAID6_Q_STRIPE)
nr_data = bbio-num_stripes - 2;
@@ -1774,6 +1788,14 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
index_rbio_pages(rbio);
 
for (pagenr = 0; pagenr  nr_pages; pagenr++) {
+   /*
+* Now we just use bitmap to mark the horizontal stripes in
+* which we have data when doing parity scrub.
+*/
+   if (rbio-operation == BTRFS_RBIO_PARITY_SCRUB

[PATCH 1/9] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition

2014-11-14 Thread Miao Xie
From: Zhao Lei zhao...@cn.fujitsu.com

bbio_ret in this condition is always !NULL because previous code
already have a check-and-skip:
4908 if (!bbio_ret)
4909 goto out;

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/volumes.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f61278f..41b0dff 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5162,8 +5162,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
BTRFS_BLOCK_GROUP_RAID6)) {
u64 tmp;
 
-   if (bbio_ret  ((rw  REQ_WRITE) || mirror_num  1)
-raid_map_ret) {
+   if (raid_map_ret  ((rw  REQ_WRITE) || mirror_num  1)) {
int i, rot;
 
/* push stripe_nr back to the start of the full stripe 
*/
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/9] Btrfs,raid56: use a variant to record the operation type

2014-11-14 Thread Miao Xie
We will introduce new operation type later, if we still use integer
variant as bool variant to record the operation type, we would add new
variant and increase the size of raid bio structure. It is not good,
by this patch, we define different number for different operation,
and we can just use a variant to record the operation type.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/raid56.c | 30 +-
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index b3e9c76..d550e9b 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -62,6 +62,11 @@
 
 #define RBIO_CACHE_SIZE 1024
 
+enum btrfs_rbio_ops {
+   BTRFS_RBIO_WRITE= 0,
+   BTRFS_RBIO_READ_REBUILD = 1,
+};
+
 struct btrfs_raid_bio {
struct btrfs_fs_info *fs_info;
struct btrfs_bio *bbio;
@@ -124,7 +129,7 @@ struct btrfs_raid_bio {
 * differently from a parity rebuild as part of
 * rmw
 */
-   int read_rebuild;
+   enum btrfs_rbio_ops operation;
 
/* first bad stripe */
int faila;
@@ -147,7 +152,6 @@ struct btrfs_raid_bio {
 
atomic_t refs;
 
-
atomic_t stripes_pending;
 
atomic_t error;
@@ -583,8 +587,7 @@ static int rbio_can_merge(struct btrfs_raid_bio *last,
return 0;
 
/* reads can't merge with writes */
-   if (last-read_rebuild !=
-   cur-read_rebuild) {
+   if (last-operation != cur-operation) {
return 0;
}
 
@@ -777,9 +780,9 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
spin_unlock(rbio-bio_list_lock);
spin_unlock_irqrestore(h-lock, flags);
 
-   if (next-read_rebuild)
+   if (next-operation == BTRFS_RBIO_READ_REBUILD)
async_read_rebuild(next);
-   else {
+   else if (next-operation == BTRFS_RBIO_WRITE){
steal_rbio(rbio, next);
async_rmw_stripe(next);
}
@@ -1713,6 +1716,7 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
}
bio_list_add(rbio-bio_list, bio);
rbio-bio_list_bytes = bio-bi_iter.bi_size;
+   rbio-operation = BTRFS_RBIO_WRITE;
 
/*
 * don't plug on full rbios, just get them out the door
@@ -1761,7 +1765,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
faila = rbio-faila;
failb = rbio-failb;
 
-   if (rbio-read_rebuild) {
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD) {
spin_lock_irq(rbio-bio_list_lock);
set_bit(RBIO_RMW_LOCKED_BIT, rbio-flags);
spin_unlock_irq(rbio-bio_list_lock);
@@ -1778,7 +1782,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
 * if we're rebuilding a read, we have to use
 * pages from the bio list
 */
-   if (rbio-read_rebuild 
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD 
(stripe == faila || stripe == failb)) {
page = page_in_rbio(rbio, stripe, pagenr, 0);
} else {
@@ -1871,7 +1875,7 @@ pstripe:
 * know they can be trusted.  If this was a read reconstruction,
 * other endio functions will fiddle the uptodate bits
 */
-   if (!rbio-read_rebuild) {
+   if (rbio-operation == BTRFS_RBIO_WRITE) {
for (i = 0;  i  nr_pages; i++) {
if (faila != -1) {
page = rbio_stripe_page(rbio, faila, i);
@@ -1888,7 +1892,7 @@ pstripe:
 * if we're rebuilding a read, we have to use
 * pages from the bio list
 */
-   if (rbio-read_rebuild 
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD 
(stripe == faila || stripe == failb)) {
page = page_in_rbio(rbio, stripe, pagenr, 0);
} else {
@@ -1904,7 +1908,7 @@ cleanup:
 
 cleanup_io:
 
-   if (rbio-read_rebuild) {
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD) {
if (err == 0)
cache_rbio_pages(rbio);
else
@@ -2042,7 +2046,7 @@ out:
return 0;
 
 cleanup:
-   if (rbio-read_rebuild)
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD)
rbio_orig_end_io(rbio, -EIO, 0);
return -EIO;
 }
@@ -2068,7 +2072,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct 
bio *bio,
 
if (hold_bbio)
set_bit(RBIO_HOLD_BBIO_MAP_BIT

[PATCH 4/9] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted

2014-11-14 Thread Miao Xie
This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/raid56.c  |  42 +---
 fs/btrfs/raid56.h  |   2 +-
 fs/btrfs/scrub.c   | 194 -
 fs/btrfs/volumes.c |  16 -
 fs/btrfs/volumes.h |   4 ++
 5 files changed, 226 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c54b0e6..b3e9c76 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,6 +58,8 @@
  */
 #define RBIO_CACHE_READY_BIT   3
 
+#define RBIO_HOLD_BBIO_MAP_BIT 4
+
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -799,6 +801,21 @@ done_nolock:
remove_rbio_from_cache(rbio);
 }
 
+static inline void
+___free_bbio_and_raid_map(struct btrfs_bio *bbio, u64 *raid_map, int need)
+{
+   if (need) {
+   kfree(raid_map);
+   kfree(bbio);
+   }
+}
+
+static inline void __free_bbio_and_raid_map(struct btrfs_raid_bio *rbio)
+{
+   ___free_bbio_and_raid_map(rbio-bbio, rbio-raid_map,
+   !test_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags));
+}
+
 static void __free_raid_bio(struct btrfs_raid_bio *rbio)
 {
int i;
@@ -817,8 +834,9 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio)
rbio-stripe_pages[i] = NULL;
}
}
-   kfree(rbio-raid_map);
-   kfree(rbio-bbio);
+
+   __free_bbio_and_raid_map(rbio);
+
kfree(rbio);
 }
 
@@ -933,11 +951,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
 
rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2,
GFP_NOFS);
-   if (!rbio) {
-   kfree(raid_map);
-   kfree(bbio);
+   if (!rbio)
return ERR_PTR(-ENOMEM);
-   }
 
bio_list_init(rbio-bio_list);
INIT_LIST_HEAD(rbio-plug_list);
@@ -1692,8 +1707,10 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
struct blk_plug_cb *cb;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   ___free_bbio_and_raid_map(bbio, raid_map, 1);
return PTR_ERR(rbio);
+   }
bio_list_add(rbio-bio_list, bio);
rbio-bio_list_bytes = bio-bi_iter.bi_size;
 
@@ -2038,15 +2055,19 @@ cleanup:
  */
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
  struct btrfs_bio *bbio, u64 *raid_map,
- u64 stripe_len, int mirror_num)
+ u64 stripe_len, int mirror_num, int hold_bbio)
 {
struct btrfs_raid_bio *rbio;
int ret;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   ___free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
return PTR_ERR(rbio);
+   }
 
+   if (hold_bbio)
+   set_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags);
rbio-read_rebuild = 1;
bio_list_add(rbio-bio_list, bio);
rbio-bio_list_bytes = bio-bi_iter.bi_size;
@@ -2054,8 +2075,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct 
bio *bio,
rbio-faila = find_logical_bio_stripe(rbio, bio);
if (rbio-faila == -1) {
BUG();
-   kfree(raid_map);
-   kfree(bbio);
+   ___free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
kfree(rbio);
return -EIO;
}
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index ea5d73b..b310e8c 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -41,7 +41,7 @@ static inline int nr_data_stripes(struct map_lookup *map)
 
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
 struct btrfs_bio *bbio, u64 *raid_map,
-u64 stripe_len, int mirror_num);
+u64 stripe_len, int mirror_num, int hold_bbio);
 int raid56_parity_write(struct btrfs_root *root, struct bio *bio,
   struct btrfs_bio *bbio, u64 *raid_map,
   u64 stripe_len);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index efa0831..ca4b9eb 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -63,6 +63,13 @@ struct scrub_ctx;
  */
 #define SCRUB_MAX_PAGES_PER_BLOCK  16  /* 64k per node/leaf/sector */
 
+struct scrub_recover {
+   atomic_trefs;
+   struct btrfs_bio*bbio;
+   u64 *raid_map;
+   u64 map_length;
+};
+
 struct scrub_page {
struct scrub_block  *sblock;
struct page

[PATCH 9/9] Btrfs, replace: enable dev-replace for raid56

2014-11-14 Thread Miao Xie
From: Zhao Lei zhao...@cn.fujitsu.com

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/dev-replace.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 6f662b3..6aa835c 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -316,11 +316,6 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
struct btrfs_device *tgt_device = NULL;
struct btrfs_device *src_device = NULL;
 
-   if (btrfs_fs_incompat(fs_info, RAID56)) {
-   btrfs_warn(fs_info, dev_replace cannot yet handle 
RAID5/RAID6);
-   return -EOPNOTSUPP;
-   }
-
switch (args-start.cont_reading_from_srcdev_mode) {
case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS:
case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID:
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/9] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block

2014-11-14 Thread Miao Xie
From: Zhao Lei zhao...@cn.fujitsu.com

stripe_index's value was set again in latter line:
stripe_index = 0;

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/volumes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 41b0dff..66d5035 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5167,9 +5167,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
 
/* push stripe_nr back to the start of the full stripe 
*/
stripe_nr = raid56_full_stripe_start;
-   do_div(stripe_nr, stripe_len);
-
-   stripe_index = do_div(stripe_nr, nr_data_stripes(map));
+   do_div(stripe_nr, stripe_len * nr_data_stripes(map));
 
/* RAID[56] write or recovery. Return all stripes */
num_stripes = map-num_stripes;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 8/9] Btrfs, replace: write raid56 parity into the replace target device

2014-11-14 Thread Miao Xie
This function reused the code of parity scrub, and we just write
the right parity or corrected parity into the target device before
the parity scrub end.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/raid56.c | 23 +++
 fs/btrfs/scrub.c  |  2 +-
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 7ad9546a..b69c01f 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2305,7 +2305,9 @@ static void raid_write_parity_end_io(struct bio *bio, int 
err)
 static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 int need_check)
 {
+   struct btrfs_bio *bbio = rbio-bbio;
void *pointers[rbio-real_stripes];
+   DECLARE_BITMAP(pbitmap, rbio-stripe_npages);
int nr_data = rbio-nr_data;
int stripe;
int pagenr;
@@ -2315,6 +2317,7 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
struct page *q_page = NULL;
struct bio_list bio_list;
struct bio *bio;
+   int is_replace = 0;
int ret;
 
bio_list_init(bio_list);
@@ -2328,6 +2331,11 @@ static noinline void finish_parity_scrub(struct 
btrfs_raid_bio *rbio,
BUG();
}
 
+   if (bbio-num_tgtdevs  bbio-tgtdev_map[rbio-scrubp]) {
+   is_replace = 1;
+   bitmap_copy(pbitmap, rbio-dbitmap, rbio-stripe_npages);
+   }
+
/*
 * Because the higher layers(scrubber) are unlikely to
 * use this area of the disk again soon, so don't cache
@@ -2416,6 +2424,21 @@ writeback:
goto cleanup;
}
 
+   if (!is_replace)
+   goto submit_write;
+
+   for_each_set_bit(pagenr, pbitmap, rbio-stripe_npages) {
+   struct page *page;
+
+   page = rbio_stripe_page(rbio, rbio-scrubp, pagenr);
+   ret = rbio_add_io_page(rbio, bio_list, page,
+  bbio-tgtdev_map[rbio-scrubp],
+  pagenr, rbio-stripe_len);
+   if (ret)
+   goto cleanup;
+   }
+
+submit_write:
nr_data = bio_list_size(bio_list);
if (!nr_data) {
/* Every parity is right */
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 3ef1e1b..f690c8f 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2710,7 +2710,7 @@ static void scrub_parity_check_and_repair(struct 
scrub_parity *sparity)
goto out;
 
length = sparity-logic_end - sparity-logic_start + 1;
-   ret = btrfs_map_sblock(sctx-dev_root-fs_info, REQ_GET_READ_MIRRORS,
+   ret = btrfs_map_sblock(sctx-dev_root-fs_info, WRITE,
   sparity-logic_start,
   length, bbio, 0, raid_map);
if (ret || !bbio || !raid_map)
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/9] Btrfs, replace: write dirty pages into the replace target device

2014-11-14 Thread Miao Xie
The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
  RAID56, we add the stripes of the replace target devices at the
  end of the stripe array in btrfs bio, and we sort those target
  device stripes in the array. And we keep the number of the target
  device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
  take the target device stripes into account.
- When we do write operation, we read the data from the common devices
  and calculate the parity. Then write the dirty data and new parity
  out, at this time, we will find the relative replace target stripes
  and wirte the relative data into it.

Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/raid56.c  | 104 +
 fs/btrfs/volumes.c |  26 --
 fs/btrfs/volumes.h |  10 --
 3 files changed, 97 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index a13eb1b..7ad9546a 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -124,6 +124,8 @@ struct btrfs_raid_bio {
/* number of data stripes (no p/q) */
int nr_data;
 
+   int real_stripes;
+
int stripe_npages;
/*
 * set if we're doing a parity rebuild
@@ -619,7 +621,7 @@ static struct page *rbio_pstripe_page(struct btrfs_raid_bio 
*rbio, int index)
  */
 static struct page *rbio_qstripe_page(struct btrfs_raid_bio *rbio, int index)
 {
-   if (rbio-nr_data + 1 == rbio-bbio-num_stripes)
+   if (rbio-nr_data + 1 == rbio-real_stripes)
return NULL;
 
index += ((rbio-nr_data + 1) * rbio-stripe_len) 
@@ -959,7 +961,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
 {
struct btrfs_raid_bio *rbio;
int nr_data = 0;
-   int num_pages = rbio_nr_pages(stripe_len, bbio-num_stripes);
+   int real_stripes = bbio-num_stripes - bbio-num_tgtdevs;
+   int num_pages = rbio_nr_pages(stripe_len, real_stripes);
int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE);
void *p;
 
@@ -979,6 +982,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
rbio-fs_info = root-fs_info;
rbio-stripe_len = stripe_len;
rbio-nr_pages = num_pages;
+   rbio-real_stripes = real_stripes;
rbio-stripe_npages = stripe_npages;
rbio-faila = -1;
rbio-failb = -1;
@@ -995,10 +999,10 @@ static struct btrfs_raid_bio *alloc_rbio(struct 
btrfs_root *root,
rbio-bio_pages = p + sizeof(struct page *) * num_pages;
rbio-dbitmap = p + sizeof(struct page *) * num_pages * 2;
 
-   if (raid_map[bbio-num_stripes - 1] == RAID6_Q_STRIPE)
-   nr_data = bbio-num_stripes - 2;
+   if (raid_map[real_stripes - 1] == RAID6_Q_STRIPE)
+   nr_data = real_stripes - 2;
else
-   nr_data = bbio-num_stripes - 1;
+   nr_data = real_stripes - 1;
 
rbio-nr_data = nr_data;
return rbio;
@@ -1110,7 +1114,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
 static void validate_rbio_for_rmw(struct btrfs_raid_bio *rbio)
 {
if (rbio-faila = 0 || rbio-failb = 0) {
-   BUG_ON(rbio-faila == rbio-bbio-num_stripes - 1);
+   BUG_ON(rbio-faila == rbio-real_stripes - 1);
__raid56_parity_recover(rbio);
} else {
finish_rmw(rbio);
@@ -1171,7 +1175,7 @@ static void index_rbio_pages(struct btrfs_raid_bio *rbio)
 static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 {
struct btrfs_bio *bbio = rbio-bbio;
-   void *pointers[bbio-num_stripes];
+   void *pointers[rbio-real_stripes];
int stripe_len = rbio-stripe_len;
int nr_data = rbio-nr_data;
int stripe;
@@ -1185,11 +1189,11 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
 
bio_list_init(bio_list);
 
-   if (bbio-num_stripes - rbio-nr_data == 1) {
-   p_stripe = bbio-num_stripes - 1;
-   } else if (bbio-num_stripes - rbio-nr_data == 2) {
-   p_stripe = bbio-num_stripes - 2;
-   q_stripe = bbio-num_stripes - 1;
+   if (rbio-real_stripes - rbio-nr_data == 1) {
+   p_stripe = rbio-real_stripes - 1;
+   } else if (rbio-real_stripes - rbio-nr_data == 2) {
+   p_stripe = rbio-real_stripes - 2;
+   q_stripe = rbio-real_stripes - 1;
} else {
BUG();
}
@@ -1246,7 +1250,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
SetPageUptodate(p);
pointers[stripe++] = kmap(p);
 
-   raid6_call.gen_syndrome(bbio-num_stripes, PAGE_SIZE,
+   raid6_call.gen_syndrome(rbio

[PATCH 3/9] Btrfs, raid56: don't change bbio and raid_map

2014-11-14 Thread Miao Xie
Because we will reuse bbio and raid_map during the scrub later, it is
better that we don't change any variant of bbio and don't free it at
the end of IO request. So we introduced similar variants into the raid
bio, and don't access those bbio's variants any more.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/raid56.c | 42 +++---
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 6a41631..c54b0e6 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,7 +58,6 @@
  */
 #define RBIO_CACHE_READY_BIT   3
 
-
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -146,6 +145,10 @@ struct btrfs_raid_bio {
 
atomic_t refs;
 
+
+   atomic_t stripes_pending;
+
+   atomic_t error;
/*
 * these are two arrays of pointers.  We allocate the
 * rbio big enough to hold them both and setup their
@@ -858,13 +861,13 @@ static void raid_write_end_io(struct bio *bio, int err)
 
bio_put(bio);
 
-   if (!atomic_dec_and_test(rbio-bbio-stripes_pending))
+   if (!atomic_dec_and_test(rbio-stripes_pending))
return;
 
err = 0;
 
/* OK, we have read all the stripes we need to. */
-   if (atomic_read(rbio-bbio-error)  rbio-bbio-max_errors)
+   if (atomic_read(rbio-error)  rbio-bbio-max_errors)
err = -EIO;
 
rbio_orig_end_io(rbio, err, 0);
@@ -949,6 +952,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
rbio-faila = -1;
rbio-failb = -1;
atomic_set(rbio-refs, 1);
+   atomic_set(rbio-error, 0);
+   atomic_set(rbio-stripes_pending, 0);
 
/*
 * the stripe_pages and bio_pages array point to the extra
@@ -1169,7 +1174,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
set_bit(RBIO_RMW_LOCKED_BIT, rbio-flags);
spin_unlock_irq(rbio-bio_list_lock);
 
-   atomic_set(rbio-bbio-error, 0);
+   atomic_set(rbio-error, 0);
 
/*
 * now that we've set rmw_locked, run through the
@@ -1245,8 +1250,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
}
}
 
-   atomic_set(bbio-stripes_pending, bio_list_size(bio_list));
-   BUG_ON(atomic_read(bbio-stripes_pending) == 0);
+   atomic_set(rbio-stripes_pending, bio_list_size(bio_list));
+   BUG_ON(atomic_read(rbio-stripes_pending) == 0);
 
while (1) {
bio = bio_list_pop(bio_list);
@@ -1331,11 +1336,11 @@ static int fail_rbio_index(struct btrfs_raid_bio *rbio, 
int failed)
if (rbio-faila == -1) {
/* first failure on this rbio */
rbio-faila = failed;
-   atomic_inc(rbio-bbio-error);
+   atomic_inc(rbio-error);
} else if (rbio-failb == -1) {
/* second failure on this rbio */
rbio-failb = failed;
-   atomic_inc(rbio-bbio-error);
+   atomic_inc(rbio-error);
} else {
ret = -EIO;
}
@@ -1394,11 +1399,11 @@ static void raid_rmw_end_io(struct bio *bio, int err)
 
bio_put(bio);
 
-   if (!atomic_dec_and_test(rbio-bbio-stripes_pending))
+   if (!atomic_dec_and_test(rbio-stripes_pending))
return;
 
err = 0;
-   if (atomic_read(rbio-bbio-error)  rbio-bbio-max_errors)
+   if (atomic_read(rbio-error)  rbio-bbio-max_errors)
goto cleanup;
 
/*
@@ -1439,7 +1444,6 @@ static void async_read_rebuild(struct btrfs_raid_bio 
*rbio)
 static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 {
int bios_to_read = 0;
-   struct btrfs_bio *bbio = rbio-bbio;
struct bio_list bio_list;
int ret;
int nr_pages = DIV_ROUND_UP(rbio-stripe_len, PAGE_CACHE_SIZE);
@@ -1455,7 +1459,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 
index_rbio_pages(rbio);
 
-   atomic_set(rbio-bbio-error, 0);
+   atomic_set(rbio-error, 0);
/*
 * build a list of bios to read all the missing parts of this
 * stripe
@@ -1503,7 +1507,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 * the bbio may be freed once we submit the last bio.  Make sure
 * not to touch it after that
 */
-   atomic_set(bbio-stripes_pending, bios_to_read);
+   atomic_set(rbio-stripes_pending, bios_to_read);
while (1) {
bio = bio_list_pop(bio_list);
if (!bio)
@@ -1917,10 +1921,10 @@ static void raid_recover_end_io(struct bio *bio, int 
err)
set_bio_pages_uptodate(bio);
bio_put(bio);
 
-   if (!atomic_dec_and_test(rbio-bbio-stripes_pending))
+   if (!atomic_dec_and_test(rbio-stripes_pending))
return;
 
-   if (atomic_read(rbio-bbio-error)  rbio-bbio-max_errors)
+   if (atomic_read(rbio-error)  rbio-bbio-max_errors

Re: [PATCH] Btrfs: fix incorrect compression ratio detection

2014-11-09 Thread Miao Xie
On Tue, 7 Oct 2014 18:44:35 -0400, Wang Shilong wrote:
 Steps to reproduce:
  # mkfs.btrfs -f /dev/sdb
  # mount -t btrfs /dev/sdb /mnt -o compress=lzo
  # dd if=/dev/zero of=/mnt/data bs=$((33*4096)) count=1
 
 after previous steps, inode will be detected as bad compression ratio,
 and NOCOMPRESS flag will be set for that inode.
 
 Reason is that compress have a max limit pages every time(128K), if a
 132k write in, it will be splitted into two write(128k+4k), this bug
 is a leftover for commit 68bb462d42a(Btrfs: don't compress for a small write)
 
 Fix this problem by checking every time before compression, if it is a
 small write(=blocksize), we bail out and fall into nocompression directly.
 
 Signed-off-by: Wang Shilong wangshilong1...@gmail.com

Looks good.

Reviewed-by: Miao Xie mi...@cn.fujitsu.com

 ---
  fs/btrfs/inode.c | 16 
  1 file changed, 8 insertions(+), 8 deletions(-)
 
 diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
 index 344a322..b78e90a 100644
 --- a/fs/btrfs/inode.c
 +++ b/fs/btrfs/inode.c
 @@ -411,14 +411,6 @@ static noinline int compress_file_range(struct inode 
 *inode,
   (start  0 || end + 1  BTRFS_I(inode)-disk_i_size))
   btrfs_add_inode_defrag(NULL, inode);
  
 - /*
 -  * skip compression for a small file range(=blocksize) that
 -  * isn't an inline extent, since it dosen't save disk space at all.
 -  */
 - if ((end - start + 1) = blocksize 
 - (start  0 || end + 1  BTRFS_I(inode)-disk_i_size))
 - goto cleanup_and_bail_uncompressed;
 -
   actual_end = min_t(u64, isize, end + 1);
  again:
   will_compress = 0;
 @@ -440,6 +432,14 @@ again:
  
   total_compressed = actual_end - start;
  
 + /*
 +  * skip compression for a small file range(=blocksize) that
 +  * isn't an inline extent, since it dosen't save disk space at all.
 +  */
 + if (total_compressed = blocksize 
 +(start  0 || end + 1  BTRFS_I(inode)-disk_i_size))
 + goto cleanup_and_bail_uncompressed;
 +
   /* we want to make sure that amount of ram required to uncompress
* an extent is reasonable, so we limit the total size in ram
* of a compressed extent to 128k.  This is a crucial number
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: don't do async reclaim during log replay V2

2014-11-06 Thread Miao Xie
On Thu, 6 Nov 2014 09:39:19 -0500, Josef Bacik wrote:
 On 10/23/2014 04:44 AM, Miao Xie wrote:
 On Thu, 18 Sep 2014 11:27:17 -0400, Josef Bacik wrote:
 Trying to reproduce a log enospc bug I hit a panic in the async reclaim code
 during log replay.  This is because we use fs_info-fs_root as our root for
 shrinking and such.  Technically we can use whatever root we want, but let's
 just not allow async reclaim while we're doing log replay.  Thanks,

 Why not move the code of fs_root initialization to the front of log replay?
 I think it is better than the fix way in this patch because the async 
 reclaimer
 can help us do some work.

 
 Because this is simpler.  We could move the initialization forward, but then 
 say somebody comes and adds some other dependency to the async reclaim stuff 
 in the future and doesn't think about log replay and suddenly some poor sap's 
 box panics on mount.  Log replay is a known quantity, we don't have to worry 
 about enospc, so lets make it as simple as possible.  Thanks,

Yes, you are right.

So this patch looks good.

Reviewed-by: Miao Xie mi...@cn.fujitsu.com

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: don't take the chunk_mutex/dev_list mutex in statfs V2

2014-11-04 Thread Miao Xie
On Mon, 3 Nov 2014 08:56:50 -0500, Josef Bacik wrote:
 Our gluster boxes get several thousand statfs() calls per second, which begins
 to suck hardcore with all of the lock contention on the chunk mutex and dev 
 list
 mutex.  We don't really need to hold these things, if we have transient
 weirdness with statfs() because of the chunk allocator we don't care, so 
 remove
 this locking.
 
 We still need the dev_list lock if you mount with -o alloc_start however, 
 which
 is a good argument for nuking that thing from orbit, but that's a patch for
 another day.  Thanks,
 
 Signed-off-by: Josef Bacik jba...@fb.com
 ---
 V1-V2: make sure -alloc_start is set before doing the dev extent lookup 
 logic.

I am strange that why we need dev_list_lock if we mount with -o alloc_start. 
AFAIK.
-alloc_start is protected by chunk_mutex.

But I think we needn't care that someone changes -alloc_start, in other words, 
we needn't take chunk_mutex during the whole process, the following case can be
tolerated by the users, I think.

Task1   Task2
statfs
  mutex_lock(fs_info-chunk_mutex);
  tmp = fs_info-alloc_start;
  mutex_unlock(fs_info-chunk_mutex);
  btrfs_calc_avail_data_space(fs_info, tmp)
...
mount -o 
remount,alloc_start=
...

Thanks
Miao

 
  fs/btrfs/super.c | 72 
 
  1 file changed, 47 insertions(+), 25 deletions(-)
 
 diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
 index 54bd91e..dc337d1 100644
 --- a/fs/btrfs/super.c
 +++ b/fs/btrfs/super.c
 @@ -1644,8 +1644,20 @@ static int btrfs_calc_avail_data_space(struct 
 btrfs_root *root, u64 *free_bytes)
   int i = 0, nr_devices;
   int ret;
  
 + /*
 +  * We aren't under the device list lock, so this is racey-ish, but good
 +  * enough for our purposes.
 +  */
   nr_devices = fs_info-fs_devices-open_devices;
 - BUG_ON(!nr_devices);
 + if (!nr_devices) {
 + smp_mb();
 + nr_devices = fs_info-fs_devices-open_devices;
 + ASSERT(nr_devices);
 + if (!nr_devices) {
 + *free_bytes = 0;
 + return 0;
 + }
 + }
  
   devices_info = kmalloc_array(nr_devices, sizeof(*devices_info),
  GFP_NOFS);
 @@ -1670,11 +1682,17 @@ static int btrfs_calc_avail_data_space(struct 
 btrfs_root *root, u64 *free_bytes)
   else
   min_stripe_size = BTRFS_STRIPE_LEN;
  
 - list_for_each_entry(device, fs_devices-devices, dev_list) {
 + if (fs_info-alloc_start)
 + mutex_lock(fs_devices-device_list_mutex);
 + rcu_read_lock();
 + list_for_each_entry_rcu(device, fs_devices-devices, dev_list) {
   if (!device-in_fs_metadata || !device-bdev ||
   device-is_tgtdev_for_dev_replace)
   continue;
  
 + if (i = nr_devices)
 + break;
 +
   avail_space = device-total_bytes - device-bytes_used;
  
   /* align with stripe_len */
 @@ -1689,24 +1707,32 @@ static int btrfs_calc_avail_data_space(struct 
 btrfs_root *root, u64 *free_bytes)
   skip_space = 1024 * 1024;
  
   /* user can set the offset in fs_info-alloc_start. */
 - if (fs_info-alloc_start + BTRFS_STRIPE_LEN =
 - device-total_bytes)
 + if (fs_info-alloc_start 
 + fs_info-alloc_start + BTRFS_STRIPE_LEN =
 + device-total_bytes) {
 + rcu_read_unlock();
   skip_space = max(fs_info-alloc_start, skip_space);
  
 - /*
 -  * btrfs can not use the free space in [0, skip_space - 1],
 -  * we must subtract it from the total. In order to implement
 -  * it, we account the used space in this range first.
 -  */
 - ret = btrfs_account_dev_extents_size(device, 0, skip_space - 1,
 -  used_space);
 - if (ret) {
 - kfree(devices_info);
 - return ret;
 - }
 + /*
 +  * btrfs can not use the free space in
 +  * [0, skip_space - 1], we must subtract it from the
 +  * total. In order to implement it, we account the used
 +  * space in this range first.
 +  */
 + ret = btrfs_account_dev_extents_size(device, 0,
 +  skip_space - 1,
 +  used_space);
 + if (ret) {
 + kfree(devices_info);
 + 

Re: [PATCH] Btrfs: don't do async reclaim during log replay V2

2014-10-29 Thread Miao Xie
Ping..

On Thu, 23 Oct 2014 16:44:54 +0800, Miao Xie wrote:
 On Thu, 18 Sep 2014 11:27:17 -0400, Josef Bacik wrote:
 Trying to reproduce a log enospc bug I hit a panic in the async reclaim code
 during log replay.  This is because we use fs_info-fs_root as our root for
 shrinking and such.  Technically we can use whatever root we want, but let's
 just not allow async reclaim while we're doing log replay.  Thanks,
 
 Why not move the code of fs_root initialization to the front of log replay?
 I think it is better than the fix way in this patch because the async 
 reclaimer
 can help us do some work.
 
 Thanks
 Miao
 

 Signed-off-by: Josef Bacik jba...@fb.com
 ---
 V1-V2: use fs_info-log_root_recovering instead, didn't notice this existed
 before.

  fs/btrfs/extent-tree.c | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 28a27d5..44d0497 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -4513,7 +4513,13 @@ again:
  space_info-flush = 1;
  } else if (!ret  space_info-flags  BTRFS_BLOCK_GROUP_METADATA) {
  used += orig_bytes;
 -if (need_do_async_reclaim(space_info, root-fs_info, used) 
 +/*
 + * We will do the space reservation dance during log replay,
 + * which means we won't have fs_info-fs_root set, so don't do
 + * the async reclaim as we will panic.
 + */
 +if (!root-fs_info-log_root_recovering 
 +need_do_async_reclaim(space_info, root-fs_info, used) 
  !work_busy(root-fs_info-async_reclaim_work))
  queue_work(system_unbound_wq,
 root-fs_info-async_reclaim_work);

 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] Btrfs: fix snapshot inconsistency after a file write followed by truncate

2014-10-29 Thread Miao Xie
On Wed, 29 Oct 2014 08:21:12 +, Filipe Manana wrote:
 If right after starting the snapshot creation ioctl we perform a write 
 against a
 file followed by a truncate, with both operations increasing the file's size, 
 we
 can get a snapshot tree that reflects a state of the source subvolume's tree 
 where
 the file truncation happened but the write operation didn't. This leaves a gap
 between 2 file extent items of the inode, which makes btrfs' fsck complain 
 about it.
 
 For example, if we perform the following file operations:
 
 $ mkfs.btrfs -f /dev/vdd
 $ mount /dev/vdd /mnt
 $ xfs_io -f \
   -c pwrite -S 0xaa -b 32K 0 32K \
   -c fsync \
   -c pwrite -S 0xbb -b 32770 16K 32770 \
   -c truncate 90123 \
   /mnt/foobar
 
 and the snapshot creation ioctl was just called before the second write, we 
 often
 can get the following inode items in the snapshot's btree:
 
 item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
 inode generation 146 transid 7 size 90123 block group 0 mode 
 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
 item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
 inode ref index 282 namelen 10 name: foobar
 item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
 extent data disk byte 1104855040 nr 32768
 extent data offset 0 nr 32768 ram 32768
 extent compression 0
 item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
 extent data disk byte 0 nr 0
 extent data offset 0 nr 40960 ram 40960
 extent compression 0
 
 There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 
 4096)[
 for which there's no file extent item covering it. This is because the file 
 write
 and file truncate operations happened both right after the snapshot creation 
 ioctl
 called btrfs_start_delalloc_inodes(), which means we didn't start and wait 
 for the
 ordered extent that matches the write and, in btrfs_setsize(), we were able 
 to call
 btrfs_cont_expand() before being able to commit the current transaction in the
 snapshot creation ioctl. So this made it possibe to insert the hole file 
 extent
 item in the source subvolume (which represents the region added by the 
 truncate)
 right before the transaction commit from the snapshot creation ioctl.
 
 Btrfs' fsck tool complains about such cases with a message like the following:
 
 root 331 inode 257 errors 100, file extent discount
 
From a user perspective, the expectation when a snapshot is created while 
those
 file operations are being performed is that the snapshot will have a file that
 either:
 
 1) is empty
 2) only the first write was captured
 3) only the 2 writes were captured
 4) both writes and the truncation were captured
 
 But never capture a state where only the first write and the truncation were
 captured (since the second write was performed before the truncation).
 
 A test case for xfstests follows.
 
 Signed-off-by: Filipe Manana fdman...@suse.com
 ---
 
 V2: Use different approach to solve the problem. Don't start and wait for all
 dellaloc to finish after every expanding truncate, instead add an 
 additional
 flush at transaction commit time if we're doing a transaction commit that
 creates snapshots.

This method will make the transaction commit spend more time, why not use
i_disk_size to expand the file size in btrfs_setsize()? Or we might rename
btrfs_{start, end}_nocow_write(), and use them in btrfs_setsize()?

Thanks
Miao

 
 V3: Removed useless test condition in +wait_pending_snapshot_roots_delalloc().
 
  fs/btrfs/transaction.c | 59 
 ++
  1 file changed, 59 insertions(+)
 
 diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
 index 396ae8b..5e7f004 100644
 --- a/fs/btrfs/transaction.c
 +++ b/fs/btrfs/transaction.c
 @@ -1714,12 +1714,65 @@ static inline void btrfs_wait_delalloc_flush(struct 
 btrfs_fs_info *fs_info)
   btrfs_wait_ordered_roots(fs_info, -1);
  }
  
 +static int
 +start_pending_snapshot_roots_delalloc(struct btrfs_trans_handle *trans,
 +   struct list_head *splice)
 +{
 + struct btrfs_pending_snapshot *pending_snapshot;
 + int ret = 0;
 +
 + if (btrfs_test_opt(trans-root, FLUSHONCOMMIT))
 + return 0;
 +
 + spin_lock(trans-root-fs_info-trans_lock);
 + list_splice_init(trans-transaction-pending_snapshots, splice);
 + spin_unlock(trans-root-fs_info-trans_lock);
 +
 + /*
 +  * Start again delalloc for the roots our pending snapshots are made
 +  * from. We did it before starting/joining a transaction and we do it
 +  * here again because new inode operations might have happened since
 +  * then and we want to make sure the snapshot captures a fully
 +  * consistent state of the source root 

Re: [PATCH] Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()

2014-10-27 Thread Miao Xie
On Mon, 27 Oct 2014 09:16:55 +, Filipe Manana wrote:
 If we couldn't find our extent item, we accessed the current slot
 (path-slots[0]) to check if it corresponds to an equivalent skinny
 metadata item. However this slot could be beyond our last item in the
 leaf (i.e. path-slots[0] = btrfs_header_nritems(leaf)), in which case
 we shouldn't process it.
 
 Since btrfs_lookup_extent() is only used to find extent items for data
 extents, fix this by removing completely the logic that looks up for an
 equivalent skinny metadata item, since it can not exist.

I think we also need a better function name, such as btrfs_lookup_data_extent.

Thanks
Miao

 
 Signed-off-by: Filipe Manana fdman...@suse.com
 ---
  fs/btrfs/extent-tree.c | 8 +---
  1 file changed, 1 insertion(+), 7 deletions(-)
 
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 0d599ba..9141b2b 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -710,7 +710,7 @@ void btrfs_clear_space_info_full(struct btrfs_fs_info 
 *info)
   rcu_read_unlock();
  }
  
 -/* simple helper to search for an existing extent at a given offset */
 +/* simple helper to search for an existing data extent at a given offset */
  int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len)
  {
   int ret;
 @@ -726,12 +726,6 @@ int btrfs_lookup_extent(struct btrfs_root *root, u64 
 start, u64 len)
   key.type = BTRFS_EXTENT_ITEM_KEY;
   ret = btrfs_search_slot(NULL, root-fs_info-extent_root, key, path,
   0, 0);
 - if (ret  0) {
 - btrfs_item_key_to_cpu(path-nodes[0], key, path-slots[0]);
 - if (key.objectid == start 
 - key.type == BTRFS_METADATA_ITEM_KEY)
 - ret = 0;
 - }
   btrfs_free_path(path);
   return ret;
  }
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items

2014-10-27 Thread Miao Xie
On Mon, 27 Oct 2014 09:19:52 +, Filipe Manana wrote:
 We have a race that can lead us to miss skinny extent items in the function
 btrfs_lookup_extent_info() when the skinny metadata feature is enabled.
 So basically the sequence of steps is:
 
 1) We search in the extent tree for the skinny extent, which returns  0
(not found);
 
 2) We check the previous item in the returned leaf for a non-skinny extent,
and we don't find it;
 
 3) Because we didn't find the non-skinny extent in step 2), we release our
path to search the extent tree again, but this time for a non-skinny
extent key;
 
 4) Right after we released our path in step 3), a skinny extent was inserted
in the extent tree (delayed refs were run) - our second extent tree search
will miss it, because it's not looking for a skinny extent;
 
 5) After the second search returned (with ret  0), we look for any delayed
ref for our extent's bytenr (and we do it while holding a read lock on the
leaf), but we won't find any, as such delayed ref had just run and 
 completed
after we released out path in step 3) before doing the second search.
 
 Fix this by removing completely the path release and re-search logic. This is
 safe, because if we seach for a metadata item and we don't find it, we have 
 the
 guarantee that the returned leaf is the one where the item would be inserted,
 and so path-slots[0]  0 and path-slots[0] - 1 must be the slot where the
 non-skinny extent item is if it exists. The only case where path-slots[0] is

I think this analysis is wrong if there are some independent shared ref 
metadata for
a tree block, just like:
++-+-+
| tree block extent item | shared ref1 | shared ref2 |
++-+-+

Thanks
Miao

 zero is when there are no smaller keys in the tree (i.e. no left siblings for
 our leaf), in which case the re-search logic isn't needed as well.
 
 This race has been present since the introduction of skinny metadata (change
 3173a18f70554fe7880bb2d85c7da566e364eb3c).
 
 Signed-off-by: Filipe Manana fdman...@suse.com
 ---
  fs/btrfs/extent-tree.c | 8 
  1 file changed, 8 deletions(-)
 
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 9141b2b..2cedd06 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -780,7 +780,6 @@ search_again:
   else
   key.type = BTRFS_EXTENT_ITEM_KEY;
  
 -again:
   ret = btrfs_search_slot(trans, root-fs_info-extent_root,
   key, path, 0, 0);
   if (ret  0)
 @@ -796,13 +795,6 @@ again:
   key.offset == root-nodesize)
   ret = 0;
   }
 - if (ret) {
 - key.objectid = bytenr;
 - key.type = BTRFS_EXTENT_ITEM_KEY;
 - key.offset = root-nodesize;
 - btrfs_release_path(path);
 - goto again;
 - }
   }
  
   if (ret == 0) {
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items

2014-10-27 Thread Miao Xie
On Mon, 27 Oct 2014 13:44:22 +, Filipe David Manana wrote:
 On Mon, Oct 27, 2014 at 12:11 PM, Filipe David Manana
 fdman...@gmail.com wrote:
 On Mon, Oct 27, 2014 at 11:08 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 On Mon, 27 Oct 2014 09:19:52 +, Filipe Manana wrote:
 We have a race that can lead us to miss skinny extent items in the function
 btrfs_lookup_extent_info() when the skinny metadata feature is enabled.
 So basically the sequence of steps is:

 1) We search in the extent tree for the skinny extent, which returns  0
(not found);

 2) We check the previous item in the returned leaf for a non-skinny extent,
and we don't find it;

 3) Because we didn't find the non-skinny extent in step 2), we release our
path to search the extent tree again, but this time for a non-skinny
extent key;

 4) Right after we released our path in step 3), a skinny extent was 
 inserted
in the extent tree (delayed refs were run) - our second extent tree 
 search
will miss it, because it's not looking for a skinny extent;

 5) After the second search returned (with ret  0), we look for any delayed
ref for our extent's bytenr (and we do it while holding a read lock on 
 the
leaf), but we won't find any, as such delayed ref had just run and 
 completed
after we released out path in step 3) before doing the second search.

 Fix this by removing completely the path release and re-search logic. This 
 is
 safe, because if we seach for a metadata item and we don't find it, we 
 have the
 guarantee that the returned leaf is the one where the item would be 
 inserted,
 and so path-slots[0]  0 and path-slots[0] - 1 must be the slot where the
 non-skinny extent item is if it exists. The only case where path-slots[0] 
 is

 I think this analysis is wrong if there are some independent shared ref 
 metadata for
 a tree block, just like:
 ++-+-+
 | tree block extent item | shared ref1 | shared ref2 |
 ++-+-+
 
 Trying to guess what's in your mind.
 
 Is the concern that if after a non-skinny extent item we have
 non-inlined references, the assumption that path-slots[0] - 1 points
 to the extent item would be wrong when searching for a skinny extent?
 
 That wouldn't be the case because BTRFS_EXTENT_ITEM_KEY == 168 and
 BTRFS_METADATA_ITEM_KEY == 169, with BTRFS_SHARED_BLOCK_REF_KEY ==
 182. So in the presence of such non-inlined shared tree block
 reference items, searching for a skinny extent item leaves us at a
 slot that points to the first non-inlined ref (regardless of its type,
 since they're all  169), and therefore path-slots[0] - 1 is the
 non-skinny extent item.

You are right. I forget to check the value of key type. Sorry.

This patch seems good for me.

Reviewed-by: Miao Xie mi...@cn.fujitsu.com

 
 thanks.
 

 Why does that matters? Can you elaborate why it's not correct?

 We're looking for the extent item only in btrfs_lookup_extent_info(),
 and running a delayed ref, independently of being inlined/shared, it
 implies inserting a new extent item or updating an existing extent
 item (updating ref count).

 thanks


 Thanks
 Miao

 zero is when there are no smaller keys in the tree (i.e. no left siblings 
 for
 our leaf), in which case the re-search logic isn't needed as well.

 This race has been present since the introduction of skinny metadata 
 (change
 3173a18f70554fe7880bb2d85c7da566e364eb3c).

 Signed-off-by: Filipe Manana fdman...@suse.com
 ---
  fs/btrfs/extent-tree.c | 8 
  1 file changed, 8 deletions(-)

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 9141b2b..2cedd06 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -780,7 +780,6 @@ search_again:
   else
   key.type = BTRFS_EXTENT_ITEM_KEY;

 -again:
   ret = btrfs_search_slot(trans, root-fs_info-extent_root,
   key, path, 0, 0);
   if (ret  0)
 @@ -796,13 +795,6 @@ again:
   key.offset == root-nodesize)
   ret = 0;
   }
 - if (ret) {
 - key.objectid = bytenr;
 - key.type = BTRFS_EXTENT_ITEM_KEY;
 - key.offset = root-nodesize;
 - btrfs_release_path(path);
 - goto again;
 - }
   }

   if (ret == 0) {


 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



 --
 Filipe David Manana,

 Reasonable men adapt themselves to the world.
  Unreasonable men adapt the world to themselves.
  That's why all progress depends on unreasonable men.
 
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http

Re: [PATCH] Btrfs: properly clean up btrfs_end_io_wq_cache

2014-10-23 Thread Miao Xie
On Wed, 15 Oct 2014 17:19:59 -0400, Josef Bacik wrote:
 In one of Dave's cleanup commits he forgot to call btrfs_end_io_wq_exit on
 unload, which makes us unable to unload and then re-load the btrfs module.  
 This
 fixes the problem.  Thanks,
 
 Signed-off-by: Josef Bacik jba...@fb.com

Reviewed-by: Miao Xie mi...@cn.fujitsu.com

 ---
  fs/btrfs/super.c | 1 +
  1 file changed, 1 insertion(+)
 
 diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
 index b83ef15..c1d020f 100644
 --- a/fs/btrfs/super.c
 +++ b/fs/btrfs/super.c
 @@ -2151,6 +2151,7 @@ static void __exit exit_btrfs_fs(void)
   extent_map_exit();
   extent_io_exit();
   btrfs_interface_exit();
 + btrfs_end_io_wq_exit();
   unregister_filesystem(btrfs_fs_type);
   btrfs_exit_sysfs();
   btrfs_cleanup_fs_uuids();
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: don't do async reclaim during log replay V2

2014-10-23 Thread Miao Xie
On Thu, 18 Sep 2014 11:27:17 -0400, Josef Bacik wrote:
 Trying to reproduce a log enospc bug I hit a panic in the async reclaim code
 during log replay.  This is because we use fs_info-fs_root as our root for
 shrinking and such.  Technically we can use whatever root we want, but let's
 just not allow async reclaim while we're doing log replay.  Thanks,

Why not move the code of fs_root initialization to the front of log replay?
I think it is better than the fix way in this patch because the async reclaimer
can help us do some work.

Thanks
Miao

 
 Signed-off-by: Josef Bacik jba...@fb.com
 ---
 V1-V2: use fs_info-log_root_recovering instead, didn't notice this existed
 before.
 
  fs/btrfs/extent-tree.c | 8 +++-
  1 file changed, 7 insertions(+), 1 deletion(-)
 
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 28a27d5..44d0497 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -4513,7 +4513,13 @@ again:
   space_info-flush = 1;
   } else if (!ret  space_info-flags  BTRFS_BLOCK_GROUP_METADATA) {
   used += orig_bytes;
 - if (need_do_async_reclaim(space_info, root-fs_info, used) 
 + /*
 +  * We will do the space reservation dance during log replay,
 +  * which means we won't have fs_info-fs_root set, so don't do
 +  * the async reclaim as we will panic.
 +  */
 + if (!root-fs_info-log_root_recovering 
 + need_do_async_reclaim(space_info, root-fs_info, used) 
   !work_busy(root-fs_info-async_reclaim_work))
   queue_work(system_unbound_wq,
  root-fs_info-async_reclaim_work);
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device balance times

2014-10-23 Thread Miao Xie
On Wed, 22 Oct 2014 14:40:47 +0200, Piotr Pawłow wrote:
 On 22.10.2014 03:43, Chris Murphy wrote:
 On Oct 21, 2014, at 4:14 PM, Piotr Pawłowp...@siedziba.pl  wrote:
 Looks normal to me. Last time I started a balance after adding 6th device 
 to my FS, it took 4 days to move 25GBs of data.
 It's long term untenable. At some point it must be fixed. It's way, way 
 slower than md raid.
 At a certain point it needs to fallback to block level copying, with a ~ 
 32KB block. It can't be treating things as if they're 1K files, doing file 
 level copying that takes forever. It's just too risky that another device 
 fails in the meantime.
 
 There's device replace for restoring redundancy, which is fast, but not 
 implemented yet for RAID5/6.

Now my colleague and I is implementing the scrub/replace for RAID5/6
and I have a plan to reimplement the balance and split it off from the 
metadata/file data process. the main idea is
- allocate a new chunk which has the same size as the relocated one, but don't 
insert it into the block group list, so we don't
  allocate the free space from it.
- set the source chunk to be Read-only
- copy the data from the source chunk to the new chunk
- replace the extent map of the source chunk with the one of the new chunk(The 
new chunk has
  the same logical address and the length as the old one)
- release the source chunk

By this way, we needn't deal the data one extent by one extent, and needn't do 
any space reservation,
so the speed will be very fast even we have lots of snapshots.

Thanks
Miao

 
 I think the problem is that balance was originally used for balancing data / 
 metadata split - moving stuff out of mostly empty chunks to free them and use 
 for something else. It pretty much has to be done on the extent level.
 
 Then balance was repurposed for things like converting RAID profiles and 
 restoring redundancy and balancing device usage in multi-device 
 configurations. It works, but the approach to do it extent by extent is slow.
 
 I wonder if we could do some of these operations by just copying whole chunks 
 in bulk. Wasn't that the point of introducing logical addresses? - to be able 
 to move chunks around quickly without changing anything except updating chunk 
 pointers?
 
 BTW: I'd love a simple interface to be able to select a chunk and tell it to 
 move somewhere else. I'd like to tell chunks with metadata, or with tons of 
 extents: Hey, chunks! Why don't you move to my SSDs? :)
 -- 
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Btrfs: check-int: don't complain about balanced blocks

2014-10-17 Thread Miao Xie
On Thu, 16 Oct 2014 17:48:49 +0200, Stefan Behrens wrote:
 The xfstest btrfs/014 which tests the balance operation caused that the
 check_int module complained that known blocks changed their physical
 location. Since this is not an error in this case, only print such
 message if the verbose mode was enabled.
 
 Reported-by: Wang Shilong wangshilong1...@gmail.com
 Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de
 ---
  fs/btrfs/check-integrity.c | 87 
 ++
  1 file changed, 49 insertions(+), 38 deletions(-)
 
 diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
 index 65fc2e0bbc4a..65226d7c9fe0 100644
 --- a/fs/btrfs/check-integrity.c
 +++ b/fs/btrfs/check-integrity.c
 @@ -1325,24 +1325,28 @@ static int btrfsic_create_link_to_next_block(
   l = NULL;
   next_block-generation = BTRFSIC_GENERATION_UNKNOWN;
   } else {
 - if (next_block-logical_bytenr != next_bytenr 
 - !(!next_block-is_metadata 
 -   0 == next_block-logical_bytenr)) {
 - printk(KERN_INFO
 -Referenced block @%llu (%s/%llu/%d)
 - found in hash table, %c,
 - bytenr mismatch (!= stored %llu).\n,
 -next_bytenr, next_block_ctx-dev-name,
 -next_block_ctx-dev_bytenr, *mirror_nump,
 -btrfsic_get_block_type(state, next_block),
 -next_block-logical_bytenr);
 - } else if (state-print_mask  BTRFSIC_PRINT_MASK_VERBOSE)
 - printk(KERN_INFO
 -Referenced block @%llu (%s/%llu/%d)
 - found in hash table, %c.\n,
 -next_bytenr, next_block_ctx-dev-name,
 -next_block_ctx-dev_bytenr, *mirror_nump,
 -btrfsic_get_block_type(state, next_block));
 + if (state-print_mask  BTRFSIC_PRINT_MASK_VERBOSE) {
 + if (next_block-logical_bytenr != next_bytenr 
 + !(!next_block-is_metadata 
 +   0 == next_block-logical_bytenr))
 + printk(KERN_INFO
 +Referenced block @%llu (%s/%llu/%d)
 + found in hash table, %c,
 + bytenr mismatch (!= stored %llu).\n,

According to the coding style, we don't expect the user-visible strings are 
broken.

Thanks
Miao

 +next_bytenr, next_block_ctx-dev-name,
 +next_block_ctx-dev_bytenr, *mirror_nump,
 +btrfsic_get_block_type(state,
 +   next_block),
 +next_block-logical_bytenr);
 + else
 + printk(KERN_INFO
 +Referenced block @%llu (%s/%llu/%d)
 + found in hash table, %c.\n,
 +next_bytenr, next_block_ctx-dev-name,
 +next_block_ctx-dev_bytenr, *mirror_nump,
 +btrfsic_get_block_type(state,
 +   next_block));
 + }
   next_block-logical_bytenr = next_bytenr;
  
   next_block-mirror_num = *mirror_nump;
 @@ -1528,7 +1532,9 @@ static int btrfsic_handle_extent_data(
   return -1;
   }
   if (!block_was_created) {
 - if (next_block-logical_bytenr != next_bytenr 
 + if ((state-print_mask 
 +  BTRFSIC_PRINT_MASK_VERBOSE) 
 + next_block-logical_bytenr != next_bytenr 
   !(!next_block-is_metadata 
 0 == next_block-logical_bytenr)) {
   printk(KERN_INFO
 @@ -1881,25 +1887,30 @@ again:
  dev_state,
  dev_bytenr);
   }
 - if (block-logical_bytenr != bytenr 
 - !(!block-is_metadata 
 -   block-logical_bytenr == 0))
 - printk(KERN_INFO
 -Written block @%llu (%s/%llu/%d)
 - found in hash table, %c,
 - bytenr mismatch
 - (!= stored %llu).\n,
 -   

Re: [PATCH] Btrfs: return failure if btrfs_dev_replace_finishing() failed

2014-10-12 Thread Miao Xie
Guan

On Sat, 11 Oct 2014 14:45:29 +0800, Eryu Guan wrote:
 device replace could fail due to another running scrub process, but this
 failure doesn't get returned to userspace.

 The following steps could reproduce this issue

mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
mount /dev/sdb1 /mnt/btrfs
while true; do
btrfs scrub start -B /mnt/btrfs /dev/null 21
done 
btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
# if this replace succeeded, do the following and repeat until
# you see this log in dmesg
# BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
#btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs

# once you see the error log in dmesg, check return value of
# replace
echo $?

 Also only WARN_ON if the return code is not -EINPROGRESS.

 Signed-off-by: Eryu Guan guane...@gmail.com

 Ping, any comments on this patch?

 Thanks,
 Eryu
 ---
  fs/btrfs/dev-replace.c | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)

 diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
 index eea26e1..44d32ab 100644
 --- a/fs/btrfs/dev-replace.c
 +++ b/fs/btrfs/dev-replace.c
 @@ -418,9 +418,11 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
  dev_replace-scrub_progress, 0, 1);
  
ret = btrfs_dev_replace_finishing(root-fs_info, ret);
 -  WARN_ON(ret);
 +  /* don't warn if EINPROGRESS, someone else might be running scrub */
 +  if (ret != -EINPROGRESS)
 +  WARN_ON(ret);

 picky comment

 I prefer WARN_ON(ret  ret != -EINPROGRESS).
 
 Yes, this is simpler :)

  
 -  return 0;
 +  return ret;

 here we will return -EINPROGRESS if scrub is running, I think it better that
 we assign some special number to args-result, and then return 0, just like
 the case the device replace is running.
 
 Seems that requires a new result type, say,
 
 #define BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS  3
 
 and assign this result to args-result if btrfs_scrub_dev() returned 
 -EINPROGRESS
 
 But I don't think returning 0 unconditionally is a good idea, since
 btrfs_dev_replace_finishing() could return other errors too, that way
 these errors will be lost, and userspace still won't catch the
 errors ($? is 0)

Of course.
Maybe the above explanation of mine was not so clear. In fact, I just talked 
about
the EINPROGRESS case, for the other case, returning error code is better.

 What I'm thinking about is something like:
 
   ret = btrfs_scrub_dev(...);
   ret = btrfs_dev_replace_finishing(root-fs_info, ret);
   if (ret == -EINPROGRESS) {
   args-result = BTRFS_IOCTL_DEV_REPLACE_RESULT_SCRUB_INPROGRESS;
   ret = 0;
   } else {
   WARN_ON(ret);
   }
 
   return ret;
 
 What do you think? If no objection I'll work on v2.

I like it.

Thanks
Miao

 Thanks for your review!
 
 Eryu

 Thanks
 Miao

  
  leave:
dev_replace-srcdev = NULL;
 @@ -538,7 +540,7 @@ static int btrfs_dev_replace_finishing(struct 
 btrfs_fs_info *fs_info,
btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
mutex_unlock(dev_replace-lock_finishing_cancel_unmount);
  
 -  return 0;
 +  return scrub_ret;
}
  
printk_in_rcu(KERN_INFO
 -- 
 1.8.3.1

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 .
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: return failure if btrfs_dev_replace_finishing() failed

2014-10-10 Thread Miao Xie
On Fri, 10 Oct 2014 15:13:31 +0800, Eryu Guan wrote:
 On Thu, Sep 25, 2014 at 06:28:14PM +0800, Eryu Guan wrote:
 device replace could fail due to another running scrub process, but this
 failure doesn't get returned to userspace.

 The following steps could reproduce this issue

  mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
  mount /dev/sdb1 /mnt/btrfs
  while true; do
  btrfs scrub start -B /mnt/btrfs /dev/null 21
  done 
  btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs
  # if this replace succeeded, do the following and repeat until
  # you see this log in dmesg
  # BTRFS: btrfs_scrub_dev(/dev/sdb2, 2, /dev/sdb3) failed -115
  #btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs

  # once you see the error log in dmesg, check return value of
  # replace
  echo $?

 Also only WARN_ON if the return code is not -EINPROGRESS.

 Signed-off-by: Eryu Guan guane...@gmail.com
 
 Ping, any comments on this patch?
 
 Thanks,
 Eryu
 ---
  fs/btrfs/dev-replace.c | 8 +---
  1 file changed, 5 insertions(+), 3 deletions(-)

 diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
 index eea26e1..44d32ab 100644
 --- a/fs/btrfs/dev-replace.c
 +++ b/fs/btrfs/dev-replace.c
 @@ -418,9 +418,11 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
dev_replace-scrub_progress, 0, 1);
  
  ret = btrfs_dev_replace_finishing(root-fs_info, ret);
 -WARN_ON(ret);
 +/* don't warn if EINPROGRESS, someone else might be running scrub */
 +if (ret != -EINPROGRESS)
 +WARN_ON(ret);

picky comment

I prefer WARN_ON(ret  ret != -EINPROGRESS).

  
 -return 0;
 +return ret;

here we will return -EINPROGRESS if scrub is running, I think it better that
we assign some special number to args-result, and then return 0, just like
the case the device replace is running.

Thanks
Miao

  
  leave:
  dev_replace-srcdev = NULL;
 @@ -538,7 +540,7 @@ static int btrfs_dev_replace_finishing(struct 
 btrfs_fs_info *fs_info,
  btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
  mutex_unlock(dev_replace-lock_finishing_cancel_unmount);
  
 -return 0;
 +return scrub_ret;
  }
  
  printk_in_rcu(KERN_INFO
 -- 
 1.8.3.1

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix ABBA deadlock in btrfs_dev_replace_finishing()

2014-09-21 Thread Miao Xie
It has been fixed by

https://patchwork.kernel.org/patch/4747961/

Thanks
Miao

On Sun, 21 Sep 2014 12:41:49 +0800, Eryu Guan wrote:
 btrfs_map_bio() first calls btrfs_bio_counter_inc_blocked() which checks
 fs state and increase bio_counter, then calls __btrfs_map_block() which
 will take the dev_replace lock.
 
 On the other hand, btrfs_dev_replace_finishing() takes dev_replace lock
 first then set fs state to BTRFS_FS_STATE_DEV_REPLACING and waits for
 bio_counter to be zero.
 
 The deadlock can be reproduced easily by running replace and fsstress at
 the same time, e.g.
 
 mkfs -t btrfs -f /dev/sdb1 /dev/sdb2
 mount /dev/sdb1 /mnt/btrfs
 fsstress -d /mnt/btrfs -n 100 -p 2 -l 0  # fsstress from ltp supports -l 
 option
 i=0
 while btrfs replace start -Bf /dev/sdb2 /dev/sdb3 /mnt/btrfs  \
   btrfs replace start -Bf /dev/sdb3 /dev/sdb2 /mnt/btrfs; do
   echo === loop $i ===
   let i=$i+1
 done
 
 This was introduced by
 
 c404e0d Btrfs: fix use-after-free in the finishing procedure of the device 
 replace
 
 Signed-off-by: Eryu Guan guane...@gmail.com
 ---
 
 Tested by the reproducer and xfstests, no new failure found.
 
 But I found kmem_cache leak if I remove btrfs module after my new test 
 case[1],
 which does fsstress  replace  subvolume create/mount/umount/delete at the 
 same
 time.
 
 BUG btrfs_extent_state (Tainted: GB ): Objects remaining in 
 btrfs_extent_state on kmem_cache_close()
 ..
 kmem_cache_destroy btrfs_extent_state: Slab cache still has objects
 CPU: 3 PID: 9503 Comm: modprobe Tainted: GB  3.17.0-rc5+ #12
 Hardware name: Hewlett-Packard ProLiant DL388eGen8, BIOS P73 06/01/2012
   8dd09c52 880411c37eb0 81642f7a
  8800b9a19300 880411c37ed0 8118ce89 
  a05dcd20 880411c37ee0 a056a80f 880411c37ef0
 Call Trace:
  [81642f7a] dump_stack+0x45/0x56
  [8118ce89] kmem_cache_destroy+0xf9/0x100
  [a056a80f] extent_io_exit+0x1f/0x50 [btrfs]
  [a05c3ae3] exit_btrfs_fs+0x2c/0x549 [btrfs]
  [810efda2] SyS_delete_module+0x162/0x200
  [81013bb7] ? do_notify_resume+0x97/0xb0
  [8164af69] system_call_fastpath+0x16/0x1b
 
 The test would hang before the fix. I'm not sure if it's related to the fix
 (seems not), please help review.
 
 Thanks,
 Eryu Guan
 
 [1] http://www.spinics.net/lists/linux-btrfs/msg37625.html
 
  fs/btrfs/dev-replace.c | 6 ++
  1 file changed, 2 insertions(+), 4 deletions(-)
 
 diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
 index eea26e1..5dfd292 100644
 --- a/fs/btrfs/dev-replace.c
 +++ b/fs/btrfs/dev-replace.c
 @@ -510,6 +510,7 @@ static int btrfs_dev_replace_finishing(struct 
 btrfs_fs_info *fs_info,
   /* keep away write_all_supers() during the finishing procedure */
   mutex_lock(root-fs_info-chunk_mutex);
   mutex_lock(root-fs_info-fs_devices-device_list_mutex);
 + btrfs_rm_dev_replace_blocked(fs_info);
   btrfs_dev_replace_lock(dev_replace);
   dev_replace-replace_state =
   scrub_ret ? BTRFS_IOCTL_DEV_REPLACE_STATE_CANCELED
 @@ -567,12 +568,8 @@ static int btrfs_dev_replace_finishing(struct 
 btrfs_fs_info *fs_info,
   btrfs_kobj_rm_device(fs_info, src_device);
   btrfs_kobj_add_device(fs_info, tgt_device);
  
 - btrfs_rm_dev_replace_blocked(fs_info);
 -
   btrfs_rm_dev_replace_srcdev(fs_info, src_device);
  
 - btrfs_rm_dev_replace_unblocked(fs_info);
 -
   /*
* this is again a consistent state where no dev_replace procedure
* is running, the target device is part of the filesystem, the
 @@ -581,6 +578,7 @@ static int btrfs_dev_replace_finishing(struct 
 btrfs_fs_info *fs_info,
* belong to this filesystem.
*/
   btrfs_dev_replace_unlock(dev_replace);
 + btrfs_rm_dev_replace_unblocked(fs_info);
   mutex_unlock(root-fs_info-fs_devices-device_list_mutex);
   mutex_unlock(root-fs_info-chunk_mutex);
  
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Fix the wrong condition judgment about subset extent map

2014-09-21 Thread Miao Xie
This patch and the previous one(The following patch) also fixed a oops, which 
can be reproduced
by LTP stress test(ltpstress.sh + fsstress).

[PATCH] btrfs: Fix and enhance merge_extent_mapping() to insert best fitted 
extent map

Thanks
Miao

On Mon, 22 Sep 2014 09:13:03 +0800, Qu Wenruo wrote:
 Previous commit: btrfs: Fix and enhance merge_extent_mapping() to insert
 best fitted extent map
 is using wrong condition to judgement whether the range is a subset of a
 existing extent map.
 
 This may cause bug in btrfs no-holes mode.
 
 This patch will correct the judgment and fix the bug.
 
 Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
 ---
  fs/btrfs/inode.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
 index 8039021..a99ee9d 100644
 --- a/fs/btrfs/inode.c
 +++ b/fs/btrfs/inode.c
 @@ -6527,7 +6527,7 @@ insert:
* extent causing the -EEXIST.
*/
   if (start = extent_map_end(existing) ||
 - start + len = existing-start) {
 + start = existing-start) {
   /*
* The existing extent map is the one nearest to
* the [start, start + len) range which overlaps
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel integration branch updated

2014-09-18 Thread Miao Xie
Chris

On Fri, 19 Sep 2014 09:45:17 +0800, Qu Wenruo wrote:
 Hi Chris,
 
 
 I'm sorry that the commit 'btrfs: Fix and enhance merge_extent_mapping() to 
 insert best fitted extent map'
 has a V2 patch, so the one in tree is not up-to-data.
 
 Although the v2 change is quite small and it's relevantly dependent, so it 
 should not be a pain change.

I think it is better to merge it to v3.17 since it is a regression of v3.17 
kernel

Thanks
Miao 

 Thanks,
 Qu
 
  Original Message 
 Subject: kernel integration branch updated
 From: Chris Mason c...@fb.com
 To: linux-btrfs linux-btrfs@vger.kernel.org
 Date: 2014年09月18日 22:19
 Hi everyone,

 I've added a few more patches to the kernel integration branch, and
 rebased onto rc5.  This should be my last rebase before sending into
 linux-next, please take a look.

 It's still missing three patches from Josef, which we're updating.  I
 can put more patches on top, but I'd prefer not to rebase again unless
 some patches need removing.

 -chris
 -- 
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 -- 
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 00/11] Implement the data repair function for direct read

2014-09-12 Thread Miao Xie
This patchset implement the data repair function for the direct read, it
is implemented like buffered read:
1.When we find the data is not right, we try to read the data from the other
  mirror.
2.When the io on the mirror ends, we will insert the endio work into the
  dedicated btrfs workqueue, not common read endio workqueue, because the
  original endio work is still blocked in the btrfs endio workqueue, if we
  insert the endio work of the io on the mirror into that workqueue, deadlock
  would happen.
3.If We get right data, we write it back to repair the corrupted mirror.
4.If the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
5.After the above work, we set the uptodate flag according to the result.

The difference is that the direct read may be splited to several small io,
in order to get the number of the mirror on which the io error happens. we
have to do data check and repair on the end IO function of those sub-IO
request.

Besides that, we also fixed some bugs of direct io.

Changelog v3 - v4:
- Remove the 1st patch which has been applied into the upstream kernel.
- Use a dedicated btrfs workqueue instead of the system workqueue to
  deal with the completed repair bio, this suggest was from Chris.
- Rebase the patchset to integration branch of Chris's git tree.

Changelog v2 - v3:
- Fix wrong returned bio when doing bio clone, which was reported by Filipe

Changelog v1 - v2:
- Fix the warning which was triggered by __GFP_ZERO in the 2nd patch

Miao Xie (11):
  Btrfs: load checksum data once when submitting a direct read io
  Btrfs: cleanup similar code of the buffered data data check and dio
read data check
  Btrfs: do file data check by sub-bio's self
  Btrfs: fix missing error handler if submiting re-read bio fails
  Btrfs: Cleanup unused variant and argument of IO failure handlers
  Btrfs: split bio_readpage_error into several functions
  Btrfs: modify repair_io_failure and make it suit direct io
  Btrfs: modify clean_io_failure and make it suit direct io
  Btrfs: Set real mirror number for read operation on RAID0/5/6
  Btrfs: implement repair function when direct read fails
  Btrfs: cleanup the read failure record after write or when the inode
is freeing

 fs/btrfs/async-thread.c |   1 +
 fs/btrfs/async-thread.h |   1 +
 fs/btrfs/btrfs_inode.h  |  10 +-
 fs/btrfs/ctree.h|   4 +-
 fs/btrfs/disk-io.c  |  11 +-
 fs/btrfs/disk-io.h  |   1 +
 fs/btrfs/extent_io.c| 254 +--
 fs/btrfs/extent_io.h|  38 -
 fs/btrfs/file-item.c|  14 +-
 fs/btrfs/inode.c| 446 +++-
 fs/btrfs/scrub.c|   4 +-
 fs/btrfs/volumes.c  |   5 +
 fs/btrfs/volumes.h  |   5 +-
 13 files changed, 601 insertions(+), 193 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 03/11] Btrfs: do file data check by sub-bio's self

2014-09-12 Thread Miao Xie
Direct IO splits the original bio to several sub-bios because of the limit of
raid stripe, and the filesystem will wait for all sub-bios and then run final
end io process.

But it was very hard to implement the data repair when dio read failure happens,
because at the final end io function, we didn't know which mirror the data was
read from. So in order to implement the data repair, we have to move the file 
data
check in the final end io function to the sub-bio end io function, in which we 
can
get the mirror number of the device we access. This patch did this work as the
first step of the direct io data repair implementation.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None
---
 fs/btrfs/btrfs_inode.h |   9 +
 fs/btrfs/extent_io.c   |   2 +-
 fs/btrfs/inode.c   | 100 -
 fs/btrfs/volumes.h |   5 ++-
 4 files changed, 87 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 8bea70e..4d30947 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -245,8 +245,11 @@ static inline int btrfs_inode_in_log(struct inode *inode, 
u64 generation)
return 0;
 }
 
+#define BTRFS_DIO_ORIG_BIO_SUBMITTED   0x1
+
 struct btrfs_dio_private {
struct inode *inode;
+   unsigned long flags;
u64 logical_offset;
u64 disk_bytenr;
u64 bytes;
@@ -263,6 +266,12 @@ struct btrfs_dio_private {
 
/* dio_bio came from fs/direct-io.c */
struct bio *dio_bio;
+
+   /*
+* The original bio may be splited to several sub-bios, this is
+* done during endio of sub-bios
+*/
+   int (*subio_endio)(struct inode *, struct btrfs_io_bio *);
 };
 
 /*
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index dfe1afe..92a6d9f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2472,7 +2472,7 @@ static void end_bio_extent_readpage(struct bio *bio, int 
err)
struct inode *inode = page-mapping-host;
 
pr_debug(end_bio_extent_readpage: bi_sector=%llu, err=%d, 
-mirror=%lu\n, (u64)bio-bi_iter.bi_sector, err,
+mirror=%u\n, (u64)bio-bi_iter.bi_sector, err,
 io_bio-mirror_num);
tree = BTRFS_I(inode)-io_tree;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e8139c6..cf79f79 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7198,29 +7198,40 @@ unlock_err:
return ret;
 }
 
-static void btrfs_endio_direct_read(struct bio *bio, int err)
+static int btrfs_subio_endio_read(struct inode *inode,
+ struct btrfs_io_bio *io_bio)
 {
-   struct btrfs_dio_private *dip = bio-bi_private;
struct bio_vec *bvec;
-   struct inode *inode = dip-inode;
-   struct bio *dio_bio;
-   struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
u64 start;
-   int ret;
int i;
+   int ret;
+   int err = 0;
 
-   if (err || (BTRFS_I(inode)-flags  BTRFS_INODE_NODATASUM))
-   goto skip_checksum;
+   if (BTRFS_I(inode)-flags  BTRFS_INODE_NODATASUM)
+   return 0;
 
-   start = dip-logical_offset;
-   bio_for_each_segment_all(bvec, bio, i) {
+   start = io_bio-logical;
+   bio_for_each_segment_all(bvec, io_bio-bio, i) {
ret = __readpage_endio_check(inode, io_bio, i, bvec-bv_page,
 0, start, bvec-bv_len);
if (ret)
err = -EIO;
start += bvec-bv_len;
}
-skip_checksum:
+
+   return err;
+}
+
+static void btrfs_endio_direct_read(struct bio *bio, int err)
+{
+   struct btrfs_dio_private *dip = bio-bi_private;
+   struct inode *inode = dip-inode;
+   struct bio *dio_bio;
+   struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+
+   if (!err  (dip-flags  BTRFS_DIO_ORIG_BIO_SUBMITTED))
+   err = btrfs_subio_endio_read(inode, io_bio);
+
unlock_extent(BTRFS_I(inode)-io_tree, dip-logical_offset,
  dip-logical_offset + dip-bytes - 1);
dio_bio = dip-dio_bio;
@@ -7298,6 +7309,7 @@ static int __btrfs_submit_bio_start_direct_io(struct 
inode *inode, int rw,
 static void btrfs_end_dio_bio(struct bio *bio, int err)
 {
struct btrfs_dio_private *dip = bio-bi_private;
+   int ret;
 
if (err) {
btrfs_err(BTRFS_I(dip-inode)-root-fs_info,
@@ -7305,6 +7317,13 @@ static void btrfs_end_dio_bio(struct bio *bio, int err)
  btrfs_ino(dip-inode), bio-bi_rw,
  (unsigned long long)bio-bi_iter.bi_sector,
  bio-bi_iter.bi_size, err);
+   } else if (dip-subio_endio) {
+   ret = dip-subio_endio(dip-inode, btrfs_io_bio(bio));
+   if (ret)
+   err = ret;
+   }
+
+   if (err

[PATCH v4 07/11] Btrfs: modify repair_io_failure and make it suit direct io

2014-09-12 Thread Miao Xie
The original code of repair_io_failure was just used for buffered read,
because it got some filesystem data from page structure, it is safe for
the page in the page cache. But when we do a direct read, the pages in bio
are not in the page cache, that is there is no filesystem data in the page
structure. In order to implement direct read data repair, we need modify
repair_io_failure and pass all filesystem data it need by function
parameters.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None
---
 fs/btrfs/extent_io.c | 8 +---
 fs/btrfs/extent_io.h | 2 +-
 fs/btrfs/scrub.c | 1 +
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cf1de40..9fbc005 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1997,7 +1997,7 @@ static int free_io_failure(struct inode *inode, struct 
io_failure_record *rec)
  */
 int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
u64 length, u64 logical, struct page *page,
-   int mirror_num)
+   unsigned int pg_offset, int mirror_num)
 {
struct bio *bio;
struct btrfs_device *dev;
@@ -2036,7 +2036,7 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 
start,
return -EIO;
}
bio-bi_bdev = dev-bdev;
-   bio_add_page(bio, page, length, start - page_offset(page));
+   bio_add_page(bio, page, length, pg_offset);
 
if (btrfsic_submit_bio_wait(WRITE_SYNC, bio)) {
/* try to remap that extent elsewhere? */
@@ -2067,7 +2067,8 @@ int repair_eb_io_failure(struct btrfs_root *root, struct 
extent_buffer *eb,
for (i = 0; i  num_pages; i++) {
struct page *p = extent_buffer_page(eb, i);
ret = repair_io_failure(root-fs_info, start, PAGE_CACHE_SIZE,
-   start, p, mirror_num);
+   start, p, start - page_offset(p),
+   mirror_num);
if (ret)
break;
start += PAGE_CACHE_SIZE;
@@ -2127,6 +2128,7 @@ static int clean_io_failure(u64 start, struct page *page)
if (num_copies  1)  {
repair_io_failure(fs_info, start, failrec-len,
  failrec-logical, page,
+ start - page_offset(page),
  failrec-failed_mirror);
}
}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 75b621b..a82ecbc 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -340,7 +340,7 @@ struct btrfs_fs_info;
 
 int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
u64 length, u64 logical, struct page *page,
-   int mirror_num);
+   unsigned int pg_offset, int mirror_num);
 int end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 int mirror_num);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index cce122b..3978529 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -682,6 +682,7 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 
root, void *fixup_ctx)
fs_info = BTRFS_I(inode)-root-fs_info;
ret = repair_io_failure(fs_info, offset, PAGE_SIZE,
fixup-logical, page,
+   offset - page_offset(page),
fixup-mirror_num);
unlock_page(page);
corrected = !ret;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 05/11] Btrfs: Cleanup unused variant and argument of IO failure handlers

2014-09-12 Thread Miao Xie
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None
---
 fs/btrfs/extent_io.c | 26 ++
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f8dda46..154cb8e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1981,8 +1981,7 @@ struct io_failure_record {
int in_validation;
 };
 
-static int free_io_failure(struct inode *inode, struct io_failure_record *rec,
-   int did_repair)
+static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
 {
int ret;
int err = 0;
@@ -2109,7 +2108,6 @@ static int clean_io_failure(u64 start, struct page *page)
struct btrfs_fs_info *fs_info = BTRFS_I(inode)-root-fs_info;
struct extent_state *state;
int num_copies;
-   int did_repair = 0;
int ret;
 
private = 0;
@@ -2130,7 +2128,6 @@ static int clean_io_failure(u64 start, struct page *page)
/* there was no real error, just free the record */
pr_debug(clean_io_failure: freeing dummy error at %llu\n,
 failrec-start);
-   did_repair = 1;
goto out;
}
if (fs_info-sb-s_flags  MS_RDONLY)
@@ -2147,19 +2144,16 @@ static int clean_io_failure(u64 start, struct page 
*page)
num_copies = btrfs_num_copies(fs_info, failrec-logical,
  failrec-len);
if (num_copies  1)  {
-   ret = repair_io_failure(fs_info, start, failrec-len,
-   failrec-logical, page,
-   failrec-failed_mirror);
-   did_repair = !ret;
+   repair_io_failure(fs_info, start, failrec-len,
+ failrec-logical, page,
+ failrec-failed_mirror);
}
-   ret = 0;
}
 
 out:
-   if (!ret)
-   ret = free_io_failure(inode, failrec, did_repair);
+   free_io_failure(inode, failrec);
 
-   return ret;
+   return 0;
 }
 
 /*
@@ -2269,7 +2263,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
 */
pr_debug(bio_readpage_error: cannot repair, num_copies=%d, 
next_mirror %d, failed_mirror %d\n,
 num_copies, failrec-this_mirror, failed_mirror);
-   free_io_failure(inode, failrec, 0);
+   free_io_failure(inode, failrec);
return -EIO;
}
 
@@ -2312,13 +2306,13 @@ static int bio_readpage_error(struct bio *failed_bio, 
u64 phy_offset,
if (failrec-this_mirror  num_copies) {
pr_debug(bio_readpage_error: (fail) num_copies=%d, next_mirror 
%d, failed_mirror %d\n,
 num_copies, failrec-this_mirror, failed_mirror);
-   free_io_failure(inode, failrec, 0);
+   free_io_failure(inode, failrec);
return -EIO;
}
 
bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
if (!bio) {
-   free_io_failure(inode, failrec, 0);
+   free_io_failure(inode, failrec);
return -EIO;
}
bio-bi_end_io = failed_bio-bi_end_io;
@@ -2349,7 +2343,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
 failrec-this_mirror,
 failrec-bio_flags, 0);
if (ret) {
-   free_io_failure(inode, failrec, 0);
+   free_io_failure(inode, failrec);
bio_put(bio);
}
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 02/11] Btrfs: cleanup similar code of the buffered data data check and dio read data check

2014-09-12 Thread Miao Xie
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None
---
 fs/btrfs/inode.c | 102 +--
 1 file changed, 47 insertions(+), 55 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index af304e1..e8139c6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2893,6 +2893,40 @@ static int btrfs_writepage_end_io_hook(struct page 
*page, u64 start, u64 end,
return 0;
 }
 
+static int __readpage_endio_check(struct inode *inode,
+ struct btrfs_io_bio *io_bio,
+ int icsum, struct page *page,
+ int pgoff, u64 start, size_t len)
+{
+   char *kaddr;
+   u32 csum_expected;
+   u32 csum = ~(u32)0;
+   static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+   csum_expected = *(((u32 *)io_bio-csum) + icsum);
+
+   kaddr = kmap_atomic(page);
+   csum = btrfs_csum_data(kaddr + pgoff, csum,  len);
+   btrfs_csum_final(csum, (char *)csum);
+   if (csum != csum_expected)
+   goto zeroit;
+
+   kunmap_atomic(kaddr);
+   return 0;
+zeroit:
+   if (__ratelimit(_rs))
+   btrfs_info(BTRFS_I(inode)-root-fs_info,
+  csum failed ino %llu off %llu csum %u expected csum 
%u,
+  btrfs_ino(inode), start, csum, csum_expected);
+   memset(kaddr + pgoff, 1, len);
+   flush_dcache_page(page);
+   kunmap_atomic(kaddr);
+   if (csum_expected == 0)
+   return 0;
+   return -EIO;
+}
+
 /*
  * when reads are done, we need to check csums to verify the data is correct
  * if there's a match, we allow the bio to finish.  If not, the code in
@@ -2905,20 +2939,15 @@ static int btrfs_readpage_end_io_hook(struct 
btrfs_io_bio *io_bio,
size_t offset = start - page_offset(page);
struct inode *inode = page-mapping-host;
struct extent_io_tree *io_tree = BTRFS_I(inode)-io_tree;
-   char *kaddr;
struct btrfs_root *root = BTRFS_I(inode)-root;
-   u32 csum_expected;
-   u32 csum = ~(u32)0;
-   static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
 
if (PageChecked(page)) {
ClearPageChecked(page);
-   goto good;
+   return 0;
}
 
if (BTRFS_I(inode)-flags  BTRFS_INODE_NODATASUM)
-   goto good;
+   return 0;
 
if (root-root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID 
test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1, NULL)) {
@@ -2928,28 +2957,8 @@ static int btrfs_readpage_end_io_hook(struct 
btrfs_io_bio *io_bio,
}
 
phy_offset = inode-i_sb-s_blocksize_bits;
-   csum_expected = *(((u32 *)io_bio-csum) + phy_offset);
-
-   kaddr = kmap_atomic(page);
-   csum = btrfs_csum_data(kaddr + offset, csum,  end - start + 1);
-   btrfs_csum_final(csum, (char *)csum);
-   if (csum != csum_expected)
-   goto zeroit;
-
-   kunmap_atomic(kaddr);
-good:
-   return 0;
-
-zeroit:
-   if (__ratelimit(_rs))
-   btrfs_info(root-fs_info, csum failed ino %llu off %llu csum 
%u expected csum %u,
-   btrfs_ino(page-mapping-host), start, csum, 
csum_expected);
-   memset(kaddr + offset, 1, end - start + 1);
-   flush_dcache_page(page);
-   kunmap_atomic(kaddr);
-   if (csum_expected == 0)
-   return 0;
-   return -EIO;
+   return __readpage_endio_check(inode, io_bio, phy_offset, page, offset,
+ start, (size_t)(end - start + 1));
 }
 
 struct delayed_iput {
@@ -7194,41 +7203,24 @@ static void btrfs_endio_direct_read(struct bio *bio, 
int err)
struct btrfs_dio_private *dip = bio-bi_private;
struct bio_vec *bvec;
struct inode *inode = dip-inode;
-   struct btrfs_root *root = BTRFS_I(inode)-root;
struct bio *dio_bio;
struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
-   u32 *csums = (u32 *)io_bio-csum;
u64 start;
+   int ret;
int i;
 
+   if (err || (BTRFS_I(inode)-flags  BTRFS_INODE_NODATASUM))
+   goto skip_checksum;
+
start = dip-logical_offset;
bio_for_each_segment_all(bvec, bio, i) {
-   if (!(BTRFS_I(inode)-flags  BTRFS_INODE_NODATASUM)) {
-   struct page *page = bvec-bv_page;
-   char *kaddr;
-   u32 csum = ~(u32)0;
-   unsigned long flags;
-
-   local_irq_save(flags);
-   kaddr = kmap_atomic(page);
-   csum = btrfs_csum_data(kaddr + bvec-bv_offset,
-  csum, bvec-bv_len

[PATCH v4 04/11] Btrfs: fix missing error handler if submiting re-read bio fails

2014-09-12 Thread Miao Xie
We forgot to free failure record and bio after submitting re-read bio failed,
fix it.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None
---
 fs/btrfs/extent_io.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 92a6d9f..f8dda46 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2348,6 +2348,11 @@ static int bio_readpage_error(struct bio *failed_bio, 
u64 phy_offset,
ret = tree-ops-submit_bio_hook(inode, read_mode, bio,
 failrec-this_mirror,
 failrec-bio_flags, 0);
+   if (ret) {
+   free_io_failure(inode, failrec, 0);
+   bio_put(bio);
+   }
+
return ret;
 }
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 10/11] Btrfs: implement repair function when direct read fails

2014-09-12 Thread Miao Xie
This patch implement data repair function when direct read fails.

The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
  mirror.
- When the io on the mirror ends, we will insert the endio work into the
  dedicated btrfs workqueue, not common read endio workqueue, because the
  original endio work is still blocked in the btrfs endio workqueue, if we
  insert the endio work of the io on the mirror into that workqueue, deadlock
  would happen.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v3 - v4:
- Use a dedicated btrfs workqueue instead of the system workqueue to
  deal with the completed repair bio, this suggest was from Chris.

Changelog v1 - v3:
- None
---
 fs/btrfs/async-thread.c |   1 +
 fs/btrfs/async-thread.h |   1 +
 fs/btrfs/btrfs_inode.h  |   2 +-
 fs/btrfs/ctree.h|   1 +
 fs/btrfs/disk-io.c  |  11 +-
 fs/btrfs/disk-io.h  |   1 +
 fs/btrfs/extent_io.c|  12 ++-
 fs/btrfs/extent_io.h|   5 +-
 fs/btrfs/inode.c| 276 
 9 files changed, 281 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index fbd76de..2da0a66 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -74,6 +74,7 @@ BTRFS_WORK_HELPER(endio_helper);
 BTRFS_WORK_HELPER(endio_meta_helper);
 BTRFS_WORK_HELPER(endio_meta_write_helper);
 BTRFS_WORK_HELPER(endio_raid56_helper);
+BTRFS_WORK_HELPER(endio_repair_helper);
 BTRFS_WORK_HELPER(rmw_helper);
 BTRFS_WORK_HELPER(endio_write_helper);
 BTRFS_WORK_HELPER(freespace_write_helper);
diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h
index e9e31c9..e386c29 100644
--- a/fs/btrfs/async-thread.h
+++ b/fs/btrfs/async-thread.h
@@ -53,6 +53,7 @@ BTRFS_WORK_HELPER_PROTO(endio_helper);
 BTRFS_WORK_HELPER_PROTO(endio_meta_helper);
 BTRFS_WORK_HELPER_PROTO(endio_meta_write_helper);
 BTRFS_WORK_HELPER_PROTO(endio_raid56_helper);
+BTRFS_WORK_HELPER_PROTO(endio_repair_helper);
 BTRFS_WORK_HELPER_PROTO(rmw_helper);
 BTRFS_WORK_HELPER_PROTO(endio_write_helper);
 BTRFS_WORK_HELPER_PROTO(freespace_write_helper);
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 4d30947..7a7521c 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -271,7 +271,7 @@ struct btrfs_dio_private {
 * The original bio may be splited to several sub-bios, this is
 * done during endio of sub-bios
 */
-   int (*subio_endio)(struct inode *, struct btrfs_io_bio *);
+   int (*subio_endio)(struct inode *, struct btrfs_io_bio *, int);
 };
 
 /*
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7b54cd9..63acfd8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1538,6 +1538,7 @@ struct btrfs_fs_info {
struct btrfs_workqueue *endio_workers;
struct btrfs_workqueue *endio_meta_workers;
struct btrfs_workqueue *endio_raid56_workers;
+   struct btrfs_workqueue *endio_repair_workers;
struct btrfs_workqueue *rmw_workers;
struct btrfs_workqueue *endio_meta_write_workers;
struct btrfs_workqueue *endio_write_workers;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ff3ee22..1594d91 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -713,7 +713,11 @@ static void end_workqueue_bio(struct bio *bio, int err)
func = btrfs_endio_write_helper;
}
} else {
-   if (end_io_wq-metadata == BTRFS_WQ_ENDIO_RAID56) {
+   if (unlikely(end_io_wq-metadata ==
+BTRFS_WQ_ENDIO_DIO_REPAIR)) {
+   wq = fs_info-endio_repair_workers;
+   func = btrfs_endio_repair_helper;
+   } else if (end_io_wq-metadata == BTRFS_WQ_ENDIO_RAID56) {
wq = fs_info-endio_raid56_workers;
func = btrfs_endio_raid56_helper;
} else if (end_io_wq-metadata) {
@@ -741,6 +745,7 @@ int btrfs_bio_wq_end_io(struct btrfs_fs_info *info, struct 
bio *bio,
int metadata)
 {
struct end_io_wq *end_io_wq;
+
end_io_wq = kmalloc(sizeof(*end_io_wq), GFP_NOFS);
if (!end_io_wq)
return -ENOMEM;
@@ -2059,6 +2064,7 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info 
*fs_info)
btrfs_destroy_workqueue(fs_info-endio_workers);
btrfs_destroy_workqueue(fs_info-endio_meta_workers);
btrfs_destroy_workqueue(fs_info-endio_raid56_workers);
+   btrfs_destroy_workqueue(fs_info-endio_repair_workers);
btrfs_destroy_workqueue(fs_info-rmw_workers);
btrfs_destroy_workqueue

[PATCH v4 01/11] Btrfs: load checksum data once when submitting a direct read io

2014-09-12 Thread Miao Xie
The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v3 - v4:
- None

Changelog v2 - v3:
- Fix the wrong return value of btrfs_bio_clone

Changelog v1 - v2:
- Remove the __GFP_ZERO flag in btrfs_submit_direct because it would trigger
  a WARNing. It is reported by Filipe David Manana, Thanks.
---
 fs/btrfs/btrfs_inode.h |  1 -
 fs/btrfs/ctree.h   |  3 +--
 fs/btrfs/extent_io.c   | 13 +++--
 fs/btrfs/file-item.c   | 14 ++
 fs/btrfs/inode.c   | 38 +-
 5 files changed, 35 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index fd87941..8bea70e 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -263,7 +263,6 @@ struct btrfs_dio_private {
 
/* dio_bio came from fs/direct-io.c */
struct bio *dio_bio;
-   u8 csum[0];
 };
 
 /*
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ded7781..7b54cd9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3719,8 +3719,7 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
 int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
  struct bio *bio, u32 *dst);
 int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
- struct btrfs_dio_private *dip, struct bio *bio,
- u64 logical_offset);
+ struct bio *bio, u64 logical_offset);
 int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 struct btrfs_root *root,
 u64 objectid, u64 pos,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 86b39de..dfe1afe 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2621,9 +2621,18 @@ btrfs_bio_alloc(struct block_device *bdev, u64 
first_sector, int nr_vecs,
 
 struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask)
 {
-   return bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
-}
+   struct btrfs_io_bio *btrfs_bio;
+   struct bio *new;
 
+   new = bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
+   if (new) {
+   btrfs_bio = btrfs_io_bio(new);
+   btrfs_bio-csum = NULL;
+   btrfs_bio-csum_allocated = NULL;
+   btrfs_bio-end_io = NULL;
+   }
+   return new;
+}
 
 /* this also allocates from the btrfs_bioset */
 struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 6e6262e..783a943 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -299,19 +299,9 @@ int btrfs_lookup_bio_sums(struct btrfs_root *root, struct 
inode *inode,
 }
 
 int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
- struct btrfs_dio_private *dip, struct bio *bio,
- u64 offset)
+ struct bio *bio, u64 offset)
 {
-   int len = (bio-bi_iter.bi_sector  9) - dip-disk_bytenr;
-   u16 csum_size = btrfs_super_csum_size(root-fs_info-super_copy);
-   int ret;
-
-   len = inode-i_sb-s_blocksize_bits;
-   len *= csum_size;
-
-   ret = __btrfs_lookup_bio_sums(root, inode, bio, offset,
- (u32 *)(dip-csum + len), 1);
-   return ret;
+   return __btrfs_lookup_bio_sums(root, inode, bio, offset, NULL, 1);
 }
 
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2118ea6..af304e1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7196,7 +7196,8 @@ static void btrfs_endio_direct_read(struct bio *bio, int 
err)
struct inode *inode = dip-inode;
struct btrfs_root *root = BTRFS_I(inode)-root;
struct bio *dio_bio;
-   u32 *csums = (u32 *)dip-csum;
+   struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+   u32 *csums = (u32 *)io_bio-csum;
u64 start;
int i;
 
@@ -7238,6 +7239,9 @@ static void btrfs_endio_direct_read(struct bio *bio, int 
err)
if (err)
clear_bit(BIO_UPTODATE, dio_bio-bi_flags);
dio_end_io(dio_bio, err);
+
+   if (io_bio-end_io)
+   io_bio-end_io(io_bio, err);
bio_put(bio);
 }
 
@@ -7377,13 +7381,20 @@ static inline int __btrfs_submit_dio_bio(struct bio 
*bio, struct inode *inode,
ret = btrfs_csum_one_bio(root, inode, bio, file_offset, 1);
if (ret)
goto err;
-   } else if (!skip_sum) {
-   ret

[PATCH v4 11/11] Btrfs: cleanup the read failure record after write or when the inode is freeing

2014-09-12 Thread Miao Xie
After the data is written successfully, we should cleanup the read failure 
record
in that range because
- If we set data COW for the file, the range that the failure record pointed to 
is
  mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
  the corrupted data is corrected, so the failure record can be removed. And if
  some errors happen on the mirrors, we also needn't worry about it because the
  failure record will be recreated if we read the same place again.

Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak 
happens.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None
---
 fs/btrfs/extent_io.c | 34 ++
 fs/btrfs/extent_io.h |  1 +
 fs/btrfs/inode.c |  6 ++
 3 files changed, 41 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 86dc352..5427fd5 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2138,6 +2138,40 @@ out:
return 0;
 }
 
+/*
+ * Can be called when
+ * - hold extent lock
+ * - under ordered extent
+ * - the inode is freeing
+ */
+void btrfs_free_io_failure_record(struct inode *inode, u64 start, u64 end)
+{
+   struct extent_io_tree *failure_tree = BTRFS_I(inode)-io_failure_tree;
+   struct io_failure_record *failrec;
+   struct extent_state *state, *next;
+
+   if (RB_EMPTY_ROOT(failure_tree-state))
+   return;
+
+   spin_lock(failure_tree-lock);
+   state = find_first_extent_bit_state(failure_tree, start, EXTENT_DIRTY);
+   while (state) {
+   if (state-start  end)
+   break;
+
+   ASSERT(state-end = end);
+
+   next = next_state(state);
+
+   failrec = (struct io_failure_record *)state-private;
+   free_extent_state(state);
+   kfree(failrec);
+
+   state = next;
+   }
+   spin_unlock(failure_tree-lock);
+}
+
 int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
struct io_failure_record **failrec_ret)
 {
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 176a4b1..5e91fb9 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -366,6 +366,7 @@ struct io_failure_record {
int in_validation;
 };
 
+void btrfs_free_io_failure_record(struct inode *inode, u64 start, u64 end);
 int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
struct io_failure_record **failrec_ret);
 int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bc8cdaf..c591af5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2697,6 +2697,10 @@ static int btrfs_finish_ordered_io(struct 
btrfs_ordered_extent *ordered_extent)
goto out;
}
 
+   btrfs_free_io_failure_record(inode, ordered_extent-file_offset,
+ordered_extent-file_offset +
+ordered_extent-len - 1);
+
if (test_bit(BTRFS_ORDERED_TRUNCATED, ordered_extent-flags)) {
truncated = true;
logical_len = ordered_extent-truncated_len;
@@ -4792,6 +4796,8 @@ void btrfs_evict_inode(struct inode *inode)
/* do we really want it for -i_nlink  0 and zero btrfs_root_refs? */
btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
+   btrfs_free_io_failure_record(inode, 0, (u64)-1);
+
if (root-fs_info-log_root_recovering) {
BUG_ON(test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
 BTRFS_I(inode)-runtime_flags));
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 06/11] Btrfs: split bio_readpage_error into several functions

2014-09-12 Thread Miao Xie
The data repair function of direct read will be implemented later, and some code
in bio_readpage_error will be reused, so split bio_readpage_error into
several functions which will be used in direct read repair later.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None
---
 fs/btrfs/extent_io.c | 159 ++-
 fs/btrfs/extent_io.h |  28 +
 2 files changed, 123 insertions(+), 64 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 154cb8e..cf1de40 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1962,25 +1962,6 @@ static void check_page_uptodate(struct extent_io_tree 
*tree, struct page *page)
SetPageUptodate(page);
 }
 
-/*
- * When IO fails, either with EIO or csum verification fails, we
- * try other mirrors that might have a good copy of the data.  This
- * io_failure_record is used to record state as we go through all the
- * mirrors.  If another mirror has good data, the page is set up to date
- * and things continue.  If a good mirror can't be found, the original
- * bio end_io callback is called to indicate things have failed.
- */
-struct io_failure_record {
-   struct page *page;
-   u64 start;
-   u64 len;
-   u64 logical;
-   unsigned long bio_flags;
-   int this_mirror;
-   int failed_mirror;
-   int in_validation;
-};
-
 static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
 {
int ret;
@@ -2156,40 +2137,24 @@ out:
return 0;
 }
 
-/*
- * this is a generic handler for readpage errors (default
- * readpage_io_failed_hook). if other copies exist, read those and write back
- * good data to the failed position. does not investigate in remapping the
- * failed extent elsewhere, hoping the device will be smart enough to do this 
as
- * needed
- */
-
-static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
- struct page *page, u64 start, u64 end,
- int failed_mirror)
+int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
+   struct io_failure_record **failrec_ret)
 {
-   struct io_failure_record *failrec = NULL;
+   struct io_failure_record *failrec;
u64 private;
struct extent_map *em;
-   struct inode *inode = page-mapping-host;
struct extent_io_tree *failure_tree = BTRFS_I(inode)-io_failure_tree;
struct extent_io_tree *tree = BTRFS_I(inode)-io_tree;
struct extent_map_tree *em_tree = BTRFS_I(inode)-extent_tree;
-   struct bio *bio;
-   struct btrfs_io_bio *btrfs_failed_bio;
-   struct btrfs_io_bio *btrfs_bio;
-   int num_copies;
int ret;
-   int read_mode;
u64 logical;
 
-   BUG_ON(failed_bio-bi_rw  REQ_WRITE);
-
ret = get_state_private(failure_tree, start, private);
if (ret) {
failrec = kzalloc(sizeof(*failrec), GFP_NOFS);
if (!failrec)
return -ENOMEM;
+
failrec-start = start;
failrec-len = end - start + 1;
failrec-this_mirror = 0;
@@ -2209,11 +2174,11 @@ static int bio_readpage_error(struct bio *failed_bio, 
u64 phy_offset,
em = NULL;
}
read_unlock(em_tree-lock);
-
if (!em) {
kfree(failrec);
return -EIO;
}
+
logical = start - em-start;
logical = em-block_start + logical;
if (test_bit(EXTENT_FLAG_COMPRESSED, em-flags)) {
@@ -,8 +2187,10 @@ static int bio_readpage_error(struct bio *failed_bio, 
u64 phy_offset,
extent_set_compress_type(failrec-bio_flags,
 em-compress_type);
}
-   pr_debug(bio_readpage_error: (new) logical=%llu, start=%llu, 
-len=%llu\n, logical, start, failrec-len);
+
+   pr_debug(Get IO Failure Record: (new) logical=%llu, 
start=%llu, len=%llu\n,
+logical, start, failrec-len);
+
failrec-logical = logical;
free_extent_map(em);
 
@@ -2243,8 +2210,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
}
} else {
failrec = (struct io_failure_record *)(unsigned long)private;
-   pr_debug(bio_readpage_error: (found) logical=%llu, 
-start=%llu, len=%llu, validation=%d\n,
+   pr_debug(Get IO Failure Record: (found) logical=%llu, 
start=%llu, len=%llu, validation=%d\n,
 failrec-logical, failrec-start, failrec-len,
 failrec-in_validation);
/*
@@ -2253,6 +2219,17 @@ static int bio_readpage_error(struct bio *failed_bio, 
u64

[PATCH v4 09/11] Btrfs: Set real mirror number for read operation on RAID0/5/6

2014-09-12 Thread Miao Xie
We need real mirror number for RAID0/5/6 when reading data, or if read error
happens, we would pass 0 as the number of the mirror on which the io error
happens. It is wrong and would cause the filesystem read the data from the
corrupted mirror again.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None
---
 fs/btrfs/volumes.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1aacf5f..4856547 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5073,6 +5073,8 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
num_stripes = min_t(u64, map-num_stripes,
stripe_nr_end - stripe_nr_orig);
stripe_index = do_div(stripe_nr, map-num_stripes);
+   if (!(rw  (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS)))
+   mirror_num = 1;
} else if (map-type  BTRFS_BLOCK_GROUP_RAID1) {
if (rw  (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS))
num_stripes = map-num_stripes;
@@ -5176,6 +5178,9 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
/* We distribute the parity blocks across stripes */
tmp = stripe_nr + stripe_index;
stripe_index = do_div(tmp, map-num_stripes);
+   if (!(rw  (REQ_WRITE | REQ_DISCARD |
+   REQ_GET_READ_MIRRORS))  mirror_num = 1)
+   mirror_num = 1;
}
} else {
/*
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 03/10] Btrfs: fix wrong fsid check of scrub

2014-09-03 Thread Miao Xie
All the metadata in the seed devices has the same fsid as the fsid
of the seed filesystem which is on the seed device, so we should check
them by the current filesystem. Fix it.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
Reviewed-by: David Sterba dste...@suse.cz
---
Changelog v1 - v2:
- Use const keyword to restrict the fsid.
- Remove unnecessary the variant.
---
 fs/btrfs/scrub.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index dfb92a2..12a6801 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1361,6 +1361,14 @@ static void scrub_recheck_block(struct btrfs_fs_info 
*fs_info,
return;
 }
 
+static inline int scrub_check_fsid(const u8 *fsid,
+  struct scrub_page *spage)
+{
+   struct btrfs_fs_devices *fs_devices = spage-dev-fs_devices;
+
+   return !memcmp(fsid, fs_devices-fsid, BTRFS_UUID_SIZE);
+}
+
 static void scrub_recheck_block_checksum(struct btrfs_fs_info *fs_info,
 struct scrub_block *sblock,
 int is_metadata, int have_csum,
@@ -1380,7 +1388,7 @@ static void scrub_recheck_block_checksum(struct 
btrfs_fs_info *fs_info,
h = (struct btrfs_header *)mapped_buffer;
 
if (sblock-pagev[0]-logical != btrfs_stack_header_bytenr(h) ||
-   memcmp(h-fsid, fs_info-fsid, BTRFS_UUID_SIZE) ||
+   !scrub_check_fsid(h-fsid, sblock-pagev[0]) ||
memcmp(h-chunk_tree_uuid, fs_info-chunk_tree_uuid,
   BTRFS_UUID_SIZE)) {
sblock-header_error = 1;
@@ -1750,7 +1758,7 @@ static int scrub_checksum_tree_block(struct scrub_block 
*sblock)
if (sblock-pagev[0]-generation != btrfs_stack_header_generation(h))
++fail;
 
-   if (memcmp(h-fsid, fs_info-fsid, BTRFS_UUID_SIZE))
+   if (!scrub_check_fsid(h-fsid, sblock-pagev[0]))
++fail;
 
if (memcmp(h-chunk_tree_uuid, fs_info-chunk_tree_uuid,
@@ -1790,8 +1798,6 @@ static int scrub_checksum_super(struct scrub_block 
*sblock)
 {
struct btrfs_super_block *s;
struct scrub_ctx *sctx = sblock-sctx;
-   struct btrfs_root *root = sctx-dev_root;
-   struct btrfs_fs_info *fs_info = root-fs_info;
u8 calculated_csum[BTRFS_CSUM_SIZE];
u8 on_disk_csum[BTRFS_CSUM_SIZE];
struct page *page;
@@ -1816,7 +1822,7 @@ static int scrub_checksum_super(struct scrub_block 
*sblock)
if (sblock-pagev[0]-generation != btrfs_super_generation(s))
++fail_gen;
 
-   if (memcmp(s-fsid, fs_info-fsid, BTRFS_UUID_SIZE))
+   if (!scrub_check_fsid(s-fsid, sblock-pagev[0]))
++fail_cor;
 
len = BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 06/10] Btrfs: Fix the problem that the dirty flag of dev stats is cleared

2014-09-03 Thread Miao Xie
The io error might happen during writing out the device stats, and the
device stats information and dirty flag would be update at that time,
but the current code didn't consider this case, just clear the dirty
flag, it would cause that we forgot to write out the new device stats
information. Fix it.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v2:
- Change the variant name and make some cleanup by David's comment
---
 fs/btrfs/volumes.c |  8 ++--
 fs/btrfs/volumes.h | 16 
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 19188df..4ea73c8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -159,6 +159,7 @@ static struct btrfs_device *__alloc_device(void)
 
spin_lock_init(dev-reada_lock);
atomic_set(dev-reada_in_flight, 0);
+   atomic_set(dev-dev_stats_dirty, 0);
INIT_RADIX_TREE(dev-reada_zones, GFP_NOFS  ~__GFP_WAIT);
INIT_RADIX_TREE(dev-reada_extents, GFP_NOFS  ~__GFP_WAIT);
 
@@ -6398,16 +6399,19 @@ int btrfs_run_dev_stats(struct btrfs_trans_handle 
*trans,
struct btrfs_root *dev_root = fs_info-dev_root;
struct btrfs_fs_devices *fs_devices = fs_info-fs_devices;
struct btrfs_device *device;
+   int dirtied;
int ret = 0;
 
mutex_lock(fs_devices-device_list_mutex);
list_for_each_entry(device, fs_devices-devices, dev_list) {
-   if (!device-dev_stats_valid || !device-dev_stats_dirty)
+   dirtied = atomic_read(device-dev_stats_dirty);
+
+   if (!device-dev_stats_valid || !dirtied)
continue;
 
ret = update_dev_stat_item(trans, dev_root, device);
if (!ret)
-   device-dev_stats_dirty = 0;
+   atomic_sub(dirtied, device-dev_stats_dirty);
}
mutex_unlock(fs_devices-device_list_mutex);
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 6fcc8ea..9a1eff3 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -110,7 +110,8 @@ struct btrfs_device {
/* disk I/O failure stats. For detailed description refer to
 * enum btrfs_dev_stat_values in ioctl.h */
int dev_stats_valid;
-   int dev_stats_dirty; /* counters need to be written to disk */
+
+   atomic_t dev_stats_dirty; /* counters need to be written to disk */
atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
 };
 
@@ -359,11 +360,18 @@ unsigned long btrfs_full_stripe_len(struct btrfs_root 
*root,
 int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
struct btrfs_root *extent_root,
u64 chunk_offset, u64 chunk_size);
+
+static inline void btrfs_dev_dirty_stat(struct btrfs_device *dev)
+{
+   smp_mb__before_atomic();
+   atomic_inc(dev-dev_stats_dirty);
+}
+
 static inline void btrfs_dev_stat_inc(struct btrfs_device *dev,
  int index)
 {
atomic_inc(dev-dev_stat_values + index);
-   dev-dev_stats_dirty = 1;
+   btrfs_dev_dirty_stat(dev);
 }
 
 static inline int btrfs_dev_stat_read(struct btrfs_device *dev,
@@ -378,7 +386,7 @@ static inline int btrfs_dev_stat_read_and_reset(struct 
btrfs_device *dev,
int ret;
 
ret = atomic_xchg(dev-dev_stat_values + index, 0);
-   dev-dev_stats_dirty = 1;
+   btrfs_dev_dirty_stat(dev);
return ret;
 }
 
@@ -386,7 +394,7 @@ static inline void btrfs_dev_stat_set(struct btrfs_device 
*dev,
  int index, unsigned long val)
 {
atomic_set(dev-dev_stat_values + index, val);
-   dev-dev_stats_dirty = 1;
+   btrfs_dev_dirty_stat(dev);
 }
 
 static inline void btrfs_dev_stat_reset(struct btrfs_device *dev,
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 11/12] Btrfs: implement repair function when direct read fails

2014-09-03 Thread Miao Xie
On Tue, 2 Sep 2014 09:05:15 -0400, Chris Mason wrote:
 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 08e65e9..56b1546 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -698,7 +719,12 @@ static void end_workqueue_bio(struct bio *bio, int 
 err)
  
   fs_info = end_io_wq-info;
   end_io_wq-error = err;
 - btrfs_init_work(end_io_wq-work, end_workqueue_fn, NULL, NULL);
 +
 + if (likely(end_io_wq-metadata != BTRFS_WQ_ENDIO_DIO_REPAIR))
 + btrfs_init_work(end_io_wq-work, end_workqueue_fn, NULL,
 + NULL);
 + else
 + INIT_WORK(end_io_wq-work.normal_work, dio_end_workqueue_fn);

 It's not clear why this one is using INIT_WORK instead of
 btrfs_init_work, or why we're calling directly into queue_work instead
 of btrfs_queue_work.  What am I missing?

 I'm sorry that I forgot writing the explanation in this patch's changlog,
 I wrote it in Patch 0.

 2.When the io on the mirror ends, we will insert the endio work into the
system workqueue, not btrfs own endio workqueue, because the original
endio work is still blocked in the btrfs endio workqueue, if we insert
the endio work of the io on the mirror into that workqueue, deadlock
would happen.

 Can you elaborate the deadlock?

 Now that buffer read can insert a subsequent read-mirror bio into btrfs endio
 workqueue without problems, what's the difference?
 
 We do have problems if we're inserting dependent items in the same
 workqueue.
 
 Miao, please make a repair workqueue.  I'll also have a use for it in
 the raid56 parity work I think.

OK, I'll update the patch soon.

Thanks
Miao

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/18] Btrfs: cleanup double assignment of device-bytes_used when device replace finishes

2014-09-03 Thread Miao Xie
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/dev-replace.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index a85b5f5..10dfb41 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -550,7 +550,6 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info 
*fs_info,
tgt_device-is_tgtdev_for_dev_replace = 0;
tgt_device-devid = src_device-devid;
src_device-devid = BTRFS_DEV_REPLACE_DEVID;
-   tgt_device-bytes_used = src_device-bytes_used;
memcpy(uuid_tmp, tgt_device-uuid, sizeof(uuid_tmp));
memcpy(tgt_device-uuid, src_device-uuid, sizeof(tgt_device-uuid));
memcpy(src_device-uuid, uuid_tmp, sizeof(src_device-uuid));
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/18] Btrfs: Fix wrong free_chunk_space assignment during removing a device

2014-09-03 Thread Miao Xie
During removing a device, we have modified free_chunk_space when we
shrink the device, so we needn't assign a new value to it after
the device shrink. Fix it.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/volumes.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f8273bb..1524b3f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1671,11 +1671,6 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path)
if (ret)
goto error_undo;
 
-   spin_lock(root-fs_info-free_chunk_lock);
-   root-fs_info-free_chunk_space = device-total_bytes -
-   device-bytes_used;
-   spin_unlock(root-fs_info-free_chunk_lock);
-
device-in_fs_metadata = 0;
btrfs_scrub_cancel_dev(root-fs_info, device);
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/18] Btrfs: cleanup unused num_can_discard in fs_devices

2014-09-03 Thread Miao Xie
The member variants - num_can_discard - of fs_devices structure
are set, but no one use them to do anything. so remove them.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/volumes.c | 16 ++--
 fs/btrfs/volumes.h |  1 -
 2 files changed, 2 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e9676a4..483fc6d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -720,8 +720,6 @@ static int __btrfs_close_devices(struct btrfs_fs_devices 
*fs_devices)
fs_devices-rw_devices--;
}
 
-   if (device-can_discard)
-   fs_devices-num_can_discard--;
if (device-missing)
fs_devices-missing_devices--;
 
@@ -828,10 +826,8 @@ static int __btrfs_open_devices(struct btrfs_fs_devices 
*fs_devices,
}
 
q = bdev_get_queue(bdev);
-   if (blk_queue_discard(q)) {
+   if (blk_queue_discard(q))
device-can_discard = 1;
-   fs_devices-num_can_discard++;
-   }
 
device-bdev = bdev;
device-in_fs_metadata = 0;
@@ -1835,8 +1831,7 @@ void btrfs_rm_dev_replace_srcdev(struct btrfs_fs_info 
*fs_info,
if (!fs_devices-seeding)
fs_devices-rw_devices++;
}
-   if (srcdev-can_discard)
-   fs_devices-num_can_discard--;
+
if (srcdev-bdev) {
fs_devices-open_devices--;
 
@@ -1886,8 +1881,6 @@ void btrfs_destroy_dev_replace_tgtdev(struct 
btrfs_fs_info *fs_info,
fs_info-fs_devices-open_devices--;
}
fs_info-fs_devices-num_devices--;
-   if (tgtdev-can_discard)
-   fs_info-fs_devices-num_can_discard++;
 
next_device = list_entry(fs_info-fs_devices-devices.next,
 struct btrfs_device, dev_list);
@@ -2008,7 +2001,6 @@ static int btrfs_prepare_sprout(struct btrfs_root *root)
fs_devices-num_devices = 0;
fs_devices-open_devices = 0;
fs_devices-missing_devices = 0;
-   fs_devices-num_can_discard = 0;
fs_devices-rotating = 0;
fs_devices-seed = seed_devices;
 
@@ -2200,8 +2192,6 @@ int btrfs_init_new_device(struct btrfs_root *root, char 
*device_path)
root-fs_info-fs_devices-open_devices++;
root-fs_info-fs_devices-rw_devices++;
root-fs_info-fs_devices-total_devices++;
-   if (device-can_discard)
-   root-fs_info-fs_devices-num_can_discard++;
root-fs_info-fs_devices-total_rw_bytes += device-total_bytes;
 
spin_lock(root-fs_info-free_chunk_lock);
@@ -2371,8 +2361,6 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_root 
*root, char *device_path,
list_add(device-dev_list, fs_info-fs_devices-devices);
fs_info-fs_devices-num_devices++;
fs_info-fs_devices-open_devices++;
-   if (device-can_discard)
-   fs_info-fs_devices-num_can_discard++;
mutex_unlock(root-fs_info-fs_devices-device_list_mutex);
 
*device_out = device;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index e894ac6..37f8bff 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -124,7 +124,6 @@ struct btrfs_fs_devices {
u64 rw_devices;
u64 missing_devices;
u64 total_rw_bytes;
-   u64 num_can_discard;
u64 total_devices;
struct block_device *latest_bdev;
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] Btrfs: restructure btrfs_get_bdev_and_sb and pick up some code used later

2014-09-03 Thread Miao Xie
Some code in btrfs_get_bdev_and_sb will be re-used by the other function later,
so restructure btrfs_get_bdev_and_sb and pick up those code to make a new
function.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/volumes.c | 66 +-
 1 file changed, 36 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index bcb19d5..9d52fd8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -193,42 +193,47 @@ static noinline struct btrfs_fs_devices *find_fsid(u8 
*fsid)
return NULL;
 }
 
+static int __btrfs_get_sb(struct block_device *bdev, int flush,
+ struct buffer_head **bh)
+{
+   int ret;
+
+   if (flush)
+   filemap_write_and_wait(bdev-bd_inode-i_mapping);
+
+   ret = set_blocksize(bdev, 4096);
+   if (ret)
+   return ret;
+
+   invalidate_bdev(bdev);
+   *bh = btrfs_read_dev_super(bdev);
+   if (!*bh)
+   return -EINVAL;
+
+   return 0;
+}
+
 static int
-btrfs_get_bdev_and_sb(const char *device_path, fmode_t flags, void *holder,
- int flush, struct block_device **bdev,
- struct buffer_head **bh)
+btrfs_get_bdev_and_sb_by_path(const char *device_path, fmode_t flags,
+ void *holder, int flush,
+ struct block_device **bdev,
+ struct buffer_head **bh)
 {
int ret;
 
*bdev = blkdev_get_by_path(device_path, flags, holder);
-
if (IS_ERR(*bdev)) {
-   ret = PTR_ERR(*bdev);
printk(KERN_INFO BTRFS: open %s failed\n, device_path);
-   goto error;
+   return PTR_ERR(*bdev);
}
 
-   if (flush)
-   filemap_write_and_wait((*bdev)-bd_inode-i_mapping);
-   ret = set_blocksize(*bdev, 4096);
+   ret = __btrfs_get_sb(*bdev, flush, bh);
if (ret) {
blkdev_put(*bdev, flags);
-   goto error;
-   }
-   invalidate_bdev(*bdev);
-   *bh = btrfs_read_dev_super(*bdev);
-   if (!*bh) {
-   ret = -EINVAL;
-   blkdev_put(*bdev, flags);
-   goto error;
+   return ret;
}
 
return 0;
-
-error:
-   *bdev = NULL;
-   *bh = NULL;
-   return ret;
 }
 
 static void requeue_list(struct btrfs_pending_bios *pending_bios,
@@ -806,8 +811,8 @@ static int __btrfs_open_devices(struct btrfs_fs_devices 
*fs_devices,
continue;
 
/* Just open everything we can; ignore failures here */
-   if (btrfs_get_bdev_and_sb(device-name-str, flags, holder, 1,
-   bdev, bh))
+   if (btrfs_get_bdev_and_sb_by_path(device-name-str, flags,
+ holder, 1, bdev, bh))
continue;
 
disk_super = (struct btrfs_super_block *)bh-b_data;
@@ -1629,10 +1634,10 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path)
goto out;
}
} else {
-   ret = btrfs_get_bdev_and_sb(device_path,
-   FMODE_WRITE | FMODE_EXCL,
-   root-fs_info-bdev_holder, 0,
-   bdev, bh);
+   ret = btrfs_get_bdev_and_sb_by_path(device_path,
+   FMODE_WRITE | FMODE_EXCL,
+   root-fs_info-bdev_holder,
+   0, bdev, bh);
if (ret)
goto out;
disk_super = (struct btrfs_super_block *)bh-b_data;
@@ -1906,8 +1911,9 @@ static int btrfs_find_device_by_path(struct btrfs_root 
*root, char *device_path,
struct buffer_head *bh;
 
*device = NULL;
-   ret = btrfs_get_bdev_and_sb(device_path, FMODE_READ,
-   root-fs_info-bdev_holder, 0, bdev, bh);
+   ret = btrfs_get_bdev_and_sb_by_path(device_path, FMODE_READ,
+   root-fs_info-bdev_holder, 0,
+   bdev, bh);
if (ret)
return ret;
disk_super = (struct btrfs_super_block *)bh-b_data;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/18] Btrfs: fix unprotected assignment of the target device

2014-09-03 Thread Miao Xie
We didn't protect the assignment of the target device, it might cause the
problem that the super block update was skipped because we might find wrong
size of the target device during the assignment. Fix it by moving the
assignment sentences into the initialization function of the target device.
And there is another merit that we can check if the target device is suitable
more early.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/dev-replace.c | 32 
 fs/btrfs/volumes.c | 23 +++
 fs/btrfs/volumes.h |  1 +
 3 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 10dfb41..72dc02e 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -330,29 +330,19 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
return -EINVAL;
 
mutex_lock(fs_info-volume_mutex);
-   ret = btrfs_init_dev_replace_tgtdev(root, args-start.tgtdev_name,
-   tgt_device);
-   if (ret) {
-   btrfs_err(fs_info, target device %s is invalid!,
-  args-start.tgtdev_name);
-   mutex_unlock(fs_info-volume_mutex);
-   return -EINVAL;
-   }
-
ret = btrfs_dev_replace_find_srcdev(root, args-start.srcdevid,
args-start.srcdev_name,
src_device);
-   mutex_unlock(fs_info-volume_mutex);
if (ret) {
-   ret = -EINVAL;
-   goto leave_no_lock;
+   mutex_unlock(fs_info-volume_mutex);
+   return ret;
}
 
-   if (tgt_device-total_bytes  src_device-total_bytes) {
-   btrfs_err(fs_info, target device is smaller than source 
device!);
-   ret = -EINVAL;
-   goto leave_no_lock;
-   }
+   ret = btrfs_init_dev_replace_tgtdev(root, args-start.tgtdev_name,
+   src_device, tgt_device);
+   mutex_unlock(fs_info-volume_mutex);
+   if (ret)
+   return ret;
 
btrfs_dev_replace_lock(dev_replace);
switch (dev_replace-replace_state) {
@@ -380,10 +370,6 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
  src_device-devid,
  rcu_str_deref(tgt_device-name));
 
-   tgt_device-total_bytes = src_device-total_bytes;
-   tgt_device-disk_total_bytes = src_device-disk_total_bytes;
-   tgt_device-bytes_used = src_device-bytes_used;
-
/*
 * from now on, the writes to the srcdev are all duplicated to
 * go to the tgtdev as well (refer to btrfs_map_block()).
@@ -426,9 +412,7 @@ leave:
dev_replace-srcdev = NULL;
dev_replace-tgtdev = NULL;
btrfs_dev_replace_unlock(dev_replace);
-leave_no_lock:
-   if (tgt_device)
-   btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
+   btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
return ret;
 }
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 483fc6d..1646659 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2295,6 +2295,7 @@ error:
 }
 
 int btrfs_init_dev_replace_tgtdev(struct btrfs_root *root, char *device_path,
+ struct btrfs_device *srcdev,
  struct btrfs_device **device_out)
 {
struct request_queue *q;
@@ -2307,24 +2308,37 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_root 
*root, char *device_path,
int ret = 0;
 
*device_out = NULL;
-   if (fs_info-fs_devices-seeding)
+   if (fs_info-fs_devices-seeding) {
+   btrfs_err(fs_info, the filesystem is a seed filesystem!);
return -EINVAL;
+   }
 
bdev = blkdev_get_by_path(device_path, FMODE_WRITE | FMODE_EXCL,
  fs_info-bdev_holder);
-   if (IS_ERR(bdev))
+   if (IS_ERR(bdev)) {
+   btrfs_err(fs_info, target device %s is invalid!, device_path);
return PTR_ERR(bdev);
+   }
 
filemap_write_and_wait(bdev-bd_inode-i_mapping);
 
devices = fs_info-fs_devices-devices;
list_for_each_entry(device, devices, dev_list) {
if (device-bdev == bdev) {
+   btrfs_err(fs_info, target device is in the 
filesystem!);
ret = -EEXIST;
goto error;
}
}
 
+
+   if (i_size_read(bdev-bd_inode)  srcdev-total_bytes) {
+   btrfs_err(fs_info, target device is smaller than source 
device!);
+   ret = -EINVAL;
+   goto error;
+   }
+
+
device = btrfs_alloc_device(NULL, devid, NULL);
if (IS_ERR(device)) {
ret = PTR_ERR(device);
@@ -2348,8 +2362,9 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_root 
*root

[PATCH 3/5] Btrfs: restructure btrfs_scan_one_device

2014-09-03 Thread Miao Xie
Some code in btrfs_scan_one_device will be re-used by the other function later,
so restructure btrfs_scan_one_device and pick up those code to make a new
function.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/volumes.c | 57 +++---
 1 file changed, 33 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 740a4f9..bcb19d5 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -885,24 +885,18 @@ int btrfs_open_devices(struct btrfs_fs_devices 
*fs_devices,
return ret;
 }
 
-/*
- * Look for a btrfs signature on a device. This may be called out of the mount 
path
- * and we are not allowed to call set_blocksize during the scan. The superblock
- * is read via pagecache
- */
-int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder,
- struct btrfs_fs_devices **fs_devices_ret)
+static int __scan_device(struct block_device *bdev, const char *path,
+struct btrfs_fs_devices **fs_devices_ret)
 {
struct btrfs_super_block *disk_super;
-   struct block_device *bdev;
struct page *page;
void *p;
-   int ret = -EINVAL;
u64 devid;
u64 transid;
u64 total_devices;
u64 bytenr;
pgoff_t index;
+   int ret;
 
/*
 * we would like to check all the supers, but that would make
@@ -911,38 +905,30 @@ int btrfs_scan_one_device(const char *path, fmode_t 
flags, void *holder,
 * later supers, using BTRFS_SUPER_MIRROR_MAX instead
 */
bytenr = btrfs_sb_offset(0);
-   flags |= FMODE_EXCL;
-   mutex_lock(uuid_mutex);
-
-   bdev = blkdev_get_by_path(path, flags, holder);
-
-   if (IS_ERR(bdev)) {
-   ret = PTR_ERR(bdev);
-   goto error;
-   }
 
/* make sure our super fits in the device */
if (bytenr + PAGE_CACHE_SIZE = i_size_read(bdev-bd_inode))
-   goto error_bdev_put;
+   return -EINVAL;
 
/* make sure our super fits in the page */
if (sizeof(*disk_super)  PAGE_CACHE_SIZE)
-   goto error_bdev_put;
+   return -EINVAL;
 
/* make sure our super doesn't straddle pages on disk */
index = bytenr  PAGE_CACHE_SHIFT;
if ((bytenr + sizeof(*disk_super) - 1)  PAGE_CACHE_SHIFT != index)
-   goto error_bdev_put;
+   return -EINVAL;
 
/* pull in the page with our super */
page = read_cache_page_gfp(bdev-bd_inode-i_mapping,
   index, GFP_NOFS);
 
if (IS_ERR_OR_NULL(page))
-   goto error_bdev_put;
+   return -ENOMEM;
 
-   p = kmap(page);
+   ret = -EINVAL;
 
+   p = kmap(page);
/* align our pointer to the offset of the super block */
disk_super = p + (bytenr  ~PAGE_CACHE_MASK);
 
@@ -974,7 +960,30 @@ error_unmap:
kunmap(page);
page_cache_release(page);
 
-error_bdev_put:
+   return ret;
+}
+
+/*
+ * Look for a btrfs signature on a device. This may be called out of the mount 
path
+ * and we are not allowed to call set_blocksize during the scan. The superblock
+ * is read via pagecache
+ */
+int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder,
+ struct btrfs_fs_devices **fs_devices_ret)
+{
+   struct block_device *bdev;
+   int ret;
+
+   flags |= FMODE_EXCL;
+
+   mutex_lock(uuid_mutex);
+   bdev = blkdev_get_by_path(path, flags, holder);
+   if (IS_ERR(bdev)) {
+   ret = PTR_ERR(bdev);
+   goto error;
+   }
+
+   ret = __scan_device(bdev, path, fs_devices_ret);
blkdev_put(bdev, flags);
 error:
mutex_unlock(uuid_mutex);
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 16/18] Btrfs: stop mounting the fs if the non-ENOENT errors happen when opening seed fs

2014-09-03 Thread Miao Xie
When we open a seed filesystem, if the degraded mount option is set, we 
continue to
mount the fs if we don't find some devices in the seed filesystem. But we 
should stop
mounting if other errors happen. Fix it

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/volumes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index fd8141e..cc59fcb 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6093,7 +6093,7 @@ static int read_one_dev(struct btrfs_root *root,
 
if (memcmp(fs_uuid, root-fs_info-fsid, BTRFS_UUID_SIZE)) {
ret = open_seed_devices(root, fs_uuid);
-   if (ret  !btrfs_test_opt(root, DEGRADED))
+   if (ret  !(ret == -ENOENT  btrfs_test_opt(root, DEGRADED)))
return ret;
}
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/18] Btrfs: fix use-after-free problem of the device during device replace

2014-09-03 Thread Miao Xie
The problem is:
Task0(device scan task) Task1(device replace task)
scan_one_device()
mutex_lock(uuid_mutex)
device = find_device()
mutex_lock(device_list_mutex)
lock_chunk()
rm_and_free_source_device
unlock_chunk()
mutex_unlock(device_list_mutex)
check device

Destroying the target device if device replace fails also has the same problem.

We fix this problem by locking uuid_mutex during destroying source device or
target device, just like the device remove operation.

It is a temporary solution, we can fix this problem and make the code more
clear by atomic counter in the future.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/dev-replace.c | 3 +++
 fs/btrfs/volumes.c | 4 +++-
 fs/btrfs/volumes.h | 2 ++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index aa4c828..e9cbbdb 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -509,6 +509,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info 
*fs_info,
ret = btrfs_commit_transaction(trans, root);
WARN_ON(ret);
 
+   mutex_lock(uuid_mutex);
/* keep away write_all_supers() during the finishing procedure */
mutex_lock(root-fs_info-fs_devices-device_list_mutex);
mutex_lock(root-fs_info-chunk_mutex);
@@ -536,6 +537,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info 
*fs_info,
btrfs_dev_replace_unlock(dev_replace);
mutex_unlock(root-fs_info-chunk_mutex);
mutex_unlock(root-fs_info-fs_devices-device_list_mutex);
+   mutex_unlock(uuid_mutex);
if (tgt_device)
btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
mutex_unlock(dev_replace-lock_finishing_cancel_unmount);
@@ -591,6 +593,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info 
*fs_info,
 */
mutex_unlock(root-fs_info-chunk_mutex);
mutex_unlock(root-fs_info-fs_devices-device_list_mutex);
+   mutex_unlock(uuid_mutex);
 
/* write back the superblocks */
trans = btrfs_start_transaction(root, 0);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f0173b1..24d7001 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -50,7 +50,7 @@ static void __btrfs_reset_dev_stats(struct btrfs_device *dev);
 static void btrfs_dev_stat_print_on_error(struct btrfs_device *dev);
 static void btrfs_dev_stat_print_on_load(struct btrfs_device *device);
 
-static DEFINE_MUTEX(uuid_mutex);
+DEFINE_MUTEX(uuid_mutex);
 static LIST_HEAD(fs_uuids);
 
 static void lock_chunks(struct btrfs_root *root)
@@ -1867,6 +1867,7 @@ void btrfs_destroy_dev_replace_tgtdev(struct 
btrfs_fs_info *fs_info,
 {
struct btrfs_device *next_device;
 
+   mutex_lock(uuid_mutex);
WARN_ON(!tgtdev);
mutex_lock(fs_info-fs_devices-device_list_mutex);
if (tgtdev-bdev) {
@@ -1886,6 +1887,7 @@ void btrfs_destroy_dev_replace_tgtdev(struct 
btrfs_fs_info *fs_info,
call_rcu(tgtdev-rcu, free_device);
 
mutex_unlock(fs_info-fs_devices-device_list_mutex);
+   mutex_unlock(uuid_mutex);
 }
 
 static int btrfs_find_device_by_path(struct btrfs_root *root, char 
*device_path,
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 76600a3..2b37da3 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -24,6 +24,8 @@
 #include linux/btrfs.h
 #include async-thread.h
 
+extern struct mutex uuid_mutex;
+
 #define BTRFS_STRIPE_LEN   (64 * 1024)
 
 struct buffer_head;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/18] Btrfs: update free_chunk_space during allocting a new chunk

2014-09-03 Thread Miao Xie
We should update free_chunk_space in time when we allocate a new chunk,
not when we deal with the pending device update and block group insertion,
because we need the real free_chunk_space data to calculate the reserved
space, if we don't update it in time, we would consider the disk space which
has be allocated as free space, and would use it to do overcommit reservation.
Fix it.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/volumes.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 45e0b5d..d8e4a3d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4432,6 +4432,11 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle 
*trans,
for (i = 0; i  map-num_stripes; i++)
map-stripes[i].dev-bytes_used += stripe_size;
 
+   spin_lock(extent_root-fs_info-free_chunk_lock);
+   extent_root-fs_info-free_chunk_space -= (stripe_size *
+  map-num_stripes);
+   spin_unlock(extent_root-fs_info-free_chunk_lock);
+
free_extent_map(em);
check_raid56_incompat_flag(extent_root-fs_info, type);
 
@@ -4515,11 +4520,6 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle 
*trans,
goto out;
}
 
-   spin_lock(extent_root-fs_info-free_chunk_lock);
-   extent_root-fs_info-free_chunk_space -= (stripe_size *
-  map-num_stripes);
-   spin_unlock(extent_root-fs_info-free_chunk_lock);
-
stripe = chunk-stripe;
for (i = 0; i  map-num_stripes; i++) {
device = map-stripes[i].dev;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] Btrfs: don't return btrfs_fs_devices if the caller doesn't want it

2014-09-03 Thread Miao Xie
We will implement the function that the filesystem scan all the devices
in the system and build the device set for btrfs. In this case, we needn't
get btrfs_fs_devices when adding a device into list. This patch changes
device_add_list and implement this feature.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/volumes.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1aacf5f..740a4f9 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -568,7 +568,8 @@ static noinline int device_list_add(const char *path,
if (!fs_devices-opened)
device-generation = found_transid;
 
-   *fs_devices_ret = fs_devices;
+   if (fs_devices_ret)
+   *fs_devices_ret = fs_devices;
 
return ret;
 }
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/18] Btrfs: fix unprotected system chunk array insertion

2014-09-03 Thread Miao Xie
We didn't protect the system chunk array when we added a new
system chunk into it, it would cause the array be corrupted
if someone remove/add some system chunk into array at the same
time. Fix it by chunk lock.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/volumes.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 41da102..9f22398d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4054,10 +4054,13 @@ static int btrfs_add_system_chunk(struct btrfs_root 
*root,
u32 array_size;
u8 *ptr;
 
+   lock_chunks(root);
array_size = btrfs_super_sys_array_size(super_copy);
if (array_size + item_size + sizeof(disk_key)
-BTRFS_SYSTEM_CHUNK_ARRAY_SIZE)
+BTRFS_SYSTEM_CHUNK_ARRAY_SIZE) {
+   unlock_chunks(root);
return -EFBIG;
+   }
 
ptr = super_copy-sys_chunk_array + array_size;
btrfs_cpu_key_to_disk(disk_key, key);
@@ -4066,6 +4069,8 @@ static int btrfs_add_system_chunk(struct btrfs_root *root,
memcpy(ptr, chunk, item_size);
item_size += sizeof(disk_key);
btrfs_set_super_sys_array_size(super_copy, array_size + item_size);
+   unlock_chunks(root);
+
return 0;
 }
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 0/5] Scan all devices to build fs device list

2014-09-03 Thread Miao Xie
This patchset implements device list automatic building function. As we
know, currently we need scan the devices to build device list by a user tool
before mounting the filesystem, especially mount the filesystem after
we re-install btrfs module. It is not convenient. This patchset can improve
that problem. With this patchset, we will scan all the devices in the
system to build the device list if we find the number of the devices
is not right when we mount the filesystem. By this way, we needn't scan
the device by the user tool and reduce the mount failure probability due
to the incomplete device list.

---
Miao Xie (5):
  block: export disk_class and disk_type for btrfs
  Btrfs: don't return btrfs_fs_devices if the caller doesn't want it
  Btrfs: restructure btrfs_scan_one_device
  Btrfs: restructure btrfs_get_bdev_and_sb and pick up some code used
later
  Btrfs: scan all the devices and build the fs device list by btrfs's
self

 block/genhd.c |   7 +-
 fs/btrfs/super.c  |   3 +
 fs/btrfs/volumes.c| 227 --
 fs/btrfs/volumes.h|   5 +-
 include/linux/genhd.h |   1 +
 5 files changed, 177 insertions(+), 66 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   8   9   10   >