Re: how long should btrfs device delete missing ... take?

2014-09-12 Thread Duncan
Chris Murphy posted on Thu, 11 Sep 2014 20:10:26 -0600 as excerpted:

 Sure. But what's the next step? Given 260+ snapshots might mean well
 more than 350GB of data, depending on how deduplicated the fs is, it
 still probably would be faster to rsync this to a pile of drives in
 linear/concat+XFS than wait a month (?) for device delete to finish.

That was what I was getting at in my other just-finished short reply.  It 
may be time to give up on the btrfs specific solutions for the moment and 
go with tried and tested traditional solutions (tho I'd definitely *NOT* 
try rsync or the like with the delete still going, we know from other 
reports that rsync places its own stresses on btrfs and one major 
stressor, the delete-triggered rebalance, at a time, is bad enough).

 Alternatively, script some way to create 260+ ro snapshots to btrfs
 send/receive to a new btrfs volume; and turn it into a raid1 later.

No confirmation yet but I strongly suspect most of those subs are 
snapshots.  Assuming that's the case, it's very likely most of them can 
simply be eliminated as I originally suggested, a process that /should/ 
be fast, decomplexifying the situation dramatically.

 I'm curious if a sysrq+s followed by sysrq+u might leave the filessystem
 in a state where it could still be rw mountable. But I'm skeptical of
 anything interrupting the device delete before being fully prepared for
 the fs to be toast for rw mount. If only ro mount is possible, any
 chance of creating ro snapshots is out.

In theory, that is, barring bugs, interrupting the delete with normal 
shutdown to the extent possible, then sysrq+s, sysrq+u, should not be a 
problem.  The delete is basically a balance, going chunk by chunk, and 
either the chunk has been duplicated to the new device or it hasn't.  In 
either case, the existing chunk on the remaining old device shouldn't be 
affected.

So rebooting in that way in ordered to stop the delete temporarily 
/should/ have no bad effects.  Of course, that's barring bugs.  Btrfs is 
still not fully stabilized, and bugs do happen, so anything's possible.  
But I'd consider it safe enough to try here, certainly so if I had 
backups, as is still STRONGLY recommended for btrfs at this point, much 
more so than the routine sysadmin if it's not backed up by definition 
it's not valuable to you rule.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how long should btrfs device delete missing ... take?

2014-09-12 Thread Chris Murphy

On Sep 11, 2014, at 11:19 PM, Russell Coker russ...@coker.com.au wrote:

 It would be nice if a file system mounted ro counted as ro snapshots for 
 btrfs send.
 
 When a file system is so messed up it can't be mounted rw it should be 
 regarded as ro for all operations.

Yes it's come up before, and there's a question whether mount -o ro is reliably 
ro enough for this. Maybe a force option?

But then another one is a recursive btrfs send to go along with the above. I 
might want them all, or I might want all of the ones in two particular 
subvolumes, etc. And even combine the recursive ro snapshot and recursive send 
as a btrfs rescue option that would work even if the volume is mounted 
read-only.


Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: deal with conflict options for btrfs fi show

2014-09-12 Thread Gui Hecheng
On Fri, 2014-09-12 at 14:56 +0900, Satoru Takeuchi wrote:
 Hi Gui,
 
 (2014/09/12 10:15), Gui Hecheng wrote:
  For btrfs fi show, -d|--all-devices  -m|--mounted will
  overwrite each other, so if specified both, let the user
  know that he should not use them at the same time.
  
  Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com
  ---
  changelog:
  v1-v2: add option conflict descriptions to manpage and usage.
  ---
Documentation/btrfs-filesystem.txt |  9 ++---
cmds-filesystem.c  | 12 ++--
2 files changed, 16 insertions(+), 5 deletions(-)
  
  diff --git a/Documentation/btrfs-filesystem.txt 
  b/Documentation/btrfs-filesystem.txt
  index c9c0b00..d3d2dcc 100644
  --- a/Documentation/btrfs-filesystem.txt
  +++ b/Documentation/btrfs-filesystem.txt
  @@ -20,15 +20,18 @@ SUBCOMMAND
*df* path [path...]::
Show space usage information for a mount point.

  -*show* [--mounted|--all-devices|path|uuid|device|label]::
  +*show* [-m|--mounted|-d|--all-devices|path|uuid|device|label]::
 
 This line seems to be too long. Please see also the
 following thread.
 
 https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36270.html
 

Hi Satoru,

Ah, there is a patch that is changing the same document before but not
merged yet. So I think I will rebase my additional words about the
option conflict after the former patch merged.
--
Hi David,
Sorry to bother, would you please give a glance at the v1 patch and
ignore this v2 first. And I will add the option conflict stuff in
another patch after the former path below merged. Is that OK?

https://patchwork.kernel.org/patch/4711831/

-Gui

 Thanks,
 Satoru
 
 
Show the btrfs filesystem with some additional info.
+
If no option nor path|uuid|device|label is passed, btrfs shows
information of all the btrfs filesystem both mounted and unmounted.
  -If '--mounted' is passed, it would probe btrfs kernel to list mounted btrfs
  +If '-m|--mounted' is passed, it would probe btrfs kernel to list mounted 
  btrfs
filesystem(s);
  -If '--all-devices' is passed, all the devices under /dev are scanned;
  +If '-d|--all-devices' is passed, all the devices under /dev are scanned;
otherwise the devices list is extracted from the /proc/partitions file.
  +Don't combine -m|--mounted and -d|--all-devices, because these two options
  +will overwrite each other, and only one scan way will be adopted,
  +probe the kernel to scan or scan devices under /dev.

*sync* path::
Force a sync for the filesystem identified by path.
  diff --git a/cmds-filesystem.c b/cmds-filesystem.c
  index 69c1ca5..51c4c55 100644
  --- a/cmds-filesystem.c
  +++ b/cmds-filesystem.c
  @@ -495,6 +495,7 @@ static const char * const cmd_show_usage[] = {
  -d|--all-devices   show only disks under /dev containing btrfs 
  filesystem,
  -m|--mounted   show only mounted btrfs,
  If no argument is given, structure of all present filesystems is 
  shown.,
  +   Don't combine -d|--all-devices and -m|--mounted, refer to manpage for 
  details.,
  NULL
};

  @@ -526,16 +527,23 @@ static int cmd_show(int argc, char **argv)
  break;
  switch (c) {
  case 'd':
  -   where = BTRFS_SCAN_PROC;
  +   where = ~BTRFS_SCAN_LBLKID;
  +   where |= BTRFS_SCAN_PROC;
  break;
  case 'm':
  -   where = BTRFS_SCAN_MOUNTED;
  +   where = ~BTRFS_SCAN_LBLKID;
  +   where |= BTRFS_SCAN_MOUNTED;
  break;
  default:
  usage(cmd_show_usage);
  }
  }

  +   if ((where  BTRFS_SCAN_PROC)  (where  BTRFS_SCAN_MOUNTED)) {
  +   fprintf(stderr, Don't use -d|--all-devices and -m|--mounted 
  options at the same time.\n);
  +   usage(cmd_show_usage);
  +   }
  +
  if (check_argc_max(argc, optind + 1))
  usage(cmd_show_usage);

  
 


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RAID1 failure and recovery

2014-09-12 Thread shane-kernel
Hi,

I am testing BTRFS in a simple RAID1 environment. Default mount options and 
data and metadata are mirrored between sda2 and sdb2. I have a few questions 
and a potential bug report. I don't normally have console access to the server 
so when the server boots with 1 of 2 disks, the mount will fail without -o 
degraded. Can I use -o degraded by default to force mounting with any number of 
disks? This is the default behaviour for linux-raid so I was rather surprised 
when the server didn't boot after a simulated disk failure.

So I pulled sdb to simulate a disk failure. The kernel oops'd but did continue 
running. I then rebooted encountering the above mount problem. I re-inserted 
the disk and rebooted again and BTRFS mounted successfully. However, I am now 
getting warnings like:
BTRFS: read error corrected: ino 1615 off 86016 (dev /dev/sda2 sector 
4580382824)
I take it there were writes to SDA and sdb is out of sync. Btrfs is correcting 
sdb as it goes but I won't have redundancy until sdb resyncs completely. Is 
there a way to tell btrfs that I just re-added a failed disk and to go through 
and resync the array as mdraid would do? I know I can do a btrfs fi resync 
manually but can that be automated if the array goes out of sync for whatever 
reason (power failure)...

Finally for those using this sort of setup in production, is running btrfs on 
top of mdraid the way to go at this point?

Cheers,
Shane


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 00/11] Implement the data repair function for direct read

2014-09-12 Thread Miao Xie
This patchset implement the data repair function for the direct read, it
is implemented like buffered read:
1.When we find the data is not right, we try to read the data from the other
  mirror.
2.When the io on the mirror ends, we will insert the endio work into the
  dedicated btrfs workqueue, not common read endio workqueue, because the
  original endio work is still blocked in the btrfs endio workqueue, if we
  insert the endio work of the io on the mirror into that workqueue, deadlock
  would happen.
3.If We get right data, we write it back to repair the corrupted mirror.
4.If the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
5.After the above work, we set the uptodate flag according to the result.

The difference is that the direct read may be splited to several small io,
in order to get the number of the mirror on which the io error happens. we
have to do data check and repair on the end IO function of those sub-IO
request.

Besides that, we also fixed some bugs of direct io.

Changelog v3 - v4:
- Remove the 1st patch which has been applied into the upstream kernel.
- Use a dedicated btrfs workqueue instead of the system workqueue to
  deal with the completed repair bio, this suggest was from Chris.
- Rebase the patchset to integration branch of Chris's git tree.

Changelog v2 - v3:
- Fix wrong returned bio when doing bio clone, which was reported by Filipe

Changelog v1 - v2:
- Fix the warning which was triggered by __GFP_ZERO in the 2nd patch

Miao Xie (11):
  Btrfs: load checksum data once when submitting a direct read io
  Btrfs: cleanup similar code of the buffered data data check and dio
read data check
  Btrfs: do file data check by sub-bio's self
  Btrfs: fix missing error handler if submiting re-read bio fails
  Btrfs: Cleanup unused variant and argument of IO failure handlers
  Btrfs: split bio_readpage_error into several functions
  Btrfs: modify repair_io_failure and make it suit direct io
  Btrfs: modify clean_io_failure and make it suit direct io
  Btrfs: Set real mirror number for read operation on RAID0/5/6
  Btrfs: implement repair function when direct read fails
  Btrfs: cleanup the read failure record after write or when the inode
is freeing

 fs/btrfs/async-thread.c |   1 +
 fs/btrfs/async-thread.h |   1 +
 fs/btrfs/btrfs_inode.h  |  10 +-
 fs/btrfs/ctree.h|   4 +-
 fs/btrfs/disk-io.c  |  11 +-
 fs/btrfs/disk-io.h  |   1 +
 fs/btrfs/extent_io.c| 254 +--
 fs/btrfs/extent_io.h|  38 -
 fs/btrfs/file-item.c|  14 +-
 fs/btrfs/inode.c| 446 +++-
 fs/btrfs/scrub.c|   4 +-
 fs/btrfs/volumes.c  |   5 +
 fs/btrfs/volumes.h  |   5 +-
 13 files changed, 601 insertions(+), 193 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 03/11] Btrfs: do file data check by sub-bio's self

2014-09-12 Thread Miao Xie
Direct IO splits the original bio to several sub-bios because of the limit of
raid stripe, and the filesystem will wait for all sub-bios and then run final
end io process.

But it was very hard to implement the data repair when dio read failure happens,
because at the final end io function, we didn't know which mirror the data was
read from. So in order to implement the data repair, we have to move the file 
data
check in the final end io function to the sub-bio end io function, in which we 
can
get the mirror number of the device we access. This patch did this work as the
first step of the direct io data repair implementation.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None
---
 fs/btrfs/btrfs_inode.h |   9 +
 fs/btrfs/extent_io.c   |   2 +-
 fs/btrfs/inode.c   | 100 -
 fs/btrfs/volumes.h |   5 ++-
 4 files changed, 87 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 8bea70e..4d30947 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -245,8 +245,11 @@ static inline int btrfs_inode_in_log(struct inode *inode, 
u64 generation)
return 0;
 }
 
+#define BTRFS_DIO_ORIG_BIO_SUBMITTED   0x1
+
 struct btrfs_dio_private {
struct inode *inode;
+   unsigned long flags;
u64 logical_offset;
u64 disk_bytenr;
u64 bytes;
@@ -263,6 +266,12 @@ struct btrfs_dio_private {
 
/* dio_bio came from fs/direct-io.c */
struct bio *dio_bio;
+
+   /*
+* The original bio may be splited to several sub-bios, this is
+* done during endio of sub-bios
+*/
+   int (*subio_endio)(struct inode *, struct btrfs_io_bio *);
 };
 
 /*
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index dfe1afe..92a6d9f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2472,7 +2472,7 @@ static void end_bio_extent_readpage(struct bio *bio, int 
err)
struct inode *inode = page-mapping-host;
 
pr_debug(end_bio_extent_readpage: bi_sector=%llu, err=%d, 
-mirror=%lu\n, (u64)bio-bi_iter.bi_sector, err,
+mirror=%u\n, (u64)bio-bi_iter.bi_sector, err,
 io_bio-mirror_num);
tree = BTRFS_I(inode)-io_tree;
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index e8139c6..cf79f79 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7198,29 +7198,40 @@ unlock_err:
return ret;
 }
 
-static void btrfs_endio_direct_read(struct bio *bio, int err)
+static int btrfs_subio_endio_read(struct inode *inode,
+ struct btrfs_io_bio *io_bio)
 {
-   struct btrfs_dio_private *dip = bio-bi_private;
struct bio_vec *bvec;
-   struct inode *inode = dip-inode;
-   struct bio *dio_bio;
-   struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
u64 start;
-   int ret;
int i;
+   int ret;
+   int err = 0;
 
-   if (err || (BTRFS_I(inode)-flags  BTRFS_INODE_NODATASUM))
-   goto skip_checksum;
+   if (BTRFS_I(inode)-flags  BTRFS_INODE_NODATASUM)
+   return 0;
 
-   start = dip-logical_offset;
-   bio_for_each_segment_all(bvec, bio, i) {
+   start = io_bio-logical;
+   bio_for_each_segment_all(bvec, io_bio-bio, i) {
ret = __readpage_endio_check(inode, io_bio, i, bvec-bv_page,
 0, start, bvec-bv_len);
if (ret)
err = -EIO;
start += bvec-bv_len;
}
-skip_checksum:
+
+   return err;
+}
+
+static void btrfs_endio_direct_read(struct bio *bio, int err)
+{
+   struct btrfs_dio_private *dip = bio-bi_private;
+   struct inode *inode = dip-inode;
+   struct bio *dio_bio;
+   struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+
+   if (!err  (dip-flags  BTRFS_DIO_ORIG_BIO_SUBMITTED))
+   err = btrfs_subio_endio_read(inode, io_bio);
+
unlock_extent(BTRFS_I(inode)-io_tree, dip-logical_offset,
  dip-logical_offset + dip-bytes - 1);
dio_bio = dip-dio_bio;
@@ -7298,6 +7309,7 @@ static int __btrfs_submit_bio_start_direct_io(struct 
inode *inode, int rw,
 static void btrfs_end_dio_bio(struct bio *bio, int err)
 {
struct btrfs_dio_private *dip = bio-bi_private;
+   int ret;
 
if (err) {
btrfs_err(BTRFS_I(dip-inode)-root-fs_info,
@@ -7305,6 +7317,13 @@ static void btrfs_end_dio_bio(struct bio *bio, int err)
  btrfs_ino(dip-inode), bio-bi_rw,
  (unsigned long long)bio-bi_iter.bi_sector,
  bio-bi_iter.bi_size, err);
+   } else if (dip-subio_endio) {
+   ret = dip-subio_endio(dip-inode, btrfs_io_bio(bio));
+   if (ret)
+   err = ret;
+   }
+
+   if (err) {

[PATCH v4 07/11] Btrfs: modify repair_io_failure and make it suit direct io

2014-09-12 Thread Miao Xie
The original code of repair_io_failure was just used for buffered read,
because it got some filesystem data from page structure, it is safe for
the page in the page cache. But when we do a direct read, the pages in bio
are not in the page cache, that is there is no filesystem data in the page
structure. In order to implement direct read data repair, we need modify
repair_io_failure and pass all filesystem data it need by function
parameters.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None
---
 fs/btrfs/extent_io.c | 8 +---
 fs/btrfs/extent_io.h | 2 +-
 fs/btrfs/scrub.c | 1 +
 3 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index cf1de40..9fbc005 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1997,7 +1997,7 @@ static int free_io_failure(struct inode *inode, struct 
io_failure_record *rec)
  */
 int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
u64 length, u64 logical, struct page *page,
-   int mirror_num)
+   unsigned int pg_offset, int mirror_num)
 {
struct bio *bio;
struct btrfs_device *dev;
@@ -2036,7 +2036,7 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 
start,
return -EIO;
}
bio-bi_bdev = dev-bdev;
-   bio_add_page(bio, page, length, start - page_offset(page));
+   bio_add_page(bio, page, length, pg_offset);
 
if (btrfsic_submit_bio_wait(WRITE_SYNC, bio)) {
/* try to remap that extent elsewhere? */
@@ -2067,7 +2067,8 @@ int repair_eb_io_failure(struct btrfs_root *root, struct 
extent_buffer *eb,
for (i = 0; i  num_pages; i++) {
struct page *p = extent_buffer_page(eb, i);
ret = repair_io_failure(root-fs_info, start, PAGE_CACHE_SIZE,
-   start, p, mirror_num);
+   start, p, start - page_offset(p),
+   mirror_num);
if (ret)
break;
start += PAGE_CACHE_SIZE;
@@ -2127,6 +2128,7 @@ static int clean_io_failure(u64 start, struct page *page)
if (num_copies  1)  {
repair_io_failure(fs_info, start, failrec-len,
  failrec-logical, page,
+ start - page_offset(page),
  failrec-failed_mirror);
}
}
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 75b621b..a82ecbc 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -340,7 +340,7 @@ struct btrfs_fs_info;
 
 int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start,
u64 length, u64 logical, struct page *page,
-   int mirror_num);
+   unsigned int pg_offset, int mirror_num);
 int end_extent_writepage(struct page *page, int err, u64 start, u64 end);
 int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb,
 int mirror_num);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index cce122b..3978529 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -682,6 +682,7 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 
root, void *fixup_ctx)
fs_info = BTRFS_I(inode)-root-fs_info;
ret = repair_io_failure(fs_info, offset, PAGE_SIZE,
fixup-logical, page,
+   offset - page_offset(page),
fixup-mirror_num);
unlock_page(page);
corrected = !ret;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 05/11] Btrfs: Cleanup unused variant and argument of IO failure handlers

2014-09-12 Thread Miao Xie
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None
---
 fs/btrfs/extent_io.c | 26 ++
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f8dda46..154cb8e 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1981,8 +1981,7 @@ struct io_failure_record {
int in_validation;
 };
 
-static int free_io_failure(struct inode *inode, struct io_failure_record *rec,
-   int did_repair)
+static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
 {
int ret;
int err = 0;
@@ -2109,7 +2108,6 @@ static int clean_io_failure(u64 start, struct page *page)
struct btrfs_fs_info *fs_info = BTRFS_I(inode)-root-fs_info;
struct extent_state *state;
int num_copies;
-   int did_repair = 0;
int ret;
 
private = 0;
@@ -2130,7 +2128,6 @@ static int clean_io_failure(u64 start, struct page *page)
/* there was no real error, just free the record */
pr_debug(clean_io_failure: freeing dummy error at %llu\n,
 failrec-start);
-   did_repair = 1;
goto out;
}
if (fs_info-sb-s_flags  MS_RDONLY)
@@ -2147,19 +2144,16 @@ static int clean_io_failure(u64 start, struct page 
*page)
num_copies = btrfs_num_copies(fs_info, failrec-logical,
  failrec-len);
if (num_copies  1)  {
-   ret = repair_io_failure(fs_info, start, failrec-len,
-   failrec-logical, page,
-   failrec-failed_mirror);
-   did_repair = !ret;
+   repair_io_failure(fs_info, start, failrec-len,
+ failrec-logical, page,
+ failrec-failed_mirror);
}
-   ret = 0;
}
 
 out:
-   if (!ret)
-   ret = free_io_failure(inode, failrec, did_repair);
+   free_io_failure(inode, failrec);
 
-   return ret;
+   return 0;
 }
 
 /*
@@ -2269,7 +2263,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
 */
pr_debug(bio_readpage_error: cannot repair, num_copies=%d, 
next_mirror %d, failed_mirror %d\n,
 num_copies, failrec-this_mirror, failed_mirror);
-   free_io_failure(inode, failrec, 0);
+   free_io_failure(inode, failrec);
return -EIO;
}
 
@@ -2312,13 +2306,13 @@ static int bio_readpage_error(struct bio *failed_bio, 
u64 phy_offset,
if (failrec-this_mirror  num_copies) {
pr_debug(bio_readpage_error: (fail) num_copies=%d, next_mirror 
%d, failed_mirror %d\n,
 num_copies, failrec-this_mirror, failed_mirror);
-   free_io_failure(inode, failrec, 0);
+   free_io_failure(inode, failrec);
return -EIO;
}
 
bio = btrfs_io_bio_alloc(GFP_NOFS, 1);
if (!bio) {
-   free_io_failure(inode, failrec, 0);
+   free_io_failure(inode, failrec);
return -EIO;
}
bio-bi_end_io = failed_bio-bi_end_io;
@@ -2349,7 +2343,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
 failrec-this_mirror,
 failrec-bio_flags, 0);
if (ret) {
-   free_io_failure(inode, failrec, 0);
+   free_io_failure(inode, failrec);
bio_put(bio);
}
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 02/11] Btrfs: cleanup similar code of the buffered data data check and dio read data check

2014-09-12 Thread Miao Xie
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None
---
 fs/btrfs/inode.c | 102 +--
 1 file changed, 47 insertions(+), 55 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index af304e1..e8139c6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2893,6 +2893,40 @@ static int btrfs_writepage_end_io_hook(struct page 
*page, u64 start, u64 end,
return 0;
 }
 
+static int __readpage_endio_check(struct inode *inode,
+ struct btrfs_io_bio *io_bio,
+ int icsum, struct page *page,
+ int pgoff, u64 start, size_t len)
+{
+   char *kaddr;
+   u32 csum_expected;
+   u32 csum = ~(u32)0;
+   static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
+
+   csum_expected = *(((u32 *)io_bio-csum) + icsum);
+
+   kaddr = kmap_atomic(page);
+   csum = btrfs_csum_data(kaddr + pgoff, csum,  len);
+   btrfs_csum_final(csum, (char *)csum);
+   if (csum != csum_expected)
+   goto zeroit;
+
+   kunmap_atomic(kaddr);
+   return 0;
+zeroit:
+   if (__ratelimit(_rs))
+   btrfs_info(BTRFS_I(inode)-root-fs_info,
+  csum failed ino %llu off %llu csum %u expected csum 
%u,
+  btrfs_ino(inode), start, csum, csum_expected);
+   memset(kaddr + pgoff, 1, len);
+   flush_dcache_page(page);
+   kunmap_atomic(kaddr);
+   if (csum_expected == 0)
+   return 0;
+   return -EIO;
+}
+
 /*
  * when reads are done, we need to check csums to verify the data is correct
  * if there's a match, we allow the bio to finish.  If not, the code in
@@ -2905,20 +2939,15 @@ static int btrfs_readpage_end_io_hook(struct 
btrfs_io_bio *io_bio,
size_t offset = start - page_offset(page);
struct inode *inode = page-mapping-host;
struct extent_io_tree *io_tree = BTRFS_I(inode)-io_tree;
-   char *kaddr;
struct btrfs_root *root = BTRFS_I(inode)-root;
-   u32 csum_expected;
-   u32 csum = ~(u32)0;
-   static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
- DEFAULT_RATELIMIT_BURST);
 
if (PageChecked(page)) {
ClearPageChecked(page);
-   goto good;
+   return 0;
}
 
if (BTRFS_I(inode)-flags  BTRFS_INODE_NODATASUM)
-   goto good;
+   return 0;
 
if (root-root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID 
test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1, NULL)) {
@@ -2928,28 +2957,8 @@ static int btrfs_readpage_end_io_hook(struct 
btrfs_io_bio *io_bio,
}
 
phy_offset = inode-i_sb-s_blocksize_bits;
-   csum_expected = *(((u32 *)io_bio-csum) + phy_offset);
-
-   kaddr = kmap_atomic(page);
-   csum = btrfs_csum_data(kaddr + offset, csum,  end - start + 1);
-   btrfs_csum_final(csum, (char *)csum);
-   if (csum != csum_expected)
-   goto zeroit;
-
-   kunmap_atomic(kaddr);
-good:
-   return 0;
-
-zeroit:
-   if (__ratelimit(_rs))
-   btrfs_info(root-fs_info, csum failed ino %llu off %llu csum 
%u expected csum %u,
-   btrfs_ino(page-mapping-host), start, csum, 
csum_expected);
-   memset(kaddr + offset, 1, end - start + 1);
-   flush_dcache_page(page);
-   kunmap_atomic(kaddr);
-   if (csum_expected == 0)
-   return 0;
-   return -EIO;
+   return __readpage_endio_check(inode, io_bio, phy_offset, page, offset,
+ start, (size_t)(end - start + 1));
 }
 
 struct delayed_iput {
@@ -7194,41 +7203,24 @@ static void btrfs_endio_direct_read(struct bio *bio, 
int err)
struct btrfs_dio_private *dip = bio-bi_private;
struct bio_vec *bvec;
struct inode *inode = dip-inode;
-   struct btrfs_root *root = BTRFS_I(inode)-root;
struct bio *dio_bio;
struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
-   u32 *csums = (u32 *)io_bio-csum;
u64 start;
+   int ret;
int i;
 
+   if (err || (BTRFS_I(inode)-flags  BTRFS_INODE_NODATASUM))
+   goto skip_checksum;
+
start = dip-logical_offset;
bio_for_each_segment_all(bvec, bio, i) {
-   if (!(BTRFS_I(inode)-flags  BTRFS_INODE_NODATASUM)) {
-   struct page *page = bvec-bv_page;
-   char *kaddr;
-   u32 csum = ~(u32)0;
-   unsigned long flags;
-
-   local_irq_save(flags);
-   kaddr = kmap_atomic(page);
-   csum = btrfs_csum_data(kaddr + bvec-bv_offset,
-  csum, bvec-bv_len);
-  

[PATCH v4 04/11] Btrfs: fix missing error handler if submiting re-read bio fails

2014-09-12 Thread Miao Xie
We forgot to free failure record and bio after submitting re-read bio failed,
fix it.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None
---
 fs/btrfs/extent_io.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 92a6d9f..f8dda46 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2348,6 +2348,11 @@ static int bio_readpage_error(struct bio *failed_bio, 
u64 phy_offset,
ret = tree-ops-submit_bio_hook(inode, read_mode, bio,
 failrec-this_mirror,
 failrec-bio_flags, 0);
+   if (ret) {
+   free_io_failure(inode, failrec, 0);
+   bio_put(bio);
+   }
+
return ret;
 }
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 10/11] Btrfs: implement repair function when direct read fails

2014-09-12 Thread Miao Xie
This patch implement data repair function when direct read fails.

The detail of the implementation is:
- When we find the data is not right, we try to read the data from the other
  mirror.
- When the io on the mirror ends, we will insert the endio work into the
  dedicated btrfs workqueue, not common read endio workqueue, because the
  original endio work is still blocked in the btrfs endio workqueue, if we
  insert the endio work of the io on the mirror into that workqueue, deadlock
  would happen.
- After we get right data, we write it back to the corrupted mirror.
- And if the data on the new mirror is still corrupted, we will try next
  mirror until we read right data or all the mirrors are traversed.
- After the above work, we set the uptodate flag according to the result.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v3 - v4:
- Use a dedicated btrfs workqueue instead of the system workqueue to
  deal with the completed repair bio, this suggest was from Chris.

Changelog v1 - v3:
- None
---
 fs/btrfs/async-thread.c |   1 +
 fs/btrfs/async-thread.h |   1 +
 fs/btrfs/btrfs_inode.h  |   2 +-
 fs/btrfs/ctree.h|   1 +
 fs/btrfs/disk-io.c  |  11 +-
 fs/btrfs/disk-io.h  |   1 +
 fs/btrfs/extent_io.c|  12 ++-
 fs/btrfs/extent_io.h|   5 +-
 fs/btrfs/inode.c| 276 
 9 files changed, 281 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c
index fbd76de..2da0a66 100644
--- a/fs/btrfs/async-thread.c
+++ b/fs/btrfs/async-thread.c
@@ -74,6 +74,7 @@ BTRFS_WORK_HELPER(endio_helper);
 BTRFS_WORK_HELPER(endio_meta_helper);
 BTRFS_WORK_HELPER(endio_meta_write_helper);
 BTRFS_WORK_HELPER(endio_raid56_helper);
+BTRFS_WORK_HELPER(endio_repair_helper);
 BTRFS_WORK_HELPER(rmw_helper);
 BTRFS_WORK_HELPER(endio_write_helper);
 BTRFS_WORK_HELPER(freespace_write_helper);
diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h
index e9e31c9..e386c29 100644
--- a/fs/btrfs/async-thread.h
+++ b/fs/btrfs/async-thread.h
@@ -53,6 +53,7 @@ BTRFS_WORK_HELPER_PROTO(endio_helper);
 BTRFS_WORK_HELPER_PROTO(endio_meta_helper);
 BTRFS_WORK_HELPER_PROTO(endio_meta_write_helper);
 BTRFS_WORK_HELPER_PROTO(endio_raid56_helper);
+BTRFS_WORK_HELPER_PROTO(endio_repair_helper);
 BTRFS_WORK_HELPER_PROTO(rmw_helper);
 BTRFS_WORK_HELPER_PROTO(endio_write_helper);
 BTRFS_WORK_HELPER_PROTO(freespace_write_helper);
diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 4d30947..7a7521c 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -271,7 +271,7 @@ struct btrfs_dio_private {
 * The original bio may be splited to several sub-bios, this is
 * done during endio of sub-bios
 */
-   int (*subio_endio)(struct inode *, struct btrfs_io_bio *);
+   int (*subio_endio)(struct inode *, struct btrfs_io_bio *, int);
 };
 
 /*
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7b54cd9..63acfd8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1538,6 +1538,7 @@ struct btrfs_fs_info {
struct btrfs_workqueue *endio_workers;
struct btrfs_workqueue *endio_meta_workers;
struct btrfs_workqueue *endio_raid56_workers;
+   struct btrfs_workqueue *endio_repair_workers;
struct btrfs_workqueue *rmw_workers;
struct btrfs_workqueue *endio_meta_write_workers;
struct btrfs_workqueue *endio_write_workers;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index ff3ee22..1594d91 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -713,7 +713,11 @@ static void end_workqueue_bio(struct bio *bio, int err)
func = btrfs_endio_write_helper;
}
} else {
-   if (end_io_wq-metadata == BTRFS_WQ_ENDIO_RAID56) {
+   if (unlikely(end_io_wq-metadata ==
+BTRFS_WQ_ENDIO_DIO_REPAIR)) {
+   wq = fs_info-endio_repair_workers;
+   func = btrfs_endio_repair_helper;
+   } else if (end_io_wq-metadata == BTRFS_WQ_ENDIO_RAID56) {
wq = fs_info-endio_raid56_workers;
func = btrfs_endio_raid56_helper;
} else if (end_io_wq-metadata) {
@@ -741,6 +745,7 @@ int btrfs_bio_wq_end_io(struct btrfs_fs_info *info, struct 
bio *bio,
int metadata)
 {
struct end_io_wq *end_io_wq;
+
end_io_wq = kmalloc(sizeof(*end_io_wq), GFP_NOFS);
if (!end_io_wq)
return -ENOMEM;
@@ -2059,6 +2064,7 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info 
*fs_info)
btrfs_destroy_workqueue(fs_info-endio_workers);
btrfs_destroy_workqueue(fs_info-endio_meta_workers);
btrfs_destroy_workqueue(fs_info-endio_raid56_workers);
+   btrfs_destroy_workqueue(fs_info-endio_repair_workers);
btrfs_destroy_workqueue(fs_info-rmw_workers);

[PATCH v4 01/11] Btrfs: load checksum data once when submitting a direct read io

2014-09-12 Thread Miao Xie
The current code would load checksum data for several times when we split
a whole direct read io because of the limit of the raid stripe, it would
make us search the csum tree for several times. In fact, it just wasted time,
and made the contention of the csum tree root be more serious. This patch
improves this problem by loading the data at once.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v3 - v4:
- None

Changelog v2 - v3:
- Fix the wrong return value of btrfs_bio_clone

Changelog v1 - v2:
- Remove the __GFP_ZERO flag in btrfs_submit_direct because it would trigger
  a WARNing. It is reported by Filipe David Manana, Thanks.
---
 fs/btrfs/btrfs_inode.h |  1 -
 fs/btrfs/ctree.h   |  3 +--
 fs/btrfs/extent_io.c   | 13 +++--
 fs/btrfs/file-item.c   | 14 ++
 fs/btrfs/inode.c   | 38 +-
 5 files changed, 35 insertions(+), 34 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index fd87941..8bea70e 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -263,7 +263,6 @@ struct btrfs_dio_private {
 
/* dio_bio came from fs/direct-io.c */
struct bio *dio_bio;
-   u8 csum[0];
 };
 
 /*
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ded7781..7b54cd9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3719,8 +3719,7 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
 int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
  struct bio *bio, u32 *dst);
 int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
- struct btrfs_dio_private *dip, struct bio *bio,
- u64 logical_offset);
+ struct bio *bio, u64 logical_offset);
 int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 struct btrfs_root *root,
 u64 objectid, u64 pos,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 86b39de..dfe1afe 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2621,9 +2621,18 @@ btrfs_bio_alloc(struct block_device *bdev, u64 
first_sector, int nr_vecs,
 
 struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask)
 {
-   return bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
-}
+   struct btrfs_io_bio *btrfs_bio;
+   struct bio *new;
 
+   new = bio_clone_bioset(bio, gfp_mask, btrfs_bioset);
+   if (new) {
+   btrfs_bio = btrfs_io_bio(new);
+   btrfs_bio-csum = NULL;
+   btrfs_bio-csum_allocated = NULL;
+   btrfs_bio-end_io = NULL;
+   }
+   return new;
+}
 
 /* this also allocates from the btrfs_bioset */
 struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs)
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 6e6262e..783a943 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -299,19 +299,9 @@ int btrfs_lookup_bio_sums(struct btrfs_root *root, struct 
inode *inode,
 }
 
 int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
- struct btrfs_dio_private *dip, struct bio *bio,
- u64 offset)
+ struct bio *bio, u64 offset)
 {
-   int len = (bio-bi_iter.bi_sector  9) - dip-disk_bytenr;
-   u16 csum_size = btrfs_super_csum_size(root-fs_info-super_copy);
-   int ret;
-
-   len = inode-i_sb-s_blocksize_bits;
-   len *= csum_size;
-
-   ret = __btrfs_lookup_bio_sums(root, inode, bio, offset,
- (u32 *)(dip-csum + len), 1);
-   return ret;
+   return __btrfs_lookup_bio_sums(root, inode, bio, offset, NULL, 1);
 }
 
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2118ea6..af304e1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7196,7 +7196,8 @@ static void btrfs_endio_direct_read(struct bio *bio, int 
err)
struct inode *inode = dip-inode;
struct btrfs_root *root = BTRFS_I(inode)-root;
struct bio *dio_bio;
-   u32 *csums = (u32 *)dip-csum;
+   struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+   u32 *csums = (u32 *)io_bio-csum;
u64 start;
int i;
 
@@ -7238,6 +7239,9 @@ static void btrfs_endio_direct_read(struct bio *bio, int 
err)
if (err)
clear_bit(BIO_UPTODATE, dio_bio-bi_flags);
dio_end_io(dio_bio, err);
+
+   if (io_bio-end_io)
+   io_bio-end_io(io_bio, err);
bio_put(bio);
 }
 
@@ -7377,13 +7381,20 @@ static inline int __btrfs_submit_dio_bio(struct bio 
*bio, struct inode *inode,
ret = btrfs_csum_one_bio(root, inode, bio, file_offset, 1);
if (ret)
goto err;
-   } else if (!skip_sum) {
-   ret = 

[PATCH v4 11/11] Btrfs: cleanup the read failure record after write or when the inode is freeing

2014-09-12 Thread Miao Xie
After the data is written successfully, we should cleanup the read failure 
record
in that range because
- If we set data COW for the file, the range that the failure record pointed to 
is
  mapped to a new place, so it is invalid.
- If we set no data COW for the file, and if there is no error during writting,
  the corrupted data is corrected, so the failure record can be removed. And if
  some errors happen on the mirrors, we also needn't worry about it because the
  failure record will be recreated if we read the same place again.

Sometimes, we may fail to correct the data, so the failure records will be left
in the tree, we need free them when we free the inode or the memory leak 
happens.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None
---
 fs/btrfs/extent_io.c | 34 ++
 fs/btrfs/extent_io.h |  1 +
 fs/btrfs/inode.c |  6 ++
 3 files changed, 41 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 86dc352..5427fd5 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2138,6 +2138,40 @@ out:
return 0;
 }
 
+/*
+ * Can be called when
+ * - hold extent lock
+ * - under ordered extent
+ * - the inode is freeing
+ */
+void btrfs_free_io_failure_record(struct inode *inode, u64 start, u64 end)
+{
+   struct extent_io_tree *failure_tree = BTRFS_I(inode)-io_failure_tree;
+   struct io_failure_record *failrec;
+   struct extent_state *state, *next;
+
+   if (RB_EMPTY_ROOT(failure_tree-state))
+   return;
+
+   spin_lock(failure_tree-lock);
+   state = find_first_extent_bit_state(failure_tree, start, EXTENT_DIRTY);
+   while (state) {
+   if (state-start  end)
+   break;
+
+   ASSERT(state-end = end);
+
+   next = next_state(state);
+
+   failrec = (struct io_failure_record *)state-private;
+   free_extent_state(state);
+   kfree(failrec);
+
+   state = next;
+   }
+   spin_unlock(failure_tree-lock);
+}
+
 int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
struct io_failure_record **failrec_ret)
 {
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 176a4b1..5e91fb9 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -366,6 +366,7 @@ struct io_failure_record {
int in_validation;
 };
 
+void btrfs_free_io_failure_record(struct inode *inode, u64 start, u64 end);
 int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
struct io_failure_record **failrec_ret);
 int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bc8cdaf..c591af5 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2697,6 +2697,10 @@ static int btrfs_finish_ordered_io(struct 
btrfs_ordered_extent *ordered_extent)
goto out;
}
 
+   btrfs_free_io_failure_record(inode, ordered_extent-file_offset,
+ordered_extent-file_offset +
+ordered_extent-len - 1);
+
if (test_bit(BTRFS_ORDERED_TRUNCATED, ordered_extent-flags)) {
truncated = true;
logical_len = ordered_extent-truncated_len;
@@ -4792,6 +4796,8 @@ void btrfs_evict_inode(struct inode *inode)
/* do we really want it for -i_nlink  0 and zero btrfs_root_refs? */
btrfs_wait_ordered_range(inode, 0, (u64)-1);
 
+   btrfs_free_io_failure_record(inode, 0, (u64)-1);
+
if (root-fs_info-log_root_recovering) {
BUG_ON(test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM,
 BTRFS_I(inode)-runtime_flags));
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 06/11] Btrfs: split bio_readpage_error into several functions

2014-09-12 Thread Miao Xie
The data repair function of direct read will be implemented later, and some code
in bio_readpage_error will be reused, so split bio_readpage_error into
several functions which will be used in direct read repair later.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None
---
 fs/btrfs/extent_io.c | 159 ++-
 fs/btrfs/extent_io.h |  28 +
 2 files changed, 123 insertions(+), 64 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 154cb8e..cf1de40 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1962,25 +1962,6 @@ static void check_page_uptodate(struct extent_io_tree 
*tree, struct page *page)
SetPageUptodate(page);
 }
 
-/*
- * When IO fails, either with EIO or csum verification fails, we
- * try other mirrors that might have a good copy of the data.  This
- * io_failure_record is used to record state as we go through all the
- * mirrors.  If another mirror has good data, the page is set up to date
- * and things continue.  If a good mirror can't be found, the original
- * bio end_io callback is called to indicate things have failed.
- */
-struct io_failure_record {
-   struct page *page;
-   u64 start;
-   u64 len;
-   u64 logical;
-   unsigned long bio_flags;
-   int this_mirror;
-   int failed_mirror;
-   int in_validation;
-};
-
 static int free_io_failure(struct inode *inode, struct io_failure_record *rec)
 {
int ret;
@@ -2156,40 +2137,24 @@ out:
return 0;
 }
 
-/*
- * this is a generic handler for readpage errors (default
- * readpage_io_failed_hook). if other copies exist, read those and write back
- * good data to the failed position. does not investigate in remapping the
- * failed extent elsewhere, hoping the device will be smart enough to do this 
as
- * needed
- */
-
-static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset,
- struct page *page, u64 start, u64 end,
- int failed_mirror)
+int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end,
+   struct io_failure_record **failrec_ret)
 {
-   struct io_failure_record *failrec = NULL;
+   struct io_failure_record *failrec;
u64 private;
struct extent_map *em;
-   struct inode *inode = page-mapping-host;
struct extent_io_tree *failure_tree = BTRFS_I(inode)-io_failure_tree;
struct extent_io_tree *tree = BTRFS_I(inode)-io_tree;
struct extent_map_tree *em_tree = BTRFS_I(inode)-extent_tree;
-   struct bio *bio;
-   struct btrfs_io_bio *btrfs_failed_bio;
-   struct btrfs_io_bio *btrfs_bio;
-   int num_copies;
int ret;
-   int read_mode;
u64 logical;
 
-   BUG_ON(failed_bio-bi_rw  REQ_WRITE);
-
ret = get_state_private(failure_tree, start, private);
if (ret) {
failrec = kzalloc(sizeof(*failrec), GFP_NOFS);
if (!failrec)
return -ENOMEM;
+
failrec-start = start;
failrec-len = end - start + 1;
failrec-this_mirror = 0;
@@ -2209,11 +2174,11 @@ static int bio_readpage_error(struct bio *failed_bio, 
u64 phy_offset,
em = NULL;
}
read_unlock(em_tree-lock);
-
if (!em) {
kfree(failrec);
return -EIO;
}
+
logical = start - em-start;
logical = em-block_start + logical;
if (test_bit(EXTENT_FLAG_COMPRESSED, em-flags)) {
@@ -,8 +2187,10 @@ static int bio_readpage_error(struct bio *failed_bio, 
u64 phy_offset,
extent_set_compress_type(failrec-bio_flags,
 em-compress_type);
}
-   pr_debug(bio_readpage_error: (new) logical=%llu, start=%llu, 
-len=%llu\n, logical, start, failrec-len);
+
+   pr_debug(Get IO Failure Record: (new) logical=%llu, 
start=%llu, len=%llu\n,
+logical, start, failrec-len);
+
failrec-logical = logical;
free_extent_map(em);
 
@@ -2243,8 +2210,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 
phy_offset,
}
} else {
failrec = (struct io_failure_record *)(unsigned long)private;
-   pr_debug(bio_readpage_error: (found) logical=%llu, 
-start=%llu, len=%llu, validation=%d\n,
+   pr_debug(Get IO Failure Record: (found) logical=%llu, 
start=%llu, len=%llu, validation=%d\n,
 failrec-logical, failrec-start, failrec-len,
 failrec-in_validation);
/*
@@ -2253,6 +2219,17 @@ static int bio_readpage_error(struct bio *failed_bio, 
u64 

[PATCH v4 09/11] Btrfs: Set real mirror number for read operation on RAID0/5/6

2014-09-12 Thread Miao Xie
We need real mirror number for RAID0/5/6 when reading data, or if read error
happens, we would pass 0 as the number of the mirror on which the io error
happens. It is wrong and would cause the filesystem read the data from the
corrupted mirror again.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None
---
 fs/btrfs/volumes.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1aacf5f..4856547 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5073,6 +5073,8 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
num_stripes = min_t(u64, map-num_stripes,
stripe_nr_end - stripe_nr_orig);
stripe_index = do_div(stripe_nr, map-num_stripes);
+   if (!(rw  (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS)))
+   mirror_num = 1;
} else if (map-type  BTRFS_BLOCK_GROUP_RAID1) {
if (rw  (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS))
num_stripes = map-num_stripes;
@@ -5176,6 +5178,9 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
/* We distribute the parity blocks across stripes */
tmp = stripe_nr + stripe_index;
stripe_index = do_div(tmp, map-num_stripes);
+   if (!(rw  (REQ_WRITE | REQ_DISCARD |
+   REQ_GET_READ_MIRRORS))  mirror_num = 1)
+   mirror_num = 1;
}
} else {
/*
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID1 failure and recovery

2014-09-12 Thread Hugo Mills
On Fri, Sep 12, 2014 at 01:57:37AM -0700, shane-ker...@csy.ca wrote:
 Hi,

 I am testing BTRFS in a simple RAID1 environment. Default mount
 options and data and metadata are mirrored between sda2 and sdb2. I
 have a few questions and a potential bug report. I don't normally
 have console access to the server so when the server boots with 1 of
 2 disks, the mount will fail without -o degraded. Can I use -o
 degraded by default to force mounting with any number of disks? This
 is the default behaviour for linux-raid so I was rather surprised
 when the server didn't boot after a simulated disk failure.

   The problem with that is that at the moment, you don't get any
notification that anything's wrong when the system boots. As a result,
using -odegraded as a default option is not generally recommended.

 So I pulled sdb to simulate a disk failure. The kernel oops'd but
 did continue running. I then rebooted encountering the above mount
 problem. I re-inserted the disk and rebooted again and BTRFS mounted
 successfully. However, I am now getting warnings like: BTRFS: read
 error corrected: ino 1615 off 86016 (dev /dev/sda2 sector
 4580382824)
 
 I take it there were writes to SDA and sdb is out of sync. Btrfs is
 correcting sdb as it goes but I won't have redundancy until sdb
 resyncs completely. Is there a way to tell btrfs that I just
 re-added a failed disk and to go through and resync the array as
 mdraid would do? I know I can do a btrfs fi resync manually but can
 that be automated if the array goes out of sync for whatever reason
 (power failure)...

   I've done this before, by accident (pulled the wrong drive,
reinserted it). You can fix it by running a scrub on the device (btrfs
scrub start /dev/ice, I think).

 Finally for those using this sort of setup in production, is running
 btrfs on top of mdraid the way to go at this point?

   Using btrfs native RAID means that you get independent checksums on
the two copies, so that where the data differs between the copies, the
correct data can be identified.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
--- SCSI is usually fixed by remembering that it needs three --- 
terminations: One at each end of the chain. And the goat.
 


signature.asc
Description: Digital signature


Re: RAID1 failure and recovery

2014-09-12 Thread Duncan
shane-kernel posted on Fri, 12 Sep 2014 01:57:37 -0700 as excerpted:

[Last question first as it's easy to answer...]

 Finally for those using this sort of setup in production, is running
 btrfs on top of mdraid the way to go at this point?

While the latest kernel and btrfs-tools have removed the warnings, btrfs 
is still not yet fully stable and isn't really recommended for 
production.  Yes, certain distributions support it, but that's their 
support choice that you're buying from them, and if it all goes belly up, 
I guess you'll see what that money actually buys.  However, /here/ it's 
not really recommended yet.

That said, there are people doing it, and if you make sure you have 
suitable backups for the extent to which you're depending on the data on 
that btrfs and are willing to deal with the downtime or failover hassles 
if it happens...

Also, keeping current with particularly kernels but not letting btrfs-
progs userspace get too outdated either, is important, as is following 
this list to keep up with current status.  If you're running older than 
the latest kernel series without a specific reason, you're likely to be 
running without patches for the most recently discovered btrfs bugs.

There was a recent exception to the general latest kernel rule in the 
form of a bug that only affected the kworker threads that btrfs 
transferred to in 3.15, so 3.14 was unaffected, while it took thru 3.15 
and 3.16 to find and trace the bug.  3.17-rc3 got the fix, and I believe 
it's in the latest 3.16 stable as well.  But that's where staying current 
with the list and actually having a reason to run an older than current 
kernel comes in, so while an exception to the general latest kernel rule, 
it wasn't an exception to the way I put it above, because once it became 
known on the list there was a reason to run the older kernel.

If you're unwilling to do that, then choose something other than btrfs.

But anyway, here's a direct answer to the question...

While btrfs on top of mdraid (or dmraid or...) in general works, it 
doesn't match up well with btrfs checksummed data integrity features.

Consider:  mdraid-1 writes to all devices, but reads from only one, 
without any checksumming or other data integrity measures.  If the copy 
mdraid-1 decides to read from is bad, unless the hardware actually 
reports it as bad, mdraid is entirely oblivious and will carry on as if 
nothing happened.  There's no checking the other copies to see that they 
match, no checksums or other verification, nothing.

Btrfs OTOH has checksumming and data verification.  With btrfs raid1, 
that verification means that if whatever copy btrfs happens to pull fails 
the verify, it can verify and pull from the second copy, overwriting the 
bad-checksum copy with a good-checksum copy.  BUT THAT ONLY HAPPENS IF IT 
HAS THAT SECOND COPY, AND IT ONLY HAS THAT SECOND COPY IN BTRFS RAID1 (or 
raid10 or for metadata, dup) MODE.

Now, consider what happens when btrfs data verification interacts with 
mdraid's lack of data verification.  If whatever copy mdraid pulls up is 
bad, it's going to fail the btrfs checksum and btrfs will reject it.  But 
because btrfs is on top of mdraid and mdraid is oblivious, there's no 
mechanism for btrfs to know that mdraid has other copies that may be just 
fine -- to btrfs, that copy is bad, period.  And if btrfs doesn't have a 
second btrfs copy, either due to btrfs raid1 or raid10 mode on top of 
mdraid, or for metadata, due to dup mode, then btrfs will simply return 
an error for that data, no second chance, because it knows nothing about 
the other copies mdraid has.

So while in general it works about as well as any other filesystem on top 
of mdraid, the interaction between mdraid's lack of data verification and 
btrfs' automated data verification is... unfortunate.

With that said, let's look at the rest of the post...

 I am testing BTRFS in a simple RAID1 environment. Default mount options
 and data and metadata are mirrored between sda2 and sdb2. I have a few
 questions and a potential bug report. I don't normally have console
 access to the server so when the server boots with 1 of 2 disks, the
 mount will fail without -o degraded. Can I use -o degraded by default to
 force mounting with any number of disks? This is the default behaviour
 for linux-raid so I was rather surprised when the server didn't boot
 after a simulated disk failure.

The idea here is that if a device is malfunctioning, the admin should 
have to take deliberate action to demonstrate knowledge of that fact 
before the filesystem will mount.  Btrfs isn't yet as robust in degraded 
mode as say mdraid, and important btrfs features like data validation and 
scrub are seriously degraded when that second copy is no longer there.  
In addition, btrfs raid1 mode requires that each of the two copies of a 
chunk be written to different devices, and once there's only a single 
device available, that can no longer happen, so unless 

breathe life into degraded raid10 (no space left on specific device)

2014-09-12 Thread Mate Gabri
Dear List,

I tried to remove a device from a 12 disk RAID10 array but it failed with a no 
space left and the system crashed. After a reset i could only mount the array 
in degraded mode because the device was marked as missing. I've tried a replace 
command but it said that it does not support RAID5/6 arrays so i went with the 
device add command which added the disk to the array but there was the missing 
device message still so i issued the btrfs device delete missing [path] 
command which was running for more than 24 hours without happening much. Also 
if i try to do some disk io (rsync) the system crashes after 10-15 minutes. 

If i try to mount the array as is i get this:

[   15.649539] btrfs: failed to read chunk tree on sdc
[   15.682209] btrfs: open_ctree failed

Mounting with degraded works. Now if i try to btrfs device delete missing i 
get a no space left error on /dev/sdc. I tried to do a rebalance which worked 
with the data chunks but with metadata it fails with the no space left error. 
I've tried to delete some snapshots which happened but after that some metadata 
rebalance started which failed too with no space left error. I've issued a 
whole array scrub which complained that dev id 12 is missing. A scrub on 
/dev/sdc finished successfuly without any error. I'm sure that /dev/sdc is 
broken metadata wise but i'm afraid to remove that device too. Do you have any 
suggestions what sould i do?

Here's some info:

uname -a:
Linux backup 3.13-0.bpo.1-amd64 #1 SMP Debian 3.13.10-1~bpo70+1 (2014-04-23) 
x86_64 GNU/Linux

 btrfs --version
Btrfs v3.14.1

btrfs fi show (/dev/sdl was removed and re-added)
Label: 'backup'  uuid: 667ec955-bcaa-4175-827d-3a44eaa515bb
Total devices 13 FS bytes used 9.29TiB
devid1 size 2.71TiB used 1.56TiB path /dev/sda4
devid2 size 2.71TiB used 1.56TiB path /dev/sdb4
devid3 size 2.73TiB used 1.56TiB path /dev/sdc
devid4 size 2.73TiB used 1.56TiB path /dev/sdd
devid5 size 2.73TiB used 1.56TiB path /dev/sde
devid6 size 2.73TiB used 1.56TiB path /dev/sdf
devid7 size 2.73TiB used 1.56TiB path /dev/sdg
devid8 size 2.73TiB used 1.56TiB path /dev/sdh
devid9 size 2.73TiB used 1.56TiB path /dev/sdi
devid   10 size 2.73TiB used 1.56TiB path /dev/sdj
devid   11 size 2.73TiB used 1.56TiB path /dev/sdk
devid   13 size 2.73TiB used 1.75GiB path /dev/sdl
*** Some devices missing

Btrfs v3.14.1


Data, RAID10: total=8.87TiB, used=8.79TiB
System, RAID1: total=32.00MiB, used=1.11MiB
Metadata, RAID10: total=518.72GiB, used=516.31GiB

Best regards,
--
GABRI Mate
Rendszergazda
mga...@plex.hu | +36 (30) 450 9750

Plex Online Kft. | http://plex.hu
H-1118 Budapest, Ugron Gábor u. 35.
T: +36 (1) 445 0167 | F: +36 (1) 248 3250
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fs corruption report

2014-09-12 Thread Marc Dietrich
Hello Guy,

Am Donnerstag, 4. September 2014, 11:50:14 schrieb Marc Dietrich:
 Am Donnerstag, 4. September 2014, 11:00:55 schrieb Gui Hecheng:
  Hi Zooko, Marc,
  
  Firstly, thanks for your backtrace info, Marc.
  Sorry to reply late, since I'm offline these days.
  For the restore problem, I'm sure that the lzo decompress routine lacks
  the ability to handle some specific extent pattern.
  
  Here is my test result:
  I'm using a specific file for test
  /usr/lib/modules/$(uname -r)/kernel/net/irda/irda.ko.
  You can get it easily on your own box.
  
  # mkfs -t btrfs dev
  # mount -o compress-force=lzo dev mnt
  # cp irda.ko mnt
  # umount dev
  # btrfs restore -v dev restore_dir
  
  report:
  # bad compress length
  # failed to inflate
 
 uh, that's really odd. I don't use force compress, but I guess it will also
 happen with non-forced one if the file is big enough.
 
  btrfs-progs version: v3.16.x
  
  With the same file under no-compress  zlib-compress,
  the restore will output a correct copy of irda.ko.
  
  I'm not sure whether the problem above has something to do with your
  problem. Hope that the messages above are helpful.
 
 I also get lot's of bad compress length, so it might be indeed related.
 
 I'm not a programmer, but is it possible that we are just skipping the lzo
 header (magic + header len + rest of header)?

I'm able to reproduce it. Any improved insight into this problem?

Marc


signature.asc
Description: This is a digitally signed message part.


Re: [PATCH v4 00/11] Implement the data repair function for direct read

2014-09-12 Thread Chris Mason


On 09/12/2014 06:43 AM, Miao Xie wrote:
 This patchset implement the data repair function for the direct read, it
 is implemented like buffered read:
 1.When we find the data is not right, we try to read the data from the other
   mirror.
 2.When the io on the mirror ends, we will insert the endio work into the
   dedicated btrfs workqueue, not common read endio workqueue, because the
   original endio work is still blocked in the btrfs endio workqueue, if we
   insert the endio work of the io on the mirror into that workqueue, deadlock
   would happen.
 3.If We get right data, we write it back to repair the corrupted mirror.
 4.If the data on the new mirror is still corrupted, we will try next
   mirror until we read right data or all the mirrors are traversed.
 5.After the above work, we set the uptodate flag according to the result.
 
 The difference is that the direct read may be splited to several small io,
 in order to get the number of the mirror on which the io error happens. we
 have to do data check and repair on the end IO function of those sub-IO
 request.
 
 Besides that, we also fixed some bugs of direct io.
 
 Changelog v3 - v4:
 - Remove the 1st patch which has been applied into the upstream kernel.
 - Use a dedicated btrfs workqueue instead of the system workqueue to
   deal with the completed repair bio, this suggest was from Chris.
 - Rebase the patchset to integration branch of Chris's git tree.

Perfect, thank you.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] Btrfs for rc5

2014-09-12 Thread Chris Mason

Hi Linus,

My for-linus branch has some fixes for the next rc:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus

Filipe is doing a careful pass through fsync problems, and these are the
fixes so far.  I'll have one more for rc6 that we're still testing.

My big commit is fixing up some inode hash races that Al Viro found
(thanks Al).

Filipe Manana (3) commits (+75/-21):
Btrfs: fix corruption after write/fsync failure + fsync + log recovery 
(+9/-3)
Btrfs: fix fsync data loss after a ranged fsync (+64/-17)
Btrfs: fix crash while doing a ranged fsync (+2/-1)

Chris Mason (2) commits (+113/-70):
Btrfs: use insert_inode_locked4 for inode creation (+109/-67)
Btrfs: fix autodefrag with compression (+4/-3)

Dan Carpenter (1) commits (+15/-10):
Btrfs: kfree()ing ERR_PTRs

Total: (6) commits (+203/-101)

 fs/btrfs/file.c |   2 +-
 fs/btrfs/inode.c| 191 +---
 fs/btrfs/ioctl.c|  32 +
 fs/btrfs/tree-log.c |  77 -
 fs/btrfs/tree-log.h |   2 +
 5 files changed, 203 insertions(+), 101 deletions(-)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: remove empty block groups automatically

2014-09-12 Thread Josef Bacik
One problem that has plagued us is that a user will use up all of his space with
data, remove a bunch of that data, and then try to create a bunch of small files
and run out of space.  This happens because all the chunks were allocated for
data since the metadata requirements were so low.  But now there's a bunch of
empty data block groups and not enough metadata space to do anything.  This
patch solves this problem by automatically deleting empty block groups.  If we
notice the used count go down to 0 when deleting or on mount notice that a block
group has a used count of 0 then we will queue it to be deleted.

When the cleaner thread runs we will double check to make sure the block group
is still empty and then we will delete it.  This patch has the side effect of no
longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
helpful for both this and relocate.  Thanks,

Signed-off-by: Josef Bacik jba...@fb.com
---
 fs/btrfs/ctree.h  |   8 ++-
 fs/btrfs/disk-io.c|   3 +
 fs/btrfs/extent-tree.c| 128 +++---
 fs/btrfs/tests/free-space-tests.c |   2 +-
 fs/btrfs/volumes.c| 115 ++
 fs/btrfs/volumes.h|   2 +
 6 files changed, 209 insertions(+), 49 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6db3d4b..d373c89 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1298,8 +1298,8 @@ struct btrfs_block_group_cache {
 */
struct list_head cluster_list;
 
-   /* For delayed block group creation */
-   struct list_head new_bg_list;
+   /* For delayed block group creation or deletion of empty block groups */
+   struct list_head bg_list;
 };
 
 /* delayed seq elem */
@@ -1716,6 +1716,9 @@ struct btrfs_fs_info {
 
/* Used to reclaim the metadata space in the background. */
struct work_struct async_reclaim_work;
+
+   spinlock_t unused_bgs_lock;
+   struct list_head unused_bgs;
 };
 
 struct btrfs_subvolume_writers {
@@ -3343,6 +3346,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
*trans,
   u64 size);
 int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 struct btrfs_root *root, u64 group_start);
+void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
 void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
   struct btrfs_root *root);
 u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a224fb9..3409734 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1764,6 +1764,7 @@ static int cleaner_kthread(void *arg)
}
 
btrfs_run_delayed_iputs(root);
+   btrfs_delete_unused_bgs(root-fs_info);
again = btrfs_clean_one_deleted_snapshot(root);
mutex_unlock(root-fs_info-cleaner_mutex);
 
@@ -2224,6 +2225,7 @@ int open_ctree(struct super_block *sb,
spin_lock_init(fs_info-super_lock);
spin_lock_init(fs_info-qgroup_op_lock);
spin_lock_init(fs_info-buffer_lock);
+   spin_lock_init(fs_info-unused_bgs_lock);
rwlock_init(fs_info-tree_mod_log_lock);
mutex_init(fs_info-reloc_mutex);
mutex_init(fs_info-delalloc_root_mutex);
@@ -2233,6 +2235,7 @@ int open_ctree(struct super_block *sb,
INIT_LIST_HEAD(fs_info-dirty_cowonly_roots);
INIT_LIST_HEAD(fs_info-space_info);
INIT_LIST_HEAD(fs_info-tree_mod_seq_list);
+   INIT_LIST_HEAD(fs_info-unused_bgs);
btrfs_mapping_init(fs_info-mapping_tree);
btrfs_init_block_rsv(fs_info-global_block_rsv,
 BTRFS_BLOCK_RSV_GLOBAL);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index b30ddb4..c68cdb1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5433,6 +5433,20 @@ static int update_block_group(struct btrfs_root *root,
spin_unlock(cache-space_info-lock);
} else {
old_val -= num_bytes;
+
+   /*
+* No longer have used bytes in this block group, queue
+* it for deletion.
+*/
+   if (old_val == 0) {
+   spin_lock(info-unused_bgs_lock);
+   if (list_empty(cache-bg_list)) {
+   btrfs_get_block_group(cache);
+   list_add_tail(cache-bg_list,
+ info-unused_bgs);
+   }
+   spin_unlock(info-unused_bgs_lock);
+   }
btrfs_set_block_group_used(cache-item, old_val);
cache-pinned += 

Re: [PATCH] Btrfs: remove empty block groups automatically

2014-09-12 Thread Chris Mason


On 09/12/2014 03:18 PM, Josef Bacik wrote:
 One problem that has plagued us is that a user will use up all of his space 
 with
 data, remove a bunch of that data, and then try to create a bunch of small 
 files
 and run out of space.  This happens because all the chunks were allocated for
 data since the metadata requirements were so low.  But now there's a bunch of
 empty data block groups and not enough metadata space to do anything.  This
 patch solves this problem by automatically deleting empty block groups.  If we
 notice the used count go down to 0 when deleting or on mount notice that a 
 block
 group has a used count of 0 then we will queue it to be deleted.
 
 When the cleaner thread runs we will double check to make sure the block group
 is still empty and then we will delete it.  This patch has the side effect of 
 no
 longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
 helpful for both this and relocate.  Thanks,

Thanks Josef, we've needed this forever.  I'm planning on pulling it in
for integration as well.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs: device_list_add() should not update list when mounted breaks subvol mount

2014-09-12 Thread xavier.gn...@gmail.com

Hi,

On standard ubuntu 14.04 a with an encrypted (cryptsetup) /home as brtfs 
subvolume we have the following results:

3.17-rc2 : Ok.
3.17-rc3 and 3.17-rc4 :  /home fails to mount on boot. If one try mount 
-a then the system tells that the partition is already mounted according 
to matab.



On a 3.17-rc4, btrfs fi sh returns nothing special:

Label: none  uuid: f4f554bb-57d9-4647-ab14-ea978c9e7e9f
Total devices 1 FS bytes used 131.41GiB
devid1 size 173.31GiB used 134.03GiB path /dev/sda5

Btrfs v3.12


I'm not sure if it has something to do with cryptsetup...


Xavier


Hi Johannes,


I've two more systems with kernel version 3.17-rc3
running and no problem like this.


 Does this 3.17-rc3 also has the same type of subvol config
 and it mount operation/sequence as you mentioned ?

 - one hdd with btrfs
 - default subvolume (rootfs) is different from subovlid=0
 - at boot, several subvols are mounted at /home/$DIR

 I ran a few tests, on our mainline
 
 mount -o subvol=sv1 /dev/sdh1 /btrfs
 mount /dev/sdh1 /btrfs
  mount: /dev/sdh1 already mounted or /btrfs busy [*]
 mount /dev/sdh1 /btrfs1
 echo $?
 0
 -

 [*] hope this isn't the problem you are mentioning.

Thanks, Anand
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[bug] subvol doesn't belong to btrfs mount point

2014-09-12 Thread Chris Murphy
Summary: When a btrfs subvolume is mounted with -o subvol, and a nested ro 
subvol/snapshot is created, btrfs send returns with an error. If the top level 
(id 5) is mounted instead, the send command succeeds.

3.17.0-0.rc4.git0.1.fc22.i686
Btrfs v3.16

This may also be happening on x86_64, and this bug suggests the problem is 
commit de22c28ef31d9721606ba05965a093a8044be0de

https://bugzilla.kernel.org/show_bug.cgi?id=83741



[root@lati ~]# strace btrfs send /root/rpms.ro | btrfs receive /mnt/
execve(/usr/sbin/btrfs, [btrfs, send, /root/rpms.ro], [/* 26 vars */]) 
= 0
brk(0)  = 0x866a000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0xb7767000
access(/etc/ld.so.preload, R_OK)  = -1 ENOENT (No such file or directory)
open(/etc/ld.so.cache, O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=84972, ...}) = 0
mmap2(NULL, 84972, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7752000
close(3)= 0
open(/lib/libuuid.so.1, O_RDONLY|O_CLOEXEC) = 3
read(3, 
\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\320\17\0\0004\0\0\0..., 512) 
= 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=19100, ...}) = 0
mmap2(NULL, 20692, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0xb774c000
mmap2(0xb775, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0xb775
close(3)= 0
open(/lib/libblkid.so.1, O_RDONLY|O_CLOEXEC) = 3
read(3, \177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\320P\0\0004\0\0\0..., 
512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=271276, ...}) = 0
mmap2(NULL, 271832, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0xb7709000
mmap2(0xb7748000, 12288, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3e000) = 0xb7748000
mmap2(0xb774b000, 1496, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb774b000
close(3)= 0
open(/lib/libm.so.6, O_RDONLY|O_CLOEXEC) = 3
read(3, \177ELF\1\1\1\3\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0pF\0\0004\0\0\0..., 
512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=357384, ...}) = 0
mmap2(NULL, 311456, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0xb76bc000
mmap2(0xb7707000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x4b000) = 0xb7707000
close(3)= 0
open(/lib/libz.so.1, O_RDONLY|O_CLOEXEC) = 3
read(3, 
\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\240\30\0\0004\0\0\0..., 512) 
= 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=93444, ...}) = 0
mmap2(NULL, 94464, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0xb76a4000
mmap2(0xb76ba000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x15000) = 0xb76ba000
close(3)= 0
open(/lib/liblzo2.so.2, O_RDONLY|O_CLOEXEC) = 3
read(3, \177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0@ \0\0004\0\0\0..., 
512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=145984, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0xb76a3000
mmap2(NULL, 147584, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0xb767e000
mmap2(0xb76a1000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0xb76a1000
close(3)= 0
open(/lib/libpthread.so.0, O_RDONLY|O_CLOEXEC) = 3
read(3, \177ELF\1\1\1\3\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\320O\0\0004\0\0\0..., 
512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=142932, ...}) = 0
mmap2(NULL, 115404, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0xb7661000
mprotect(0xb7679000, 4096, PROT_NONE)   = 0
mmap2(0xb767a000, 8192, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x18000) = 0xb767a000
mmap2(0xb767c000, 4812, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb767c000
close(3)= 0
open(/lib/libc.so.6, O_RDONLY|O_CLOEXEC) = 3
read(3, 
\177ELF\1\1\1\3\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\0\200\1\0004\0\0\0..., 512) = 
512
fstat64(3, {st_mode=S_IFREG|0755, st_size=2166836, ...}) = 0
mmap2(NULL, 1933020, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0xb7489000
mmap2(0xb765a000, 20480, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1d) = 0xb765a000
mmap2(0xb765f000, 7900, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb765f000
close(3)= 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0xb7488000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0xb7487000
set_thread_area({entry_number:-1, base_addr:0xb7487800, limit:1048575, 
seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, 
useable:1}) = 0 (entry_number:6)
mprotect(0xb765a000, 12288, PROT_READ)  = 0
mprotect(0xb767a000, 4096, PROT_READ)   = 0
mprotect(0xb76a1000, 4096, 

Re: Btrfs: device_list_add() should not update list when mounted breaks subvol mount

2014-09-12 Thread Anand Jain


Hi Xavier,

 Thanks for the report.

 I got this reproduced: its a very corner case, it depends on the
 device path given in the subsequent subvol mounts, the fix appear
 to be outside of this patch at this moment and I am digging to know
 if we need to normalize the device path before using it in the btrfs
 kernel, just like btrfs-progs did recently.


 reproducer:
 ls -l /root/dev/sde-link
   /root/dev/sde-link - /dev/sde

 mount -o device=/root/dev/sde-link /dev/sdd /btrfs1
 btrfs fi show
 Label: none  uuid: 943bf422-998c-4640-9d7f-d49f17b782ce
Total devices 2 FS bytes used 272.00KiB
devid1 size 1.52GiB used 339.50MiB path /dev/sdd
devid2 size 1.52GiB used 319.50MiB path /root/dev/sde-link

 mount -o subvol=sv1,device=/dev/sde /dev/sdd /btrfs  -shouldn't fail.
  mount: /dev/sdd already mounted or /btrfs busy
  mount: according to mtab, /dev/sdd is mounted on /btrfs1

 mount -o device=/root/dev/sde-link /dev/sdd /btrfs
  echo $?
  0


Xavier, Johannes,

The quickest workaround for you will be to try to match
 the device path as in the btrfs fi show -m /mnt output to
 your probably fstab/mnttab entry.


Anand

On 09/13/2014 04:43 AM, xavier.gn...@gmail.com wrote:

Hi,

On standard ubuntu 14.04 a with an encrypted (cryptsetup) /home as brtfs
subvolume we have the following results:
3.17-rc2 : Ok.
3.17-rc3 and 3.17-rc4 :  /home fails to mount on boot. If one try mount
-a then the system tells that the partition is already mounted according
to matab.





On a 3.17-rc4, btrfs fi sh returns nothing special:

Label: none  uuid: f4f554bb-57d9-4647-ab14-ea978c9e7e9f
 Total devices 1 FS bytes used 131.41GiB
 devid1 size 173.31GiB used 134.03GiB path /dev/sda5

Btrfs v3.12


I'm not sure if it has something to do with cryptsetup...


Xavier


Hi Johannes,


I've two more systems with kernel version 3.17-rc3
running and no problem like this.


 Does this 3.17-rc3 also has the same type of subvol config
 and it mount operation/sequence as you mentioned ?

 - one hdd with btrfs
 - default subvolume (rootfs) is different from subovlid=0
 - at boot, several subvols are mounted at /home/$DIR

 I ran a few tests, on our mainline
 
 mount -o subvol=sv1 /dev/sdh1 /btrfs
 mount /dev/sdh1 /btrfs
  mount: /dev/sdh1 already mounted or /btrfs busy [*]
 mount /dev/sdh1 /btrfs1
 echo $?
 0
 -

 [*] hope this isn't the problem you are mentioning.

Thanks, Anand

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html