Re: how long should btrfs device delete missing ... take?
Chris Murphy posted on Thu, 11 Sep 2014 20:10:26 -0600 as excerpted: Sure. But what's the next step? Given 260+ snapshots might mean well more than 350GB of data, depending on how deduplicated the fs is, it still probably would be faster to rsync this to a pile of drives in linear/concat+XFS than wait a month (?) for device delete to finish. That was what I was getting at in my other just-finished short reply. It may be time to give up on the btrfs specific solutions for the moment and go with tried and tested traditional solutions (tho I'd definitely *NOT* try rsync or the like with the delete still going, we know from other reports that rsync places its own stresses on btrfs and one major stressor, the delete-triggered rebalance, at a time, is bad enough). Alternatively, script some way to create 260+ ro snapshots to btrfs send/receive to a new btrfs volume; and turn it into a raid1 later. No confirmation yet but I strongly suspect most of those subs are snapshots. Assuming that's the case, it's very likely most of them can simply be eliminated as I originally suggested, a process that /should/ be fast, decomplexifying the situation dramatically. I'm curious if a sysrq+s followed by sysrq+u might leave the filessystem in a state where it could still be rw mountable. But I'm skeptical of anything interrupting the device delete before being fully prepared for the fs to be toast for rw mount. If only ro mount is possible, any chance of creating ro snapshots is out. In theory, that is, barring bugs, interrupting the delete with normal shutdown to the extent possible, then sysrq+s, sysrq+u, should not be a problem. The delete is basically a balance, going chunk by chunk, and either the chunk has been duplicated to the new device or it hasn't. In either case, the existing chunk on the remaining old device shouldn't be affected. So rebooting in that way in ordered to stop the delete temporarily /should/ have no bad effects. Of course, that's barring bugs. Btrfs is still not fully stabilized, and bugs do happen, so anything's possible. But I'd consider it safe enough to try here, certainly so if I had backups, as is still STRONGLY recommended for btrfs at this point, much more so than the routine sysadmin if it's not backed up by definition it's not valuable to you rule. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how long should btrfs device delete missing ... take?
On Sep 11, 2014, at 11:19 PM, Russell Coker russ...@coker.com.au wrote: It would be nice if a file system mounted ro counted as ro snapshots for btrfs send. When a file system is so messed up it can't be mounted rw it should be regarded as ro for all operations. Yes it's come up before, and there's a question whether mount -o ro is reliably ro enough for this. Maybe a force option? But then another one is a recursive btrfs send to go along with the above. I might want them all, or I might want all of the ones in two particular subvolumes, etc. And even combine the recursive ro snapshot and recursive send as a btrfs rescue option that would work even if the volume is mounted read-only. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs-progs: deal with conflict options for btrfs fi show
On Fri, 2014-09-12 at 14:56 +0900, Satoru Takeuchi wrote: Hi Gui, (2014/09/12 10:15), Gui Hecheng wrote: For btrfs fi show, -d|--all-devices -m|--mounted will overwrite each other, so if specified both, let the user know that he should not use them at the same time. Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com --- changelog: v1-v2: add option conflict descriptions to manpage and usage. --- Documentation/btrfs-filesystem.txt | 9 ++--- cmds-filesystem.c | 12 ++-- 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/Documentation/btrfs-filesystem.txt b/Documentation/btrfs-filesystem.txt index c9c0b00..d3d2dcc 100644 --- a/Documentation/btrfs-filesystem.txt +++ b/Documentation/btrfs-filesystem.txt @@ -20,15 +20,18 @@ SUBCOMMAND *df* path [path...]:: Show space usage information for a mount point. -*show* [--mounted|--all-devices|path|uuid|device|label]:: +*show* [-m|--mounted|-d|--all-devices|path|uuid|device|label]:: This line seems to be too long. Please see also the following thread. https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36270.html Hi Satoru, Ah, there is a patch that is changing the same document before but not merged yet. So I think I will rebase my additional words about the option conflict after the former patch merged. -- Hi David, Sorry to bother, would you please give a glance at the v1 patch and ignore this v2 first. And I will add the option conflict stuff in another patch after the former path below merged. Is that OK? https://patchwork.kernel.org/patch/4711831/ -Gui Thanks, Satoru Show the btrfs filesystem with some additional info. + If no option nor path|uuid|device|label is passed, btrfs shows information of all the btrfs filesystem both mounted and unmounted. -If '--mounted' is passed, it would probe btrfs kernel to list mounted btrfs +If '-m|--mounted' is passed, it would probe btrfs kernel to list mounted btrfs filesystem(s); -If '--all-devices' is passed, all the devices under /dev are scanned; +If '-d|--all-devices' is passed, all the devices under /dev are scanned; otherwise the devices list is extracted from the /proc/partitions file. +Don't combine -m|--mounted and -d|--all-devices, because these two options +will overwrite each other, and only one scan way will be adopted, +probe the kernel to scan or scan devices under /dev. *sync* path:: Force a sync for the filesystem identified by path. diff --git a/cmds-filesystem.c b/cmds-filesystem.c index 69c1ca5..51c4c55 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -495,6 +495,7 @@ static const char * const cmd_show_usage[] = { -d|--all-devices show only disks under /dev containing btrfs filesystem, -m|--mounted show only mounted btrfs, If no argument is given, structure of all present filesystems is shown., + Don't combine -d|--all-devices and -m|--mounted, refer to manpage for details., NULL }; @@ -526,16 +527,23 @@ static int cmd_show(int argc, char **argv) break; switch (c) { case 'd': - where = BTRFS_SCAN_PROC; + where = ~BTRFS_SCAN_LBLKID; + where |= BTRFS_SCAN_PROC; break; case 'm': - where = BTRFS_SCAN_MOUNTED; + where = ~BTRFS_SCAN_LBLKID; + where |= BTRFS_SCAN_MOUNTED; break; default: usage(cmd_show_usage); } } + if ((where BTRFS_SCAN_PROC) (where BTRFS_SCAN_MOUNTED)) { + fprintf(stderr, Don't use -d|--all-devices and -m|--mounted options at the same time.\n); + usage(cmd_show_usage); + } + if (check_argc_max(argc, optind + 1)) usage(cmd_show_usage); -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RAID1 failure and recovery
Hi, I am testing BTRFS in a simple RAID1 environment. Default mount options and data and metadata are mirrored between sda2 and sdb2. I have a few questions and a potential bug report. I don't normally have console access to the server so when the server boots with 1 of 2 disks, the mount will fail without -o degraded. Can I use -o degraded by default to force mounting with any number of disks? This is the default behaviour for linux-raid so I was rather surprised when the server didn't boot after a simulated disk failure. So I pulled sdb to simulate a disk failure. The kernel oops'd but did continue running. I then rebooted encountering the above mount problem. I re-inserted the disk and rebooted again and BTRFS mounted successfully. However, I am now getting warnings like: BTRFS: read error corrected: ino 1615 off 86016 (dev /dev/sda2 sector 4580382824) I take it there were writes to SDA and sdb is out of sync. Btrfs is correcting sdb as it goes but I won't have redundancy until sdb resyncs completely. Is there a way to tell btrfs that I just re-added a failed disk and to go through and resync the array as mdraid would do? I know I can do a btrfs fi resync manually but can that be automated if the array goes out of sync for whatever reason (power failure)... Finally for those using this sort of setup in production, is running btrfs on top of mdraid the way to go at this point? Cheers, Shane -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 00/11] Implement the data repair function for direct read
This patchset implement the data repair function for the direct read, it is implemented like buffered read: 1.When we find the data is not right, we try to read the data from the other mirror. 2.When the io on the mirror ends, we will insert the endio work into the dedicated btrfs workqueue, not common read endio workqueue, because the original endio work is still blocked in the btrfs endio workqueue, if we insert the endio work of the io on the mirror into that workqueue, deadlock would happen. 3.If We get right data, we write it back to repair the corrupted mirror. 4.If the data on the new mirror is still corrupted, we will try next mirror until we read right data or all the mirrors are traversed. 5.After the above work, we set the uptodate flag according to the result. The difference is that the direct read may be splited to several small io, in order to get the number of the mirror on which the io error happens. we have to do data check and repair on the end IO function of those sub-IO request. Besides that, we also fixed some bugs of direct io. Changelog v3 - v4: - Remove the 1st patch which has been applied into the upstream kernel. - Use a dedicated btrfs workqueue instead of the system workqueue to deal with the completed repair bio, this suggest was from Chris. - Rebase the patchset to integration branch of Chris's git tree. Changelog v2 - v3: - Fix wrong returned bio when doing bio clone, which was reported by Filipe Changelog v1 - v2: - Fix the warning which was triggered by __GFP_ZERO in the 2nd patch Miao Xie (11): Btrfs: load checksum data once when submitting a direct read io Btrfs: cleanup similar code of the buffered data data check and dio read data check Btrfs: do file data check by sub-bio's self Btrfs: fix missing error handler if submiting re-read bio fails Btrfs: Cleanup unused variant and argument of IO failure handlers Btrfs: split bio_readpage_error into several functions Btrfs: modify repair_io_failure and make it suit direct io Btrfs: modify clean_io_failure and make it suit direct io Btrfs: Set real mirror number for read operation on RAID0/5/6 Btrfs: implement repair function when direct read fails Btrfs: cleanup the read failure record after write or when the inode is freeing fs/btrfs/async-thread.c | 1 + fs/btrfs/async-thread.h | 1 + fs/btrfs/btrfs_inode.h | 10 +- fs/btrfs/ctree.h| 4 +- fs/btrfs/disk-io.c | 11 +- fs/btrfs/disk-io.h | 1 + fs/btrfs/extent_io.c| 254 +-- fs/btrfs/extent_io.h| 38 - fs/btrfs/file-item.c| 14 +- fs/btrfs/inode.c| 446 +++- fs/btrfs/scrub.c| 4 +- fs/btrfs/volumes.c | 5 + fs/btrfs/volumes.h | 5 +- 13 files changed, 601 insertions(+), 193 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 03/11] Btrfs: do file data check by sub-bio's self
Direct IO splits the original bio to several sub-bios because of the limit of raid stripe, and the filesystem will wait for all sub-bios and then run final end io process. But it was very hard to implement the data repair when dio read failure happens, because at the final end io function, we didn't know which mirror the data was read from. So in order to implement the data repair, we have to move the file data check in the final end io function to the sub-bio end io function, in which we can get the mirror number of the device we access. This patch did this work as the first step of the direct io data repair implementation. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None --- fs/btrfs/btrfs_inode.h | 9 + fs/btrfs/extent_io.c | 2 +- fs/btrfs/inode.c | 100 - fs/btrfs/volumes.h | 5 ++- 4 files changed, 87 insertions(+), 29 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 8bea70e..4d30947 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -245,8 +245,11 @@ static inline int btrfs_inode_in_log(struct inode *inode, u64 generation) return 0; } +#define BTRFS_DIO_ORIG_BIO_SUBMITTED 0x1 + struct btrfs_dio_private { struct inode *inode; + unsigned long flags; u64 logical_offset; u64 disk_bytenr; u64 bytes; @@ -263,6 +266,12 @@ struct btrfs_dio_private { /* dio_bio came from fs/direct-io.c */ struct bio *dio_bio; + + /* +* The original bio may be splited to several sub-bios, this is +* done during endio of sub-bios +*/ + int (*subio_endio)(struct inode *, struct btrfs_io_bio *); }; /* diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index dfe1afe..92a6d9f 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2472,7 +2472,7 @@ static void end_bio_extent_readpage(struct bio *bio, int err) struct inode *inode = page-mapping-host; pr_debug(end_bio_extent_readpage: bi_sector=%llu, err=%d, -mirror=%lu\n, (u64)bio-bi_iter.bi_sector, err, +mirror=%u\n, (u64)bio-bi_iter.bi_sector, err, io_bio-mirror_num); tree = BTRFS_I(inode)-io_tree; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e8139c6..cf79f79 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7198,29 +7198,40 @@ unlock_err: return ret; } -static void btrfs_endio_direct_read(struct bio *bio, int err) +static int btrfs_subio_endio_read(struct inode *inode, + struct btrfs_io_bio *io_bio) { - struct btrfs_dio_private *dip = bio-bi_private; struct bio_vec *bvec; - struct inode *inode = dip-inode; - struct bio *dio_bio; - struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); u64 start; - int ret; int i; + int ret; + int err = 0; - if (err || (BTRFS_I(inode)-flags BTRFS_INODE_NODATASUM)) - goto skip_checksum; + if (BTRFS_I(inode)-flags BTRFS_INODE_NODATASUM) + return 0; - start = dip-logical_offset; - bio_for_each_segment_all(bvec, bio, i) { + start = io_bio-logical; + bio_for_each_segment_all(bvec, io_bio-bio, i) { ret = __readpage_endio_check(inode, io_bio, i, bvec-bv_page, 0, start, bvec-bv_len); if (ret) err = -EIO; start += bvec-bv_len; } -skip_checksum: + + return err; +} + +static void btrfs_endio_direct_read(struct bio *bio, int err) +{ + struct btrfs_dio_private *dip = bio-bi_private; + struct inode *inode = dip-inode; + struct bio *dio_bio; + struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); + + if (!err (dip-flags BTRFS_DIO_ORIG_BIO_SUBMITTED)) + err = btrfs_subio_endio_read(inode, io_bio); + unlock_extent(BTRFS_I(inode)-io_tree, dip-logical_offset, dip-logical_offset + dip-bytes - 1); dio_bio = dip-dio_bio; @@ -7298,6 +7309,7 @@ static int __btrfs_submit_bio_start_direct_io(struct inode *inode, int rw, static void btrfs_end_dio_bio(struct bio *bio, int err) { struct btrfs_dio_private *dip = bio-bi_private; + int ret; if (err) { btrfs_err(BTRFS_I(dip-inode)-root-fs_info, @@ -7305,6 +7317,13 @@ static void btrfs_end_dio_bio(struct bio *bio, int err) btrfs_ino(dip-inode), bio-bi_rw, (unsigned long long)bio-bi_iter.bi_sector, bio-bi_iter.bi_size, err); + } else if (dip-subio_endio) { + ret = dip-subio_endio(dip-inode, btrfs_io_bio(bio)); + if (ret) + err = ret; + } + + if (err) {
[PATCH v4 07/11] Btrfs: modify repair_io_failure and make it suit direct io
The original code of repair_io_failure was just used for buffered read, because it got some filesystem data from page structure, it is safe for the page in the page cache. But when we do a direct read, the pages in bio are not in the page cache, that is there is no filesystem data in the page structure. In order to implement direct read data repair, we need modify repair_io_failure and pass all filesystem data it need by function parameters. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None --- fs/btrfs/extent_io.c | 8 +--- fs/btrfs/extent_io.h | 2 +- fs/btrfs/scrub.c | 1 + 3 files changed, 7 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index cf1de40..9fbc005 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1997,7 +1997,7 @@ static int free_io_failure(struct inode *inode, struct io_failure_record *rec) */ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start, u64 length, u64 logical, struct page *page, - int mirror_num) + unsigned int pg_offset, int mirror_num) { struct bio *bio; struct btrfs_device *dev; @@ -2036,7 +2036,7 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start, return -EIO; } bio-bi_bdev = dev-bdev; - bio_add_page(bio, page, length, start - page_offset(page)); + bio_add_page(bio, page, length, pg_offset); if (btrfsic_submit_bio_wait(WRITE_SYNC, bio)) { /* try to remap that extent elsewhere? */ @@ -2067,7 +2067,8 @@ int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb, for (i = 0; i num_pages; i++) { struct page *p = extent_buffer_page(eb, i); ret = repair_io_failure(root-fs_info, start, PAGE_CACHE_SIZE, - start, p, mirror_num); + start, p, start - page_offset(p), + mirror_num); if (ret) break; start += PAGE_CACHE_SIZE; @@ -2127,6 +2128,7 @@ static int clean_io_failure(u64 start, struct page *page) if (num_copies 1) { repair_io_failure(fs_info, start, failrec-len, failrec-logical, page, + start - page_offset(page), failrec-failed_mirror); } } diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 75b621b..a82ecbc 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -340,7 +340,7 @@ struct btrfs_fs_info; int repair_io_failure(struct btrfs_fs_info *fs_info, u64 start, u64 length, u64 logical, struct page *page, - int mirror_num); + unsigned int pg_offset, int mirror_num); int end_extent_writepage(struct page *page, int err, u64 start, u64 end); int repair_eb_io_failure(struct btrfs_root *root, struct extent_buffer *eb, int mirror_num); diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index cce122b..3978529 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -682,6 +682,7 @@ static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *fixup_ctx) fs_info = BTRFS_I(inode)-root-fs_info; ret = repair_io_failure(fs_info, offset, PAGE_SIZE, fixup-logical, page, + offset - page_offset(page), fixup-mirror_num); unlock_page(page); corrected = !ret; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 05/11] Btrfs: Cleanup unused variant and argument of IO failure handlers
Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None --- fs/btrfs/extent_io.c | 26 ++ 1 file changed, 10 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index f8dda46..154cb8e 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1981,8 +1981,7 @@ struct io_failure_record { int in_validation; }; -static int free_io_failure(struct inode *inode, struct io_failure_record *rec, - int did_repair) +static int free_io_failure(struct inode *inode, struct io_failure_record *rec) { int ret; int err = 0; @@ -2109,7 +2108,6 @@ static int clean_io_failure(u64 start, struct page *page) struct btrfs_fs_info *fs_info = BTRFS_I(inode)-root-fs_info; struct extent_state *state; int num_copies; - int did_repair = 0; int ret; private = 0; @@ -2130,7 +2128,6 @@ static int clean_io_failure(u64 start, struct page *page) /* there was no real error, just free the record */ pr_debug(clean_io_failure: freeing dummy error at %llu\n, failrec-start); - did_repair = 1; goto out; } if (fs_info-sb-s_flags MS_RDONLY) @@ -2147,19 +2144,16 @@ static int clean_io_failure(u64 start, struct page *page) num_copies = btrfs_num_copies(fs_info, failrec-logical, failrec-len); if (num_copies 1) { - ret = repair_io_failure(fs_info, start, failrec-len, - failrec-logical, page, - failrec-failed_mirror); - did_repair = !ret; + repair_io_failure(fs_info, start, failrec-len, + failrec-logical, page, + failrec-failed_mirror); } - ret = 0; } out: - if (!ret) - ret = free_io_failure(inode, failrec, did_repair); + free_io_failure(inode, failrec); - return ret; + return 0; } /* @@ -2269,7 +2263,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, */ pr_debug(bio_readpage_error: cannot repair, num_copies=%d, next_mirror %d, failed_mirror %d\n, num_copies, failrec-this_mirror, failed_mirror); - free_io_failure(inode, failrec, 0); + free_io_failure(inode, failrec); return -EIO; } @@ -2312,13 +2306,13 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, if (failrec-this_mirror num_copies) { pr_debug(bio_readpage_error: (fail) num_copies=%d, next_mirror %d, failed_mirror %d\n, num_copies, failrec-this_mirror, failed_mirror); - free_io_failure(inode, failrec, 0); + free_io_failure(inode, failrec); return -EIO; } bio = btrfs_io_bio_alloc(GFP_NOFS, 1); if (!bio) { - free_io_failure(inode, failrec, 0); + free_io_failure(inode, failrec); return -EIO; } bio-bi_end_io = failed_bio-bi_end_io; @@ -2349,7 +2343,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, failrec-this_mirror, failrec-bio_flags, 0); if (ret) { - free_io_failure(inode, failrec, 0); + free_io_failure(inode, failrec); bio_put(bio); } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 02/11] Btrfs: cleanup similar code of the buffered data data check and dio read data check
Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None --- fs/btrfs/inode.c | 102 +-- 1 file changed, 47 insertions(+), 55 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index af304e1..e8139c6 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2893,6 +2893,40 @@ static int btrfs_writepage_end_io_hook(struct page *page, u64 start, u64 end, return 0; } +static int __readpage_endio_check(struct inode *inode, + struct btrfs_io_bio *io_bio, + int icsum, struct page *page, + int pgoff, u64 start, size_t len) +{ + char *kaddr; + u32 csum_expected; + u32 csum = ~(u32)0; + static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); + + csum_expected = *(((u32 *)io_bio-csum) + icsum); + + kaddr = kmap_atomic(page); + csum = btrfs_csum_data(kaddr + pgoff, csum, len); + btrfs_csum_final(csum, (char *)csum); + if (csum != csum_expected) + goto zeroit; + + kunmap_atomic(kaddr); + return 0; +zeroit: + if (__ratelimit(_rs)) + btrfs_info(BTRFS_I(inode)-root-fs_info, + csum failed ino %llu off %llu csum %u expected csum %u, + btrfs_ino(inode), start, csum, csum_expected); + memset(kaddr + pgoff, 1, len); + flush_dcache_page(page); + kunmap_atomic(kaddr); + if (csum_expected == 0) + return 0; + return -EIO; +} + /* * when reads are done, we need to check csums to verify the data is correct * if there's a match, we allow the bio to finish. If not, the code in @@ -2905,20 +2939,15 @@ static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio, size_t offset = start - page_offset(page); struct inode *inode = page-mapping-host; struct extent_io_tree *io_tree = BTRFS_I(inode)-io_tree; - char *kaddr; struct btrfs_root *root = BTRFS_I(inode)-root; - u32 csum_expected; - u32 csum = ~(u32)0; - static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL, - DEFAULT_RATELIMIT_BURST); if (PageChecked(page)) { ClearPageChecked(page); - goto good; + return 0; } if (BTRFS_I(inode)-flags BTRFS_INODE_NODATASUM) - goto good; + return 0; if (root-root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID test_range_bit(io_tree, start, end, EXTENT_NODATASUM, 1, NULL)) { @@ -2928,28 +2957,8 @@ static int btrfs_readpage_end_io_hook(struct btrfs_io_bio *io_bio, } phy_offset = inode-i_sb-s_blocksize_bits; - csum_expected = *(((u32 *)io_bio-csum) + phy_offset); - - kaddr = kmap_atomic(page); - csum = btrfs_csum_data(kaddr + offset, csum, end - start + 1); - btrfs_csum_final(csum, (char *)csum); - if (csum != csum_expected) - goto zeroit; - - kunmap_atomic(kaddr); -good: - return 0; - -zeroit: - if (__ratelimit(_rs)) - btrfs_info(root-fs_info, csum failed ino %llu off %llu csum %u expected csum %u, - btrfs_ino(page-mapping-host), start, csum, csum_expected); - memset(kaddr + offset, 1, end - start + 1); - flush_dcache_page(page); - kunmap_atomic(kaddr); - if (csum_expected == 0) - return 0; - return -EIO; + return __readpage_endio_check(inode, io_bio, phy_offset, page, offset, + start, (size_t)(end - start + 1)); } struct delayed_iput { @@ -7194,41 +7203,24 @@ static void btrfs_endio_direct_read(struct bio *bio, int err) struct btrfs_dio_private *dip = bio-bi_private; struct bio_vec *bvec; struct inode *inode = dip-inode; - struct btrfs_root *root = BTRFS_I(inode)-root; struct bio *dio_bio; struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); - u32 *csums = (u32 *)io_bio-csum; u64 start; + int ret; int i; + if (err || (BTRFS_I(inode)-flags BTRFS_INODE_NODATASUM)) + goto skip_checksum; + start = dip-logical_offset; bio_for_each_segment_all(bvec, bio, i) { - if (!(BTRFS_I(inode)-flags BTRFS_INODE_NODATASUM)) { - struct page *page = bvec-bv_page; - char *kaddr; - u32 csum = ~(u32)0; - unsigned long flags; - - local_irq_save(flags); - kaddr = kmap_atomic(page); - csum = btrfs_csum_data(kaddr + bvec-bv_offset, - csum, bvec-bv_len); -
[PATCH v4 04/11] Btrfs: fix missing error handler if submiting re-read bio fails
We forgot to free failure record and bio after submitting re-read bio failed, fix it. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None --- fs/btrfs/extent_io.c | 5 + 1 file changed, 5 insertions(+) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 92a6d9f..f8dda46 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2348,6 +2348,11 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, ret = tree-ops-submit_bio_hook(inode, read_mode, bio, failrec-this_mirror, failrec-bio_flags, 0); + if (ret) { + free_io_failure(inode, failrec, 0); + bio_put(bio); + } + return ret; } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 10/11] Btrfs: implement repair function when direct read fails
This patch implement data repair function when direct read fails. The detail of the implementation is: - When we find the data is not right, we try to read the data from the other mirror. - When the io on the mirror ends, we will insert the endio work into the dedicated btrfs workqueue, not common read endio workqueue, because the original endio work is still blocked in the btrfs endio workqueue, if we insert the endio work of the io on the mirror into that workqueue, deadlock would happen. - After we get right data, we write it back to the corrupted mirror. - And if the data on the new mirror is still corrupted, we will try next mirror until we read right data or all the mirrors are traversed. - After the above work, we set the uptodate flag according to the result. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v3 - v4: - Use a dedicated btrfs workqueue instead of the system workqueue to deal with the completed repair bio, this suggest was from Chris. Changelog v1 - v3: - None --- fs/btrfs/async-thread.c | 1 + fs/btrfs/async-thread.h | 1 + fs/btrfs/btrfs_inode.h | 2 +- fs/btrfs/ctree.h| 1 + fs/btrfs/disk-io.c | 11 +- fs/btrfs/disk-io.h | 1 + fs/btrfs/extent_io.c| 12 ++- fs/btrfs/extent_io.h| 5 +- fs/btrfs/inode.c| 276 9 files changed, 281 insertions(+), 29 deletions(-) diff --git a/fs/btrfs/async-thread.c b/fs/btrfs/async-thread.c index fbd76de..2da0a66 100644 --- a/fs/btrfs/async-thread.c +++ b/fs/btrfs/async-thread.c @@ -74,6 +74,7 @@ BTRFS_WORK_HELPER(endio_helper); BTRFS_WORK_HELPER(endio_meta_helper); BTRFS_WORK_HELPER(endio_meta_write_helper); BTRFS_WORK_HELPER(endio_raid56_helper); +BTRFS_WORK_HELPER(endio_repair_helper); BTRFS_WORK_HELPER(rmw_helper); BTRFS_WORK_HELPER(endio_write_helper); BTRFS_WORK_HELPER(freespace_write_helper); diff --git a/fs/btrfs/async-thread.h b/fs/btrfs/async-thread.h index e9e31c9..e386c29 100644 --- a/fs/btrfs/async-thread.h +++ b/fs/btrfs/async-thread.h @@ -53,6 +53,7 @@ BTRFS_WORK_HELPER_PROTO(endio_helper); BTRFS_WORK_HELPER_PROTO(endio_meta_helper); BTRFS_WORK_HELPER_PROTO(endio_meta_write_helper); BTRFS_WORK_HELPER_PROTO(endio_raid56_helper); +BTRFS_WORK_HELPER_PROTO(endio_repair_helper); BTRFS_WORK_HELPER_PROTO(rmw_helper); BTRFS_WORK_HELPER_PROTO(endio_write_helper); BTRFS_WORK_HELPER_PROTO(freespace_write_helper); diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 4d30947..7a7521c 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -271,7 +271,7 @@ struct btrfs_dio_private { * The original bio may be splited to several sub-bios, this is * done during endio of sub-bios */ - int (*subio_endio)(struct inode *, struct btrfs_io_bio *); + int (*subio_endio)(struct inode *, struct btrfs_io_bio *, int); }; /* diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 7b54cd9..63acfd8 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1538,6 +1538,7 @@ struct btrfs_fs_info { struct btrfs_workqueue *endio_workers; struct btrfs_workqueue *endio_meta_workers; struct btrfs_workqueue *endio_raid56_workers; + struct btrfs_workqueue *endio_repair_workers; struct btrfs_workqueue *rmw_workers; struct btrfs_workqueue *endio_meta_write_workers; struct btrfs_workqueue *endio_write_workers; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index ff3ee22..1594d91 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -713,7 +713,11 @@ static void end_workqueue_bio(struct bio *bio, int err) func = btrfs_endio_write_helper; } } else { - if (end_io_wq-metadata == BTRFS_WQ_ENDIO_RAID56) { + if (unlikely(end_io_wq-metadata == +BTRFS_WQ_ENDIO_DIO_REPAIR)) { + wq = fs_info-endio_repair_workers; + func = btrfs_endio_repair_helper; + } else if (end_io_wq-metadata == BTRFS_WQ_ENDIO_RAID56) { wq = fs_info-endio_raid56_workers; func = btrfs_endio_raid56_helper; } else if (end_io_wq-metadata) { @@ -741,6 +745,7 @@ int btrfs_bio_wq_end_io(struct btrfs_fs_info *info, struct bio *bio, int metadata) { struct end_io_wq *end_io_wq; + end_io_wq = kmalloc(sizeof(*end_io_wq), GFP_NOFS); if (!end_io_wq) return -ENOMEM; @@ -2059,6 +2064,7 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info) btrfs_destroy_workqueue(fs_info-endio_workers); btrfs_destroy_workqueue(fs_info-endio_meta_workers); btrfs_destroy_workqueue(fs_info-endio_raid56_workers); + btrfs_destroy_workqueue(fs_info-endio_repair_workers); btrfs_destroy_workqueue(fs_info-rmw_workers);
[PATCH v4 01/11] Btrfs: load checksum data once when submitting a direct read io
The current code would load checksum data for several times when we split a whole direct read io because of the limit of the raid stripe, it would make us search the csum tree for several times. In fact, it just wasted time, and made the contention of the csum tree root be more serious. This patch improves this problem by loading the data at once. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v3 - v4: - None Changelog v2 - v3: - Fix the wrong return value of btrfs_bio_clone Changelog v1 - v2: - Remove the __GFP_ZERO flag in btrfs_submit_direct because it would trigger a WARNing. It is reported by Filipe David Manana, Thanks. --- fs/btrfs/btrfs_inode.h | 1 - fs/btrfs/ctree.h | 3 +-- fs/btrfs/extent_io.c | 13 +++-- fs/btrfs/file-item.c | 14 ++ fs/btrfs/inode.c | 38 +- 5 files changed, 35 insertions(+), 34 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index fd87941..8bea70e 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -263,7 +263,6 @@ struct btrfs_dio_private { /* dio_bio came from fs/direct-io.c */ struct bio *dio_bio; - u8 csum[0]; }; /* diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index ded7781..7b54cd9 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3719,8 +3719,7 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans, int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode, struct bio *bio, u32 *dst); int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode, - struct btrfs_dio_private *dip, struct bio *bio, - u64 logical_offset); + struct bio *bio, u64 logical_offset); int btrfs_insert_file_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 objectid, u64 pos, diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 86b39de..dfe1afe 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2621,9 +2621,18 @@ btrfs_bio_alloc(struct block_device *bdev, u64 first_sector, int nr_vecs, struct bio *btrfs_bio_clone(struct bio *bio, gfp_t gfp_mask) { - return bio_clone_bioset(bio, gfp_mask, btrfs_bioset); -} + struct btrfs_io_bio *btrfs_bio; + struct bio *new; + new = bio_clone_bioset(bio, gfp_mask, btrfs_bioset); + if (new) { + btrfs_bio = btrfs_io_bio(new); + btrfs_bio-csum = NULL; + btrfs_bio-csum_allocated = NULL; + btrfs_bio-end_io = NULL; + } + return new; +} /* this also allocates from the btrfs_bioset */ struct bio *btrfs_io_bio_alloc(gfp_t gfp_mask, unsigned int nr_iovecs) diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 6e6262e..783a943 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -299,19 +299,9 @@ int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode, } int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode, - struct btrfs_dio_private *dip, struct bio *bio, - u64 offset) + struct bio *bio, u64 offset) { - int len = (bio-bi_iter.bi_sector 9) - dip-disk_bytenr; - u16 csum_size = btrfs_super_csum_size(root-fs_info-super_copy); - int ret; - - len = inode-i_sb-s_blocksize_bits; - len *= csum_size; - - ret = __btrfs_lookup_bio_sums(root, inode, bio, offset, - (u32 *)(dip-csum + len), 1); - return ret; + return __btrfs_lookup_bio_sums(root, inode, bio, offset, NULL, 1); } int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 2118ea6..af304e1 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7196,7 +7196,8 @@ static void btrfs_endio_direct_read(struct bio *bio, int err) struct inode *inode = dip-inode; struct btrfs_root *root = BTRFS_I(inode)-root; struct bio *dio_bio; - u32 *csums = (u32 *)dip-csum; + struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); + u32 *csums = (u32 *)io_bio-csum; u64 start; int i; @@ -7238,6 +7239,9 @@ static void btrfs_endio_direct_read(struct bio *bio, int err) if (err) clear_bit(BIO_UPTODATE, dio_bio-bi_flags); dio_end_io(dio_bio, err); + + if (io_bio-end_io) + io_bio-end_io(io_bio, err); bio_put(bio); } @@ -7377,13 +7381,20 @@ static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode, ret = btrfs_csum_one_bio(root, inode, bio, file_offset, 1); if (ret) goto err; - } else if (!skip_sum) { - ret =
[PATCH v4 11/11] Btrfs: cleanup the read failure record after write or when the inode is freeing
After the data is written successfully, we should cleanup the read failure record in that range because - If we set data COW for the file, the range that the failure record pointed to is mapped to a new place, so it is invalid. - If we set no data COW for the file, and if there is no error during writting, the corrupted data is corrected, so the failure record can be removed. And if some errors happen on the mirrors, we also needn't worry about it because the failure record will be recreated if we read the same place again. Sometimes, we may fail to correct the data, so the failure records will be left in the tree, we need free them when we free the inode or the memory leak happens. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None --- fs/btrfs/extent_io.c | 34 ++ fs/btrfs/extent_io.h | 1 + fs/btrfs/inode.c | 6 ++ 3 files changed, 41 insertions(+) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 86dc352..5427fd5 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2138,6 +2138,40 @@ out: return 0; } +/* + * Can be called when + * - hold extent lock + * - under ordered extent + * - the inode is freeing + */ +void btrfs_free_io_failure_record(struct inode *inode, u64 start, u64 end) +{ + struct extent_io_tree *failure_tree = BTRFS_I(inode)-io_failure_tree; + struct io_failure_record *failrec; + struct extent_state *state, *next; + + if (RB_EMPTY_ROOT(failure_tree-state)) + return; + + spin_lock(failure_tree-lock); + state = find_first_extent_bit_state(failure_tree, start, EXTENT_DIRTY); + while (state) { + if (state-start end) + break; + + ASSERT(state-end = end); + + next = next_state(state); + + failrec = (struct io_failure_record *)state-private; + free_extent_state(state); + kfree(failrec); + + state = next; + } + spin_unlock(failure_tree-lock); +} + int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end, struct io_failure_record **failrec_ret) { diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 176a4b1..5e91fb9 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -366,6 +366,7 @@ struct io_failure_record { int in_validation; }; +void btrfs_free_io_failure_record(struct inode *inode, u64 start, u64 end); int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end, struct io_failure_record **failrec_ret); int btrfs_check_repairable(struct inode *inode, struct bio *failed_bio, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index bc8cdaf..c591af5 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2697,6 +2697,10 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent) goto out; } + btrfs_free_io_failure_record(inode, ordered_extent-file_offset, +ordered_extent-file_offset + +ordered_extent-len - 1); + if (test_bit(BTRFS_ORDERED_TRUNCATED, ordered_extent-flags)) { truncated = true; logical_len = ordered_extent-truncated_len; @@ -4792,6 +4796,8 @@ void btrfs_evict_inode(struct inode *inode) /* do we really want it for -i_nlink 0 and zero btrfs_root_refs? */ btrfs_wait_ordered_range(inode, 0, (u64)-1); + btrfs_free_io_failure_record(inode, 0, (u64)-1); + if (root-fs_info-log_root_recovering) { BUG_ON(test_bit(BTRFS_INODE_HAS_ORPHAN_ITEM, BTRFS_I(inode)-runtime_flags)); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 06/11] Btrfs: split bio_readpage_error into several functions
The data repair function of direct read will be implemented later, and some code in bio_readpage_error will be reused, so split bio_readpage_error into several functions which will be used in direct read repair later. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None --- fs/btrfs/extent_io.c | 159 ++- fs/btrfs/extent_io.h | 28 + 2 files changed, 123 insertions(+), 64 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 154cb8e..cf1de40 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -1962,25 +1962,6 @@ static void check_page_uptodate(struct extent_io_tree *tree, struct page *page) SetPageUptodate(page); } -/* - * When IO fails, either with EIO or csum verification fails, we - * try other mirrors that might have a good copy of the data. This - * io_failure_record is used to record state as we go through all the - * mirrors. If another mirror has good data, the page is set up to date - * and things continue. If a good mirror can't be found, the original - * bio end_io callback is called to indicate things have failed. - */ -struct io_failure_record { - struct page *page; - u64 start; - u64 len; - u64 logical; - unsigned long bio_flags; - int this_mirror; - int failed_mirror; - int in_validation; -}; - static int free_io_failure(struct inode *inode, struct io_failure_record *rec) { int ret; @@ -2156,40 +2137,24 @@ out: return 0; } -/* - * this is a generic handler for readpage errors (default - * readpage_io_failed_hook). if other copies exist, read those and write back - * good data to the failed position. does not investigate in remapping the - * failed extent elsewhere, hoping the device will be smart enough to do this as - * needed - */ - -static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, - struct page *page, u64 start, u64 end, - int failed_mirror) +int btrfs_get_io_failure_record(struct inode *inode, u64 start, u64 end, + struct io_failure_record **failrec_ret) { - struct io_failure_record *failrec = NULL; + struct io_failure_record *failrec; u64 private; struct extent_map *em; - struct inode *inode = page-mapping-host; struct extent_io_tree *failure_tree = BTRFS_I(inode)-io_failure_tree; struct extent_io_tree *tree = BTRFS_I(inode)-io_tree; struct extent_map_tree *em_tree = BTRFS_I(inode)-extent_tree; - struct bio *bio; - struct btrfs_io_bio *btrfs_failed_bio; - struct btrfs_io_bio *btrfs_bio; - int num_copies; int ret; - int read_mode; u64 logical; - BUG_ON(failed_bio-bi_rw REQ_WRITE); - ret = get_state_private(failure_tree, start, private); if (ret) { failrec = kzalloc(sizeof(*failrec), GFP_NOFS); if (!failrec) return -ENOMEM; + failrec-start = start; failrec-len = end - start + 1; failrec-this_mirror = 0; @@ -2209,11 +2174,11 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, em = NULL; } read_unlock(em_tree-lock); - if (!em) { kfree(failrec); return -EIO; } + logical = start - em-start; logical = em-block_start + logical; if (test_bit(EXTENT_FLAG_COMPRESSED, em-flags)) { @@ -,8 +2187,10 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, extent_set_compress_type(failrec-bio_flags, em-compress_type); } - pr_debug(bio_readpage_error: (new) logical=%llu, start=%llu, -len=%llu\n, logical, start, failrec-len); + + pr_debug(Get IO Failure Record: (new) logical=%llu, start=%llu, len=%llu\n, +logical, start, failrec-len); + failrec-logical = logical; free_extent_map(em); @@ -2243,8 +2210,7 @@ static int bio_readpage_error(struct bio *failed_bio, u64 phy_offset, } } else { failrec = (struct io_failure_record *)(unsigned long)private; - pr_debug(bio_readpage_error: (found) logical=%llu, -start=%llu, len=%llu, validation=%d\n, + pr_debug(Get IO Failure Record: (found) logical=%llu, start=%llu, len=%llu, validation=%d\n, failrec-logical, failrec-start, failrec-len, failrec-in_validation); /* @@ -2253,6 +2219,17 @@ static int bio_readpage_error(struct bio *failed_bio, u64
[PATCH v4 09/11] Btrfs: Set real mirror number for read operation on RAID0/5/6
We need real mirror number for RAID0/5/6 when reading data, or if read error happens, we would pass 0 as the number of the mirror on which the io error happens. It is wrong and would cause the filesystem read the data from the corrupted mirror again. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v1 - v4: - None --- fs/btrfs/volumes.c | 5 + 1 file changed, 5 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 1aacf5f..4856547 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -5073,6 +5073,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw, num_stripes = min_t(u64, map-num_stripes, stripe_nr_end - stripe_nr_orig); stripe_index = do_div(stripe_nr, map-num_stripes); + if (!(rw (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS))) + mirror_num = 1; } else if (map-type BTRFS_BLOCK_GROUP_RAID1) { if (rw (REQ_WRITE | REQ_DISCARD | REQ_GET_READ_MIRRORS)) num_stripes = map-num_stripes; @@ -5176,6 +5178,9 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw, /* We distribute the parity blocks across stripes */ tmp = stripe_nr + stripe_index; stripe_index = do_div(tmp, map-num_stripes); + if (!(rw (REQ_WRITE | REQ_DISCARD | + REQ_GET_READ_MIRRORS)) mirror_num = 1) + mirror_num = 1; } } else { /* -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 failure and recovery
On Fri, Sep 12, 2014 at 01:57:37AM -0700, shane-ker...@csy.ca wrote: Hi, I am testing BTRFS in a simple RAID1 environment. Default mount options and data and metadata are mirrored between sda2 and sdb2. I have a few questions and a potential bug report. I don't normally have console access to the server so when the server boots with 1 of 2 disks, the mount will fail without -o degraded. Can I use -o degraded by default to force mounting with any number of disks? This is the default behaviour for linux-raid so I was rather surprised when the server didn't boot after a simulated disk failure. The problem with that is that at the moment, you don't get any notification that anything's wrong when the system boots. As a result, using -odegraded as a default option is not generally recommended. So I pulled sdb to simulate a disk failure. The kernel oops'd but did continue running. I then rebooted encountering the above mount problem. I re-inserted the disk and rebooted again and BTRFS mounted successfully. However, I am now getting warnings like: BTRFS: read error corrected: ino 1615 off 86016 (dev /dev/sda2 sector 4580382824) I take it there were writes to SDA and sdb is out of sync. Btrfs is correcting sdb as it goes but I won't have redundancy until sdb resyncs completely. Is there a way to tell btrfs that I just re-added a failed disk and to go through and resync the array as mdraid would do? I know I can do a btrfs fi resync manually but can that be automated if the array goes out of sync for whatever reason (power failure)... I've done this before, by accident (pulled the wrong drive, reinserted it). You can fix it by running a scrub on the device (btrfs scrub start /dev/ice, I think). Finally for those using this sort of setup in production, is running btrfs on top of mdraid the way to go at this point? Using btrfs native RAID means that you get independent checksums on the two copies, so that where the data differs between the copies, the correct data can be identified. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- SCSI is usually fixed by remembering that it needs three --- terminations: One at each end of the chain. And the goat. signature.asc Description: Digital signature
Re: RAID1 failure and recovery
shane-kernel posted on Fri, 12 Sep 2014 01:57:37 -0700 as excerpted: [Last question first as it's easy to answer...] Finally for those using this sort of setup in production, is running btrfs on top of mdraid the way to go at this point? While the latest kernel and btrfs-tools have removed the warnings, btrfs is still not yet fully stable and isn't really recommended for production. Yes, certain distributions support it, but that's their support choice that you're buying from them, and if it all goes belly up, I guess you'll see what that money actually buys. However, /here/ it's not really recommended yet. That said, there are people doing it, and if you make sure you have suitable backups for the extent to which you're depending on the data on that btrfs and are willing to deal with the downtime or failover hassles if it happens... Also, keeping current with particularly kernels but not letting btrfs- progs userspace get too outdated either, is important, as is following this list to keep up with current status. If you're running older than the latest kernel series without a specific reason, you're likely to be running without patches for the most recently discovered btrfs bugs. There was a recent exception to the general latest kernel rule in the form of a bug that only affected the kworker threads that btrfs transferred to in 3.15, so 3.14 was unaffected, while it took thru 3.15 and 3.16 to find and trace the bug. 3.17-rc3 got the fix, and I believe it's in the latest 3.16 stable as well. But that's where staying current with the list and actually having a reason to run an older than current kernel comes in, so while an exception to the general latest kernel rule, it wasn't an exception to the way I put it above, because once it became known on the list there was a reason to run the older kernel. If you're unwilling to do that, then choose something other than btrfs. But anyway, here's a direct answer to the question... While btrfs on top of mdraid (or dmraid or...) in general works, it doesn't match up well with btrfs checksummed data integrity features. Consider: mdraid-1 writes to all devices, but reads from only one, without any checksumming or other data integrity measures. If the copy mdraid-1 decides to read from is bad, unless the hardware actually reports it as bad, mdraid is entirely oblivious and will carry on as if nothing happened. There's no checking the other copies to see that they match, no checksums or other verification, nothing. Btrfs OTOH has checksumming and data verification. With btrfs raid1, that verification means that if whatever copy btrfs happens to pull fails the verify, it can verify and pull from the second copy, overwriting the bad-checksum copy with a good-checksum copy. BUT THAT ONLY HAPPENS IF IT HAS THAT SECOND COPY, AND IT ONLY HAS THAT SECOND COPY IN BTRFS RAID1 (or raid10 or for metadata, dup) MODE. Now, consider what happens when btrfs data verification interacts with mdraid's lack of data verification. If whatever copy mdraid pulls up is bad, it's going to fail the btrfs checksum and btrfs will reject it. But because btrfs is on top of mdraid and mdraid is oblivious, there's no mechanism for btrfs to know that mdraid has other copies that may be just fine -- to btrfs, that copy is bad, period. And if btrfs doesn't have a second btrfs copy, either due to btrfs raid1 or raid10 mode on top of mdraid, or for metadata, due to dup mode, then btrfs will simply return an error for that data, no second chance, because it knows nothing about the other copies mdraid has. So while in general it works about as well as any other filesystem on top of mdraid, the interaction between mdraid's lack of data verification and btrfs' automated data verification is... unfortunate. With that said, let's look at the rest of the post... I am testing BTRFS in a simple RAID1 environment. Default mount options and data and metadata are mirrored between sda2 and sdb2. I have a few questions and a potential bug report. I don't normally have console access to the server so when the server boots with 1 of 2 disks, the mount will fail without -o degraded. Can I use -o degraded by default to force mounting with any number of disks? This is the default behaviour for linux-raid so I was rather surprised when the server didn't boot after a simulated disk failure. The idea here is that if a device is malfunctioning, the admin should have to take deliberate action to demonstrate knowledge of that fact before the filesystem will mount. Btrfs isn't yet as robust in degraded mode as say mdraid, and important btrfs features like data validation and scrub are seriously degraded when that second copy is no longer there. In addition, btrfs raid1 mode requires that each of the two copies of a chunk be written to different devices, and once there's only a single device available, that can no longer happen, so unless
breathe life into degraded raid10 (no space left on specific device)
Dear List, I tried to remove a device from a 12 disk RAID10 array but it failed with a no space left and the system crashed. After a reset i could only mount the array in degraded mode because the device was marked as missing. I've tried a replace command but it said that it does not support RAID5/6 arrays so i went with the device add command which added the disk to the array but there was the missing device message still so i issued the btrfs device delete missing [path] command which was running for more than 24 hours without happening much. Also if i try to do some disk io (rsync) the system crashes after 10-15 minutes. If i try to mount the array as is i get this: [ 15.649539] btrfs: failed to read chunk tree on sdc [ 15.682209] btrfs: open_ctree failed Mounting with degraded works. Now if i try to btrfs device delete missing i get a no space left error on /dev/sdc. I tried to do a rebalance which worked with the data chunks but with metadata it fails with the no space left error. I've tried to delete some snapshots which happened but after that some metadata rebalance started which failed too with no space left error. I've issued a whole array scrub which complained that dev id 12 is missing. A scrub on /dev/sdc finished successfuly without any error. I'm sure that /dev/sdc is broken metadata wise but i'm afraid to remove that device too. Do you have any suggestions what sould i do? Here's some info: uname -a: Linux backup 3.13-0.bpo.1-amd64 #1 SMP Debian 3.13.10-1~bpo70+1 (2014-04-23) x86_64 GNU/Linux btrfs --version Btrfs v3.14.1 btrfs fi show (/dev/sdl was removed and re-added) Label: 'backup' uuid: 667ec955-bcaa-4175-827d-3a44eaa515bb Total devices 13 FS bytes used 9.29TiB devid1 size 2.71TiB used 1.56TiB path /dev/sda4 devid2 size 2.71TiB used 1.56TiB path /dev/sdb4 devid3 size 2.73TiB used 1.56TiB path /dev/sdc devid4 size 2.73TiB used 1.56TiB path /dev/sdd devid5 size 2.73TiB used 1.56TiB path /dev/sde devid6 size 2.73TiB used 1.56TiB path /dev/sdf devid7 size 2.73TiB used 1.56TiB path /dev/sdg devid8 size 2.73TiB used 1.56TiB path /dev/sdh devid9 size 2.73TiB used 1.56TiB path /dev/sdi devid 10 size 2.73TiB used 1.56TiB path /dev/sdj devid 11 size 2.73TiB used 1.56TiB path /dev/sdk devid 13 size 2.73TiB used 1.75GiB path /dev/sdl *** Some devices missing Btrfs v3.14.1 Data, RAID10: total=8.87TiB, used=8.79TiB System, RAID1: total=32.00MiB, used=1.11MiB Metadata, RAID10: total=518.72GiB, used=516.31GiB Best regards, -- GABRI Mate Rendszergazda mga...@plex.hu | +36 (30) 450 9750 Plex Online Kft. | http://plex.hu H-1118 Budapest, Ugron Gábor u. 35. T: +36 (1) 445 0167 | F: +36 (1) 248 3250 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: fs corruption report
Hello Guy, Am Donnerstag, 4. September 2014, 11:50:14 schrieb Marc Dietrich: Am Donnerstag, 4. September 2014, 11:00:55 schrieb Gui Hecheng: Hi Zooko, Marc, Firstly, thanks for your backtrace info, Marc. Sorry to reply late, since I'm offline these days. For the restore problem, I'm sure that the lzo decompress routine lacks the ability to handle some specific extent pattern. Here is my test result: I'm using a specific file for test /usr/lib/modules/$(uname -r)/kernel/net/irda/irda.ko. You can get it easily on your own box. # mkfs -t btrfs dev # mount -o compress-force=lzo dev mnt # cp irda.ko mnt # umount dev # btrfs restore -v dev restore_dir report: # bad compress length # failed to inflate uh, that's really odd. I don't use force compress, but I guess it will also happen with non-forced one if the file is big enough. btrfs-progs version: v3.16.x With the same file under no-compress zlib-compress, the restore will output a correct copy of irda.ko. I'm not sure whether the problem above has something to do with your problem. Hope that the messages above are helpful. I also get lot's of bad compress length, so it might be indeed related. I'm not a programmer, but is it possible that we are just skipping the lzo header (magic + header len + rest of header)? I'm able to reproduce it. Any improved insight into this problem? Marc signature.asc Description: This is a digitally signed message part.
Re: [PATCH v4 00/11] Implement the data repair function for direct read
On 09/12/2014 06:43 AM, Miao Xie wrote: This patchset implement the data repair function for the direct read, it is implemented like buffered read: 1.When we find the data is not right, we try to read the data from the other mirror. 2.When the io on the mirror ends, we will insert the endio work into the dedicated btrfs workqueue, not common read endio workqueue, because the original endio work is still blocked in the btrfs endio workqueue, if we insert the endio work of the io on the mirror into that workqueue, deadlock would happen. 3.If We get right data, we write it back to repair the corrupted mirror. 4.If the data on the new mirror is still corrupted, we will try next mirror until we read right data or all the mirrors are traversed. 5.After the above work, we set the uptodate flag according to the result. The difference is that the direct read may be splited to several small io, in order to get the number of the mirror on which the io error happens. we have to do data check and repair on the end IO function of those sub-IO request. Besides that, we also fixed some bugs of direct io. Changelog v3 - v4: - Remove the 1st patch which has been applied into the upstream kernel. - Use a dedicated btrfs workqueue instead of the system workqueue to deal with the completed repair bio, this suggest was from Chris. - Rebase the patchset to integration branch of Chris's git tree. Perfect, thank you. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs for rc5
Hi Linus, My for-linus branch has some fixes for the next rc: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus Filipe is doing a careful pass through fsync problems, and these are the fixes so far. I'll have one more for rc6 that we're still testing. My big commit is fixing up some inode hash races that Al Viro found (thanks Al). Filipe Manana (3) commits (+75/-21): Btrfs: fix corruption after write/fsync failure + fsync + log recovery (+9/-3) Btrfs: fix fsync data loss after a ranged fsync (+64/-17) Btrfs: fix crash while doing a ranged fsync (+2/-1) Chris Mason (2) commits (+113/-70): Btrfs: use insert_inode_locked4 for inode creation (+109/-67) Btrfs: fix autodefrag with compression (+4/-3) Dan Carpenter (1) commits (+15/-10): Btrfs: kfree()ing ERR_PTRs Total: (6) commits (+203/-101) fs/btrfs/file.c | 2 +- fs/btrfs/inode.c| 191 +--- fs/btrfs/ioctl.c| 32 + fs/btrfs/tree-log.c | 77 - fs/btrfs/tree-log.h | 2 + 5 files changed, 203 insertions(+), 101 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: remove empty block groups automatically
One problem that has plagued us is that a user will use up all of his space with data, remove a bunch of that data, and then try to create a bunch of small files and run out of space. This happens because all the chunks were allocated for data since the metadata requirements were so low. But now there's a bunch of empty data block groups and not enough metadata space to do anything. This patch solves this problem by automatically deleting empty block groups. If we notice the used count go down to 0 when deleting or on mount notice that a block group has a used count of 0 then we will queue it to be deleted. When the cleaner thread runs we will double check to make sure the block group is still empty and then we will delete it. This patch has the side effect of no longer having a bunch of BUG_ON()'s in the chunk delete code, which will be helpful for both this and relocate. Thanks, Signed-off-by: Josef Bacik jba...@fb.com --- fs/btrfs/ctree.h | 8 ++- fs/btrfs/disk-io.c| 3 + fs/btrfs/extent-tree.c| 128 +++--- fs/btrfs/tests/free-space-tests.c | 2 +- fs/btrfs/volumes.c| 115 ++ fs/btrfs/volumes.h| 2 + 6 files changed, 209 insertions(+), 49 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 6db3d4b..d373c89 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1298,8 +1298,8 @@ struct btrfs_block_group_cache { */ struct list_head cluster_list; - /* For delayed block group creation */ - struct list_head new_bg_list; + /* For delayed block group creation or deletion of empty block groups */ + struct list_head bg_list; }; /* delayed seq elem */ @@ -1716,6 +1716,9 @@ struct btrfs_fs_info { /* Used to reclaim the metadata space in the background. */ struct work_struct async_reclaim_work; + + spinlock_t unused_bgs_lock; + struct list_head unused_bgs; }; struct btrfs_subvolume_writers { @@ -3343,6 +3346,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 size); int btrfs_remove_block_group(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 group_start); +void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info); void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, struct btrfs_root *root); u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index a224fb9..3409734 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1764,6 +1764,7 @@ static int cleaner_kthread(void *arg) } btrfs_run_delayed_iputs(root); + btrfs_delete_unused_bgs(root-fs_info); again = btrfs_clean_one_deleted_snapshot(root); mutex_unlock(root-fs_info-cleaner_mutex); @@ -2224,6 +2225,7 @@ int open_ctree(struct super_block *sb, spin_lock_init(fs_info-super_lock); spin_lock_init(fs_info-qgroup_op_lock); spin_lock_init(fs_info-buffer_lock); + spin_lock_init(fs_info-unused_bgs_lock); rwlock_init(fs_info-tree_mod_log_lock); mutex_init(fs_info-reloc_mutex); mutex_init(fs_info-delalloc_root_mutex); @@ -2233,6 +2235,7 @@ int open_ctree(struct super_block *sb, INIT_LIST_HEAD(fs_info-dirty_cowonly_roots); INIT_LIST_HEAD(fs_info-space_info); INIT_LIST_HEAD(fs_info-tree_mod_seq_list); + INIT_LIST_HEAD(fs_info-unused_bgs); btrfs_mapping_init(fs_info-mapping_tree); btrfs_init_block_rsv(fs_info-global_block_rsv, BTRFS_BLOCK_RSV_GLOBAL); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index b30ddb4..c68cdb1 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5433,6 +5433,20 @@ static int update_block_group(struct btrfs_root *root, spin_unlock(cache-space_info-lock); } else { old_val -= num_bytes; + + /* +* No longer have used bytes in this block group, queue +* it for deletion. +*/ + if (old_val == 0) { + spin_lock(info-unused_bgs_lock); + if (list_empty(cache-bg_list)) { + btrfs_get_block_group(cache); + list_add_tail(cache-bg_list, + info-unused_bgs); + } + spin_unlock(info-unused_bgs_lock); + } btrfs_set_block_group_used(cache-item, old_val); cache-pinned +=
Re: [PATCH] Btrfs: remove empty block groups automatically
On 09/12/2014 03:18 PM, Josef Bacik wrote: One problem that has plagued us is that a user will use up all of his space with data, remove a bunch of that data, and then try to create a bunch of small files and run out of space. This happens because all the chunks were allocated for data since the metadata requirements were so low. But now there's a bunch of empty data block groups and not enough metadata space to do anything. This patch solves this problem by automatically deleting empty block groups. If we notice the used count go down to 0 when deleting or on mount notice that a block group has a used count of 0 then we will queue it to be deleted. When the cleaner thread runs we will double check to make sure the block group is still empty and then we will delete it. This patch has the side effect of no longer having a bunch of BUG_ON()'s in the chunk delete code, which will be helpful for both this and relocate. Thanks, Thanks Josef, we've needed this forever. I'm planning on pulling it in for integration as well. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs: device_list_add() should not update list when mounted breaks subvol mount
Hi, On standard ubuntu 14.04 a with an encrypted (cryptsetup) /home as brtfs subvolume we have the following results: 3.17-rc2 : Ok. 3.17-rc3 and 3.17-rc4 : /home fails to mount on boot. If one try mount -a then the system tells that the partition is already mounted according to matab. On a 3.17-rc4, btrfs fi sh returns nothing special: Label: none uuid: f4f554bb-57d9-4647-ab14-ea978c9e7e9f Total devices 1 FS bytes used 131.41GiB devid1 size 173.31GiB used 134.03GiB path /dev/sda5 Btrfs v3.12 I'm not sure if it has something to do with cryptsetup... Xavier Hi Johannes, I've two more systems with kernel version 3.17-rc3 running and no problem like this. Does this 3.17-rc3 also has the same type of subvol config and it mount operation/sequence as you mentioned ? - one hdd with btrfs - default subvolume (rootfs) is different from subovlid=0 - at boot, several subvols are mounted at /home/$DIR I ran a few tests, on our mainline mount -o subvol=sv1 /dev/sdh1 /btrfs mount /dev/sdh1 /btrfs mount: /dev/sdh1 already mounted or /btrfs busy [*] mount /dev/sdh1 /btrfs1 echo $? 0 - [*] hope this isn't the problem you are mentioning. Thanks, Anand -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[bug] subvol doesn't belong to btrfs mount point
Summary: When a btrfs subvolume is mounted with -o subvol, and a nested ro subvol/snapshot is created, btrfs send returns with an error. If the top level (id 5) is mounted instead, the send command succeeds. 3.17.0-0.rc4.git0.1.fc22.i686 Btrfs v3.16 This may also be happening on x86_64, and this bug suggests the problem is commit de22c28ef31d9721606ba05965a093a8044be0de https://bugzilla.kernel.org/show_bug.cgi?id=83741 [root@lati ~]# strace btrfs send /root/rpms.ro | btrfs receive /mnt/ execve(/usr/sbin/btrfs, [btrfs, send, /root/rpms.ro], [/* 26 vars */]) = 0 brk(0) = 0x866a000 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7767000 access(/etc/ld.so.preload, R_OK) = -1 ENOENT (No such file or directory) open(/etc/ld.so.cache, O_RDONLY|O_CLOEXEC) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=84972, ...}) = 0 mmap2(NULL, 84972, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7752000 close(3)= 0 open(/lib/libuuid.so.1, O_RDONLY|O_CLOEXEC) = 3 read(3, \177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\320\17\0\0004\0\0\0..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0755, st_size=19100, ...}) = 0 mmap2(NULL, 20692, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb774c000 mmap2(0xb775, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3000) = 0xb775 close(3)= 0 open(/lib/libblkid.so.1, O_RDONLY|O_CLOEXEC) = 3 read(3, \177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\320P\0\0004\0\0\0..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0755, st_size=271276, ...}) = 0 mmap2(NULL, 271832, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7709000 mmap2(0xb7748000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x3e000) = 0xb7748000 mmap2(0xb774b000, 1496, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb774b000 close(3)= 0 open(/lib/libm.so.6, O_RDONLY|O_CLOEXEC) = 3 read(3, \177ELF\1\1\1\3\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0pF\0\0004\0\0\0..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0755, st_size=357384, ...}) = 0 mmap2(NULL, 311456, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb76bc000 mmap2(0xb7707000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x4b000) = 0xb7707000 close(3)= 0 open(/lib/libz.so.1, O_RDONLY|O_CLOEXEC) = 3 read(3, \177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\240\30\0\0004\0\0\0..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0755, st_size=93444, ...}) = 0 mmap2(NULL, 94464, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb76a4000 mmap2(0xb76ba000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x15000) = 0xb76ba000 close(3)= 0 open(/lib/liblzo2.so.2, O_RDONLY|O_CLOEXEC) = 3 read(3, \177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0@ \0\0004\0\0\0..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0755, st_size=145984, ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb76a3000 mmap2(NULL, 147584, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb767e000 mmap2(0xb76a1000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0xb76a1000 close(3)= 0 open(/lib/libpthread.so.0, O_RDONLY|O_CLOEXEC) = 3 read(3, \177ELF\1\1\1\3\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\320O\0\0004\0\0\0..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0755, st_size=142932, ...}) = 0 mmap2(NULL, 115404, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7661000 mprotect(0xb7679000, 4096, PROT_NONE) = 0 mmap2(0xb767a000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x18000) = 0xb767a000 mmap2(0xb767c000, 4812, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb767c000 close(3)= 0 open(/lib/libc.so.6, O_RDONLY|O_CLOEXEC) = 3 read(3, \177ELF\1\1\1\3\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\0\200\1\0004\0\0\0..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0755, st_size=2166836, ...}) = 0 mmap2(NULL, 1933020, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb7489000 mmap2(0xb765a000, 20480, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1d) = 0xb765a000 mmap2(0xb765f000, 7900, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb765f000 close(3)= 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7488000 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7487000 set_thread_area({entry_number:-1, base_addr:0xb7487800, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0 (entry_number:6) mprotect(0xb765a000, 12288, PROT_READ) = 0 mprotect(0xb767a000, 4096, PROT_READ) = 0 mprotect(0xb76a1000, 4096,
Re: Btrfs: device_list_add() should not update list when mounted breaks subvol mount
Hi Xavier, Thanks for the report. I got this reproduced: its a very corner case, it depends on the device path given in the subsequent subvol mounts, the fix appear to be outside of this patch at this moment and I am digging to know if we need to normalize the device path before using it in the btrfs kernel, just like btrfs-progs did recently. reproducer: ls -l /root/dev/sde-link /root/dev/sde-link - /dev/sde mount -o device=/root/dev/sde-link /dev/sdd /btrfs1 btrfs fi show Label: none uuid: 943bf422-998c-4640-9d7f-d49f17b782ce Total devices 2 FS bytes used 272.00KiB devid1 size 1.52GiB used 339.50MiB path /dev/sdd devid2 size 1.52GiB used 319.50MiB path /root/dev/sde-link mount -o subvol=sv1,device=/dev/sde /dev/sdd /btrfs -shouldn't fail. mount: /dev/sdd already mounted or /btrfs busy mount: according to mtab, /dev/sdd is mounted on /btrfs1 mount -o device=/root/dev/sde-link /dev/sdd /btrfs echo $? 0 Xavier, Johannes, The quickest workaround for you will be to try to match the device path as in the btrfs fi show -m /mnt output to your probably fstab/mnttab entry. Anand On 09/13/2014 04:43 AM, xavier.gn...@gmail.com wrote: Hi, On standard ubuntu 14.04 a with an encrypted (cryptsetup) /home as brtfs subvolume we have the following results: 3.17-rc2 : Ok. 3.17-rc3 and 3.17-rc4 : /home fails to mount on boot. If one try mount -a then the system tells that the partition is already mounted according to matab. On a 3.17-rc4, btrfs fi sh returns nothing special: Label: none uuid: f4f554bb-57d9-4647-ab14-ea978c9e7e9f Total devices 1 FS bytes used 131.41GiB devid1 size 173.31GiB used 134.03GiB path /dev/sda5 Btrfs v3.12 I'm not sure if it has something to do with cryptsetup... Xavier Hi Johannes, I've two more systems with kernel version 3.17-rc3 running and no problem like this. Does this 3.17-rc3 also has the same type of subvol config and it mount operation/sequence as you mentioned ? - one hdd with btrfs - default subvolume (rootfs) is different from subovlid=0 - at boot, several subvols are mounted at /home/$DIR I ran a few tests, on our mainline mount -o subvol=sv1 /dev/sdh1 /btrfs mount /dev/sdh1 /btrfs mount: /dev/sdh1 already mounted or /btrfs busy [*] mount /dev/sdh1 /btrfs1 echo $? 0 - [*] hope this isn't the problem you are mentioning. Thanks, Anand -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html