Re: [PATCH v5 8/8] btrfs: new ioctls to do logical-inode and inode-path resolving
On 21.07.2011 22:14, Andi Kleen wrote: Jan Schmidt list.bt...@jan-o-sch.net writes: + +static long btrfs_ioctl_logical_to_ino(struct btrfs_root *root, +void __user *arg) +{ +int ret = 0; +int size; +u64 extent_offset; +struct btrfs_ioctl_logical_ino_args *loi; +struct btrfs_data_container *inodes = NULL; +struct btrfs_path *path = NULL; +struct btrfs_key key; This really needs to be root-only for obvious reasons. The same for the ino_path function + +loi = memdup_user(arg, sizeof(*loi)); +if (IS_ERR(loi)) { +ret = PTR_ERR(loi); +loi = NULL; +goto out; +} + +path = btrfs_alloc_path(); +if (!path) { +ret = -ENOMEM; +goto out; +} + +size = min(loi-size, 4096); This is likely a root hole. loi-size is signed! Consider the case of a negative value being passed in. Same for the earlier function. Sigh. Thanks for pointing these out. Shouldn't release code that was fine for development without carefully reconsidering such things. I'll send a v6. -Jan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Broken btrfs?
On 21.07.2011 23:13, Jan Schubert wrote: On 07/18/2011 10:29 AM, Jan Schmidt wrote: If you are on a 3.0 kernel, get the most current version of btrfs tools from Hugo's integration-20110705 branch at http://git.darksatanic.net/repo/btrfs-progs-unstable.git/ and do a scrub. -Jan Thx Jan, I did. This is the result: scrub status for 03201fc0-7695-4468-9a10-f61ad79f23ca scrub started at Thu Jul 21 22:27:31 2011 and finished after 787 seconds total bytes scrubbed: 173.91GB with 2211 errors error details: csum=2211 corrected errors: 0, uncorrectable errors: 2211 Any help what to do now? Should I stick with this filesystem or create a new one? Well, you won't be able to repair the broken files. You can create a new filesystem. It is not guaranteed that this won't result in similar problems, though. You might have a built on a sandy hard drive. The good thing is, running 3.0 does not crash the system anymore while accessing corrupt data but just printing an I/O error. Scrub should be printing inode numbers to your system log while detecting those errors. If you want to know the exact files corrupted, you can grab my patch set with subject Btrfs scrub: print path to corrupted files and trigger nodatasum fixup from the list and give it a try. -Jan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new metadata reader/writer locks in integration-test
On fri, 22 Jul 2011 12:06:40 +0800, Miao Xie wrote: On thu, 21 Jul 2011 20:53:24 -0400, Chris Mason wrote: Hi everyone, I just rebased Josef's enospc fixes into integration-test, it should fix the warnings in extent-tree.c Unfortunately, I got the following messages. Jul 21 09:41:22 luna kernel: [ cut here ] Jul 21 09:41:22 luna kernel: WARNING: at fs/btrfs/extent-tree.c:5564 btrfs_alloc_reserved_file_extent+0xf8/0x100 [btrfs]() Jul 21 09:41:22 luna kernel: Hardware name: PRIMERGY Jul 21 09:41:22 luna kernel: Modules linked in: btrfs zlib_deflate crc32c libcrc32c autofs4 sunrpc 8021q garp stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 ext3 jbd dm_mirror dm_region_hash dm_log dm_mod kvm uinput ppdev parport_pc parport sg pcspkr i2c_i801 i2c_core iTCO_wdt iTCO_vendor_support tg3 shpchp pci_hotplug i3000_edac edac_core ext4 mbcache jbd2 crc16 sd_mod crc_t10dif sr_mod cdrom megaraid_sas floppy pata_acpi ata_generic ata_piix libata scsi_mod [last unloaded: microcode] Jul 21 09:41:22 luna kernel: Pid: 5517, comm: btrfs-endio-wri Tainted: G W 2.6.39btrfs-tc1+ #1 Jul 21 09:41:22 luna kernel: Call Trace: Jul 21 09:41:22 luna kernel: [8106004f] warn_slowpath_common+0x7f/0xc0 Jul 21 09:41:22 luna kernel: [810600aa] warn_slowpath_null+0x1a/0x20 Jul 21 09:41:22 luna kernel: [a044a068] btrfs_alloc_reserved_file_extent+0xf8/0x100 [btrfs] Jul 21 09:41:22 luna kernel: [a0464121] insert_reserved_file_extent.clone.0+0x201/0x270 [btrfs] Jul 21 09:41:22 luna kernel: [a0468c0b] btrfs_finish_ordered_io+0x2eb/0x360 [btrfs] Jul 21 09:41:22 luna kernel: [8106fe23] ? try_to_del_timer_sync+0x83/0xe0 Jul 21 09:41:22 luna kernel: [a0468cd0] btrfs_writepage_end_io_hook+0x50/0xa0 [btrfs] Jul 21 09:41:22 luna kernel: [a049a3c6] end_compressed_bio_write+0x86/0xf0 [btrfs] Jul 21 09:41:22 luna kernel: [8117f96d] bio_endio+0x1d/0x40 Jul 21 09:41:22 luna kernel: [a0459d84] end_workqueue_fn+0xf4/0x130 [btrfs] Jul 21 09:41:22 luna kernel: [a048841e] worker_loop+0x13e/0x540 [btrfs] Jul 21 09:41:22 luna kernel: [a04882e0] ? btrfs_queue_worker+0x2d0/0x2d0 [btrfs] Jul 21 09:41:22 luna kernel: [a04882e0] ? btrfs_queue_worker+0x2d0/0x2d0 [btrfs] Jul 21 09:41:22 luna kernel: [81081756] kthread+0x96/0xa0 Jul 21 09:41:22 luna kernel: [81486004] kernel_thread_helper+0x4/0x10 Jul 21 09:41:22 luna kernel: [810816c0] ? kthread_worker_fn+0x1a0/0x1a0 Jul 21 09:41:22 luna kernel: [81486000] ? gs_change+0x13/0x13 Jul 21 09:41:22 luna kernel: ---[ end trace 02c1fa3044677043 ]--- a very similar warning here, but without compression involved: Ok, these are probably the enospc fixes. Could you please try bisecting out some of Josef's patches? I did binary search and found the following patch led to this problem. commit 97ffc7d564f55787c7d9ea557d5d30d9ecb2f003 Author: Josef Bacik jo...@redhat.com Date: Fri Jul 15 18:29:11 2011 + Btrfs: don't be as agressive with delalloc metadata reservations Currently we reserve enough space to COW an entirely full btree for every ex we have reserved for an inode. This _sucks_, because you only need to COW o and then everybody else is ok. Unfortunately we don't know we'll all be abl get into the same transaction so that's what we have had to do. But the glo reserve holds a reservation large enough to cover a large percentage of all metadata currently in the fs. So all we really need to account for is any n blocks that we may allocate. So fix this by …… Please ignore my analysis and patch, which can not fix the problem. The reason is the calculation of the reservation is wrong, the nodes in the search path may be split, and new nodes may be created, but the above patch didn't reserve space for these new nodes. The following patch can fix it. Though my test passed, I still need Arne's verification to make sure it can fix all the reported problems. Arne, Could you test it for me? Subject: [PATCH] Btrfs: fix wrong calculation of the reservation for the transaction At worst, Btrfs may split all the nodes in the search path, so we must take those new nodes into account when we calculate the space that need be reserved. Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- fs/btrfs/ctree.h |8 +++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index d813a67..4f23819 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2133,10 +2133,16 @@ static inline bool btrfs_mixed_space_info(struct btrfs_space_info *space_info) } /* extent-tree.c */ +/* + * This inline function is used to calc the size of new nodes/leaves that we + * may create. At worst, we may split all the nodes in the path
[PATCH v6 4/8] btrfs scrub: bugfix: mirror_num off by one
Fix the mirror_num determination in scrub_stripe. The rest of the scrub code did not use mirror_num for anything important and that error went unnoticed. The nodatasum fixup patch of this set depends on a correct mirror_num. Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/scrub.c | 12 ++-- 1 files changed, 6 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 221fd5c..59caf8f 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -452,7 +452,7 @@ static void scrub_fixup(struct scrub_bio *sbio, int ix) * first find a good copy */ for (i = 0; i multi-num_stripes; ++i) { - if (i == sbio-spag[ix].mirror_num) + if (i + 1 == sbio-spag[ix].mirror_num) continue; if (scrub_fixup_io(READ, multi-stripes[i].dev-bdev, @@ -930,21 +930,21 @@ static noinline_for_stack int scrub_stripe(struct scrub_dev *sdev, if (map-type BTRFS_BLOCK_GROUP_RAID0) { offset = map-stripe_len * num; increment = map-stripe_len * map-num_stripes; - mirror_num = 0; + mirror_num = 1; } else if (map-type BTRFS_BLOCK_GROUP_RAID10) { int factor = map-num_stripes / map-sub_stripes; offset = map-stripe_len * (num / map-sub_stripes); increment = map-stripe_len * factor; - mirror_num = num % map-sub_stripes; + mirror_num = num % map-sub_stripes + 1; } else if (map-type BTRFS_BLOCK_GROUP_RAID1) { increment = map-stripe_len; - mirror_num = num % map-num_stripes; + mirror_num = num % map-num_stripes + 1; } else if (map-type BTRFS_BLOCK_GROUP_DUP) { increment = map-stripe_len; - mirror_num = num % map-num_stripes; + mirror_num = num % map-num_stripes + 1; } else { increment = map-stripe_len; - mirror_num = 0; + mirror_num = 1; } path = btrfs_alloc_path(); -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 8/8] btrfs: new ioctls to do logical-inode and inode-path resolving
these ioctls make use of the new functions initially added for scrub. they return all inodes belonging to a logical address (BTRFS_IOC_LOGICAL_INO) and all paths belonging to an inode (BTRFS_IOC_INO_PATHS). Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/ioctl.c | 145 ++ fs/btrfs/ioctl.h | 19 +++ 2 files changed, 164 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index a3c4751..aac4c05 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -51,6 +51,7 @@ #include volumes.h #include locking.h #include inode-map.h +#include backref.h /* Mask out flags that are inappropriate for the given type of inode. */ static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags) @@ -2836,6 +2837,146 @@ static long btrfs_ioctl_scrub_progress(struct btrfs_root *root, return ret; } +static long btrfs_ioctl_ino_to_path(struct btrfs_root *root, void __user *arg) +{ + int ret = 0; + int i; + unsigned long rel_ptr; + int size; + struct btrfs_ioctl_ino_path_args *ipa; + struct inode_fs_paths *ipath = NULL; + struct btrfs_path *path; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + + ipa = memdup_user(arg, sizeof(*ipa)); + if (IS_ERR(ipa)) { + ret = PTR_ERR(ipa); + ipa = NULL; + goto out; + } + + size = min(ipa-size, 4096); + ipath = init_ipath(size, root, path); + if (IS_ERR(ipath)) { + ret = PTR_ERR(ipath); + ipath = NULL; + goto out; + } + + ret = paths_from_inode(ipa-inum, ipath); + if (ret 0) + goto out; + + for (i = 0; i ipath-fspath-elem_cnt; ++i) { + rel_ptr = ipath-fspath-str[i] - (char *)ipath-fspath-str; + ipath-fspath-str[i] = (void *)rel_ptr; + } + + ret = copy_to_user(ipa-fspath, ipath-fspath, size); + if (ret) { + ret = -EFAULT; + goto out; + } + +out: + btrfs_free_path(path); + free_ipath(ipath); + kfree(ipa); + + return ret; +} + +static int build_ino_list(u64 inum, u64 offset, u64 root, void *ctx) +{ + struct btrfs_data_container *inodes = ctx; + + inodes-size -= 3 * sizeof(u64); + if (inodes-size 0) { + inodes-val[inodes-elem_cnt] = inum; + inodes-val[inodes-elem_cnt + 1] = offset; + inodes-val[inodes-elem_cnt + 2] = root; + inodes-elem_cnt += 3; + } else { + inodes-elem_missed += 3; + } + + return 0; +} + +static long btrfs_ioctl_logical_to_ino(struct btrfs_root *root, + void __user *arg) +{ + int ret = 0; + int size; + u64 extent_offset; + struct btrfs_ioctl_logical_ino_args *loi; + struct btrfs_data_container *inodes = NULL; + struct btrfs_path *path = NULL; + struct btrfs_key key; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + loi = memdup_user(arg, sizeof(*loi)); + if (IS_ERR(loi)) { + ret = PTR_ERR(loi); + loi = NULL; + goto out; + } + + if (loi-size = 0) { + ret = -EINVAL; + goto out; + } + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + + size = min(loi-size, 4096); + inodes = init_data_container(size); + if (IS_ERR(inodes)) { + ret = PTR_ERR(inodes); + inodes = NULL; + goto out; + } + + ret = extent_from_logical(root-fs_info, loi-logical, path, key); + + if (ret BTRFS_EXTENT_FLAG_TREE_BLOCK) + ret = -ENOENT; + if (ret 0) + goto out; + + extent_offset = loi-logical - key.objectid; + ret = iterate_extent_inodes(root-fs_info, path, key.objectid, + extent_offset, build_ino_list, inodes); + + if (ret 0) + goto out; + + ret = copy_to_user(loi-inodes, inodes, size); + if (ret) + ret = -EFAULT; + +out: + btrfs_free_path(path); + kfree(inodes); + kfree(loi); + + return ret; +} + long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { @@ -2893,6 +3034,10 @@ long btrfs_ioctl(struct file *file, unsigned int return btrfs_ioctl_tree_search(file, argp); case BTRFS_IOC_INO_LOOKUP: return btrfs_ioctl_ino_lookup(file, argp); + case BTRFS_IOC_INO_PATHS: + return btrfs_ioctl_ino_to_path(root, argp); + case BTRFS_IOC_LOGICAL_INO:
[PATCH v6 1/8] btrfs: added helper functions to iterate backrefs
These helper functions iterate back references and call a function for each backref. There is also a function to resolve an inode to a path in the file system. Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/Makefile |3 +- fs/btrfs/backref.c | 748 fs/btrfs/backref.h | 62 + fs/btrfs/ioctl.h | 10 + 4 files changed, 822 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index 9b72dcf..c63f649 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -7,4 +7,5 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \ extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \ export.o tree-log.o acl.o free-space-cache.o zlib.o lzo.o \ - compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o + compression.o delayed-ref.o relocation.o delayed-inode.o backref.o \ + scrub.o diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c new file mode 100644 index 000..477f154 --- /dev/null +++ b/fs/btrfs/backref.c @@ -0,0 +1,748 @@ +/* + * Copyright (C) 2011 STRATO. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#include ctree.h +#include disk-io.h +#include backref.h + +struct __data_ref { + struct list_head list; + u64 inum; + u64 root; + u64 extent_data_item_offset; +}; + +struct __shared_ref { + struct list_head list; + u64 disk_byte; +}; + +static int __inode_info(u64 inum, u64 ioff, u8 key_type, + struct btrfs_root *fs_root, struct btrfs_path *path, + struct btrfs_key *found_key) +{ + int ret; + struct btrfs_key key; + struct extent_buffer *eb; + + key.type = key_type; + key.objectid = inum; + key.offset = ioff; + + ret = btrfs_search_slot(NULL, fs_root, key, path, 0, 0); + if (ret 0) + return ret; + + eb = path-nodes[0]; + if (ret path-slots[0] = btrfs_header_nritems(eb)) { + ret = btrfs_next_leaf(fs_root, path); + if (ret) + return ret; + eb = path-nodes[0]; + } + + btrfs_item_key_to_cpu(eb, found_key, path-slots[0]); + if (found_key-type != key.type || found_key-objectid != key.objectid) + return 1; + + return 0; +} + +/* + * this makes the path point to (inum INODE_ITEM ioff) + */ +int inode_item_info(u64 inum, u64 ioff, struct btrfs_root *fs_root, + struct btrfs_path *path) +{ + struct btrfs_key key; + return __inode_info(inum, ioff, BTRFS_INODE_ITEM_KEY, fs_root, path, + key); +} + +static int inode_ref_info(u64 inum, u64 ioff, struct btrfs_root *fs_root, + struct btrfs_path *path, int strict, + u64 *out_parent_inum, + struct extent_buffer **out_iref_eb, + int *out_slot) +{ + int ret; + struct btrfs_key found_key; + + ret = __inode_info(inum, ioff, BTRFS_INODE_REF_KEY, fs_root, path, + found_key); + + if (!ret) { + if (out_slot) + *out_slot = path-slots[0]; + if (out_iref_eb) + *out_iref_eb = path-nodes[0]; + if (out_parent_inum) + *out_parent_inum = found_key.offset; + } + + btrfs_release_path(path); + return ret; +} + +/* + * this iterates to turn a btrfs_inode_ref into a full filesystem path. elements + * of the path are separated by '/' and the path is guaranteed to be + * 0-terminated. the path is only given within the current file system. + * Therefore, it never starts with a '/'. the caller is responsible to provide + * size bytes in dest. the dest buffer will be filled backwards. finally, + * the start point of the resulting string is returned. this pointer is within + * dest, normally. + * in case the path buffer would overflow, the pointer is decremented further + * as if output was written to the buffer, though no more output is actually + * generated. that way, the caller
[PATCH v6 7/8] btrfs scrub: add fixup code for errors on nodatasum files
This removes a FIXME comment and introduces the first part of nodatasum fixup: It gets the corresponding inode for a logical address and triggers a regular readpage for the corrupted sector. Once we have on-the-fly error correction our error will be automatically corrected. The correction code is expected to clear the newly introduced EXTENT_DAMAGED flag, making scrub report that error as corrected instead of uncorrectable eventually. Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/extent_io.h |1 + fs/btrfs/scrub.c | 188 -- 2 files changed, 183 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 22bf366..2734fd9 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -17,6 +17,7 @@ #define EXTENT_NODATASUM (1 10) #define EXTENT_DO_ACCOUNTING (1 11) #define EXTENT_FIRST_DELALLOC (1 12) +#define EXTENT_DAMAGED (1 13) #define EXTENT_IOBITS (EXTENT_LOCKED | EXTENT_WRITEBACK) #define EXTENT_CTLBITS (EXTENT_DO_ACCOUNTING | EXTENT_FIRST_DELALLOC) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 41a0114..db09f01 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -22,6 +22,7 @@ #include volumes.h #include disk-io.h #include ordered-data.h +#include transaction.h #include backref.h /* @@ -89,6 +90,7 @@ struct scrub_dev { int first_free; int curr; atomic_tin_flight; + atomic_tfixup_cnt; spinlock_t list_lock; wait_queue_head_t list_wait; u16 csum_size; @@ -102,6 +104,14 @@ struct scrub_dev { spinlock_t stat_lock; }; +struct scrub_fixup_nodatasum { + struct scrub_dev*sdev; + u64 logical; + struct btrfs_root *root; + struct btrfs_work work; + int mirror_num; +}; + struct scrub_warning { struct btrfs_path *path; u64 extent_item_size; @@ -190,12 +200,13 @@ struct scrub_dev *scrub_setup_dev(struct btrfs_device *dev) if (i != SCRUB_BIOS_PER_DEV-1) sdev-bios[i]-next_free = i + 1; -else + else sdev-bios[i]-next_free = -1; } sdev-first_free = 0; sdev-curr = -1; atomic_set(sdev-in_flight, 0); + atomic_set(sdev-fixup_cnt, 0); atomic_set(sdev-cancel_req, 0); sdev-csum_size = btrfs_super_csum_size(fs_info-super_copy); INIT_LIST_HEAD(sdev-csum_list); @@ -347,6 +358,151 @@ out: kfree(swarn.msg_buf); } +static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *ctx) +{ + struct page *page; + unsigned long index; + struct scrub_fixup_nodatasum *fixup = ctx; + int ret; + int corrected; + struct btrfs_key key; + struct inode *inode; + u64 end = offset + PAGE_SIZE - 1; + struct btrfs_root *local_root; + + key.objectid = root; + key.type = BTRFS_ROOT_ITEM_KEY; + key.offset = (u64)-1; + local_root = btrfs_read_fs_root_no_name(fixup-root-fs_info, key); + if (IS_ERR(local_root)) + return PTR_ERR(local_root); + + key.type = BTRFS_INODE_ITEM_KEY; + key.objectid = inum; + key.offset = 0; + inode = btrfs_iget(fixup-root-fs_info-sb, key, local_root, NULL); + if (IS_ERR(inode)) + return PTR_ERR(inode); + + ret = set_extent_bit(BTRFS_I(inode)-io_tree, offset, end, + EXTENT_DAMAGED, 0, NULL, NULL, GFP_NOFS); + + /* set_extent_bit should either succeed or give proper error */ + WARN_ON(ret 0); + if (ret) + return ret 0 ? ret : -EFAULT; + + index = offset PAGE_CACHE_SHIFT; + + page = find_or_create_page(inode-i_mapping, index, GFP_NOFS); + if (!page) + return -ENOMEM; + + ret = extent_read_full_page(BTRFS_I(inode)-io_tree, page, + btrfs_get_extent, fixup-mirror_num); + wait_on_page_locked(page); + corrected = !test_range_bit(BTRFS_I(inode)-io_tree, offset, end, + EXTENT_DAMAGED, 0, NULL); + + if (corrected) + WARN_ON(!PageUptodate(page)); + else + clear_extent_bit(BTRFS_I(inode)-io_tree, offset, end, + EXTENT_DAMAGED, 0, 0, NULL, GFP_NOFS); + + put_page(page); + iput(inode); + + if (ret 0) + return ret; + + if (ret == 0 corrected) { + /* +* we only need to call readpage for one of the inodes belonging +* to this extent. so make iterate_extent_inodes stop +*/ + return 1; + }
[PATCH v6 0/8] Btrfs scrub: print path to corrupted files and trigger nodatasum fixup
Here comes the fix for the bug immediately following the very last bug in this patch series: Changelog v5-v6: - fixed ioctl priviledge and input sanity checking (reported by Andi Kleen) Original message follows: This patch set introduces two new features for scrub. They share the backref iteration code which is the reason they made it into the same patch set. The first feature adds printk statements in case scrub finds an error which list all affected files. You will need patch 1, 2 and 3 for that. The second feature adds the trigger which enables us to correct i/o errors in case the affected extent does not have a checksum (nodatasum), eventually. You will need patch 1, 4, 5 and 6 for that. I tried to apply all patches to the current cmason/for-linus branch and to Arne's current for-chris branch. They do apply with no errors (some offsets possible). The new ioctl()s can be tested from usermode by applying the patch series [PATCH v2 0/3] Btrfs-progs: add the first inspect-internal commands from this mailing list to the user land tools. Please review. Next I'm starting to make up my mind how to implement on-the-fly error correction correctly. This will enable us to rewrite good data whenever we encounter a bad copy. I have some preliminary patches already, the stress in the first sentence is on correctly. The second feature mentioned in this patch series will then automatically use that code, too. Changelog v1-v2: - Various cleanup, sensible error codes as suggested by David Sterba Changelog v2-v3: - evaluation and iteration of shared refs - support for in-tree refs (v2 iterated inline refs only) - never call an interator function without releasing the path - iterate_irefs now returns -ENOENT in case no refs are found - some stupid bugs removed where release_path was called too early - ioctls added to provide new functions to user mode - bugfixes for cases where search_slot found the very end of a leaf - bugfix: use right fs root for readpage instead of fs_root-fs_info - based on current cmason/for-linus Changelog v3-v4: - fixed a regression with mirror_num that could prevent error correction - based on current cmason/for-linus Changelog v4-v5: - fixed a deadlock when fixup is taking longer while scrub is about to end Please try it and report errors (or confirm there are none, of course). I can provide a place to pull from if anyone likes. -Jan Jan Schmidt (8): btrfs: added helper functions to iterate backrefs btrfs scrub: added unverified_errors btrfs scrub: print paths of corrupted files btrfs scrub: bugfix: mirror_num off by one btrfs: add mirror_num to extent_read_full_page btrfs scrub: use int for mirror_num, not u64 btrfs scrub: add fixup code for errors on nodatasum files btrfs: new ioctls to do logical-inode and inode-path resolving fs/btrfs/Makefile|3 +- fs/btrfs/backref.c | 748 ++ fs/btrfs/backref.h | 62 + fs/btrfs/disk-io.c |2 +- fs/btrfs/extent_io.c |6 +- fs/btrfs/extent_io.h |3 +- fs/btrfs/inode.c |2 +- fs/btrfs/ioctl.c | 145 ++ fs/btrfs/ioctl.h | 29 ++ fs/btrfs/scrub.c | 414 +--- 10 files changed, 1374 insertions(+), 40 deletions(-) create mode 100644 fs/btrfs/backref.c create mode 100644 fs/btrfs/backref.h -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 2/8] btrfs scrub: added unverified_errors
In normal operation, scrub is reading data sequentially in large portions. In case of an i/o error, we try to find the corrupted area(s) by issuing page sized read requests. With this commit we increment the unverified_errors counter if all of the small size requests succeed. Userland patches carrying such conspicous events to the administrator should already be around. Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/scrub.c | 37 ++--- 1 files changed, 26 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index a8d03d5..35099fa 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -201,18 +201,25 @@ nomem: * recheck_error gets called for every page in the bio, even though only * one may be bad */ -static void scrub_recheck_error(struct scrub_bio *sbio, int ix) +static int scrub_recheck_error(struct scrub_bio *sbio, int ix) { + struct scrub_dev *sdev = sbio-sdev; + u64 sector = (sbio-physical + ix * PAGE_SIZE) 9; + if (sbio-err) { - if (scrub_fixup_io(READ, sbio-sdev-dev-bdev, - (sbio-physical + ix * PAGE_SIZE) 9, + if (scrub_fixup_io(READ, sbio-sdev-dev-bdev, sector, sbio-bio-bi_io_vec[ix].bv_page) == 0) { if (scrub_fixup_check(sbio, ix) == 0) - return; + return 0; } } + spin_lock(sdev-stat_lock); + ++sdev-stat.read_errors; + spin_unlock(sdev-stat_lock); + scrub_fixup(sbio, ix); + return 1; } static int scrub_fixup_check(struct scrub_bio *sbio, int ix) @@ -382,8 +389,14 @@ static void scrub_checksum(struct btrfs_work *work) int ret; if (sbio-err) { + ret = 0; for (i = 0; i sbio-count; ++i) - scrub_recheck_error(sbio, i); + ret |= scrub_recheck_error(sbio, i); + if (!ret) { + spin_lock(sdev-stat_lock); + ++sdev-stat.unverified_errors; + spin_unlock(sdev-stat_lock); + } sbio-bio-bi_flags = ~(BIO_POOL_MASK - 1); sbio-bio-bi_flags |= 1 BIO_UPTODATE; @@ -396,10 +409,6 @@ static void scrub_checksum(struct btrfs_work *work) bi-bv_offset = 0; bi-bv_len = PAGE_SIZE; } - - spin_lock(sdev-stat_lock); - ++sdev-stat.read_errors; - spin_unlock(sdev-stat_lock); goto out; } for (i = 0; i sbio-count; ++i) { @@ -420,8 +429,14 @@ static void scrub_checksum(struct btrfs_work *work) WARN_ON(1); } kunmap_atomic(buffer, KM_USER0); - if (ret) - scrub_recheck_error(sbio, i); + if (ret) { + ret = scrub_recheck_error(sbio, i); + if (!ret) { + spin_lock(sdev-stat_lock); + ++sdev-stat.unverified_errors; + spin_unlock(sdev-stat_lock); + } + } } out: -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 6/8] btrfs scrub: use int for mirror_num, not u64
the rest of the code uses int mirror_num, and so should scrub Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/scrub.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 59caf8f..41a0114 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -65,7 +65,7 @@ static void scrub_fixup(struct scrub_bio *sbio, int ix); struct scrub_page { u64 flags; /* extent flags */ u64 generation; - u64 mirror_num; + int mirror_num; int have_csum; u8 csum[BTRFS_CSUM_SIZE]; }; @@ -776,7 +776,7 @@ nomem: } static int scrub_page(struct scrub_dev *sdev, u64 logical, u64 len, - u64 physical, u64 flags, u64 gen, u64 mirror_num, + u64 physical, u64 flags, u64 gen, int mirror_num, u8 *csum, int force) { struct scrub_bio *sbio; @@ -873,7 +873,7 @@ static int scrub_find_csum(struct scrub_dev *sdev, u64 logical, u64 len, /* scrub extent tries to collect up to 64 kB for each bio */ static int scrub_extent(struct scrub_dev *sdev, u64 logical, u64 len, - u64 physical, u64 flags, u64 gen, u64 mirror_num) + u64 physical, u64 flags, u64 gen, int mirror_num) { int ret; u8 csum[BTRFS_CSUM_SIZE]; @@ -919,7 +919,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_dev *sdev, u64 physical; u64 logical; u64 generation; - u64 mirror_num; + int mirror_num; u64 increment = map-stripe_len; u64 offset; -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 5/8] btrfs: add mirror_num to extent_read_full_page
Currently, extent_read_full_page always assumes we are trying to read mirror 0, which generally is the best we can do. To add flexibility, pass it as a parameter. This will be needed by scrub fixup code. Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/disk-io.c |2 +- fs/btrfs/extent_io.c |6 +++--- fs/btrfs/extent_io.h |2 +- fs/btrfs/inode.c |2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 1ac8db5d..b898319 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -874,7 +874,7 @@ static int btree_readpage(struct file *file, struct page *page) { struct extent_io_tree *tree; tree = BTRFS_I(page-mapping-host)-io_tree; - return extent_read_full_page(tree, page, btree_get_extent); + return extent_read_full_page(tree, page, btree_get_extent, 0); } static int btree_releasepage(struct page *page, gfp_t gfp_flags) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index b181a94..b78f665 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2111,16 +2111,16 @@ static int __extent_read_full_page(struct extent_io_tree *tree, } int extent_read_full_page(struct extent_io_tree *tree, struct page *page, - get_extent_t *get_extent) + get_extent_t *get_extent, int mirror_num) { struct bio *bio = NULL; unsigned long bio_flags = 0; int ret; - ret = __extent_read_full_page(tree, page, get_extent, bio, 0, + ret = __extent_read_full_page(tree, page, get_extent, bio, mirror_num, bio_flags); if (bio) - ret = submit_one_bio(READ, bio, 0, bio_flags); + ret = submit_one_bio(READ, bio, mirror_num, bio_flags); return ret; } diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index a11a92e..22bf366 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -177,7 +177,7 @@ int unlock_extent_cached(struct extent_io_tree *tree, u64 start, u64 end, int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end, gfp_t mask); int extent_read_full_page(struct extent_io_tree *tree, struct page *page, - get_extent_t *get_extent); + get_extent_t *get_extent, int mirror_num); int __init extent_io_init(void); void extent_io_exit(void); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 4a13730..730ee3d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6250,7 +6250,7 @@ int btrfs_readpage(struct file *file, struct page *page) { struct extent_io_tree *tree; tree = BTRFS_I(page-mapping-host)-io_tree; - return extent_read_full_page(tree, page, btrfs_get_extent); + return extent_read_full_page(tree, page, btrfs_get_extent, 0); } static int btrfs_writepage(struct page *page, struct writeback_control *wbc) -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v6 3/8] btrfs scrub: print paths of corrupted files
While scrubbing, we may encounter various errors. Previously, a logical address was printed to the log only. Now, all paths belonging to that address are resolved and printed separately. That should work for hardlinks as well as reflinks. Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/scrub.c | 169 -- 1 files changed, 163 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 35099fa..221fd5c 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -17,10 +17,12 @@ */ #include linux/blkdev.h +#include linux/ratelimit.h #include ctree.h #include volumes.h #include disk-io.h #include ordered-data.h +#include backref.h /* * This is only the first step towards a full-features scrub. It reads all @@ -100,6 +102,19 @@ struct scrub_dev { spinlock_t stat_lock; }; +struct scrub_warning { + struct btrfs_path *path; + u64 extent_item_size; + char*scratch_buf; + char*msg_buf; + const char *errstr; + sector_tsector; + u64 logical; + struct btrfs_device *dev; + int msg_bufsize; + int scratch_bufsize; +}; + static void scrub_free_csums(struct scrub_dev *sdev) { while (!list_empty(sdev-csum_list)) { @@ -195,6 +210,143 @@ nomem: return ERR_PTR(-ENOMEM); } +static int scrub_print_warning_inode(u64 inum, u64 offset, u64 root, void *ctx) +{ + u64 isize; + u32 nlink; + int ret; + int i; + struct extent_buffer *eb; + struct btrfs_inode_item *inode_item; + struct scrub_warning *swarn = ctx; + struct btrfs_fs_info *fs_info = swarn-dev-dev_root-fs_info; + struct inode_fs_paths *ipath = NULL; + struct btrfs_root *local_root; + struct btrfs_key root_key; + + root_key.objectid = root; + root_key.type = BTRFS_ROOT_ITEM_KEY; + root_key.offset = (u64)-1; + local_root = btrfs_read_fs_root_no_name(fs_info, root_key); + if (IS_ERR(local_root)) { + ret = PTR_ERR(local_root); + goto err; + } + + ret = inode_item_info(inum, 0, local_root, swarn-path); + if (ret) { + btrfs_release_path(swarn-path); + goto err; + } + + eb = swarn-path-nodes[0]; + inode_item = btrfs_item_ptr(eb, swarn-path-slots[0], + struct btrfs_inode_item); + isize = btrfs_inode_size(eb, inode_item); + nlink = btrfs_inode_nlink(eb, inode_item); + btrfs_release_path(swarn-path); + + ipath = init_ipath(4096, local_root, swarn-path); + ret = paths_from_inode(inum, ipath); + + if (ret 0) + goto err; + + /* +* we deliberately ignore the bit ipath might have been too small to +* hold all of the paths here +*/ + for (i = 0; i ipath-fspath-elem_cnt; ++i) + printk(KERN_WARNING btrfs: %s at logical %llu on dev + %s, sector %llu, root %llu, inode %llu, offset %llu, + length %llu, links %u (path: %s)\n, swarn-errstr, + swarn-logical, swarn-dev-name, + (unsigned long long)swarn-sector, root, inum, offset, + min(isize - offset, (u64)PAGE_SIZE), nlink, + ipath-fspath-str[i]); + + free_ipath(ipath); + return 0; + +err: + printk(KERN_WARNING btrfs: %s at logical %llu on dev + %s, sector %llu, root %llu, inode %llu, offset %llu: path + resolving failed with ret=%d\n, swarn-errstr, + swarn-logical, swarn-dev-name, + (unsigned long long)swarn-sector, root, inum, offset, ret); + + free_ipath(ipath); + return 0; +} + +static void scrub_print_warning(const char *errstr, struct scrub_bio *sbio, + int ix) +{ + struct btrfs_device *dev = sbio-sdev-dev; + struct btrfs_fs_info *fs_info = dev-dev_root-fs_info; + struct btrfs_path *path; + struct btrfs_key found_key; + struct extent_buffer *eb; + struct btrfs_extent_item *ei; + struct scrub_warning swarn; + u32 item_size; + int ret; + u64 ref_root; + u8 ref_level; + unsigned long ptr = 0; + const int bufsize = 4096; + u64 extent_offset; + + path = btrfs_alloc_path(); + + swarn.scratch_buf = kmalloc(bufsize, GFP_NOFS); + swarn.msg_buf = kmalloc(bufsize, GFP_NOFS); + swarn.sector = (sbio-physical + ix * PAGE_SIZE) 9; + swarn.logical = sbio-logical + ix * PAGE_SIZE; + swarn.errstr = errstr; + swarn.dev = dev; + swarn.msg_bufsize = bufsize; + swarn.scratch_bufsize = bufsize; + +
[PATCH v7 4/8] btrfs scrub: bugfix: mirror_num off by one
Fix the mirror_num determination in scrub_stripe. The rest of the scrub code did not use mirror_num for anything important and that error went unnoticed. The nodatasum fixup patch of this set depends on a correct mirror_num. Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/scrub.c | 12 ++-- 1 files changed, 6 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 221fd5c..59caf8f 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -452,7 +452,7 @@ static void scrub_fixup(struct scrub_bio *sbio, int ix) * first find a good copy */ for (i = 0; i multi-num_stripes; ++i) { - if (i == sbio-spag[ix].mirror_num) + if (i + 1 == sbio-spag[ix].mirror_num) continue; if (scrub_fixup_io(READ, multi-stripes[i].dev-bdev, @@ -930,21 +930,21 @@ static noinline_for_stack int scrub_stripe(struct scrub_dev *sdev, if (map-type BTRFS_BLOCK_GROUP_RAID0) { offset = map-stripe_len * num; increment = map-stripe_len * map-num_stripes; - mirror_num = 0; + mirror_num = 1; } else if (map-type BTRFS_BLOCK_GROUP_RAID10) { int factor = map-num_stripes / map-sub_stripes; offset = map-stripe_len * (num / map-sub_stripes); increment = map-stripe_len * factor; - mirror_num = num % map-sub_stripes; + mirror_num = num % map-sub_stripes + 1; } else if (map-type BTRFS_BLOCK_GROUP_RAID1) { increment = map-stripe_len; - mirror_num = num % map-num_stripes; + mirror_num = num % map-num_stripes + 1; } else if (map-type BTRFS_BLOCK_GROUP_DUP) { increment = map-stripe_len; - mirror_num = num % map-num_stripes; + mirror_num = num % map-num_stripes + 1; } else { increment = map-stripe_len; - mirror_num = 0; + mirror_num = 1; } path = btrfs_alloc_path(); -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 0/8] Btrfs scrub: print path to corrupted files and trigger nodatasum fixup
Please ignore v6, was sent while only half way through :-( Changelog v6-v7: - include everything that was stated to be in v6 Changelog v5-v6: - fixed ioctl priviledge and input sanity checking (reported by Andi Kleen) Original message follows: This patch set introduces two new features for scrub. They share the backref iteration code which is the reason they made it into the same patch set. The first feature adds printk statements in case scrub finds an error which list all affected files. You will need patch 1, 2 and 3 for that. The second feature adds the trigger which enables us to correct i/o errors in case the affected extent does not have a checksum (nodatasum), eventually. You will need patch 1, 4, 5 and 6 for that. I tried to apply all patches to the current cmason/for-linus branch and to Arne's current for-chris branch. They do apply with no errors (some offsets possible). The new ioctl()s can be tested from usermode by applying the patch series [PATCH v2 0/3] Btrfs-progs: add the first inspect-internal commands from this mailing list to the user land tools. Please review. Next I'm starting to make up my mind how to implement on-the-fly error correction correctly. This will enable us to rewrite good data whenever we encounter a bad copy. I have some preliminary patches already, the stress in the first sentence is on correctly. The second feature mentioned in this patch series will then automatically use that code, too. Changelog v1-v2: - Various cleanup, sensible error codes as suggested by David Sterba Changelog v2-v3: - evaluation and iteration of shared refs - support for in-tree refs (v2 iterated inline refs only) - never call an interator function without releasing the path - iterate_irefs now returns -ENOENT in case no refs are found - some stupid bugs removed where release_path was called too early - ioctls added to provide new functions to user mode - bugfixes for cases where search_slot found the very end of a leaf - bugfix: use right fs root for readpage instead of fs_root-fs_info - based on current cmason/for-linus Changelog v3-v4: - fixed a regression with mirror_num that could prevent error correction - based on current cmason/for-linus Changelog v4-v5: - fixed a deadlock when fixup is taking longer while scrub is about to end Please try it and report errors (or confirm there are none, of course). I can provide a place to pull from if anyone likes. -Jan Jan Schmidt (8): btrfs: added helper functions to iterate backrefs btrfs scrub: added unverified_errors btrfs scrub: print paths of corrupted files btrfs scrub: bugfix: mirror_num off by one btrfs: add mirror_num to extent_read_full_page btrfs scrub: use int for mirror_num, not u64 btrfs scrub: add fixup code for errors on nodatasum files btrfs: new ioctls to do logical-inode and inode-path resolving fs/btrfs/Makefile|3 +- fs/btrfs/backref.c | 748 ++ fs/btrfs/backref.h | 62 + fs/btrfs/disk-io.c |2 +- fs/btrfs/extent_io.c |6 +- fs/btrfs/extent_io.h |3 +- fs/btrfs/inode.c |2 +- fs/btrfs/ioctl.c | 150 ++ fs/btrfs/ioctl.h | 29 ++ fs/btrfs/scrub.c | 414 +--- 10 files changed, 1379 insertions(+), 40 deletions(-) create mode 100644 fs/btrfs/backref.c create mode 100644 fs/btrfs/backref.h -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 8/8] btrfs: new ioctls to do logical-inode and inode-path resolving
these ioctls make use of the new functions initially added for scrub. they return all inodes belonging to a logical address (BTRFS_IOC_LOGICAL_INO) and all paths belonging to an inode (BTRFS_IOC_INO_PATHS). Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/ioctl.c | 150 ++ fs/btrfs/ioctl.h | 19 +++ 2 files changed, 169 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index a3c4751..798c8ed 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -51,6 +51,7 @@ #include volumes.h #include locking.h #include inode-map.h +#include backref.h /* Mask out flags that are inappropriate for the given type of inode. */ static inline __u32 btrfs_mask_flags(umode_t mode, __u32 flags) @@ -2836,6 +2837,151 @@ static long btrfs_ioctl_scrub_progress(struct btrfs_root *root, return ret; } +static long btrfs_ioctl_ino_to_path(struct btrfs_root *root, void __user *arg) +{ + int ret = 0; + int i; + unsigned long rel_ptr; + int size; + struct btrfs_ioctl_ino_path_args *ipa; + struct inode_fs_paths *ipath = NULL; + struct btrfs_path *path; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + + ipa = memdup_user(arg, sizeof(*ipa)); + if (IS_ERR(ipa)) { + ret = PTR_ERR(ipa); + ipa = NULL; + goto out; + } + + if (ipa-size = 0) { + ret = -EINVAL; + goto out; + } + + size = min(ipa-size, 4096); + ipath = init_ipath(size, root, path); + if (IS_ERR(ipath)) { + ret = PTR_ERR(ipath); + ipath = NULL; + goto out; + } + + ret = paths_from_inode(ipa-inum, ipath); + if (ret 0) + goto out; + + for (i = 0; i ipath-fspath-elem_cnt; ++i) { + rel_ptr = ipath-fspath-str[i] - (char *)ipath-fspath-str; + ipath-fspath-str[i] = (void *)rel_ptr; + } + + ret = copy_to_user(ipa-fspath, ipath-fspath, size); + if (ret) { + ret = -EFAULT; + goto out; + } + +out: + btrfs_free_path(path); + free_ipath(ipath); + kfree(ipa); + + return ret; +} + +static int build_ino_list(u64 inum, u64 offset, u64 root, void *ctx) +{ + struct btrfs_data_container *inodes = ctx; + + inodes-size -= 3 * sizeof(u64); + if (inodes-size 0) { + inodes-val[inodes-elem_cnt] = inum; + inodes-val[inodes-elem_cnt + 1] = offset; + inodes-val[inodes-elem_cnt + 2] = root; + inodes-elem_cnt += 3; + } else { + inodes-elem_missed += 3; + } + + return 0; +} + +static long btrfs_ioctl_logical_to_ino(struct btrfs_root *root, + void __user *arg) +{ + int ret = 0; + int size; + u64 extent_offset; + struct btrfs_ioctl_logical_ino_args *loi; + struct btrfs_data_container *inodes = NULL; + struct btrfs_path *path = NULL; + struct btrfs_key key; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + loi = memdup_user(arg, sizeof(*loi)); + if (IS_ERR(loi)) { + ret = PTR_ERR(loi); + loi = NULL; + goto out; + } + + if (loi-size = 0) { + ret = -EINVAL; + goto out; + } + + path = btrfs_alloc_path(); + if (!path) { + ret = -ENOMEM; + goto out; + } + + size = min(loi-size, 4096); + inodes = init_data_container(size); + if (IS_ERR(inodes)) { + ret = PTR_ERR(inodes); + inodes = NULL; + goto out; + } + + ret = extent_from_logical(root-fs_info, loi-logical, path, key); + + if (ret BTRFS_EXTENT_FLAG_TREE_BLOCK) + ret = -ENOENT; + if (ret 0) + goto out; + + extent_offset = loi-logical - key.objectid; + ret = iterate_extent_inodes(root-fs_info, path, key.objectid, + extent_offset, build_ino_list, inodes); + + if (ret 0) + goto out; + + ret = copy_to_user(loi-inodes, inodes, size); + if (ret) + ret = -EFAULT; + +out: + btrfs_free_path(path); + kfree(inodes); + kfree(loi); + + return ret; +} + long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { @@ -2893,6 +3039,10 @@ long btrfs_ioctl(struct file *file, unsigned int return btrfs_ioctl_tree_search(file, argp); case BTRFS_IOC_INO_LOOKUP: return btrfs_ioctl_ino_lookup(file, argp); + case
[PATCH v7 3/8] btrfs scrub: print paths of corrupted files
While scrubbing, we may encounter various errors. Previously, a logical address was printed to the log only. Now, all paths belonging to that address are resolved and printed separately. That should work for hardlinks as well as reflinks. Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/scrub.c | 169 -- 1 files changed, 163 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 35099fa..221fd5c 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -17,10 +17,12 @@ */ #include linux/blkdev.h +#include linux/ratelimit.h #include ctree.h #include volumes.h #include disk-io.h #include ordered-data.h +#include backref.h /* * This is only the first step towards a full-features scrub. It reads all @@ -100,6 +102,19 @@ struct scrub_dev { spinlock_t stat_lock; }; +struct scrub_warning { + struct btrfs_path *path; + u64 extent_item_size; + char*scratch_buf; + char*msg_buf; + const char *errstr; + sector_tsector; + u64 logical; + struct btrfs_device *dev; + int msg_bufsize; + int scratch_bufsize; +}; + static void scrub_free_csums(struct scrub_dev *sdev) { while (!list_empty(sdev-csum_list)) { @@ -195,6 +210,143 @@ nomem: return ERR_PTR(-ENOMEM); } +static int scrub_print_warning_inode(u64 inum, u64 offset, u64 root, void *ctx) +{ + u64 isize; + u32 nlink; + int ret; + int i; + struct extent_buffer *eb; + struct btrfs_inode_item *inode_item; + struct scrub_warning *swarn = ctx; + struct btrfs_fs_info *fs_info = swarn-dev-dev_root-fs_info; + struct inode_fs_paths *ipath = NULL; + struct btrfs_root *local_root; + struct btrfs_key root_key; + + root_key.objectid = root; + root_key.type = BTRFS_ROOT_ITEM_KEY; + root_key.offset = (u64)-1; + local_root = btrfs_read_fs_root_no_name(fs_info, root_key); + if (IS_ERR(local_root)) { + ret = PTR_ERR(local_root); + goto err; + } + + ret = inode_item_info(inum, 0, local_root, swarn-path); + if (ret) { + btrfs_release_path(swarn-path); + goto err; + } + + eb = swarn-path-nodes[0]; + inode_item = btrfs_item_ptr(eb, swarn-path-slots[0], + struct btrfs_inode_item); + isize = btrfs_inode_size(eb, inode_item); + nlink = btrfs_inode_nlink(eb, inode_item); + btrfs_release_path(swarn-path); + + ipath = init_ipath(4096, local_root, swarn-path); + ret = paths_from_inode(inum, ipath); + + if (ret 0) + goto err; + + /* +* we deliberately ignore the bit ipath might have been too small to +* hold all of the paths here +*/ + for (i = 0; i ipath-fspath-elem_cnt; ++i) + printk(KERN_WARNING btrfs: %s at logical %llu on dev + %s, sector %llu, root %llu, inode %llu, offset %llu, + length %llu, links %u (path: %s)\n, swarn-errstr, + swarn-logical, swarn-dev-name, + (unsigned long long)swarn-sector, root, inum, offset, + min(isize - offset, (u64)PAGE_SIZE), nlink, + ipath-fspath-str[i]); + + free_ipath(ipath); + return 0; + +err: + printk(KERN_WARNING btrfs: %s at logical %llu on dev + %s, sector %llu, root %llu, inode %llu, offset %llu: path + resolving failed with ret=%d\n, swarn-errstr, + swarn-logical, swarn-dev-name, + (unsigned long long)swarn-sector, root, inum, offset, ret); + + free_ipath(ipath); + return 0; +} + +static void scrub_print_warning(const char *errstr, struct scrub_bio *sbio, + int ix) +{ + struct btrfs_device *dev = sbio-sdev-dev; + struct btrfs_fs_info *fs_info = dev-dev_root-fs_info; + struct btrfs_path *path; + struct btrfs_key found_key; + struct extent_buffer *eb; + struct btrfs_extent_item *ei; + struct scrub_warning swarn; + u32 item_size; + int ret; + u64 ref_root; + u8 ref_level; + unsigned long ptr = 0; + const int bufsize = 4096; + u64 extent_offset; + + path = btrfs_alloc_path(); + + swarn.scratch_buf = kmalloc(bufsize, GFP_NOFS); + swarn.msg_buf = kmalloc(bufsize, GFP_NOFS); + swarn.sector = (sbio-physical + ix * PAGE_SIZE) 9; + swarn.logical = sbio-logical + ix * PAGE_SIZE; + swarn.errstr = errstr; + swarn.dev = dev; + swarn.msg_bufsize = bufsize; + swarn.scratch_bufsize = bufsize; + +
[PATCH v7 2/8] btrfs scrub: added unverified_errors
In normal operation, scrub is reading data sequentially in large portions. In case of an i/o error, we try to find the corrupted area(s) by issuing page sized read requests. With this commit we increment the unverified_errors counter if all of the small size requests succeed. Userland patches carrying such conspicous events to the administrator should already be around. Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/scrub.c | 37 ++--- 1 files changed, 26 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index a8d03d5..35099fa 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -201,18 +201,25 @@ nomem: * recheck_error gets called for every page in the bio, even though only * one may be bad */ -static void scrub_recheck_error(struct scrub_bio *sbio, int ix) +static int scrub_recheck_error(struct scrub_bio *sbio, int ix) { + struct scrub_dev *sdev = sbio-sdev; + u64 sector = (sbio-physical + ix * PAGE_SIZE) 9; + if (sbio-err) { - if (scrub_fixup_io(READ, sbio-sdev-dev-bdev, - (sbio-physical + ix * PAGE_SIZE) 9, + if (scrub_fixup_io(READ, sbio-sdev-dev-bdev, sector, sbio-bio-bi_io_vec[ix].bv_page) == 0) { if (scrub_fixup_check(sbio, ix) == 0) - return; + return 0; } } + spin_lock(sdev-stat_lock); + ++sdev-stat.read_errors; + spin_unlock(sdev-stat_lock); + scrub_fixup(sbio, ix); + return 1; } static int scrub_fixup_check(struct scrub_bio *sbio, int ix) @@ -382,8 +389,14 @@ static void scrub_checksum(struct btrfs_work *work) int ret; if (sbio-err) { + ret = 0; for (i = 0; i sbio-count; ++i) - scrub_recheck_error(sbio, i); + ret |= scrub_recheck_error(sbio, i); + if (!ret) { + spin_lock(sdev-stat_lock); + ++sdev-stat.unverified_errors; + spin_unlock(sdev-stat_lock); + } sbio-bio-bi_flags = ~(BIO_POOL_MASK - 1); sbio-bio-bi_flags |= 1 BIO_UPTODATE; @@ -396,10 +409,6 @@ static void scrub_checksum(struct btrfs_work *work) bi-bv_offset = 0; bi-bv_len = PAGE_SIZE; } - - spin_lock(sdev-stat_lock); - ++sdev-stat.read_errors; - spin_unlock(sdev-stat_lock); goto out; } for (i = 0; i sbio-count; ++i) { @@ -420,8 +429,14 @@ static void scrub_checksum(struct btrfs_work *work) WARN_ON(1); } kunmap_atomic(buffer, KM_USER0); - if (ret) - scrub_recheck_error(sbio, i); + if (ret) { + ret = scrub_recheck_error(sbio, i); + if (!ret) { + spin_lock(sdev-stat_lock); + ++sdev-stat.unverified_errors; + spin_unlock(sdev-stat_lock); + } + } } out: -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 5/8] btrfs: add mirror_num to extent_read_full_page
Currently, extent_read_full_page always assumes we are trying to read mirror 0, which generally is the best we can do. To add flexibility, pass it as a parameter. This will be needed by scrub fixup code. Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/disk-io.c |2 +- fs/btrfs/extent_io.c |6 +++--- fs/btrfs/extent_io.h |2 +- fs/btrfs/inode.c |2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 1ac8db5d..b898319 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -874,7 +874,7 @@ static int btree_readpage(struct file *file, struct page *page) { struct extent_io_tree *tree; tree = BTRFS_I(page-mapping-host)-io_tree; - return extent_read_full_page(tree, page, btree_get_extent); + return extent_read_full_page(tree, page, btree_get_extent, 0); } static int btree_releasepage(struct page *page, gfp_t gfp_flags) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index b181a94..b78f665 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2111,16 +2111,16 @@ static int __extent_read_full_page(struct extent_io_tree *tree, } int extent_read_full_page(struct extent_io_tree *tree, struct page *page, - get_extent_t *get_extent) + get_extent_t *get_extent, int mirror_num) { struct bio *bio = NULL; unsigned long bio_flags = 0; int ret; - ret = __extent_read_full_page(tree, page, get_extent, bio, 0, + ret = __extent_read_full_page(tree, page, get_extent, bio, mirror_num, bio_flags); if (bio) - ret = submit_one_bio(READ, bio, 0, bio_flags); + ret = submit_one_bio(READ, bio, mirror_num, bio_flags); return ret; } diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index a11a92e..22bf366 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -177,7 +177,7 @@ int unlock_extent_cached(struct extent_io_tree *tree, u64 start, u64 end, int try_lock_extent(struct extent_io_tree *tree, u64 start, u64 end, gfp_t mask); int extent_read_full_page(struct extent_io_tree *tree, struct page *page, - get_extent_t *get_extent); + get_extent_t *get_extent, int mirror_num); int __init extent_io_init(void); void extent_io_exit(void); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 4a13730..730ee3d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6250,7 +6250,7 @@ int btrfs_readpage(struct file *file, struct page *page) { struct extent_io_tree *tree; tree = BTRFS_I(page-mapping-host)-io_tree; - return extent_read_full_page(tree, page, btrfs_get_extent); + return extent_read_full_page(tree, page, btrfs_get_extent, 0); } static int btrfs_writepage(struct page *page, struct writeback_control *wbc) -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 7/8] btrfs scrub: add fixup code for errors on nodatasum files
This removes a FIXME comment and introduces the first part of nodatasum fixup: It gets the corresponding inode for a logical address and triggers a regular readpage for the corrupted sector. Once we have on-the-fly error correction our error will be automatically corrected. The correction code is expected to clear the newly introduced EXTENT_DAMAGED flag, making scrub report that error as corrected instead of uncorrectable eventually. Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/extent_io.h |1 + fs/btrfs/scrub.c | 188 -- 2 files changed, 183 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h index 22bf366..2734fd9 100644 --- a/fs/btrfs/extent_io.h +++ b/fs/btrfs/extent_io.h @@ -17,6 +17,7 @@ #define EXTENT_NODATASUM (1 10) #define EXTENT_DO_ACCOUNTING (1 11) #define EXTENT_FIRST_DELALLOC (1 12) +#define EXTENT_DAMAGED (1 13) #define EXTENT_IOBITS (EXTENT_LOCKED | EXTENT_WRITEBACK) #define EXTENT_CTLBITS (EXTENT_DO_ACCOUNTING | EXTENT_FIRST_DELALLOC) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 41a0114..db09f01 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -22,6 +22,7 @@ #include volumes.h #include disk-io.h #include ordered-data.h +#include transaction.h #include backref.h /* @@ -89,6 +90,7 @@ struct scrub_dev { int first_free; int curr; atomic_tin_flight; + atomic_tfixup_cnt; spinlock_t list_lock; wait_queue_head_t list_wait; u16 csum_size; @@ -102,6 +104,14 @@ struct scrub_dev { spinlock_t stat_lock; }; +struct scrub_fixup_nodatasum { + struct scrub_dev*sdev; + u64 logical; + struct btrfs_root *root; + struct btrfs_work work; + int mirror_num; +}; + struct scrub_warning { struct btrfs_path *path; u64 extent_item_size; @@ -190,12 +200,13 @@ struct scrub_dev *scrub_setup_dev(struct btrfs_device *dev) if (i != SCRUB_BIOS_PER_DEV-1) sdev-bios[i]-next_free = i + 1; -else + else sdev-bios[i]-next_free = -1; } sdev-first_free = 0; sdev-curr = -1; atomic_set(sdev-in_flight, 0); + atomic_set(sdev-fixup_cnt, 0); atomic_set(sdev-cancel_req, 0); sdev-csum_size = btrfs_super_csum_size(fs_info-super_copy); INIT_LIST_HEAD(sdev-csum_list); @@ -347,6 +358,151 @@ out: kfree(swarn.msg_buf); } +static int scrub_fixup_readpage(u64 inum, u64 offset, u64 root, void *ctx) +{ + struct page *page; + unsigned long index; + struct scrub_fixup_nodatasum *fixup = ctx; + int ret; + int corrected; + struct btrfs_key key; + struct inode *inode; + u64 end = offset + PAGE_SIZE - 1; + struct btrfs_root *local_root; + + key.objectid = root; + key.type = BTRFS_ROOT_ITEM_KEY; + key.offset = (u64)-1; + local_root = btrfs_read_fs_root_no_name(fixup-root-fs_info, key); + if (IS_ERR(local_root)) + return PTR_ERR(local_root); + + key.type = BTRFS_INODE_ITEM_KEY; + key.objectid = inum; + key.offset = 0; + inode = btrfs_iget(fixup-root-fs_info-sb, key, local_root, NULL); + if (IS_ERR(inode)) + return PTR_ERR(inode); + + ret = set_extent_bit(BTRFS_I(inode)-io_tree, offset, end, + EXTENT_DAMAGED, 0, NULL, NULL, GFP_NOFS); + + /* set_extent_bit should either succeed or give proper error */ + WARN_ON(ret 0); + if (ret) + return ret 0 ? ret : -EFAULT; + + index = offset PAGE_CACHE_SHIFT; + + page = find_or_create_page(inode-i_mapping, index, GFP_NOFS); + if (!page) + return -ENOMEM; + + ret = extent_read_full_page(BTRFS_I(inode)-io_tree, page, + btrfs_get_extent, fixup-mirror_num); + wait_on_page_locked(page); + corrected = !test_range_bit(BTRFS_I(inode)-io_tree, offset, end, + EXTENT_DAMAGED, 0, NULL); + + if (corrected) + WARN_ON(!PageUptodate(page)); + else + clear_extent_bit(BTRFS_I(inode)-io_tree, offset, end, + EXTENT_DAMAGED, 0, 0, NULL, GFP_NOFS); + + put_page(page); + iput(inode); + + if (ret 0) + return ret; + + if (ret == 0 corrected) { + /* +* we only need to call readpage for one of the inodes belonging +* to this extent. so make iterate_extent_inodes stop +*/ + return 1; + }
[PATCH v7 1/8] btrfs: added helper functions to iterate backrefs
These helper functions iterate back references and call a function for each backref. There is also a function to resolve an inode to a path in the file system. Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/Makefile |3 +- fs/btrfs/backref.c | 748 fs/btrfs/backref.h | 62 + fs/btrfs/ioctl.h | 10 + 4 files changed, 822 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index 9b72dcf..c63f649 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -7,4 +7,5 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \ extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \ export.o tree-log.o acl.o free-space-cache.o zlib.o lzo.o \ - compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o + compression.o delayed-ref.o relocation.o delayed-inode.o backref.o \ + scrub.o diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c new file mode 100644 index 000..477f154 --- /dev/null +++ b/fs/btrfs/backref.c @@ -0,0 +1,748 @@ +/* + * Copyright (C) 2011 STRATO. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#include ctree.h +#include disk-io.h +#include backref.h + +struct __data_ref { + struct list_head list; + u64 inum; + u64 root; + u64 extent_data_item_offset; +}; + +struct __shared_ref { + struct list_head list; + u64 disk_byte; +}; + +static int __inode_info(u64 inum, u64 ioff, u8 key_type, + struct btrfs_root *fs_root, struct btrfs_path *path, + struct btrfs_key *found_key) +{ + int ret; + struct btrfs_key key; + struct extent_buffer *eb; + + key.type = key_type; + key.objectid = inum; + key.offset = ioff; + + ret = btrfs_search_slot(NULL, fs_root, key, path, 0, 0); + if (ret 0) + return ret; + + eb = path-nodes[0]; + if (ret path-slots[0] = btrfs_header_nritems(eb)) { + ret = btrfs_next_leaf(fs_root, path); + if (ret) + return ret; + eb = path-nodes[0]; + } + + btrfs_item_key_to_cpu(eb, found_key, path-slots[0]); + if (found_key-type != key.type || found_key-objectid != key.objectid) + return 1; + + return 0; +} + +/* + * this makes the path point to (inum INODE_ITEM ioff) + */ +int inode_item_info(u64 inum, u64 ioff, struct btrfs_root *fs_root, + struct btrfs_path *path) +{ + struct btrfs_key key; + return __inode_info(inum, ioff, BTRFS_INODE_ITEM_KEY, fs_root, path, + key); +} + +static int inode_ref_info(u64 inum, u64 ioff, struct btrfs_root *fs_root, + struct btrfs_path *path, int strict, + u64 *out_parent_inum, + struct extent_buffer **out_iref_eb, + int *out_slot) +{ + int ret; + struct btrfs_key found_key; + + ret = __inode_info(inum, ioff, BTRFS_INODE_REF_KEY, fs_root, path, + found_key); + + if (!ret) { + if (out_slot) + *out_slot = path-slots[0]; + if (out_iref_eb) + *out_iref_eb = path-nodes[0]; + if (out_parent_inum) + *out_parent_inum = found_key.offset; + } + + btrfs_release_path(path); + return ret; +} + +/* + * this iterates to turn a btrfs_inode_ref into a full filesystem path. elements + * of the path are separated by '/' and the path is guaranteed to be + * 0-terminated. the path is only given within the current file system. + * Therefore, it never starts with a '/'. the caller is responsible to provide + * size bytes in dest. the dest buffer will be filled backwards. finally, + * the start point of the resulting string is returned. this pointer is within + * dest, normally. + * in case the path buffer would overflow, the pointer is decremented further + * as if output was written to the buffer, though no more output is actually + * generated. that way, the caller
Re: Broken btrfs?
On 07/22/2011 09:24 AM, Jan Schmidt wrote: Scrub should be printing inode numbers to your system log while detecting those errors. If you want to know the exact files corrupted, you can grab my patch set with subject Btrfs scrub: print path to corrupted files and trigger nodatasum fixup from the list and give it a try. Cool Jan, this is exactly what I asked for in my original post. Your patch set is against kernel sources (not btrfs-progs), right? I took the opportunity to upgrade to official 3.0 where your patch applied and compiled without any issues. I also did recompile btrfs-progs-unstable and run a scrub. This scrub completed without any errors: # btrfs scrub status . scrub status for 03201fc0-7695-4468-9a10-f61ad79f23ca scrub started at Fri Jul 22 14:24:21 2011, running for 706 seconds total bytes scrubbed: 158.01GB with 0 errors Is'nt this strange? This message is generated after rebooting the box (due to a crash, see below), I remember to have seen some more information before the crash but also 0 errors. While doing the scrub I still did see csum errors in my dmesg but no files associated: Jul 22 14:17:50 toral kernel: btrfs no csum found for inode 199934 start 729088 Jul 22 14:17:50 toral kernel: btrfs csum failed ino 199934 off 729088 csum 3390946210 private 0 Jul 22 14:17:51 toral kernel: btrfs no csum found for inode 199934 start 24096768 Jul 22 14:17:51 toral kernel: btrfs csum failed ino 199934 off 24096768 csum 439962552 private 0 Jul 22 14:17:51 toral kernel: btrfs no csum found for inode 199934 start 24801280 Jul 22 14:17:51 toral kernel: btrfs no csum found for inode 199934 start 24805376 Jul 22 14:17:51 toral kernel: btrfs csum failed ino 199934 off 24801280 csum 158010657 private 0 Jul 22 14:17:51 toral kernel: btrfs csum failed ino 199934 off 24805376 csum 127231121 private 0 And sorry to say, it also crashed my box throwing a kernel expception and a reference to somtehing like scrub_print_warning_inode (or similar) which I could not find after rebooting my box. Seems my kernel.log and all others logs are empty for the last 30min, Sry. What is the most current btrfs-progs git branch to use for further investigation? Thx, Jan -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/16] Btrfs: clean up code for extent_map lookup
On Thu, Jul 14, 2011 at 11:18:15AM +0800, Li Zefan wrote: lookup_extent_map() and search_extent_map() can share most of code. Signed-off-by: Li Zefan l...@cn.fujitsu.com --- fs/btrfs/extent_map.c | 85 + 1 files changed, 29 insertions(+), 56 deletions(-) diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index 911a9db..df7a803 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -299,19 +299,8 @@ static u64 range_end(u64 start, u64 len) return start + len; } -/** - * lookup_extent_mapping - lookup extent_map - * @tree:tree to lookup in - * @start: byte offset to start the search - * @len: length of the lookup range - * - * Find and return the first extent_map struct in @tree that intersects the - * [start, len] range. There may be additional objects in the tree that - * intersect, so check the object returned carefully to make sure that no - * additional lookups are needed. - */ -struct extent_map *lookup_extent_mapping(struct extent_map_tree *tree, - u64 start, u64 len) +struct extent_map *__lookup_extent_mapping(struct extent_map_tree *tree, +u64 start, u64 len, int strict) just minor thing: can be defined static { struct extent_map *em; struct rb_node *rb_node; @@ -320,38 +309,42 @@ struct extent_map *lookup_extent_mapping(struct extent_map_tree *tree, u64 end = range_end(start, len); rb_node = __tree_search(tree-map, start, prev, next); - if (!rb_node prev) { - em = rb_entry(prev, struct extent_map, rb_node); - if (end em-start start extent_map_end(em)) - goto found; - } - if (!rb_node next) { - em = rb_entry(next, struct extent_map, rb_node); - if (end em-start start extent_map_end(em)) - goto found; - } if (!rb_node) { - em = NULL; - goto out; - } - if (IS_ERR(rb_node)) { - em = ERR_CAST(rb_node); - goto out; + if (prev) + rb_node = prev; + else if (next) + rb_node = next; + else + return NULL; } + em = rb_entry(rb_node, struct extent_map, rb_node); - if (end em-start start extent_map_end(em)) - goto found; - em = NULL; - goto out; + if (strict !(end em-start start extent_map_end(em))) + return NULL; -found: atomic_inc(em-refs); -out: return em; } /** + * lookup_extent_mapping - lookup extent_map + * @tree:tree to lookup in + * @start: byte offset to start the search + * @len: length of the lookup range + * + * Find and return the first extent_map struct in @tree that intersects the + * [start, len] range. There may be additional objects in the tree that + * intersect, so check the object returned carefully to make sure that no + * additional lookups are needed. + */ +struct extent_map *lookup_extent_mapping(struct extent_map_tree *tree, + u64 start, u64 len) +{ + return __lookup_extent_mapping(tree, start, len, 1); +} + +/** * search_extent_mapping - find a nearby extent map * @tree:tree to lookup in * @start: byte offset to start the search @@ -365,27 +358,7 @@ out: struct extent_map *search_extent_mapping(struct extent_map_tree *tree, u64 start, u64 len) { - struct extent_map *em; - struct rb_node *rb_node; - struct rb_node *prev = NULL; - struct rb_node *next = NULL; - - rb_node = __tree_search(tree-map, start, prev, next); - if (!rb_node prev) { - em = rb_entry(prev, struct extent_map, rb_node); - goto found; - } - if (!rb_node next) { - em = rb_entry(next, struct extent_map, rb_node); - goto found; - } - if (!rb_node) - return NULL; - - em = rb_entry(rb_node, struct extent_map, rb_node); -found: - atomic_inc(em-refs); - return em; + return __lookup_extent_mapping(tree, start, len, 0); } /** -- 1.7.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 0/4] btrfs: Suggestion for raid auto-repair
Hi all! This is my suggestion how to do on the fly repair for corrupted raid setups. Currently, btrfs can cope with a hardware failure in a way that it tries to find another mirror and ... that's it. The bad mirror always stays bad and your data is lost when the last copy vanishes. Here is where I got on my way changing this. I built upon the retry code originally used for data (inode.c), moved it to a more central place (extent_io.c) and made it repair errors when possible. Those two steps are currently inlcuded in patch 4, because what I actually did was somewhat more iterative. If it helps reviewing, I can try to split that up in a move-commit and a change-commit - just tell me you'd like this. To test this, I made some bad sectors with hdparm (data and metadata) and had them corrected while reading the affected data. Anyway, this patch touches critical parts and can potentially screw up your data, in case i have an error in determination of the destination for corrective writes. You have been warned! But please, try it anyway :-) One remark concerning scrub: My latest scrub patches include a change that triggers a regular page read to correct some kind of errors. This code is meant to end up exactly in the error correction routines added here, too. There are some special cases (nodatasum and a certain state of page cache) where scrub comes across an error that it reports as incorrectable, which it isn't. I have a patch for that as well, but as it is only relevant when you combine those two patch series, I did not include it. -Jan Jan Schmidt (4): btrfs: btrfs_multi_bio replaced with btrfs_bio btrfs: Do not use bio-bi_bdev after submission btrfs: Put mirror_num in bi_bdev btrfs: Moved repair code from inode.c to extent_io.c fs/btrfs/extent-tree.c | 10 +- fs/btrfs/extent_io.c | 386 +++- fs/btrfs/extent_io.h | 11 ++- fs/btrfs/inode.c | 155 +--- fs/btrfs/scrub.c | 20 ++-- fs/btrfs/volumes.c | 130 + fs/btrfs/volumes.h | 10 +- 7 files changed, 485 insertions(+), 237 deletions(-) -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/4] btrfs: Do not use bio-bi_bdev after submission
The block layer modifies bio-bi_bdev and bio-bi_sector while working on the bio, they do _not_ come back unmodified in the completion callback. To call add_page, we need at least some bi_bdev set, which is why the code was working, previously. With this patch, we use the latest_bdev from fsinfo instead of the leftover in the bio. This gives us the possibility to use the bi_bdev field for another purpose. Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/inode.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 4a13730..6ec7a93 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1916,7 +1916,7 @@ static int btrfs_io_failed_hook(struct bio *failed_bio, bio-bi_private = state; bio-bi_end_io = failed_bio-bi_end_io; bio-bi_sector = failrec-logical 9; - bio-bi_bdev = failed_bio-bi_bdev; + bio-bi_bdev = BTRFS_I(inode)-root-fs_info-fs_devices-latest_bdev; bio-bi_size = 0; bio_add_page(bio, page, failrec-len, start - page_offset(page)); -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/4] btrfs: btrfs_multi_bio replaced with btrfs_bio
btrfs_bio is a bio abstraction able to split and not complete after the last bio has returned (like the old btrfs_multi_bio). Additionally, btrfs_bio tracks the mirror_num used to read data which can be used for error correction purposes. Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/extent-tree.c | 10 ++-- fs/btrfs/scrub.c | 20 fs/btrfs/volumes.c | 128 +-- fs/btrfs/volumes.h | 10 +++- 4 files changed, 90 insertions(+), 78 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 71cd456..351efb3 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1772,18 +1772,18 @@ static int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr, { int ret; u64 discarded_bytes = 0; - struct btrfs_multi_bio *multi = NULL; + struct btrfs_bio *bbio = NULL; /* Tell the block device(s) that the sectors can be discarded */ ret = btrfs_map_block(root-fs_info-mapping_tree, REQ_DISCARD, - bytenr, num_bytes, multi, 0); + bytenr, num_bytes, bbio, 0); if (!ret) { - struct btrfs_bio_stripe *stripe = multi-stripes; + struct btrfs_bio_stripe *stripe = bbio-stripes; int i; - for (i = 0; i multi-num_stripes; i++, stripe++) { + for (i = 0; i bbio-num_stripes; i++, stripe++) { ret = btrfs_issue_discard(stripe-dev-bdev, stripe-physical, stripe-length); @@ -1792,7 +1792,7 @@ static int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr, else if (ret != -EOPNOTSUPP) break; } - kfree(multi); + kfree(bbio); } if (discarded_bytes ret == -EOPNOTSUPP) ret = 0; diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index a8d03d5..c04775e 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -250,7 +250,7 @@ static void scrub_fixup(struct scrub_bio *sbio, int ix) struct scrub_dev *sdev = sbio-sdev; struct btrfs_fs_info *fs_info = sdev-dev-dev_root-fs_info; struct btrfs_mapping_tree *map_tree = fs_info-mapping_tree; - struct btrfs_multi_bio *multi = NULL; + struct btrfs_bio *bbio = NULL; u64 logical = sbio-logical + ix * PAGE_SIZE; u64 length; int i; @@ -269,8 +269,8 @@ static void scrub_fixup(struct scrub_bio *sbio, int ix) length = PAGE_SIZE; ret = btrfs_map_block(map_tree, REQ_WRITE, logical, length, - multi, 0); - if (ret || !multi || length PAGE_SIZE) { + bbio, 0); + if (ret || !bbio || length PAGE_SIZE) { printk(KERN_ERR scrub_fixup: btrfs_map_block failed us for %llu\n, (unsigned long long)logical); @@ -278,19 +278,19 @@ static void scrub_fixup(struct scrub_bio *sbio, int ix) return; } - if (multi-num_stripes == 1) + if (bbio-num_stripes == 1) /* there aren't any replicas */ goto uncorrectable; /* * first find a good copy */ - for (i = 0; i multi-num_stripes; ++i) { + for (i = 0; i bbio-num_stripes; ++i) { if (i == sbio-spag[ix].mirror_num) continue; - if (scrub_fixup_io(READ, multi-stripes[i].dev-bdev, - multi-stripes[i].physical 9, + if (scrub_fixup_io(READ, bbio-stripes[i].dev-bdev, + bbio-stripes[i].physical 9, sbio-bio-bi_io_vec[ix].bv_page)) { /* I/O-error, this is not a good copy */ continue; @@ -299,7 +299,7 @@ static void scrub_fixup(struct scrub_bio *sbio, int ix) if (scrub_fixup_check(sbio, ix) == 0) break; } - if (i == multi-num_stripes) + if (i == bbio-num_stripes) goto uncorrectable; if (!sdev-readonly) { @@ -314,7 +314,7 @@ static void scrub_fixup(struct scrub_bio *sbio, int ix) } } - kfree(multi); + kfree(bbio); spin_lock(sdev-stat_lock); ++sdev-stat.corrected_errors; spin_unlock(sdev-stat_lock); @@ -325,7 +325,7 @@ static void scrub_fixup(struct scrub_bio *sbio, int ix) return; uncorrectable: - kfree(multi); + kfree(bbio); spin_lock(sdev-stat_lock); ++sdev-stat.uncorrectable_errors; spin_unlock(sdev-stat_lock); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 19450bc..e839b72 100644 --- a/fs/btrfs/volumes.c
[RFC PATCH 3/4] btrfs: Put mirror_num in bi_bdev
The error correction code wants to make sure that only the bad mirror is rewritten. Thus, we need to know which mirror is the bad one. I did not find a more apropriate field than bi_bdev. But I think using this is fine, because it is modified by the block layer, anyway, and should not be read after the bio returned. Signed-off-by: Jan Schmidt list.bt...@jan-o-sch.net --- fs/btrfs/volumes.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e839b72..55fbd4d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3169,6 +3169,8 @@ static void btrfs_end_bio(struct bio *bio, int err) } bio-bi_private = bbio-private; bio-bi_end_io = bbio-end_io; + bio-bi_bdev = (struct block_device *) + (unsigned long)bbio-mirror_num; /* only send an error to the higher layers if it is * beyond the tolerance of the multi-bio */ -- 1.7.3.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
rw_semaphore performance, was: new metadata reader/writer locks in integration-test
On Tue, Jul 19, 2011 at 01:30:22PM -0400, Chris Mason wrote: We've seen a number of benchmarks dominated by contention on the root node lock. This changes our locks into a simple reader/writer lock. They are based on mutexes so that we still take advantage of the mutex adaptive spins for write locks (rwsemaphores were much slower). Interesting. Do you have set up some artifical benchmarks for this? I wonder if the lack of adaptive spinning has something to do with the slightly slower XFS performance on Joern's flash testing, given that we extensively use the rw_semaphore as the primary I/O mutex, while all others rely on plain mutexes as the primary synchronization primitive. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rw_semaphore performance, was: new metadata reader/writer locks in integration-test
Excerpts from Christoph Hellwig's message of 2011-07-22 11:01:51 -0400: On Tue, Jul 19, 2011 at 01:30:22PM -0400, Chris Mason wrote: We've seen a number of benchmarks dominated by contention on the root node lock. This changes our locks into a simple reader/writer lock. They are based on mutexes so that we still take advantage of the mutex adaptive spins for write locks (rwsemaphores were much slower). Interesting. Do you have set up some artifical benchmarks for this? I wonder if the lack of adaptive spinning has something to do with the slightly slower XFS performance on Joern's flash testing, given that we extensively use the rw_semaphore as the primary I/O mutex, while all others rely on plain mutexes as the primary synchronization primitive. For the rw locks I had three main tests. 1) dbench 10. This is interesting only because it is mostly bound by how quickly we can do metadata operations in ram. There's not much IO and there's a good mixture of read and write btree operations (about 50/50). rwsemaphores ran at 200MB/s while my current code runs at 2400MB/s. The old btrfs implementation runs at 3000MB/s. We all love and hate dbench, so I don't put a huge amount of stock in 2400 vs 3000. But, 200 vs 2400...people notice that in real world stuff. 2) fs_mark doing parallel zero byte file creates. No fsyncs here, all metadata operations. The old btrfs locking was completely bound by getting write locks on the root node. The new code is much better here, overall about 30-50% faster. I didn't do the rw semaphores on this one, I'll give it a shot. 3) A stat-hammer program. This creates a bunch of files in parallel, and then times how long it takes us to stat all the inodes. I went from 3s of CPU time down to .9s. rwsems were about the same here (very fast), but that's because it's 100% reader locks. My money for Joern's benchmarks is end-io latencies. xfs and btrfs are doing more at endio time. But I need to sit down and run them myself and take a look. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new metadata reader/writer locks in integration-test
On Wed, Jul 20, 2011 at 05:36:09PM +0900, Tsutomu Itoh wrote: (2011/07/20 16:58), Chris Mason wrote: Excerpts from Tsutomu Itoh's message of 2011-07-19 22:08:38 -0400: (2011/07/20 2:30), Chris Mason wrote: Hi everyone, I've pushed out a new integration-test branch, and it includes a new reader/writer locking scheme for the btree locks. We've seen a number of benchmarks dominated by contention on the root node lock. This changes our locks into a simple reader/writer lock. They are based on mutexes so that we still take advantage of the mutex adaptive spins for write locks (rwsemaphores were much slower). I'm also sending the individual commits, please do take a look. I pulled the new integration-test branch, and I got the following warning messages. Jul 20 10:03:30 luna kernel: [ cut here ] Jul 20 10:03:30 luna kernel: WARNING: at fs/btrfs/extent-tree.c:5704 btrfs_alloc_free_block+0x178/0x340 [btrfs]() Thanks, I think this one is related to Josef's enospc changes, but I'll double check. What was the test? I ran my original test script. This script concurrently executes the making deletion of a lot of files, and the making deletion of a big file, etc. I'm having a hard time triggering this with Josef's current patch (after my rebase). Could you please send along the reproduction script? -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new metadata reader/writer locks in integration-test
On 21.07.2011 07:44, Arne Jansen wrote: On 20.07.2011 19:21, Chris Mason wrote: Excerpts from Chris Mason's message of 2011-07-19 13:30:22 -0400: Hi everyone, I've pushed out a new integration-test branch, and it includes a new reader/writer locking scheme for the btree locks. We've seen a number of benchmarks dominated by contention on the root node lock. This changes our locks into a simple reader/writer lock. They are based on mutexes so that we still take advantage of the mutex adaptive spins for write locks (rwsemaphores were much slower). I'm also sending the individual commits, please do take a look. Hi everyone, I just rebased Josef's enospc fixes into integration-test, it should fix the warnings in extent-tree.c With the current integration-test branch I get very early enospc on a 7G volume create with -m single -d single and fs_mark-3.3/fs_mark -d /mnt/fsm -D 512 -t 16 -n 4096 -s 51200 -L5 -S0 -R1 It enospces at about 20%, but I can continue to fill it up to 94%. I tried to bisect this, but it turned out to be hard. Sooner or later I get this early enospc on every revision, on some sooner, on others later. At least the current for-linus branch is much worse than integration-test. -Arne -- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Issues with KVM
Hi, I have a new fast computer to run many virtual machines. Everything looks very fast, except the installation of new operating systems in KVM. The installation is very fast until it begins to write on disk. It looks like it writes slower and slower. I tried Debian, FreeBSD, OpenIndiana and OpenBSD: same problem. The FreeBSD installer displays the speed: it starts at 780 KB/sec (which it already very slow) to finish between 1 and 8 KB/sec. darksatanic suggested me to use nodatacow mount flag: it is not faster, and it looks even slower (fewer wsec/s in iostat output, see below). The computer is an Intel i7 2600 (4 cores with hyper threading: 8 threads), 12 GB or RAM, 2 hard drives of 1 TB (Western Digital Caviar Blue 1 To 7200 RPM 32 Mo Serial ATA 6Gb/s - WD10EALX). Both disks are connected to SATA 6 GB/sec connectors using a P67 chipset. I'm using RAID 0 with Linux software (MD) RAID, and I have one unique btrfs partition of 2 TB. The host OS is Fedora 15 (Linux kernel 2.6.38). I'm using hardware virtualisation with KVM. Debian is installed using virtio, so it should not be an issue with the hard drive driver of KVM. I'm watching iostat during the Debian installation. With the default mount option, wsec/s starts at 49000 to finish near 42000 (on the MD device), %usage is greater than 50% of both disks (near 80% for sda, near 60% for sdb). Using nodatacow option, wsec/s starts at 12000 (%usage 75%) to finish near 1 (%usage always 75%). It is slower, right? A sector is 512 bytes. The Debian image size is 40 GB, its type is raw. The system load is greater than 2 (or maybe 3) during the installation of the VM, while CPU usage is under 8% and wa% is also low (maybe 10% or lower, I don't remember). bonnie++ output (on the Fedora host, not in a VM): Version 1.96 --Sequential Output-- --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP ned 24048M 346 98 220839 24 98489 19 245 84 251547 18 199.2 259 Latency 37256us 326ms 943ms 251ms 197ms 151ms Version 1.96 --Sequential Create-- Random Create ned -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 11128 34 + +++ 16558 45 14006 40 + +++ 17226 48 Latency 14997us 663us 11401us8115us 282us 10105us 1.96,1.96,ned,1,1311365747,24048M,,346,98,220839,24,98489,19,245,84,251547,18,199.2,259,16,11128,34,+,+++,16558,45,14006,40,+,+++,17226,48,37256us,326ms,943ms,251ms,197ms,151ms,14997us,663us,11401us,8115us,282us,10105us Do you have any idea why the %usage is so high (in iostat), while the speed looks so low? The disk speed during the installation is between 5 MB/sec and 23 MB/sec, whereas the raw speed is greater than 200 MB/sec on the host (234 MB/sec according to hdparm -t /dev/md127, 220 MB/sec according to bonnie++ on sequential output). It's difficult to read bonnie++ output: random create is near 14 MB/sec if I read correctly. btrfs behaves maybe very badly with a raw image of 40 GB, especially on RAID 0 with 2 disks? Should I try other KVM option (e.g. use another type of image)? Try btrfs RAID instead of Linux MD RAID? Try to disable some CPU cores? Or maybe not using btrfs for KVM images? :-) Victor -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issues with KVM
On Fri, Jul 22, 2011 at 2:44 PM, Victor Stinner victor.stin...@haypocalc.com wrote: Hi, I have a new fast computer to run many virtual machines. Everything looks very fast, except the installation of new operating systems in KVM. The installation is very fast until it begins to write on disk. It looks like it writes slower and slower. I tried Debian, FreeBSD, OpenIndiana and OpenBSD: same problem. The FreeBSD installer displays the speed: it starts at 780 KB/sec (which it already very slow) to finish between 1 and 8 KB/sec. ) is the host FS btrfs? ) are virtio modules in the initramfs (or kernel probably)? ) are you sure virtio is being used (eg. are the disks called vdX vs sdX)? ) is the disk bus set to virtio (virtmanager)? ) is the disk's cache mode set to none [or maybe writeback] (virtmanager)? ) is the disk's storage format set to raw, should be (virtmanager)? ) is caching enabled on the image? () probably need to change the cache mode on the disk, or if the host is btrfs you need to flag the image with whetever is needed to prevent continuous COWing. C Anthony -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issues with KVM
On Fri, Jul 22, 2011 at 2:59 PM, C Anthony Risinger anth...@xtfx.me wrote: On Fri, Jul 22, 2011 at 2:44 PM, Victor Stinner victor.stin...@haypocalc.com wrote: ) is caching enabled on the image? () oops, disregard that ... remainder left over from editing copy/paste :-) C Anthony -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issues with KVM
On 07/22/2011 03:44 PM, Victor Stinner wrote: Hi, I have a new fast computer to run many virtual machines. Everything looks very fast, except the installation of new operating systems in KVM. The installation is very fast until it begins to write on disk. It looks like it writes slower and slower. I tried Debian, FreeBSD, OpenIndiana and OpenBSD: same problem. The FreeBSD installer displays the speed: it starts at 780 KB/sec (which it already very slow) to finish between 1 and 8 KB/sec. darksatanic suggested me to use nodatacow mount flag: it is not faster, and it looks even slower (fewer wsec/s in iostat output, see below). The computer is an Intel i7 2600 (4 cores with hyper threading: 8 threads), 12 GB or RAM, 2 hard drives of 1 TB (Western Digital Caviar Blue 1 To 7200 RPM 32 Mo Serial ATA 6Gb/s - WD10EALX). Both disks are connected to SATA 6 GB/sec connectors using a P67 chipset. I'm using RAID 0 with Linux software (MD) RAID, and I have one unique btrfs partition of 2 TB. The host OS is Fedora 15 (Linux kernel 2.6.38). I'm using hardware virtualisation with KVM. Debian is installed using virtio, so it should not be an issue with the hard drive driver of KVM. I'm watching iostat during the Debian installation. With the default mount option, wsec/s starts at 49000 to finish near 42000 (on the MD device), %usage is greater than 50% of both disks (near 80% for sda, near 60% for sdb). Using nodatacow option, wsec/s starts at 12000 (%usage 75%) to finish near 1 (%usage always 75%). It is slower, right? A sector is 512 bytes. The Debian image size is 40 GB, its type is raw. The system load is greater than 2 (or maybe 3) during the installation of the VM, while CPU usage is under 8% and wa% is also low (maybe 10% or lower, I don't remember). bonnie++ output (on the Fedora host, not in a VM): Version 1.96 --Sequential Output-- --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP ned 24048M 346 98 220839 24 98489 19 245 84 251547 18 199.2 259 Latency 37256us 326ms 943ms 251ms 197ms 151ms Version 1.96 --Sequential Create-- Random Create ned -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 11128 34 + +++ 16558 45 14006 40 + +++ 17226 48 Latency 14997us 663us 11401us8115us 282us 10105us 1.96,1.96,ned,1,1311365747,24048M,,346,98,220839,24,98489,19,245,84,251547,18,199.2,259,16,11128,34,+,+++,16558,45,14006,40,+,+++,17226,48,37256us,326ms,943ms,251ms,197ms,151ms,14997us,663us,11401us,8115us,282us,10105us Do you have any idea why the %usage is so high (in iostat), while the speed looks so low? The disk speed during the installation is between 5 MB/sec and 23 MB/sec, whereas the raw speed is greater than 200 MB/sec on the host (234 MB/sec according to hdparm -t /dev/md127, 220 MB/sec according to bonnie++ on sequential output). It's difficult to read bonnie++ output: random create is near 14 MB/sec if I read correctly. btrfs behaves maybe very badly with a raw image of 40 GB, especially on RAID 0 with 2 disks? Should I try other KVM option (e.g. use another type of image)? Try btrfs RAID instead of Linux MD RAID? Try to disable some CPU cores? Or maybe not using btrfs for KVM images? :-) Use the kvm option of cache=none for your device. Granted its still going to be slow, but it should be a bit faster. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issues with KVM
On Fri, 22 Jul 2011 21:44:24 +0200, Victor Stinner wrote: Should I try other KVM option (e.g. use another type of image)? Try btrfs RAID instead of Linux MD RAID? Try to disable some CPU cores? Or maybe not using btrfs for KVM images? :-) Hi, I would suggest you the following points: - qemu-img create -f qcow2 -o size=400,preallocation=metadata vdisk.img - disk: cache=none - controller: virtio Best regards, Morten -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux 3.0 release - btrfs possible locking deadlock
On Thursday 21 July 2011 22:59:53 Linus Torvalds wrote: So there it is. Gone are the 2.6.bignum days, and 3.0 is out. Hi, Managed to get this with btrfs rsync(ing) from ext4 to a btrfs fs with three partitions using raid1. [16018.211493] device fsid f7186eeb-60df-4b1a-890a-4a1eb42f81fe devid 1 transid 10 /dev/sdd4 [16018.230643] btrfs: use lzo compression [16018.234619] btrfs: enabling disk space caching [25949.414011] [25949.414011] === [25949.416549] [ INFO: possible circular locking dependency detected ] [25949.423187] 3.0.0-crc+ #348 [25949.423187] --- [25949.423187] rsync/20237 is trying to acquire lock: [25949.423187] (btrfs-extent-01){+.+...}, at: [a047ce88] btrfs_try_spin_lock+0x78/0xb0 [btrfs] [25949.423187] [25949.423187] but task is already holding lock: [25949.423187] ((eb-lock)-rlock){+.+...}, at: [a047cee2] btrfs_clear_lock_blocking+0x22/0x30 [btrfs] [25949.423187] [25949.423187] which lock already depends on the new lock. [25949.423187] [25949.423187] [25949.423187] the existing dependency chain (in reverse order) is: [25949.423187] [25949.423187] - #1 ((eb-lock)-rlock){+.+...}: [25949.423187][8108bb75] lock_acquire+0x95/0x140 [25949.423187][815792eb] _raw_spin_lock+0x3b/0x50 [25949.423187][a047ce88] btrfs_try_spin_lock+0x78/0xb0 [btrfs] [25949.423187][a0427959] btrfs_search_slot+0x2e9/0x800 [btrfs] [25949.423187][a0433bee] lookup_inline_extent_backref+0xbe/0x490 [btrfs] [25949.423187][a0434cbb] __btrfs_free_extent+0x13b/0x900 [btrfs] [25949.423187][a0435ca3] run_clustered_refs+0x823/0xaf0 [btrfs] [25949.423187][a043603d] btrfs_run_delayed_refs+0xcd/0x290 [btrfs] [25949.423187][a0445ecb] btrfs_commit_transaction+0x8b/0x9d0 [btrfs] [25949.423187][a0440c06] transaction_kthread+0x2b6/0x2e0 [btrfs] [25949.423187][81071536] kthread+0xb6/0xc0 [25949.423187][81582314] kernel_thread_helper+0x4/0x10 [25949.423187] [25949.423187] - #0 (btrfs-extent-01){+.+...}: [25949.423187][8108b468] __lock_acquire+0x1588/0x16a0 [25949.423187][8108bb75] lock_acquire+0x95/0x140 [25949.423187][815792eb] _raw_spin_lock+0x3b/0x50 [25949.423187][a047ce88] btrfs_try_spin_lock+0x78/0xb0 [btrfs] [25949.423187][a0427959] btrfs_search_slot+0x2e9/0x800 [btrfs] [25949.423187][a0439dd2] btrfs_lookup_dir_item+0x82/0x120 [btrfs] [25949.423187][a04532a5] btrfs_lookup_dentry+0xc5/0x4c0 [btrfs] [25949.423187][a04536c4] btrfs_lookup+0x24/0x70 [btrfs] [25949.423187][8115a863] d_alloc_and_lookup+0xc3/0x100 [25949.423187][8115cfa0] do_lookup+0x260/0x480 [25949.423187][8115d540] walk_component+0x60/0x1f0 [25949.423187][8115e7aa] path_lookupat+0xea/0x620 [25949.423187][8115ed15] do_path_lookup+0x35/0x1c0 [25949.423187][8115fc38] user_path_at+0x98/0xe0 [25949.423187][81153fac] vfs_fstatat+0x4c/0x90 [25949.423187][8115405e] vfs_lstat+0x1e/0x20 [25949.423187][81154084] sys_newlstat+0x24/0x50 [25949.423187][815814eb] system_call_fastpath+0x16/0x1b [25949.423187] [25949.423187] other info that might help us debug this: [25949.423187] [25949.423187] Possible unsafe locking scenario: [25949.423187] [25949.423187]CPU0CPU1 [25949.423187] [25949.423187] lock((eb-lock)-rlock); [25949.423187]lock(btrfs-extent-01); [25949.423187]lock((eb-lock)-rlock); [25949.423187] lock(btrfs-extent-01); [25949.423187] [25949.423187] *** DEADLOCK *** [25949.423187] [25949.423187] 2 locks held by rsync/20237: [25949.423187] #0: (sb-s_type-i_mutex_key#14){+.+.+.}, at: [8115cf5a] do_lookup+0x21a/0x480 [25949.423187] #1: ((eb-lock)-rlock){+.+...}, at: [a047cee2] btrfs_clear_lock_blocking+0x22/0x30 [btrfs] [25949.423187] [25949.423187] stack backtrace: [25949.423187] Pid: 20237, comm: rsync Not tainted 3.0.0-crc+ #348 [25949.423187] Call Trace: [25949.423187] [810887de] print_circular_bug+0x20e/0x2f0 [25949.423187] [8108b468] __lock_acquire+0x1588/0x16a0 [25949.423187] [a0441ebb] ? verify_parent_transid+0xcb/0x290 [btrfs] [25949.423187] [a047ce88] ? btrfs_try_spin_lock+0x78/0xb0 [btrfs] [25949.423187] [8108bb75] lock_acquire+0x95/0x140 [25949.423187] [a047ce88] ? btrfs_try_spin_lock+0x78/0xb0 [btrfs] [25949.423187] [815792eb] _raw_spin_lock+0x3b/0x50 [25949.423187] [a047ce88] ? btrfs_try_spin_lock+0x78/0xb0 [btrfs]