btrfs rare silent data corruption with kernel data leak
Summary: There seem to be two btrfs bugs here: one loses data on writes, and the other leaks data from the kernel to replace it on reads. It all happens after checksums are verified, so the corruption is entirely silent--no EIO errors, kernel messages, or device event statistics. Compressed extents are corrupted with kernel data leak. Uncompressed extents may not be corrupted, or may be corrupted by deterministically replacing data bytes with zero, or may not be corrupted. No preconditions for corruption are known. Less than one file per hundred thousand seems to be affected. Only specific parts of any file can be affected. Kernels v4.0..v4.5.7 tested, all have the issue. Background, observations, and analysis: I've been detecting silent data corruption on btrfs for over a year. Over time I've been improving data collection and controlling for confounding factors (other known btrfs bugs, RAM and CPU failures, raid5, etc). I have recently isolated the most common remaining corruption mode, and it seems to be a btrfs bug. I don't have an easy recipe to create a corrupted file and I don't know precisely how they come to exist. In the wild, about one in 10^5..10^7 files is provably corrupted. The corruption can only occur at one point in each file so the rate of corruption incidents follows the number of files. It seems to occur most often to software builders and rsync backup receivers. It seems to happen mostly on busier machines with mixed workloads and not at all on idle test VMs trying to reproduce this issue with a script. One way to get corruption is to set up a series of filesystems and rsync /usr to them sequentially (i.e. rsync -a /usr /fs-A; rsync -a /fs-A /fs-B; rsync -a /fs-B /fs-C; ...) and verify each copy by comparison afterwards. The same host needs to be doing other filesystem workloads or it won't seem to reproduce this issue. It took me two weeks to intentionally create one corrupt file this way. Good luck. In cases where this corruption mode is found, the files always have an extent map following this pattern: # filefrag -v usr/share/icons/hicolor/icon-theme.cache Filesystem type is: 9123683e File size of usr/share/icons/hicolor/icon-theme.cache is 36456 (9 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0..4095: 0.. 4095: 4096: encoded,not_aligned,inline 1:1.. 8: 182785288.. 182785295: 8: 1: last,encoded,shared,eof usr/share/icons/hicolor/icon-theme.cache: 2 extents found Note the first inline extent followed by one or more non-inline extents. I don't know enough about the writing side of btrfs to know if this is a bug in and of itself. It _looks_ wrong to me. Once such an extent is created, the corruption is persistent but not deterministic. When I read the extent through btrfs, the file is different most of the time: # cp usr/share/icons/hicolor/icon-theme.cache /tmp/foo # ls -l usr/share/icons/hicolor/icon-theme.cache /tmp/foo -rw-r--r-- 1 root root 36456 Sep 20 11:41 /tmp/foo -rw-r--r-- 1 root root 36456 Sep 6 11:52 usr/share/icons/hicolor/icon-theme.cache # while sysctl vm.drop_caches=1; do cmp -l usr/share/icons/hicolor/icon-theme.cache /tmp/foo; done vm.drop_caches = 1 vm.drop_caches = 1 4093 213 0 4094 177 0 vm.drop_caches = 1 4093 216 0 4094 33 0 4095 173 0 4096 15 0 vm.drop_caches = 1 4093 352 0 4094 3 0 4095 37 0 4096 2 0 vm.drop_caches = 1 4093 243 0 4094 372 0 4095 154 0 4096 221 0 vm.drop_caches = 1 4093 333 0 4094 170 0 4095 356 0 4096 213 0 vm.drop_caches = 1 4093 170 0 4094 155 0 4095 62 0 4096 233 0 vm.drop_caches = 1 4093 263 0 4094 6 0 4095 363 0 4096 44 0 vm.drop_caches = 1 4093 237 0 4094 330 0 4095 217 0 4096 206 0 ^C In other runs there can be 5 or more consecutive reads with no differences detected. I fetched the raw inline extent item for this file through the SEARCH_V2 ioctl and decoded it: # head /tmp/bar 27 5e 06 00 00 00 00 00 [generation 417319] fc 0f 00 00 00 00 00 00 [ram_bytes = 0xffc, compression = 1] 01 00 00 00 00 78 5e 9c [zlib data starts at "78 5e..."] 97 3d 74 14 55 14 c7 6f 60 77 b3 9f d9 20 20 08 28 11 22 a0 66 90 8f a0 a8 01 a2 80 80 a2 20 e6 28 20 42 26 bb 93 cd 30 b3 33 9b d9 99 24 62 d4 20 f8 51 58 58 50 58 58 Notice ram_bytes is 0xffc, or 4092, but the inline extent's position in the f
[PATCH v2 12/14] btrfs-progs: check: fix the return value bug of cmd_check()
From: Lu Fengqi The function cmd_check() is called by the main function of btrfs.c, its return value will be returned by exit(). Resulting in the loss of significant bits in some cases, for example this value is greater than 0377. If use a bool value "err" to store all of the return value, this will solve the problem. Signed-off-by: Lu Fengqi Signed-off-by: Qu Wenruo --- cmds-check.c | 47 --- 1 file changed, 36 insertions(+), 11 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index 934a3dd..701fff5 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -12337,6 +12337,7 @@ int cmd_check(int argc, char **argv) u64 chunk_root_bytenr = 0; char uuidbuf[BTRFS_UUID_UNPARSED_SIZE]; int ret; + int err = 0; u64 num; int init_csum_tree = 0; int readonly = 0; @@ -12470,10 +12471,12 @@ int cmd_check(int argc, char **argv) if((ret = check_mounted(argv[optind])) < 0) { error("could not check mount status: %s", strerror(-ret)); + err |= !!ret; goto err_out; } else if(ret) { error("%s is currently mounted, aborting", argv[optind]); ret = -EBUSY; + err |= !!ret; goto err_out; } @@ -12486,6 +12489,7 @@ int cmd_check(int argc, char **argv) if (!info) { error("cannot open file system"); ret = -EIO; + err |= !!ret; goto err_out; } @@ -12500,9 +12504,11 @@ int cmd_check(int argc, char **argv) ret = ask_user("repair mode will force to clear out log tree, are you sure?"); if (!ret) { ret = 1; + err |= !!ret; goto close_out; } ret = zero_log_tree(root); + err |= !!ret; if (ret) { error("failed to zero log tree: %d", ret); goto close_out; @@ -12514,6 +12520,7 @@ int cmd_check(int argc, char **argv) printf("Print quota groups for %s\nUUID: %s\n", argv[optind], uuidbuf); ret = qgroup_verify_all(info); + err |= !!ret; if (ret == 0) report_qgroups(1); goto close_out; @@ -12522,6 +12529,7 @@ int cmd_check(int argc, char **argv) printf("Print extent state for subvolume %llu on %s\nUUID: %s\n", subvolid, argv[optind], uuidbuf); ret = print_extent_state(info, subvolid); + err |= !!ret; goto close_out; } printf("Checking filesystem on %s\nUUID: %s\n", argv[optind], uuidbuf); @@ -12530,6 +12538,7 @@ int cmd_check(int argc, char **argv) !extent_buffer_uptodate(info->dev_root->node) || !extent_buffer_uptodate(info->chunk_root->node)) { error("critical roots corrupted, unable to check the filesystem"); + err |= !!ret; ret = -EIO; goto close_out; } @@ -12541,12 +12550,14 @@ int cmd_check(int argc, char **argv) if (IS_ERR(trans)) { error("error starting transaction"); ret = PTR_ERR(trans); + err |= !!ret; goto close_out; } if (init_extent_tree) { printf("Creating a new extent tree\n"); ret = reinit_extent_tree(trans, info); + err |= !!ret; if (ret) goto close_out; } @@ -12558,11 +12569,13 @@ int cmd_check(int argc, char **argv) error("checksum tree initialization failed: %d", ret); ret = -EIO; + err |= !!ret; goto close_out; } ret = fill_csum_tree(trans, info->csum_root, init_extent_tree); + err |= !!ret; if (ret) { error("checksum tree refilling failed: %d", ret); return -EIO; @@ -12573,17 +12586,20 @@ int cmd_check(int argc, char **argv) * extent entries for all of the items it finds. */ ret = btrfs_commit_transaction(trans, info->extent_root); + err |= !!ret; if (ret) goto close_out; } if (!extent_buffer_uptodate(info->extent_root->node)) { error("critical: extent_root, unable to check the filesystem");
[PATCH v2 04/14] btrfs-progs: check: introduce function to check inode_extref
From: Lu Fengqi Introduce a new function check_inode_extref() to check INODE_EXTREF, and call find_dir_item() to find the related DIR_ITEM/DIR_INDEX. Signed-off-by: Lu Fengqi Signed-off-by: Qu Wenruo --- cmds-check.c | 78 1 file changed, 78 insertions(+) diff --git a/cmds-check.c b/cmds-check.c index 90d5fbc..fec8b6e 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -4064,6 +4064,84 @@ next: return err; } +/* + * Traverse the given INODE_EXTREF and call find_dir_item() to find related + * DIR_ITEM/DIR_INDEX. + * + * @root: the root of the fs/file tree + * @ref_key: the key of the INODE_EXTREF + * @refs: the count of INODE_EXTREF + * @mode: the st_mode of INODE_ITEM + * + * Return 0 if no error occurred. + */ +static int check_inode_extref(struct btrfs_root *root, + struct btrfs_key *ref_key, + struct extent_buffer *node, int slot, u64 *refs, + int mode) +{ + struct btrfs_key key; + struct btrfs_inode_extref *extref; + char namebuf[BTRFS_NAME_LEN] = {0}; + u32 total; + u32 cur = 0; + u32 len; + u32 name_len; + u64 index; + u64 parent; + int ret; + int err = 0; + + extref = btrfs_item_ptr(node, slot, struct btrfs_inode_extref); + total = btrfs_item_size_nr(node, slot); + +next: + /* update inode ref count */ + (*refs)++; + name_len = btrfs_inode_extref_name_len(node, extref); + index = btrfs_inode_extref_index(node, extref); + parent = btrfs_inode_extref_parent(node, extref); + if (name_len <= BTRFS_NAME_LEN) { + len = name_len; + } else { + len = BTRFS_NAME_LEN; + warning("root %llu INODE_EXTREF[%llu %llu] name too long", + root->objectid, ref_key->objectid, ref_key->offset); + } + read_extent_buffer(node, namebuf, (unsigned long)(extref + 1), len); + + /* Check root dir ref name */ + if (index == 0 && strncmp(namebuf, "..", name_len)) { + error("root %llu INODE_EXTREF[%llu %llu] ROOT_DIR name shouldn't be %s", + root->objectid, ref_key->objectid, ref_key->offset, + namebuf); + err |= ROOT_DIR_ERROR; + } + + /* find related dir_index */ + key.objectid = parent; + key.type = BTRFS_DIR_INDEX_KEY; + key.offset = index; + ret = find_dir_item(root, ref_key, &key, index, namebuf, len, mode); + err |= ret; + + /* find related dir_item */ + key.objectid = parent; + key.type = BTRFS_DIR_ITEM_KEY; + key.offset = btrfs_name_hash(namebuf, len); + ret = find_dir_item(root, ref_key, &key, index, namebuf, len, mode); + err |= ret; + + len = sizeof(*extref) + name_len; + extref = (struct btrfs_inode_extref *)((char *)extref + len); + cur += len; + + if (cur < total) + goto next; + + return err; +} + static int all_backpointers_checked(struct extent_record *rec, int print_errs) { struct list_head *cur = rec->backrefs.next; -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 14/14] btrfs-progs: check: Enhance leaf traversal function to handle missing inode item
The leaf traversal function in lowmem mode will skip to the first inode item of leaf and begin to check the inode. That's designed to avoid checking overlapping part of a leaf. But that will cause problem in fsck/010 test case, as in that case inode item of the first inode(256) is missing. So above traversal will check from inode 257 and found nothing wrong. The fix is done in 2 part: 1) Manually check the first inode To avoid case like fsck/010 2) Check inode if ino changes from the first ino of a leaf To avoid missing inode_item problem. Signed-off-by: Qu Wenruo --- cmds-check.c | 46 -- 1 file changed, 44 insertions(+), 2 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index d290a66..f5be153 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -1875,6 +1875,7 @@ static int process_one_leaf_v2(struct btrfs_root *root, struct btrfs_path *path, struct btrfs_key key; u64 cur_bytenr; u32 nritems; + u64 first_ino = 0; int root_level = btrfs_header_level(root->node); int i; int ret = 0; /* Final return value */ @@ -1882,11 +1883,14 @@ static int process_one_leaf_v2(struct btrfs_root *root, struct btrfs_path *path, cur_bytenr = cur->start; - /* skip to first inode item in this leaf */ + /* skip to first inode item or the first inode number change */ nritems = btrfs_header_nritems(cur); for (i = 0; i < nritems; i++) { btrfs_item_key_to_cpu(cur, &key, i); - if (key.type == BTRFS_INODE_ITEM_KEY) + if (i == 0) + first_ino = key.objectid; + if (key.type == BTRFS_INODE_ITEM_KEY || + (first_ino && first_ino != key.objectid)) break; } if (i == nritems) { @@ -4951,6 +4955,34 @@ out: return err; } +static int check_fs_first_inode(struct btrfs_root *root, unsigned int ext_ref) +{ + struct btrfs_path *path; + struct btrfs_key key; + int err = 0; + int ret; + + path = btrfs_alloc_path(); + key.objectid = 256; + key.type = BTRFS_INODE_ITEM_KEY; + key.offset = 0; + + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); + if (ret < 0) + return ret; + if (ret > 0) { + ret = 0; + err |= INODE_ITEM_MISSING; + } + + err |= check_inode_item(root, path, ext_ref); + err &= ~LAST_ITEM; + if (err && !ret) + ret = -EIO; + btrfs_free_path(path); + return ret; +} + /* * Iterate all item on the tree and call check_inode_item() to check. * @@ -4968,6 +5000,16 @@ static int check_fs_root_v2(struct btrfs_root *root, unsigned int ext_ref) int ret, wret; int level; + /* +* We need to manually check the first inode item(256) +* As the following traversal function will only start from +* the first inode item in the leaf, if inode item(256) is missing +* we will just skip it forever. +*/ + ret = check_fs_first_inode(root, ext_ref); + if (ret < 0) + return ret; + path = btrfs_alloc_path(); if (!path) return -ENOMEM; -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 06/14] btrfs-progs: check: introduce a function to check dir_item
From: Lu Fengqi Introduce a new function check_dir_item() to check DIR_ITEM/DIR_INDEX, and call find_inode_ref() to find the related INODE_REF/INODE_EXTREF. Signed-off-by: Lu Fengqi Signed-off-by: Qu Wenruo --- cmds-check.c | 125 +++ 1 file changed, 125 insertions(+) diff --git a/cmds-check.c b/cmds-check.c index a261821..d39a81c 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -3853,6 +3853,8 @@ out: #define DIR_ITEM_MISSING (1<<2) /* DIR_ITEM not found */ #define DIR_ITEM_MISMATCH (1<<3) /* DIR_ITEM found but not match */ #define INODE_REF_MISSING (1<<4) /* INODE_REF/INODE_EXTREF not found */ +#define INODE_ITEM_MISSING (1<<5) /* INODE_ITEM not found */ +#define INODE_ITEM_MISMATCH(1<<6) /* INODE_ITEM found but not match */ /* * Find DIR_ITEM/DIR_INDEX for the given key and check it with the specified @@ -4291,6 +4293,129 @@ out: return ret; } +/* + * Traverse the given DIR_ITEM/DIR_INDEX and check related INODE_ITEM and + * call find_inode_ref() to check related INODE_REF/INODE_EXTREF. + * + * @root: the root of the fs/file tree + * @key: the key of the INODE_REF/INODE_EXTREF + * @size: the st_size of the INODE_ITEM + * @ext_ref: the EXTENDED_IREF feature + * + * Return 0 if no error occurred. + */ +static int check_dir_item(struct btrfs_root *root, struct btrfs_key *key, + struct extent_buffer *node, int slot, u64 *size, + unsigned int ext_ref) +{ + struct btrfs_dir_item *di; + struct btrfs_inode_item *ii; + struct btrfs_path path; + struct btrfs_key location; + char namebuf[BTRFS_NAME_LEN] = {0}; + u32 total; + u32 cur = 0; + u32 len; + u32 name_len; + u32 data_len; + u8 filetype; + u32 mode; + u64 index; + int ret; + int err = 0; + + /* +* For DIR_ITEM set index to (u64)-1, so that find_inode_ref +* ignore index check. +*/ + index = (key->type == BTRFS_DIR_INDEX_KEY) ? key->offset : (u64)-1; + + di = btrfs_item_ptr(node, slot, struct btrfs_dir_item); + total = btrfs_item_size_nr(node, slot); + + while (cur < total) { + data_len = btrfs_dir_data_len(node, di); + if (data_len) + error("root %llu %s[%llu %llu] data_len shouldn't be %u", + root->objectid, key->type == BTRFS_DIR_ITEM_KEY ? + "DIR_ITEM" : "DIR_INDEX", + key->objectid, key->offset, data_len); + + name_len = btrfs_dir_name_len(node, di); + if (name_len <= BTRFS_NAME_LEN) { + len = name_len; + } else { + len = BTRFS_NAME_LEN; + warning("root %llu %s[%llu %llu] name too long", + root->objectid, + key->type == BTRFS_DIR_ITEM_KEY ? + "DIR_ITEM" : "DIR_INDEX", + key->objectid, key->offset); + } + (*size) += name_len; + + read_extent_buffer(node, namebuf, (unsigned long)(di + 1), len); + filetype = btrfs_dir_type(node, di); + + btrfs_init_path(&path); + btrfs_dir_item_key_to_cpu(node, di, &location); + + /* Ignore related ROOT_ITEM check */ + if (location.type == BTRFS_ROOT_ITEM_KEY) + goto next; + + /* Check relative INODE_ITEM(existence/filetype) */ + ret = btrfs_search_slot(NULL, root, &location, &path, 0, 0); + if (ret) { + err |= INODE_ITEM_MISSING; + error("root %llu %s[%llu %llu] couldn't find relative INODE_ITEM[%llu] namelen %u filename %s filetype %x", + root->objectid, key->type == BTRFS_DIR_ITEM_KEY ? + "DIR_ITEM" : "DIR_INDEX", key->objectid, + key->offset, location.objectid, name_len, + namebuf, filetype); + goto next; + } + + ii = btrfs_item_ptr(path.nodes[0], path.slots[0], + struct btrfs_inode_item); + mode = btrfs_inode_mode(path.nodes[0], ii); + + if (imode_to_type(mode) != filetype) { + err |= INODE_ITEM_MISMATCH; + error("root %llu %s[%llu %llu] relative INODE_ITEM filetype mismatch namelen %u filename %s filetype %d", + root->objectid, key->type == BTRFS_DIR_ITEM_KEY ? + "DIR_ITEM" : "DIR_INDEX", key->objectid, + key->offset, name_len, namebuf, filetype); + } + + /
[PATCH v2 03/14] btrfs-progs: check: introduce function to check inode_ref
From: Lu Fengqi Introduce a new function check_inode_ref() to check INODE_REF, and call find_dir_item() to find the related DIR_ITEM/DIR_INDEX. Signed-off-by: Lu Fengqi Signed-off-by: Qu Wenruo --- cmds-check.c | 76 1 file changed, 76 insertions(+) diff --git a/cmds-check.c b/cmds-check.c index 4e25804..90d5fbc 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -41,6 +41,7 @@ #include "rbtree-utils.h" #include "backref.h" #include "ulist.h" +#include "hash.h" enum task_position { TASK_EXTENTS, @@ -3988,6 +3989,81 @@ out: return ret; } +/* + * Traverse the given INODE_REF and call find_dir_item() to find related + * DIR_ITEM/DIR_INDEX. + * + * @root: the root of the fs/file tree + * @ref_key: the key of the INODE_REF + * @refs: the count of INODE_REF + * @mode: the st_mode of INODE_ITEM + * + * Return 0 if no error occurred. + */ +static int check_inode_ref(struct btrfs_root *root, struct btrfs_key *ref_key, + struct extent_buffer *node, int slot, u64 *refs, + int mode) +{ + struct btrfs_key key; + struct btrfs_inode_ref *ref; + char namebuf[BTRFS_NAME_LEN] = {0}; + u32 total; + u32 cur = 0; + u32 len; + u32 name_len; + u64 index; + int ret, err = 0; + + ref = btrfs_item_ptr(node, slot, struct btrfs_inode_ref); + total = btrfs_item_size_nr(node, slot); + +next: + /* Update inode ref count */ + (*refs)++; + + index = btrfs_inode_ref_index(node, ref); + name_len = btrfs_inode_ref_name_len(node, ref); + if (name_len <= BTRFS_NAME_LEN) { + len = name_len; + } else { + len = BTRFS_NAME_LEN; + warning("root %llu INODE_REF[%llu %llu] name too long", + root->objectid, ref_key->objectid, ref_key->offset); + } + + read_extent_buffer(node, namebuf, (unsigned long)(ref + 1), len); + + /* Check root dir ref name */ + if (index == 0 && strncmp(namebuf, "..", name_len)) { + error("root %llu INODE_REF[%llu %llu] ROOT_DIR name shouldn't be %s", + root->objectid, ref_key->objectid, ref_key->offset, + namebuf); + err |= ROOT_DIR_ERROR; + } + + /* Find related dir_index */ + key.objectid = ref_key->offset; + key.type = BTRFS_DIR_INDEX_KEY; + key.offset = index; + ret = find_dir_item(root, ref_key, &key, index, namebuf, len, mode); + err |= ret; + + /* Find related dir_item */ + key.objectid = ref_key->offset; + key.type = BTRFS_DIR_ITEM_KEY; + key.offset = btrfs_name_hash(namebuf, len); + ret = find_dir_item(root, ref_key, &key, index, namebuf, len, mode); + err |= ret; + + len = sizeof(*ref) + name_len; + ref = (struct btrfs_inode_ref *)((char *)ref + len); + cur += len; + if (cur < total) + goto next; + + return err; +} + static int all_backpointers_checked(struct extent_record *rec, int print_errs) { struct list_head *cur = rec->backrefs.next; -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 05/14] btrfs-progs: check: introduce function to find inode_ref
From: Lu Fengqi Introduce a new function find_inode_ref() to find INODE_REF/INODE_EXTREF for the given key, and check it with the specified DIR_ITEM/DIR_INDEX match. Signed-off-by: Lu Fengqi Signed-off-by: Qu Wenruo --- cmds-check.c | 149 +++ 1 file changed, 149 insertions(+) diff --git a/cmds-check.c b/cmds-check.c index fec8b6e..a261821 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -3852,6 +3852,7 @@ out: #define ROOT_DIR_ERROR (1<<1) /* bad root_dir */ #define DIR_ITEM_MISSING (1<<2) /* DIR_ITEM not found */ #define DIR_ITEM_MISMATCH (1<<3) /* DIR_ITEM found but not match */ +#define INODE_REF_MISSING (1<<4) /* INODE_REF/INODE_EXTREF not found */ /* * Find DIR_ITEM/DIR_INDEX for the given key and check it with the specified @@ -4142,6 +4143,154 @@ next: return err; } +/* + * Find INODE_REF/INODE_EXTREF for the given key and check it with the specified + * DIR_ITEM/DIR_INDEX match. + * + * @root: the root of the fs/file tree + * @key: the key of the INODE_REF/INODE_EXTREF + * @name: the name in the INODE_REF/INODE_EXTREF + * @namelen: the length of name in the INODE_REF/INODE_EXTREF + * @index: the index in the INODE_REF/INODE_EXTREF, for DIR_ITEM set index + * to (u64)-1 + * @ext_ref: the EXTENDED_IREF feature + * + * Return 0 if no error occurred. + * Return >0 for error bitmap + */ +static int find_inode_ref(struct btrfs_root *root, struct btrfs_key *key, + char *name, int namelen, u64 index, + unsigned int ext_ref) +{ + struct btrfs_path path; + struct btrfs_inode_ref *ref; + struct btrfs_inode_extref *extref; + struct extent_buffer *node; + char ref_namebuf[BTRFS_NAME_LEN] = {0}; + u32 total; + u32 cur = 0; + u32 len; + u32 ref_namelen; + u64 ref_index; + u64 parent; + u64 dir_id; + int slot; + int ret; + + btrfs_init_path(&path); + ret = btrfs_search_slot(NULL, root, key, &path, 0, 0); + if (ret) { + ret = INODE_REF_MISSING; + goto extref; + } + + node = path.nodes[0]; + slot = path.slots[0]; + + ref = btrfs_item_ptr(node, slot, struct btrfs_inode_ref); + total = btrfs_item_size_nr(node, slot); + + /* Iterate all entry of INODE_REF */ + while (cur < total) { + ret = INODE_REF_MISSING; + + ref_namelen = btrfs_inode_ref_name_len(node, ref); + ref_index = btrfs_inode_ref_index(node, ref); + if (index != (u64)-1 && index != ref_index) + goto next_ref; + + if (ref_namelen <= BTRFS_NAME_LEN) { + len = ref_namelen; + } else { + len = BTRFS_NAME_LEN; + warning("root %llu INODE %s[%llu %llu] name too long", + root->objectid, + key->type == BTRFS_INODE_REF_KEY ? + "REF" : "EXTREF", + key->objectid, key->offset); + } + read_extent_buffer(node, ref_namebuf, (unsigned long)(ref + 1), + len); + + if (len != namelen || strncmp(ref_namebuf, name, len)) + goto next_ref; + + ret = 0; + goto out; +next_ref: + len = sizeof(*ref) + ref_namelen; + ref = (struct btrfs_inode_ref *)((char *)ref + len); + cur += len; + } + +extref: + /* Skip if not support EXTENDED_IREF feature */ + if (!ext_ref) + goto out; + + btrfs_release_path(&path); + btrfs_init_path(&path); + + dir_id = key->offset; + key->type = BTRFS_INODE_EXTREF_KEY; + key->offset = btrfs_extref_hash(dir_id, name, namelen); + + ret = btrfs_search_slot(NULL, root, key, &path, 0, 0); + if (ret) { + ret = INODE_REF_MISSING; + goto out; + } + + node = path.nodes[0]; + slot = path.slots[0]; + + extref = btrfs_item_ptr(node, slot, struct btrfs_inode_extref); + cur = 0; + total = btrfs_item_size_nr(node, slot); + + /* Iterate all entry of INODE_EXTREF */ + while (cur < total) { + ret = INODE_REF_MISSING; + + ref_namelen = btrfs_inode_extref_name_len(node, extref); + ref_index = btrfs_inode_extref_index(node, extref); + parent = btrfs_inode_extref_parent(node, extref); + if (index != (u64)-1 && index != ref_index) + goto next_extref; + + if (parent != dir_id) + goto next_extref; + + if (ref_namelen <= BTRFS_NAME_LEN) { + len = ref_namelen; +
[PATCH v2 02/14] btrfs-progs: check: introduce function to find dir_item
From: Lu Fengqi Introduce a new function find_dir_item() to find DIR_ITEM for the given key, and check it with the specified INODE_REF/INODE_EXTREF match. Signed-off-by: Lu Fengqi Signed-off-by: Qu Wenruo --- cmds-check.c | 140 +++ 1 file changed, 140 insertions(+) diff --git a/cmds-check.c b/cmds-check.c index 998ba63..4e25804 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -3848,6 +3848,146 @@ out: return err; } +#define ROOT_DIR_ERROR (1<<1) /* bad root_dir */ +#define DIR_ITEM_MISSING (1<<2) /* DIR_ITEM not found */ +#define DIR_ITEM_MISMATCH (1<<3) /* DIR_ITEM found but not match */ + +/* + * Find DIR_ITEM/DIR_INDEX for the given key and check it with the specified + * INODE_REF/INODE_EXTREF match. + * + * @root: the root of the fs/file tree + * @ref_key: the key of the INODE_REF/INODE_EXTREF + * @key: the key of the DIR_ITEM/DIR_INDEX + * @index: the index in the INODE_REF/INODE_EXTREF, be used to + * distinguish root_dir between normal dir/file + * @name: the name in the INODE_REF/INODE_EXTREF + * @namelen: the length of name in the INODE_REF/INODE_EXTREF + * @mode: the st_mode of INODE_ITEM + * + * Return 0 if no error occurred. + * Return ROOT_DIR_ERROR if found DIR_ITEM/DIR_INDEX for root_dir. + * Return DIR_ITEM_MISSING if couldn't find DIR_ITEM/DIR_INDEX for normal + * dir/file. + * Return DIR_ITEM_MISMATCH if INODE_REF/INODE_EXTREF and DIR_ITEM/DIR_INDEX + * not match for normal dir/file. + */ +static int find_dir_item(struct btrfs_root *root, struct btrfs_key *ref_key, +struct btrfs_key *key, u64 index, char *name, +u32 namelen, u32 mode) +{ + struct btrfs_path path; + struct extent_buffer *node; + struct btrfs_dir_item *di; + struct btrfs_key location; + char namebuf[BTRFS_NAME_LEN] = {0}; + u32 total; + u32 cur = 0; + u32 len; + u32 name_len; + u32 data_len; + u8 filetype; + int slot; + int ret; + + btrfs_init_path(&path); + ret = btrfs_search_slot(NULL, root, key, &path, 0, 0); + if (ret < 0) { + ret = DIR_ITEM_MISSING; + goto out; + } + + /* Process root dir and goto out*/ + if (index == 0) { + if (ret == 0) { + ret = ROOT_DIR_ERROR; + error( + "root %llu INODE %s[%llu %llu] ROOT_DIR shouldn't have %s", + root->objectid, + ref_key->type == BTRFS_INODE_REF_KEY ? + "REF" : "EXTREF", + ref_key->objectid, ref_key->offset, + key->type == BTRFS_DIR_ITEM_KEY ? + "DIR_ITEM" : "DIR_INDEX"); + } else { + ret = 0; + } + + goto out; + } + + /* Process normal file/dir */ + if (ret > 0) { + ret = DIR_ITEM_MISSING; + error( + "root %llu INODE %s[%llu %llu] doesn't have related %s[%llu %llu] namelen %u filename %s filetype %d", + root->objectid, + ref_key->type == BTRFS_INODE_REF_KEY ? "REF" : "EXTREF", + ref_key->objectid, ref_key->offset, + key->type == BTRFS_DIR_ITEM_KEY ? + "DIR_ITEM" : "DIR_INDEX", + key->objectid, key->offset, namelen, name, + imode_to_type(mode)); + goto out; + } + + /* Check whether inode_id/filetype/name match */ + node = path.nodes[0]; + slot = path.slots[0]; + di = btrfs_item_ptr(node, slot, struct btrfs_dir_item); + total = btrfs_item_size_nr(node, slot); + while (cur < total) { + ret = DIR_ITEM_MISMATCH; + name_len = btrfs_dir_name_len(node, di); + data_len = btrfs_dir_data_len(node, di); + + btrfs_dir_item_key_to_cpu(node, di, &location); + if (location.objectid != ref_key->objectid || + location.type != BTRFS_INODE_ITEM_KEY || + location.offset != 0) + goto next; + + filetype = btrfs_dir_type(node, di); + if (imode_to_type(mode) != filetype) + goto next; + + if (name_len <= BTRFS_NAME_LEN) { + len = name_len; + } else { + len = BTRFS_NAME_LEN; + fprintf(stderr, + "Warning: root %llu %s[%llu %llu] name too long\n", + root->objectid, + key->type == BTRFS_DIR_ITEM_KEY ? + "DIR_ITEM" : "DIR_IN
[PATCH v2 00/14] Btrfsck low memory mode with fs/subvol tree check
The branch can be fetched from my github: https://github.com/adam900710/btrfs-progs/tree/lowmem_fs_tree Already merged lowmem mode fsck only works for extent/chunk tree check. And for fs tree, it's still using original mode codes. This makes btrfs check still eat tons of memory for large fs. Now the new lowmem mode code will also cover fs tree now, to make lowmem mode be really low-memory usage mode. And the whole patchset goes through the whole fsck test cases, except the following case: 006: There is a bug in root item repair code, causing backref error. However old fsck has another bug to overwrite extent tree error, so old fsck will only report error but still return 0. That's an unrelated btrfsck repair bug, which I'll address it later. 015: Just wrong test cases. It's not a normal check-repair-check one. So the check after repair will still report error. Better to put it to fuzz test cases. Further plan for lowmem mode is: 1) Add support for --repair A lot of work again. 2) Separate original and lowmem mode codes into different files 300+K single source is really too large. Better separate them into a dir and multiple files 3) Avoid using find_all_parents() in traversal function In lowmmem mode, we are using find_all_parents() function to ensure only the root with smallest objectid to check the leaf, so we can save some IO. However find_all_parents() is still a quite time consuming function, so we'd better avoid calling that function. Lu Fengqi (12): btrfs-progs: move btrfs_extref_hash() to hash.h btrfs-progs: check: introduce function to find dir_item btrfs-progs: check: introduce function to check inode_ref btrfs-progs: check: introduce function to check inode_extref btrfs-progs: check: introduce function to find inode_ref btrfs-progs: check: introduce a function to check dir_item btrfs-progs: check: introduce function to check file extent btrfs-progs: check: introduce function to check inode item btrfs-progs: check: introduce function to check fs root btrfs-progs: check: introduce function to check root ref btrfs-progs: check: introduce low_memory mode fs_tree check btrfs-progs: check: fix the return value bug of cmd_check() Qu Wenruo (1): btrfs-progs: check: Enhance leaf traversal function to handle missing inode item Wang Xiaoguang (1): btrfs-progs: check: skip shared node or leaf check for low_memory mode cmds-check.c | 1763 -- hash.h | 10 + inode-item.c |8 +- 3 files changed, 1600 insertions(+), 181 deletions(-) -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 01/14] btrfs-progs: move btrfs_extref_hash() to hash.h
From: Lu Fengqi Move btrfs_extref_hash() from inode-item.c to hash.h, so that the function can be called elsewhere. Signed-off-by: Lu Fengqi Signed-off-by: Qu Wenruo --- hash.h | 10 ++ inode-item.c | 8 +--- 2 files changed, 11 insertions(+), 7 deletions(-) diff --git a/hash.h b/hash.h index ac4c411..9ff6761 100644 --- a/hash.h +++ b/hash.h @@ -25,4 +25,14 @@ static inline u64 btrfs_name_hash(const char *name, int len) { return crc32c((u32)~1, name, len); } + +/* + * Figure the key offset of an extended inode ref + */ +static inline u64 btrfs_extref_hash(u64 parent_objectid, const char *name, + int len) +{ + return (u64)btrfs_crc32c(parent_objectid, name, len); +} + #endif diff --git a/inode-item.c b/inode-item.c index 913b81a..f7b6ead 100644 --- a/inode-item.c +++ b/inode-item.c @@ -19,7 +19,7 @@ #include "ctree.h" #include "disk-io.h" #include "transaction.h" -#include "crc32c.h" +#include "hash.h" static int find_name_in_backref(struct btrfs_path *path, const char * name, int name_len, struct btrfs_inode_ref **ref_ret) @@ -184,12 +184,6 @@ out: return ret_inode_ref; } -static inline u64 btrfs_extref_hash(u64 parent_ino, const char *name, - int namelen) -{ - return (u64)btrfs_crc32c(parent_ino, name, namelen); -} - static int btrfs_find_name_in_ext_backref(struct btrfs_path *path, u64 parent_ino, const char *name, int namelen, struct btrfs_inode_extref **extref_ret) -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 13/14] btrfs-progs: check: skip shared node or leaf check for low_memory mode
From: Wang Xiaoguang The basic idea is simple. Assume a middle tree node A is shared and its referenceing fs/file tree root ids are 5, 258 and 260, then we only check node A in the tree who has the smallest root id. That means in this case, when checking root tree(5), we check inode A, for root tree 258 and 260, we can just skip it. Notice even with this patch, we still may visit a shared node or leaf multiple times. This happens when a inode metadata occupies multiple leaves. leaf_A leaf_B When checking inode item in leaf_A, assume inode[512] have file extents in leaf_B, and leaf_B is shared. In the case, for inode[512], we must visit leaf_B to have inode item check. After finishing inode[512] check, here we walk down tree root to leaf_B to check whether node or leaf is shared, if some node or leaf is shared, we can just skip it and below nodes or leaf's check. I also fill a disk partition with linux source codes and create 3 snapshots in it. Before this patch, it averagely took 46s to finish one btrfsck execution, with this patch, it averagely took 15s. Signed-off-by: Wang Xiaoguang Signed-off-by: Qu Wenruo --- cmds-check.c | 390 --- 1 file changed, 321 insertions(+), 69 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index 701fff5..d290a66 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -113,6 +113,24 @@ struct data_backref { u32 found_ref; }; +#define ROOT_DIR_ERROR (1<<1) /* bad root_dir */ +#define DIR_ITEM_MISSING (1<<2) /* DIR_ITEM not found */ +#define DIR_ITEM_MISMATCH (1<<3) /* DIR_ITEM found but not match */ +#define INODE_REF_MISSING (1<<4) /* INODE_REF/INODE_EXTREF not found */ +#define INODE_ITEM_MISSING (1<<5) /* INODE_ITEM not found */ +#define INODE_ITEM_MISMATCH(1<<6) /* INODE_ITEM found but not match */ +#define FILE_EXTENT_ERROR (1<<7) /* bad file extent */ +#define ODD_CSUM_ITEM (1<<8) /* CSUM_ITEM ERROR */ +#define CSUM_ITEM_MISSING (1<<9) /* CSUM_ITEM not found */ +#define LINK_COUNT_ERROR (1<<10) /* INODE_ITEM nlink count error */ +#define NBYTES_ERROR (1<<11) /* INODE_ITEM nbytes count error */ +#define ISIZE_ERROR(1<<12) /* INODE_ITEM size count error */ +#define ORPHAN_ITEM(1<<13) /* INODE_ITEM no reference */ +#define NO_INODE_ITEM (1<<14) /* no inode_item */ +#define LAST_ITEM (1<<15) /* Complete this tree traversal */ +#define ROOT_REF_MISSING (1<<16) /* ROOT_REF not found */ +#define ROOT_REF_MISMATCH (1<<17) /* ROOT_REF found but not match */ + static inline struct data_backref* to_data_backref(struct extent_backref *back) { return container_of(back, struct data_backref, node); @@ -1839,6 +1857,92 @@ static int process_one_leaf(struct btrfs_root *root, struct extent_buffer *eb, return ret; } +struct node_refs { + u64 bytenr[BTRFS_MAX_LEVEL]; + u64 refs[BTRFS_MAX_LEVEL]; + int need_check[BTRFS_MAX_LEVEL]; +}; + +static int update_nodes_refs(struct btrfs_root *root, u64 bytenr, +struct node_refs *nrefs, u64 level); +static int check_inode_item(struct btrfs_root *root, struct btrfs_path *path, + unsigned int ext_ref); + +static int process_one_leaf_v2(struct btrfs_root *root, struct btrfs_path *path, + struct node_refs *nrefs, int *level, int ext_ref) +{ + struct extent_buffer *cur = path->nodes[0]; + struct btrfs_key key; + u64 cur_bytenr; + u32 nritems; + int root_level = btrfs_header_level(root->node); + int i; + int ret = 0; /* Final return value */ + int err = 0; /* Positive error bitmap */ + + cur_bytenr = cur->start; + + /* skip to first inode item in this leaf */ + nritems = btrfs_header_nritems(cur); + for (i = 0; i < nritems; i++) { + btrfs_item_key_to_cpu(cur, &key, i); + if (key.type == BTRFS_INODE_ITEM_KEY) + break; + } + if (i == nritems) { + path->slots[0] = nritems; + return 0; + } + path->slots[0] = i; + +again: + err |= check_inode_item(root, path, ext_ref); + + if (err & LAST_ITEM) + goto out; + + /* still have inode items in thie leaf */ + if (cur->start == cur_bytenr) + goto again; + + /* +* we have switched to another leaf, above nodes may +* have changed, here walk down the path, if a node +* or leaf is shared, check whether we can skip this +* node or leaf. +*/ + for (i = root_level; i >= 0; i--) { + if (path->nodes[i]->start == nrefs->bytenr[i]) + continue; + + ret = update_nodes_refs(root, + path->nodes[i]->start, +
[PATCH v2 11/14] btrfs-progs: check: introduce low_memory mode fs_tree check
From: Lu Fengqi Introduce a new function check_fs_roots_v2() for check fs_tree in low_memory mode. It call check_fs_root_v2() to check fs_root, and call check_root_ref() to check root_ref. Signed-off-by: Lu Fengqi Signed-off-by: Qu Wenruo --- cmds-check.c | 106 +++ 1 file changed, 100 insertions(+), 6 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index 7593013..934a3dd 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -4774,8 +4774,8 @@ static int check_root_ref(struct btrfs_root *root, struct btrfs_key *ref_key, struct btrfs_key key; struct btrfs_root_ref *ref; struct btrfs_root_ref *backref; - char ref_name[BTRFS_NAME_LEN]; - char backref_name[BTRFS_NAME_LEN]; + char ref_name[BTRFS_NAME_LEN] = {0}; + char backref_name[BTRFS_NAME_LEN] = {0}; u64 ref_dirid; u64 ref_seq; u32 ref_namelen; @@ -4850,6 +4850,94 @@ out: return err; } +/* + * Check all fs/file tree in low_memory mode. + * + * 1. for fs tree root item, call check_fs_root_v2() + * 2. for fs tree root ref/backref, call check_root_ref() + * + * Return 0 if no error occurred. + */ +static int check_fs_roots_v2(struct btrfs_fs_info *fs_info) +{ + struct btrfs_root *tree_root = fs_info->tree_root; + struct btrfs_root *cur_root = NULL; + struct btrfs_path *path; + struct btrfs_key key; + struct extent_buffer *node; + unsigned int ext_ref; + int slot; + int ret; + int err = 0; + + ext_ref = btrfs_fs_incompat(fs_info, + BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF); + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + key.objectid = BTRFS_FS_TREE_OBJECTID; + key.offset = 0; + key.type = BTRFS_ROOT_ITEM_KEY; + + ret = btrfs_search_slot(NULL, tree_root, &key, path, 0, 0); + if (ret < 0) { + err = ret; + goto out; + } else if (ret > 0) { + err = -ENOENT; + goto out; + } + + while (1) { + node = path->nodes[0]; + slot = path->slots[0]; + btrfs_item_key_to_cpu(node, &key, slot); + if (key.objectid > BTRFS_LAST_FREE_OBJECTID) + goto out; + if (key.type == BTRFS_ROOT_ITEM_KEY && + fs_root_objectid(key.objectid)) { + if (key.objectid == BTRFS_TREE_RELOC_OBJECTID) { + cur_root = btrfs_read_fs_root_no_cache(fs_info, + &key); + } else { + key.offset = (u64)-1; + cur_root = btrfs_read_fs_root(fs_info, &key); + } + + if (IS_ERR(cur_root)) { + error("Fail to read fs/subvol tree: %lld", + key.objectid); + err = -EIO; + goto next; + } + + ret = check_fs_root_v2(cur_root, ext_ref); + err |= ret; + + if (key.objectid == BTRFS_TREE_RELOC_OBJECTID) + btrfs_free_fs_root(cur_root); + } else if (key.type == BTRFS_ROOT_REF_KEY || + key.type == BTRFS_ROOT_BACKREF_KEY) { + ret = check_root_ref(tree_root, &key, node, slot); + err |= ret; + } +next: + ret = btrfs_next_item(tree_root, path); + if (ret > 0) + goto out; + if (ret < 0) { + err = ret; + goto out; + } + } + +out: + btrfs_free_path(path); + return err; +} + static int all_backpointers_checked(struct extent_record *rec, int print_errs) { struct list_head *cur = rec->backrefs.next; @@ -12544,7 +12632,10 @@ int cmd_check(int argc, char **argv) BTRFS_FEATURE_INCOMPAT_NO_HOLES); if (!ctx.progress_enabled) fprintf(stderr, "checking fs roots\n"); - ret = check_fs_roots(root, &root_cache); + if (check_mode == CHECK_MODE_LOWMEM) + ret = check_fs_roots_v2(root->fs_info); + else + ret = check_fs_roots(root, &root_cache); if (ret) goto out; @@ -12554,9 +12645,12 @@ int cmd_check(int argc, char **argv) goto out; fprintf(stderr, "checking root refs\n"); - ret = check_root_refs(root, &root_cache); - if (ret) - goto out; + /* For low memory mode, check_fs_roots_v2 handles root refs */ + if (check_mode != CHECK_MODE_LOWMEM) { +
[PATCH v2 10/14] btrfs-progs: check: introduce function to check root ref
From: Lu Fengqi Introduce a new function check_root_ref() to check root_ref/root_backref. Signed-off-by: Lu Fengqi Signed-off-by: Qu Wenruo --- cmds-check.c | 93 1 file changed, 93 insertions(+) diff --git a/cmds-check.c b/cmds-check.c index 6d3c6a8..7593013 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -3864,6 +3864,8 @@ out: #define ORPHAN_ITEM(1<<13) /* INODE_ITEM no reference */ #define NO_INODE_ITEM (1<<14) /* no inode_item */ #define LAST_ITEM (1<<15) /* Complete this tree traversal */ +#define ROOT_REF_MISSING (1<<16) /* ROOT_REF not found */ +#define ROOT_REF_MISMATCH (1<<17) /* ROOT_REF found but not match */ /* * Find DIR_ITEM/DIR_INDEX for the given key and check it with the specified @@ -4757,6 +4759,97 @@ out: return ret; } +/* + * Find the relative ref for root_ref and root_backref. + * + * @root: the root of the root tree. + * @ref_key: the key of the root ref. + * + * Return 0 if no error occurred. + */ +static int check_root_ref(struct btrfs_root *root, struct btrfs_key *ref_key, + struct extent_buffer *node, int slot) +{ + struct btrfs_path *path; + struct btrfs_key key; + struct btrfs_root_ref *ref; + struct btrfs_root_ref *backref; + char ref_name[BTRFS_NAME_LEN]; + char backref_name[BTRFS_NAME_LEN]; + u64 ref_dirid; + u64 ref_seq; + u32 ref_namelen; + u64 backref_dirid; + u64 backref_seq; + u32 backref_namelen; + u32 len; + int ret; + int err = 0; + + ref = btrfs_item_ptr(node, slot, struct btrfs_root_ref); + ref_dirid = btrfs_root_ref_dirid(node, ref); + ref_seq = btrfs_root_ref_sequence(node, ref); + ref_namelen = btrfs_root_ref_name_len(node, ref); + + if (ref_namelen <= BTRFS_NAME_LEN) { + len = ref_namelen; + } else { + len = BTRFS_NAME_LEN; + warning("%s[%llu %llu] ref_name too long", + ref_key->type == BTRFS_ROOT_REF_KEY ? + "ROOT_REF" : "ROOT_BACKREF", ref_key->objectid, + ref_key->offset); + } + read_extent_buffer(node, ref_name, (unsigned long)(ref + 1), len); + + /* Find relative root_ref */ + key.objectid = ref_key->offset; + key.type = BTRFS_ROOT_BACKREF_KEY + BTRFS_ROOT_REF_KEY - ref_key->type; + key.offset = ref_key->objectid; + + path = btrfs_alloc_path(); + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); + if (ret) { + err |= ROOT_REF_MISSING; + error("%s[%llu %llu] couldn't find relative ref", + ref_key->type == BTRFS_ROOT_REF_KEY ? + "ROOT_REF" : "ROOT_BACKREF", + ref_key->objectid, ref_key->offset); + goto out; + } + + backref = btrfs_item_ptr(path->nodes[0], path->slots[0], +struct btrfs_root_ref); + backref_dirid = btrfs_root_ref_dirid(path->nodes[0], backref); + backref_seq = btrfs_root_ref_sequence(path->nodes[0], backref); + backref_namelen = btrfs_root_ref_name_len(path->nodes[0], backref); + + if (backref_namelen <= BTRFS_NAME_LEN) { + len = backref_namelen; + } else { + len = BTRFS_NAME_LEN; + warning("%s[%llu %llu] ref_name too long", + key.type == BTRFS_ROOT_REF_KEY ? + "ROOT_REF" : "ROOT_BACKREF", + key.objectid, key.offset); + } + read_extent_buffer(path->nodes[0], backref_name, + (unsigned long)(backref + 1), len); + + if (ref_dirid != backref_dirid || ref_seq != backref_seq || + ref_namelen != backref_namelen || + strncmp(ref_name, backref_name, len)) { + err |= ROOT_REF_MISMATCH; + error("%s[%llu %llu] mismatch relative ref", + ref_key->type == BTRFS_ROOT_REF_KEY ? + "ROOT_REF" : "ROOT_BACKREF", + ref_key->objectid, ref_key->offset); + } +out: + btrfs_free_path(path); + return err; +} + static int all_backpointers_checked(struct extent_record *rec, int print_errs) { struct list_head *cur = rec->backrefs.next; -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 09/14] btrfs-progs: check: introduce function to check fs root
From: Lu Fengqi Introduce a new function check_fs_root_v2() to check fs root, and call check_inode_item to check the items in the tree. Signed-off-by: Lu Fengqi Signed-off-by: Qu Wenruo --- cmds-check.c | 76 1 file changed, 76 insertions(+) diff --git a/cmds-check.c b/cmds-check.c index 5e3ecac..6d3c6a8 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -3862,6 +3862,7 @@ out: #define NBYTES_ERROR (1<<11) /* INODE_ITEM nbytes count error */ #define ISIZE_ERROR(1<<12) /* INODE_ITEM size count error */ #define ORPHAN_ITEM(1<<13) /* INODE_ITEM no reference */ +#define NO_INODE_ITEM (1<<14) /* no inode_item */ #define LAST_ITEM (1<<15) /* Complete this tree traversal */ /* @@ -4681,6 +4682,81 @@ out: return err; } +/* + * Iterate all item on the tree and call check_inode_item() to check. + * + * @root: the root of the tree to be checked. + * @ext_ref: the EXTENDED_IREF feature + * + * Return 0 if no error found. + * Return <0 for error. + * All internal error bitmap will be converted to -EIO, to avoid + * mixing negative and postive return value. + */ +static int check_fs_root_v2(struct btrfs_root *root, unsigned int ext_ref) +{ + struct btrfs_path *path; + struct btrfs_key key; + u64 inode_id; + int ret, err = 0; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + key.objectid = 0; + key.type = 0; + key.offset = 0; + + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); + if (ret < 0) + goto out; + + while (1) { + btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); + + /* +* All check must start with inode item, skip if not +*/ + if (key.type == BTRFS_INODE_ITEM_KEY) { + ret = check_inode_item(root, path, ext_ref); + err |= ret; + if (err & LAST_ITEM) + goto out; + continue; + } + error("root %llu ITEM[%llu %u %llu] isn't INODE_ITEM, skip to next inode", + root->objectid, key.objectid, key.type, + key.offset); + + err |= NO_INODE_ITEM; + inode_id = key.objectid; + + /* +* skip to next inode +* TODO: Maybe search_slot() will be faster? +*/ + do { + ret = btrfs_next_item(root, path); + if (ret > 0) { + goto out; + } else if (ret < 0) { + err = ret; + goto out; + } + btrfs_item_key_to_cpu(path->nodes[0], &key, + path->slots[0]); + } while (inode_id == key.objectid); + } + +out: + err &= ~LAST_ITEM; + if (err && !ret) + ret = -EIO; + btrfs_free_path(path); + return ret; +} + static int all_backpointers_checked(struct extent_record *rec, int print_errs) { struct list_head *cur = rec->backrefs.next; -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 08/14] btrfs-progs: check: introduce function to check inode item
From: Lu Fengqi Introduce a new function check_inode_item() to check INODE_ITEM and related ITEMs that have the same inode id. Signed-off-by: Lu Fengqi Signed-off-by: Qu Wenruo --- cmds-check.c | 168 +++ 1 file changed, 168 insertions(+) diff --git a/cmds-check.c b/cmds-check.c index 0afbf96..5e3ecac 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -3858,6 +3858,11 @@ out: #define FILE_EXTENT_ERROR (1<<7) /* bad file extent */ #define ODD_CSUM_ITEM (1<<8) /* CSUM_ITEM ERROR */ #define CSUM_ITEM_MISSING (1<<9) /* CSUM_ITEM not found */ +#define LINK_COUNT_ERROR (1<<10) /* INODE_ITEM nlink count error */ +#define NBYTES_ERROR (1<<11) /* INODE_ITEM nbytes count error */ +#define ISIZE_ERROR(1<<12) /* INODE_ITEM size count error */ +#define ORPHAN_ITEM(1<<13) /* INODE_ITEM no reference */ +#define LAST_ITEM (1<<15) /* Complete this tree traversal */ /* * Find DIR_ITEM/DIR_INDEX for the given key and check it with the specified @@ -4513,6 +4518,169 @@ static int check_file_extent(struct btrfs_root *root, struct btrfs_key *fkey, return err; } +/* + * Check INODE_ITEM and related ITEMs(the same inode id) + * 1. check link count + * 2. check inode ref/extref + * 3. check dir item/index + * + * @ext_ref: the EXTENDED_IREF feature + * + * Return 0 if no error occurred. + * Return >0 for error or hit the traversal is done(by error bitmap) + */ +static int check_inode_item(struct btrfs_root *root, struct btrfs_path *path, + unsigned int ext_ref) +{ + struct extent_buffer *node; + struct btrfs_inode_item *ii; + struct btrfs_key key; + u64 inode_id; + u32 mode; + u64 nlink; + u64 nbytes; + u64 isize; + u64 size = 0; + u64 refs = 0; + u64 extent_end = 0; + u64 extent_size = 0; + unsigned int dir; + unsigned int nodatasum; + int slot; + int ret; + int err = 0; + + node = path->nodes[0]; + slot = path->slots[0]; + + btrfs_item_key_to_cpu(node, &key, slot); + inode_id = key.objectid; + + if (inode_id == BTRFS_ORPHAN_OBJECTID) { + ret = btrfs_next_item(root, path); + if (ret > 0) + err |= LAST_ITEM; + return err; + } + + ii = btrfs_item_ptr(node, slot, struct btrfs_inode_item); + isize = btrfs_inode_size(node, ii); + nbytes = btrfs_inode_nbytes(node, ii); + mode = btrfs_inode_mode(node, ii); + dir = imode_to_type(mode) == BTRFS_FT_DIR; + nlink = btrfs_inode_nlink(node, ii); + nodatasum = btrfs_inode_flags(node, ii) & BTRFS_INODE_NODATASUM; + + while (1) { + ret = btrfs_next_item(root, path); + if (ret < 0) { + /* out will fill 'err' rusing current statistics */ + goto out; + } else if (ret > 0) { + err |= LAST_ITEM; + goto out; + } + + node = path->nodes[0]; + slot = path->slots[0]; + btrfs_item_key_to_cpu(node, &key, slot); + if (key.objectid != inode_id) + goto out; + + switch (key.type) { + case BTRFS_INODE_REF_KEY: + ret = check_inode_ref(root, &key, node, slot, &refs, + mode); + err |= ret; + break; + case BTRFS_INODE_EXTREF_KEY: + if (key.type == BTRFS_INODE_EXTREF_KEY && !ext_ref) + warning("root %llu EXTREF[%llu %llu] isn't supported", + root->objectid, key.objectid, + key.offset); + ret = check_inode_extref(root, &key, node, slot, &refs, +mode); + err |= ret; + break; + case BTRFS_DIR_ITEM_KEY: + case BTRFS_DIR_INDEX_KEY: + if (!dir) { + warning("root %llu INODE[%llu] mode %u shouldn't have DIR_INDEX[%llu %llu]", + root->objectid, inode_id, + imode_to_type(mode), key.objectid, + key.offset); + } + ret = check_dir_item(root, &key, node, slot, &size, +ext_ref); + err |= ret; + break; + case BTRFS_EXTENT_DATA_KEY: + if (dir) { + warning("root %llu DIR INODE[%llu] shouldn't EXTENT_DATA[%llu %llu]", +
[PATCH v2 07/14] btrfs-progs: check: introduce function to check file extent
From: Lu Fengqi Introduce a new function check_file_extent() to check file extent, such as datasum, hole, size. Signed-off-by: Lu Fengqi Signed-off-by: Qu Wenruo --- cmds-check.c | 97 1 file changed, 97 insertions(+) diff --git a/cmds-check.c b/cmds-check.c index d39a81c..0afbf96 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -3855,6 +3855,9 @@ out: #define INODE_REF_MISSING (1<<4) /* INODE_REF/INODE_EXTREF not found */ #define INODE_ITEM_MISSING (1<<5) /* INODE_ITEM not found */ #define INODE_ITEM_MISMATCH(1<<6) /* INODE_ITEM found but not match */ +#define FILE_EXTENT_ERROR (1<<7) /* bad file extent */ +#define ODD_CSUM_ITEM (1<<8) /* CSUM_ITEM ERROR */ +#define CSUM_ITEM_MISSING (1<<9) /* CSUM_ITEM not found */ /* * Find DIR_ITEM/DIR_INDEX for the given key and check it with the specified @@ -4416,6 +4419,100 @@ next: return err; } +/* + * Check file extent datasum/hole, update the size of the file extents, + * check and update the last offset of the file extent. + * + * @root: the root of fs/file tree. + * @fkey: the key of the file extent. + * @nodatasum: INODE_NODATASUM feature. + * @size: the sum of all EXTENT_DATA items size for this inode. + * @end: the offset of the last extent. + * + * Return 0 if no error occurred. + */ +static int check_file_extent(struct btrfs_root *root, struct btrfs_key *fkey, +struct extent_buffer *node, int slot, +unsigned int nodatasum, u64 *size, u64 *end) +{ + struct btrfs_file_extent_item *fi; + u64 disk_bytenr; + u64 disk_num_bytes; + u64 extent_num_bytes; + u64 found; + unsigned int extent_type; + unsigned int is_hole; + int ret; + int err = 0; + + fi = btrfs_item_ptr(node, slot, struct btrfs_file_extent_item); + + extent_type = btrfs_file_extent_type(node, fi); + /* Skip if file extent is inline */ + if (extent_type == BTRFS_FILE_EXTENT_INLINE) { + struct btrfs_item *e = btrfs_item_nr(slot); + u32 item_inline_len; + + item_inline_len = btrfs_file_extent_inline_item_len(node, e); + extent_num_bytes = btrfs_file_extent_inline_len(node, slot, fi); + if (extent_num_bytes == 0 || + extent_num_bytes != item_inline_len) + err |= FILE_EXTENT_ERROR; + *size += extent_num_bytes; + return err; + } + + /* Check extent type */ + if (extent_type != BTRFS_FILE_EXTENT_REG && + extent_type != BTRFS_FILE_EXTENT_PREALLOC) { + err |= FILE_EXTENT_ERROR; + error("root %llu EXTENT_DATA[%llu %llu] type bad", + root->objectid, fkey->objectid, fkey->offset); + return err; + } + + /* Check REG_EXTENT/PREALLOC_EXTENT */ + disk_bytenr = btrfs_file_extent_disk_bytenr(node, fi); + disk_num_bytes = btrfs_file_extent_disk_num_bytes(node, fi); + extent_num_bytes = btrfs_file_extent_num_bytes(node, fi); + is_hole = (disk_bytenr == 0) && (disk_num_bytes == 0); + + /* Check EXTENT_DATA datasum */ + ret = count_csum_range(root, disk_bytenr, disk_num_bytes, &found); + if (found > 0 && nodatasum) { + err |= ODD_CSUM_ITEM; + error("root %llu EXTENT_DATA[%llu %llu] nodatasum shouldn't have datasum", + root->objectid, fkey->objectid, fkey->offset); + } else if (extent_type == BTRFS_FILE_EXTENT_REG && !nodatasum && + !is_hole && + (ret < 0 || found == 0 || found < disk_num_bytes)) { + err |= CSUM_ITEM_MISSING; + error("root %llu EXTENT_DATA[%llu %llu] datasum missing", + root->objectid, fkey->objectid, fkey->offset); + } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC && found > 0) { + err |= ODD_CSUM_ITEM; + error("root %llu EXTENT_DATA[%llu %llu] prealloc shouldn't have datasum", + root->objectid, fkey->objectid, fkey->offset); + } + + /* Check EXTENT_DATA hole */ + if (no_holes && is_hole) { + err |= FILE_EXTENT_ERROR; + error("root %llu EXTENT_DATA[%llu %llu] shouldn't be hole", + root->objectid, fkey->objectid, fkey->offset); + } else if (!no_holes && *end != fkey->offset) { + err |= FILE_EXTENT_ERROR; + error("root %llu EXTENT_DATA[%llu %llu] interrupt", + root->objectid, fkey->objectid, fkey->offset); + } + + *end += extent_num_bytes; + if (!is_hole) + *size += extent_num_bytes; + + return err; +} + static int all_backpointers_checked(struct extent_record *rec, int print_errs) { s
Re: how to understand "btrfs fi show" output? "No space left" issues
On Tue, Sep 20, 2016 at 12:47 AM, Tomasz Chmielewski wrote: > How to understand the following "btrfs fi show" output? > > # btrfs fi show /var/lib/lxd > Label: 'btrfs' uuid: f5f30428-ec5b-4497-82de-6e20065e6f61 > Total devices 2 FS bytes used 136.18GiB > devid1 size 423.13GiB used 423.13GiB path /dev/sda3 > devid2 size 423.13GiB used 423.13GiB path /dev/sdb3 > > Why is it "size 423.13GiB used 423.13GiB"? Is it full? > > I had "No space left" on this filesystem just yesterday (running kernel > 4.7.4). This is btrfs RAID-1 on SSD disks. This filesystem is used for 20-30 > LXD containers with different roles (mongo, mysql, postgres databases, > webservers etc.), around 150 read-only snapshots, btrfs compression is > disabled. > > > Both "btrfs fi df" and "df -h" show plenty of space: > > # btrfs fi df /var/lib/lxd > Data, RAID1: total=417.12GiB, used=131.33GiB > System, RAID1: total=8.00MiB, used=80.00KiB > Metadata, RAID1: total=6.00GiB, used=4.86GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > > # df -h > Filesystem Size Used Avail Use% Mounted on > /dev/sda3 424G 137G 286G 33% /var/lib/lxd I'm coming into this late and realize most questions have been answered. But I take the position this is a bug, clearly there's enough space when df reports only 33% used, and therefore it's important to gather information about the file system in its current state so the devs can make decisions. Manually running balance is the correct work around, but it's bad Ux and should not be necessary (even though it's known to sometimes be necessary). Anyway, in this case there is room in all chunks and GlobalReserve used is 0.00B. Metadata has a bit over a gigabyte of unused space in its allocated block groups. So at the moment I'm thinking it's a bug. The two things that'd be useful if you can reproduce this problem at some point, by NOT trying to prevent it again, are: grep . -IR /sys/fs/btrfs//allocation/ pick the UUID for the affected fs volume. btrfs-debugfs found in btrfs-progs upstream as a python program but typically not packaged by distros https://github.com/kdave/btrfs-progs/blob/master/btrfs-debugfs Takes the form: sudo ./btrfs-debugfs -b It'll show you the percent each block group is actually being used so you can have a good idea what -dusage value to use (in your case) to free up space. That should help, but ultimately it's a work around, not a real fix. There shouldn't be enospc anyway. So if it happens again, first capture the above two bits of information, and then if you feel like testing kernel 4.8rc7 do that. It has a massive pile of enoscp related rework and I bet Josef would like to know if the problem reproduces with that kernel. As in, just change kernels, don't try to fix it with balance first. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't mount btrfs raid1
On Tue, Sep 20, 2016 at 5:16 PM, Mirak M wrote: > Hello, > > I have a failure when mounting btrfs. > >> mount -oro,recovery /dev/sda2 sda2_btrfs >> mount: /dev/sda2: can't read superblock What do you get for 'btrfs super-recover -v ' and 'btrfs check ' For this purpose any 4.4+ version is probably OK, except 4.7 and 4.7.1 which might spit out some bogus items (it's just noise it won't hurt anything as long as you don't use --repair). > > The kernel log is here http://pastebin.com/tHihHT92 and at the bottom > of the email > > I must admit I did the error of running btrfs check --repair at some > point, not knowing this was not a good idea. > > I run ubuntu 16.04 with kernel 4.4.0-36-generic . OK what version of btrrfs-progs? What was the output from btrfs check? > [ 1692.712574] BTRFS critical (device sda2): corrupt leaf, bad key > order: block=1957998690304,root=1, slot=29 > [ 1692.712819] BTRFS critical (device sda2): corrupt leaf, bad key > order: block=1957998690304,root=1, slot=29 List archives suggest this might be due to bad RAM. I also see there are some bugs that can cause it, but I'm not finding any post kernel 4.4 patches for this (there are a metric f tonne of changes since 4.4). I would suggest kernel 4.4.21 if you need to stick with a long term kernel, I have no idea what 4.4.0-36 translates into. > [ 1692.713963] BTRFS: error (device sda2) in btrfs_replay_log:2401: > errno=-5 IO failure (Failed to recover log tree) This is kinda curious, was there a crash or power failure? -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Can't mount btrfs raid1
Hello, I have a failure when mounting btrfs. > mount -oro,recovery /dev/sda2 sda2_btrfs > mount: /dev/sda2: can't read superblock The kernel log is here http://pastebin.com/tHihHT92 and at the bottom of the email I must admit I did the error of running btrfs check --repair at some point, not knowing this was not a good idea. I run ubuntu 16.04 with kernel 4.4.0-36-generic . Regards, Mirak [ 1685.255619] BTRFS info (device sda2): enabling auto recovery [ 1685.255626] BTRFS info (device sda2): disk space caching is enabled [ 1685.255628] BTRFS: has skinny extents [ 1692.712574] BTRFS critical (device sda2): corrupt leaf, bad key order: block=1957998690304,root=1, slot=29 [ 1692.712819] BTRFS critical (device sda2): corrupt leaf, bad key order: block=1957998690304,root=1, slot=29 [ 1692.712827] [ cut here ] [ 1692.712865] WARNING: CPU: 3 PID: 6305 at /build/linux-a2WvEb/linux-4.4.0/fs/btrfs/extent-tree.c:6552 __btrfs_free_extent.isra.70+0x2e6/0xd30 [btrfs]() [ 1692.712867] BTRFS: Transaction aborted (error -5) [ 1692.712868] Modules linked in: nvram msr joydev input_leds ir_xmp_decoder ir_lirc_codec ir_mce_kbd_decoder ir_sharp_decoder ir_sanyo_decoder ir_sony_decoder ir_jvc_decoder ir_rc6_decoder ir_nec_decoder ir_rc5_decoder rc_rc6_mce mceusb lirc_dev rc_core snd_hda_codec_realtek snd_hda_codec_generic binfmt_misc coretemp kvm_intel kvm irqbypass snd_hda_codec_hdmi snd_hda_intel serio_raw snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd shpchp soundcore 8250_fintek i2c_nforce2 mac_hid parport_pc ppdev lp parport autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear dm_mirror dm_region_hash dm_log hid_logitech ff_memless pata_acpi hid_logitech_hidpp [ 1692.712936] hid_logitech_dj usbhid hid uas usb_storage amdkfd amd_iommu_v2 radeon i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt firewire_ohci pata_jmicron fb_sys_fops psmouse firewire_core drm forcedeth crc_itu_t ahci libahci video wmi fjes [ 1692.712960] CPU: 3 PID: 6305 Comm: mount Tainted: P OE 4.4.0-36-generic #55-Ubuntu [ 1692.712963] Hardware name: Gigabyte Technology Co., Ltd. GA-E7AUM-DS2H/GA-E7AUM-DS2H, BIOS F2 12/17/2008 [ 1692.712965] 0286 b16cde4b 880098c3f688 813f13b3 [ 1692.712968] 880098c3f6d0 c04b5468 880098c3f6c0 810810f2 [ 1692.712972] 01c81f86 fffb 8801756c2000 [ 1692.712975] Call Trace: [ 1692.712983] [] dump_stack+0x63/0x90 [ 1692.712988] [] warn_slowpath_common+0x82/0xc0 [ 1692.712991] [] warn_slowpath_fmt+0x5c/0x80 [ 1692.713009] [] __btrfs_free_extent.isra.70+0x2e6/0xd30 [btrfs] [ 1692.713031] [] ? btrfs_merge_delayed_refs+0x66/0x650 [btrfs] [ 1692.713050] [] __btrfs_run_delayed_refs+0xaab/0x11f0 [btrfs] [ 1692.713068] [] btrfs_run_delayed_refs+0x7d/0x2a0 [btrfs] [ 1692.713085] [] ? btrfs_set_path_blocking+0x3f/0x70 [btrfs] [ 1692.713105] [] btrfs_commit_transaction+0x56/0xa90 [btrfs] [ 1692.713110] [] ? kmem_cache_free+0x1d4/0x1e0 [ 1692.713132] [] btrfs_recover_log_trees+0x3e7/0x480 [btrfs] [ 1692.713155] [] ? replay_one_extent+0x6c0/0x6c0 [btrfs] [ 1692.713175] [] open_ctree+0x1a5c/0x2460 [btrfs] [ 1692.713192] [] btrfs_mount+0x944/0xa60 [btrfs] [ 1692.713196] [] ? find_next_bit+0x15/0x20 [ 1692.713200] [] ? pcpu_alloc+0x37f/0x640 [ 1692.713203] [] mount_fs+0x38/0x160 [ 1692.713206] [] ? __alloc_percpu+0x15/0x20 [ 1692.713209] [] vfs_kern_mount+0x67/0x110 [ 1692.713226] [] btrfs_mount+0x1df/0xa60 [btrfs] [ 1692.713228] [] ? pcpu_alloc+0x37f/0x640 [ 1692.713231] [] mount_fs+0x38/0x160 [ 1692.713233] [] ? __alloc_percpu+0x15/0x20 [ 1692.713236] [] vfs_kern_mount+0x67/0x110 [ 1692.713239] [] do_mount+0x269/0xde0 [ 1692.713242] [] SyS_mount+0x9f/0x100 [ 1692.713246] [] entry_SYSCALL_64_fastpath+0x16/0x71 [ 1692.713249] ---[ end trace e6d60ad04bc3178e ]--- [ 1692.713252] BTRFS: error (device sda2) in __btrfs_free_extent:6552: errno=-5 IO failure [ 1692.713257] BTRFS: error (device sda2) in btrfs_run_delayed_refs:2927: errno=-5 IO failure [ 1692.713950] pending csums is 4096 [ 1692.713963] BTRFS: error (device sda2) in btrfs_replay_log:2401: errno=-5 IO failure (Failed to recover log tree) [ 1692.713994] BTRFS error (device sda2): cleaner transaction attach returned -30 [ 1692.760459] BTRFS: open_ctree failed -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ChaCha20 vs. AES performance
kent.overstr...@gmail.com (Kent Overstreet) writes: > On Tue, Sep 20, 2016 at 10:23:20AM -0400, Theodore Ts'o wrote: >> On Tue, Sep 20, 2016 at 03:15:19AM -0800, Kent Overstreet wrote: >> > Not on the list or I would've replied directly, but on Haswell, ChaCha20 >> > (in >> > software) is over 2x as fast as AES (in hardware), at realistic (for a >> > filesystem) block sizes: >> >> On Skylake and Broadwell processors, AES is faster (the posting is >> from a ChaCha20 enthusiast): >> >> https://blog.cloudflare.com/it-takes-two-to-chacha-poly/ > > The performance delta in his graphs isn't near as big as what I've measured, > which makes me suspect OpenSSL's ChaCha20 implementation isn't nearly as fast > as > the kernel's. The other thing to keep in mind is this (aka what's true for a big intel cpu isn't true everywhere): "The new cipher suites are fast. As Adam Langley described, ChaCha20-Poly1305 is three times faster than AES-128-GCM on mobile devices. Spending less time on decryption means faster page rendering and better battery life." https://blog.cloudflare.com/do-the-chacha-better-mobile-performance-with-cryptography/ The argument made by Bernstein is in a nutshell than "CPUs are optimized for video games and thus ciphers should use the same instructions which makes games 'faster'" (I'd recommend to read his whole email to understand what he means): https://moderncrypto.org/mail-archive/noise/2016/000699.html ) Or as one person commented on the net https://news.ycombinator.com/item?id=12264321 : Bernstein agrees with you. His point isn't that it's dumb that CPUs are optimized for games. It's that cipher designers should have enough awareness of trends in CPU development to design ciphers that take advantage of the same features that games do. That's what he did with Salsa/ChaCha. *His subtext is that over the medium term he believes his ciphers will outperform AES, despite AES having AES-NI hardware support.* (emphasis mine) -- Mathieu Chouquet-Stringer The sun itself sees not till heaven clears. -- William Shakespeare -- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: splat in split_leaf with integrity checking
On Mon, Sep 19, 2016 at 07:21:47PM -0700, Liu Bo wrote: > On Tue, Sep 20, 2016 at 03:39:27AM +0200, Adam Borowski wrote: > > Hi! > > I just had the following splat in 4.8-rc6 for the third time in a week: > > Sorry for the trouble, this is caused by my patch and here are two fixes[1] > to get it right with integrity check (not sure if they've been queued yet). > > Here is a discussion that explains why we remove it[2]. > > [1] https://patchwork.kernel.org/patch/9320077/ > https://patchwork.kernel.org/patch/9311541/ > > [2] http://www.spinics.net/lists/linux-btrfs/msg58506.html Thanks! I should have noticed these posts and not bother you; anyway, I've applied the patches and stress-tested during the day, just in case they break something. All seems to work fine now -- and as I had a splat just after posting this when copying that big file, and another during the kernel compile, the bug would likely have triggered by now. Meow! -- Second "wet cat laying down on a powered-on box-less SoC on the desk" close shave in a week. Protect your ARMs, folks! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: multi-device btrfs with single data mode and disk failure
On Tue, Sep 20, 2016 at 2:18 PM, Alexandre Poux wrote: > > > Le 20/09/2016 à 21:46, Chris Murphy a écrit : >> On Tue, Sep 20, 2016 at 1:31 PM, Alexandre Poux wrote: >>> >>> Le 20/09/2016 à 21:11, Chris Murphy a écrit : And no backup? Umm, I'd resolve that sooner than anything else. >>> Yeah you are absolutely right, this was a temporary solution which came >>> to be not that temporary. >>> And I regret it already... >> Well on the bright side, if this were LVM or mdadm linear/concat >> array, the whole thing would be toast because any other file system >> would have lost too much fs metadata on the missing device. >> It should be true that it'll tolerate a read only mount indefinitely, but read write? Not sure. This sort of edge case isn't well tested at all seeing as it required changing the kernel to reduce safe guards. So all bets are off the whole thing could become unmountable, not even read only, and then it's a scraping job. >>> I'm not that crazy, I tried the patch inside a virtual machine on >>> virtual drives... >>> And since it's only virtual, it may not work on the real partition... >> Are you sure the virtual setup lacked a CHUNK_ITEM on the missing >> device? That might be what pinned it in that case. > In fact in my virtual setup there was more chunk missing (1 metadata 1 > System and 1 Data). > I will try to do a setup closer to my real one. Probably the reason why that missing device has no used chunks is because it's so small. Btrfs allocates block groups to devices with the most unallocated space first. Only once the unallocated space is even (approximately) on all devices would it allocate a block group to the small device. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: multi-device btrfs with single data mode and disk failure
Le 20/09/2016 à 22:18, Alexandre Poux a écrit : > > Le 20/09/2016 à 21:46, Chris Murphy a écrit : >> On Tue, Sep 20, 2016 at 1:31 PM, Alexandre Poux wrote: >>> Le 20/09/2016 à 21:11, Chris Murphy a écrit : And no backup? Umm, I'd resolve that sooner than anything else. >>> Yeah you are absolutely right, this was a temporary solution which came >>> to be not that temporary. >>> And I regret it already... >> Well on the bright side, if this were LVM or mdadm linear/concat >> array, the whole thing would be toast because any other file system >> would have lost too much fs metadata on the missing device. >> It should be true that it'll tolerate a read only mount indefinitely, but read write? Not sure. This sort of edge case isn't well tested at all seeing as it required changing the kernel to reduce safe guards. So all bets are off the whole thing could become unmountable, not even read only, and then it's a scraping job. >>> I'm not that crazy, I tried the patch inside a virtual machine on >>> virtual drives... >>> And since it's only virtual, it may not work on the real partition... >> Are you sure the virtual setup lacked a CHUNK_ITEM on the missing >> device? That might be what pinned it in that case. > In fact in my virtual setup there was more chunk missing (1 metadata 1 > System and 1 Data). > I will try to do a setup closer to my real one. Good news, I made a test were in my virtual setup, I was missing no chunk at all And in this case, It has no problem to remove it ! What I did is - make an array with 6 disks (data single, metadata raid1) - dd if=/dev/zero of=/mnt/somefile bs=64M count=16 # make a 1G file - use btrfs-debug-tree to identify which device was not used - shutdown the vm, remove this virtual device, and restart the vm - mount the array in degraded but with read write thanks to the patched kernel - btrfs remove missing - and voilà ! I will try with something else than /dev/null, but this is very encouraging Do you think that my test is too trivial ? Should I try something else before trying on the real partition with the overlay ? >> You could try some sort of overlay for your remaining drives. >> Something like this: >> https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file >> >> Make sure you understand the gotcha about cloning which applies here: >> https://btrfs.wiki.kernel.org/index.php/Gotchas >> >> I think it's safe to use blockdev --setro on every real device you're >> trying to protect from changes. And when mounting you'll at least need >> to use device= mount option to explicitly mount each of the overlay >> devices. Based on the wiki, I'm wincing, I don't really know for sure >> if device mount option is enough to compel Btrfs to only use those >> devices and not go off the rails and still use one of the real >> devices, but at least if they're setro it won't matter (the mount will >> just fail somehow due to write failures). >> >> So now you can try removing the missing device... and see what >> happens. You could inspect the overlay files and see what changes were >> made. > Wow that looks like nice. > So, if it work, and if we find a way to fix the filesystem inside the vm, > I can use this over the real partion to check if it works before trying > the fix for real. > Nice idea. What do you get for btrfs-debug-tree -t 3 That should show the chunk tree, and what I'm wondering if if the chunk tree has any references to chunks on the missing device. Even if there are no extents on that device, if there are chunks, that might be one of the safeguards. >>> You'll find it attached. >>> The missing device is the devid 8 (since it's the only one missing in >>> btrfs fi show) >>> I found it only once line 63 >> Yeah bummer. Not used for system, data, or metadata chunks at all. >> >> > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: multi-device btrfs with single data mode and disk failure
On 20/09/16 19:53, Alexandre Poux wrote: > As for moving data to an another volume, since it's only data and > nothing fancy (no subvolume or anything), a simple rsync would do the trick. > My problem in this case is that I don't have enough available space > elsewhere to move my data. > That's why I'm trying this hard to recover the partition... I am sure you have already thought about this, but... it might be easier, and even maybe faster, to backup the data to a cloud server, then recreate and download again. Backblaze B2 is very cheap for upload and storage (don't know about download charges, though). And rclone works well to handle rsync-style copies (although you might want to use tar or dar if you need to preserve file attributes). And if that works, rclone + B2 might make a reasonable offsite backup solution for the future! Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/4][V3] metadata throttling in writeback patches
This is the latest set of patches based on my conversations with Jan and Johannes. The biggest change has been changing the metadata accounting counters to be in bytes intead of pages in order to better support varying blocksizes. I've also stopped messing with the other pagecache related counters so we can keep them truly separate. Johannes suggested this change and I simply convert the bytes counter to pages when calculating dirty limits and such. The other big change is changing WB_WRITTEN/WB_DIRTIED to be in bytes instead of pages as well. This is just a name and accounting change, it doesn't really change the core logic at all. I'm sending this out ahead of my full battery of tests, but I want to get feedback on this direction as soon as possible. In the meantime I've changed my btrfs specific patches to work with these patches and am running long running tests now to verify everything still works. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes
These are counters that constantly go up in order to do bandwidth calculations. It isn't important what the units are in, as long as they are consistent between the two of them, so convert them to count bytes written/dirtied, and allow the metadata accounting stuff to change the counters as well. Signed-off-by: Josef Bacik --- fs/fuse/file.c | 4 ++-- include/linux/backing-dev-defs.h | 4 ++-- include/linux/backing-dev.h | 2 +- mm/backing-dev.c | 8 mm/page-writeback.c | 26 -- 5 files changed, 25 insertions(+), 19 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index f394aff..3f5991e 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1466,7 +1466,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req) for (i = 0; i < req->num_pages; i++) { dec_wb_stat(&bdi->wb, WB_WRITEBACK); dec_node_page_state(req->pages[i], NR_WRITEBACK_TEMP); - wb_writeout_inc(&bdi->wb); + wb_writeout_inc(&bdi->wb, PAGE_SIZE); } wake_up(&fi->page_waitq); } @@ -1770,7 +1770,7 @@ static bool fuse_writepage_in_flight(struct fuse_req *new_req, dec_wb_stat(&bdi->wb, WB_WRITEBACK); dec_node_page_state(page, NR_WRITEBACK_TEMP); - wb_writeout_inc(&bdi->wb); + wb_writeout_inc(&bdi->wb, PAGE_SIZE); fuse_writepage_free(fc, new_req); fuse_request_free(new_req); goto out; diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 1a7c3c1..cef0f24 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -36,8 +36,8 @@ enum wb_stat_item { WB_WRITEBACK, WB_METADATA_DIRTY_BYTES, WB_METADATA_WRITEBACK_BYTES, - WB_DIRTIED, - WB_WRITTEN, + WB_DIRTIED_BYTES, + WB_WRITTEN_BYTES, NR_WB_STAT_ITEMS }; diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index 089acf6..742238a 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -113,7 +113,7 @@ static inline s64 wb_stat_sum(struct bdi_writeback *wb, enum wb_stat_item item) return sum; } -extern void wb_writeout_inc(struct bdi_writeback *wb); +extern void wb_writeout_inc(struct bdi_writeback *wb, long bytes); /* * maximal error of a stat counter. diff --git a/mm/backing-dev.c b/mm/backing-dev.c index d76f432..f0695b0 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -77,8 +77,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) "BdiDirtyThresh: %10lu kB\n" "DirtyThresh:%10lu kB\n" "BackgroundThresh: %10lu kB\n" - "BdiDirtied: %10lu kB\n" - "BdiWritten: %10lu kB\n" + "BdiDirtiedBytes:%10lu kB\n" + "BdiWrittenBytes:%10lu kB\n" "BdiMetadataDirty: %10lu kB\n" "BdiMetaWriteback: %10lu kB\n" "BdiWriteBandwidth: %10lu kBps\n" @@ -93,8 +93,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) K(wb_thresh), K(dirty_thresh), K(background_thresh), - (unsigned long) K(wb_stat(wb, WB_DIRTIED)), - (unsigned long) K(wb_stat(wb, WB_WRITTEN)), + (unsigned long) BtoK(wb_stat(wb, WB_DIRTIED_BYTES)), + (unsigned long) BtoK(wb_stat(wb, WB_WRITTEN_BYTES)), (unsigned long) BtoK(wb_stat(wb, WB_METADATA_DIRTY_BYTES)), (unsigned long) BtoK(wb_stat(wb, WB_METADATA_WRITEBACK_BYTES)), (unsigned long) K(wb->write_bandwidth), diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 423d2f5..6d08673 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -624,11 +624,11 @@ static void wb_domain_writeout_inc(struct wb_domain *dom, * Increment @wb's writeout completion count and the global writeout * completion count. Called from test_clear_page_writeback(). */ -static inline void __wb_writeout_inc(struct bdi_writeback *wb) +static inline void __wb_writeout_inc(struct bdi_writeback *wb, long bytes) { struct wb_domain *cgdom; - __inc_wb_stat(wb, WB_WRITTEN); + __add_wb_stat(wb, WB_WRITTEN_BYTES, bytes); wb_domain_writeout_inc(&global_wb_domain, &wb->completions, wb->bdi->max_prop_frac); @@ -638,12 +638,12 @@ static inline void __wb_writeout_inc(struct bdi_writeback *wb) wb->bdi->max_prop_frac); } -void wb_writeout_inc(struct bdi_writeback *wb) +void wb_writeout_inc(struct bdi_writeback *wb, long bytes) { unsigned long flags; local_irq_save(flags); - __wb_writeout_inc(wb)
[PATCH 4/4] writeback: introduce super_operations->write_metadata
Now that we have metadata counters in the VM, we need to provide a way to kick writeback on dirty metadata. Introduce super_operations->write_metadata. This allows file systems to deal with writing back any dirty metadata we need based on the writeback needs of the system. Since there is no inode to key off of we need a list in the bdi for dirty super blocks to be added. From there we can find any dirty sb's on the bdi we are currently doing writeback on and call into their ->write_metadata callback. Signed-off-by: Josef Bacik --- fs/fs-writeback.c| 72 fs/super.c | 7 include/linux/backing-dev-defs.h | 2 ++ include/linux/fs.h | 4 +++ mm/backing-dev.c | 2 ++ 5 files changed, 81 insertions(+), 6 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index aafdb11..8cd072e 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1464,6 +1464,31 @@ static long writeback_chunk_size(struct bdi_writeback *wb, return pages; } +static long writeback_sb_metadata(struct super_block *sb, + struct bdi_writeback *wb, + struct wb_writeback_work *work) +{ + struct writeback_control wbc = { + .sync_mode = work->sync_mode, + .tagged_writepages = work->tagged_writepages, + .for_kupdate= work->for_kupdate, + .for_background = work->for_background, + .for_sync = work->for_sync, + .range_cyclic = work->range_cyclic, + .range_start= 0, + .range_end = LLONG_MAX, + }; + long write_chunk; + + write_chunk = writeback_chunk_size(wb, work); + wbc.nr_to_write = write_chunk; + sb->s_op->write_metadata(sb, &wbc); + work->nr_pages -= write_chunk - wbc.nr_to_write; + + return write_chunk - wbc.nr_to_write; +} + + /* * Write a portion of b_io inodes which belong to @sb. * @@ -1490,6 +1515,7 @@ static long writeback_sb_inodes(struct super_block *sb, unsigned long start_time = jiffies; long write_chunk; long wrote = 0; /* count both pages and inodes */ + bool done = false; while (!list_empty(&wb->b_io)) { struct inode *inode = wb_inode(wb->b_io.prev); @@ -1606,12 +1632,18 @@ static long writeback_sb_inodes(struct super_block *sb, * background threshold and other termination conditions. */ if (wrote) { - if (time_is_before_jiffies(start_time + HZ / 10UL)) - break; - if (work->nr_pages <= 0) + if (time_is_before_jiffies(start_time + HZ / 10UL) || + work->nr_pages <= 0) { + done = true; break; + } } } + if (!done && sb->s_op->write_metadata) { + spin_unlock(&wb->list_lock); + wrote += writeback_sb_metadata(sb, wb, work); + spin_unlock(&wb->list_lock); + } return wrote; } @@ -1620,6 +1652,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb, { unsigned long start_time = jiffies; long wrote = 0; + bool done = false; while (!list_empty(&wb->b_io)) { struct inode *inode = wb_inode(wb->b_io.prev); @@ -1639,12 +1672,39 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb, /* refer to the same tests at the end of writeback_sb_inodes */ if (wrote) { - if (time_is_before_jiffies(start_time + HZ / 10UL)) - break; - if (work->nr_pages <= 0) + if (time_is_before_jiffies(start_time + HZ / 10UL) || + work->nr_pages <= 0) { + done = true; break; + } } } + + if (!done && wb_stat(wb, WB_METADATA_DIRTY_BYTES)) { + LIST_HEAD(list); + + spin_unlock(&wb->list_lock); + spin_lock(&wb->bdi->sb_list_lock); + list_splice_init(&wb->bdi->dirty_sb_list, &list); + while (!list_empty(&list)) { + struct super_block *sb; + + sb = list_first_entry(&list, struct super_block, + s_bdi_dirty_list); + list_move_tail(&sb->s_bdi_dirty_list, + &wb->bdi->dirty_sb_list); + if (!sb->s_op->write_metadata) + continue; +
[PATCH 2/4] writeback: allow for dirty metadata accounting
Btrfs has no bounds except memory on the amount of dirty memory that we have in use for metadata. Historically we have used a special inode so we could take advantage of the balance_dirty_pages throttling that comes with using pagecache. However as we'd like to support different blocksizes it would be nice to not have to rely on pagecache, but still get the balance_dirty_pages throttling without having to do it ourselves. So introduce *METADATA_DIRTY_BYTES and *METADATA_WRITEBACK_BYTES. These are zone and bdi_writeback counters to keep track of how many bytes we have in flight for METADATA. We need to count in bytes as blocksizes could be percentages of pagesize. We simply convert the bytes to number of pages where it is needed for the throttling. Signed-off-by: Josef Bacik --- arch/tile/mm/pgtable.c | 3 +- drivers/base/node.c | 6 ++ fs/fs-writeback.c| 2 + fs/proc/meminfo.c| 5 ++ include/linux/backing-dev-defs.h | 2 + include/linux/mm.h | 9 +++ include/linux/mmzone.h | 2 + include/trace/events/writeback.h | 13 +++- mm/backing-dev.c | 5 ++ mm/page-writeback.c | 157 +++ mm/page_alloc.c | 16 +++- mm/vmscan.c | 4 +- 12 files changed, 200 insertions(+), 24 deletions(-) diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c index 7cc6ee7..9543468 100644 --- a/arch/tile/mm/pgtable.c +++ b/arch/tile/mm/pgtable.c @@ -44,12 +44,13 @@ void show_mem(unsigned int filter) { struct zone *zone; - pr_err("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu free:%lu\n slab:%lu mapped:%lu pagetables:%lu bounce:%lu pagecache:%lu swap:%lu\n", + pr_err("Active:%lu inactive:%lu dirty:%lu metadata_dirty:%lu writeback:%lu unstable:%lu free:%lu\n slab:%lu mapped:%lu pagetables:%lu bounce:%lu pagecache:%lu swap:%lu\n", (global_node_page_state(NR_ACTIVE_ANON) + global_node_page_state(NR_ACTIVE_FILE)), (global_node_page_state(NR_INACTIVE_ANON) + global_node_page_state(NR_INACTIVE_FILE)), global_node_page_state(NR_FILE_DIRTY), + global_node_page_state(NR_METADATA_DIRTY), global_node_page_state(NR_WRITEBACK), global_node_page_state(NR_UNSTABLE_NFS), global_page_state(NR_FREE_PAGES), diff --git a/drivers/base/node.c b/drivers/base/node.c index 5548f96..3615264 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -51,6 +51,8 @@ static DEVICE_ATTR(cpumap, S_IRUGO, node_read_cpumask, NULL); static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL); #define K(x) ((x) << (PAGE_SHIFT - 10)) +#define BtoK(x) ((x) >> 10) + static ssize_t node_read_meminfo(struct device *dev, struct device_attribute *attr, char *buf) { @@ -99,7 +101,9 @@ static ssize_t node_read_meminfo(struct device *dev, #endif n += sprintf(buf + n, "Node %d Dirty: %8lu kB\n" + "Node %d MetadataDirty: %8lu kB\n" "Node %d Writeback: %8lu kB\n" + "Node %d MetaWriteback: %8lu kB\n" "Node %d FilePages: %8lu kB\n" "Node %d Mapped: %8lu kB\n" "Node %d AnonPages: %8lu kB\n" @@ -119,7 +123,9 @@ static ssize_t node_read_meminfo(struct device *dev, #endif , nid, K(node_page_state(pgdat, NR_FILE_DIRTY)), + nid, BtoK(node_page_state(pgdat, NR_METADATA_DIRTY_BYTES)), nid, K(node_page_state(pgdat, NR_WRITEBACK)), + nid, BtoK(node_page_state(pgdat, NR_METADATA_WRITEBACK_BYTES)), nid, K(node_page_state(pgdat, NR_FILE_PAGES)), nid, K(node_page_state(pgdat, NR_FILE_MAPPED)), nid, K(node_page_state(pgdat, NR_ANON_MAPPED)), diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 56c8fda..aafdb11 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1801,6 +1801,7 @@ static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb) return work; } +#define BtoP(x) ((x) >> PAGE_SHIFT) /* * Add in the number of potentially dirty inodes, because each inode * write can dirty pagecache in the underlying blockdev. @@ -1809,6 +1810,7 @@ static unsigned long get_nr_dirty_pages(void) { return global_node_page_state(NR_FILE_DIRTY) + global_node_page_state(NR_UNSTABLE_NFS) + + BtoP(global_node_page_state(NR_METADATA_DIRTY_BYTES)) + get_nr_dirty_inodes(); } diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c index 09e18fd..95b0d8a 100644 --- a/fs/proc/meminfo.c +++ b/fs/proc/meminfo.c @@ -36,6 +36,7 @@ static
[PATCH 1/4] remove mapping from balance_dirty_pages*()
The only reason we pass in the mapping is to get the inode in order to see if writeback cgroups is enabled, and even then it only checks the bdi and a super block flag. balance_dirty_pages() doesn't even use the mapping. Since balance_dirty_pages*() works on a bdi level, just pass in the bdi and super block directly so we can avoid using mapping. This will allow us to still use balance_dirty_pages for dirty metadata pages that are not backed by an address_mapping. Signed-off-by: Josef Bacik Reviewed-by: Jan Kara --- drivers/mtd/devices/block2mtd.c | 12 fs/btrfs/disk-io.c | 4 ++-- fs/btrfs/file.c | 3 ++- fs/btrfs/ioctl.c| 3 ++- fs/btrfs/relocation.c | 3 ++- fs/buffer.c | 3 ++- fs/iomap.c | 3 ++- fs/ntfs/attrib.c| 10 +++--- fs/ntfs/file.c | 4 ++-- include/linux/backing-dev.h | 29 +++-- include/linux/writeback.h | 3 ++- mm/filemap.c| 4 +++- mm/memory.c | 9 +++-- mm/page-writeback.c | 15 +++ 14 files changed, 71 insertions(+), 34 deletions(-) diff --git a/drivers/mtd/devices/block2mtd.c b/drivers/mtd/devices/block2mtd.c index 7c887f1..7892d0b 100644 --- a/drivers/mtd/devices/block2mtd.c +++ b/drivers/mtd/devices/block2mtd.c @@ -52,7 +52,8 @@ static struct page *page_read(struct address_space *mapping, int index) /* erase a specified part of the device */ static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t to, size_t len) { - struct address_space *mapping = dev->blkdev->bd_inode->i_mapping; + struct inode *inode = dev->blkdev->bd_inode; + struct address_space *mapping = inode->i_mapping; struct page *page; int index = to >> PAGE_SHIFT; // page index int pages = len >> PAGE_SHIFT; @@ -71,7 +72,8 @@ static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t to, size_t len) memset(page_address(page), 0xff, PAGE_SIZE); set_page_dirty(page); unlock_page(page); - balance_dirty_pages_ratelimited(mapping); + balance_dirty_pages_ratelimited(inode_to_bdi(inode), + inode->i_sb); break; } @@ -141,7 +143,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, const u_char *buf, loff_t to, size_t len, size_t *retlen) { struct page *page; - struct address_space *mapping = dev->blkdev->bd_inode->i_mapping; + struct inode *inode = dev->blkdev->bd_inode; + struct address_space *mapping = inode->i_mapping; int index = to >> PAGE_SHIFT; // page index int offset = to & ~PAGE_MASK; // page offset int cpylen; @@ -162,7 +165,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, const u_char *buf, memcpy(page_address(page) + offset, buf, cpylen); set_page_dirty(page); unlock_page(page); - balance_dirty_pages_ratelimited(mapping); + balance_dirty_pages_ratelimited(inode_to_bdi(inode), + inode->i_sb); } put_page(page); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 87dad55..4034ad6 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -4024,8 +4024,8 @@ static void __btrfs_btree_balance_dirty(struct btrfs_root *root, ret = percpu_counter_compare(&root->fs_info->dirty_metadata_bytes, BTRFS_DIRTY_METADATA_THRESH); if (ret > 0) { - balance_dirty_pages_ratelimited( - root->fs_info->btree_inode->i_mapping); + balance_dirty_pages_ratelimited(&root->fs_info->bdi, + root->fs_info->sb); } } diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 9404121..f060b08 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1686,7 +1686,8 @@ again: cond_resched(); - balance_dirty_pages_ratelimited(inode->i_mapping); + balance_dirty_pages_ratelimited(inode_to_bdi(inode), + inode->i_sb); if (dirty_pages < (root->nodesize >> PAGE_SHIFT) + 1) btrfs_btree_balance_dirty(root); diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 14ed1e9..a222bad 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1410,7 +1410,8 @@ int btrfs_defrag_file(struct inode *inode, struct file *file, } defrag_count += ret; -
Re: Post ext3 conversion problems
On Tue, Sep 20, 2016 at 01:02:38PM +0800, Qu Wenruo wrote: > > Glad to hear you've found the core of the issue. > > > > At this point, I can trigger it immediately. As soon as I log in and run > > dmenu, it will attempt to rebuild its cache file (small text file that's > > just a list of all executables in the PATH). Once that write happens, > > the bug triggers and the fs goes read only. > > Rewrite? Or write into new inode? > > And is the same inode always causing the problem? It's not always the same. It seems like whatever triggers a write first is what kills it. I went to test it, and this time it triggered on my .bash_history file. I have bash set up with "history -a", so presumably that was an append, not an overwrite. To cut down on the number of variables, I booted my system with the "rescue" systemd target, then su'd to my user. Simply running a few commands (with the history -a writes that bash triggered) was enough to trigger the bug. This is on 4.8.0-rc6, with the following compile time options enabled: CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y CONFIG_BTRFS_DEBUG=y CONFIG_BTRFS_ASSERT=y If I run the stock Arch kernel (4.7.2 at the moment), the issue still appears, but it takes longer. My most reliable trigger is Firefox, whose constant DB writes will trigger it within minutes. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ChaCha20 vs. AES performance
On Tue, 20 Sep 2016 07:51:52 -0800, Kent Overstreet wrote: > On Tue, Sep 20, 2016 at 10:23:20AM -0400, Theodore Ts'o wrote: >> On Tue, Sep 20, 2016 at 03:15:19AM -0800, Kent Overstreet wrote: >> > Not on the list or I would've replied directly, but on Haswell, >> > ChaCha20 (in software) is over 2x as fast as AES (in hardware), at >> > realistic (for a filesystem) block sizes: Apologies if this doesn't CC you - replying via gmane, since (not being subscribed via email either) I can't try the same trick I did to include Ted (i.e., reply via my mail client). One useful trick, though - if you have a Usenet client, gmane _will_ let you reply directly, even to old messages. That's what I'm doing. >> On Skylake and Broadwell processors, AES is faster (the posting is from >> a ChaCha20 enthusiast): >> >> https://blog.cloudflare.com/it-takes-two-to-chacha-poly/ > > The performance delta in his graphs isn't near as big as what I've > measured, which makes me suspect OpenSSL's ChaCha20 implementation isn't > nearly as fast as the kernel's. > >> My big worry though is that schemes that require that nonces/IV's must >> **never** be reused are fragile. It's for the same reason that DSA >> makes my skin crawl. If you ever screw up --- maybe after a crash, or >> a file system bug, you end up reusing a nonce, it's game over. >> >> So if there are hardware solutions which are faster or fast enough that >> the crypto is no longer dominant cost, why not use a cipher scheme >> which is more robust? > > Block ciphers have their own downsides though - XTS is really a big pile > of hacks and workarounds. On the whole, if you can get nonces right, a > stream cipher cryptosystem (and ChaCha20 especially) is on the whole > drastically simpler, and thus easier to understand and audit. Yes, I would entirely agree with your assessment of XTS (in particular, the doubling of the length of the key is rooted in the original authors misunderstanding the XEX paper...). > And if you can do nonces correctly, ChaCha20/Poly1305 is pretty much one > of the gold standards - it's secure against pretty much any vaguely > realistic threat model. XTS, not so much - it's just the best you can do > given the constraints of typical disk crypto. The gold standards of > encryption today are the AEADs - and AES/GCM fails badly with nonce > reuse too, there aren't any AEADs yet that don't fail badly with nonce > reuse. Not true - SIV is a generic construction, which has been applied to AES (AES-SIV, RFC 5297) and ChaCha20 (HS1-SIV, submitted to CAESAR). There's also AES-GCM-SIV, which takes advantage of GCM hardware acceleration as well as AES acceleration. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: multi-device btrfs with single data mode and disk failure
Le 20/09/2016 à 21:46, Chris Murphy a écrit : > On Tue, Sep 20, 2016 at 1:31 PM, Alexandre Poux wrote: >> >> Le 20/09/2016 à 21:11, Chris Murphy a écrit : >>> And no backup? Umm, I'd resolve that sooner than anything else. >> Yeah you are absolutely right, this was a temporary solution which came >> to be not that temporary. >> And I regret it already... > Well on the bright side, if this were LVM or mdadm linear/concat > array, the whole thing would be toast because any other file system > would have lost too much fs metadata on the missing device. > >>> It >>> should be true that it'll tolerate a read only mount indefinitely, but >>> read write? Not sure. This sort of edge case isn't well tested at all >>> seeing as it required changing the kernel to reduce safe guards. So >>> all bets are off the whole thing could become unmountable, not even >>> read only, and then it's a scraping job. >> I'm not that crazy, I tried the patch inside a virtual machine on >> virtual drives... >> And since it's only virtual, it may not work on the real partition... > Are you sure the virtual setup lacked a CHUNK_ITEM on the missing > device? That might be what pinned it in that case. In fact in my virtual setup there was more chunk missing (1 metadata 1 System and 1 Data). I will try to do a setup closer to my real one. > You could try some sort of overlay for your remaining drives. > Something like this: > https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file > > Make sure you understand the gotcha about cloning which applies here: > https://btrfs.wiki.kernel.org/index.php/Gotchas > > I think it's safe to use blockdev --setro on every real device you're > trying to protect from changes. And when mounting you'll at least need > to use device= mount option to explicitly mount each of the overlay > devices. Based on the wiki, I'm wincing, I don't really know for sure > if device mount option is enough to compel Btrfs to only use those > devices and not go off the rails and still use one of the real > devices, but at least if they're setro it won't matter (the mount will > just fail somehow due to write failures). > > So now you can try removing the missing device... and see what > happens. You could inspect the overlay files and see what changes were > made. Wow that looks like nice. So, if it work, and if we find a way to fix the filesystem inside the vm, I can use this over the real partion to check if it works before trying the fix for real. Nice idea. >>> What do you get for btrfs-debug-tree -t 3 >>> >>> That should show the chunk tree, and what I'm wondering if if the >>> chunk tree has any references to chunks on the missing device. Even if >>> there are no extents on that device, if there are chunks, that might >>> be one of the safeguards. >>> >> You'll find it attached. >> The missing device is the devid 8 (since it's the only one missing in >> btrfs fi show) >> I found it only once line 63 > Yeah bummer. Not used for system, data, or metadata chunks at all. > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: multi-device btrfs with single data mode and disk failure
On Tue, Sep 20, 2016 at 1:54 PM, Alexandre Poux wrote: > > OK, good idea, but to be able to do that, I have to use the patch that > allow me to mount the partition in rw, otherwise I won't be able to > shrink it I suppose.. > And even with the patch I'm not sure that I won't get an IO error the > same way I get it when I try to remove the device. > I will try it on my virtual machine. The shrink itself is pretty trivial in that its just moving block groups around if necessary, it's part of the balance code, there's not much metadata being changed, just CoW the block groups, and then update the chunk tree and supers. It is trickier when it comes to either partition map changes while the fs is still mounted; or doing it the way I was describing by deleting one of the present devices in which case you can then just use that now empty partition as a starter for a new file system. It's a catch 22 either way. Note that by default if you don't specify a devid for shrink, it's only resizing devid1. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: multi-device btrfs with single data mode and disk failure
On Tue, Sep 20, 2016 at 1:43 PM, Austin S. Hemmelgarn wrote: >> First off, as Chris said, if you can read the data and don't already have a > backup, that should be your first priority. This really is an edge case > that's not well tested, and the kernel technically doesn't officially > support it. > > Now, beyond that and his suggestions, there's another option, but it's > risky, so I wouldn't even think about trying it without a backup (unless of > course you can trivially regenerate the data). Multiple devices support and > online resizing allows for a rather neat trick to regenerate a filesystem in > place. The process is pretty simple: > 1. Shrink the existing filesystem down to the minimum size possible. > 2. Create a new partition in the free space, and format it as a temporary > BTRFS filesystem. Ideally, this FS should be mixed mode, and ideally single > profile. If you don't have much free space, you can use a flash drive to > start this temporary filesystem instead. > 3. Start copying files from the old filesystem to the temporary one. > 4. Once the new filesystem is about 95% full, stop copying, shrink the old > filesystem again, create a new partition, and add that partition to the > temporary filesystem. > 5. Repeat steps 3-4 until you have everything off of the old filesystem. > 6. Re-format the remaining portion of the old filesystem using the > parameters you want for the replacement filesystem. > 7. Start copying files from the temporary filesystem to the new filesystem. > 8. As you empty out each temporary partition, remove it from the temporary > filesystem, delete the partition, and expand the new filesystem. > > This takes a while, and is only safe if you have reliable hardware, but I've > done it before and it works reliably as long as you don't have many big > files on the old filesystem (things can get complicated if you do). The > other negative aspect is that if you aren't careful, it's possible to get > stuck half-way, but in such a case, adding a flash drive to the temporary > filesystem can usually give you enough extra space to get things unstuck. Yes I thought of this also. Gotcha is that he'll need to apply the patch that allows degraded rw mounts with a device missing on the actual computer with these drives. He tested that patch in a VM with virtual devices. What might be easier is just 'btrfs dev rm /dev/sda6' because that one has the least amount of data on it: devid 12 size 728.32GiB used 312.03GiB path /dev/sda6 which should fit on all remaining devices. But, does Btrfs get pissed at some point that there's this missing device it might want to write to? I have no idea to what degree this patched kernel permits a lot of degraded writing. The other quandary is the file system will do online shrink, but the kernel can sometimes get pissy about partition map changes on devices with active volumes, even when using partprobe to update the kernel's idea of the partition map. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: multi-device btrfs with single data mode and disk failure
Le 20/09/2016 à 21:43, Austin S. Hemmelgarn a écrit : > On 2016-09-20 14:53, Alexandre Poux wrote: >> >> >> Le 20/09/2016 à 20:38, Chris Murphy a écrit : >>> On Tue, Sep 20, 2016 at 12:19 PM, Alexandre Poux >>> wrote: Le 20/09/2016 à 19:54, Chris Murphy a écrit : > On Tue, Sep 20, 2016 at 11:03 AM, Alexandre Poux > wrote: > >> If I wanted to try to edit my partitions with an hex editor, >> where would >> I find infos on how to do that ? >> I really don't want to go this way, but if this is relatively >> simple, it >> may be worth to try. > Simple is relative. First you'd need > https://btrfs.wiki.kernel.org/index.php/On-disk_Format to get some > understanding of where things are to edit, and then btrfs-map-logical > to convert btrfs logical addresses to physical device and sector to > know what to edit. > > I'd call it distinctly non-trivial and very tedious. > OK, another idea: would it be possible to trick btrfs with a manufactured file that the disk is present while it isn't ? I mean, looking for a few minutes on the hexdump of my trivial test partition, header of members of btrfs array seems very alike. So maybe, I can make a file wich would have enough header to make btrfs believe that this is my device, and then remove it as usual looks like a long shot, but it doesn't hurt to ask >>> There may be another test that applies to single profiles, that >>> disallows dropping a device. I think that's the place to look next. >>> The superblock is easy to copy, but you'll need the device specific >>> UUID which should be locatable with btrfs-show-super -f for each >>> devid. The bigger problem is that Btrfs at mount time doesn't just >>> look at the superblock and then mount. It actually reads parts of each >>> tree, the extent of which I don't know. And it's doing a bunch of >>> sanity tests as it reads those things, including transid (generation). >>> So I'm not sure how easily spoofable a fake device is going to be. >>> As a practical matter, migrate it to a new volume is faster and more >>> reliable. Unfortunately, the inability to mount it read write is going >>> to prevent you from making read only snapshots to use with btrfs >>> send/receive. What might work, is find out what on-disk modification >>> btrfs-tune does to make a device a read-only seed. Again your volume >>> is missing a device so btrfs-tune won't let you modify it. But if you >>> could force that to happen, it's probably a very minor change to >>> metadata on each device, maybe it'll act like a seed device when you >>> next mount it, in which case you'll be able to add a device and >>> remount it read write and then delete the seed causing migration of >>> everything that does remain on the volume over to the new device. I've >>> never tried anything like this so I have no idea if it'll work. And >>> even in the best case I haven't tried a multiple device seed going to >>> a single device sprout (is it even allowed when removing the seed?). >>> So...more questions than answers. >>> >> Sorry if I wasn't clear, but with the patch mentionned earlyer, I can >> get a read write mount. >> What I can't do is remove the device. >> As for moving data to an another volume, since it's only data and >> nothing fancy (no subvolume or anything), a simple rsync would do the >> trick. >> My problem in this case is that I don't have enough available space >> elsewhere to move my data. >> That's why I'm trying this hard to recover the partition... > First off, as Chris said, if you can read the data and don't already > have a backup, that should be your first priority. This really is an > edge case that's not well tested, and the kernel technically doesn't > officially support it. > > Now, beyond that and his suggestions, there's another option, but it's > risky, so I wouldn't even think about trying it without a backup > (unless of course you can trivially regenerate the data). Multiple > devices support and online resizing allows for a rather neat trick to > regenerate a filesystem in place. The process is pretty simple: > 1. Shrink the existing filesystem down to the minimum size possible. > 2. Create a new partition in the free space, and format it as a > temporary BTRFS filesystem. Ideally, this FS should be mixed mode, > and ideally single profile. If you don't have much free space, you > can use a flash drive to start this temporary filesystem instead. > 3. Start copying files from the old filesystem to the temporary one. > 4. Once the new filesystem is about 95% full, stop copying, shrink the > old filesystem again, create a new partition, and add that partition > to the temporary filesystem. > 5. Repeat steps 3-4 until you have everything off of the old filesystem. > 6. Re-format the remaining portion of the old filesystem using the > parameters you want for the replacement filesystem. > 7. Start copyin
Re: multi-device btrfs with single data mode and disk failure
On Tue, Sep 20, 2016 at 1:31 PM, Alexandre Poux wrote: > > > Le 20/09/2016 à 21:11, Chris Murphy a écrit : >> And no backup? Umm, I'd resolve that sooner than anything else. > Yeah you are absolutely right, this was a temporary solution which came > to be not that temporary. > And I regret it already... Well on the bright side, if this were LVM or mdadm linear/concat array, the whole thing would be toast because any other file system would have lost too much fs metadata on the missing device. >> It >> should be true that it'll tolerate a read only mount indefinitely, but >> read write? Not sure. This sort of edge case isn't well tested at all >> seeing as it required changing the kernel to reduce safe guards. So >> all bets are off the whole thing could become unmountable, not even >> read only, and then it's a scraping job. > I'm not that crazy, I tried the patch inside a virtual machine on > virtual drives... > And since it's only virtual, it may not work on the real partition... Are you sure the virtual setup lacked a CHUNK_ITEM on the missing device? That might be what pinned it in that case. You could try some sort of overlay for your remaining drives. Something like this: https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file Make sure you understand the gotcha about cloning which applies here: https://btrfs.wiki.kernel.org/index.php/Gotchas I think it's safe to use blockdev --setro on every real device you're trying to protect from changes. And when mounting you'll at least need to use device= mount option to explicitly mount each of the overlay devices. Based on the wiki, I'm wincing, I don't really know for sure if device mount option is enough to compel Btrfs to only use those devices and not go off the rails and still use one of the real devices, but at least if they're setro it won't matter (the mount will just fail somehow due to write failures). So now you can try removing the missing device... and see what happens. You could inspect the overlay files and see what changes were made. >> What do you get for btrfs-debug-tree -t 3 >> >> That should show the chunk tree, and what I'm wondering if if the >> chunk tree has any references to chunks on the missing device. Even if >> there are no extents on that device, if there are chunks, that might >> be one of the safeguards. >> > You'll find it attached. > The missing device is the devid 8 (since it's the only one missing in > btrfs fi show) > I found it only once line 63 Yeah bummer. Not used for system, data, or metadata chunks at all. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: multi-device btrfs with single data mode and disk failure
On 2016-09-20 14:53, Alexandre Poux wrote: Le 20/09/2016 à 20:38, Chris Murphy a écrit : On Tue, Sep 20, 2016 at 12:19 PM, Alexandre Poux wrote: Le 20/09/2016 à 19:54, Chris Murphy a écrit : On Tue, Sep 20, 2016 at 11:03 AM, Alexandre Poux wrote: If I wanted to try to edit my partitions with an hex editor, where would I find infos on how to do that ? I really don't want to go this way, but if this is relatively simple, it may be worth to try. Simple is relative. First you'd need https://btrfs.wiki.kernel.org/index.php/On-disk_Format to get some understanding of where things are to edit, and then btrfs-map-logical to convert btrfs logical addresses to physical device and sector to know what to edit. I'd call it distinctly non-trivial and very tedious. OK, another idea: would it be possible to trick btrfs with a manufactured file that the disk is present while it isn't ? I mean, looking for a few minutes on the hexdump of my trivial test partition, header of members of btrfs array seems very alike. So maybe, I can make a file wich would have enough header to make btrfs believe that this is my device, and then remove it as usual looks like a long shot, but it doesn't hurt to ask There may be another test that applies to single profiles, that disallows dropping a device. I think that's the place to look next. The superblock is easy to copy, but you'll need the device specific UUID which should be locatable with btrfs-show-super -f for each devid. The bigger problem is that Btrfs at mount time doesn't just look at the superblock and then mount. It actually reads parts of each tree, the extent of which I don't know. And it's doing a bunch of sanity tests as it reads those things, including transid (generation). So I'm not sure how easily spoofable a fake device is going to be. As a practical matter, migrate it to a new volume is faster and more reliable. Unfortunately, the inability to mount it read write is going to prevent you from making read only snapshots to use with btrfs send/receive. What might work, is find out what on-disk modification btrfs-tune does to make a device a read-only seed. Again your volume is missing a device so btrfs-tune won't let you modify it. But if you could force that to happen, it's probably a very minor change to metadata on each device, maybe it'll act like a seed device when you next mount it, in which case you'll be able to add a device and remount it read write and then delete the seed causing migration of everything that does remain on the volume over to the new device. I've never tried anything like this so I have no idea if it'll work. And even in the best case I haven't tried a multiple device seed going to a single device sprout (is it even allowed when removing the seed?). So...more questions than answers. Sorry if I wasn't clear, but with the patch mentionned earlyer, I can get a read write mount. What I can't do is remove the device. As for moving data to an another volume, since it's only data and nothing fancy (no subvolume or anything), a simple rsync would do the trick. My problem in this case is that I don't have enough available space elsewhere to move my data. That's why I'm trying this hard to recover the partition... First off, as Chris said, if you can read the data and don't already have a backup, that should be your first priority. This really is an edge case that's not well tested, and the kernel technically doesn't officially support it. Now, beyond that and his suggestions, there's another option, but it's risky, so I wouldn't even think about trying it without a backup (unless of course you can trivially regenerate the data). Multiple devices support and online resizing allows for a rather neat trick to regenerate a filesystem in place. The process is pretty simple: 1. Shrink the existing filesystem down to the minimum size possible. 2. Create a new partition in the free space, and format it as a temporary BTRFS filesystem. Ideally, this FS should be mixed mode, and ideally single profile. If you don't have much free space, you can use a flash drive to start this temporary filesystem instead. 3. Start copying files from the old filesystem to the temporary one. 4. Once the new filesystem is about 95% full, stop copying, shrink the old filesystem again, create a new partition, and add that partition to the temporary filesystem. 5. Repeat steps 3-4 until you have everything off of the old filesystem. 6. Re-format the remaining portion of the old filesystem using the parameters you want for the replacement filesystem. 7. Start copying files from the temporary filesystem to the new filesystem. 8. As you empty out each temporary partition, remove it from the temporary filesystem, delete the partition, and expand the new filesystem. This takes a while, and is only safe if you have reliable hardware, but I've done it before and it works reliably as long as you don't have many big files on the old filesys
Re: multi-device btrfs with single data mode and disk failure
On Tue, Sep 20, 2016 at 12:53 PM, Alexandre Poux wrote: > > > Le 20/09/2016 à 20:38, Chris Murphy a écrit : >> On Tue, Sep 20, 2016 at 12:19 PM, Alexandre Poux wrote: >>> >>> Le 20/09/2016 à 19:54, Chris Murphy a écrit : On Tue, Sep 20, 2016 at 11:03 AM, Alexandre Poux wrote: > If I wanted to try to edit my partitions with an hex editor, where would > I find infos on how to do that ? > I really don't want to go this way, but if this is relatively simple, it > may be worth to try. Simple is relative. First you'd need https://btrfs.wiki.kernel.org/index.php/On-disk_Format to get some understanding of where things are to edit, and then btrfs-map-logical to convert btrfs logical addresses to physical device and sector to know what to edit. I'd call it distinctly non-trivial and very tedious. >>> OK, another idea: >>> would it be possible to trick btrfs with a manufactured file that the >>> disk is present while it isn't ? >>> >>> I mean, looking for a few minutes on the hexdump of my trivial test >>> partition, header of members of btrfs array seems very alike. >>> So maybe, I can make a file wich would have enough header to make btrfs >>> believe that this is my device, and then remove it as usual >>> looks like a long shot, but it doesn't hurt to ask >> There may be another test that applies to single profiles, that >> disallows dropping a device. I think that's the place to look next. >> The superblock is easy to copy, but you'll need the device specific >> UUID which should be locatable with btrfs-show-super -f for each >> devid. The bigger problem is that Btrfs at mount time doesn't just >> look at the superblock and then mount. It actually reads parts of each >> tree, the extent of which I don't know. And it's doing a bunch of >> sanity tests as it reads those things, including transid (generation). >> So I'm not sure how easily spoofable a fake device is going to be. >> As a practical matter, migrate it to a new volume is faster and more >> reliable. Unfortunately, the inability to mount it read write is going >> to prevent you from making read only snapshots to use with btrfs >> send/receive. What might work, is find out what on-disk modification >> btrfs-tune does to make a device a read-only seed. Again your volume >> is missing a device so btrfs-tune won't let you modify it. But if you >> could force that to happen, it's probably a very minor change to >> metadata on each device, maybe it'll act like a seed device when you >> next mount it, in which case you'll be able to add a device and >> remount it read write and then delete the seed causing migration of >> everything that does remain on the volume over to the new device. I've >> never tried anything like this so I have no idea if it'll work. And >> even in the best case I haven't tried a multiple device seed going to >> a single device sprout (is it even allowed when removing the seed?). >> So...more questions than answers. >> > Sorry if I wasn't clear, but with the patch mentionned earlyer, I can > get a read write mount. > What I can't do is remove the device. > As for moving data to an another volume, since it's only data and > nothing fancy (no subvolume or anything), a simple rsync would do the trick. > My problem in this case is that I don't have enough available space > elsewhere to move my data. > That's why I'm trying this hard to recover the partition... And no backup? Umm, I'd resolve that sooner than anything else. It should be true that it'll tolerate a read only mount indefinitely, but read write? Not sure. This sort of edge case isn't well tested at all seeing as it required changing the kernel to reduce safe guards. So all bets are off the whole thing could become unmountable, not even read only, and then it's a scraping job. I think what you want to do here is reasonable, there's no missing data on the missing device. If the device were present and you deleted it, Btrfs would presumably have nothing to migrate, it'd just shrink the fs, update all supers, wipe the signatures off the device being removed, that's it. So there's some safeguard in place that's disallowing the remove missing in this case even though there's no data or metadata to migrate off the drive. In another thread about clusters and planned data loss, I describe how this functionality has a practical real world benefit other than your particular situation. So it would be nice if it were possible but I can't tell you what the safe guard is that's preventing it from being removed, or if it's even just one safeguard. What do you get for btrfs-debug-tree -t 3 That should show the chunk tree, and what I'm wondering if if the chunk tree has any references to chunks on the missing device. Even if there are no extents on that device, if there are chunks, that might be one of the safeguards. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in th
Re: multi-device btrfs with single data mode and disk failure
Le 20/09/2016 à 20:56, Austin S. Hemmelgarn a écrit : > On 2016-09-20 13:54, Chris Murphy wrote: >> On Tue, Sep 20, 2016 at 11:03 AM, Alexandre Poux >> wrote: >> >>> If I wanted to try to edit my partitions with an hex editor, where >>> would >>> I find infos on how to do that ? >>> I really don't want to go this way, but if this is relatively >>> simple, it >>> may be worth to try. >> >> Simple is relative. First you'd need >> https://btrfs.wiki.kernel.org/index.php/On-disk_Format to get some >> understanding of where things are to edit, and then btrfs-map-logical >> to convert btrfs logical addresses to physical device and sector to >> know what to edit. >> >> I'd call it distinctly non-trivial and very tedious. >> > It really is. I've done this before, but I had a copy of the on-disk > format documentation, a couple of working filesystems, a full copy of > the current kernel sources for reference, and about 8 cups of green > tea (my beverage of choice for staying awake and focused). I got > _really_ lucky and it was something that really was simple to fix once > I found it (it amounted to about 64 bytes of changes, it took me maybe > 5 minutes to actually correct the issue once I found where it was), > but it took me a good couple of hours to figure out what to even look > for, plus another hour just to find it, and I'm not sure I would be > able to do it any faster if I had to again (unlike doing so for ext4, > which is a walk in the park by comparison). > > TBH the only thing I'd worry about using a hex editor to fix in BTRFS > is the super-blocks or system chunks, because they're pretty easy to > find, and usually not all that hard to fix. In fact, if it hadn't > been for the fact that I had no backup of the data I would lose by > recreating that filesystem, and I was _really_ bored that day, I > probably wouldn't have even tried. OK I will forget this. Thank you -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: multi-device btrfs with single data mode and disk failure
On 2016-09-20 13:54, Chris Murphy wrote: On Tue, Sep 20, 2016 at 11:03 AM, Alexandre Poux wrote: If I wanted to try to edit my partitions with an hex editor, where would I find infos on how to do that ? I really don't want to go this way, but if this is relatively simple, it may be worth to try. Simple is relative. First you'd need https://btrfs.wiki.kernel.org/index.php/On-disk_Format to get some understanding of where things are to edit, and then btrfs-map-logical to convert btrfs logical addresses to physical device and sector to know what to edit. I'd call it distinctly non-trivial and very tedious. It really is. I've done this before, but I had a copy of the on-disk format documentation, a couple of working filesystems, a full copy of the current kernel sources for reference, and about 8 cups of green tea (my beverage of choice for staying awake and focused). I got _really_ lucky and it was something that really was simple to fix once I found it (it amounted to about 64 bytes of changes, it took me maybe 5 minutes to actually correct the issue once I found where it was), but it took me a good couple of hours to figure out what to even look for, plus another hour just to find it, and I'm not sure I would be able to do it any faster if I had to again (unlike doing so for ext4, which is a walk in the park by comparison). TBH the only thing I'd worry about using a hex editor to fix in BTRFS is the super-blocks or system chunks, because they're pretty easy to find, and usually not all that hard to fix. In fact, if it hadn't been for the fact that I had no backup of the data I would lose by recreating that filesystem, and I was _really_ bored that day, I probably wouldn't have even tried. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: multi-device btrfs with single data mode and disk failure
Le 20/09/2016 à 20:38, Chris Murphy a écrit : > On Tue, Sep 20, 2016 at 12:19 PM, Alexandre Poux wrote: >> >> Le 20/09/2016 à 19:54, Chris Murphy a écrit : >>> On Tue, Sep 20, 2016 at 11:03 AM, Alexandre Poux wrote: >>> If I wanted to try to edit my partitions with an hex editor, where would I find infos on how to do that ? I really don't want to go this way, but if this is relatively simple, it may be worth to try. >>> Simple is relative. First you'd need >>> https://btrfs.wiki.kernel.org/index.php/On-disk_Format to get some >>> understanding of where things are to edit, and then btrfs-map-logical >>> to convert btrfs logical addresses to physical device and sector to >>> know what to edit. >>> >>> I'd call it distinctly non-trivial and very tedious. >>> >> OK, another idea: >> would it be possible to trick btrfs with a manufactured file that the >> disk is present while it isn't ? >> >> I mean, looking for a few minutes on the hexdump of my trivial test >> partition, header of members of btrfs array seems very alike. >> So maybe, I can make a file wich would have enough header to make btrfs >> believe that this is my device, and then remove it as usual >> looks like a long shot, but it doesn't hurt to ask > There may be another test that applies to single profiles, that > disallows dropping a device. I think that's the place to look next. > The superblock is easy to copy, but you'll need the device specific > UUID which should be locatable with btrfs-show-super -f for each > devid. The bigger problem is that Btrfs at mount time doesn't just > look at the superblock and then mount. It actually reads parts of each > tree, the extent of which I don't know. And it's doing a bunch of > sanity tests as it reads those things, including transid (generation). > So I'm not sure how easily spoofable a fake device is going to be. > As a practical matter, migrate it to a new volume is faster and more > reliable. Unfortunately, the inability to mount it read write is going > to prevent you from making read only snapshots to use with btrfs > send/receive. What might work, is find out what on-disk modification > btrfs-tune does to make a device a read-only seed. Again your volume > is missing a device so btrfs-tune won't let you modify it. But if you > could force that to happen, it's probably a very minor change to > metadata on each device, maybe it'll act like a seed device when you > next mount it, in which case you'll be able to add a device and > remount it read write and then delete the seed causing migration of > everything that does remain on the volume over to the new device. I've > never tried anything like this so I have no idea if it'll work. And > even in the best case I haven't tried a multiple device seed going to > a single device sprout (is it even allowed when removing the seed?). > So...more questions than answers. > Sorry if I wasn't clear, but with the patch mentionned earlyer, I can get a read write mount. What I can't do is remove the device. As for moving data to an another volume, since it's only data and nothing fancy (no subvolume or anything), a simple rsync would do the trick. My problem in this case is that I don't have enough available space elsewhere to move my data. That's why I'm trying this hard to recover the partition... -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: multi-device btrfs with single data mode and disk failure
On Tue, Sep 20, 2016 at 12:19 PM, Alexandre Poux wrote: > > > Le 20/09/2016 à 19:54, Chris Murphy a écrit : >> On Tue, Sep 20, 2016 at 11:03 AM, Alexandre Poux wrote: >> >>> If I wanted to try to edit my partitions with an hex editor, where would >>> I find infos on how to do that ? >>> I really don't want to go this way, but if this is relatively simple, it >>> may be worth to try. >> Simple is relative. First you'd need >> https://btrfs.wiki.kernel.org/index.php/On-disk_Format to get some >> understanding of where things are to edit, and then btrfs-map-logical >> to convert btrfs logical addresses to physical device and sector to >> know what to edit. >> >> I'd call it distinctly non-trivial and very tedious. >> > OK, another idea: > would it be possible to trick btrfs with a manufactured file that the > disk is present while it isn't ? > > I mean, looking for a few minutes on the hexdump of my trivial test > partition, header of members of btrfs array seems very alike. > So maybe, I can make a file wich would have enough header to make btrfs > believe that this is my device, and then remove it as usual > looks like a long shot, but it doesn't hurt to ask There may be another test that applies to single profiles, that disallows dropping a device. I think that's the place to look next. The superblock is easy to copy, but you'll need the device specific UUID which should be locatable with btrfs-show-super -f for each devid. The bigger problem is that Btrfs at mount time doesn't just look at the superblock and then mount. It actually reads parts of each tree, the extent of which I don't know. And it's doing a bunch of sanity tests as it reads those things, including transid (generation). So I'm not sure how easily spoofable a fake device is going to be. As a practical matter, migrate it to a new volume is faster and more reliable. Unfortunately, the inability to mount it read write is going to prevent you from making read only snapshots to use with btrfs send/receive. What might work, is find out what on-disk modification btrfs-tune does to make a device a read-only seed. Again your volume is missing a device so btrfs-tune won't let you modify it. But if you could force that to happen, it's probably a very minor change to metadata on each device, maybe it'll act like a seed device when you next mount it, in which case you'll be able to add a device and remount it read write and then delete the seed causing migration of everything that does remain on the volume over to the new device. I've never tried anything like this so I have no idea if it'll work. And even in the best case I haven't tried a multiple device seed going to a single device sprout (is it even allowed when removing the seed?). So...more questions than answers. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: multi-device btrfs with single data mode and disk failure
Le 20/09/2016 à 19:54, Chris Murphy a écrit : > On Tue, Sep 20, 2016 at 11:03 AM, Alexandre Poux wrote: > >> If I wanted to try to edit my partitions with an hex editor, where would >> I find infos on how to do that ? >> I really don't want to go this way, but if this is relatively simple, it >> may be worth to try. > Simple is relative. First you'd need > https://btrfs.wiki.kernel.org/index.php/On-disk_Format to get some > understanding of where things are to edit, and then btrfs-map-logical > to convert btrfs logical addresses to physical device and sector to > know what to edit. > > I'd call it distinctly non-trivial and very tedious. > OK, another idea: would it be possible to trick btrfs with a manufactured file that the disk is present while it isn't ? I mean, looking for a few minutes on the hexdump of my trivial test partition, header of members of btrfs array seems very alike. So maybe, I can make a file wich would have enough header to make btrfs believe that this is my device, and then remove it as usual looks like a long shot, but it doesn't hurt to ask -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: kill BUG_ON in do_relocation
On Tue, Sep 20, 2016 at 10:03:43AM +0200, David Sterba wrote: > On Mon, Sep 19, 2016 at 04:11:44PM -0700, Liu Bo wrote: > > > > That's EIO. Sometimes the EIO is big enough we have to abort, but > > > > really the abort is just adding bonus. > > > > > > I think we misuse the EIO where we should really return EFSCORRUPTED > > > that's an alias for EUCLEAN, looking at xfs or ext4. EIO should be > > > really a message that the hardware is bad. > > > > I love this idea, but one quick question, when returning EUCLEAN, what > > message do users get? > > > > "#define EUCLEAN 117 /* Structure needs cleaning */" > > strerror(EUCLEAN) -> "Structure needs cleaning" Hmm, if I was the user, I'm not sure how to deal with "Structure needs cleaning", so still need to take a glance at dmesg log. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: memset to avoid stale content in btree node block
On Tue, Sep 20, 2016 at 03:16:36PM +0200, David Sterba wrote: > On Wed, Sep 14, 2016 at 05:22:57PM -0700, Liu Bo wrote: > > During updating btree, we could push items between sibling > > nodes/leaves, for leaves data sections starts reversely from > > the end of the block while for nodes we only have key pairs > > which are stored one by one from the start of the block. > > > > So we could do try to push key pairs from one node to the next > > node right in the tree, and after that, we update the node's > > nritems to reflect the correct end while leaving the stale > > content in the node. One may intentionally corrupt the fs > > image and access the stale content by bumping the nritems and > > causes various crashes. > > > > This takes the in-memory @nritems as the correct one and > > gets to memset the unused part of a btree node. > > > > Signed-off-by: Liu Bo > > Reviewed-by: David Sterba > > > --- > > fs/btrfs/extent_io.c | 11 +++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > > index c2325c3..56c9dee 100644 > > --- a/fs/btrfs/extent_io.c > > +++ b/fs/btrfs/extent_io.c > > @@ -3732,6 +3732,17 @@ static noinline_for_stack int write_one_eb(struct > > extent_buffer *eb, > > if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID) > > bio_flags = EXTENT_BIO_TREE_LOG; > > > > + /* set btree node beyond nritems with 0 to avoid stale content */ > > + if (btrfs_header_level(eb) > 0) { > > We can do the same for leaves. In theory, the problem also applies for leaves, but I haven't got a reproducer for leaf case. So I'll update a v2 with leaf memset, please review that part more carefully :) Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: multi-device btrfs with single data mode and disk failure
On Tue, Sep 20, 2016 at 11:03 AM, Alexandre Poux wrote: > If I wanted to try to edit my partitions with an hex editor, where would > I find infos on how to do that ? > I really don't want to go this way, but if this is relatively simple, it > may be worth to try. Simple is relative. First you'd need https://btrfs.wiki.kernel.org/index.php/On-disk_Format to get some understanding of where things are to edit, and then btrfs-map-logical to convert btrfs logical addresses to physical device and sector to know what to edit. I'd call it distinctly non-trivial and very tedious. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: multi-device btrfs with single data mode and disk failure
Le 20/09/2016 à 00:05, Alexandre Poux a écrit : > > Le 15/09/2016 à 23:54, Chris Murphy a écrit : >> On Thu, Sep 15, 2016 at 3:48 PM, Alexandre Poux wrote: >>> Le 15/09/2016 à 18:54, Chris Murphy a écrit : On Thu, Sep 15, 2016 at 10:30 AM, Alexandre Poux wrote: > Thank you very much for your answers > > Le 15/09/2016 à 17:38, Chris Murphy a écrit : >> On Thu, Sep 15, 2016 at 1:44 AM, Alexandre Poux >> wrote: >>> Is it possible to do some king of a "btrfs delete missing" on this >>> kind of setup, in order to recover access in rw to my other data, or >>> I must copy all my data on a new partition >> That *should* work :) Except that your file system with 6 drives is >> too full to be shrunk to 5 drives. Btrfs will either refuse, or get >> confused, about how to shrink a nearly full 6 drive volume into 5. >> >> So you'll have to do one of three things: >> >> 1. Add a 2+TB drive, then remove the missing one; OR >> 2. btrfs replace is faster and is raid10 reliable; OR >> 3. Read only scrub to get a file listing of bad files, then remount >> read-write degraded and delete them all. Now you maybe can do a device >> delete missing. But it's still a tight fit, it basically has to >> balance things out to get it to fit on an odd number of drives, it may >> actually not work even though there seems to be enough total space, >> there has to be enough space on FOUR drives. >> > Are you sure you are talking about data in single mode ? > I don't understand why you are talking about raid10, > or the fact that it will have to rebalance everything. Yeah sorry I got confused in that very last sentence. Single, it will find space in 1GiB increments. Of course this fails because that data doesn't exist anymore, but to start the operation it needs to be possible. >>> No problem > Moreover, even in degraded mode I cannot mount it in rw > It tells me > "too many missing devices, writeable remount is not allowed" > due to the fact I'm in single mode. Oh you're in that trap. Well now you're stuck. I've had the case where I could mount read write degraded with metadata raid1 and data single, but it was good for only one mount and then I get the same message you get and it was only possible to mount read only. At that point it's totally suck unless you're adept at manipulating the file system with a hex editor... Someone might have a patch somewhere that drops this check and lets too many missing devices to mount anyway... I seem to recall this. It'd be in the archives if it exists. > And as far as as know, btrfs replace and btrfs delete, are not supposed > to work in read only... It doesn't. Must be read write mounted. > I would like to tell him forgot about the missing data, and give me back > my partition. This feature doesn't exist yet. I really want to see this, it'd be great for ceph and gluster if the volume could lose a drive, report all the missing files to the cluster file system, delete the device and the file references, and then the cluster knows that brick doesn't have those files and can replicate them somewhere else or even back to the brick that had them. >>> So I found this patch : https://patchwork.kernel.org/patch/7014141/ >>> >>> Does this seems ok ? >> No idea I haven't tried it. >> >>> So after patching my kernel with it, >>> I should be able to mount in rw my partition, and thus, >>> I will be able to do a btrfs delete missing >>> Which will just forgot about the old disk and everything should be fine >>> afterward ? >> It will forget about the old disk but it will try to migrate all >> metadata and data that was on that disk to the remaining drives; so >> until you delete all files that are corrupt, you'll continue to get >> corruption messages about them. >> >>> Is this risky ? or not so much ? >> Probably. If you care about the data, mount read only, back up what >> you can, then see if you can fix it after that. >> >>> The scrubing is almost finished, and as I was expecting, I lost no data >>> at all. >> Well I'd guess the device delete should work then, but I still have no >> idea if that patch will let you mount it degraded read-write. Worth a >> shot though, it'll save time. >> > OK, so I found some time to work on it. > > I decided to do some tests in a vm (virtualbox) with 3 disks > after making an array with 3 disks, metadata in raid1 and data in single, > I remove one disk to reproduce my situation. > > I tried the patch, and, after updated it (nothing fancy), > I can indeed mount a degraded partition with data in single. > > But I can't remove the device : > #btrfs device remove missing /mnt > ERROR: error removing device 'missing': Input/output error > or > #btrfs device remove 2 /mnt > ERROR: error removing
Re: Is stability a joke?
btrfs-convert has been rewritten as of btrfs-progs 4.6, and therefore the conversion page could use an update: https://btrfs.wiki.kernel.org/index.php/Conversion_from_Ext3 Anyone wanting to update the page should advise the code is new, check the changelog, the latest btrfs-progs version should be used, and there still may be edge cases: https://btrfs.wiki.kernel.org/index.php/Changelog Also, the status page doesn't mention the convert feature. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH]btrfs-progs: Post btrfs-convert verify permissions and acls
On 2016-09-19 20:17, Qu Wenruo wrote: Hi Laksmipathi, At 09/06/2016 03:27 AM, Lakshmipathi.G wrote: Signed-off-by: Lakshmipathi.G --- tests/common.convert | 95 +++- 1 file changed, 94 insertions(+), 1 deletion(-) diff --git a/tests/common.convert b/tests/common.convert index 4e3d49c..67c99b1 100644 --- a/tests/common.convert +++ b/tests/common.convert @@ -123,6 +123,38 @@ convert_test_gen_checksums() { count=1 >/dev/null 2>&1 run_check_stdout $SUDO_HELPER find $TEST_MNT -type f ! -name 'image' -exec md5sum {} \+ > "$CHECKSUMTMP" } +# list $TEST_MNT data set file permissions. +# $1: path where the permissions will be stored +convert_test_perm() { + local PERMTMP + PERMTMP="$1" + FILES_LIST=$(mktemp --tmpdir btrfs-progs-convert.fileslistXX) + + run_check $SUDO_HELPER dd if=/dev/zero of=$TEST_MNT/test bs=$nodesize \ + count=1 >/dev/null 2>&1 + run_check_stdout $SUDO_HELPER find $TEST_MNT -type f ! -name 'image' -fprint $FILES_LIST + #Fix directory entries order. + sort $FILES_LIST -o $FILES_LIST + for file in `cat $FILES_LIST` ;do + run_check_stdout $SUDO_HELPER getfacl --absolute-names $file >> "$PERMTMP" + done + rm $FILES_LIST +} +# list acls of files on $TEST_MNT +# $1: path where the acls will be stored +convert_test_acl() { + local ACLSTMP + ACLTMP="$1" + FILES_LIST=$(mktemp --tmpdir btrfs-progs-convert.fileslistXX) + + run_check_stdout $SUDO_HELPER find $TEST_MNT/acls -type f -fprint $FILES_LIST + #Fix directory entries order. + sort $FILES_LIST -o $FILES_LIST + for file in `cat $FILES_LIST`;do + run_check_stdout $SUDO_HELPER getfattr --absolute-names -d $file >> "$ACLTMP" + done + rm $FILES_LIST +} # do conversion with given features and nodesize, fsck afterwards # $1: features, argument of -O, can be empty @@ -133,15 +165,68 @@ convert_test_do_convert() { run_check $TOP/btrfs-show-super -Ffa $TEST_DEV } +# post conversion check, verify file permissions. +# $1: file with ext permissions. +convert_test_post_check_permissions() { + local EXT_PERMTMP + local BTRFS_PERMTMP + + EXT_PERMTMP="$1" + BTRFS_PERMTMP=$(mktemp --tmpdir btrfs-progs-convert.permXX) + convert_test_perm "$BTRFS_PERMTMP" + + btrfs_perm=`md5sum $BTRFS_PERMTMP | cut -f1 -d' '` + ext_perm=`md5sum $EXT_PERMTMP | cut -f1 -d' ' When running test case 005, the test script hangs here. And EXT_PERMTMP seems to be empty, so md5sum is waiting input from stdio, causing the hang. Any idea to fix it? Thanks, Qu Hi Qu, Can you confirm whether in-place 'sort' command used convert_test_perm() is creating appropriate sorted file and doesn't create empty file? $ sort --version sort (GNU coreutils) 8.22 Cheers, Lakshmipathi.G -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH]btrfs-progs: Add fast,slow symlinks and fifo types to convert test
On 2016-09-19 18:21, Qu Wenruo wrote: Just curious, did the new fifo/slow_symlink exposed any convert bug? Thanks, Qu Unfortunately no. I was hoping something will fail, but sadly convert-tests.sh passed! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH]btrfs-progs: Add fast,slow symlinks and fifo types to convert test
On 2016-09-19 11:05, David Sterba wrote: On Thu, Sep 15, 2016 at 11:34:07AM +0200, Lakshmipathi.G wrote: + slow_symlink) + for num in $(seq 1 $DATASET_SIZE); do + fname64=`date +%s | sha256sum | cut -f1 -d'-'` Do you need to generate the date and sha all the time? Right, I missed that part. We can create a single file and create multiple symlink to that same file. fname64 creation can be moved out of the loop. I'll re-send another patch with this fix. Cheers. Lakshmipathi.G -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ChaCha20 vs. AES performance
On Tue, Sep 20, 2016 at 10:23:20AM -0400, Theodore Ts'o wrote: > On Tue, Sep 20, 2016 at 03:15:19AM -0800, Kent Overstreet wrote: > > Not on the list or I would've replied directly, but on Haswell, ChaCha20 (in > > software) is over 2x as fast as AES (in hardware), at realistic (for a > > filesystem) block sizes: > > On Skylake and Broadwell processors, AES is faster (the posting is > from a ChaCha20 enthusiast): > > https://blog.cloudflare.com/it-takes-two-to-chacha-poly/ The performance delta in his graphs isn't near as big as what I've measured, which makes me suspect OpenSSL's ChaCha20 implementation isn't nearly as fast as the kernel's. > My big worry though is that schemes that require that nonces/IV's must > **never** be reused are fragile. It's for the same reason that DSA > makes my skin crawl. If you ever screw up --- maybe after a crash, or > a file system bug, you end up reusing a nonce, it's game over. > > So if there are hardware solutions which are faster or fast enough > that the crypto is no longer dominant cost, why not use a cipher > scheme which is more robust? Block ciphers have their own downsides though - XTS is really a big pile of hacks and workarounds. On the whole, if you can get nonces right, a stream cipher cryptosystem (and ChaCha20 especially) is on the whole drastically simpler, and thus easier to understand and audit. And if you can do nonces correctly, ChaCha20/Poly1305 is pretty much one of the gold standards - it's secure against pretty much any vaguely realistic threat model. XTS, not so much - it's just the best you can do given the constraints of typical disk crypto. The gold standards of encryption today are the AEADs - and AES/GCM fails badly with nonce reuse too, there aren't any AEADs yet that don't fail badly with nonce reuse. > P.S. We're also both ignoring the cost of whatever changes are needed in > the file system to guarantee that the nonce is never, ever reused... I'm definitely not advocating for hacking stream ciphers into existing filesystems - if you don't have the machinery you need to be 100% rigorous about nonces, then definitely stick with XTS. But I already had most of what I needed in bcachefs, and I can still break the on disk format if I need to (and encryption is a breaking change), so for me ChaCha20/Poly1305 was a no brainer. BTW though, if there do turn out to be platforms where AES is significantly faster than ChaCha20 I can still add AES support pretty easily - I've already got all the relevant switch statements, since encryption is handled as another checksum type. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Experimental btrfs encryption
On 09/19/2016 10:50 PM, Theodore Ts'o wrote: On Mon, Sep 19, 2016 at 08:32:34PM -0400, Chris Mason wrote: That key is used to protect the contents of the data file, and to encrypt filenames and symlink targets --- since filenames can leak significant information about what the user is doing. (For example, in the downloads directory of their web browser, leaking filenames is just as good as leaking part of their browsing history.) One of the things that makes per-subvolume encryption attractive to me is that we're able to enforce the idea that an entire directory tree is encrypted by one key. It can't be snapshotted again without the key, and it just fits with the rest of the btrfs management code. I do want to support the existing vfs interfaces as well too though. One of the main reasons for doing fs-level encryption is so you can allow multiple users to have different keys. In some cases you can assume that different users will be in different distinct subvolumes (e.g., each user has their own home directory), but that's not always going to be possible. Agreed, they are just different use cases. I think both are important, and btrfs won't do encryption without the file-level option. One of the other things that was in the original design, but which got dropped in our initial implementation, was the concept of having the per-inode key wrapped by multiple user keys. This would allow a file to be accessible by more than one user. So something to consider is that there may very well be situations where you *want* to have more than one key associated with a directory hierarchy. The issue, here, is that inodes are fundamentally not a safe scope to attach that information to in btrfs. As extents can be shared between inodes (and thus both will need to decrypt them), and inodes can be duplicated unmodified (snapshots), attaching keys and nonces to inodes opens up a whole host of (possibly insoluble) issues, including catastrophic nonce reuse via writable snapshots. I'm going to have to read harder about nonce reuse. In btrfs an inode is really a pair [ root id, inode number ], so strictly speaking two writable snapshots won't have the same inode in memory and when a snapshot is modified we'd end up with a different nonce for the new modifications. Nonce reuse is not necessrily catastrophic. It all depends on the context. In the case of Counter or GCM mode, nonce (or IV) reuse is absolutely catastrophic. It must *never* be done or you completely lose all security. As the Soviets discovered the hard way courtesy of the Venona project (well, they didn't discover it until after they lost the cold war, but...) one time pads are completely secure. Two-time pads, are most emphatically _not_. :-) In the case of the nonces used in fscrypt's key derivation, reuse of the nonce basically means that two files share the same key. Assuming you're using a competently designed block cipher (e.g., AES), reuse of the key is not necessarily a problem. What it would mean is that two files which are are reflinked would share the same key. And if you have writable snapshots, that's definitely not a problem, since with AES we use the a fixed key and a fixed IV given a logical block number, and we can do block overwrites without having to guarantee unique nonces (which you *do* need to worry about if you use counter mode or some other stream cipher such as ChaCha20 --- Kent Overstreet had some clever tricks to avoid IV reuse since he used a stream cipher in his proposed bcachefs encryption). The main issue is if you want to reflink a file and then have the two files have different permissions / ownerships. In that case, you really want to use different keys for user A and for user B --- but if you are assuming a single key per subvolume, you can't support different keys for different users anyway, so you're kind of toast for that use case in any case. So there's a matrix of possible configurations. If you're doing a reflink between subvolumes and you're doing a subvolume granular encryption and you don't have keys to the source subvolume, the reflink shouldn't be allowed. If you do have keys, any new writes are happening into a different inode, and will be encrypted with a different key. If you're doing a file level encryption and you do have access to the source file, the destination file is a new inode. Thanks to COW any changes are going to go into new extents and will end up with different keys/nonces. Either way, we degrade down into extent based encryption. I'd take that hit to maintain sane semantics in the face of snapshots and reflinks. The btrfs extent structures on disk already have an encryption type field. So in any case, assuming you're using block encryption (which is what fscrypt uses) there really isn't a problem with nonce reuse, although in some cases if you really do want to reflink a file and have it be protected by different user keys, this would have
Re: [RFC] Preliminary BTRFS Encryption
Hi David, On 09/18/2016 02:45 AM, David Sterba wrote: On Sat, Sep 17, 2016 at 12:38:30AM -0400, Zygo Blaxell wrote: There's also a nasty problem with the extent tree--there's only one per filesystem, it's shared between all subvols and block groups, and every extent in that tree has back references to the (possibly encrypted) subvol trees. I'll leave that problem as an exercise for other readers. ;) A design point that I'm not mentioning for the first time: there would be per-subvolume group extent trees, ie. a set of subvolumes with attached extent tree where similar to what we have now. So, encrypted and unencrypted extent metadata will never be mixed. (the crypto key questions are not addressed here) This hasn't been implemented but I'm making sure this will be possible when somebody mentions changes to the extent tree or blockgroup reworks (to actually solve other problems). Now I remember this was told before, sorry this slipped my mind. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ChaCha20 vs. AES performance
On Tue, Sep 20, 2016 at 03:15:19AM -0800, Kent Overstreet wrote: > Not on the list or I would've replied directly, but on Haswell, ChaCha20 (in > software) is over 2x as fast as AES (in hardware), at realistic (for a > filesystem) block sizes: On Skylake and Broadwell processors, AES is faster (the posting is from a ChaCha20 enthusiast): https://blog.cloudflare.com/it-takes-two-to-chacha-poly/ My big worry though is that schemes that require that nonces/IV's must **never** be reused are fragile. It's for the same reason that DSA makes my skin crawl. If you ever screw up --- maybe after a crash, or a file system bug, you end up reusing a nonce, it's game over. So if there are hardware solutions which are faster or fast enough that the crypto is no longer dominant cost, why not use a cipher scheme which is more robust? - Ted P.S. We're also both ignoring the cost of whatever changes are needed in the file system to guarantee that the nonce is never, ever reused... -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] btrfs: convert pr_* to btrfs_* where possible
From: Jeff Mahoney For many printks, we want to know which file system issued the message. This patch converts most pr_* calls to use the btrfs_* versions instead. In some cases, this means adding plumbing to allow call sites access to an fs_info pointer. fs/btrfs/check-integrity.c is left alone for another day. Signed-off-by: Jeff Mahoney diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index c2cd4c2..16ec215 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -390,7 +390,8 @@ static int __resolve_indirect_ref(struct btrfs_fs_info *fs_info, /* root node has been locked, we can release @subvol_srcu safely here */ srcu_read_unlock(&fs_info->subvol_srcu, index); - pr_debug("search slot in root %llu (level %d, ref count %d) returned %d for key (%llu %u %llu)\n", + btrfs_debug(fs_info, + "search slot in root %llu (level %d, ref count %d) returned %d for key (%llu %u %llu)", ref->root_id, level, ref->count, ret, ref->key_for_search.objectid, ref->key_for_search.type, ref->key_for_search.offset); @@ -1491,7 +1492,8 @@ int extent_from_logical(struct btrfs_fs_info *fs_info, u64 logical, if (found_key->objectid > logical || found_key->objectid + size <= logical) { - pr_debug("logical %llu is not within any extent\n", logical); + btrfs_debug(fs_info, + "logical %llu is not within any extent", logical); return -ENOENT; } @@ -1502,7 +1504,8 @@ int extent_from_logical(struct btrfs_fs_info *fs_info, u64 logical, ei = btrfs_item_ptr(eb, path->slots[0], struct btrfs_extent_item); flags = btrfs_extent_flags(eb, ei); - pr_debug("logical %llu is at position %llu within the extent (%llu EXTENT_ITEM %llu) flags %#llx size %u\n", + btrfs_debug(fs_info, + "logical %llu is at position %llu within the extent (%llu EXTENT_ITEM %llu) flags %#llx size %u", logical, logical - found_key->objectid, found_key->objectid, found_key->offset, flags, item_size); @@ -1623,20 +1626,24 @@ int tree_backref_for_extent(unsigned long *ptr, struct extent_buffer *eb, return 0; } -static int iterate_leaf_refs(struct extent_inode_elem *inode_list, - u64 root, u64 extent_item_objectid, - iterate_extent_inodes_t *iterate, void *ctx) +static int iterate_leaf_refs(struct btrfs_fs_info *fs_info, +struct extent_inode_elem *inode_list, +u64 root, u64 extent_item_objectid, +iterate_extent_inodes_t *iterate, void *ctx) { struct extent_inode_elem *eie; int ret = 0; for (eie = inode_list; eie; eie = eie->next) { - pr_debug("ref for %llu resolved, key (%llu EXTEND_DATA %llu), root %llu\n", extent_item_objectid, -eie->inum, eie->offset, root); + btrfs_debug(fs_info, + "ref for %llu resolved, key (%llu EXTEND_DATA %llu), root %llu", + extent_item_objectid, eie->inum, + eie->offset, root); ret = iterate(eie->inum, eie->offset, root, ctx); if (ret) { - pr_debug("stopping iteration for %llu due to ret=%d\n", -extent_item_objectid, ret); + btrfs_debug(fs_info, + "stopping iteration for %llu due to ret=%d", + extent_item_objectid, ret); break; } } @@ -1664,7 +1671,7 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info, struct ulist_iterator ref_uiter; struct ulist_iterator root_uiter; - pr_debug("resolving all inodes for extent %llu\n", + btrfs_debug(fs_info, "resolving all inodes for extent %llu", extent_item_objectid); if (!search_commit_root) { @@ -1690,9 +1697,12 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info, break; ULIST_ITER_INIT(&root_uiter); while (!ret && (root_node = ulist_next(roots, &root_uiter))) { - pr_debug("root %llu references leaf %llu, data list %#llx\n", root_node->val, ref_node->val, -ref_node->aux); - ret = iterate_leaf_refs((struct extent_inode_elem *) + btrfs_debug(fs_info, + "root %llu references leaf %llu, data list %#llx", + root_node->val, ref_node->val, + ref_node->aux); + ret = iterate_leaf_refs(fs_info, + (
[PATCH 5/5] btrfs: convert send's verbose_printk to btrfs_debug
From: Jeff Mahoney This was basically an open-coded, less flexible dynamic printk. We can just use btrfs_debug instead. Signed-off-by: Jeff Mahoney diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index ee10345..96bc99d 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -36,10 +36,6 @@ #include "transaction.h" #include "compression.h" -static int g_verbose = 0; - -#define verbose_printk(...) if (g_verbose) printk(__VA_ARGS__) - /* * A fs_path is a helper to dynamically build path names with unknown size. * It reallocates the internal buffer on demand. @@ -727,9 +723,10 @@ static int send_cmd(struct send_ctx *sctx) static int send_rename(struct send_ctx *sctx, struct fs_path *from, struct fs_path *to) { + struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; int ret; -verbose_printk("btrfs: send_rename %s -> %s\n", from->start, to->start); + btrfs_debug(fs_info, "send_rename %s -> %s", from->start, to->start); ret = begin_cmd(sctx, BTRFS_SEND_C_RENAME); if (ret < 0) @@ -751,9 +748,10 @@ out: static int send_link(struct send_ctx *sctx, struct fs_path *path, struct fs_path *lnk) { + struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; int ret; -verbose_printk("btrfs: send_link %s -> %s\n", path->start, lnk->start); + btrfs_debug(fs_info, "send_link %s -> %s", path->start, lnk->start); ret = begin_cmd(sctx, BTRFS_SEND_C_LINK); if (ret < 0) @@ -774,9 +772,10 @@ out: */ static int send_unlink(struct send_ctx *sctx, struct fs_path *path) { + struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; int ret; -verbose_printk("btrfs: send_unlink %s\n", path->start); + btrfs_debug(fs_info, "send_unlink %s", path->start); ret = begin_cmd(sctx, BTRFS_SEND_C_UNLINK); if (ret < 0) @@ -796,9 +795,10 @@ out: */ static int send_rmdir(struct send_ctx *sctx, struct fs_path *path) { + struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; int ret; -verbose_printk("btrfs: send_rmdir %s\n", path->start); + btrfs_debug(fs_info, "send_rmdir %s", path->start); ret = begin_cmd(sctx, BTRFS_SEND_C_RMDIR); if (ret < 0) @@ -1313,6 +1313,7 @@ static int find_extent_clone(struct send_ctx *sctx, u64 ino_size, struct clone_root **found) { + struct btrfs_fs_info *fs_info = sctx->send_root->fs_info; int ret; int extent_type; u64 logical; @@ -1371,10 +1372,10 @@ static int find_extent_clone(struct send_ctx *sctx, } logical = disk_byte + btrfs_file_extent_offset(eb, fi); - down_read(&sctx->send_root->fs_info->commit_root_sem); - ret = extent_from_logical(sctx->send_root->fs_info, disk_byte, tmp_path, + down_read(&fs_info->commit_root_sem); + ret = extent_from_logical(fs_info, disk_byte, tmp_path, &found_key, &flags); - up_read(&sctx->send_root->fs_info->commit_root_sem); + up_read(&fs_info->commit_root_sem); btrfs_release_path(tmp_path); if (ret < 0) @@ -1429,7 +1430,7 @@ static int find_extent_clone(struct send_ctx *sctx, extent_item_pos = logical - found_key.objectid; else extent_item_pos = 0; - ret = iterate_extent_inodes(sctx->send_root->fs_info, + ret = iterate_extent_inodes(fs_info, found_key.objectid, extent_item_pos, 1, __iterate_backrefs, backref_ctx); @@ -1439,17 +1440,18 @@ static int find_extent_clone(struct send_ctx *sctx, if (!backref_ctx->found_itself) { /* found a bug in backref code? */ ret = -EIO; - btrfs_err(sctx->send_root->fs_info, + btrfs_err(fs_info, "did not find backref in send_root. inode=%llu, offset=%llu, disk_byte=%llu found extent=%llu", - ino, data_offset, disk_byte, found_key.objectid); + ino, data_offset, disk_byte, found_key.objectid); goto out; } -verbose_printk(KERN_DEBUG "btrfs: find_extent_clone: data_offset=%llu, ino=%llu, num_bytes=%llu, logical=%llu\n", - data_offset, ino, num_bytes, logical); + btrfs_debug(fs_info, + "find_extent_clone: data_offset=%llu, ino=%llu, num_bytes=%llu, logical=%llu", + data_offset, ino, num_bytes, logical); if (!backref_ctx->found) - verbose_printk("btrfs:no clones found\n"); + btrfs_debug(fs_info, "no clones found"); cur_clone_root = NULL; for (i = 0; i < sctx->clone_roots_cnt; i++) { @@ -2420,10 +2422,11 @@ out: static int send_truncate(struct send_ctx *sctx, u64 ino, u64 gen, u64 size) { + struct btrfs_fs_info *fs_
[PATCH 3/5] btrfs: convert printk(KERN_* to use pr_* calls
From: Jeff Mahoney This patch converts printk(KERN_* style messages to use the pr_* versions. One side effect is that anything that was KERN_DEBUG is now automatically a dynamic debug message. Signed-off-by: Jeff Mahoney diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c index 8d87056..dc9c93e 100644 --- a/fs/btrfs/check-integrity.c +++ b/fs/btrfs/check-integrity.c @@ -656,7 +656,7 @@ static int btrfsic_process_superblock(struct btrfsic_state *state, BUG_ON(NULL == state); selected_super = kzalloc(sizeof(*selected_super), GFP_NOFS); if (NULL == selected_super) { - printk(KERN_INFO "btrfsic: error, kmalloc failed!\n"); + pr_info("btrfsic: error, kmalloc failed!\n"); return -ENOMEM; } @@ -681,7 +681,7 @@ static int btrfsic_process_superblock(struct btrfsic_state *state, } if (NULL == state->latest_superblock) { - printk(KERN_INFO "btrfsic: no superblock found!\n"); + pr_info("btrfsic: no superblock found!\n"); kfree(selected_super); return -1; } @@ -698,13 +698,13 @@ static int btrfsic_process_superblock(struct btrfsic_state *state, next_bytenr = btrfs_super_root(selected_super); if (state->print_mask & BTRFSIC_PRINT_MASK_ROOT_CHUNK_LOG_TREE_LOCATION) - printk(KERN_INFO "root@%llu\n", next_bytenr); + pr_info("root@%llu\n", next_bytenr); break; case 1: next_bytenr = btrfs_super_chunk_root(selected_super); if (state->print_mask & BTRFSIC_PRINT_MASK_ROOT_CHUNK_LOG_TREE_LOCATION) - printk(KERN_INFO "chunk@%llu\n", next_bytenr); + pr_info("chunk@%llu\n", next_bytenr); break; case 2: next_bytenr = btrfs_super_log_root(selected_super); @@ -712,7 +712,7 @@ static int btrfsic_process_superblock(struct btrfsic_state *state, continue; if (state->print_mask & BTRFSIC_PRINT_MASK_ROOT_CHUNK_LOG_TREE_LOCATION) - printk(KERN_INFO "log@%llu\n", next_bytenr); + pr_info("log@%llu\n", next_bytenr); break; } @@ -720,7 +720,7 @@ static int btrfsic_process_superblock(struct btrfsic_state *state, btrfs_num_copies(state->root->fs_info, next_bytenr, state->metablock_size); if (state->print_mask & BTRFSIC_PRINT_MASK_NUM_COPIES) - printk(KERN_INFO "num_copies(log_bytenr=%llu) = %d\n", + pr_info("num_copies(log_bytenr=%llu) = %d\n", next_bytenr, num_copies); for (mirror_num = 1; mirror_num <= num_copies; mirror_num++) { @@ -733,7 +733,7 @@ static int btrfsic_process_superblock(struct btrfsic_state *state, &tmp_next_block_ctx, mirror_num); if (ret) { - printk(KERN_INFO "btrfsic: btrfsic_map_block(root @%llu, mirror %d) failed!\n", + pr_info("btrfsic: btrfsic_map_block(root @%llu, mirror %d) failed!\n", next_bytenr, mirror_num); kfree(selected_super); return -1; @@ -756,8 +756,7 @@ static int btrfsic_process_superblock(struct btrfsic_state *state, ret = btrfsic_read_block(state, &tmp_next_block_ctx); if (ret < (int)PAGE_SIZE) { - printk(KERN_INFO - "btrfsic: read @logical %llu failed!\n", + pr_info("btrfsic: read @logical %llu failed!\n", tmp_next_block_ctx.start); btrfsic_release_block_ctx(&tmp_next_block_ctx); kfree(selected_super); @@ -818,7 +817,7 @@ static int btrfsic_process_superblock_dev_mirror( if (NULL == superblock_tmp) { superblock_tmp = btrfsic_block_alloc(); if (NULL == superblock_tmp) { - printk(KERN_INFO "btrfsic: error, kmalloc failed!\n"); + pr_info("btrfsic: error, kmalloc failed!\n"); brelse(bh); return -1; } @@ -892,7 +891,7 @@ static int btrfsic_process_superblock_dev_mirror( btrfs_num_copies(state->root->fs_info,
[PATCH 2/5] btrfs: unsplit printed strings
From: Jeff Mahoney CodingStyle chapter 2: "[...] never break user-visible strings such as printk messages, because that breaks the ability to grep for them." This patch unsplits user-visible strings. Signed-off-by: Jeff Mahoney diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 455a6b2..c2cd4c2 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -390,8 +390,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info *fs_info, /* root node has been locked, we can release @subvol_srcu safely here */ srcu_read_unlock(&fs_info->subvol_srcu, index); - pr_debug("search slot in root %llu (level %d, ref count %d) returned " -"%d for key (%llu %u %llu)\n", + pr_debug("search slot in root %llu (level %d, ref count %d) returned %d for key (%llu %u %llu)\n", ref->root_id, level, ref->count, ret, ref->key_for_search.objectid, ref->key_for_search.type, ref->key_for_search.offset); @@ -1503,8 +1502,7 @@ int extent_from_logical(struct btrfs_fs_info *fs_info, u64 logical, ei = btrfs_item_ptr(eb, path->slots[0], struct btrfs_extent_item); flags = btrfs_extent_flags(eb, ei); - pr_debug("logical %llu is at position %llu within the extent (%llu " -"EXTENT_ITEM %llu) flags %#llx size %u\n", + pr_debug("logical %llu is at position %llu within the extent (%llu EXTENT_ITEM %llu) flags %#llx size %u\n", logical, logical - found_key->objectid, found_key->objectid, found_key->offset, flags, item_size); @@ -1633,8 +1631,7 @@ static int iterate_leaf_refs(struct extent_inode_elem *inode_list, int ret = 0; for (eie = inode_list; eie; eie = eie->next) { - pr_debug("ref for %llu resolved, key (%llu EXTEND_DATA %llu), " -"root %llu\n", extent_item_objectid, + pr_debug("ref for %llu resolved, key (%llu EXTEND_DATA %llu), root %llu\n", extent_item_objectid, eie->inum, eie->offset, root); ret = iterate(eie->inum, eie->offset, root, ctx); if (ret) { @@ -1693,8 +1690,7 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info, break; ULIST_ITER_INIT(&root_uiter); while (!ret && (root_node = ulist_next(roots, &root_uiter))) { - pr_debug("root %llu references leaf %llu, data list " -"%#llx\n", root_node->val, ref_node->val, + pr_debug("root %llu references leaf %llu, data list %#llx\n", root_node->val, ref_node->val, ref_node->aux); ret = iterate_leaf_refs((struct extent_inode_elem *) (uintptr_t)ref_node->aux, @@ -1792,8 +1788,7 @@ static int iterate_inode_refs(u64 inum, struct btrfs_root *fs_root, for (cur = 0; cur < btrfs_item_size(eb, item); cur += len) { name_len = btrfs_inode_ref_name_len(eb, iref); /* path must be released before calling iterate()! */ - pr_debug("following ref at offset %u for inode %llu in " -"tree %llu\n", cur, found_key.objectid, + pr_debug("following ref at offset %u for inode %llu in tree %llu\n", cur, found_key.objectid, fs_root->objectid); ret = iterate(parent, name_len, (unsigned long)(iref + 1), eb, ctx); diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c index 6678947..8d87056 100644 --- a/fs/btrfs/check-integrity.c +++ b/fs/btrfs/check-integrity.c @@ -733,9 +733,7 @@ static int btrfsic_process_superblock(struct btrfsic_state *state, &tmp_next_block_ctx, mirror_num); if (ret) { - printk(KERN_INFO "btrfsic:" - " btrfsic_map_block(root @%llu," - " mirror %d) failed!\n", + printk(KERN_INFO "btrfsic: btrfsic_map_block(root @%llu, mirror %d) failed!\n", next_bytenr, mirror_num); kfree(selected_super); return -1; @@ -905,8 +903,7 @@ static int btrfsic_process_superblock_dev_mirror( state->metablock_size, &tmp_next_block_ctx, mirror_num)) { - printk(KERN_INFO "btrfsic: btrfsic_map_block(" - "bytenr @%llu, mirror %d) failed!\n", +
[PATCH 1/5] btrfs: add dynamic debug support
From: Jeff Mahoney We can re-use the dynamic debugging descriptor to make use of the dynamic debugging mechanism but still use our own printk interface. Defining the DEBUG macro works as it did before. When it's defined, all of the messages default to print. We can also enable all debug messages at boot or module-load time using the 'dyndbg' and 'btrfs.dyndbg' options. Signed-off-by: Jeff Mahoney diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 33fe035..9267436 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -37,6 +37,7 @@ #include #include #include +#include #include "extent_io.h" #include "extent_map.h" #include "async-thread.h" @@ -3315,7 +3316,35 @@ void btrfs_printk(const struct btrfs_fs_info *fs_info, const char *fmt, ...) btrfs_printk_ratelimited(fs_info, KERN_NOTICE fmt, ##args) #define btrfs_info_rl(fs_info, fmt, args...) \ btrfs_printk_ratelimited(fs_info, KERN_INFO fmt, ##args) -#ifdef DEBUG + +#if defined(CONFIG_DYNAMIC_DEBUG) +#define btrfs_debug(fs_info, fmt, args...) \ +do { \ + DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt); \ + if (unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT)) \ + btrfs_printk(fs_info, KERN_DEBUG fmt, ##args); \ +} while (0) +#define btrfs_debug_in_rcu(fs_info, fmt, args...) \ +do { \ + DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt); \ + if (unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT)) \ + btrfs_printk_in_rcu(fs_info, KERN_DEBUG fmt, ##args); \ +} while (0) +#define btrfs_debug_rl_in_rcu(fs_info, fmt, args...) \ +do { \ + DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt); \ + if (unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT)) \ + btrfs_printk_rl_in_rcu(fs_info, KERN_DEBUG fmt, \ + ##args);\ +} while (0) +#define btrfs_debug_rl(fs_info, fmt, args...) \ +do { \ + DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt); \ + if (unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT)) \ + btrfs_printk_ratelimited(fs_info, KERN_DEBUG fmt, \ +##args); \ +} while (0) +#elif defined(DEBUG) #define btrfs_debug(fs_info, fmt, args...) \ btrfs_printk(fs_info, KERN_DEBUG fmt, ##args) #define btrfs_debug_in_rcu(fs_info, fmt, args...) \ -- 2.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] btrfs: printing cleanup patchset
From: Jeff Mahoney This is a patchset I've been working on to clean up message printing, make it adhere to kernel style, and be more consistent. The end result is that we: * use dynamic debugging for debugging messages * merge strings that exceed 80 characters into a single greppable string * convert printk calls to btrfs_{warn,info,err,debug,etc} calls where it makes sense. * dump the ad-hoc verbose_printk garbage in send The exception to this is check-integrity since it has a ton of messages and it also has its own mask mechanism. I wanted to discuss if we wanted to find another solution to that and, if so, how we want to move forward there. Dave, this will probably conflict with the fsinfo patchset, so please advise on which you want to land first. -Jeff Jeff Mahoney (5): btrfs: add dynamic debug support btrfs: unsplit printed strings btrfs: convert printk(KERN_* to use pr_* calls btrfs: convert pr_* to btrfs_* where possible btrfs: convert send's verbose_printk to btrfs_debug fs/btrfs/backref.c | 48 --- fs/btrfs/check-integrity.c | 335 ++-- fs/btrfs/compression.c | 6 +- fs/btrfs/ctree.c| 12 +- fs/btrfs/ctree.h| 39 +- fs/btrfs/delayed-inode.c| 17 +-- fs/btrfs/delayed-ref.c | 9 +- fs/btrfs/dev-replace.c | 21 +-- fs/btrfs/dir-item.c | 7 +- fs/btrfs/disk-io.c | 98 ++--- fs/btrfs/extent-tree.c | 106 +++--- fs/btrfs/extent_io.c| 93 ++-- fs/btrfs/free-space-cache.c | 21 +-- fs/btrfs/free-space-cache.h | 6 +- fs/btrfs/free-space-tree.c | 14 +- fs/btrfs/inode-map.c| 31 ++-- fs/btrfs/inode.c| 26 ++-- fs/btrfs/ioctl.c| 14 +- fs/btrfs/lzo.c | 6 +- fs/btrfs/ordered-data.c | 4 +- fs/btrfs/print-tree.c | 86 +--- fs/btrfs/qgroup.c | 22 +-- fs/btrfs/reada.c| 32 ++--- fs/btrfs/relocation.c | 16 ++- fs/btrfs/root-tree.c| 18 +-- fs/btrfs/scrub.c| 58 fs/btrfs/send.c | 71 +- fs/btrfs/super.c| 60 fs/btrfs/sysfs.c| 8 +- fs/btrfs/transaction.c | 20 +-- fs/btrfs/transaction.h | 1 + fs/btrfs/tree-log.c | 8 +- fs/btrfs/uuid-tree.c| 27 ++-- fs/btrfs/volumes.c | 131 + fs/btrfs/volumes.h | 2 +- fs/btrfs/zlib.c | 8 +- 36 files changed, 719 insertions(+), 762 deletions(-) -- 2.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: add error handling for extent buffer in print tree
On Wed, Sep 14, 2016 at 05:23:39PM -0700, Liu Bo wrote: > Somehow we missed btrfs_print_tree when last time we > updated error handling for read_extent_block(). > > This keeps us from getting a NULL pointer panic when > btrfs_print_tree's read_extent_block() fails. > > Signed-off-by: Liu Bo Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: memset to avoid stale content in btree node block
On Wed, Sep 14, 2016 at 05:22:57PM -0700, Liu Bo wrote: > During updating btree, we could push items between sibling > nodes/leaves, for leaves data sections starts reversely from > the end of the block while for nodes we only have key pairs > which are stored one by one from the start of the block. > > So we could do try to push key pairs from one node to the next > node right in the tree, and after that, we update the node's > nritems to reflect the correct end while leaving the stale > content in the node. One may intentionally corrupt the fs > image and access the stale content by bumping the nritems and > causes various crashes. > > This takes the in-memory @nritems as the correct one and > gets to memset the unused part of a btree node. > > Signed-off-by: Liu Bo Reviewed-by: David Sterba > --- > fs/btrfs/extent_io.c | 11 +++ > 1 file changed, 11 insertions(+) > > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c > index c2325c3..56c9dee 100644 > --- a/fs/btrfs/extent_io.c > +++ b/fs/btrfs/extent_io.c > @@ -3732,6 +3732,17 @@ static noinline_for_stack int write_one_eb(struct > extent_buffer *eb, > if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID) > bio_flags = EXTENT_BIO_TREE_LOG; > > + /* set btree node beyond nritems with 0 to avoid stale content */ > + if (btrfs_header_level(eb) > 0) { We can do the same for leaves. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: stat(2) returning device ID not existing in mountinfo
On 9/16/16 4:28 PM, Tomasz Sterna wrote: > Hi. > > I have spotted an issue with stat(2) call on files on btrfs. > It is giving me dev_t st_dev number that does not correspond to any > mounted filesystem in proc's mountinfo. That's by design. Your particular file system may only use one device but, internally, btrfs uses virtualized storage that may be spread across multiple devices. To make things more complicated, snapshots mean that: sled1a:/mnt # btrfs sub list . ID 257 gen 14 top level 5 path a ID 258 gen 14 top level 5 path b sled1a:/mnt # ls -laRi .: total 16 256 drwxr-xr-x 1 root root 4 Sep 20 09:08 . 256 drwxr-xr-x 1 root root 220 Sep 16 09:49 .. 256 drwxr-xr-x 1 root root 8 Sep 14 10:24 a 256 drwxr-xr-x 1 root root 8 Sep 14 10:24 b ./a: total 4112 256 drwxr-xr-x 1 root root 8 Sep 14 10:24 . 256 drwxr-xr-x 1 root root 4 Sep 20 09:08 .. 257 -rw-r--r-- 1 root root 4194304 Sep 14 10:24 file ./b: total 4112 256 drwxr-xr-x 1 root root 8 Sep 14 10:24 . 256 drwxr-xr-x 1 root root 4 Sep 20 09:08 .. 257 -rw-r--r-- 1 root root 4194304 Sep 14 10:24 file Under normal circumstances those are two files with the same st_dev and the same inode number. That would normally correspond to a hard link, but the files do not (necessarily) correspond to the same file. ... but because we use anonymous device numbers for each subvolume, we have different device numbers for each one. sled1a:/mnt # stat --format "%n st_dev=%d" {a,b}/file a/file st_dev=69 b/file st_dev=70 It's a pretty big usability wart that we don't consistently report the device number. We do it correctly in stat() but there are other places in the code that assume that inode->i_sb->s_dev will work. In the SUSE kernels, we have patches that add a super_operation to report the correct device number everywhere, but even that is a hack. > I already attempted a illinformed-patch in fs/btrfs/super.c: > > @@ -1127,6 +1127,7 @@ static int btrfs_fill_super(struct super_block *sb, > goto fail_close; > } > > + sb->s_dev = inode->i_sb->s_dev; > sb->s_root = d_make_root(inode); > if (!sb->s_root) { > err = -ENOMEM; > > but it didn't help. It wouldn't. That is assigning a variable to itself. > I would like to dig deeper and fix it, but first I have to ask: > - Which number is wrong? > The one returned by stat() or the one in mountinfo? The one in mountinfo, but then that means that the user only sees the anonymous devices in mount(8), which isn't what we want either. I'm afraid the correct fix is very involved and requires non-trivial changes in the VFS layer as well. It's on my long-term TODO list. I currently have some patches that do the magic with vfsmounts but it's far from being usable. -Jeff -- Jeff Mahoney SUSE Labs signature.asc Description: OpenPGP digital signature
Re: Fs: Btrfs - Fix possible ERR_PTR() dereferencing.
On 9/20/16 2:48 AM, Shailendra Verma wrote: > This is of course wrong to call kfree() if memdup_user() fails, > no memory was allocated and the error in the error-valued pointer > should be returned. > > Reviewed-by: Ravikant Sharma > Signed-off-by: Shailendra Verma Hi Shailendra - In all three cases, the return value is set using the error-valued pointer and the pointer is set to NULL. kfree() checks to see if the pointer is NULL and, if so, does nothing. This allows us to use a common exit path which is an extremely common pattern in the kernel. So there's never any possible ERR_PTR dereferencing happening. However, in all three cases, the allocation you're checking is the first in each routine and there's no additional cleanup to do. So your patch is an improvement, but it's an improvement in code readability instead of a bug fix. I'd ask that you re-submit with a commit message that reflects that. Thanks, -Jeff > --- > fs/btrfs/ioctl.c | 21 ++--- > 1 file changed, 6 insertions(+), 15 deletions(-) > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index da94138..58a40f8 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -4512,11 +4512,8 @@ static long btrfs_ioctl_logical_to_ino(struct > btrfs_root *root, > return -EPERM; > > loi = memdup_user(arg, sizeof(*loi)); > - if (IS_ERR(loi)) { > - ret = PTR_ERR(loi); > - loi = NULL; > - goto out; > - } > + if (IS_ERR(loi)) > + return PTR_ERR(loi); > > path = btrfs_alloc_path(); > if (!path) { > @@ -5143,11 +5140,8 @@ static long btrfs_ioctl_set_received_subvol_32(struct > file *file, > int ret = 0; > > args32 = memdup_user(arg, sizeof(*args32)); > - if (IS_ERR(args32)) { > - ret = PTR_ERR(args32); > - args32 = NULL; > - goto out; > - } > + if (IS_ERR(args32)) > + return PTR_ERR(args32); > > args64 = kmalloc(sizeof(*args64), GFP_NOFS); > if (!args64) { > @@ -5195,11 +5189,8 @@ static long btrfs_ioctl_set_received_subvol(struct > file *file, > int ret = 0; > > sa = memdup_user(arg, sizeof(*sa)); > - if (IS_ERR(sa)) { > - ret = PTR_ERR(sa); > - sa = NULL; > - goto out; > - } > + if (IS_ERR(sa)) > + return PTR_ERR(sa); > > ret = _btrfs_ioctl_set_received_subvol(file, sa); > > -- Jeff Mahoney SUSE Labs signature.asc Description: OpenPGP digital signature
[PATCH] btrfs: clean the old superblocks before freeing the device
From: Jeff Mahoney btrfs_rm_device frees the block device but then re-opens it using the saved device name. A race exists between the close and the re-open that allows the block size to be changed. The result is getting stuck forever in the reclaim loop in __getblk_slow. This patch moves the superblock cleanup before closing the block device, which is also consistent with other callers. We also don't need a private copy of dev_name as the whole routine operates under the uuid_mutex. Signed-off-by: Jeff Mahoney --- fs/btrfs/volumes.c | 38 +++--- 1 file changed, 11 insertions(+), 27 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index c41f8c1..3adf5ce 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1846,7 +1846,6 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path, u64 devid) u64 num_devices; int ret = 0; bool clear_super = false; - char *dev_name = NULL; mutex_lock(&uuid_mutex); @@ -1882,11 +1881,6 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path, u64 devid) list_del_init(&device->dev_alloc_list); device->fs_devices->rw_devices--; unlock_chunks(root); - dev_name = kstrdup(device->name->str, GFP_KERNEL); - if (!dev_name) { - ret = -ENOMEM; - goto error_undo; - } clear_super = true; } @@ -1936,14 +1930,21 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path, u64 devid) btrfs_sysfs_rm_device_link(root->fs_info->fs_devices, device); } - btrfs_close_bdev(device); - - call_rcu(&device->rcu, free_device); - num_devices = btrfs_super_num_devices(root->fs_info->super_copy) - 1; btrfs_set_super_num_devices(root->fs_info->super_copy, num_devices); mutex_unlock(&root->fs_info->fs_devices->device_list_mutex); + /* +* at this point, the device is zero sized and detached from +* the devices list. All that's left is to zero out the old +* supers and free the device. +*/ + if (device->writeable) + btrfs_scratch_superblocks(device->bdev, device->name->str); + + btrfs_close_bdev(device); + call_rcu(&device->rcu, free_device); + if (cur_devices->open_devices == 0) { struct btrfs_fs_devices *fs_devices; fs_devices = root->fs_info->fs_devices; @@ -1962,24 +1963,7 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path, u64 devid) root->fs_info->num_tolerated_disk_barrier_failures = btrfs_calc_num_tolerated_disk_barrier_failures(root->fs_info); - /* -* at this point, the device is zero sized. We want to -* remove it from the devices list and zero out the old super -*/ - if (clear_super) { - struct block_device *bdev; - - bdev = blkdev_get_by_path(dev_name, FMODE_READ | FMODE_EXCL, - root->fs_info->bdev_holder); - if (!IS_ERR(bdev)) { - btrfs_scratch_superblocks(bdev, dev_name); - blkdev_put(bdev, FMODE_READ | FMODE_EXCL); - } - } - out: - kfree(dev_name); - mutex_unlock(&uuid_mutex); return ret; -- 2.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is stability a joke? (wiki updated)
On 2016-09-19 16:15, Zygo Blaxell wrote: On Mon, Sep 19, 2016 at 01:38:36PM -0400, Austin S. Hemmelgarn wrote: I'm not sure if the brfsck is really all that helpful to user as much as it is for developers to better learn about the failure vectors of the file system. ReiserFS had no working fsck for all of the 8 years I used it (and still didn't last year when I tried to use it on an old disk). "Not working" here means "much less data is readable from the filesystem after running fsck than before." It's not that much of an inconvenience if you have backups. For a small array, this may be the case. Once you start looking into double digit TB scale arrays though, restoring backups becomes a very expensive operation. If you had a multi-PB array with a single dentry which had no inode, would you rather be spending multiple days restoring files and possibly losing recent changes, or spend a few hours to check the filesystem and fix it with minimal data loss? I'd really prefer to be able to delete the dead dentry with 'rm' as root, or failing that, with a ZDB-like tool or ioctl, if it's the only known instance of such a bad metadata object and I already know where it's located. I entirely agree on that. The problem is that because the VFS layer chokes on it, it can't be rm, and it would be non-trivial to implement as an ioctl. It pretty much has to be out-of-band. I'd love to see btrfs check add the ability to process subsets of the filesystem (for example 'I know that something is screwed up somehow in /path/to/random/directory, check only that path in the filesystem (possibly recursively) and tell me what's wrong (and possibly try to fix it)'). Usually the ultimate failure mode of a btrfs filesystem is a read-only filesystem from which you can read most or all of your data, but you can't ever make it writable again because of fsck limitations. The one thing I do miss about every filesystem that isn't ext2/ext3 is automated fsck that prioritizes availability, making the filesystem safely writable even if it can't recover lost data. On the other hand, fixing an ext[23] filesystem is utterly trivial compared to any btree-based filesystem. For a data center or corporate entity, dropping broken parts of the FS and recovering from backups makes sense. For a traditional home user (that is, the type of person Ubuntu and Windows traditionally target), it usually doesn't, as they almost certainly don't have a backup. Personally, I'd rather have a tool that gives me the option of whether to try and fix a given path or just remove it, instead of assuming that it knows how I want to fix it. That would allow for both use cases. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: State of the fuzzer
There are now 21 bugs open on bko, most of them crashes and some undefined behavior. The nodes are now mostly running idle as no new paths are discovered (after around one billion images tested in the current run). My thoughts are to wait until the current bugs have been fixed, then restart the whole process from HEAD (together with the corpus of ~2.000 seed images discovered by now) and catch new bugs and aborts() - we need to get rid of the reachable ones so code coverage can improve. After those, I'll change the process to run btrfsck --repair, which is slower but has a lot of yet uncovered code. DigitalOcean has provided some funding for this undertaking so we are good on CPU power. Kudos to them. 2016-09-13 22:28 GMT+02:00 Lukas Lueg : > I've booted another instance with btrfs-progs checked out to 2b7c507 > and collected some bugs which remained from the run before the current > one. The current run discovered what qgroups are just three days ago > and will spend some time on that. I've also added UBSAN- and > MSAN-logging to my setup and there were three bugs found so far (one > is already fixed). I will boot a third instance to run lowmem-mode > exclusively in the next few days. > > There are 11 bugs open at the moment, all have a reproducing image > attached to them. The whole list is at > > https://bugzilla.kernel.org/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&component=btrfs&email1=lukas.lueg%40gmail.com&emailreporter1=1&emailtype1=exact&list_id=858441&query_format=advanced > > > 2016-09-09 16:00 GMT+02:00 David Sterba : >> On Tue, Sep 06, 2016 at 10:32:28PM +0200, Lukas Lueg wrote: >>> I'm currently fuzzing rev 2076992 and things start to slowly, slowly >>> quiet down. We will probably run out of steam at the end of the week >>> when a total of (roughly) half a billion BTRFS-images have passed by. >>> I will switch revisions to current HEAD and restart the whole process >>> then. A few things: >>> >>> * There are a couple of crashes (mostly segfaults) I have not reported >>> yet. I'll report them if they show up again with the latest revision. >> >> Ok. >> >>> * The coverage-analysis shows assertion failures which are currently >>> silenced. An assertion failure is technically a worse disaster >>> successfully prevented, it still constitutes unexpected/unusable >>> behaviour, though. Do you want assertions to be enabled and images >>> triggering those assertions reported? This is basically the same >>> conundrum as with BUG_ON and abort(). >> >> Yes please. I'd like to turn most bugons/assertions into a normal >> failure report if it would make sense. >> >>> * A few endless loops entered into by btrfsck are currently >>> unmitigated (see bugs 155621, 155571, 11 and 155151). It would be >>> nice if those had been taken care of by next week if possible. >> >> Two of them are fixed, the other two need more work, updating all >> callers of read_node_slot and the callchain. So you may still see that >> kind of looping in more images. I don't have an ETA for the fix, I won't >> be available during the next week. >> >> At the moment, the initial sanity checks should catch most of the >> corrupted values, so I'm expecting that you'll see different classes of >> problems in the next rounds. >> >> The testsuite now contains all images that you reported and we have a >> fix in git. There are more utilities run on the images, there may be >> more problems for us to fix. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ChaCha20 vs. AES performance
Not on the list or I would've replied directly, but on Haswell, ChaCha20 (in software) is over 2x as fast as AES (in hardware), at realistic (for a filesystem) block sizes: testing speed of ctr(aes) (ctr(aes-aesni)) decryption test 0 (128 bit key, 16 byte blocks): 1 operation in 378 cycles (16 bytes) test 1 (128 bit key, 64 byte blocks): 1 operation in 1130 cycles (64 bytes) test 2 (128 bit key, 256 byte blocks): 1 operation in 3981 cycles (256 bytes) test 3 (128 bit key, 1024 byte blocks): 1 operation in 15458 cycles (1024 bytes) test 4 (128 bit key, 8192 byte blocks): 1 operation in 122880 cycles (8192 bytes) test 5 (192 bit key, 16 byte blocks): 1 operation in 391 cycles (16 bytes) test 6 (192 bit key, 64 byte blocks): 1 operation in 1193 cycles (64 bytes) test 7 (192 bit key, 256 byte blocks): 1 operation in 4212 cycles (256 bytes) test 8 (192 bit key, 1024 byte blocks): 1 operation in 16388 cycles (1024 bytes) test 9 (192 bit key, 8192 byte blocks): 1 operation in 131029 cycles (8192 bytes) test 10 (256 bit key, 16 byte blocks): 1 operation in 417 cycles (16 bytes) test 11 (256 bit key, 64 byte blocks): 1 operation in 1222 cycles (64 bytes) test 12 (256 bit key, 256 byte blocks): 1 operation in 4398 cycles (256 bytes) test 13 (256 bit key, 1024 byte blocks): 1 operation in 17114 cycles (1024 bytes) test 14 (256 bit key, 8192 byte blocks): 1 operation in 137028 cycles (8192 bytes) testing speed of chacha20 (chacha20-simd) encryption test 0 (256 bit key, 16 byte blocks): 1 operation in 4356 cycles (16 bytes) test 1 (256 bit key, 64 byte blocks): 1 operation in 4004 cycles (64 bytes) test 2 (256 bit key, 256 byte blocks): 1 operation in 6524 cycles (256 bytes) test 3 (256 bit key, 1024 byte blocks): 1 operation in 9248 cycles (1024 bytes) test 4 (256 bit key, 8192 byte blocks): 1 operation in 60274 cycles (8192 bytes) Poly1305 is also plenty fast: testing speed of gcm(aes) (gcm_base(ctr-aes-aesni,ghash-generic)) encryption test 0 (128 bit key, 16 byte blocks): 1 operation in 7567 cycles (16 bytes) test 1 (128 bit key, 64 byte blocks): 1 operation in 9654 cycles (64 bytes) test 2 (128 bit key, 256 byte blocks): 1 operation in 19010 cycles (256 bytes) test 3 (128 bit key, 512 byte blocks): 1 operation in 33118 cycles (512 bytes) test 4 (128 bit key, 1024 byte blocks): 1 operation in 59738 cycles (1024 bytes) test 5 (128 bit key, 2048 byte blocks): 1 operation in 106545 cycles (2048 bytes) test 6 (128 bit key, 4096 byte blocks): 1 operation in 211189 cycles (4096 bytes) test 7 (128 bit key, 8192 byte blocks): 1 operation in 370439 cycles (8192 bytes) test 8 (192 bit key, 16 byte blocks): 1 operation in 6780 cycles (16 bytes) test 9 (192 bit key, 64 byte blocks): 1 operation in 8802 cycles (64 bytes) test 10 (192 bit key, 256 byte blocks): 1 operation in 17352 cycles (256 bytes) test 11 (192 bit key, 512 byte blocks): 1 operation in 28680 cycles (512 bytes) test 12 (192 bit key, 1024 byte blocks): 1 operation in 51230 cycles (1024 bytes) test 13 (192 bit key, 2048 byte blocks): 1 operation in 96662 cycles (2048 bytes) test 14 (192 bit key, 4096 byte blocks): 1 operation in 187287 cycles (4096 bytes) test 15 (192 bit key, 8192 byte blocks): 1 operation in 372570 cycles (8192 bytes) test 16 (256 bit key, 16 byte blocks): 1 operation in 6273 cycles (16 bytes) test 17 (256 bit key, 64 byte blocks): 1 operation in 8096 cycles (64 bytes) test 18 (256 bit key, 256 byte blocks): 1 operation in 15895 cycles (256 bytes) test 19 (256 bit key, 512 byte blocks): 1 operation in 26259 cycles (512 bytes) test 20 (256 bit key, 1024 byte blocks): 1 operation in 47121 cycles (1024 bytes) test 21 (256 bit key, 2048 byte blocks): 1 operation in 91003 cycles (2048 bytes) test 22 (256 bit key, 4096 byte blocks): 1 operation in 175883 cycles (4096 bytes) test 23 (256 bit key, 8192 byte blocks): 1 operation in 340904 cycles (8192 bytes) testing speed of rfc7539esp(chacha20,poly1305) (rfc7539esp(chacha20-simd,poly1305-simd)) encryption test 0 (288 bit key, 16 byte blocks): 1 operation in 12145 cycles (16 bytes) test 1 (288 bit key, 64 byte blocks): 1 operation in 14538 cycles (64 bytes) test 2 (288 bit key, 256 byte blocks): 1 operation in 16435 cycles (256 bytes) test 3 (288 bit key, 512 byte blocks): 1 operation in 15622 cycles (512 bytes) test 4 (288 bit key, 1024 byte blocks): 1 operation in 18671 cycles (1024 bytes) test 5 (288 bit key, 2048 byte blocks): 1 operation in 23264 cycles (2048 bytes) test 6 (288 bit key, 4096 byte blocks): 1 operation in 36480 cycles (4096 bytes) test 7 (288 bit key, 8192 byte blocks): 1 operation in 75051 cycles (8192 bytes) When AVX-512 comes out ChaCha20 is going to get even faster - probably by more than 2x, since they're adding a rotate instruction. I haven't tested on ARM but I'd be surprised if the situation is significantly different there (the kernel's lacking a NEON ChaCha20 implementation, but I could do one). Just because it's implemented in hardware doesn't mean i
Re: how to understand "btrfs fi show" output? "No space left" issues
Output from my nightly balance script for my 15 TB Raid 1 btrfs pool (3x 3TB + 1x 6TB) with ~100 snapshots: Before balance of /media/RAID Data, RAID1: total=5.57TiB, used=5.45TiB System, RAID1: total=32.00MiB, used=832.00KiB Metadata, RAID1: total=7.00GiB, used=6.03GiB GlobalReserve, single: total=512.00MiB, used=0.00B Filesystem Size Used Avail Use% Mounted on /dev/sde7.6T 6.1T 1.5T 81% /media/RAID Done, had to relocate 0 out of 5710 chunks Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=1 Done, had to relocate 0 out of 5710 chunks Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=5 Done, had to relocate 0 out of 5710 chunks Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=10 Done, had to relocate 0 out of 5710 chunks Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=20 Done, had to relocate 0 out of 5710 chunks Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=30 Done, had to relocate 0 out of 5710 chunks Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=40 Done, had to relocate 0 out of 5710 chunks Dumping filters: flags 0x1, state 0x0, force is off DATA (flags 0x2): balancing, usage=50 Done, had to relocate 0 out of 5710 chunks Done, had to relocate 0 out of 5710 chunks Dumping filters: flags 0x6, state 0x0, force is off METADATA (flags 0x2): balancing, usage=1 SYSTEM (flags 0x2): balancing, usage=1 Done, had to relocate 0 out of 5710 chunks Dumping filters: flags 0x6, state 0x0, force is off METADATA (flags 0x2): balancing, usage=5 SYSTEM (flags 0x2): balancing, usage=5 Done, had to relocate 1 out of 5710 chunks Dumping filters: flags 0x6, state 0x0, force is off METADATA (flags 0x2): balancing, usage=10 SYSTEM (flags 0x2): balancing, usage=10 Done, had to relocate 1 out of 5710 chunks Dumping filters: flags 0x6, state 0x0, force is off METADATA (flags 0x2): balancing, usage=20 SYSTEM (flags 0x2): balancing, usage=20 Done, had to relocate 1 out of 5710 chunks Dumping filters: flags 0x6, state 0x0, force is off METADATA (flags 0x2): balancing, usage=30 SYSTEM (flags 0x2): balancing, usage=30 Done, had to relocate 1 out of 5710 chunks After balance of /media/RAID Data, RAID1: total=5.57TiB, used=5.45TiB System, RAID1: total=32.00MiB, used=832.00KiB Metadata, RAID1: total=7.00GiB, used=6.03GiB GlobalReserve, single: total=512.00MiB, used=0.00B Filesystem Size Used Avail Use% Mounted on /dev/sde7.6T 6.1T 1.5T 81% /media/RAID Its effective reduce the internal fragmentation (to 0,12 TB data and ~1GB metadata). 2016-09-20 10:59 GMT+02:00 Peter Becker : > 2016-09-20 10:48 GMT+02:00 Hugo Mills : >> On Tue, Sep 20, 2016 at 10:34:49AM +0200, Peter Becker wrote: >>> More details on the issue and a complete explantion you can find here: >>> >>> http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html >>> and >>> (Help! I ran out of disk space! ) >>> https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space.21 >>> >>> And an explantion for the "dlimit" solution: >> >>It's not "dlimit". It's "d" with option "limit". You could just as >> easily write -dusage=99,limit=10 or -dlimit=10,usage=99 (although >> those aren't the options I'd pick... see below). >> >>> Quote From: Uncommon solutions for BTRFS >>> (http://blog.schmorp.de/2015-10-08-smr-archive-drives-fast-now.html) >>> >>> > For my purposes, I define internal fragmentation as space allocated but >>> > not usable by the filesystem. In BTRFS, each time you delete files, the >>> > space used by those files cannot be reused for new files automatically. >>> > It's not a hard requirement to do this maintenance regularly, but doing >>> > it regularly spares you waiting for hours when the disk is full and you >>> > need to wait for a balance clean up command - and of course also reduces >>> > the number of > times you get unexpected disk full errors. As a side >>> > note, this can also be useful to prolong the life of your SSD because it >>> > allows the SSD to reuse space not needed by the filesystem (although >>> > there is a trade-off, frequent balancing is bad, no balancing is bad, the >>> > sweet spot is somewhere in between). >>> >>> 2016-09-20 10:20 GMT+02:00 Peter Becker : >>> > Normaly total and used should deviate us a few gb. >>> > depend on your write workload you should run >>> > >>> > btrfs balance start -dusage=60 /mnt >>> > >>> > every week to avoid "ENOSPC" >>> > >>> > if you use newer btrfs-progs who supper balance limit filters you should >>> > run >>> > >>> > btrfs balance start -dusage=99 -dlimit=10 /mnt >>> > >>> > every 3 hours. >> >>These two options both feel horrible to me. Particularly the second >> option, which is going to result in huge write load on the FS, and is >> almost
Re: how to understand "btrfs fi show" output? "No space left" issues
2016-09-20 10:48 GMT+02:00 Hugo Mills : > On Tue, Sep 20, 2016 at 10:34:49AM +0200, Peter Becker wrote: >> More details on the issue and a complete explantion you can find here: >> >> http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html >> and >> (Help! I ran out of disk space! ) >> https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space.21 >> >> And an explantion for the "dlimit" solution: > >It's not "dlimit". It's "d" with option "limit". You could just as > easily write -dusage=99,limit=10 or -dlimit=10,usage=99 (although > those aren't the options I'd pick... see below). > >> Quote From: Uncommon solutions for BTRFS >> (http://blog.schmorp.de/2015-10-08-smr-archive-drives-fast-now.html) >> >> > For my purposes, I define internal fragmentation as space allocated but >> > not usable by the filesystem. In BTRFS, each time you delete files, the >> > space used by those files cannot be reused for new files automatically. >> > It's not a hard requirement to do this maintenance regularly, but doing it >> > regularly spares you waiting for hours when the disk is full and you need >> > to wait for a balance clean up command - and of course also reduces the >> > number of > times you get unexpected disk full errors. As a side note, >> > this can also be useful to prolong the life of your SSD because it allows >> > the SSD to reuse space not needed by the filesystem (although there is a >> > trade-off, frequent balancing is bad, no balancing is bad, the sweet spot >> > is somewhere in between). >> >> 2016-09-20 10:20 GMT+02:00 Peter Becker : >> > Normaly total and used should deviate us a few gb. >> > depend on your write workload you should run >> > >> > btrfs balance start -dusage=60 /mnt >> > >> > every week to avoid "ENOSPC" >> > >> > if you use newer btrfs-progs who supper balance limit filters you should >> > run >> > >> > btrfs balance start -dusage=99 -dlimit=10 /mnt >> > >> > every 3 hours. > >These two options both feel horrible to me. Particularly the second > option, which is going to result in huge write load on the FS, and is > almost certainly going to be unnecessary most of the time. I take this from kdave's btrfs maintence scripts and this works for me since one year. (https://github.com/kdave/btrfsmaintenance) >My recommendation would be to check at regular intervals (daily, > say) whether the used value is equal to the size value in btrfs fi > show. If it is (and only if), then you should run a balance with no > usage= option, and with limit=, for some relatively small value of > (3, say). That will give you some unallocated space that the FS > can take for metadata should it need it, which is all that's required > to avoid early ENOSPC. With no usage-option, how to avoid balance full blocks? -dusage=99 only balance blocks with empty space. >If you regularly find that your usage patterns result in large > numbers of empty or near-empty block groups (i.e. lots of headroom in > data shown by btrfs fi df), then a regular (but probably less > frequent) balance with something like usage=5 should keep that down. > >> > This will balance 2 Blocks (dlimit=10; corresponds to 10 gb) with are > >No, it will balance 10 complete block groups, not 10 GiB. Depending > on the RAID configuration, that could be a very large amount of data > indeed. (For example, an 8-disk RAID-10 would be rewriting up to 80 > GiB of data with that command). Thanks for this clarification. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to understand "btrfs fi show" output? "No space left" issues
2016-09-20 10:30 GMT+02:00 Andrei Borzenkov : > On Tue, Sep 20, 2016 at 11:20 AM, Peter Becker wrote: > I still do do understand where ENOSPC comes from in the first place. > Filesystem is half empty. Do you suggest that it is normal to get > ENOSPC in this case? Its how the block allocator and the chunk allocator work together. As i know the developer has this "bug" in there todo list. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to understand "btrfs fi show" output? "No space left" issues
On Tue, Sep 20, 2016 at 10:34:49AM +0200, Peter Becker wrote: > More details on the issue and a complete explantion you can find here: > > http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html > and > (Help! I ran out of disk space! ) > https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space.21 > > And an explantion for the "dlimit" solution: It's not "dlimit". It's "d" with option "limit". You could just as easily write -dusage=99,limit=10 or -dlimit=10,usage=99 (although those aren't the options I'd pick... see below). > Quote From: Uncommon solutions for BTRFS > (http://blog.schmorp.de/2015-10-08-smr-archive-drives-fast-now.html) > > > For my purposes, I define internal fragmentation as space allocated but not > > usable by the filesystem. In BTRFS, each time you delete files, the space > > used by those files cannot be reused for new files automatically. > > It's not a hard requirement to do this maintenance regularly, but doing it > > regularly spares you waiting for hours when the disk is full and you need > > to wait for a balance clean up command - and of course also reduces the > > number of > times you get unexpected disk full errors. As a side note, this > > can also be useful to prolong the life of your SSD because it allows the > > SSD to reuse space not needed by the filesystem (although there is a > > trade-off, frequent balancing is bad, no balancing is bad, the sweet spot > > is somewhere in between). > > 2016-09-20 10:20 GMT+02:00 Peter Becker : > > Normaly total and used should deviate us a few gb. > > depend on your write workload you should run > > > > btrfs balance start -dusage=60 /mnt > > > > every week to avoid "ENOSPC" > > > > if you use newer btrfs-progs who supper balance limit filters you should run > > > > btrfs balance start -dusage=99 -dlimit=10 /mnt > > > > every 3 hours. These two options both feel horrible to me. Particularly the second option, which is going to result in huge write load on the FS, and is almost certainly going to be unnecessary most of the time. My recommendation would be to check at regular intervals (daily, say) whether the used value is equal to the size value in btrfs fi show. If it is (and only if), then you should run a balance with no usage= option, and with limit=, for some relatively small value of (3, say). That will give you some unallocated space that the FS can take for metadata should it need it, which is all that's required to avoid early ENOSPC. If you regularly find that your usage patterns result in large numbers of empty or near-empty block groups (i.e. lots of headroom in data shown by btrfs fi df), then a regular (but probably less frequent) balance with something like usage=5 should keep that down. > > This will balance 2 Blocks (dlimit=10; corresponds to 10 gb) with are No, it will balance 10 complete block groups, not 10 GiB. Depending on the RAID configuration, that could be a very large amount of data indeed. (For example, an 8-disk RAID-10 would be rewriting up to 80 GiB of data with that command). Hugo. > > not filled full into new blocks. You could/should adjust the intervall > > and the limit-filter depend on your write workload. > > For example if you write (change files + new files) only 10GB a day it > > will be enough to run this ever night. > > The last option completly avoid the ENOSPC issue but produce aditional > > workload for your harddrives. > > > > Note: you should avoid making snapshots during balance. Use a simple > > lock-mechanic for that. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Hugo Mills | There isn't a noun that can't be verbed. hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: stability matrix (was: Is stability a joke?)
On Mon, Sep 19, 2016 at 09:45:46PM +0200, Christoph Anton Mitterer wrote: > +1 for all your changes with the following comments in addition... > > > On Mon, 2016-09-19 at 17:27 +0200, David Sterba wrote: > > That's more like a usecase, thats out of the scope of the tabular > > overview. But we have an existing page UseCases that I'd like to > > transform to a more structured and complete overview of usceases of > > various features, so the UUID collisions would build on top of that > > with > > "and this could hapen if ...". > Well I don't agree here and see it basically like Austin. So we'd have to make that two separate topics so the "what if" has better visibility, and possibly marked "with security implications". > It's not that these UUID collisions can only happen in special > circumstances but plain normal situations that always used to work with > probably literally each and every fs. (So much for the accidental > corruptions). > > And an attack is probably never "usecase dependant"... it always > depends on the attacker. > And since that seems to be a pretty real attack vector, I'd also say > it's mandatory to quite clearly warn about that deficiency... > > TBH, I'm rather surprised that this situation seems to be kinda > "accepted". > > I had a chat with CM recently and he implied things might be solved > with encryption. > While this is probably the case for at least some of the described > problems, it rather seems like a workaround: > - why making btrfs-encryption mandatory for devices who have partially > secured access (e.g. where a systemdisk with btrfs is not physically > accessible but a USB port is) > - what about users that rather want to use block device encryption > instead of fs-level-encryption? > > > > > - in-band dedupe > > > deduped are IIRC not bitwise compared by the kernel before de- > > > duping, > > > as it's the case with offline dedupe. > > > Even if this is considered safe by the community... I think users > > > should be told. > > Only features merged are reflected. And the out-of-band dedupe does > > full > > memcpy. See btrfs_cmp_data() called from btrfs_extent_same(). > Ah,... I kinda thought it was already merged ... possibly got confused > by the countless patch iterations of it ;) > > > > > - btrfs check --repair (and others?) > > > Telling people that this may often cause more harm than good. > > I think userspace tools do not belong to the overview. > Well... I wouldn't mind if there was a btrfs-progs status page... (and > both link each other). > OTOH,... the user probably wants one central point where all relevant > info can be found... and not again having to dig through n websites. The Status page should give enough overview about all main topics, so the progs can be one section there. Any details should go to separate pages and be linked from there. > > > - even mounting a fs ro, may cause it to be changed > > > > This would go to the UseCases > Fine for me. > > > > > > > > > > - DB/VM-image like IO patterns + nodatacow + (!)checksumming > > > + (auto)defrag + snapshots > > > a) > > > People typically may have the impression: > > > btrfs = checksummed => als is guaranteed to be "valid" (or at > > > least > > > noticed) > > > However this isn't the case for nodatacow'ed files, which in turn > > > is > > > kinda "mandatory" for DB/VM-image like IO patterns, cause > > > otherwise > > > these would fragment to heavily (see (b). > > > Unless claimed by some people, none of the major DBs or VM-image > > > formats do general checksumming on their own, most even don't > > > support > > > it, some that do wouldn't do it without app-support and few > > > "just" > > > don't do it per default. > > > Thus one should bump people to this situation and that they may > > > not > > > get this "correctness" guarantee here. > > > b) > > > IIRC, it doesn't even help to simply not use nodatacow on such > > > files > > > and using auto-defrag instead to countermeasure the fragmenting, > > > as > > > that one doesn't perform too well on large files. > > > > Same. > Fine for me either... you already said above you would mention the > nodatacow=>no-checksumming=>no-verification-and-no-raid-repair in the > general section... this is enough for that place. > > > > > For specific features: > > > - Autodefrag > > > - didn't that also cause reflinks to be broken up? > > > > No and never had. > > Absolutely sure? One year ago, I was told that at first too so I > started using it, but later on some (IIRC) developer said auto-defrag > would also suffer from it. Reading the subthread, I have to change the statement. Autodefrag can read surrounding blocks up to 64k and write it to a new location, on that write the links will get broken. I'll update the page. > > > > - RAID* > > > No userland tools for monitoring/etc. > > > > That's a usability bug. > > Well it is and it will probably go away sooner or later...
Re: how to understand "btrfs fi show" output? "No space left" issues
More details on the issue and a complete explantion you can find here: http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html and (Help! I ran out of disk space! ) https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space.21 And an explantion for the "dlimit" solution: Quote From: Uncommon solutions for BTRFS (http://blog.schmorp.de/2015-10-08-smr-archive-drives-fast-now.html) > For my purposes, I define internal fragmentation as space allocated but not > usable by the filesystem. In BTRFS, each time you delete files, the space > used by those files cannot be reused for new files automatically. > It's not a hard requirement to do this maintenance regularly, but doing it > regularly spares you waiting for hours when the disk is full and you need to > wait for a balance clean up command - and of course also reduces the number > of > times you get unexpected disk full errors. As a side note, this can also > be useful to prolong the life of your SSD because it allows the SSD to reuse > space not needed by the filesystem (although there is a trade-off, frequent > balancing is bad, no balancing is bad, the sweet spot is somewhere in > between). 2016-09-20 10:20 GMT+02:00 Peter Becker : > Normaly total and used should deviate us a few gb. > depend on your write workload you should run > > btrfs balance start -dusage=60 /mnt > > every week to avoid "ENOSPC" > > if you use newer btrfs-progs who supper balance limit filters you should run > > btrfs balance start -dusage=99 -dlimit=10 /mnt > > every 3 hours. > > This will balance 2 Blocks (dlimit=10; corresponds to 10 gb) with are > not filled full into new blocks. You could/should adjust the intervall > and the limit-filter depend on your write workload. > For example if you write (change files + new files) only 10GB a day it > will be enough to run this ever night. > The last option completly avoid the ENOSPC issue but produce aditional > workload for your harddrives. > > Note: you should avoid making snapshots during balance. Use a simple > lock-mechanic for that. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to understand "btrfs fi show" output? "No space left" issues
On Tue, Sep 20, 2016 at 11:20 AM, Peter Becker wrote: > The last option completly avoid the ENOSPC issue but produce aditional > workload for your harddrives. > I still do do understand where ENOSPC comes from in the first place. Filesystem is half empty. Do you suggest that it is normal to get ENOSPC in this case? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to understand "btrfs fi show" output? "No space left" issues
Normaly total and used should deviate us a few gb. depend on your write workload you should run btrfs balance start -dusage=60 /mnt every week to avoid "ENOSPC" if you use newer btrfs-progs who supper balance limit filters you should run btrfs balance start -dusage=99 -dlimit=10 /mnt every 3 hours. This will balance 2 Blocks (dlimit=10; corresponds to 10 gb) with are not filled full into new blocks. You could/should adjust the intervall and the limit-filter depend on your write workload. For example if you write (change files + new files) only 10GB a day it will be enough to run this ever night. The last option completly avoid the ENOSPC issue but produce aditional workload for your harddrives. Note: you should avoid making snapshots during balance. Use a simple lock-mechanic for that. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: stability matrix (was: Is stability a joke?)
On Tue, Sep 20, 2016 at 07:59:44AM +, Duncan wrote: > Christoph Anton Mitterer posted on Mon, 19 Sep 2016 21:45:46 +0200 as > excerpted: > > > On Mon, 2016-09-19 at 17:27 +0200, David Sterba wrote: > > > >> > For specific features: > >> > - Autodefrag > >> > - didn't that also cause reflinks to be broken up? > >> > >> No and never had. > > > > Absolutely sure? One year ago, I was told that at first too so I started > > using it, but later on some (IIRC) developer said auto-defrag would also > > suffer from it. > > AFAIK it was Hugo that said he looked into that, and that (if I'm > representing it correctly) autodefrag breaks reflinks and triggers space- > using duplication much as defrag does, but that it does it on a much > smaller scale, since it (1) only triggers when some parts of a file are > being rewritten anyway, thus breaking the reflink for those specific > parts of the file due to COW (COW1 on otherwise NOCOW files) in any case, > and (2) unlike defrag, doesn't rewrite and thus break the reflinks on > entire files, just somewhat larger extents than the pure rewrite by > itself without autodefrag would. > > Thus making the reflink-breaking and duplication effect of autodefrag > there, but relatively quite small compared to on-demand per-file defrag. I didn't investigate it -- It was my firmly-stated misunderstanding which caused someone (Filipe, I think) with much more actual knowledge to correct me, thus making the actual behaviour much clearer. :) I think your description is accurate as far as my current understanding goes. Hugo. > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Hugo Mills | There isn't a noun that can't be verbed. hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: [PATCH] Btrfs: kill BUG_ON in do_relocation
On Mon, Sep 19, 2016 at 04:11:44PM -0700, Liu Bo wrote: > > > That's EIO. Sometimes the EIO is big enough we have to abort, but > > > really the abort is just adding bonus. > > > > I think we misuse the EIO where we should really return EFSCORRUPTED > > that's an alias for EUCLEAN, looking at xfs or ext4. EIO should be > > really a message that the hardware is bad. > > I love this idea, but one quick question, when returning EUCLEAN, what > message do users get? > > "#define EUCLEAN 117 /* Structure needs cleaning */" strerror(EUCLEAN) -> "Structure needs cleaning" -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: stability matrix (was: Is stability a joke?)
Christoph Anton Mitterer posted on Mon, 19 Sep 2016 21:45:46 +0200 as excerpted: > On Mon, 2016-09-19 at 17:27 +0200, David Sterba wrote: > >> > For specific features: >> > - Autodefrag >> > - didn't that also cause reflinks to be broken up? >> >> No and never had. > > Absolutely sure? One year ago, I was told that at first too so I started > using it, but later on some (IIRC) developer said auto-defrag would also > suffer from it. AFAIK it was Hugo that said he looked into that, and that (if I'm representing it correctly) autodefrag breaks reflinks and triggers space- using duplication much as defrag does, but that it does it on a much smaller scale, since it (1) only triggers when some parts of a file are being rewritten anyway, thus breaking the reflink for those specific parts of the file due to COW (COW1 on otherwise NOCOW files) in any case, and (2) unlike defrag, doesn't rewrite and thus break the reflinks on entire files, just somewhat larger extents than the pure rewrite by itself without autodefrag would. Thus making the reflink-breaking and duplication effect of autodefrag there, but relatively quite small compared to on-demand per-file defrag. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to understand "btrfs fi show" output? "No space left" issues
On 2016-09-20 16:27, Peter Becker wrote: You have 417(total)-131(used) blocks wo are only partial filled. You should balance your file-system. (...) #or a full balance btrfs balance start /mnt OK, does it mean that btrfs needs some userspace daemon which does the following from time to time (how often?): 1) btrfs fi show /mountpoint(s) 2) if "used" is more than 90% (or 80%? or 70%?) of "size" - run a full balance 3) ...unless "btrfs fi df" shows that "used" is 95% (?) or more of "total", then don't bother, as we're "really" full ? Tomasz Chmielewski https://lxadm.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to understand "btrfs fi show" output? "No space left" issues
Yes, have it disabled already (for their datadirs). Tomasz Chmielewski https://lxadm.com On 2016-09-20 16:30, Peter Becker wrote: for the future. disable COW for all database containers 2016-09-20 9:28 GMT+02:00 Peter Becker : * If this NOT solve the "No space left" issues you must remove old snapshots. 2016-09-20 9:27 GMT+02:00 Peter Becker : Data, RAID1: total=417.12GiB, used=131.33GiB You have 417(total)-131(used) blocks wo are only partial filled. You should balance your file-system. At first you need some free space. You could remove some files / old snapshots etc. or you add a empty USB-Stick with min. 4 GB to your BTRFS-Pool (after balancing complete you can remove the stick from the pool). But at first you should try to free emty data and meta data blocks: btrfs balance start -musage=0 /mnt btrfs balance start -dusage=0 /mnt Then you an run a full balance or a partial balance: #a partial balance with reorganize data blocks less then 50% filled btrfs balance start -dusage=50 /mnt #or a full balance btrfs balance start /mnt Because of a possible bug you should disable all snapshot scripts (like cron-jobs) during the balance. If this solve the "No space left" issues you must remove old snapshots. 2016-09-20 8:58 GMT+02:00 Hugo Mills : On Tue, Sep 20, 2016 at 03:47:14PM +0900, Tomasz Chmielewski wrote: How to understand the following "btrfs fi show" output? This gives a write-up (and worked example) of an answer to your question: https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools If you've got any follow-up questions after reading it, please do come back and we can try to improve the FAQ entry. :) Hugo. # btrfs fi show /var/lib/lxd Label: 'btrfs' uuid: f5f30428-ec5b-4497-82de-6e20065e6f61 Total devices 2 FS bytes used 136.18GiB devid1 size 423.13GiB used 423.13GiB path /dev/sda3 devid2 size 423.13GiB used 423.13GiB path /dev/sdb3 Why is it "size 423.13GiB used 423.13GiB"? Is it full? I had "No space left" on this filesystem just yesterday (running kernel 4.7.4). This is btrfs RAID-1 on SSD disks. This filesystem is used for 20-30 LXD containers with different roles (mongo, mysql, postgres databases, webservers etc.), around 150 read-only snapshots, btrfs compression is disabled. Both "btrfs fi df" and "df -h" show plenty of space: # btrfs fi df /var/lib/lxd Data, RAID1: total=417.12GiB, used=131.33GiB System, RAID1: total=8.00MiB, used=80.00KiB Metadata, RAID1: total=6.00GiB, used=4.86GiB GlobalReserve, single: total=512.00MiB, used=0.00B # df -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 424G 137G 286G 33% /var/lib/lxd Tomasz Chmielewski https://lxadm.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Hugo Mills | I can resist everything except temptation. hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 | -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Fix handling of -ENOENT from btrfs_uuid_iter_rem
[Resend due to gmail mobile interface defaulting to html layout] >> >> We know its returning -ENOENT, so it should in theory be enough to just >> goto again_search_slot, assuming that we just raced with the deletion. > > > I will apply this on the machine which are exhibitting problems and will > report whether it rectified the situation. i bump the objectid wince this is > what you suggested. i can also try without it. >> >> >> -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to understand "btrfs fi show" output? "No space left" issues
for the future. disable COW for all database containers 2016-09-20 9:28 GMT+02:00 Peter Becker : > * If this NOT solve the "No space left" issues you must remove old snapshots. > > 2016-09-20 9:27 GMT+02:00 Peter Becker : >> Data, RAID1: total=417.12GiB, used=131.33GiB >> >> You have 417(total)-131(used) blocks wo are only partial filled. >> You should balance your file-system. >> >> At first you need some free space. You could remove some files / old >> snapshots etc. or you add a empty USB-Stick with min. 4 GB to your >> BTRFS-Pool (after balancing complete you can remove the stick from the >> pool). >> >> But at first you should try to free emty data and meta data blocks: >> >> btrfs balance start -musage=0 /mnt >> btrfs balance start -dusage=0 /mnt >> >> Then you an run a full balance or a partial balance: >> >> #a partial balance with reorganize data blocks less then 50% filled >> btrfs balance start -dusage=50 /mnt >> >> #or a full balance >> btrfs balance start /mnt >> >> Because of a possible bug you should disable all snapshot scripts >> (like cron-jobs) during the balance. >> >> If this solve the "No space left" issues you must remove old snapshots. >> >> 2016-09-20 8:58 GMT+02:00 Hugo Mills : >>> On Tue, Sep 20, 2016 at 03:47:14PM +0900, Tomasz Chmielewski wrote: How to understand the following "btrfs fi show" output? >>> >>> This gives a write-up (and worked example) of an answer to your question: >>> >>> https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools >>> >>>If you've got any follow-up questions after reading it, please do >>> come back and we can try to improve the FAQ entry. :) >>> >>>Hugo. >>> # btrfs fi show /var/lib/lxd Label: 'btrfs' uuid: f5f30428-ec5b-4497-82de-6e20065e6f61 Total devices 2 FS bytes used 136.18GiB devid1 size 423.13GiB used 423.13GiB path /dev/sda3 devid2 size 423.13GiB used 423.13GiB path /dev/sdb3 Why is it "size 423.13GiB used 423.13GiB"? Is it full? I had "No space left" on this filesystem just yesterday (running kernel 4.7.4). This is btrfs RAID-1 on SSD disks. This filesystem is used for 20-30 LXD containers with different roles (mongo, mysql, postgres databases, webservers etc.), around 150 read-only snapshots, btrfs compression is disabled. Both "btrfs fi df" and "df -h" show plenty of space: # btrfs fi df /var/lib/lxd Data, RAID1: total=417.12GiB, used=131.33GiB System, RAID1: total=8.00MiB, used=80.00KiB Metadata, RAID1: total=6.00GiB, used=4.86GiB GlobalReserve, single: total=512.00MiB, used=0.00B # df -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 424G 137G 286G 33% /var/lib/lxd Tomasz Chmielewski https://lxadm.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> -- >>> Hugo Mills | I can resist everything except temptation. >>> hugo@... carfax.org.uk | >>> http://carfax.org.uk/ | >>> PGP: E2AB1DE4 | -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to understand "btrfs fi show" output? "No space left" issues
* If this NOT solve the "No space left" issues you must remove old snapshots. 2016-09-20 9:27 GMT+02:00 Peter Becker : > Data, RAID1: total=417.12GiB, used=131.33GiB > > You have 417(total)-131(used) blocks wo are only partial filled. > You should balance your file-system. > > At first you need some free space. You could remove some files / old > snapshots etc. or you add a empty USB-Stick with min. 4 GB to your > BTRFS-Pool (after balancing complete you can remove the stick from the > pool). > > But at first you should try to free emty data and meta data blocks: > > btrfs balance start -musage=0 /mnt > btrfs balance start -dusage=0 /mnt > > Then you an run a full balance or a partial balance: > > #a partial balance with reorganize data blocks less then 50% filled > btrfs balance start -dusage=50 /mnt > > #or a full balance > btrfs balance start /mnt > > Because of a possible bug you should disable all snapshot scripts > (like cron-jobs) during the balance. > > If this solve the "No space left" issues you must remove old snapshots. > > 2016-09-20 8:58 GMT+02:00 Hugo Mills : >> On Tue, Sep 20, 2016 at 03:47:14PM +0900, Tomasz Chmielewski wrote: >>> How to understand the following "btrfs fi show" output? >> >> This gives a write-up (and worked example) of an answer to your question: >> >> https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools >> >>If you've got any follow-up questions after reading it, please do >> come back and we can try to improve the FAQ entry. :) >> >>Hugo. >> >>> # btrfs fi show /var/lib/lxd >>> Label: 'btrfs' uuid: f5f30428-ec5b-4497-82de-6e20065e6f61 >>> Total devices 2 FS bytes used 136.18GiB >>> devid1 size 423.13GiB used 423.13GiB path /dev/sda3 >>> devid2 size 423.13GiB used 423.13GiB path /dev/sdb3 >>> >>> Why is it "size 423.13GiB used 423.13GiB"? Is it full? >>> >>> I had "No space left" on this filesystem just yesterday (running >>> kernel 4.7.4). This is btrfs RAID-1 on SSD disks. This filesystem is >>> used for 20-30 LXD containers with different roles (mongo, mysql, >>> postgres databases, webservers etc.), around 150 read-only >>> snapshots, btrfs compression is disabled. >>> >>> >>> Both "btrfs fi df" and "df -h" show plenty of space: >>> >>> # btrfs fi df /var/lib/lxd >>> Data, RAID1: total=417.12GiB, used=131.33GiB >>> System, RAID1: total=8.00MiB, used=80.00KiB >>> Metadata, RAID1: total=6.00GiB, used=4.86GiB >>> GlobalReserve, single: total=512.00MiB, used=0.00B >>> >>> >>> # df -h >>> Filesystem Size Used Avail Use% Mounted on >>> /dev/sda3 424G 137G 286G 33% /var/lib/lxd >>> >>> >>> >>> Tomasz Chmielewski >>> https://lxadm.com >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- >> Hugo Mills | I can resist everything except temptation. >> hugo@... carfax.org.uk | >> http://carfax.org.uk/ | >> PGP: E2AB1DE4 | -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to understand "btrfs fi show" output? "No space left" issues
Data, RAID1: total=417.12GiB, used=131.33GiB You have 417(total)-131(used) blocks wo are only partial filled. You should balance your file-system. At first you need some free space. You could remove some files / old snapshots etc. or you add a empty USB-Stick with min. 4 GB to your BTRFS-Pool (after balancing complete you can remove the stick from the pool). But at first you should try to free emty data and meta data blocks: btrfs balance start -musage=0 /mnt btrfs balance start -dusage=0 /mnt Then you an run a full balance or a partial balance: #a partial balance with reorganize data blocks less then 50% filled btrfs balance start -dusage=50 /mnt #or a full balance btrfs balance start /mnt Because of a possible bug you should disable all snapshot scripts (like cron-jobs) during the balance. If this solve the "No space left" issues you must remove old snapshots. 2016-09-20 8:58 GMT+02:00 Hugo Mills : > On Tue, Sep 20, 2016 at 03:47:14PM +0900, Tomasz Chmielewski wrote: >> How to understand the following "btrfs fi show" output? > > This gives a write-up (and worked example) of an answer to your question: > > https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools > >If you've got any follow-up questions after reading it, please do > come back and we can try to improve the FAQ entry. :) > >Hugo. > >> # btrfs fi show /var/lib/lxd >> Label: 'btrfs' uuid: f5f30428-ec5b-4497-82de-6e20065e6f61 >> Total devices 2 FS bytes used 136.18GiB >> devid1 size 423.13GiB used 423.13GiB path /dev/sda3 >> devid2 size 423.13GiB used 423.13GiB path /dev/sdb3 >> >> Why is it "size 423.13GiB used 423.13GiB"? Is it full? >> >> I had "No space left" on this filesystem just yesterday (running >> kernel 4.7.4). This is btrfs RAID-1 on SSD disks. This filesystem is >> used for 20-30 LXD containers with different roles (mongo, mysql, >> postgres databases, webservers etc.), around 150 read-only >> snapshots, btrfs compression is disabled. >> >> >> Both "btrfs fi df" and "df -h" show plenty of space: >> >> # btrfs fi df /var/lib/lxd >> Data, RAID1: total=417.12GiB, used=131.33GiB >> System, RAID1: total=8.00MiB, used=80.00KiB >> Metadata, RAID1: total=6.00GiB, used=4.86GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B >> >> >> # df -h >> Filesystem Size Used Avail Use% Mounted on >> /dev/sda3 424G 137G 286G 33% /var/lib/lxd >> >> >> >> Tomasz Chmielewski >> https://lxadm.com >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Hugo Mills | I can resist everything except temptation. > hugo@... carfax.org.uk | > http://carfax.org.uk/ | > PGP: E2AB1DE4 | -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to understand "btrfs fi show" output? "No space left" issues
OK, according to that - it means 423.13GiB out of total available space, 423.13GiB, has been allocated. Is it good? Is it bad? Is it why I'm getting "No space left" issues? Why has it allocated all available space, if only around 1/3 of space is in use, according to other tools (less than 140 GB out of 423 GB is in use)? On other systems, I see that "used" from "btrfs fi show" more or less matches the output of "btrfs fi df"; here - everything is allocated. Tomasz Chmielewski https://lxadm.com On 2016-09-20 15:58, Hugo Mills wrote: On Tue, Sep 20, 2016 at 03:47:14PM +0900, Tomasz Chmielewski wrote: How to understand the following "btrfs fi show" output? This gives a write-up (and worked example) of an answer to your question: https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools If you've got any follow-up questions after reading it, please do come back and we can try to improve the FAQ entry. :) Hugo. # btrfs fi show /var/lib/lxd Label: 'btrfs' uuid: f5f30428-ec5b-4497-82de-6e20065e6f61 Total devices 2 FS bytes used 136.18GiB devid1 size 423.13GiB used 423.13GiB path /dev/sda3 devid2 size 423.13GiB used 423.13GiB path /dev/sdb3 Why is it "size 423.13GiB used 423.13GiB"? Is it full? I had "No space left" on this filesystem just yesterday (running kernel 4.7.4). This is btrfs RAID-1 on SSD disks. This filesystem is used for 20-30 LXD containers with different roles (mongo, mysql, postgres databases, webservers etc.), around 150 read-only snapshots, btrfs compression is disabled. Both "btrfs fi df" and "df -h" show plenty of space: # btrfs fi df /var/lib/lxd Data, RAID1: total=417.12GiB, used=131.33GiB System, RAID1: total=8.00MiB, used=80.00KiB Metadata, RAID1: total=6.00GiB, used=4.86GiB GlobalReserve, single: total=512.00MiB, used=0.00B # df -h Filesystem Size Used Avail Use% Mounted on /dev/sda3 424G 137G 286G 33% /var/lib/lxd Tomasz Chmielewski https://lxadm.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html