btrfs rare silent data corruption with kernel data leak

2016-09-20 Thread Zygo Blaxell
Summary: 

There seem to be two btrfs bugs here: one loses data on writes,
and the other leaks data from the kernel to replace it on reads.  It all
happens after checksums are verified, so the corruption is entirely
silent--no EIO errors, kernel messages, or device event statistics.

Compressed extents are corrupted with kernel data leak.  Uncompressed
extents may not be corrupted, or may be corrupted by deterministically
replacing data bytes with zero, or may not be corrupted.  No preconditions
for corruption are known.  Less than one file per hundred thousand
seems to be affected.  Only specific parts of any file can be affected.
Kernels v4.0..v4.5.7 tested, all have the issue.

Background, observations, and analysis:

I've been detecting silent data corruption on btrfs for over a year.
Over time I've been improving data collection and controlling for
confounding factors (other known btrfs bugs, RAM and CPU failures, raid5,
etc).  I have recently isolated the most common remaining corruption mode,
and it seems to be a btrfs bug.

I don't have an easy recipe to create a corrupted file and I don't know
precisely how they come to exist.  In the wild, about one in 10^5..10^7
files is provably corrupted.  The corruption can only occur at one point
in each file so the rate of corruption incidents follows the number
of files.  It seems to occur most often to software builders and rsync
backup receivers.  It seems to happen mostly on busier machines with
mixed workloads and not at all on idle test VMs trying to reproduce this
issue with a script.

One way to get corruption is to set up a series of filesystems and rsync
/usr to them sequentially (i.e. rsync -a /usr /fs-A; rsync -a /fs-A /fs-B;
rsync -a /fs-B /fs-C; ...) and verify each copy by comparison afterwards.
The same host needs to be doing other filesystem workloads or it won't
seem to reproduce this issue.  It took me two weeks to intentionally
create one corrupt file this way.  Good luck.

In cases where this corruption mode is found, the files always have an
extent map following this pattern:

# filefrag -v usr/share/icons/hicolor/icon-theme.cache
Filesystem type is: 9123683e
File size of usr/share/icons/hicolor/icon-theme.cache is 36456 (9 
blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: 
flags:
   0:0..4095:  0..  4095:   4096: 
encoded,not_aligned,inline
   1:1..   8:  182785288.. 182785295:  8:  1: 
last,encoded,shared,eof
usr/share/icons/hicolor/icon-theme.cache: 2 extents found

Note the first inline extent followed by one or more non-inline
extents.  I don't know enough about the writing side of btrfs to know
if this is a bug in and of itself.  It _looks_ wrong to me.

Once such an extent is created, the corruption is persistent but not
deterministic.  When I read the extent through btrfs, the file is
different most of the time:

# cp usr/share/icons/hicolor/icon-theme.cache /tmp/foo
# ls -l usr/share/icons/hicolor/icon-theme.cache /tmp/foo
-rw-r--r-- 1 root root 36456 Sep 20 11:41 /tmp/foo
-rw-r--r-- 1 root root 36456 Sep  6 11:52 
usr/share/icons/hicolor/icon-theme.cache
# while sysctl vm.drop_caches=1; do cmp -l 
usr/share/icons/hicolor/icon-theme.cache /tmp/foo; done
vm.drop_caches = 1
vm.drop_caches = 1
 4093 213   0
 4094 177   0
vm.drop_caches = 1
 4093 216   0
 4094  33   0
 4095 173   0
 4096  15   0
vm.drop_caches = 1
 4093 352   0
 4094   3   0
 4095  37   0
 4096   2   0
vm.drop_caches = 1
 4093 243   0
 4094 372   0
 4095 154   0
 4096 221   0
vm.drop_caches = 1
 4093 333   0
 4094 170   0
 4095 356   0
 4096 213   0
vm.drop_caches = 1
 4093 170   0
 4094 155   0
 4095  62   0
 4096 233   0
vm.drop_caches = 1
 4093 263   0
 4094   6   0
 4095 363   0
 4096  44   0
vm.drop_caches = 1
 4093 237   0
 4094 330   0
 4095 217   0
 4096 206   0
^C

In other runs there can be 5 or more consecutive reads with no differences
detected.

I fetched the raw inline extent item for this file through the SEARCH_V2
ioctl and decoded it:

# head /tmp/bar
27 5e 06 00 00 00 00 00 [generation 417319]
fc 0f 00 00 00 00 00 00 [ram_bytes = 0xffc, compression = 1]
01 00 00 00 00 78 5e 9c [zlib data starts at "78 5e..."]
97 3d 74 14 55 14 c7 6f
60 77 b3 9f d9 20 20 08
28 11 22 a0 66 90 8f a0
a8 01 a2 80 80 a2 20 e6
28 20 42 26 bb 93 cd 30
b3 33 9b d9 99 24 62 d4
20 f8 51 58 58 50 58 58

Notice ram_bytes is 0xffc, or 4092, but the inline extent's position in
the f

[PATCH v2 12/14] btrfs-progs: check: fix the return value bug of cmd_check()

2016-09-20 Thread Qu Wenruo
From: Lu Fengqi 

The function cmd_check() is called by the main function of btrfs.c, its
return value will be returned by exit(). Resulting in the loss of
significant bits in some cases, for example this value is greater than
0377. If use a bool value "err" to store all of the return value, this
will solve the problem.

Signed-off-by: Lu Fengqi 
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 47 ---
 1 file changed, 36 insertions(+), 11 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 934a3dd..701fff5 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -12337,6 +12337,7 @@ int cmd_check(int argc, char **argv)
u64 chunk_root_bytenr = 0;
char uuidbuf[BTRFS_UUID_UNPARSED_SIZE];
int ret;
+   int err = 0;
u64 num;
int init_csum_tree = 0;
int readonly = 0;
@@ -12470,10 +12471,12 @@ int cmd_check(int argc, char **argv)
 
if((ret = check_mounted(argv[optind])) < 0) {
error("could not check mount status: %s", strerror(-ret));
+   err |= !!ret;
goto err_out;
} else if(ret) {
error("%s is currently mounted, aborting", argv[optind]);
ret = -EBUSY;
+   err |= !!ret;
goto err_out;
}
 
@@ -12486,6 +12489,7 @@ int cmd_check(int argc, char **argv)
if (!info) {
error("cannot open file system");
ret = -EIO;
+   err |= !!ret;
goto err_out;
}
 
@@ -12500,9 +12504,11 @@ int cmd_check(int argc, char **argv)
ret = ask_user("repair mode will force to clear out log tree, 
are you sure?");
if (!ret) {
ret = 1;
+   err |= !!ret;
goto close_out;
}
ret = zero_log_tree(root);
+   err |= !!ret;
if (ret) {
error("failed to zero log tree: %d", ret);
goto close_out;
@@ -12514,6 +12520,7 @@ int cmd_check(int argc, char **argv)
printf("Print quota groups for %s\nUUID: %s\n", argv[optind],
   uuidbuf);
ret = qgroup_verify_all(info);
+   err |= !!ret;
if (ret == 0)
report_qgroups(1);
goto close_out;
@@ -12522,6 +12529,7 @@ int cmd_check(int argc, char **argv)
printf("Print extent state for subvolume %llu on %s\nUUID: 
%s\n",
   subvolid, argv[optind], uuidbuf);
ret = print_extent_state(info, subvolid);
+   err |= !!ret;
goto close_out;
}
printf("Checking filesystem on %s\nUUID: %s\n", argv[optind], uuidbuf);
@@ -12530,6 +12538,7 @@ int cmd_check(int argc, char **argv)
!extent_buffer_uptodate(info->dev_root->node) ||
!extent_buffer_uptodate(info->chunk_root->node)) {
error("critical roots corrupted, unable to check the 
filesystem");
+   err |= !!ret;
ret = -EIO;
goto close_out;
}
@@ -12541,12 +12550,14 @@ int cmd_check(int argc, char **argv)
if (IS_ERR(trans)) {
error("error starting transaction");
ret = PTR_ERR(trans);
+   err |= !!ret;
goto close_out;
}
 
if (init_extent_tree) {
printf("Creating a new extent tree\n");
ret = reinit_extent_tree(trans, info);
+   err |= !!ret;
if (ret)
goto close_out;
}
@@ -12558,11 +12569,13 @@ int cmd_check(int argc, char **argv)
error("checksum tree initialization failed: %d",
ret);
ret = -EIO;
+   err |= !!ret;
goto close_out;
}
 
ret = fill_csum_tree(trans, info->csum_root,
 init_extent_tree);
+   err |= !!ret;
if (ret) {
error("checksum tree refilling failed: %d", 
ret);
return -EIO;
@@ -12573,17 +12586,20 @@ int cmd_check(int argc, char **argv)
 * extent entries for all of the items it finds.
 */
ret = btrfs_commit_transaction(trans, info->extent_root);
+   err |= !!ret;
if (ret)
goto close_out;
}
if (!extent_buffer_uptodate(info->extent_root->node)) {
error("critical: extent_root, unable to check the filesystem");
  

[PATCH v2 04/14] btrfs-progs: check: introduce function to check inode_extref

2016-09-20 Thread Qu Wenruo
From: Lu Fengqi 

Introduce a new function check_inode_extref() to check INODE_EXTREF, and
call find_dir_item() to find the related DIR_ITEM/DIR_INDEX.

Signed-off-by: Lu Fengqi 
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 78 
 1 file changed, 78 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index 90d5fbc..fec8b6e 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -4064,6 +4064,84 @@ next:
return err;
 }
 
+/*
+ * Traverse the given INODE_EXTREF and call find_dir_item() to find related
+ * DIR_ITEM/DIR_INDEX.
+ *
+ * @root:  the root of the fs/file tree
+ * @ref_key:   the key of the INODE_EXTREF
+ * @refs:  the count of INODE_EXTREF
+ * @mode:  the st_mode of INODE_ITEM
+ *
+ * Return 0 if no error occurred.
+ */
+static int check_inode_extref(struct btrfs_root *root,
+ struct btrfs_key *ref_key,
+ struct extent_buffer *node, int slot, u64 *refs,
+ int mode)
+{
+   struct btrfs_key key;
+   struct btrfs_inode_extref *extref;
+   char namebuf[BTRFS_NAME_LEN] = {0};
+   u32 total;
+   u32 cur = 0;
+   u32 len;
+   u32 name_len;
+   u64 index;
+   u64 parent;
+   int ret;
+   int err = 0;
+
+   extref = btrfs_item_ptr(node, slot, struct btrfs_inode_extref);
+   total = btrfs_item_size_nr(node, slot);
+
+next:
+   /* update inode ref count */
+   (*refs)++;
+   name_len = btrfs_inode_extref_name_len(node, extref);
+   index = btrfs_inode_extref_index(node, extref);
+   parent = btrfs_inode_extref_parent(node, extref);
+   if (name_len <= BTRFS_NAME_LEN) {
+   len = name_len;
+   } else {
+   len = BTRFS_NAME_LEN;
+   warning("root %llu INODE_EXTREF[%llu %llu] name too long",
+   root->objectid, ref_key->objectid, ref_key->offset);
+   }
+   read_extent_buffer(node, namebuf, (unsigned long)(extref + 1), len);
+
+   /* Check root dir ref name */
+   if (index == 0 && strncmp(namebuf, "..", name_len)) {
+   error("root %llu INODE_EXTREF[%llu %llu] ROOT_DIR name 
shouldn't be %s",
+ root->objectid, ref_key->objectid, ref_key->offset,
+ namebuf);
+   err |= ROOT_DIR_ERROR;
+   }
+
+   /* find related dir_index */
+   key.objectid = parent;
+   key.type = BTRFS_DIR_INDEX_KEY;
+   key.offset = index;
+   ret = find_dir_item(root, ref_key, &key, index, namebuf, len, mode);
+   err |= ret;
+
+   /* find related dir_item */
+   key.objectid = parent;
+   key.type = BTRFS_DIR_ITEM_KEY;
+   key.offset = btrfs_name_hash(namebuf, len);
+   ret = find_dir_item(root, ref_key, &key, index, namebuf, len, mode);
+   err |= ret;
+
+   len = sizeof(*extref) + name_len;
+   extref = (struct btrfs_inode_extref *)((char *)extref + len);
+   cur += len;
+
+   if (cur < total)
+   goto next;
+
+   return err;
+}
+
 static int all_backpointers_checked(struct extent_record *rec, int print_errs)
 {
struct list_head *cur = rec->backrefs.next;
-- 
2.10.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 14/14] btrfs-progs: check: Enhance leaf traversal function to handle missing inode item

2016-09-20 Thread Qu Wenruo
The leaf traversal function in lowmem mode will skip to the first inode
item of leaf and begin to check the inode.

That's designed to avoid checking overlapping part of a leaf.

But that will cause problem in fsck/010 test case, as in that case inode
item of the first inode(256) is missing.
So above traversal will check from inode 257 and found nothing wrong.

The fix is done in 2 part:
1) Manually check the first inode
   To avoid case like fsck/010

2) Check inode if ino changes from the first ino of a leaf
   To avoid missing inode_item problem.

Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 46 --
 1 file changed, 44 insertions(+), 2 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index d290a66..f5be153 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -1875,6 +1875,7 @@ static int process_one_leaf_v2(struct btrfs_root *root, 
struct btrfs_path *path,
struct btrfs_key key;
u64 cur_bytenr;
u32 nritems;
+   u64 first_ino = 0;
int root_level = btrfs_header_level(root->node);
int i;
int ret = 0; /* Final return value */
@@ -1882,11 +1883,14 @@ static int process_one_leaf_v2(struct btrfs_root *root, 
struct btrfs_path *path,
 
cur_bytenr = cur->start;
 
-   /* skip to first inode item in this leaf */
+   /* skip to first inode item or the first inode number change */
nritems = btrfs_header_nritems(cur);
for (i = 0; i < nritems; i++) {
btrfs_item_key_to_cpu(cur, &key, i);
-   if (key.type == BTRFS_INODE_ITEM_KEY)
+   if (i == 0)
+   first_ino = key.objectid;
+   if (key.type == BTRFS_INODE_ITEM_KEY ||
+   (first_ino && first_ino != key.objectid))
break;
}
if (i == nritems) {
@@ -4951,6 +4955,34 @@ out:
return err;
 }
 
+static int check_fs_first_inode(struct btrfs_root *root, unsigned int ext_ref)
+{
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   int err = 0;
+   int ret;
+
+   path = btrfs_alloc_path();
+   key.objectid = 256;
+   key.type = BTRFS_INODE_ITEM_KEY;
+   key.offset = 0;
+
+   ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+   if (ret < 0)
+   return ret;
+   if (ret > 0) {
+   ret = 0;
+   err |= INODE_ITEM_MISSING;
+   }
+
+   err |= check_inode_item(root, path, ext_ref);
+   err &= ~LAST_ITEM;
+   if (err && !ret)
+   ret = -EIO;
+   btrfs_free_path(path);
+   return ret;
+}
+
 /*
  * Iterate all item on the tree and call check_inode_item() to check.
  *
@@ -4968,6 +5000,16 @@ static int check_fs_root_v2(struct btrfs_root *root, 
unsigned int ext_ref)
int ret, wret;
int level;
 
+   /*
+* We need to manually check the first inode item(256)
+* As the following traversal function will only start from
+* the first inode item in the leaf, if inode item(256) is missing
+* we will just skip it forever.
+*/
+   ret = check_fs_first_inode(root, ext_ref);
+   if (ret < 0)
+   return ret;
+
path = btrfs_alloc_path();
if (!path)
return -ENOMEM;
-- 
2.10.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 06/14] btrfs-progs: check: introduce a function to check dir_item

2016-09-20 Thread Qu Wenruo
From: Lu Fengqi 

Introduce a new function check_dir_item() to check DIR_ITEM/DIR_INDEX,
and call find_inode_ref() to find the related INODE_REF/INODE_EXTREF.

Signed-off-by: Lu Fengqi 
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 125 +++
 1 file changed, 125 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index a261821..d39a81c 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -3853,6 +3853,8 @@ out:
 #define DIR_ITEM_MISSING   (1<<2)  /* DIR_ITEM not found */
 #define DIR_ITEM_MISMATCH  (1<<3)  /* DIR_ITEM found but not match */
 #define INODE_REF_MISSING  (1<<4)  /* INODE_REF/INODE_EXTREF not found */
+#define INODE_ITEM_MISSING (1<<5)  /* INODE_ITEM not found */
+#define INODE_ITEM_MISMATCH(1<<6)  /* INODE_ITEM found but not match */
 
 /*
  * Find DIR_ITEM/DIR_INDEX for the given key and check it with the specified
@@ -4291,6 +4293,129 @@ out:
return ret;
 }
 
+/*
+ * Traverse the given DIR_ITEM/DIR_INDEX and check related INODE_ITEM and
+ * call find_inode_ref() to check related INODE_REF/INODE_EXTREF.
+ *
+ * @root:  the root of the fs/file tree
+ * @key:   the key of the INODE_REF/INODE_EXTREF
+ * @size:  the st_size of the INODE_ITEM
+ * @ext_ref:   the EXTENDED_IREF feature
+ *
+ * Return 0 if no error occurred.
+ */
+static int check_dir_item(struct btrfs_root *root, struct btrfs_key *key,
+ struct extent_buffer *node, int slot, u64 *size,
+ unsigned int ext_ref)
+{
+   struct btrfs_dir_item *di;
+   struct btrfs_inode_item *ii;
+   struct btrfs_path path;
+   struct btrfs_key location;
+   char namebuf[BTRFS_NAME_LEN] = {0};
+   u32 total;
+   u32 cur = 0;
+   u32 len;
+   u32 name_len;
+   u32 data_len;
+   u8 filetype;
+   u32 mode;
+   u64 index;
+   int ret;
+   int err = 0;
+
+   /*
+* For DIR_ITEM set index to (u64)-1, so that find_inode_ref
+* ignore index check.
+*/
+   index = (key->type == BTRFS_DIR_INDEX_KEY) ? key->offset : (u64)-1;
+
+   di = btrfs_item_ptr(node, slot, struct btrfs_dir_item);
+   total = btrfs_item_size_nr(node, slot);
+
+   while (cur < total) {
+   data_len = btrfs_dir_data_len(node, di);
+   if (data_len)
+   error("root %llu %s[%llu %llu] data_len shouldn't be 
%u",
+ root->objectid, key->type == BTRFS_DIR_ITEM_KEY ?
+ "DIR_ITEM" : "DIR_INDEX",
+ key->objectid, key->offset, data_len);
+
+   name_len = btrfs_dir_name_len(node, di);
+   if (name_len <= BTRFS_NAME_LEN) {
+   len = name_len;
+   } else {
+   len = BTRFS_NAME_LEN;
+   warning("root %llu %s[%llu %llu] name too long",
+   root->objectid,
+   key->type == BTRFS_DIR_ITEM_KEY ?
+   "DIR_ITEM" : "DIR_INDEX",
+   key->objectid, key->offset);
+   }
+   (*size) += name_len;
+
+   read_extent_buffer(node, namebuf, (unsigned long)(di + 1), len);
+   filetype = btrfs_dir_type(node, di);
+
+   btrfs_init_path(&path);
+   btrfs_dir_item_key_to_cpu(node, di, &location);
+
+   /* Ignore related ROOT_ITEM check */
+   if (location.type == BTRFS_ROOT_ITEM_KEY)
+   goto next;
+
+   /* Check relative INODE_ITEM(existence/filetype) */
+   ret = btrfs_search_slot(NULL, root, &location, &path, 0, 0);
+   if (ret) {
+   err |= INODE_ITEM_MISSING;
+   error("root %llu %s[%llu %llu] couldn't find relative 
INODE_ITEM[%llu] namelen %u filename %s filetype %x",
+ root->objectid, key->type == BTRFS_DIR_ITEM_KEY ?
+ "DIR_ITEM" : "DIR_INDEX", key->objectid,
+ key->offset, location.objectid, name_len,
+ namebuf, filetype);
+   goto next;
+   }
+
+   ii = btrfs_item_ptr(path.nodes[0], path.slots[0],
+   struct btrfs_inode_item);
+   mode = btrfs_inode_mode(path.nodes[0], ii);
+
+   if (imode_to_type(mode) != filetype) {
+   err |= INODE_ITEM_MISMATCH;
+   error("root %llu %s[%llu %llu] relative INODE_ITEM 
filetype mismatch namelen %u filename %s filetype %d",
+ root->objectid, key->type == BTRFS_DIR_ITEM_KEY ?
+ "DIR_ITEM" : "DIR_INDEX", key->objectid,
+ key->offset, name_len, namebuf, filetype);
+   }
+
+   /

[PATCH v2 03/14] btrfs-progs: check: introduce function to check inode_ref

2016-09-20 Thread Qu Wenruo
From: Lu Fengqi 

Introduce a new function check_inode_ref() to check INODE_REF,
and call find_dir_item() to find the related DIR_ITEM/DIR_INDEX.

Signed-off-by: Lu Fengqi 
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 76 
 1 file changed, 76 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index 4e25804..90d5fbc 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -41,6 +41,7 @@
 #include "rbtree-utils.h"
 #include "backref.h"
 #include "ulist.h"
+#include "hash.h"
 
 enum task_position {
TASK_EXTENTS,
@@ -3988,6 +3989,81 @@ out:
return ret;
 }
 
+/*
+ * Traverse the given INODE_REF and call find_dir_item() to find related
+ * DIR_ITEM/DIR_INDEX.
+ *
+ * @root:  the root of the fs/file tree
+ * @ref_key:   the key of the INODE_REF
+ * @refs:  the count of INODE_REF
+ * @mode:  the st_mode of INODE_ITEM
+ *
+ * Return 0 if no error occurred.
+ */
+static int check_inode_ref(struct btrfs_root *root, struct btrfs_key *ref_key,
+  struct extent_buffer *node, int slot, u64 *refs,
+  int mode)
+{
+   struct btrfs_key key;
+   struct btrfs_inode_ref *ref;
+   char namebuf[BTRFS_NAME_LEN] = {0};
+   u32 total;
+   u32 cur = 0;
+   u32 len;
+   u32 name_len;
+   u64 index;
+   int ret, err = 0;
+
+   ref = btrfs_item_ptr(node, slot, struct btrfs_inode_ref);
+   total = btrfs_item_size_nr(node, slot);
+
+next:
+   /* Update inode ref count */
+   (*refs)++;
+
+   index = btrfs_inode_ref_index(node, ref);
+   name_len = btrfs_inode_ref_name_len(node, ref);
+   if (name_len <= BTRFS_NAME_LEN) {
+   len = name_len;
+   } else {
+   len = BTRFS_NAME_LEN;
+   warning("root %llu INODE_REF[%llu %llu] name too long",
+   root->objectid, ref_key->objectid, ref_key->offset);
+   }
+
+   read_extent_buffer(node, namebuf, (unsigned long)(ref + 1), len);
+
+   /* Check root dir ref name */
+   if (index == 0 && strncmp(namebuf, "..", name_len)) {
+   error("root %llu INODE_REF[%llu %llu] ROOT_DIR name shouldn't 
be %s",
+ root->objectid, ref_key->objectid, ref_key->offset,
+ namebuf);
+   err |= ROOT_DIR_ERROR;
+   }
+
+   /* Find related dir_index */
+   key.objectid = ref_key->offset;
+   key.type = BTRFS_DIR_INDEX_KEY;
+   key.offset = index;
+   ret = find_dir_item(root, ref_key, &key, index, namebuf, len, mode);
+   err |= ret;
+
+   /* Find related dir_item */
+   key.objectid = ref_key->offset;
+   key.type = BTRFS_DIR_ITEM_KEY;
+   key.offset = btrfs_name_hash(namebuf, len);
+   ret = find_dir_item(root, ref_key, &key, index, namebuf, len, mode);
+   err |= ret;
+
+   len = sizeof(*ref) + name_len;
+   ref = (struct btrfs_inode_ref *)((char *)ref + len);
+   cur += len;
+   if (cur < total)
+   goto next;
+
+   return err;
+}
+
 static int all_backpointers_checked(struct extent_record *rec, int print_errs)
 {
struct list_head *cur = rec->backrefs.next;
-- 
2.10.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 05/14] btrfs-progs: check: introduce function to find inode_ref

2016-09-20 Thread Qu Wenruo
From: Lu Fengqi 

Introduce a new function find_inode_ref() to find
INODE_REF/INODE_EXTREF for the given key, and check it with the
specified DIR_ITEM/DIR_INDEX match.

Signed-off-by: Lu Fengqi 
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 149 +++
 1 file changed, 149 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index fec8b6e..a261821 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -3852,6 +3852,7 @@ out:
 #define ROOT_DIR_ERROR (1<<1)  /* bad root_dir */
 #define DIR_ITEM_MISSING   (1<<2)  /* DIR_ITEM not found */
 #define DIR_ITEM_MISMATCH  (1<<3)  /* DIR_ITEM found but not match */
+#define INODE_REF_MISSING  (1<<4)  /* INODE_REF/INODE_EXTREF not found */
 
 /*
  * Find DIR_ITEM/DIR_INDEX for the given key and check it with the specified
@@ -4142,6 +4143,154 @@ next:
return err;
 }
 
+/*
+ * Find INODE_REF/INODE_EXTREF for the given key and check it with the 
specified
+ * DIR_ITEM/DIR_INDEX match.
+ *
+ * @root:  the root of the fs/file tree
+ * @key:   the key of the INODE_REF/INODE_EXTREF
+ * @name:  the name in the INODE_REF/INODE_EXTREF
+ * @namelen:   the length of name in the INODE_REF/INODE_EXTREF
+ * @index: the index in the INODE_REF/INODE_EXTREF, for DIR_ITEM set index
+ * to (u64)-1
+ * @ext_ref:   the EXTENDED_IREF feature
+ *
+ * Return 0 if no error occurred.
+ * Return >0 for error bitmap
+ */
+static int find_inode_ref(struct btrfs_root *root, struct btrfs_key *key,
+ char *name, int namelen, u64 index,
+ unsigned int ext_ref)
+{
+   struct btrfs_path path;
+   struct btrfs_inode_ref *ref;
+   struct btrfs_inode_extref *extref;
+   struct extent_buffer *node;
+   char ref_namebuf[BTRFS_NAME_LEN] = {0};
+   u32 total;
+   u32 cur = 0;
+   u32 len;
+   u32 ref_namelen;
+   u64 ref_index;
+   u64 parent;
+   u64 dir_id;
+   int slot;
+   int ret;
+
+   btrfs_init_path(&path);
+   ret = btrfs_search_slot(NULL, root, key, &path, 0, 0);
+   if (ret) {
+   ret = INODE_REF_MISSING;
+   goto extref;
+   }
+
+   node = path.nodes[0];
+   slot = path.slots[0];
+
+   ref = btrfs_item_ptr(node, slot, struct btrfs_inode_ref);
+   total = btrfs_item_size_nr(node, slot);
+
+   /* Iterate all entry of INODE_REF */
+   while (cur < total) {
+   ret = INODE_REF_MISSING;
+
+   ref_namelen = btrfs_inode_ref_name_len(node, ref);
+   ref_index = btrfs_inode_ref_index(node, ref);
+   if (index != (u64)-1 && index != ref_index)
+   goto next_ref;
+
+   if (ref_namelen <= BTRFS_NAME_LEN) {
+   len = ref_namelen;
+   } else {
+   len = BTRFS_NAME_LEN;
+   warning("root %llu INODE %s[%llu %llu] name too long",
+   root->objectid,
+   key->type == BTRFS_INODE_REF_KEY ?
+   "REF" : "EXTREF",
+   key->objectid, key->offset);
+   }
+   read_extent_buffer(node, ref_namebuf, (unsigned long)(ref + 1),
+  len);
+
+   if (len != namelen || strncmp(ref_namebuf, name, len))
+   goto next_ref;
+
+   ret = 0;
+   goto out;
+next_ref:
+   len = sizeof(*ref) + ref_namelen;
+   ref = (struct btrfs_inode_ref *)((char *)ref + len);
+   cur += len;
+   }
+
+extref:
+   /* Skip if not support EXTENDED_IREF feature */
+   if (!ext_ref)
+   goto out;
+
+   btrfs_release_path(&path);
+   btrfs_init_path(&path);
+
+   dir_id = key->offset;
+   key->type = BTRFS_INODE_EXTREF_KEY;
+   key->offset = btrfs_extref_hash(dir_id, name, namelen);
+
+   ret = btrfs_search_slot(NULL, root, key, &path, 0, 0);
+   if (ret) {
+   ret = INODE_REF_MISSING;
+   goto out;
+   }
+
+   node = path.nodes[0];
+   slot = path.slots[0];
+
+   extref = btrfs_item_ptr(node, slot, struct btrfs_inode_extref);
+   cur = 0;
+   total = btrfs_item_size_nr(node, slot);
+
+   /* Iterate all entry of INODE_EXTREF */
+   while (cur < total) {
+   ret = INODE_REF_MISSING;
+
+   ref_namelen = btrfs_inode_extref_name_len(node, extref);
+   ref_index = btrfs_inode_extref_index(node, extref);
+   parent = btrfs_inode_extref_parent(node, extref);
+   if (index != (u64)-1 && index != ref_index)
+   goto next_extref;
+
+   if (parent != dir_id)
+   goto next_extref;
+
+   if (ref_namelen <= BTRFS_NAME_LEN) {
+   len = ref_namelen;
+  

[PATCH v2 02/14] btrfs-progs: check: introduce function to find dir_item

2016-09-20 Thread Qu Wenruo
From: Lu Fengqi 

Introduce a new function find_dir_item() to find DIR_ITEM for the given
key, and check it with the specified INODE_REF/INODE_EXTREF match.

Signed-off-by: Lu Fengqi 
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 140 +++
 1 file changed, 140 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index 998ba63..4e25804 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -3848,6 +3848,146 @@ out:
return err;
 }
 
+#define ROOT_DIR_ERROR (1<<1)  /* bad root_dir */
+#define DIR_ITEM_MISSING   (1<<2)  /* DIR_ITEM not found */
+#define DIR_ITEM_MISMATCH  (1<<3)  /* DIR_ITEM found but not match */
+
+/*
+ * Find DIR_ITEM/DIR_INDEX for the given key and check it with the specified
+ * INODE_REF/INODE_EXTREF match.
+ *
+ * @root:  the root of the fs/file tree
+ * @ref_key:   the key of the INODE_REF/INODE_EXTREF
+ * @key:   the key of the DIR_ITEM/DIR_INDEX
+ * @index: the index in the INODE_REF/INODE_EXTREF, be used to
+ * distinguish root_dir between normal dir/file
+ * @name:  the name in the INODE_REF/INODE_EXTREF
+ * @namelen:   the length of name in the INODE_REF/INODE_EXTREF
+ * @mode:  the st_mode of INODE_ITEM
+ *
+ * Return 0 if no error occurred.
+ * Return ROOT_DIR_ERROR if found DIR_ITEM/DIR_INDEX for root_dir.
+ * Return DIR_ITEM_MISSING if couldn't find DIR_ITEM/DIR_INDEX for normal
+ * dir/file.
+ * Return DIR_ITEM_MISMATCH if INODE_REF/INODE_EXTREF and DIR_ITEM/DIR_INDEX
+ * not match for normal dir/file.
+ */
+static int find_dir_item(struct btrfs_root *root, struct btrfs_key *ref_key,
+struct btrfs_key *key, u64 index, char *name,
+u32 namelen, u32 mode)
+{
+   struct btrfs_path path;
+   struct extent_buffer *node;
+   struct btrfs_dir_item *di;
+   struct btrfs_key location;
+   char namebuf[BTRFS_NAME_LEN] = {0};
+   u32 total;
+   u32 cur = 0;
+   u32 len;
+   u32 name_len;
+   u32 data_len;
+   u8 filetype;
+   int slot;
+   int ret;
+
+   btrfs_init_path(&path);
+   ret = btrfs_search_slot(NULL, root, key, &path, 0, 0);
+   if (ret < 0) {
+   ret = DIR_ITEM_MISSING;
+   goto out;
+   }
+
+   /* Process root dir and goto out*/
+   if (index == 0) {
+   if (ret == 0) {
+   ret = ROOT_DIR_ERROR;
+   error(
+   "root %llu INODE %s[%llu %llu] ROOT_DIR shouldn't have 
%s",
+   root->objectid,
+   ref_key->type == BTRFS_INODE_REF_KEY ?
+   "REF" : "EXTREF",
+   ref_key->objectid, ref_key->offset,
+   key->type == BTRFS_DIR_ITEM_KEY ?
+   "DIR_ITEM" : "DIR_INDEX");
+   } else {
+   ret = 0;
+   }
+
+   goto out;
+   }
+
+   /* Process normal file/dir */
+   if (ret > 0) {
+   ret = DIR_ITEM_MISSING;
+   error(
+   "root %llu INODE %s[%llu %llu] doesn't have related %s[%llu 
%llu] namelen %u filename %s filetype %d",
+   root->objectid,
+   ref_key->type == BTRFS_INODE_REF_KEY ? "REF" : "EXTREF",
+   ref_key->objectid, ref_key->offset,
+   key->type == BTRFS_DIR_ITEM_KEY ?
+   "DIR_ITEM" : "DIR_INDEX",
+   key->objectid, key->offset, namelen, name,
+   imode_to_type(mode));
+   goto out;
+   }
+
+   /* Check whether inode_id/filetype/name match */
+   node = path.nodes[0];
+   slot = path.slots[0];
+   di = btrfs_item_ptr(node, slot, struct btrfs_dir_item);
+   total = btrfs_item_size_nr(node, slot);
+   while (cur < total) {
+   ret = DIR_ITEM_MISMATCH;
+   name_len = btrfs_dir_name_len(node, di);
+   data_len = btrfs_dir_data_len(node, di);
+
+   btrfs_dir_item_key_to_cpu(node, di, &location);
+   if (location.objectid != ref_key->objectid ||
+   location.type !=  BTRFS_INODE_ITEM_KEY ||
+   location.offset != 0)
+   goto next;
+
+   filetype = btrfs_dir_type(node, di);
+   if (imode_to_type(mode) != filetype)
+   goto next;
+
+   if (name_len <= BTRFS_NAME_LEN) {
+   len = name_len;
+   } else {
+   len = BTRFS_NAME_LEN;
+   fprintf(stderr,
+   "Warning: root %llu %s[%llu %llu] name too long\n",
+   root->objectid,
+   key->type == BTRFS_DIR_ITEM_KEY ?
+   "DIR_ITEM" : "DIR_IN

[PATCH v2 00/14] Btrfsck low memory mode with fs/subvol tree check

2016-09-20 Thread Qu Wenruo
The branch can be fetched from my github:
https://github.com/adam900710/btrfs-progs/tree/lowmem_fs_tree

Already merged lowmem mode fsck only works for extent/chunk tree check.
And for fs tree, it's still using original mode codes.

This makes btrfs check still eat tons of memory for large fs.

Now the new lowmem mode code will also cover fs tree now, to make
lowmem mode be really low-memory usage mode.

And the whole patchset goes through the whole fsck test cases, except
the following case:

006: There is a bug in root item repair code, causing backref error.
 However old fsck has another bug to overwrite extent tree error,
 so old fsck will only report error but still return 0.

 That's an unrelated btrfsck repair bug, which I'll address it later.

015: Just wrong test cases. It's not a normal check-repair-check one.
 So the check after repair will still report error.
 Better to put it to fuzz test cases.

Further plan for lowmem mode is:
1) Add support for --repair
   A lot of work again.

2) Separate original and lowmem mode codes into different files
   300+K single source is really too large.
   Better separate them into a dir and multiple files

3) Avoid using find_all_parents() in traversal function
   In lowmmem mode, we are using find_all_parents() function to ensure
   only the root with smallest objectid to check the leaf, so we can
   save some IO.

   However find_all_parents() is still a quite time consuming function, so
   we'd better avoid calling that function.

Lu Fengqi (12):
  btrfs-progs: move btrfs_extref_hash() to hash.h
  btrfs-progs: check: introduce function to find dir_item
  btrfs-progs: check: introduce function to check inode_ref
  btrfs-progs: check: introduce function to check inode_extref
  btrfs-progs: check: introduce function to find inode_ref
  btrfs-progs: check: introduce a function to check dir_item
  btrfs-progs: check: introduce function to check file extent
  btrfs-progs: check: introduce function to check inode item
  btrfs-progs: check: introduce function to check fs root
  btrfs-progs: check: introduce function to check root ref
  btrfs-progs: check: introduce low_memory mode fs_tree check
  btrfs-progs: check: fix the return value bug of cmd_check()

Qu Wenruo (1):
  btrfs-progs: check: Enhance leaf traversal function to handle missing
inode item

Wang Xiaoguang (1):
  btrfs-progs: check: skip shared node or leaf check for low_memory mode

 cmds-check.c | 1763 --
 hash.h   |   10 +
 inode-item.c |8 +-
 3 files changed, 1600 insertions(+), 181 deletions(-)

-- 
2.10.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 01/14] btrfs-progs: move btrfs_extref_hash() to hash.h

2016-09-20 Thread Qu Wenruo
From: Lu Fengqi 

Move btrfs_extref_hash() from inode-item.c to hash.h,
so that the function can be called elsewhere.

Signed-off-by: Lu Fengqi 
Signed-off-by: Qu Wenruo 
---
 hash.h   | 10 ++
 inode-item.c |  8 +---
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/hash.h b/hash.h
index ac4c411..9ff6761 100644
--- a/hash.h
+++ b/hash.h
@@ -25,4 +25,14 @@ static inline u64 btrfs_name_hash(const char *name, int len)
 {
return crc32c((u32)~1, name, len);
 }
+
+/*
+ * Figure the key offset of an extended inode ref
+ */
+static inline u64 btrfs_extref_hash(u64 parent_objectid, const char *name,
+   int len)
+{
+   return (u64)btrfs_crc32c(parent_objectid, name, len);
+}
+
 #endif
diff --git a/inode-item.c b/inode-item.c
index 913b81a..f7b6ead 100644
--- a/inode-item.c
+++ b/inode-item.c
@@ -19,7 +19,7 @@
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
-#include "crc32c.h"
+#include "hash.h"
 
 static int find_name_in_backref(struct btrfs_path *path, const char * name,
 int name_len, struct btrfs_inode_ref **ref_ret)
@@ -184,12 +184,6 @@ out:
return ret_inode_ref;
 }
 
-static inline u64 btrfs_extref_hash(u64 parent_ino, const char *name,
-   int namelen)
-{
-   return (u64)btrfs_crc32c(parent_ino, name, namelen);
-}
-
 static int btrfs_find_name_in_ext_backref(struct btrfs_path *path,
u64 parent_ino, const char *name, int namelen,
struct btrfs_inode_extref **extref_ret)
-- 
2.10.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 13/14] btrfs-progs: check: skip shared node or leaf check for low_memory mode

2016-09-20 Thread Qu Wenruo
From: Wang Xiaoguang 

The basic idea is simple. Assume a middle tree node A is shared and
its referenceing fs/file tree root ids are 5, 258 and 260, then we
only check node A in the tree who has the smallest root id. That means
in this case, when checking root tree(5), we check inode A, for root
tree 258 and 260, we can just skip it.

Notice even with this patch, we still may visit a shared node or leaf
multiple times. This happens when a inode metadata occupies multiple
leaves.

 leaf_A leaf_B
When checking inode item in leaf_A, assume inode[512] have file extents
in leaf_B, and leaf_B is shared. In the case, for inode[512], we must
visit leaf_B to have inode item check. After finishing inode[512] check,
here we walk down tree root to leaf_B to check whether node or leaf
is shared, if some node or leaf is shared, we can just skip it and below
nodes or leaf's check.

I also fill a disk partition with linux source codes and create 3
snapshots
in it. Before this patch, it averagely took 46s to finish one btrfsck
execution, with this patch, it averagely took 15s.

Signed-off-by: Wang Xiaoguang 
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 390 ---
 1 file changed, 321 insertions(+), 69 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 701fff5..d290a66 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -113,6 +113,24 @@ struct data_backref {
u32 found_ref;
 };
 
+#define ROOT_DIR_ERROR (1<<1)  /* bad root_dir */
+#define DIR_ITEM_MISSING   (1<<2)  /* DIR_ITEM not found */
+#define DIR_ITEM_MISMATCH  (1<<3)  /* DIR_ITEM found but not match */
+#define INODE_REF_MISSING  (1<<4)  /* INODE_REF/INODE_EXTREF not found */
+#define INODE_ITEM_MISSING (1<<5)  /* INODE_ITEM not found */
+#define INODE_ITEM_MISMATCH(1<<6)  /* INODE_ITEM found but not match */
+#define FILE_EXTENT_ERROR  (1<<7)  /* bad file extent */
+#define ODD_CSUM_ITEM  (1<<8)  /* CSUM_ITEM ERROR */
+#define CSUM_ITEM_MISSING  (1<<9)  /* CSUM_ITEM not found */
+#define LINK_COUNT_ERROR   (1<<10) /* INODE_ITEM nlink count error */
+#define NBYTES_ERROR   (1<<11) /* INODE_ITEM nbytes count error */
+#define ISIZE_ERROR(1<<12) /* INODE_ITEM size count error */
+#define ORPHAN_ITEM(1<<13) /* INODE_ITEM no reference */
+#define NO_INODE_ITEM  (1<<14) /* no inode_item */
+#define LAST_ITEM  (1<<15) /* Complete this tree traversal */
+#define ROOT_REF_MISSING   (1<<16) /* ROOT_REF not found */
+#define ROOT_REF_MISMATCH  (1<<17) /* ROOT_REF found but not match */
+
 static inline struct data_backref* to_data_backref(struct extent_backref *back)
 {
return container_of(back, struct data_backref, node);
@@ -1839,6 +1857,92 @@ static int process_one_leaf(struct btrfs_root *root, 
struct extent_buffer *eb,
return ret;
 }
 
+struct node_refs {
+   u64 bytenr[BTRFS_MAX_LEVEL];
+   u64 refs[BTRFS_MAX_LEVEL];
+   int need_check[BTRFS_MAX_LEVEL];
+};
+
+static int update_nodes_refs(struct btrfs_root *root, u64 bytenr,
+struct node_refs *nrefs, u64 level);
+static int check_inode_item(struct btrfs_root *root, struct btrfs_path *path,
+   unsigned int ext_ref);
+
+static int process_one_leaf_v2(struct btrfs_root *root, struct btrfs_path 
*path,
+  struct node_refs *nrefs, int *level, int ext_ref)
+{
+   struct extent_buffer *cur = path->nodes[0];
+   struct btrfs_key key;
+   u64 cur_bytenr;
+   u32 nritems;
+   int root_level = btrfs_header_level(root->node);
+   int i;
+   int ret = 0; /* Final return value */
+   int err = 0; /* Positive error bitmap */
+
+   cur_bytenr = cur->start;
+
+   /* skip to first inode item in this leaf */
+   nritems = btrfs_header_nritems(cur);
+   for (i = 0; i < nritems; i++) {
+   btrfs_item_key_to_cpu(cur, &key, i);
+   if (key.type == BTRFS_INODE_ITEM_KEY)
+   break;
+   }
+   if (i == nritems) {
+   path->slots[0] = nritems;
+   return 0;
+   }
+   path->slots[0] = i;
+
+again:
+   err |= check_inode_item(root, path, ext_ref);
+
+   if (err & LAST_ITEM)
+   goto out;
+
+   /* still have inode items in thie leaf */
+   if (cur->start == cur_bytenr)
+   goto again;
+
+   /*
+* we have switched to another leaf, above nodes may
+* have changed, here walk down the path, if a node
+* or leaf is shared, check whether we can skip this
+* node or leaf.
+*/
+   for (i = root_level; i >= 0; i--) {
+   if (path->nodes[i]->start == nrefs->bytenr[i])
+   continue;
+
+   ret = update_nodes_refs(root,
+   path->nodes[i]->start,
+  

[PATCH v2 11/14] btrfs-progs: check: introduce low_memory mode fs_tree check

2016-09-20 Thread Qu Wenruo
From: Lu Fengqi 

Introduce a new function check_fs_roots_v2() for check fs_tree in
low_memory mode. It call check_fs_root_v2() to check fs_root, and call
check_root_ref() to check root_ref.

Signed-off-by: Lu Fengqi 
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 106 +++
 1 file changed, 100 insertions(+), 6 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 7593013..934a3dd 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -4774,8 +4774,8 @@ static int check_root_ref(struct btrfs_root *root, struct 
btrfs_key *ref_key,
struct btrfs_key key;
struct btrfs_root_ref *ref;
struct btrfs_root_ref *backref;
-   char ref_name[BTRFS_NAME_LEN];
-   char backref_name[BTRFS_NAME_LEN];
+   char ref_name[BTRFS_NAME_LEN] = {0};
+   char backref_name[BTRFS_NAME_LEN] = {0};
u64 ref_dirid;
u64 ref_seq;
u32 ref_namelen;
@@ -4850,6 +4850,94 @@ out:
return err;
 }
 
+/*
+ * Check all fs/file tree in low_memory mode.
+ *
+ * 1. for fs tree root item, call check_fs_root_v2()
+ * 2. for fs tree root ref/backref, call check_root_ref()
+ *
+ * Return 0 if no error occurred.
+ */
+static int check_fs_roots_v2(struct btrfs_fs_info *fs_info)
+{
+   struct btrfs_root *tree_root = fs_info->tree_root;
+   struct btrfs_root *cur_root = NULL;
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   struct extent_buffer *node;
+   unsigned int ext_ref;
+   int slot;
+   int ret;
+   int err = 0;
+
+   ext_ref = btrfs_fs_incompat(fs_info,
+   BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF);
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   key.objectid = BTRFS_FS_TREE_OBJECTID;
+   key.offset = 0;
+   key.type = BTRFS_ROOT_ITEM_KEY;
+
+   ret = btrfs_search_slot(NULL, tree_root, &key, path, 0, 0);
+   if (ret < 0) {
+   err = ret;
+   goto out;
+   } else if (ret > 0) {
+   err = -ENOENT;
+   goto out;
+   }
+
+   while (1) {
+   node = path->nodes[0];
+   slot = path->slots[0];
+   btrfs_item_key_to_cpu(node, &key, slot);
+   if (key.objectid > BTRFS_LAST_FREE_OBJECTID)
+   goto out;
+   if (key.type == BTRFS_ROOT_ITEM_KEY &&
+   fs_root_objectid(key.objectid)) {
+   if (key.objectid == BTRFS_TREE_RELOC_OBJECTID) {
+   cur_root = btrfs_read_fs_root_no_cache(fs_info,
+  &key);
+   } else {
+   key.offset = (u64)-1;
+   cur_root = btrfs_read_fs_root(fs_info, &key);
+   }
+
+   if (IS_ERR(cur_root)) {
+   error("Fail to read fs/subvol tree: %lld",
+ key.objectid);
+   err = -EIO;
+   goto next;
+   }
+
+   ret = check_fs_root_v2(cur_root, ext_ref);
+   err |= ret;
+
+   if (key.objectid == BTRFS_TREE_RELOC_OBJECTID)
+   btrfs_free_fs_root(cur_root);
+   } else if (key.type == BTRFS_ROOT_REF_KEY ||
+   key.type == BTRFS_ROOT_BACKREF_KEY) {
+   ret = check_root_ref(tree_root, &key, node, slot);
+   err |= ret;
+   }
+next:
+   ret = btrfs_next_item(tree_root, path);
+   if (ret > 0)
+   goto out;
+   if (ret < 0) {
+   err = ret;
+   goto out;
+   }
+   }
+
+out:
+   btrfs_free_path(path);
+   return err;
+}
+
 static int all_backpointers_checked(struct extent_record *rec, int print_errs)
 {
struct list_head *cur = rec->backrefs.next;
@@ -12544,7 +12632,10 @@ int cmd_check(int argc, char **argv)
 BTRFS_FEATURE_INCOMPAT_NO_HOLES);
if (!ctx.progress_enabled)
fprintf(stderr, "checking fs roots\n");
-   ret = check_fs_roots(root, &root_cache);
+   if (check_mode == CHECK_MODE_LOWMEM)
+   ret = check_fs_roots_v2(root->fs_info);
+   else
+   ret = check_fs_roots(root, &root_cache);
if (ret)
goto out;
 
@@ -12554,9 +12645,12 @@ int cmd_check(int argc, char **argv)
goto out;
 
fprintf(stderr, "checking root refs\n");
-   ret = check_root_refs(root, &root_cache);
-   if (ret)
-   goto out;
+   /* For low memory mode, check_fs_roots_v2 handles root refs */
+   if (check_mode != CHECK_MODE_LOWMEM) {
+

[PATCH v2 10/14] btrfs-progs: check: introduce function to check root ref

2016-09-20 Thread Qu Wenruo
From: Lu Fengqi 

Introduce a new function check_root_ref() to check
root_ref/root_backref.

Signed-off-by: Lu Fengqi 
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 93 
 1 file changed, 93 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index 6d3c6a8..7593013 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -3864,6 +3864,8 @@ out:
 #define ORPHAN_ITEM(1<<13) /* INODE_ITEM no reference */
 #define NO_INODE_ITEM  (1<<14) /* no inode_item */
 #define LAST_ITEM  (1<<15) /* Complete this tree traversal */
+#define ROOT_REF_MISSING   (1<<16) /* ROOT_REF not found */
+#define ROOT_REF_MISMATCH  (1<<17) /* ROOT_REF found but not match */
 
 /*
  * Find DIR_ITEM/DIR_INDEX for the given key and check it with the specified
@@ -4757,6 +4759,97 @@ out:
return ret;
 }
 
+/*
+ * Find the relative ref for root_ref and root_backref.
+ *
+ * @root:  the root of the root tree.
+ * @ref_key:   the key of the root ref.
+ *
+ * Return 0 if no error occurred.
+ */
+static int check_root_ref(struct btrfs_root *root, struct btrfs_key *ref_key,
+ struct extent_buffer *node, int slot)
+{
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   struct btrfs_root_ref *ref;
+   struct btrfs_root_ref *backref;
+   char ref_name[BTRFS_NAME_LEN];
+   char backref_name[BTRFS_NAME_LEN];
+   u64 ref_dirid;
+   u64 ref_seq;
+   u32 ref_namelen;
+   u64 backref_dirid;
+   u64 backref_seq;
+   u32 backref_namelen;
+   u32 len;
+   int ret;
+   int err = 0;
+
+   ref = btrfs_item_ptr(node, slot, struct btrfs_root_ref);
+   ref_dirid = btrfs_root_ref_dirid(node, ref);
+   ref_seq = btrfs_root_ref_sequence(node, ref);
+   ref_namelen = btrfs_root_ref_name_len(node, ref);
+
+   if (ref_namelen <= BTRFS_NAME_LEN) {
+   len = ref_namelen;
+   } else {
+   len = BTRFS_NAME_LEN;
+   warning("%s[%llu %llu] ref_name too long",
+   ref_key->type == BTRFS_ROOT_REF_KEY ?
+   "ROOT_REF" : "ROOT_BACKREF", ref_key->objectid,
+   ref_key->offset);
+   }
+   read_extent_buffer(node, ref_name, (unsigned long)(ref + 1), len);
+
+   /* Find relative root_ref */
+   key.objectid = ref_key->offset;
+   key.type = BTRFS_ROOT_BACKREF_KEY + BTRFS_ROOT_REF_KEY - ref_key->type;
+   key.offset = ref_key->objectid;
+
+   path = btrfs_alloc_path();
+   ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+   if (ret) {
+   err |= ROOT_REF_MISSING;
+   error("%s[%llu %llu] couldn't find relative ref",
+ ref_key->type == BTRFS_ROOT_REF_KEY ?
+ "ROOT_REF" : "ROOT_BACKREF",
+ ref_key->objectid, ref_key->offset);
+   goto out;
+   }
+
+   backref = btrfs_item_ptr(path->nodes[0], path->slots[0],
+struct btrfs_root_ref);
+   backref_dirid = btrfs_root_ref_dirid(path->nodes[0], backref);
+   backref_seq = btrfs_root_ref_sequence(path->nodes[0], backref);
+   backref_namelen = btrfs_root_ref_name_len(path->nodes[0], backref);
+
+   if (backref_namelen <= BTRFS_NAME_LEN) {
+   len = backref_namelen;
+   } else {
+   len = BTRFS_NAME_LEN;
+   warning("%s[%llu %llu] ref_name too long",
+   key.type == BTRFS_ROOT_REF_KEY ?
+   "ROOT_REF" : "ROOT_BACKREF",
+   key.objectid, key.offset);
+   }
+   read_extent_buffer(path->nodes[0], backref_name,
+  (unsigned long)(backref + 1), len);
+
+   if (ref_dirid != backref_dirid || ref_seq != backref_seq ||
+   ref_namelen != backref_namelen ||
+   strncmp(ref_name, backref_name, len)) {
+   err |= ROOT_REF_MISMATCH;
+   error("%s[%llu %llu] mismatch relative ref",
+ ref_key->type == BTRFS_ROOT_REF_KEY ?
+ "ROOT_REF" : "ROOT_BACKREF",
+ ref_key->objectid, ref_key->offset);
+   }
+out:
+   btrfs_free_path(path);
+   return err;
+}
+
 static int all_backpointers_checked(struct extent_record *rec, int print_errs)
 {
struct list_head *cur = rec->backrefs.next;
-- 
2.10.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 09/14] btrfs-progs: check: introduce function to check fs root

2016-09-20 Thread Qu Wenruo
From: Lu Fengqi 

Introduce a new function check_fs_root_v2() to check fs root,
and call check_inode_item to check the items in the tree.

Signed-off-by: Lu Fengqi 
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 76 
 1 file changed, 76 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index 5e3ecac..6d3c6a8 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -3862,6 +3862,7 @@ out:
 #define NBYTES_ERROR   (1<<11) /* INODE_ITEM nbytes count error */
 #define ISIZE_ERROR(1<<12) /* INODE_ITEM size count error */
 #define ORPHAN_ITEM(1<<13) /* INODE_ITEM no reference */
+#define NO_INODE_ITEM  (1<<14) /* no inode_item */
 #define LAST_ITEM  (1<<15) /* Complete this tree traversal */
 
 /*
@@ -4681,6 +4682,81 @@ out:
return err;
 }
 
+/*
+ * Iterate all item on the tree and call check_inode_item() to check.
+ *
+ * @root:  the root of the tree to be checked.
+ * @ext_ref:   the EXTENDED_IREF feature
+ *
+ * Return 0 if no error found.
+ * Return <0 for error.
+ * All internal error bitmap will be converted to -EIO, to avoid
+ * mixing negative and postive return value.
+ */
+static int check_fs_root_v2(struct btrfs_root *root, unsigned int ext_ref)
+{
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   u64 inode_id;
+   int ret, err = 0;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   key.objectid = 0;
+   key.type = 0;
+   key.offset = 0;
+
+   ret = btrfs_search_slot(NULL, root, &key, path, 0, 0);
+   if (ret < 0)
+   goto out;
+
+   while (1) {
+   btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]);
+
+   /*
+* All check must start with inode item, skip if not
+*/
+   if (key.type == BTRFS_INODE_ITEM_KEY) {
+   ret = check_inode_item(root, path, ext_ref);
+   err |= ret;
+   if (err & LAST_ITEM)
+   goto out;
+   continue;
+   }
+   error("root %llu ITEM[%llu %u %llu] isn't INODE_ITEM, skip to 
next inode",
+ root->objectid, key.objectid, key.type,
+ key.offset);
+
+   err |= NO_INODE_ITEM;
+   inode_id = key.objectid;
+
+   /*
+* skip to next inode
+* TODO: Maybe search_slot() will be faster?
+*/
+   do {
+   ret = btrfs_next_item(root, path);
+   if (ret > 0) {
+   goto out;
+   } else if (ret < 0) {
+   err = ret;
+   goto out;
+   }
+   btrfs_item_key_to_cpu(path->nodes[0], &key,
+ path->slots[0]);
+   } while (inode_id == key.objectid);
+   }
+
+out:
+   err &= ~LAST_ITEM;
+   if (err && !ret)
+   ret = -EIO;
+   btrfs_free_path(path);
+   return ret;
+}
+
 static int all_backpointers_checked(struct extent_record *rec, int print_errs)
 {
struct list_head *cur = rec->backrefs.next;
-- 
2.10.0



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 08/14] btrfs-progs: check: introduce function to check inode item

2016-09-20 Thread Qu Wenruo
From: Lu Fengqi 

Introduce a new function check_inode_item() to check INODE_ITEM and
related ITEMs that have the same inode id.

Signed-off-by: Lu Fengqi 
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 168 +++
 1 file changed, 168 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index 0afbf96..5e3ecac 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -3858,6 +3858,11 @@ out:
 #define FILE_EXTENT_ERROR  (1<<7)  /* bad file extent */
 #define ODD_CSUM_ITEM  (1<<8)  /* CSUM_ITEM ERROR */
 #define CSUM_ITEM_MISSING  (1<<9)  /* CSUM_ITEM not found */
+#define LINK_COUNT_ERROR   (1<<10) /* INODE_ITEM nlink count error */
+#define NBYTES_ERROR   (1<<11) /* INODE_ITEM nbytes count error */
+#define ISIZE_ERROR(1<<12) /* INODE_ITEM size count error */
+#define ORPHAN_ITEM(1<<13) /* INODE_ITEM no reference */
+#define LAST_ITEM  (1<<15) /* Complete this tree traversal */
 
 /*
  * Find DIR_ITEM/DIR_INDEX for the given key and check it with the specified
@@ -4513,6 +4518,169 @@ static int check_file_extent(struct btrfs_root *root, 
struct btrfs_key *fkey,
return err;
 }
 
+/*
+ * Check INODE_ITEM and related ITEMs(the same inode id)
+ * 1. check link count
+ * 2. check inode ref/extref
+ * 3. check dir item/index
+ *
+ * @ext_ref:   the EXTENDED_IREF feature
+ *
+ * Return 0 if no error occurred.
+ * Return >0 for error or hit the traversal is done(by error bitmap)
+ */
+static int check_inode_item(struct btrfs_root *root, struct btrfs_path *path,
+   unsigned int ext_ref)
+{
+   struct extent_buffer *node;
+   struct btrfs_inode_item *ii;
+   struct btrfs_key key;
+   u64 inode_id;
+   u32 mode;
+   u64 nlink;
+   u64 nbytes;
+   u64 isize;
+   u64 size = 0;
+   u64 refs = 0;
+   u64 extent_end = 0;
+   u64 extent_size = 0;
+   unsigned int dir;
+   unsigned int nodatasum;
+   int slot;
+   int ret;
+   int err = 0;
+
+   node = path->nodes[0];
+   slot = path->slots[0];
+
+   btrfs_item_key_to_cpu(node, &key, slot);
+   inode_id = key.objectid;
+
+   if (inode_id == BTRFS_ORPHAN_OBJECTID) {
+   ret = btrfs_next_item(root, path);
+   if (ret > 0)
+   err |= LAST_ITEM;
+   return err;
+   }
+
+   ii = btrfs_item_ptr(node, slot, struct btrfs_inode_item);
+   isize = btrfs_inode_size(node, ii);
+   nbytes = btrfs_inode_nbytes(node, ii);
+   mode = btrfs_inode_mode(node, ii);
+   dir = imode_to_type(mode) == BTRFS_FT_DIR;
+   nlink = btrfs_inode_nlink(node, ii);
+   nodatasum = btrfs_inode_flags(node, ii) & BTRFS_INODE_NODATASUM;
+
+   while (1) {
+   ret = btrfs_next_item(root, path);
+   if (ret < 0) {
+   /* out will fill 'err' rusing current statistics */
+   goto out;
+   } else if (ret > 0) {
+   err |= LAST_ITEM;
+   goto out;
+   }
+
+   node = path->nodes[0];
+   slot = path->slots[0];
+   btrfs_item_key_to_cpu(node, &key, slot);
+   if (key.objectid != inode_id)
+   goto out;
+
+   switch (key.type) {
+   case BTRFS_INODE_REF_KEY:
+   ret = check_inode_ref(root, &key, node, slot, &refs,
+ mode);
+   err |= ret;
+   break;
+   case BTRFS_INODE_EXTREF_KEY:
+   if (key.type == BTRFS_INODE_EXTREF_KEY && !ext_ref)
+   warning("root %llu EXTREF[%llu %llu] isn't 
supported",
+   root->objectid, key.objectid,
+   key.offset);
+   ret = check_inode_extref(root, &key, node, slot, &refs,
+mode);
+   err |= ret;
+   break;
+   case BTRFS_DIR_ITEM_KEY:
+   case BTRFS_DIR_INDEX_KEY:
+   if (!dir) {
+   warning("root %llu INODE[%llu] mode %u 
shouldn't have DIR_INDEX[%llu %llu]",
+   root->objectid, inode_id,
+   imode_to_type(mode), key.objectid,
+   key.offset);
+   }
+   ret = check_dir_item(root, &key, node, slot, &size,
+ext_ref);
+   err |= ret;
+   break;
+   case BTRFS_EXTENT_DATA_KEY:
+   if (dir) {
+   warning("root %llu DIR INODE[%llu] shouldn't 
EXTENT_DATA[%llu %llu]",
+ 

[PATCH v2 07/14] btrfs-progs: check: introduce function to check file extent

2016-09-20 Thread Qu Wenruo
From: Lu Fengqi 

Introduce a new function check_file_extent() to check file extent,
such as datasum, hole, size.

Signed-off-by: Lu Fengqi 
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 97 
 1 file changed, 97 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index d39a81c..0afbf96 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -3855,6 +3855,9 @@ out:
 #define INODE_REF_MISSING  (1<<4)  /* INODE_REF/INODE_EXTREF not found */
 #define INODE_ITEM_MISSING (1<<5)  /* INODE_ITEM not found */
 #define INODE_ITEM_MISMATCH(1<<6)  /* INODE_ITEM found but not match */
+#define FILE_EXTENT_ERROR  (1<<7)  /* bad file extent */
+#define ODD_CSUM_ITEM  (1<<8)  /* CSUM_ITEM ERROR */
+#define CSUM_ITEM_MISSING  (1<<9)  /* CSUM_ITEM not found */
 
 /*
  * Find DIR_ITEM/DIR_INDEX for the given key and check it with the specified
@@ -4416,6 +4419,100 @@ next:
return err;
 }
 
+/*
+ * Check file extent datasum/hole, update the size of the file extents,
+ * check and update the last offset of the file extent.
+ *
+ * @root:  the root of fs/file tree.
+ * @fkey:  the key of the file extent.
+ * @nodatasum: INODE_NODATASUM feature.
+ * @size:  the sum of all EXTENT_DATA items size for this inode.
+ * @end:   the offset of the last extent.
+ *
+ * Return 0 if no error occurred.
+ */
+static int check_file_extent(struct btrfs_root *root, struct btrfs_key *fkey,
+struct extent_buffer *node, int slot,
+unsigned int nodatasum, u64 *size, u64 *end)
+{
+   struct btrfs_file_extent_item *fi;
+   u64 disk_bytenr;
+   u64 disk_num_bytes;
+   u64 extent_num_bytes;
+   u64 found;
+   unsigned int extent_type;
+   unsigned int is_hole;
+   int ret;
+   int err = 0;
+
+   fi = btrfs_item_ptr(node, slot, struct btrfs_file_extent_item);
+
+   extent_type = btrfs_file_extent_type(node, fi);
+   /* Skip if file extent is inline */
+   if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
+   struct btrfs_item *e = btrfs_item_nr(slot);
+   u32 item_inline_len;
+
+   item_inline_len = btrfs_file_extent_inline_item_len(node, e);
+   extent_num_bytes = btrfs_file_extent_inline_len(node, slot, fi);
+   if (extent_num_bytes == 0 ||
+   extent_num_bytes != item_inline_len)
+   err |= FILE_EXTENT_ERROR;
+   *size += extent_num_bytes;
+   return err;
+   }
+
+   /* Check extent type */
+   if (extent_type != BTRFS_FILE_EXTENT_REG &&
+   extent_type != BTRFS_FILE_EXTENT_PREALLOC) {
+   err |= FILE_EXTENT_ERROR;
+   error("root %llu EXTENT_DATA[%llu %llu] type bad",
+ root->objectid, fkey->objectid, fkey->offset);
+   return err;
+   }
+
+   /* Check REG_EXTENT/PREALLOC_EXTENT */
+   disk_bytenr = btrfs_file_extent_disk_bytenr(node, fi);
+   disk_num_bytes = btrfs_file_extent_disk_num_bytes(node, fi);
+   extent_num_bytes = btrfs_file_extent_num_bytes(node, fi);
+   is_hole = (disk_bytenr == 0) && (disk_num_bytes == 0);
+
+   /* Check EXTENT_DATA datasum */
+   ret = count_csum_range(root, disk_bytenr, disk_num_bytes, &found);
+   if (found > 0 && nodatasum) {
+   err |= ODD_CSUM_ITEM;
+   error("root %llu EXTENT_DATA[%llu %llu] nodatasum shouldn't 
have datasum",
+ root->objectid, fkey->objectid, fkey->offset);
+   } else if (extent_type == BTRFS_FILE_EXTENT_REG && !nodatasum &&
+  !is_hole &&
+  (ret < 0 || found == 0 || found < disk_num_bytes)) {
+   err |= CSUM_ITEM_MISSING;
+   error("root %llu EXTENT_DATA[%llu %llu] datasum missing",
+ root->objectid, fkey->objectid, fkey->offset);
+   } else if (extent_type == BTRFS_FILE_EXTENT_PREALLOC && found > 0) {
+   err |= ODD_CSUM_ITEM;
+   error("root %llu EXTENT_DATA[%llu %llu] prealloc shouldn't have 
datasum",
+ root->objectid, fkey->objectid, fkey->offset);
+   }
+
+   /* Check EXTENT_DATA hole */
+   if (no_holes && is_hole) {
+   err |= FILE_EXTENT_ERROR;
+   error("root %llu EXTENT_DATA[%llu %llu] shouldn't be hole",
+ root->objectid, fkey->objectid, fkey->offset);
+   } else if (!no_holes && *end != fkey->offset) {
+   err |= FILE_EXTENT_ERROR;
+   error("root %llu EXTENT_DATA[%llu %llu] interrupt",
+ root->objectid, fkey->objectid, fkey->offset);
+   }
+
+   *end += extent_num_bytes;
+   if (!is_hole)
+   *size += extent_num_bytes;
+
+   return err;
+}
+
 static int all_backpointers_checked(struct extent_record *rec, int print_errs)
 {
s

Re: how to understand "btrfs fi show" output? "No space left" issues

2016-09-20 Thread Chris Murphy
On Tue, Sep 20, 2016 at 12:47 AM, Tomasz Chmielewski  wrote:
> How to understand the following "btrfs fi show" output?
>
> # btrfs fi show /var/lib/lxd
> Label: 'btrfs'  uuid: f5f30428-ec5b-4497-82de-6e20065e6f61
> Total devices 2 FS bytes used 136.18GiB
> devid1 size 423.13GiB used 423.13GiB path /dev/sda3
> devid2 size 423.13GiB used 423.13GiB path /dev/sdb3
>
> Why is it "size 423.13GiB used 423.13GiB"? Is it full?
>
> I had "No space left" on this filesystem just yesterday (running kernel
> 4.7.4). This is btrfs RAID-1 on SSD disks. This filesystem is used for 20-30
> LXD containers with different roles (mongo, mysql, postgres databases,
> webservers etc.), around 150 read-only snapshots, btrfs compression is
> disabled.
>
>
> Both "btrfs fi df" and "df -h" show plenty of space:
>
> # btrfs fi df /var/lib/lxd
> Data, RAID1: total=417.12GiB, used=131.33GiB
> System, RAID1: total=8.00MiB, used=80.00KiB
> Metadata, RAID1: total=6.00GiB, used=4.86GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
>
>
> # df -h
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/sda3   424G  137G  286G  33% /var/lib/lxd

I'm coming into this late and realize most questions have been
answered. But I take the position this is a bug, clearly there's
enough space when df reports only 33% used, and therefore it's
important to gather information about the file system in its current
state so the devs can make decisions. Manually running balance is the
correct work around, but it's bad Ux and should not be necessary (even
though it's known to sometimes be necessary).

Anyway, in this case there is room in all chunks and GlobalReserve
used is 0.00B. Metadata has a bit over a gigabyte of unused space in
its allocated block groups. So at the moment I'm thinking it's a bug.
The two things that'd be useful if you can reproduce this problem at
some point, by NOT trying to prevent it again, are:

grep . -IR /sys/fs/btrfs//allocation/

 pick the UUID for the affected fs volume.

btrfs-debugfs found in btrfs-progs upstream as a python program but
typically not packaged by distros
https://github.com/kdave/btrfs-progs/blob/master/btrfs-debugfs

Takes the form:

sudo ./btrfs-debugfs -b 

It'll show you the percent each block group is actually being used so
you can have a good idea what -dusage value to use (in your case) to
free up space. That should help, but ultimately it's a work around,
not a real fix. There shouldn't be enospc anyway.

So if it happens again, first capture the above two bits of
information, and then if  you feel like testing kernel 4.8rc7 do that.
It has a massive pile of enoscp related rework and I bet Josef would
like to know if the problem reproduces with that kernel. As in, just
change kernels, don't try to fix it with balance first.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Can't mount btrfs raid1

2016-09-20 Thread Chris Murphy
On Tue, Sep 20, 2016 at 5:16 PM, Mirak M  wrote:
> Hello,
>
> I have a failure when mounting btrfs.
>
>> mount -oro,recovery /dev/sda2 sda2_btrfs
>> mount: /dev/sda2: can't read superblock

What do you get for 'btrfs super-recover -v ' and 'btrfs check '

For this purpose any 4.4+ version is probably OK, except 4.7 and 4.7.1
which might spit out some bogus items (it's just noise it won't hurt
anything as long as you don't use --repair).


>
> The kernel log is here http://pastebin.com/tHihHT92 and at the bottom
> of the email
>
> I must admit I did the error of running btrfs check --repair at some
> point, not knowing this was not a good idea.
>
> I run ubuntu 16.04 with kernel 4.4.0-36-generic .

OK what version of btrrfs-progs? What was the output from btrfs check?


> [ 1692.712574] BTRFS critical (device sda2): corrupt leaf, bad key
> order: block=1957998690304,root=1, slot=29
> [ 1692.712819] BTRFS critical (device sda2): corrupt leaf, bad key
> order: block=1957998690304,root=1, slot=29

List archives suggest this might be due to bad RAM. I also see there
are some bugs that can cause it, but I'm not finding any post kernel
4.4 patches for this (there are a metric f tonne of changes since
4.4). I would suggest kernel 4.4.21 if you need to stick with a long
term kernel, I have no idea what 4.4.0-36 translates into.


> [ 1692.713963] BTRFS: error (device sda2) in btrfs_replay_log:2401:
> errno=-5 IO failure (Failed to recover log tree)

This is kinda curious, was there a crash or power failure?




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Can't mount btrfs raid1

2016-09-20 Thread Mirak M
Hello,

I have a failure when mounting btrfs.

> mount -oro,recovery /dev/sda2 sda2_btrfs
> mount: /dev/sda2: can't read superblock

The kernel log is here http://pastebin.com/tHihHT92 and at the bottom
of the email

I must admit I did the error of running btrfs check --repair at some
point, not knowing this was not a good idea.

I run ubuntu 16.04 with kernel 4.4.0-36-generic .

Regards,
Mirak


[ 1685.255619] BTRFS info (device sda2): enabling auto recovery
[ 1685.255626] BTRFS info (device sda2): disk space caching is enabled
[ 1685.255628] BTRFS: has skinny extents
[ 1692.712574] BTRFS critical (device sda2): corrupt leaf, bad key
order: block=1957998690304,root=1, slot=29
[ 1692.712819] BTRFS critical (device sda2): corrupt leaf, bad key
order: block=1957998690304,root=1, slot=29
[ 1692.712827] [ cut here ]
[ 1692.712865] WARNING: CPU: 3 PID: 6305 at
/build/linux-a2WvEb/linux-4.4.0/fs/btrfs/extent-tree.c:6552
__btrfs_free_extent.isra.70+0x2e6/0xd30 [btrfs]()
[ 1692.712867] BTRFS: Transaction aborted (error -5)
[ 1692.712868] Modules linked in: nvram msr joydev input_leds
ir_xmp_decoder ir_lirc_codec ir_mce_kbd_decoder ir_sharp_decoder
ir_sanyo_decoder ir_sony_decoder ir_jvc_decoder ir_rc6_decoder
ir_nec_decoder ir_rc5_decoder rc_rc6_mce mceusb lirc_dev rc_core
snd_hda_codec_realtek snd_hda_codec_generic binfmt_misc coretemp
kvm_intel kvm irqbypass snd_hda_codec_hdmi snd_hda_intel serio_raw
snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi
snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer snd
shpchp soundcore 8250_fintek i2c_nforce2 mac_hid parport_pc ppdev lp
parport autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy
async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0
multipath linear dm_mirror dm_region_hash dm_log hid_logitech
ff_memless pata_acpi hid_logitech_hidpp
[ 1692.712936]  hid_logitech_dj usbhid hid uas usb_storage amdkfd
amd_iommu_v2 radeon i2c_algo_bit ttm drm_kms_helper syscopyarea
sysfillrect sysimgblt firewire_ohci pata_jmicron fb_sys_fops psmouse
firewire_core drm forcedeth crc_itu_t ahci libahci video wmi fjes
[ 1692.712960] CPU: 3 PID: 6305 Comm: mount Tainted: P   OE
4.4.0-36-generic #55-Ubuntu
[ 1692.712963] Hardware name: Gigabyte Technology Co., Ltd.
GA-E7AUM-DS2H/GA-E7AUM-DS2H, BIOS F2 12/17/2008
[ 1692.712965]  0286 b16cde4b 880098c3f688
813f13b3
[ 1692.712968]  880098c3f6d0 c04b5468 880098c3f6c0
810810f2
[ 1692.712972]  01c81f86 fffb 
8801756c2000
[ 1692.712975] Call Trace:
[ 1692.712983]  [] dump_stack+0x63/0x90
[ 1692.712988]  [] warn_slowpath_common+0x82/0xc0
[ 1692.712991]  [] warn_slowpath_fmt+0x5c/0x80
[ 1692.713009]  []
__btrfs_free_extent.isra.70+0x2e6/0xd30 [btrfs]
[ 1692.713031]  [] ?
btrfs_merge_delayed_refs+0x66/0x650 [btrfs]
[ 1692.713050]  []
__btrfs_run_delayed_refs+0xaab/0x11f0 [btrfs]
[ 1692.713068]  [] btrfs_run_delayed_refs+0x7d/0x2a0 [btrfs]
[ 1692.713085]  [] ? btrfs_set_path_blocking+0x3f/0x70 [btrfs]
[ 1692.713105]  [] btrfs_commit_transaction+0x56/0xa90 [btrfs]
[ 1692.713110]  [] ? kmem_cache_free+0x1d4/0x1e0
[ 1692.713132]  [] btrfs_recover_log_trees+0x3e7/0x480 [btrfs]
[ 1692.713155]  [] ? replay_one_extent+0x6c0/0x6c0 [btrfs]
[ 1692.713175]  [] open_ctree+0x1a5c/0x2460 [btrfs]
[ 1692.713192]  [] btrfs_mount+0x944/0xa60 [btrfs]
[ 1692.713196]  [] ? find_next_bit+0x15/0x20
[ 1692.713200]  [] ? pcpu_alloc+0x37f/0x640
[ 1692.713203]  [] mount_fs+0x38/0x160
[ 1692.713206]  [] ? __alloc_percpu+0x15/0x20
[ 1692.713209]  [] vfs_kern_mount+0x67/0x110
[ 1692.713226]  [] btrfs_mount+0x1df/0xa60 [btrfs]
[ 1692.713228]  [] ? pcpu_alloc+0x37f/0x640
[ 1692.713231]  [] mount_fs+0x38/0x160
[ 1692.713233]  [] ? __alloc_percpu+0x15/0x20
[ 1692.713236]  [] vfs_kern_mount+0x67/0x110
[ 1692.713239]  [] do_mount+0x269/0xde0
[ 1692.713242]  [] SyS_mount+0x9f/0x100
[ 1692.713246]  [] entry_SYSCALL_64_fastpath+0x16/0x71
[ 1692.713249] ---[ end trace e6d60ad04bc3178e ]---
[ 1692.713252] BTRFS: error (device sda2) in __btrfs_free_extent:6552:
errno=-5 IO failure
[ 1692.713257] BTRFS: error (device sda2) in
btrfs_run_delayed_refs:2927: errno=-5 IO failure
[ 1692.713950] pending csums is 4096
[ 1692.713963] BTRFS: error (device sda2) in btrfs_replay_log:2401:
errno=-5 IO failure (Failed to recover log tree)
[ 1692.713994] BTRFS error (device sda2): cleaner transaction attach
returned -30
[ 1692.760459] BTRFS: open_ctree failed
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ChaCha20 vs. AES performance

2016-09-20 Thread Mathieu Chouquet-Stringer
kent.overstr...@gmail.com (Kent Overstreet) writes:
> On Tue, Sep 20, 2016 at 10:23:20AM -0400, Theodore Ts'o wrote:
>> On Tue, Sep 20, 2016 at 03:15:19AM -0800, Kent Overstreet wrote:
>> > Not on the list or I would've replied directly, but on Haswell, ChaCha20 
>> > (in
>> > software) is over 2x as fast as AES (in hardware), at realistic (for a
>> > filesystem) block sizes:
>> 
>> On Skylake and Broadwell processors, AES is faster (the posting is
>> from a ChaCha20 enthusiast):
>> 
>>  https://blog.cloudflare.com/it-takes-two-to-chacha-poly/
>
> The performance delta in his graphs isn't near as big as what I've measured,
> which makes me suspect OpenSSL's ChaCha20 implementation isn't nearly as fast 
> as
> the kernel's.

The other thing to keep in mind is this (aka what's true for a big intel
cpu isn't true everywhere): "The new cipher suites are fast. As Adam
Langley described, ChaCha20-Poly1305 is three times faster than
AES-128-GCM on mobile devices. Spending less time on decryption means
faster page rendering and better battery life."

https://blog.cloudflare.com/do-the-chacha-better-mobile-performance-with-cryptography/

The argument made by Bernstein is in a nutshell than "CPUs are optimized
for video games and thus ciphers should use the same instructions which
makes games 'faster'" (I'd recommend to read his whole email to understand
what he means):
https://moderncrypto.org/mail-archive/noise/2016/000699.html )

Or as one person commented on the net
https://news.ycombinator.com/item?id=12264321 :

Bernstein agrees with you. His point isn't that it's dumb that CPUs are
optimized for games. It's that cipher designers should have enough
awareness of trends in CPU development to design ciphers that take
advantage of the same features that games do. That's what he did with
Salsa/ChaCha. *His subtext is that over the medium term he believes his
ciphers will outperform AES, despite AES having AES-NI hardware
support.* (emphasis mine)

-- 
Mathieu Chouquet-Stringer
The sun itself sees not till heaven clears.
 -- William Shakespeare --
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: splat in split_leaf with integrity checking

2016-09-20 Thread Adam Borowski
On Mon, Sep 19, 2016 at 07:21:47PM -0700, Liu Bo wrote:
> On Tue, Sep 20, 2016 at 03:39:27AM +0200, Adam Borowski wrote:
> > Hi!
> > I just had the following splat in 4.8-rc6 for the third time in a week:
> 
> Sorry for the trouble, this is caused by my patch and here are two fixes[1]
> to get it right with integrity check (not sure if they've been queued yet).
> 
> Here is a discussion that explains why we remove it[2].
> 
> [1] https://patchwork.kernel.org/patch/9320077/
> https://patchwork.kernel.org/patch/9311541/
> 
> [2] http://www.spinics.net/lists/linux-btrfs/msg58506.html

Thanks!  I should have noticed these posts and not bother you; anyway, I've
applied the patches and stress-tested during the day, just in case they
break something.  All seems to work fine now -- and as I had a splat just
after posting this when copying that big file, and another during the kernel
compile, the bug would likely have triggered by now.


Meow!
-- 
Second "wet cat laying down on a powered-on box-less SoC on the desk" close
shave in a week.  Protect your ARMs, folks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: multi-device btrfs with single data mode and disk failure

2016-09-20 Thread Chris Murphy
On Tue, Sep 20, 2016 at 2:18 PM, Alexandre Poux  wrote:
>
>
> Le 20/09/2016 à 21:46, Chris Murphy a écrit :
>> On Tue, Sep 20, 2016 at 1:31 PM, Alexandre Poux  wrote:
>>>
>>> Le 20/09/2016 à 21:11, Chris Murphy a écrit :
 And no backup? Umm, I'd resolve that sooner than anything else.
>>> Yeah you are absolutely right, this was a temporary solution which came
>>> to be not that temporary.
>>> And I regret it already...
>> Well on the bright side, if this were LVM or mdadm linear/concat
>> array, the whole thing would be toast because any other file system
>> would have lost too much fs metadata on the missing device.
>>
  It
 should be true that it'll tolerate a read only mount indefinitely, but
 read write? Not sure. This sort of edge case isn't well tested at all
 seeing as it required changing the kernel to reduce safe guards. So
 all bets are off the whole thing could become unmountable, not even
 read only, and then it's a scraping job.
>>> I'm not that crazy, I tried the patch inside a virtual machine on
>>> virtual drives...
>>> And since it's only virtual, it may not work on the real partition...
>> Are you sure the virtual setup lacked a CHUNK_ITEM on the missing
>> device? That might be what pinned it in that case.
> In fact in my virtual setup there was more chunk missing (1 metadata 1
> System and 1 Data).
> I will try to do a setup closer to my real one.

Probably the reason why that missing device has no used chunks is
because it's so small. Btrfs allocates block groups to devices with
the most unallocated space first. Only once the unallocated space is
even (approximately) on all devices would it allocate a block group to
the small device.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: multi-device btrfs with single data mode and disk failure

2016-09-20 Thread Alexandre Poux


Le 20/09/2016 à 22:18, Alexandre Poux a écrit :
>
> Le 20/09/2016 à 21:46, Chris Murphy a écrit :
>> On Tue, Sep 20, 2016 at 1:31 PM, Alexandre Poux  wrote:
>>> Le 20/09/2016 à 21:11, Chris Murphy a écrit :
 And no backup? Umm, I'd resolve that sooner than anything else.
>>> Yeah you are absolutely right, this was a temporary solution which came
>>> to be not that temporary.
>>> And I regret it already...
>> Well on the bright side, if this were LVM or mdadm linear/concat
>> array, the whole thing would be toast because any other file system
>> would have lost too much fs metadata on the missing device.
>>
  It
 should be true that it'll tolerate a read only mount indefinitely, but
 read write? Not sure. This sort of edge case isn't well tested at all
 seeing as it required changing the kernel to reduce safe guards. So
 all bets are off the whole thing could become unmountable, not even
 read only, and then it's a scraping job.
>>> I'm not that crazy, I tried the patch inside a virtual machine on
>>> virtual drives...
>>> And since it's only virtual, it may not work on the real partition...
>> Are you sure the virtual setup lacked a CHUNK_ITEM on the missing
>> device? That might be what pinned it in that case.
> In fact in my virtual setup there was more chunk missing (1 metadata 1
> System and 1 Data).
> I will try to do a setup closer to my real one.
Good news, I made a test were in my virtual setup, I was missing no
chunk at all
And in this case, It has no problem to remove it !
What I did is
- make an array with 6 disks (data single, metadata raid1)
- dd if=/dev/zero of=/mnt/somefile bs=64M count=16 # make a 1G file
- use btrfs-debug-tree to identify which device was not used
- shutdown the vm, remove this virtual device, and restart the vm
- mount the array in degraded but with read write thanks to the patched
kernel
- btrfs remove missing
- and voilà !
I will try with something else than /dev/null, but this is very encouraging
Do you think that my test is too trivial ?
Should I try something else before trying on the real partition with the
overlay ?

>> You could try some sort of overlay for your remaining drives.
>> Something like this:
>> https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file
>>
>> Make sure you understand the gotcha about cloning which applies here:
>> https://btrfs.wiki.kernel.org/index.php/Gotchas
>>
>> I think it's safe to use blockdev --setro on every real device  you're
>> trying to protect from changes. And when mounting you'll at least need
>> to use device= mount option to explicitly mount each of the overlay
>> devices. Based on the wiki, I'm wincing, I don't really know for sure
>> if device mount option is enough to compel Btrfs to only use those
>> devices and not go off the rails and still use one of the real
>> devices, but at least if they're setro it won't matter (the mount will
>> just fail somehow due to write failures).
>>
>> So now you can try removing the missing device... and see what
>> happens. You could inspect the overlay files and see what changes were
>> made.
> Wow that looks like nice.
> So, if it work, and if we find a way to fix the filesystem inside the vm,
> I can use this over the real partion to check if it works before trying
> the fix for real.
> Nice idea.
 What do you get for btrfs-debug-tree -t 3 

 That should show the chunk tree, and what I'm wondering if if the
 chunk tree has any references to chunks on the missing device. Even if
 there are no extents on that device, if there are chunks, that might
 be one of the safeguards.

>>> You'll find it attached.
>>> The missing device is the devid 8 (since it's the only one missing in
>>> btrfs fi show)
>>> I found it only once line 63
>> Yeah bummer. Not used for system, data, or metadata chunks at all.
>>
>>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: multi-device btrfs with single data mode and disk failure

2016-09-20 Thread Graham Cobb
On 20/09/16 19:53, Alexandre Poux wrote:
> As for moving data to an another volume, since it's only data and
> nothing fancy (no subvolume or anything), a simple rsync would do the trick.
> My problem in this case is that I don't have enough available space
> elsewhere to move my data.
> That's why I'm trying this hard to recover the partition...

I am sure you have already thought about this, but... it might be
easier, and even maybe faster, to backup the data to a cloud server,
then recreate and download again.

Backblaze B2 is very cheap for upload and storage (don't know about
download charges, though).  And rclone works well to handle rsync-style
copies (although you might want to use tar or dar if you need to
preserve file attributes).

And if that works, rclone + B2 might make a reasonable offsite backup
solution for the future!

Graham
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4][V3] metadata throttling in writeback patches

2016-09-20 Thread Josef Bacik
This is the latest set of patches based on my conversations with Jan and
Johannes.  The biggest change has been changing the metadata accounting counters
to be in bytes intead of pages in order to better support varying blocksizes.
I've also stopped messing with the other pagecache related counters so we can
keep them truly separate.  Johannes suggested this change and I simply convert
the bytes counter to pages when calculating dirty limits and such.

The other big change is changing WB_WRITTEN/WB_DIRTIED to be in bytes instead of
pages as well.  This is just a name and accounting change, it doesn't really
change the core logic at all.

I'm sending this out ahead of my full battery of tests, but I want to get
feedback on this direction as soon as possible.  In the meantime I've changed my
btrfs specific patches to work with these patches and am running long running
tests now to verify everything still works.  Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] writeback: convert WB_WRITTEN/WB_DIRITED counters to bytes

2016-09-20 Thread Josef Bacik
These are counters that constantly go up in order to do bandwidth calculations.
It isn't important what the units are in, as long as they are consistent between
the two of them, so convert them to count bytes written/dirtied, and allow the
metadata accounting stuff to change the counters as well.

Signed-off-by: Josef Bacik 
---
 fs/fuse/file.c   |  4 ++--
 include/linux/backing-dev-defs.h |  4 ++--
 include/linux/backing-dev.h  |  2 +-
 mm/backing-dev.c |  8 
 mm/page-writeback.c  | 26 --
 5 files changed, 25 insertions(+), 19 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f394aff..3f5991e 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1466,7 +1466,7 @@ static void fuse_writepage_finish(struct fuse_conn *fc, 
struct fuse_req *req)
for (i = 0; i < req->num_pages; i++) {
dec_wb_stat(&bdi->wb, WB_WRITEBACK);
dec_node_page_state(req->pages[i], NR_WRITEBACK_TEMP);
-   wb_writeout_inc(&bdi->wb);
+   wb_writeout_inc(&bdi->wb, PAGE_SIZE);
}
wake_up(&fi->page_waitq);
 }
@@ -1770,7 +1770,7 @@ static bool fuse_writepage_in_flight(struct fuse_req 
*new_req,
 
dec_wb_stat(&bdi->wb, WB_WRITEBACK);
dec_node_page_state(page, NR_WRITEBACK_TEMP);
-   wb_writeout_inc(&bdi->wb);
+   wb_writeout_inc(&bdi->wb, PAGE_SIZE);
fuse_writepage_free(fc, new_req);
fuse_request_free(new_req);
goto out;
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 1a7c3c1..cef0f24 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -36,8 +36,8 @@ enum wb_stat_item {
WB_WRITEBACK,
WB_METADATA_DIRTY_BYTES,
WB_METADATA_WRITEBACK_BYTES,
-   WB_DIRTIED,
-   WB_WRITTEN,
+   WB_DIRTIED_BYTES,
+   WB_WRITTEN_BYTES,
NR_WB_STAT_ITEMS
 };
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 089acf6..742238a 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -113,7 +113,7 @@ static inline s64 wb_stat_sum(struct bdi_writeback *wb, 
enum wb_stat_item item)
return sum;
 }
 
-extern void wb_writeout_inc(struct bdi_writeback *wb);
+extern void wb_writeout_inc(struct bdi_writeback *wb, long bytes);
 
 /*
  * maximal error of a stat counter.
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d76f432..f0695b0 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -77,8 +77,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
   "BdiDirtyThresh: %10lu kB\n"
   "DirtyThresh:%10lu kB\n"
   "BackgroundThresh:   %10lu kB\n"
-  "BdiDirtied: %10lu kB\n"
-  "BdiWritten: %10lu kB\n"
+  "BdiDirtiedBytes:%10lu kB\n"
+  "BdiWrittenBytes:%10lu kB\n"
   "BdiMetadataDirty:   %10lu kB\n"
   "BdiMetaWriteback:   %10lu kB\n"
   "BdiWriteBandwidth:  %10lu kBps\n"
@@ -93,8 +93,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
   K(wb_thresh),
   K(dirty_thresh),
   K(background_thresh),
-  (unsigned long) K(wb_stat(wb, WB_DIRTIED)),
-  (unsigned long) K(wb_stat(wb, WB_WRITTEN)),
+  (unsigned long) BtoK(wb_stat(wb, WB_DIRTIED_BYTES)),
+  (unsigned long) BtoK(wb_stat(wb, WB_WRITTEN_BYTES)),
   (unsigned long) BtoK(wb_stat(wb, WB_METADATA_DIRTY_BYTES)),
   (unsigned long) BtoK(wb_stat(wb, 
WB_METADATA_WRITEBACK_BYTES)),
   (unsigned long) K(wb->write_bandwidth),
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 423d2f5..6d08673 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -624,11 +624,11 @@ static void wb_domain_writeout_inc(struct wb_domain *dom,
  * Increment @wb's writeout completion count and the global writeout
  * completion count. Called from test_clear_page_writeback().
  */
-static inline void __wb_writeout_inc(struct bdi_writeback *wb)
+static inline void __wb_writeout_inc(struct bdi_writeback *wb, long bytes)
 {
struct wb_domain *cgdom;
 
-   __inc_wb_stat(wb, WB_WRITTEN);
+   __add_wb_stat(wb, WB_WRITTEN_BYTES, bytes);
wb_domain_writeout_inc(&global_wb_domain, &wb->completions,
   wb->bdi->max_prop_frac);
 
@@ -638,12 +638,12 @@ static inline void __wb_writeout_inc(struct bdi_writeback 
*wb)
   wb->bdi->max_prop_frac);
 }
 
-void wb_writeout_inc(struct bdi_writeback *wb)
+void wb_writeout_inc(struct bdi_writeback *wb, long bytes)
 {
unsigned long flags;
 
local_irq_save(flags);
-   __wb_writeout_inc(wb)

[PATCH 4/4] writeback: introduce super_operations->write_metadata

2016-09-20 Thread Josef Bacik
Now that we have metadata counters in the VM, we need to provide a way to kick
writeback on dirty metadata.  Introduce super_operations->write_metadata.  This
allows file systems to deal with writing back any dirty metadata we need based
on the writeback needs of the system.  Since there is no inode to key off of we
need a list in the bdi for dirty super blocks to be added.  From there we can
find any dirty sb's on the bdi we are currently doing writeback on and call into
their ->write_metadata callback.

Signed-off-by: Josef Bacik 
---
 fs/fs-writeback.c| 72 
 fs/super.c   |  7 
 include/linux/backing-dev-defs.h |  2 ++
 include/linux/fs.h   |  4 +++
 mm/backing-dev.c |  2 ++
 5 files changed, 81 insertions(+), 6 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index aafdb11..8cd072e 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1464,6 +1464,31 @@ static long writeback_chunk_size(struct bdi_writeback 
*wb,
return pages;
 }
 
+static long writeback_sb_metadata(struct super_block *sb,
+ struct bdi_writeback *wb,
+ struct wb_writeback_work *work)
+{
+   struct writeback_control wbc = {
+   .sync_mode  = work->sync_mode,
+   .tagged_writepages  = work->tagged_writepages,
+   .for_kupdate= work->for_kupdate,
+   .for_background = work->for_background,
+   .for_sync   = work->for_sync,
+   .range_cyclic   = work->range_cyclic,
+   .range_start= 0,
+   .range_end  = LLONG_MAX,
+   };
+   long write_chunk;
+
+   write_chunk = writeback_chunk_size(wb, work);
+   wbc.nr_to_write = write_chunk;
+   sb->s_op->write_metadata(sb, &wbc);
+   work->nr_pages -= write_chunk - wbc.nr_to_write;
+
+   return write_chunk - wbc.nr_to_write;
+}
+
+
 /*
  * Write a portion of b_io inodes which belong to @sb.
  *
@@ -1490,6 +1515,7 @@ static long writeback_sb_inodes(struct super_block *sb,
unsigned long start_time = jiffies;
long write_chunk;
long wrote = 0;  /* count both pages and inodes */
+   bool done = false;
 
while (!list_empty(&wb->b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
@@ -1606,12 +1632,18 @@ static long writeback_sb_inodes(struct super_block *sb,
 * background threshold and other termination conditions.
 */
if (wrote) {
-   if (time_is_before_jiffies(start_time + HZ / 10UL))
-   break;
-   if (work->nr_pages <= 0)
+   if (time_is_before_jiffies(start_time + HZ / 10UL) ||
+   work->nr_pages <= 0) {
+   done = true;
break;
+   }
}
}
+   if (!done && sb->s_op->write_metadata) {
+   spin_unlock(&wb->list_lock);
+   wrote += writeback_sb_metadata(sb, wb, work);
+   spin_unlock(&wb->list_lock);
+   }
return wrote;
 }
 
@@ -1620,6 +1652,7 @@ static long __writeback_inodes_wb(struct bdi_writeback 
*wb,
 {
unsigned long start_time = jiffies;
long wrote = 0;
+   bool done = false;
 
while (!list_empty(&wb->b_io)) {
struct inode *inode = wb_inode(wb->b_io.prev);
@@ -1639,12 +1672,39 @@ static long __writeback_inodes_wb(struct bdi_writeback 
*wb,
 
/* refer to the same tests at the end of writeback_sb_inodes */
if (wrote) {
-   if (time_is_before_jiffies(start_time + HZ / 10UL))
-   break;
-   if (work->nr_pages <= 0)
+   if (time_is_before_jiffies(start_time + HZ / 10UL) ||
+   work->nr_pages <= 0) {
+   done = true;
break;
+   }
}
}
+
+   if (!done && wb_stat(wb, WB_METADATA_DIRTY_BYTES)) {
+   LIST_HEAD(list);
+
+   spin_unlock(&wb->list_lock);
+   spin_lock(&wb->bdi->sb_list_lock);
+   list_splice_init(&wb->bdi->dirty_sb_list, &list);
+   while (!list_empty(&list)) {
+   struct super_block *sb;
+
+   sb = list_first_entry(&list, struct super_block,
+ s_bdi_dirty_list);
+   list_move_tail(&sb->s_bdi_dirty_list,
+  &wb->bdi->dirty_sb_list);
+   if (!sb->s_op->write_metadata)
+   continue;
+

[PATCH 2/4] writeback: allow for dirty metadata accounting

2016-09-20 Thread Josef Bacik
Btrfs has no bounds except memory on the amount of dirty memory that we have in
use for metadata.  Historically we have used a special inode so we could take
advantage of the balance_dirty_pages throttling that comes with using pagecache.
However as we'd like to support different blocksizes it would be nice to not
have to rely on pagecache, but still get the balance_dirty_pages throttling
without having to do it ourselves.

So introduce *METADATA_DIRTY_BYTES and *METADATA_WRITEBACK_BYTES.  These are
zone and bdi_writeback counters to keep track of how many bytes we have in
flight for METADATA.  We need to count in bytes as blocksizes could be
percentages of pagesize.  We simply convert the bytes to number of pages where
it is needed for the throttling.

Signed-off-by: Josef Bacik 
---
 arch/tile/mm/pgtable.c   |   3 +-
 drivers/base/node.c  |   6 ++
 fs/fs-writeback.c|   2 +
 fs/proc/meminfo.c|   5 ++
 include/linux/backing-dev-defs.h |   2 +
 include/linux/mm.h   |   9 +++
 include/linux/mmzone.h   |   2 +
 include/trace/events/writeback.h |  13 +++-
 mm/backing-dev.c |   5 ++
 mm/page-writeback.c  | 157 +++
 mm/page_alloc.c  |  16 +++-
 mm/vmscan.c  |   4 +-
 12 files changed, 200 insertions(+), 24 deletions(-)

diff --git a/arch/tile/mm/pgtable.c b/arch/tile/mm/pgtable.c
index 7cc6ee7..9543468 100644
--- a/arch/tile/mm/pgtable.c
+++ b/arch/tile/mm/pgtable.c
@@ -44,12 +44,13 @@ void show_mem(unsigned int filter)
 {
struct zone *zone;
 
-   pr_err("Active:%lu inactive:%lu dirty:%lu writeback:%lu unstable:%lu 
free:%lu\n slab:%lu mapped:%lu pagetables:%lu bounce:%lu pagecache:%lu 
swap:%lu\n",
+   pr_err("Active:%lu inactive:%lu dirty:%lu metadata_dirty:%lu 
writeback:%lu unstable:%lu free:%lu\n slab:%lu mapped:%lu pagetables:%lu 
bounce:%lu pagecache:%lu swap:%lu\n",
   (global_node_page_state(NR_ACTIVE_ANON) +
global_node_page_state(NR_ACTIVE_FILE)),
   (global_node_page_state(NR_INACTIVE_ANON) +
global_node_page_state(NR_INACTIVE_FILE)),
   global_node_page_state(NR_FILE_DIRTY),
+  global_node_page_state(NR_METADATA_DIRTY),
   global_node_page_state(NR_WRITEBACK),
   global_node_page_state(NR_UNSTABLE_NFS),
   global_page_state(NR_FREE_PAGES),
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 5548f96..3615264 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -51,6 +51,8 @@ static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
 static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
 
 #define K(x) ((x) << (PAGE_SHIFT - 10))
+#define BtoK(x) ((x) >> 10)
+
 static ssize_t node_read_meminfo(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -99,7 +101,9 @@ static ssize_t node_read_meminfo(struct device *dev,
 #endif
n += sprintf(buf + n,
   "Node %d Dirty:  %8lu kB\n"
+  "Node %d MetadataDirty:  %8lu kB\n"
   "Node %d Writeback:  %8lu kB\n"
+  "Node %d MetaWriteback:  %8lu kB\n"
   "Node %d FilePages:  %8lu kB\n"
   "Node %d Mapped: %8lu kB\n"
   "Node %d AnonPages:  %8lu kB\n"
@@ -119,7 +123,9 @@ static ssize_t node_read_meminfo(struct device *dev,
 #endif
,
   nid, K(node_page_state(pgdat, NR_FILE_DIRTY)),
+  nid, BtoK(node_page_state(pgdat, 
NR_METADATA_DIRTY_BYTES)),
   nid, K(node_page_state(pgdat, NR_WRITEBACK)),
+  nid, BtoK(node_page_state(pgdat, 
NR_METADATA_WRITEBACK_BYTES)),
   nid, K(node_page_state(pgdat, NR_FILE_PAGES)),
   nid, K(node_page_state(pgdat, NR_FILE_MAPPED)),
   nid, K(node_page_state(pgdat, NR_ANON_MAPPED)),
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 56c8fda..aafdb11 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1801,6 +1801,7 @@ static struct wb_writeback_work 
*get_next_work_item(struct bdi_writeback *wb)
return work;
 }
 
+#define BtoP(x) ((x) >> PAGE_SHIFT)
 /*
  * Add in the number of potentially dirty inodes, because each inode
  * write can dirty pagecache in the underlying blockdev.
@@ -1809,6 +1810,7 @@ static unsigned long get_nr_dirty_pages(void)
 {
return global_node_page_state(NR_FILE_DIRTY) +
global_node_page_state(NR_UNSTABLE_NFS) +
+   BtoP(global_node_page_state(NR_METADATA_DIRTY_BYTES)) +
get_nr_dirty_inodes();
 }
 
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 09e18fd..95b0d8a 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -36,6 +36,7 @@ static

[PATCH 1/4] remove mapping from balance_dirty_pages*()

2016-09-20 Thread Josef Bacik
The only reason we pass in the mapping is to get the inode in order to see if
writeback cgroups is enabled, and even then it only checks the bdi and a super
block flag.  balance_dirty_pages() doesn't even use the mapping.  Since
balance_dirty_pages*() works on a bdi level, just pass in the bdi and super
block directly so we can avoid using mapping.  This will allow us to still use
balance_dirty_pages for dirty metadata pages that are not backed by an
address_mapping.

Signed-off-by: Josef Bacik 
Reviewed-by: Jan Kara 
---
 drivers/mtd/devices/block2mtd.c | 12 
 fs/btrfs/disk-io.c  |  4 ++--
 fs/btrfs/file.c |  3 ++-
 fs/btrfs/ioctl.c|  3 ++-
 fs/btrfs/relocation.c   |  3 ++-
 fs/buffer.c |  3 ++-
 fs/iomap.c  |  3 ++-
 fs/ntfs/attrib.c| 10 +++---
 fs/ntfs/file.c  |  4 ++--
 include/linux/backing-dev.h | 29 +++--
 include/linux/writeback.h   |  3 ++-
 mm/filemap.c|  4 +++-
 mm/memory.c |  9 +++--
 mm/page-writeback.c | 15 +++
 14 files changed, 71 insertions(+), 34 deletions(-)

diff --git a/drivers/mtd/devices/block2mtd.c b/drivers/mtd/devices/block2mtd.c
index 7c887f1..7892d0b 100644
--- a/drivers/mtd/devices/block2mtd.c
+++ b/drivers/mtd/devices/block2mtd.c
@@ -52,7 +52,8 @@ static struct page *page_read(struct address_space *mapping, 
int index)
 /* erase a specified part of the device */
 static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t to, size_t len)
 {
-   struct address_space *mapping = dev->blkdev->bd_inode->i_mapping;
+   struct inode *inode = dev->blkdev->bd_inode;
+   struct address_space *mapping = inode->i_mapping;
struct page *page;
int index = to >> PAGE_SHIFT;   // page index
int pages = len >> PAGE_SHIFT;
@@ -71,7 +72,8 @@ static int _block2mtd_erase(struct block2mtd_dev *dev, loff_t 
to, size_t len)
memset(page_address(page), 0xff, PAGE_SIZE);
set_page_dirty(page);
unlock_page(page);
-   balance_dirty_pages_ratelimited(mapping);
+   
balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+   inode->i_sb);
break;
}
 
@@ -141,7 +143,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, 
const u_char *buf,
loff_t to, size_t len, size_t *retlen)
 {
struct page *page;
-   struct address_space *mapping = dev->blkdev->bd_inode->i_mapping;
+   struct inode *inode = dev->blkdev->bd_inode;
+   struct address_space *mapping = inode->i_mapping;
int index = to >> PAGE_SHIFT;   // page index
int offset = to & ~PAGE_MASK;   // page offset
int cpylen;
@@ -162,7 +165,8 @@ static int _block2mtd_write(struct block2mtd_dev *dev, 
const u_char *buf,
memcpy(page_address(page) + offset, buf, cpylen);
set_page_dirty(page);
unlock_page(page);
-   balance_dirty_pages_ratelimited(mapping);
+   balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+   inode->i_sb);
}
put_page(page);
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 87dad55..4034ad6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4024,8 +4024,8 @@ static void __btrfs_btree_balance_dirty(struct btrfs_root 
*root,
ret = percpu_counter_compare(&root->fs_info->dirty_metadata_bytes,
 BTRFS_DIRTY_METADATA_THRESH);
if (ret > 0) {
-   balance_dirty_pages_ratelimited(
-  root->fs_info->btree_inode->i_mapping);
+   balance_dirty_pages_ratelimited(&root->fs_info->bdi,
+   root->fs_info->sb);
}
 }
 
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 9404121..f060b08 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1686,7 +1686,8 @@ again:
 
cond_resched();
 
-   balance_dirty_pages_ratelimited(inode->i_mapping);
+   balance_dirty_pages_ratelimited(inode_to_bdi(inode),
+   inode->i_sb);
if (dirty_pages < (root->nodesize >> PAGE_SHIFT) + 1)
btrfs_btree_balance_dirty(root);
 
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 14ed1e9..a222bad 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1410,7 +1410,8 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
}
 
defrag_count += ret;
-   

Re: Post ext3 conversion problems

2016-09-20 Thread Sean Greenslade
On Tue, Sep 20, 2016 at 01:02:38PM +0800, Qu Wenruo wrote:
> > Glad to hear you've found the core of the issue.
> > 
> > At this point, I can trigger it immediately. As soon as I log in and run
> > dmenu, it will attempt to rebuild its cache file (small text file that's
> > just a list of all executables in the PATH). Once that write happens,
> > the bug triggers and the fs goes read only.
> 
> Rewrite? Or write into new inode?
> 
> And is the same inode always causing the problem?

It's not always the same. It seems like whatever triggers a write first
is what kills it. I went to test it, and this time it triggered on my
.bash_history file. I have bash set up with "history -a", so presumably
that was an append, not an overwrite.

To cut down on the number of variables, I booted my system with the
"rescue" systemd target, then su'd to my user. Simply running a few
commands (with the history -a writes that bash triggered) was enough to
trigger the bug. This is on 4.8.0-rc6, with the following compile time
options enabled:

CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y
CONFIG_BTRFS_DEBUG=y
CONFIG_BTRFS_ASSERT=y

If I run the stock Arch kernel (4.7.2 at the moment), the issue still
appears, but it takes longer. My most reliable trigger is Firefox, whose
constant DB writes will trigger it within minutes.

--Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ChaCha20 vs. AES performance

2016-09-20 Thread Alex Elsayed
On Tue, 20 Sep 2016 07:51:52 -0800, Kent Overstreet wrote:

> On Tue, Sep 20, 2016 at 10:23:20AM -0400, Theodore Ts'o wrote:
>> On Tue, Sep 20, 2016 at 03:15:19AM -0800, Kent Overstreet wrote:
>> > Not on the list or I would've replied directly, but on Haswell,
>> > ChaCha20 (in software) is over 2x as fast as AES (in hardware), at
>> > realistic (for a filesystem) block sizes:

Apologies if this doesn't CC you - replying via gmane, since (not being 
subscribed via email either) I can't try the same trick I did to include 
Ted (i.e., reply via my mail client).

One useful trick, though - if you have a Usenet client, gmane _will_ let 
you reply directly, even to old messages. That's what I'm doing.

>> On Skylake and Broadwell processors, AES is faster (the posting is from
>> a ChaCha20 enthusiast):
>> 
>>  https://blog.cloudflare.com/it-takes-two-to-chacha-poly/
> 
> The performance delta in his graphs isn't near as big as what I've
> measured, which makes me suspect OpenSSL's ChaCha20 implementation isn't
> nearly as fast as the kernel's.
> 
>> My big worry though is that schemes that require that nonces/IV's must
>> **never** be reused are fragile.  It's for the same reason that DSA
>> makes my skin crawl.  If you ever screw up --- maybe after a crash, or
>> a file system bug, you end up reusing a nonce, it's game over.
>> 
>> So if there are hardware solutions which are faster or fast enough that
>> the crypto is no longer dominant cost, why not use a cipher scheme
>> which is more robust?
> 
> Block ciphers have their own downsides though - XTS is really a big pile
> of hacks and workarounds. On the whole, if you can get nonces right, a
> stream cipher cryptosystem (and ChaCha20 especially) is on the whole
> drastically simpler, and thus easier to understand and audit.

Yes, I would entirely agree with your assessment of XTS (in particular, 
the doubling of the length of the key is rooted in the original authors 
misunderstanding the XEX paper...).

> And if you can do nonces correctly, ChaCha20/Poly1305 is pretty much one
> of the gold standards - it's secure against pretty much any vaguely
> realistic threat model. XTS, not so much - it's just the best you can do
> given the constraints of typical disk crypto. The gold standards of
> encryption today are the AEADs - and AES/GCM fails badly with nonce
> reuse too, there aren't any AEADs yet that don't fail badly with nonce
> reuse.

Not true - SIV is a generic construction, which has been applied to AES 
(AES-SIV, RFC 5297) and ChaCha20 (HS1-SIV, submitted to CAESAR). There's 
also AES-GCM-SIV, which takes advantage of GCM hardware acceleration as 
well as AES acceleration.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: multi-device btrfs with single data mode and disk failure

2016-09-20 Thread Alexandre Poux


Le 20/09/2016 à 21:46, Chris Murphy a écrit :
> On Tue, Sep 20, 2016 at 1:31 PM, Alexandre Poux  wrote:
>>
>> Le 20/09/2016 à 21:11, Chris Murphy a écrit :
>>> And no backup? Umm, I'd resolve that sooner than anything else.
>> Yeah you are absolutely right, this was a temporary solution which came
>> to be not that temporary.
>> And I regret it already...
> Well on the bright side, if this were LVM or mdadm linear/concat
> array, the whole thing would be toast because any other file system
> would have lost too much fs metadata on the missing device.
>
>>>  It
>>> should be true that it'll tolerate a read only mount indefinitely, but
>>> read write? Not sure. This sort of edge case isn't well tested at all
>>> seeing as it required changing the kernel to reduce safe guards. So
>>> all bets are off the whole thing could become unmountable, not even
>>> read only, and then it's a scraping job.
>> I'm not that crazy, I tried the patch inside a virtual machine on
>> virtual drives...
>> And since it's only virtual, it may not work on the real partition...
> Are you sure the virtual setup lacked a CHUNK_ITEM on the missing
> device? That might be what pinned it in that case.
In fact in my virtual setup there was more chunk missing (1 metadata 1
System and 1 Data).
I will try to do a setup closer to my real one.
> You could try some sort of overlay for your remaining drives.
> Something like this:
> https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file
>
> Make sure you understand the gotcha about cloning which applies here:
> https://btrfs.wiki.kernel.org/index.php/Gotchas
>
> I think it's safe to use blockdev --setro on every real device  you're
> trying to protect from changes. And when mounting you'll at least need
> to use device= mount option to explicitly mount each of the overlay
> devices. Based on the wiki, I'm wincing, I don't really know for sure
> if device mount option is enough to compel Btrfs to only use those
> devices and not go off the rails and still use one of the real
> devices, but at least if they're setro it won't matter (the mount will
> just fail somehow due to write failures).
>
> So now you can try removing the missing device... and see what
> happens. You could inspect the overlay files and see what changes were
> made.
Wow that looks like nice.
So, if it work, and if we find a way to fix the filesystem inside the vm,
I can use this over the real partion to check if it works before trying
the fix for real.
Nice idea.
>>> What do you get for btrfs-debug-tree -t 3 
>>>
>>> That should show the chunk tree, and what I'm wondering if if the
>>> chunk tree has any references to chunks on the missing device. Even if
>>> there are no extents on that device, if there are chunks, that might
>>> be one of the safeguards.
>>>
>> You'll find it attached.
>> The missing device is the devid 8 (since it's the only one missing in
>> btrfs fi show)
>> I found it only once line 63
> Yeah bummer. Not used for system, data, or metadata chunks at all.
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: multi-device btrfs with single data mode and disk failure

2016-09-20 Thread Chris Murphy
On Tue, Sep 20, 2016 at 1:54 PM, Alexandre Poux  wrote:
>
> OK, good idea, but to be able to do that, I have to use the patch that
> allow me to mount the partition in rw, otherwise I won't be able to
> shrink it I suppose..
> And even with the patch I'm not sure that I won't get an IO error the
> same way I get it when I try to remove the device.
> I will try it on my virtual machine.

The shrink itself is pretty trivial in that its just moving block
groups around if necessary, it's part of the balance code, there's not
much metadata being changed, just CoW the block groups, and then
update the chunk tree and supers. It is trickier when it comes to
either partition map changes while the fs is still mounted; or doing
it the way I was describing by deleting one of the present devices in
which case you can then just use that now empty partition as a starter
for a new file system.

It's a catch 22 either way.

Note that by default if you don't specify a devid for shrink, it's
only resizing devid1.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: multi-device btrfs with single data mode and disk failure

2016-09-20 Thread Chris Murphy
On Tue, Sep 20, 2016 at 1:43 PM, Austin S. Hemmelgarn
 wrote:

>> First off, as Chris said, if you can read the data and don't already have a
> backup, that should be your first priority.  This really is an edge case
> that's not well tested, and the kernel technically doesn't officially
> support it.
>
> Now, beyond that and his suggestions, there's another option, but it's
> risky, so I wouldn't even think about trying it without a backup (unless of
> course you can trivially regenerate the data).  Multiple devices support and
> online resizing allows for a rather neat trick to regenerate a filesystem in
> place.  The process is pretty simple:
> 1. Shrink the existing filesystem down to the minimum size possible.
> 2. Create a new partition in the free space, and format it as a temporary
> BTRFS filesystem.  Ideally, this FS should be mixed mode, and ideally single
> profile.  If you don't have much free space, you can use a flash drive to
> start this temporary filesystem instead.
> 3. Start copying files from the old filesystem to the temporary one.
> 4. Once the new filesystem is about 95% full, stop copying, shrink the old
> filesystem again, create a new partition, and add that partition to the
> temporary filesystem.
> 5. Repeat steps 3-4 until you have everything off of the old filesystem.
> 6. Re-format the remaining portion of the old filesystem using the
> parameters you want for the replacement filesystem.
> 7. Start copying files from the temporary filesystem to the new filesystem.
> 8. As you empty out each temporary partition, remove it from the temporary
> filesystem, delete the partition, and expand the new filesystem.
>
> This takes a while, and is only safe if you have reliable hardware, but I've
> done it before and it works reliably as long as you don't have many big
> files on the old filesystem (things can get complicated if you do). The
> other negative aspect is that if you aren't careful, it's possible to get
> stuck half-way, but in such a case, adding a flash drive to the temporary
> filesystem can usually give you enough extra space to get things unstuck.

Yes I thought of this also.

Gotcha is that he'll need to apply the patch that allows degraded rw
mounts with a device missing on the actual computer with these drives.
He tested that patch in a VM with virtual devices.

What might be easier is just 'btrfs dev rm /dev/sda6' because that one
has the least amount of data on it:

devid   12 size 728.32GiB used 312.03GiB path /dev/sda6

which should fit on all remaining devices. But, does Btrfs get pissed
at some point that there's this missing device it might want to write
to? I have no idea to what degree this patched kernel permits a lot of
degraded writing.

The other quandary is the file system will do online shrink, but the
kernel can sometimes get pissy about partition map changes on devices
with active volumes, even when using partprobe to update the kernel's
idea of the partition map.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: multi-device btrfs with single data mode and disk failure

2016-09-20 Thread Alexandre Poux


Le 20/09/2016 à 21:43, Austin S. Hemmelgarn a écrit :
> On 2016-09-20 14:53, Alexandre Poux wrote:
>>
>>
>> Le 20/09/2016 à 20:38, Chris Murphy a écrit :
>>> On Tue, Sep 20, 2016 at 12:19 PM, Alexandre Poux 
>>> wrote:

 Le 20/09/2016 à 19:54, Chris Murphy a écrit :
> On Tue, Sep 20, 2016 at 11:03 AM, Alexandre Poux
>  wrote:
>
>> If I wanted to try to edit my partitions with an hex editor,
>> where would
>> I find infos on how to do that ?
>> I really don't want to go this way, but if this is relatively
>> simple, it
>> may be worth to try.
> Simple is relative. First you'd need
> https://btrfs.wiki.kernel.org/index.php/On-disk_Format to get some
> understanding of where things are to edit, and then btrfs-map-logical
> to convert btrfs logical addresses to physical device and sector to
> know what to edit.
>
> I'd call it distinctly non-trivial and very tedious.
>
 OK, another idea:
 would it be possible to trick btrfs with a manufactured file that the
 disk is present while it isn't ?

 I mean, looking for a few minutes on the hexdump of my trivial test
 partition, header of members of btrfs array seems very alike.
 So maybe, I can make a file wich would have enough header to make
 btrfs
 believe that this is my device, and then remove it as usual
 looks like a long shot, but it doesn't hurt to ask
>>> There may be another test that applies to single profiles, that
>>> disallows dropping a device. I think that's the place to look next.
>>> The superblock is easy to copy, but you'll need the device specific
>>> UUID which should be locatable with btrfs-show-super -f for each
>>> devid. The bigger problem is that Btrfs at mount time doesn't just
>>> look at the superblock and then mount. It actually reads parts of each
>>> tree, the extent of which I don't know. And it's doing a bunch of
>>> sanity tests as it reads those things, including transid (generation).
>>> So I'm not sure how easily spoofable a fake device is going to be.
>>> As a practical matter, migrate it to a new volume is faster and more
>>> reliable. Unfortunately, the inability to mount it read write is going
>>> to prevent you from making read only snapshots to use with btrfs
>>> send/receive. What might work, is find out what on-disk modification
>>> btrfs-tune does to make a device a read-only seed. Again your volume
>>> is missing a device so btrfs-tune won't let you modify it. But if you
>>> could force that to happen, it's probably a very minor change to
>>> metadata on each device, maybe it'll act like a seed device when you
>>> next mount it, in which case you'll be able to add a device and
>>> remount it read write and then delete the seed causing migration of
>>> everything that does remain on the volume over to the new device. I've
>>> never tried anything like this so I have no idea if it'll work. And
>>> even in the best case I haven't tried a multiple device seed going to
>>> a single device sprout (is it even allowed when removing the seed?).
>>> So...more questions than answers.
>>>
>> Sorry if I wasn't clear, but with the patch mentionned earlyer, I can
>> get a read write mount.
>> What I can't do is remove the device.
>> As for moving data to an another volume, since it's only data and
>> nothing fancy (no subvolume or anything), a simple rsync would do the
>> trick.
>> My problem in this case is that I don't have enough available space
>> elsewhere to move my data.
>> That's why I'm trying this hard to recover the partition...
> First off, as Chris said, if you can read the data and don't already
> have a backup, that should be your first priority.  This really is an
> edge case that's not well tested, and the kernel technically doesn't
> officially support it.
>
> Now, beyond that and his suggestions, there's another option, but it's
> risky, so I wouldn't even think about trying it without a backup
> (unless of course you can trivially regenerate the data).  Multiple
> devices support and online resizing allows for a rather neat trick to
> regenerate a filesystem in place.  The process is pretty simple:
> 1. Shrink the existing filesystem down to the minimum size possible.
> 2. Create a new partition in the free space, and format it as a
> temporary BTRFS filesystem.  Ideally, this FS should be mixed mode,
> and ideally single profile.  If you don't have much free space, you
> can use a flash drive to start this temporary filesystem instead.
> 3. Start copying files from the old filesystem to the temporary one.
> 4. Once the new filesystem is about 95% full, stop copying, shrink the
> old filesystem again, create a new partition, and add that partition
> to the temporary filesystem.
> 5. Repeat steps 3-4 until you have everything off of the old filesystem.
> 6. Re-format the remaining portion of the old filesystem using the
> parameters you want for the replacement filesystem.
> 7. Start copyin

Re: multi-device btrfs with single data mode and disk failure

2016-09-20 Thread Chris Murphy
On Tue, Sep 20, 2016 at 1:31 PM, Alexandre Poux  wrote:
>
>
> Le 20/09/2016 à 21:11, Chris Murphy a écrit :

>> And no backup? Umm, I'd resolve that sooner than anything else.
> Yeah you are absolutely right, this was a temporary solution which came
> to be not that temporary.
> And I regret it already...

Well on the bright side, if this were LVM or mdadm linear/concat
array, the whole thing would be toast because any other file system
would have lost too much fs metadata on the missing device.

>>  It
>> should be true that it'll tolerate a read only mount indefinitely, but
>> read write? Not sure. This sort of edge case isn't well tested at all
>> seeing as it required changing the kernel to reduce safe guards. So
>> all bets are off the whole thing could become unmountable, not even
>> read only, and then it's a scraping job.
> I'm not that crazy, I tried the patch inside a virtual machine on
> virtual drives...
> And since it's only virtual, it may not work on the real partition...

Are you sure the virtual setup lacked a CHUNK_ITEM on the missing
device? That might be what pinned it in that case.

You could try some sort of overlay for your remaining drives.
Something like this:
https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file

Make sure you understand the gotcha about cloning which applies here:
https://btrfs.wiki.kernel.org/index.php/Gotchas

I think it's safe to use blockdev --setro on every real device  you're
trying to protect from changes. And when mounting you'll at least need
to use device= mount option to explicitly mount each of the overlay
devices. Based on the wiki, I'm wincing, I don't really know for sure
if device mount option is enough to compel Btrfs to only use those
devices and not go off the rails and still use one of the real
devices, but at least if they're setro it won't matter (the mount will
just fail somehow due to write failures).

So now you can try removing the missing device... and see what
happens. You could inspect the overlay files and see what changes were
made.

>> What do you get for btrfs-debug-tree -t 3 
>>
>> That should show the chunk tree, and what I'm wondering if if the
>> chunk tree has any references to chunks on the missing device. Even if
>> there are no extents on that device, if there are chunks, that might
>> be one of the safeguards.
>>
> You'll find it attached.
> The missing device is the devid 8 (since it's the only one missing in
> btrfs fi show)
> I found it only once line 63

Yeah bummer. Not used for system, data, or metadata chunks at all.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: multi-device btrfs with single data mode and disk failure

2016-09-20 Thread Austin S. Hemmelgarn

On 2016-09-20 14:53, Alexandre Poux wrote:



Le 20/09/2016 à 20:38, Chris Murphy a écrit :

On Tue, Sep 20, 2016 at 12:19 PM, Alexandre Poux  wrote:


Le 20/09/2016 à 19:54, Chris Murphy a écrit :

On Tue, Sep 20, 2016 at 11:03 AM, Alexandre Poux  wrote:


If I wanted to try to edit my partitions with an hex editor, where would
I find infos on how to do that ?
I really don't want to go this way, but if this is relatively simple, it
may be worth to try.

Simple is relative. First you'd need
https://btrfs.wiki.kernel.org/index.php/On-disk_Format to get some
understanding of where things are to edit, and then btrfs-map-logical
to convert btrfs logical addresses to physical device and sector to
know what to edit.

I'd call it distinctly non-trivial and very tedious.


OK, another idea:
would it be possible to trick btrfs with a manufactured file that the
disk is present while it isn't ?

I mean, looking for a few minutes on the hexdump of my trivial test
partition, header of members of btrfs array seems very alike.
So maybe, I can make a file wich would have enough header to make btrfs
believe that this is my device, and then remove it as usual
looks like a long shot, but it doesn't hurt to ask

There may be another test that applies to single profiles, that
disallows dropping a device. I think that's the place to look next.
The superblock is easy to copy, but you'll need the device specific
UUID which should be locatable with btrfs-show-super -f for each
devid. The bigger problem is that Btrfs at mount time doesn't just
look at the superblock and then mount. It actually reads parts of each
tree, the extent of which I don't know. And it's doing a bunch of
sanity tests as it reads those things, including transid (generation).
So I'm not sure how easily spoofable a fake device is going to be.
As a practical matter, migrate it to a new volume is faster and more
reliable. Unfortunately, the inability to mount it read write is going
to prevent you from making read only snapshots to use with btrfs
send/receive. What might work, is find out what on-disk modification
btrfs-tune does to make a device a read-only seed. Again your volume
is missing a device so btrfs-tune won't let you modify it. But if you
could force that to happen, it's probably a very minor change to
metadata on each device, maybe it'll act like a seed device when you
next mount it, in which case you'll be able to add a device and
remount it read write and then delete the seed causing migration of
everything that does remain on the volume over to the new device. I've
never tried anything like this so I have no idea if it'll work. And
even in the best case I haven't tried a multiple device seed going to
a single device sprout (is it even allowed when removing the seed?).
So...more questions than answers.


Sorry if I wasn't clear, but with the patch mentionned earlyer, I can
get a read write mount.
What I can't do is remove the device.
As for moving data to an another volume, since it's only data and
nothing fancy (no subvolume or anything), a simple rsync would do the trick.
My problem in this case is that I don't have enough available space
elsewhere to move my data.
That's why I'm trying this hard to recover the partition...
First off, as Chris said, if you can read the data and don't already 
have a backup, that should be your first priority.  This really is an 
edge case that's not well tested, and the kernel technically doesn't 
officially support it.


Now, beyond that and his suggestions, there's another option, but it's 
risky, so I wouldn't even think about trying it without a backup (unless 
of course you can trivially regenerate the data).  Multiple devices 
support and online resizing allows for a rather neat trick to regenerate 
a filesystem in place.  The process is pretty simple:

1. Shrink the existing filesystem down to the minimum size possible.
2. Create a new partition in the free space, and format it as a 
temporary BTRFS filesystem.  Ideally, this FS should be mixed mode, and 
ideally single profile.  If you don't have much free space, you can use 
a flash drive to start this temporary filesystem instead.

3. Start copying files from the old filesystem to the temporary one.
4. Once the new filesystem is about 95% full, stop copying, shrink the 
old filesystem again, create a new partition, and add that partition to 
the temporary filesystem.

5. Repeat steps 3-4 until you have everything off of the old filesystem.
6. Re-format the remaining portion of the old filesystem using the 
parameters you want for the replacement filesystem.

7. Start copying files from the temporary filesystem to the new filesystem.
8. As you empty out each temporary partition, remove it from the 
temporary filesystem, delete the partition, and expand the new filesystem.


This takes a while, and is only safe if you have reliable hardware, but 
I've done it before and it works reliably as long as you don't have many 
big files on the old filesys

Re: multi-device btrfs with single data mode and disk failure

2016-09-20 Thread Chris Murphy
On Tue, Sep 20, 2016 at 12:53 PM, Alexandre Poux  wrote:
>
>
> Le 20/09/2016 à 20:38, Chris Murphy a écrit :
>> On Tue, Sep 20, 2016 at 12:19 PM, Alexandre Poux  wrote:
>>>
>>> Le 20/09/2016 à 19:54, Chris Murphy a écrit :
 On Tue, Sep 20, 2016 at 11:03 AM, Alexandre Poux  wrote:

> If I wanted to try to edit my partitions with an hex editor, where would
> I find infos on how to do that ?
> I really don't want to go this way, but if this is relatively simple, it
> may be worth to try.
 Simple is relative. First you'd need
 https://btrfs.wiki.kernel.org/index.php/On-disk_Format to get some
 understanding of where things are to edit, and then btrfs-map-logical
 to convert btrfs logical addresses to physical device and sector to
 know what to edit.

 I'd call it distinctly non-trivial and very tedious.

>>> OK, another idea:
>>> would it be possible to trick btrfs with a manufactured file that the
>>> disk is present while it isn't ?
>>>
>>> I mean, looking for a few minutes on the hexdump of my trivial test
>>> partition, header of members of btrfs array seems very alike.
>>> So maybe, I can make a file wich would have enough header to make btrfs
>>> believe that this is my device, and then remove it as usual
>>> looks like a long shot, but it doesn't hurt to ask
>> There may be another test that applies to single profiles, that
>> disallows dropping a device. I think that's the place to look next.
>> The superblock is easy to copy, but you'll need the device specific
>> UUID which should be locatable with btrfs-show-super -f for each
>> devid. The bigger problem is that Btrfs at mount time doesn't just
>> look at the superblock and then mount. It actually reads parts of each
>> tree, the extent of which I don't know. And it's doing a bunch of
>> sanity tests as it reads those things, including transid (generation).
>> So I'm not sure how easily spoofable a fake device is going to be.
>> As a practical matter, migrate it to a new volume is faster and more
>> reliable. Unfortunately, the inability to mount it read write is going
>> to prevent you from making read only snapshots to use with btrfs
>> send/receive. What might work, is find out what on-disk modification
>> btrfs-tune does to make a device a read-only seed. Again your volume
>> is missing a device so btrfs-tune won't let you modify it. But if you
>> could force that to happen, it's probably a very minor change to
>> metadata on each device, maybe it'll act like a seed device when you
>> next mount it, in which case you'll be able to add a device and
>> remount it read write and then delete the seed causing migration of
>> everything that does remain on the volume over to the new device. I've
>> never tried anything like this so I have no idea if it'll work. And
>> even in the best case I haven't tried a multiple device seed going to
>> a single device sprout (is it even allowed when removing the seed?).
>> So...more questions than answers.
>>
> Sorry if I wasn't clear, but with the patch mentionned earlyer, I can
> get a read write mount.
> What I can't do is remove the device.
> As for moving data to an another volume, since it's only data and
> nothing fancy (no subvolume or anything), a simple rsync would do the trick.
> My problem in this case is that I don't have enough available space
> elsewhere to move my data.
> That's why I'm trying this hard to recover the partition...

And no backup? Umm, I'd resolve that sooner than anything else. It
should be true that it'll tolerate a read only mount indefinitely, but
read write? Not sure. This sort of edge case isn't well tested at all
seeing as it required changing the kernel to reduce safe guards. So
all bets are off the whole thing could become unmountable, not even
read only, and then it's a scraping job.

I think what you want to do here is reasonable, there's no missing
data on the missing device. If the device were present and you deleted
it, Btrfs would presumably have nothing to migrate, it'd just shrink
the fs, update all supers, wipe the signatures off the device being
removed, that's it. So there's some safeguard in place that's
disallowing the remove missing in this case even though there's no
data or metadata to migrate off the drive.

In another thread about clusters and planned data loss, I describe how
this functionality has a practical real world benefit other than your
particular situation. So it would be nice if it were possible but I
can't tell you what the safe guard is that's preventing it from being
removed, or if it's even just one safeguard.

What do you get for btrfs-debug-tree -t 3 

That should show the chunk tree, and what I'm wondering if if the
chunk tree has any references to chunks on the missing device. Even if
there are no extents on that device, if there are chunks, that might
be one of the safeguards.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
th

Re: multi-device btrfs with single data mode and disk failure

2016-09-20 Thread Alexandre Poux
Le 20/09/2016 à 20:56, Austin S. Hemmelgarn a écrit :
> On 2016-09-20 13:54, Chris Murphy wrote:
>> On Tue, Sep 20, 2016 at 11:03 AM, Alexandre Poux 
>> wrote:
>>
>>> If I wanted to try to edit my partitions with an hex editor, where
>>> would
>>> I find infos on how to do that ?
>>> I really don't want to go this way, but if this is relatively
>>> simple, it
>>> may be worth to try.
>>
>> Simple is relative. First you'd need
>> https://btrfs.wiki.kernel.org/index.php/On-disk_Format to get some
>> understanding of where things are to edit, and then btrfs-map-logical
>> to convert btrfs logical addresses to physical device and sector to
>> know what to edit.
>>
>> I'd call it distinctly non-trivial and very tedious.
>>
> It really is.  I've done this before, but I had a copy of the on-disk
> format documentation, a couple of working filesystems, a full copy of
> the current kernel sources for reference, and about 8 cups of green
> tea (my beverage of choice for staying awake and focused).  I got
> _really_ lucky and it was something that really was simple to fix once
> I found it (it amounted to about 64 bytes of changes, it took me maybe
> 5 minutes to actually correct the issue once I found where it was),
> but it took me a good couple of hours to figure out what to even look
> for, plus another hour just to find it, and I'm not sure I would be
> able to do it any faster if I had to again (unlike doing so for ext4,
> which is a walk in the park by comparison).
>
> TBH the only thing I'd worry about using a hex editor to fix in BTRFS
> is the super-blocks or system chunks, because they're pretty easy to
> find, and usually not all that hard to fix.  In fact, if it hadn't
> been for the fact that I had no backup of the data I would lose by
> recreating that filesystem, and I was _really_ bored that day, I
> probably wouldn't have even tried.
OK I will forget this.
Thank you

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: multi-device btrfs with single data mode and disk failure

2016-09-20 Thread Austin S. Hemmelgarn

On 2016-09-20 13:54, Chris Murphy wrote:

On Tue, Sep 20, 2016 at 11:03 AM, Alexandre Poux  wrote:


If I wanted to try to edit my partitions with an hex editor, where would
I find infos on how to do that ?
I really don't want to go this way, but if this is relatively simple, it
may be worth to try.


Simple is relative. First you'd need
https://btrfs.wiki.kernel.org/index.php/On-disk_Format to get some
understanding of where things are to edit, and then btrfs-map-logical
to convert btrfs logical addresses to physical device and sector to
know what to edit.

I'd call it distinctly non-trivial and very tedious.

It really is.  I've done this before, but I had a copy of the on-disk 
format documentation, a couple of working filesystems, a full copy of 
the current kernel sources for reference, and about 8 cups of green tea 
(my beverage of choice for staying awake and focused).  I got _really_ 
lucky and it was something that really was simple to fix once I found it 
(it amounted to about 64 bytes of changes, it took me maybe 5 minutes to 
actually correct the issue once I found where it was), but it took me a 
good couple of hours to figure out what to even look for, plus another 
hour just to find it, and I'm not sure I would be able to do it any 
faster if I had to again (unlike doing so for ext4, which is a walk in 
the park by comparison).


TBH the only thing I'd worry about using a hex editor to fix in BTRFS is 
the super-blocks or system chunks, because they're pretty easy to find, 
and usually not all that hard to fix.  In fact, if it hadn't been for 
the fact that I had no backup of the data I would lose by recreating 
that filesystem, and I was _really_ bored that day, I probably wouldn't 
have even tried.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: multi-device btrfs with single data mode and disk failure

2016-09-20 Thread Alexandre Poux


Le 20/09/2016 à 20:38, Chris Murphy a écrit :
> On Tue, Sep 20, 2016 at 12:19 PM, Alexandre Poux  wrote:
>>
>> Le 20/09/2016 à 19:54, Chris Murphy a écrit :
>>> On Tue, Sep 20, 2016 at 11:03 AM, Alexandre Poux  wrote:
>>>
 If I wanted to try to edit my partitions with an hex editor, where would
 I find infos on how to do that ?
 I really don't want to go this way, but if this is relatively simple, it
 may be worth to try.
>>> Simple is relative. First you'd need
>>> https://btrfs.wiki.kernel.org/index.php/On-disk_Format to get some
>>> understanding of where things are to edit, and then btrfs-map-logical
>>> to convert btrfs logical addresses to physical device and sector to
>>> know what to edit.
>>>
>>> I'd call it distinctly non-trivial and very tedious.
>>>
>> OK, another idea:
>> would it be possible to trick btrfs with a manufactured file that the
>> disk is present while it isn't ?
>>
>> I mean, looking for a few minutes on the hexdump of my trivial test
>> partition, header of members of btrfs array seems very alike.
>> So maybe, I can make a file wich would have enough header to make btrfs
>> believe that this is my device, and then remove it as usual
>> looks like a long shot, but it doesn't hurt to ask
> There may be another test that applies to single profiles, that
> disallows dropping a device. I think that's the place to look next.
> The superblock is easy to copy, but you'll need the device specific
> UUID which should be locatable with btrfs-show-super -f for each
> devid. The bigger problem is that Btrfs at mount time doesn't just
> look at the superblock and then mount. It actually reads parts of each
> tree, the extent of which I don't know. And it's doing a bunch of
> sanity tests as it reads those things, including transid (generation).
> So I'm not sure how easily spoofable a fake device is going to be.
> As a practical matter, migrate it to a new volume is faster and more
> reliable. Unfortunately, the inability to mount it read write is going
> to prevent you from making read only snapshots to use with btrfs
> send/receive. What might work, is find out what on-disk modification
> btrfs-tune does to make a device a read-only seed. Again your volume
> is missing a device so btrfs-tune won't let you modify it. But if you
> could force that to happen, it's probably a very minor change to
> metadata on each device, maybe it'll act like a seed device when you
> next mount it, in which case you'll be able to add a device and
> remount it read write and then delete the seed causing migration of
> everything that does remain on the volume over to the new device. I've
> never tried anything like this so I have no idea if it'll work. And
> even in the best case I haven't tried a multiple device seed going to
> a single device sprout (is it even allowed when removing the seed?).
> So...more questions than answers.
>
Sorry if I wasn't clear, but with the patch mentionned earlyer, I can
get a read write mount.
What I can't do is remove the device.
As for moving data to an another volume, since it's only data and
nothing fancy (no subvolume or anything), a simple rsync would do the trick.
My problem in this case is that I don't have enough available space
elsewhere to move my data.
That's why I'm trying this hard to recover the partition...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: multi-device btrfs with single data mode and disk failure

2016-09-20 Thread Chris Murphy
On Tue, Sep 20, 2016 at 12:19 PM, Alexandre Poux  wrote:
>
>
> Le 20/09/2016 à 19:54, Chris Murphy a écrit :
>> On Tue, Sep 20, 2016 at 11:03 AM, Alexandre Poux  wrote:
>>
>>> If I wanted to try to edit my partitions with an hex editor, where would
>>> I find infos on how to do that ?
>>> I really don't want to go this way, but if this is relatively simple, it
>>> may be worth to try.
>> Simple is relative. First you'd need
>> https://btrfs.wiki.kernel.org/index.php/On-disk_Format to get some
>> understanding of where things are to edit, and then btrfs-map-logical
>> to convert btrfs logical addresses to physical device and sector to
>> know what to edit.
>>
>> I'd call it distinctly non-trivial and very tedious.
>>
> OK, another idea:
> would it be possible to trick btrfs with a manufactured file that the
> disk is present while it isn't ?
>
> I mean, looking for a few minutes on the hexdump of my trivial test
> partition, header of members of btrfs array seems very alike.
> So maybe, I can make a file wich would have enough header to make btrfs
> believe that this is my device, and then remove it as usual
> looks like a long shot, but it doesn't hurt to ask

There may be another test that applies to single profiles, that
disallows dropping a device. I think that's the place to look next.
The superblock is easy to copy, but you'll need the device specific
UUID which should be locatable with btrfs-show-super -f for each
devid. The bigger problem is that Btrfs at mount time doesn't just
look at the superblock and then mount. It actually reads parts of each
tree, the extent of which I don't know. And it's doing a bunch of
sanity tests as it reads those things, including transid (generation).
So I'm not sure how easily spoofable a fake device is going to be.

As a practical matter, migrate it to a new volume is faster and more
reliable. Unfortunately, the inability to mount it read write is going
to prevent you from making read only snapshots to use with btrfs
send/receive. What might work, is find out what on-disk modification
btrfs-tune does to make a device a read-only seed. Again your volume
is missing a device so btrfs-tune won't let you modify it. But if you
could force that to happen, it's probably a very minor change to
metadata on each device, maybe it'll act like a seed device when you
next mount it, in which case you'll be able to add a device and
remount it read write and then delete the seed causing migration of
everything that does remain on the volume over to the new device. I've
never tried anything like this so I have no idea if it'll work. And
even in the best case I haven't tried a multiple device seed going to
a single device sprout (is it even allowed when removing the seed?).
So...more questions than answers.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: multi-device btrfs with single data mode and disk failure

2016-09-20 Thread Alexandre Poux


Le 20/09/2016 à 19:54, Chris Murphy a écrit :
> On Tue, Sep 20, 2016 at 11:03 AM, Alexandre Poux  wrote:
>
>> If I wanted to try to edit my partitions with an hex editor, where would
>> I find infos on how to do that ?
>> I really don't want to go this way, but if this is relatively simple, it
>> may be worth to try.
> Simple is relative. First you'd need
> https://btrfs.wiki.kernel.org/index.php/On-disk_Format to get some
> understanding of where things are to edit, and then btrfs-map-logical
> to convert btrfs logical addresses to physical device and sector to
> know what to edit.
>
> I'd call it distinctly non-trivial and very tedious.
>
OK, another idea:
would it be possible to trick btrfs with a manufactured file that the
disk is present while it isn't ?

I mean, looking for a few minutes on the hexdump of my trivial test
partition, header of members of btrfs array seems very alike.
So maybe, I can make a file wich would have enough header to make btrfs
believe that this is my device, and then remove it as usual
looks like a long shot, but it doesn't hurt to ask
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: kill BUG_ON in do_relocation

2016-09-20 Thread Liu Bo
On Tue, Sep 20, 2016 at 10:03:43AM +0200, David Sterba wrote:
> On Mon, Sep 19, 2016 at 04:11:44PM -0700, Liu Bo wrote:
> > > > That's EIO.  Sometimes the EIO is big enough we have to abort, but 
> > > > really the abort is just adding bonus.
> > > 
> > > I think we misuse the EIO where we should really return EFSCORRUPTED
> > > that's an alias for EUCLEAN, looking at xfs or ext4. EIO should be
> > > really a message that the hardware is bad.
> > 
> > I love this idea, but one quick question, when returning EUCLEAN, what
> > message do users get? 
> > 
> > "#define EUCLEAN 117 /* Structure needs cleaning */"
> 
> strerror(EUCLEAN) -> "Structure needs cleaning"

Hmm, if I was the user, I'm not sure how to deal with "Structure needs 
cleaning", so still need to take a glance at dmesg log.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: memset to avoid stale content in btree node block

2016-09-20 Thread Liu Bo
On Tue, Sep 20, 2016 at 03:16:36PM +0200, David Sterba wrote:
> On Wed, Sep 14, 2016 at 05:22:57PM -0700, Liu Bo wrote:
> > During updating btree, we could push items between sibling
> > nodes/leaves, for leaves data sections starts reversely from
> > the end of the block while for nodes we only have key pairs
> > which are stored one by one from the start of the block.
> > 
> > So we could do try to push key pairs from one node to the next
> > node right in the tree, and after that, we update the node's
> > nritems to reflect the correct end while leaving the stale
> > content in the node.  One may intentionally corrupt the fs
> > image and access the stale content by bumping the nritems and
> > causes various crashes.
> > 
> > This takes the in-memory @nritems as the correct one and
> > gets to memset the unused part of a btree node.
> > 
> > Signed-off-by: Liu Bo 
> 
> Reviewed-by: David Sterba 
> 
> > ---
> >  fs/btrfs/extent_io.c | 11 +++
> >  1 file changed, 11 insertions(+)
> > 
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index c2325c3..56c9dee 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -3732,6 +3732,17 @@ static noinline_for_stack int write_one_eb(struct 
> > extent_buffer *eb,
> > if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
> > bio_flags = EXTENT_BIO_TREE_LOG;
> >  
> > +   /* set btree node beyond nritems with 0 to avoid stale content */
> > +   if (btrfs_header_level(eb) > 0) {
> 
> We can do the same for leaves.

In theory, the problem also applies for leaves, but I haven't got a
reproducer for leaf case.

So I'll update a v2 with leaf memset, please review that part more
carefully :)

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: multi-device btrfs with single data mode and disk failure

2016-09-20 Thread Chris Murphy
On Tue, Sep 20, 2016 at 11:03 AM, Alexandre Poux  wrote:

> If I wanted to try to edit my partitions with an hex editor, where would
> I find infos on how to do that ?
> I really don't want to go this way, but if this is relatively simple, it
> may be worth to try.

Simple is relative. First you'd need
https://btrfs.wiki.kernel.org/index.php/On-disk_Format to get some
understanding of where things are to edit, and then btrfs-map-logical
to convert btrfs logical addresses to physical device and sector to
know what to edit.

I'd call it distinctly non-trivial and very tedious.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: multi-device btrfs with single data mode and disk failure

2016-09-20 Thread Alexandre Poux


Le 20/09/2016 à 00:05, Alexandre Poux a écrit :
>
> Le 15/09/2016 à 23:54, Chris Murphy a écrit :
>> On Thu, Sep 15, 2016 at 3:48 PM, Alexandre Poux  wrote:
>>> Le 15/09/2016 à 18:54, Chris Murphy a écrit :
 On Thu, Sep 15, 2016 at 10:30 AM, Alexandre Poux  wrote:
> Thank you very much for your answers
>
> Le 15/09/2016 à 17:38, Chris Murphy a écrit :
>> On Thu, Sep 15, 2016 at 1:44 AM, Alexandre Poux  
>> wrote:
>>> Is it possible to do some king of a "btrfs delete missing" on this
>>> kind of setup, in order to recover access in rw to my other data, or
>>> I must copy all my data on a new partition
>> That *should* work :) Except that your file system with 6 drives is
>> too full to be shrunk to 5 drives. Btrfs will either refuse, or get
>> confused, about how to shrink a nearly full 6 drive volume into 5.
>>
>> So you'll have to do one of three things:
>>
>> 1. Add a 2+TB drive, then remove the missing one; OR
>> 2. btrfs replace is faster and is raid10 reliable; OR
>> 3. Read only scrub to get a file listing of bad files, then remount
>> read-write degraded and delete them all. Now you maybe can do a device
>> delete missing. But it's still a tight fit, it basically has to
>> balance things out to get it to fit on an odd number of drives, it may
>> actually not work even though there seems to be enough total space,
>> there has to be enough space on FOUR drives.
>>
> Are you sure you are talking about data in single mode ?
> I don't understand why you are talking about raid10,
> or the fact that it will have to rebalance everything.
 Yeah sorry I got confused in that very last sentence. Single, it will
 find space in 1GiB increments. Of course this fails because that data
 doesn't exist anymore, but to start the operation it needs to be
 possible.
>>> No problem
> Moreover, even in degraded mode I cannot mount it in rw
> It tells me
> "too many missing devices, writeable remount is not allowed"
> due to the fact I'm in single mode.
 Oh you're in that trap. Well now you're stuck. I've had the case where
 I could mount read write degraded with metadata raid1 and data single,
 but it was good for only one mount and then I get the same message you
 get and it was only possible to mount read only. At that point it's
 totally suck unless you're adept at manipulating the file system with
 a hex editor...

 Someone might have a patch somewhere that drops this check and lets
 too many missing devices to mount anyway... I seem to recall this.
 It'd be in the archives if it exists.



> And as far as as know, btrfs replace and btrfs delete, are not supposed
> to work in read only...
 It doesn't. Must be read write mounted.


> I would like to tell him forgot about the missing data, and give me back
> my partition.
 This feature doesn't exist yet. I really want to see this, it'd be
 great for ceph and gluster if the volume could lose a drive, report
 all the missing files to the cluster file system, delete the device
 and the file references, and then the cluster knows that brick doesn't
 have those files and can replicate them somewhere else or even back to
 the brick that had them.

>>> So I found this patch : https://patchwork.kernel.org/patch/7014141/
>>>
>>> Does this seems ok ?
>> No idea I haven't tried it.
>>
>>> So after patching my kernel with it,
>>> I should be able to mount in rw my partition, and thus,
>>> I will be able to do a btrfs delete missing
>>> Which will just forgot about the old disk and everything should be fine
>>> afterward ?
>> It will forget about the old disk but it will try to migrate all
>> metadata and data that was on that disk to the remaining drives; so
>> until you delete all files that are corrupt, you'll continue to get
>> corruption messages about them.
>>
>>> Is this risky ? or not so much ?
>> Probably. If you care about the data, mount read only, back up what
>> you can, then see if you can fix it after that.
>>
>>> The scrubing is almost finished, and as I was expecting, I lost no data
>>> at all.
>> Well I'd guess the device delete should work then, but I still have no
>> idea if that patch will let you mount it degraded read-write. Worth a
>> shot though, it'll save time.
>>
> OK, so I found some time to work on it.
>
> I decided to do some tests in a vm (virtualbox) with 3 disks
> after making an array with 3 disks, metadata in raid1 and data in single,
> I remove one disk to reproduce my situation.
>
> I tried the patch, and, after updated it (nothing fancy),
> I can indeed mount a degraded partition with data in single.
>
> But I can't remove the device :
> #btrfs device remove missing /mnt
> ERROR: error removing device 'missing': Input/output error
> or
> #btrfs device remove 2 /mnt
> ERROR: error removing

Re: Is stability a joke?

2016-09-20 Thread Chris Murphy
btrfs-convert has been rewritten as of btrfs-progs 4.6, and therefore
the conversion page could use an update:
https://btrfs.wiki.kernel.org/index.php/Conversion_from_Ext3

Anyone wanting to update the page should advise the code is new, check
the changelog, the latest btrfs-progs version should be used, and
there still may be edge cases:
https://btrfs.wiki.kernel.org/index.php/Changelog

Also, the status page doesn't mention the convert feature.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]btrfs-progs: Post btrfs-convert verify permissions and acls

2016-09-20 Thread lakshmipathi . g

On 2016-09-19 20:17, Qu Wenruo wrote:

Hi Laksmipathi,

At 09/06/2016 03:27 AM, Lakshmipathi.G wrote:

Signed-off-by: Lakshmipathi.G 
---
 tests/common.convert | 95 
+++-

 1 file changed, 94 insertions(+), 1 deletion(-)

diff --git a/tests/common.convert b/tests/common.convert
index 4e3d49c..67c99b1 100644
--- a/tests/common.convert
+++ b/tests/common.convert
@@ -123,6 +123,38 @@ convert_test_gen_checksums() {
count=1 >/dev/null 2>&1
 	run_check_stdout $SUDO_HELPER find $TEST_MNT -type f ! -name 'image' 
-exec md5sum {} \+ > "$CHECKSUMTMP"

 }
+# list $TEST_MNT data set file permissions.
+# $1: path where the permissions will be stored
+convert_test_perm() {
+   local PERMTMP
+   PERMTMP="$1"
+   FILES_LIST=$(mktemp --tmpdir btrfs-progs-convert.fileslistXX)
+
+	run_check $SUDO_HELPER dd if=/dev/zero of=$TEST_MNT/test 
bs=$nodesize \

+   count=1 >/dev/null 2>&1
+	run_check_stdout $SUDO_HELPER find $TEST_MNT -type f ! -name 'image' 
-fprint $FILES_LIST

+   #Fix directory entries order.
+   sort $FILES_LIST -o $FILES_LIST
+   for file in `cat $FILES_LIST` ;do
+		run_check_stdout $SUDO_HELPER getfacl --absolute-names $file >> 
"$PERMTMP"

+   done
+   rm $FILES_LIST
+}
+# list acls of files on $TEST_MNT
+# $1: path where the acls will be stored
+convert_test_acl() {
+   local ACLSTMP
+   ACLTMP="$1"
+   FILES_LIST=$(mktemp --tmpdir btrfs-progs-convert.fileslistXX)
+
+	run_check_stdout $SUDO_HELPER find $TEST_MNT/acls -type f -fprint 
$FILES_LIST

+   #Fix directory entries order.
+   sort $FILES_LIST -o $FILES_LIST
+   for file in `cat $FILES_LIST`;do
+		run_check_stdout $SUDO_HELPER getfattr --absolute-names -d $file >> 
"$ACLTMP"

+   done
+   rm $FILES_LIST
+}

 # do conversion with given features and nodesize, fsck afterwards
 # $1: features, argument of -O, can be empty
@@ -133,15 +165,68 @@ convert_test_do_convert() {
run_check $TOP/btrfs-show-super -Ffa $TEST_DEV
 }

+# post conversion check, verify file permissions.
+# $1: file with ext permissions.
+convert_test_post_check_permissions() {
+   local EXT_PERMTMP
+   local BTRFS_PERMTMP
+
+   EXT_PERMTMP="$1"
+   BTRFS_PERMTMP=$(mktemp --tmpdir btrfs-progs-convert.permXX)
+   convert_test_perm "$BTRFS_PERMTMP"
+
+   btrfs_perm=`md5sum $BTRFS_PERMTMP | cut -f1 -d' '`
+   ext_perm=`md5sum $EXT_PERMTMP | cut -f1 -d' '


When running test case 005, the test script hangs here.
And EXT_PERMTMP seems to be empty, so md5sum is waiting input from
stdio, causing the hang.

Any idea to fix it?

Thanks,
Qu



Hi Qu,

Can you confirm whether in-place 'sort' command used convert_test_perm() 
is creating appropriate sorted file and doesn't create empty file?



$ sort --version
sort (GNU coreutils) 8.22


Cheers,
Lakshmipathi.G
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]btrfs-progs: Add fast,slow symlinks and fifo types to convert test

2016-09-20 Thread lakshmipathi . g

On 2016-09-19 18:21, Qu Wenruo wrote:

Just curious, did the new fifo/slow_symlink exposed any convert bug?

Thanks,
Qu
Unfortunately no. I was hoping something will fail, but sadly 
convert-tests.sh passed!

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH]btrfs-progs: Add fast,slow symlinks and fifo types to convert test

2016-09-20 Thread lakshmipathi . g

On 2016-09-19 11:05, David Sterba wrote:

On Thu, Sep 15, 2016 at 11:34:07AM +0200, Lakshmipathi.G wrote:

+   slow_symlink)
+   for num in $(seq 1 $DATASET_SIZE); do
+   fname64=`date +%s | sha256sum | cut -f1 -d'-'`


Do you need to generate the date and sha all the time?



Right, I missed that part. We can create a single file and create 
multiple symlink to that same file. fname64 creation can be moved out of 
the loop. I'll re-send another patch with this fix.



Cheers.
Lakshmipathi.G
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ChaCha20 vs. AES performance

2016-09-20 Thread Kent Overstreet
On Tue, Sep 20, 2016 at 10:23:20AM -0400, Theodore Ts'o wrote:
> On Tue, Sep 20, 2016 at 03:15:19AM -0800, Kent Overstreet wrote:
> > Not on the list or I would've replied directly, but on Haswell, ChaCha20 (in
> > software) is over 2x as fast as AES (in hardware), at realistic (for a
> > filesystem) block sizes:
> 
> On Skylake and Broadwell processors, AES is faster (the posting is
> from a ChaCha20 enthusiast):
> 
>  https://blog.cloudflare.com/it-takes-two-to-chacha-poly/

The performance delta in his graphs isn't near as big as what I've measured,
which makes me suspect OpenSSL's ChaCha20 implementation isn't nearly as fast as
the kernel's.

> My big worry though is that schemes that require that nonces/IV's must
> **never** be reused are fragile.  It's for the same reason that DSA
> makes my skin crawl.  If you ever screw up --- maybe after a crash, or
> a file system bug, you end up reusing a nonce, it's game over.
> 
> So if there are hardware solutions which are faster or fast enough
> that the crypto is no longer dominant cost, why not use a cipher
> scheme which is more robust?

Block ciphers have their own downsides though - XTS is really a big pile of
hacks and workarounds. On the whole, if you can get nonces right, a stream
cipher cryptosystem (and ChaCha20 especially) is on the whole drastically
simpler, and thus easier to understand and audit.

And if you can do nonces correctly, ChaCha20/Poly1305 is pretty much one of the
gold standards - it's secure against pretty much any vaguely realistic threat
model. XTS, not so much - it's just the best you can do given the constraints of
typical disk crypto. The gold standards of encryption today are the AEADs - and
AES/GCM fails badly with nonce reuse too, there aren't any AEADs yet that don't
fail badly with nonce reuse.

> P.S.  We're also both ignoring the cost of whatever changes are needed in
> the file system to guarantee that the nonce is never, ever reused...

I'm definitely not advocating for hacking stream ciphers into existing
filesystems - if you don't have the machinery you need to be 100% rigorous about
nonces, then definitely stick with XTS. But I already had most of what I needed
in bcachefs, and I can still break the on disk format if I need to (and
encryption is a breaking change), so for me ChaCha20/Poly1305 was a no brainer.

BTW though, if there do turn out to be platforms where AES is significantly
faster than ChaCha20 I can still add AES support pretty easily - I've already
got all the relevant switch statements, since encryption is handled as another
checksum type.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Experimental btrfs encryption

2016-09-20 Thread Chris Mason



On 09/19/2016 10:50 PM, Theodore Ts'o wrote:

On Mon, Sep 19, 2016 at 08:32:34PM -0400, Chris Mason wrote:

That key is used to protect the contents of the data file, and to
encrypt filenames and symlink targets --- since filenames can leak
significant information about what the user is doing.  (For example,
in the downloads directory of their web browser, leaking filenames is
just as good as leaking part of their browsing history.)


One of the things that makes per-subvolume encryption attractive to me is
that we're able to enforce the idea that an entire directory tree is
encrypted by one key.  It can't be snapshotted again without the key, and it
just fits with the rest of the btrfs management code.  I do want to support
the existing vfs interfaces as well too though.


One of the main reasons for doing fs-level encryption is so you can
allow multiple users to have different keys.  In some cases you can
assume that different users will be in different distinct subvolumes
(e.g., each user has their own home directory), but that's not always
going to be possible.



Agreed, they are just different use cases.  I think both are important, 
and btrfs won't do encryption without the file-level option.



One of the other things that was in the original design, but which got
dropped in our initial implementation, was the concept of having the
per-inode key wrapped by multiple user keys.  This would allow a file
to be accessible by more than one user.  So something to consider is
that there may very well be situations where you *want* to have more
than one key associated with a directory hierarchy.


The issue, here, is that inodes are fundamentally not a safe scope to
attach that information to in btrfs. As extents can be shared between
inodes (and thus both will need to decrypt them), and inodes can be
duplicated unmodified (snapshots), attaching keys and nonces to inodes
opens up a whole host of (possibly insoluble) issues, including
catastrophic nonce reuse via writable snapshots.


I'm going to have to read harder about nonce reuse.  In btrfs an inode is
really a pair [ root id, inode number ], so strictly speaking two writable
snapshots won't have the same inode in memory and when a snapshot is
modified we'd end up with a different nonce for the new modifications.


Nonce reuse is not necessrily catastrophic.  It all depends on the
context.  In the case of Counter or GCM mode, nonce (or IV) reuse is
absolutely catastrophic.  It must *never* be done or you completely
lose all security.  As the Soviets discovered the hard way courtesy of
the Venona project (well, they didn't discover it until after they
lost the cold war, but...) one time pads are completely secure.
Two-time pads, are most emphatically _not_.  :-)

In the case of the nonces used in fscrypt's key derivation, reuse of
the nonce basically means that two files share the same key.  Assuming
you're using a competently designed block cipher (e.g., AES), reuse of
the key is not necessarily a problem.  What it would mean is that two
files which are are reflinked would share the same key.  And if you
have writable snapshots, that's definitely not a problem, since with
AES we use the a fixed key and a fixed IV given a logical block
number, and we can do block overwrites without having to guarantee
unique nonces (which you *do* need to worry about if you use counter
mode or some other stream cipher such as ChaCha20 --- Kent Overstreet
had some clever tricks to avoid IV reuse since he used a stream cipher
in his proposed bcachefs encryption).

The main issue is if you want to reflink a file and then have the two
files have different permissions / ownerships.  In that case, you
really want to use different keys for user A and for user B --- but if
you are assuming a single key per subvolume, you can't support
different keys for different users anyway, so you're kind of toast for
that use case in any case.


So there's a matrix of possible configurations.  If you're doing a 
reflink between subvolumes and you're doing a subvolume granular 
encryption and you don't have keys to the source subvolume, the reflink 
shouldn't be allowed.  If you do have keys, any new writes are happening 
into a different inode, and will be encrypted with a different key.


If you're doing a file level encryption and you do have access to the 
source file, the destination file is a new inode.  Thanks to COW any 
changes are going to go into new extents and will end up with different 
keys/nonces.


Either way, we degrade down into extent based encryption.  I'd take that 
hit to maintain sane semantics in the face of snapshots and reflinks. 
The btrfs extent structures on disk already have an encryption type field.




So in any case, assuming you're using block encryption (which is what
fscrypt uses) there really isn't a problem with nonce reuse, although
in some cases if you really do want to reflink a file and have it be
protected by different user keys, this would have 

Re: [RFC] Preliminary BTRFS Encryption

2016-09-20 Thread Anand Jain




Hi David,

On 09/18/2016 02:45 AM, David Sterba wrote:

On Sat, Sep 17, 2016 at 12:38:30AM -0400, Zygo Blaxell wrote:

There's also a nasty problem with the extent tree--there's only one per
filesystem, it's shared between all subvols and block groups, and every
extent in that tree has back references to the (possibly encrypted) subvol
trees.  I'll leave that problem as an exercise for other readers.  ;)


A design point that I'm not mentioning for the first time: there would
be per-subvolume group extent trees, ie. a set of subvolumes with
attached extent tree where similar to what we have now. So, encrypted
and unencrypted extent metadata will never be mixed.
(the crypto key questions are not addressed here)

This hasn't been implemented but I'm making sure this will be possible
when somebody mentions changes to the extent tree or blockgroup reworks
(to actually solve other problems).


 Now I remember this was told before, sorry this slipped my mind.

Thanks, Anand



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ChaCha20 vs. AES performance

2016-09-20 Thread Theodore Ts'o
On Tue, Sep 20, 2016 at 03:15:19AM -0800, Kent Overstreet wrote:
> Not on the list or I would've replied directly, but on Haswell, ChaCha20 (in
> software) is over 2x as fast as AES (in hardware), at realistic (for a
> filesystem) block sizes:

On Skylake and Broadwell processors, AES is faster (the posting is
from a ChaCha20 enthusiast):

 https://blog.cloudflare.com/it-takes-two-to-chacha-poly/

My big worry though is that schemes that require that nonces/IV's must
**never** be reused are fragile.  It's for the same reason that DSA
makes my skin crawl.  If you ever screw up --- maybe after a crash, or
a file system bug, you end up reusing a nonce, it's game over.

So if there are hardware solutions which are faster or fast enough
that the crypto is no longer dominant cost, why not use a cipher
scheme which is more robust?

- Ted

P.S.  We're also both ignoring the cost of whatever changes are needed in
the file system to guarantee that the nonce is never, ever reused...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] btrfs: convert pr_* to btrfs_* where possible

2016-09-20 Thread jeffm
From: Jeff Mahoney 

For many printks, we want to know which file system issued the message.

This patch converts most pr_* calls to use the btrfs_* versions instead.
In some cases, this means adding plumbing to allow call sites access to
an fs_info pointer.

fs/btrfs/check-integrity.c is left alone for another day.

Signed-off-by: Jeff Mahoney 

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index c2cd4c2..16ec215 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -390,7 +390,8 @@ static int __resolve_indirect_ref(struct btrfs_fs_info 
*fs_info,
/* root node has been locked, we can release @subvol_srcu safely here */
srcu_read_unlock(&fs_info->subvol_srcu, index);
 
-   pr_debug("search slot in root %llu (level %d, ref count %d) returned %d 
for key (%llu %u %llu)\n",
+   btrfs_debug(fs_info,
+   "search slot in root %llu (level %d, ref count %d) returned 
%d for key (%llu %u %llu)",
 ref->root_id, level, ref->count, ret,
 ref->key_for_search.objectid, ref->key_for_search.type,
 ref->key_for_search.offset);
@@ -1491,7 +1492,8 @@ int extent_from_logical(struct btrfs_fs_info *fs_info, 
u64 logical,
 
if (found_key->objectid > logical ||
found_key->objectid + size <= logical) {
-   pr_debug("logical %llu is not within any extent\n", logical);
+   btrfs_debug(fs_info,
+   "logical %llu is not within any extent", logical);
return -ENOENT;
}
 
@@ -1502,7 +1504,8 @@ int extent_from_logical(struct btrfs_fs_info *fs_info, 
u64 logical,
ei = btrfs_item_ptr(eb, path->slots[0], struct btrfs_extent_item);
flags = btrfs_extent_flags(eb, ei);
 
-   pr_debug("logical %llu is at position %llu within the extent (%llu 
EXTENT_ITEM %llu) flags %#llx size %u\n",
+   btrfs_debug(fs_info,
+   "logical %llu is at position %llu within the extent (%llu 
EXTENT_ITEM %llu) flags %#llx size %u",
 logical, logical - found_key->objectid, found_key->objectid,
 found_key->offset, flags, item_size);
 
@@ -1623,20 +1626,24 @@ int tree_backref_for_extent(unsigned long *ptr, struct 
extent_buffer *eb,
return 0;
 }
 
-static int iterate_leaf_refs(struct extent_inode_elem *inode_list,
-   u64 root, u64 extent_item_objectid,
-   iterate_extent_inodes_t *iterate, void *ctx)
+static int iterate_leaf_refs(struct btrfs_fs_info *fs_info,
+struct extent_inode_elem *inode_list,
+u64 root, u64 extent_item_objectid,
+iterate_extent_inodes_t *iterate, void *ctx)
 {
struct extent_inode_elem *eie;
int ret = 0;
 
for (eie = inode_list; eie; eie = eie->next) {
-   pr_debug("ref for %llu resolved, key (%llu EXTEND_DATA %llu), 
root %llu\n", extent_item_objectid,
-eie->inum, eie->offset, root);
+   btrfs_debug(fs_info,
+   "ref for %llu resolved, key (%llu EXTEND_DATA 
%llu), root %llu",
+   extent_item_objectid, eie->inum,
+   eie->offset, root);
ret = iterate(eie->inum, eie->offset, root, ctx);
if (ret) {
-   pr_debug("stopping iteration for %llu due to ret=%d\n",
-extent_item_objectid, ret);
+   btrfs_debug(fs_info,
+   "stopping iteration for %llu due to ret=%d",
+   extent_item_objectid, ret);
break;
}
}
@@ -1664,7 +1671,7 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info,
struct ulist_iterator ref_uiter;
struct ulist_iterator root_uiter;
 
-   pr_debug("resolving all inodes for extent %llu\n",
+   btrfs_debug(fs_info, "resolving all inodes for extent %llu",
extent_item_objectid);
 
if (!search_commit_root) {
@@ -1690,9 +1697,12 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info,
break;
ULIST_ITER_INIT(&root_uiter);
while (!ret && (root_node = ulist_next(roots, &root_uiter))) {
-   pr_debug("root %llu references leaf %llu, data list 
%#llx\n", root_node->val, ref_node->val,
-ref_node->aux);
-   ret = iterate_leaf_refs((struct extent_inode_elem *)
+   btrfs_debug(fs_info,
+   "root %llu references leaf %llu, data list 
%#llx",
+   root_node->val, ref_node->val,
+   ref_node->aux);
+   ret = iterate_leaf_refs(fs_info,
+   (

[PATCH 5/5] btrfs: convert send's verbose_printk to btrfs_debug

2016-09-20 Thread jeffm
From: Jeff Mahoney 

This was basically an open-coded, less flexible dynamic printk.  We can
just use btrfs_debug instead.

Signed-off-by: Jeff Mahoney 

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index ee10345..96bc99d 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -36,10 +36,6 @@
 #include "transaction.h"
 #include "compression.h"
 
-static int g_verbose = 0;
-
-#define verbose_printk(...) if (g_verbose) printk(__VA_ARGS__)
-
 /*
  * A fs_path is a helper to dynamically build path names with unknown size.
  * It reallocates the internal buffer on demand.
@@ -727,9 +723,10 @@ static int send_cmd(struct send_ctx *sctx)
 static int send_rename(struct send_ctx *sctx,
 struct fs_path *from, struct fs_path *to)
 {
+   struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
int ret;
 
-verbose_printk("btrfs: send_rename %s -> %s\n", from->start, to->start);
+   btrfs_debug(fs_info, "send_rename %s -> %s", from->start, to->start);
 
ret = begin_cmd(sctx, BTRFS_SEND_C_RENAME);
if (ret < 0)
@@ -751,9 +748,10 @@ out:
 static int send_link(struct send_ctx *sctx,
 struct fs_path *path, struct fs_path *lnk)
 {
+   struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
int ret;
 
-verbose_printk("btrfs: send_link %s -> %s\n", path->start, lnk->start);
+   btrfs_debug(fs_info, "send_link %s -> %s", path->start, lnk->start);
 
ret = begin_cmd(sctx, BTRFS_SEND_C_LINK);
if (ret < 0)
@@ -774,9 +772,10 @@ out:
  */
 static int send_unlink(struct send_ctx *sctx, struct fs_path *path)
 {
+   struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
int ret;
 
-verbose_printk("btrfs: send_unlink %s\n", path->start);
+   btrfs_debug(fs_info, "send_unlink %s", path->start);
 
ret = begin_cmd(sctx, BTRFS_SEND_C_UNLINK);
if (ret < 0)
@@ -796,9 +795,10 @@ out:
  */
 static int send_rmdir(struct send_ctx *sctx, struct fs_path *path)
 {
+   struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
int ret;
 
-verbose_printk("btrfs: send_rmdir %s\n", path->start);
+   btrfs_debug(fs_info, "send_rmdir %s", path->start);
 
ret = begin_cmd(sctx, BTRFS_SEND_C_RMDIR);
if (ret < 0)
@@ -1313,6 +1313,7 @@ static int find_extent_clone(struct send_ctx *sctx,
 u64 ino_size,
 struct clone_root **found)
 {
+   struct btrfs_fs_info *fs_info = sctx->send_root->fs_info;
int ret;
int extent_type;
u64 logical;
@@ -1371,10 +1372,10 @@ static int find_extent_clone(struct send_ctx *sctx,
}
logical = disk_byte + btrfs_file_extent_offset(eb, fi);
 
-   down_read(&sctx->send_root->fs_info->commit_root_sem);
-   ret = extent_from_logical(sctx->send_root->fs_info, disk_byte, tmp_path,
+   down_read(&fs_info->commit_root_sem);
+   ret = extent_from_logical(fs_info, disk_byte, tmp_path,
  &found_key, &flags);
-   up_read(&sctx->send_root->fs_info->commit_root_sem);
+   up_read(&fs_info->commit_root_sem);
btrfs_release_path(tmp_path);
 
if (ret < 0)
@@ -1429,7 +1430,7 @@ static int find_extent_clone(struct send_ctx *sctx,
extent_item_pos = logical - found_key.objectid;
else
extent_item_pos = 0;
-   ret = iterate_extent_inodes(sctx->send_root->fs_info,
+   ret = iterate_extent_inodes(fs_info,
found_key.objectid, extent_item_pos, 1,
__iterate_backrefs, backref_ctx);
 
@@ -1439,17 +1440,18 @@ static int find_extent_clone(struct send_ctx *sctx,
if (!backref_ctx->found_itself) {
/* found a bug in backref code? */
ret = -EIO;
-   btrfs_err(sctx->send_root->fs_info,
+   btrfs_err(fs_info,
  "did not find backref in send_root. inode=%llu, 
offset=%llu, disk_byte=%llu found extent=%llu",
-   ino, data_offset, disk_byte, 
found_key.objectid);
+ ino, data_offset, disk_byte, found_key.objectid);
goto out;
}
 
-verbose_printk(KERN_DEBUG "btrfs: find_extent_clone: data_offset=%llu, 
ino=%llu, num_bytes=%llu, logical=%llu\n",
-   data_offset, ino, num_bytes, logical);
+   btrfs_debug(fs_info,
+   "find_extent_clone: data_offset=%llu, ino=%llu, 
num_bytes=%llu, logical=%llu",
+   data_offset, ino, num_bytes, logical);
 
if (!backref_ctx->found)
-   verbose_printk("btrfs:no clones found\n");
+   btrfs_debug(fs_info, "no clones found");
 
cur_clone_root = NULL;
for (i = 0; i < sctx->clone_roots_cnt; i++) {
@@ -2420,10 +2422,11 @@ out:
 
 static int send_truncate(struct send_ctx *sctx, u64 ino, u64 gen, u64 size)
 {
+   struct btrfs_fs_info *fs_

[PATCH 3/5] btrfs: convert printk(KERN_* to use pr_* calls

2016-09-20 Thread jeffm
From: Jeff Mahoney 

This patch converts printk(KERN_* style messages to use the pr_* versions.

One side effect is that anything that was KERN_DEBUG is now automatically
a dynamic debug message.

Signed-off-by: Jeff Mahoney 

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 8d87056..dc9c93e 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -656,7 +656,7 @@ static int btrfsic_process_superblock(struct btrfsic_state 
*state,
BUG_ON(NULL == state);
selected_super = kzalloc(sizeof(*selected_super), GFP_NOFS);
if (NULL == selected_super) {
-   printk(KERN_INFO "btrfsic: error, kmalloc failed!\n");
+   pr_info("btrfsic: error, kmalloc failed!\n");
return -ENOMEM;
}
 
@@ -681,7 +681,7 @@ static int btrfsic_process_superblock(struct btrfsic_state 
*state,
}
 
if (NULL == state->latest_superblock) {
-   printk(KERN_INFO "btrfsic: no superblock found!\n");
+   pr_info("btrfsic: no superblock found!\n");
kfree(selected_super);
return -1;
}
@@ -698,13 +698,13 @@ static int btrfsic_process_superblock(struct 
btrfsic_state *state,
next_bytenr = btrfs_super_root(selected_super);
if (state->print_mask &
BTRFSIC_PRINT_MASK_ROOT_CHUNK_LOG_TREE_LOCATION)
-   printk(KERN_INFO "root@%llu\n", next_bytenr);
+   pr_info("root@%llu\n", next_bytenr);
break;
case 1:
next_bytenr = btrfs_super_chunk_root(selected_super);
if (state->print_mask &
BTRFSIC_PRINT_MASK_ROOT_CHUNK_LOG_TREE_LOCATION)
-   printk(KERN_INFO "chunk@%llu\n", next_bytenr);
+   pr_info("chunk@%llu\n", next_bytenr);
break;
case 2:
next_bytenr = btrfs_super_log_root(selected_super);
@@ -712,7 +712,7 @@ static int btrfsic_process_superblock(struct btrfsic_state 
*state,
continue;
if (state->print_mask &
BTRFSIC_PRINT_MASK_ROOT_CHUNK_LOG_TREE_LOCATION)
-   printk(KERN_INFO "log@%llu\n", next_bytenr);
+   pr_info("log@%llu\n", next_bytenr);
break;
}
 
@@ -720,7 +720,7 @@ static int btrfsic_process_superblock(struct btrfsic_state 
*state,
btrfs_num_copies(state->root->fs_info,
 next_bytenr, state->metablock_size);
if (state->print_mask & BTRFSIC_PRINT_MASK_NUM_COPIES)
-   printk(KERN_INFO "num_copies(log_bytenr=%llu) = %d\n",
+   pr_info("num_copies(log_bytenr=%llu) = %d\n",
   next_bytenr, num_copies);
 
for (mirror_num = 1; mirror_num <= num_copies; mirror_num++) {
@@ -733,7 +733,7 @@ static int btrfsic_process_superblock(struct btrfsic_state 
*state,
&tmp_next_block_ctx,
mirror_num);
if (ret) {
-   printk(KERN_INFO "btrfsic: 
btrfsic_map_block(root @%llu, mirror %d) failed!\n",
+   pr_info("btrfsic: btrfsic_map_block(root @%llu, 
mirror %d) failed!\n",
   next_bytenr, mirror_num);
kfree(selected_super);
return -1;
@@ -756,8 +756,7 @@ static int btrfsic_process_superblock(struct btrfsic_state 
*state,
 
ret = btrfsic_read_block(state, &tmp_next_block_ctx);
if (ret < (int)PAGE_SIZE) {
-   printk(KERN_INFO
-  "btrfsic: read @logical %llu failed!\n",
+   pr_info("btrfsic: read @logical %llu failed!\n",
   tmp_next_block_ctx.start);
btrfsic_release_block_ctx(&tmp_next_block_ctx);
kfree(selected_super);
@@ -818,7 +817,7 @@ static int btrfsic_process_superblock_dev_mirror(
if (NULL == superblock_tmp) {
superblock_tmp = btrfsic_block_alloc();
if (NULL == superblock_tmp) {
-   printk(KERN_INFO "btrfsic: error, kmalloc failed!\n");
+   pr_info("btrfsic: error, kmalloc failed!\n");
brelse(bh);
return -1;
}
@@ -892,7 +891,7 @@ static int btrfsic_process_superblock_dev_mirror(
btrfs_num_copies(state->root->fs_info,

[PATCH 2/5] btrfs: unsplit printed strings

2016-09-20 Thread jeffm
From: Jeff Mahoney 

CodingStyle chapter 2:
"[...] never break user-visible strings such as printk messages,
because that breaks the ability to grep for them."

This patch unsplits user-visible strings.

Signed-off-by: Jeff Mahoney 

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index 455a6b2..c2cd4c2 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -390,8 +390,7 @@ static int __resolve_indirect_ref(struct btrfs_fs_info 
*fs_info,
/* root node has been locked, we can release @subvol_srcu safely here */
srcu_read_unlock(&fs_info->subvol_srcu, index);
 
-   pr_debug("search slot in root %llu (level %d, ref count %d) returned "
-"%d for key (%llu %u %llu)\n",
+   pr_debug("search slot in root %llu (level %d, ref count %d) returned %d 
for key (%llu %u %llu)\n",
 ref->root_id, level, ref->count, ret,
 ref->key_for_search.objectid, ref->key_for_search.type,
 ref->key_for_search.offset);
@@ -1503,8 +1502,7 @@ int extent_from_logical(struct btrfs_fs_info *fs_info, 
u64 logical,
ei = btrfs_item_ptr(eb, path->slots[0], struct btrfs_extent_item);
flags = btrfs_extent_flags(eb, ei);
 
-   pr_debug("logical %llu is at position %llu within the extent (%llu "
-"EXTENT_ITEM %llu) flags %#llx size %u\n",
+   pr_debug("logical %llu is at position %llu within the extent (%llu 
EXTENT_ITEM %llu) flags %#llx size %u\n",
 logical, logical - found_key->objectid, found_key->objectid,
 found_key->offset, flags, item_size);
 
@@ -1633,8 +1631,7 @@ static int iterate_leaf_refs(struct extent_inode_elem 
*inode_list,
int ret = 0;
 
for (eie = inode_list; eie; eie = eie->next) {
-   pr_debug("ref for %llu resolved, key (%llu EXTEND_DATA %llu), "
-"root %llu\n", extent_item_objectid,
+   pr_debug("ref for %llu resolved, key (%llu EXTEND_DATA %llu), 
root %llu\n", extent_item_objectid,
 eie->inum, eie->offset, root);
ret = iterate(eie->inum, eie->offset, root, ctx);
if (ret) {
@@ -1693,8 +1690,7 @@ int iterate_extent_inodes(struct btrfs_fs_info *fs_info,
break;
ULIST_ITER_INIT(&root_uiter);
while (!ret && (root_node = ulist_next(roots, &root_uiter))) {
-   pr_debug("root %llu references leaf %llu, data list "
-"%#llx\n", root_node->val, ref_node->val,
+   pr_debug("root %llu references leaf %llu, data list 
%#llx\n", root_node->val, ref_node->val,
 ref_node->aux);
ret = iterate_leaf_refs((struct extent_inode_elem *)
(uintptr_t)ref_node->aux,
@@ -1792,8 +1788,7 @@ static int iterate_inode_refs(u64 inum, struct btrfs_root 
*fs_root,
for (cur = 0; cur < btrfs_item_size(eb, item); cur += len) {
name_len = btrfs_inode_ref_name_len(eb, iref);
/* path must be released before calling iterate()! */
-   pr_debug("following ref at offset %u for inode %llu in "
-"tree %llu\n", cur, found_key.objectid,
+   pr_debug("following ref at offset %u for inode %llu in 
tree %llu\n", cur, found_key.objectid,
 fs_root->objectid);
ret = iterate(parent, name_len,
  (unsigned long)(iref + 1), eb, ctx);
diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 6678947..8d87056 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -733,9 +733,7 @@ static int btrfsic_process_superblock(struct btrfsic_state 
*state,
&tmp_next_block_ctx,
mirror_num);
if (ret) {
-   printk(KERN_INFO "btrfsic:"
-  " btrfsic_map_block(root @%llu,"
-  " mirror %d) failed!\n",
+   printk(KERN_INFO "btrfsic: 
btrfsic_map_block(root @%llu, mirror %d) failed!\n",
   next_bytenr, mirror_num);
kfree(selected_super);
return -1;
@@ -905,8 +903,7 @@ static int btrfsic_process_superblock_dev_mirror(
  state->metablock_size,
  &tmp_next_block_ctx,
  mirror_num)) {
-   printk(KERN_INFO "btrfsic: btrfsic_map_block("
-  "bytenr @%llu, mirror %d) failed!\n",
+

[PATCH 1/5] btrfs: add dynamic debug support

2016-09-20 Thread jeffm
From: Jeff Mahoney 

We can re-use the dynamic debugging descriptor to make use of the dynamic
debugging mechanism but still use our own printk interface.

Defining the DEBUG macro works as it did before.  When it's defined,
all of the messages default to print.  We can also enable all debug
messages at boot or module-load time using the 'dyndbg' and
'btrfs.dyndbg' options.

Signed-off-by: Jeff Mahoney 

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 33fe035..9267436 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "extent_io.h"
 #include "extent_map.h"
 #include "async-thread.h"
@@ -3315,7 +3316,35 @@ void btrfs_printk(const struct btrfs_fs_info *fs_info, 
const char *fmt, ...)
btrfs_printk_ratelimited(fs_info, KERN_NOTICE fmt, ##args)
 #define btrfs_info_rl(fs_info, fmt, args...) \
btrfs_printk_ratelimited(fs_info, KERN_INFO fmt, ##args)
-#ifdef DEBUG
+
+#if defined(CONFIG_DYNAMIC_DEBUG)
+#define btrfs_debug(fs_info, fmt, args...) \
+do {   \
+   DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt); \
+   if (unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT))  \
+   btrfs_printk(fs_info, KERN_DEBUG fmt, ##args);  \
+} while (0)
+#define btrfs_debug_in_rcu(fs_info, fmt, args...)  \
+do {   \
+   DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt); \
+   if (unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT))  \
+   btrfs_printk_in_rcu(fs_info, KERN_DEBUG fmt, ##args);   \
+} while (0)
+#define btrfs_debug_rl_in_rcu(fs_info, fmt, args...)   \
+do {   \
+   DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt); \
+   if (unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT))  \
+   btrfs_printk_rl_in_rcu(fs_info, KERN_DEBUG fmt, \
+   ##args);\
+} while (0)
+#define btrfs_debug_rl(fs_info, fmt, args...)  \
+do {   \
+   DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt); \
+   if (unlikely(descriptor.flags & _DPRINTK_FLAGS_PRINT))  \
+   btrfs_printk_ratelimited(fs_info, KERN_DEBUG fmt,   \
+##args);   \
+} while (0)
+#elif defined(DEBUG)
 #define btrfs_debug(fs_info, fmt, args...) \
btrfs_printk(fs_info, KERN_DEBUG fmt, ##args)
 #define btrfs_debug_in_rcu(fs_info, fmt, args...) \
-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] btrfs: printing cleanup patchset

2016-09-20 Thread jeffm
From: Jeff Mahoney 

This is a patchset I've been working on to clean up message printing,
make it adhere to kernel style, and be more consistent.

The end result is that we:
* use dynamic debugging for debugging messages
* merge strings that exceed 80 characters into a single greppable string
* convert printk calls to btrfs_{warn,info,err,debug,etc} calls where it
  makes sense.
* dump the ad-hoc verbose_printk garbage in send

The exception to this is check-integrity since it has a ton of messages
and it also has its own mask mechanism.  I wanted to discuss if we wanted
to find another solution to that and, if so, how we want to move forward
there.

Dave, this will probably conflict with the fsinfo patchset, so please
advise on which you want to land first.

-Jeff

Jeff Mahoney (5):
  btrfs: add dynamic debug support
  btrfs: unsplit printed strings
  btrfs: convert printk(KERN_* to use pr_* calls
  btrfs: convert pr_* to btrfs_* where possible
  btrfs: convert send's verbose_printk to btrfs_debug

 fs/btrfs/backref.c  |  48 ---
 fs/btrfs/check-integrity.c  | 335 ++--
 fs/btrfs/compression.c  |   6 +-
 fs/btrfs/ctree.c|  12 +-
 fs/btrfs/ctree.h|  39 +-
 fs/btrfs/delayed-inode.c|  17 +--
 fs/btrfs/delayed-ref.c  |   9 +-
 fs/btrfs/dev-replace.c  |  21 +--
 fs/btrfs/dir-item.c |   7 +-
 fs/btrfs/disk-io.c  |  98 ++---
 fs/btrfs/extent-tree.c  | 106 +++---
 fs/btrfs/extent_io.c|  93 ++--
 fs/btrfs/free-space-cache.c |  21 +--
 fs/btrfs/free-space-cache.h |   6 +-
 fs/btrfs/free-space-tree.c  |  14 +-
 fs/btrfs/inode-map.c|  31 ++--
 fs/btrfs/inode.c|  26 ++--
 fs/btrfs/ioctl.c|  14 +-
 fs/btrfs/lzo.c  |   6 +-
 fs/btrfs/ordered-data.c |   4 +-
 fs/btrfs/print-tree.c   |  86 +---
 fs/btrfs/qgroup.c   |  22 +--
 fs/btrfs/reada.c|  32 ++---
 fs/btrfs/relocation.c   |  16 ++-
 fs/btrfs/root-tree.c|  18 +--
 fs/btrfs/scrub.c|  58 
 fs/btrfs/send.c |  71 +-
 fs/btrfs/super.c|  60 
 fs/btrfs/sysfs.c|   8 +-
 fs/btrfs/transaction.c  |  20 +--
 fs/btrfs/transaction.h  |   1 +
 fs/btrfs/tree-log.c |   8 +-
 fs/btrfs/uuid-tree.c|  27 ++--
 fs/btrfs/volumes.c  | 131 +
 fs/btrfs/volumes.h  |   2 +-
 fs/btrfs/zlib.c |   8 +-
 36 files changed, 719 insertions(+), 762 deletions(-)

-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add error handling for extent buffer in print tree

2016-09-20 Thread David Sterba
On Wed, Sep 14, 2016 at 05:23:39PM -0700, Liu Bo wrote:
> Somehow we missed btrfs_print_tree when last time we
> updated error handling for read_extent_block().
> 
> This keeps us from getting a NULL pointer panic when
> btrfs_print_tree's read_extent_block() fails.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: memset to avoid stale content in btree node block

2016-09-20 Thread David Sterba
On Wed, Sep 14, 2016 at 05:22:57PM -0700, Liu Bo wrote:
> During updating btree, we could push items between sibling
> nodes/leaves, for leaves data sections starts reversely from
> the end of the block while for nodes we only have key pairs
> which are stored one by one from the start of the block.
> 
> So we could do try to push key pairs from one node to the next
> node right in the tree, and after that, we update the node's
> nritems to reflect the correct end while leaving the stale
> content in the node.  One may intentionally corrupt the fs
> image and access the stale content by bumping the nritems and
> causes various crashes.
> 
> This takes the in-memory @nritems as the correct one and
> gets to memset the unused part of a btree node.
> 
> Signed-off-by: Liu Bo 

Reviewed-by: David Sterba 

> ---
>  fs/btrfs/extent_io.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index c2325c3..56c9dee 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3732,6 +3732,17 @@ static noinline_for_stack int write_one_eb(struct 
> extent_buffer *eb,
>   if (btrfs_header_owner(eb) == BTRFS_TREE_LOG_OBJECTID)
>   bio_flags = EXTENT_BIO_TREE_LOG;
>  
> + /* set btree node beyond nritems with 0 to avoid stale content */
> + if (btrfs_header_level(eb) > 0) {

We can do the same for leaves.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stat(2) returning device ID not existing in mountinfo

2016-09-20 Thread Jeff Mahoney
On 9/16/16 4:28 PM, Tomasz Sterna wrote:
> Hi.
> 
> I have spotted an issue with stat(2) call on files on btrfs.
> It is giving me dev_t st_dev number that does not correspond to any
> mounted filesystem in proc's mountinfo.

That's by design.  Your particular file system may only use one device
but, internally, btrfs uses virtualized storage that may be spread
across multiple devices.  To make things more complicated, snapshots
mean that:

sled1a:/mnt # btrfs sub list .
ID 257 gen 14 top level 5 path a
ID 258 gen 14 top level 5 path b

sled1a:/mnt # ls -laRi
.:
total 16
256 drwxr-xr-x 1 root root   4 Sep 20 09:08 .
256 drwxr-xr-x 1 root root 220 Sep 16 09:49 ..
256 drwxr-xr-x 1 root root   8 Sep 14 10:24 a
256 drwxr-xr-x 1 root root   8 Sep 14 10:24 b

./a:
total 4112
256 drwxr-xr-x 1 root root   8 Sep 14 10:24 .
256 drwxr-xr-x 1 root root   4 Sep 20 09:08 ..
257 -rw-r--r-- 1 root root 4194304 Sep 14 10:24 file

./b:
total 4112
256 drwxr-xr-x 1 root root   8 Sep 14 10:24 .
256 drwxr-xr-x 1 root root   4 Sep 20 09:08 ..
257 -rw-r--r-- 1 root root 4194304 Sep 14 10:24 file

Under normal circumstances those are two files with the same st_dev and
the same inode number.  That would normally correspond to a hard link,
but the files do not (necessarily) correspond to the same file.

... but because we use anonymous device numbers for each subvolume, we
have different device numbers for each one.

sled1a:/mnt # stat --format "%n st_dev=%d" {a,b}/file
a/file st_dev=69
b/file st_dev=70

It's a pretty big usability wart that we don't consistently report the
device number.  We do it correctly in stat() but there are other places
in the code that assume that inode->i_sb->s_dev will work.  In the SUSE
kernels, we have patches that add a super_operation to report the
correct device number everywhere, but even that is a hack.

> I already attempted a illinformed-patch in fs/btrfs/super.c:
> 
> @@ -1127,6 +1127,7 @@ static int btrfs_fill_super(struct super_block *sb,
>   goto fail_close;
>   }
>  
> + sb->s_dev = inode->i_sb->s_dev;
>   sb->s_root = d_make_root(inode);
>   if (!sb->s_root) {
>   err = -ENOMEM;
> 
> but it didn't help.

It wouldn't.  That is assigning a variable to itself.

> I would like to dig deeper and fix it, but first I have to ask:
> - Which number is wrong?
>   The one returned by stat() or the one in mountinfo?

The one in mountinfo, but then that means that the user only sees the
anonymous devices in mount(8), which isn't what we want either.

I'm afraid the correct fix is very involved and requires non-trivial
changes in the VFS layer as well.  It's on my long-term TODO list.  I
currently have some patches that do the magic with vfsmounts but it's
far from being usable.

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


Re: Fs: Btrfs - Fix possible ERR_PTR() dereferencing.

2016-09-20 Thread Jeff Mahoney
On 9/20/16 2:48 AM, Shailendra Verma wrote:
> This is of course wrong to call kfree() if memdup_user() fails,
> no memory was allocated and the error in the error-valued pointer
> should be returned.
> 
> Reviewed-by: Ravikant Sharma 
> Signed-off-by: Shailendra Verma 

Hi Shailendra -

In all three cases, the return value is set using the error-valued
pointer and the pointer is set to NULL.  kfree() checks to see if the
pointer is NULL and, if so, does nothing.  This allows us to use a
common exit path which is an extremely common pattern in the kernel.  So
there's never any possible ERR_PTR dereferencing happening.

However, in all three cases, the allocation you're checking is the first
in each routine and there's no additional cleanup to do.  So your patch
is an improvement, but it's an improvement in code readability instead
of a bug fix.  I'd ask that you re-submit with a commit message that
reflects that.

Thanks,

-Jeff

> ---
>  fs/btrfs/ioctl.c | 21 ++---
>  1 file changed, 6 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index da94138..58a40f8 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -4512,11 +4512,8 @@ static long btrfs_ioctl_logical_to_ino(struct 
> btrfs_root *root,
>   return -EPERM;
>  
>   loi = memdup_user(arg, sizeof(*loi));
> - if (IS_ERR(loi)) {
> - ret = PTR_ERR(loi);
> - loi = NULL;
> - goto out;
> - }
> + if (IS_ERR(loi))
> + return PTR_ERR(loi);
>  
>   path = btrfs_alloc_path();
>   if (!path) {


> @@ -5143,11 +5140,8 @@ static long btrfs_ioctl_set_received_subvol_32(struct 
> file *file,
>   int ret = 0;
>  
>   args32 = memdup_user(arg, sizeof(*args32));
> - if (IS_ERR(args32)) {
> - ret = PTR_ERR(args32);
> - args32 = NULL;
> - goto out;
> - }
> + if (IS_ERR(args32))
> + return PTR_ERR(args32);
>  
>   args64 = kmalloc(sizeof(*args64), GFP_NOFS);
>   if (!args64) {
> @@ -5195,11 +5189,8 @@ static long btrfs_ioctl_set_received_subvol(struct 
> file *file,
>   int ret = 0;
>  
>   sa = memdup_user(arg, sizeof(*sa));
> - if (IS_ERR(sa)) {
> - ret = PTR_ERR(sa);
> - sa = NULL;
> - goto out;
> - }
> + if (IS_ERR(sa))
> + return PTR_ERR(sa);
>  
>   ret = _btrfs_ioctl_set_received_subvol(file, sa);
>  
> 


-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature


[PATCH] btrfs: clean the old superblocks before freeing the device

2016-09-20 Thread jeffm
From: Jeff Mahoney 

btrfs_rm_device frees the block device but then re-opens it using
the saved device name.  A race exists between the close and the
re-open that allows the block size to be changed.  The result
is getting stuck forever in the reclaim loop in __getblk_slow.

This patch moves the superblock cleanup before closing the block
device, which is also consistent with other callers.  We also don't
need a private copy of dev_name as the whole routine operates under
the uuid_mutex.

Signed-off-by: Jeff Mahoney 
---
 fs/btrfs/volumes.c | 38 +++---
 1 file changed, 11 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c41f8c1..3adf5ce 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1846,7 +1846,6 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path, u64 devid)
u64 num_devices;
int ret = 0;
bool clear_super = false;
-   char *dev_name = NULL;
 
mutex_lock(&uuid_mutex);
 
@@ -1882,11 +1881,6 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path, u64 devid)
list_del_init(&device->dev_alloc_list);
device->fs_devices->rw_devices--;
unlock_chunks(root);
-   dev_name = kstrdup(device->name->str, GFP_KERNEL);
-   if (!dev_name) {
-   ret = -ENOMEM;
-   goto error_undo;
-   }
clear_super = true;
}
 
@@ -1936,14 +1930,21 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path, u64 devid)
btrfs_sysfs_rm_device_link(root->fs_info->fs_devices, device);
}
 
-   btrfs_close_bdev(device);
-
-   call_rcu(&device->rcu, free_device);
-
num_devices = btrfs_super_num_devices(root->fs_info->super_copy) - 1;
btrfs_set_super_num_devices(root->fs_info->super_copy, num_devices);
mutex_unlock(&root->fs_info->fs_devices->device_list_mutex);
 
+   /*
+* at this point, the device is zero sized and detached from
+* the devices list.  All that's left is to zero out the old
+* supers and free the device.
+*/
+   if (device->writeable)
+   btrfs_scratch_superblocks(device->bdev, device->name->str);
+
+   btrfs_close_bdev(device);
+   call_rcu(&device->rcu, free_device);
+
if (cur_devices->open_devices == 0) {
struct btrfs_fs_devices *fs_devices;
fs_devices = root->fs_info->fs_devices;
@@ -1962,24 +1963,7 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path, u64 devid)
root->fs_info->num_tolerated_disk_barrier_failures =
btrfs_calc_num_tolerated_disk_barrier_failures(root->fs_info);
 
-   /*
-* at this point, the device is zero sized.  We want to
-* remove it from the devices list and zero out the old super
-*/
-   if (clear_super) {
-   struct block_device *bdev;
-
-   bdev = blkdev_get_by_path(dev_name, FMODE_READ | FMODE_EXCL,
-   root->fs_info->bdev_holder);
-   if (!IS_ERR(bdev)) {
-   btrfs_scratch_superblocks(bdev, dev_name);
-   blkdev_put(bdev, FMODE_READ | FMODE_EXCL);
-   }
-   }
-
 out:
-   kfree(dev_name);
-
mutex_unlock(&uuid_mutex);
return ret;
 
-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is stability a joke? (wiki updated)

2016-09-20 Thread Austin S. Hemmelgarn

On 2016-09-19 16:15, Zygo Blaxell wrote:

On Mon, Sep 19, 2016 at 01:38:36PM -0400, Austin S. Hemmelgarn wrote:

I'm not sure if the brfsck is really all that helpful to user as much
as it is for developers to better learn about the failure vectors of
the file system.


ReiserFS had no working fsck for all of the 8 years I used it (and still
didn't last year when I tried to use it on an old disk).  "Not working"
here means "much less data is readable from the filesystem after running
fsck than before."  It's not that much of an inconvenience if you have
backups.

For a small array, this may be the case.  Once you start looking into double
digit TB scale arrays though, restoring backups becomes a very expensive
operation.  If you had a multi-PB array with a single dentry which had no
inode, would you rather be spending multiple days restoring files and
possibly losing recent changes, or spend a few hours to check the filesystem
and fix it with minimal data loss?


I'd really prefer to be able to delete the dead dentry with 'rm' as root,
or failing that, with a ZDB-like tool or ioctl, if it's the only known
instance of such a bad metadata object and I already know where it's
located.
I entirely agree on that.  The problem is that because the VFS layer 
chokes on it, it can't be rm, and it would be non-trivial to implement 
as an ioctl.  It pretty much has to be out-of-band.  I'd love to see 
btrfs check add the ability to process subsets of the filesystem (for 
example 'I know that something is screwed up somehow in 
/path/to/random/directory, check only that path in the filesystem 
(possibly recursively) and tell me what's wrong (and possibly try to fix 
it)').


Usually the ultimate failure mode of a btrfs filesystem is a read-only
filesystem from which you can read most or all of your data, but you
can't ever make it writable again because of fsck limitations.

The one thing I do miss about every filesystem that isn't ext2/ext3 is
automated fsck that prioritizes availability, making the filesystem
safely writable even if it can't recover lost data.  On the other
hand, fixing an ext[23] filesystem is utterly trivial compared to any
btree-based filesystem.
For a data center or corporate entity, dropping broken parts of the FS 
and recovering from backups makes sense.  For a traditional home user 
(that is, the type of person Ubuntu and Windows traditionally target), 
it usually doesn't, as they almost certainly don't have a backup. 
Personally, I'd rather have a tool that gives me the option of whether 
to try and fix a given path or just remove it, instead of assuming that 
it knows how I want to fix it.  That would allow for both use cases.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: State of the fuzzer

2016-09-20 Thread Lukas Lueg
There are now 21 bugs open on bko, most of them crashes and some
undefined behavior. The nodes are now mostly running idle as no new
paths are discovered (after around one billion images tested in the
current run).

My thoughts are to wait until the current bugs have been fixed, then
restart the whole process from HEAD (together with the corpus of
~2.000 seed images discovered by now) and catch new bugs and aborts()
- we need to get rid of the reachable ones so code coverage can
improve. After those, I'll change the process to run btrfsck --repair,
which is slower but has a lot of yet uncovered code.

DigitalOcean has provided some funding for this undertaking so we are
good on CPU power. Kudos to them.

2016-09-13 22:28 GMT+02:00 Lukas Lueg :
> I've booted another instance with btrfs-progs checked out to 2b7c507
> and collected some bugs which remained from the run before the current
> one. The current run discovered what qgroups are just three days ago
> and will spend some time on that. I've also added UBSAN- and
> MSAN-logging to my setup and there were three bugs found so far (one
> is already fixed). I will boot a third instance to run lowmem-mode
> exclusively in the next few days.
>
> There are 11 bugs open at the moment, all have a reproducing image
> attached to them. The whole list is at
>
> https://bugzilla.kernel.org/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&component=btrfs&email1=lukas.lueg%40gmail.com&emailreporter1=1&emailtype1=exact&list_id=858441&query_format=advanced
>
>
> 2016-09-09 16:00 GMT+02:00 David Sterba :
>> On Tue, Sep 06, 2016 at 10:32:28PM +0200, Lukas Lueg wrote:
>>> I'm currently fuzzing rev 2076992 and things start to slowly, slowly
>>> quiet down. We will probably run out of steam at the end of the week
>>> when a total of (roughly) half a billion BTRFS-images have passed by.
>>> I will switch revisions to current HEAD and restart the whole process
>>> then. A few things:
>>>
>>> * There are a couple of crashes (mostly segfaults) I have not reported
>>> yet. I'll report them if they show up again with the latest revision.
>>
>> Ok.
>>
>>> * The coverage-analysis shows assertion failures which are currently
>>> silenced. An assertion failure is technically a worse disaster
>>> successfully prevented, it still constitutes unexpected/unusable
>>> behaviour, though. Do you want assertions to be enabled and images
>>> triggering those assertions reported? This is basically the same
>>> conundrum as with BUG_ON and abort().
>>
>> Yes please. I'd like to turn most bugons/assertions into a normal
>> failure report if it would make sense.
>>
>>> * A few endless loops entered into by btrfsck are currently
>>> unmitigated (see bugs 155621, 155571, 11 and 155151). It would be
>>> nice if those had been taken care of by next week if possible.
>>
>> Two of them are fixed, the other two need more work, updating all
>> callers of read_node_slot and the callchain. So you may still see that
>> kind of looping in more images. I don't have an ETA for the fix, I won't
>> be available during the next week.
>>
>> At the moment, the initial sanity checks should catch most of the
>> corrupted values, so I'm expecting that you'll see different classes of
>> problems in the next rounds.
>>
>> The testsuite now contains all images that you reported and we have a
>> fix in git. There are more utilities run on the images, there may be
>> more problems for us to fix.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ChaCha20 vs. AES performance

2016-09-20 Thread Kent Overstreet
Not on the list or I would've replied directly, but on Haswell, ChaCha20 (in
software) is over 2x as fast as AES (in hardware), at realistic (for a
filesystem) block sizes:
 
testing speed of ctr(aes) (ctr(aes-aesni)) decryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 378 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 1130 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 3981 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 15458 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 122880 cycles (8192 
bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 391 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 1193 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 4212 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 16388 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 131029 cycles (8192 
bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 417 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 1222 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 4398 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 operation in 17114 cycles (1024 
bytes)
test 14 (256 bit key, 8192 byte blocks): 1 operation in 137028 cycles (8192 
bytes)

testing speed of chacha20 (chacha20-simd) encryption
test 0 (256 bit key, 16 byte blocks): 1 operation in 4356 cycles (16 bytes)
test 1 (256 bit key, 64 byte blocks): 1 operation in 4004 cycles (64 bytes)
test 2 (256 bit key, 256 byte blocks): 1 operation in 6524 cycles (256 bytes)
test 3 (256 bit key, 1024 byte blocks): 1 operation in 9248 cycles (1024 bytes)
test 4 (256 bit key, 8192 byte blocks): 1 operation in 60274 cycles (8192 bytes)

Poly1305 is also plenty fast:

testing speed of gcm(aes) (gcm_base(ctr-aes-aesni,ghash-generic)) encryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 7567 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 9654 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 19010 cycles (256 bytes)
test 3 (128 bit key, 512 byte blocks): 1 operation in 33118 cycles (512 bytes)
test 4 (128 bit key, 1024 byte blocks): 1 operation in 59738 cycles (1024 bytes)
test 5 (128 bit key, 2048 byte blocks): 1 operation in 106545 cycles (2048 
bytes)
test 6 (128 bit key, 4096 byte blocks): 1 operation in 211189 cycles (4096 
bytes)
test 7 (128 bit key, 8192 byte blocks): 1 operation in 370439 cycles (8192 
bytes)
test 8 (192 bit key, 16 byte blocks): 1 operation in 6780 cycles (16 bytes)
test 9 (192 bit key, 64 byte blocks): 1 operation in 8802 cycles (64 bytes)
test 10 (192 bit key, 256 byte blocks): 1 operation in 17352 cycles (256 bytes)
test 11 (192 bit key, 512 byte blocks): 1 operation in 28680 cycles (512 bytes)
test 12 (192 bit key, 1024 byte blocks): 1 operation in 51230 cycles (1024 
bytes)
test 13 (192 bit key, 2048 byte blocks): 1 operation in 96662 cycles (2048 
bytes)
test 14 (192 bit key, 4096 byte blocks): 1 operation in 187287 cycles (4096 
bytes)
test 15 (192 bit key, 8192 byte blocks): 1 operation in 372570 cycles (8192 
bytes)
test 16 (256 bit key, 16 byte blocks): 1 operation in 6273 cycles (16 bytes)
test 17 (256 bit key, 64 byte blocks): 1 operation in 8096 cycles (64 bytes)
test 18 (256 bit key, 256 byte blocks): 1 operation in 15895 cycles (256 bytes)
test 19 (256 bit key, 512 byte blocks): 1 operation in 26259 cycles (512 bytes)
test 20 (256 bit key, 1024 byte blocks): 1 operation in 47121 cycles (1024 
bytes)
test 21 (256 bit key, 2048 byte blocks): 1 operation in 91003 cycles (2048 
bytes)
test 22 (256 bit key, 4096 byte blocks): 1 operation in 175883 cycles (4096 
bytes)
test 23 (256 bit key, 8192 byte blocks): 1 operation in 340904 cycles (8192 
bytes)

testing speed of rfc7539esp(chacha20,poly1305) 
(rfc7539esp(chacha20-simd,poly1305-simd)) encryption
test 0 (288 bit key, 16 byte blocks): 1 operation in 12145 cycles (16 bytes)
test 1 (288 bit key, 64 byte blocks): 1 operation in 14538 cycles (64 bytes)
test 2 (288 bit key, 256 byte blocks): 1 operation in 16435 cycles (256 bytes)
test 3 (288 bit key, 512 byte blocks): 1 operation in 15622 cycles (512 bytes)
test 4 (288 bit key, 1024 byte blocks): 1 operation in 18671 cycles (1024 bytes)
test 5 (288 bit key, 2048 byte blocks): 1 operation in 23264 cycles (2048 bytes)
test 6 (288 bit key, 4096 byte blocks): 1 operation in 36480 cycles (4096 bytes)
test 7 (288 bit key, 8192 byte blocks): 1 operation in 75051 cycles (8192 bytes)

When AVX-512 comes out ChaCha20 is going to get even faster - probably by more
than 2x, since they're adding a rotate instruction. I haven't tested on ARM but
I'd be surprised if the situation is significantly different there (the kernel's
lacking a NEON ChaCha20 implementation, but I could do one).

Just because it's implemented in hardware doesn't mean i

Re: how to understand "btrfs fi show" output? "No space left" issues

2016-09-20 Thread Peter Becker
Output from my nightly balance script for my 15 TB Raid 1 btrfs pool
(3x 3TB + 1x 6TB) with ~100 snapshots:

Before balance of /media/RAID
Data, RAID1: total=5.57TiB, used=5.45TiB
System, RAID1: total=32.00MiB, used=832.00KiB
Metadata, RAID1: total=7.00GiB, used=6.03GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
Filesystem  Size  Used Avail Use% Mounted on
/dev/sde7.6T  6.1T  1.5T  81% /media/RAID
Done, had to relocate 0 out of 5710 chunks
Dumping filters: flags 0x1, state 0x0, force is off
  DATA (flags 0x2): balancing, usage=1
Done, had to relocate 0 out of 5710 chunks
Dumping filters: flags 0x1, state 0x0, force is off
  DATA (flags 0x2): balancing, usage=5
Done, had to relocate 0 out of 5710 chunks
Dumping filters: flags 0x1, state 0x0, force is off
  DATA (flags 0x2): balancing, usage=10
Done, had to relocate 0 out of 5710 chunks
Dumping filters: flags 0x1, state 0x0, force is off
  DATA (flags 0x2): balancing, usage=20
Done, had to relocate 0 out of 5710 chunks
Dumping filters: flags 0x1, state 0x0, force is off
  DATA (flags 0x2): balancing, usage=30
Done, had to relocate 0 out of 5710 chunks
Dumping filters: flags 0x1, state 0x0, force is off
  DATA (flags 0x2): balancing, usage=40
Done, had to relocate 0 out of 5710 chunks
Dumping filters: flags 0x1, state 0x0, force is off
  DATA (flags 0x2): balancing, usage=50
Done, had to relocate 0 out of 5710 chunks
Done, had to relocate 0 out of 5710 chunks
Dumping filters: flags 0x6, state 0x0, force is off
  METADATA (flags 0x2): balancing, usage=1
  SYSTEM (flags 0x2): balancing, usage=1
Done, had to relocate 0 out of 5710 chunks
Dumping filters: flags 0x6, state 0x0, force is off
  METADATA (flags 0x2): balancing, usage=5
  SYSTEM (flags 0x2): balancing, usage=5
Done, had to relocate 1 out of 5710 chunks
Dumping filters: flags 0x6, state 0x0, force is off
  METADATA (flags 0x2): balancing, usage=10
  SYSTEM (flags 0x2): balancing, usage=10
Done, had to relocate 1 out of 5710 chunks
Dumping filters: flags 0x6, state 0x0, force is off
  METADATA (flags 0x2): balancing, usage=20
  SYSTEM (flags 0x2): balancing, usage=20
Done, had to relocate 1 out of 5710 chunks
Dumping filters: flags 0x6, state 0x0, force is off
  METADATA (flags 0x2): balancing, usage=30
  SYSTEM (flags 0x2): balancing, usage=30
Done, had to relocate 1 out of 5710 chunks
After balance of /media/RAID
Data, RAID1: total=5.57TiB, used=5.45TiB
System, RAID1: total=32.00MiB, used=832.00KiB
Metadata, RAID1: total=7.00GiB, used=6.03GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
Filesystem  Size  Used Avail Use% Mounted on
/dev/sde7.6T  6.1T  1.5T  81% /media/RAID


Its effective reduce the internal fragmentation (to 0,12 TB data and
~1GB metadata).

2016-09-20 10:59 GMT+02:00 Peter Becker :
> 2016-09-20 10:48 GMT+02:00 Hugo Mills :
>> On Tue, Sep 20, 2016 at 10:34:49AM +0200, Peter Becker wrote:
>>> More details on the issue and a complete explantion you can find here:
>>>
>>> http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html
>>> and
>>> (Help! I ran out of disk space! )
>>> https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space.21
>>>
>>> And an explantion for the "dlimit" solution:
>>
>>It's not "dlimit". It's "d" with option "limit". You could just as
>> easily write -dusage=99,limit=10 or -dlimit=10,usage=99 (although
>> those aren't the options I'd pick... see below).
>>
>>> Quote From: Uncommon solutions for BTRFS
>>> (http://blog.schmorp.de/2015-10-08-smr-archive-drives-fast-now.html)
>>>
>>> > For my purposes, I define internal fragmentation as space allocated but 
>>> > not usable by the filesystem. In BTRFS, each time you delete files, the 
>>> > space used by those files cannot be reused for new files automatically.
>>> > It's not a hard requirement to do this maintenance regularly, but doing 
>>> > it regularly spares you waiting for hours when the disk is full and you 
>>> > need to wait for a balance clean up command - and of course also reduces 
>>> > the number of > times you get unexpected disk full errors. As a side 
>>> > note, this can also be useful to prolong the life of your SSD because it 
>>> > allows the SSD to reuse space not needed by the filesystem (although 
>>> > there is a trade-off, frequent balancing is bad, no balancing is bad, the 
>>> > sweet spot is somewhere in between).
>>>
>>> 2016-09-20 10:20 GMT+02:00 Peter Becker :
>>> > Normaly total and used should deviate us a few gb.
>>> > depend on your write workload you should run
>>> >
>>> > btrfs balance start -dusage=60 /mnt
>>> >
>>> > every week to avoid "ENOSPC"
>>> >
>>> > if you use newer btrfs-progs who supper balance limit filters you should 
>>> > run
>>> >
>>> > btrfs balance start -dusage=99 -dlimit=10 /mnt
>>> >
>>> > every 3 hours.
>>
>>These two options both feel horrible to me. Particularly the second
>> option, which is going to result in huge write load on the FS, and is
>> almost

Re: how to understand "btrfs fi show" output? "No space left" issues

2016-09-20 Thread Peter Becker
2016-09-20 10:48 GMT+02:00 Hugo Mills :
> On Tue, Sep 20, 2016 at 10:34:49AM +0200, Peter Becker wrote:
>> More details on the issue and a complete explantion you can find here:
>>
>> http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html
>> and
>> (Help! I ran out of disk space! )
>> https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space.21
>>
>> And an explantion for the "dlimit" solution:
>
>It's not "dlimit". It's "d" with option "limit". You could just as
> easily write -dusage=99,limit=10 or -dlimit=10,usage=99 (although
> those aren't the options I'd pick... see below).
>
>> Quote From: Uncommon solutions for BTRFS
>> (http://blog.schmorp.de/2015-10-08-smr-archive-drives-fast-now.html)
>>
>> > For my purposes, I define internal fragmentation as space allocated but 
>> > not usable by the filesystem. In BTRFS, each time you delete files, the 
>> > space used by those files cannot be reused for new files automatically.
>> > It's not a hard requirement to do this maintenance regularly, but doing it 
>> > regularly spares you waiting for hours when the disk is full and you need 
>> > to wait for a balance clean up command - and of course also reduces the 
>> > number of > times you get unexpected disk full errors. As a side note, 
>> > this can also be useful to prolong the life of your SSD because it allows 
>> > the SSD to reuse space not needed by the filesystem (although there is a 
>> > trade-off, frequent balancing is bad, no balancing is bad, the sweet spot 
>> > is somewhere in between).
>>
>> 2016-09-20 10:20 GMT+02:00 Peter Becker :
>> > Normaly total and used should deviate us a few gb.
>> > depend on your write workload you should run
>> >
>> > btrfs balance start -dusage=60 /mnt
>> >
>> > every week to avoid "ENOSPC"
>> >
>> > if you use newer btrfs-progs who supper balance limit filters you should 
>> > run
>> >
>> > btrfs balance start -dusage=99 -dlimit=10 /mnt
>> >
>> > every 3 hours.
>
>These two options both feel horrible to me. Particularly the second
> option, which is going to result in huge write load on the FS, and is
> almost certainly going to be unnecessary most of the time.

I take this from kdave's btrfs maintence scripts and this works for me
since one year. (https://github.com/kdave/btrfsmaintenance)

>My recommendation would be to check at regular intervals (daily,
> say) whether the used value is equal to the size value in btrfs fi
> show. If it is (and only if), then you should run a balance with no
> usage= option, and with limit=, for some relatively small value of
>  (3, say). That will give you some unallocated space that the FS
> can take for metadata should it need it, which is all that's required
> to avoid early ENOSPC.

With no usage-option, how to avoid balance full blocks? -dusage=99
only balance blocks with empty space.

>If you regularly find that your usage patterns result in large
> numbers of empty or near-empty block groups (i.e. lots of headroom in
> data shown by btrfs fi df), then a regular (but probably less
> frequent) balance with something like usage=5 should keep that down.
>
>> > This will balance 2 Blocks (dlimit=10; corresponds to 10 gb) with are
>
>No, it will balance 10 complete block groups, not 10 GiB. Depending
> on the RAID configuration, that could be a very large amount of data
> indeed. (For example, an 8-disk RAID-10 would be rewriting up to 80
> GiB of data with that command).

Thanks for this clarification.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to understand "btrfs fi show" output? "No space left" issues

2016-09-20 Thread Peter Becker
2016-09-20 10:30 GMT+02:00 Andrei Borzenkov :
> On Tue, Sep 20, 2016 at 11:20 AM, Peter Becker  wrote:
> I still do do understand where ENOSPC comes from in the first place.
> Filesystem is half empty. Do you suggest that it is normal to get
> ENOSPC in this case?

Its how the block allocator and the chunk allocator work together. As
i know the developer has this "bug" in there todo list.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to understand "btrfs fi show" output? "No space left" issues

2016-09-20 Thread Hugo Mills
On Tue, Sep 20, 2016 at 10:34:49AM +0200, Peter Becker wrote:
> More details on the issue and a complete explantion you can find here:
> 
> http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html
> and
> (Help! I ran out of disk space! )
> https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space.21
> 
> And an explantion for the "dlimit" solution:

   It's not "dlimit". It's "d" with option "limit". You could just as
easily write -dusage=99,limit=10 or -dlimit=10,usage=99 (although
those aren't the options I'd pick... see below).

> Quote From: Uncommon solutions for BTRFS
> (http://blog.schmorp.de/2015-10-08-smr-archive-drives-fast-now.html)
> 
> > For my purposes, I define internal fragmentation as space allocated but not 
> > usable by the filesystem. In BTRFS, each time you delete files, the space 
> > used by those files cannot be reused for new files automatically.
> > It's not a hard requirement to do this maintenance regularly, but doing it 
> > regularly spares you waiting for hours when the disk is full and you need 
> > to wait for a balance clean up command - and of course also reduces the 
> > number of > times you get unexpected disk full errors. As a side note, this 
> > can also be useful to prolong the life of your SSD because it allows the 
> > SSD to reuse space not needed by the filesystem (although there is a 
> > trade-off, frequent balancing is bad, no balancing is bad, the sweet spot 
> > is somewhere in between).
> 
> 2016-09-20 10:20 GMT+02:00 Peter Becker :
> > Normaly total and used should deviate us a few gb.
> > depend on your write workload you should run
> >
> > btrfs balance start -dusage=60 /mnt
> >
> > every week to avoid "ENOSPC"
> > 
> > if you use newer btrfs-progs who supper balance limit filters you should run
> >
> > btrfs balance start -dusage=99 -dlimit=10 /mnt
> >
> > every 3 hours.

   These two options both feel horrible to me. Particularly the second
option, which is going to result in huge write load on the FS, and is
almost certainly going to be unnecessary most of the time.

   My recommendation would be to check at regular intervals (daily,
say) whether the used value is equal to the size value in btrfs fi
show. If it is (and only if), then you should run a balance with no
usage= option, and with limit=, for some relatively small value of
 (3, say). That will give you some unallocated space that the FS
can take for metadata should it need it, which is all that's required
to avoid early ENOSPC.

   If you regularly find that your usage patterns result in large
numbers of empty or near-empty block groups (i.e. lots of headroom in
data shown by btrfs fi df), then a regular (but probably less
frequent) balance with something like usage=5 should keep that down.

> > This will balance 2 Blocks (dlimit=10; corresponds to 10 gb) with are

   No, it will balance 10 complete block groups, not 10 GiB. Depending
on the RAID configuration, that could be a very large amount of data
indeed. (For example, an 8-disk RAID-10 would be rewriting up to 80
GiB of data with that command).

   Hugo.

> > not filled full into new blocks. You could/should adjust the intervall
> > and the limit-filter depend on your write workload.
> > For example if you write (change files + new files) only 10GB a day it
> > will be enough to run this ever night.
> > The last option completly avoid the ENOSPC issue but produce aditional
> > workload for your harddrives.
> >
> > Note: you should avoid making snapshots during balance. Use a simple
> > lock-mechanic for that.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Hugo Mills | There isn't a noun that can't be verbed.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: stability matrix (was: Is stability a joke?)

2016-09-20 Thread David Sterba
On Mon, Sep 19, 2016 at 09:45:46PM +0200, Christoph Anton Mitterer wrote:
> +1 for all your changes with the following comments in addition...
> 
> 
> On Mon, 2016-09-19 at 17:27 +0200, David Sterba wrote:
> > That's more like a usecase, thats out of the scope of the tabular
> > overview. But we have an existing page UseCases that I'd like to
> > transform to a more structured and complete overview of usceases of
> > various features, so the UUID collisions would build on top of that
> > with
> > "and this could hapen if ...".
> Well I don't agree here and see it basically like Austin.

So we'd have to make that two separate topics so the "what if" has
better visibility, and possibly marked "with security implications".

> It's not that these UUID collisions can only happen in special
> circumstances but plain normal situations that always used to work with
> probably literally each and every fs. (So much for the accidental
> corruptions).
> 
> And an attack is probably never "usecase dependant"... it always
> depends on the attacker.
> And since that seems to be a pretty real attack vector, I'd also say
> it's mandatory to quite clearly warn about that deficiency...
> 
> TBH, I'm rather surprised that this situation seems to be kinda
> "accepted".
> 
> I had a chat with CM recently and he implied things might be solved
> with encryption.
> While this is probably the case for at least some of the described
> problems, it rather seems like a workaround:
> - why making btrfs-encryption mandatory for devices who have partially
>   secured access (e.g. where a systemdisk with btrfs is not physically
>   accessible but a USB port is)
> - what about users that rather want to use block device encryption
>   instead of fs-level-encryption?
> 
> 
> > > - in-band dedupe
> > >   deduped are IIRC not bitwise compared by the kernel before de-
> > > duping,
> > >   as it's the case with offline dedupe.
> > >   Even if this is considered safe by the community... I think users
> > >   should be told.
> > Only features merged are reflected. And the out-of-band dedupe does
> > full
> > memcpy. See btrfs_cmp_data() called from btrfs_extent_same().
> Ah,... I kinda thought it was already merged ... possibly got confused
> by the countless patch iterations of it ;)
> 
> 
> > > - btrfs check --repair (and others?)
> > >   Telling people that this may often cause more harm than good.
> > I think userspace tools do not belong to the overview.
> Well... I wouldn't mind if there was a btrfs-progs status page... (and
> both link each other).
> OTOH,... the user probably wants one central point where all relevant
> info can be found... and not again having to dig through n websites.

The Status page should give enough overview about all main topics, so
the progs can be one section there. Any details should go to separate
pages and be linked from there.

> > > - even mounting a fs ro, may cause it to be changed
> > 
> > This would go to the UseCases
> Fine for me.
> 
> 
> > 
> > > 
> > > - DB/VM-image like IO patterns + nodatacow + (!)checksumming
> > >   + (auto)defrag + snapshots
> > >   a)
> > >   People typically may have the impression:
> > >   btrfs = checksummed => als is guaranteed to be "valid" (or at
> > > least
> > >   noticed)
> > >   However this isn't the case for nodatacow'ed files, which in turn
> > > is
> > >   kinda "mandatory" for DB/VM-image like IO patterns, cause
> > > otherwise
> > >   these would fragment to heavily (see (b).
> > >   Unless claimed by some people, none of the major DBs or VM-image
> > >   formats do general checksumming on their own, most even don't
> > > support
> > >   it, some that do wouldn't do it without app-support and few
> > > "just"
> > >   don't do it per default.
> > >   Thus one should bump people to this situation and that they may
> > > not
> > >   get this "correctness" guarantee here.
> > >   b)
> > >   IIRC, it doesn't even help to simply not use nodatacow on such
> > > files
> > >   and using auto-defrag instead to countermeasure the fragmenting,
> > > as
> > >   that one doesn't perform too well on large files.
> > 
> > Same.
> Fine for me either... you already said above you would mention the
> nodatacow=>no-checksumming=>no-verification-and-no-raid-repair in the
> general section... this is enough for that place.
> 
> 
> > > For specific features:
> > > - Autodefrag
> > >   - didn't that also cause reflinks to be broken up?
> > 
> > No and never had.
> 
> Absolutely sure? One year ago, I was told that at first too so I
> started using it, but later on some (IIRC) developer said auto-defrag
> would also suffer from it.

Reading the subthread, I have to change the statement. Autodefrag can
read surrounding blocks up to 64k and write it to a new location, on
that write the links will get broken. I'll update the page.

> 
> > > - RAID*
> > >   No userland tools for monitoring/etc.
> > 
> > That's a usability bug.
> 
> Well it is and it will probably go away sooner or later... 

Re: how to understand "btrfs fi show" output? "No space left" issues

2016-09-20 Thread Peter Becker
More details on the issue and a complete explantion you can find here:

http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html
and
(Help! I ran out of disk space! )
https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space.21

And an explantion for the "dlimit" solution:

Quote From: Uncommon solutions for BTRFS
(http://blog.schmorp.de/2015-10-08-smr-archive-drives-fast-now.html)

> For my purposes, I define internal fragmentation as space allocated but not 
> usable by the filesystem. In BTRFS, each time you delete files, the space 
> used by those files cannot be reused for new files automatically.
> It's not a hard requirement to do this maintenance regularly, but doing it 
> regularly spares you waiting for hours when the disk is full and you need to 
> wait for a balance clean up command - and of course also reduces the number 
> of > times you get unexpected disk full errors. As a side note, this can also 
> be useful to prolong the life of your SSD because it allows the SSD to reuse 
> space not needed by the filesystem (although there is a trade-off, frequent 
> balancing is bad, no balancing is bad, the sweet spot is somewhere in 
> between).

2016-09-20 10:20 GMT+02:00 Peter Becker :
> Normaly total and used should deviate us a few gb.
> depend on your write workload you should run
>
> btrfs balance start -dusage=60 /mnt
>
> every week to avoid "ENOSPC"
>
> if you use newer btrfs-progs who supper balance limit filters you should run
>
> btrfs balance start -dusage=99 -dlimit=10 /mnt
>
> every 3 hours.
>
> This will balance 2 Blocks (dlimit=10; corresponds to 10 gb) with are
> not filled full into new blocks. You could/should adjust the intervall
> and the limit-filter depend on your write workload.
> For example if you write (change files + new files) only 10GB a day it
> will be enough to run this ever night.
> The last option completly avoid the ENOSPC issue but produce aditional
> workload for your harddrives.
>
> Note: you should avoid making snapshots during balance. Use a simple
> lock-mechanic for that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to understand "btrfs fi show" output? "No space left" issues

2016-09-20 Thread Andrei Borzenkov
On Tue, Sep 20, 2016 at 11:20 AM, Peter Becker  wrote:
> The last option completly avoid the ENOSPC issue but produce aditional
> workload for your harddrives.
>

I still do do understand where ENOSPC comes from in the first place.
Filesystem is half empty. Do you suggest that it is normal to get
ENOSPC in this case?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to understand "btrfs fi show" output? "No space left" issues

2016-09-20 Thread Peter Becker
Normaly total and used should deviate us a few gb.
depend on your write workload you should run

btrfs balance start -dusage=60 /mnt

every week to avoid "ENOSPC"

if you use newer btrfs-progs who supper balance limit filters you should run

btrfs balance start -dusage=99 -dlimit=10 /mnt

every 3 hours.

This will balance 2 Blocks (dlimit=10; corresponds to 10 gb) with are
not filled full into new blocks. You could/should adjust the intervall
and the limit-filter depend on your write workload.
For example if you write (change files + new files) only 10GB a day it
will be enough to run this ever night.
The last option completly avoid the ENOSPC issue but produce aditional
workload for your harddrives.

Note: you should avoid making snapshots during balance. Use a simple
lock-mechanic for that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stability matrix (was: Is stability a joke?)

2016-09-20 Thread Hugo Mills
On Tue, Sep 20, 2016 at 07:59:44AM +, Duncan wrote:
> Christoph Anton Mitterer posted on Mon, 19 Sep 2016 21:45:46 +0200 as
> excerpted:
> 
> > On Mon, 2016-09-19 at 17:27 +0200, David Sterba wrote:
> > 
> >> > For specific features:
> >> > - Autodefrag
> >> >   - didn't that also cause reflinks to be broken up?
> >> 
> >> No and never had.
> > 
> > Absolutely sure? One year ago, I was told that at first too so I started
> > using it, but later on some (IIRC) developer said auto-defrag would also
> > suffer from it.
> 
> AFAIK it was Hugo that said he looked into that, and that (if I'm 
> representing it correctly) autodefrag breaks reflinks and triggers space-
> using duplication much as defrag does, but that it does it on a much 
> smaller scale, since it (1) only triggers when some parts of a file are 
> being rewritten anyway, thus breaking the reflink for those specific 
> parts of the file due to COW (COW1 on otherwise NOCOW files) in any case, 
> and (2) unlike defrag, doesn't rewrite and thus break the reflinks on 
> entire files, just somewhat larger extents than the pure rewrite by 
> itself without autodefrag would.
> 
> Thus making the reflink-breaking and duplication effect of autodefrag 
> there, but relatively quite small compared to on-demand per-file defrag.

   I didn't investigate it -- It was my firmly-stated misunderstanding
which caused someone (Filipe, I think) with much more actual knowledge
to correct me, thus making the actual behaviour much clearer. :)

   I think your description is accurate as far as my current
understanding goes.

   Hugo.

> -- 
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Hugo Mills | There isn't a noun that can't be verbed.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH] Btrfs: kill BUG_ON in do_relocation

2016-09-20 Thread David Sterba
On Mon, Sep 19, 2016 at 04:11:44PM -0700, Liu Bo wrote:
> > > That's EIO.  Sometimes the EIO is big enough we have to abort, but 
> > > really the abort is just adding bonus.
> > 
> > I think we misuse the EIO where we should really return EFSCORRUPTED
> > that's an alias for EUCLEAN, looking at xfs or ext4. EIO should be
> > really a message that the hardware is bad.
> 
> I love this idea, but one quick question, when returning EUCLEAN, what
> message do users get? 
> 
> "#define EUCLEAN 117 /* Structure needs cleaning */"

strerror(EUCLEAN) -> "Structure needs cleaning"
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stability matrix (was: Is stability a joke?)

2016-09-20 Thread Duncan
Christoph Anton Mitterer posted on Mon, 19 Sep 2016 21:45:46 +0200 as
excerpted:

> On Mon, 2016-09-19 at 17:27 +0200, David Sterba wrote:
> 
>> > For specific features:
>> > - Autodefrag
>> >   - didn't that also cause reflinks to be broken up?
>> 
>> No and never had.
> 
> Absolutely sure? One year ago, I was told that at first too so I started
> using it, but later on some (IIRC) developer said auto-defrag would also
> suffer from it.

AFAIK it was Hugo that said he looked into that, and that (if I'm 
representing it correctly) autodefrag breaks reflinks and triggers space-
using duplication much as defrag does, but that it does it on a much 
smaller scale, since it (1) only triggers when some parts of a file are 
being rewritten anyway, thus breaking the reflink for those specific 
parts of the file due to COW (COW1 on otherwise NOCOW files) in any case, 
and (2) unlike defrag, doesn't rewrite and thus break the reflinks on 
entire files, just somewhat larger extents than the pure rewrite by 
itself without autodefrag would.

Thus making the reflink-breaking and duplication effect of autodefrag 
there, but relatively quite small compared to on-demand per-file defrag.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to understand "btrfs fi show" output? "No space left" issues

2016-09-20 Thread Tomasz Chmielewski

On 2016-09-20 16:27, Peter Becker wrote:


You have 417(total)-131(used) blocks wo are only partial filled.
You should balance your file-system.


(...)


#or a full balance
btrfs balance start /mnt


OK, does it mean that btrfs needs some userspace daemon which does the 
following from time to time (how often?):


1) btrfs fi show /mountpoint(s)

2) if "used" is more than 90% (or 80%? or 70%?) of "size" - run a full 
balance


3) ...unless "btrfs fi df" shows that "used" is 95% (?) or more of 
"total", then don't bother, as we're "really" full


?


Tomasz Chmielewski
https://lxadm.com

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to understand "btrfs fi show" output? "No space left" issues

2016-09-20 Thread Tomasz Chmielewski

Yes, have it disabled already (for their datadirs).


Tomasz Chmielewski
https://lxadm.com


On 2016-09-20 16:30, Peter Becker wrote:

for the future. disable COW for all database containers

2016-09-20 9:28 GMT+02:00 Peter Becker :
* If this NOT solve the "No space left" issues you must remove old 
snapshots.


2016-09-20 9:27 GMT+02:00 Peter Becker :

Data, RAID1: total=417.12GiB, used=131.33GiB

You have 417(total)-131(used) blocks wo are only partial filled.
You should balance your file-system.

At first you need some free space. You could remove some files / old
snapshots etc. or you add a empty USB-Stick with min. 4 GB to your
BTRFS-Pool (after balancing complete you can remove the stick from 
the

pool).

But at first you should try to free emty data and meta data blocks:

btrfs balance start -musage=0 /mnt
btrfs balance start -dusage=0 /mnt

Then you an run a full balance or a partial balance:

#a partial balance with reorganize data blocks less then 50% filled
btrfs balance start -dusage=50 /mnt

#or a full balance
btrfs balance start /mnt

Because of a possible bug you should disable all snapshot scripts
(like cron-jobs) during the balance.

If this solve the "No space left" issues you must remove old 
snapshots.


2016-09-20 8:58 GMT+02:00 Hugo Mills :

On Tue, Sep 20, 2016 at 03:47:14PM +0900, Tomasz Chmielewski wrote:

How to understand the following "btrfs fi show" output?


This gives a write-up (and worked example) of an answer to your 
question:


https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools

   If you've got any follow-up questions after reading it, please do
come back and we can try to improve the FAQ entry. :)

   Hugo.


# btrfs fi show /var/lib/lxd
Label: 'btrfs'  uuid: f5f30428-ec5b-4497-82de-6e20065e6f61
Total devices 2 FS bytes used 136.18GiB
devid1 size 423.13GiB used 423.13GiB path /dev/sda3
devid2 size 423.13GiB used 423.13GiB path /dev/sdb3

Why is it "size 423.13GiB used 423.13GiB"? Is it full?

I had "No space left" on this filesystem just yesterday (running
kernel 4.7.4). This is btrfs RAID-1 on SSD disks. This filesystem 
is

used for 20-30 LXD containers with different roles (mongo, mysql,
postgres databases, webservers etc.), around 150 read-only
snapshots, btrfs compression is disabled.


Both "btrfs fi df" and "df -h" show plenty of space:

# btrfs fi df /var/lib/lxd
Data, RAID1: total=417.12GiB, used=131.33GiB
System, RAID1: total=8.00MiB, used=80.00KiB
Metadata, RAID1: total=6.00GiB, used=4.86GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


# df -h
Filesystem  Size  Used Avail Use% Mounted on
/dev/sda3   424G  137G  286G  33% /var/lib/lxd



Tomasz Chmielewski
https://lxadm.com
--
To unsubscribe from this list: send the line "unsubscribe 
linux-btrfs" in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
Hugo Mills | I can resist everything except temptation.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Fix handling of -ENOENT from btrfs_uuid_iter_rem

2016-09-20 Thread Nikolay Borisov
[Resend due to gmail mobile interface defaulting to html layout]
>>
>> We know its returning -ENOENT, so it should in theory be enough to just
>> goto again_search_slot, assuming that we just raced with the deletion.
>
>
> I will apply this on the machine which are exhibitting problems and will
> report whether it rectified the situation. i bump the objectid wince this is
> what you suggested. i can also try without it.
>>
>>
>> -chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to understand "btrfs fi show" output? "No space left" issues

2016-09-20 Thread Peter Becker
for the future. disable COW for all database containers

2016-09-20 9:28 GMT+02:00 Peter Becker :
> * If this NOT solve the "No space left" issues you must remove old snapshots.
>
> 2016-09-20 9:27 GMT+02:00 Peter Becker :
>> Data, RAID1: total=417.12GiB, used=131.33GiB
>>
>> You have 417(total)-131(used) blocks wo are only partial filled.
>> You should balance your file-system.
>>
>> At first you need some free space. You could remove some files / old
>> snapshots etc. or you add a empty USB-Stick with min. 4 GB to your
>> BTRFS-Pool (after balancing complete you can remove the stick from the
>> pool).
>>
>> But at first you should try to free emty data and meta data blocks:
>>
>> btrfs balance start -musage=0 /mnt
>> btrfs balance start -dusage=0 /mnt
>>
>> Then you an run a full balance or a partial balance:
>>
>> #a partial balance with reorganize data blocks less then 50% filled
>> btrfs balance start -dusage=50 /mnt
>>
>> #or a full balance
>> btrfs balance start /mnt
>>
>> Because of a possible bug you should disable all snapshot scripts
>> (like cron-jobs) during the balance.
>>
>> If this solve the "No space left" issues you must remove old snapshots.
>>
>> 2016-09-20 8:58 GMT+02:00 Hugo Mills :
>>> On Tue, Sep 20, 2016 at 03:47:14PM +0900, Tomasz Chmielewski wrote:
 How to understand the following "btrfs fi show" output?
>>>
>>> This gives a write-up (and worked example) of an answer to your question:
>>>
>>> https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools
>>>
>>>If you've got any follow-up questions after reading it, please do
>>> come back and we can try to improve the FAQ entry. :)
>>>
>>>Hugo.
>>>
 # btrfs fi show /var/lib/lxd
 Label: 'btrfs'  uuid: f5f30428-ec5b-4497-82de-6e20065e6f61
 Total devices 2 FS bytes used 136.18GiB
 devid1 size 423.13GiB used 423.13GiB path /dev/sda3
 devid2 size 423.13GiB used 423.13GiB path /dev/sdb3

 Why is it "size 423.13GiB used 423.13GiB"? Is it full?

 I had "No space left" on this filesystem just yesterday (running
 kernel 4.7.4). This is btrfs RAID-1 on SSD disks. This filesystem is
 used for 20-30 LXD containers with different roles (mongo, mysql,
 postgres databases, webservers etc.), around 150 read-only
 snapshots, btrfs compression is disabled.


 Both "btrfs fi df" and "df -h" show plenty of space:

 # btrfs fi df /var/lib/lxd
 Data, RAID1: total=417.12GiB, used=131.33GiB
 System, RAID1: total=8.00MiB, used=80.00KiB
 Metadata, RAID1: total=6.00GiB, used=4.86GiB
 GlobalReserve, single: total=512.00MiB, used=0.00B


 # df -h
 Filesystem  Size  Used Avail Use% Mounted on
 /dev/sda3   424G  137G  286G  33% /var/lib/lxd



 Tomasz Chmielewski
 https://lxadm.com
 --
 To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> Hugo Mills | I can resist everything except temptation.
>>> hugo@... carfax.org.uk |
>>> http://carfax.org.uk/  |
>>> PGP: E2AB1DE4  |
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to understand "btrfs fi show" output? "No space left" issues

2016-09-20 Thread Peter Becker
* If this NOT solve the "No space left" issues you must remove old snapshots.

2016-09-20 9:27 GMT+02:00 Peter Becker :
> Data, RAID1: total=417.12GiB, used=131.33GiB
>
> You have 417(total)-131(used) blocks wo are only partial filled.
> You should balance your file-system.
>
> At first you need some free space. You could remove some files / old
> snapshots etc. or you add a empty USB-Stick with min. 4 GB to your
> BTRFS-Pool (after balancing complete you can remove the stick from the
> pool).
>
> But at first you should try to free emty data and meta data blocks:
>
> btrfs balance start -musage=0 /mnt
> btrfs balance start -dusage=0 /mnt
>
> Then you an run a full balance or a partial balance:
>
> #a partial balance with reorganize data blocks less then 50% filled
> btrfs balance start -dusage=50 /mnt
>
> #or a full balance
> btrfs balance start /mnt
>
> Because of a possible bug you should disable all snapshot scripts
> (like cron-jobs) during the balance.
>
> If this solve the "No space left" issues you must remove old snapshots.
>
> 2016-09-20 8:58 GMT+02:00 Hugo Mills :
>> On Tue, Sep 20, 2016 at 03:47:14PM +0900, Tomasz Chmielewski wrote:
>>> How to understand the following "btrfs fi show" output?
>>
>> This gives a write-up (and worked example) of an answer to your question:
>>
>> https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools
>>
>>If you've got any follow-up questions after reading it, please do
>> come back and we can try to improve the FAQ entry. :)
>>
>>Hugo.
>>
>>> # btrfs fi show /var/lib/lxd
>>> Label: 'btrfs'  uuid: f5f30428-ec5b-4497-82de-6e20065e6f61
>>> Total devices 2 FS bytes used 136.18GiB
>>> devid1 size 423.13GiB used 423.13GiB path /dev/sda3
>>> devid2 size 423.13GiB used 423.13GiB path /dev/sdb3
>>>
>>> Why is it "size 423.13GiB used 423.13GiB"? Is it full?
>>>
>>> I had "No space left" on this filesystem just yesterday (running
>>> kernel 4.7.4). This is btrfs RAID-1 on SSD disks. This filesystem is
>>> used for 20-30 LXD containers with different roles (mongo, mysql,
>>> postgres databases, webservers etc.), around 150 read-only
>>> snapshots, btrfs compression is disabled.
>>>
>>>
>>> Both "btrfs fi df" and "df -h" show plenty of space:
>>>
>>> # btrfs fi df /var/lib/lxd
>>> Data, RAID1: total=417.12GiB, used=131.33GiB
>>> System, RAID1: total=8.00MiB, used=80.00KiB
>>> Metadata, RAID1: total=6.00GiB, used=4.86GiB
>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>
>>>
>>> # df -h
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> /dev/sda3   424G  137G  286G  33% /var/lib/lxd
>>>
>>>
>>>
>>> Tomasz Chmielewski
>>> https://lxadm.com
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> Hugo Mills | I can resist everything except temptation.
>> hugo@... carfax.org.uk |
>> http://carfax.org.uk/  |
>> PGP: E2AB1DE4  |
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to understand "btrfs fi show" output? "No space left" issues

2016-09-20 Thread Peter Becker
Data, RAID1: total=417.12GiB, used=131.33GiB

You have 417(total)-131(used) blocks wo are only partial filled.
You should balance your file-system.

At first you need some free space. You could remove some files / old
snapshots etc. or you add a empty USB-Stick with min. 4 GB to your
BTRFS-Pool (after balancing complete you can remove the stick from the
pool).

But at first you should try to free emty data and meta data blocks:

btrfs balance start -musage=0 /mnt
btrfs balance start -dusage=0 /mnt

Then you an run a full balance or a partial balance:

#a partial balance with reorganize data blocks less then 50% filled
btrfs balance start -dusage=50 /mnt

#or a full balance
btrfs balance start /mnt

Because of a possible bug you should disable all snapshot scripts
(like cron-jobs) during the balance.

If this solve the "No space left" issues you must remove old snapshots.

2016-09-20 8:58 GMT+02:00 Hugo Mills :
> On Tue, Sep 20, 2016 at 03:47:14PM +0900, Tomasz Chmielewski wrote:
>> How to understand the following "btrfs fi show" output?
>
> This gives a write-up (and worked example) of an answer to your question:
>
> https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools
>
>If you've got any follow-up questions after reading it, please do
> come back and we can try to improve the FAQ entry. :)
>
>Hugo.
>
>> # btrfs fi show /var/lib/lxd
>> Label: 'btrfs'  uuid: f5f30428-ec5b-4497-82de-6e20065e6f61
>> Total devices 2 FS bytes used 136.18GiB
>> devid1 size 423.13GiB used 423.13GiB path /dev/sda3
>> devid2 size 423.13GiB used 423.13GiB path /dev/sdb3
>>
>> Why is it "size 423.13GiB used 423.13GiB"? Is it full?
>>
>> I had "No space left" on this filesystem just yesterday (running
>> kernel 4.7.4). This is btrfs RAID-1 on SSD disks. This filesystem is
>> used for 20-30 LXD containers with different roles (mongo, mysql,
>> postgres databases, webservers etc.), around 150 read-only
>> snapshots, btrfs compression is disabled.
>>
>>
>> Both "btrfs fi df" and "df -h" show plenty of space:
>>
>> # btrfs fi df /var/lib/lxd
>> Data, RAID1: total=417.12GiB, used=131.33GiB
>> System, RAID1: total=8.00MiB, used=80.00KiB
>> Metadata, RAID1: total=6.00GiB, used=4.86GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>>
>> # df -h
>> Filesystem  Size  Used Avail Use% Mounted on
>> /dev/sda3   424G  137G  286G  33% /var/lib/lxd
>>
>>
>>
>> Tomasz Chmielewski
>> https://lxadm.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> Hugo Mills | I can resist everything except temptation.
> hugo@... carfax.org.uk |
> http://carfax.org.uk/  |
> PGP: E2AB1DE4  |
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to understand "btrfs fi show" output? "No space left" issues

2016-09-20 Thread Tomasz Chmielewski
OK, according to that - it means 423.13GiB out of total available space, 
423.13GiB, has been allocated.


Is it good? Is it bad? Is it why I'm getting "No space left" issues?

Why has it allocated all available space, if only around 1/3 of space is 
in use, according to other tools (less than 140 GB out of 423 GB is in 
use)?



On other systems, I see that "used" from "btrfs fi show" more or less 
matches the output of "btrfs fi df"; here - everything is allocated.



Tomasz Chmielewski
https://lxadm.com


On 2016-09-20 15:58, Hugo Mills wrote:

On Tue, Sep 20, 2016 at 03:47:14PM +0900, Tomasz Chmielewski wrote:

How to understand the following "btrfs fi show" output?


This gives a write-up (and worked example) of an answer to your 
question:


https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools

   If you've got any follow-up questions after reading it, please do
come back and we can try to improve the FAQ entry. :)

   Hugo.


# btrfs fi show /var/lib/lxd
Label: 'btrfs'  uuid: f5f30428-ec5b-4497-82de-6e20065e6f61
Total devices 2 FS bytes used 136.18GiB
devid1 size 423.13GiB used 423.13GiB path /dev/sda3
devid2 size 423.13GiB used 423.13GiB path /dev/sdb3

Why is it "size 423.13GiB used 423.13GiB"? Is it full?

I had "No space left" on this filesystem just yesterday (running
kernel 4.7.4). This is btrfs RAID-1 on SSD disks. This filesystem is
used for 20-30 LXD containers with different roles (mongo, mysql,
postgres databases, webservers etc.), around 150 read-only
snapshots, btrfs compression is disabled.


Both "btrfs fi df" and "df -h" show plenty of space:

# btrfs fi df /var/lib/lxd
Data, RAID1: total=417.12GiB, used=131.33GiB
System, RAID1: total=8.00MiB, used=80.00KiB
Metadata, RAID1: total=6.00GiB, used=4.86GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


# df -h
Filesystem  Size  Used Avail Use% Mounted on
/dev/sda3   424G  137G  286G  33% /var/lib/lxd



Tomasz Chmielewski
https://lxadm.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" 
in

the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html