Re: [PATCH 0/4] Small coding style cleanups

2019-10-21 Thread Johannes Thumshirn
On 18/10/2019 14:02, Nikolay Borisov wrote:
[...]

> Patches 2-4 LGTM you can add, :

What's wrong with 1/4?



-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: "BUG: kernel NULL pointer dereference," when unmounting filesystem hitted by enospc error

2019-10-21 Thread Johannes Thumshirn
On 19/10/2019 21:29, Peter Hjalmarsson wrote:

Hi Peter,

Thanks for the report.

> While trying to reproduce another problem I have seen with BTRFS while
> running balance and raid6 I hit an issue resulting in:
> BUG: kernel NULL pointer dereference, address: 02ce
> 
> I created a script trying to pinpoint the problem utilizing zram, and
> the run goes like this:
> 
> 1. run the scripts which sets up a "SINGEL" filesystem on two larger
> devices, fills it with an adequate amount of data, and then tries to
> convert it to raid6
> 2. the filesystem will fail to convert due to not enough space to
> convert all data to raid6 (not enough space on at least 3 devices at
> the same time), hitting an enospc-error
> * Up until this point stuff still seems to work without crashing, and
> the system seems stable, just with two different profiles fot the data
> 3. issue the umount command which will be "Killed", and a backtrace
> will be written to dmesg
> 
> This results in a filesystem that cannot be synced, unmounted, and in
> all just seems crashed.
> I have ran the script 6 times. 1 time it passed, 5 times it has
> crashed, and I have rebooted the computer, since that is the only way
> it seems to get rid of the filesysem, so the issue seems pretty
> reproducable.
> 
> The system is a laptop running Fedora 31 Beta with the latest updates,
> and the latest kernel (5.3.6-300.fc31.x86_64)
> 
> Do you have any input, or any other questions or stuff you want me to test?
> 
> The script looks as follow:
> -
> $ cat run-btrfs-test
> modprobe -iv zram num_devices=8
> udevadm trigger
> sync
> zramctl /dev/zram0 -s 8G && \
> zramctl /dev/zram1 -s 8G && \
> zramctl /dev/zram2 -s 4G && \
> zramctl /dev/zram3 -s 4G && \
> zramctl /dev/zram4 -s 4G && \
> zramctl /dev/zram5 -s 2G && \
> zramctl /dev/zram6 -s 2G && \
> zramctl /dev/zram7 -s 4G && \
> mkfs.btrfs /dev/zram0 && \
> mkdir -p /mnt/btrfs-test && \
> mount /dev/zram0 /mnt/btrfs-test && \
> echo "FS Mounted" && \
> btrfs dev add /dev/zram1 /mnt/btrfs-test && \
> echo "Devices added" && \
> for int in {1..500} ; do dd if=/dev/zero of=/mnt/btrfs-test/file${int}
> bs=32M count=1 && sync ; done
> btrfs dev add /dev/zram[2-7] /mnt/btrfs-test && \
> btrfs fi sh /mnt/btrfs-test && \
> btrfs fi df /mnt/btrfs-test && \
> btrfs bal star -mconvert=raid6 /mnt/btrfs-test && \
> btrfs bal star -dconvert=raid6 /mnt/btrfs-test
> btrfs fi sh /mnt/btrfs-test && \
> btrfs fi df /mnt/btrfs-test
> =
> 
> The output from the script, I will trim the output down to after the for-loop:
> --
> Label: none  uuid: 87150918-e487-4f59-994b-ccb13ee05446
> Total devices 8 FS bytes used 15.64GiB
> devid1 size 8.00GiB used 8.00GiB path /dev/zram0
> devid2 size 8.00GiB used 8.00GiB path /dev/zram1
> devid3 size 4.00GiB used 0.00B path /dev/zram2
> devid4 size 4.00GiB used 0.00B path /dev/zram3
> devid5 size 4.00GiB used 0.00B path /dev/zram4
> devid6 size 2.00GiB used 0.00B path /dev/zram5
> devid7 size 2.00GiB used 0.00B path /dev/zram6
> devid8 size 4.00GiB used 0.00B path /dev/zram7
> 
> Data, single: total=15.74GiB, used=15.63GiB
> System, single: total=4.00MiB, used=16.00KiB
> Metadata, single: total=264.00MiB, used=17.06MiB
> GlobalReserve, single: total=16.70MiB, used=0.00B
> Done, had to relocate 3 out of 20 chunks
> ERROR: error during balancing '/mnt/btrfs-test': No space left on device
> There may be more info in syslog - try dmesg | tail
> Label: none  uuid: 87150918-e487-4f59-994b-ccb13ee05446
> Total devices 8 FS bytes used 15.65GiB
> devid1 size 8.00GiB used 2.99GiB path /dev/zram0
> devid2 size 8.00GiB used 2.00GiB path /dev/zram1
> devid3 size 4.00GiB used 4.00GiB path /dev/zram2
> devid4 size 4.00GiB used 4.00GiB path /dev/zram3
> devid5 size 4.00GiB used 4.00GiB path /dev/zram4
> devid6 size 2.00GiB used 2.00GiB path /dev/zram5
> devid7 size 2.00GiB used 2.00GiB path /dev/zram6
> devid8 size 4.00GiB used 4.00GiB path /dev/zram7
> 
> Data, single: total=2.00GiB, used=1.97GiB
> Data, RAID6: total=14.41GiB, used=13.66GiB
> System, RAID6: total=80.00MiB, used=16.00KiB
> Metadata, RAID6: total=512.00MiB, used=17.52MiB
> GlobalReserve, single: total=17.16MiB, used=0.00B
> ===
> 
> Issueing the umount:
> 
> # umount /mnt/btrfs-test && modprobe -rv zram

OK, I couldn't reproduce it in my environment (5.4-rc3+ based
btrfs-devel/misc-next form David) with this script. I'll dig deeper.


> Killed
> ===
> 
> And last but not least: the output in dmesg:
> ---
> [  205.960233] BTRFS info (device zram0): 2 enospc errors during balance
> [  205.960235] BTRFS info (device zram0): balance: ended with status: -28

Here balance ended with -ENOSPC (28).

> [  235.774821] BUG: kernel NULL pointer dereference, address: 02ce

That's a NULL pointer deference with an offset of 0x2ce (718).

> [  235.774826] #PF: supervisor read access in kernel mode
> [  235.774828] #PF: error_code(0x) - not-present page

[PATCH 0/3] btrfs-progs: check: Introduce optional argument for -b|--backup

2019-10-21 Thread Qu Wenruo
Before this patchset, if we want to use backup roots, it's only possible
to let btrfs-check to automatically choose the backup.

If user want to use a specified backup, it can only use -r|--tree-root
option along with backup roots dump from "btrfs ins dump-super".

This patchset will introduce optional argument for -b|--backup, so user
can specify which backup to use by providing the generation difference
(-3, -2, -1).

If the optional argument is not provided, the default value is -1, and
the behavior should be pretty much the same.

Qu Wenruo (3):
  btrfs-progs: utils-lib: Use error() to replace fprintf(stderr, "ERROR:
")
  btrfs-progs: disk-io: Handle backup root more correctly
  btrfs-progs: check: Introduce optional argument for -b|--backup

 Documentation/btrfs-check.asciidoc |  6 ++--
 check/main.c   | 33 +++---
 common/utils.h |  1 +
 ctree.h|  8 +
 disk-io.c  | 55 --
 disk-io.h  | 33 +++---
 utils-lib.c| 25 +++---
 7 files changed, 127 insertions(+), 34 deletions(-)

-- 
2.23.0



[PATCH 1/3] btrfs-progs: utils-lib: Use error() to replace fprintf(stderr, "ERROR: ")

2019-10-21 Thread Qu Wenruo
This saves several lines of code.

Signed-off-by: Qu Wenruo 
---
 utils-lib.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/utils-lib.c b/utils-lib.c
index c2b6097f5df9..0202dd7677f0 100644
--- a/utils-lib.c
+++ b/utils-lib.c
@@ -23,8 +23,7 @@ u64 arg_strtou64(const char *str)
 
value = strtoull(str, &ptr_parse_end, 0);
if (ptr_parse_end && *ptr_parse_end != '\0') {
-   fprintf(stderr, "ERROR: %s is not a valid numeric value.\n",
-   str);
+   error("%s is not a valid numeric value.", str);
exit(1);
}
 
@@ -33,12 +32,11 @@ u64 arg_strtou64(const char *str)
 * unexpected number to us, so let's do the check ourselves.
 */
if (str[0] == '-') {
-   fprintf(stderr, "ERROR: %s: negative value is invalid.\n",
-   str);
+   error("%s: negative value is invalid.", str);
exit(1);
}
if (value == ULLONG_MAX) {
-   fprintf(stderr, "ERROR: %s is too large.\n", str);
+   error("%s is too large.", str);
exit(1);
}
return value;
-- 
2.23.0



[PATCH 3/3] btrfs-progs: check: Introduce optional argument for -b|--backup

2019-10-21 Thread Qu Wenruo
Add an optional argument,  for -b|--backup.

This optional argument allow user to specify the generation difference
to search for the best backup root.

The value values are: -3, -2, -1.

To co-operate with this change, the following modifications are made:
- Man page and help string update

- New OPEN_CTREE flags and helpers are introduced
  OPEN_CTREE_BACKUP_GEN_DIFF[123] are introduced to replace old single
  bit OPEN_CTREE_BACKUP_ROOT.
  New helpers backup_gen_diff_to_cflags() and
  cflags_to_backup_gen_diff() are introduced to do the convert.
  So that we can keep the old open_ctree() interface without introducing
  new parameters.

- New arg_strol() helper to get negative int

- New btrfs_fs_info::backup_gen_diff member
  Now btrfs_setup_all_roots() uses that member to search backup roots.

Signed-off-by: Qu Wenruo 
---
 Documentation/btrfs-check.asciidoc |  6 --
 check/main.c   | 33 ++
 common/utils.h |  1 +
 ctree.h|  8 
 disk-io.c  | 25 ++
 disk-io.h  | 33 ++
 utils-lib.c| 17 +++
 7 files changed, 100 insertions(+), 23 deletions(-)

diff --git a/Documentation/btrfs-check.asciidoc 
b/Documentation/btrfs-check.asciidoc
index b963eae5cdce..84a4241069cf 100644
--- a/Documentation/btrfs-check.asciidoc
+++ b/Documentation/btrfs-check.asciidoc
@@ -41,8 +41,10 @@ filesystem, similarly the run time.
 SAFE OR ADVISORY OPTIONS
 
 
--b|--backup::
-use the first valid set of backup roots stored in the superblock
+-b[] | --backup[=]::
+use the newest valid set of backup roots which is no older than 'generation + 
'
++
+'' is optional, if not given, default value is -1. Valid values are 
-3, -2 ,-1.
 +
 This can be combined with '--super' if some of the superblocks are damaged.
 
diff --git a/check/main.c b/check/main.c
index c2d0f3949c5e..46f3e3d4c5b5 100644
--- a/check/main.c
+++ b/check/main.c
@@ -9773,7 +9773,9 @@ static const char * const cmd_check_usage[] = {
"Options:",
"  starting point selection:",
"   -s|--super  use this superblock copy",
-   "   -b|--backup use the first valid backup root 
copy",
+   "   -b|--backup[=]use the backup root with 
",
+   "is an optional 
parameter,",
+   "   valid values are -3, -2, -1 
(default).",
"   -r|--tree-root  use the given bytenr for the tree 
root",
"   --chunk-rootuse the given bytenr for the chunk 
tree root",
"  operation modes:",
@@ -9799,6 +9801,17 @@ static const char * const cmd_check_usage[] = {
NULL
 };
 
+static unsigned backup_gen_diff_to_cflags(int backup_gen_diff)
+{
+   ASSERT(-3 <= backup_gen_diff && backup_gen_diff <= -1);
+
+   if (backup_gen_diff == -3)
+   return OPEN_CTREE_BACKUP_GEN_DIFF3;
+   if (backup_gen_diff == -2)
+   return OPEN_CTREE_BACKUP_GEN_DIFF2;
+   return OPEN_CTREE_BACKUP_GEN_DIFF1;
+}
+
 static int cmd_check(const struct cmd_struct *cmd, int argc, char **argv)
 {
struct cache_tree root_cache;
@@ -9818,6 +9831,7 @@ static int cmd_check(const struct cmd_struct *cmd, int 
argc, char **argv)
int qgroup_report = 0;
int qgroups_repaired = 0;
int qgroup_report_ret;
+   int backup_gen_diff;
unsigned ctree_flags = OPEN_CTREE_EXCLUSIVE;
int force = 0;
 
@@ -9838,7 +9852,7 @@ static int cmd_check(const struct cmd_struct *cmd, int 
argc, char **argv)
GETOPT_VAL_INIT_EXTENT },
{ "check-data-csum", no_argument, NULL,
GETOPT_VAL_CHECK_CSUM },
-   { "backup", no_argument, NULL, 'b' },
+   { "backup", optional_argument, NULL, 'b' },
{ "subvol-extents", required_argument, NULL, 'E' },
{ "qgroup-report", no_argument, NULL, 'Q' },
{ "tree-root", required_argument, NULL, 'r' },
@@ -9853,13 +9867,24 @@ static int cmd_check(const struct cmd_struct *cmd, int 
argc, char **argv)
{ NULL, 0, NULL, 0}
};
 
-   c = getopt_long(argc, argv, "as:br:pE:Q", long_options, NULL);
+   c = getopt_long(argc, argv, "as:b::r:pE:Q", long_options, NULL);
if (c < 0)
break;
switch(c) {
case 'a': /* ignored */ break;
case 'b':
-   ctree_flags |= OPEN_CTREE_BACKUP_ROOT;
+   if (optarg) {
+   backup_gen_diff = arg_strtol(optarg);
+   } else
+

[PATCH 2/3] btrfs-progs: disk-io: Handle backup root more correctly

2019-10-21 Thread Qu Wenruo
Current backup root handling has extra check on super generation:

static int find_best_backup_root(struct btrfs_super_block *super)
{
u64 orig_gen = btrfs_super_generation(super);
...
if (btrfs_backup_tree_root_gen(backup) != orig_gen &&
btrfs_backup_tree_root_gen(backup) > gen) {
best_index = i;
gen = btrfs_backup_tree_root_gen(backup);

This check is to ensure we don't get backup root with current
generation, but it can still return backup root newer than current root.

So for the following super:
generation: 10
backup[0] generation:   8
backup[1] generation:   9
backup[2] generation:   10
backup[3] generation:   11

If we're calling find_best_backup_root() then we can pick up slot 3
which is newer than current generation.

This patch introduce a new parameter for find_best_backup_root() to
specify the max generation.

So with above superblock, calling find_best_backup_root(sb, sb_gen - 1)
will ensure we get slot 1, other than slot 3.
This also affects how we update backup roots.

Furthermore, due to the change in the return value,
find_best_backup_root() can now return -1 to indicates error (no valid
backup found), so change callers to co-operate.

Signed-off-by: Qu Wenruo 
---
 disk-io.c | 34 ++
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/disk-io.c b/disk-io.c
index be44eead5cef..36db1be264cd 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -845,17 +845,22 @@ int btrfs_check_fs_compatibility(struct btrfs_super_block 
*sb,
return 0;
 }
 
-static int find_best_backup_root(struct btrfs_super_block *super)
+/*
+ * Find the newest backup slot whose generation <= @max_gen
+ *
+ * Can return <0 for error, indicating no valid backup slot for @max_gen.
+ */
+static int find_best_backup_root(struct btrfs_super_block *super,
+u64 max_gen)
 {
struct btrfs_root_backup *backup;
-   u64 orig_gen = btrfs_super_generation(super);
u64 gen = 0;
-   int best_index = 0;
+   int best_index = -1;
int i;
 
for (i = 0; i < BTRFS_NUM_BACKUP_ROOTS; i++) {
backup = super->super_roots + i;
-   if (btrfs_backup_tree_root_gen(backup) != orig_gen &&
+   if (btrfs_backup_tree_root_gen(backup) <= max_gen &&
btrfs_backup_tree_root_gen(backup) > gen) {
best_index = i;
gen = btrfs_backup_tree_root_gen(backup);
@@ -908,9 +913,10 @@ int btrfs_setup_all_roots(struct btrfs_fs_info *fs_info, 
u64 root_tree_bytenr,
root_tree_bytenr = btrfs_super_root(sb);
} else if (flags & OPEN_CTREE_BACKUP_ROOT) {
struct btrfs_root_backup *backup;
-   int index = find_best_backup_root(sb);
-   if (index >= BTRFS_NUM_BACKUP_ROOTS) {
-   fprintf(stderr, "Invalid backup root number\n");
+   int index = find_best_backup_root(sb,
+   btrfs_super_generation(sb) - 1);
+   if (index < 0) {
+   error("can't find any valid backup root");
return -EIO;
}
backup = fs_info->super_copy->super_roots + index;
@@ -1707,10 +1713,22 @@ static int write_dev_supers(struct btrfs_fs_info 
*fs_info,
 static void backup_super_roots(struct btrfs_fs_info *info)
 {
struct btrfs_root_backup *root_backup;
+   u64 current_gen = btrfs_super_generation(info->super_copy);
int next_backup;
int last_backup;
 
-   last_backup = find_best_backup_root(info->super_copy);
+   last_backup = find_best_backup_root(info->super_copy, current_gen - 1);
+   /* No older backups, retry current gen */
+   if (last_backup < 0) {
+   last_backup = find_best_backup_root(info->super_copy,
+   current_gen);
+   /*
+* Still failed, means no valid backup root at all, restart
+* from slot 0.
+*/
+   if (last_backup < 0)
+   last_backup = 0;
+   }
next_backup = (last_backup + 1) % BTRFS_NUM_BACKUP_ROOTS;
 
/* just overwrite the last backup if we're at the same generation */
-- 
2.23.0



[PATCH v3] tools/lib/traceevent, perf tools: Handle %pU format correctly

2019-10-21 Thread Qu Wenruo
[BUG]
For btrfs related events, there is a field for fsid, but perf never
parse it correctly.

 # perf trace -e btrfs:qgroup_meta_convert xfs_io -f -c "pwrite 0 4k" \
   /mnt/btrfs/file1
 0.000 xfs_io/77915 btrfs:qgroup_meta_reserve:(nil)U: refroot=5(FS_TREE) 
type=0x0 diff=2
  ^^ Not a correct UUID
 ...

[CAUSE]
The pretty_print() function doesn't handle the %pU format correctly.
In fact it doesn't handle %pU as uuid at all.

[FIX]
Add a new function, print_uuid_arg(), to handle %pU correctly.

Now perf trace can at least print fsid correctly:
 0.000 xfs_io/79619 
btrfs:qgroup_meta_reserve:23ad1511-dd83-47d4-a79c-e96625a15a6e 
refroot=5(FS_TREE) type=0x0 diff=2

Signed-off-by: Qu Wenruo 
---
Changelog:
v2:
- Use more comment explaining the finetunings we skipped for %pU*
- Extra check for the field before reading the data
- Use more elegant way to output uuid string
v3:
- Use a even more elegant way to output uuid string
---
 tools/lib/traceevent/event-parse.c | 51 ++
 1 file changed, 51 insertions(+)

diff --git a/tools/lib/traceevent/event-parse.c 
b/tools/lib/traceevent/event-parse.c
index d948475585ce..a71f4a86b6ca 100644
--- a/tools/lib/traceevent/event-parse.c
+++ b/tools/lib/traceevent/event-parse.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -4508,6 +4509,40 @@ get_bprint_format(void *data, int size __maybe_unused,
return format;
 }
 
+static void print_uuid_arg(struct trace_seq *s, void *data, int size,
+  struct tep_event *event, struct tep_print_arg *arg)
+{
+   unsigned char *buf;
+   int i;
+
+   if (arg->type != TEP_PRINT_FIELD) {
+   trace_seq_printf(s, "ARG TYPE NOT FIELID but %d", arg->type);
+   return;
+   }
+
+   if (!arg->field.field) {
+   arg->field.field = tep_find_any_field(event, arg->field.name);
+   if (!arg->field.field) {
+   do_warning("%s: field %s not found",
+  __func__, arg->field.name);
+   return;
+   }
+   }
+   if (arg->field.field->size < 16) {
+   trace_seq_printf(s, "INVALID UUID: size have %u expect 16",
+   arg->field.field->size);
+   return;
+   }
+   buf = data + arg->field.field->offset;
+
+   for (i = 0; i < 8; i++) {
+   trace_seq_printf(s, "%02x", buf[2 * i]);
+   trace_seq_printf(s, "%02x", buf[2 * i + 1]);
+   if (1 <= i && i <= 4)
+   trace_seq_putc(s, '-');
+   }
+}
+
 static void print_mac_arg(struct trace_seq *s, int mac, void *data, int size,
  struct tep_event *event, struct tep_print_arg *arg)
 {
@@ -5074,6 +5109,22 @@ static void pretty_print(struct trace_seq *s, void 
*data, int size, struct tep_e
arg = arg->next;
break;
}
+   } else if (*ptr == 'U') {
+   /*
+* %pU has several finetunings variants
+* like %pUb and %pUL.
+* Here we ignore them, default to
+* byte-order no endian, lower case
+* letters.
+*/
+   if (isalpha(ptr[1]))
+   ptr += 2;
+   else
+   ptr++;
+
+   print_uuid_arg(s, data, size, event, 
arg);
+   arg = arg->next;
+   break;
}
 
/* fall through */
-- 
2.23.0



Re: [PATCH 1/3] btrfs-progs: utils-lib: Use error() to replace fprintf(stderr, "ERROR: ")

2019-10-21 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 

-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH] btrfs: Do not check for PagePrivate twice

2019-10-21 Thread Filipe Manana
On Sat, Oct 19, 2019 at 10:05 AM Goldwyn Rodrigues  wrote:
>
> From: Goldwyn Rodrigues 
>
> We are checking PagePrivate twice, once with lock and once without.
> Perform the check only once.

Have you checked if there's some performance degradation after
removing the check?
My guess is it's there to avoid taking the lock, as the lock can be
heavily used on a system under heavy load (maybe even if it's not too
heavy, since we generate a lot of dirty metadata due to cow).
The page may have been released after locking the mapping, that's why
we check it twice, and after unlocking we are sure it can not be
released due to taking a reference on the extent buffer.

Thanks.

>
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/extent_io.c | 3 ---
>  1 file changed, 3 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index cceaf05aada2..425ba359178c 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -3959,9 +3959,6 @@ int btree_write_cache_pages(struct address_space 
> *mapping,
> for (i = 0; i < nr_pages; i++) {
> struct page *page = pvec.pages[i];
>
> -   if (!PagePrivate(page))
> -   continue;
> -
> spin_lock(&mapping->private_lock);
> if (!PagePrivate(page)) {
> spin_unlock(&mapping->private_lock);
> --
> 2.16.4
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


[RFC PATCH 03/14] btrfs-progs: migrate filesystem defragment to global verbose

2019-10-21 Thread Anand Jain
btrfs filesystem deframent already supports local sub-command
verbose option, enable the same when the global verbose option is set.
And as well make sure the same remains enabled at the local level.

Suggested-by: David Sterba 
Signed-off-by: Anand Jain 
---
 cmds/filesystem.c | 25 +++--
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/cmds/filesystem.c b/cmds/filesystem.c
index 4f22089abeaa..ee4d366fbf64 100644
--- a/cmds/filesystem.c
+++ b/cmds/filesystem.c
@@ -832,13 +832,13 @@ static const char * const cmd_filesystem_defrag_usage[] = 
{
"btrfs filesystem defragment [options] | [|...]",
"Defragment a file or a directory",
"",
-   "-v  be verbose",
-   "-r  defragment files recursively",
-   "-c[zlib,lzo,zstd]   compress the file while defragmenting",
-   "-f  flush data to disk immediately after 
defragmenting",
-   "-s startdefragment only from byte onward",
-   "-l len  defragment only up to len bytes",
-   "-t size target extent size hint (default: 32M)",
+   HELPINFO_INSERT_VERBOSE_SHORT,
+   "-r defragment files recursively",
+   "-c[zlib,lzo,zstd]  compress the file while defragmenting",
+   "-f flush data to disk immediately after defragmenting",
+   "-s start   defragment only from byte onward",
+   "-l len defragment only up to len bytes",
+   "-t sizetarget extent size hint (default: 32M)",
"",
"Warning: most Linux kernels will break up the ref-links of COW data",
"(e.g., files copied with 'cp --reflink', snapshots) which may cause",
@@ -848,7 +848,7 @@ static const char * const cmd_filesystem_defrag_usage[] = {
 };
 
 static struct btrfs_ioctl_defrag_range_args defrag_global_range;
-static int defrag_global_verbose;
+extern bool global_verbose;
 static int defrag_global_errors;
 static int defrag_callback(const char *fpath, const struct stat *sb,
int typeflag, struct FTW *ftwbuf)
@@ -857,8 +857,7 @@ static int defrag_callback(const char *fpath, const struct 
stat *sb,
int fd = 0;
 
if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) {
-   if (defrag_global_verbose)
-   printf("%s\n", fpath);
+   pr_verbose(global_verbose, "%s\n", fpath);
fd = open(fpath, defrag_open_mode);
if (fd < 0) {
goto error;
@@ -913,7 +912,6 @@ static int cmd_filesystem_defrag(const struct cmd_struct 
*cmd,
thresh = SZ_32M;
 
defrag_global_errors = 0;
-   defrag_global_verbose = 0;
defrag_global_errors = 0;
optind = 0;
while(1) {
@@ -931,7 +929,7 @@ static int cmd_filesystem_defrag(const struct cmd_struct 
*cmd,
flush = 1;
break;
case 'v':
-   defrag_global_verbose = 1;
+   global_verbose = true;
break;
case 's':
start = parse_size(optarg);
@@ -1031,8 +1029,7 @@ static int cmd_filesystem_defrag(const struct cmd_struct 
*cmd,
/* errors are handled in the callback */
ret = 0;
} else {
-   if (defrag_global_verbose)
-   printf("%s\n", argv[i]);
+   pr_verbose(global_verbose, "%s\n", argv[i]);
ret = ioctl(fd, BTRFS_IOC_DEFRAG_RANGE,
&defrag_global_range);
defrag_err = errno;
-- 
1.8.3.1



[RFC PATCH 00/14] btrfs-progs: global-verbose option

2019-10-21 Thread Anand Jain
This patch set brings --verbose option to the top level btrfs command,
such as 'btrfs --verbose'. With this we don't have to add or remember
verbose option at the sub-commands level.

As there are already verbose options to 11 sub-commands as listed
below [1][2]. So the top level --verbose option here takes care to transpire
verbose request from the top level to the sub-command level for 9 (not 11)
sub-commands as in [1] as of now.

This patch is RFC still for the following two reasons (comments appreciated).

1.
The sub-commands as in [2] uses multi-level compile time verbose option,
such as %g_verbose = 0 (quite), %g_verbose = 1 (default), %g_verbose > 1
(real-verbose). And verbose at default is also part the .out files in
fstests. So it needs further discussions on how to handle the multi-
level verbose option using the global verbose option, and so sub-
commands in [2] are untouched.

2.
These patch has been unit-tested individually.
These patches does not alter the verbose output.
But it fixes the indentation in the command's help output, which may be
used in fstests and btrfs-progs/tests and their verification is pending
still, which I am planning to do it before v1.

[1]
btrfs subvolume delete --help
-v|--verbose   verbose output of operations
btrfs filesystem defragment --help
-v  be verbose
btrfs balance start --help
-v|--verbosebe verbose
btrfs balance status --help
-v|--verbosebe verbose
btrfs rescue chunk-recover --help
-v  Verbose mode
btrfs rescue super-recover --help
-v  Verbose mode
btrfs restore --help
-v|--verbose verbose
btrfs inspect-internal inode-resolve --help
-v   verbose mode
btrfs inspect-internal logical-resolve --help
-v  verbose mode

[2]
btrfs send --help
-v|--verbose enable verbose output to stderr, each occurrence of
btrfs receive --help
-v   increase verbosity about performed action

Anand Jain (14):
  btrfs-progs: add global verbose helper functions
  btrfs-progs: migrate subvolume delete to global verbose
  btrfs-progs: migrate filesystem defragment to global verbose
  btrfs-progs: migrate btrfs balance start to global verbose
  btrfs-progs: migrate balance status to global verbose
  btrfs-progs: fix help, show long option in balance start and status
  btrfs-progs: migrate rescue chunk-recover to global verbose
  btrfs-progs: migrate rescue super-recover to global verbose
  btrfs-progs: restore: delete unreachable code
  btrfs-progs: migrate restore to global verbose
  btrfs-progs: migrate inspect-internal inode-resolve to global verbose
  btrfs-progs: migrate inspect-internal logical-resolve to global
verbose
  btrfs-progs: refactor btrfs_scan_devices() to accept verbose argument
  btrfs-progs: enable verbose for btrfs device scan

 btrfs.c  | 12 ++--
 cmds/balance.c   | 29 +-
 cmds/device.c|  2 +-
 cmds/filesystem.c| 27 -
 cmds/inspect.c   | 36 +++---
 cmds/rescue.c| 22 +++---
 cmds/restore.c   | 86 +---
 cmds/subvolume.c | 35 +++--
 common/device-scan.c |  4 ++-
 common/device-scan.h |  2 +-
 common/help.h|  8 +
 common/messages.c| 19 
 common/messages.h|  5 +++
 common/utils.c   |  2 +-
 disk-io.c|  2 +-
 15 files changed, 155 insertions(+), 136 deletions(-)

-- 
1.8.3.1



[RFC PATCH 02/14] btrfs-progs: migrate subvolume delete to global verbose

2019-10-21 Thread Anand Jain
btrfs subvolume delete already supports verbose at the sub-command
level, this patch restores same verbose which can be either enabled
by the sub-command or from the top level command.

Suggested-by: David Sterba 
Signed-off-by: Anand Jain 
---
 cmds/subvolume.c | 35 +--
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/cmds/subvolume.c b/cmds/subvolume.c
index 7a5fd79bb1f3..18efd0cf6e4a 100644
--- a/cmds/subvolume.c
+++ b/cmds/subvolume.c
@@ -43,6 +43,8 @@
 #include "common/path-utils.h"
 #include "common/device-scan.h"
 
+extern bool global_verbose;
+
 static int wait_for_subvolume_cleaning(int fd, size_t count, uint64_t *ids,
   int sleep_interval)
 {
@@ -231,9 +233,9 @@ static const char * const cmd_subvol_delete_usage[] = {
"after a crash). Use one of the --commit options to wait until the",
"operation is safely stored on the media.",
"",
-   "-c|--commit-after  wait for transaction commit at the end of the 
operation",
-   "-C|--commit-each   wait for transaction commit after deleting each 
subvolume",
-   "-v|--verbose   verbose output of operations",
+   "-c|--commit-after  wait for transaction commit at the end of the 
operation",
+   "-C|--commit-each   wait for transaction commit after deleting each 
subvolume",
+   HELPINFO_INSERT_VERBOSE,
NULL
 };
 
@@ -248,7 +250,6 @@ static int cmd_subvol_delete(const struct cmd_struct *cmd,
char*dupvname = NULL;
char*path;
DIR *dirstream = NULL;
-   int verbose = 0;
int commit_mode = 0;
u8 fsid[BTRFS_FSID_SIZE];
char uuidbuf[BTRFS_UUID_UNPARSED_SIZE];
@@ -278,7 +279,7 @@ static int cmd_subvol_delete(const struct cmd_struct *cmd,
commit_mode = COMMIT_EACH;
break;
case 'v':
-   verbose++;
+   global_verbose = true;
break;
default:
usage_unknown_option(cmd, argv);
@@ -288,11 +289,9 @@ static int cmd_subvol_delete(const struct cmd_struct *cmd,
if (check_argc_min(argc - optind, 1))
return 1;
 
-   if (verbose > 0) {
-   printf("Transaction commit: %s\n",
-   !commit_mode ? "none (default)" :
-   commit_mode == COMMIT_AFTER ? "at the end" : "after 
each");
-   }
+   pr_verbose(global_verbose, "Transaction commit: %s\n",
+  !commit_mode ? "none (default)" :
+  commit_mode == COMMIT_AFTER ? "at the end" : "after each");
 
cnt = optind;
 
@@ -353,11 +352,10 @@ again:
}
 
if (add_seen_fsid(fsid, seen_fsid_hash, fd, dirstream) == 0) {
-   if (verbose > 0) {
-   uuid_unparse(fsid, uuidbuf);
-   printf("  new fs is found for '%s', fsid: %s\n",
-   path, uuidbuf);
-   }
+   uuid_unparse(fsid, uuidbuf);
+   pr_verbose(global_verbose,
+  "  new fs is found for '%s', fsid: %s\n",
+  path, uuidbuf);
/*
 * This is the first time a subvolume on this
 * filesystem is deleted, keep fd in order to issue
@@ -398,10 +396,11 @@ keep_fd:
"unable to do final sync after deletion: %m, fsid: %s",
uuidbuf);
ret = 1;
-   } else if (verbose > 0) {
+   } else {
uuid_unparse(seen->fsid, uuidbuf);
-   printf("final sync is done for fsid: 
%s\n",
-   uuidbuf);
+   pr_verbose(global_verbose,
+  "final sync is done for 
fsid: %s\n",
+  uuidbuf);
}
seen = seen->next;
}
-- 
1.8.3.1



[RFC PATCH 10/14] btrfs-progs: migrate restore to global verbose

2019-10-21 Thread Anand Jain
Command btrfs restore provides local verbose option, this patch makes it
enable-able by using the global --verbose option as well.

Suggested-by: David Sterba 
Signed-off-by: Anand Jain 
---
 cmds/restore.c | 69 ++
 1 file changed, 31 insertions(+), 38 deletions(-)

diff --git a/cmds/restore.c b/cmds/restore.c
index 79caf6734e76..3592faeb6bca 100644
--- a/cmds/restore.c
+++ b/cmds/restore.c
@@ -51,7 +51,7 @@ static char fs_name[PATH_MAX];
 static char path_name[PATH_MAX];
 static char symlink_target[PATH_MAX];
 static int get_snaps = 0;
-static int verbose = 0;
+extern bool global_verbose;
 static int restore_metadata = 0;
 static int restore_symlinks = 0;
 static int ignore_errors = 0;
@@ -375,8 +375,7 @@ static int copy_one_extent(struct btrfs_root *root, int fd,
if (compress == BTRFS_COMPRESS_NONE)
bytenr += offset;
 
-   if (verbose && offset)
-   printf("offset is %Lu\n", offset);
+   pr_verbose(global_verbose && offset, "offset is %Lu\n", offset);
/* we found a hole */
if (disk_size == 0)
return 0;
@@ -832,9 +831,8 @@ static int overwrite_ok(const char * path)
if (overwrite)
return 2;
 
-   if (verbose || !warn)
-   printf("Skipping existing file"
-  " %s\n", path);
+   pr_verbose(global_verbose || !warn,
+  "Skipping existing file %s\n", path);
if (!warn)
printf("If you wish to overwrite use -o\n");
warn = 1;
@@ -994,9 +992,8 @@ static int search_dir(struct btrfs_root *root, struct 
btrfs_key *key,
goto out;
} else if (ret > 0) {
/* No more leaves to search */
-   if (verbose)
-   printf("Reached the end of the tree looking "
-  "for the directory\n");
+   pr_verbose(global_verbose,
+  "Reached the end of the tree looking for the directory\n");
ret = 0;
goto out;
}
@@ -1020,10 +1017,8 @@ static int search_dir(struct btrfs_root *root, struct 
btrfs_key *key,
goto out;
} else if (ret > 0) {
/* No more leaves to search */
-   if (verbose)
-   printf("Reached the end of "
-  "the tree searching the"
-  " directory\n");
+   pr_verbose(global_verbose,
+   "Reached the end of the tree searching the directory\n");
ret = 0;
goto out;
}
@@ -1063,8 +1058,7 @@ static int search_dir(struct btrfs_root *root, struct 
btrfs_key *key,
if (!overwrite_ok(path_name))
goto next;
 
-   if (verbose)
-   printf("Restoring %s\n", path_name);
+   pr_verbose(global_verbose, "Restoring %s\n", path_name);
if (dry_run)
goto next;
fd = open(path_name, O_CREAT|O_WRONLY, 0644);
@@ -1136,8 +1130,7 @@ static int search_dir(struct btrfs_root *root, struct 
btrfs_key *key,
location.objectid = BTRFS_FIRST_FREE_OBJECTID;
}
 
-   if (verbose)
-   printf("Restoring %s\n", path_name);
+   pr_verbose(global_verbose, "Restoring %s\n", path_name);
 
errno = 0;
if (dry_run)
@@ -1200,8 +1193,7 @@ next:
}
}
 
-   if (verbose)
-   printf("Done searching %s\n", in_dir);
+   pr_verbose(global_verbose, "Done searching %s\n", in_dir);
 out:
btrfs_release_path(&path);
return ret;
@@ -1392,25 +1384,26 @@ static const char * const cmd_restore_usage[] = {
"btrfs restore [options]   | -l ",
"Try to restore files from a damaged filesystem (unmounted)",
"",
-   "-s|--snapshots   get snapshots",
-   "-x|--xattr   restore extended attributes",
-   "-m|--metadatarestore owner, mode and times",
-   "-S|--symlink restore symbolic links",
-   "-v|--verbose verbose",
-   "-i|--ignore-errors   ignore errors",
-   "-o|--overwrite   overwrite",
-   "-t   tree location",
-   "-f   filesystem location",
-   "-u|-

[RFC PATCH 05/14] btrfs-progs: migrate balance status to global verbose

2019-10-21 Thread Anand Jain
Make sure top level verbose option can enable the blalance status
subcommand's verbose option.

Suggested-by: David Sterba 
Signed-off-by: Anand Jain 
---
 cmds/balance.c | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/cmds/balance.c b/cmds/balance.c
index 7e84efd6a80d..d4916c5fb34e 100644
--- a/cmds/balance.c
+++ b/cmds/balance.c
@@ -822,7 +822,7 @@ static const char * const cmd_balance_status_usage[] = {
"btrfs balance status [-v] ",
"Show status of running or paused balance",
"",
-   "-v be verbose",
+   HELPINFO_INSERT_VERBOSE_SHORT,
NULL
 };
 
@@ -839,7 +839,6 @@ static int cmd_balance_status(const struct cmd_struct *cmd,
const char *path;
DIR *dirstream = NULL;
int fd;
-   int verbose = 0;
int ret;
 
optind = 0;
@@ -856,7 +855,7 @@ static int cmd_balance_status(const struct cmd_struct *cmd,
 
switch (opt) {
case 'v':
-   verbose = 1;
+   global_verbose = true;
break;
default:
usage_unknown_option(cmd, argv);
@@ -902,7 +901,7 @@ static int cmd_balance_status(const struct cmd_struct *cmd,
   (unsigned long long)args.stat.considered,
   100 * (1 - (float)args.stat.completed/args.stat.expected));
 
-   if (verbose)
+   if (global_verbose)
dump_ioctl_balance_args(&args);
 
ret = 1;
-- 
1.8.3.1



[RFC PATCH 08/14] btrfs-progs: migrate rescue super-recover to global verbose

2019-10-21 Thread Anand Jain
Now with this patch 'btrfs rescue super-recover' can show verbose output
either by the top level --verbose option or by the sub-command -v
option.

Suggested-by: David Sterba 
Signed-off-by: Anand Jain 
---
 cmds/rescue.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/cmds/rescue.c b/cmds/rescue.c
index 1785bc164264..bd11241478a8 100644
--- a/cmds/rescue.c
+++ b/cmds/rescue.c
@@ -101,8 +101,8 @@ static const char * const cmd_rescue_super_recover_usage[] 
= {
"btrfs rescue super-recover [options] ",
"Recover bad superblocks from good copies",
"",
-   "-y Assume an answer of `yes' to all questions",
-   "-v Verbose mode",
+   "-y Assume an answer of `yes' to all questions",
+   HELPINFO_INSERT_VERBOSE_SHORT,
NULL
 };
 
@@ -118,7 +118,6 @@ static int cmd_rescue_super_recover(const struct cmd_struct 
*cmd,
int argc, char **argv)
 {
int ret;
-   int verbose = 0;
int yes = 0;
char *dname;
 
@@ -129,7 +128,7 @@ static int cmd_rescue_super_recover(const struct cmd_struct 
*cmd,
break;
switch (c) {
case 'v':
-   verbose = 1;
+   global_verbose = true;
break;
case 'y':
yes = 1;
@@ -151,7 +150,7 @@ static int cmd_rescue_super_recover(const struct cmd_struct 
*cmd,
error("the device is busy");
return 1;
}
-   ret = btrfs_recover_superblocks(dname, verbose, yes);
+   ret = btrfs_recover_superblocks(dname, global_verbose, yes);
return ret;
 }
 static DEFINE_SIMPLE_COMMAND(rescue_super_recover, "super-recover");
-- 
1.8.3.1



[RFC PATCH 06/14] btrfs-progs: fix help, show long option in balance start and status

2019-10-21 Thread Anand Jain
btrfs balance start|status support both short and long option
-v|--verbose however failed to show it in its --help. This patch fixes
the --help.

Signed-off-by: Anand Jain 
---
 cmds/balance.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/cmds/balance.c b/cmds/balance.c
index d4916c5fb34e..06bab9f7f96f 100644
--- a/cmds/balance.c
+++ b/cmds/balance.c
@@ -492,7 +492,7 @@ static const char * const cmd_balance_start_usage[] = {
"-d[filters]act on data chunks",
"-m[filters]act on metadata chunks",
"-s[filters]act on system chunks (only under -f)",
-   HELPINFO_INSERT_VERBOSE_SHORT,
+   HELPINFO_INSERT_VERBOSE,
"-f force a reduction of metadata integrity",
"--full-balance do not print warning and do not delay start",
"--background|--bg  run the balance as a background process",
@@ -822,7 +822,7 @@ static const char * const cmd_balance_status_usage[] = {
"btrfs balance status [-v] ",
"Show status of running or paused balance",
"",
-   HELPINFO_INSERT_VERBOSE_SHORT,
+   HELPINFO_INSERT_VERBOSE,
NULL
 };
 
-- 
1.8.3.1



[RFC PATCH 01/14] btrfs-progs: add global verbose helper functions

2019-10-21 Thread Anand Jain
The idea is to use the global --verbose command option to show
verbose output from the sub-commands. This patch adds a global
bool variable, %global_verbose, to transpire the verbose requisites
to the sub-command level. And provides pr_verbose() helper
function to log the verbose messages.

Suggested-by: David Sterba 
Signed-off-by: Anand Jain 
---
 btrfs.c   | 12 ++--
 common/help.h |  8 
 common/messages.c | 19 +++
 common/messages.h |  5 +
 4 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/btrfs.c b/btrfs.c
index 72dad6fb3983..ac10d8110495 100644
--- a/btrfs.c
+++ b/btrfs.c
@@ -26,8 +26,10 @@
 #include "common/help.h"
 #include "common/box.h"
 
+extern bool global_verbose;
+
 static const char * const btrfs_cmd_group_usage[] = {
-   "btrfs [--help] [--version] [--format ]  [...] 
 []",
+   "btrfs [--help] [--version] [--format ] [--verbose]  
[...]  []",
NULL
 };
 
@@ -242,12 +244,13 @@ static void handle_output_format(const char *format)
  */
 static int handle_global_options(int argc, char **argv)
 {
-   enum { OPT_HELP = 256, OPT_VERSION, OPT_FULL, OPT_FORMAT };
+   enum { OPT_HELP = 256, OPT_VERSION, OPT_FULL, OPT_FORMAT, OPT_VERBOSE };
static const struct option long_options[] = {
{ "help", no_argument, NULL, OPT_HELP },
{ "version", no_argument, NULL, OPT_VERSION },
{ "format", required_argument, NULL, OPT_FORMAT },
{ "full", no_argument, NULL, OPT_FULL },
+   { "verbose", no_argument, NULL, OPT_VERBOSE },
{ NULL, 0, NULL, 0}
};
int shift;
@@ -270,6 +273,7 @@ static int handle_global_options(int argc, char **argv)
case OPT_FORMAT:
handle_output_format(optarg);
break;
+   case OPT_VERBOSE: break;
default:
fprintf(stderr, "Unknown global option: %s\n",
argv[optind - 1]);
@@ -310,6 +314,10 @@ static void handle_special_globals(int shift, int argc, 
char **argv)
cmd_execute(&cmd_struct_version, argc, argv);
exit(0);
}
+
+   for (i = 0; i < shift; i++)
+   if (strcmp(argv[i], "--verbose") == 0)
+   global_verbose = true;
 }
 
 static const struct cmd_group btrfs_cmd_group = {
diff --git a/common/help.h b/common/help.h
index 01dfc68a7c8d..7bb3074b0be6 100644
--- a/common/help.h
+++ b/common/help.h
@@ -53,6 +53,14 @@
"-t|--tbytesshow sizes in TiB, or TB with --si"
 
 /*
+ * Global verbose option for the sub-commands
+ */
+#define HELPINFO_INSERT_VERBOSE
\
+   "-v|--verbose   show verbose output"
+#define HELPINFO_INSERT_VERBOSE_SHORT  
\
+   "-v show verbose output"
+
+/*
  * Special marker in the help strings that will preemptively insert the global
  * options and then continue with the following text that possibly follows
  * after the regular options
diff --git a/common/messages.c b/common/messages.c
index 0e5694ecd467..e14c112ebbbf 100644
--- a/common/messages.c
+++ b/common/messages.c
@@ -16,6 +16,7 @@
 
 #include 
 #include 
+#include 
 #include "common/messages.h"
 
 __attribute__ ((format (printf, 1, 2)))
@@ -75,3 +76,21 @@ int __btrfs_error_on(int condition, const char *fmt, ...)
 
return 1;
 }
+
+bool global_verbose = false;
+
+__attribute__ ((format (printf, 2, 3)))
+void pr_verbose(bool condition, const char *fmt, ...)
+{
+   va_list args;
+
+   if (condition == false)
+   return;
+
+   if (global_verbose == false)
+   return;
+
+   va_start(args, fmt);
+   vfprintf(stdout, fmt, args);
+   va_end(args);
+}
diff --git a/common/messages.h b/common/messages.h
index 596047948fef..a14e2d62f3a0 100644
--- a/common/messages.h
+++ b/common/messages.h
@@ -14,6 +14,8 @@
  * Boston, MA 021110-1307, USA.
  */
 
+#include 
+
 #ifndef __BTRFS_MESSAGES_H__
 #define __BTRFS_MESSAGES_H__
 
@@ -96,3 +98,6 @@ __attribute__ ((format (printf, 2, 3)))
 int __btrfs_error_on(int condition, const char *fmt, ...);
 
 #endif
+
+__attribute__ ((format (printf, 2, 3)))
+void pr_verbose(bool condition, const char *fmt, ...);
-- 
1.8.3.1



[RFC PATCH 04/14] btrfs-progs: migrate btrfs balance start to global verbose

2019-10-21 Thread Anand Jain
Make sure the sub command balance start calls verbose when the global
verbose is set and vise versa.

Suggested-by: David Sterba 
Signed-off-by: Anand Jain 
---
 cmds/balance.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/cmds/balance.c b/cmds/balance.c
index 32830002f3a0..7e84efd6a80d 100644
--- a/cmds/balance.c
+++ b/cmds/balance.c
@@ -34,6 +34,8 @@
 #include "common/utils.h"
 #include "common/help.h"
 
+extern bool global_verbose;
+
 static const char * const balance_cmd_group_usage[] = {
"btrfs balance  [options] ",
"btrfs balance ",
@@ -487,14 +489,13 @@ static const char * const cmd_balance_start_usage[] = {
"long operation and the user is warned before this start, with",
"a delay to stop it.",
"",
-   "-d[filters]act on data chunks",
-   "-m[filters]act on metadata chunks",
-   "-s[filters]act on system chunks (only under -f)",
-   "-v be verbose",
-   "-f force a reduction of metadata integrity",
-   "--full-balance do not print warning and do not delay start",
-   "--background|--bg",
-   "   run the balance as a background process",
+   "-d[filters]act on data chunks",
+   "-m[filters]act on metadata chunks",
+   "-s[filters]act on system chunks (only under -f)",
+   HELPINFO_INSERT_VERBOSE_SHORT,
+   "-f force a reduction of metadata integrity",
+   "--full-balance do not print warning and do not delay start",
+   "--background|--bg  run the balance as a background process",
NULL
 };
 
@@ -505,7 +506,6 @@ static int cmd_balance_start(const struct cmd_struct *cmd,
struct btrfs_balance_args *ptrs[] = { &args.data, &args.sys,
&args.meta, NULL };
int force = 0;
-   int verbose = 0;
int background = 0;
unsigned start_flags = 0;
int i;
@@ -560,7 +560,7 @@ static int cmd_balance_start(const struct cmd_struct *cmd,
force = 1;
break;
case 'v':
-   verbose = 1;
+   global_verbose = true;
break;
case GETOPT_VAL_FULL_BALANCE:
start_flags |= BALANCE_START_NOWARN;
@@ -636,7 +636,7 @@ static int cmd_balance_start(const struct cmd_struct *cmd,
 
if (force)
args.flags |= BTRFS_BALANCE_FORCE;
-   if (verbose)
+   if (global_verbose)
dump_ioctl_balance_args(&args);
if (background) {
switch (fork()) {
-- 
1.8.3.1



[RFC PATCH 07/14] btrfs-progs: migrate rescue chunk-recover to global verbose

2019-10-21 Thread Anand Jain
Now with this patch the btrfs rescue chunk-recover can display verbose
output either at the sub-command level or at the top level. For example
'btrfs --verbose rescue chunk-recover <>' or 'btrfs rescue chunk-recover
-v <>'.

Suggested-by: David Sterba 
Signed-off-by: Anand Jain 
---
 cmds/rescue.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/cmds/rescue.c b/cmds/rescue.c
index e8eab6808bc3..1785bc164264 100644
--- a/cmds/rescue.c
+++ b/cmds/rescue.c
@@ -28,6 +28,8 @@
 #include "common/help.h"
 #include "cmds/rescue.h"
 
+extern bool global_verbose;
+
 static const char * const rescue_cmd_group_usage[] = {
"btrfs rescue  [options] ",
NULL
@@ -37,9 +39,9 @@ static const char * const cmd_rescue_chunk_recover_usage[] = {
"btrfs rescue chunk-recover [options] ",
"Recover the chunk tree by scanning the devices one by one.",
"",
-   "-y Assume an answer of `yes' to all questions",
-   "-v Verbose mode",
-   "-h Help",
+   "-yAssume an answer of `yes' to all questions",
+   HELPINFO_INSERT_VERBOSE_SHORT,
+   "-hHelp",
NULL
 };
 
@@ -49,7 +51,6 @@ static int cmd_rescue_chunk_recover(const struct cmd_struct 
*cmd,
int ret = 0;
char *file;
int yes = 0;
-   int verbose = 0;
 
optind = 0;
while (1) {
@@ -61,7 +62,7 @@ static int cmd_rescue_chunk_recover(const struct cmd_struct 
*cmd,
yes = 1;
break;
case 'v':
-   verbose = 1;
+   global_verbose = true;
break;
default:
usage_unknown_option(cmd, argv);
@@ -83,7 +84,7 @@ static int cmd_rescue_chunk_recover(const struct cmd_struct 
*cmd,
return 1;
}
 
-   ret = btrfs_recover_chunk_tree(file, verbose, yes);
+   ret = btrfs_recover_chunk_tree(file, global_verbose, yes);
if (!ret) {
fprintf(stdout, "Chunk tree recovered successfully\n");
} else if (ret > 0) {
-- 
1.8.3.1



[RFC PATCH 09/14] btrfs-progs: restore: delete unreachable code

2019-10-21 Thread Anand Jain
Maximum value of %verbose is 1 when %verbose is enabled using
'btrfs restore -v  ', and the code under the condition
%verbose > 1 is never reached. So delete them.

Signed-off-by: Anand Jain 
---
 cmds/restore.c | 17 -
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/cmds/restore.c b/cmds/restore.c
index c104b01aef69..79caf6734e76 100644
--- a/cmds/restore.c
+++ b/cmds/restore.c
@@ -987,9 +987,6 @@ static int search_dir(struct btrfs_root *root, struct 
btrfs_key *key,
 
leaf = path.nodes[0];
while (!leaf) {
-   if (verbose > 1)
-   printf("No leaf after search, looking for the next "
-  "leaf\n");
ret = next_leaf(root, &path);
if (ret < 0) {
fprintf(stderr, "Error getting next leaf %d\n",
@@ -1035,18 +1032,12 @@ static int search_dir(struct btrfs_root *root, struct 
btrfs_key *key,
continue;
}
btrfs_item_key_to_cpu(leaf, &found_key, path.slots[0]);
-   if (found_key.objectid != key->objectid) {
-   if (verbose > 1)
-   printf("Found objectid=%Lu, key=%Lu\n",
-  found_key.objectid, key->objectid);
+   if (found_key.objectid != key->objectid)
break;
-   }
-   if (found_key.type != key->type) {
-   if (verbose > 1)
-   printf("Found type=%u, want=%u\n",
-  found_key.type, key->type);
+
+   if (found_key.type != key->type)
break;
-   }
+
dir_item = btrfs_item_ptr(leaf, path.slots[0],
  struct btrfs_dir_item);
name_ptr = (unsigned long)(dir_item + 1);
-- 
1.8.3.1



[RFC PATCH 14/14] btrfs-progs: enable verbose for btrfs device scan

2019-10-21 Thread Anand Jain
Enable verbose output for the device scan only through the global
verbose option.

For example:
./btrfs --verbose device scan
Scanning for Btrfs filesystems
registered: /dev/sda1
registered: /dev/sda2
registered: /dev/sda3
registered: /dev/sda5
registered: /dev/sda6

Signed-off-by: Anand Jain 
---
 cmds/device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cmds/device.c b/cmds/device.c
index b429a169cd5d..b38bf78a798e 100644
--- a/cmds/device.c
+++ b/cmds/device.c
@@ -354,7 +354,7 @@ static int cmd_device_scan(const struct cmd_struct *cmd, 
int argc, char **argv)
}
} else {
printf("Scanning for Btrfs filesystems\n");
-   ret = btrfs_scan_devices(false);
+   ret = btrfs_scan_devices(true);
error_on(ret, "error %d while scanning", ret);
ret = btrfs_register_all_devices();
error_on(ret,
-- 
1.8.3.1



[RFC PATCH 12/14] btrfs-progs: migrate inspect-internal logical-resolve to global verbose

2019-10-21 Thread Anand Jain
Command btrfs inspect-internal logical-resolve provides local verbose
option this patch makes it enable-able by using the global --verbose
option.

Suggested-by: David Sterba 
Signed-off-by: Anand Jain 
---
 cmds/inspect.c | 25 +++--
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/cmds/inspect.c b/cmds/inspect.c
index f872b471b420..3f35cdea56b5 100644
--- a/cmds/inspect.c
+++ b/cmds/inspect.c
@@ -40,7 +40,7 @@ static const char * const inspect_cmd_group_usage[] = {
NULL
 };
 
-static int __ino_to_path_fd(u64 inum, int fd, int verbose, const char *prepend)
+static int __ino_to_path_fd(u64 inum, int fd, const char *prepend)
 {
int ret;
int i;
@@ -118,8 +118,7 @@ static int cmd_inspect_inode_resolve(const struct 
cmd_struct *cmd,
if (fd < 0)
return 1;
 
-   ret = __ino_to_path_fd(arg_strtou64(argv[optind]), fd,
-  global_verbose ? 1 : 0, argv[optind+1]);
+   ret = __ino_to_path_fd(arg_strtou64(argv[optind]), fd, argv[optind+1]);
close_file_or_dir(fd, dirstream);
return !!ret;
 
@@ -130,11 +129,11 @@ static const char * const 
cmd_inspect_logical_resolve_usage[] = {
"btrfs inspect-internal logical-resolve [-Pv] [-s bufsize]  
",
"Get file system paths for the given logical address",
"",
-   "-P  skip the path resolving and print the inodes instead",
-   "-v  verbose mode",
-   "-s bufsize  set inode container's size. This is used to increase 
inode",
-   "container's size in case it is not enough to read all the 
",
-   "resolved results. The max value one can set is 64k",
+   "-P skip the path resolving and print the inodes 
instead",
+   HELPINFO_INSERT_VERBOSE_SHORT,
+   "-s bufsize set inode container's size. This is used to 
increase inode",
+   "   container's size in case it is not enough to read 
all the ",
+   "   resolved results. The max value one can set is 64k",
NULL
 };
 
@@ -144,7 +143,6 @@ static int cmd_inspect_logical_resolve(const struct 
cmd_struct *cmd,
int ret;
int fd;
int i;
-   int verbose = 0;
int getpath = 1;
int bytes_left;
struct btrfs_ioctl_logical_ino_args loi;
@@ -165,7 +163,7 @@ static int cmd_inspect_logical_resolve(const struct 
cmd_struct *cmd,
getpath = 0;
break;
case 'v':
-   verbose = 1;
+   global_verbose = true;
break;
case 's':
size = arg_strtou64(optarg);
@@ -200,8 +198,8 @@ static int cmd_inspect_logical_resolve(const struct 
cmd_struct *cmd,
goto out;
}
 
-   if (verbose)
-   printf("ioctl ret=%d, total_size=%llu, bytes_left=%lu, "
+   pr_verbose(global_verbose,
+   "ioctl ret=%d, total_size=%llu, bytes_left=%lu, "
"bytes_missing=%lu, cnt=%d, missed=%d\n",
ret, size,
(unsigned long)inodes->bytes_left,
@@ -251,8 +249,7 @@ static int cmd_inspect_logical_resolve(const struct 
cmd_struct *cmd,
goto out;
}
}
-   ret = __ino_to_path_fd(inum, path_fd, verbose,
-   full_path);
+   ret = __ino_to_path_fd(inum, path_fd, full_path);
if (path_fd != fd)
close_file_or_dir(path_fd, dirs);
} else {
-- 
1.8.3.1



[RFC PATCH 11/14] btrfs-progs: migrate inspect-internal inode-resolve to global verbose

2019-10-21 Thread Anand Jain
Command btrfs inspect-internal inode-resolve provides verbose option at
the sub-command level, this patch makes it enable-able by using the
global --verbose option.

Suggested-by: David Sterba 
Signed-off-by: Anand Jain 
---
 cmds/inspect.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/cmds/inspect.c b/cmds/inspect.c
index 758b6e60c591..f872b471b420 100644
--- a/cmds/inspect.c
+++ b/cmds/inspect.c
@@ -33,6 +33,8 @@
 #include "btrfs-list.h"
 #include "common/help.h"
 
+extern bool global_verbose;
+
 static const char * const inspect_cmd_group_usage[] = {
"btrfs inspect-internal  ",
NULL
@@ -56,8 +58,8 @@ static int __ino_to_path_fd(u64 inum, int fd, int verbose, 
const char *prepend)
goto out;
}
 
-   if (verbose)
-   printf("ioctl ret=%d, bytes_left=%lu, bytes_missing=%lu, "
+   pr_verbose(global_verbose,
+   "ioctl ret=%d, bytes_left=%lu, bytes_missing=%lu, "
"cnt=%d, missed=%d\n", ret,
(unsigned long)fspath->bytes_left,
(unsigned long)fspath->bytes_missing,
@@ -83,7 +85,7 @@ static const char * const cmd_inspect_inode_resolve_usage[] = 
{
"btrfs inspect-internal inode-resolve [-v]  ",
"Get file system paths for the given inode",
"",
-   "-v   verbose mode",
+   HELPINFO_INSERT_VERBOSE_SHORT,
NULL
 };
 
@@ -91,7 +93,6 @@ static int cmd_inspect_inode_resolve(const struct cmd_struct 
*cmd,
 int argc, char **argv)
 {
int fd;
-   int verbose = 0;
int ret;
DIR *dirstream = NULL;
 
@@ -103,7 +104,7 @@ static int cmd_inspect_inode_resolve(const struct 
cmd_struct *cmd,
 
switch (c) {
case 'v':
-   verbose = 1;
+   global_verbose = true;
break;
default:
usage_unknown_option(cmd, argv);
@@ -117,8 +118,8 @@ static int cmd_inspect_inode_resolve(const struct 
cmd_struct *cmd,
if (fd < 0)
return 1;
 
-   ret = __ino_to_path_fd(arg_strtou64(argv[optind]), fd, verbose,
-  argv[optind+1]);
+   ret = __ino_to_path_fd(arg_strtou64(argv[optind]), fd,
+  global_verbose ? 1 : 0, argv[optind+1]);
close_file_or_dir(fd, dirstream);
return !!ret;
 
-- 
1.8.3.1



[RFC PATCH 13/14] btrfs-progs: refactor btrfs_scan_devices() to accept verbose argument

2019-10-21 Thread Anand Jain
Function btrfs_scan_devices() is being used by commands such as
'btrfs filesystem' and 'btrfs device', by having the verbose argument in
the btrfs_scan_devices() we can control which threads to show the
verbose when verbose is enabled by the global verbose option.

So add an option %verbose to btrfs_scan_devices().

Signed-off-by: Anand Jain 
---
 cmds/device.c| 2 +-
 cmds/filesystem.c| 2 +-
 common/device-scan.c | 4 +++-
 common/device-scan.h | 2 +-
 common/utils.c   | 2 +-
 disk-io.c| 2 +-
 6 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/cmds/device.c b/cmds/device.c
index 24158308a41b..b429a169cd5d 100644
--- a/cmds/device.c
+++ b/cmds/device.c
@@ -354,7 +354,7 @@ static int cmd_device_scan(const struct cmd_struct *cmd, 
int argc, char **argv)
}
} else {
printf("Scanning for Btrfs filesystems\n");
-   ret = btrfs_scan_devices();
+   ret = btrfs_scan_devices(false);
error_on(ret, "error %d while scanning", ret);
ret = btrfs_register_all_devices();
error_on(ret,
diff --git a/cmds/filesystem.c b/cmds/filesystem.c
index ee4d366fbf64..fb6e2e998dcf 100644
--- a/cmds/filesystem.c
+++ b/cmds/filesystem.c
@@ -746,7 +746,7 @@ devs_only:
else
ret = 1;
} else {
-   ret = btrfs_scan_devices();
+   ret = btrfs_scan_devices(false);
}
 
if (ret) {
diff --git a/common/device-scan.c b/common/device-scan.c
index 48dbd9e19715..a5963d789f49 100644
--- a/common/device-scan.c
+++ b/common/device-scan.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "kernel-lib/overflow.h"
 #include "common/path-utils.h"
 #include "common/device-scan.h"
@@ -360,7 +361,7 @@ void free_seen_fsid(struct seen_fsid *seen_fsid_hash[])
}
 }
 
-int btrfs_scan_devices(void)
+int btrfs_scan_devices(bool verbose)
 {
int fd = -1;
int ret;
@@ -404,6 +405,7 @@ int btrfs_scan_devices(void)
close (fd);
continue;
}
+   pr_verbose(verbose, "registered: %s\n", path);
 
close(fd);
}
diff --git a/common/device-scan.h b/common/device-scan.h
index eda2bae5c6c4..c50fe2fbf91f 100644
--- a/common/device-scan.h
+++ b/common/device-scan.h
@@ -29,7 +29,7 @@ struct seen_fsid {
int fd;
 };
 
-int btrfs_scan_devices(void);
+int btrfs_scan_devices(bool verbose);
 int btrfs_register_one_device(const char *fname);
 int btrfs_register_all_devices(void);
 int btrfs_add_to_fsid(struct btrfs_trans_handle *trans,
diff --git a/common/utils.c b/common/utils.c
index 6617b3ef38b1..9027de596f5d 100644
--- a/common/utils.c
+++ b/common/utils.c
@@ -277,7 +277,7 @@ int check_mounted_where(int fd, const char *file, char 
*where, int size,
 
/* scan other devices */
if (is_btrfs && total_devs > 1) {
-   ret = btrfs_scan_devices();
+   ret = btrfs_scan_devices(false);
if (ret)
return ret;
}
diff --git a/disk-io.c b/disk-io.c
index a9744af90a43..33bd003167fe 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -1100,7 +1100,7 @@ int btrfs_scan_fs_devices(int fd, const char *path,
}
 
if (!skip_devices && total_devs != 1) {
-   ret = btrfs_scan_devices();
+   ret = btrfs_scan_devices(false);
if (ret)
return ret;
}
-- 
1.8.3.1



Re: first it froze, now the (btrfs) root fs won't mount ...

2019-10-21 Thread Christian Pernegger
[Please CC me, I'm not on the list.]

Am So., 20. Okt. 2019 um 12:28 Uhr schrieb Qu Wenruo :
> > Question: Can I work with the mounted backup image on the machine that
> > also contains the original disc? I vaguely recall something about
> > btrfs really not liking clones.
>
> If your fs only contains one device (single fs on single device), then
> you should be mostly fine. [...] mostly OK.

Should? Mostly? What a nightmare-inducing, yet pleasantly Adams-esqe
way of putting things ... :-)

Anyway, I have an image of the whole disk on a server now and am
feeling all the more adventurous for it. (The first try failed a
couple of MB from completion due to spurious network issues, which is
why I've taken so long to reply.)

> > You wouldn't happen to know of a [suitable] bootable rescue image [...]?
>
> Archlinux iso at least has the latest btrfs-progs.

I'm on the Ubuntu 19.10 live CD (btrfs-progs 5.2.1, kernel 5.3.0)
until further notice. Exploring other options (incl. running your
rescue kernel on another machine and serving the disk via nbd) in
parallel.

> I'd recommend the following safer methods before trying --init-extent-tree:
>
> - Dump backup roots first:
>   # btrfs ins dump-super -f  | grep backup_treee_root
>   Then grab all big numbers.

# btrfs inspect-internal dump-super -f /dev/nvme0n1p2 | grep backup_tree_root
backup_tree_root:284041969664gen: 58600level: 1
backup_tree_root:284041953280gen: 58601level: 1
backup_tree_root:284042706944gen: 58602level: 1
backup_tree_root:284045410304gen: 58603level: 1

> - Try backup_extent_root numbers in btrfs check first
>   # btrfs check -r  
>   Use the number with highest generation first.

Assuming backup_extent_root == backup_tree_root ...

# btrfs check --tree-root 284045410304 /dev/nvme0n1p2
Opening filesystem to check...
checksum verify failed on 284041084928 found E4E3BDB6 wanted 
checksum verify failed on 284041084928 found E4E3BDB6 wanted 
bad tree block 284041084928, bytenr mismatch, want=284041084928, have=0
ERROR: cannot open file system

# btrfs check --tree-root 284042706944 /dev/nvme0n1p2
Opening filesystem to check...
checksum verify failed on 284042706944 found E4E3BDB6 wanted 
checksum verify failed on 284042706944 found E4E3BDB6 wanted 
bad tree block 284042706944, bytenr mismatch, want=284042706944, have=0
Couldn't read tree root
ERROR: cannot open file system

# btrfs check --tree-root 284041953280 /dev/nvme0n1p2
Opening filesystem to check...
checksum verify failed on 284041953280 found E4E3BDB6 wanted 
checksum verify failed on 284041953280 found E4E3BDB6 wanted 
bad tree block 284041953280, bytenr mismatch, want=284041953280, have=0
Couldn't read tree root
ERROR: cannot open file system

# btrfs check --tree-root 284041969664 /dev/nvme0n1p2
Opening filesystem to check...
checksum verify failed on 284041969664 found E4E3BDB6 wanted 
checksum verify failed on 284041969664 found E4E3BDB6 wanted 
bad tree block 284041969664, bytenr mismatch, want=284041969664, have=0
Couldn't read tree root
ERROR: cannot open file system

>   If all backup fails to pass basic btrfs check, and all happen to have
>   the same "wanted " then it means a big range of tree blocks
>   get wiped out, not really related to btrfs but some hardware wipe.

Doesn't look good, does it? Any further ideas at all or is this the
end of the line? TBH, at this point, I don't mind having to re-install
the box so much as the idea that the same thing might happen again --
either to this one, or to my work machine, which is very similar. If
nothing else, I'd really appreciate knowing what exactly happened here
and why -- a bug in the GPU and/or its driver shouldn't cause this --;
and an avoidance strategy that goes beyond-upgrade-and-pray.

Cheers,
Christian


Re: first it froze, now the (btrfs) root fs won't mount ...

2019-10-21 Thread Qu Wenruo


On 2019/10/21 下午6:47, Christian Pernegger wrote:
> [Please CC me, I'm not on the list.]
> 
> Am So., 20. Okt. 2019 um 12:28 Uhr schrieb Qu Wenruo :
>>> Question: Can I work with the mounted backup image on the machine that
>>> also contains the original disc? I vaguely recall something about
>>> btrfs really not liking clones.
>>
>> If your fs only contains one device (single fs on single device), then
>> you should be mostly fine. [...] mostly OK.
> 
> Should? Mostly? What a nightmare-inducing, yet pleasantly Adams-esqe
> way of putting things ... :-)
> 
> Anyway, I have an image of the whole disk on a server now and am
> feeling all the more adventurous for it. (The first try failed a
> couple of MB from completion due to spurious network issues, which is
> why I've taken so long to reply.)
> 
>>> You wouldn't happen to know of a [suitable] bootable rescue image [...]?
>>
>> Archlinux iso at least has the latest btrfs-progs.
> 
> I'm on the Ubuntu 19.10 live CD (btrfs-progs 5.2.1, kernel 5.3.0)
> until further notice. Exploring other options (incl. running your
> rescue kernel on another machine and serving the disk via nbd) in
> parallel.
> 
>> I'd recommend the following safer methods before trying --init-extent-tree:
>>
>> - Dump backup roots first:
>>   # btrfs ins dump-super -f  | grep backup_treee_root
>>   Then grab all big numbers.
> 
> # btrfs inspect-internal dump-super -f /dev/nvme0n1p2 | grep backup_tree_root
> backup_tree_root:284041969664gen: 58600level: 1
> backup_tree_root:284041953280gen: 58601level: 1
> backup_tree_root:284042706944gen: 58602level: 1
> backup_tree_root:284045410304gen: 58603level: 1
> 
>> - Try backup_extent_root numbers in btrfs check first
>>   # btrfs check -r  
>>   Use the number with highest generation first.
> 
> Assuming backup_extent_root == backup_tree_root ...
> 
> # btrfs check --tree-root 284045410304 /dev/nvme0n1p2
> Opening filesystem to check...
> checksum verify failed on 284041084928 found E4E3BDB6 wanted 
> checksum verify failed on 284041084928 found E4E3BDB6 wanted 
> bad tree block 284041084928, bytenr mismatch, want=284041084928, have=0
> ERROR: cannot open file system
> 
> # btrfs check --tree-root 284042706944 /dev/nvme0n1p2
> Opening filesystem to check...
> checksum verify failed on 284042706944 found E4E3BDB6 wanted 
> checksum verify failed on 284042706944 found E4E3BDB6 wanted 
> bad tree block 284042706944, bytenr mismatch, want=284042706944, have=0
> Couldn't read tree root
> ERROR: cannot open file system
> 
> # btrfs check --tree-root 284041953280 /dev/nvme0n1p2
> Opening filesystem to check...
> checksum verify failed on 284041953280 found E4E3BDB6 wanted 
> checksum verify failed on 284041953280 found E4E3BDB6 wanted 
> bad tree block 284041953280, bytenr mismatch, want=284041953280, have=0
> Couldn't read tree root
> ERROR: cannot open file system
> 
> # btrfs check --tree-root 284041969664 /dev/nvme0n1p2
> Opening filesystem to check...
> checksum verify failed on 284041969664 found E4E3BDB6 wanted 
> checksum verify failed on 284041969664 found E4E3BDB6 wanted 
> bad tree block 284041969664, bytenr mismatch, want=284041969664, have=0
> Couldn't read tree root
> ERROR: cannot open file system

This doesn't look good at all.

All 4 copies are wiped out, so it doesn't look like a bug in btrfs, but
some other problem wiping out a full range of tree blocks.

> 
>>   If all backup fails to pass basic btrfs check, and all happen to have
>>   the same "wanted " then it means a big range of tree blocks
>>   get wiped out, not really related to btrfs but some hardware wipe.
> 
> Doesn't look good, does it? Any further ideas at all or is this the
> end of the line? TBH, at this point, I don't mind having to re-install
> the box so much as the idea that the same thing might happen again --

I don't have good idea. The result looks like something have wiped part
of your tree blocks (not a single one, but a range).

> either to this one, or to my work machine, which is very similar. If
> nothing else, I'd really appreciate knowing what exactly happened here
> and why -- a bug in the GPU and/or its driver shouldn't cause this --;
> and an avoidance strategy that goes beyond-upgrade-and-pray.

At this stage, I'm sorry that I have no idea at all.

If you're 100% sure that you haven't enabled discard for a while, then I
guess it doesn't look like btrfs at least.
Btrfs shouldn't cause so many tree blocks wiped at all, even for v5.0
kernel.

Thanks,
Qu

> 
> Cheers,
> Christian
> 



signature.asc
Description: OpenPGP digital signature


Lots of btrfs_dump_space_info in kernel log

2019-10-21 Thread Patrik Lundquist
I'm running Debian Testing with kernel 5.2.17-1. Five disk raid1 with
at least 393.01GiB unallocated on each disk. No device errors. No
kernel WARNINGs or ERRORs.

BTRFS info (device dm-1): enabling auto defrag
BTRFS info (device dm-1): using free space tree
BTRFS info (device dm-1): has skinny extents

Mounted with noauto,noatime,autodefrag,skip_balance,space_cache=v2,enospc_debug.

First btrfs_dump_space_info() is

BTRFS info (device dm-1): space_info 4 has 18446744073353838592 free,
is not full
BTRFS info (device dm-1): space_info total=10737418240,
used=9636544512, pinned=0, reserved=27066368, may_use=1429454848,
readonly=65536
BTRFS info (device dm-1): global_block_rsv: size 536870912 reserved 536870912
BTRFS info (device dm-1): trans_block_rsv: size 0 reserved 0
BTRFS info (device dm-1): chunk_block_rsv: size 0 reserved 0
BTRFS info (device dm-1): delayed_block_rsv: size 20185088 reserved 20185088
BTRFS info (device dm-1): delayed_refs_rsv: size 868745216 reserved 868745216

and current last is

BTRFS info (device dm-1): space_info 4 has 0 free, is not full
BTRFS info (device dm-1): space_info total=10737418240,
used=9664561152, pinned=458752, reserved=2523136, may_use=1069809664,
readonly=65536
BTRFS info (device dm-1): global_block_rsv: size 536870912 reserved 536821760
BTRFS info (device dm-1): trans_block_rsv: size 0 reserved 0
BTRFS info (device dm-1): chunk_block_rsv: size 0 reserved 0
BTRFS info (device dm-1): delayed_block_rsv: size 0 reserved 0
BTRFS info (device dm-1): delayed_refs_rsv: size 556531712 reserved 498647040

with lots more in between and no other Btrfs kernel printing preceding
it since mounting.

It's the first time I'm seeing it but I have always mounted with
enospc_debug since it always seems too late to add the option when you
finally need it.

What does it mean? Should I be worried? What to do? No apparent problems yet.


Re: first it froze, now the (btrfs) root fs won't mount ...

2019-10-21 Thread Austin S. Hemmelgarn

On 2019-10-21 06:47, Christian Pernegger wrote:

[Please CC me, I'm not on the list.]

Am So., 20. Okt. 2019 um 12:28 Uhr schrieb Qu Wenruo :

Question: Can I work with the mounted backup image on the machine that
also contains the original disc? I vaguely recall something about
btrfs really not liking clones.


If your fs only contains one device (single fs on single device), then
you should be mostly fine. [...] mostly OK.


Should? Mostly? What a nightmare-inducing, yet pleasantly Adams-esqe
way of putting things ... :-)

Anyway, I have an image of the whole disk on a server now and am
feeling all the more adventurous for it. (The first try failed a
couple of MB from completion due to spurious network issues, which is
why I've taken so long to reply.)
I've done stuff like this dozens of times on single-device volumes with 
exactly zero issues.  The only time you're likely to see problems is if 
the kernel thinks (either correctly or incorrectly) that the volume 
should consist of multiple devices.


Ultimately, the issue is that the kernel tries to use all devices it 
knows of with the same volume UUID when you mount the volume, without 
validating the number of devices and that there are no duplicate device 
UUID's in the volume, so it can accidentally pull in multiple instances 
of the same 'device' when mounting.



You wouldn't happen to know of a [suitable] bootable rescue image [...]?


Archlinux iso at least has the latest btrfs-progs.


I'm on the Ubuntu 19.10 live CD (btrfs-progs 5.2.1, kernel 5.3.0)
until further notice. Exploring other options (incl. running your
rescue kernel on another machine and serving the disk via nbd) in
parallel.


I'd recommend the following safer methods before trying --init-extent-tree:

- Dump backup roots first:
   # btrfs ins dump-super -f  | grep backup_treee_root
   Then grab all big numbers.


# btrfs inspect-internal dump-super -f /dev/nvme0n1p2 | grep backup_tree_root
backup_tree_root:284041969664gen: 58600level: 1
backup_tree_root:284041953280gen: 58601level: 1
backup_tree_root:284042706944gen: 58602level: 1
backup_tree_root:284045410304gen: 58603level: 1


- Try backup_extent_root numbers in btrfs check first
   # btrfs check -r  
   Use the number with highest generation first.


Assuming backup_extent_root == backup_tree_root ...

# btrfs check --tree-root 284045410304 /dev/nvme0n1p2
Opening filesystem to check...
checksum verify failed on 284041084928 found E4E3BDB6 wanted 
checksum verify failed on 284041084928 found E4E3BDB6 wanted 
bad tree block 284041084928, bytenr mismatch, want=284041084928, have=0
ERROR: cannot open file system

# btrfs check --tree-root 284042706944 /dev/nvme0n1p2
Opening filesystem to check...
checksum verify failed on 284042706944 found E4E3BDB6 wanted 
checksum verify failed on 284042706944 found E4E3BDB6 wanted 
bad tree block 284042706944, bytenr mismatch, want=284042706944, have=0
Couldn't read tree root
ERROR: cannot open file system

# btrfs check --tree-root 284041953280 /dev/nvme0n1p2
Opening filesystem to check...
checksum verify failed on 284041953280 found E4E3BDB6 wanted 
checksum verify failed on 284041953280 found E4E3BDB6 wanted 
bad tree block 284041953280, bytenr mismatch, want=284041953280, have=0
Couldn't read tree root
ERROR: cannot open file system

# btrfs check --tree-root 284041969664 /dev/nvme0n1p2
Opening filesystem to check...
checksum verify failed on 284041969664 found E4E3BDB6 wanted 
checksum verify failed on 284041969664 found E4E3BDB6 wanted 
bad tree block 284041969664, bytenr mismatch, want=284041969664, have=0
Couldn't read tree root
ERROR: cannot open file system


   If all backup fails to pass basic btrfs check, and all happen to have
   the same "wanted " then it means a big range of tree blocks
   get wiped out, not really related to btrfs but some hardware wipe.


Doesn't look good, does it? Any further ideas at all or is this the
end of the line? TBH, at this point, I don't mind having to re-install
the box so much as the idea that the same thing might happen again --
either to this one, or to my work machine, which is very similar. If
nothing else, I'd really appreciate knowing what exactly happened here
and why -- a bug in the GPU and/or its driver shouldn't cause this --;
and an avoidance strategy that goes beyond-upgrade-and-pray.
There are actually two possible ways I can think of a buggy GPU driver 
causing this type of issue:


* The GPU driver in some way caused memory corruption, which in turn 
caused other problems.
* The GPU driver confused the GPU enough that it issued a P2P transfer 
on the PCI-e bus to the NVMe device, which in turn caused data 
corruption on the NVMe device.


Both are reasonably unlikely, but definitely possible.  Your best option 
for mitigation (other than just not using that version of that GPU 
driver) is to ensure that your ha

Btrfs progs release 5.3

2019-10-21 Thread David Sterba
Hi,

btrfs-progs version 5.3 have been released.

There are some changes that fit the major release time, like changing default
tree traversal for dump-tree, new CI integration and documentation generation.
There's preparatory work for the checksums that will be in 5.5 and we decided
to release kernel and progs support at the same time. So mkfs now understands
--checksum and --csum but will accept only crc32c.

No change since 5.3-rc1

Changelog:

  * mkfs:
* new option to specify checksum algorithm (only crc32c)
* fix xattr enumeration
  * dump-tree: BFS (breadth-first) traversal now default
  * libbtrfsutil: remove stale BTRFS_DEV_REPLACE_ITEM_STATE_x defines
  * ci: add support for gitlab
  * other:
* preparatory work for more checksum algorithms
* docs update
* switch to docbook5 backend for asciidoc
* fix build on uClibc due to missing backtrace()
* lots of printf format fixups

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

Anand Jain (4):
  libbtrfsutil: remove stale BTRFS_DEV_REPLACE_ITEM_STATE_x defines
  btrfs-progs: print-tree: add BTRFS_DEV_ITEMS_OBJECTID in comment
  btrfs-progs: tests: fix misc/021 when restoring overlapping stale data
  btrfs-progs: mkfs: match devid order to the stripe index

David Sterba (8):
  btrfs-progs: dump-tree: update help and docs regarding DFS/BFS
  btrfs-progs: move qgroup-verify.[ch] to check/
  btrfs-progs: copy btrfsck.h to check/common.h
  btrfs-progs: check: remove flat include switch from common.h
  btrfs-progs: check: move device_record to main.c
  btrfs-progs: build: add missing objects to libbtrfs
  btrfs-progs: update CHANGES for 5.3
  Btrfs progs v5.3

Dennis Zhou (1):
  btrfs-progs: docs: add compression level support for mount options

Dimitri John Ledkov (1):
  btrfs-progs: docs: use docbook5 backend for asciidoctor

Fabrice Fontaine (1):
  btrfs-progs: kerncompat: define BTRFS_DISABLE_BACKTRACE when building 
with uClibc

Jeff Mahoney (2):
  btrfs-progs: qgroups: use parse_size instead of open coding it
  btrfs-progs: constify argument of parse_size

Johannes Thumshirn (15):
  btrfs-progs: use btrfs_csum_data() in __csum_tree_block_size()
  btrfs-progs: pass in a btrfs_mkfs_config to write_temp_extent_buffer
  btrfs-progs: make checksum type explicit in mkfs context structure
  btrfs-progs: don't blindly assume crc32c in csum_tree_block_size()
  btrfs-progs: cache csum_type in recover_control
  btrfs-progs: add checksum type to checksumming functions
  btrfs-progs: don't assume checksums are always 4 bytes
  btrfs-progs: pass checksum type to btrfs_csum_data()/btrfs_csum_final()
  btrfs-progs: sb-mod: simplify update_block_csum()
  btrfs-progs: update checksumming api
  btrfs-progs: mkfs: new option to specify checksum type
  btrfs-progs: add is_valid_csum_type() helper
  btrfs-progs: add table for checksum type and name
  btrfs-progs: mkfs: print checksum type when running mkfs
  btrfs-progs: unbreak btrfs-sb-mod compilation

Lakshmipathi.G (1):
  btrfs-progs: ci: setup GitLab-CI

Long An (1):
  btrfs-progs: testsadd clean-test.sh to testsuite-files

Nikolay Borisov (1):
  btrfs-progs: corrupt-block: Fix description of 'r' option

Qu Wenruo (1):
  btrfs-progs: print-tree: Use BFS as default traversal method

Rosen Penev (1):
  btrfs-progs: Fix printf formats

Vladimir Panteleev (2):
  btrfs-progs: docs: document btrfs-balance exit status in detail
  btrfs-progs: mkfs: fix xattr enumeration



Re: [PATCH v2] btrfs-progs: small fixes/cleanup in Documentation

2019-10-21 Thread David Sterba
On Fri, Oct 18, 2019 at 06:23:31PM +0200, Merlin Büge wrote:
> 
> 
> On Fri, 18 Oct 2019 18:14:33 +0200
> Merlin Büge  wrote:
> ...
> > diff --git a/Documentation/btrfs-man5.asciidoc 
> > b/Documentation/btrfs-man5.asciidoc
> > index 6a1a04b7..ee6010fe 100644
> > --- a/Documentation/btrfs-man5.asciidoc
> > +++ b/Documentation/btrfs-man5.asciidoc
> ...
> > @@ -659,7 +653,7 @@ swapfile extents or may fail:
> ...
> > -* device replace - dtto
> > +* device replace - ditto
> 
> I think I may have got that last one wrong, sorry.
> I've never seen it spelled 'dtto', but I now see further references e.g.
> here:
> https://github.com/kdave/btrfs-progs/blob/master/Makefile
> 
> So, feel free to pick v1/v2 which you find the most appropriate!

I've merged the patch as-is, thank. The 'ditto' spelling is probably
more widely used in english texts. 'dtto' is in sources and thus not
visible to wide audience so we can live with that.


Re: [PATCH 0/4] Small coding style cleanups

2019-10-21 Thread Nikolay Borisov



On 21.10.19 г. 11:06 ч., Johannes Thumshirn wrote:
> On 18/10/2019 14:02, Nikolay Borisov wrote:
> [...]
> 
>> Patches 2-4 LGTM you can add, :
> 
> What's wrong with 1/4?

Nothing per-se but it looked ugly and my brain refused to make sense of
it so I haven't really reviewed it :) .

> 
> 
> 


Re: [PATCH 0/4] Small coding style cleanups

2019-10-21 Thread Johannes Thumshirn
On 21/10/2019 14:32, Nikolay Borisov wrote:
> 
> 
> On 21.10.19 г. 11:06 ч., Johannes Thumshirn wrote:
>> On 18/10/2019 14:02, Nikolay Borisov wrote:
>> [...]
>>
>>> Patches 2-4 LGTM you can add, :
>>
>> What's wrong with 1/4?
> 
> Nothing per-se but it looked ugly and my brain refused to make sense of
> it so I haven't really reviewed it :) .

It will look less ugly after the patch is applied ;-)

https://git.kernel.org/pub/scm/linux/kernel/git/jth/linux.git/tree/fs/btrfs/raid56.c?h=btrfs-raid-cleanups&id=ad33a49606b6f31c7fe20aa089f5603ad3065bd5#n684

-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 1/4] btrfs: reduce indentation in lock_stripe_add

2019-10-21 Thread Nikolay Borisov



On 18.10.19 г. 12:58 ч., Johannes Thumshirn wrote:
> In lock_stripe_add() we're traversing the stripe hash list and check if
> the current list element's raid_map equals is equal to the raid bio's
> raid_map. If both are equal we continue processing.
> 
> If we'd check for inequality instead of equality we can reduce one level
> of indentation.
> 
> Signed-off-by: Johannes Thumshirn 

After comparing before/after applying the patch I can say:

Reviewed-by: Nikolay Borisov 


Re: [PATCH v2] btrfs-progs: small fixes/cleanup in Documentation

2019-10-21 Thread Merlin Büge



On Mon, 21 Oct 2019 14:30:50 +0200
David Sterba  wrote:
...
> I've merged the patch as-is, thank. The 'ditto' spelling is probably
> more widely used in english texts. 'dtto' is in sources and thus not
> visible to wide audience so we can live with that.

Thank you!

-- 
Merlin Büge


Re: first it froze, now the (btrfs) root fs won't mount ...

2019-10-21 Thread Christian Pernegger
[Please CC me, I'm not on the list.]

Am Mo., 21. Okt. 2019 um 13:47 Uhr schrieb Austin S. Hemmelgarn
:
> I've [worked with fs clones] like this dozens of times on single-device 
> volumes with exactly zero issues.

Thank you, I have taken precautions, but it does seem to work fine.

> There are actually two possible ways I can think of a buggy GPU driver 
> causing this type of issue: [snip]

Interesting and plausible, but ...

> Your best option for mitigation [...] is to ensure that your hardware has an 
> IOMMU [...] and ensure it's enabled in firmware.

It has and it is. (The machine's been specced so GPU pass-through is
an option, should it be required. I haven't gotten around to setting
that up yet, haven't even gotten a second GPU, but I have laid the
groundwork, the IOMMU is enabled and, as far as one can tell from logs
and such, working.)

> However, there's also the possibility that you may have hardware issues.

Don't I know it ... The problem is, if there are hardware issues,
that's the first I've seen of them, and while I didn't run torture
tests, there was quite a lot of benchmarking when it was new. Needle
in a haystack. Some memory testing can't hurt, I suppose. Any other
ideas (for hardware testing)?

Back on the topic of TRIM: I'm 99 % certain discard wasn't set on the
mount (not by me, in any case), but I think Mint runs fstrim
periodically by default. Just to be sure, should any form of TRIM be
disabled?
The only other idea I've got is Timeshift's hourly snapshots. (How)
would btrfs deal with a crash during snapshot creation?


In other news, I've still not quite given up, mainly because the fs
doesn't look all that broken. The output of btrfs inspect-internal
dump-tree (incl. options), for instance, looks like gibberish to me of
course, but it looks sane, doesn't spew warnings, doesn't error out or
crash. Also plain btrfs check --init-extent-tree errored out, same
with -s0, but with -s1 it's now chugging along. (BTW, is there a
hierarchy among the super block slots, a best or newest one?)

Will keep you posted.

Cheers,
C.


Re: Lots of btrfs_dump_space_info in kernel log

2019-10-21 Thread David Sterba
On Mon, Oct 21, 2019 at 12:58:12PM +0200, Patrik Lundquist wrote:
> I'm running Debian Testing with kernel 5.2.17-1. Five disk raid1 with
> at least 393.01GiB unallocated on each disk. No device errors. No
> kernel WARNINGs or ERRORs.
> 
> BTRFS info (device dm-1): enabling auto defrag
> BTRFS info (device dm-1): using free space tree
> BTRFS info (device dm-1): has skinny extents
> 
> Mounted with 
> noauto,noatime,autodefrag,skip_balance,space_cache=v2,enospc_debug.
> 
> First btrfs_dump_space_info() is
> 
> BTRFS info (device dm-1): space_info 4 has 18446744073353838592 free,
> is not full
> BTRFS info (device dm-1): space_info total=10737418240,
> used=9636544512, pinned=0, reserved=27066368, may_use=1429454848,
> readonly=65536
> BTRFS info (device dm-1): global_block_rsv: size 536870912 reserved 536870912
> BTRFS info (device dm-1): trans_block_rsv: size 0 reserved 0
> BTRFS info (device dm-1): chunk_block_rsv: size 0 reserved 0
> BTRFS info (device dm-1): delayed_block_rsv: size 20185088 reserved 20185088
> BTRFS info (device dm-1): delayed_refs_rsv: size 868745216 reserved 868745216
> 
> and current last is
> 
> BTRFS info (device dm-1): space_info 4 has 0 free, is not full
> BTRFS info (device dm-1): space_info total=10737418240,
> used=9664561152, pinned=458752, reserved=2523136, may_use=1069809664,
> readonly=65536
> BTRFS info (device dm-1): global_block_rsv: size 536870912 reserved 536821760
> BTRFS info (device dm-1): trans_block_rsv: size 0 reserved 0
> BTRFS info (device dm-1): chunk_block_rsv: size 0 reserved 0
> BTRFS info (device dm-1): delayed_block_rsv: size 0 reserved 0
> BTRFS info (device dm-1): delayed_refs_rsv: size 556531712 reserved 498647040
> 
> with lots more in between and no other Btrfs kernel printing preceding
> it since mounting.
> 
> It's the first time I'm seeing it but I have always mounted with
> enospc_debug since it always seems too late to add the option when you
> finally need it.
> 
> What does it mean? Should I be worried? What to do? No apparent problems yet.

enospc_debug is known to be noisy, in many cases it prints the state of
the space management structures at some interesting points. It's
possible that the amount of the information dumped changes over time,
there have been updates to the space reservations and related code.

I've checked the message levels of the messages printed under the
option, some of them are 'debug' some of them are 'info'. I would be
possible to make them all 'debug' so they are in the log but you can
filter them out on console.


HOW ARE YOU

2019-10-21 Thread Nayef Abu Sakran
Do you receive the mail i send to you?


Re: [RFC PATCH v2 5/5] btrfs: implement RWF_ENCODED writes

2019-10-21 Thread David Sterba
On Fri, Oct 18, 2019 at 03:55:13PM -0700, Omar Sandoval wrote:
> > > + nr_pages = (disk_num_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > 
> > nit: nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE)
> 
> disk_num_bytes is a u64, so that would expand to a 64-bit division. The
> compiler is probably smart enough to optimize it to a shift, but I
> didn't want to rely on that, because that would cause build failures on
> 32-bit.

There are several DIV_ROUND_UP(u64, PAGE_SIZE) in btrfs code, no build
brekages have been reported so far, you can use it.


Re: [PATCH -next] btrfs: Make init_tree_roots static

2019-10-21 Thread David Sterba
On Fri, Oct 18, 2019 at 03:09:58PM +0300, Nikolay Borisov wrote:
> 
> 
> On 18.10.19 г. 15:06 ч., YueHaibing wrote:
> > Fix sparse warning:
> > 
> > fs/btrfs/disk-io.c:2534:12: warning:
> >  symbol 'init_tree_roots' was not declared. Should it be static?
> > 
> > Reported-by: Hulk Robot 
> > Signed-off-by: YueHaibing 
> 
> Huhz, I thought I had added static... Anyway this could be folded in the
> original patch. Thanks for the report.

Folded.


Re: [PATCH 1/4] btrfs: reduce indentation in lock_stripe_add

2019-10-21 Thread David Sterba
On Fri, Oct 18, 2019 at 11:58:20AM +0200, Johannes Thumshirn wrote:
> - /* can we steal this cached rbio's pages? */

> + /* can we steal this cached rbio's pages? */

> + /* no merging, put us on the tail of the plug list, our rbio
> +  * will be started with the currently running rbio unlocks
> +  */

In patches that touch comments it's allowed if not encouraged to
reformat the comments to the preferred style. Ie. capital first letter,
aligned to 80, and not the ugl^Wnet code format. I'm fixing that in many
other patches anyway, no need to resend.


Re: first it froze, now the (btrfs) root fs won't mount ...

2019-10-21 Thread Qu Wenruo


On 2019/10/21 下午9:02, Christian Pernegger wrote:
> [Please CC me, I'm not on the list.]
> 
> Am Mo., 21. Okt. 2019 um 13:47 Uhr schrieb Austin S. Hemmelgarn
> :
>> I've [worked with fs clones] like this dozens of times on single-device 
>> volumes with exactly zero issues.
> 
> Thank you, I have taken precautions, but it does seem to work fine.
> 
>> There are actually two possible ways I can think of a buggy GPU driver 
>> causing this type of issue: [snip]
> 
> Interesting and plausible, but ...
> 
>> Your best option for mitigation [...] is to ensure that your hardware has an 
>> IOMMU [...] and ensure it's enabled in firmware.
> 
> It has and it is. (The machine's been specced so GPU pass-through is
> an option, should it be required. I haven't gotten around to setting
> that up yet, haven't even gotten a second GPU, but I have laid the
> groundwork, the IOMMU is enabled and, as far as one can tell from logs
> and such, working.)
> 
>> However, there's also the possibility that you may have hardware issues.
> 
> Don't I know it ... The problem is, if there are hardware issues,
> that's the first I've seen of them, and while I didn't run torture
> tests, there was quite a lot of benchmarking when it was new. Needle
> in a haystack. Some memory testing can't hurt, I suppose. Any other
> ideas (for hardware testing)?
> 
> Back on the topic of TRIM: I'm 99 % certain discard wasn't set on the
> mount (not by me, in any case), but I think Mint runs fstrim
> periodically by default.

Oh, that explains why only one root (the current generation one) is not
all zero.

Then it should be a false alert, just fstrim wiped some old tree blocks.
But maybe it's some unfortunate race, that fstrim trimmed some tree
blocks still in use.

This means, it's a bug of btrfs.
However I can't find a bug fix during v5.0..v5.3 related to trim.
(Only some v5.2 regression and its fixes)

So it may be a hidden bug.

> Just to be sure, should any form of TRIM be
> disabled?

Not exactly.

There are some thing that is completely safe to trim, the unallocated space.
And there may be something tricky to trim, tree blocks. (the bug you hit)

One good compromise is, only trim unallocated space.

Then you need to pass something parameter for fstrim.
Like -o 0 -l 1M.

Newer kernel would try to trim block groups in that range, and modern
btrfs has no block groups in that range, then fstrim will go trim all
unallocated space them.
So with that options, fstrim should be safe.


BTW, as you can found already, trimmed blocks can make recovery
trickier, as old tree blocks are just trimmed, no way to rely on trimmed
data.

> The only other idea I've got is Timeshift's hourly snapshots. (How)
> would btrfs deal with a crash during snapshot creation?

In theory, btrfs has transaction and tree block CoW to take care of
everything. So no matter when the crash happens, it should be safe.
But in real world, you know life is always hard...

> 
> 
> In other news, I've still not quite given up, mainly because the fs
> doesn't look all that broken. The output of btrfs inspect-internal
> dump-tree (incl. options), for instance, looks like gibberish to me of
> course, but it looks sane, doesn't spew warnings, doesn't error out or
> crash. Also plain btrfs check --init-extent-tree errored out, same
> with -s0, but with -s1 it's now chugging along. (BTW, is there a
> hierarchy among the super block slots, a best or newest one?)

As your corruption is only in extent tree.
With my patchset, you should be able to mount it, so it's not that
screwed up.

But extent tree update is really somehow trickier than I thought.

Thanks,
Qu

> 
> Will keep you posted.
> 
> Cheers,
> C.
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 0/4] Small coding style cleanups

2019-10-21 Thread David Sterba
On Fri, Oct 18, 2019 at 11:58:19AM +0200, Johannes Thumshirn wrote:
> Here are some minor coding style cleanups which I think are neat as they make
> some functions a bit easier to read.
> 
> None of these patches is really needed though.
> 
> The patches have no regressions with regrads to the basline
> btrfs-devel/misc-next on an xfstests -g auto run.
> 
> A gitweb preview can be found at
> https://git.kernel.org/pub/scm/linux/kernel/git/jth/linux.git/log/?h=btrfs-raid-cleanups
> 
> Note I did rebase the branch because one patch did not have a changelog.
> 
> Johannes Thumshirn (4):
>   btrfs: reduce indentation in lock_stripe_add
>   btrfs: remove pointless local variable in lock_stripe_add()
>   btrfs: reduce indentation in btrfs_may_alloc_data_chunk
>   btrfs: remove pointless indentation in btrfs_read_sys_array()

Added to misc-next, thanks.


Re: [PATCH v3] btrfs-progs: add verbose option to btrfs device scan

2019-10-21 Thread David Sterba
On Tue, Oct 15, 2019 at 11:29:34AM +0800, Anand Jain wrote:
>   I was thinking there might be some common code between the
>   sub-commands in btrfs-progs now or in future, and if the printf()
>   due to verbose is required in one sub-command and the same printf()
>   due to verbose is not required in another sub-command (which I
>   called unwanted message) then we won't have any choice to not
>   to print those unwanted printf().
>   But as this is just an anticipatory only, so probably we could try
>   global verbose and see how it fares. I will try.

I see, but it would be better to have a concrete example where it's
problematic so we can figure out ways how to filter unwanted messages.


Re: [PATCH v3] tools/lib/traceevent, perf tools: Handle %pU format correctly

2019-10-21 Thread Steven Rostedt
On Mon, 21 Oct 2019 17:47:30 +0800
Qu Wenruo  wrote:

> +static void print_uuid_arg(struct trace_seq *s, void *data, int size,
> +struct tep_event *event, struct tep_print_arg *arg)
> +{
> + unsigned char *buf;
> + int i;
> +
> + if (arg->type != TEP_PRINT_FIELD) {
> + trace_seq_printf(s, "ARG TYPE NOT FIELID but %d", arg->type);
> + return;
> + }
> +
> + if (!arg->field.field) {
> + arg->field.field = tep_find_any_field(event, arg->field.name);
> + if (!arg->field.field) {
> + do_warning("%s: field %s not found",
> +__func__, arg->field.name);
> + return;
> + }
> + }
> + if (arg->field.field->size < 16) {
> + trace_seq_printf(s, "INVALID UUID: size have %u expect 16",
> + arg->field.field->size);
> + return;
> + }
> + buf = data + arg->field.field->offset;
> +
> + for (i = 0; i < 8; i++) {
> + trace_seq_printf(s, "%02x", buf[2 * i]);
> + trace_seq_printf(s, "%02x", buf[2 * i + 1]);
> + if (1 <= i && i <= 4)

I'm fine with this patch except for one nit. The above is hard to read
(in my opinion), and I absolutely hate the "constant" compare to
"variable" notation. Please change the above to:

if (i >= 1 && i <= 4)

Thanks,

-- Steve

> + trace_seq_putc(s, '-');
> + }
> +}
> +


Re: first it froze, now the (btrfs) root fs won't mount ...

2019-10-21 Thread Austin S. Hemmelgarn

On 2019-10-21 09:02, Christian Pernegger wrote:

[Please CC me, I'm not on the list.]

Am Mo., 21. Okt. 2019 um 13:47 Uhr schrieb Austin S. Hemmelgarn
:

I've [worked with fs clones] like this dozens of times on single-device volumes 
with exactly zero issues.


Thank you, I have taken precautions, but it does seem to work fine.


There are actually two possible ways I can think of a buggy GPU driver causing 
this type of issue: [snip]


Interesting and plausible, but ...


Your best option for mitigation [...] is to ensure that your hardware has an 
IOMMU [...] and ensure it's enabled in firmware.


It has and it is. (The machine's been specced so GPU pass-through is
an option, should it be required. I haven't gotten around to setting
that up yet, haven't even gotten a second GPU, but I have laid the
groundwork, the IOMMU is enabled and, as far as one can tell from logs
and such, working.)


However, there's also the possibility that you may have hardware issues.


Don't I know it ... The problem is, if there are hardware issues,
that's the first I've seen of them, and while I didn't run torture
tests, there was quite a lot of benchmarking when it was new. Needle
in a haystack. Some memory testing can't hurt, I suppose. Any other
ideas (for hardware testing)?
The power supply would be the other big one I'd suggest testing, as a 
bad PSU can cause all kinds of odd intermittent issues. Just like with 
RAM, you can't really easily cover everything, but you can check some 
things that have very low false negative rates when indicating problems.


Typical procedure I use is:

1. Completely disconnect the PSU from _everything_ inside the computer. 
(If you're really paranoid, you can completely remove the PSU from the 
case too, though that won't really make the testing more reliable or safer).
2. Make sure the PSU itself is plugged in to mains power, with the 
switch on the back (if it has one) turned on.
3. Connect a good multimeter to the 24-pin main power connector, with 
the positive probe on pin 8 and the negative probe on pin 7, set to 
measure DC voltages in the double-digit range with the highest precision 
possible.
4. Short pins 15 and 16 of the 24-pin main power connector using a short 
piece of solid copper wire. At this point, if the PSU has a fan, the fan 
should turn on. The multimeter should read +5 volts within half a second 
or less.
5. Check voltages of each of the power rails relative to ground. Make 
sure and check each one for a couple of seconds to watch for any 
fluctuations, and make a point to check _each_ set of wires coming off 
of the PSU separately (as well as checking each wire in each connector 
independently, even if they're supposed to be tied together internally).
6. Check the =5V standby power by hooking up the multimeter to that and 
a ground pin, then disconnecting the copper wire mentioned in step 3. 
It should maintain it's voltage while you're disconnecting the wire and 
afterwards, even once the fan stops.


You can find the respective pinouts online in many places (for example, 
[1]).  Tolerances are +/- 5% on everything except the negative voltages 
which are +/- 10%. The -5V pin may show nothing, which is normal (modern 
systems do not use -5V for anything, and actually most don't use -12V 
anymore either, though that's still provided). This won't confirm that 
the PSU isn't suspect (it could still have issues under load), but if 
any of this testing fails, you can be 100% certain you have either a bad 
PSU, or that your mains power is suspect (usually the issue there is 
very high line noise, though you'll need special equipment to test for 
that).


Back on the topic of TRIM: I'm 99 % certain discard wasn't set on the
mount (not by me, in any case), but I think Mint runs fstrim
periodically by default. Just to be sure, should any form of TRIM be
disabled?
The issue with TRIM is that it drops old copies of the on-disk data 
structures used by BTRFS, which can make recovery more difficult in the 
event of a crash. Running `fstrim` at regular intervals is not as much 
of an issue as inline discard, but still drops the old trees, so there's 
a window of time right after it gets run when you are more vulnerable.


Additionally, some SSD's have had issues with TRIM causing data 
corruption elsewhere on the disk, but it's been years since I've seen a 
report of such issues, and I don't think a Samsung device as recent as 
yours is likely to have such problems.



The only other idea I've got is Timeshift's hourly snapshots. (How)
would btrfs deal with a crash during snapshot creation?
It should have no issues whatsoever most of the time.  The only case I 
can think of where it might is if you're snapshotting a subvolume that's 
being written to at the same time. Snapshots on BTRFS are only truly 
atomic if none of the data being snapshotted is being written to at the 
same time. If there are pending writes, there are some indeterminate 
states involved, and crashing the

Re: [PATCH v3] tools/lib/traceevent, perf tools: Handle %pU format correctly

2019-10-21 Thread Qu Wenruo


On 2019/10/21 下午9:56, Steven Rostedt wrote:
> On Mon, 21 Oct 2019 17:47:30 +0800
> Qu Wenruo  wrote:
> 
>> +static void print_uuid_arg(struct trace_seq *s, void *data, int size,
>> +   struct tep_event *event, struct tep_print_arg *arg)
>> +{
>> +unsigned char *buf;
>> +int i;
>> +
>> +if (arg->type != TEP_PRINT_FIELD) {
>> +trace_seq_printf(s, "ARG TYPE NOT FIELID but %d", arg->type);
>> +return;
>> +}
>> +
>> +if (!arg->field.field) {
>> +arg->field.field = tep_find_any_field(event, arg->field.name);
>> +if (!arg->field.field) {
>> +do_warning("%s: field %s not found",
>> +   __func__, arg->field.name);
>> +return;
>> +}
>> +}
>> +if (arg->field.field->size < 16) {
>> +trace_seq_printf(s, "INVALID UUID: size have %u expect 16",
>> +arg->field.field->size);
>> +return;
>> +}
>> +buf = data + arg->field.field->offset;
>> +
>> +for (i = 0; i < 8; i++) {
>> +trace_seq_printf(s, "%02x", buf[2 * i]);
>> +trace_seq_printf(s, "%02x", buf[2 * i + 1]);
>> +if (1 <= i && i <= 4)
> 
> I'm fine with this patch except for one nit. The above is hard to read
> (in my opinion), and I absolutely hate the "constant" compare to
> "variable" notation. Please change the above to:
> 
>   if (i >= 1 && i <= 4)

Isn't this ( 1 <= i && i <= 4 ) easier to find out the lower and upper
boundary? only two numbers, both at the end of the expression.

I feel that ( i >= 1 && i <= 4 ) easier to write, but takes me extra
half second to read, thus I changed to the current one.

Thanks,
Qu

> 
> Thanks,
> 
> -- Steve
> 
>> +trace_seq_putc(s, '-');
>> +}
>> +}
>> +



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3] tools/lib/traceevent, perf tools: Handle %pU format correctly

2019-10-21 Thread Steven Rostedt
On Mon, 21 Oct 2019 22:03:21 +0800
Qu Wenruo  wrote:

> On 2019/10/21 下午9:56, Steven Rostedt wrote:
> > On Mon, 21 Oct 2019 17:47:30 +0800
> > Qu Wenruo  wrote:
> >   
> >> +static void print_uuid_arg(struct trace_seq *s, void *data, int size,
> >> + struct tep_event *event, struct tep_print_arg *arg)
> >> +{
> >> +  unsigned char *buf;
> >> +  int i;
> >> +
> >> +  if (arg->type != TEP_PRINT_FIELD) {
> >> +  trace_seq_printf(s, "ARG TYPE NOT FIELID but %d", arg->type);
> >> +  return;
> >> +  }
> >> +
> >> +  if (!arg->field.field) {
> >> +  arg->field.field = tep_find_any_field(event, arg->field.name);
> >> +  if (!arg->field.field) {
> >> +  do_warning("%s: field %s not found",
> >> + __func__, arg->field.name);
> >> +  return;
> >> +  }
> >> +  }
> >> +  if (arg->field.field->size < 16) {
> >> +  trace_seq_printf(s, "INVALID UUID: size have %u expect 16",
> >> +  arg->field.field->size);
> >> +  return;
> >> +  }
> >> +  buf = data + arg->field.field->offset;
> >> +
> >> +  for (i = 0; i < 8; i++) {
> >> +  trace_seq_printf(s, "%02x", buf[2 * i]);
> >> +  trace_seq_printf(s, "%02x", buf[2 * i + 1]);
> >> +  if (1 <= i && i <= 4)  
> > 
> > I'm fine with this patch except for one nit. The above is hard to read
> > (in my opinion), and I absolutely hate the "constant" compare to
> > "variable" notation. Please change the above to:
> > 
> > if (i >= 1 && i <= 4)  
> 
> Isn't this ( 1 <= i && i <= 4 ) easier to find out the lower and upper
> boundary? only two numbers, both at the end of the expression.

I don't read it like that.

> 
> I feel that ( i >= 1 && i <= 4 ) easier to write, but takes me extra
> half second to read, thus I changed to the current one.

How do you read it in English?

  "If one is less than or equal to i and i is less than or equal to
  four."

Or

  "If i is greater than or equal to one and i is less than or equal to
   four."

?

I read it the second way, and I believe most English speakers read it
that way too.

It took me a minute or two to understand the original method, because
my mind likes to take a variable and keep it on the same side of the
comparison, and the variable should always be first.

-- Steve




Re: [PATCH v3] tools/lib/traceevent, perf tools: Handle %pU format correctly

2019-10-21 Thread Qu Wenruo


On 2019/10/21 下午10:24, Steven Rostedt wrote:
> On Mon, 21 Oct 2019 22:03:21 +0800
> Qu Wenruo  wrote:
> 
>> On 2019/10/21 下午9:56, Steven Rostedt wrote:
>>> On Mon, 21 Oct 2019 17:47:30 +0800
>>> Qu Wenruo  wrote:
>>>   
 +static void print_uuid_arg(struct trace_seq *s, void *data, int size,
 + struct tep_event *event, struct tep_print_arg *arg)
 +{
 +  unsigned char *buf;
 +  int i;
 +
 +  if (arg->type != TEP_PRINT_FIELD) {
 +  trace_seq_printf(s, "ARG TYPE NOT FIELID but %d", arg->type);
 +  return;
 +  }
 +
 +  if (!arg->field.field) {
 +  arg->field.field = tep_find_any_field(event, arg->field.name);
 +  if (!arg->field.field) {
 +  do_warning("%s: field %s not found",
 + __func__, arg->field.name);
 +  return;
 +  }
 +  }
 +  if (arg->field.field->size < 16) {
 +  trace_seq_printf(s, "INVALID UUID: size have %u expect 16",
 +  arg->field.field->size);
 +  return;
 +  }
 +  buf = data + arg->field.field->offset;
 +
 +  for (i = 0; i < 8; i++) {
 +  trace_seq_printf(s, "%02x", buf[2 * i]);
 +  trace_seq_printf(s, "%02x", buf[2 * i + 1]);
 +  if (1 <= i && i <= 4)  
>>>
>>> I'm fine with this patch except for one nit. The above is hard to read
>>> (in my opinion), and I absolutely hate the "constant" compare to
>>> "variable" notation. Please change the above to:
>>>
>>> if (i >= 1 && i <= 4)  
>>
>> Isn't this ( 1 <= i && i <= 4 ) easier to find out the lower and upper
>> boundary? only two numbers, both at the end of the expression.
> 
> I don't read it like that.
> 
>>
>> I feel that ( i >= 1 && i <= 4 ) easier to write, but takes me extra
>> half second to read, thus I changed to the current one.
> 
> How do you read it in English?

How about mathematics interval?

i in [1, 4].

It looks way easier and simpler no matter what language you speak.

Thanks,
Qu
> 
>   "If one is less than or equal to i and i is less than or equal to
>   four."
> 
> Or
> 
>   "If i is greater than or equal to one and i is less than or equal to
>four."
> 
> ?
> 
> I read it the second way, and I believe most English speakers read it
> that way too.
> 
> It took me a minute or two to understand the original method, because
> my mind likes to take a variable and keep it on the same side of the
> comparison, and the variable should always be first.
> 
> -- Steve
> 
> 



signature.asc
Description: OpenPGP digital signature


Re: "BUG: kernel NULL pointer dereference," when unmounting filesystem hitted by enospc error

2019-10-21 Thread Johannes Thumshirn
On 21/10/2019 11:17, Johannes Thumshirn wrote:
[...]
>> -
>> $ cat run-btrfs-test
>> modprobe -iv zram num_devices=8
>> udevadm trigger
>> sync
>> zramctl /dev/zram0 -s 8G && \
>> zramctl /dev/zram1 -s 8G && \
>> zramctl /dev/zram2 -s 4G && \
>> zramctl /dev/zram3 -s 4G && \
>> zramctl /dev/zram4 -s 4G && \
>> zramctl /dev/zram5 -s 2G && \
>> zramctl /dev/zram6 -s 2G && \
>> zramctl /dev/zram7 -s 4G && \
>> mkfs.btrfs /dev/zram0 && \
>> mkdir -p /mnt/btrfs-test && \
>> mount /dev/zram0 /mnt/btrfs-test && \
>> echo "FS Mounted" && \
>> btrfs dev add /dev/zram1 /mnt/btrfs-test && \
>> echo "Devices added" && \
>> for int in {1..500} ; do dd if=/dev/zero of=/mnt/btrfs-test/file${int}
>> bs=32M count=1 && sync ; done
>> btrfs dev add /dev/zram[2-7] /mnt/btrfs-test && \
>> btrfs fi sh /mnt/btrfs-test && \
>> btrfs fi df /mnt/btrfs-test && \
>> btrfs bal star -mconvert=raid6 /mnt/btrfs-test && \
>> btrfs bal star -dconvert=raid6 /mnt/btrfs-test
>> btrfs fi sh /mnt/btrfs-test && \
>> btrfs fi df /mnt/btrfs-test

I'm sorry. I ran this script in a loop for 35 iterations on 5.3.6 and
couldn't reproduce a single crash.


Is there anything else needed?
-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5
90409 Nürnberg
Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PULL REQUEST] btrfs-progs: For next merge window

2019-10-21 Thread David Sterba
On Fri, Oct 11, 2019 at 10:07:16AM +0800, Qu Wenruo wrote:
> 
> 
> On 2019/10/11 上午12:17, David Sterba wrote:
> > On Thu, Oct 10, 2019 at 04:59:32PM +0800, Qu Wenruo wrote:
> >> This patchset can be fetched from github:
> >> https://github.com/adam900710/btrfs-progs/tree/for_next
> >> Which is based on devel branch, with the following HEAD:
> > 
> > Thanks for putting the branch together, I've been too busy with kernel
> > work so progs are lagging behind constantly.
> > 
> > The current devel branch is almost ready for a release, some patches
> > will be moved to 5.4 but I'm planning to do release by tomorrow. I can
> > pull your branch as-is to 5.4 so we have time to fix any problems.
> 
> No problem at all.
> Since I don't expect this pull get merged in the next release either.

I had to rebase the branch, all patches from you (image, check) are now
in devel. I did some fixups (error messages, coding style, changelogs)
on the way.

Next will be the patchset from Omar. The fix from Johanness regarding
zstd needs to be reworked so I'll drop it for now.


Re: [PATCH v2 1/2] btrfs-progs: warn users about the possible dangers of check --repair

2019-10-21 Thread David Sterba
On Fri, Oct 18, 2019 at 01:16:03PM +0200, Johannes Thumshirn wrote:
> The manual page of btrfsck clearly states 'btrfs check --repair' is a
> dangerous operation.
> 
> Although this warning is in place users do not read the manual page and/or
> are used to the behaviour of fsck utilities which repair the filesystem,
> and thus potentially cause harm.
> 
> Similar to 'btrfs balance' without any filters, add a warning and a
> countdown, so users can bail out before eventual corrupting the filesystem
> more than it already is.
> 
> Signed-off-by: Johannes Thumshirn 
> 
> ---
> Changes to v1:
> - Fix grammar mistakes in warning message
> - Skip delay with --force

--force was added for a different reason, to allow check on a mounted
filesystem. I don't think that combining --repair and --force just to
allow repair is a good idea. There's a 'dangerous repair' mode for eg.
xfs that allows to do live surgery on a mounted filesytem (followed by
immediate reboot). We want to be able to do that eventually.

I understand where the motivation comes from, let me have a second
thought on that.


Re: [PATCH v3 0/7] btrfs-progs: Support for BG_TREE feature

2019-10-21 Thread David Sterba
On Thu, Oct 10, 2019 at 02:41:49PM +0800, Qu Wenruo wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/btrfs-progs/tree/bg_tree
> Which is based on v5.2.2 tag.
> 
> This patchset provides the needed user space infrastructure for BG_TREE
> feature.
> 
> Since it's an new incompatible feature, unlike SKINNY_METADATA, btrfs-progs
> is needed to convert existing fs (unmounted) to new format.
> 
> Now btrfstune can convert regular extent tree fs to bg tree fs to
> improve mount time.
> 
> For the performance improvement, please check the kernel patchset cover
> letter or the last patch.
> (SPOILER ALERT: It's super fast)
> 
> Changelog:
> v2:
> - Rebase to v5.2.2 tag
> - Add btrfstune ability to convert existing fs to BG_TREE feature
> 
> v3:
> - Fix a bug that temp chunks are not cleaned up properly
>   This is caused by wrong timing btrfs_convert_to_bg_tree() is called.
>   It should be called after temp chunks cleaned up.
> 
> - Fix a bug that an extent buffer get leaked
>   This is caused by newly created bg tree not added to dirty list.
> 
> Qu Wenruo (7):
>   btrfs-progs: Refactor excluded extent functions to use fs_info
>   btrfs-progs: Refactor btrfs_read_block_groups()

I'll add 1 and 2 to devel as they're independent. Thanks.


Re: MD RAID 5/6 vs BTRFS RAID 5/6

2019-10-21 Thread Edmund Urbani
On 10/16/19 9:42 PM, Zygo Blaxell wrote:
>
> For raid5 I'd choose btrfs -draid5 -mraid1 over mdadm raid5
> sometimes--even with the write hole, I'd expect smaller average data
> losses than mdadm raid5 + write hole mitigation due to the way disk
> failure modes are distributed.  

What about the write hole and RAID-1? I understand the write hole is most
commonly associated with RAID-5, but it is also a problem with other RAID 
levels.

You would need to scrub after a power failure to be sure that no meta data gets
corrupted even with RAID-1. Otherwise you might still have inconsistent copies
and the problem may only become apparent when one drive fails. Can we be certain
that scrubbing is always able to fix such inconsistencies with RAID-1?

Regards,
 Edmund


-- 
*Liland IT GmbH*


Ferlach ● Wien ● München
Tel: +43 463 220111
Tel: +49 89 
458 15 940
off...@liland.com
https://Liland.com  



Copyright © 2019 Liland IT GmbH 

Diese Mail enthaelt vertrauliche und/oder 
rechtlich geschuetzte Informationen. 
Wenn Sie nicht der richtige Adressat 
sind oder diese Email irrtuemlich erhalten haben, informieren Sie bitte 
sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren 
sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet. 

This 
email may contain confidential and/or privileged information. 
If you are 
not the intended recipient (or have received this email in error) please 
notify the sender immediately and destroy this email. Any unauthorised 
copying, disclosure or distribution of the material in this email is 
strictly forbidden.



Re: [PATCH v3] tools/lib/traceevent, perf tools: Handle %pU format correctly

2019-10-21 Thread Steven Rostedt
On Mon, 21 Oct 2019 22:30:37 +0800
Qu Wenruo  wrote:

> > How do you read it in English?  
> 
> How about mathematics interval?
> 
> i in [1, 4].
> 
> It looks way easier and simpler no matter what language you speak.

But C doesn't accept that syntax ;-)

-- Steve


Re: [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature

2019-10-21 Thread David Sterba
On Sat, Oct 19, 2019 at 08:04:51AM +0800, Qu Wenruo wrote:
> That's wonderful.
> Although I guess my patchset should provide the hint of where to modify
> the code, since there are only a limited number of places we modify
> block group item.

I indeed started at your patchset and grepped fro BG_TREE, adding
another branch.

> > We agree on the point that the block group items must be packed. The key
> > approach should move the new BGI to the beginning, ie. key type is
> > smaller than anything that appears in the extent tree. I chose 100 for
> > the prototype, it could change.
> > 
> > To keep changes to minimum, the new BGI uses the same block group item,
> > so the only difference then becomes how we search for the items.
> 
> If we're introducing new block group item, I hope to do a minor change.
> 
> Remove the chunk_objectid member, or even flags. to make it more
> compact. So that you can make the BGI subtree even smaller.

Yeah that can be done.

> But I guess since you don't want to modify the BGI structure, and keep
> the code modification minimal, it may not be a good idea right now.

As long as the changes are bearable, eg. a minor refactoring of block
group access (the cache.key serving a as offset and length) is fine. And
if we can make the b-tree item more then let's do it.

> > Packing of the items is done by swapping the key objectid and offset.
> > 
> > Normal BGI has bg.start == key.objectid and bg.length == key.offset. As
> > the objectid is the thing that scatters the items all over the tree.
> > 
> > So the new BGI has bg.length == key.objectid and bg.start == key.offset.
> > As most of block groups are of same size, or from a small set, they're
> > packed.
> 
> That doesn't look optimized enough.
> 
> bg.length can be at 1G, that means if extents starts below 1G can still
> be before BGIs.

This shold not be a big problem, the items are still grouped togethers.
Mkfs does 8M, we can have 256M or 1G. On average there could be several
packed groups, which I think is fine and the estimated overhead would be
a few more seeks.

> I believe we should have a fixed objectid for this new BGIs, so that
> they are ensured to be at the beginning of extent tree.

That was my idea too, but that also requires to add one more member to
the item to track the length. Currently the key is saves the bytes. With
the proposed changes to drop chunk_objectid, the overall size of BGI
would not increase so this still sounds ok. And all the problems with
searching would go away.

> > The nice thing is that a lot of code can be shared between BGI and new
> > BGI, just needs some care with searches, inserts and search key
> > advances.
> 
> Exactly, but since we're introducing a new key type, I prefer to perfect
> it. Not only change the key, but also the block group item structure to
> make it more compact.
> 
> Although from the design aspect, it looks like BG tree along with new
> BGI would be the best design.
> 
> New BG key goes (bg start, NEW BGI TYPE, used) no data. It would provide
> the most compact on-disk format.

That's very compact. Given the 'bg start' we can't use the same for the
extent tree item.

> > This would be easy with the bg_tree, because we'd know that all items in
> > the tree are just the block group items. Some sort of enumeration could
> > work for bg_key too, but I don't have something solid.
> 
> Why not fixed objectid for BGI and just ignore the bg.len part?

I wanted to avoid storing a useless number, it costs 8 bytes per item,
and the simple swap of objectid/offset was first idea how to avoid it.

> We have chunk<->BGI verification code, no bg.len is not a problem at
> all, we can still make sure chunk<->bg is 1:1 mapped and still verify
> the bg.used.

This is all great, sure, and makes the extensions easy. Let's try to
work out best for each approach we have so far. Having a separate tree
for block groups may open some future options.


Re: [PATCH v3 00/10] btrfs-progs: image: Enhancement with new data dump feature

2019-10-21 Thread David Sterba
On Wed, Sep 25, 2019 at 08:13:46PM +0800, Qu Wenruo wrote:
> This patchset includes the following features:
> 
> - Small fixes to improve btrfs-image error report
>   * Output error message for chunk tree build error
>   * Fix error output to show correct return value
>   Patch 1 and 2.
> 
> - Reduce memory usage when decompress super block
>   Independent change, for current btrfs-image, it will reduce buffer
>   size from 256K to fixed 4K.
>   Patch 3.
> 
> - Rework how we search chunk tree blocks
>   Instead of iterating clusters again and again for each chunk tree
>   block, record system chunk array and iterate clusters once for all
>   chunk tree blocks.
>   This should reduce restore time for large image dump.
>   Patch 4, 5 and 6.
> 
> - Introduce data dump feature to dump the whole fs.
>   This will introduce a new magic number to prevent old btrfs-image to
>   hit failure as the item size limit is enlarged.
>   Patch 7 and 8.
> 
> - Reduce memory usage for data dump restore
>   This is mostly due to the fact that we have much larger
>   max_pending_size introduced by data dump(256K -> 256M).
>   Using 4 * max_pending_size for each decompress thread as buffer is way
>   too expensive now.
>   Use proper inflate() to replace uncompress() calls.
>   Patch 9 and 10.
> 
> Changelog:
> v2:
> - New small fixes:
>   * Fix a confusing error message due to unpopulated errno
>   * Output error message for chunk tree build error
>   
> - Fix a regression of previous version
>   Patch "btrfs-progs: image: Rework how we search chunk tree blocks"
>   deleted a "ret = 0" line which could cause false early exit.
> 
> - Reduce memory usage for data dump
> 
> v2.1:
> - Rebased to devel branch
>   Removing 4 already merged patches from the patchset.
> 
> - Re-order the patchset
>   Put small and independent patches at the top of queue, and put the
>   data dump related feature at the end.
> 
> - Fix -Wmaybe-uninitialized warnings
>   Strangely, D=1 won't trigger these warnings thus they sneak into v2
>   without being detected.
> 
> - Fix FROM: line
>   Reverted to old smtp setup. The new setup will override FROM: line,
>   messing up the name of author.
> 
> v3:
> - Fix a wrong option in error string
> - Fix a bug that we always dump data extents
> 
> Qu Wenruo (10):
>   btrfs-progs: image: Output error message for chunk tree build error
>   btrfs-progs: image: Fix error output to show correct return value
>   btrfs-progs: image: Don't waste memory when we're just extracting
> super block
>   btrfs-progs: image: Allow restore to record system chunk ranges for
> later usage
>   btrfs-progs: image: Introduce helper to determine if a tree block is
> in the range of system chunks
>   btrfs-progs: image: Rework how we search chunk tree blocks
>   btrfs-progs: image: Introduce framework for more dump versions
>   btrfs-progs: image: Introduce -d option to dump data
>   btrfs-progs: image: Reduce memory requirement for decompression
>   btrfs-progs: image: Reduce memory usage for chunk tree search

Added to devel. Thanks.


Re: [PATCH 0/3] btrfs-progs: Add check and repair for invalid inode generation

2019-10-21 Thread Ferry Toth




Op 20-10-2019 om 16:24 schreef Ferry Toth:

Op 20-10-2019 om 16:11 schreef Qu Wenruo:



On 2019/10/20 下午9:29, Ferry Toth wrote:

Op 20-10-2019 om 15:15 schreef Qu WenRuo:



On 2019/10/20 下午9:04, Ferry Toth wrote:

Op 20-10-2019 om 02:51 schreef Qu Wenruo:



On 2019/10/20 上午8:26, Qu Wenruo wrote:



On 2019/10/20 上午12:24, Ferry Toth wrote:

Hi,

Op 19-10-2019 om 01:50 schreef Qu WenRuo:



On 2019/10/19 上午4:32, Ferry Toth wrote:

Op 24-09-2019 om 10:11 schreef Qu Wenruo:
We have at least two user reports about bad inode generation 
makes

kernel reject the fs.


May I add my report? I just upgraded Ubuntu from 19.04 -> 
19.10 so

kernel went from 5.0 -> 5.3 (but I was using 4.15 too).

Booting 5.3 leaves me in initramfs as I have /boot on @boot and /
on /@

In initramfs I can try to mount but get something like
btrfs critical corrupt leaf invalid inode generation open_ctree
failed

Booting old kernel works just as before, no errors.

According to the creation time, the inode is created by some 
2014

kernel.


How do I get the creation time?


# btrfs ins dump-tree -b  


I just went back to the office to reboot to 5.3 and check the
creation
times and found they were 2013 - 2014.



And the generation member of INODE_ITEM is not updated 
(unlike the

transid member) so the error persists until latest tree-checker
detects.

Even the situation can be fixed by reverting back to older 
kernel

and
copying the offending dir/file to another inode and delete the
offending
one, it still should be done by btrfs-progs.

How to find the offending dir/file from the command line 
manually?


# find  -inum 


This works, thanks.

But appears unpractical. After fix 2 files and reboot, I found 4
more,
then 16, then I gave up.


Another solution is use "find" to locate the files with creation time
before 2015, and copy them to a new file, then replace the old file
with
the new file.


Hmm. But how do I "find" by creation time (otime)? Do you have a
suggestion for this?


$ touch -t 20150101 /tmp/sample
$ find  -not -cnewer /tmp/sample


AFAIK this compares file modified date with status changed date. So, no
search for creation date.

And stat /tmp/sample (sorry dutch lang output):

ferry@ferry-quad:~$ stat /tmp/sample
   Bestand: /tmp/sample
   Grootte: 0    Blokken: 0    IO-blok: 4096   leeg
normaal bestand
Apparaat: 1bh/27d   Inode: 62005381 Koppelingen: 1
Toegang: (0664/-rw-rw-r--)   UID: ( 1001/   ferry)   GID: ( 1001/   
ferry)

Toegang:   2015-01-01 00:00:00.0 +0100
Gewijzigd: 2015-01-01 00:00:00.0 +0100
Veranderd: 2019-10-20 15:20:50.366163766 +0200
Ontstaan:  -


My bad, always got confused by o/a/c/mtime, as c really looks like *c*
reation, so I always got confused between ctime and otime.

Then considering not all fs supports otime, find doesn't support that.
I guess it's only possible by other tools


New stat will support it, but not in Ubuntu 19.10. We did find this:
https://github.com/torvalds/linux/blob/master/samples/vfs/test-statx.c

and are trying to work out a script that will gzip all files with 
creation < 2015. Then we can unzip again and overwrite.


So we combined find + test-statx + awk to produce a list of files with 
creation date 1970 that are then fed to tar. Finally we untar that to 
overwrite the original files.


I found the following:
- the problematic files are both ordinary and symlinks
- they all seem to have the same inode generation
- all had no creation date (so were set to jan 1970)
- not all files without creation date were problematic (only ~10%)
- modified dates were 2013/2014 afaikt

The procedure keeps the modified date, but sets creation date to today, 
creates a new inode and fixes the error.



BTW, did you find any patterns in those existing offending inodes?
I guess it would be faster than finding a tool supporting otime search.


I didn't see any logic. A mix of logs, cached files, journal files etc. 
+ a .kde directory in the /



Thanks,
Qu




If you want, you can add -exec to that find, but I'd only add that 
after

confirming the execution load is verified.

Thanks,
Qu




It would be much safer than btrfs check --repair.

Thanks,
Qu





Thanks,
Qu




This patchset adds such check and repair ability to btrfs-check,
with a
simple test image.

Qu Wenruo (3):
   btrfs-progs: check/lowmem: Add check and repair for 
invalid

inode
 generation
   btrfs-progs: check/original: Add check and repair for
invalid inode
 generation
   btrfs-progs: fsck-tests: Add test image for invalid inode
generation
 repair

  check/main.c  |  50
+++-
  check/mode-lowmem.c   |  76
++
  check/mode-original.h |   1 +
  .../.lowmem_repairable    |   0
  .../bad_inode_geneartion.img.xz   | Bin 0 
-> 2012

bytes
  5 files changed, 126 insertions(+), 1 dele

Re: [RFC PATCH 00/14] btrfs-progs: global-verbose option

2019-10-21 Thread David Sterba
On Mon, Oct 21, 2019 at 06:01:08PM +0800, Anand Jain wrote:
> This patch set brings --verbose option to the top level btrfs command,
> such as 'btrfs --verbose'. With this we don't have to add or remember
> verbose option at the sub-commands level.
> 
> As there are already verbose options to 11 sub-commands as listed
> below [1][2]. So the top level --verbose option here takes care to transpire
> verbose request from the top level to the sub-command level for 9 (not 11)
> sub-commands as in [1] as of now.
> 
> This patch is RFC still for the following two reasons (comments appreciated).
> 
> 1.
> The sub-commands as in [2] uses multi-level compile time verbose option,
> such as %g_verbose = 0 (quite), %g_verbose = 1 (default), %g_verbose > 1
> (real-verbose). And verbose at default is also part the .out files in
> fstests. So it needs further discussions on how to handle the multi-
> level verbose option using the global verbose option, and so sub-
> commands in [2] are untouched.

The idea is to unify all verbosity options. Default is 1, 0 is for quiet
(only errors are printed), the rest is up to the commands what to print
on the higher levels.

> 2.
> These patch has been unit-tested individually.
> These patches does not alter the verbose output.
> But it fixes the indentation in the command's help output, which may be
> used in fstests and btrfs-progs/tests and their verification is pending
> still, which I am planning to do it before v1.

The indentation does not need to be changed if the glboal options are
split from the per-command, like

---
usage: btrfs subvolume delete [options]  [...]

Delete subvolume(s)

Delete subvolumes from the filesystem. The corresponding directory
is removed instantly but the data blocks are removed later.
The deletion does not involve full commit by default due to
performance reasons (as a consequence, the subvolume may appear again
after a crash). Use one of the --commit options to wait until the
operation is safely stored on the media.

-c|--commit-after  wait for transaction commit at the end of the operation
-C|--commit-each   wait for transaction commit after deleting each subvolume

Global options:
-v|--verbose show verbose output
---

Some commands can have long option names or the argument names make it
long in some cases, the global options could stay indented. I think
visually it'll be ok. We can introduce some way to automatically format
the options and help texts so we don't have to adjust them manually each
time, but this would be more intrusive and can be done later.

With the global verbose option there shouldbe also -q|--quiet. Both
short and long versions should be available for all commands. So the
help would look like:

---
Global options:
-v|--verbose verbose output, repeat for more verbosity
-q|--quiet   print only errors
---

In code this looks like:

  "",
  "-c|--commit-after  wait for transaction commit at the end of the 
operation",
  "-C|--commit-each   wait for transaction commit after deleting each 
subvolume",
  HELPINFO_GLOBAL_OPTIONS_HEADER,
  HELPINFO_INSERT_VERBOSE,
  NULL

#define HELPINFO_GLOBAL_OPTIONS_HEADER  \
"", \
"Global options:"

and HELPINFO_INSERT_VERBOSE also contains the quiet option.

The global option value is stored in 'btrfs_config_init bconf', so
everything can access it directly.

Thanks for working on this, I'll have more comments on v2 as I probably
forgot a few more things to do, the above is the base for all further
changes.


[PATCH 0/2 + 0/2] Add BLAKE2 checksumming support

2019-10-21 Thread David Sterba
The two patches apply on top of Johaness' series adding SHA256. To
make it actually work the patch adding BLAKE2 to kernel is needed. You
can find it here (v5) https://patchwork.kernel.org/patch/11188109/ .

WARNING: This is for testing only!

All in one can be pulled from

  git://github.com/kdave/btrfs-devel preview/checksums-5

and the btrfs-progs patches are in the devel branch

  git://github.com/kdave/btrfs-progs devel

David Sterba (2):
  btrfs: add member for a specific checksum driver
  btrfs: add blake2b to checksumming algorithms

 fs/btrfs/ctree.c| 14 ++
 fs/btrfs/ctree.h|  1 +
 fs/btrfs/disk-io.c  |  7 ---
 fs/btrfs/super.c|  1 +
 include/uapi/linux/btrfs_tree.h |  1 +
 5 files changed, 21 insertions(+), 3 deletions(-)

-- 
2.23.0



[PATCH 2/2] btrfs: add blake2b to checksumming algorithms

2019-10-21 Thread David Sterba
Add blake2b (with 256 bit digest) to the list of possible checksumming
algorithms used by BTRFS.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.c| 2 ++
 fs/btrfs/disk-io.c  | 1 +
 fs/btrfs/super.c| 1 +
 include/uapi/linux/btrfs_tree.h | 1 +
 4 files changed, 5 insertions(+)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index c211216b4524..5b6e86aaf2e1 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -37,6 +37,8 @@ static const struct btrfs_csums {
[BTRFS_CSUM_TYPE_CRC32] = { .size = 4, .name = "crc32c" },
[BTRFS_CSUM_TYPE_XXHASH] = { .size = 8, .name = "xxhash64" },
[BTRFS_CSUM_TYPE_SHA256] = { .size = 32, .name = "sha256" },
+   [BTRFS_CSUM_TYPE_BLAKE2] = { .size = 32, .name = "blake2b",
+.driver = "blake2b-256" },
 };
 
 int btrfs_super_csum_size(const struct btrfs_super_block *s)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 948ed5ac0794..7c345c4bc817 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -354,6 +354,7 @@ static bool btrfs_supported_super_csum(u16 csum_type)
case BTRFS_CSUM_TYPE_CRC32:
case BTRFS_CSUM_TYPE_XXHASH:
case BTRFS_CSUM_TYPE_SHA256:
+   case BTRFS_CSUM_TYPE_BLAKE2:
return true;
default:
return false;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 36440336c08b..5c6796caebb5 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2464,3 +2464,4 @@ MODULE_LICENSE("GPL");
 MODULE_SOFTDEP("pre: crc32c");
 MODULE_SOFTDEP("pre: xxhash64");
 MODULE_SOFTDEP("pre: sha256");
+MODULE_SOFTDEP("pre: blake2b");
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index e2823c9b2061..5160be1d7332 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -304,6 +304,7 @@ enum btrfs_csum_type {
BTRFS_CSUM_TYPE_CRC32   = 0,
BTRFS_CSUM_TYPE_XXHASH  = 1,
BTRFS_CSUM_TYPE_SHA256  = 2,
+   BTRFS_CSUM_TYPE_BLAKE2  = 3,
 };
 
 /*
-- 
2.23.0



[PATCH 1/2] btrfs: add member for a specific checksum driver

2019-10-21 Thread David Sterba
Currently all the checksum algorithms generate a fixed size digest size
and we use it.  The on-disk format can hold up to BTRFS_CSUM_SIZE bytes
and BLAKE2b produces digest of 512 bits by default. We can't do that and
will use the blake2b-256, this needs to be passed to the crypto API.

Separate that from the base algorithm name and add a member to request
specific driver, in this case with the digest size.

The only place that uses the driver name is the crypto API setup.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.c   | 12 
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/disk-io.c |  6 +++---
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index c1561097e5a3..c211216b4524 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -32,6 +32,7 @@ static void del_ptr(struct btrfs_root *root, struct 
btrfs_path *path,
 static const struct btrfs_csums {
u16 size;
const char  *name;
+   const char  *driver;
 } btrfs_csums[] = {
[BTRFS_CSUM_TYPE_CRC32] = { .size = 4, .name = "crc32c" },
[BTRFS_CSUM_TYPE_XXHASH] = { .size = 8, .name = "xxhash64" },
@@ -53,6 +54,17 @@ const char *btrfs_super_csum_name(u16 csum_type)
return btrfs_csums[csum_type].name;
 }
 
+/*
+ * Return driver name if defined, otherwise the name that's also a valid driver
+ * name
+ */
+const char *btrfs_super_csum_driver(u16 csum_type)
+{
+   /* csum type is validated at mount time */
+   return btrfs_csums[csum_type].driver ?:
+   btrfs_csums[csum_type].name;
+}
+
 size_t __const btrfs_get_num_csums(void)
 {
return ARRAY_SIZE(btrfs_csums);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b17944995e09..4a842c0fa062 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2163,6 +2163,7 @@ BTRFS_SETGET_STACK_FUNCS(super_uuid_tree_generation, 
struct btrfs_super_block,
 
 int btrfs_super_csum_size(const struct btrfs_super_block *s);
 const char *btrfs_super_csum_name(u16 csum_type);
+const char *btrfs_super_csum_driver(u16 csum_type);
 size_t __const btrfs_get_num_csums(void);
 
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 21fab336f8ca..948ed5ac0794 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2219,13 +2219,13 @@ static int btrfs_init_workqueues(struct btrfs_fs_info 
*fs_info,
 static int btrfs_init_csum_hash(struct btrfs_fs_info *fs_info, u16 csum_type)
 {
struct crypto_shash *csum_shash;
-   const char *csum_name = btrfs_super_csum_name(csum_type);
+   const char *csum_driver = btrfs_super_csum_driver(csum_type);
 
-   csum_shash = crypto_alloc_shash(csum_name, 0, 0);
+   csum_shash = crypto_alloc_shash(csum_driver, 0, 0);
 
if (IS_ERR(csum_shash)) {
btrfs_err(fs_info, "error allocating %s hash for checksum",
- csum_name);
+ csum_driver);
return PTR_ERR(csum_shash);
}
 
-- 
2.23.0



[PATCH 3/4] btrfs-progs: add blake2b reference implementation

2019-10-21 Thread David Sterba
Upstream commit 997fa5ba1e14b52c554fb03ce39e579e6f27b90c,
git repository: git://github.com/BLAKE2/BLAKE2

The reference implemetation added in this patch is unchanged and will be
modified only to compile in current code base and with minimal other
modifications in case of future sync with upstream code. IOW, the coding
style should stay as-is and does not conform to the other btrfs-progs
code. This is an exception for xxhash and sha256 code as well.

Signed-off-by: David Sterba 
---
 Makefile |   3 +-
 crypto/blake2-impl.h | 160 ++
 crypto/blake2.h  | 195 ++
 crypto/blake2b-ref.c | 379 +++
 4 files changed, 736 insertions(+), 1 deletion(-)
 create mode 100644 crypto/blake2-impl.h
 create mode 100644 crypto/blake2.h
 create mode 100644 crypto/blake2b-ref.c

diff --git a/Makefile b/Makefile
index 9104764e974c..e9d01d9838d9 100644
--- a/Makefile
+++ b/Makefile
@@ -153,7 +153,8 @@ libbtrfs_objects = send-stream.o send-utils.o 
kernel-lib/rbtree.o btrfs-list.o \
   kernel-lib/radix-tree.o extent-cache.o extent_io.o \
   crypto/crc32c.o common/messages.o \
   uuid-tree.o utils-lib.o common/rbtree-utils.o \
-  crypto/hash.o crypto/xxhash.o crypto/sha224-256.o
+  crypto/hash.o crypto/xxhash.o crypto/sha224-256.o \
+  crypto/blake2b-ref.o
 libbtrfs_headers = send-stream.h send-utils.h send.h kernel-lib/rbtree.h 
btrfs-list.h \
   crypto/crc32c.h kernel-lib/list.h kerncompat.h \
   kernel-lib/radix-tree.h kernel-lib/sizes.h kernel-lib/raid56.h \
diff --git a/crypto/blake2-impl.h b/crypto/blake2-impl.h
new file mode 100644
index ..c1df82e0c95d
--- /dev/null
+++ b/crypto/blake2-impl.h
@@ -0,0 +1,160 @@
+/*
+   BLAKE2 reference source code package - reference C implementations
+
+   Copyright 2012, Samuel Neves .  You may use this under the
+   terms of the CC0, the OpenSSL Licence, or the Apache Public License 2.0, at
+   your option.  The terms of these licenses can be found at:
+
+   - CC0 1.0 Universal : http://creativecommons.org/publicdomain/zero/1.0
+   - OpenSSL license   : https://www.openssl.org/source/license.html
+   - Apache 2.0: http://www.apache.org/licenses/LICENSE-2.0
+
+   More information about the BLAKE2 hash function can be found at
+   https://blake2.net.
+*/
+#ifndef BLAKE2_IMPL_H
+#define BLAKE2_IMPL_H
+
+#include 
+#include 
+
+#if !defined(__cplusplus) && (!defined(__STDC_VERSION__) || __STDC_VERSION__ < 
199901L)
+  #if   defined(_MSC_VER)
+#define BLAKE2_INLINE __inline
+  #elif defined(__GNUC__)
+#define BLAKE2_INLINE __inline__
+  #else
+#define BLAKE2_INLINE
+  #endif
+#else
+  #define BLAKE2_INLINE inline
+#endif
+
+static BLAKE2_INLINE uint32_t load32( const void *src )
+{
+#if defined(NATIVE_LITTLE_ENDIAN)
+  uint32_t w;
+  memcpy(&w, src, sizeof w);
+  return w;
+#else
+  const uint8_t *p = ( const uint8_t * )src;
+  return (( uint32_t )( p[0] ) <<  0) |
+ (( uint32_t )( p[1] ) <<  8) |
+ (( uint32_t )( p[2] ) << 16) |
+ (( uint32_t )( p[3] ) << 24) ;
+#endif
+}
+
+static BLAKE2_INLINE uint64_t load64( const void *src )
+{
+#if defined(NATIVE_LITTLE_ENDIAN)
+  uint64_t w;
+  memcpy(&w, src, sizeof w);
+  return w;
+#else
+  const uint8_t *p = ( const uint8_t * )src;
+  return (( uint64_t )( p[0] ) <<  0) |
+ (( uint64_t )( p[1] ) <<  8) |
+ (( uint64_t )( p[2] ) << 16) |
+ (( uint64_t )( p[3] ) << 24) |
+ (( uint64_t )( p[4] ) << 32) |
+ (( uint64_t )( p[5] ) << 40) |
+ (( uint64_t )( p[6] ) << 48) |
+ (( uint64_t )( p[7] ) << 56) ;
+#endif
+}
+
+static BLAKE2_INLINE uint16_t load16( const void *src )
+{
+#if defined(NATIVE_LITTLE_ENDIAN)
+  uint16_t w;
+  memcpy(&w, src, sizeof w);
+  return w;
+#else
+  const uint8_t *p = ( const uint8_t * )src;
+  return ( uint16_t )((( uint32_t )( p[0] ) <<  0) |
+  (( uint32_t )( p[1] ) <<  8));
+#endif
+}
+
+static BLAKE2_INLINE void store16( void *dst, uint16_t w )
+{
+#if defined(NATIVE_LITTLE_ENDIAN)
+  memcpy(dst, &w, sizeof w);
+#else
+  uint8_t *p = ( uint8_t * )dst;
+  *p++ = ( uint8_t )w; w >>= 8;
+  *p++ = ( uint8_t )w;
+#endif
+}
+
+static BLAKE2_INLINE void store32( void *dst, uint32_t w )
+{
+#if defined(NATIVE_LITTLE_ENDIAN)
+  memcpy(dst, &w, sizeof w);
+#else
+  uint8_t *p = ( uint8_t * )dst;
+  p[0] = (uint8_t)(w >>  0);
+  p[1] = (uint8_t)(w >>  8);
+  p[2] = (uint8_t)(w >> 16);
+  p[3] = (uint8_t)(w >> 24);
+#endif
+}
+
+static BLAKE2_INLINE void store64( void *dst, uint64_t w )
+{
+#if defined(NATIVE_LITTLE_ENDIAN)
+  memcpy(dst, &w, sizeof w);
+#else
+  uint8_t *p = ( uint8_t * )dst;
+  p[0] = (uint8_t)(w >>  0);
+  p[1] = (uint8_t)(w >>  8);
+  p[2] = (uint8_t)(w >> 16);
+  p[3] = (uint8_t)(w >> 24);
+  p[4] = (uint8_t)(w >> 32);
+  p[5] = (uint8_t)(w >> 40);
+  p[6] = (uint8_t)(w >> 48)

More checksumming algorithms for btrfs

2019-10-21 Thread David Sterba
The work to add more checksums is nearly finished, we're now in a good state to
let interested users do some testing and benchmarking.

New hashes: xxhash64, sha256, blake2b-256

Quick start:

  git://github.com/kdave/btrfs-devel preview/checksums-5
  (build with CRYPTO_BLAKE2B=m)

  git://github.com/kdave/btrfs-progs devel

  $ mkfs.btrfs --csum blake2 /dev/sda
  $ mount /dev/sda /mnt

Warning: use only for testing!

The increased size of checksums is allocating more memory when the data are
being written so this can produce some warnings regarding size of the
allocation. This will be addressed later.


Selection results
~

fast hash: xxhash
- 64bit digest
- optimized for 64bit platforms, leveraging CPU pipelining

cryptographically strong hash 1: sha256
- 256bit digest
- FIPS certification

cryptographically strong hash 2: blake2
- blake2b with 256bit digest
- '2b' as it targets 64bit platforms


Microbenchmark
~~

  $ cd btrfs-progs.git
  $ make hash-speedtest
  $ ./hash-speedtest [iterations]

Block size: 4096
Iterations: 10

NULL-NOP: cycles: 53638797, c/i  536
 NULL-MEMCPY: cycles: 59547932, c/i  595
  CRC32C: cycles:179251924, c/i 1792
  XXHASH: cycles:137327470, c/i 1373
  SHA256: cycles:  10719756126, c/i   107197
 BLAKE2b: cycles:   2264316924, c/i22643


Compatibility
~

There's no new incompat bit, the checksum algorithm is detected at mount time
and unknown type will fail to mount.

The crypto modules implementing the digests must be either built-in or
loadable, lack of thereof will fail mount. The actual digest implementation is
up to the crypto API to choose. Check /sys/fs/UUID/features/checksum .


Target release
~~

The plan is to queue the patches for 5.5, the blake2b patches seem to be on a
good track so both shall be in the same release.


[PATCH 3/4] btrfs-progs: add blake2b support

2019-10-21 Thread David Sterba
Add definition, crypto wrappers and support to mkfs for blake2 for
checksumming. There are 2 aliases either blake2 or blake2b.

Signed-off-by: David Sterba 
---
 cmds/inspect-dump-super.c |  1 +
 crypto/hash.c | 12 
 crypto/hash.h |  1 +
 ctree.c   |  1 +
 ctree.h   |  1 +
 disk-io.c |  2 ++
 mkfs/main.c   |  3 +++
 7 files changed, 21 insertions(+)

diff --git a/cmds/inspect-dump-super.c b/cmds/inspect-dump-super.c
index 47f56f78934f..fc06488dde32 100644
--- a/cmds/inspect-dump-super.c
+++ b/cmds/inspect-dump-super.c
@@ -317,6 +317,7 @@ static bool is_valid_csum_type(u16 csum_type)
case BTRFS_CSUM_TYPE_CRC32:
case BTRFS_CSUM_TYPE_XXHASH:
case BTRFS_CSUM_TYPE_SHA256:
+   case BTRFS_CSUM_TYPE_BLAKE2:
return true;
default:
return false;
diff --git a/crypto/hash.c b/crypto/hash.c
index ee4c7f4e7112..48623c798739 100644
--- a/crypto/hash.c
+++ b/crypto/hash.c
@@ -2,6 +2,7 @@
 #include "crypto/crc32c.h"
 #include "crypto/xxhash.h"
 #include "crypto/sha.h"
+#include "crypto/blake2.h"
 
 int hash_crc32c(const u8* buf, size_t length, u8 *out)
 {
@@ -37,3 +38,14 @@ int hash_sha256(const u8 *buf, size_t len, u8 *out)
 
return 0;
 }
+
+int hash_blake2b(const u8 *buf, size_t len, u8 *out)
+{
+   blake2b_state S;
+
+   blake2b_init(&S, CRYPTO_HASH_SIZE_MAX);
+   blake2b_update(&S, buf, len);
+   blake2b_final(&S, out, CRYPTO_HASH_SIZE_MAX);
+
+   return 0;
+}
diff --git a/crypto/hash.h b/crypto/hash.h
index 49d586541af9..fefccbd59a09 100644
--- a/crypto/hash.h
+++ b/crypto/hash.h
@@ -8,5 +8,6 @@
 int hash_crc32c(const u8 *buf, size_t length, u8 *out);
 int hash_xxhash(const u8 *buf, size_t length, u8 *out);
 int hash_sha256(const u8 *buf, size_t length, u8 *out);
+int hash_blake2b(const u8 *buf, size_t length, u8 *out);
 
 #endif
diff --git a/ctree.c b/ctree.c
index 19052e8ec128..97b44d48dc85 100644
--- a/ctree.c
+++ b/ctree.c
@@ -45,6 +45,7 @@ static const struct btrfs_csum {
[BTRFS_CSUM_TYPE_CRC32] = {  4, "crc32c" },
[BTRFS_CSUM_TYPE_XXHASH]= {  8, "xxhash64" },
[BTRFS_CSUM_TYPE_SHA256]= { 32, "sha256" },
+   [BTRFS_CSUM_TYPE_BLAKE2]= { 32, "blake2" },
 };
 
 u16 btrfs_super_csum_size(const struct btrfs_super_block *sb)
diff --git a/ctree.h b/ctree.h
index f7f7e2963b0e..b99ba11279ad 100644
--- a/ctree.h
+++ b/ctree.h
@@ -169,6 +169,7 @@ enum btrfs_csum_type {
BTRFS_CSUM_TYPE_CRC32   = 0,
BTRFS_CSUM_TYPE_XXHASH  = 1,
BTRFS_CSUM_TYPE_SHA256  = 2,
+   BTRFS_CSUM_TYPE_BLAKE2  = 3,
 };
 
 #define BTRFS_EMPTY_DIR_SIZE 0
diff --git a/disk-io.c b/disk-io.c
index 62ec8c1a84bf..a9744af90a43 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -150,6 +150,8 @@ int btrfs_csum_data(u16 csum_type, const u8 *data, u8 *out, 
size_t len)
return hash_xxhash(data, len, out);
case BTRFS_CSUM_TYPE_SHA256:
return hash_sha256(data, len, out);
+   case BTRFS_CSUM_TYPE_BLAKE2:
+   return hash_blake2b(data, len, out);
default:
fprintf(stderr, "ERROR: unknown csum type: %d\n", csum_type);
ASSERT(0);
diff --git a/mkfs/main.c b/mkfs/main.c
index 607cfe3e0cf5..1a4578412b41 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -397,6 +397,9 @@ static enum btrfs_csum_type parse_csum_type(const char *s)
return BTRFS_CSUM_TYPE_XXHASH;
} else if (strcasecmp(s, "sha256") == 0) {
return BTRFS_CSUM_TYPE_SHA256;
+   } else if (strcasecmp(s, "blake2b") == 0 ||
+  strcasecmp(s, "blake2") == 0) {
+   return BTRFS_CSUM_TYPE_BLAKE2;
} else {
error("unknown csum type %s", s);
exit(1);
-- 
2.23.0



Re: [RFC PATCH v2 5/5] btrfs: implement RWF_ENCODED writes

2019-10-21 Thread Omar Sandoval
On Mon, Oct 21, 2019 at 03:14:52PM +0200, David Sterba wrote:
> On Fri, Oct 18, 2019 at 03:55:13PM -0700, Omar Sandoval wrote:
> > > > +   nr_pages = (disk_num_bytes + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > > 
> > > nit: nr_pages = DIV_ROUND_UP(disk_num_bytes, PAGE_SIZE)
> > 
> > disk_num_bytes is a u64, so that would expand to a 64-bit division. The
> > compiler is probably smart enough to optimize it to a shift, but I
> > didn't want to rely on that, because that would cause build failures on
> > 32-bit.
> 
> There are several DIV_ROUND_UP(u64, PAGE_SIZE) in btrfs code, no build
> brekages have been reported so far, you can use it.

Good to know, I'll fix both places I'm doing this, then.


Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data

2019-10-21 Thread Darrick J. Wong
On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Btrfs supports transparent compression: data written by the user can be
> compressed when written to disk and decompressed when read back.
> However, we'd like to add an interface to write pre-compressed data
> directly to the filesystem, and the matching interface to read
> compressed data without decompressing it. This adds support for
> so-called "encoded I/O" via preadv2() and pwritev2().
> 
> A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> this flag is set, iov[0].iov_base points to a struct encoded_iov which
> is used for metadata: namely, the compression algorithm, unencoded
> (i.e., decompressed) length, and what subrange of the unencoded data
> should be used (needed for truncated or hole-punched extents and when
> reading in the middle of an extent). For reads, the filesystem returns
> this information; for writes, the caller provides it to the filesystem.
> iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> used to extend the interface in the future. The remaining iovecs contain
> the encoded extent.
> 
> Filesystems must indicate that they support encoded writes by setting
> FMODE_ENCODED_IO in ->file_open().
> 
> Signed-off-by: Omar Sandoval 
> ---
>  include/linux/fs.h  | 14 +++
>  include/uapi/linux/fs.h | 26 -
>  mm/filemap.c| 82 ++---
>  3 files changed, 108 insertions(+), 14 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e0d909d35763..54681f21e05e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t 
> offset,
>  /* File does not contribute to nr_files count */
>  #define FMODE_NOACCOUNT  ((__force fmode_t)0x2000)
>  
> +/* File supports encoded IO */
> +#define FMODE_ENCODED_IO ((__force fmode_t)0x4000)
> +
>  /*
>   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
>   * that indicates that they should check the contents of the iovec are
> @@ -314,6 +317,7 @@ enum rw_hint {
>  #define IOCB_SYNC(1 << 5)
>  #define IOCB_WRITE   (1 << 6)
>  #define IOCB_NOWAIT  (1 << 7)
> +#define IOCB_ENCODED (1 << 8)
>  
>  struct kiocb {
>   struct file *ki_filp;
> @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, int);
>  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
>  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct 
> *);
>  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> +struct encoded_iov;
> +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov 
> *);
> +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> + struct iov_iter *);
>  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
>   struct file *file_out, loff_t pos_out,
>   loff_t *count, unsigned int remap_flags);
> @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb *ki, 
> rwf_t flags)
>   return -EOPNOTSUPP;
>   ki->ki_flags |= IOCB_NOWAIT;
>   }
> + if (flags & RWF_ENCODED) {
> + if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> + return -EOPNOTSUPP;
> + ki->ki_flags |= IOCB_ENCODED;
> + }
>   if (flags & RWF_HIPRI)
>   ki->ki_flags |= IOCB_HIPRI;
>   if (flags & RWF_DSYNC)
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 379a612f8f1d..ed92a8a257cb 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -284,6 +284,27 @@ struct fsxattr {
>  
>  typedef int __bitwise __kernel_rwf_t;
>  
> +enum {
> + ENCODED_IOV_COMPRESSION_NONE,
> + ENCODED_IOV_COMPRESSION_ZLIB,
> + ENCODED_IOV_COMPRESSION_LZO,
> + ENCODED_IOV_COMPRESSION_ZSTD,
> + ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> +};
> +
> +enum {
> + ENCODED_IOV_ENCRYPTION_NONE,
> + ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> +};
> +
> +struct encoded_iov {
> + __u64 len;
> + __u64 unencoded_len;
> + __u64 unencoded_offset;
> + __u32 compression;
> + __u32 encryption;

Can we add some must-be-zero padding space at the end here for whomever
comes along next wanting to add more encoding info?

(And maybe a manpage and some basic testing, to reiterate Dave...)

--D

> +};
> +
>  /* high priority request, poll if possible */
>  #define RWF_HIPRI((__force __kernel_rwf_t)0x0001)
>  
> @@ -299,8 +320,11 @@ typedef int __bitwise __kernel_rwf_t;
>  /* per-IO O_APPEND */
>  #define RWF_APPEND   ((__force __kernel_rwf_t)0x

Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data

2019-10-21 Thread Aleksa Sarai
On 2019-10-21, Darrick J. Wong  wrote:
> On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > Btrfs supports transparent compression: data written by the user can be
> > compressed when written to disk and decompressed when read back.
> > However, we'd like to add an interface to write pre-compressed data
> > directly to the filesystem, and the matching interface to read
> > compressed data without decompressing it. This adds support for
> > so-called "encoded I/O" via preadv2() and pwritev2().
> > 
> > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > is used for metadata: namely, the compression algorithm, unencoded
> > (i.e., decompressed) length, and what subrange of the unencoded data
> > should be used (needed for truncated or hole-punched extents and when
> > reading in the middle of an extent). For reads, the filesystem returns
> > this information; for writes, the caller provides it to the filesystem.
> > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > used to extend the interface in the future. The remaining iovecs contain
> > the encoded extent.
> > 
> > Filesystems must indicate that they support encoded writes by setting
> > FMODE_ENCODED_IO in ->file_open().
> > 
> > Signed-off-by: Omar Sandoval 
> > ---
> >  include/linux/fs.h  | 14 +++
> >  include/uapi/linux/fs.h | 26 -
> >  mm/filemap.c| 82 ++---
> >  3 files changed, 108 insertions(+), 14 deletions(-)
> > 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index e0d909d35763..54681f21e05e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t 
> > offset,
> >  /* File does not contribute to nr_files count */
> >  #define FMODE_NOACCOUNT((__force fmode_t)0x2000)
> >  
> > +/* File supports encoded IO */
> > +#define FMODE_ENCODED_IO   ((__force fmode_t)0x4000)
> > +
> >  /*
> >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> >   * that indicates that they should check the contents of the iovec are
> > @@ -314,6 +317,7 @@ enum rw_hint {
> >  #define IOCB_SYNC  (1 << 5)
> >  #define IOCB_WRITE (1 << 6)
> >  #define IOCB_NOWAIT(1 << 7)
> > +#define IOCB_ENCODED   (1 << 8)
> >  
> >  struct kiocb {
> > struct file *ki_filp;
> > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, 
> > int);
> >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> >  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct 
> > *);
> >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > +struct encoded_iov;
> > +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov 
> > *);
> > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > +   struct iov_iter *);
> >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> > struct file *file_out, loff_t pos_out,
> > loff_t *count, unsigned int remap_flags);
> > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb 
> > *ki, rwf_t flags)
> > return -EOPNOTSUPP;
> > ki->ki_flags |= IOCB_NOWAIT;
> > }
> > +   if (flags & RWF_ENCODED) {
> > +   if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > +   return -EOPNOTSUPP;
> > +   ki->ki_flags |= IOCB_ENCODED;
> > +   }
> > if (flags & RWF_HIPRI)
> > ki->ki_flags |= IOCB_HIPRI;
> > if (flags & RWF_DSYNC)
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index 379a612f8f1d..ed92a8a257cb 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -284,6 +284,27 @@ struct fsxattr {
> >  
> >  typedef int __bitwise __kernel_rwf_t;
> >  
> > +enum {
> > +   ENCODED_IOV_COMPRESSION_NONE,
> > +   ENCODED_IOV_COMPRESSION_ZLIB,
> > +   ENCODED_IOV_COMPRESSION_LZO,
> > +   ENCODED_IOV_COMPRESSION_ZSTD,
> > +   ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > +};
> > +
> > +enum {
> > +   ENCODED_IOV_ENCRYPTION_NONE,
> > +   ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > +};
> > +
> > +struct encoded_iov {
> > +   __u64 len;
> > +   __u64 unencoded_len;
> > +   __u64 unencoded_offset;
> > +   __u32 compression;
> > +   __u32 encryption;
> 
> Can we add some must-be-zero padding space at the end here for whomever
> comes along next wanting to add more encoding info?

I would suggest to copy the extension design of copy_struct_from_user().
Adding must-be-zero padding is a less-ideal solut

Re: [PATCH man-pages] Document encoded I/O

2019-10-21 Thread Omar Sandoval
On Mon, Oct 21, 2019 at 09:18:13AM +0300, Amir Goldstein wrote:
> CC: Ted
> 
> What ever happened to read/write ext4 encrypted data API?
> https://marc.info/?l=linux-ext4&m=145030599010416&w=2
> 
> Can we learn anything from the ext4 experience to improve
> the new proposed API?

I wasn't aware of these patches, thanks for pointing them out. Ted, do
you have any thoughts about making this API work for fscrypt?

> On Wed, Oct 16, 2019 at 12:29 AM Omar Sandoval  wrote:
> >
> > From: Omar Sandoval 
> >
> > This adds a new page, rwf_encoded(7), providing an overview of encoded
> > I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
> > reference it.
> >
> > Signed-off-by: Omar Sandoval 
> > ---
> >  man2/fcntl.2   |  10 +-
> >  man2/open.2|  13 ++
> >  man2/readv.2   |  46 +++
> >  man7/rwf_encoded.7 | 297 +
> >  4 files changed, 365 insertions(+), 1 deletion(-)
> >  create mode 100644 man7/rwf_encoded.7
> >
> > diff --git a/man2/fcntl.2 b/man2/fcntl.2
> > index fce4f4c2b..76fe9cc6f 100644
> > --- a/man2/fcntl.2
> > +++ b/man2/fcntl.2
> > @@ -222,8 +222,9 @@ On Linux, this command can change only the
> >  .BR O_ASYNC ,
> >  .BR O_DIRECT ,
> >  .BR O_NOATIME ,
> > +.BR O_NONBLOCK ,
> >  and
> > -.B O_NONBLOCK
> > +.B O_ENCODED
> >  flags.
> >  It is not possible to change the
> >  .BR O_DSYNC
> > @@ -1803,6 +1804,13 @@ Attempted to clear the
> >  flag on a file that has the append-only attribute set.
> >  .TP
> >  .B EPERM
> > +Attempted to set the
> > +.B O_ENCODED
> > +flag and the calling process did not have the
> > +.B CAP_SYS_ADMIN
> > +capability.
> > +.TP
> > +.B EPERM
> >  .I cmd
> >  was
> >  .BR F_ADD_SEALS ,
> > diff --git a/man2/open.2 b/man2/open.2
> > index b0f485b41..cdd3c549c 100644
> > --- a/man2/open.2
> > +++ b/man2/open.2
> > @@ -421,6 +421,14 @@ was followed by a call to
> >  .BR fdatasync (2)).
> >  .IR "See NOTES below" .
> >  .TP
> > +.B O_ENCODED
> > +Open the file with encoded I/O permissions;
> 
> 1. I find the name of the flag confusing.
> Yes, most people don't read documentation so carefully (or at all)
> so they will assume O_ENCODED will affect read/write or that it
> relates to RWF_ENCODED in a similar way that O_SYNC relates
> to RWF_SYNC (i.e. logical OR and not logical AND).
> 
> I am not good at naming and to prove it I will propose:
> O_PROMISCUOUS, O_MAINTENANCE, O_ALLOW_ENCODED

Agreed, the name is misleading. I can't think of anything better than
O_ALLOW_ENCODED, so I'll go with that unless someone comes up with
something better :)

> 2. While I see no harm in adding O_ flag to open(2) for this
> use case, I also don't see a major benefit in adding it.
> What if we only allowed setting the flag via fcntl(2) which returns
> an error on old kernels?
> Since unlike most O_ flags, O_ENCODED does NOT affect file
> i/o without additional opt-in flags, it is not standard anyway and
> therefore I find that setting it only via fcntl(2) is less error prone.

If I make this fcntl-only, then it probably shouldn't be through
F_GETFL/F_SETFL (it'd be pretty awkward for an O_ flag to not be valid
for open(), and also awkward to mix some non-O_ flag with O_ flags for
F_GETFL/F_SETFL). So that leaves a couple of options:

1. Get/set it with F_GETFD/F_SETFD, which is currently only used for
   FD_CLOEXEC. That also silently ignores unknown flags, but as with the
   O_ flag option, I don't think that's a big deal for FD_ALLOW_ENCODED.
2. Add a new fcntl command (F_GETFD2/F_SETFD2?). This seems like
   overkill to me.

However, both of these options are annoying to implement. Ideally, we
wouldn't have to add another flags field to struct file. But, to reuse
f_flags, we'd need to make sure that FD_ALLOW_ENCODED doesn't collide
with other O_ flags, and we'd probably want to hide it from F_GETFL. At
that point, it might as well be an O_ flag.

It seems to me that it's more trouble than it's worth to make this not
an O_ flag, but please let me know if you see a nice way to do so.

> > +see
> > +.BR rwf_encoded (7).
> > +The caller must have the
> > +.B CAP_SYS_ADMIN
> > +capabilty.
> > +.TP
> >  .B O_EXCL
> >  Ensure that this call creates the file:
> >  if this flag is specified in conjunction with
> > @@ -1168,6 +1176,11 @@ did not match the owner of the file and the caller 
> > was not privileged.
> >  The operation was prevented by a file seal; see
> >  .BR fcntl (2).
> >  .TP
> > +.B EPERM
> > +The
> > +.B O_ENCODED
> > +flag was specified, but the caller was not privileged.
> > +.TP
> >  .B EROFS
> >  .I pathname
> >  refers to a file on a read-only filesystem and write access was
> > diff --git a/man2/readv.2 b/man2/readv.2
> > index af27aa63e..aa60b980a 100644
> > --- a/man2/readv.2
> > +++ b/man2/readv.2
> > @@ -265,6 +265,11 @@ the data is always appended to the end of the file.
> >  However, if the
> >  .I offset
> >  argument is \-1, the current file offset is updated.
> > +.TP
> > +.BR RWF_ENCODED " (sinc

Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data

2019-10-21 Thread Darrick J. Wong
On Tue, Oct 22, 2019 at 05:38:31AM +1100, Aleksa Sarai wrote:
> On 2019-10-21, Darrick J. Wong  wrote:
> > On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > > From: Omar Sandoval 
> > > 
> > > Btrfs supports transparent compression: data written by the user can be
> > > compressed when written to disk and decompressed when read back.
> > > However, we'd like to add an interface to write pre-compressed data
> > > directly to the filesystem, and the matching interface to read
> > > compressed data without decompressing it. This adds support for
> > > so-called "encoded I/O" via preadv2() and pwritev2().
> > > 
> > > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > is used for metadata: namely, the compression algorithm, unencoded
> > > (i.e., decompressed) length, and what subrange of the unencoded data
> > > should be used (needed for truncated or hole-punched extents and when
> > > reading in the middle of an extent). For reads, the filesystem returns
> > > this information; for writes, the caller provides it to the filesystem.
> > > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > > used to extend the interface in the future. The remaining iovecs contain
> > > the encoded extent.
> > > 
> > > Filesystems must indicate that they support encoded writes by setting
> > > FMODE_ENCODED_IO in ->file_open().
> > > 
> > > Signed-off-by: Omar Sandoval 
> > > ---
> > >  include/linux/fs.h  | 14 +++
> > >  include/uapi/linux/fs.h | 26 -
> > >  mm/filemap.c| 82 ++---
> > >  3 files changed, 108 insertions(+), 14 deletions(-)
> > > 
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index e0d909d35763..54681f21e05e 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t 
> > > offset,
> > >  /* File does not contribute to nr_files count */
> > >  #define FMODE_NOACCOUNT  ((__force fmode_t)0x2000)
> > >  
> > > +/* File supports encoded IO */
> > > +#define FMODE_ENCODED_IO ((__force fmode_t)0x4000)
> > > +
> > >  /*
> > >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> > >   * that indicates that they should check the contents of the iovec are
> > > @@ -314,6 +317,7 @@ enum rw_hint {
> > >  #define IOCB_SYNC(1 << 5)
> > >  #define IOCB_WRITE   (1 << 6)
> > >  #define IOCB_NOWAIT  (1 << 7)
> > > +#define IOCB_ENCODED (1 << 8)
> > >  
> > >  struct kiocb {
> > >   struct file *ki_filp;
> > > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, 
> > > int);
> > >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> > >  extern int generic_file_readonly_mmap(struct file *, struct 
> > > vm_area_struct *);
> > >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > > +struct encoded_iov;
> > > +extern int generic_encoded_write_checks(struct kiocb *, struct 
> > > encoded_iov *);
> > > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > > + struct iov_iter *);
> > >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> > >   struct file *file_out, loff_t pos_out,
> > >   loff_t *count, unsigned int remap_flags);
> > > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb 
> > > *ki, rwf_t flags)
> > >   return -EOPNOTSUPP;
> > >   ki->ki_flags |= IOCB_NOWAIT;
> > >   }
> > > + if (flags & RWF_ENCODED) {
> > > + if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > > + return -EOPNOTSUPP;
> > > + ki->ki_flags |= IOCB_ENCODED;
> > > + }
> > >   if (flags & RWF_HIPRI)
> > >   ki->ki_flags |= IOCB_HIPRI;
> > >   if (flags & RWF_DSYNC)
> > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > index 379a612f8f1d..ed92a8a257cb 100644
> > > --- a/include/uapi/linux/fs.h
> > > +++ b/include/uapi/linux/fs.h
> > > @@ -284,6 +284,27 @@ struct fsxattr {
> > >  
> > >  typedef int __bitwise __kernel_rwf_t;
> > >  
> > > +enum {
> > > + ENCODED_IOV_COMPRESSION_NONE,
> > > + ENCODED_IOV_COMPRESSION_ZLIB,
> > > + ENCODED_IOV_COMPRESSION_LZO,
> > > + ENCODED_IOV_COMPRESSION_ZSTD,
> > > + ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > > +};
> > > +
> > > +enum {
> > > + ENCODED_IOV_ENCRYPTION_NONE,
> > > + ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > > +};
> > > +
> > > +struct encoded_iov {
> > > + __u64 len;
> > > + __u64 unencoded_len;
> > > + __u64 unencoded_offset;
> > > + __u32 compression;
> > > + __u32 encryption;
> > 
> > Can we add so

Re: [RFC PATCH v2 0/5] fs: interface for directly reading/writing compressed data

2019-10-21 Thread Omar Sandoval
On Mon, Oct 21, 2019 at 10:05:01AM +1100, Dave Chinner wrote:
> On Tue, Oct 15, 2019 at 11:42:38AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > Hello,
> > 
> > This series adds an API for reading compressed data on a filesystem
> > without decompressing it as well as support for writing compressed data
> > directly to the filesystem. It is based on my previous series which
> > added a Btrfs-specific ioctl [1], but it is now an extension to
> > preadv2()/pwritev2() as suggested by Dave Chinner [2]. I've included a
> > man page patch describing the API in detail. Test cases and examples
> > programs are available [3].
> > 
> > The use case that I have in mind is Btrfs send/receive: currently, when
> > sending data from one compressed filesystem to another, the sending side
> > decompresses the data and the receiving side recompresses it before
> > writing it out. This is wasteful and can be avoided if we can just send
> > and write compressed extents. The send part will be implemented in a
> > separate series, as this API can stand alone.
> > 
> > Patches 1 and 2 add the VFS support. Patch 3 is a Btrfs prep patch.
> > Patch 4 implements encoded reads for Btrfs, and patch 5 implements
> > encoded writes.
> > 
> > Changes from v1 [4]:
> > 
> > - Encoded reads are now also implemented.
> > - The encoded_iov structure now includes metadata for referring to a
> >   subset of decoded data. This is required to handle certain cases where
> >   a compressed extent is truncated, hole punched, or otherwise sliced up
> >   and Btrfs chooses to reflect this in metadata instead of decompressing
> >   the whole extent and rewriting the pieces. We call these "bookend
> >   extents" in Btrfs, but any filesystem supporting transparent encoding
> >   is likely to have a similar concept.
> 
> Where's the in-kernel documentation for this API? You're encoding a
> specific set of behaviours into the user API, so this needs a whole
> heap of documentation in the generic code to describe how it works
> so that other filesystems implementing have a well defined guideline
> to what they need to support.

The man-page I sent is quite detailed, but sure, I can add the relevant
information to the generic code, as well.

> Also, I don't see any test code for this -

It's in the cover letter: https://github.com/osandov/xfstests/tree/rwf-encoded

I haven't sent those patches up because it's tedious to rework and
resend them for each little tweak we make to the API.

> can you please add
> support for RWF_ENCODED to xfs_io and write a suite of unit tests
> for fstests that exercise the user API fully?

Reading requires filesystem-specific decoding, and I wasn't sure if that
would be a good fit for xfs_io. Alternatively, it could dump the raw
buffer to stdout, but whatever interprets it also needs the metadata, so
there'd need to be some sort of protocol between xfs_io and whatever
interprets it. I added a btrfs_read_encoded program in my xfstests
branch above instead. It should be easy enough to move the encoded_write
test program to xfs_io pwrite, though.

> Given our history of
> screwing up new user APIs, this absolutely should not be merged
> until there is a full set of generic unit tests written and reviewed
> for it and support has been added to fsstress, fsx, and other test
> utilities to fuzz and stress the implementation as part of normal
> day-to-day filesystem development...

Sure thing, I'll add support to those tools once the API isn't in flux
so much.


Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data

2019-10-21 Thread Omar Sandoval
On Mon, Oct 21, 2019 at 11:28:06AM -0700, Darrick J. Wong wrote:
> On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > Btrfs supports transparent compression: data written by the user can be
> > compressed when written to disk and decompressed when read back.
> > However, we'd like to add an interface to write pre-compressed data
> > directly to the filesystem, and the matching interface to read
> > compressed data without decompressing it. This adds support for
> > so-called "encoded I/O" via preadv2() and pwritev2().
> > 
> > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > is used for metadata: namely, the compression algorithm, unencoded
> > (i.e., decompressed) length, and what subrange of the unencoded data
> > should be used (needed for truncated or hole-punched extents and when
> > reading in the middle of an extent). For reads, the filesystem returns
> > this information; for writes, the caller provides it to the filesystem.
> > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > used to extend the interface in the future. The remaining iovecs contain
> > the encoded extent.
> > 
> > Filesystems must indicate that they support encoded writes by setting
> > FMODE_ENCODED_IO in ->file_open().
> > 
> > Signed-off-by: Omar Sandoval 
> > ---
> >  include/linux/fs.h  | 14 +++
> >  include/uapi/linux/fs.h | 26 -
> >  mm/filemap.c| 82 ++---
> >  3 files changed, 108 insertions(+), 14 deletions(-)
> > 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index e0d909d35763..54681f21e05e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t 
> > offset,
> >  /* File does not contribute to nr_files count */
> >  #define FMODE_NOACCOUNT((__force fmode_t)0x2000)
> >  
> > +/* File supports encoded IO */
> > +#define FMODE_ENCODED_IO   ((__force fmode_t)0x4000)
> > +
> >  /*
> >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> >   * that indicates that they should check the contents of the iovec are
> > @@ -314,6 +317,7 @@ enum rw_hint {
> >  #define IOCB_SYNC  (1 << 5)
> >  #define IOCB_WRITE (1 << 6)
> >  #define IOCB_NOWAIT(1 << 7)
> > +#define IOCB_ENCODED   (1 << 8)
> >  
> >  struct kiocb {
> > struct file *ki_filp;
> > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block *, 
> > int);
> >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> >  extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct 
> > *);
> >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > +struct encoded_iov;
> > +extern int generic_encoded_write_checks(struct kiocb *, struct encoded_iov 
> > *);
> > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > +   struct iov_iter *);
> >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> > struct file *file_out, loff_t pos_out,
> > loff_t *count, unsigned int remap_flags);
> > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct kiocb 
> > *ki, rwf_t flags)
> > return -EOPNOTSUPP;
> > ki->ki_flags |= IOCB_NOWAIT;
> > }
> > +   if (flags & RWF_ENCODED) {
> > +   if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > +   return -EOPNOTSUPP;
> > +   ki->ki_flags |= IOCB_ENCODED;
> > +   }
> > if (flags & RWF_HIPRI)
> > ki->ki_flags |= IOCB_HIPRI;
> > if (flags & RWF_DSYNC)
> > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > index 379a612f8f1d..ed92a8a257cb 100644
> > --- a/include/uapi/linux/fs.h
> > +++ b/include/uapi/linux/fs.h
> > @@ -284,6 +284,27 @@ struct fsxattr {
> >  
> >  typedef int __bitwise __kernel_rwf_t;
> >  
> > +enum {
> > +   ENCODED_IOV_COMPRESSION_NONE,
> > +   ENCODED_IOV_COMPRESSION_ZLIB,
> > +   ENCODED_IOV_COMPRESSION_LZO,
> > +   ENCODED_IOV_COMPRESSION_ZSTD,
> > +   ENCODED_IOV_COMPRESSION_TYPES = ENCODED_IOV_COMPRESSION_ZSTD,
> > +};
> > +
> > +enum {
> > +   ENCODED_IOV_ENCRYPTION_NONE,
> > +   ENCODED_IOV_ENCRYPTION_TYPES = ENCODED_IOV_ENCRYPTION_NONE,
> > +};
> > +
> > +struct encoded_iov {
> > +   __u64 len;
> > +   __u64 unencoded_len;
> > +   __u64 unencoded_offset;
> > +   __u32 compression;
> > +   __u32 encryption;
> 
> Can we add some must-be-zero padding space at the end here for whomever
> comes along next wanting to add more encoding info?

>From the commit message:

iov_len must be set to sizeof(struct encoded_iov), which can be used 

Re: MD RAID 5/6 vs BTRFS RAID 5/6

2019-10-21 Thread Zygo Blaxell
On Mon, Oct 21, 2019 at 05:27:54PM +0200, Edmund Urbani wrote:
> On 10/16/19 9:42 PM, Zygo Blaxell wrote:
> >
> > For raid5 I'd choose btrfs -draid5 -mraid1 over mdadm raid5
> > sometimes--even with the write hole, I'd expect smaller average data
> > losses than mdadm raid5 + write hole mitigation due to the way disk
> > failure modes are distributed.  
> 
> What about the write hole and RAID-1? I understand the write hole is most
> commonly associated with RAID-5, but it is also a problem with other RAID 
> levels.

Filesystem tree updates are atomic on btrfs.  Everything persistent on
btrfs is part of a committed tree.  The current writes in progress are
initially stored in an uncommitted tree, which consists of blocks that
are isolated from any committed tree block.  The algorithm relies on
two things:

Isolation:  every write to any uncommitted data block must not
affect the correctness of any data in any committed data block.

Ordering:  a commit completes all uncommitted tree updates on
all disks in any order, then updates superblocks to point to the
updated tree roots.  A barrier is used to separate these phases
of updates across disks.

Isolation and ordering make each transaction atomic.  If either
requirement is not implemented correctly, data or metadata may be
corrupted.  If metadata is corrupted, the filesystem can be destroyed.

Transactions close write holes in all btrfs RAID profiles except 5
and 6.  mdadm RAID levels 0, 1, 10, and linear have isolation properties
sufficient for btrfs.  mdadm RAID 4/5/6 work only if the mdadm write hole
mitigation feature is enabled to provide isolation.  All mdadm and btrfs
RAID profiles provide sufficient ordering.

The problem on mdadm/btrfs raid5/6 is that committed data blocks are not
fully isolated from uncommitted data blocks when they share a parity block
in the RAID5/6 layer (wherever that layer is).  This is why the problem
only affects raid5/6 on btrfs (and why it also applies to mdadm raid5/6).
In this case, writing to any single block within a stripe makes the parity
block inconsistent with previously committed data blocks.  This violates
the isolation requirement for CoW transactions: a committed data block
is related to uncommitted data blocks in the same stripe through the
parity block and the RAID5/6 data/parity equation.

The isolation failure affects only parity blocks.  You could kill
power all day long and not lose any committed data on any btrfs raid
profile--as long as none of the disks fail and each disk's firmware
implements write barriers correctly or write cache is disabled (sadly,
even in 2019, a few drive models still don't have working barriers).
btrfs on raid5/6 is as robust as raid0 if you ignore the parity blocks.

> You would need to scrub after a power failure to be sure that no meta data 
> gets
> corrupted even with RAID-1. Otherwise you might still have inconsistent copies
> and the problem may only become apparent when one drive fails. 

There are no inconsistencies expected with RAID1 unless the hardware is
already failing (or there's a software/firmware bug).

Scrub after power failure is required only if raid5/6 is used.
All committed data and metadata blocks will be consistent on all
RAID profiles--even in raid5/6, only parity blocks are inconsistent.
Uncommitted data returns to free space when the filesystem is mounted,
so the the consistency of uncommitted data blocks doesn't matter.

Parity block updates are not handled by the btrfs transaction update
mechanism.  The scrub after power failure is required specifically
to repair inconsistent parity blocks.  This should happen as soon as
possible after a crash, while all the data blocks are still readable.

> Can we be certain that scrubbing is always able to fix such
> inconsistencies...

Scrub can reliably repair raid5/6 parity blocks assuming all data blocks
are still correct and readable.

> ...with RAID-1?

We cannot be certain that scrub fixes inconsistencies in the general case.
If the hardware is failing, scrub relies on crc32c to detect missing
writes.  Error detection and repair is highly likely, but not certain.

Collisions between old/bad data crc32c and correct data crc32c are
not detected.  nodatasum files (which have no csums) are corrupted.
Metadata has a greater chance of successful error detection because more
fields in metadata are checked than just CRC.

If all the errors are confined to a single drive, device 'replace' is more
appropriate than 'scrub'; however, it can be hard to determine if errors
are confined to a single disk without running 'scrub' first to find out.

> Regards,
>  Edmund
> 
> 
> -- 
> *Liland IT GmbH*
> 
> 
> Ferlach ● Wien ● München
> Tel: +43 463 220111
> Tel: +49 89 
> 458 15 940
> off...@liland.com
> https://Liland.com  
> 
> 
> 
> Copyright © 2019 Liland IT GmbH 
> 
> Diese Mail enthaelt vertrauliche und/oder 
> rechtlich geschuetzte Informationen. 
> 

Re: [PATCH v3] btrfs-progs: add verbose option to btrfs device scan

2019-10-21 Thread Anand Jain




On 10/21/19 9:43 PM, David Sterba wrote:

On Tue, Oct 15, 2019 at 11:29:34AM +0800, Anand Jain wrote:

   I was thinking there might be some common code between the
   sub-commands in btrfs-progs now or in future, and if the printf()
   due to verbose is required in one sub-command and the same printf()
   due to verbose is not required in another sub-command (which I
   called unwanted message) then we won't have any choice to not
   to print those unwanted printf().
   But as this is just an anticipatory only, so probably we could try
   global verbose and see how it fares. I will try.


I see, but it would be better to have a concrete example where it's
problematic so we can figure out ways how to filter unwanted messages.



I solved with an argument to btrfs_scan_devcies() [1], by adding
%verbose argument to btrfs_scan_devices() to make sure
only btrfs dev scan would print the verbose and not the btrfs fi show.
If btrfs fi show prints the verbose it shall break few test-cases
in fstests.

[1]
https://patchwork.kernel.org/patch/11201791/
https://patchwork.kernel.org/patch/11201793/


Re: [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature

2019-10-21 Thread Qu Wenruo


On 2019/10/21 下午11:44, David Sterba wrote:
> On Sat, Oct 19, 2019 at 08:04:51AM +0800, Qu Wenruo wrote:
>> That's wonderful.
>> Although I guess my patchset should provide the hint of where to modify
>> the code, since there are only a limited number of places we modify
>> block group item.
> 
> I indeed started at your patchset and grepped fro BG_TREE, adding
> another branch.
> 
>>> We agree on the point that the block group items must be packed. The key
>>> approach should move the new BGI to the beginning, ie. key type is
>>> smaller than anything that appears in the extent tree. I chose 100 for
>>> the prototype, it could change.
>>>
>>> To keep changes to minimum, the new BGI uses the same block group item,
>>> so the only difference then becomes how we search for the items.
>>
>> If we're introducing new block group item, I hope to do a minor change.
>>
>> Remove the chunk_objectid member, or even flags. to make it more
>> compact. So that you can make the BGI subtree even smaller.
> 
> Yeah that can be done.
> 
>> But I guess since you don't want to modify the BGI structure, and keep
>> the code modification minimal, it may not be a good idea right now.
> 
> As long as the changes are bearable, eg. a minor refactoring of block
> group access (the cache.key serving a as offset and length) is fine. And
> if we can make the b-tree item more then let's do it.
> 
>>> Packing of the items is done by swapping the key objectid and offset.
>>>
>>> Normal BGI has bg.start == key.objectid and bg.length == key.offset. As
>>> the objectid is the thing that scatters the items all over the tree.
>>>
>>> So the new BGI has bg.length == key.objectid and bg.start == key.offset.
>>> As most of block groups are of same size, or from a small set, they're
>>> packed.
>>
>> That doesn't look optimized enough.
>>
>> bg.length can be at 1G, that means if extents starts below 1G can still
>> be before BGIs.
> 
> This shold not be a big problem, the items are still grouped togethers.
> Mkfs does 8M, we can have 256M or 1G. On average there could be several
> packed groups, which I think is fine and the estimated overhead would be
> a few more seeks.
> 
>> I believe we should have a fixed objectid for this new BGIs, so that
>> they are ensured to be at the beginning of extent tree.
> 
> That was my idea too, but that also requires to add one more member to
> the item to track the length. Currently the key is saves the bytes. With
> the proposed changes to drop chunk_objectid, the overall size of BGI
> would not increase so this still sounds ok. And all the problems with
> searching would go away.
> 
>>> The nice thing is that a lot of code can be shared between BGI and new
>>> BGI, just needs some care with searches, inserts and search key
>>> advances.
>>
>> Exactly, but since we're introducing a new key type, I prefer to perfect
>> it. Not only change the key, but also the block group item structure to
>> make it more compact.
>>
>> Although from the design aspect, it looks like BG tree along with new
>> BGI would be the best design.
>>
>> New BG key goes (bg start, NEW BGI TYPE, used) no data. It would provide
>> the most compact on-disk format.
> 
> That's very compact. Given the 'bg start' we can't use the same for the
> extent tree item.
> 
>>> This would be easy with the bg_tree, because we'd know that all items in
>>> the tree are just the block group items. Some sort of enumeration could
>>> work for bg_key too, but I don't have something solid.
>>
>> Why not fixed objectid for BGI and just ignore the bg.len part?
> 
> I wanted to avoid storing a useless number, it costs 8 bytes per item,
> and the simple swap of objectid/offset was first idea how to avoid it.
> 
>> We have chunk<->BGI verification code, no bg.len is not a problem at
>> all, we can still make sure chunk<->bg is 1:1 mapped and still verify
>> the bg.used.
> 
> This is all great, sure, and makes the extensions easy. Let's try to
> work out best for each approach we have so far. Having a separate tree
> for block groups may open some future options.

Great, I'll explore the most compact key only method with BG_TREE.

And maybe also try the fixed key objectid solution, just dropping the
bg.length, while keep the current BGI structure.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data

2019-10-21 Thread Aleksa Sarai
On 2019-10-21, Darrick J. Wong  wrote:
> On Tue, Oct 22, 2019 at 05:38:31AM +1100, Aleksa Sarai wrote:
> > On 2019-10-21, Darrick J. Wong  wrote:
> > > On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > > > From: Omar Sandoval 
> > > > 
> > > > Btrfs supports transparent compression: data written by the user can be
> > > > compressed when written to disk and decompressed when read back.
> > > > However, we'd like to add an interface to write pre-compressed data
> > > > directly to the filesystem, and the matching interface to read
> > > > compressed data without decompressing it. This adds support for
> > > > so-called "encoded I/O" via preadv2() and pwritev2().
> > > > 
> > > > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > > > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > > is used for metadata: namely, the compression algorithm, unencoded
> > > > (i.e., decompressed) length, and what subrange of the unencoded data
> > > > should be used (needed for truncated or hole-punched extents and when
> > > > reading in the middle of an extent). For reads, the filesystem returns
> > > > this information; for writes, the caller provides it to the filesystem.
> > > > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > > > used to extend the interface in the future. The remaining iovecs contain
> > > > the encoded extent.
> > > > 
> > > > Filesystems must indicate that they support encoded writes by setting
> > > > FMODE_ENCODED_IO in ->file_open().
> > > > 
> > > > Signed-off-by: Omar Sandoval 
> > > > ---
> > > >  include/linux/fs.h  | 14 +++
> > > >  include/uapi/linux/fs.h | 26 -
> > > >  mm/filemap.c| 82 ++---
> > > >  3 files changed, 108 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index e0d909d35763..54681f21e05e 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, 
> > > > loff_t offset,
> > > >  /* File does not contribute to nr_files count */
> > > >  #define FMODE_NOACCOUNT((__force fmode_t)0x2000)
> > > >  
> > > > +/* File supports encoded IO */
> > > > +#define FMODE_ENCODED_IO   ((__force fmode_t)0x4000)
> > > > +
> > > >  /*
> > > >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> > > >   * that indicates that they should check the contents of the iovec are
> > > > @@ -314,6 +317,7 @@ enum rw_hint {
> > > >  #define IOCB_SYNC  (1 << 5)
> > > >  #define IOCB_WRITE (1 << 6)
> > > >  #define IOCB_NOWAIT(1 << 7)
> > > > +#define IOCB_ENCODED   (1 << 8)
> > > >  
> > > >  struct kiocb {
> > > > struct file *ki_filp;
> > > > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block 
> > > > *, int);
> > > >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> > > >  extern int generic_file_readonly_mmap(struct file *, struct 
> > > > vm_area_struct *);
> > > >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > > > +struct encoded_iov;
> > > > +extern int generic_encoded_write_checks(struct kiocb *, struct 
> > > > encoded_iov *);
> > > > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > > > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > > > +   struct iov_iter *);
> > > >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> > > > struct file *file_out, loff_t pos_out,
> > > > loff_t *count, unsigned int 
> > > > remap_flags);
> > > > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct 
> > > > kiocb *ki, rwf_t flags)
> > > > return -EOPNOTSUPP;
> > > > ki->ki_flags |= IOCB_NOWAIT;
> > > > }
> > > > +   if (flags & RWF_ENCODED) {
> > > > +   if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > > > +   return -EOPNOTSUPP;
> > > > +   ki->ki_flags |= IOCB_ENCODED;
> > > > +   }
> > > > if (flags & RWF_HIPRI)
> > > > ki->ki_flags |= IOCB_HIPRI;
> > > > if (flags & RWF_DSYNC)
> > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > > index 379a612f8f1d..ed92a8a257cb 100644
> > > > --- a/include/uapi/linux/fs.h
> > > > +++ b/include/uapi/linux/fs.h
> > > > @@ -284,6 +284,27 @@ struct fsxattr {
> > > >  
> > > >  typedef int __bitwise __kernel_rwf_t;
> > > >  
> > > > +enum {
> > > > +   ENCODED_IOV_COMPRESSION_NONE,
> > > > +   ENCODED_IOV_COMPRESSION_ZLIB,
> > > > +   ENCODED_IOV_COMPRESSION_LZO,
> > > > +   ENCODED_IOV_COMPRESSION_ZSTD,
> > > > +   ENCODED_IOV_COMPRESSION_TYPES = 

Re: [RFC PATCH 00/14] btrfs-progs: global-verbose option

2019-10-21 Thread Anand Jain




On 10/22/19 12:12 AM, David Sterba wrote:

On Mon, Oct 21, 2019 at 06:01:08PM +0800, Anand Jain wrote:

This patch set brings --verbose option to the top level btrfs command,
such as 'btrfs --verbose'. With this we don't have to add or remember
verbose option at the sub-commands level.

As there are already verbose options to 11 sub-commands as listed
below [1][2]. So the top level --verbose option here takes care to transpire
verbose request from the top level to the sub-command level for 9 (not 11)
sub-commands as in [1] as of now.

This patch is RFC still for the following two reasons (comments appreciated).

1.
The sub-commands as in [2] uses multi-level compile time verbose option,
such as %g_verbose = 0 (quite), %g_verbose = 1 (default), %g_verbose > 1
(real-verbose). And verbose at default is also part the .out files in
fstests. So it needs further discussions on how to handle the multi-
level verbose option using the global verbose option, and so sub-
commands in [2] are untouched.


The idea is to unify all verbosity options. Default is 1, 0 is for quiet
(only errors are printed), the rest is up to the commands what to print
on the higher levels.


As of now verbosity level is a compile time option. [3]

[3]
---
cmds/send.c

 51 /*
 52  * Default is 1 for historical reasons, changing may break scripts 
that expect

 53  * the 'At subvol' message.
 54  */
 55 static int g_verbose = 1;





2.
These patch has been unit-tested individually.
These patches does not alter the verbose output.
But it fixes the indentation in the command's help output, which may be
used in fstests and btrfs-progs/tests and their verification is pending
still, which I am planning to do it before v1.


The indentation does not need to be changed if the glboal options are
split from the per-command, like

>

---
usage: btrfs subvolume delete [options]  [...]

 Delete subvolume(s)

 Delete subvolumes from the filesystem. The corresponding directory
 is removed instantly but the data blocks are removed later.
 The deletion does not involve full commit by default due to
 performance reasons (as a consequence, the subvolume may appear again
 after a crash). Use one of the --commit options to wait until the
 operation is safely stored on the media.

 -c|--commit-after  wait for transaction commit at the end of the operation
 -C|--commit-each   wait for transaction commit after deleting each 
subvolume

 Global options:
 -v|--verbose show verbose output
---



 Oh split it. ok. Makes sense.


Some commands can have long option names or the argument names make it
long in some cases, the global options could stay indented. I think
visually it'll be ok. We can introduce some way to automatically format
the options and help texts so we don't have to adjust them manually each
time, but this would be more intrusive and can be done later.


 ok. But my pertaining question is if the sub-command verbose option
 should still remain? if no I will be happy to take it out as the same
 verbose will anyway be activated using the global verbose option.


With the global verbose option there shouldbe also -q|--quiet. Both
short and long versions should be available for all commands. So the
help would look like:

---
 Global options:
 -v|--verbose verbose output, repeat for more verbosity
 -q|--quiet   print only errors
---

In code this looks like:

   "",
   "-c|--commit-after  wait for transaction commit at the end of the 
operation",
   "-C|--commit-each   wait for transaction commit after deleting each 
subvolume",
   HELPINFO_GLOBAL_OPTIONS_HEADER,
   HELPINFO_INSERT_VERBOSE,
   NULL

#define HELPINFO_GLOBAL_OPTIONS_HEADER  \
"",   \
"Global options:"

and HELPINFO_INSERT_VERBOSE also contains the quiet option.

The global option value is stored in 'btrfs_config_init bconf', so
everything can access it directly.


 Oh ok.

 In the above code-snap [3].

 g_verbose = 0 and g_verbose = 1 can be mapped to the global
 -q|--quite and --verbose respectively.


 But any idea what to do with g_verbose > 1? which we support
 in send.c and receive.c. And in defrag which the patch [4] removed it.
  [4]
  [RFC PATCH 09/14] btrfs-progs: restore: delete unreachable code

 Another way is

  btrfs [--quite] [--verbose[=n]]

n=1 default
n=2 verbose

Can't imagine anything better.

Thanks for helping to shape this.

Anand


Thanks for working on this, I'll have more comments on v2 as I probably
forgot a few more things to do, the above is the base for all further
changes.



[PATCH 2/2] btrfs-progs: Makefile: Add -Wimplicit-fallthrough

2019-10-21 Thread Marcos Paulo de Souza
From: Marcos Paulo de Souza 

Avoid introducing new cases of implicit fallthrough by having this flag
always set.

Signed-off-by: Marcos Paulo de Souza 
---
 Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Makefile b/Makefile
index 21bf2717..2f04e880 100644
--- a/Makefile
+++ b/Makefile
@@ -86,6 +86,7 @@ CFLAGS = $(SUBST_CFLAGS) \
 -D_XOPEN_SOURCE=700  \
 -fno-strict-aliasing \
 -fPIC \
+-Wimplicit-fallthrough \
 -I$(TOPDIR) \
 -I$(TOPDIR)/libbtrfsutil \
 $(DISABLE_WARNING_FLAGS) \
-- 
2.23.0



[PATCH 1/2] btrfs-progs: utils: Replace __attribute__(fallthrough)

2019-10-21 Thread Marcos Paulo de Souza
From: Marcos Paulo de Souza 

When compiling with clang, this warning is shown:

common/utils.c:404:3: warning: declaration does not declare anything 
[-Wmissing-declarations]
__attribute__ ((fallthrough));

This attribute seems to silence the same warning in GCC. Changing this
attribute with /* fallthrough */ fixes the warning for both gcc and
clang.

Signed-off-by: Marcos Paulo de Souza 
---
 common/utils.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/common/utils.c b/common/utils.c
index 2cf15c33..a88336b3 100644
--- a/common/utils.c
+++ b/common/utils.c
@@ -401,15 +401,15 @@ int pretty_size_snprintf(u64 size, char *str, size_t 
str_size, unsigned unit_mod
case UNITS_TBYTES:
base *= mult;
num_divs++;
-   __attribute__ ((fallthrough));
+   /* fallthrough */
case UNITS_GBYTES:
base *= mult;
num_divs++;
-   __attribute__ ((fallthrough));
+   /* fallthrough */
case UNITS_MBYTES:
base *= mult;
num_divs++;
-   __attribute__ ((fallthrough));
+   /* fallthrough */
case UNITS_KBYTES:
num_divs++;
break;
@@ -1135,14 +1135,14 @@ int test_num_disk_vs_raid(u64 metadata_profile, u64 
data_profile,
default:
case 4:
allowed |= BTRFS_BLOCK_GROUP_RAID10;
-   __attribute__ ((fallthrough));
+   /* fallthrough */
case 3:
allowed |= BTRFS_BLOCK_GROUP_RAID6;
-   __attribute__ ((fallthrough));
+   /* fallthrough */
case 2:
allowed |= BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1 |
BTRFS_BLOCK_GROUP_RAID5;
-   __attribute__ ((fallthrough));
+   /* fallthrough */
case 1:
allowed |= BTRFS_BLOCK_GROUP_DUP;
}
-- 
2.23.0



[PATCH 0/2] btrfs-progs: Setting implicit-fallthrough by default

2019-10-21 Thread Marcos Paulo de Souza
From: Marcos Paulo de Souza 

While compiling btrfs-progs using clang I found an issue using
__attribute__(fallthrough), which does not seems to work in clang.

To solve this issue, the code was changed to use /* fallthrough */, which is the
same notation adopted by linux kernel.

Once these places were changed, -Wimplicit-fallthrough was set in Makefile, to
avoid further implicit-fallthrough cases being added in the future.

Marcos Paulo de Souza (2):
  btrfs-progs: utils: Replace __attribute__(fallthrough)
  btrfs-progs: Makefile: Add -Wimplicit-fallthrough

 Makefile   |  1 +
 common/utils.c | 12 ++--
 2 files changed, 7 insertions(+), 6 deletions(-)

-- 
2.23.0



Re: [RFC PATCH v2 2/5] fs: add RWF_ENCODED for reading/writing compressed data

2019-10-21 Thread Aleksa Sarai
On 2019-10-21, Darrick J. Wong  wrote:
> On Tue, Oct 22, 2019 at 05:38:31AM +1100, Aleksa Sarai wrote:
> > On 2019-10-21, Darrick J. Wong  wrote:
> > > On Tue, Oct 15, 2019 at 11:42:40AM -0700, Omar Sandoval wrote:
> > > > From: Omar Sandoval 
> > > > 
> > > > Btrfs supports transparent compression: data written by the user can be
> > > > compressed when written to disk and decompressed when read back.
> > > > However, we'd like to add an interface to write pre-compressed data
> > > > directly to the filesystem, and the matching interface to read
> > > > compressed data without decompressing it. This adds support for
> > > > so-called "encoded I/O" via preadv2() and pwritev2().
> > > > 
> > > > A new RWF_ENCODED flags indicates that a read or write is "encoded". If
> > > > this flag is set, iov[0].iov_base points to a struct encoded_iov which
> > > > is used for metadata: namely, the compression algorithm, unencoded
> > > > (i.e., decompressed) length, and what subrange of the unencoded data
> > > > should be used (needed for truncated or hole-punched extents and when
> > > > reading in the middle of an extent). For reads, the filesystem returns
> > > > this information; for writes, the caller provides it to the filesystem.
> > > > iov[0].iov_len must be set to sizeof(struct encoded_iov), which can be
> > > > used to extend the interface in the future. The remaining iovecs contain
> > > > the encoded extent.
> > > > 
> > > > Filesystems must indicate that they support encoded writes by setting
> > > > FMODE_ENCODED_IO in ->file_open().
> > > > 
> > > > Signed-off-by: Omar Sandoval 
> > > > ---
> > > >  include/linux/fs.h  | 14 +++
> > > >  include/uapi/linux/fs.h | 26 -
> > > >  mm/filemap.c| 82 ++---
> > > >  3 files changed, 108 insertions(+), 14 deletions(-)
> > > > 
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index e0d909d35763..54681f21e05e 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, 
> > > > loff_t offset,
> > > >  /* File does not contribute to nr_files count */
> > > >  #define FMODE_NOACCOUNT((__force fmode_t)0x2000)
> > > >  
> > > > +/* File supports encoded IO */
> > > > +#define FMODE_ENCODED_IO   ((__force fmode_t)0x4000)
> > > > +
> > > >  /*
> > > >   * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
> > > >   * that indicates that they should check the contents of the iovec are
> > > > @@ -314,6 +317,7 @@ enum rw_hint {
> > > >  #define IOCB_SYNC  (1 << 5)
> > > >  #define IOCB_WRITE (1 << 6)
> > > >  #define IOCB_NOWAIT(1 << 7)
> > > > +#define IOCB_ENCODED   (1 << 8)
> > > >  
> > > >  struct kiocb {
> > > > struct file *ki_filp;
> > > > @@ -3088,6 +3092,11 @@ extern int sb_min_blocksize(struct super_block 
> > > > *, int);
> > > >  extern int generic_file_mmap(struct file *, struct vm_area_struct *);
> > > >  extern int generic_file_readonly_mmap(struct file *, struct 
> > > > vm_area_struct *);
> > > >  extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> > > > +struct encoded_iov;
> > > > +extern int generic_encoded_write_checks(struct kiocb *, struct 
> > > > encoded_iov *);
> > > > +extern ssize_t check_encoded_read(struct kiocb *, struct iov_iter *);
> > > > +extern int import_encoded_write(struct kiocb *, struct encoded_iov *,
> > > > +   struct iov_iter *);
> > > >  extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> > > > struct file *file_out, loff_t pos_out,
> > > > loff_t *count, unsigned int 
> > > > remap_flags);
> > > > @@ -3403,6 +3412,11 @@ static inline int kiocb_set_rw_flags(struct 
> > > > kiocb *ki, rwf_t flags)
> > > > return -EOPNOTSUPP;
> > > > ki->ki_flags |= IOCB_NOWAIT;
> > > > }
> > > > +   if (flags & RWF_ENCODED) {
> > > > +   if (!(ki->ki_filp->f_mode & FMODE_ENCODED_IO))
> > > > +   return -EOPNOTSUPP;
> > > > +   ki->ki_flags |= IOCB_ENCODED;
> > > > +   }
> > > > if (flags & RWF_HIPRI)
> > > > ki->ki_flags |= IOCB_HIPRI;
> > > > if (flags & RWF_DSYNC)
> > > > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> > > > index 379a612f8f1d..ed92a8a257cb 100644
> > > > --- a/include/uapi/linux/fs.h
> > > > +++ b/include/uapi/linux/fs.h
> > > > @@ -284,6 +284,27 @@ struct fsxattr {
> > > >  
> > > >  typedef int __bitwise __kernel_rwf_t;
> > > >  
> > > > +enum {
> > > > +   ENCODED_IOV_COMPRESSION_NONE,
> > > > +   ENCODED_IOV_COMPRESSION_ZLIB,
> > > > +   ENCODED_IOV_COMPRESSION_LZO,
> > > > +   ENCODED_IOV_COMPRESSION_ZSTD,
> > > > +   ENCODED_IOV_COMPRESSION_TYPES = 

Re: [PATCH v2 0/7] btrfs-progs: Support for BG_TREE feature

2019-10-21 Thread Qu Wenruo
BTW, there is one important compatibility problem related to all the BGI
related features.

Although I'm putting the BG_TREE feature as incompatible feature, but in
theory, it should be RO compatible.

As except extent/bg tree, we *should* read the fs without any problem.

But the problem is, current btrfs mount (including btrfs-check) still
need to go through the block group item search, even for permanent RO mount.

This get my rescue mount option patchset to be involved.
If we have such skip_bg feature earlier, we can completely afford to
make all these potential features as RO compatible.


Now my question is,  should we put this feature still as incompatible
feature?

Thanks,
Qu


On 2019/10/22 上午8:49, Qu Wenruo wrote:
> 
> 
> On 2019/10/21 下午11:44, David Sterba wrote:
>> On Sat, Oct 19, 2019 at 08:04:51AM +0800, Qu Wenruo wrote:
>>> That's wonderful.
>>> Although I guess my patchset should provide the hint of where to modify
>>> the code, since there are only a limited number of places we modify
>>> block group item.
>>
>> I indeed started at your patchset and grepped fro BG_TREE, adding
>> another branch.
>>
 We agree on the point that the block group items must be packed. The key
 approach should move the new BGI to the beginning, ie. key type is
 smaller than anything that appears in the extent tree. I chose 100 for
 the prototype, it could change.

 To keep changes to minimum, the new BGI uses the same block group item,
 so the only difference then becomes how we search for the items.
>>>
>>> If we're introducing new block group item, I hope to do a minor change.
>>>
>>> Remove the chunk_objectid member, or even flags. to make it more
>>> compact. So that you can make the BGI subtree even smaller.
>>
>> Yeah that can be done.
>>
>>> But I guess since you don't want to modify the BGI structure, and keep
>>> the code modification minimal, it may not be a good idea right now.
>>
>> As long as the changes are bearable, eg. a minor refactoring of block
>> group access (the cache.key serving a as offset and length) is fine. And
>> if we can make the b-tree item more then let's do it.
>>
 Packing of the items is done by swapping the key objectid and offset.

 Normal BGI has bg.start == key.objectid and bg.length == key.offset. As
 the objectid is the thing that scatters the items all over the tree.

 So the new BGI has bg.length == key.objectid and bg.start == key.offset.
 As most of block groups are of same size, or from a small set, they're
 packed.
>>>
>>> That doesn't look optimized enough.
>>>
>>> bg.length can be at 1G, that means if extents starts below 1G can still
>>> be before BGIs.
>>
>> This shold not be a big problem, the items are still grouped togethers.
>> Mkfs does 8M, we can have 256M or 1G. On average there could be several
>> packed groups, which I think is fine and the estimated overhead would be
>> a few more seeks.
>>
>>> I believe we should have a fixed objectid for this new BGIs, so that
>>> they are ensured to be at the beginning of extent tree.
>>
>> That was my idea too, but that also requires to add one more member to
>> the item to track the length. Currently the key is saves the bytes. With
>> the proposed changes to drop chunk_objectid, the overall size of BGI
>> would not increase so this still sounds ok. And all the problems with
>> searching would go away.
>>
 The nice thing is that a lot of code can be shared between BGI and new
 BGI, just needs some care with searches, inserts and search key
 advances.
>>>
>>> Exactly, but since we're introducing a new key type, I prefer to perfect
>>> it. Not only change the key, but also the block group item structure to
>>> make it more compact.
>>>
>>> Although from the design aspect, it looks like BG tree along with new
>>> BGI would be the best design.
>>>
>>> New BG key goes (bg start, NEW BGI TYPE, used) no data. It would provide
>>> the most compact on-disk format.
>>
>> That's very compact. Given the 'bg start' we can't use the same for the
>> extent tree item.
>>
 This would be easy with the bg_tree, because we'd know that all items in
 the tree are just the block group items. Some sort of enumeration could
 work for bg_key too, but I don't have something solid.
>>>
>>> Why not fixed objectid for BGI and just ignore the bg.len part?
>>
>> I wanted to avoid storing a useless number, it costs 8 bytes per item,
>> and the simple swap of objectid/offset was first idea how to avoid it.
>>
>>> We have chunk<->BGI verification code, no bg.len is not a problem at
>>> all, we can still make sure chunk<->bg is 1:1 mapped and still verify
>>> the bg.used.
>>
>> This is all great, sure, and makes the extensions easy. Let's try to
>> work out best for each approach we have so far. Having a separate tree
>> for block groups may open some future options.
> 
> Great, I'll explore the most compact key only method with BG_TREE.
> 
> And maybe also try the fi

Re: [PATCH man-pages] Document encoded I/O

2019-10-21 Thread Amir Goldstein
On Mon, Oct 21, 2019 at 9:54 PM Omar Sandoval  wrote:
>
> On Mon, Oct 21, 2019 at 09:18:13AM +0300, Amir Goldstein wrote:
> > CC: Ted
> >
> > What ever happened to read/write ext4 encrypted data API?
> > https://marc.info/?l=linux-ext4&m=145030599010416&w=2
> >
> > Can we learn anything from the ext4 experience to improve
> > the new proposed API?
>
> I wasn't aware of these patches, thanks for pointing them out. Ted, do
> you have any thoughts about making this API work for fscrypt?
>
> > On Wed, Oct 16, 2019 at 12:29 AM Omar Sandoval  wrote:
> > >
> > > From: Omar Sandoval 
> > >
> > > This adds a new page, rwf_encoded(7), providing an overview of encoded
> > > I/O and updates fcntl(2), open(2), and preadv2(2)/pwritev2(2) to
> > > reference it.
> > >
> > > Signed-off-by: Omar Sandoval 
> > > ---
> > >  man2/fcntl.2   |  10 +-
> > >  man2/open.2|  13 ++
> > >  man2/readv.2   |  46 +++
> > >  man7/rwf_encoded.7 | 297 +
> > >  4 files changed, 365 insertions(+), 1 deletion(-)
> > >  create mode 100644 man7/rwf_encoded.7
> > >
> > > diff --git a/man2/fcntl.2 b/man2/fcntl.2
> > > index fce4f4c2b..76fe9cc6f 100644
> > > --- a/man2/fcntl.2
> > > +++ b/man2/fcntl.2
> > > @@ -222,8 +222,9 @@ On Linux, this command can change only the
> > >  .BR O_ASYNC ,
> > >  .BR O_DIRECT ,
> > >  .BR O_NOATIME ,
> > > +.BR O_NONBLOCK ,
> > >  and
> > > -.B O_NONBLOCK
> > > +.B O_ENCODED
> > >  flags.
> > >  It is not possible to change the
> > >  .BR O_DSYNC
> > > @@ -1803,6 +1804,13 @@ Attempted to clear the
> > >  flag on a file that has the append-only attribute set.
> > >  .TP
> > >  .B EPERM
> > > +Attempted to set the
> > > +.B O_ENCODED
> > > +flag and the calling process did not have the
> > > +.B CAP_SYS_ADMIN
> > > +capability.
> > > +.TP
> > > +.B EPERM
> > >  .I cmd
> > >  was
> > >  .BR F_ADD_SEALS ,
> > > diff --git a/man2/open.2 b/man2/open.2
> > > index b0f485b41..cdd3c549c 100644
> > > --- a/man2/open.2
> > > +++ b/man2/open.2
> > > @@ -421,6 +421,14 @@ was followed by a call to
> > >  .BR fdatasync (2)).
> > >  .IR "See NOTES below" .
> > >  .TP
> > > +.B O_ENCODED
> > > +Open the file with encoded I/O permissions;
> >
> > 1. I find the name of the flag confusing.
> > Yes, most people don't read documentation so carefully (or at all)
> > so they will assume O_ENCODED will affect read/write or that it
> > relates to RWF_ENCODED in a similar way that O_SYNC relates
> > to RWF_SYNC (i.e. logical OR and not logical AND).
> >
> > I am not good at naming and to prove it I will propose:
> > O_PROMISCUOUS, O_MAINTENANCE, O_ALLOW_ENCODED
>
> Agreed, the name is misleading. I can't think of anything better than
> O_ALLOW_ENCODED, so I'll go with that unless someone comes up with
> something better :)
>
> > 2. While I see no harm in adding O_ flag to open(2) for this
> > use case, I also don't see a major benefit in adding it.
> > What if we only allowed setting the flag via fcntl(2) which returns
> > an error on old kernels?
> > Since unlike most O_ flags, O_ENCODED does NOT affect file
> > i/o without additional opt-in flags, it is not standard anyway and
> > therefore I find that setting it only via fcntl(2) is less error prone.
>
> If I make this fcntl-only, then it probably shouldn't be through
> F_GETFL/F_SETFL (it'd be pretty awkward for an O_ flag to not be valid
> for open(), and also awkward to mix some non-O_ flag with O_ flags for
> F_GETFL/F_SETFL). So that leaves a couple of options:
>
> 1. Get/set it with F_GETFD/F_SETFD, which is currently only used for
>FD_CLOEXEC. That also silently ignores unknown flags, but as with the
>O_ flag option, I don't think that's a big deal for FD_ALLOW_ENCODED.
> 2. Add a new fcntl command (F_GETFD2/F_SETFD2?). This seems like
>overkill to me.
>
> However, both of these options are annoying to implement. Ideally, we
> wouldn't have to add another flags field to struct file. But, to reuse
> f_flags, we'd need to make sure that FD_ALLOW_ENCODED doesn't collide
> with other O_ flags, and we'd probably want to hide it from F_GETFL. At
> that point, it might as well be an O_ flag.
>
> It seems to me that it's more trouble than it's worth to make this not
> an O_ flag, but please let me know if you see a nice way to do so.
>

No, I see why you choose to add the flag to open(2).
I have no objection.

I once had a crazy thought how to add new open flags
in a non racy manner without adding a new syscall,
but as you wrote, this is not relevant for O_ALLOW_ENCODED.

Something like:

/*
 * Old kernels silently ignore unsupported open flags.
 * New kernels that gets __O_CHECK_NEWFLAGS do
 * the proper checking for unsupported flags AND set the
 * flag __O_HAVE_NEWFLAGS.
 */
#define O_FLAG1 __O_CHECK_NEWFLAGS|__O_FLAG1
#define O_HAVE_FLAG1 __O_HAVE_NEWFLAGS|__O_FLAG1

fd = open(path, O_FLAG1);
if (fd < 0)
return -errno;
flags = fcntl(fd, F_GETFL, 0);
if (flags < 0)
return flags;
if ((flags & O_HAVE_FLAG1