Re: [RFC PATCH 1/6] btrfs-progs: splitcmd-gen.sh: create btrfs- binaries for selected subcommands

2018-08-29 Thread Misono Tomohiro
On 2018/08/30 2:24, Axel Burri wrote:
> Create separate binaries for each subcommand ("btrfs foo bar").
> Least invasive approach, generate c-files for each command:
> 
> # ./splitcmd-gen.sh
> # make V=1 btrfs-subvolume-show
> # make V=1 btrfs-send
> # [...]
> 
> Alternative approach: instead of including the c-file, link with obj
> in Makefile, e.g.:
> 
> btrfs_subvolume_show_objects = cmds-subvolume.o
> btrfs_send_objects = cmds-send.o
> [...]
> 
> This implies adaptions in cmds-subvolume.c (and others):
> 
> -static int cmd_filesystem_show(int argc, char **argv)
> +int cmd_filesystem_show(int argc, char **argv)
> 
> If they are defined non-static, we could probably simplify further and
> add `-Wl,-eentry` flags (changing entry point from "main" to "entry").
> 
> With this, and if handle_command_group() was declared in some library
> instead of btrfs.c, we would get rid of generated files completely.
> 
> Signed-off-by: Axel Burri 
> ---
>  splitcmd-gen.sh | 70 
> +
>  splitcmd.c.in   | 17 ++
>  2 files changed, 87 insertions(+)
>  create mode 100755 splitcmd-gen.sh
>  create mode 100644 splitcmd.c.in
> 
> diff --git a/splitcmd-gen.sh b/splitcmd-gen.sh
> new file mode 100755
> index ..4d2e0509
> --- /dev/null
> +++ b/splitcmd-gen.sh
> @@ -0,0 +1,70 @@
> +#!/bin/bash
> +
> +#
> +# Generate c-files for btrfs subcommands defined below
> +#
> +
> +# Notes on linux capabilities:
> +#
> +# btrfs-subvolume-show, btrfs-subvolume-list, btrfs-send:
> +#  - CAP_FOWNER is only needed for O_NOATIME flag in open() system calls
> +#  - why CAP_SYS_ADMIN? shouldn't CAP_DAC_READ_SEARCH be enough?

Hello,

Not directly related this series and just FYI,
I'm working to allow sub show/list to non-privileged user as long
as he can access to the subvolume:
  https://www.spinics.net/lists/linux-btrfs/msg79285.html

Hopefully this will be merged to master in near future
(any comments from user/dev is welcome).

Thanks,
Misono

> +#
> +# btrfs-receive:
> +#  - dependent on send-stream (see cmds-receive.c: "send_ops"):
> +#CAP_CHOWN, CAP_MKNOD, CAP_SETFCAP (for "lsetxattr")
> +#
> +# btrfs-filesystem-usage:
> +#  - CAP_SYS_ADMIN is for BTRFS_IOC_TREE_SEARCH and BTRFS_IOC_FS_INFO
> +#in order to provide full level of detail, see btrfs-filesystem(8)
> +
> +
> +makefile_out="Makefile.install_setcap"
> +
> +splitcmd_list=""
> +setcap_lines=""
> +
> +function gen_splitcmd {
> +local name="$1"
> +local dest="${1}.c"
> +local cfile="$2"
> +local entry="$3"
> +local caps="$4"
> +echo "generating: ${dest} (cfile=${cfile}, entry=${entry})"
> +echo -e "/*\n * ${name}\n *\n * GENERATED BY splitcmd-gen.sh\n */\n" > 
> $dest
> +sed -e "s|@BTRFS_SPLITCMD_CFILE_INCLUDE@|${cfile}|g" \
> +-e "s|@BTRFS_SPLITCMD_ENTRY@|${entry}|g" \
> +splitcmd.c.in >> $dest
> +}
> +
> +gen_splitcmd "btrfs-subvolume-show" \
> + "cmds-subvolume.c" "cmd_subvol_show" \
> + "cap_sys_admin,cap_fowner,cap_dac_read_search"
> +
> +gen_splitcmd "btrfs-subvolume-list" \
> + "cmds-subvolume.c" "cmd_subvol_list" \
> + "cap_sys_admin,cap_fowner,cap_dac_read_search"
> +
> +gen_splitcmd "btrfs-subvolume-snapshot" \
> + "cmds-subvolume.c" "cmd_subvol_snapshot" \
> + "cap_sys_admin,cap_fowner,cap_dac_override,cap_dac_read_search"
> +
> +gen_splitcmd "btrfs-subvolume-delete" \
> + "cmds-subvolume.c" "cmd_subvol_delete" \
> + "cap_sys_admin,cap_dac_override"
> +
> +gen_splitcmd "btrfs-send" \
> + "cmds-send.c" "cmd_send" \
> + "cap_sys_admin,cap_fowner,cap_dac_read_search"
> +
> +gen_splitcmd "btrfs-receive" \
> + "cmds-receive.c" "cmd_receive" \
> + 
> "cap_sys_admin,cap_fowner,cap_chown,cap_mknod,cap_setfcap,cap_dac_override,cap_dac_read_search"
> +
> +gen_splitcmd "btrfs-filesystem-usage" \
> + "cmds-fi-usage.c" "cmd_filesystem_usage" \
> + "cap_sys_admin"
> +
> +gen_splitcmd "btrfs-qgroup-destroy" \
> + "cmds-qgroup.c" "cmd_qgroup_destroy" \
> + "cap_sys_admin,cap_dac_override"
> diff --git a/splitcmd.c.in b/splitcmd.c.in
> new file mode 100644
> index ..aa07af9a
> --- /dev/null
> +++ b/splitcmd.c.in
> @@ -0,0 +1,17 @@
> +#include "@BTRFS_SPLITCMD_CFILE_INCLUDE@"
> +
> +/*
> + * Dummy object: used from second-level command groups (e.g. in
> + * "cmds-subvolume.c"), is never called in splitcmd executables.
> + */
> +int handle_command_group(const struct cmd_group *grp, int argc,
> +  char **argv)
> +{
> + exit(1);
> +}
> +
> +
> +int main(int argc, char **argv)
> +{
> + return @BTRFS_SPLITCMD_ENTRY@(argc, argv);
> +}
> 



Re: [PATCH v2] btrfs: Always check nocow for quota enabled case to make sure we won't reserve unnecessary data space

2018-08-29 Thread Qu Wenruo


On 2018/8/30 上午9:57, Misono Tomohiro wrote:
> On 2018/08/28 14:21, Qu Wenruo wrote:
>> On 2018/8/24 下午4:09, Misono Tomohiro wrote:
>> [snip]

 BTW, what's the possibility of such problem in your test environment?
>>>
>>> It's like one in several times.
>>> It may depend on hardware performance? (the machine is not so fast),
>>>
>>> I also noticed following warning happens too (not always):
>>>
>>
>> After digging into the case, it's more complex than just my patch.
>>
>> Firstly, we lacks a lot of underflow check when modifying bytes_may_use.
>> So we need to do all the underflow detection for every modifier of
>> bytes_may_use.
>>
>> Secondly, btrfs_cross_ref_exist() check makes NODATACOW check in
>> __btrfs_buffered_write() unreliable.
>>
>> For the following case, at __btrfs_buffered_write() time we're pretty
>> sure we could do NODATACOW, but at sync time, due to cloned range,
>> btrfs_cross_ref_exist() would detect reflinked prealloc extent, then
>> falls back to CoW, and finally cause bytes_may_use underflow:
>>
>> ---
>> mkfs.btrfs -f $dev > $full_log
>>
>> mount $dev $mnt -o nospace_cache
>> btrfs quota enable $mnt
>> btrfs quota rescan -w $mnt
>>
>> xfs_io -f -c "falloc 0 2M" $mnt/file1 > /dev/null
>> xfs_io -c "pwrite -b 1M 0 1M" $mnt/file1 > /dev/null
>> xfs_io -c "reflink $mnt/file1 1M 4M 1M" $mnt/file1 > /dev/null
>> sync
>> 
> 
> Thanks for the explanation. I will try to understand the relevant code.

Feel free to ask if there is anything I forgot to explain.

> 
>>
>> Even without my patch, the "pwrite" command is still CoWed, which could
>> be avoided.
>> And that's the reason my patch is causing the underflow.
>>
>> To fix this, we need more accurate btrfs_cross_ref_exist() check, not
>> only for @disk_bytenr but also check @len.
>>
>> Or we could try to flush the whole inode in clone_range() so we could go
>> through NOCOW routine before clone really happens.
> 
> So as your RFC patch does not work, the option is first one?

Yes, and it may be more complex to do that, while even fixed we could
only get diminishing return.

Thanks,
Qu

> 
> Thanks,
> Misono
> 
>>
>> Thanks,
>> Qu
>>
> 



signature.asc
Description: OpenPGP digital signature


[PATCH] btrfs: extent-tree.c: Remove redundant variable from btrfs_cross_ref_exist()

2018-08-29 Thread Misono Tomohiro
Since commit d7df2c796d7e ("Btrfs attach delayed ref updates to
delayed ref heads"), check_delaed_ref() won't return -ENOENT.

In btrfs_cross_ref_exist(), two variable 'ret' and 'ret2' is
originally used to handle -ENOENT error case.

Since the code is not needed anymore, let's just remove 'ret2'.

Signed-off-by: Misono Tomohiro 
---
 fs/btrfs/extent-tree.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2d9074295d7f..0c87472d5719 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3139,7 +3139,6 @@ int btrfs_cross_ref_exist(struct btrfs_root *root, u64 
objectid, u64 offset,
 {
struct btrfs_path *path;
int ret;
-   int ret2;
 
path = btrfs_alloc_path();
if (!path)
@@ -3151,17 +3150,10 @@ int btrfs_cross_ref_exist(struct btrfs_root *root, u64 
objectid, u64 offset,
if (ret && ret != -ENOENT)
goto out;
 
-   ret2 = check_delayed_ref(root, path, objectid,
+   ret = check_delayed_ref(root, path, objectid,
 offset, bytenr);
-   } while (ret2 == -EAGAIN);
+   } while (ret == -EAGAIN);
 
-   if (ret2 && ret2 != -ENOENT) {
-   ret = ret2;
-   goto out;
-   }
-
-   if (ret != -ENOENT || ret2 != -ENOENT)
-   ret = 0;
 out:
btrfs_free_path(path);
if (root->root_key.objectid == BTRFS_DATA_RELOC_TREE_OBJECTID)
-- 
2.17.1




Re: [PATCH v2] btrfs: Always check nocow for quota enabled case to make sure we won't reserve unnecessary data space

2018-08-29 Thread Misono Tomohiro
On 2018/08/28 14:21, Qu Wenruo wrote:
> On 2018/8/24 下午4:09, Misono Tomohiro wrote:
> [snip]
>>>
>>> BTW, what's the possibility of such problem in your test environment?
>>
>> It's like one in several times.
>> It may depend on hardware performance? (the machine is not so fast),
>>
>> I also noticed following warning happens too (not always):
>>
> 
> After digging into the case, it's more complex than just my patch.
> 
> Firstly, we lacks a lot of underflow check when modifying bytes_may_use.
> So we need to do all the underflow detection for every modifier of
> bytes_may_use.
> 
> Secondly, btrfs_cross_ref_exist() check makes NODATACOW check in
> __btrfs_buffered_write() unreliable.
> 
> For the following case, at __btrfs_buffered_write() time we're pretty
> sure we could do NODATACOW, but at sync time, due to cloned range,
> btrfs_cross_ref_exist() would detect reflinked prealloc extent, then
> falls back to CoW, and finally cause bytes_may_use underflow:
> 
> ---
> mkfs.btrfs -f $dev > $full_log
> 
> mount $dev $mnt -o nospace_cache
> btrfs quota enable $mnt
> btrfs quota rescan -w $mnt
> 
> xfs_io -f -c "falloc 0 2M" $mnt/file1 > /dev/null
> xfs_io -c "pwrite -b 1M 0 1M" $mnt/file1 > /dev/null
> xfs_io -c "reflink $mnt/file1 1M 4M 1M" $mnt/file1 > /dev/null
> sync
> 

Thanks for the explanation. I will try to understand the relevant code.

> 
> Even without my patch, the "pwrite" command is still CoWed, which could
> be avoided.
> And that's the reason my patch is causing the underflow.
> 
> To fix this, we need more accurate btrfs_cross_ref_exist() check, not
> only for @disk_bytenr but also check @len.
> 
> Or we could try to flush the whole inode in clone_range() so we could go
> through NOCOW routine before clone really happens.

So as your RFC patch does not work, the option is first one?

Thanks,
Misono

> 
> Thanks,
> Qu
> 



Re: How to erase a RAID1 ?

2018-08-29 Thread Qu Wenruo
[Forgot to Cc the list]

On 2018/8/29 下午10:04, Pierre Couderc wrote:
>
> On 08/29/2018 02:52 PM, Qu Wenruo wrote:
>>
>> On 2018/8/29 下午8:49, Pierre Couderc wrote:
>>> I want to reinstall a RAID1 btrfs system (wchis is now under debian
>>> stretch, and will be reinstalled in stretch).
>> If you still want to use btrfs, just umount the original fs, and
>>
>> # mkfs.btrfs -f 
>>
>> Then a completely new btrfs.
>>
> The problem is to "unmount" the RAID1 system..

If it's not the root fs, it could be unmount if there is no user
reading/writing or has any opened file.

>
> I have not found how to do that.
>
> I have found a solution by reinstalling under debian, it now asks if
> need to reformat thez btrfs disk...

If it's root fs, either go booting from another device, or just let the
installer (well, it's booting from memory/another device already) to
re-format the fs.

Thanks,
Qu

>




signature.asc
Description: OpenPGP digital signature


Re: [PATCH 6/6] btrfs-progs: rescue-super: Don't double free fs_devices

2018-08-29 Thread Qu Wenruo



On 2018/8/29 下午11:38, Nikolay Borisov wrote:
> 
> 
> On  3.08.2018 08:50, Qu Wenruo wrote:
>> [BUG]
>> During fuzz/007 we hit the following error:
>> --
>> == RUN MAYFAIL /home/adam/btrfs/btrfs-progs/btrfs rescue super-recover 
>> -y -v 
>> /home/adam/btrfs/btrfs-progs/tests//fuzz-tests/images/bko-200409.raw.restored.scratch
>> ERROR: tree_root block unaligned: 33554431
>> ERROR: superblock checksum matches but it has invalid members
>> ERROR: tree_root block unaligned: 33554431
>> ERROR: superblock checksum matches but it has invalid members
>> ERROR: tree_root block unaligned: 33554431
>> ERROR: superblock checksum matches but it has invalid members
>> ERROR: failed to add chunk map start=12582912 len=8454144: -17 (File exists)
>> Couldn't read chunk tree
>> failed (ignored, ret=139): /home/adam/btrfs/btrfs-progs/btrfs rescue 
>> super-recover -y -v 
>> /home/adam/btrfs/btrfs-progs/tests//fuzz-tests/images/bko-200409.raw.restored.scratch
>> mayfail: returned code 139 (SEGFAULT), not ignored
>> test failed for case 007-simple-super-recover
>> --
>>
>> [CAUSE]
>> In __open_ctree_fd(), if we have valid @open_ctree_flags and
>> btrfs_scan_fs_devices() successes without problem, no matter what
>> happens we will call btrfs_close_devices(), thus free all related
>> devices.
> 
> Why do you think it's _always_ going to be called? Looking into that
> function it seems this can happen if
> btrfs_setup_chunk_tree_and_device_map fails.

No need to reach btrfS_setup_chunk_tree_and_device_map().

As long as we could reach btrfs_open_devices(), no matter whether if
succeeded or not, we will call btrfs_close_devices() to cleanup the
@fs_devices.

If btrfs_open_devices() fails, we goto fail label in
btrfs_open_devices() which calls btrfs_close_devices().

Or we succeeded in btrfs_open_devices(), then next error label is
out_devices in __open_ctree_fd(), and will call btrfs_close_devices() too.

And since in super recovery we have already called
btrfs_scan_fs_devices() so in __open_ctree_fd() it shouldn't fail.

So either we will hit btrfs_close_devices(), no matter whatever happens.

>>
>> In super-recover, before we call open_ctree(), we have called
>> btrfs_scan_fs_devices() already, so btrfs_scan_fs_devices() should not
>> fail in open_ctree(), fs_devices will always be freed in open_ctree() or
>> close_ctree().
> 
> Isn't the actual issue just that we call close_ctree. So the actual life
> time of fs_devices is :

No, no need to call close_ctree().
Just as described above, even failed __open_ctree_fd() could call
btrfs_close_devices() and free @fs_devices.

Thanks,
Qu

> 
> 1. Create in btrfs_scan_fs_devices called from btrfs_recover_superblocks
> 2. All other references to those fs_devices will just return the same
> reference.
> 3. Calling close_ctree frees fs_devices.
> 
>>
>> [FIX]
>> So in super-recover.c, we should not call btrfs_close_devices(), or we
>> will find fs_devices->list get poisoned, and trigger segfault when
>> exiting.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>  super-recover.c | 3 ---
>>  1 file changed, 3 deletions(-)
>>
>> diff --git a/super-recover.c b/super-recover.c
>> index 880fd7712546..86b3df9867dc 100644
>> --- a/super-recover.c
>> +++ b/super-recover.c
>> @@ -292,9 +292,6 @@ int btrfs_recover_superblocks(const char *dname,
>>  no_recover:
>>  recover_err_str(ret);
>>  free_recover_superblock();
>> -/* check if we have freed fs_devices in close_ctree() */
>> -if (!root)
>> -btrfs_close_devices(recover.fs_devices);
>>  return ret;
>>  }
>>  
>>


Re: [PATCH 3/6] btrfs-progs: Don't report dirty leaked eb using BUG_ON

2018-08-29 Thread Qu Wenruo



On 2018/8/29 下午10:52, Nikolay Borisov wrote:
> 
> 
> On  3.08.2018 08:50, Qu Wenruo wrote:
>> Another BUG_ON() during fuzz/003:
>> --
>> == RUN MAYFAIL /home/adam/btrfs/btrfs-progs/btrfs check --init-csum-tree 
>> /home/adam/btrfs/btrfs-progs/tests//fuzz-tests/images/bko-161821.raw.restored
>> [1/7] checking root items
>> Fixed 0 roots.
>> [2/7] checking extents
>> parent transid verify failed on 4198400 wanted 14 found 1114126
>> parent transid verify failed on 4198400 wanted 14 found 1114126
>> Ignoring transid failure
>> owner ref check failed [4198400 4096]
>> repair deleting extent record: key [4198400,169,0]
>> adding new tree backref on start 4198400 len 4096 parent 0 root 5
>> Repaired extent references for 4198400
>> ref mismatch on [4222976 4096] extent item 1, found 0
>> backref 4222976 root 7 not referenced back 0x5617f8ecf780
>> incorrect global backref count on 4222976 found 1 wanted 0
>> backpointer mismatch on [4222976 4096]
>> owner ref check failed [4222976 4096]
>> repair deleting extent record: key [4222976,169,0]
>> Repaired extent references for 4222976
>> [3/7] checking free space cache
>> [4/7] checking fs roots
>> parent transid verify failed on 4198400 wanted 14 found 1114126
>> Ignoring transid failure
>> Wrong generation of child node/leaf, wanted: 1114126, have: 14
>> root 5 missing its root dir, recreating
>> parent transid verify failed on 4198400 wanted 14 found 1114126
>> Ignoring transid failure
>> ERROR: child eb corrupted: parent bytenr=4222976 item=0 parent level=1 child 
>> level=2
>> ERROR: errors found in fs roots
>> extent buffer leak: start 4222976 len 4096
>> extent_io.c:611: free_extent_buffer_internal: BUG_ON `eb->flags & 
>> EXTENT_DIRTY` triggered, value 1
>> failed (ignored, ret=134): /home/adam/btrfs/btrfs-progs/btrfs check 
>> --init-csum-tree 
>> /home/adam/btrfs/btrfs-progs/tests//fuzz-tests/images/bko-161821.raw.restored
>> mayfail: returned code 134 (SIGABRT), not ignored
>> test failed for case 003-multi-check-unmounted
>> --
>>
>> Since we're shifting to using btrfs_abort_transaction() in btrfs-progs,
>> it will be more and more common to see dirty leaked eb.
>> Instead of BUG_ON(), we only needs to report it as warning.
> 
> 
> So how are such leaked extents supposed to be cleaned? So when
> transaction_aborted is set we just return some errors from various
> functions but I don't see how modified extents in the transaction are freed?

They're freed by extent_io_tree_cleanup(), called by the following call
trace:
close_ctree_fs_info()
|- btrfs_cleanup_all_caches()
   |- extent_io_tree_cleanup(_info->extent_cache)
  |- free_extent_buffer_nocache()

And inside extent_io_tree_cleanup(), it's also where we do leaked extent
buffer detection.

Thanks,
Qu

>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>  extent_io.c | 6 +-
>>  1 file changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/extent_io.c b/extent_io.c
>> index 198492699438..b8510b0ae94e 100644
>> --- a/extent_io.c
>> +++ b/extent_io.c
>> @@ -608,7 +608,11 @@ static void free_extent_buffer_internal(struct 
>> extent_buffer *eb, bool free_now)
>>  eb->refs--;
>>  BUG_ON(eb->refs < 0);
>>  if (eb->refs == 0) {
>> -BUG_ON(eb->flags & EXTENT_DIRTY);
>> +if (eb->flags & EXTENT_DIRTY) {
>> +warning(
>> +"dirty eb leak (aborted trans): start %llu len %u",
>> +eb->start, eb->len);
>> +}
>>  list_del_init(>recow);
>>  if (eb->flags & EXTENT_BUFFER_DUMMY || free_now)
>>  free_extent_buffer_final(eb);
>>


Re: [RFC PATCH 0/6] btrfs-progs: build distinct binaries for specific btrfs subcommands

2018-08-29 Thread Austin S. Hemmelgarn

On 2018-08-29 13:24, Axel Burri wrote:

This patch allows to build distinct binaries for specific btrfs
subcommands, e.g. "btrfs-subvolume-show" which would be identical to
"btrfs subvolume show".


Motivation:

While btrfs-progs offer the all-inclusive "btrfs" command, it gets
pretty cumbersome to restrict privileges to the subcommands [1].
Common approaches are to either setuid root for "/sbin/btrfs" (which
is not recommended at all), or to write sudo rules for each
subcommand.

Separating the subcommands into distinct binaries makes it easy to set
elevated privileges using capabilities(7) or setuid. A typical use
case where this is needed is when it comes to automated scripts,
e.g. btrbk [2] [3] creating snapshots and send/receive them via ssh.
Let me start by saying I think this is a great idea to have as an 
option, and that the motivation is a particularly good one.


I've posted my opinions on your two open questions below, but there's 
two other comments I'd like to make:


* Is there some particular reason that this only includes the commands 
it does, and _hard codes_ which ones it works with?  if we just do 
everything instead of only the stuff we think needs certain 
capabilities, then we can auto-generate the list of commands to be 
processed based on function names in the C files, and it will 
automatically pick up any newly added commands.  At the very least, it 
could still parse through the C files and look for tags in the comments 
for the functions to indicate which ones need to be processed this way. 
Either case will make it significantly easier to add new commands, and 
would also better justify the overhead of shipping all the files 
pre-generated (because there would be much more involved in 
pre-generating them).


* While not essential, it would be really neat to have the `btrfs` 
command detect if an associated binary exists for whatever command was 
just invoked, and automatically exec that (possibly with some 
verification) instead of calling the command directly so that desired 
permissions are enforced.  This would mitigate the need for users to 
remember different command names depending on execution context.



Description:

Patch 1 adds a template as well as a generator shell script for the
splitted subcommands.

Patch 2 adds the generated subcommand source files.

Patch 3-5 adds a "install-splitcmd-setcap" make target, with different
approaches (either hardcoded in Makefile, or more generically by
including "Makefile.install_setcap" generated by "splitcmd-gen.sh").


Open Questions:

1. "make install-splitcmd-setcap" installs the binaries with hardcoded
group "btrfs". This needs to be configurable (how?). Another approach
would be to not set the group at all, and leave this to the user or
distro packaging script.
Leave it to the user or distro.  It's likely to end up standardized on 
the name 'btrfs', but it should be agnostic of that.


2. Instead of the "install-splitcmd-setcap" make target, we could
introduce a "configure --enable-splitted-subcommands" option, which
would simply add all splitcmd binaries to the "all" and "install"
targets without special treatment, and leave the setcap stuff to the
user or distro packaging script (at least in gentoo, this needs to be
specified using the "fcaps" eclass anyways [5]).
A bit of a nitpick, but 'split' is the proper past tense of the word 
'split', it's one of those exceptions that English has all over the 
place.  Even aside from that though, I think `separate` sounds more 
natural for the configure option, or better yet, just make it 
`--enable-fscaps` like most other packages do.


That aside, I think having a configure option is the best way to do 
this, it makes it very easy for distro build systems to handle it 
because this is what they're used to doing anyway.  It also makes it a 
bit easier on the user, because it just becomes `make` to build 
whichever version you want installed.


[RFC PATCH 5/6] btrfs-progs: Makefile: move progs_splitcmd variable to Makefile.install_setcap

2018-08-29 Thread Axel Burri
Signed-off-by: Axel Burri 
---
 Makefile| 17 -
 Makefile.install_setcap |  2 ++
 splitcmd-gen.sh | 17 ++---
 3 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/Makefile b/Makefile
index acf5a677..3f16434c 100644
--- a/Makefile
+++ b/Makefile
@@ -232,22 +232,13 @@ progs_install =
 progs_build =
 endif
 
-# split-command executables, generated by splitcmd-gen.sh
-progs_splitcmd = btrfs-send \
-   btrfs-receive \
-   btrfs-subvolume-list \
-   btrfs-subvolume-show \
-   btrfs-subvolume-snapshot \
-   btrfs-subvolume-delete \
-   btrfs-filesystem-usage \
-   btrfs-qgroup-destroy
-
-progs_install_splitcmd = $(progs_splitcmd)
+# defines "progs_splitcmd" as well as "btrfs_*_fcaps" variables
+# used by "install-splitcmd-setcap-%" below
+include Makefile.install_setcap
 
 INSTALL_SETCAP_FLAGS = -m710 -gbtrfs
 
-# defines btrfs_*_caps; used by "install-splitcmd-setcap-%" below
-include Makefile.install_setcap
+progs_install_splitcmd = $(progs_splitcmd)
 
 # external libs required by various binaries; for btrfs-foo,
 # specify btrfs_foo_libs = ; see $($(subst...)) rules below
diff --git a/Makefile.install_setcap b/Makefile.install_setcap
index 7705db74..e1efb686 100644
--- a/Makefile.install_setcap
+++ b/Makefile.install_setcap
@@ -1,5 +1,7 @@
 # capabilities(7) for splitcmd executables
 
+progs_splitcmd = btrfs-subvolume-show btrfs-subvolume-list 
btrfs-subvolume-snapshot btrfs-subvolume-delete btrfs-send btrfs-receive 
btrfs-filesystem-usage btrfs-qgroup-destroy
+
 btrfs_subvolume_show_fcaps = "cap_sys_admin,cap_fowner,cap_dac_read_search"
 btrfs_subvolume_list_fcaps = "cap_sys_admin,cap_fowner,cap_dac_read_search"
 btrfs_subvolume_snapshot_fcaps = 
"cap_sys_admin,cap_fowner,cap_dac_override,cap_dac_read_search"
diff --git a/splitcmd-gen.sh b/splitcmd-gen.sh
index d34c5cbd..7492594f 100755
--- a/splitcmd-gen.sh
+++ b/splitcmd-gen.sh
@@ -36,12 +36,14 @@ function gen_splitcmd {
 -e "s|@BTRFS_SPLITCMD_ENTRY@|${entry}|g" \
 splitcmd.c.in >> $dest
 
-echo "${name//-/_}_fcaps = \"${caps}\"" >> $makefile_out
+splitcmd_list="${splitcmd_list} ${name}";
+setcap_lines="${setcap_lines}\n${name//-/_}_fcaps = \"${caps}\""
 }
 
-echo "generating: ${makefile_out}"
-echo -e "# capabilities(7) for splitcmd executables\n" > $makefile_out
 
+#
+# generate c-files
+#
 gen_splitcmd "btrfs-subvolume-show" \
  "cmds-subvolume.c" "cmd_subvol_show" \
  "cap_sys_admin,cap_fowner,cap_dac_read_search"
@@ -73,3 +75,12 @@ gen_splitcmd "btrfs-filesystem-usage" \
 gen_splitcmd "btrfs-qgroup-destroy" \
  "cmds-qgroup.c" "cmd_qgroup_destroy" \
  "cap_sys_admin,cap_dac_override"
+
+
+#
+# generate Makefile includes
+#
+echo "generating: ${makefile_out}"
+echo -e "# capabilities(7) for splitcmd executables" > $makefile_out
+echo -e "\nprogs_splitcmd =$splitcmd_list" >> $makefile_out
+echo -e "${setcap_lines}" >> $makefile_out
-- 
2.16.4



[RFC PATCH 1/6] btrfs-progs: splitcmd-gen.sh: create btrfs- binaries for selected subcommands

2018-08-29 Thread Axel Burri
Create separate binaries for each subcommand ("btrfs foo bar").
Least invasive approach, generate c-files for each command:

# ./splitcmd-gen.sh
# make V=1 btrfs-subvolume-show
# make V=1 btrfs-send
# [...]

Alternative approach: instead of including the c-file, link with obj
in Makefile, e.g.:

btrfs_subvolume_show_objects = cmds-subvolume.o
btrfs_send_objects = cmds-send.o
[...]

This implies adaptions in cmds-subvolume.c (and others):

-static int cmd_filesystem_show(int argc, char **argv)
+int cmd_filesystem_show(int argc, char **argv)

If they are defined non-static, we could probably simplify further and
add `-Wl,-eentry` flags (changing entry point from "main" to "entry").

With this, and if handle_command_group() was declared in some library
instead of btrfs.c, we would get rid of generated files completely.

Signed-off-by: Axel Burri 
---
 splitcmd-gen.sh | 70 +
 splitcmd.c.in   | 17 ++
 2 files changed, 87 insertions(+)
 create mode 100755 splitcmd-gen.sh
 create mode 100644 splitcmd.c.in

diff --git a/splitcmd-gen.sh b/splitcmd-gen.sh
new file mode 100755
index ..4d2e0509
--- /dev/null
+++ b/splitcmd-gen.sh
@@ -0,0 +1,70 @@
+#!/bin/bash
+
+#
+# Generate c-files for btrfs subcommands defined below
+#
+
+# Notes on linux capabilities:
+#
+# btrfs-subvolume-show, btrfs-subvolume-list, btrfs-send:
+#  - CAP_FOWNER is only needed for O_NOATIME flag in open() system calls
+#  - why CAP_SYS_ADMIN? shouldn't CAP_DAC_READ_SEARCH be enough?
+#
+# btrfs-receive:
+#  - dependent on send-stream (see cmds-receive.c: "send_ops"):
+#CAP_CHOWN, CAP_MKNOD, CAP_SETFCAP (for "lsetxattr")
+#
+# btrfs-filesystem-usage:
+#  - CAP_SYS_ADMIN is for BTRFS_IOC_TREE_SEARCH and BTRFS_IOC_FS_INFO
+#in order to provide full level of detail, see btrfs-filesystem(8)
+
+
+makefile_out="Makefile.install_setcap"
+
+splitcmd_list=""
+setcap_lines=""
+
+function gen_splitcmd {
+local name="$1"
+local dest="${1}.c"
+local cfile="$2"
+local entry="$3"
+local caps="$4"
+echo "generating: ${dest} (cfile=${cfile}, entry=${entry})"
+echo -e "/*\n * ${name}\n *\n * GENERATED BY splitcmd-gen.sh\n */\n" > 
$dest
+sed -e "s|@BTRFS_SPLITCMD_CFILE_INCLUDE@|${cfile}|g" \
+-e "s|@BTRFS_SPLITCMD_ENTRY@|${entry}|g" \
+splitcmd.c.in >> $dest
+}
+
+gen_splitcmd "btrfs-subvolume-show" \
+ "cmds-subvolume.c" "cmd_subvol_show" \
+ "cap_sys_admin,cap_fowner,cap_dac_read_search"
+
+gen_splitcmd "btrfs-subvolume-list" \
+ "cmds-subvolume.c" "cmd_subvol_list" \
+ "cap_sys_admin,cap_fowner,cap_dac_read_search"
+
+gen_splitcmd "btrfs-subvolume-snapshot" \
+ "cmds-subvolume.c" "cmd_subvol_snapshot" \
+ "cap_sys_admin,cap_fowner,cap_dac_override,cap_dac_read_search"
+
+gen_splitcmd "btrfs-subvolume-delete" \
+ "cmds-subvolume.c" "cmd_subvol_delete" \
+ "cap_sys_admin,cap_dac_override"
+
+gen_splitcmd "btrfs-send" \
+ "cmds-send.c" "cmd_send" \
+ "cap_sys_admin,cap_fowner,cap_dac_read_search"
+
+gen_splitcmd "btrfs-receive" \
+ "cmds-receive.c" "cmd_receive" \
+ 
"cap_sys_admin,cap_fowner,cap_chown,cap_mknod,cap_setfcap,cap_dac_override,cap_dac_read_search"
+
+gen_splitcmd "btrfs-filesystem-usage" \
+ "cmds-fi-usage.c" "cmd_filesystem_usage" \
+ "cap_sys_admin"
+
+gen_splitcmd "btrfs-qgroup-destroy" \
+ "cmds-qgroup.c" "cmd_qgroup_destroy" \
+ "cap_sys_admin,cap_dac_override"
diff --git a/splitcmd.c.in b/splitcmd.c.in
new file mode 100644
index ..aa07af9a
--- /dev/null
+++ b/splitcmd.c.in
@@ -0,0 +1,17 @@
+#include "@BTRFS_SPLITCMD_CFILE_INCLUDE@"
+
+/*
+ * Dummy object: used from second-level command groups (e.g. in
+ * "cmds-subvolume.c"), is never called in splitcmd executables.
+ */
+int handle_command_group(const struct cmd_group *grp, int argc,
+char **argv)
+{
+   exit(1);
+}
+
+
+int main(int argc, char **argv)
+{
+   return @BTRFS_SPLITCMD_ENTRY@(argc, argv);
+}
-- 
2.16.4



[RFC PATCH 3/6] btrfs-progs: Makefile: add "install-splitcmd-setcap" target, installs splitcmd binaries with appropriate capabilities

2018-08-29 Thread Axel Burri
Install all $progs_install_splitcmd, and set appropriate linux file
capabilities(7) using setcap(8).

NOTE: while installing, group is hardcoded to "btrfs"! This needs
further discussion.

Signed-off-by: Axel Burri 
---
 Makefile| 36 
 Makefile.inc.in |  1 +
 configure.ac|  1 +
 3 files changed, 38 insertions(+)

diff --git a/Makefile b/Makefile
index fcfc815a..5a1e2747 100644
--- a/Makefile
+++ b/Makefile
@@ -4,6 +4,7 @@
 #   static  build static bnaries, requires static version of the libraries
 #   testrun the full testsuite
 #   install install to default location (/usr/local)
+#   install-splitcmd-setcap  install splitcmd binaries, and set linux 
capabilities
 #   clean   clean built binaries (not the documentation)
 #   clean-all   clean as above, clean docs and generated files
 #
@@ -231,6 +232,30 @@ progs_install =
 progs_build =
 endif
 
+# split-command executables, generated by splitcmd-gen.sh
+progs_splitcmd = btrfs-send \
+   btrfs-receive \
+   btrfs-subvolume-list \
+   btrfs-subvolume-show \
+   btrfs-subvolume-snapshot \
+   btrfs-subvolume-delete \
+   btrfs-filesystem-usage \
+   btrfs-qgroup-destroy
+
+progs_install_splitcmd = $(progs_splitcmd)
+
+INSTALL_SETCAP_FLAGS = -m710 -gbtrfs
+
+# linux capabilities(7) needed; used by "install-splitcmd-setcap-%" below
+btrfs_subvolume_show_fcaps = "cap_sys_admin,cap_fowner,cap_dac_read_search"
+btrfs_subvolume_list_fcaps = "cap_sys_admin,cap_fowner,cap_dac_read_search"
+btrfs_subvolume_snapshot_fcaps = 
"cap_sys_admin,cap_fowner,cap_dac_override,cap_dac_read_search"
+btrfs_subvolume_delete_fcaps = "cap_sys_admin,cap_dac_override"
+btrfs_send_fcaps = "cap_sys_admin,cap_fowner,cap_dac_read_search"
+btrfs_receive_fcaps = 
"cap_sys_admin,cap_fowner,cap_chown,cap_mknod,cap_setfcap,cap_dac_override,cap_dac_read_search"
+btrfs_filesystem_usage_fcaps = "cap_sys_admin"
+btrfs_qgroup_destroy_fcaps = "cap_sys_admin,cap_dac_override"
+
 # external libs required by various binaries; for btrfs-foo,
 # specify btrfs_foo_libs = ; see $($(subst...)) rules below
 btrfs_convert_cflags = -DBTRFSCONVERT_EXT2=$(BTRFSCONVERT_EXT2)
@@ -318,6 +343,7 @@ endif
$($(subst -,_,btrfs-$(@:%/$(notdir $@)=%)-cflags))
 
 all: $(progs_build) $(libs_build) $(BUILDDIRS)
+splitcmd: $(progs_splitcmd)
 ifeq ($(PYTHON_BINDINGS),1)
 all: libbtrfsutil_python
 endif
@@ -618,6 +644,7 @@ clean: $(CLEANDIRS)
  $(check_defs) \
  $(libs) $(lib_links) \
  $(progs_static) \
+ $(progs_splitcmd) \
  libbtrfsutil/*.o libbtrfsutil/*.o.d
 ifeq ($(PYTHON_BINDINGS),1)
$(Q)cd libbtrfsutil/python; \
@@ -678,6 +705,15 @@ install-static: $(progs_static) $(INSTALLDIRS)
# btrfsck is a link to btrfs in the src tree, make it so for installed 
file as well
$(LN_S) -f btrfs.static $(DESTDIR)$(bindir)/btrfsck.static
 
+# install split-command binary, and set linux capabilities(7) defined
+# in btrfs_*_fcaps above, using setcap(8)
+install-splitcmd-setcap-%: %
+   @echo $(INSTALL) -m755 -d $(DESTDIR)$(bindir)
+   @echo $(INSTALL) $(INSTALL_SETCAP_FLAGS) $< $(DESTDIR)$(bindir)
+   @echo $(SETCAP) $($(subst -,_,$<)_fcaps)+ep $(DESTDIR)$(bindir)/$<
+
+install-splitcmd-setcap: $(progs_install_splitcmd) $(patsubst 
%,install-splitcmd-setcap-%,$(progs_install_splitcmd))
+
 $(INSTALLDIRS):
@echo "Making install in $(patsubst install-%,%,$@)"
$(Q)$(MAKE) $(MAKEOPTS) -C $(patsubst install-%,%,$@) install
diff --git a/Makefile.inc.in b/Makefile.inc.in
index a86c528e..567e4e6f 100644
--- a/Makefile.inc.in
+++ b/Makefile.inc.in
@@ -10,6 +10,7 @@ AR = @AR@
 RM = @RM@
 RMDIR = @RMDIR@
 INSTALL = @INSTALL@
+SETCAP = @SETCAP@
 DISABLE_DOCUMENTATION = @DISABLE_DOCUMENTATION@
 DISABLE_BTRFSCONVERT = @DISABLE_BTRFSCONVERT@
 BUILD_PROGRAMS = @BUILD_PROGRAMS@
diff --git a/configure.ac b/configure.ac
index df02f206..fefbfd9c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -39,6 +39,7 @@ AC_PROG_LN_S
 AC_CHECK_TOOL([AR], [ar])
 AC_PATH_PROG([RM], [rm], [rm])
 AC_PATH_PROG([RMDIR], [rmdir], [rmdir])
+AC_PATH_PROG([SETCAP], [setcap], [setcap])
 
 
 AC_CHECK_FUNCS([openat], [],
-- 
2.16.4



[RFC PATCH 6/6] btrfs-progs: add splitcmd binaries to gitignore

2018-08-29 Thread Axel Burri
Signed-off-by: Axel Burri 
---
 .gitignore | 9 +
 1 file changed, 9 insertions(+)

diff --git a/.gitignore b/.gitignore
index 144ebb3b..299019e0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -42,6 +42,15 @@ library-test-static
 /fssum
 testsuite-id
 
+btrfs-filesystem-usage
+btrfs-qgroup-destroy
+btrfs-receive
+btrfs-send
+btrfs-subvolume-delete
+btrfs-subvolume-list
+btrfs-subvolume-show
+btrfs-subvolume-snapshot
+
 /tests/*-tests-results.txt
 /tests/test-console.txt
 /tests/test.img
-- 
2.16.4



[RFC PATCH 0/6] btrfs-progs: build distinct binaries for specific btrfs subcommands

2018-08-29 Thread Axel Burri
This patch allows to build distinct binaries for specific btrfs
subcommands, e.g. "btrfs-subvolume-show" which would be identical to
"btrfs subvolume show".


Motivation:

While btrfs-progs offer the all-inclusive "btrfs" command, it gets
pretty cumbersome to restrict privileges to the subcommands [1].
Common approaches are to either setuid root for "/sbin/btrfs" (which
is not recommended at all), or to write sudo rules for each
subcommand.

Separating the subcommands into distinct binaries makes it easy to set
elevated privileges using capabilities(7) or setuid. A typical use
case where this is needed is when it comes to automated scripts,
e.g. btrbk [2] [3] creating snapshots and send/receive them via ssh.


Description:

Patch 1 adds a template as well as a generator shell script for the
splitted subcommands.

Patch 2 adds the generated subcommand source files.

Patch 3-5 adds a "install-splitcmd-setcap" make target, with different
approaches (either hardcoded in Makefile, or more generically by
including "Makefile.install_setcap" generated by "splitcmd-gen.sh").


Open Questions:

1. "make install-splitcmd-setcap" installs the binaries with hardcoded
group "btrfs". This needs to be configurable (how?). Another approach
would be to not set the group at all, and leave this to the user or
distro packaging script.

2. Instead of the "install-splitcmd-setcap" make target, we could
introduce a "configure --enable-splitted-subcommands" option, which
would simply add all splitcmd binaries to the "all" and "install"
targets without special treatment, and leave the setcap stuff to the
user or distro packaging script (at least in gentoo, this needs to be
specified using the "fcaps" eclass anyways [5]).


References:

  [1] https://www.spinics.net/lists/linux-btrfs/msg75736.html
  [2] https://github.com/digint/btrbk
  [3] https://github.com/digint/btrfs-progs-btrbk
  [4] https://github.com/digint/btrfs-progs/tree/splitcmd-setcap
  [5] https://dev.tty0.ch/portage/digint-overlay.git (sys-fs/btrfs-progs-btrbk)



Axel Burri (6):
  btrfs-progs: splitcmd-gen.sh: create btrfs- binaries for
selected subcommands
  btrfs-progs: add btrfs- source files generated by
splitcmd-gen.sh
  btrfs-progs: Makefile: add "install-splitcmd-setcap" target, installs
splitcmd binaries with appropriate capabilities
  btrfs-progs: Makefile: include Makefile.install_setcap generated by
splitcmd-gen.sh
  btrfs-progs: Makefile: move progs_splitcmd variable to
Makefile.install_setcap
  btrfs-progs: add splitcmd binaries to gitignore

 .gitignore |  9 +
 Makefile   | 20 +++
 Makefile.inc.in|  1 +
 Makefile.install_setcap| 12 +++
 btrfs-filesystem-usage.c   | 23 +
 btrfs-qgroup-destroy.c | 23 +
 btrfs-receive.c| 23 +
 btrfs-send.c   | 23 +
 btrfs-subvolume-delete.c   | 23 +
 btrfs-subvolume-list.c | 23 +
 btrfs-subvolume-show.c | 23 +
 btrfs-subvolume-snapshot.c | 23 +
 configure.ac   |  1 +
 splitcmd-gen.sh| 86 ++
 splitcmd.c.in  | 17 +
 15 files changed, 330 insertions(+)
 create mode 100644 Makefile.install_setcap
 create mode 100644 btrfs-filesystem-usage.c
 create mode 100644 btrfs-qgroup-destroy.c
 create mode 100644 btrfs-receive.c
 create mode 100644 btrfs-send.c
 create mode 100644 btrfs-subvolume-delete.c
 create mode 100644 btrfs-subvolume-list.c
 create mode 100644 btrfs-subvolume-show.c
 create mode 100644 btrfs-subvolume-snapshot.c
 create mode 100755 splitcmd-gen.sh
 create mode 100644 splitcmd.c.in

-- 
2.16.4



[RFC PATCH 2/6] btrfs-progs: add btrfs- source files generated by splitcmd-gen.sh

2018-08-29 Thread Axel Burri
Another approach would be to generate the splitted commands in the
Makefile on-demand, which is probably not desired.

Signed-off-by: Axel Burri 
---
 btrfs-filesystem-usage.c   | 23 +++
 btrfs-qgroup-destroy.c | 23 +++
 btrfs-receive.c| 23 +++
 btrfs-send.c   | 23 +++
 btrfs-subvolume-delete.c   | 23 +++
 btrfs-subvolume-list.c | 23 +++
 btrfs-subvolume-show.c | 23 +++
 btrfs-subvolume-snapshot.c | 23 +++
 8 files changed, 184 insertions(+)
 create mode 100644 btrfs-filesystem-usage.c
 create mode 100644 btrfs-qgroup-destroy.c
 create mode 100644 btrfs-receive.c
 create mode 100644 btrfs-send.c
 create mode 100644 btrfs-subvolume-delete.c
 create mode 100644 btrfs-subvolume-list.c
 create mode 100644 btrfs-subvolume-show.c
 create mode 100644 btrfs-subvolume-snapshot.c

diff --git a/btrfs-filesystem-usage.c b/btrfs-filesystem-usage.c
new file mode 100644
index ..e102ebc8
--- /dev/null
+++ b/btrfs-filesystem-usage.c
@@ -0,0 +1,23 @@
+/*
+ * btrfs-filesystem-usage
+ *
+ * GENERATED BY splitcmd-gen.sh
+ */
+
+#include "cmds-fi-usage.c"
+
+/*
+ * Dummy object: used from second-level command groups (e.g. in
+ * "cmds-subvolume.c"), is never called in splitcmd executables.
+ */
+int handle_command_group(const struct cmd_group *grp, int argc,
+char **argv)
+{
+   exit(1);
+}
+
+
+int main(int argc, char **argv)
+{
+   return cmd_filesystem_usage(argc, argv);
+}
diff --git a/btrfs-qgroup-destroy.c b/btrfs-qgroup-destroy.c
new file mode 100644
index ..4fb32210
--- /dev/null
+++ b/btrfs-qgroup-destroy.c
@@ -0,0 +1,23 @@
+/*
+ * btrfs-qgroup-destroy
+ *
+ * GENERATED BY splitcmd-gen.sh
+ */
+
+#include "cmds-qgroup.c"
+
+/*
+ * Dummy object: used from second-level command groups (e.g. in
+ * "cmds-subvolume.c"), is never called in splitcmd executables.
+ */
+int handle_command_group(const struct cmd_group *grp, int argc,
+char **argv)
+{
+   exit(1);
+}
+
+
+int main(int argc, char **argv)
+{
+   return cmd_qgroup_destroy(argc, argv);
+}
diff --git a/btrfs-receive.c b/btrfs-receive.c
new file mode 100644
index ..9fe94080
--- /dev/null
+++ b/btrfs-receive.c
@@ -0,0 +1,23 @@
+/*
+ * btrfs-receive
+ *
+ * GENERATED BY splitcmd-gen.sh
+ */
+
+#include "cmds-receive.c"
+
+/*
+ * Dummy object: used from second-level command groups (e.g. in
+ * "cmds-subvolume.c"), is never called in splitcmd executables.
+ */
+int handle_command_group(const struct cmd_group *grp, int argc,
+char **argv)
+{
+   exit(1);
+}
+
+
+int main(int argc, char **argv)
+{
+   return cmd_receive(argc, argv);
+}
diff --git a/btrfs-send.c b/btrfs-send.c
new file mode 100644
index ..0eff2931
--- /dev/null
+++ b/btrfs-send.c
@@ -0,0 +1,23 @@
+/*
+ * btrfs-send
+ *
+ * GENERATED BY splitcmd-gen.sh
+ */
+
+#include "cmds-send.c"
+
+/*
+ * Dummy object: used from second-level command groups (e.g. in
+ * "cmds-subvolume.c"), is never called in splitcmd executables.
+ */
+int handle_command_group(const struct cmd_group *grp, int argc,
+char **argv)
+{
+   exit(1);
+}
+
+
+int main(int argc, char **argv)
+{
+   return cmd_send(argc, argv);
+}
diff --git a/btrfs-subvolume-delete.c b/btrfs-subvolume-delete.c
new file mode 100644
index ..d6018d8b
--- /dev/null
+++ b/btrfs-subvolume-delete.c
@@ -0,0 +1,23 @@
+/*
+ * btrfs-subvolume-delete
+ *
+ * GENERATED BY splitcmd-gen.sh
+ */
+
+#include "cmds-subvolume.c"
+
+/*
+ * Dummy object: used from second-level command groups (e.g. in
+ * "cmds-subvolume.c"), is never called in splitcmd executables.
+ */
+int handle_command_group(const struct cmd_group *grp, int argc,
+char **argv)
+{
+   exit(1);
+}
+
+
+int main(int argc, char **argv)
+{
+   return cmd_subvol_delete(argc, argv);
+}
diff --git a/btrfs-subvolume-list.c b/btrfs-subvolume-list.c
new file mode 100644
index ..8529aef5
--- /dev/null
+++ b/btrfs-subvolume-list.c
@@ -0,0 +1,23 @@
+/*
+ * btrfs-subvolume-list
+ *
+ * GENERATED BY splitcmd-gen.sh
+ */
+
+#include "cmds-subvolume.c"
+
+/*
+ * Dummy object: used from second-level command groups (e.g. in
+ * "cmds-subvolume.c"), is never called in splitcmd executables.
+ */
+int handle_command_group(const struct cmd_group *grp, int argc,
+char **argv)
+{
+   exit(1);
+}
+
+
+int main(int argc, char **argv)
+{
+   return cmd_subvol_list(argc, argv);
+}
diff --git a/btrfs-subvolume-show.c b/btrfs-subvolume-show.c
new file mode 100644
index ..856f97cf
--- /dev/null
+++ b/btrfs-subvolume-show.c
@@ -0,0 +1,23 @@
+/*
+ * btrfs-subvolume-show
+ *
+ * GENERATED BY splitcmd-gen.sh
+ */
+
+#include "cmds-subvolume.c"
+
+/*
+ * Dummy object: used from second-level command groups (e.g. in

[RFC PATCH 4/6] btrfs-progs: Makefile: include Makefile.install_setcap generated by splitcmd-gen.sh

2018-08-29 Thread Axel Burri
Signed-off-by: Axel Burri 
---
 Makefile| 11 ++-
 Makefile.install_setcap | 10 ++
 splitcmd-gen.sh |  5 +
 3 files changed, 17 insertions(+), 9 deletions(-)
 create mode 100644 Makefile.install_setcap

diff --git a/Makefile b/Makefile
index 5a1e2747..acf5a677 100644
--- a/Makefile
+++ b/Makefile
@@ -246,15 +246,8 @@ progs_install_splitcmd = $(progs_splitcmd)
 
 INSTALL_SETCAP_FLAGS = -m710 -gbtrfs
 
-# linux capabilities(7) needed; used by "install-splitcmd-setcap-%" below
-btrfs_subvolume_show_fcaps = "cap_sys_admin,cap_fowner,cap_dac_read_search"
-btrfs_subvolume_list_fcaps = "cap_sys_admin,cap_fowner,cap_dac_read_search"
-btrfs_subvolume_snapshot_fcaps = 
"cap_sys_admin,cap_fowner,cap_dac_override,cap_dac_read_search"
-btrfs_subvolume_delete_fcaps = "cap_sys_admin,cap_dac_override"
-btrfs_send_fcaps = "cap_sys_admin,cap_fowner,cap_dac_read_search"
-btrfs_receive_fcaps = 
"cap_sys_admin,cap_fowner,cap_chown,cap_mknod,cap_setfcap,cap_dac_override,cap_dac_read_search"
-btrfs_filesystem_usage_fcaps = "cap_sys_admin"
-btrfs_qgroup_destroy_fcaps = "cap_sys_admin,cap_dac_override"
+# defines btrfs_*_caps; used by "install-splitcmd-setcap-%" below
+include Makefile.install_setcap
 
 # external libs required by various binaries; for btrfs-foo,
 # specify btrfs_foo_libs = ; see $($(subst...)) rules below
diff --git a/Makefile.install_setcap b/Makefile.install_setcap
new file mode 100644
index ..7705db74
--- /dev/null
+++ b/Makefile.install_setcap
@@ -0,0 +1,10 @@
+# capabilities(7) for splitcmd executables
+
+btrfs_subvolume_show_fcaps = "cap_sys_admin,cap_fowner,cap_dac_read_search"
+btrfs_subvolume_list_fcaps = "cap_sys_admin,cap_fowner,cap_dac_read_search"
+btrfs_subvolume_snapshot_fcaps = 
"cap_sys_admin,cap_fowner,cap_dac_override,cap_dac_read_search"
+btrfs_subvolume_delete_fcaps = "cap_sys_admin,cap_dac_override"
+btrfs_send_fcaps = "cap_sys_admin,cap_fowner,cap_dac_read_search"
+btrfs_receive_fcaps = 
"cap_sys_admin,cap_fowner,cap_chown,cap_mknod,cap_setfcap,cap_dac_override,cap_dac_read_search"
+btrfs_filesystem_usage_fcaps = "cap_sys_admin"
+btrfs_qgroup_destroy_fcaps = "cap_sys_admin,cap_dac_override"
diff --git a/splitcmd-gen.sh b/splitcmd-gen.sh
index 4d2e0509..d34c5cbd 100755
--- a/splitcmd-gen.sh
+++ b/splitcmd-gen.sh
@@ -35,8 +35,13 @@ function gen_splitcmd {
 sed -e "s|@BTRFS_SPLITCMD_CFILE_INCLUDE@|${cfile}|g" \
 -e "s|@BTRFS_SPLITCMD_ENTRY@|${entry}|g" \
 splitcmd.c.in >> $dest
+
+echo "${name//-/_}_fcaps = \"${caps}\"" >> $makefile_out
 }
 
+echo "generating: ${makefile_out}"
+echo -e "# capabilities(7) for splitcmd executables\n" > $makefile_out
+
 gen_splitcmd "btrfs-subvolume-show" \
  "cmds-subvolume.c" "cmd_subvol_show" \
  "cap_sys_admin,cap_fowner,cap_dac_read_search"
-- 
2.16.4



Re: [PATCH 6/6] btrfs-progs: rescue-super: Don't double free fs_devices

2018-08-29 Thread Nikolay Borisov



On  3.08.2018 08:50, Qu Wenruo wrote:
> [BUG]
> During fuzz/007 we hit the following error:
> --
> == RUN MAYFAIL /home/adam/btrfs/btrfs-progs/btrfs rescue super-recover -y 
> -v 
> /home/adam/btrfs/btrfs-progs/tests//fuzz-tests/images/bko-200409.raw.restored.scratch
> ERROR: tree_root block unaligned: 33554431
> ERROR: superblock checksum matches but it has invalid members
> ERROR: tree_root block unaligned: 33554431
> ERROR: superblock checksum matches but it has invalid members
> ERROR: tree_root block unaligned: 33554431
> ERROR: superblock checksum matches but it has invalid members
> ERROR: failed to add chunk map start=12582912 len=8454144: -17 (File exists)
> Couldn't read chunk tree
> failed (ignored, ret=139): /home/adam/btrfs/btrfs-progs/btrfs rescue 
> super-recover -y -v 
> /home/adam/btrfs/btrfs-progs/tests//fuzz-tests/images/bko-200409.raw.restored.scratch
> mayfail: returned code 139 (SEGFAULT), not ignored
> test failed for case 007-simple-super-recover
> --
> 
> [CAUSE]
> In __open_ctree_fd(), if we have valid @open_ctree_flags and
> btrfs_scan_fs_devices() successes without problem, no matter what
> happens we will call btrfs_close_devices(), thus free all related
> devices.

Why do you think it's _always_ going to be called? Looking into that
function it seems this can happen if
btrfs_setup_chunk_tree_and_device_map fails.
> 
> In super-recover, before we call open_ctree(), we have called
> btrfs_scan_fs_devices() already, so btrfs_scan_fs_devices() should not
> fail in open_ctree(), fs_devices will always be freed in open_ctree() or
> close_ctree().

Isn't the actual issue just that we call close_ctree. So the actual life
time of fs_devices is :

1. Create in btrfs_scan_fs_devices called from btrfs_recover_superblocks
2. All other references to those fs_devices will just return the same
reference.
3. Calling close_ctree frees fs_devices.

> 
> [FIX]
> So in super-recover.c, we should not call btrfs_close_devices(), or we
> will find fs_devices->list get poisoned, and trigger segfault when
> exiting.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  super-recover.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/super-recover.c b/super-recover.c
> index 880fd7712546..86b3df9867dc 100644
> --- a/super-recover.c
> +++ b/super-recover.c
> @@ -292,9 +292,6 @@ int btrfs_recover_superblocks(const char *dname,
>  no_recover:
>   recover_err_str(ret);
>   free_recover_superblock();
> - /* check if we have freed fs_devices in close_ctree() */
> - if (!root)
> - btrfs_close_devices(recover.fs_devices);
>   return ret;
>  }
>  
> 


Re: [PATCH 5/6] btrfs-progs: Exit gracefull when we failed to alloc dev extent

2018-08-29 Thread Nikolay Borisov



On  3.08.2018 08:50, Qu Wenruo wrote:
> Another BUG_ON() during fuzz/003:
> --
> == RUN MAYFAIL /home/adam/btrfs/btrfs-progs/btrfs check --repair 
> /home/adam/btrfs/btrfs-progs/tests//fuzz-tests/images/bko-199833-reloc-recovery-crash.raw.restored
> [1/7] checking root items
> Fixed 0 roots.
> [2/7] checking extents
> ctree.c:1650: leaf_space_used: Warning: assertion `data_len < 0` failed, 
> value 1
> bad key ordering 18 19
> bad block 29409280
> ERROR: errors found in extent allocation tree or chunk allocation
> WARNING: minor unaligned/mismatch device size detected
> WARNING: recommended to use 'btrfs rescue fix-device-size' to fix it
> [3/7] checking free space cache
> [4/7] checking fs roots
> ctree.c:1650: leaf_space_used: Warning: assertion `data_len < 0` failed, 
> value 1
> bad key ordering 18 19
> root 18446744073709551608 missing its root dir, recreating
> Unable to find block group for 0
> Unable to find block group for 0
> Unable to find block group for 0
> volumes.c:564: btrfs_alloc_dev_extent: BUG_ON `ret` triggered, value -28
> failed (ignored, ret=134): /home/adam/btrfs/btrfs-progs/btrfs check --repair 
> /home/adam/btrfs/btrfs-progs/tests//fuzz-tests/images/bko-199833-reloc-recovery-crash.raw.restored
> mayfail: returned code 134 (SIGABRT), not ignored
> test failed for case 003-multi-check-unmounted
> --
> 
> However the culprit function btrfs_alloc_dev_extent() has proper error
> handler tag err:, just use that tag would solve the problem easily.
> 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov 

> ---
>  volumes.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/volumes.c b/volumes.c
> index d81b348eb14d..f7a413b71d52 100644
> --- a/volumes.c
> +++ b/volumes.c
> @@ -561,7 +561,8 @@ static int btrfs_alloc_dev_extent(struct 
> btrfs_trans_handle *trans,
>   key.type = BTRFS_DEV_EXTENT_KEY;
>   ret = btrfs_insert_empty_item(trans, root, path, ,
> sizeof(*extent));
> - BUG_ON(ret);
> + if (ret < 0)
> + goto err;
>  
>   leaf = path->nodes[0];
>   extent = btrfs_item_ptr(leaf, path->slots[0],
> 


Re: [PATCH 3/6] btrfs-progs: Don't report dirty leaked eb using BUG_ON

2018-08-29 Thread Nikolay Borisov



On  3.08.2018 08:50, Qu Wenruo wrote:
> Another BUG_ON() during fuzz/003:
> --
> == RUN MAYFAIL /home/adam/btrfs/btrfs-progs/btrfs check --init-csum-tree 
> /home/adam/btrfs/btrfs-progs/tests//fuzz-tests/images/bko-161821.raw.restored
> [1/7] checking root items
> Fixed 0 roots.
> [2/7] checking extents
> parent transid verify failed on 4198400 wanted 14 found 1114126
> parent transid verify failed on 4198400 wanted 14 found 1114126
> Ignoring transid failure
> owner ref check failed [4198400 4096]
> repair deleting extent record: key [4198400,169,0]
> adding new tree backref on start 4198400 len 4096 parent 0 root 5
> Repaired extent references for 4198400
> ref mismatch on [4222976 4096] extent item 1, found 0
> backref 4222976 root 7 not referenced back 0x5617f8ecf780
> incorrect global backref count on 4222976 found 1 wanted 0
> backpointer mismatch on [4222976 4096]
> owner ref check failed [4222976 4096]
> repair deleting extent record: key [4222976,169,0]
> Repaired extent references for 4222976
> [3/7] checking free space cache
> [4/7] checking fs roots
> parent transid verify failed on 4198400 wanted 14 found 1114126
> Ignoring transid failure
> Wrong generation of child node/leaf, wanted: 1114126, have: 14
> root 5 missing its root dir, recreating
> parent transid verify failed on 4198400 wanted 14 found 1114126
> Ignoring transid failure
> ERROR: child eb corrupted: parent bytenr=4222976 item=0 parent level=1 child 
> level=2
> ERROR: errors found in fs roots
> extent buffer leak: start 4222976 len 4096
> extent_io.c:611: free_extent_buffer_internal: BUG_ON `eb->flags & 
> EXTENT_DIRTY` triggered, value 1
> failed (ignored, ret=134): /home/adam/btrfs/btrfs-progs/btrfs check 
> --init-csum-tree 
> /home/adam/btrfs/btrfs-progs/tests//fuzz-tests/images/bko-161821.raw.restored
> mayfail: returned code 134 (SIGABRT), not ignored
> test failed for case 003-multi-check-unmounted
> --
> 
> Since we're shifting to using btrfs_abort_transaction() in btrfs-progs,
> it will be more and more common to see dirty leaked eb.
> Instead of BUG_ON(), we only needs to report it as warning.


So how are such leaked extents supposed to be cleaned? So when
transaction_aborted is set we just return some errors from various
functions but I don't see how modified extents in the transaction are freed?
> 
> Signed-off-by: Qu Wenruo 
> ---
>  extent_io.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/extent_io.c b/extent_io.c
> index 198492699438..b8510b0ae94e 100644
> --- a/extent_io.c
> +++ b/extent_io.c
> @@ -608,7 +608,11 @@ static void free_extent_buffer_internal(struct 
> extent_buffer *eb, bool free_now)
>   eb->refs--;
>   BUG_ON(eb->refs < 0);
>   if (eb->refs == 0) {
> - BUG_ON(eb->flags & EXTENT_DIRTY);
> + if (eb->flags & EXTENT_DIRTY) {
> + warning(
> + "dirty eb leak (aborted trans): start %llu len %u",
> + eb->start, eb->len);
> + }
>   list_del_init(>recow);
>   if (eb->flags & EXTENT_BUFFER_DUMMY || free_now)
>   free_extent_buffer_final(eb);
> 


Re: btrfs fi du unreliable?

2018-08-29 Thread Graham Cobb
On 29/08/18 14:31, Jorge Bastos wrote:
> Thanks, that makes sense, so it's only possible to see how much space
> a snapshot is using with quotas enable, I remember reading that
> somewhere before, though there was a new way after reading this latest
> post .

My extents lists scripts (https://github.com/GrahamCobb/extents-lists)
can tell you the answers to questions like this. In particular, see the
extents-to-remove script.

However, be aware of the warning in the documentation:
---
Be warned: the last three examples take a very LONG TIME (and require a
lot of space in $TMPDIR) as they effectively have to get the file
extents for almost every file on the disk (and sort them multiple
times). They take over 12 hours on my system!
---

I don't know if there are any better tools which work faster.


Re: [PATCH v3 1/2] btrfs: Enhance btrfs_trim_fs function to handle error better

2018-08-29 Thread Nikolay Borisov



On 29.08.2018 16:53, Qu Wenruo wrote:
> 
> 
> On 2018/8/29 下午9:43, Nikolay Borisov wrote:
>>
>>
>> On 29.08.2018 08:15, Qu Wenruo wrote:
>>> Function btrfs_trim_fs() doesn't handle errors in a consistent way, if
>>> error happens when trimming existing block groups, it will skip the
>>> remaining blocks and continue to trim unallocated space for each device.
>>>
>>> And the return value will only reflect the final error from device
>>> trimming.
>>>
>>> This patch will fix such behavior by:
>>>
>>> 1) Recording last error from block group or device trimming
>>>So return value will also reflect the last error during trimming.
>>>Make developer more aware of the problem.
>>>
>>> 2) Continuing trimming if we can
>>>If we failed to trim one block group or device, we could still try
>>>next block group or device.
>>>
>>> 3) Report number of failures during block group and device trimming
>>>So it would be less noisy, but still gives user a brief summary of
>>>what's going wrong.
>>>
>>> Such behavior can avoid confusion for case like failure to trim the
>>> first block group and then only unallocated space is trimmed.
>>>
>>> Reported-by: Chris Murphy 
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>>  fs/btrfs/extent-tree.c | 57 ++
>>>  1 file changed, 41 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>>> index de6f75f5547b..7768f206196a 100644
>>> --- a/fs/btrfs/extent-tree.c
>>> +++ b/fs/btrfs/extent-tree.c
>>> @@ -10832,6 +10832,16 @@ static int btrfs_trim_free_extents(struct 
>>> btrfs_device *device,
>>> return ret;
>>>  }
>>>  
>>> +/*
>>> + * Trim the whole fs, by:
>>> + * 1) Trimming free space in each block group
>>> + * 2) Trimming unallocated space in each device
>>> + *
>>> + * Will try to continue trimming even if we failed to trim one block group 
>>> or
>>> + * device.
>>> + * The return value will be the last error during trim.
>>> + * Or 0 if nothing wrong happened.
>>> + */
>>>  int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range 
>>> *range)
>>>  {
>>> struct btrfs_block_group_cache *cache = NULL;
>>> @@ -10842,6 +10852,10 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
>>> struct fstrim_range *range)
>>> u64 end;
>>> u64 trimmed = 0;
>>> u64 total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
>>> +   u64 bg_failed = 0;
>>> +   u64 dev_failed = 0;
>>> +   int bg_ret = 0;
>>> +   int dev_ret = 0;
>>> int ret = 0;
>>>  
>>> /*
>>> @@ -10852,7 +10866,7 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
>>> struct fstrim_range *range)
>>> else
>>> cache = btrfs_lookup_block_group(fs_info, range->start);
>>>  
>>> -   while (cache) {
>>> +   for (; cache; cache = next_block_group(fs_info, cache)) {
>>> if (cache->key.objectid >= (range->start + range->len)) {
>>> btrfs_put_block_group(cache);
>>> break;
>>> @@ -10866,45 +10880,56 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
>>> struct fstrim_range *range)
>>> if (!block_group_cache_done(cache)) {
>>> ret = cache_block_group(cache, 0);
>>> if (ret) {
>>> -   btrfs_put_block_group(cache);
>>> -   break;
>>> +   bg_failed++;
>>> +   bg_ret = ret;
>>> +   continue;
>>> }
>>> ret = wait_block_group_cache_done(cache);
>>> if (ret) {
>>> -   btrfs_put_block_group(cache);
>>> -   break;
>>> +   bg_failed++;
>>> +   bg_ret = ret;
>>> +   continue;
>>> }
>>> }
>>> -   ret = btrfs_trim_block_group(cache,
>>> -_trimmed,
>>> -start,
>>> -end,
>>> -range->minlen);
>>> +   ret = btrfs_trim_block_group(cache, _trimmed,
>>> +   start, end, range->minlen);
>>>  
>>> trimmed += group_trimmed;
>>> if (ret) {
>>> -   btrfs_put_block_group(cache);
>>> -   break;
>>> +   bg_failed++;
>>> +   bg_ret = ret;
>>> +   continue;
>>> }
>>> }
>>> -
>>> -   cache = next_block_group(fs_info, cache);
>>> }
>>>  
>>> +   if (bg_failed)
>>> +   

Re: [PATCH 2/6] btrfs-progs: Exit gracefully when failed to repair root dir item

2018-08-29 Thread Nikolay Borisov



On  3.08.2018 08:50, Qu Wenruo wrote:
> Another BUG_ON() during fuzz/003:
> --
> == RUN MAYFAIL /home/adam/btrfs/btrfs-progs/btrfs check --init-csum-tree 
> /home/adam/btrfs/btrfs-progs/tests//fuzz-tests/images/bko-161821.raw.restored
> [1/7] checking root items
> Fixed 0 roots.
> [2/7] checking extents
> parent transid verify failed on 4198400 wanted 14 found 1114126
> parent transid verify failed on 4198400 wanted 14 found 1114126
> Ignoring transid failure
> owner ref check failed [4198400 4096]
> repair deleting extent record: key [4198400,169,0]
> adding new tree backref on start 4198400 len 4096 parent 0 root 5
> Repaired extent references for 4198400
> ref mismatch on [4222976 4096] extent item 1, found 0
> backref 4222976 root 7 not referenced back 0x55e9cc694780
> incorrect global backref count on 4222976 found 1 wanted 0
> backpointer mismatch on [4222976 4096]
> owner ref check failed [4222976 4096]
> repair deleting extent record: key [4222976,169,0]
> Repaired extent references for 4222976
> [3/7] checking free space cache
> [4/7] checking fs roots
> parent transid verify failed on 4198400 wanted 14 found 1114126
> Ignoring transid failure
> Wrong generation of child node/leaf, wanted: 1114126, have: 14
> root 5 missing its root dir, recreating
> parent transid verify failed on 4198400 wanted 14 found 1114126
> Ignoring transid failure
> ERROR: child eb corrupted: parent bytenr=4222976 item=0 parent level=1 child 
> level=2
> check/main.c:2738: check_inode_recs: BUG_ON `ret` triggered, value -5
> failed (ignored, ret=134): /home/adam/btrfs/btrfs-progs/btrfs check 
> --init-csum-tree 
> /home/adam/btrfs/btrfs-progs/tests//fuzz-tests/images/bko-161821.raw.restored
> mayfail: returned code 134 (SIGABRT), not ignored
> test failed for case 003-multi-check-unmounted
> --
> 
> Just abort current transaction and exit gracefully in this case.
> 
> Signed-off-by: Qu Wenruo 

LGTM:

Reviewed-by: Nikolay Borisov 

> ---
>  check/main.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/check/main.c b/check/main.c
> index b597e2a607e5..e6756e0f852b 100644
> --- a/check/main.c
> +++ b/check/main.c
> @@ -2735,7 +2735,10 @@ static int check_inode_recs(struct btrfs_root *root,
>   (unsigned long long)root->objectid);
>  
>   ret = btrfs_make_root_dir(trans, root, root_dirid);
> - BUG_ON(ret);
> + if (ret < 0) {
> + btrfs_abort_transaction(trans, ret);
> + return ret;
> + }
>  
>   btrfs_commit_transaction(trans, root);
>   return -EAGAIN;
> 


Re: [PATCH 1/6] btrfs-progs: Exit gracefully if we hit ENOSPC when allocating tree block

2018-08-29 Thread Nikolay Borisov



On  3.08.2018 08:50, Qu Wenruo wrote:
> When running test fuzz/003, we could hit the following BUG_ON:
> --
> == RUN MAYFAIL /home/adam/btrfs/btrfs-progs/btrfs check --init-csum-tree 
> /home/adam/btrfs/btrfs-progs/tests//fuzz-tests/images/bko-155621-bad-block-group-offset.raw.restored
> Unable to find block group for 0
> Unable to find block group for 0
> Unable to find block group for 0
> extent-tree.c:2657: alloc_tree_block: BUG_ON `ret` triggered, value -28
> failed (ignored, ret=134): /home/adam/btrfs/btrfs-progs/btrfs check 
> --init-csum-tree 
> /home/adam/btrfs/btrfs-progs/tests//fuzz-tests/images/bko-155621-bad-block-group-offset.raw.restored
> mayfail: returned code 134 (SIGABRT), not ignored
> test failed for case 003-multi-check-unmounted
> --
> 
> Just remove that BUG_ON() and allow us to exit gracefully.
> 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov 

However there are a lot more BUG_ONs in btrfs_reserve_extent :(
> ---
>  extent-tree.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/extent-tree.c b/extent-tree.c
> index b9d51b388c9a..a1f711ece7a8 100644
> --- a/extent-tree.c
> +++ b/extent-tree.c
> @@ -2654,7 +2654,10 @@ static int alloc_tree_block(struct btrfs_trans_handle 
> *trans,
>  
>   ret = btrfs_reserve_extent(trans, root, num_bytes, empty_size,
>  hint_byte, search_end, ins, 0);
> - BUG_ON(ret);
> + if (ret < 0) {
> + btrfs_free_delayed_extent_op(extent_op);
> + return ret;
> + }
>  
>   if (key)
>   memcpy(_op->key, key, sizeof(extent_op->key));
> 


Re: [PATCH v3 2/2] btrfs: Ensure btrfs_trim_fs can trim the whole fs

2018-08-29 Thread Nikolay Borisov



On 29.08.2018 08:15, Qu Wenruo wrote:
> [BUG]
> fstrim on some btrfs only trims the unallocated space, not trimming any
> space in existing block groups.
> 
> [CAUSE]
> Before fstrim_range passed to btrfs_trim_fs(), it get truncated to
> range [0, super->total_bytes).
> So later btrfs_trim_fs() will only be able to trim block groups in range
> [0, super->total_bytes).
> 
> While for btrfs, any bytenr aligned to sector size is valid, since btrfs use
> its logical address space, there is nothing limiting the location where
> we put block groups.
> 
> For btrfs with routine balance, it's quite easy to relocate all
> block groups and bytenr of block groups will start beyond super->total_bytes.
> 
> In that case, btrfs will not trim existing block groups.
> 
> [FIX]
> Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
> can get the unmodified range, which is normally set to [0, U64_MAX].
> 
> Reported-by: Chris Murphy 
> Fixes: f4c697e6406d ("btrfs: return EINVAL if start > total_bytes in fitrim 
> ioctl")
> Cc:  # v4.0+
> Signed-off-by: Qu Wenruo 

Seems legit,

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent-tree.c | 10 +-
>  fs/btrfs/ioctl.c   | 11 +++
>  2 files changed, 8 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 7768f206196a..d1478d66c7a5 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10851,21 +10851,13 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
> struct fstrim_range *range)
>   u64 start;
>   u64 end;
>   u64 trimmed = 0;
> - u64 total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
>   u64 bg_failed = 0;
>   u64 dev_failed = 0;
>   int bg_ret = 0;
>   int dev_ret = 0;
>   int ret = 0;
>  
> - /*
> -  * try to trim all FS space, our block group may start from non-zero.
> -  */
> - if (range->len == total_bytes)
> - cache = btrfs_lookup_first_block_group(fs_info, range->start);
> - else
> - cache = btrfs_lookup_block_group(fs_info, range->start);
> -
> + cache = btrfs_lookup_first_block_group(fs_info, range->start);
>   for (; cache; cache = next_block_group(fs_info, cache)) {
>   if (cache->key.objectid >= (range->start + range->len)) {
>   btrfs_put_block_group(cache);
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 63600dc2ac4c..8165a4bfa579 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -491,7 +491,6 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, 
> void __user *arg)
>   struct fstrim_range range;
>   u64 minlen = ULLONG_MAX;
>   u64 num_devices = 0;
> - u64 total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
>   int ret;
>  
>   if (!capable(CAP_SYS_ADMIN))
> @@ -515,11 +514,15 @@ static noinline int btrfs_ioctl_fitrim(struct file 
> *file, void __user *arg)
>   return -EOPNOTSUPP;
>   if (copy_from_user(, arg, sizeof(range)))
>   return -EFAULT;
> - if (range.start > total_bytes ||
> - range.len < fs_info->sb->s_blocksize)
> +
> + /*
> +  * NOTE: Don't truncate the range using super->total_bytes.
> +  * Bytenr of btrfs block group is in btrfs logical address space,
> +  * which can be any sector size aligned bytenr in [0, U64_MAX].
> +  */
> + if (range.len < fs_info->sb->s_blocksize)
>   return -EINVAL;
>  
> - range.len = min(range.len, total_bytes - range.start);
>   range.minlen = max(range.minlen, minlen);
>   ret = btrfs_trim_fs(fs_info, );
>   if (ret < 0)
> 


Re: How to erase a RAID1 ?

2018-08-29 Thread Pierre Couderc




On 08/29/2018 02:52 PM, Qu Wenruo wrote:


On 2018/8/29 下午8:49, Pierre Couderc wrote:

I want to reinstall a RAID1 btrfs system (wchis is now under debian
stretch, and will be reinstalled in stretch).

If you still want to use btrfs, just umount the original fs, and

# mkfs.btrfs -f 



The problem is to "unmount" the RAID1 system..

I have not found how to do that.

I have found a solution by reinstalling under debian, it now asks if 
need to reformat thez btrfs disk...


Re: [PATCH v3 1/2] btrfs: Enhance btrfs_trim_fs function to handle error better

2018-08-29 Thread Qu Wenruo



On 2018/8/29 下午9:43, Nikolay Borisov wrote:
> 
> 
> On 29.08.2018 08:15, Qu Wenruo wrote:
>> Function btrfs_trim_fs() doesn't handle errors in a consistent way, if
>> error happens when trimming existing block groups, it will skip the
>> remaining blocks and continue to trim unallocated space for each device.
>>
>> And the return value will only reflect the final error from device
>> trimming.
>>
>> This patch will fix such behavior by:
>>
>> 1) Recording last error from block group or device trimming
>>So return value will also reflect the last error during trimming.
>>Make developer more aware of the problem.
>>
>> 2) Continuing trimming if we can
>>If we failed to trim one block group or device, we could still try
>>next block group or device.
>>
>> 3) Report number of failures during block group and device trimming
>>So it would be less noisy, but still gives user a brief summary of
>>what's going wrong.
>>
>> Such behavior can avoid confusion for case like failure to trim the
>> first block group and then only unallocated space is trimmed.
>>
>> Reported-by: Chris Murphy 
>> Signed-off-by: Qu Wenruo 
>> ---
>>  fs/btrfs/extent-tree.c | 57 ++
>>  1 file changed, 41 insertions(+), 16 deletions(-)
>>
>> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
>> index de6f75f5547b..7768f206196a 100644
>> --- a/fs/btrfs/extent-tree.c
>> +++ b/fs/btrfs/extent-tree.c
>> @@ -10832,6 +10832,16 @@ static int btrfs_trim_free_extents(struct 
>> btrfs_device *device,
>>  return ret;
>>  }
>>  
>> +/*
>> + * Trim the whole fs, by:
>> + * 1) Trimming free space in each block group
>> + * 2) Trimming unallocated space in each device
>> + *
>> + * Will try to continue trimming even if we failed to trim one block group 
>> or
>> + * device.
>> + * The return value will be the last error during trim.
>> + * Or 0 if nothing wrong happened.
>> + */
>>  int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range)
>>  {
>>  struct btrfs_block_group_cache *cache = NULL;
>> @@ -10842,6 +10852,10 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
>> struct fstrim_range *range)
>>  u64 end;
>>  u64 trimmed = 0;
>>  u64 total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
>> +u64 bg_failed = 0;
>> +u64 dev_failed = 0;
>> +int bg_ret = 0;
>> +int dev_ret = 0;
>>  int ret = 0;
>>  
>>  /*
>> @@ -10852,7 +10866,7 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
>> struct fstrim_range *range)
>>  else
>>  cache = btrfs_lookup_block_group(fs_info, range->start);
>>  
>> -while (cache) {
>> +for (; cache; cache = next_block_group(fs_info, cache)) {
>>  if (cache->key.objectid >= (range->start + range->len)) {
>>  btrfs_put_block_group(cache);
>>  break;
>> @@ -10866,45 +10880,56 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
>> struct fstrim_range *range)
>>  if (!block_group_cache_done(cache)) {
>>  ret = cache_block_group(cache, 0);
>>  if (ret) {
>> -btrfs_put_block_group(cache);
>> -break;
>> +bg_failed++;
>> +bg_ret = ret;
>> +continue;
>>  }
>>  ret = wait_block_group_cache_done(cache);
>>  if (ret) {
>> -btrfs_put_block_group(cache);
>> -break;
>> +bg_failed++;
>> +bg_ret = ret;
>> +continue;
>>  }
>>  }
>> -ret = btrfs_trim_block_group(cache,
>> - _trimmed,
>> - start,
>> - end,
>> - range->minlen);
>> +ret = btrfs_trim_block_group(cache, _trimmed,
>> +start, end, range->minlen);
>>  
>>  trimmed += group_trimmed;
>>  if (ret) {
>> -btrfs_put_block_group(cache);
>> -break;
>> +bg_failed++;
>> +bg_ret = ret;
>> +continue;
>>  }
>>  }
>> -
>> -cache = next_block_group(fs_info, cache);
>>  }
>>  
>> +if (bg_failed)
>> +btrfs_warn(fs_info,
>> +"failed to trim %llu block group(s), last error was %d",
>> +   

Re: [PATCH v3 1/2] btrfs: Enhance btrfs_trim_fs function to handle error better

2018-08-29 Thread Nikolay Borisov



On 29.08.2018 08:15, Qu Wenruo wrote:
> Function btrfs_trim_fs() doesn't handle errors in a consistent way, if
> error happens when trimming existing block groups, it will skip the
> remaining blocks and continue to trim unallocated space for each device.
> 
> And the return value will only reflect the final error from device
> trimming.
> 
> This patch will fix such behavior by:
> 
> 1) Recording last error from block group or device trimming
>So return value will also reflect the last error during trimming.
>Make developer more aware of the problem.
> 
> 2) Continuing trimming if we can
>If we failed to trim one block group or device, we could still try
>next block group or device.
> 
> 3) Report number of failures during block group and device trimming
>So it would be less noisy, but still gives user a brief summary of
>what's going wrong.
> 
> Such behavior can avoid confusion for case like failure to trim the
> first block group and then only unallocated space is trimmed.
> 
> Reported-by: Chris Murphy 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/extent-tree.c | 57 ++
>  1 file changed, 41 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index de6f75f5547b..7768f206196a 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -10832,6 +10832,16 @@ static int btrfs_trim_free_extents(struct 
> btrfs_device *device,
>   return ret;
>  }
>  
> +/*
> + * Trim the whole fs, by:
> + * 1) Trimming free space in each block group
> + * 2) Trimming unallocated space in each device
> + *
> + * Will try to continue trimming even if we failed to trim one block group or
> + * device.
> + * The return value will be the last error during trim.
> + * Or 0 if nothing wrong happened.
> + */
>  int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range)
>  {
>   struct btrfs_block_group_cache *cache = NULL;
> @@ -10842,6 +10852,10 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
> struct fstrim_range *range)
>   u64 end;
>   u64 trimmed = 0;
>   u64 total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
> + u64 bg_failed = 0;
> + u64 dev_failed = 0;
> + int bg_ret = 0;
> + int dev_ret = 0;
>   int ret = 0;
>  
>   /*
> @@ -10852,7 +10866,7 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
> struct fstrim_range *range)
>   else
>   cache = btrfs_lookup_block_group(fs_info, range->start);
>  
> - while (cache) {
> + for (; cache; cache = next_block_group(fs_info, cache)) {
>   if (cache->key.objectid >= (range->start + range->len)) {
>   btrfs_put_block_group(cache);
>   break;
> @@ -10866,45 +10880,56 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
> struct fstrim_range *range)
>   if (!block_group_cache_done(cache)) {
>   ret = cache_block_group(cache, 0);
>   if (ret) {
> - btrfs_put_block_group(cache);
> - break;
> + bg_failed++;
> + bg_ret = ret;
> + continue;
>   }
>   ret = wait_block_group_cache_done(cache);
>   if (ret) {
> - btrfs_put_block_group(cache);
> - break;
> + bg_failed++;
> + bg_ret = ret;
> + continue;
>   }
>   }
> - ret = btrfs_trim_block_group(cache,
> -  _trimmed,
> -  start,
> -  end,
> -  range->minlen);
> + ret = btrfs_trim_block_group(cache, _trimmed,
> + start, end, range->minlen);
>  
>   trimmed += group_trimmed;
>   if (ret) {
> - btrfs_put_block_group(cache);
> - break;
> + bg_failed++;
> + bg_ret = ret;
> + continue;
>   }
>   }
> -
> - cache = next_block_group(fs_info, cache);
>   }
>  
> + if (bg_failed)
> + btrfs_warn(fs_info,
> + "failed to trim %llu block group(s), last error was %d",
> +bg_failed, bg_ret);

IMO this error handling strategy doesn't really bring any value. The
only thing which the 

Re: btrfs fi du unreliable?

2018-08-29 Thread Jorge Bastos
On Wed, Aug 29, 2018 at 1:42 PM Remi Gauvin  wrote:
>
>
> Exclusive means... exclusive... to that 1 snapshop/subvolume.  If the
> data also exists on the 23'rd snapshot, it's not exclusive.
>
> If you wanted to report how much data is exclusive to a group of
> snapshots, (say, July 22 *and* 23rd). you would have to make them
> members of a parent qgroup, then, you could see the exclusive value of
> the whole group.
>

Thanks, that makes sense, so it's only possible to see how much space
a snapshot is using with quotas enable, I remember reading that
somewhere before, though there was a new way after reading this latest
post .

Jorge


Re: How to erase a RAID1 ?

2018-08-29 Thread Qu Wenruo


On 2018/8/29 下午8:49, Pierre Couderc wrote:
> I want to reinstall a RAID1 btrfs system (wchis is now under debian
> stretch, and will be reinstalled in stretch).

If you still want to use btrfs, just umount the original fs, and

# mkfs.btrfs -f 

Then a completely new btrfs.

> 
> How to correctly "erase" it ? Not truly hard erase it, but so that old
> data does not appear...

If not planing to use btrfs, umount the fs and then wipefs -fa 

Thanks,
Qu

> 
> It is not clear in the wiki.
> 
> Thanks
> 
> PC
> 



signature.asc
Description: OpenPGP digital signature


How to erase a RAID1 ?

2018-08-29 Thread Pierre Couderc
I want to reinstall a RAID1 btrfs system (wchis is now under debian 
stretch, and will be reinstalled in stretch).


How to correctly "erase" it ? Not truly hard erase it, but so that old 
data does not appear...


It is not clear in the wiki.

Thanks

PC



Re: [PATCH 0/4] Userspace support for FSID change

2018-08-29 Thread Qu Wenruo



On 2018/8/29 下午8:33, Nikolay Borisov wrote:
> 
> 
> On 29.08.2018 15:09, Qu Wenruo wrote:
>>
>>
>> On 2018/8/29 下午4:35, Nikolay Borisov wrote:
>>> Here is the userspace tooling support for utilising the new metadata_uuid 
>>> field, 
>>> enabling the change of fsid without having to rewrite every metadata block. 
>>> This
>>> patchset consists of adding support for the new field to various tools and 
>>> files (Patch 1). The actual implementation of the new -m|-M options (which 
>>> are 
>>> described in more detail in Patch 2). A new misc-tests testcasei (Patch 3) 
>>> which
>>> exercises the new options and verifies certain invariants hold (these are 
>>> also
>>> described in Patch2). Patch 4 is more or less copy of the kernel 
>>> conuterpart 
>>> just reducing some duplication between btrfs_fs_info and btrfs_fs_devices 
>>> structures.
>>
>> So to my understand, now we have another layer of UUID.
>>
>> Before we have one fsid, both used in superblock and tree blocks.
>>
>> Now we have 2 fsid, the one used in tree blocks are kept the same, but
>> changed its name to metadata_uuid in superblock.
>> And superblock::fsid will become a new field, and although they are the
>> same at mkfs time, they could change several times during its operation.
>>
>> This indeed makes uuid change super fast, only needs to update all
>> superblocks of the fs, instead of all tree blocks.
>>
>> However I have one nitpick of the design. Unlike XFS, btrfs supports
>> multiple devices.
>> If we have a raid10 fs with 4 devices, and it has already gone through
>> several UUID change (so its metadata uuid is already different from fsid).
>>
>> And during another UUID change procedure, we lost power while only
>> updated 2 super blocks, what will happen for kernel device assembly?
>>
>> (Although considering how fast the UUID change would happen, such case
>> should be super niche)
> 
> Then I guess you will be fucked. I'm all ears for suggestion how to
> rectify this without skyrocketing the complexity. The current UUID
> rewrite method sets a flag int he superblock that FSID change is in
> progress and clears it once every metadatablock has been rewritten. I
> can piggyback on this mechanism but I'm not sure it provides 100%
> guarantee. Because by the some token you can set this flag, start
> writing the super blocks then lose power and then only some of the
> superblocks could have this flag set so we back at square 1.

Well, forget it, considering how fast the new method is, such case
should be really rare.

> 
>>
>>>
>>> The intended usecase of this feature is to give the sysadmin the ability to 
>>> create copies of filesystesm, change their uuid quickly and mount them 
>>> alongside
>>> the original filesystem for, say, forensic purposes. 
>>>
>>> One thing which still hasn't been set in stone is whether the new options 
>>> will remain as -m|-M or whether they should subsume the current -u|-U - 
>>> from 
>>> the point of view of users nothing should change.
>>
>> Well, user would be surprised by how fast the new -m is, thus there is
>> still something changed :)
>>
>> I prefer to subsume current -u/-U, and use the new one if the incompat
>> feature is already set. Or fall back to original behavior.
>>
>> But I'm not a fan of using INCOMPAT flags as an indicator of changed
>> fsid/metadata uuid.
>> INCOMPAT feature should not change so easily nor acts as an indicator.
>>
>> That's to say, the flag should only be set at mkfs time, and then never
>> change unlike the 2nd patch (I don't even like btrfstune to change
>> incompat flags).
>>
>> E.g.
>> mkfs.btrfs -O metadata_uuid , then we could use the new way to
>> change fsid without touching metadata uuid.
>> Or we could only use the old method.
> 
> I disagree, I don't see any benefit in this but only added complexity.
> Can you elaborate more ?

Well, the incompat feature is really introducing some incompatible
on-disk format change.
So if you're introducing the new metadata_uuid field, no matter if it
differs from fsid or not, it's a new field, and the incompat flag should
be set.

To me, older kernel could recognize the new format when fsid matches
metadata uuid (since there in your current patchset, such case will not
have incompat flag set) is a little dangerous.

What will happen if such old kernel/btrfs-progs tries to change fsid?
Older btrfs-progs doesn't know there is a new field, it will not touch
the metadata uuid field, just changing the fsid field along with all
tree blocks.

This will cause a fs whose fsid doesn't match metadata uuid and has no
incompat flag set. This is definitely leading to compatibility problem.

So we need to follow the incompat flags rule strictly, if on-disk format
is changed, we need the new incompat flag.

Thanks,
Qu
> 
> 
>>
>> Thanks,
>> Qu
>>
>>> So this is something which 
>>> I'd like to hear from the community. Of course the alternative of rewriting 
>>> the metadata blocks will be assigne new options - perhaps -m|M ?
>>>

Re: [PATCH 0/4] Userspace support for FSID change

2018-08-29 Thread Austin S. Hemmelgarn

On 2018-08-29 08:33, Nikolay Borisov wrote:



On 29.08.2018 15:09, Qu Wenruo wrote:



On 2018/8/29 下午4:35, Nikolay Borisov wrote:

Here is the userspace tooling support for utilising the new metadata_uuid field,
enabling the change of fsid without having to rewrite every metadata block. This
patchset consists of adding support for the new field to various tools and
files (Patch 1). The actual implementation of the new -m|-M options (which are
described in more detail in Patch 2). A new misc-tests testcasei (Patch 3) which
exercises the new options and verifies certain invariants hold (these are also
described in Patch2). Patch 4 is more or less copy of the kernel conuterpart
just reducing some duplication between btrfs_fs_info and btrfs_fs_devices
structures.


So to my understand, now we have another layer of UUID.

Before we have one fsid, both used in superblock and tree blocks.

Now we have 2 fsid, the one used in tree blocks are kept the same, but
changed its name to metadata_uuid in superblock.
And superblock::fsid will become a new field, and although they are the
same at mkfs time, they could change several times during its operation.

This indeed makes uuid change super fast, only needs to update all
superblocks of the fs, instead of all tree blocks.

However I have one nitpick of the design. Unlike XFS, btrfs supports
multiple devices.
If we have a raid10 fs with 4 devices, and it has already gone through
several UUID change (so its metadata uuid is already different from fsid).

And during another UUID change procedure, we lost power while only
updated 2 super blocks, what will happen for kernel device assembly?

(Although considering how fast the UUID change would happen, such case
should be super niche)


Then I guess you will be fucked. I'm all ears for suggestion how to
rectify this without skyrocketing the complexity. The current UUID
rewrite method sets a flag int he superblock that FSID change is in
progress and clears it once every metadatablock has been rewritten. I
can piggyback on this mechanism but I'm not sure it provides 100%
guarantee. Because by the some token you can set this flag, start
writing the super blocks then lose power and then only some of the
superblocks could have this flag set so we back at square 1.





The intended usecase of this feature is to give the sysadmin the ability to
create copies of filesystesm, change their uuid quickly and mount them alongside
the original filesystem for, say, forensic purposes.

One thing which still hasn't been set in stone is whether the new options
will remain as -m|-M or whether they should subsume the current -u|-U - from
the point of view of users nothing should change.


Well, user would be surprised by how fast the new -m is, thus there is
still something changed :)

I prefer to subsume current -u/-U, and use the new one if the incompat
feature is already set. Or fall back to original behavior.

But I'm not a fan of using INCOMPAT flags as an indicator of changed
fsid/metadata uuid.
INCOMPAT feature should not change so easily nor acts as an indicator.

That's to say, the flag should only be set at mkfs time, and then never
change unlike the 2nd patch (I don't even like btrfstune to change
incompat flags).

E.g.
mkfs.btrfs -O metadata_uuid , then we could use the new way to
change fsid without touching metadata uuid.
Or we could only use the old method.


I disagree, I don't see any benefit in this but only added complexity.
Can you elaborate more ?
Same here, I see essentially zero benefit to this, and one _big_ 
drawback, namely that you can't convert an existing volume to use this 
approach if it's a feature that can only be set at mkfs time.


That one drawback means that this is effectively useless for all 
existing BTRFS volumes, which is a pretty big limitation.


I also do think an INCOMPAT feature bit is appropriate here.  Volumes 
with this feature will potentially be enumerated with the wrong UUID on 
older kernels, which is a pretty big behavioral issue (on the level of 
completely breaking boot on some systems, keep in mind that almost all 
major distros use volume UUID's to identify volumes in /etc/fstab).





Thanks,
Qu


So this is something which
I'd like to hear from the community. Of course the alternative of rewriting
the metadata blocks will be assigne new options - perhaps -m|M ?

I've tested this with multiple xfstest runs with the new tools installed as
well as running btrfs-progs test and have observed no regressions.

Nikolay Borisov (4):
   btrfs-progs: Add support for metadata_uuid field.
   btrfstune: Add support for changing the user uuid
   btrfs-progs: tests: Add tests for changing fsid feature
   btrfs-progs: Remove fsid/metdata_uuid fields from fs_info

  btrfstune.c| 174 -
  check/main.c   |   2 +-
  chunk-recover.c|  17 ++-
  cmds-filesystem.c  |  

Re: btrfs fi du unreliable?

2018-08-29 Thread Remi Gauvin
On 2018-08-29 08:00 AM, Jorge Bastos wrote:

> 
> Look for example at snapshots from July 21st and 22nd, total used
> space went from 199 to 277GiB, this is mostly from new added files, as
> I confirmed from browsing those snapshots, there were no changes on
> the 23rd, and a lot of files were deleted before the 24th, so should't
> there be about 80GiB of exclusive content for the 22nd, or am I
> misunderstanding how this is reported? ? Those were new files only,
> never existed on previous snapshots, If I delete both snapshots from
> the 22nd and the 23rd I expect to get about 80GiB freed space.

Exclusive means... exclusive... to that 1 snapshop/subvolume.  If the
data also exists on the 23'rd snapshot, it's not exclusive.

If you wanted to report how much data is exclusive to a group of
snapshots, (say, July 22 *and* 23rd). you would have to make them
members of a parent qgroup, then, you could see the exclusive value of
the whole group.

<>

Re: [PATCH 0/4] Userspace support for FSID change

2018-08-29 Thread Nikolay Borisov



On 29.08.2018 15:09, Qu Wenruo wrote:
> 
> 
> On 2018/8/29 下午4:35, Nikolay Borisov wrote:
>> Here is the userspace tooling support for utilising the new metadata_uuid 
>> field, 
>> enabling the change of fsid without having to rewrite every metadata block. 
>> This
>> patchset consists of adding support for the new field to various tools and 
>> files (Patch 1). The actual implementation of the new -m|-M options (which 
>> are 
>> described in more detail in Patch 2). A new misc-tests testcasei (Patch 3) 
>> which
>> exercises the new options and verifies certain invariants hold (these are 
>> also
>> described in Patch2). Patch 4 is more or less copy of the kernel conuterpart 
>> just reducing some duplication between btrfs_fs_info and btrfs_fs_devices 
>> structures.
> 
> So to my understand, now we have another layer of UUID.
> 
> Before we have one fsid, both used in superblock and tree blocks.
> 
> Now we have 2 fsid, the one used in tree blocks are kept the same, but
> changed its name to metadata_uuid in superblock.
> And superblock::fsid will become a new field, and although they are the
> same at mkfs time, they could change several times during its operation.
> 
> This indeed makes uuid change super fast, only needs to update all
> superblocks of the fs, instead of all tree blocks.
> 
> However I have one nitpick of the design. Unlike XFS, btrfs supports
> multiple devices.
> If we have a raid10 fs with 4 devices, and it has already gone through
> several UUID change (so its metadata uuid is already different from fsid).
> 
> And during another UUID change procedure, we lost power while only
> updated 2 super blocks, what will happen for kernel device assembly?
> 
> (Although considering how fast the UUID change would happen, such case
> should be super niche)

Then I guess you will be fucked. I'm all ears for suggestion how to
rectify this without skyrocketing the complexity. The current UUID
rewrite method sets a flag int he superblock that FSID change is in
progress and clears it once every metadatablock has been rewritten. I
can piggyback on this mechanism but I'm not sure it provides 100%
guarantee. Because by the some token you can set this flag, start
writing the super blocks then lose power and then only some of the
superblocks could have this flag set so we back at square 1.

> 
>>
>> The intended usecase of this feature is to give the sysadmin the ability to 
>> create copies of filesystesm, change their uuid quickly and mount them 
>> alongside
>> the original filesystem for, say, forensic purposes. 
>>
>> One thing which still hasn't been set in stone is whether the new options 
>> will remain as -m|-M or whether they should subsume the current -u|-U - from 
>> the point of view of users nothing should change.
> 
> Well, user would be surprised by how fast the new -m is, thus there is
> still something changed :)
> 
> I prefer to subsume current -u/-U, and use the new one if the incompat
> feature is already set. Or fall back to original behavior.
> 
> But I'm not a fan of using INCOMPAT flags as an indicator of changed
> fsid/metadata uuid.
> INCOMPAT feature should not change so easily nor acts as an indicator.
> 
> That's to say, the flag should only be set at mkfs time, and then never
> change unlike the 2nd patch (I don't even like btrfstune to change
> incompat flags).
> 
> E.g.
> mkfs.btrfs -O metadata_uuid , then we could use the new way to
> change fsid without touching metadata uuid.
> Or we could only use the old method.

I disagree, I don't see any benefit in this but only added complexity.
Can you elaborate more ?


> 
> Thanks,
> Qu
> 
>> So this is something which 
>> I'd like to hear from the community. Of course the alternative of rewriting 
>> the metadata blocks will be assigne new options - perhaps -m|M ?
>>
>> I've tested this with multiple xfstest runs with the new tools installed as 
>> well as running btrfs-progs test and have observed no regressions. 
>>
>> Nikolay Borisov (4):
>>   btrfs-progs: Add support for metadata_uuid field.
>>   btrfstune: Add support for changing the user uuid
>>   btrfs-progs: tests: Add tests for changing fsid feature
>>   btrfs-progs: Remove fsid/metdata_uuid fields from fs_info
>>
>>  btrfstune.c| 174 
>> -
>>  check/main.c   |   2 +-
>>  chunk-recover.c|  17 ++-
>>  cmds-filesystem.c  |   2 +
>>  cmds-inspect-dump-super.c  |  22 +++-
>>  convert/common.c   |   2 +
>>  ctree.c|  15 +--
>>  ctree.h|   8 +-
>>  disk-io.c  |  62 --
>>  image/main.c   |  25 +++--
>>  tests/misc-tests/033-metadata-uuid/test.sh | 137 +++
>>  volumes.c  |  37 --
>>  volumes.h  

Re: [PATCH 0/4] Userspace support for FSID change

2018-08-29 Thread Qu Wenruo



On 2018/8/29 下午4:35, Nikolay Borisov wrote:
> Here is the userspace tooling support for utilising the new metadata_uuid 
> field, 
> enabling the change of fsid without having to rewrite every metadata block. 
> This
> patchset consists of adding support for the new field to various tools and 
> files (Patch 1). The actual implementation of the new -m|-M options (which 
> are 
> described in more detail in Patch 2). A new misc-tests testcasei (Patch 3) 
> which
> exercises the new options and verifies certain invariants hold (these are also
> described in Patch2). Patch 4 is more or less copy of the kernel conuterpart 
> just reducing some duplication between btrfs_fs_info and btrfs_fs_devices 
> structures.

So to my understand, now we have another layer of UUID.

Before we have one fsid, both used in superblock and tree blocks.

Now we have 2 fsid, the one used in tree blocks are kept the same, but
changed its name to metadata_uuid in superblock.
And superblock::fsid will become a new field, and although they are the
same at mkfs time, they could change several times during its operation.

This indeed makes uuid change super fast, only needs to update all
superblocks of the fs, instead of all tree blocks.

However I have one nitpick of the design. Unlike XFS, btrfs supports
multiple devices.
If we have a raid10 fs with 4 devices, and it has already gone through
several UUID change (so its metadata uuid is already different from fsid).

And during another UUID change procedure, we lost power while only
updated 2 super blocks, what will happen for kernel device assembly?

(Although considering how fast the UUID change would happen, such case
should be super niche)

> 
> The intended usecase of this feature is to give the sysadmin the ability to 
> create copies of filesystesm, change their uuid quickly and mount them 
> alongside
> the original filesystem for, say, forensic purposes. 
> 
> One thing which still hasn't been set in stone is whether the new options 
> will remain as -m|-M or whether they should subsume the current -u|-U - from 
> the point of view of users nothing should change.

Well, user would be surprised by how fast the new -m is, thus there is
still something changed :)

I prefer to subsume current -u/-U, and use the new one if the incompat
feature is already set. Or fall back to original behavior.

But I'm not a fan of using INCOMPAT flags as an indicator of changed
fsid/metadata uuid.
INCOMPAT feature should not change so easily nor acts as an indicator.

That's to say, the flag should only be set at mkfs time, and then never
change unlike the 2nd patch (I don't even like btrfstune to change
incompat flags).

E.g.
mkfs.btrfs -O metadata_uuid , then we could use the new way to
change fsid without touching metadata uuid.
Or we could only use the old method.

Thanks,
Qu

> So this is something which 
> I'd like to hear from the community. Of course the alternative of rewriting 
> the metadata blocks will be assigne new options - perhaps -m|M ?
> 
> I've tested this with multiple xfstest runs with the new tools installed as 
> well as running btrfs-progs test and have observed no regressions. 
> 
> Nikolay Borisov (4):
>   btrfs-progs: Add support for metadata_uuid field.
>   btrfstune: Add support for changing the user uuid
>   btrfs-progs: tests: Add tests for changing fsid feature
>   btrfs-progs: Remove fsid/metdata_uuid fields from fs_info
> 
>  btrfstune.c| 174 
> -
>  check/main.c   |   2 +-
>  chunk-recover.c|  17 ++-
>  cmds-filesystem.c  |   2 +
>  cmds-inspect-dump-super.c  |  22 +++-
>  convert/common.c   |   2 +
>  ctree.c|  15 +--
>  ctree.h|   8 +-
>  disk-io.c  |  62 --
>  image/main.c   |  25 +++--
>  tests/misc-tests/033-metadata-uuid/test.sh | 137 +++
>  volumes.c  |  37 --
>  volumes.h  |   1 +
>  13 files changed, 431 insertions(+), 73 deletions(-)
>  create mode 100755 tests/misc-tests/033-metadata-uuid/test.sh
> 


btrfs fi du unreliable?

2018-08-29 Thread Jorge Bastos
Hi,

Read on another post that you can use sudo btrfs fi du -s
' to get a report on  how much exclusive space is
being used by the snapshots, I wasn't aware of this command and
believed that there wasn't an easy way to get this, but after using it
on some of my btrfs file systems used for backups it seems unreliable.

This file system is used to backup one of my desktop disks where I
temporally save mostly large multimedia files to later archive to
another server, everyday before making the backup I do a snapshot, so
there are practically no modified files, only new or deleted files,
these are just a few days of snapshots:

btrfs fi du -s /mnt/disks/BackupSS/snaps/sspaces_daily_2018-07-2*
 Total   Exclusive  Set shared  Filename
 199.98GiB   648.22MiB   199.35GiB
/mnt/disks/BackupSS/snaps/sspaces_daily_2018-07-20-070001
 198.74GiB 2.45GiB   196.29GiB
/mnt/disks/BackupSS/snaps/sspaces_daily_2018-07-21-070001
 277.13GiB   0.00B   277.13GiB
/mnt/disks/BackupSS/snaps/sspaces_daily_2018-07-22-070001
 277.13GiB   0.00B   277.13GiB
/mnt/disks/BackupSS/snaps/sspaces_daily_2018-07-23-070001
  80.04GiB 2.44GiB77.60GiB
/mnt/disks/BackupSS/snaps/sspaces_daily_2018-07-24-070001
  93.71GiB 1.17GiB92.55GiB
/mnt/disks/BackupSS/snaps/sspaces_daily_2018-07-25-070001
  94.85GiB   0.00B94.85GiB
/mnt/disks/BackupSS/snaps/sspaces_daily_2018-07-26-070001
 339.82GiB   127.77GiB   212.05GiB
/mnt/disks/BackupSS/snaps/sspaces_daily_2018-07-27-070001
 279.94GiB79.69GiB   200.25GiB
/mnt/disks/BackupSS/snaps/sspaces_daily_2018-07-28-070002
 118.00GiB 5.77GiB   112.22GiB
/mnt/disks/BackupSS/snaps/sspaces_daily_2018-07-29-070001

Look for example at snapshots from July 21st and 22nd, total used
space went from 199 to 277GiB, this is mostly from new added files, as
I confirmed from browsing those snapshots, there were no changes on
the 23rd, and a lot of files were deleted before the 24th, so should't
there be about 80GiB of exclusive content for the 22nd, or am I
misunderstanding how this is reported? ? Those were new files only,
never existed on previous snapshots, If I delete both snapshots from
the 22nd and the 23rd I expect to get about 80GiB freed space.

kernel 4.14.49 / btrfs-progs v4.15.1

Label: none  uuid: e12ba68e-1200-467c-99e7-95e986c1d6f2
Total devices 2 FS bytes used 1.43TiB
devid2 size 1.82TiB used 1.38TiB path /dev/sdh1
devid3 size 931.51GiB used 706.51GiB path /dev/sdi1

Data, single: total=2.06TiB, used=1.43TiB
System, RAID1: total=32.00MiB, used=336.00KiB
Metadata, RAID1: total=4.00GiB, used=1.55GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


Re: DRDY errors are not consistent with scrub results

2018-08-29 Thread Hugo Mills
On Wed, Aug 29, 2018 at 09:58:58AM +, Duncan wrote:
> Cerem Cem ASLAN posted on Wed, 29 Aug 2018 09:58:21 +0300 as excerpted:
> 
> > Thinking again, this is totally acceptable. If the requirement was a
> > good health disk, then I think I must check the disk health by myself.
> > I may believe that the disk is in a good state, or make a quick test or
> > make some very detailed tests to be sure.
> 
> For testing you might try badblocks.  It's most useful on a device that 
> doesn't have a filesystem on it you're trying to save, so you can use the 
> -w write-test option.  See the manpage for details.
> 
> The -w option should force the device to remap bad blocks where it can as 
> well, and you can take your previous smartctl read and compare it to a 
> new one after the test.
> 
> Hint if testing multiple spinning-rust devices:  Try running multiple 
> tests at once.  While this might have been slower on old EIDE, at least 
> with spinning rust, on SATA and similar you should be able to test 
> multiple devices at once without them slowing down significantly, because 
> the bottleneck is the spinning rust, not the bus, controller or CPU.  I 
> used badblocks years ago to test my new disks before setting up mdraid on 
> them, and with full disk tests on spinning rust taking (at the time) 
> nearly a day a pass and four passes for the -w test, the multiple tests 
> at once trick saved me quite a bit of time!

   Hah. Only a day? It's up to 2 days now.

   The devices get bigger. The interfaces don't get faster at the same
rate. Back in the late '90s, it was only an hour or so to run a
badblocks pass on a big disk...

   Hugo.

-- 
Hugo Mills | Nostalgia isn't what it used to be.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: DRDY errors are not consistent with scrub results

2018-08-29 Thread Duncan
Cerem Cem ASLAN posted on Wed, 29 Aug 2018 09:58:21 +0300 as excerpted:

> Thinking again, this is totally acceptable. If the requirement was a
> good health disk, then I think I must check the disk health by myself.
> I may believe that the disk is in a good state, or make a quick test or
> make some very detailed tests to be sure.

For testing you might try badblocks.  It's most useful on a device that 
doesn't have a filesystem on it you're trying to save, so you can use the 
-w write-test option.  See the manpage for details.

The -w option should force the device to remap bad blocks where it can as 
well, and you can take your previous smartctl read and compare it to a 
new one after the test.

Hint if testing multiple spinning-rust devices:  Try running multiple 
tests at once.  While this might have been slower on old EIDE, at least 
with spinning rust, on SATA and similar you should be able to test 
multiple devices at once without them slowing down significantly, because 
the bottleneck is the spinning rust, not the bus, controller or CPU.  I 
used badblocks years ago to test my new disks before setting up mdraid on 
them, and with full disk tests on spinning rust taking (at the time) 
nearly a day a pass and four passes for the -w test, the multiple tests 
at once trick saved me quite a bit of time!

It's not a great idea to do the test on new SSDs as it's unnecessary 
wear, writing the entire device four times with different patterns each 
time for a -w, but it might be worthwhile to try it on an ssd you're just 
trying to salvage, forcing it to swap out any bad sectors it encounters 
in the process.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



Re: DRDY errors are not consistent with scrub results

2018-08-29 Thread ein
On 08/28/2018 12:51 AM, Cerem Cem ASLAN wrote:
> Hi,
>

Good morning.

>
> I'm not sure how to proceed at the moment. Taking succesfull backups
> made me think that everything might be okay but I'm not sure if I
> should continue trusting the drive or not. What additional checks
> should I perform?
> 

Can you please show also:

btrfs dev stats /path/to/the/mount/point


-- 
PGP Public Key (RSA/4096b):
ID: 0xF2C6EA10
SHA-1: 51DA 40EE 832A 0572 5AD8 B3C0 7AFF 69E1 F2C6 EA10


Re: Strange behavior (possible bugs) in btrfs

2018-08-29 Thread Filipe Manana
On Tue, Aug 28, 2018 at 9:35 PM Jayashree Mohan  wrote:
>
> Hi Filipe,
>
> This is to follow up the status of crash consistency bugs we reported
> on btrfs. We see that there has been a patch(not in the kernel yet)
> (https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg77875.html)
> that resolves one of the reported bugs. However, the other bugs we
> reported still exist on the latest kernel (4.19-rc1), even with the
> submitted patch. Here is the list of other inconsistencies we
> reported, along with the workload to reproduce them :
> https://www.spinics.net/lists/linux-btrfs/msg77219.html
>
> We just wanted to ensure that resolving these are on your to-do list.
> Additionally, if there are more patches queued to address these
> issues, please let us know.

Hi,

I go through the issues as time allows. Not all of these are top
priorities for me right now.
Been working on a fix for some of them but they are not yet ready to
submit (need more testing, or cause other problems or are too
complex).
If suddenly there are people hitting any of these issues frequently
and causing trouble I'll give them higher priority.

Thanks.

>
> Thanks,
> Jayashree Mohan
>
> Thanks,
> Jayashree Mohan
>
>
>
> On Fri, May 11, 2018 at 10:45 AM Filipe Manana  wrote:
> >
> > On Mon, Apr 30, 2018 at 5:04 PM, Vijay Chidambaram  
> > wrote:
> > > Hi,
> > >
> > > We found two more cases where the btrfs behavior is a little strange.
> > > In one case, an fsync-ed file goes missing after a crash. In the
> > > other, a renamed file shows up in both directories after a crash.
> > >
> > > Workload 1:
> > >
> > > mkdir A
> > > mkdir B
> > > mkdir A/C
> > > creat B/foo
> > > fsync B/foo
> > > link B/foo A/C/foo
> > > fsync A
> > > -- crash --
> > >
> > > Expected state after recovery:
> > > B B/foo A A/C exist
> > >
> > > What we find:
> > > Only B B/foo exist
> > >
> > > A is lost even after explicit fsync to A.
> > >
> > > Workload 2:
> > >
> > > mkdir A
> > > mkdir A/C
> > > rename A/C B
> > > touch B/bar
> > > fsync B/bar
> > > rename B/bar A/bar
> > > rename A B (replacing B with A at this point)
> > > fsync B/bar
> > > -- crash --
> > >
> > > Expected contents after recovery:
> > > A/bar
> > >
> > > What we find after recovery:
> > > A/bar
> > > B/bar
> > >
> > > We think this breaks rename's atomicity guarantee. bar should be
> > > present in either A or B, but now it is present in both.
> >
> > I'll take a look at these, and all the other potential issues you
> > reported in other threads, next week and let you know.
> > Thanks.
> >
> > >
> > > Thanks,
> > > Vijay
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> > --
> > Filipe David Manana,
> >
> > “Whether you think you can, or you think you can't — you're right.”



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


[PATCH 1/2] btrfs: Introduce support for FSID change without metadata rewrite

2018-08-29 Thread Nikolay Borisov
This field is going to be used when the user wants to change the UUID
of the filesystem without having to rewrite all metadata blocks. This
field adds another level of indirection such that when the FSID is
changed what really happens is the current uuid (the one with which the
fs was created) is copied to the 'metadata_uuid' field in the superblock
as well as a new incompat flag is set METADATA_UUID. When the kernel
detects this flag is set it knows that the superblock in fact has 2
uuids:

1. Is the UUID which is user-visible, currently known as FSID.
2. Metadata UUID - this is the UUID which is stamped into all on-disk
datastructures belonging to this file system.

When the new incompat flag is present device scaning checks whether
both fsid/metadata_uuid of the scanned device match to any of the
registed filesystems. When the flag is not set then both UUIDs are
equal and only the FSID is retained on disk, metadata_uuid is set only
in-memory during mount.

Additionally a new metadata_uuid field is also added to the fs_info
struct. It's initialised either with the FSID in case METADATA_UUID
incompat flag is not set or with the metdata_uuid of the superblock
otherwise.

This commit introduces the new fields as well as the new incompat flag
and switches all users of the fsid to the new logic.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/ctree.c   |  4 +--
 fs/btrfs/ctree.h   | 12 +---
 fs/btrfs/disk-io.c | 32 +
 fs/btrfs/extent-tree.c |  2 +-
 fs/btrfs/volumes.c | 72 --
 fs/btrfs/volumes.h |  1 +
 include/uapi/linux/btrfs.h |  1 +
 7 files changed, 96 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 1124d236291d..0a6a519723c2 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -260,7 +260,7 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
else
btrfs_set_header_owner(cow, new_root_objectid);
 
-   write_extent_buffer_fsid(cow, fs_info->fsid);
+   write_extent_buffer_fsid(cow, fs_info->metadata_fsid);
 
WARN_ON(btrfs_header_generation(buf) > trans->transid);
if (new_root_objectid == BTRFS_TREE_RELOC_OBJECTID)
@@ -1069,7 +1069,7 @@ static noinline int __btrfs_cow_block(struct 
btrfs_trans_handle *trans,
else
btrfs_set_header_owner(cow, root->root_key.objectid);
 
-   write_extent_buffer_fsid(cow, fs_info->fsid);
+   write_extent_buffer_fsid(cow, fs_info->metadata_fsid);
 
ret = update_ref_for_cow(trans, root, buf, cow, _ref);
if (ret) {
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4133de59aeee..57aae731d4a1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -203,7 +203,7 @@ struct btrfs_root_backup {
 struct btrfs_super_block {
u8 csum[BTRFS_CSUM_SIZE];
/* the first 4 fields must match struct btrfs_header */
-   u8 fsid[BTRFS_FSID_SIZE];/* FS specific uuid */
+   u8 fsid[BTRFS_FSID_SIZE];/* userfacing FS specific uuid */
__le64 bytenr; /* this block number */
__le64 flags;
 
@@ -240,8 +240,10 @@ struct btrfs_super_block {
__le64 cache_generation;
__le64 uuid_tree_generation;
 
+   u8 metadata_uuid[BTRFS_FSID_SIZE]; /* The uuid written into btree 
blocks */
+
/* future expansion */
-   __le64 reserved[30];
+   __le64 reserved[28];
u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE];
struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS];
 } __attribute__ ((__packed__));
@@ -271,7 +273,8 @@ struct btrfs_super_block {
 BTRFS_FEATURE_INCOMPAT_RAID56 |\
 BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \
 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |   \
-BTRFS_FEATURE_INCOMPAT_NO_HOLES)
+BTRFS_FEATURE_INCOMPAT_NO_HOLES|   \
+BTRFS_FEATURE_INCOMPAT_METADATA_UUID)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET\
(BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
@@ -750,7 +753,8 @@ struct btrfs_delayed_root;
 #define BTRFS_FS_BALANCE_RUNNING   18
 
 struct btrfs_fs_info {
-   u8 fsid[BTRFS_FSID_SIZE];
+   u8 fsid[BTRFS_FSID_SIZE]; /* User-visible fs UUID */
+   u8 metadata_fsid[BTRFS_FSID_SIZE]; /* UUID written to btree blocks */
u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
unsigned long flags;
struct btrfs_root *extent_root;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 3611df2ce5c1..565db699afee 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -551,7 +551,7 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, 
struct page *page)
if (WARN_ON(!PageUptodate(page)))
return -EUCLEAN;
 
-   ASSERT(memcmp_extent_buffer(eb, fs_info->fsid,
+   ASSERT(memcmp_extent_buffer(eb, fs_info->metadata_fsid,
btrfs_header_fsid(), BTRFS_FSID_SIZE) 

[PATCH 2/2] btrfs: Remove fsid/metadata_fsid fields from btrfs_info

2018-08-29 Thread Nikolay Borisov
Currently btrfs_fs_info structure contains a copy of the
fsid/metadata_uuid fields. Same values are also contained in the
btrfs_fs_devices structure which fs_info has a reference to. Let's
reduce duplication by removing the fields from fs_info and always refer
to the ones in fs_devices. No functional changes.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/check-integrity.c   |  2 +-
 fs/btrfs/ctree.c |  5 +++--
 fs/btrfs/ctree.h |  2 --
 fs/btrfs/disk-io.c   | 21 -
 fs/btrfs/extent-tree.c   |  2 +-
 fs/btrfs/ioctl.c |  2 +-
 fs/btrfs/super.c |  2 +-
 fs/btrfs/volumes.c   | 10 --
 include/trace/events/btrfs.h |  2 +-
 9 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 833cf3c35b4d..4336de746f9c 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -1720,7 +1720,7 @@ static int btrfsic_test_for_metadata(struct btrfsic_state 
*state,
num_pages = state->metablock_size >> PAGE_SHIFT;
h = (struct btrfs_header *)datav[0];
 
-   if (memcmp(h->fsid, fs_info->fsid, BTRFS_FSID_SIZE))
+   if (memcmp(h->fsid, fs_info->fs_devices->fsid, BTRFS_FSID_SIZE))
return 1;
 
for (i = 0; i < num_pages; i++) {
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 0a6a519723c2..f00bfb6dd8f5 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -12,6 +12,7 @@
 #include "transaction.h"
 #include "print-tree.h"
 #include "locking.h"
+#include "volumes.h"
 
 static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
  *root, struct btrfs_path *path, int level);
@@ -260,7 +261,7 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
else
btrfs_set_header_owner(cow, new_root_objectid);
 
-   write_extent_buffer_fsid(cow, fs_info->metadata_fsid);
+   write_extent_buffer_fsid(cow, fs_info->fs_devices->metadata_uuid);
 
WARN_ON(btrfs_header_generation(buf) > trans->transid);
if (new_root_objectid == BTRFS_TREE_RELOC_OBJECTID)
@@ -1069,7 +1070,7 @@ static noinline int __btrfs_cow_block(struct 
btrfs_trans_handle *trans,
else
btrfs_set_header_owner(cow, root->root_key.objectid);
 
-   write_extent_buffer_fsid(cow, fs_info->metadata_fsid);
+   write_extent_buffer_fsid(cow, fs_info->fs_devices->metadata_uuid);
 
ret = update_ref_for_cow(trans, root, buf, cow, _ref);
if (ret) {
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 57aae731d4a1..670820b28a5a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -753,8 +753,6 @@ struct btrfs_delayed_root;
 #define BTRFS_FS_BALANCE_RUNNING   18
 
 struct btrfs_fs_info {
-   u8 fsid[BTRFS_FSID_SIZE]; /* User-visible fs UUID */
-   u8 metadata_fsid[BTRFS_FSID_SIZE]; /* UUID written to btree blocks */
u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
unsigned long flags;
struct btrfs_root *extent_root;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 565db699afee..adc5dd3ae318 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -551,7 +551,7 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, 
struct page *page)
if (WARN_ON(!PageUptodate(page)))
return -EUCLEAN;
 
-   ASSERT(memcmp_extent_buffer(eb, fs_info->metadata_fsid,
+   ASSERT(memcmp_extent_buffer(eb, fs_info->fs_devices->metadata_uuid,
btrfs_header_fsid(), BTRFS_FSID_SIZE) == 0);
 
return csum_tree_block(fs_info, eb, 0);
@@ -2491,11 +2491,12 @@ static int validate_super(struct btrfs_fs_info *fs_info,
ret = -EINVAL;
}
 
-   if (memcmp(fs_info->metadata_fsid, sb->dev_item.fsid,
+   if (memcmp(fs_info->fs_devices->metadata_uuid, sb->dev_item.fsid,
   BTRFS_FSID_SIZE) != 0) {
btrfs_err(fs_info,
   "dev_item UUID does not match metadata fsid: %pU != 
%pU",
-  fs_info->metadata_fsid, sb->dev_item.fsid);
+  fs_info->fs_devices->metadata_uuid,
+  sb->dev_item.fsid);
ret = -EINVAL;
}
 
@@ -2834,14 +2835,16 @@ int open_ctree(struct super_block *sb,
   sizeof(*fs_info->super_for_commit));
brelse(bh);
 
-   memcpy(fs_info->fsid, fs_info->super_copy->fsid, BTRFS_FSID_SIZE);
+   ASSERT(!memcmp(fs_info->fs_devices->fsid, fs_info->super_copy->fsid,
+  BTRFS_FSID_SIZE));
+
if (btrfs_fs_incompat(fs_info, METADATA_UUID)) {
-   memcpy(fs_info->metadata_fsid,
-  fs_info->super_copy->metadata_uuid, BTRFS_FSID_SIZE);
-   } else {
-   memcpy(fs_info->metadata_fsid, fs_info->fsid, BTRFS_FSID_SIZE);
+   ASSERT(!memcmp(fs_info->fs_devices->metadata_uuid,
+ 

[PATCH 0/2] FSID change kernel support

2018-08-29 Thread Nikolay Borisov
Here is the kernel support for another level of indirection when dealing with 
FSID. I've taken inspiration from XFS so if people are familiar how FSID is 
changed there it's identical in BTRFS. In short, another field is introduced
which will record the original fsid (which is stamped into every metadata block
on-disk) if the user chooses to change it. This enables filesytems to have 
arbitrary user-visible UUIDS but immutable on-disk UUIDs. 

The first patch adjusts the kernel code to cope with this by adding an incompat
flag and introducing a new field to the on-disk super block. This means old
kernels will not be able to mount fsid-changed filesystems. Additionally, 
the device scanning logic is modified so that now devices are being 
matched for fsid && metadata_uuid when added to the system. This handles 
scenarios where two filesystems with identical on-disk metadata uuids (but 
different user-visible fsids) exist on the same machine. 

The second patch just removes some duplication of the fsid/metadata_uuid fields
since they were copied in both btrfs_fs_info and btrfs_fs_devices structs. 

Those patches survived multiple xfstest runs so I think they should be pretty 
solid. Please review and give them a spin. 

Nikolay Borisov (2):
  btrfs: Introduce support for FSID change without metadata rewrite
  btrfs: Remove fsid/metadata_fsid fields from btrfs_info

 fs/btrfs/check-integrity.c   |  2 +-
 fs/btrfs/ctree.c |  5 +--
 fs/btrfs/ctree.h | 10 +++---
 fs/btrfs/disk-io.c   | 39 ++-
 fs/btrfs/extent-tree.c   |  2 +-
 fs/btrfs/ioctl.c |  2 +-
 fs/btrfs/super.c |  2 +-
 fs/btrfs/volumes.c   | 74 ++--
 fs/btrfs/volumes.h   |  1 +
 include/trace/events/btrfs.h |  2 +-
 include/uapi/linux/btrfs.h   |  1 +
 11 files changed, 104 insertions(+), 36 deletions(-)

-- 
2.7.4



[PATCH 1/4] btrfs-progs: Add support for metadata_uuid field.

2018-08-29 Thread Nikolay Borisov
Add support for a new metadata_uuid field. This is just a preparatory
commit which switches all users of the fsid field for metdata comparison
purposes to utilize the new field. This more or less mirrors the
kernel patch, additionally:

 * Update 'btrfs inspect-internal dump-super' to account for the new
 field. This involes introducing the 'metadata_uuid' line to the
 output and updating the logic for comparing the fs uuid to the
 dev_item uuid.

Signed-off-by: Nikolay Borisov 
---
 btrfstune.c   |  3 +++
 check/main.c  |  2 +-
 chunk-recover.c   | 16 ++---
 cmds-filesystem.c |  2 ++
 cmds-inspect-dump-super.c | 22 ++---
 convert/common.c  |  2 ++
 ctree.c   | 14 +--
 ctree.h   |  8 +--
 disk-io.c | 60 ++-
 image/main.c  | 25 +---
 volumes.c | 34 +--
 volumes.h |  1 +
 12 files changed, 146 insertions(+), 43 deletions(-)

diff --git a/btrfstune.c b/btrfstune.c
index 889b931c55d8..ea04159a81ce 100644
--- a/btrfstune.c
+++ b/btrfstune.c
@@ -248,6 +248,9 @@ static int change_fsid_prepare(struct btrfs_fs_info 
*fs_info)
if (ret < 0)
return ret;
 
+   /* Also need to change the metadatauuid of the fs info */
+   memcpy(fs_info->metadata_uuid, fs_info->new_fsid, BTRFS_FSID_SIZE);
+
/* also restore new chunk_tree_id into tree_root for restore */
write_extent_buffer(tree_root->node, fs_info->new_chunk_tree_uuid,
btrfs_header_chunk_tree_uuid(tree_root->node),
diff --git a/check/main.c b/check/main.c
index bc2ee22f7943..0790264190f2 100644
--- a/check/main.c
+++ b/check/main.c
@@ -8418,7 +8418,7 @@ static int btrfs_fsck_reinit_root(struct 
btrfs_trans_handle *trans,
btrfs_set_header_backref_rev(c, BTRFS_MIXED_BACKREF_REV);
btrfs_set_header_owner(c, root->root_key.objectid);
 
-   write_extent_buffer(c, root->fs_info->fsid,
+   write_extent_buffer(c, root->fs_info->metadata_uuid,
btrfs_header_fsid(), BTRFS_FSID_SIZE);
 
write_extent_buffer(c, root->fs_info->chunk_tree_uuid,
diff --git a/chunk-recover.c b/chunk-recover.c
index 1d30db51d8ed..31325bfc54e0 100644
--- a/chunk-recover.c
+++ b/chunk-recover.c
@@ -759,7 +759,7 @@ static int scan_one_device(void *dev_scan_struct)
rc->nodesize)
break;
 
-   if (memcmp_extent_buffer(buf, rc->fs_devices->fsid,
+   if (memcmp_extent_buffer(buf, rc->fs_devices->metadata_uuid,
 btrfs_header_fsid(),
 BTRFS_FSID_SIZE)) {
bytenr += rc->sectorsize;
@@ -1155,7 +1155,7 @@ static int __rebuild_chunk_root(struct btrfs_trans_handle 
*trans,
btrfs_set_header_level(cow, 0);
btrfs_set_header_backref_rev(cow, BTRFS_MIXED_BACKREF_REV);
btrfs_set_header_owner(cow, BTRFS_CHUNK_TREE_OBJECTID);
-   write_extent_buffer(cow, root->fs_info->fsid,
+   write_extent_buffer(cow, root->fs_info->metadata_uuid,
btrfs_header_fsid(), BTRFS_FSID_SIZE);
 
write_extent_buffer(cow, root->fs_info->chunk_tree_uuid,
@@ -1192,7 +1192,8 @@ static int __rebuild_device_items(struct 
btrfs_trans_handle *trans,
btrfs_set_stack_device_io_width(dev_item, dev->io_width);
btrfs_set_stack_device_sector_size(dev_item, dev->sector_size);
memcpy(dev_item->uuid, dev->uuid, BTRFS_UUID_SIZE);
-   memcpy(dev_item->fsid, dev->fs_devices->fsid, BTRFS_UUID_SIZE);
+   memcpy(dev_item->fsid, dev->fs_devices->metadata_uuid,
+  BTRFS_FSID_SIZE);
 
ret = btrfs_insert_item(trans, root, ,
dev_item, sizeof(*dev_item));
@@ -1432,6 +1433,7 @@ open_ctree_with_broken_chunk(struct recover_control *rc)
struct btrfs_fs_info *fs_info;
struct btrfs_super_block *disk_super;
struct extent_buffer *eb;
+   u64 features;
int ret;
 
fs_info = btrfs_new_fs_info(1, BTRFS_SUPER_INFO_OFFSET);
@@ -1464,6 +1466,14 @@ open_ctree_with_broken_chunk(struct recover_control *rc)
if (ret)
goto out_devices;
 
+   features = btrfs_super_incompat_flags(disk_super);
+
+   if (features & BTRFS_FEATURE_INCOMPAT_METADATA_UUID)
+   memcpy(fs_info->metadata_uuid, disk_super->metadata_uuid,
+  BTRFS_FSID_SIZE);
+   else
+   memcpy(fs_info->metadata_uuid, fs_info->fsid, BTRFS_FSID_SIZE);
+
btrfs_setup_root(fs_info->chunk_root, fs_info,
 BTRFS_CHUNK_TREE_OBJECTID);
 
diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 06c8311bdfe3..10946b31976f 100644
--- 

[PATCH 2/4] btrfstune: Add support for changing the user uuid

2018-08-29 Thread Nikolay Borisov
This allows us to change the use-visible UUID on filesytems from
userspace if desired, by copying the existing UUID to the new location
for metadata comparisons. If this is done, an incompat flag must be
set to prevent older filesystems from mounting the filesystem, but
the original UUID can be restored, and the incompat flag removed.

This introduces the new -m|-M UUID options similar to current
-u|-U UUID ones with the difference that we don't rewrite the fsid but
just copy the old uuid and set a new one. Additionally running with
[-M old-uuid] clears the incompat flag and retains only fsid on-disk.

Additionally it's not allowed to intermix -m/-u/-U/-M options in a
single invocation of btrfstune, nor is it allowed to change the uuid
while there is a uuid rewrite in-progress. Also changing the uuid of a
seed device is not currently allowed (can change in the future).

Example:

btrfstune -m /dev/loop1
btrfs inspect-internal dump-super /dev/loop1

superblock: bytenr=65536, device=/dev/loop1
-
csum_type   0 (crc32c)
csum_size   4
csum0x4b7ea749 [match]



fsid0efc41d3-4451-49f3-8108-7b8bdbcf5ae8
metadata_uuid   352715e7-62cf-4ae0-92ee-85a574adc318



incompat_flags  0x541
( MIXED_BACKREF |
  EXTENDED_IREF |
  SKINNY_METADATA |
  METADATA_UUID )



dev_item.uuid   0610deee-dfc3-498b-9449-a06533cdec98
dev_item.fsid   352715e7-62cf-4ae0-92ee-85a574adc318 [match]



mount /dev/loop1 btrfs-mnt/
btrfs fi show btrfs-mnt/

Label: none  uuid: 0efc41d3-4451-49f3-8108-7b8bdbcf5ae8
Total devices 1 FS bytes used 128.00KiB
devid1 size 5.00GiB used 536.00MiB path /dev/loop1

In this case a new btrfs filesystem was created and the original uuid
was 352715e7-62cf-4ae0-92ee-85a574adc318, then btrfstune was run which
copied that value over to metadata_uuid field and set the current fsid
to 0efc41d3-4451-49f3-8108-7b8bdbcf5ae8. And as far as userspace is
concerned this is the fsid of the fs.

Signed-off-by: Nikolay Borisov 
---
 btrfstune.c | 168 +++-
 1 file changed, 143 insertions(+), 25 deletions(-)

diff --git a/btrfstune.c b/btrfstune.c
index ea04159a81ce..697285ef3bf6 100644
--- a/btrfstune.c
+++ b/btrfstune.c
@@ -73,6 +73,101 @@ static int update_seeding_flag(struct btrfs_root *root, int 
set_flag)
return ret;
 }
 
+/*
+ * Return 0 for no unfinished fsid change.
+ * Return >0 for unfinished fsid change, and restore unfinished fsid/
+ * chunk_tree_id into fsid_ret/chunk_id_ret.
+ */
+static int check_unfinished_fsid_change(struct btrfs_fs_info *fs_info,
+   uuid_t fsid_ret, uuid_t chunk_id_ret)
+{
+   struct btrfs_root *tree_root = fs_info->tree_root;
+   u64 flags = btrfs_super_flags(fs_info->super_copy);
+
+   if (flags & BTRFS_SUPER_FLAG_CHANGING_FSID) {
+   memcpy(fsid_ret, fs_info->super_copy->fsid, BTRFS_FSID_SIZE);
+   read_extent_buffer(tree_root->node, chunk_id_ret,
+   btrfs_header_chunk_tree_uuid(tree_root->node),
+   BTRFS_UUID_SIZE);
+   return 1;
+   }
+   return 0;
+}
+
+static int set_metadata_uuid(struct btrfs_root *root, const char *uuid_string)
+{
+   struct btrfs_super_block *disk_super;
+   uuid_t new_fsid, unused1, unused2;
+   struct btrfs_trans_handle *trans;
+   bool new_uuid = true;
+   u64 incompat_flags;
+   bool uuid_changed;
+   u64 super_flags;
+   int ret;
+
+   disk_super = root->fs_info->super_copy;
+   super_flags = btrfs_super_flags(disk_super);
+   incompat_flags = btrfs_super_incompat_flags(disk_super);
+   uuid_changed = incompat_flags & BTRFS_FEATURE_INCOMPAT_METADATA_UUID;
+
+   if (super_flags & BTRFS_SUPER_FLAG_SEEDING) {
+   fprintf(stderr, "Cannot set metadata UUID on a seed device\n");
+   return 1;
+   }
+
+   if (check_unfinished_fsid_change(root->fs_info, unused1, unused2)) {
+   fprintf(stderr, "UUID rewrite in progress, cannot change "
+   "fsid\n");
+   return 1;
+   }
+
+   if (uuid_string)
+   uuid_parse(uuid_string, new_fsid);
+   else
+   uuid_generate(new_fsid);
+
+   new_uuid = (memcmp(new_fsid, disk_super->fsid, BTRFS_FSID_SIZE) != 0);
+
+   if (new_uuid && uuid_changed && memcmp(disk_super->metadata_uuid,
+  new_fsid, BTRFS_FSID_SIZE) == 0) 
{
+   /*
+* Changing fsid to be the same as metadata uuid, so just
+* disable the flag
+*/
+   memcpy(disk_super->fsid, _fsid, BTRFS_FSID_SIZE);
+   

[PATCH 4/4] btrfs-progs: Remove fsid/metdata_uuid fields from fs_info

2018-08-29 Thread Nikolay Borisov
Signed-off-by: Nikolay Borisov 
---
 btrfstune.c |  5 +++--
 check/main.c|  2 +-
 chunk-recover.c | 11 +--
 ctree.c | 11 ++-
 ctree.h |  2 --
 disk-io.c   | 18 +-
 volumes.c   |  5 +++--
 7 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/btrfstune.c b/btrfstune.c
index 697285ef3bf6..98e366a3b3ce 100644
--- a/btrfstune.c
+++ b/btrfstune.c
@@ -344,7 +344,8 @@ static int change_fsid_prepare(struct btrfs_fs_info 
*fs_info)
return ret;
 
/* Also need to change the metadatauuid of the fs info */
-   memcpy(fs_info->metadata_uuid, fs_info->new_fsid, BTRFS_FSID_SIZE);
+   memcpy(fs_info->fs_devices->metadata_uuid, fs_info->new_fsid,
+  BTRFS_FSID_SIZE);
 
/* also restore new chunk_tree_id into tree_root for restore */
write_extent_buffer(tree_root->node, fs_info->new_chunk_tree_uuid,
@@ -401,7 +402,7 @@ static int change_uuid(struct btrfs_fs_info *fs_info, const 
char *new_fsid_str)
fs_info->new_fsid = new_fsid;
fs_info->new_chunk_tree_uuid = new_chunk_id;
 
-   memcpy(old_fsid, (const char*)fs_info->fsid, BTRFS_UUID_SIZE);
+   memcpy(old_fsid, (const char*)fs_info->fs_devices->fsid, 
BTRFS_UUID_SIZE);
uuid_unparse(old_fsid, uuid_buf);
printf("Current fsid: %s\n", uuid_buf);
 
diff --git a/check/main.c b/check/main.c
index 0790264190f2..0aede2742dcf 100644
--- a/check/main.c
+++ b/check/main.c
@@ -8418,7 +8418,7 @@ static int btrfs_fsck_reinit_root(struct 
btrfs_trans_handle *trans,
btrfs_set_header_backref_rev(c, BTRFS_MIXED_BACKREF_REV);
btrfs_set_header_owner(c, root->root_key.objectid);
 
-   write_extent_buffer(c, root->fs_info->metadata_uuid,
+   write_extent_buffer(c, root->fs_info->fs_devices->metadata_uuid,
btrfs_header_fsid(), BTRFS_FSID_SIZE);
 
write_extent_buffer(c, root->fs_info->chunk_tree_uuid,
diff --git a/chunk-recover.c b/chunk-recover.c
index 31325bfc54e0..959c169f79a4 100644
--- a/chunk-recover.c
+++ b/chunk-recover.c
@@ -1155,7 +1155,7 @@ static int __rebuild_chunk_root(struct btrfs_trans_handle 
*trans,
btrfs_set_header_level(cow, 0);
btrfs_set_header_backref_rev(cow, BTRFS_MIXED_BACKREF_REV);
btrfs_set_header_owner(cow, BTRFS_CHUNK_TREE_OBJECTID);
-   write_extent_buffer(cow, root->fs_info->metadata_uuid,
+   write_extent_buffer(cow, root->fs_info->fs_devices->metadata_uuid,
btrfs_header_fsid(), BTRFS_FSID_SIZE);
 
write_extent_buffer(cow, root->fs_info->chunk_tree_uuid,
@@ -1457,7 +1457,7 @@ open_ctree_with_broken_chunk(struct recover_control *rc)
goto out_devices;
}
 
-   memcpy(fs_info->fsid, _super->fsid, BTRFS_FSID_SIZE);
+   ASSERT(!memcmp(disk_super->fsid, rc->fs_devices->fsid, 
BTRFS_FSID_SIZE));
fs_info->sectorsize = btrfs_super_sectorsize(disk_super);
fs_info->nodesize = btrfs_super_nodesize(disk_super);
fs_info->stripesize = btrfs_super_stripesize(disk_super);
@@ -1469,10 +1469,9 @@ open_ctree_with_broken_chunk(struct recover_control *rc)
features = btrfs_super_incompat_flags(disk_super);
 
if (features & BTRFS_FEATURE_INCOMPAT_METADATA_UUID)
-   memcpy(fs_info->metadata_uuid, disk_super->metadata_uuid,
-  BTRFS_FSID_SIZE);
-   else
-   memcpy(fs_info->metadata_uuid, fs_info->fsid, BTRFS_FSID_SIZE);
+   ASSERT(!memcmp(disk_super->metadata_uuid,
+  fs_info->fs_devices->metadata_uuid,
+  BTRFS_FSID_SIZE));
 
btrfs_setup_root(fs_info->chunk_root, fs_info,
 BTRFS_CHUNK_TREE_OBJECTID);
diff --git a/ctree.c b/ctree.c
index 883c2ae1861d..c79837e0206f 100644
--- a/ctree.c
+++ b/ctree.c
@@ -23,6 +23,7 @@
 #include "internal.h"
 #include "sizes.h"
 #include "messages.h"
+#include "volumes.h"
 
 static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
  *root, struct btrfs_path *path, int level);
@@ -134,7 +135,7 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
else
btrfs_set_header_owner(cow, new_root_objectid);
 
-   write_extent_buffer(cow, root->fs_info->metadata_uuid,
+   write_extent_buffer(cow, root->fs_info->fs_devices->metadata_uuid,
btrfs_header_fsid(), BTRFS_FSID_SIZE);
 
WARN_ON(btrfs_header_generation(buf) > trans->transid);
@@ -308,7 +309,7 @@ int __btrfs_cow_block(struct btrfs_trans_handle *trans,
else
btrfs_set_header_owner(cow, root->root_key.objectid);
 
-   write_extent_buffer(cow, root->fs_info->metadata_uuid,
+   write_extent_buffer(cow, root->fs_info->fs_devices->metadata_uuid,
btrfs_header_fsid(), BTRFS_FSID_SIZE);
 
WARN_ON(!(buf->flags & EXTENT_BAD_TRANSID) &&
@@ -1474,7 

[PATCH 3/4] btrfs-progs: tests: Add tests for changing fsid feature

2018-08-29 Thread Nikolay Borisov
Add a bunch of tests exercising the new btrfstune functionality. In
particular check that various restrictions are implemented correctly,
test that btrfs-image works as expected and also test the output of
btrfs inspect-internal dump-super is correct.

Signed-off-by: Nikolay Borisov 
---
 tests/misc-tests/033-metadata-uuid/test.sh | 137 +
 1 file changed, 137 insertions(+)
 create mode 100755 tests/misc-tests/033-metadata-uuid/test.sh

diff --git a/tests/misc-tests/033-metadata-uuid/test.sh 
b/tests/misc-tests/033-metadata-uuid/test.sh
new file mode 100755
index ..b08220c257c8
--- /dev/null
+++ b/tests/misc-tests/033-metadata-uuid/test.sh
@@ -0,0 +1,137 @@
+#!/bin/bash
+
+source "$TEST_TOP/common"
+
+check_prereq mkfs.btrfs
+check_prereq btrfs
+check_prereq btrfstune
+check_prereq btrfs-image
+
+setup_root_helper
+
+prepare_test_dev
+
+function read_fsid {
+   local dev="$1"
+   echo $(run_check_stdout $SUDO_HELPER "$TOP/btrfs" inspect-internal \
+   dump-super "$dev" | awk '/fsid/ {print $2}' | head -n 1)
+}
+
+function read_metadata_uuid {
+   local dev="$1"
+   echo $(run_check_stdout $SUDO_HELPER "$TOP/btrfs" inspect-internal \
+   dump-super "$dev" | awk '/metadata_uuid/ {print $2}')
+}
+
+function check_btrfstune {
+   local fsid
+   echo "Checking btrfstune logic" >> "$RESULTS"
+   #Test with random uuid
+   run_check $SUDO_HELPER "$TOP/btrfstune" -m "$TEST_DEV"
+
+   #check that specific uuid can set
+   run_check $SUDO_HELPER "$TOP/btrfstune" -M 
d88c8333-a652-4476-b225-2e9284eb59f1 "$TEST_DEV"
+
+   #test that having seed on already changed device doesn't work 
+   run_mustfail "Managed to set seed on metadata uuid fs" \
+   $SUDO_HELPER "$TOP/btrfstune" -S 1 "$TEST_DEV"
+
+   #test that setting both seed and -m|M is forbidden 
+   run_check $SUDO_HELPER "$TOP/mkfs.btrfs" -f "$TEST_DEV"
+   run_mustfail "Succeeded setting seed and changing fs uuid" \
+   $SUDO_HELPER "$TOP/btrfstune" -S 1 -m "$TEST_DEV"
+
+   #test that having -m|-M on seed device is forbidden
+   run_check $SUDO_HELPER "$TOP/mkfs.btrfs" -f "$TEST_DEV"
+   run_check $SUDO_HELPER "$TOP/btrfstune" -S 1 "$TEST_DEV" 
+   run_mustfail "Succeded changing fsid on a seed device" $SUDO_HELPER 
"$TOP/btrfstune" -m "$TEST_DEV"
+
+   #test that using -U|-u on an fs with METADATA_UUID flag is forbidden
+   run_check $SUDO_HELPER "$TOP/mkfs.btrfs" -f "$TEST_DEV"
+   run_check $SUDO_HELPER "$TOP/btrfstune" -m "$TEST_DEV"
+   run_mustfail "Succeeded triggering FSID rewrite while METADATA_UUID is 
active" \
+   $SUDO_HELPER "$TOP/btrfstune" -u  "$TEST_DEV"
+
+}
+
+function check_dump_super_output {
+   local fsid
+   local metadata_uuid
+   local dev_item_match
+   local old_metadata_uuid
+
+   echo "Checking dump-super output" >> "$RESULTS"
+   #Assert that metadata/fsid match on non-changed fs
+   fsid=$(read_fsid "$TEST_DEV")
+   metadata_uuid=$(read_metadata_uuid "$TEST_DEV")
+   [ "$fsid" = "$metadata_uuid" ] || _fail "fsid ("$fsid") doesn't match 
metadata_uuid ("$metadata_uuid")"
+
+   dev_item_match=$(run_check_stdout $SUDO_HELPER "$TOP/btrfs" 
inspect-internal dump-super \
+   "$TEST_DEV" | awk '/dev_item.fsid/ {print $3}')
+
+   [ $dev_item_match = "[match]" ] || _fail "dev_item.fsid doesn't match 
on non-metadata uuid fs"
+
+
+   echo "Checking output after fsid change" >> "$RESULTS"
+   #change metadatauuid and ensure everything in the output is still 
correct 
+   old_metadata_uuid=$metadata_uuid
+   run_check $SUDO_HELPER "$TOP/btrfstune" -M 
d88c8333-a652-4476-b225-2e9284eb59f1 "$TEST_DEV"
+   fsid=$(read_fsid "$TEST_DEV")
+   metadata_uuid=$(read_metadata_uuid "$TEST_DEV")
+   dev_item_match=$(run_check_stdout $SUDO_HELPER "$TOP/btrfs" \
+   inspect-internal dump-super "$TEST_DEV" | awk '/dev_item.fsid/ 
{print $3}') 
+   

+[ "$dev_item_match" = "[match]" ] || _fail "dev_item.fsid doesn't match on 
metadata uuid fs"
+   [ "$fsid" = "d88c8333-a652-4476-b225-2e9284eb59f1" ] || _fail 
"btrfstune metadata UUID change failed"
+   [ "$old_metadata_uuid" = "$metadata_uuid" ] || _fail "Metadata uuid 
change unexpectedly"
+
+   echo "Checking for incompat textual representation" >> "$RESULTS"
+   #check for textual output of the new incompat feature
+   run_check_stdout $SUDO_HELPER "$TOP/btrfs" inspect-internal dump-super \
+   "$TEST_DEV" | grep -q METADATA_UUID
+   [ $? -eq 0 ] || _fail "Didn't find textual representation of 
METADATA_UUID feature"
+
+   echo "Checking setting fsid back to original" >> "$RESULTS"
+   #ensure that  setting the fsid back to the original works
+   run_check 

[PATCH 0/4] Userspace support for FSID change

2018-08-29 Thread Nikolay Borisov
Here is the userspace tooling support for utilising the new metadata_uuid 
field, 
enabling the change of fsid without having to rewrite every metadata block. This
patchset consists of adding support for the new field to various tools and 
files (Patch 1). The actual implementation of the new -m|-M options (which are 
described in more detail in Patch 2). A new misc-tests testcasei (Patch 3) which
exercises the new options and verifies certain invariants hold (these are also
described in Patch2). Patch 4 is more or less copy of the kernel conuterpart 
just reducing some duplication between btrfs_fs_info and btrfs_fs_devices 
structures. 

The intended usecase of this feature is to give the sysadmin the ability to 
create copies of filesystesm, change their uuid quickly and mount them alongside
the original filesystem for, say, forensic purposes. 

One thing which still hasn't been set in stone is whether the new options 
will remain as -m|-M or whether they should subsume the current -u|-U - from 
the point of view of users nothing should change. So this is something which 
I'd like to hear from the community. Of course the alternative of rewriting 
the metadata blocks will be assigne new options - perhaps -m|M ?

I've tested this with multiple xfstest runs with the new tools installed as 
well as running btrfs-progs test and have observed no regressions. 

Nikolay Borisov (4):
  btrfs-progs: Add support for metadata_uuid field.
  btrfstune: Add support for changing the user uuid
  btrfs-progs: tests: Add tests for changing fsid feature
  btrfs-progs: Remove fsid/metdata_uuid fields from fs_info

 btrfstune.c| 174 -
 check/main.c   |   2 +-
 chunk-recover.c|  17 ++-
 cmds-filesystem.c  |   2 +
 cmds-inspect-dump-super.c  |  22 +++-
 convert/common.c   |   2 +
 ctree.c|  15 +--
 ctree.h|   8 +-
 disk-io.c  |  62 --
 image/main.c   |  25 +++--
 tests/misc-tests/033-metadata-uuid/test.sh | 137 +++
 volumes.c  |  37 --
 volumes.h  |   1 +
 13 files changed, 431 insertions(+), 73 deletions(-)
 create mode 100755 tests/misc-tests/033-metadata-uuid/test.sh

-- 
2.7.4



Re: DRDY errors are not consistent with scrub results

2018-08-29 Thread Cerem Cem ASLAN
Chris Murphy , 29 Ağu 2018 Çar, 02:58
tarihinde şunu yazdı:
>
> On Tue, Aug 28, 2018 at 5:04 PM, Cerem Cem ASLAN  
> wrote:
> > What I want to achive is that I want to add the problematic disk as
> > raid1 and see how/when it fails and how BTRFS recovers these fails.
> > While the party goes on, the main system shouldn't be interrupted
> > since this is a production system. For example, I would never expect
> > to be ended up with such a readonly state while trying to add a disk
> > with "unknown health" to the system. Was it somewhat expected?
>
> I don't know. I also can't tell you how LVM or mdraid behave in the
> same situation either though. For sure I've come across bug reports
> where underlying devices go read only and the file system falls over
> totally and developers shrug and say they can't do anything.
>
> This situation is a little different and difficult. You're starting
> out with a one drive setup so the profile is single/DUP or
> single/single, and that doesn't change when adding. So the 2nd drive
> is actually *mandatory* for a brief period of time before you've made
> it raid1 or higher. It's a developer question what is the design, and
> if this is a bug: maybe the device being added should be written to
> with placeholder supers or even just zeros in all the places for 'dev
> add' metadata, and only if that succeeds, to then write real updated
> supers to all devices. It's possible the 'dev add' presently writes
> updated supers to all devices at the same time, and has a brief period
> where the state is fragile and if it fails, it goes read only to
> prevent damaging the file system.

Thinking again, this is totally acceptable. If the requirement was a
good health disk, then I think I must check the disk health by myself.
I may believe that the disk is in a good state, or make a quick test
or make some very detailed tests to be sure.

Likewise, ending up with readonly state is not the end of the world,
even over SSH, because system still functions and all I need to do is
a reboot in the worst case. That's also acceptable *while adding a new
disk*.

>
> Anyway, without a call trace, no idea why it ended up read only. So I
> have to speculate.
>

I may try adding the disk again any time and provide any requested
logs, it is still attached to the server. I'm only not sure if this is
a useful experiment from the point of view of the rest of the people.

>
> >
> > Although we know that disk is about to fail, it still survives.
>
> That's very tenuous rationalization, a drive that rejects even a
> single write is considered failed by the md driver. Btrfs is still
> very tolerant of this, so if it had successfully added and you were
> running in production, you should expect to see thousands of write
> errors dumped to the kernel log

That's exactly what I expected :)

because Btrfs never ejects a bad drive
> still. It keeps trying. And keeps reporting the failures. And all
> those errors being logged can end up causing more write demand if the
> logs are on the same volume as the failing device, even more errors to
> record, and you get an escalating situation with heavy log writing.
>

Good to point this. Maybe I should arrange an on-ram virtual machine
that writes back to local disk if no hardware errors found and start
sending logs in a different server *if* such a hardware failure
occurs.

>
> > Shouldn't we expect in such a scenario that when system tries to read
> > or write some data from/to that BROKEN_DISK and when it recognizes it
> > failed, it will try to recover the part of the data from GOOD_DISK and
> > try to store that recovered data in some other part of the
> > BROKEN_DISK?
>
> Nope. Btrfs can only write supers to fixed locations on the drive,
> same as any other file system. Btrfs metadata could possibly go
> elsewhere because it doesn't have fixed locations, but Btrfs doesn't
> do bad sector tracking. So once it decides metadata goes in location
> X, if X reports a write error it will not try to write elsewhere and
> insofar as I'm aware ext4 and XFS and LVM and md don't either; md does
> have an optional bad block map it will use for tracking bad sectors
> and remap to known good sectors. Normally the drive firmware should do
> this and when that fails the drive is considered toast for production
> purpose

That's also plausible. Thinking again (again? :) if BTRFS would behave
as I expected, that retries might never end if the disk is in a very
bad situation and that would add very intensive IO load on a
production system.

I think in such a situation I should remove the raid device, try to
reformat it and attach it again.

>
> >Or did I misunderstood the whole thing?
>
> Well in a way this is sorta user sabotage. It's a valid test and I'd
> say ideally things should fail safely, rather than fall over. But at
> the same time it's not wrong for developers to say: "look if you add a
> bad device there's a decent chance we're going face plant and go read
> only to avoid 

Re: [LKP] [lkp-robot] [mm] 9092c71bb7: blogbench.write_score -12.3% regression

2018-08-29 Thread Huang, Ying
"Huang, Ying"  writes:

> Josef Bacik  writes:
>
>> On Thu, Aug 02, 2018 at 01:55:23PM +0800, Huang, Ying wrote:
>>> "Huang, Ying"  writes:
>>> 
>>> > Hi, Chris,
>>> >
>>> > Chris Mason  writes:
>>> >
>>> >> On 19 Jun 2018, at 23:51, Huang, Ying wrote:
>>> > "Huang, Ying"  writes:
>>> >
>>> >> Hi, Josef,
>>> >>
>>> >> Do you have time to take a look at the regression?
>>> >>
>>> >> kernel test robot  writes:
>>> >>
>>> >>> Greeting,
>>> >>>
>>> >>> FYI, we noticed a -12.3% regression of blogbench.write_score and
>>> >>> a +9.6% improvement
>>> >>> of blogbench.read_score due to commit:
>>> >>>
>>> >>>
>>> >>> commit: 9092c71bb724dba2ecba849eae69e5c9d39bd3d2 ("mm: use
>>> >>> sc->priority for slab shrink targets")
>>> >>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git
>>> >>> master
>>> >>>
>>> >>> in testcase: blogbench
>>> >>> on test machine: 16 threads Intel(R) Xeon(R) CPU D-1541 @
>>> >>> 2.10GHz with 8G memory
>>> >>> with following parameters:
>>> >>>
>>> >>> disk: 1SSD
>>> >>> fs: btrfs
>>> >>> cpufreq_governor: performance
>>> >>>
>>> >>> test-description: Blogbench is a portable filesystem benchmark
>>> >>> that tries to reproduce the load of a real-world busy file
>>> >>> server.
>>> >>> test-url:
>>> >>
>>> >> I'm surprised, this patch is a big win in production here at FB.  I'll
>>> >> have to reproduce these results to better understand what is going on.
>>> >> My first guess is that since we have fewer inodes in slab, we're
>>> >> reading more inodes from disk in order to do the writes.
>>> >>
>>> >> But that should also make our read scores lower.
>>> >
>>> > Any update on this?
>>> 
>>> Ping.
>>> 
>>
>> I can't reproduce this, and what's more it appears that blogbench doesn't use
>> much memory at all.  I have the slab shrinking tracepoints on and we never go
>> into this code at all, so I'm pretty sure these results are bogus.  How are 
>> you
>> running blogbench?  I'm doing blogbench -d /whatever, if I need to be doing
>> something else let me know.  But from what I can tell this thing uses less 
>> than
>> 100m of memory, and on an 8gig of ram box we're never going to trip over this
>> code.  Thanks,
>
> Thanks for looking at this.  In my testing, blogbench will eat up system
> memory.  Please check the vmstat result attached.  The SSD disk size is
> about 745GB.

Hi, Josef,

Do you need more information?

Best Regards,
Huang, Ying

> Best Regards,
> Huang, Ying