Re: Where is my disk space ?

2018-10-30 Thread Chris Murphy
Also, since you don't have any snapshots, you could also find this
conventionally:

# du -sh /*


Chris Murphy


Re: Where is my disk space ?

2018-10-30 Thread Chris Murphy
On Tue, Oct 30, 2018 at 4:44 PM, Barbet Alain  wrote:
> Thanks for answer !
> alian@alian:~>  sudo btrfs sub list -ta /
> [sudo] Mot de passe de root :
> ID  gen top level   path
> --  --- -   
> 257 79379   5   /@
> 258 79386   257 @/var
> 259 79000   257 @/usr/local
> 260 79376   257 @/tmp
> 261 79001   257 @/srv
> 262 79062   257 @/root
> 263 79001   257 @/opt
> 264 78898   257 @/boot/grub2/x86_64-efi
> 265 78933   257 @/boot/grub2/i386-pc
>
> Yes it's opensuse, but I don't see any snapper config enable.
> For memory, I use docker that full my disk, I remove subvolume, but
> it's look like something is missing somewhere.

Try

mount -o subvolid=5  /mnt
cd /mnt
btrfs fi du -s *

Maybe that will help reveal where it's hiding. It's possible btrfs fi
du does not cross bind mounts. I know the Total column does include
amounts in nested subvolumes.



-- 
Chris Murphy


Re: Salvage files from broken btrfs

2018-10-30 Thread Chris Murphy
On Tue, Oct 30, 2018 at 4:11 PM, Mirko Klingmann  wrote:
> Hi all,
>
> my btrfs root file system on a SD card broke down and did not mount anymore.

It might mount with -o ro,nologreplay

Typically an SD card will break in a way that it can't write, and
mount will just hang (with mmcblk errors). Mounting with both ro and
nologreplay will ensure no writes are needed, allowing the mount to
succeed. of course any changes that are in the log tree will be
missing so recent transactions may be unrecoverable but so far I've
had good luck recovering from broken SD cards this way.




-- 
Chris Murphy


[PATCH] Btrfs: fix missing delayed iputs on unmount

2018-10-30 Thread Omar Sandoval
From: Omar Sandoval 

There's a race between close_ctree() and cleaner_kthread().
close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
sees it set, but this is racy; the cleaner might have already checked
the bit and could be cleaning stuff. In particular, if it deletes unused
block groups, it will create delayed iputs for the free space cache
inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
longer running delayed iputs after a commit. Therefore, if the cleaner
creates more delayed iputs after delayed iputs are run in
btrfs_commit_super(), we will leak inodes on unmount and get a busy
inode crash from the VFS.

Fix it by parking the cleaner before we actually close anything. Then,
any remaining delayed iputs will always be handled in
btrfs_commit_super(). This also ensures that the commit in close_ctree()
is really the last commit, so we can get rid of the commit in
cleaner_kthread().

Fixes: 30928e9baac2 ("btrfs: don't run delayed_iputs in commit")
Signed-off-by: Omar Sandoval 
---
We found this with a stress test that our containers team runs. I'm
wondering if this same race could have caused any other issues other
than this new iput thing, but I couldn't identify any.

 fs/btrfs/disk-io.c | 40 +++-
 1 file changed, 7 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b0ab41da91d1..7c17284ae3c2 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1664,9 +1664,8 @@ static int cleaner_kthread(void *arg)
struct btrfs_root *root = arg;
struct btrfs_fs_info *fs_info = root->fs_info;
int again;
-   struct btrfs_trans_handle *trans;
 
-   do {
+   while (1) {
again = 0;
 
/* Make the cleaner go to sleep early. */
@@ -1715,42 +1714,16 @@ static int cleaner_kthread(void *arg)
 */
btrfs_delete_unused_bgs(fs_info);
 sleep:
+   if (kthread_should_park())
+   kthread_parkme();
+   if (kthread_should_stop())
+   return 0;
if (!again) {
set_current_state(TASK_INTERRUPTIBLE);
-   if (!kthread_should_stop())
-   schedule();
+   schedule();
__set_current_state(TASK_RUNNING);
}
-   } while (!kthread_should_stop());
-
-   /*
-* Transaction kthread is stopped before us and wakes us up.
-* However we might have started a new transaction and COWed some
-* tree blocks when deleting unused block groups for example. So
-* make sure we commit the transaction we started to have a clean
-* shutdown when evicting the btree inode - if it has dirty pages
-* when we do the final iput() on it, eviction will trigger a
-* writeback for it which will fail with null pointer dereferences
-* since work queues and other resources were already released and
-* destroyed by the time the iput/eviction/writeback is made.
-*/
-   trans = btrfs_attach_transaction(root);
-   if (IS_ERR(trans)) {
-   if (PTR_ERR(trans) != -ENOENT)
-   btrfs_err(fs_info,
- "cleaner transaction attach returned %ld",
- PTR_ERR(trans));
-   } else {
-   int ret;
-
-   ret = btrfs_commit_transaction(trans);
-   if (ret)
-   btrfs_err(fs_info,
- "cleaner open transaction commit returned %d",
- ret);
}
-
-   return 0;
 }
 
 static int transaction_kthread(void *arg)
@@ -3931,6 +3904,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
int ret;
 
set_bit(BTRFS_FS_CLOSING_START, _info->flags);
+   kthread_park(fs_info->cleaner_kthread);
 
/* wait for the qgroup rescan worker to stop */
btrfs_qgroup_wait_for_completion(fs_info, false);
-- 
2.19.1



Re: Salvage files from broken btrfs

2018-10-30 Thread Qu Wenruo


On 2018/10/31 上午4:11, Mirko Klingmann wrote:
> Hi all,
> 
> my btrfs root file system on a SD card broke down and did not mount anymore.
> 
> In retrospective, I think it reached its endurance, so I know that there
> is nothing to repair. All I want to do is to salvage some configuration
> and data files from the remains left in my ISO file copy. The SD card is
> no longer readable, so all I have is the 30GB "dd" copy of the btrfs
> partition.
> 
> I also tried some things on the ISO file I later found I shouldn't have
> done with the "btrfs" tools, which I think broke the file system in it
> even more.

Not exactly.

For your case, your best friend would be btrfs-restore + some way to
recover chunk tree.
Unless you want to do all salvage manually.

> 
> So at this stage, this is the "dmesg" output when trying to mount the
> ISO file, which then fails:
> 
> [  249.239883] BTRFS: device fsid 4235aa4f-7721-4e73-97f0-7fe0e9a3ce9c
> devid 1 transid 1757933 /dev/loop2
> [  249.241504] BTRFS info (device loop2): disk space caching is enabled
> [  249.275950] BTRFS error (device loop2): bad tree block start 0 20987904
> [  249.280936] BTRFS error (device loop2): bad tree block start 0 20987904
> [  249.280946] BTRFS error (device loop2): failed to read chunk root
> [  249.336291] BTRFS error (device loop2): open_ctree failed
> 
> Output of "uname -a":
> 
> Linux desinfect 4.13.0-39-generic #44~16.04.1-Ubuntu SMP Thu Apr 5
> 16:43:10 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> 
> Output of "btrfs --version":
> 
> btrfs-progs v4.4
> 
> When reading the ISO file with "Active@ Disk Editor" (a hex file editor)
> I find a super block at offset 0x1 that looks like this:

That's the primary super block.

BTW, you could use just 'grep' to locate btrfs superblock:

  # grep -obUaP "\x5F\x42\x48\x52\x66\x53\x5F\x4D" 

It's better to use "btrfs ins dump-super -fFa" to show the superblock
info in a human readable way.
> 
> B8E15DD74235AA4F77214E7397F07FE0E9A3CE9C010001005F42485266535F4DEDD21A40FF34004040010E35E023076092F1030006000110004000400010E200A6380E00610100E0230700E02307100010001090185CF6B93749BBB19191D08677EE224235AA4F77214E7397F07FE0E9A3CE9C
> 
> The super block at offset 0x400 is zeroed out.
> 
> When looking at the addresses of chunk root (0x1404000), root of tree
> root (0x34FF4000) and log tree root (0x350E) in the first super
> block they are all zeroed out as well. So I think I understand why the
> error "failed to read chunk root" crops up.

Not a big problem really.

We can still find the chunk root just using the system chunk array (and
some time) easily, since normally system chunks are small and we can
afford checking all tree blocks in that range.

That's why I'm recommended to use "btrfs ins dump-super" to inspect the
superblock, as that allow us to inspect system chunk array.

IIRC btrfs-find-root is pretty good at such job, if that works.


> 
> If I try to "restore" using "btrfs restore sdcard.iso /outdir" I get
> this output:
> 
> checksum verify failed on 20987904 found E4E3BDB6 wanted 
> checksum verify failed on 20987904 found E4E3BDB6 wanted 
> checksum verify failed on 20987904 found E4E3BDB6 wanted 
> checksum verify failed on 20987904 found E4E3BDB6 wanted 
> bytenr mismatch, want=20987904, have=0
> Couldn't read chunk root
> Could not open root, trying backup super
> No valid Btrfs found on sdcard.iso
> Could not open root, trying backup super
> Superblock bytenr is larger than device size
> Could not open root, trying backup super

My plan for such recovery is:

1) btrfs ins dump-super to make sure system chunk array is valid
2) btrfs-find-root to find any valid chunk tree blocks
3) pass that chunk tree bytenr to btrfs-restore
   Unfortunately, btrfs-restore doesn't support specifying chunk root
   yet. But it's pretty easy to add such support.

So, please provide the "btrfs ins dump-super -Ffa" output to start with.

> 
> And, finally, I can see "/etc" someplace near "fstab" in the ISO which
> looks like a directory listing as well as content of files I remember,
> which tells me, that the data I still in there somewhere.
> 
> So, what can I do to get the files I need out of this blob. I am willing
> to follow data pointers as described in
> https://btrfs.wiki.kernel.org/index.php/On-disk_Format in the hex editor
> and copy the data from there.

If there is something that a hex editor is really needed, it means we
should add a new function in btrfs-progs. :)

Thanks,
Qu

> 
> Can anyone give me any pointers into the ISO file (maybe starting from
> the super block) to help me extract the 

Re: [PATCH] btrfs: add zstd compression level support

2018-10-30 Thread Omar Sandoval
On Tue, Oct 30, 2018 at 12:06:21PM -0700, Nick Terrell wrote:
> From: Jennifer Liu 
> 
> Adds zstd compression level support to btrfs. Zstd requires
> different amounts of memory for each level, so the design had
> to be modified to allow set_level() to allocate memory. We
> preallocate one workspace of the maximum size to guarantee
> forward progress. This feature is expected to be useful for
> read-mostly filesystems, or when creating images.
> 
> Benchmarks run in qemu on Intel x86 with a single core.
> The benchmark measures the time to copy the Silesia corpus [0] to
> a btrfs filesystem 10 times, then read it back.
> 
> The two important things to note are:
> - The decompression speed and memory remains constant.
>   The memory required to decompress is the same as level 1.
> - The compression speed and ratio will vary based on the source.
> 
> Level Ratio   Compression Decompression   Compression Memory
> 1 2.59153 MB/s112 MB/s0.8 MB
> 2 2.67136 MB/s113 MB/s1.0 MB
> 3 2.72106 MB/s115 MB/s1.3 MB
> 4 2.7886  MB/s109 MB/s0.9 MB
> 5 2.8369  MB/s109 MB/s1.4 MB
> 6 2.8953  MB/s110 MB/s1.5 MB
> 7 2.9140  MB/s112 MB/s1.4 MB
> 8 2.9234  MB/s110 MB/s1.8 MB
> 9 2.9327  MB/s109 MB/s1.8 MB
> 102.9422  MB/s109 MB/s1.8 MB
> 112.9517  MB/s114 MB/s1.8 MB
> 122.9513  MB/s113 MB/s1.8 MB
> 132.9510  MB/s111 MB/s2.3 MB
> 142.997   MB/s110 MB/s2.6 MB
> 153.036   MB/s110 MB/s2.6 MB
> 
> [0] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia

Reviewed-by: Omar Sandoval 

> Signed-off-by: Jennifer Liu 
> Signed-off-by: Nick Terrell 
> ---
>  fs/btrfs/compression.c | 172 +
>  fs/btrfs/compression.h |  18 +++--
>  fs/btrfs/lzo.c |   5 +-
>  fs/btrfs/super.c   |   7 +-
>  fs/btrfs/zlib.c|  33 
>  fs/btrfs/zstd.c|  74 ++
>  6 files changed, 204 insertions(+), 105 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 2955a4ea2fa8..bd8e69381dc9 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -822,11 +822,15 @@ void __init btrfs_init_compress(void)
>  
>   /*
>* Preallocate one workspace for each compression type so
> -  * we can guarantee forward progress in the worst case
> +  * we can guarantee forward progress in the worst case.
> +  * Provide the maximum compression level to guarantee large
> +  * enough workspace.
>*/
> - workspace = btrfs_compress_op[i]->alloc_workspace();
> + workspace = btrfs_compress_op[i]->alloc_workspace(
> + btrfs_compress_op[i]->max_level);
>   if (IS_ERR(workspace)) {
> - pr_warn("BTRFS: cannot preallocate compression 
> workspace, will try later\n");
> + pr_warn("BTRFS: cannot preallocate compression "
> + "workspace, will try later\n");

Nit: since you didn't change this line, don't rewrap it.


Re: Where is my disk space ?

2018-10-30 Thread Barbet Alain
Thanks for answer !
alian@alian:~>  sudo btrfs sub list -ta /
[sudo] Mot de passe de root :
ID  gen top level   path
--  --- -   
257 79379   5   /@
258 79386   257 @/var
259 79000   257 @/usr/local
260 79376   257 @/tmp
261 79001   257 @/srv
262 79062   257 @/root
263 79001   257 @/opt
264 78898   257 @/boot/grub2/x86_64-efi
265 78933   257 @/boot/grub2/i386-pc

Yes it's opensuse, but I don't see any snapper config enable.
For memory, I use docker that full my disk, I remove subvolume, but
it's look like something is missing somewhere.
Le mar. 30 oct. 2018 à 19:01, Chris Murphy  a écrit :
>
> On Tue, Oct 30, 2018 at 9:17 AM, Barbet Alain  
> wrote:
> > Hi,
> > I experienced disk out of space issue:
> > alian:~ # df -h
> > Filesystem  Size  Used Avail Use% Mounted on
> > devtmpfs7.8G 0  7.8G   0% /dev
> > tmpfs   7.8G   47M  7.8G   1% /dev/shm
> > tmpfs   7.8G   18M  7.8G   1% /run
> > tmpfs   7.8G 0  7.8G   0% /sys/fs/cgroup
> > /dev/sda641G   35G  5.1G  88% /
> > /dev/sda641G   35G  5.1G  88% /var
> > /dev/sda641G   35G  5.1G  88% /root
> > /dev/sda641G   35G  5.1G  88% /srv
> > /dev/sda641G   35G  5.1G  88% /opt
> > /dev/sda641G   35G  5.1G  88% /boot/grub2/i386-pc
> > /dev/sda641G   35G  5.1G  88% /usr/local
> > /dev/sda641G   35G  5.1G  88% /tmp
> > /dev/sda641G   35G  5.1G  88% /boot/grub2/x86_64-efi
> > /dev/sda7   424G  200G  225G  48% /home
> >
> >
> > It say I use 35Go / 41. But I have only 5,8Go of data:
> > alian:~ # btrfs fi du -s /
> >  Total   Exclusive  Set shared  Filename
> >5.84GiB 5.84GiB   0.00B  /
> > alian:/ # du -h --exclude ./home --max-depth=0
> > 6.2G.
>
> I suspect there are snapshots taking up space that are no located in
> the search path starting at /
>
> What do you get for:
>
> $ sudo btrfs sub list -ta /
>
> Is this an openSUSE system? If snapper is enabled, you'll need to ask
> it to delete some of the snapshots to free up space rather than doing
> it with btrfs user space tools.
>
>
>
>
> > alian:/ # btrfs fi df /
> > Data, single: total=35.00GiB, used=34.18GiB
> > System, DUP: total=32.00MiB, used=16.00KiB
> > Metadata, DUP: total=384.00MiB, used=216.75MiB
> > GlobalReserve, single: total=22.05MiB, used=0.00B
> >
> > I try to run btrfs balance multiple time with various parameters but
> > it doesn't change anything nor trying btrf check in single user mode.
> >
> > Where is my 30 Go missing ?
>
>
>
> --
> Chris Murphy


Salvage files from broken btrfs

2018-10-30 Thread Mirko Klingmann
Hi all,

my btrfs root file system on a SD card broke down and did not mount anymore.

In retrospective, I think it reached its endurance, so I know that there
is nothing to repair. All I want to do is to salvage some configuration
and data files from the remains left in my ISO file copy. The SD card is
no longer readable, so all I have is the 30GB "dd" copy of the btrfs
partition.

I also tried some things on the ISO file I later found I shouldn't have
done with the "btrfs" tools, which I think broke the file system in it
even more.

So at this stage, this is the "dmesg" output when trying to mount the
ISO file, which then fails:

[  249.239883] BTRFS: device fsid 4235aa4f-7721-4e73-97f0-7fe0e9a3ce9c
devid 1 transid 1757933 /dev/loop2
[  249.241504] BTRFS info (device loop2): disk space caching is enabled
[  249.275950] BTRFS error (device loop2): bad tree block start 0 20987904
[  249.280936] BTRFS error (device loop2): bad tree block start 0 20987904
[  249.280946] BTRFS error (device loop2): failed to read chunk root
[  249.336291] BTRFS error (device loop2): open_ctree failed

Output of "uname -a":

Linux desinfect 4.13.0-39-generic #44~16.04.1-Ubuntu SMP Thu Apr 5
16:43:10 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Output of "btrfs --version":

btrfs-progs v4.4

When reading the ISO file with "Active@ Disk Editor" (a hex file editor)
I find a super block at offset 0x1 that looks like this:

B8E15DD74235AA4F77214E7397F07FE0E9A3CE9C010001005F42485266535F4DEDD21A40FF34004040010E35E023076092F1030006000110004000400010E200A6380E00610100E0230700E02307100010001090185CF6B93749BBB19191D08677EE224235AA4F77214E7397F07FE0E9A3CE9C

The super block at offset 0x400 is zeroed out.

When looking at the addresses of chunk root (0x1404000), root of tree
root (0x34FF4000) and log tree root (0x350E) in the first super
block they are all zeroed out as well. So I think I understand why the
error "failed to read chunk root" crops up.

If I try to "restore" using "btrfs restore sdcard.iso /outdir" I get
this output:

checksum verify failed on 20987904 found E4E3BDB6 wanted 
checksum verify failed on 20987904 found E4E3BDB6 wanted 
checksum verify failed on 20987904 found E4E3BDB6 wanted 
checksum verify failed on 20987904 found E4E3BDB6 wanted 
bytenr mismatch, want=20987904, have=0
Couldn't read chunk root
Could not open root, trying backup super
No valid Btrfs found on sdcard.iso
Could not open root, trying backup super
Superblock bytenr is larger than device size
Could not open root, trying backup super

And, finally, I can see "/etc" someplace near "fstab" in the ISO which
looks like a directory listing as well as content of files I remember,
which tells me, that the data I still in there somewhere.

So, what can I do to get the files I need out of this blob. I am willing
to follow data pointers as described in
https://btrfs.wiki.kernel.org/index.php/On-disk_Format in the hex editor
and copy the data from there.

Can anyone give me any pointers into the ISO file (maybe starting from
the super block) to help me extract the data I need?

Cheers,

Mirko




[PATCH] btrfs: add zstd compression level support

2018-10-30 Thread Nick Terrell
From: Jennifer Liu 

Adds zstd compression level support to btrfs. Zstd requires
different amounts of memory for each level, so the design had
to be modified to allow set_level() to allocate memory. We
preallocate one workspace of the maximum size to guarantee
forward progress. This feature is expected to be useful for
read-mostly filesystems, or when creating images.

Benchmarks run in qemu on Intel x86 with a single core.
The benchmark measures the time to copy the Silesia corpus [0] to
a btrfs filesystem 10 times, then read it back.

The two important things to note are:
- The decompression speed and memory remains constant.
  The memory required to decompress is the same as level 1.
- The compression speed and ratio will vary based on the source.

Level   Ratio   Compression Decompression   Compression Memory
1   2.59153 MB/s112 MB/s0.8 MB
2   2.67136 MB/s113 MB/s1.0 MB
3   2.72106 MB/s115 MB/s1.3 MB
4   2.7886  MB/s109 MB/s0.9 MB
5   2.8369  MB/s109 MB/s1.4 MB
6   2.8953  MB/s110 MB/s1.5 MB
7   2.9140  MB/s112 MB/s1.4 MB
8   2.9234  MB/s110 MB/s1.8 MB
9   2.9327  MB/s109 MB/s1.8 MB
10  2.9422  MB/s109 MB/s1.8 MB
11  2.9517  MB/s114 MB/s1.8 MB
12  2.9513  MB/s113 MB/s1.8 MB
13  2.9510  MB/s111 MB/s2.3 MB
14  2.997   MB/s110 MB/s2.6 MB
15  3.036   MB/s110 MB/s2.6 MB

[0] http://sun.aei.polsl.pl/~sdeor/index.php?page=silesia

Signed-off-by: Jennifer Liu 
Signed-off-by: Nick Terrell 
---
 fs/btrfs/compression.c | 172 +
 fs/btrfs/compression.h |  18 +++--
 fs/btrfs/lzo.c |   5 +-
 fs/btrfs/super.c   |   7 +-
 fs/btrfs/zlib.c|  33 
 fs/btrfs/zstd.c|  74 ++
 6 files changed, 204 insertions(+), 105 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 2955a4ea2fa8..bd8e69381dc9 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -822,11 +822,15 @@ void __init btrfs_init_compress(void)
 
/*
 * Preallocate one workspace for each compression type so
-* we can guarantee forward progress in the worst case
+* we can guarantee forward progress in the worst case.
+* Provide the maximum compression level to guarantee large
+* enough workspace.
 */
-   workspace = btrfs_compress_op[i]->alloc_workspace();
+   workspace = btrfs_compress_op[i]->alloc_workspace(
+   btrfs_compress_op[i]->max_level);
if (IS_ERR(workspace)) {
-   pr_warn("BTRFS: cannot preallocate compression 
workspace, will try later\n");
+   pr_warn("BTRFS: cannot preallocate compression "
+   "workspace, will try later\n");
} else {
atomic_set(_comp_ws[i].total_ws, 1);
btrfs_comp_ws[i].free_ws = 1;
@@ -835,23 +839,78 @@ void __init btrfs_init_compress(void)
}
 }
 
+/*
+ * put a workspace struct back on the list or free it if we have enough
+ * idle ones sitting around
+ */
+static void __free_workspace(int type, struct list_head *workspace,
+bool heuristic)
+{
+   int idx = type - 1;
+   struct list_head *idle_ws;
+   spinlock_t *ws_lock;
+   atomic_t *total_ws;
+   wait_queue_head_t *ws_wait;
+   int *free_ws;
+
+   if (heuristic) {
+   idle_ws  = _heuristic_ws.idle_ws;
+   ws_lock  = _heuristic_ws.ws_lock;
+   total_ws = _heuristic_ws.total_ws;
+   ws_wait  = _heuristic_ws.ws_wait;
+   free_ws  = _heuristic_ws.free_ws;
+   } else {
+   idle_ws  = _comp_ws[idx].idle_ws;
+   ws_lock  = _comp_ws[idx].ws_lock;
+   total_ws = _comp_ws[idx].total_ws;
+   ws_wait  = _comp_ws[idx].ws_wait;
+   free_ws  = _comp_ws[idx].free_ws;
+   }
+
+   spin_lock(ws_lock);
+   if (*free_ws <= num_online_cpus()) {
+   list_add(workspace, idle_ws);
+   (*free_ws)++;
+   spin_unlock(ws_lock);
+   goto wake;
+   }
+   spin_unlock(ws_lock);
+
+   if (heuristic)
+   free_heuristic_ws(workspace);
+   else
+   btrfs_compress_op[idx]->free_workspace(workspace);
+   atomic_dec(total_ws);
+wake:
+   cond_wake_up(ws_wait);
+}
+
+static void free_workspace(int type, struct list_head *ws)
+{
+   return __free_workspace(type, ws, false);
+}
+
 /*
  * This finds an available workspace or allocates a 

Re: [RFC][PATCH v3 10/10] btrfs: use common file type conversion

2018-10-30 Thread David Sterba
On Sat, Oct 27, 2018 at 01:53:48AM +0100, Phillip Potter wrote:
> Deduplicate the btrfs file type conversion implementation - file systems
> that use the same file types as defined by POSIX do not need to define
> their own versions and can use the common helper functions decared in
> fs_types.h and implemented in fs_types.c
> 
> Signed-off-by: Amir Goldstein 
> Signed-off-by: Phillip Potter 

Acked-by: David Sterba 


Re: Where is my disk space ?

2018-10-30 Thread Chris Murphy
On Tue, Oct 30, 2018 at 9:17 AM, Barbet Alain  wrote:
> Hi,
> I experienced disk out of space issue:
> alian:~ # df -h
> Filesystem  Size  Used Avail Use% Mounted on
> devtmpfs7.8G 0  7.8G   0% /dev
> tmpfs   7.8G   47M  7.8G   1% /dev/shm
> tmpfs   7.8G   18M  7.8G   1% /run
> tmpfs   7.8G 0  7.8G   0% /sys/fs/cgroup
> /dev/sda641G   35G  5.1G  88% /
> /dev/sda641G   35G  5.1G  88% /var
> /dev/sda641G   35G  5.1G  88% /root
> /dev/sda641G   35G  5.1G  88% /srv
> /dev/sda641G   35G  5.1G  88% /opt
> /dev/sda641G   35G  5.1G  88% /boot/grub2/i386-pc
> /dev/sda641G   35G  5.1G  88% /usr/local
> /dev/sda641G   35G  5.1G  88% /tmp
> /dev/sda641G   35G  5.1G  88% /boot/grub2/x86_64-efi
> /dev/sda7   424G  200G  225G  48% /home
>
>
> It say I use 35Go / 41. But I have only 5,8Go of data:
> alian:~ # btrfs fi du -s /
>  Total   Exclusive  Set shared  Filename
>5.84GiB 5.84GiB   0.00B  /
> alian:/ # du -h --exclude ./home --max-depth=0
> 6.2G.

I suspect there are snapshots taking up space that are no located in
the search path starting at /

What do you get for:

$ sudo btrfs sub list -ta /

Is this an openSUSE system? If snapper is enabled, you'll need to ask
it to delete some of the snapshots to free up space rather than doing
it with btrfs user space tools.




> alian:/ # btrfs fi df /
> Data, single: total=35.00GiB, used=34.18GiB
> System, DUP: total=32.00MiB, used=16.00KiB
> Metadata, DUP: total=384.00MiB, used=216.75MiB
> GlobalReserve, single: total=22.05MiB, used=0.00B
>
> I try to run btrfs balance multiple time with various parameters but
> it doesn't change anything nor trying btrf check in single user mode.
>
> Where is my 30 Go missing ?



-- 
Chris Murphy


Re: [PATCH 3/3] btrfs: fix pinned underflow after transaction aborted

2018-10-30 Thread David Sterba
On Wed, Oct 24, 2018 at 08:24:03PM +0800, Lu Fengqi wrote:
> When running generic/475, we may get the following warning in the dmesg.
> 
> [ 6902.102154] WARNING: CPU: 3 PID: 18013 at fs/btrfs/extent-tree.c:9776 
> btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
> [ 6902.104886] Modules linked in: btrfs(O) xor zstd_decompress zstd_compress 
> xxhash raid6_pq efivarfs xfs nvme nvme_core [last unloaded: btrfs]
> [ 6902.109160] CPU: 3 PID: 18013 Comm: umount Tainted: GW  O  
> 4.19.0-rc8+ #8
> [ 6902.110971] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 
> 02/06/2015
> [ 6902.112857] RIP: 0010:btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
> [ 6902.114377] Code: c6 48 89 04 24 48 8b 83 50 17 00 00 48 39 c6 0f 84 ab 00 
> 00 00 4c 8b ab 50 17 00 00 49 83 bd 50 ff ff ff 00 0f 84 b4 00 00 00 <0f> 0b 
> 31 c9 49 8d b5 f8 fe ff ff 31 d2 48
> 89 df e8 fc 76 ff ff 49

You can remove this

> [ 6902.118921] RSP: 0018:c9000459bdb0 EFLAGS: 00010286
> [ 6902.120315] RAX: 880175050bb0 RBX: 8801124a8000 RCX: 
> 00170007
> [ 6902.121969] RDX: 0002 RSI: 00170007 RDI: 
> 8125fb74
> [ 6902.123716] RBP: 880175055d10 R08:  R09: 
> 
> [ 6902.125417] R10:  R11:  R12: 
> 880175055d88
> [ 6902.127129] R13: 880175050bb0 R14:  R15: 
> dead0100
> [ 6902.129060] FS:  7f4507223780() GS:88017ba0() 
> knlGS:
> [ 6902.130996] CS:  0010 DS:  ES:  CR0: 80050033
> [ 6902.132558] CR2: 5623599cac78 CR3: 00014b71 CR4: 
> 003606e0
> [ 6902.134270] DR0:  DR1:  DR2: 
> 
> [ 6902.135981] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [ 6902.137836] Call Trace:
> [ 6902.138939]  close_ctree+0x171/0x330 [btrfs]
> [ 6902.140181]  ? kthread_stop+0x146/0x1f0
> [ 6902.141277]  generic_shutdown_super+0x6c/0x100
> [ 6902.142517]  kill_anon_super+0x14/0x30
> [ 6902.143554]  btrfs_kill_super+0x13/0x100 [btrfs]
> [ 6902.144790]  deactivate_locked_super+0x2f/0x70
> [ 6902.146014]  cleanup_mnt+0x3b/0x70
> [ 6902.147020]  task_work_run+0x9e/0xd0
> [ 6902.148036]  do_syscall_64+0x470/0x600
> [ 6902.149142]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [ 6902.150375]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 6902.151640] RIP: 0033:0x7f45077a6a7b
> [ 6902.152782] Code: 23 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3 0f 1e fa 
> 31 f6 e9 05 00 00 00 90 0f 1f 40 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 
> 01 f0 ff ff 73 01 c3 48 8b 0d b5 23
> 0c 00 f7 d8 64 89 01 48

and this line from the changelog (unless there's a reason to keep them).

> [ 6902.157324] RSP: 002b:7ffd589f3e68 EFLAGS: 0246 ORIG_RAX: 
> 00a6
> [ 6902.159187] RAX:  RBX: 55e8eec732b0 RCX: 
> 7f45077a6a7b
> [ 6902.160834] RDX: 0001 RSI:  RDI: 
> 55e8eec73490
> [ 6902.162526] RBP:  R08: 55e8eec734b0 R09: 
> 7ffd589f26c0
> [ 6902.164141] R10:  R11: 0246 R12: 
> 55e8eec73490
> [ 6902.165815] R13: 7f4507ac61a4 R14:  R15: 
> 7ffd589f40d8
> [ 6902.167553] irq event stamp: 0
> [ 6902.168998] hardirqs last  enabled at (0): [<>]   
> (null)
> [ 6902.170731] hardirqs last disabled at (0): [] 
> copy_process.part.55+0x3b0/0x1f00
> [ 6902.172773] softirqs last  enabled at (0): [] 
> copy_process.part.55+0x3b0/0x1f00
> [ 6902.174671] softirqs last disabled at (0): [<>]   
> (null)
> [ 6902.176407] ---[ end trace 463138c2986b275c ]---
> [ 6902.177636] BTRFS info (device dm-3): space_info 4 has 273465344 free, is 
> not full
> [ 6902.179453] BTRFS info (device dm-3): space_info total=276824064, 
> used=4685824, pinned=18446744073708158976, reserved=0, may_use=0, 
> readonly=65536
>   
> ^^^
>   
> obviously underflow

^
I'll reformat that a bit so the text is actually in the visible range   
|
and we don't need to put signs like 
-

> When transaction_kthread is running cleanup_transaction(), another
> fsstress is running btrfs_commit_transaction(). The
> btrfs_finish_extent_commit() may get the same range as
> btrfs_destroy_pinned_extent() got, which causes the pinned underflow.

So this completes what d4b450cd4b33 ("Btrfs: fix race between
transaction commit and empty block group removal") fixed in the
automatic block group removal 47ab2a6c689913d. I'll add the stable tags
too, and queue it for 4.20. Thanks.


Re: Understanding "btrfs filesystem usage"

2018-10-30 Thread Chris Murphy
On Tue, Oct 30, 2018 at 10:10 AM, Ulli Horlacher
 wrote:
>
> On Mon 2018-10-29 (17:57), Remi Gauvin wrote:
>> On 2018-10-29 02:11 PM, Ulli Horlacher wrote:
>>
>> > I want to know how many free space is left and have problems in
>> > interpreting the output of:
>> >
>> > btrfs filesystem usage
>> > btrfs filesystem df
>> > btrfs filesystem show
>>
>> In my not so humble opinion, the filesystem usage command has the
>> easiest to understand output.  It' lays out all the pertinent information.
>>
>> You can clearly see 825GiB is allocated, with 494GiB used, therefore,
>> filesystem show is actually using the "Allocated" value as "Used".
>> Allocated can be thought of "Reserved For".
>
> And what is "Device unallocated"? Not reserved?

That's a reasonable interpretation. Unallocated space is space that's
not used for anything: no data, no metadata, and isn't reference by
any block group.

It's not a relevant number day to day, I'd say it's advanced leaning
toward esoteric knowledge of Btrfs internals. At this point I'd like
to see a simper output by default, and have a verbose option for
advanced users, and an export option that spits out a superset of all
available information in a format parsable for scripts. But I know
there are other project that depend on btrfs user space output, rather
than having something specifically invented for them that's easily
parsed, and can be kept consistent and extendible, separate from human
user consumption. Oh well!




>> The disparity between 498GiB used and 823Gib is pretty high.  This is
>> probably the result of using an SSD with an older kernel.  If your
>> kernel is not very recent, (sorry, I forget where this was fixed,
>> somewhere around 4.14 or 4.15), then consider mounting with the nossd
>> option.
>
> I am running kernel 4.4 (it is a Ubuntu 16.04 system)
> But /local is on a SSD. Should I really use nossd mount option?!

Yes. But it's not a file system integrity suggestion, it's an optimization.


>
>
>
>> You can improve this by running a balance.
>>
>> Something like:
>> btrfs balance start -dusage=55
>
> I run balance via cron weekly (adapted
> https://software.opensuse.org/package/btrfsmaintenance)

With a newer kernel you can probably reduce this further depending on
your workload and use case. And optionally follow it up with executing
fstrim, or just enable fstrim.timer (we don't recommend using discard
mount option for most use cases as it too aggressively discards very
recently stale Btrfs metadata and can make recovery from crashes
harder).

There is a trim bug that causes FITRIM to only get applied to
unallocated space on older file systems, that have been balanced such
that block group logical addresses are outside the physical address
space of the device which prevents the free space inside of such block
groups to be passed over for FITRIM. Looks like this will be fixed in
kernel 4.20/5.0




-- 
Chris Murphy


Re: [PATCH] Btrfs: remove no longer used stuff for tracking pending ordered extents

2018-10-30 Thread David Sterba
On Fri, Oct 26, 2018 at 05:15:21PM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Tracking pending ordered extents per transaction was introduced in commit
> 50d9aa99bd35 ("Btrfs: make sure logged extents complete in the current
> transaction V3") and later updated in commit 161c3549b45a ("Btrfs: change
> how we wait for pending ordered extents").
> 
> However now that on fsync we always wait for ordered extents to complete
> before logging, done in commit 5636cf7d6dc8 ("btrfs: remove the logged
> extents infrastructure"), we no longer need the stuff to track for pending
> ordered extents, which was not completely removed in the mentioned commit.
> So remove the remaining of the pending ordered extents infrastructure.
> 
> Signed-off-by: Filipe Manana 

Added to misc-next, thanks.


Re: [PATCH] Btrfs: remove no longer used logged range variables when logging extents

2018-10-30 Thread David Sterba
On Fri, Oct 26, 2018 at 09:26:40PM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> The logged_start and logged_end variables, at btrfs_log_changed_extents(),
> were added in commit 8c6c592831a0 ("btrfs: log csums for all modified
> extents"). However since the recent simplification for fsync, which makes
> us wait for all ordered extents to complete before logging extents, we
> no longer need those variables. Commit a2120a473a80 ("btrfs: clean up the
> left over logged_list usage") forgot to remove them.
> 
> Signed-off-by: Filipe Manana 

Added to misc-next, thanks.


Re: Understanding "btrfs filesystem usage"

2018-10-30 Thread Austin S. Hemmelgarn

On 10/30/2018 12:10 PM, Ulli Horlacher wrote:


On Mon 2018-10-29 (17:57), Remi Gauvin wrote:

On 2018-10-29 02:11 PM, Ulli Horlacher wrote:


I want to know how many free space is left and have problems in
interpreting the output of:

btrfs filesystem usage
btrfs filesystem df
btrfs filesystem show


In my not so humble opinion, the filesystem usage command has the
easiest to understand output.  It' lays out all the pertinent information.

You can clearly see 825GiB is allocated, with 494GiB used, therefore,
filesystem show is actually using the "Allocated" value as "Used".
Allocated can be thought of "Reserved For".


And what is "Device unallocated"? Not reserved?



As the output of the Usage command and df command clearly show, you have
almost 400GiB space available.


This is the good part :-)



The disparity between 498GiB used and 823Gib is pretty high.  This is
probably the result of using an SSD with an older kernel.  If your
kernel is not very recent, (sorry, I forget where this was fixed,
somewhere around 4.14 or 4.15), then consider mounting with the nossd
option.


I am running kernel 4.4 (it is a Ubuntu 16.04 system)
But /local is on a SSD. Should I really use nossd mount option?!

Probably, and you may even want to use it on newer (patched) kernels.

This requires some explanation though.

SSD's are write limited media (write to them too much, and they stop 
working).  This is generally a pretty well known fact, and while it is 
true, it's not anywhere near as much of an issue on modern SSD"s as 
people make it out to be (pretty much, if you've got an SSD made in the 
last 5 years, you almost certainly don't have to worry about this).  The 
`ssd` code in BTRFS behaves as if this is still an issue (and does so in 
a way that doesn't even solve it well).


Put simply, when BTRFS goes to look for space, it treats requests for 
space that ask for less than a certain size as if they are that minimum 
size, and only tries to look for smaller spots if it can't find one at 
least that minimum size.  This has a couple of advantages in terms of 
write performance, especially in the common case of a mostly empty 
filesystem.


For the default (`nossd`) case, that minimum size is 64kB.  So, in most 
cases, the potentially wasted space actually doesn't matter much (most 
writes are bigger than 64k) unless you're doing certain things.


For the old (`ssd`) case, that minimum size is 2MB.  Even with the 
common cases that would normally not have an issue with the 64k default, 
this ends up wasting a _huge_ amount of space.


For the new `ssd` behavior, the minimum is different for data and 
metadata (IIRC, metadata uses the 64k default, while data still uses the 
2M size).  This solves the biggest issues (which were seen with 
metadata), but doesn't completely remove the problem.


Expanding on this further, some unusual workloads actually benefit from 
the old `ssd` behavior, so on newer kernels `ssd_spread` gives that 
behavior.  However, many workloads actually do better with the `nossd` 
behavior (especially the pathological worst case stuff like databases 
and VM disk images), so if you have a recent SSD, you probably want to 
just use that.




You can improve this by running a balance.

Something like:
btrfs balance start -dusage=55


I run balance via cron weekly (adapted
https://software.opensuse.org/package/btrfsmaintenance)






Re: [PATCH] Btrfs: fix cur_offset in the error case for nocow

2018-10-30 Thread David Sterba
On Tue, Oct 30, 2018 at 06:04:04PM +0800, robbieko wrote:
> From: Robbie Ko 
> 
> When the cow_file_range fail, the related resources are
> unlocked according to the range (start-end), so the unlock
> cannot be repeated in run_delalloc_nocow.
> 
> In some cases (e.g. cur_offset <= end && cow_start!= -1),
> cur_offset is not updated correctly, so move the cur_offset
> update before cow_file_range.
> 
> [ cut here ]
> kernel BUG at mm/page-writeback.c:2663!
> Internal error: Oops - BUG: 0 [#1] SMP
> CPU: 3 PID: 31525 Comm: kworker/u8:7 Tainted: P O
> Hardware name: Realtek_RTD1296 (DT)
> Workqueue: writeback wb_workfn (flush-btrfs-1)
> task: ffc076db3380 ti: ffc02e9ac000 task.ti: ffc02e9ac000
> PC is at clear_page_dirty_for_io+0x1bc/0x1e8
> LR is at clear_page_dirty_for_io+0x14/0x1e8
> pc : [] lr : [] pstate: 4145
> sp : ffc02e9af4f0
> Process kworker/u8:7 (pid: 31525, stack limit = 0xffc02e9ac020)
> Call trace:
> [] clear_page_dirty_for_io+0x1bc/0x1e8
> [] extent_clear_unlock_delalloc+0x1e4/0x210 [btrfs]
> [] run_delalloc_nocow+0x3b8/0x948 [btrfs]
> [] run_delalloc_range+0x250/0x3a8 [btrfs]
> [] writepage_delalloc.isra.21+0xbc/0x1d8 [btrfs]
> [] __extent_writepage+0xe8/0x248 [btrfs]
> [] extent_write_cache_pages.isra.17+0x164/0x378 [btrfs]
> [] extent_writepages+0x48/0x68 [btrfs]
> [] btrfs_writepages+0x20/0x30 [btrfs]
> [] do_writepages+0x30/0x88
> [] __writeback_single_inode+0x34/0x198
> [] writeback_sb_inodes+0x184/0x3c0
> [] __writeback_inodes_wb+0x6c/0xc0
> [] wb_writeback+0x1b8/0x1c0
> [] wb_workfn+0x150/0x250
> [] process_one_work+0x1dc/0x388
> [] worker_thread+0x130/0x500
> [] kthread+0x10c/0x110
> [] ret_from_fork+0x10/0x40
> Code: d503201f a9025bb5 a90363b7 f90023b9 (d421)
> ---[ end trace 65fecee7c2296f25 ]---
> 
> Signed-off-by: Robbie Ko 

As there's a reviewed-by I can fix the small issues at commit time, no
need to resend. Thanks.


Re: Understanding "btrfs filesystem usage"

2018-10-30 Thread Ulli Horlacher


On Mon 2018-10-29 (17:57), Remi Gauvin wrote:
> On 2018-10-29 02:11 PM, Ulli Horlacher wrote:
> 
> > I want to know how many free space is left and have problems in
> > interpreting the output of: 
> > 
> > btrfs filesystem usage
> > btrfs filesystem df
> > btrfs filesystem show
> 
> In my not so humble opinion, the filesystem usage command has the
> easiest to understand output.  It' lays out all the pertinent information.
> 
> You can clearly see 825GiB is allocated, with 494GiB used, therefore,
> filesystem show is actually using the "Allocated" value as "Used".
> Allocated can be thought of "Reserved For".

And what is "Device unallocated"? Not reserved?


> As the output of the Usage command and df command clearly show, you have
> almost 400GiB space available.

This is the good part :-)


> The disparity between 498GiB used and 823Gib is pretty high.  This is
> probably the result of using an SSD with an older kernel.  If your
> kernel is not very recent, (sorry, I forget where this was fixed,
> somewhere around 4.14 or 4.15), then consider mounting with the nossd
> option.

I am running kernel 4.4 (it is a Ubuntu 16.04 system)
But /local is on a SSD. Should I really use nossd mount option?!



> You can improve this by running a balance.
> 
> Something like:
> btrfs balance start -dusage=55

I run balance via cron weekly (adapted 
https://software.opensuse.org/package/btrfsmaintenance)


-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<85a63523-7e77-f4ca-9947-2c957c5c5...@georgianit.com>


Where is my disk space ?

2018-10-30 Thread Barbet Alain
Hi,
I experienced disk out of space issue:
alian:~ # df -h
Filesystem  Size  Used Avail Use% Mounted on
devtmpfs7.8G 0  7.8G   0% /dev
tmpfs   7.8G   47M  7.8G   1% /dev/shm
tmpfs   7.8G   18M  7.8G   1% /run
tmpfs   7.8G 0  7.8G   0% /sys/fs/cgroup
/dev/sda641G   35G  5.1G  88% /
/dev/sda641G   35G  5.1G  88% /var
/dev/sda641G   35G  5.1G  88% /root
/dev/sda641G   35G  5.1G  88% /srv
/dev/sda641G   35G  5.1G  88% /opt
/dev/sda641G   35G  5.1G  88% /boot/grub2/i386-pc
/dev/sda641G   35G  5.1G  88% /usr/local
/dev/sda641G   35G  5.1G  88% /tmp
/dev/sda641G   35G  5.1G  88% /boot/grub2/x86_64-efi
/dev/sda7   424G  200G  225G  48% /home


It say I use 35Go / 41. But I have only 5,8Go of data:
alian:~ # btrfs fi du -s /
 Total   Exclusive  Set shared  Filename
   5.84GiB 5.84GiB   0.00B  /
alian:/ # du -h --exclude ./home --max-depth=0
6.2G.

alian:/ # btrfs fi df /
Data, single: total=35.00GiB, used=34.18GiB
System, DUP: total=32.00MiB, used=16.00KiB
Metadata, DUP: total=384.00MiB, used=216.75MiB
GlobalReserve, single: total=22.05MiB, used=0.00B

I try to run btrfs balance multiple time with various parameters but
it doesn't change anything nor trying btrf check in single user mode.

Where is my 30 Go missing ?

Thank you for any help


[PATCH 6/6] btrfs: Handle final split-brain possibility during fsid change

2018-10-30 Thread Nikolay Borisov
This patch lands the last case which needs to be handled by the fsid
change code. Namely, this is the case where a multidisk filesystem has
already undergone at least one successful fsid change i.e all disks
have the METADATA_UUID incompat bit and power failure occurs as another
fsid change is in progress. When such an event occurs, disks could be
split in 2 groups. One of the groups will have both METADATA_UUID and
CHANGING_FSID_V2 flags set coupled with old fsid/metadata_uuid pairs.
The other group of disks will have only METADATA_UUID bit set and their
fsid will be different than the one in disks in the first group. Here
we look at the following cases:

  a) A disk from the first group is scanned first, so fs_devices is
  created with stale fsid/metdata_uuid. Then when a disk from the
  second group is scanned it needs to first check whether there exists
  such an fs_devices that has fsid_change set to true (because it was
  created with a disk having the CHANGING_FSID_V2 flag), the
  metadata_uuid and fsid of the fs_devices will be different (since it was
  created by a disk which already has had at least 1 successful fsid change)
  and finally the metadata_uuid of the fs_devices will equal that of the
  currently scanned disk (because metadata_uuid never really changes).
  When the correct fs_devices is found the information from the scanned
  disk will replace the current one in fs_devices since the scanned disk
  will have higher generation number.

  b) A disk from the second group is scanned so fs_devices is created
  as usual with differing fsid/metdata_uid. Then when a disk from the
  first group is scanned the code detects that it has both
  CHANGING_FSID_V2 and METADATA_UUID flags set and will search for
  fs_devices that has differing metadata_uuid/fsid and whose
  metadata_uuid is the same as that of the scanned device.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/volumes.c | 65 --
 1 file changed, 53 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f967e995feff..0ce2600c63b8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -383,7 +383,6 @@ find_fsid(const u8 *fsid, const u8 *metadata_fsid)
ASSERT(fsid);
 
if (metadata_fsid) {
-
/*
 * Handle scanned device having completed its fsid change but
 * belonging to a fs_devices that was created by first scanning
@@ -399,6 +398,21 @@ find_fsid(const u8 *fsid, const u8 *metadata_fsid)
return fs_devices;
}
}
+   /*
+* Handle scanned device having completed its fsid change but
+* belonging to a fs_devices that was created by a device that
+* has an outdated pair of fsid/metadata_uuid and
+* CHANGING_FSID_V2 flag set.
+*/
+   list_for_each_entry(fs_devices, _uuids, fs_list) {
+   if (fs_devices->fsid_change &&
+   memcmp(fs_devices->metadata_uuid,
+  fs_devices->fsid, BTRFS_FSID_SIZE) != 0 &&
+   memcmp(metadata_fsid, fs_devices->metadata_uuid,
+  BTRFS_FSID_SIZE) == 0) {
+   return fs_devices;
+   }
+   }
}
 
/* Handle non-split brain cases */
@@ -808,6 +822,30 @@ static struct btrfs_fs_devices *find_fsid_inprogress(
return NULL;
 }
 
+
+static struct btrfs_fs_devices *find_fsid_changed(
+   struct btrfs_super_block *disk_super)
+{
+   struct btrfs_fs_devices *fs_devices;
+
+   /*
+* Handles the case where scanned device is part of an fs that had
+* multiple successful changes of FSID but curently device didn't
+* observe it. Meaning our fsid will be different than theirs.
+*/
+   list_for_each_entry(fs_devices, _uuids, fs_list) {
+   if (memcmp(fs_devices->metadata_uuid, fs_devices->fsid,
+  BTRFS_FSID_SIZE) != 0 &&
+   memcmp(fs_devices->metadata_uuid, disk_super->metadata_uuid,
+  BTRFS_FSID_SIZE) == 0 &&
+   memcmp(fs_devices->fsid, disk_super->fsid,
+  BTRFS_FSID_SIZE) != 0) {
+   return fs_devices;
+   }
+   }
+
+   return NULL;
+}
 /*
  * Add new device to list of registered devices
  *
@@ -829,17 +867,20 @@ static noinline struct btrfs_device 
*device_list_add(const char *path,
bool fsid_change_in_progress = (btrfs_super_flags(disk_super) &
BTRFS_SUPER_FLAG_CHANGING_FSID_V2);
 
-   if (fsid_change_in_progress && !has_metadata_uuid) {
-   /*
-* When we have an image which has 

[PATCH 5/6] btrfs: Handle one more split-brain scenario during fsid change

2018-10-30 Thread Nikolay Borisov
This commit continues hardening the scanning code to handle cases where
power loss could have caused disks in a multi-disk filesystem to be
in inconsistent state. Namely handle the situation that can occur when
some of the disks in multi-disk fs have completed their fsid change i.e
they have METADATA_UUID incompat flag set, have cleared the
CHANGING_FSID_V2 flag and their fsid/metadata_uuid are different. At
the same time the other half of the disks will have their
fsid/metadata_uuid unchanged and will only have CHANGING_FSID_V2 flag.

This is handled by introducing code in the scan path which:

 a) Handles the case when a device with CHANGING_FSID_V2 flag is
 scanned and as a result btrfs_fs_devices is created with matching
 fsid/metdata_uuid. Subsequently, when a device with completed fsid
 change is scanned it will detect this via the new code in find_fsid
 i.e that such an fs_devices exist that fsid_change flag is set to true,
 it's metadata_uuid/fsid match and the metadata_uuid of the scanned
 device matches that of the fs_devices. In this case, it's important to
 note that the devices which has its fsid change completed will have a
 higher generation number than the device with FSID_CHANGING_V2 flag
 set, so its superblock block will be used during mount. To prevent an
 assertion triggering because the sb used for mounting will have
 differing fsid/metadata_uuid than the ones in the fs_devices struct
 also add code in device_list_add which overwrites the values in
 fs_devices.

 b) Alternatively we can end up with a device that completed its
 fsid change be scanned first which will create the respective
 btrfs_fs_devices struct with differing fsid/metadata_uuid. In this
 case when a device with FSID_CHANGING_V2 flag set is scanned it will
 call the newly added find_fsid_inprogress function which will return
 the correct fs_devices.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/volumes.c | 78 +++---
 1 file changed, 74 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f9dcbe74093c..f967e995feff 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -382,6 +382,26 @@ find_fsid(const u8 *fsid, const u8 *metadata_fsid)
 
ASSERT(fsid);
 
+   if (metadata_fsid) {
+
+   /*
+* Handle scanned device having completed its fsid change but
+* belonging to a fs_devices that was created by first scanning
+* a device which didn't have it's fsid/metadata_uuid changed
+* at all and the CHANGING_FSID_V2 flag set.
+*/
+   list_for_each_entry(fs_devices, _uuids, fs_list) {
+   if (fs_devices->fsid_change &&
+   memcmp(metadata_fsid, fs_devices->fsid,
+  BTRFS_FSID_SIZE) == 0 &&
+   memcmp(fs_devices->fsid, fs_devices->metadata_uuid,
+  BTRFS_FSID_SIZE) == 0) {
+   return fs_devices;
+   }
+   }
+   }
+
+   /* Handle non-split brain cases */
list_for_each_entry(fs_devices, _uuids, fs_list) {
if (metadata_fsid) {
if (memcmp(fsid, fs_devices->fsid, BTRFS_FSID_SIZE) == 0
@@ -768,6 +788,27 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
*fs_devices,
 }
 
 /*
+ * Handle scanned device having its CHANGING_FSID_V2 flag set and the 
fs_devices
+ * being created with a disk that has already completed its fsid change.
+ */
+static struct btrfs_fs_devices *find_fsid_inprogress(
+   struct btrfs_super_block *disk_super)
+{
+   struct btrfs_fs_devices *fs_devices;
+
+   list_for_each_entry(fs_devices, _uuids, fs_list) {
+   if (memcmp(fs_devices->metadata_uuid, fs_devices->fsid,
+  BTRFS_FSID_SIZE) != 0 &&
+   memcmp(fs_devices->metadata_uuid, disk_super->fsid,
+  BTRFS_FSID_SIZE) == 0 && !fs_devices->fsid_change) {
+   return fs_devices;
+   }
+   }
+
+   return NULL;
+}
+
+/*
  * Add new device to list of registered devices
  *
  * Returns:
@@ -779,7 +820,7 @@ static noinline struct btrfs_device *device_list_add(const 
char *path,
   bool *new_device_added)
 {
struct btrfs_device *device;
-   struct btrfs_fs_devices *fs_devices;
+   struct btrfs_fs_devices *fs_devices = NULL;
struct rcu_string *name;
u64 found_transid = btrfs_super_generation(disk_super);
u64 devid = btrfs_stack_device_id(_super->dev_item);
@@ -788,10 +829,24 @@ static noinline struct btrfs_device 
*device_list_add(const char *path,
bool fsid_change_in_progress = (btrfs_super_flags(disk_super) &
BTRFS_SUPER_FLAG_CHANGING_FSID_V2);
 
-   

[PATCH 3/6] btrfs: Add handling for disk split-brain scenario during fsid change

2018-10-30 Thread Nikolay Borisov
Even though fsid change without rewrite is a very quick operations it's
still possible to experience a split brain scenario if power loss
occurs at the right time. This patch handle the case where power
failure occurs while the first transaction (the one setting
CHANGING_FSID_V2) flag is being persisted on disk. This can cause
the btrfs_fs_devices of this filesystem to be created by a device which:

 a) has the CHANGING_FSID_V2 flag set but its fsid value is intact

 b) or a device which doesn't have CHANGING_FSID_V2 flag set and its
 fsid value is intact

This situation is trivially handled by the current find_fsid code since
in both cases the devices are going to be treated like ordinary devices.
Since btrfs is always mounted using the superblock of the latest
device (the one with highest generation number), meaning it will have
the CHANGING_FSID_V2 flag set, ensure it's being cleared on mount. On
the first transaction commit following mount all disks will have it
cleared.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/disk-io.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a458ef5b605e..6498434c2e06 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2831,10 +2831,10 @@ int open_ctree(struct super_block *sb,
 * the whole block of INFO_SIZE
 */
memcpy(fs_info->super_copy, bh->b_data, sizeof(*fs_info->super_copy));
-   memcpy(fs_info->super_for_commit, fs_info->super_copy,
-  sizeof(*fs_info->super_for_commit));
brelse(bh);
 
+   disk_super = fs_info->super_copy;
+
ASSERT(!memcmp(fs_info->fs_devices->fsid, fs_info->super_copy->fsid,
   BTRFS_FSID_SIZE));
 
@@ -2844,6 +2844,15 @@ int open_ctree(struct super_block *sb,
BTRFS_FSID_SIZE));
}
 
+   features = btrfs_super_flags(disk_super);
+   if (features & BTRFS_SUPER_FLAG_CHANGING_FSID_V2) {
+   features &= ~BTRFS_SUPER_FLAG_CHANGING_FSID_V2;
+   btrfs_set_super_flags(disk_super, features);
+   btrfs_info(fs_info, "found metadata uuid in progress flag. 
Clearing");
+   }
+
+   memcpy(fs_info->super_for_commit, fs_info->super_copy,
+  sizeof(*fs_info->super_for_commit));
 
ret = btrfs_validate_mount_super(fs_info);
if (ret) {
@@ -2852,7 +2861,6 @@ int open_ctree(struct super_block *sb,
goto fail_alloc;
}
 
-   disk_super = fs_info->super_copy;
if (!btrfs_super_root(disk_super))
goto fail_alloc;
 
-- 
2.7.4



[PATCH 4/6] btrfs: Introduce 2 more members to struct btrfs_fs_devices

2018-10-30 Thread Nikolay Borisov
In order to gracefully handle split-brain scenario which are very
unlikely, yet possible, while performing the FSID change I'm
gonna need two more pieces of information:

 1. The highest generation number among all devices registered to a
 particular btrfs_fs_devices

 2. A boolean flag whether a given btrfs_fs_devices was created by a
 device which had the FSID_CHANGING_V2 flag set.

This is a preparatory patch and just introduces the variables as well
as code which sets them, their actual use is going to happen in a later
patch.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/volumes.c | 9 -
 fs/btrfs/volumes.h | 5 +
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index bf0aa900f22c..f9dcbe74093c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -785,6 +785,8 @@ static noinline struct btrfs_device *device_list_add(const 
char *path,
u64 devid = btrfs_stack_device_id(_super->dev_item);
bool has_metadata_uuid = (btrfs_super_incompat_flags(disk_super) &
BTRFS_FEATURE_INCOMPAT_METADATA_UUID);
+   bool fsid_change_in_progress = (btrfs_super_flags(disk_super) &
+   BTRFS_SUPER_FLAG_CHANGING_FSID_V2);
 
if (has_metadata_uuid)
fs_devices = find_fsid(disk_super->fsid, 
disk_super->metadata_uuid);
@@ -798,6 +800,8 @@ static noinline struct btrfs_device *device_list_add(const 
char *path,
else
fs_devices = alloc_fs_devices(disk_super->fsid, NULL);
 
+   fs_devices->fsid_change = fsid_change_in_progress;
+
if (IS_ERR(fs_devices))
return ERR_CAST(fs_devices);
 
@@ -904,8 +908,11 @@ static noinline struct btrfs_device *device_list_add(const 
char *path,
 * it back. We need it to pick the disk with largest generation
 * (as above).
 */
-   if (!fs_devices->opened)
+   if (!fs_devices->opened) {
device->generation = found_transid;
+   fs_devices->latest_generation = max(found_transid,
+   fs_devices->latest_generation);
+   }
 
fs_devices->total_devices = btrfs_super_num_devices(disk_super);
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 04860497b33c..6b2a01c55426 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -211,6 +211,7 @@ BTRFS_DEVICE_GETSET_FUNCS(bytes_used);
 struct btrfs_fs_devices {
u8 fsid[BTRFS_FSID_SIZE]; /* FS specific uuid */
u8 metadata_uuid[BTRFS_FSID_SIZE];
+   bool fsid_change;
struct list_head fs_list;
 
u64 num_devices;
@@ -219,6 +220,10 @@ struct btrfs_fs_devices {
u64 missing_devices;
u64 total_rw_bytes;
u64 total_devices;
+
+   /* Highest generation number of seen devices */
+   u64 latest_generation;
+
struct block_device *latest_bdev;
 
/* all of the devices in the FS, protected by a mutex
-- 
2.7.4



[PATCH 1/6] btrfs: Introduce support for FSID change without metadata rewrite

2018-10-30 Thread Nikolay Borisov
This field is going to be used when the user wants to change the UUID
of the filesystem without having to rewrite all metadata blocks. This
field adds another level of indirection such that when the FSID is
changed what really happens is the current UUID (the one with which the
fs was created) is copied to the 'metadata_uuid' field in the superblock
as well as a new incompat flag is set METADATA_UUID. When the kernel
detects this flag is set it knows that the superblock in fact has 2
UUIDs:

1. Is the UUID which is user-visible, currently known as FSID.
2. Metadata UUID - this is the UUID which is stamped into all on-disk
datastructures belonging to this file system.

When the new incompat flag is present device scaning checks whether
both fsid/metadata_uuid of the scanned device match to any of the
registed filesystems. When the flag is not set then both UUIDs are
equal and only the FSID is retained on disk, metadata_uuid is set only
in-memory during mount.

Additionally a new metadata_uuid field is also added to the fs_info
struct. It's initialised either with the FSID in case METADATA_UUID
incompat flag is not set or with the metdata_uuid of the superblock
otherwise.

This commit introduces the new fields as well as the new incompat flag
and switches all users of the fsid to the new logic.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/ctree.c|  4 +--
 fs/btrfs/ctree.h| 12 ---
 fs/btrfs/disk-io.c  | 32 ++
 fs/btrfs/extent-tree.c  |  2 +-
 fs/btrfs/volumes.c  | 72 -
 fs/btrfs/volumes.h  |  1 +
 include/uapi/linux/btrfs.h  |  1 +
 include/uapi/linux/btrfs_tree.h |  1 +
 8 files changed, 97 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 539901fb5165..75cd41bf12f7 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -224,7 +224,7 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
else
btrfs_set_header_owner(cow, new_root_objectid);
 
-   write_extent_buffer_fsid(cow, fs_info->fsid);
+   write_extent_buffer_fsid(cow, fs_info->metadata_fsid);
 
WARN_ON(btrfs_header_generation(buf) > trans->transid);
if (new_root_objectid == BTRFS_TREE_RELOC_OBJECTID)
@@ -1050,7 +1050,7 @@ static noinline int __btrfs_cow_block(struct 
btrfs_trans_handle *trans,
else
btrfs_set_header_owner(cow, root->root_key.objectid);
 
-   write_extent_buffer_fsid(cow, fs_info->fsid);
+   write_extent_buffer_fsid(cow, fs_info->metadata_fsid);
 
ret = update_ref_for_cow(trans, root, buf, cow, _ref);
if (ret) {
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 68ca41dbbef3..501ada9ec7bd 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -197,7 +197,7 @@ struct btrfs_root_backup {
 struct btrfs_super_block {
u8 csum[BTRFS_CSUM_SIZE];
/* the first 4 fields must match struct btrfs_header */
-   u8 fsid[BTRFS_FSID_SIZE];/* FS specific uuid */
+   u8 fsid[BTRFS_FSID_SIZE];/* userfacing FS specific uuid */
__le64 bytenr; /* this block number */
__le64 flags;
 
@@ -234,8 +234,10 @@ struct btrfs_super_block {
__le64 cache_generation;
__le64 uuid_tree_generation;
 
+   u8 metadata_uuid[BTRFS_FSID_SIZE]; /* The uuid written into btree 
blocks */
+
/* future expansion */
-   __le64 reserved[30];
+   __le64 reserved[28];
u8 sys_chunk_array[BTRFS_SYSTEM_CHUNK_ARRAY_SIZE];
struct btrfs_root_backup super_roots[BTRFS_NUM_BACKUP_ROOTS];
 } __attribute__ ((__packed__));
@@ -265,7 +267,8 @@ struct btrfs_super_block {
 BTRFS_FEATURE_INCOMPAT_RAID56 |\
 BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \
 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |   \
-BTRFS_FEATURE_INCOMPAT_NO_HOLES)
+BTRFS_FEATURE_INCOMPAT_NO_HOLES|   \
+BTRFS_FEATURE_INCOMPAT_METADATA_UUID)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET\
(BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
@@ -746,7 +749,8 @@ struct btrfs_delayed_root;
 #define BTRFS_FS_BALANCE_RUNNING   18
 
 struct btrfs_fs_info {
-   u8 fsid[BTRFS_FSID_SIZE];
+   u8 fsid[BTRFS_FSID_SIZE]; /* User-visible fs UUID */
+   u8 metadata_fsid[BTRFS_FSID_SIZE]; /* UUID written to btree blocks */
u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
unsigned long flags;
struct btrfs_root *extent_root;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b0ab41da91d1..b76b18388b93 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -551,7 +551,7 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, 
struct page *page)
if (WARN_ON(!PageUptodate(page)))
return -EUCLEAN;
 
-   ASSERT(memcmp_extent_buffer(eb, fs_info->fsid,
+   ASSERT(memcmp_extent_buffer(eb, 

[PATCH 2/6] btrfs: Remove fsid/metadata_fsid fields from btrfs_info

2018-10-30 Thread Nikolay Borisov
Currently btrfs_fs_info structure contains a copy of the
fsid/metadata_uuid fields. Same values are also contained in the
btrfs_fs_devices structure which fs_info has a reference to. Let's
reduce duplication by removing the fields from fs_info and always refer
to the ones in fs_devices. No functional changes.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/check-integrity.c   |  2 +-
 fs/btrfs/ctree.c |  5 +++--
 fs/btrfs/ctree.h |  2 --
 fs/btrfs/disk-io.c   | 21 -
 fs/btrfs/extent-tree.c   |  2 +-
 fs/btrfs/ioctl.c |  2 +-
 fs/btrfs/super.c |  2 +-
 fs/btrfs/volumes.c   | 10 --
 include/trace/events/btrfs.h |  2 +-
 9 files changed, 24 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 2e43fba44035..781cae168d2a 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -1720,7 +1720,7 @@ static int btrfsic_test_for_metadata(struct btrfsic_state 
*state,
num_pages = state->metablock_size >> PAGE_SHIFT;
h = (struct btrfs_header *)datav[0];
 
-   if (memcmp(h->fsid, fs_info->fsid, BTRFS_FSID_SIZE))
+   if (memcmp(h->fsid, fs_info->fs_devices->fsid, BTRFS_FSID_SIZE))
return 1;
 
for (i = 0; i < num_pages; i++) {
diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 75cd41bf12f7..61f14a9836a1 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -12,6 +12,7 @@
 #include "transaction.h"
 #include "print-tree.h"
 #include "locking.h"
+#include "volumes.h"
 
 static int split_node(struct btrfs_trans_handle *trans, struct btrfs_root
  *root, struct btrfs_path *path, int level);
@@ -224,7 +225,7 @@ int btrfs_copy_root(struct btrfs_trans_handle *trans,
else
btrfs_set_header_owner(cow, new_root_objectid);
 
-   write_extent_buffer_fsid(cow, fs_info->metadata_fsid);
+   write_extent_buffer_fsid(cow, fs_info->fs_devices->metadata_uuid);
 
WARN_ON(btrfs_header_generation(buf) > trans->transid);
if (new_root_objectid == BTRFS_TREE_RELOC_OBJECTID)
@@ -1050,7 +1051,7 @@ static noinline int __btrfs_cow_block(struct 
btrfs_trans_handle *trans,
else
btrfs_set_header_owner(cow, root->root_key.objectid);
 
-   write_extent_buffer_fsid(cow, fs_info->metadata_fsid);
+   write_extent_buffer_fsid(cow, fs_info->fs_devices->metadata_uuid);
 
ret = update_ref_for_cow(trans, root, buf, cow, _ref);
if (ret) {
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 501ada9ec7bd..8531f0f5d672 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -749,8 +749,6 @@ struct btrfs_delayed_root;
 #define BTRFS_FS_BALANCE_RUNNING   18
 
 struct btrfs_fs_info {
-   u8 fsid[BTRFS_FSID_SIZE]; /* User-visible fs UUID */
-   u8 metadata_fsid[BTRFS_FSID_SIZE]; /* UUID written to btree blocks */
u8 chunk_tree_uuid[BTRFS_UUID_SIZE];
unsigned long flags;
struct btrfs_root *extent_root;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b76b18388b93..a458ef5b605e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -551,7 +551,7 @@ static int csum_dirty_buffer(struct btrfs_fs_info *fs_info, 
struct page *page)
if (WARN_ON(!PageUptodate(page)))
return -EUCLEAN;
 
-   ASSERT(memcmp_extent_buffer(eb, fs_info->metadata_fsid,
+   ASSERT(memcmp_extent_buffer(eb, fs_info->fs_devices->metadata_uuid,
btrfs_header_fsid(), BTRFS_FSID_SIZE) == 0);
 
return csum_tree_block(fs_info, eb, 0);
@@ -2490,11 +2490,12 @@ static int validate_super(struct btrfs_fs_info *fs_info,
ret = -EINVAL;
}
 
-   if (memcmp(fs_info->metadata_fsid, sb->dev_item.fsid,
+   if (memcmp(fs_info->fs_devices->metadata_uuid, sb->dev_item.fsid,
   BTRFS_FSID_SIZE) != 0) {
btrfs_err(fs_info,
   "dev_item UUID does not match metadata fsid: %pU != 
%pU",
-  fs_info->metadata_fsid, sb->dev_item.fsid);
+  fs_info->fs_devices->metadata_uuid,
+  sb->dev_item.fsid);
ret = -EINVAL;
}
 
@@ -2834,14 +2835,16 @@ int open_ctree(struct super_block *sb,
   sizeof(*fs_info->super_for_commit));
brelse(bh);
 
-   memcpy(fs_info->fsid, fs_info->super_copy->fsid, BTRFS_FSID_SIZE);
+   ASSERT(!memcmp(fs_info->fs_devices->fsid, fs_info->super_copy->fsid,
+  BTRFS_FSID_SIZE));
+
if (btrfs_fs_incompat(fs_info, METADATA_UUID)) {
-   memcpy(fs_info->metadata_fsid,
-  fs_info->super_copy->metadata_uuid, BTRFS_FSID_SIZE);
-   } else {
-   memcpy(fs_info->metadata_fsid, fs_info->fsid, BTRFS_FSID_SIZE);
+   ASSERT(!memcmp(fs_info->fs_devices->metadata_uuid,
+ 

[PATCH v3 0/6] FSID change kernel support

2018-10-30 Thread Nikolay Borisov
Here is the 3rd submission for the kernel counterpart of the uuid change 
patchset. The only difference is that I (hope) have adressed all cosmetic 
feedback from David as well as have reworded some change logs to ease 
understanding. I've also re-run the regression tests and no failure were 
obsered. 

For background information refer to first posting [0] and the second one [1]

[0] 
https://lore.kernel.org/linux-btrfs/1535531754-29774-1-git-send-email-nbori...@suse.com/
[1] 
https://lore.kernel.org/linux-btrfs/1539270244-27076-1-git-send-email-nbori...@suse.com/

Nikolay Borisov (6):
  btrfs: Introduce support for FSID change without metadata rewrite
  btrfs: Remove fsid/metadata_fsid fields from btrfs_info
  btrfs: Add handling for disk split-brain scenario during fsid change
  btrfs: Introduce 2 more members to struct btrfs_fs_devices
  btrfs: Handle one more split-brain scenario during fsid change
  btrfs: Handle final split-brain possibility during fsid change

 fs/btrfs/check-integrity.c  |   2 +-
 fs/btrfs/ctree.c|   5 +-
 fs/btrfs/ctree.h|  10 +-
 fs/btrfs/disk-io.c  |  53 ---
 fs/btrfs/extent-tree.c  |   2 +-
 fs/btrfs/ioctl.c|   2 +-
 fs/btrfs/super.c|   2 +-
 fs/btrfs/volumes.c  | 196 
 fs/btrfs/volumes.h  |   6 ++
 include/trace/events/btrfs.h|   2 +-
 include/uapi/linux/btrfs.h  |   1 +
 include/uapi/linux/btrfs_tree.h |   1 +
 12 files changed, 241 insertions(+), 41 deletions(-)

-- 
2.7.4



Re: Understanding "btrfs filesystem usage"

2018-10-30 Thread Eli V
On Mon, Oct 29, 2018 at 6:46 PM Hugo Mills  wrote:
>
> On Mon, Oct 29, 2018 at 05:57:10PM -0400, Remi Gauvin wrote:
> > On 2018-10-29 02:11 PM, Ulli Horlacher wrote:
> > > I want to know how many free space is left and have problems in
> > > interpreting the output of:
> > >
> > > btrfs filesystem usage
> > > btrfs filesystem df
> > > btrfs filesystem show
> > >
> > >
> >
> > In my not so humble opinion, the filesystem usage command has the
> > easiest to understand output.  It' lays out all the pertinent information.
>
>Opinions are divided. I find it almost impossible to read, and
> always use btrfs fi df and btrfs fi show together.

I find the tabular output via -T makes btrfs file usage much easier to
read, and it's now the only command I use to look at it space usage on
btrfs.

>
>There's short tutorials of how to read the output in both cases in
> the FAQ, which is where I start out by directing people in this
> instance.
>
>Hugo.
>
> > You can clearly see 825GiB is allocated, with 494GiB used, therefore,
> > filesystem show is actually using the "Allocated" value as "Used".
> > Allocated can be thought of "Reserved For".  As the output of the Usage
> > command and df command clearly show, you have almost 400GiB space available.
> >
> > Note that the btrfs commands are clearly and explicitly displaying
> > values in Binary units, (Mi, and Gi prefix, respectively).  If you want
> > df command to match, use -h instead of -H (see man df)
> >
> > An observation:
> >
> > The disparity between 498GiB used and 823Gib is pretty high.  This is
> > probably the result of using an SSD with an older kernel.  If your
> > kernel is not very recent, (sorry, I forget where this was fixed,
> > somewhere around 4.14 or 4.15), then consider mounting with the nossd
> > option.  You can improve this by running a balance.
> >
> > Something like:
> > btrfs balance start -dusage=55
> >
> > You do *not* want to end up with all your space allocated to Data, but
> > not actually used by data.  Bad things can happen if you run out of
> > Unallocated space for more metadata. (not catastrophic, but awkward and
> > unexpected downtime that can be a little tricky to sort out.)
> >
> >
>
> > begin:vcard
> > fn:Remi Gauvin
> > n:Gauvin;Remi
> > org:Georgian Infotech
> > adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada
> > email;internet:r...@georgianit.com
> > tel;work:226-256-1545
> > version:2.1
> > end:vcard
> >
>
>
> --
> Hugo Mills | Great oxymorons of the world, no. 8:
> hugo@... carfax.org.uk | The Latest In Proven Technology
> http://carfax.org.uk/  |
> PGP: E2AB1DE4  |


Re: [PATCH] Btrfs: fix cur_offset in the error case for nocow

2018-10-30 Thread Filipe Manana
On Tue, Oct 30, 2018 at 10:05 AM robbieko  wrote:
>
> From: Robbie Ko 
>
> When the cow_file_range fail, the related resources are
> unlocked according to the range (start-end), so the unlock
> cannot be repeated in run_delalloc_nocow.
>
> In some cases (e.g. cur_offset <= end && cow_start!= -1),
> cur_offset is not updated correctly, so move the cur_offset
> update before cow_file_range.
>
> [ cut here ]
> kernel BUG at mm/page-writeback.c:2663!
> Internal error: Oops - BUG: 0 [#1] SMP
> CPU: 3 PID: 31525 Comm: kworker/u8:7 Tainted: P O
> Hardware name: Realtek_RTD1296 (DT)
> Workqueue: writeback wb_workfn (flush-btrfs-1)
> task: ffc076db3380 ti: ffc02e9ac000 task.ti: ffc02e9ac000
> PC is at clear_page_dirty_for_io+0x1bc/0x1e8
> LR is at clear_page_dirty_for_io+0x14/0x1e8
> pc : [] lr : [] pstate: 4145
> sp : ffc02e9af4f0
> Process kworker/u8:7 (pid: 31525, stack limit = 0xffc02e9ac020)
> Call trace:
> [] clear_page_dirty_for_io+0x1bc/0x1e8
> [] extent_clear_unlock_delalloc+0x1e4/0x210 [btrfs]
> [] run_delalloc_nocow+0x3b8/0x948 [btrfs]
> [] run_delalloc_range+0x250/0x3a8 [btrfs]
> [] writepage_delalloc.isra.21+0xbc/0x1d8 [btrfs]
> [] __extent_writepage+0xe8/0x248 [btrfs]
> [] extent_write_cache_pages.isra.17+0x164/0x378 [btrfs]
> [] extent_writepages+0x48/0x68 [btrfs]
> [] btrfs_writepages+0x20/0x30 [btrfs]
> [] do_writepages+0x30/0x88
> [] __writeback_single_inode+0x34/0x198
> [] writeback_sb_inodes+0x184/0x3c0
> [] __writeback_inodes_wb+0x6c/0xc0
> [] wb_writeback+0x1b8/0x1c0
> [] wb_workfn+0x150/0x250
> [] process_one_work+0x1dc/0x388
> [] worker_thread+0x130/0x500
> [] kthread+0x10c/0x110
> [] ret_from_fork+0x10/0x40
> Code: d503201f a9025bb5 a90363b7 f90023b9 (d421)
> ---[ end trace 65fecee7c2296f25 ]---
>
> Signed-off-by: Robbie Ko 
> ---
>  fs/btrfs/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 181c58b..b62299b 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -1532,10 +1532,10 @@ static noinline int run_delalloc_nocow(struct inode 
> *inode,
>
> if (cur_offset <= end && cow_start == (u64)-1) {
> cow_start = cur_offset;
> -   cur_offset = end;
> }

Also remove the { }

Other than that, it looks good to me and you can add:

Reviewed-by: Filipe Manana 

thanks

>
> if (cow_start != (u64)-1) {
> +   cur_offset = end;
> ret = cow_file_range(inode, locked_page, cow_start, end, end,
>  page_started, nr_written, 1, NULL);
> if (ret)
> --
> 1.9.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH] Btrfs: incremental send, fix infinite loop when apply children dir moves

2018-10-30 Thread Filipe Manana
On Tue, Oct 30, 2018 at 7:00 AM robbieko  wrote:
>
> From: Robbie Ko 
>
> In apply_children_dir_moves, we first create an empty list (stack),
> then we get an entry from pending_dir_moves and add it to the stack,
> but we didn't delete the entry from rb_tree.
>
> So, in add_pending_dir_move, we create a new entry and then use the
> parent_ino in the current rb_tree to find the corresponding entry,
> and if so, add the new entry to the corresponding list.
>
> However, the entry may have been added to the stack, causing new
> entries to be added to the stack as well.
>
> Finally, each time we take the first entry from the stack and start
> processing, it ends up with an infinite loop.
>
> Fix this problem by remove node from pending_dir_moves,
> avoid add new pending_dir_move to error list.

I can't parse that explanation.
Can you give a concrete example (reproducer) or did this came out of thin air?

Thanks.

>
> Signed-off-by: Robbie Ko 
> ---
>  fs/btrfs/send.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
> index 094cc144..5be83b5 100644
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -3340,7 +3340,8 @@ static void free_pending_move(struct send_ctx *sctx, 
> struct pending_dir_move *m)
> kfree(m);
>  }
>
> -static void tail_append_pending_moves(struct pending_dir_move *moves,
> +static void tail_append_pending_moves(struct send_ctx *sctx,
> + struct pending_dir_move *moves,
>   struct list_head *stack)
>  {
> if (list_empty(>list)) {
> @@ -3351,6 +3352,10 @@ static void tail_append_pending_moves(struct 
> pending_dir_move *moves,
> list_add_tail(>list, stack);
> list_splice_tail(, stack);
> }
> +   if (!RB_EMPTY_NODE(>node)) {
> +   rb_erase(>node, >pending_dir_moves);
> +   RB_CLEAR_NODE(>node);
> +   }
>  }
>
>  static int apply_children_dir_moves(struct send_ctx *sctx)
> @@ -3365,7 +3370,7 @@ static int apply_children_dir_moves(struct send_ctx 
> *sctx)
> return 0;
>
> INIT_LIST_HEAD();
> -   tail_append_pending_moves(pm, );
> +   tail_append_pending_moves(sctx, pm, );
>
> while (!list_empty()) {
> pm = list_first_entry(, struct pending_dir_move, list);
> @@ -3376,7 +3381,7 @@ static int apply_children_dir_moves(struct send_ctx 
> *sctx)
> goto out;
> pm = get_pending_dir_moves(sctx, parent_ino);
> if (pm)
> -   tail_append_pending_moves(pm, );
> +   tail_append_pending_moves(sctx, pm, );
> }
> return 0;
>
> --
> 1.9.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


[PATCH v3] fstests: btrfs/057: Fix false alerts due to orphan files

2018-10-30 Thread Qu Wenruo
For any recent kernel, there is a chance that btrfs/057 reports false
errors.

The false error would look like:
  btrfs/057 4s ... - output mismatch (see 
/home/adam/xfstests-dev/results//btrfs/057.out.bad)
  --- tests/btrfs/057.out   2017-08-21 09:25:33.1 +0800
  +++ /home/adam/xfstests-dev/results//btrfs/057.out.bad2018-10-29 
14:07:28.443651293 +0800
  @@ -1,3 +1,3 @@
   QA output created by 057
   4096 4096
  -4096 4096
  +28672 28672

This is related to the fact that "btrfs subvolume sync" (or
vanilla sync) will not ensure orphan (unlinked but still exist) files to
be removed.

In fact, for that false error case, if inspecting the fs after umount,
its qgroup number is correct and btrfs check won't report qgroup error.

To fix the false alerts, just skip any manual qgroup number comparison,
and let fsck done after the test case to detect problem.

This also elimiate the necessary of using specified mount and mkfs
option, allowing us to improve coverage.

Reported-by: Nikolay Borisov 
Signed-off-by: Qu Wenruo 
Reviewed-by: Filipe Manana 
---
Changelog:
v2:
  Update commit message to show this is a long existing bug.
v3:
  Remove an old comment since now we don't need to specify the leaf
  size.
  Added Reviewed-by tags.
---
 tests/btrfs/057 | 18 --
 tests/btrfs/057.out |  3 +--
 2 files changed, 5 insertions(+), 16 deletions(-)

diff --git a/tests/btrfs/057 b/tests/btrfs/057
index b019f4e1e054..82e3162ebfeb 100755
--- a/tests/btrfs/057
+++ b/tests/btrfs/057
@@ -32,13 +32,9 @@ _require_scratch
 
 rm -f $seqres.full
 
-# use small leaf size to get higher btree height.
-run_check _scratch_mkfs "-b 1g --nodesize 4096"
+run_check _scratch_mkfs "-b 1g"
 
-# inode cache is saved in the FS tree itself for every
-# individual FS tree,that affects the sizes reported by qgroup show
-# so we need to explicitly turn it off to get consistent values.
-_scratch_mount "-o noinode_cache"
+_scratch_mount
 
 # -w ensures that the only ops are ones which cause write I/O
 run_check $FSSTRESS_PROG -d $SCRATCH_MNT -w -p 5 -n 1000 \
@@ -53,14 +49,8 @@ run_check $FSSTRESS_PROG -d $SCRATCH_MNT/snap1 -w -p 5 -n 
1000 \
 _run_btrfs_util_prog quota enable $SCRATCH_MNT
 _run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
 
-# remove all file/dir other than subvolume
-rm -rf $SCRATCH_MNT/snap1/* >& /dev/null
-rm -rf $SCRATCH_MNT/p* >& /dev/null
-
-_run_btrfs_util_prog filesystem sync $SCRATCH_MNT
-units=`_btrfs_qgroup_units`
-$BTRFS_UTIL_PROG qgroup show $units $SCRATCH_MNT | $SED_PROG -n '/[0-9]/p' \
-   | $AWK_PROG '{print $2" "$3}'
+echo "Silence is golden"
+# btrfs check will detect any qgroup number mismatch.
 
 status=0
 exit
diff --git a/tests/btrfs/057.out b/tests/btrfs/057.out
index 60cb92d0926c..185023c79961 100644
--- a/tests/btrfs/057.out
+++ b/tests/btrfs/057.out
@@ -1,3 +1,2 @@
 QA output created by 057
-4096 4096
-4096 4096
+Silence is golden
-- 
2.19.1



[PATCH] Btrfs: fix cur_offset in the error case for nocow

2018-10-30 Thread robbieko
From: Robbie Ko 

When the cow_file_range fail, the related resources are
unlocked according to the range (start-end), so the unlock
cannot be repeated in run_delalloc_nocow.

In some cases (e.g. cur_offset <= end && cow_start!= -1),
cur_offset is not updated correctly, so move the cur_offset
update before cow_file_range.

[ cut here ]
kernel BUG at mm/page-writeback.c:2663!
Internal error: Oops - BUG: 0 [#1] SMP
CPU: 3 PID: 31525 Comm: kworker/u8:7 Tainted: P O
Hardware name: Realtek_RTD1296 (DT)
Workqueue: writeback wb_workfn (flush-btrfs-1)
task: ffc076db3380 ti: ffc02e9ac000 task.ti: ffc02e9ac000
PC is at clear_page_dirty_for_io+0x1bc/0x1e8
LR is at clear_page_dirty_for_io+0x14/0x1e8
pc : [] lr : [] pstate: 4145
sp : ffc02e9af4f0
Process kworker/u8:7 (pid: 31525, stack limit = 0xffc02e9ac020)
Call trace:
[] clear_page_dirty_for_io+0x1bc/0x1e8
[] extent_clear_unlock_delalloc+0x1e4/0x210 [btrfs]
[] run_delalloc_nocow+0x3b8/0x948 [btrfs]
[] run_delalloc_range+0x250/0x3a8 [btrfs]
[] writepage_delalloc.isra.21+0xbc/0x1d8 [btrfs]
[] __extent_writepage+0xe8/0x248 [btrfs]
[] extent_write_cache_pages.isra.17+0x164/0x378 [btrfs]
[] extent_writepages+0x48/0x68 [btrfs]
[] btrfs_writepages+0x20/0x30 [btrfs]
[] do_writepages+0x30/0x88
[] __writeback_single_inode+0x34/0x198
[] writeback_sb_inodes+0x184/0x3c0
[] __writeback_inodes_wb+0x6c/0xc0
[] wb_writeback+0x1b8/0x1c0
[] wb_workfn+0x150/0x250
[] process_one_work+0x1dc/0x388
[] worker_thread+0x130/0x500
[] kthread+0x10c/0x110
[] ret_from_fork+0x10/0x40
Code: d503201f a9025bb5 a90363b7 f90023b9 (d421)
---[ end trace 65fecee7c2296f25 ]---

Signed-off-by: Robbie Ko 
---
 fs/btrfs/inode.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 181c58b..b62299b 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1532,10 +1532,10 @@ static noinline int run_delalloc_nocow(struct inode 
*inode,
 
if (cur_offset <= end && cow_start == (u64)-1) {
cow_start = cur_offset;
-   cur_offset = end;
}
 
if (cow_start != (u64)-1) {
+   cur_offset = end;
ret = cow_file_range(inode, locked_page, cow_start, end, end,
 page_started, nr_written, 1, NULL);
if (ret)
-- 
1.9.1



Btrfs progs pre-release 4.19-rc1

2018-10-30 Thread David Sterba
Hi,

this is a pre-release of btrfs-progs, 4.19-rc1. The version 4.18 was skipped to
keep the time of release close to kernel. The sort-of promise that
'progs version X supports features from kernel X' does not hold for the user
accessible ioctls to list subvolumes. As this is not a critical feature that's
missing, hopefully this is berable.

On the downside this blocked the whole 4.18 release as this is a user interface
change that must be done right on the first try. I don't want to repeat this in
future releases so the kernel/userspace feature parity will be more relaxed.

The 4.19 release is scheduled to this Friday, +4 days (2018-11-02).

Changelog:

* check: support repair of fs with free-space-tree feature
* core:
  * port delayed ref infrastructure from kernel
  * support write to free space tree
* dump-tree: new options for BFS and DFS enumeration of b-trees
* quota: rescan is now done automatically after 'assign'
* btrfstune: incomplete fix to uuid change
* subvol: fix 255 char limit checks
* completion: complete block devices and now regular files too
* docs:
  * ship uncompressed manual pages
  * btrfsck uses a manual page link instead of symlink
* other
  * improved error handling
  * docs
  * new tests

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

David Sterba (9):
  btrfs-progs: btrfstune: allow to continue uuid change
  btrfs-progs: tests: renumber last fsck test to 036-rescan-not-kicked-in
  btrfs-progs: docs: use manual page link instead of symlink
  btrfs-progs: build: remove gzip dependency
  btrfs-progs: docs: update clean target file masks
  btrfs-progs: clean up .gitignore
  btrfs-progs: tests: add runtime check for free-space-tree
  btrfs-progs: convert strerror to implicit %m
  btrfs-progs: update CHANGES for v4.19

Mike Gilbert (1):
  btrfs-progs: docs: install uncompressed manual pages

Misono Tomohiro (3):
  btrfs-progs: doc: update manual page of btrfs subvolume
  btrfs-progs: ioctl/libbtrfsutil: add 3 definitions of new unprivileged 
ioctl
  libbtrfsutil: factor out btrfs_util_subvolume_info_fd

Nikolay Borisov (23):
  btrfs-progs: tests: add test for missing device delete error value
  btrfs-progs: add __free_extent2 function
  btrfs-progs: add alloc_reserved_tree_block2 function
  btrfs-progs: Add delayed refs infrastructure
  btrfs-progs: Make btrfs_write_dirty_block_groups take only trans argument
  btrfs-progs: Wire up delayed refs
  btrfs-progs: Remove old delayed refs infrastructure
  btrfs-progs: Remove __free_extent2, now unused
  btrfs-progs: Merge alloc_reserved_tree_block2 and 
alloc_reserved_tree_block
  btrfs-progs: Add support for freespace tree in btrfs_read_fs_root
  btrfs-progs: Add extent buffer bitmap manipulation infrastructure
  btrfs-progs: Replace homegrown bitops related functions with kernel 
counterparts
  btrfs-progs: Implement find_*_bit_le operations
  btrfs-progs: Pull free space tree related code from kernel
  btrfs-progs: Hook FST code in extent (de)alloc
  btrfs-progs: Add freespace tree as compat_ro supported feature
  btrfs-progs: check: Add support for freespace tree fixing
  btrfs-progs: tests: Test for FST corruption detection/repair
  btrfs-progs: check: lowmem: Factor out inline extent checking code in its 
own function
  btrfs-progs: check: lowmem: Refactor extent len test in 
check_file_extent_inline
  btrfs-progs: check: lowmem: Refactor extent type checks in 
check_file_extent
  btrfs-progs: btrfstune: Remove fs_info arg from change_device_uuid
  btrfs-progs: btrfstune: Rename change_header_uuid to 
change_buffer_header_uuid

Qu Wenruo (22):
  btrfs-progs: transaction: do proper error handling in transaction commit
  btrfs-progs: completion: use _filedir to replace _btrfs_devs
  btrfs-progs: completion: let dump-tree/dump-super/inode-resolve accept 
any file
  btrfs-progs: print-tree: skip deprecated blockptr / nodesize output
  btrfs-progs: exit gracefully if we hit ENOSPC when allocating tree block
  btrfs-progs: exit gracefully when root dir item repair fails
  btrfs-progs: only warn if there are leaked extent buffers after 
transaction abort
  btrfs-progs: fix infinite loop when bad key order repair fails
  btrfs-progs: exit gracefully when device extent allocation fails
  btrfs-progs: rescue-super: don't double free fs_devices
  btrfs-progs: qgroup: don't return 1 if qgroup is marked inconsistent 
during relationship assignment
  btrfs-progs: convert: Make read_disk_extent return more -EIO instead of -1
  btrfs-progs: convert: Output meaningful error messages for create_image
  btrfs-progs: image: Warn about log tree generation mismatch when restoring
  btrfs-progs: Replace root parameter using fs_info for 

Re: [PATCH v2] fstests: btrfs/057: Fix false alerts due to orphan files

2018-10-30 Thread Nikolay Borisov



On 30.10.18 г. 11:07 ч., Qu Wenruo wrote:
> For any recent kernel, there is a chance that btrfs/057 reports false
> errors.
> 
> The false error would look like:
>   btrfs/057 4s ... - output mismatch (see 
> /home/adam/xfstests-dev/results//btrfs/057.out.bad)
>   --- tests/btrfs/057.out 2017-08-21 09:25:33.1 +0800
>   +++ /home/adam/xfstests-dev/results//btrfs/057.out.bad  2018-10-29 
> 14:07:28.443651293 +0800
>   @@ -1,3 +1,3 @@
>QA output created by 057
>4096 4096
>   -4096 4096
>   +28672 28672
> 
> This is related to the fact that "btrfs subvolume sync" (or
> vanilla sync) will not ensure orphan (unlinked but still exist) files to
> be removed.
> 
> In fact, for that false error case, if inspecting the fs after umount,
> its qgroup number is correct and btrfs check won't report qgroup error.
> 
> To fix the false alerts, just skip any manual qgroup number comparison,
> and let fsck done after the test case to detect problem.
> 
> This also elimiate the necessary of using specified mount and mkfs
> option, allowing us to improve coverage.
> 
> Reported-by: Nikolay Borisov 
> Signed-off-by: Qu Wenruo 
> ---
> Changelog:
> v2:
>   Update commit message to show this is a long existing bug.
> ---
>  tests/btrfs/057 | 17 -
>  tests/btrfs/057.out |  3 +--
>  2 files changed, 5 insertions(+), 15 deletions(-)
> 
> diff --git a/tests/btrfs/057 b/tests/btrfs/057
> index b019f4e1e054..0b5a36d34852 100755
> --- a/tests/btrfs/057
> +++ b/tests/btrfs/057
> @@ -33,12 +33,9 @@ _require_scratch
>  rm -f $seqres.full
>  
>  # use small leaf size to get higher btree height.
> -run_check _scratch_mkfs "-b 1g --nodesize 4096"
> +run_check _scratch_mkfs "-b 1g"

There was feedback from Filipe on V1 that you also need to delete the
above comment since it's no longer valid.

>  
> -# inode cache is saved in the FS tree itself for every
> -# individual FS tree,that affects the sizes reported by qgroup show
> -# so we need to explicitly turn it off to get consistent values.
> -_scratch_mount "-o noinode_cache"
> +_scratch_mount
>  
>  # -w ensures that the only ops are ones which cause write I/O
>  run_check $FSSTRESS_PROG -d $SCRATCH_MNT -w -p 5 -n 1000 \
> @@ -53,14 +50,8 @@ run_check $FSSTRESS_PROG -d $SCRATCH_MNT/snap1 -w -p 5 -n 
> 1000 \
>  _run_btrfs_util_prog quota enable $SCRATCH_MNT
>  _run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
>  
> -# remove all file/dir other than subvolume
> -rm -rf $SCRATCH_MNT/snap1/* >& /dev/null
> -rm -rf $SCRATCH_MNT/p* >& /dev/null
> -
> -_run_btrfs_util_prog filesystem sync $SCRATCH_MNT
> -units=`_btrfs_qgroup_units`
> -$BTRFS_UTIL_PROG qgroup show $units $SCRATCH_MNT | $SED_PROG -n '/[0-9]/p' \
> - | $AWK_PROG '{print $2" "$3}'
> +echo "Silence is golden"
> +# btrfs check will detect any qgroup number mismatch.
>  
>  status=0
>  exit
> diff --git a/tests/btrfs/057.out b/tests/btrfs/057.out
> index 60cb92d0926c..185023c79961 100644
> --- a/tests/btrfs/057.out
> +++ b/tests/btrfs/057.out
> @@ -1,3 +1,2 @@
>  QA output created by 057
> -4096 4096
> -4096 4096
> +Silence is golden
> 


[PATCH v2] fstests: btrfs/057: Fix false alerts due to orphan files

2018-10-30 Thread Qu Wenruo
For any recent kernel, there is a chance that btrfs/057 reports false
errors.

The false error would look like:
  btrfs/057 4s ... - output mismatch (see 
/home/adam/xfstests-dev/results//btrfs/057.out.bad)
  --- tests/btrfs/057.out   2017-08-21 09:25:33.1 +0800
  +++ /home/adam/xfstests-dev/results//btrfs/057.out.bad2018-10-29 
14:07:28.443651293 +0800
  @@ -1,3 +1,3 @@
   QA output created by 057
   4096 4096
  -4096 4096
  +28672 28672

This is related to the fact that "btrfs subvolume sync" (or
vanilla sync) will not ensure orphan (unlinked but still exist) files to
be removed.

In fact, for that false error case, if inspecting the fs after umount,
its qgroup number is correct and btrfs check won't report qgroup error.

To fix the false alerts, just skip any manual qgroup number comparison,
and let fsck done after the test case to detect problem.

This also elimiate the necessary of using specified mount and mkfs
option, allowing us to improve coverage.

Reported-by: Nikolay Borisov 
Signed-off-by: Qu Wenruo 
---
Changelog:
v2:
  Update commit message to show this is a long existing bug.
---
 tests/btrfs/057 | 17 -
 tests/btrfs/057.out |  3 +--
 2 files changed, 5 insertions(+), 15 deletions(-)

diff --git a/tests/btrfs/057 b/tests/btrfs/057
index b019f4e1e054..0b5a36d34852 100755
--- a/tests/btrfs/057
+++ b/tests/btrfs/057
@@ -33,12 +33,9 @@ _require_scratch
 rm -f $seqres.full
 
 # use small leaf size to get higher btree height.
-run_check _scratch_mkfs "-b 1g --nodesize 4096"
+run_check _scratch_mkfs "-b 1g"
 
-# inode cache is saved in the FS tree itself for every
-# individual FS tree,that affects the sizes reported by qgroup show
-# so we need to explicitly turn it off to get consistent values.
-_scratch_mount "-o noinode_cache"
+_scratch_mount
 
 # -w ensures that the only ops are ones which cause write I/O
 run_check $FSSTRESS_PROG -d $SCRATCH_MNT -w -p 5 -n 1000 \
@@ -53,14 +50,8 @@ run_check $FSSTRESS_PROG -d $SCRATCH_MNT/snap1 -w -p 5 -n 
1000 \
 _run_btrfs_util_prog quota enable $SCRATCH_MNT
 _run_btrfs_util_prog quota rescan -w $SCRATCH_MNT
 
-# remove all file/dir other than subvolume
-rm -rf $SCRATCH_MNT/snap1/* >& /dev/null
-rm -rf $SCRATCH_MNT/p* >& /dev/null
-
-_run_btrfs_util_prog filesystem sync $SCRATCH_MNT
-units=`_btrfs_qgroup_units`
-$BTRFS_UTIL_PROG qgroup show $units $SCRATCH_MNT | $SED_PROG -n '/[0-9]/p' \
-   | $AWK_PROG '{print $2" "$3}'
+echo "Silence is golden"
+# btrfs check will detect any qgroup number mismatch.
 
 status=0
 exit
diff --git a/tests/btrfs/057.out b/tests/btrfs/057.out
index 60cb92d0926c..185023c79961 100644
--- a/tests/btrfs/057.out
+++ b/tests/btrfs/057.out
@@ -1,3 +1,2 @@
 QA output created by 057
-4096 4096
-4096 4096
+Silence is golden
-- 
2.19.1



[PATCH] Btrfs: incremental send, fix infinite loop when apply children dir moves

2018-10-30 Thread robbieko
From: Robbie Ko 

In apply_children_dir_moves, we first create an empty list (stack),
then we get an entry from pending_dir_moves and add it to the stack,
but we didn't delete the entry from rb_tree.

So, in add_pending_dir_move, we create a new entry and then use the
parent_ino in the current rb_tree to find the corresponding entry,
and if so, add the new entry to the corresponding list.

However, the entry may have been added to the stack, causing new
entries to be added to the stack as well.

Finally, each time we take the first entry from the stack and start
processing, it ends up with an infinite loop.

Fix this problem by remove node from pending_dir_moves,
avoid add new pending_dir_move to error list.

Signed-off-by: Robbie Ko 
---
 fs/btrfs/send.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 094cc144..5be83b5 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -3340,7 +3340,8 @@ static void free_pending_move(struct send_ctx *sctx, 
struct pending_dir_move *m)
kfree(m);
 }
 
-static void tail_append_pending_moves(struct pending_dir_move *moves,
+static void tail_append_pending_moves(struct send_ctx *sctx,
+ struct pending_dir_move *moves,
  struct list_head *stack)
 {
if (list_empty(>list)) {
@@ -3351,6 +3352,10 @@ static void tail_append_pending_moves(struct 
pending_dir_move *moves,
list_add_tail(>list, stack);
list_splice_tail(, stack);
}
+   if (!RB_EMPTY_NODE(>node)) {
+   rb_erase(>node, >pending_dir_moves);
+   RB_CLEAR_NODE(>node);
+   }
 }
 
 static int apply_children_dir_moves(struct send_ctx *sctx)
@@ -3365,7 +3370,7 @@ static int apply_children_dir_moves(struct send_ctx *sctx)
return 0;
 
INIT_LIST_HEAD();
-   tail_append_pending_moves(pm, );
+   tail_append_pending_moves(sctx, pm, );
 
while (!list_empty()) {
pm = list_first_entry(, struct pending_dir_move, list);
@@ -3376,7 +3381,7 @@ static int apply_children_dir_moves(struct send_ctx *sctx)
goto out;
pm = get_pending_dir_moves(sctx, parent_ino);
if (pm)
-   tail_append_pending_moves(pm, );
+   tail_append_pending_moves(sctx, pm, );
}
return 0;
 
-- 
1.9.1