Recovery options for damaged beginning of the filesystem

2018-10-09 Thread Shapranov Vladimir
Hi,

I've got a filesystem with first ~50Mb accidentally dd'ed.

"btrfs check" fails with a following error (regardless of "-s"):
checksum verify failed on 21037056 found FC8A6557 wanted 2F51D090
checksum verify failed on 21037056 found FC8A6557 wanted 2F51D090
checksum verify failed on 21037056 found 1EDD5E47 wanted 222F7E7F
checksum verify failed on 21037056 found 1EDD5E47 wanted 222F7E7F
bytenr mismatch, want=21037056, have=13515002166904211737
ERROR: cannot read chunk root
ERROR: cannot open file system

"mount -o ro /dev/sdf1 /mnt/tmp" fails, while "mount -o ro,subvol=X /mnt/tmp" 
succeeds for "/" and couple subvolumes.

I can even ls the root directory, but at some point something can't be read 
anymore and following errors appear in dmesg:
[385647.81] btrfs_dev_stat_print_on_error: 158686 callbacks suppressed
[385648.079187] BTRFS error (device sdd1): bdev /dev/sdf1 errs: wr 0, rd 
22094847, flush 0, corrupt 0, gen 0
[385648.193733] BTRFS error (device sdd1): bdev /dev/sdf1 errs: wr 0, rd 
22094848, flush 0, corrupt 0, gen 0

Also it is worth mentioning that after corruption one of the subvolumes was 
mounted read-write and possibly some files were written.

Is there any chance to recover the FS?



-- 
Best regards,
Vladimir


Re: CoW behavior when writing same content

2018-10-09 Thread Chris Murphy
On Tue, Oct 9, 2018 at 11:25 AM, Andrei Borzenkov  wrote:
> 09.10.2018 18:52, Chris Murphy пишет:

>>> In this case is root/big_file and snapshot/big_file still share the same 
>>> data?
>>
>> You'll be left with three files. /big_file and root/big_file will
>> share extents,
>
> How comes they share extents? This requires --reflink, is it default now?

Good catch. It's not the default. I meant to write that initially only

root/big_file and snapshot/big_file have shared extents

And the shared extents are lost when snapshot/big_file is
"overwritten" by the copy into snapshot/


>> and snapshot/big_file will have its own extents. You'd
>> need to copy with --reflink for snapshot/big_file to have shared
>> extents with /big_file - or deduplicate.
>>
> This still overwrites the whole file in the sense original file content
> of "snapshot/big_file" is lost. That new content happens to be identical
> and that new content will probably be reflinked does not change the fact
> that original file is gone.

Agreed.

-- 
Chris Murphy


Re: [PATCH v3 4/6] btrfs-progs: lowmem check: Add dev_item check for used bytes and total bytes

2018-10-09 Thread Hans van Kranenburg
On 10/09/2018 03:14 AM, Qu Wenruo wrote:
> 
> 
> On 2018/10/9 上午6:20, Hans van Kranenburg wrote:
>> On 10/08/2018 02:30 PM, Qu Wenruo wrote:
>>> Obviously, used bytes can't be larger than total bytes.
>>>
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>>  check/mode-lowmem.c | 5 +
>>>  1 file changed, 5 insertions(+)
>>>
>>> diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
>>> index 07c03cad77af..1173b963b8f3 100644
>>> --- a/check/mode-lowmem.c
>>> +++ b/check/mode-lowmem.c
>>> @@ -4074,6 +4074,11 @@ static int check_dev_item(struct btrfs_fs_info 
>>> *fs_info,
>>> used = btrfs_device_bytes_used(eb, dev_item);
>>> total_bytes = btrfs_device_total_bytes(eb, dev_item);
>>>  
>>> +   if (used > total_bytes) {
>>> +   error("device %llu has incorrect used bytes %llu > total bytes 
>>> %llu",
>>> +   dev_id, used, total_bytes);
>>> +   return ACCOUNTING_MISMATCH;
>>
>> The message and return code point at an error in accounting logic.
> 
> One of the biggest problem in lowmem is we don't always have the error
> code we really want.
> 
> And that's the case for this error message.
> It's indeed not an accounting error, as in that case (just like that
> test case image) the used bytes is correct accounted.
> 
> The good news is, the return value is never really used to classify the
> error.
> So as long as the error message makes sense, it's not a big problem.

Aha. Clear, thanks for explaining.

> 
> Thanks,
> Qu
> 
>>
>> However, if you have a fully allocated device and a DUP chunk ending
>> beyond device, then having used > total_bytes is expected...
>>
>> So maybe there's two possibilities... There's an error in the accounting
>> logic, or there's an "over-allocation", which is another type of issue
>> which produces used > total with correct accounting logic.
>>
>>> +   }
>>> key.objectid = dev_id;
>>> key.type = BTRFS_DEV_EXTENT_KEY;
>>> key.offset = 0;
>>>
>>
>>
> 


-- 
Hans van Kranenburg


Re: CoW behavior when writing same content

2018-10-09 Thread Andrei Borzenkov
09.10.2018 18:52, Chris Murphy пишет:
> On Tue, Oct 9, 2018 at 8:48 AM, Gervais, Francois
>  wrote:
>> Hi,
>>
>> If I have a snapshot where I overwrite a big file but which only a
>> small portion of it is different, will the whole file be rewritten in
>> the snapshot? Or only the different part of the file?
> 

If you overwrite the whole file, the whole file will be overwritten.

> Depends on how the application modifies files. Many applications write
> out a whole new file with a pseudorandom filename, fsync, then rename.
> 
>>
>> Something like:
>>
>> $ dd if=/dev/urandom of=/big_file bs=1M count=1024
>> $ cp /big_file root/
>> $ btrfs sub snap root snapshot
>> $ cp /big_file snapshot/
>>

And which portion of these three files is different? They must be
identical. Not that it really matters, but that does not match your
question.

>> In this case is root/big_file and snapshot/big_file still share the same 
>> data?
> 
> You'll be left with three files. /big_file and root/big_file will
> share extents,

How comes they share extents? This requires --reflink, is it default now?

> and snapshot/big_file will have its own extents. You'd
> need to copy with --reflink for snapshot/big_file to have shared
> extents with /big_file - or deduplicate.
> 
This still overwrites the whole file in the sense original file content
of "snapshot/big_file" is lost. That new content happens to be identical
and that new content will probably be reflinked does not change the fact
that original file is gone.


Next btrfs development cycle open - 4.21

2018-10-09 Thread David Sterba
From: David Sterba 

Hi,

a friendly reminder of the timetable and what's expected at this phase.

4.18 - current
4.19 - upcoming, urgent regression fixes only
4.20 - development closed, pull request in prep, fixes or regressions only
4.21 - development open, until 4.20-rc5 (at least)

(https://btrfs.wiki.kernel.org/index.php/Developer%27s_FAQ#Development_schedule)

Whether the next version is going to be 4.20 or 5.0 I don't know, I'll use the
4.x for references. There will be 4.19-rc8 milestone, probably the last one
before the final release.


Current status
--

The branch misc-4.20 contains patches that will be sent in the first 4.20
batch.  There are about 90, fewer than in previous cycles.

There might be 2nd pull sent during the merge window, the reviews are going
slow but there are still patches/fixes that I'd like to get merged.


Hilights of 4.20 changes


The more detailed description will be in the pull request, brief summary:

* (performance) relocation with qgroups on -- skip unnecessary qgroup
  accounting work during merging b-trees after relocation, claimed improvement
  is 20-40%+ (run time) but highly depends on the extent layout

* (performance) b-tree path traversal and locking improvements -- no more
  switching between blocking and spinning mode, sample measurements show
  noticeable improvements in latency and run time

* (performance) rb-tree caching -- repeated tree traversal does not need to
  spend time chasing pointers up to the first node


Git development repos
-

  k.org: https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git
  devel1: https://gitlab.com/kdave/btrfs-devel
  devel2: https://github.com/kdave/btrfs-devel

Note: git repository at repo.or.cz is discouraged from use

The devel repos should be identical regarding the main development branches
like misc-next or for-next-MMDD.


Usual points


* the current patch queue (as is in misc-next) looks stable, so no big
  changes are going to be applied at this time. The usual exceptions are
  bugfixes or obvious cleanups.

* the base of the patches should be the last announced pull request,
  which is going to be named 'for-4.20' in my k.org tree.  Reviewed
  patches will be collected in a branch that's usually named 'misc-next'
  in my devel git repos and is part of the for-next at k.org git repo.

* merging of new patches to misc-next will be slow during the
  merge window, also because there's a btrfs-progs release scheduled

* everybody is encouraged to review and test other's patches


Re: CoW behavior when writing same content

2018-10-09 Thread Roman Mamedov
On Tue, 9 Oct 2018 09:52:00 -0600
Chris Murphy  wrote:

> You'll be left with three files. /big_file and root/big_file will
> share extents, and snapshot/big_file will have its own extents. You'd
> need to copy with --reflink for snapshot/big_file to have shared
> extents with /big_file - or deduplicate.

Or use rsync for copying, in the mode where it reads and checksums blocks of
both files, to copy only the non-matching portions.

rsync --inplace

  This  option  is  useful  for  transferring  large  files   with
  block-based  changes  or appended data, and also on systems that
  are disk bound, not network bound.  It  can  also  help  keep  a
  copy-on-write filesystem snapshot from diverging the entire con‐
  tents of a file that only has minor changes.

-- 
With respect,
Roman


Re: [PATCH] Btrfs: fix wrong dentries after fsync of file that got its parent replaced

2018-10-09 Thread David Sterba
On Tue, Oct 09, 2018 at 03:05:29PM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> In a scenario like the following:
> 
>   mkdir /mnt/A   # inode 258
>   mkdir /mnt/B   # inode 259
>   touch /mnt/B/bar   # inode 260
> 
>   sync
> 
>   mv /mnt/B/bar /mnt/A/bar
>   mv -T /mnt/A /mnt/B
>   fsync /mnt/B/bar
> 
>   
> 
> After replaying the log we end up with file bar having 2 hard links, both
> with the name 'bar' and one in the directory with inode number 258 and the
> other in the directory with inode number 259. Also, we end up with the
> directory inode 259 still existing and with the directory inode 258 still
> named as 'A', instead of 'B'. In this scenario, file 'bar' should only
> have one hard link, located at directory inode 258, the directory inode
> 259 should not exist anymore and the name for directory inode 258 should
> be 'B'.
> 
> This incorrect behaviour happens because when attempting to log the old
> parents of an inode, we skip any parents that no longer exist. Fix this
> by forcing a full commit if an old parent no longer exists.
> 
> A test case for fstests follows soon.
> 
> Signed-off-by: Filipe Manana 

Thanks, queued for 4.20 with CC: stable 4.4+ .


Re: [PATCH] Btrfs: fix warning when replaying log after fsync of a tmpfile

2018-10-09 Thread David Sterba
On Tue, Oct 09, 2018 at 06:01:56PM +0200, David Sterba wrote:
> > Reported-by: Martin Steigerwald 
> > Link: https://lore.kernel.org/linux-btrfs/319.NTnn27ZJZE@merkaba/
> > Fixes: 471d557afed1 ("Btrfs: fix loss of prealloc extents past i_size after 
> > fsync log replay")
> > Signed-off-by: Filipe Manana 
> 
> Thanks, the tmpfile logging logic makes sense to me. Patch added to
> misc-next and queued for 4.20 and I've added stable tag for 4.9+.

Err, no sorry, it's 4.18+ as there's the Fixes: reference.


Re: [PATCH] Btrfs: fix warning when replaying log after fsync of a tmpfile

2018-10-09 Thread David Sterba
On Mon, Oct 08, 2018 at 11:12:55AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> When replaying a log which contains a tmpfile (which necessarily has a
> link count of 0) we end up calling inc_nlink(), at
> fs/btrfs/tree-log.c:replay_one_buffer(), which produces a warning like
> the following:
> 
>   [195191.943673] WARNING: CPU: 0 PID: 6924 at fs/inode.c:342 
> inc_nlink+0x33/0x40
>   [195191.943674] Modules linked in: btrfs dm_flakey dm_mod xor raid6_pq 
> libcrc32c kvm_intel bochs_drm ttm kvm drm_kms_helper drm irqbypass 
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 
> crypto_simd cryptd glue_helper joydev sg button evdev pcspkr qemu_fw_cfg 
> serio_raw parport_pc ppdev lp parport ip_tables x_tables autofs4 ext4 
> crc32c_generic crc16 mbcache jbd2 fscrypto sd_mod virtio_scsi ata_generic 
> virtio_pci virtio_ring virtio ata_piix floppy crc32c_intel libata psmouse 
> e1000 scsi_mod i2c_piix4 [last unloaded: btrfs]
>   [195191.943723] CPU: 0 PID: 6924 Comm: mount Not tainted 
> 4.19.0-rc6-btrfs-next-38 #1
>   [195191.943724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
>   [195191.943726] RIP: 0010:inc_nlink+0x33/0x40
>   [195191.943727] Code: c0 74 07 83 c0 01 89 47 48 c3 f6 87 d1 00 00 00 04 74 
> 17 48 8b 47 28 f0 48 83 a8 70 07 00 00 01 8b 47 48 83 c0 01 89 47 48 c3 <0f> 
> 0b eb e5 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 65 ff 05 54
>   [195191.943728] RSP: 0018:b96e425e3870 EFLAGS: 00010246
>   [195191.943730] RAX:  RBX: 8c0d1e6af4f0 RCX: 
> 0006
>   [195191.943731] RDX:  RSI:  RDI: 
> 8c0d1e6af4f0
>   [195191.943731] RBP: 0097 R08: 0001 R09: 
> 
>   [195191.943732] R10:  R11:  R12: 
> b96e425e3a60
>   [195191.943733] R13: 8c0d10cff0c8 R14: 8c0d0d515348 R15: 
> 8c0d78a1b3f8
>   [195191.943735] FS:  7f570ee24480() GS:8c0dfb20() 
> knlGS:
>   [195191.943736] CS:  0010 DS:  ES:  CR0: 80050033
>   [195191.943737] CR2: 5593286277c8 CR3: bb8f2006 CR4: 
> 003606f0
>   [195191.943739] DR0:  DR1:  DR2: 
> 
>   [195191.943740] DR3:  DR6: fffe0ff0 DR7: 
> 0400
>   [195191.943741] Call Trace:
>   [195191.943778]  replay_one_buffer+0x797/0x7d0 [btrfs]
>   [195191.943802]  walk_up_log_tree+0x1c1/0x250 [btrfs]
>   [195191.943809]  ? rcu_read_lock_sched_held+0x3f/0x70
>   [195191.943825]  walk_log_tree+0xae/0x1d0 [btrfs]
>   [195191.943840]  btrfs_recover_log_trees+0x1d7/0x4d0 [btrfs]
>   [195191.943856]  ? replay_dir_deletes+0x280/0x280 [btrfs]
>   [195191.943870]  open_ctree+0x1c3b/0x22a0 [btrfs]
>   [195191.943887]  btrfs_mount_root+0x6b4/0x800 [btrfs]
>   [195191.943894]  ? rcu_read_lock_sched_held+0x3f/0x70
>   [195191.943899]  ? pcpu_alloc+0x55b/0x7c0
>   [195191.943906]  ? mount_fs+0x3b/0x140
>   [195191.943908]  mount_fs+0x3b/0x140
>   [195191.943912]  ? __init_waitqueue_head+0x36/0x50
>   [195191.943916]  vfs_kern_mount+0x62/0x160
>   [195191.943927]  btrfs_mount+0x134/0x890 [btrfs]
>   [195191.943936]  ? rcu_read_lock_sched_held+0x3f/0x70
>   [195191.943938]  ? pcpu_alloc+0x55b/0x7c0
>   [195191.943943]  ? mount_fs+0x3b/0x140
>   [195191.943952]  ? btrfs_remount+0x570/0x570 [btrfs]
>   [195191.943954]  mount_fs+0x3b/0x140
>   [195191.943956]  ? __init_waitqueue_head+0x36/0x50
>   [195191.943960]  vfs_kern_mount+0x62/0x160
>   [195191.943963]  do_mount+0x1f9/0xd40
>   [195191.943967]  ? memdup_user+0x4b/0x70
>   [195191.943971]  ksys_mount+0x7e/0xd0
>   [195191.943974]  __x64_sys_mount+0x21/0x30
>   [195191.943977]  do_syscall_64+0x60/0x1b0
>   [195191.943980]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>   [195191.943983] RIP: 0033:0x7f570e4e524a
>   [195191.943985] Code: 48 8b 0d 51 fc 2a 00 f7 d8 64 89 01 48 83 c8 ff c3 66 
> 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 
> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1e fc 2a 00 f7 d8 64 89 01 48

Minor nit: if the Code: and Modules: lines are not essential for the
report, you can remove them from the trace as the lines are quite long.
I've removed them in the committed patch.

>   [195191.943986] RSP: 002b:7ffd83589478 EFLAGS: 0206 ORIG_RAX: 
> 00a5
>   [195191.943989] RAX: ffda RBX: 563f335b2060 RCX: 
> 7f570e4e524a
>   [195191.943990] RDX: 563f335b2240 RSI: 563f335b2280 RDI: 
> 563f335b2260
>   [195191.943992] RBP:  R08:  R09: 
> 0020
>   [195191.943993] R10: c0ed R11: 0206 R12: 
> 563f335b2260
>   [195191.943994] R13: 563f335b2240 R14:  R15: 
> 
>   [195191.944002] irq event stamp: 8688
>   [195191.944010] hardirqs last  enabled at 

Re: CoW behavior when writing same content

2018-10-09 Thread Chris Murphy
On Tue, Oct 9, 2018 at 8:48 AM, Gervais, Francois
 wrote:
> Hi,
>
> If I have a snapshot where I overwrite a big file but which only a
> small portion of it is different, will the whole file be rewritten in
> the snapshot? Or only the different part of the file?

Depends on how the application modifies files. Many applications write
out a whole new file with a pseudorandom filename, fsync, then rename.

>
> Something like:
>
> $ dd if=/dev/urandom of=/big_file bs=1M count=1024
> $ cp /big_file root/
> $ btrfs sub snap root snapshot
> $ cp /big_file snapshot/
>
> In this case is root/big_file and snapshot/big_file still share the same data?

You'll be left with three files. /big_file and root/big_file will
share extents, and snapshot/big_file will have its own extents. You'd
need to copy with --reflink for snapshot/big_file to have shared
extents with /big_file - or deduplicate.


-- 
Chris Murphy


Re: [PATCH] btrfs: qgroup: Avoid calling qgroup functions if qgroup is not enabled

2018-10-09 Thread David Sterba
On Tue, Oct 09, 2018 at 02:36:45PM +0800, Qu Wenruo wrote:
> Some qgroup trace events like btrfs_qgroup_release_data() and
> btrfs_qgroup_free_delayed_ref() can still be triggered even qgroup is
> not enabled.
> 
> This is caused by the lack of qgroup status check before really calling
> qgroup functions.
> Thankfully related functions can handle quota disabled case well and just
> do nothing for qgroup disabled case.
> 
> This patch will do earlier check before triggering related trace events.
> 
> And for enabled <-> disabled race case:
> 1) For enabled->disabled case
>Disable will wipe out all qgroups data including reservation and
>excl/rfer. Even we leaks some reserveration or numbers, it will
>still be wiped, so nothing can go wrong.
> 
> 2) For disabled -> enabled case
>Current btrfs_qgroup_release_data() will use extent_io tree to ensure
>we won't underflow reservation. And for delayed_ref we use
>head->qgroup_reserved to record reserved space, so in that case
>head->qgroup_reserved should be 0 and we won't underflow.
> 
> Reported-by: Chris Murphy 
> Signed-off-by: Qu Wenruo 

Reviewed-by: David Sterba 

I've added the link to the report.


[PATCH] generic: test for file fsync after moving it to a new parent directory

2018-10-09 Thread fdmanana
From: Filipe Manana 

Test that if we move a file from a directory B to a directory A, replace
directory B with directory A, fsync the file and then power fail, after
mounting the filesystem the file has a single parent, named B and there
is no longer any directory with the name A.

This test is motivated by a bug found in btrfs which is fixed by a patch
for the linux kernel titled:

  "Btrfs: fix wrong dentries after fsync of file that got its parent
   replaced"

This test passes on ext4, xfs and patched btrfs but it hangs on f2fs (the
fsck.f2fs process seems stuck).

Signed-off-by: Filipe Manana 
---
 tests/generic/507 | 71 +++
 tests/generic/507.out |  7 +
 tests/generic/group   |  1 +
 3 files changed, 79 insertions(+)
 create mode 100755 tests/generic/507
 create mode 100644 tests/generic/507.out

diff --git a/tests/generic/507 b/tests/generic/507
new file mode 100755
index ..f23db677
--- /dev/null
+++ b/tests/generic/507
@@ -0,0 +1,71 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (C) 2018 SUSE Linux Products GmbH. All Rights Reserved.
+#
+# FS QA Test No. 507
+#
+# Test that if we move a file from a directory B to a directory A, replace
+# directory B with directory A, fsync the file and then power fail, after
+# mounting the filesystem the file has a single parent, named B and there
+# is no longer any directory with the name A.
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   _cleanup_flakey
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_dm_target flakey
+
+rm -f $seqres.full
+
+_scratch_mkfs >>$seqres.full 2>&1
+_require_metadata_journaling $SCRATCH_DEV
+_init_flakey
+_mount_flakey
+
+# Create our test directories and file.
+mkdir $SCRATCH_MNT/testdir
+mkdir $SCRATCH_MNT/testdir/A
+mkdir $SCRATCH_MNT/testdir/B
+touch $SCRATCH_MNT/testdir/B/bar
+
+# Make sure everything done so far is durably persisted.
+sync
+
+# Now move our file bar from directory B to directory A and then replace
+# directory B with directory A, also renaming directory A to B. Finally
+# fsync file bar.
+mv $SCRATCH_MNT/testdir/B/bar $SCRATCH_MNT/testdir/A/bar
+mv -T $SCRATCH_MNT/testdir/A $SCRATCH_MNT/testdir/B
+$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/B/bar
+
+# Simulate a power failure and mount the filesystem. We expect file bar
+# to exist and have a single parent directory, named B, and that no
+# directory named A exists.
+_flakey_drop_and_remount
+
+echo "Filesystem content after power failure:"
+ls -R $SCRATCH_MNT/testdir | _filter_scratch
+
+_unmount_flakey
+
+status=0
+exit
diff --git a/tests/generic/507.out b/tests/generic/507.out
new file mode 100644
index ..49877654
--- /dev/null
+++ b/tests/generic/507.out
@@ -0,0 +1,7 @@
+QA output created by 507
+Filesystem content after power failure:
+SCRATCH_MNT/testdir:
+B
+
+SCRATCH_MNT/testdir/B:
+bar
diff --git a/tests/generic/group b/tests/generic/group
index 2e2a6247..f4d1524b 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -509,3 +509,4 @@
 504 auto quick locks
 505 shutdown auto quick metadata
 506 auto quick log
+507 auto quick log
-- 
2.11.0



[PATCH] Btrfs: fix wrong dentries after fsync of file that got its parent replaced

2018-10-09 Thread fdmanana
From: Filipe Manana 

In a scenario like the following:

  mkdir /mnt/A   # inode 258
  mkdir /mnt/B   # inode 259
  touch /mnt/B/bar   # inode 260

  sync

  mv /mnt/B/bar /mnt/A/bar
  mv -T /mnt/A /mnt/B
  fsync /mnt/B/bar

  

After replaying the log we end up with file bar having 2 hard links, both
with the name 'bar' and one in the directory with inode number 258 and the
other in the directory with inode number 259. Also, we end up with the
directory inode 259 still existing and with the directory inode 258 still
named as 'A', instead of 'B'. In this scenario, file 'bar' should only
have one hard link, located at directory inode 258, the directory inode
259 should not exist anymore and the name for directory inode 258 should
be 'B'.

This incorrect behaviour happens because when attempting to log the old
parents of an inode, we skip any parents that no longer exist. Fix this
by forcing a full commit if an old parent no longer exists.

A test case for fstests follows soon.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/tree-log.c | 30 +++---
 1 file changed, 27 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 5e83991eb064..ed455dba 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5580,9 +5580,33 @@ static int btrfs_log_all_parents(struct 
btrfs_trans_handle *trans,
 
dir_inode = btrfs_iget(fs_info->sb, _key,
   root, NULL);
-   /* If parent inode was deleted, skip it. */
-   if (IS_ERR(dir_inode))
-   continue;
+   /*
+* If the parent inode was deleted, return an error to
+* fallback to a transaction commit. This is to prevent
+* getting an inode that was moved from one parent A to
+* a parent B, got its former parent A deleted and then
+* it got fsync'ed, from existing at both parents after
+* a log replay (and the old parent still existing).
+* Example:
+*
+* mkdir /mnt/A
+* mkdir /mnt/B
+* touch /mnt/B/bar
+* sync
+* mv /mnt/B/bar /mnt/A/bar
+* mv -T /mnt/A /mnt/B
+* fsync /mnt/B/bar
+* 
+*
+* If we ignore the old parent B which got deleted,
+* after a log replay we would have file bar linked
+* at both parents and the old parent B would still
+* exist.
+*/
+   if (IS_ERR(dir_inode)) {
+   ret = PTR_ERR(dir_inode);
+   goto out;
+   }
 
if (ctx)
ctx->log_new_dentries = false;
-- 
2.11.0



[PATCH] btrfs: qgroup: Avoid calling qgroup functions if qgroup is not enabled

2018-10-09 Thread Qu Wenruo
Some qgroup trace events like btrfs_qgroup_release_data() and
btrfs_qgroup_free_delayed_ref() can still be triggered even qgroup is
not enabled.

This is caused by the lack of qgroup status check before really calling
qgroup functions.
Thankfully related functions can handle quota disabled case well and just
do nothing for qgroup disabled case.

This patch will do earlier check before triggering related trace events.

And for enabled <-> disabled race case:
1) For enabled->disabled case
   Disable will wipe out all qgroups data including reservation and
   excl/rfer. Even we leaks some reserveration or numbers, it will
   still be wiped, so nothing can go wrong.

2) For disabled -> enabled case
   Current btrfs_qgroup_release_data() will use extent_io tree to ensure
   we won't underflow reservation. And for delayed_ref we use
   head->qgroup_reserved to record reserved space, so in that case
   head->qgroup_reserved should be 0 and we won't underflow.

Reported-by: Chris Murphy 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/qgroup.c | 4 
 fs/btrfs/qgroup.h | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 4353bb69bb86..0656f4e78db6 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -3107,6 +3107,10 @@ static int __btrfs_qgroup_release_data(struct inode 
*inode,
int trace_op = QGROUP_RELEASE;
int ret;
 
+   if (!test_bit(BTRFS_FS_QUOTA_ENABLED,
+ _I(inode)->root->fs_info->flags))
+   return 0;
+
/* In release case, we shouldn't have @reserved */
WARN_ON(!free && reserved);
if (free && reserved)
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 54b8bb282c0e..4bbcc1e92a93 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -249,6 +249,8 @@ void btrfs_qgroup_free_refroot(struct btrfs_fs_info 
*fs_info,
 static inline void btrfs_qgroup_free_delayed_ref(struct btrfs_fs_info *fs_info,
 u64 ref_root, u64 num_bytes)
 {
+   if (!test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags))
+   return;
trace_btrfs_qgroup_free_delayed_ref(fs_info, ref_root, num_bytes);
btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes,
  BTRFS_QGROUP_RSV_DATA);
-- 
2.19.1