Re: System unable to mount partition after a power loss

2018-12-06 Thread Qu Wenruo


On 2018/12/7 下午1:24, Doni Crosby wrote:
> All,
> 
> I'm coming to you to see if there is a way to fix or at least recover
> most of the data I have from a btrfs filesystem. The system went down
> after both a breaker and the battery backup failed. I cannot currently
> mount the system, with the following error from dmesg:
> 
> Note: The vda1 is just the entire disk being passed from the VM host
> to the VM it's not an actual true virtual block device
> 
> [ 499.704398] BTRFS info (device vda1): disk space caching is enabled
> [  499.704401] BTRFS info (device vda1): has skinny extents
> [  499.739522] BTRFS error (device vda1): parent transid verify failed
> on 3563231428608 wanted 5184691 found 5183327

Transid mismatch normally means the fs is screwed up more or less.

And according to your mount failure, it looks the fs get screwed up badly.

What's the kernel version used in the VM?
I don't really think the VM is always using the latest kernel.

> [  499.740257] BTRFS error (device vda1): parent transid verify failed
> on 3563231428608 wanted 5184691 found 5183327
> [  499.770847] BTRFS error (device vda1): open_ctree failed
> 
> I have tried running btrfsck:
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563221630976 wanted 5184691 found 5184688
> parent transid verify failed on 3563221630976 wanted 5184691 found 5184688
> parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
> parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
> parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
> parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
> parent transid verify failed on 3563224072192 wanted 5184691 found 5184688
> parent transid verify failed on 3563224072192 wanted 5184691 found 5184688
> parent transid verify failed on 3563225268224 wanted 5184691 found 5184689
> parent transid verify failed on 3563225268224 wanted 5184691 found 5184689
> parent transid verify failed on 3563227398144 wanted 5184691 found 5184689
> parent transid verify failed on 3563227398144 wanted 5184691 found 5184689
> parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
> parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
> parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
> parent transid verify failed on 3563229593600 wanted 5184691 found 5184689

According to your later dump-super output, it looks pretty possible that
the corrupted extents are all belonging to extent tree.

So it's still possible that your fs tree and other essential trees are OK.

Please dump the following output (with its stderr) to further confirm
the damage.
# btrfs ins dump-tree -b 31801344 --follow /dev/vda1

If your objective is only to recover data, then you could start to try
btrfs-restore.
It's pretty hard to fix the heavily damaged extent tree.

Thanks,
Qu
> Ignoring transid failure
> Checking filesystem on /dev/vda1
> UUID: 7c76bb05-b3dc-4804-bf56-88d010a214c6
> checking extents
> parent transid verify failed on 3563224842240 wanted 5184691 found 5184689
> parent transid verify failed on 3563224842240 wanted 5184691 found 5184689
> parent transid verify failed on 3563222974464 wanted 5184691 found 5184688
> parent transid verify failed on 3563222974464 wanted 5184691 found 5184688
> parent transid verify failed on 3563223121920 wanted 5184691 found 5184688
> parent transid verify failed on 3563223121920 wanted 5184691 found 5184688
> parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
> parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
> parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
> parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
> Ignoring transid failure
> parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
> parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
> parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
> parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
> Ignoring transid failure
> parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
> parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
> parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
> parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
> Ignoring transid failure
> parent transid verify 

Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-06 Thread Nikolay Borisov



On 6.12.18 г. 19:54 ч., David Sterba wrote:
> On Thu, Dec 06, 2018 at 06:52:21PM +0200, Nikolay Borisov wrote:
>>
>>
>> On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
>>> Now with the delayed_refs_rsv we can now know exactly how much pending
>>> delayed refs space we need.  This means we can drastically simplify
>>
>> IMO it will be helpful if there is a sentence here referring back to
>> btrfs_update_delayed_refs_rsv to put your first sentence into context.
>> But I guess this is something David can also do.
> 
> I'll update the changelog, but I'm not sure what exactly you want to see
> there, please post the replacement text. Thanks.

With the introduction of dealyed_refs_rsv infrastructure, namely
btrfs_update_delayed_refs_rsv we now know exactly how much pending
delayed refs space is required.

> 
>>> btrfs_check_space_for_delayed_refs by simply checking how much space we
>>> have reserved for the global rsv (which acts as a spill over buffer) and
>>> the delayed refs rsv.  If our total size is beyond that amount then we
>>> know it's time to commit the transaction and stop any more delayed refs
>>> from being generated.
>>>
>>> Signed-off-by: Josef Bacik 
> 


Re: System unable to mount partition after a power loss

2018-12-06 Thread Doni Crosby
> This is qemu-kvm? What's the cache mode being used? It's possible the
> usual write guarantees are thwarted by VM caching.
Yes it is a proxmox host running the system so it is a qemu vm, I'm
unsure on the caching situation.

> Old version of progs, I suggest upgrading to 4.17.1 and run
I updated the progs to 4.17 and ran the following

btrfs insp dump-s -f /device/:
See attachment

btrfs rescue super -v /device/ (insp rescue super wasn't valid)
All Devices:
Device: id = 1, name = /dev/vda1

Before Recovering:
[All good supers]:
device name = /dev/vda1
superblock bytenr = 65536

device name = /dev/vda1
superblock bytenr = 67108864

device name = /dev/vda1
superblock bytenr = 274877906944

[All bad supers]:

All supers are valid, no need to recover

btrfs check --mode=lowmem /dev/vda1:
parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
parent transid verify failed on 3563221630976 wanted 5184691 found 5184688
parent transid verify failed on 3563221630976 wanted 5184691 found 5184688
parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
parent transid verify failed on 3563224072192 wanted 5184691 found 5184688
parent transid verify failed on 3563224072192 wanted 5184691 found 5184688
parent transid verify failed on 3563225268224 wanted 5184691 found 5184689
parent transid verify failed on 3563225268224 wanted 5184691 found 5184689
parent transid verify failed on 3563227398144 wanted 5184691 found 5184689
parent transid verify failed on 3563227398144 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=3563210342400 item=120 parent
level=1 child level=1
ERROR: cannot open file system

mount -o ro,norecovery,usebackuproot /dev/vda1 /mnt:
Same dmesg output as before.
On Fri, Dec 7, 2018 at 12:56 AM Chris Murphy  wrote:
>
> On Thu, Dec 6, 2018 at 10:24 PM Doni Crosby  wrote:
> >
> > All,
> >
> > I'm coming to you to see if there is a way to fix or at least recover
> > most of the data I have from a btrfs filesystem. The system went down
> > after both a breaker and the battery backup failed. I cannot currently
> > mount the system, with the following error from dmesg:
> >
> > Note: The vda1 is just the entire disk being passed from the VM host
> > to the VM it's not an actual true virtual block device
>
> This is qemu-kvm? What's the cache mode being used? It's possible the
> usual write guarantees are thwarted by VM caching.
>
>
>
> > btrfs check --recover also ends in a segmentation fault
>
> I'm not familiar with --recover option, the --repair option is flagged
> with a warning in the man page.
>Warning
>Do not use --repair unless you are advised to do so by a
> developer or an experienced user,
>
>
> > btrfs --version:
> > btrfs-progs v4.7.3
>
> Old version of progs, I suggest upgrading to 4.17.1 and run
>
> btrfs insp dump-s -f /device/
> btrfs insp rescue super -v /device/
> btrfs check --mode=lowmem /device/
>
> These are all read only commands. Please post output to the list,
> hopefully a developer will get around to looking at it.
>
> It is safe to try:
>
> mount -o ro,norecovery,usebackuproot /device/ /mnt/
>
> If that works, I suggest updating your backup while it's still
> possible in the meantime.
>
>
> --
> Chris Murphy
superblock: bytenr=65536, device=/dev/vda1
-
csum_type   0 (crc32c)
csum_size   4
csum0xbfa6fd72 [match]
bytenr  65536
flags   0x1
( WRITTEN )
magic   _BHRfS_M [match]
fsid7c76bb05-b3dc-4804-bf56-88d010a214c6
label   Array
generation  5184693
root31801344
sys_array_size  226
chunk_root_generation   5183734
root_level  1
chunk_root  20971520
chunk_root_level1
log_root0
log_root_transid0
log_root_level  0
total_bytes 32003947737088
bytes_used  6652776640512
sectorsize  4096
nodesize16384
leafsize (deprecated)   16384
stripesize  4096
root_dir6
num_devices 1
compat_flags0x0
compat_ro_flags 0x0
incompat_flags  0x161
( MIXED_BACKREF |
  

Re: System unable to mount partition after a power loss

2018-12-06 Thread Chris Murphy
On Thu, Dec 6, 2018 at 10:24 PM Doni Crosby  wrote:
>
> All,
>
> I'm coming to you to see if there is a way to fix or at least recover
> most of the data I have from a btrfs filesystem. The system went down
> after both a breaker and the battery backup failed. I cannot currently
> mount the system, with the following error from dmesg:
>
> Note: The vda1 is just the entire disk being passed from the VM host
> to the VM it's not an actual true virtual block device

This is qemu-kvm? What's the cache mode being used? It's possible the
usual write guarantees are thwarted by VM caching.



> btrfs check --recover also ends in a segmentation fault

I'm not familiar with --recover option, the --repair option is flagged
with a warning in the man page.
   Warning
   Do not use --repair unless you are advised to do so by a
developer or an experienced user,


> btrfs --version:
> btrfs-progs v4.7.3

Old version of progs, I suggest upgrading to 4.17.1 and run

btrfs insp dump-s -f /device/
btrfs insp rescue super -v /device/
btrfs check --mode=lowmem /device/

These are all read only commands. Please post output to the list,
hopefully a developer will get around to looking at it.

It is safe to try:

mount -o ro,norecovery,usebackuproot /device/ /mnt/

If that works, I suggest updating your backup while it's still
possible in the meantime.


-- 
Chris Murphy


System unable to mount partition after a power loss

2018-12-06 Thread Doni Crosby
All,

I'm coming to you to see if there is a way to fix or at least recover
most of the data I have from a btrfs filesystem. The system went down
after both a breaker and the battery backup failed. I cannot currently
mount the system, with the following error from dmesg:

Note: The vda1 is just the entire disk being passed from the VM host
to the VM it's not an actual true virtual block device

[ 499.704398] BTRFS info (device vda1): disk space caching is enabled
[  499.704401] BTRFS info (device vda1): has skinny extents
[  499.739522] BTRFS error (device vda1): parent transid verify failed
on 3563231428608 wanted 5184691 found 5183327
[  499.740257] BTRFS error (device vda1): parent transid verify failed
on 3563231428608 wanted 5184691 found 5183327
[  499.770847] BTRFS error (device vda1): open_ctree failed

I have tried running btrfsck:
parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
parent transid verify failed on 3563221630976 wanted 5184691 found 5184688
parent transid verify failed on 3563221630976 wanted 5184691 found 5184688
parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
parent transid verify failed on 3563224072192 wanted 5184691 found 5184688
parent transid verify failed on 3563224072192 wanted 5184691 found 5184688
parent transid verify failed on 3563225268224 wanted 5184691 found 5184689
parent transid verify failed on 3563225268224 wanted 5184691 found 5184689
parent transid verify failed on 3563227398144 wanted 5184691 found 5184689
parent transid verify failed on 3563227398144 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
Ignoring transid failure
Checking filesystem on /dev/vda1
UUID: 7c76bb05-b3dc-4804-bf56-88d010a214c6
checking extents
parent transid verify failed on 3563224842240 wanted 5184691 found 5184689
parent transid verify failed on 3563224842240 wanted 5184691 found 5184689
parent transid verify failed on 3563222974464 wanted 5184691 found 5184688
parent transid verify failed on 3563222974464 wanted 5184691 found 5184688
parent transid verify failed on 3563223121920 wanted 5184691 found 5184688
parent transid verify failed on 3563223121920 wanted 5184691 found 5184688
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
Ignoring transid failure
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
Ignoring transid failure
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
Ignoring transid failure
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
Ignoring transid failure
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
Ignoring transid failure
Segmentation fault

btrfs check --recover also ends in a segmentation fault

I am aware of chunk-recover and have tried to run it but got weary
when I saw dev0 not vda1.

Any help would be appreciated,
Doni

uname -a:
Linux Homophone 4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1
(2018-09-13) 

Re: What if TRIM issued a wipe on devices that don't TRIM?

2018-12-06 Thread Andrei Borzenkov
06.12.2018 16:04, Austin S. Hemmelgarn пишет:
> 
> * On SCSI devices, a discard operation translates to a SCSI UNMAP
> command.  As pointed out by Ronnie Sahlberg in his reply, this command
> is purely advisory, may not result in any actual state change on the
> target device, and is not guaranteed to wipe the data.  To actually wipe
> things, you have to explicitly write bogus data to the given regions
> (using either regular writes, or a WRITESAME command with the desired
> pattern), and _then_ call UNMAP on them.

WRITE SAME command has UNMAP bit and depending on device and kernel
version kernel may actually issue either UNMAP or WRITE SAME with UNMAP
bit set when doing discard.



Re: [PATCH RESEND 0/8] btrfs-progs: sub: Relax the privileges of "subvolume list/show"

2018-12-06 Thread Omar Sandoval
On Tue, Nov 27, 2018 at 02:24:41PM +0900, Misono Tomohiro wrote:
> Hello,
> 
> This is basically the resend of 
>   "[PATCH v2 00/20] btrfs-progs: Rework of "subvolume list/show" and relax the
>   root privileges of them" [1]
> which I submitted in June. The aim of this series is to allow non-privileged 
> user
> to use basic subvolume functionality (create/list/snapshot/delete; this 
> allows "list")
> 
> They were once in devel branch with some whitespace/comment modification by 
> david.
> I rebased them to current devel branch.
> 
> github: https://github.com/t-msn/btrfs-progs/tree/rework-sub-list
> 
> Basic logic/code is the same as before. Some differences are:
>  - Use latest libbtrfsutil from Omar [2] (thus drop first part of patches).
>As a result, "sub list" cannot accept an ordinary directry to be
>specified (which is allowed in previous version)
>  - Drop patches which add new options to "sub list"
>  - Use 'nobody' as non-privileged test user just like libbtrfsutil test
>  - Update comments
> 
> Importantly, in order to make output consistent for both root and 
> non-privileged
> user, this changes the behavior of "subvolume list": 
>  - (default) Only list in subvolume under the specified path.
>Path needs to be a subvolume.
>  - (-a) filter is dropped. i.e. its output is the same as the
> default behavior of "sub list" in progs <= 4.19
> 
> Therefore, existent scripts may need to update to add -a option
> (I believe nobody uses current -a option).
> If anyone thinks this is not good, please let me know.

I think there are a few options in the case that the path isn't a
subvolume:

1. List all subvolumes in the filesystem with randomly mangled paths,
   which is what we currently do.
2. Error out, which is what this version of the series does.
3. List all subvolumes under the containing subvolume, which is what the
   previous version does.
4. List all subvolumes under the containing subvolume that are
   underneath the given path.

Option 1 won't work well for unprivileged users. Option 2 (this series)
is definitely going to break people's workflows/scripts. Option 3 is
unintuitive. In my opinion, option 4 is the nicest, but it may also
break scripts that expect all subvolumes to be printed.

There's also an option 5, which is to keep the behavior the same for
root (like what my previous patch [1] did) and implement option 4 for
unprivileged users.

I think 4 and 5 are the two main choices: do we want to preserve
backwards compatibility as carefully as possible (at the cost of
consistency), or do we want to risk it and improve the interface?

1: 
https://github.com/osandov/btrfs-progs/commit/fb61c21aeb998b12c1d02532639083d7f40c41e0


[PATCH] libbtrfsutil: fix unprivileged tests if kernel lacks support

2018-12-06 Thread Omar Sandoval
From: Omar Sandoval 

I apparently didn't test this on a pre-4.18 kernel.
test_subvolume_info_unprivileged() checks for an ENOTTY, but this
doesn't seem to work correctly with subTest().
test_subvolume_iterator_unprivileged() doesn't have a check at all. Add
an explicit check to both before doing the actual test.

Signed-off-by: Omar Sandoval 
---
Based on the devel branch.

 libbtrfsutil/python/tests/test_subvolume.py | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/libbtrfsutil/python/tests/test_subvolume.py 
b/libbtrfsutil/python/tests/test_subvolume.py
index 99ec97bc..b06a1d3d 100644
--- a/libbtrfsutil/python/tests/test_subvolume.py
+++ b/libbtrfsutil/python/tests/test_subvolume.py
@@ -168,12 +168,13 @@ class TestSubvolume(BtrfsTestCase):
 
 with drop_privs():
 try:
-self._test_subvolume_info(subvol, snapshot)
+btrfsutil.subvolume_info(self.mountpoint)
 except OSError as e:
 if e.errno == errno.ENOTTY:
 self.skipTest('BTRFS_IOC_GET_SUBVOL_INFO is not available')
 else:
 raise
+self._test_subvolume_info(subvol, snapshot)
 
 def test_read_only(self):
 for arg in self.path_or_fd(self.mountpoint):
@@ -487,6 +488,13 @@ class TestSubvolume(BtrfsTestCase):
 try:
 os.chdir(self.mountpoint)
 with drop_privs():
+try:
+list(btrfsutil.SubvolumeIterator('.'))
+except OSError as e:
+if e.errno == errno.ENOTTY:
+self.skipTest('BTRFS_IOC_GET_SUBVOL_ROOTREF is not 
available')
+else:
+raise
 self._test_subvolume_iterator()
 finally:
 os.chdir(pwd)
-- 
2.19.2



Re: BTRFS RAID filesystem unmountable

2018-12-06 Thread Qu Wenruo


On 2018/12/7 上午7:15, Michael Wade wrote:
> Hi Qu,
> 
> Me again! Having formatted the drives and rebuilt the RAID array I
> seem to have be having the same problem as before (no power cut this
> time [I bought a UPS]).

But strangely, your super block shows it has log tree, which means
either your hit a kernel panic/transaction abort, or a unexpected power
loss.

> The brtfs volume is broken on my ReadyNAS.
> 
> I have attached the results of some of the commands you asked me to
> run last time, and I am hoping you might be able to help me out.

This time, the problem is more serious, some chunk tree blocks are not
even inside system chunk range, no wonder it fails to mount.

To confirm it, you could run "btrfs ins dump-tree -b 17725903077376
" and paste the output.

But I don't have any clue. My guess is some kernel problem related to
new chunk allocation, or the chunk root node itself is already seriously
corrupted.

Considering how old your kernel is (4.4), it's not recommended to use
btrfs on such old kernel, unless it's well backported with tons of btrfs
fixes.

Thanks,
Qu

> 
> Kind regards
> Michael
> On Sat, 19 May 2018 at 12:43, Michael Wade  wrote:
>>
>> I have let the find root command run for 14+ days, its produced a
>> pretty huge log file 1.6 GB but still hasn't completed. I think I will
>> start the process of reformatting my drives and starting over.
>>
>> Thanks for your help anyway.
>>
>> Kind regards
>> Michael
>>
>> On 5 May 2018 at 01:43, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年05月05日 00:18, Michael Wade wrote:
 Hi Qu,

 The tool is still running and the log file is now ~300mb. I guess it
 shouldn't normally take this long.. Is there anything else worth
 trying?
>>>
>>> I'm afraid not much.
>>>
>>> Although there is a possibility to modify btrfs-find-root to do much
>>> faster but limited search.
>>>
>>> But from the result, it looks like underlying device corruption, and not
>>> much we can do right now.
>>>
>>> Thanks,
>>> Qu
>>>

 Kind regards
 Michael

 On 2 May 2018 at 06:29, Michael Wade  wrote:
> Thanks Qu,
>
> I actually aborted the run with the old btrfs tools once I saw its
> output. The new btrfs tools is still running and has produced a log
> file of ~85mb filled with that content so far.
>
> Kind regards
> Michael
>
> On 2 May 2018 at 02:31, Qu Wenruo  wrote:
>>
>>
>> On 2018年05月01日 23:50, Michael Wade wrote:
>>> Hi Qu,
>>>
>>> Oh dear that is not good news!
>>>
>>> I have been running the find root command since yesterday but it only
>>> seems to be only be outputting the following message:
>>>
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>
>> It's mostly fine, as find-root will go through all tree blocks and try
>> to read them as tree blocks.
>> Although btrfs-find-root will suppress csum error output, but such basic
>> tree validation check is not suppressed, thus you get such message.
>>
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>
>>> I tried with the latest btrfs tools compiled from source and the ones
>>> I have installed with the same result. Is there a CLI utility I could
>>> use to determine if the log contains any other content?
>>
>> Did it report any useful info at the end?
>>
>> Thanks,
>> Qu
>>
>>>
>>> Kind regards
>>> Michael
>>>
>>>
>>> On 30 April 2018 at 04:02, Qu Wenruo  wrote:


 On 2018年04月29日 22:08, Michael Wade wrote:
> Hi Qu,
>
> Got this error message:
>
> ./btrfs inspect dump-tree -b 20800943685632 /dev/md127
> btrfs-progs v4.16.1
> bytenr mismatch, want=20800943685632, have=3118598835113619663
> ERROR: cannot read chunk root
> ERROR: unable to open /dev/md127
>
> I have attached the dumps for:
>
> dd if=/dev/md127 of=/tmp/chunk_root.copy1 bs=1 count=32K 
> skip=266325721088
> dd if=/dev/md127 of=/tmp/chunk_root.copy2 bs=1 count=32K 
> skip=266359275520

 Unfortunately, both dumps are corrupted and contain mostly garbage.
 I think it's the underlying stack (mdraid) has something wrong or 
 failed
 to recover its data.

 This means your last 

Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-06 Thread Qu Wenruo


On 2018/12/7 上午3:35, David Sterba wrote:
> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
>> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
>>> This patchset can be fetched from github:
>>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
>>>
>>> Which is based on v4.20-rc1.
>>
>> Thanks, I'll add it to for-next soon.
> 
> The branch was there for some time but not for at least a week (my
> mistake I did not notice in time). I've rebased it on top of recent
> misc-next, but without the delayed refs patchset from Josef.
> 
> At the moment I'm considering it for merge to 4.21, there's still some
> time to pull it out in case it shows up to be too problematic. I'm
> mostly worried about the unknown interactions with the enospc updates or

For that part, I don't think it would have some obvious problem for
enospc updates.

As the user-noticeable effect is the delay of reloc tree deletion.

Despite that, it's mostly transparent to extent allocation.

> generally because of lack of qgroup and reloc code reviews.

That's the biggest problem.

However most of the current qgroup + balance optimization is done inside
qgroup code (to skip certain qgroup record), if we're going to hit some
problem then this patchset would have the highest possibility to hit
problem.

Later patches will just keep tweaking qgroup to without affecting any
other parts mostly.

So I'm fine if you decide to pull it out for now.

Thanks,
Qu

> 
> I'm going to do some testing of the rebased branch before I add it to
> for-next. The branch is ext/qu/qgroup-delay-scan in my devel repos,
> plase check if everyghing is still ok there. Thanks.
> 



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 3/3] coccinelle: api: add offset_in_page.cocci

2018-12-06 Thread Julia Lawall



On Thu, 6 Dec 2018, Johannes Thumshirn wrote:

> On 05/12/2018 15:46, Julia Lawall wrote:
> [...]
> >> +@r_patch depends on !context && patch && !org && !report@
> >> +expression E;
> >> +type T;
> >> +@@
> >> +
> >> +(
> >> +- E & ~PAGE_MASK
> >> ++ offset_in_page(E)
> >> +|
> >> +- E & (PAGE_SIZE - 1)
> >> ++ offset_in_page(E)
> >
> > The two lines above should be subsumed by the two lines below.  When there
> > is a type metavariable that has no other dependencies, an isomorphism will
> > consider that it is either present or absent.
>
> Oh OK, I'm sorry I'm not really into cocinelle so I guessed it might
> take some iterations.
>
> Do you have an example for this?

Expanation 1:

Coccinelle as a file standard.iso that shows the isomorphisms (rewrite
rules) that may be applied to semantic patches.  One of the rules is:

Expression
@ not_ptr1 @
expression *X;
@@
 !X => X == NULL

So if you have a pointer typed expression X and you write a transformation
on !X, it will also apply to occurrences of X == NULL in the source code.
In this way, you don't have to write so many variants.

Likewise there is an isomorphism:

Expression
@ drop_cast @
expression E;
pure type T;
@@

 (T)E => E

That is, if you have a semantic patch with (T)X, then it will also apply
to code that matches just X, without the cast.  The word pure means that
this isomorphism metavariable has to match a semantic patch term that is a
metavariable and this metavariable can't be used elsewhere.  If you wrote

- (char)x

Then you would probably not want that to apply without the (char) cast.
But if you have just

- (T)x

for some randome unbound metavariable T, then perhaps you don't case about
the cast to T.  If you actually do, then you can put disable drop_cast in
the header of your rule.



Explanation 2:

To see what your semantic patch is really doing, you can run

spatch --parse-cocci sp.cocci

Here is what I get for your patch rule, with some annotations added:

@@
expression E;
type T;
@@


(
-E
  >>> offset_in_page(E)
 -& -~-PAGE_MASK
|
-~
  >>> offset_in_page(E)
-PAGE_MASK -& -E
|

// the following come from
// - E & (PAGE_SIZE - 1)
// + offset_in_page(E)

-E   // 1
  >>> offset_in_page(E)
 -& -(-PAGE_SIZE -- -1-)
|
-E   // 2
  >>> offset_in_page(E)
 -& -PAGE_SIZE -- -1
|
-(   // 3
>>> offset_in_page(E)
  -PAGE_SIZE -- -1-) -& -E
|
-PAGE_SIZE   // 4
  >>> offset_in_page(E)
 -- -1 -& -E
|

// the following come from:
// - E & ((T)PAGE_SIZE - 1)
// + offset_in_page(E)

-E
  >>> offset_in_page(E)
 -& -(-(-T -)-PAGE_SIZE -- -1-)
|
-E   // same as 1
  >>> offset_in_page(E)
 -& -(-PAGE_SIZE -- -1-)
|
-E
  >>> offset_in_page(E)
 -& -(-T -)-PAGE_SIZE -- -1
|
-E   // same as 2
  >>> offset_in_page(E)
 -& -PAGE_SIZE -- -1
|
-(
>>> offset_in_page(E)
  -(-T -)-PAGE_SIZE -- -1-) -& -E
|
-(   // same as 3
>>> offset_in_page(E)
  -PAGE_SIZE -- -1-) -& -E
|
-(
>>> offset_in_page(E)
  -T -)-PAGE_SIZE -- -1 -& -E
|
-PAGE_SIZE   // same as 4
  >>> offset_in_page(E)
 -- -1 -& -E
)

So all the transformation generated by

- E & (PAGE_SIZE - 1)
+ offset_in_page(E)

are also generated by

- E & ((T)PAGE_SIZE - 1)
+ offset_in_page(E)

I hope that is helpful.

julia


Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-06 Thread David Sterba
On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
> > This patchset can be fetched from github:
> > https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
> > 
> > Which is based on v4.20-rc1.
> 
> Thanks, I'll add it to for-next soon.

The branch was there for some time but not for at least a week (my
mistake I did not notice in time). I've rebased it on top of recent
misc-next, but without the delayed refs patchset from Josef.

At the moment I'm considering it for merge to 4.21, there's still some
time to pull it out in case it shows up to be too problematic. I'm
mostly worried about the unknown interactions with the enospc updates or
generally because of lack of qgroup and reloc code reviews.

I'm going to do some testing of the rebased branch before I add it to
for-next. The branch is ext/qu/qgroup-delay-scan in my devel repos,
plase check if everyghing is still ok there. Thanks.


Re: [PATCH][v2] btrfs: run delayed items before dropping the snapshot

2018-12-06 Thread David Sterba
On Wed, Dec 05, 2018 at 12:12:21PM -0500, Josef Bacik wrote:
> From: Josef Bacik 
> 
> With my delayed refs patches in place we started seeing a large amount
> of aborts in __btrfs_free_extent
> 
> BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 
> 35964  owner 1 offset 0
> Call Trace:
>  ? btrfs_merge_delayed_refs+0xaf/0x340
>  __btrfs_run_delayed_refs+0x6ea/0xfc0
>  ? btrfs_set_path_blocking+0x31/0x60
>  btrfs_run_delayed_refs+0xeb/0x180
>  btrfs_commit_transaction+0x179/0x7f0
>  ? btrfs_check_space_for_delayed_refs+0x30/0x50
>  ? should_end_transaction.isra.19+0xe/0x40
>  btrfs_drop_snapshot+0x41c/0x7c0
>  btrfs_clean_one_deleted_snapshot+0xb5/0xd0
>  cleaner_kthread+0xf6/0x120
>  kthread+0xf8/0x130
>  ? btree_invalidatepage+0x90/0x90
>  ? kthread_bind+0x10/0x10
>  ret_from_fork+0x35/0x40
> 
> This was because btrfs_drop_snapshot depends on the root not being modified
> while it's dropping the snapshot.  It will unlock the root node (and really
> every node) as it walks down the tree, only to re-lock it when it needs to do
> something.  This is a problem because if we modify the tree we could cow a 
> block
> in our path, which free's our reference to that block.  Then once we get back 
> to
> that shared block we'll free our reference to it again, and get ENOENT when
> trying to lookup our extent reference to that block in __btrfs_free_extent.
> 
> This is ultimately happening because we have delayed items left to be 
> processed
> for our deleted snapshot _after_ all of the inodes are closed for the 
> snapshot.
> We only run the delayed inode item if we're deleting the inode, and even then 
> we
> do not run the delayed insertions or delayed removals.  These can be run at 
> any
> point after our final inode does it's last iput, which is what triggers the
> snapshot deletion.  We can end up with the snapshot deletion happening and 
> then
> have the delayed items run on that file system, resulting in the above 
> problem.
> 
> This problem has existed forever, however my patches made it much easier to 
> hit
> as I wake up the cleaner much more often to deal with delayed iputs, which 
> made
> us more likely to start the snapshot dropping work before the transaction
> commits, which is when the delayed items would generally be run.  Before,
> generally speaking, we would run the delayed items, commit the transaction, 
> and
> wakeup the cleaner thread to start deleting snapshots, which means we were 
> less
> likely to hit this problem.  You could still hit it if you had multiple
> snapshots to be deleted and ended up with lots of delayed items, but it was
> definitely harder.
> 
> Fix for now by simply running all the delayed items before starting to drop 
> the
> snapshot.  We could make this smarter in the future by making the delayed 
> items
> per-root, and then simply drop any delayed items for roots that we are going 
> to
> delete.  But for now just a quick and easy solution is the safest.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Josef Bacik 
> ---
> v1->v2:
> - check for errors from btrfs_run_delayed_items.
> - Dave I can reroll the series, but the second version of patch 1 is the same,
>   let me know what you want.

As this is a small update it's fine to send just that patch. You may
also use --in-reply-to so it threads to the original series. Resending
series makes most sense (to me) when there's a discussion and many
changes, so a fresh series makes it clear what's the current status.

Patch replaced in for-next topic branch, thanks.


Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots

2018-12-06 Thread David Sterba
On Fri, Nov 30, 2018 at 12:19:18PM -0500, Josef Bacik wrote:
> On Fri, Nov 30, 2018 at 05:14:54PM +, Filipe Manana wrote:
> > On Fri, Nov 30, 2018 at 4:53 PM Josef Bacik  wrote:
> > >
> > > From: Josef Bacik 
> > >
> > > When debugging some weird extent reference bug I suspected that we were
> > > changing a snapshot while we were deleting it, which could explain my
> > > bug.  This was indeed what was happening, and this patch helped me
> > > verify my theory.  It is never correct to modify the snapshot once it's
> > > being deleted, so mark the root when we are deleting it and make sure we
> > > complain about it when it happens.
> > >
> > > Signed-off-by: Josef Bacik 
> > > ---
> > >  fs/btrfs/ctree.c   | 3 +++
> > >  fs/btrfs/ctree.h   | 1 +
> > >  fs/btrfs/extent-tree.c | 9 +
> > >  3 files changed, 13 insertions(+)
> > >
> > > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > > index 5912a97b07a6..5f82f86085e8 100644
> > > --- a/fs/btrfs/ctree.c
> > > +++ b/fs/btrfs/ctree.c
> > > @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct 
> > > btrfs_trans_handle *trans,
> > > u64 search_start;
> > > int ret;
> > >
> > > +   if (test_bit(BTRFS_ROOT_DELETING, >state))
> > > +   WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats 
> > > being dropped\n");
> > 
> > Please use btrfs_warn(), it makes sure we use a consistent message
> > style, identifies the fs, etc.
> > Also, "thats" should be "that is" or "that's".
> > 
> 
> Ah yeah, I was following the other convention in there but we should probably
> convert all of those to btrfs_warn.  I'll fix the grammer thing as well, just 
> a
> leftover from the much less code of conduct friendly message I originally had
> there.  Thanks,

Committed with the following fixup:

-   WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being 
dropped\n");
+   btrfs_error(fs_info,
+   "COW'ing blocks on a fs root that's being dropped");



Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-06 Thread David Sterba
On Thu, Dec 06, 2018 at 06:52:21PM +0200, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> > Now with the delayed_refs_rsv we can now know exactly how much pending
> > delayed refs space we need.  This means we can drastically simplify
> 
> IMO it will be helpful if there is a sentence here referring back to
> btrfs_update_delayed_refs_rsv to put your first sentence into context.
> But I guess this is something David can also do.

I'll update the changelog, but I'm not sure what exactly you want to see
there, please post the replacement text. Thanks.

> > btrfs_check_space_for_delayed_refs by simply checking how much space we
> > have reserved for the global rsv (which acts as a spill over buffer) and
> > the delayed refs rsv.  If our total size is beyond that amount then we
> > know it's time to commit the transaction and stop any more delayed refs
> > from being generated.
> > 
> > Signed-off-by: Josef Bacik 


Re: [RFC PATCH 3/3] coccinelle: api: add offset_in_page.cocci

2018-12-06 Thread Johannes Thumshirn
On 05/12/2018 15:46, Julia Lawall wrote:
[...]
>> +@r_patch depends on !context && patch && !org && !report@
>> +expression E;
>> +type T;
>> +@@
>> +
>> +(
>> +- E & ~PAGE_MASK
>> ++ offset_in_page(E)
>> +|
>> +- E & (PAGE_SIZE - 1)
>> ++ offset_in_page(E)
> 
> The two lines above should be subsumed by the two lines below.  When there
> is a type metavariable that has no other dependencies, an isomorphism will
> consider that it is either present or absent.

Oh OK, I'm sorry I'm not really into cocinelle so I guessed it might
take some iterations.

Do you have an example for this?

> Why not include the cast case for the context and org cases?

This is an oversight actually as I couldn't test it due to a bug in
openSUSE's coccinelle rpm.

Thanks,
Johannes
-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> Now with the delayed_refs_rsv we can now know exactly how much pending
> delayed refs space we need.  This means we can drastically simplify

IMO it will be helpful if there is a sentence here referring back to
btrfs_update_delayed_refs_rsv to put your first sentence into context.
But I guess this is something David can also do.

> btrfs_check_space_for_delayed_refs by simply checking how much space we
> have reserved for the global rsv (which acts as a spill over buffer) and
> the delayed refs rsv.  If our total size is beyond that amount then we
> know it's time to commit the transaction and stop any more delayed refs
> from being generated.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/ctree.h   |  2 +-
>  fs/btrfs/extent-tree.c | 48 ++--
>  fs/btrfs/inode.c   |  4 ++--
>  fs/btrfs/transaction.c |  2 +-
>  4 files changed, 22 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2eba398c722b..30da075c042e 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2631,7 +2631,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct 
> btrfs_fs_info *fs_info,
>  }
>  
>  int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans);
> -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans);
> +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info);
>  void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
>const u64 start);
>  void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 5a2d0b061f57..07ef1b8087f7 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2839,40 +2839,28 @@ u64 btrfs_csum_bytes_to_leaves(struct btrfs_fs_info 
> *fs_info, u64 csum_bytes)
>   return num_csums;
>  }
>  
> -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans)
> +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info)
>  {
> - struct btrfs_fs_info *fs_info = trans->fs_info;
> - struct btrfs_block_rsv *global_rsv;
> - u64 num_heads = trans->transaction->delayed_refs.num_heads_ready;
> - u64 csum_bytes = trans->transaction->delayed_refs.pending_csums;
> - unsigned int num_dirty_bgs = trans->transaction->num_dirty_bgs;
> - u64 num_bytes, num_dirty_bgs_bytes;
> - int ret = 0;
> + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv;
> + struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
> + bool ret = false;
> + u64 reserved;
>  
> - num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
> - num_heads = heads_to_leaves(fs_info, num_heads);
> - if (num_heads > 1)
> - num_bytes += (num_heads - 1) * fs_info->nodesize;
> - num_bytes <<= 1;
> - num_bytes += btrfs_csum_bytes_to_leaves(fs_info, csum_bytes) *
> - fs_info->nodesize;
> - num_dirty_bgs_bytes = btrfs_calc_trans_metadata_size(fs_info,
> -  num_dirty_bgs);
> - global_rsv = _info->global_block_rsv;
> + spin_lock(_rsv->lock);
> + reserved = global_rsv->reserved;
> + spin_unlock(_rsv->lock);
>  
>   /*
> -  * If we can't allocate any more chunks lets make sure we have _lots_ of
> -  * wiggle room since running delayed refs can create more delayed refs.
> +  * Since the global reserve is just kind of magic we don't really want
> +  * to rely on it to save our bacon, so if our size is more than the
> +  * delayed_refs_rsv and the global rsv then it's time to think about
> +  * bailing.
>*/
> - if (global_rsv->space_info->full) {
> - num_dirty_bgs_bytes <<= 1;
> - num_bytes <<= 1;
> - }
> -
> - spin_lock(_rsv->lock);
> - if (global_rsv->reserved <= num_bytes + num_dirty_bgs_bytes)
> - ret = 1;
> - spin_unlock(_rsv->lock);
> + spin_lock(_refs_rsv->lock);
> + reserved += delayed_refs_rsv->reserved;
> + if (delayed_refs_rsv->size >= reserved)
> + ret = true;
> + spin_unlock(_refs_rsv->lock);
>   return ret;
>  }
>  
> @@ -2891,7 +2879,7 @@ int btrfs_should_throttle_delayed_refs(struct 
> btrfs_trans_handle *trans)
>   if (val >= NSEC_PER_SEC / 2)
>   return 2;
>  
> - return btrfs_check_space_for_delayed_refs(trans);
> + return btrfs_check_space_for_delayed_refs(trans->fs_info);
>  }
>  
>  struct async_delayed_refs {
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index a097f5fde31d..8532a2eb56d1 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5326,8 +5326,8 @@ static struct btrfs_trans_handle 
> *evict_refill_and_join(struct btrfs_root *root,
> 

Re: [PATCH 09/10] btrfs: don't run delayed refs in the end transaction logic

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> Over the years we have built up a lot of infrastructure to keep delayed
> refs in check, mostly by running them at btrfs_end_transaction() time.
> We have a lot of different maths we do to figure out how much, if we
> should do it inline or async, etc.  This existed because we had no
> feedback mechanism to force the flushing of delayed refs when they
> became a problem.  However with the enospc flushing infrastructure in
> place for flushing delayed refs when they put too much pressure on the
> enospc system we have this problem solved.  Rip out all of this code as
> it is no longer needed.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/transaction.c | 38 --
>  1 file changed, 38 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 2d8401bf8df9..01f39401619a 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -798,22 +798,12 @@ static int should_end_transaction(struct 
> btrfs_trans_handle *trans)
>  int btrfs_should_end_transaction(struct btrfs_trans_handle *trans)
>  {
>   struct btrfs_transaction *cur_trans = trans->transaction;
> - int updates;
> - int err;
>  
>   smp_mb();
>   if (cur_trans->state >= TRANS_STATE_BLOCKED ||
>   cur_trans->delayed_refs.flushing)
>   return 1;
>  
> - updates = trans->delayed_ref_updates;
> - trans->delayed_ref_updates = 0;
> - if (updates) {
> - err = btrfs_run_delayed_refs(trans, updates * 2);
> - if (err) /* Error code will also eval true */
> - return err;
> - }
> -
>   return should_end_transaction(trans);
>  }
>  
> @@ -843,11 +833,8 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
>  {
>   struct btrfs_fs_info *info = trans->fs_info;
>   struct btrfs_transaction *cur_trans = trans->transaction;
> - u64 transid = trans->transid;
> - unsigned long cur = trans->delayed_ref_updates;
>   int lock = (trans->type != TRANS_JOIN_NOLOCK);
>   int err = 0;
> - int must_run_delayed_refs = 0;
>  
>   if (refcount_read(>use_count) > 1) {
>   refcount_dec(>use_count);
> @@ -858,27 +845,6 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
>   btrfs_trans_release_metadata(trans);
>   trans->block_rsv = NULL;
>  
> - if (!list_empty(>new_bgs))
> - btrfs_create_pending_block_groups(trans);

Is this being deleted because in delayed_refs_rsv you account also fo
new block groups?

> -
> - trans->delayed_ref_updates = 0;
> - if (!trans->sync) {
> - must_run_delayed_refs =
> - btrfs_should_throttle_delayed_refs(trans);
> - cur = max_t(unsigned long, cur, 32);
> -
> - /*
> -  * don't make the caller wait if they are from a NOLOCK
> -  * or ATTACH transaction, it will deadlock with commit
> -  */
> - if (must_run_delayed_refs == 1 &&
> - (trans->type & (__TRANS_JOIN_NOLOCK | __TRANS_ATTACH)))
> - must_run_delayed_refs = 2;
> - }
> -
> - btrfs_trans_release_metadata(trans);
> - trans->block_rsv = NULL;

Why remove those 2 lines as well ?

> -
>   if (!list_empty(>new_bgs))
>   btrfs_create_pending_block_groups(trans);
>  
> @@ -923,10 +889,6 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
>   }
>  
>   kmem_cache_free(btrfs_trans_handle_cachep, trans);
> - if (must_run_delayed_refs) {
> - btrfs_async_run_delayed_refs(info, cur, transid,
> -  must_run_delayed_refs == 1);
> - }
>   return err;
>  }
>  
> 


Re: [PATCH 00/10][V2] Delayed refs rsv

2018-12-06 Thread David Sterba
On Mon, Dec 03, 2018 at 10:20:28AM -0500, Josef Bacik wrote:
> v1->v2:
> - addressed the comments from the various reviewers.
> - split "introduce delayed_refs_rsv" into 5 patches.  The patches are the same
>   together as they were, just split out more logically.  They can't really be
>   bisected across in that you will likely have fun enospc failures, but they
>   compile properly.  This was done to make it easier for review.

Thanks, that was helpful. The bisectability is IMO not hurt, as it
compiles and runs despite the potential ENOSPC problems. I guess the
point is to be able to reach any commit & compile & run and not be hit
by unrelated bugs. If somebody is bisecting an ENOSPC problem and lands
in the middle of the series, there's a hint in the changelogs that
something is going on.

> -- Original message --
> 
> This patchset changes how we do space reservations for delayed refs.  We were
> hitting probably 20-40 enospc abort's per day in production while running
> delayed refs at transaction commit time.  This means we ran out of space in 
> the
> global reserve and couldn't easily get more space in use_block_rsv().
> 
> The global reserve has grown to cover everything we don't reserve space
> explicitly for, and we've grown a lot of weird ad-hoc hueristics to know if
> we're running short on space and when it's time to force a commit.  A failure
> rate of 20-40 file systems when we run hundreds of thousands of them isn't 
> super
> high, but cleaning up this code will make things less ugly and more 
> predictible.
> 
> Thus the delayed refs rsv.  We always know how many delayed refs we have
> outstanding, and although running them generates more we can use the global
> reserve for that spill over, which fits better into it's desired use than a 
> full
> blown reservation.  This first approach is to simply take how many times we're
> reserving space for and multiply that by 2 in order to save enough space for 
> the
> delayed refs that could be generated.  This is a niave approach and will
> probably evolve, but for now it works.
> 
> With this patchset we've gone down to 2-8 failures per week.  It's not 
> perfect,
> there are some corner cases that still need to be addressed, but is
> significantly better than what we had.  Thanks,

I've merged the series to misc-next, the amount of reviews and testing
is sufficient. Though there are some more comments or suggestions for
improvements, they can be done as followups.

The other patchsets from you are namely fixes, that can be applied at
the -rc time, for that reason I'm glad the main infrastructure change
can be merged before the 4.21 code freeze.


Re: [PATCH 0/3] btrfs: use offset_in_page and PAGE_ALIGNED

2018-12-06 Thread David Sterba
On Wed, Dec 05, 2018 at 03:23:02PM +0100, Johannes Thumshirn wrote:
> Use the offset_in_page() and PAGE_ALIGNED() macros instead of open-coding them
> throughout btrfs.
> 
> This series also includes a patch for 'make coccicheck' which is marked as an
> RFC and I've CCed Julia in the hoping to get input from her.
> 
> Johannes Thumshirn (3):
>   btrfs: use offset_in_page instead of open-coding it
>   btrfs: use PAGE_ALIGNED instead of open-coding it

Added to misc-next, thanks.


Re: What if TRIM issued a wipe on devices that don't TRIM?

2018-12-06 Thread Austin S. Hemmelgarn

On 2018-12-06 01:11, Robert White wrote:
(1) Automatic and selective wiping of unused and previously used disk 
blocks is a good security measure, particularly when there is an 
encryption layer beneath the file system.


(2) USB attached devices _never_ support TRIM and they are the most 
likely to fall into strangers hands.
Not true on the first count.  Some really nice UAS devices do support 
SCSI UNMAP and WRITESAME commands.


(3) I vaguely recall that some flash chips will take bulk writhes of 
full sectors of 0x00 or 0xFF (I don't remember which) were second-best 
to TRIM for letting the flash controllers defragment their internals.


So it would be dog-slow, but it would be neat if BTRFS had a mount 
option to convert any TRIM command from above into the write of a zero, 
0xFF, or trash block to the device below if that device doesn't support 
TRIM. Real TRIM support would override the block write.


Obviously doing an fstrim would involve a lot of slow device writes but 
only for people likely to do that sort of thing.


For testing purposes the destruction of unused pages in this manner 
might catch file system failures or coding errors.


(The other layer where this might be most appropriate is in cryptsetup 
et al, where it could lie about TRIM support, but that sort of stealth 
lag might be bad for filesystem-level operations. Doing it there would 
also loose the simpler USB use cases.)


...Just a thought...
First off, TRIM is an ATA command, not the kernel term.  `fstrim` 
inherited the ATA name, but in kernel it's called a discard operation, 
and it's kind of important to understand here that a discard operation 
can result in a number of different behaviors.


In particular, you have at least the following implementations:

* On SCSI devices, a discard operation translates to a SCSI UNMAP 
command.  As pointed out by Ronnie Sahlberg in his reply, this command 
is purely advisory, may not result in any actual state change on the 
target device, and is not guaranteed to wipe the data.  To actually wipe 
things, you have to explicitly write bogus data to the given regions 
(using either regular writes, or a WRITESAME command with the desired 
pattern), and _then_ call UNMAP on them.
* On dm-thinp devices, a discard operation results in simply unmapping 
the blocks in the region it covers.  The underlying blocks themselves 
are not wiped until they get reallocated (which may not happen when you 
write to that region of the dm-thinp device again), and may not even be 
wiped then (depending on how the dm-thinp device is configured).  Thus, 
the same behavior as for SCSI is required here.
* On SD/MMC devices, a discard operation results in an SD ERASE command 
being issued.  This one is non-advisory (that is, it's guaranteed to 
happen), and is supposed to guarantee an overwrite of the region with 
zeroes or ones.
* eMMC devices additionally define a discard operation independent of 
the SD ERASE command which unmaps the region in the translation layer, 
but does not wipe the blocks either on issuing the command or on 
re-allocating the low-level blocks.  Essentially, it's just a hint for 
the wear-leveling algorithm.
* NVMe provides two different discard operations, and I'm not sure which 
the kernel uses for NVMe block emulation.  They correspond almost 
exactly to the SCSI UNMAP and SD ERASE commands in terms of behavior.
* For ATA devices, a discard operation translates to an ATA TRIM 
command.  This command doesn't even require that the data read back from 
a region the command has been issued against be consistent between 
reads, let alone that it actually returns zeroes, and it is completely 
silent on how the device should actually implement the operation.  In 
practice, most drives that implement it actually behave like dm-thinp 
devices, unmapping the low-level blocks in the region and only clearing 
them when they get reallocated, while returning any data they want on 
subsequent reads to that logical region until a write happens.
* The MTD subsystem has support for discard operations in the various 
FTL's, and they appear from a cursory look at the code to behave like a 
non-advisory version of the SCSI UNMAP command (FWIW, MTD's are what the 
concept of a discard operation was originally implemented in Linux for).


Notice that the only implementations that are actually guaranteed to 
clear out the low-level physical blocks are the SD ERASE and one of the 
two NVMe options, and all others require you to manually wipe the data 
before issuing the discard operation to guarantee that no data is retained.


Given this, I don't think this should be done as a mechanism of 
intercepting or translating discard operations, but as something else 
entirely.  Perhaps as a block-layer that wipes the region then issues a 
discard for it to the lower level device if the device supports it?


Re: [PATCH 06/10] btrfs: update may_commit_transaction to use the delayed refs rsv

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> Any space used in the delayed_refs_rsv will be freed up by a transaction
> commit, so instead of just counting the pinned space we also need to
> account for any space in the delayed_refs_rsv when deciding if it will
> make a different to commit the transaction to satisfy our space
> reservation.  If we have enough bytes to satisfy our reservation ticket
> then we are good to go, otherwise subtract out what space we would gain
> back by committing the transaction and compare that against the pinned
> space to make our decision.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: Nikolay Borisov 

However, look below for one suggestion: 

> ---
>  fs/btrfs/extent-tree.c | 24 +++-
>  1 file changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index aa0a638d0263..63ff9d832867 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4843,8 +4843,10 @@ static int may_commit_transaction(struct btrfs_fs_info 
> *fs_info,
>  {
>   struct reserve_ticket *ticket = NULL;
>   struct btrfs_block_rsv *delayed_rsv = _info->delayed_block_rsv;
> + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv;
>   struct btrfs_trans_handle *trans;
> - u64 bytes;
> + u64 bytes_needed;
> + u64 reclaim_bytes = 0;
>  
>   trans = (struct btrfs_trans_handle *)current->journal_info;
>   if (trans)
> @@ -4857,15 +4859,15 @@ static int may_commit_transaction(struct 
> btrfs_fs_info *fs_info,
>   else if (!list_empty(_info->tickets))
>   ticket = list_first_entry(_info->tickets,
> struct reserve_ticket, list);
> - bytes = (ticket) ? ticket->bytes : 0;
> + bytes_needed = (ticket) ? ticket->bytes : 0;
>   spin_unlock(_info->lock);
>  
> - if (!bytes)
> + if (!bytes_needed)
>   return 0;
>  
>   /* See if there is enough pinned space to make this reservation */
>   if (__percpu_counter_compare(_info->total_bytes_pinned,
> -bytes,
> +bytes_needed,
>  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
>   goto commit;
>  
> @@ -4877,14 +4879,18 @@ static int may_commit_transaction(struct 
> btrfs_fs_info *fs_info,
>   return -ENOSPC;

If we remove this :
 if (space_info != delayed_rsv->space_info)  
return -ENOSPC; 

Check, can't we move the reclaim_bytes calc code above the 
__percpu_counter_compare 
and eventually be left with just a single invocation to percpu_compare. 
The diff should looke something along the lines of: 

@@ -4828,19 +4827,6 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
if (!bytes)
return 0;
 
-   /* See if there is enough pinned space to make this reservation */
-   if (__percpu_counter_compare(_info->total_bytes_pinned,
-  bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
-   goto commit;
-
-   /*
-* See if there is some space in the delayed insertion reservation for
-* this reservation.
-*/
-   if (space_info != delayed_rsv->space_info)
-   return -ENOSPC;
-
spin_lock(_rsv->lock);
if (delayed_rsv->size > bytes)
bytes = 0;
@@ -4850,9 +4836,8 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 
if (__percpu_counter_compare(_info->total_bytes_pinned,
   bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
+  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
return -ENOSPC;
-   }
 
 commit:
trans = btrfs_join_transaction(fs_info->extent_root);


>  
>   spin_lock(_rsv->lock);
> - if (delayed_rsv->size > bytes)
> - bytes = 0;
> - else
> - bytes -= delayed_rsv->size;
> + reclaim_bytes += delayed_rsv->reserved;
>   spin_unlock(_rsv->lock);
>  
> + spin_lock(_refs_rsv->lock);
> + reclaim_bytes += delayed_refs_rsv->reserved;
> + spin_unlock(_refs_rsv->lock);
> + if (reclaim_bytes >= bytes_needed)
> + goto commit;
> + bytes_needed -= reclaim_bytes;
> +
>   if (__percpu_counter_compare(_info->total_bytes_pinned,
> -bytes,
> +bytes_needed,
>  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
>   return -ENOSPC;
>   }
> 


Re: [PATCH 02/10] btrfs: add cleanup_ref_head_accounting helper

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We were missing some quota cleanups in check_ref_cleanup, so break the
> ref head accounting cleanup into a helper and call that from both
> check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
> we don't screw up accounting in the future for other things that we add.
> 
> Reviewed-by: Omar Sandoval 
> Reviewed-by: Liu Bo 
> Signed-off-by: Josef Bacik 

Doesn't this also need a stable tag? Furthermore, doesn't the missing
code dealing with total_bytes_pinned in check_ref_cleanup mean that
every time the last reference for a block was freed we were leaking
bytes in total_bytes_pinned? Shouldn't this have lead to eventually
total_bytes_pinned dominating the usage in a space_info ?

Codewise lgtm:

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent-tree.c | 67 
> +-
>  1 file changed, 39 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index c36b3a42f2bb..e3ed3507018d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2443,6 +2443,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle 
> *trans,
>   return ret ? ret : 1;
>  }
>  
> +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> + struct btrfs_delayed_ref_head *head)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_delayed_ref_root *delayed_refs =
> + >transaction->delayed_refs;
> +
> + if (head->total_ref_mod < 0) {
> + struct btrfs_space_info *space_info;
> + u64 flags;
> +
> + if (head->is_data)
> + flags = BTRFS_BLOCK_GROUP_DATA;
> + else if (head->is_system)
> + flags = BTRFS_BLOCK_GROUP_SYSTEM;
> + else
> + flags = BTRFS_BLOCK_GROUP_METADATA;
> + space_info = __find_space_info(fs_info, flags);
> + ASSERT(space_info);
> + percpu_counter_add_batch(_info->total_bytes_pinned,
> +-head->num_bytes,
> +BTRFS_TOTAL_BYTES_PINNED_BATCH);
> +
> + if (head->is_data) {
> + spin_lock(_refs->lock);
> + delayed_refs->pending_csums -= head->num_bytes;
> + spin_unlock(_refs->lock);
> + }
> + }
> +
> + /* Also free its reserved qgroup space */
> + btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> +   head->qgroup_reserved);
> +}
> +
>  static int cleanup_ref_head(struct btrfs_trans_handle *trans,
>   struct btrfs_delayed_ref_head *head)
>  {
> @@ -2478,31 +2513,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
>  
> - trace_run_delayed_ref_head(fs_info, head, 0);
> -
> - if (head->total_ref_mod < 0) {
> - struct btrfs_space_info *space_info;
> - u64 flags;
> -
> - if (head->is_data)
> - flags = BTRFS_BLOCK_GROUP_DATA;
> - else if (head->is_system)
> - flags = BTRFS_BLOCK_GROUP_SYSTEM;
> - else
> - flags = BTRFS_BLOCK_GROUP_METADATA;
> - space_info = __find_space_info(fs_info, flags);
> - ASSERT(space_info);
> - percpu_counter_add_batch(_info->total_bytes_pinned,
> --head->num_bytes,
> -BTRFS_TOTAL_BYTES_PINNED_BATCH);
> -
> - if (head->is_data) {
> - spin_lock(_refs->lock);
> - delayed_refs->pending_csums -= head->num_bytes;
> - spin_unlock(_refs->lock);
> - }
> - }
> -
>   if (head->must_insert_reserved) {
>   btrfs_pin_extent(fs_info, head->bytenr,
>head->num_bytes, 1);
> @@ -2512,9 +2522,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   }
>   }
>  
> - /* Also free its reserved qgroup space */
> - btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> -   head->qgroup_reserved);
> + cleanup_ref_head_accounting(trans, head);
> +
> + trace_run_delayed_ref_head(fs_info, head, 0);
>   btrfs_delayed_ref_unlock(head);
>   btrfs_put_delayed_ref_head(head);
>   return 0;
> @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (head->must_insert_reserved)
>   ret = 1;
>  
> + cleanup_ref_head_accounting(trans, head);
>   mutex_unlock(>mutex);
>   btrfs_put_delayed_ref_head(head);
>   return ret;
> 


Re: [PATCH 01/10] btrfs: add btrfs_delete_ref_head helper

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
> into a helper and cleanup the calling functions.
> 
> Signed-off-by: Josef Bacik 
> Reviewed-by: Omar Sandoval 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/delayed-ref.c | 14 ++
>  fs/btrfs/delayed-ref.h |  3 ++-
>  fs/btrfs/extent-tree.c | 22 +++---
>  3 files changed, 19 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 9301b3ad9217..b3e4c9fcb664 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>   return head;
>  }
>  
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +struct btrfs_delayed_ref_head *head)
> +{
> + lockdep_assert_held(_refs->lock);
> + lockdep_assert_held(>lock);
> +
> + rb_erase_cached(>href_node, _refs->href_root);
> + RB_CLEAR_NODE(>href_node);
> + atomic_dec(_refs->num_entries);
> + delayed_refs->num_heads--;
> + if (head->processing == 0)
> + delayed_refs->num_heads_ready--;
> +}
> +
>  /*
>   * Helper to insert the ref_node to the tail or merge with tail.
>   *
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index 8e20c5cb5404..d2af974f68a1 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
> btrfs_delayed_ref_head *head)
>  {
>   mutex_unlock(>mutex);
>  }
> -
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +struct btrfs_delayed_ref_head *head);
>  
>  struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>   struct btrfs_delayed_ref_root *delayed_refs);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index d242a1174e50..c36b3a42f2bb 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   spin_unlock(_refs->lock);
>   return 1;
>   }
> - delayed_refs->num_heads--;
> - rb_erase_cached(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> + btrfs_delete_ref_head(delayed_refs, head);
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
> - atomic_dec(_refs->num_entries);
>  
>   trace_run_delayed_ref_head(fs_info, head, 0);
>  
> @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (!mutex_trylock(>mutex))
>   goto out;
>  
> - /*
> -  * at this point we have a head with no other entries.  Go
> -  * ahead and process it.
> -  */
> - rb_erase_cached(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> - atomic_dec(_refs->num_entries);
> -
> - /*
> -  * we don't take a ref on the node because we're removing it from the
> -  * tree, so we just steal the ref the tree was holding.
> -  */
> - delayed_refs->num_heads--;
> - if (head->processing == 0)
> - delayed_refs->num_heads_ready--;
> + btrfs_delete_ref_head(delayed_refs, head);
>   head->processing = 0;
> +
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
>  
> 


Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-06 Thread Goldwyn Rodrigues
On 11:07 06/12, Johannes Thumshirn wrote:
> On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> > This is a support for DAX in btrfs. I understand there have been
> > previous attempts at it. However, I wanted to make sure copy-on-write
> > (COW) works on dax as well.
> > 
> > Before I present this to the FS folks I wanted to run this through the
> > btrfs. Even though I wish, I cannot get it correct the first time
> > around :/.. Here are some questions for which I need suggestions:
> 
> Hi Goldwyn,
> 
> I've thrown your patches (from your git tree) onto one of my pmem test
> machines with this pmem config:

Thanks. I will check on this. Ordered extents have been a pain to deal
with for me (though mainly because of my incorrect usage)

> 
> mayhem:~/:[0]# ndctl list
> [
>   {
> "dev":"namespace1.0",
> "mode":"fsdax",
> "map":"dev",
> "size":792721358848,
> "uuid":"3fd4ab18-5145-4675-85a0-e05e6f9bcee4",
> "raw_uuid":"49264743-2351-41c5-9db9-38534813df61",
> "sector_size":512,
> "blockdev":"pmem1",
> "numa_node":1
>   },
>   {
> "dev":"namespace0.0",
> "mode":"fsdax",
> "map":"dev",
> "size":792721358848,
> "uuid":"dd0aec3c-7721-4621-8898-e50684a371b5",
> "raw_uuid":"84ff5463-f76e-4ddf-a248-85122541e909",
> "sector_size":4096,
> "blockdev":"pmem0",
> "numa_node":0
>   }
> ]
> 
> Unfortunately I hit a btrfs_panic() with btrfs/002.
> export TEST_DEV=/dev/pmem0
> export SCRATCH_DEV=/dev/pmem1
> export MOUNT_OPTIONS="-o dax"
> ./check
> [...]
> [  178.173113] run fstests btrfs/002 at 2018-12-06 10:55:43
> [  178.357044] BTRFS info (device pmem0): disk space caching is enabled
> [  178.357047] BTRFS info (device pmem0): has skinny extents
> [  178.360042] BTRFS info (device pmem0): enabling ssd optimizations
> [  178.475918] BTRFS: device fsid ee888255-7f4a-4bf7-af65-e8a6a354aca8
> devid 1 transid 3 /dev/pmem1
> [  178.505717] BTRFS info (device pmem1): disk space caching is enabled
> [  178.513593] BTRFS info (device pmem1): has skinny extents
> [  178.520384] BTRFS info (device pmem1): flagging fs with big metadata
> feature
> [  178.530997] BTRFS info (device pmem1): enabling ssd optimizations
> [  178.538331] BTRFS info (device pmem1): creating UUID tree
> [  178.587200] BTRFS critical (device pmem1): panic in
> ordered_data_tree_panic:57: Inconsistency in ordered tree at offset 0
> (errno=-17 Object already exists)
> [  178.603129] [ cut here ]
> [  178.608667] kernel BUG at fs/btrfs/ordered-data.c:57!
> [  178.614333] invalid opcode:  [#1] SMP PTI
> [  178.619295] CPU: 87 PID: 8225 Comm: dd Kdump: loaded Tainted: G
>   E 4.20.0-rc5-default-btrfs-dax #920
> [  178.630090] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS
> SE5C620.86B.0D.01.0010.072020182008 07/20/2018
> [  178.640626] RIP: 0010:__btrfs_add_ordered_extent+0x325/0x410 [btrfs]
> [  178.647404] Code: 28 4d 89 f1 49 c7 c0 90 9c 57 c0 b9 ef ff ff ff ba
> 39 00 00 00 48 c7 c6 10 fe 56 c0 48 8b b8 d8 03 00 00 31 c0 e8 e2 99 06
> 00 <0f> 0b 65 8b 05 d2 e4 b0 3f 89 c0 48 0f a3 05 78 5e cf c2 0f 92 c0
> [  178.667019] RSP: 0018:a3e3674c7ba8 EFLAGS: 00010096
> [  178.672684] RAX: 008f RBX: 9770c2ac5748 RCX:
> 
> [  178.680254] RDX: 97711f9dee80 RSI: 97711f9d6868 RDI:
> 97711f9d6868
> [  178.687831] RBP: 97711d523000 R08:  R09:
> 065a
> [  178.695411] R10: 03ff R11: 0001 R12:
> 97710d66da70
> [  178.702993] R13: 9770c2ac5600 R14:  R15:
> 97710d66d9c0
> [  178.710573] FS:  7fe11ef90700() GS:97711f9c()
> knlGS:
> [  178.719122] CS:  0010 DS:  ES:  CR0: 80050033
> [  178.725380] CR2: 0156a000 CR3: 00eb30dfc006 CR4:
> 007606e0
> [  178.732999] DR0:  DR1:  DR2:
> 
> [  178.740574] DR3:  DR6: fffe0ff0 DR7:
> 0400
> [  178.748147] PKRU: 5554
> [  178.751297] Call Trace:
> [  178.754230]  btrfs_add_ordered_extent_dio+0x1d/0x30 [btrfs]
> [  178.760269]  btrfs_create_dio_extent+0x79/0xe0 [btrfs]
> [  178.765930]  btrfs_get_extent_map_write+0x1a9/0x2b0 [btrfs]
> [  178.771959]  btrfs_file_dax_write+0x1f8/0x4f0 [btrfs]
> [  178.777508]  ? current_time+0x3f/0x70
> [  178.781672]  btrfs_file_write_iter+0x384/0x580 [btrfs]
> [  178.787265]  ? pipe_read+0x243/0x2a0
> [  178.791298]  __vfs_write+0xee/0x170
> [  178.795241]  vfs_write+0xad/0x1a0
> [  178.799008]  ? vfs_read+0x111/0x130
> [  178.802949]  ksys_write+0x42/0x90
> [  178.806712]  do_syscall_64+0x5b/0x180
> [  178.810829]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  178.816334] RIP: 0033:0x7fe11eabb3d0
> [  178.820364] Code: 73 01 c3 48 8b 0d b8 ea 2b 00 f7 d8 64 89 01 48 83
> c8 ff c3 66 0f 1f 44 00 00 83 3d b9 43 2c 00 00 75 10 b8 01 00 00 00 0f
> 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 2e 90 01 00 48 89 

Re: [PATCH 07/10] dax: export functions for use with btrfs

2018-12-06 Thread Goldwyn Rodrigues
On  6:52 05/12, Christoph Hellwig wrote:
> If you want to export these at all they have to be EXPORT_SYMBOL_GPL.
> 

Understood.

> But I'd really like to avoid seeing another duplicate DAX I/O path.
> Please try to adopt the existing iomap-based infrastructure for your
> needs first.

This is not worth with btrfs. With non-page aligned I/O on btrfs, we
need to copy the first/last page of the extents for CoW. So, we
would end up using the exported functions anyways. Believe me, I have
spent some time getting btrfs iomap compatible before giving up. The
problems are btrfs needs to carry a lot of information across
iomap_begin and iomap_end. While the added private variable helps in
this, it also needs hooks in bio_submit() functions for crc calculations
during direct writes.

-- 
Goldwyn


HELP unmountable partition after btrfs balance to RAID0

2018-12-06 Thread Thomas Mohr

Dear developers of BTRFS,

we have a problem. We wanted to convert a file system to a RAID0 with 
two partitions. Unfortunately we had to reboot the server during the 
balance operation before it could complete.


Now following happens:

A mount attempt of the array fails with following error code:

btrfs recover yields roughly 1.6 out of 4 TB.

to recover the rest we have tried:

mount:

[18192.357444] BTRFS info (device sdb1): disk space caching is enabled
[18192.357447] BTRFS info (device sdb1): has skinny extents
[18192.370664] BTRFS error (device sdb1): parent transid verify failed 
on 30523392 wanted 7432 found 7445
[18192.370810] BTRFS error (device sdb1): parent transid verify failed 
on 30523392 wanted 7432 found 7445

[18192.394745] BTRFS error (device sdb1): open_ctree failed

mounting with options ro, degraded, cache_clear etc yields the same errors.


btrfs rescue zero-log. This operation works, however, the error persists 
and the array remains unmountable


parent transid verify failed on 59768832 wanted 7422 found 7187
parent transid verify failed on 59768832 wanted 7422 found 7187
parent transid verify failed on 59768832 wanted 7422 found 7187
parent transid verify failed on 59768832 wanted 7422 found 7187
Ignoring transid failure
parent transid verify failed on 30408704 wanted 7430 found 7443
parent transid verify failed on 30408704 wanted 7430 found 7443
parent transid verify failed on 30408704 wanted 7430 found 7443
parent transid verify failed on 30408704 wanted 7430 found 7443
Ignoring transid failure
Clearing log on /dev/sdb1, previous log_root 0, level 0

btrfs rescue chunk-recover fails with following error message:

btrfs check results in:

Opening filesystem to check...
parent transid verify failed on 59768832 wanted 7422 found 7187
parent transid verify failed on 59768832 wanted 7422 found 7187
parent transid verify failed on 59768832 wanted 7422 found 7187
parent transid verify failed on 59768832 wanted 7422 found 7187
Ignoring transid failure
parent transid verify failed on 30408704 wanted 7430 found 7443
parent transid verify failed on 30408704 wanted 7430 found 7443
parent transid verify failed on 30408704 wanted 7430 found 7443
parent transid verify failed on 30408704 wanted 7430 found 7443
Ignoring transid failure
Checking filesystem on /dev/sdb1
UUID: 6c9ed4e1-d63f-46f0-b1e9-608b8fa43bb8
[1/7] checking root items
parent transid verify failed on 30523392 wanted 7432 found 7443
parent transid verify failed on 30523392 wanted 7432 found 7443
parent transid verify failed on 30523392 wanted 7432 found 7443
parent transid verify failed on 30523392 wanted 7432 found 7443
Ignoring transid failure
leaf parent key incorrect 30523392ERROR: failed to repair root items: 
Operation not permitted


Any ideas what is going on or how to recover the file system ? I would 
greatly appreciate your help !!!


best,

Thomas


uname -a:

Linux server2 4.19.5-1-default #1 SMP PREEMPT Tue Nov 27 19:56:09 UTC 
2018 (6210279) x86_64 x86_64 x86_64 GNU/Linux


btrfs-progs version 4.19


--
ScienceConsult - DI Thomas Mohr KG
DI Thomas Mohr
Enzianweg 10a
2353 Guntramsdorf
Austria
+43 2236 56793
+43 660 461 1966
http://www.mohrkeg.co.at



Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-06 Thread Johannes Thumshirn
On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs. I understand there have been
> previous attempts at it. However, I wanted to make sure copy-on-write
> (COW) works on dax as well.
> 
> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:

Hi Goldwyn,

I've thrown your patches (from your git tree) onto one of my pmem test
machines with this pmem config:

mayhem:~/:[0]# ndctl list
[
  {
"dev":"namespace1.0",
"mode":"fsdax",
"map":"dev",
"size":792721358848,
"uuid":"3fd4ab18-5145-4675-85a0-e05e6f9bcee4",
"raw_uuid":"49264743-2351-41c5-9db9-38534813df61",
"sector_size":512,
"blockdev":"pmem1",
"numa_node":1
  },
  {
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":792721358848,
"uuid":"dd0aec3c-7721-4621-8898-e50684a371b5",
"raw_uuid":"84ff5463-f76e-4ddf-a248-85122541e909",
"sector_size":4096,
"blockdev":"pmem0",
"numa_node":0
  }
]

Unfortunately I hit a btrfs_panic() with btrfs/002.
export TEST_DEV=/dev/pmem0
export SCRATCH_DEV=/dev/pmem1
export MOUNT_OPTIONS="-o dax"
./check
[...]
[  178.173113] run fstests btrfs/002 at 2018-12-06 10:55:43
[  178.357044] BTRFS info (device pmem0): disk space caching is enabled
[  178.357047] BTRFS info (device pmem0): has skinny extents
[  178.360042] BTRFS info (device pmem0): enabling ssd optimizations
[  178.475918] BTRFS: device fsid ee888255-7f4a-4bf7-af65-e8a6a354aca8
devid 1 transid 3 /dev/pmem1
[  178.505717] BTRFS info (device pmem1): disk space caching is enabled
[  178.513593] BTRFS info (device pmem1): has skinny extents
[  178.520384] BTRFS info (device pmem1): flagging fs with big metadata
feature
[  178.530997] BTRFS info (device pmem1): enabling ssd optimizations
[  178.538331] BTRFS info (device pmem1): creating UUID tree
[  178.587200] BTRFS critical (device pmem1): panic in
ordered_data_tree_panic:57: Inconsistency in ordered tree at offset 0
(errno=-17 Object already exists)
[  178.603129] [ cut here ]
[  178.608667] kernel BUG at fs/btrfs/ordered-data.c:57!
[  178.614333] invalid opcode:  [#1] SMP PTI
[  178.619295] CPU: 87 PID: 8225 Comm: dd Kdump: loaded Tainted: G
  E 4.20.0-rc5-default-btrfs-dax #920
[  178.630090] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS
SE5C620.86B.0D.01.0010.072020182008 07/20/2018
[  178.640626] RIP: 0010:__btrfs_add_ordered_extent+0x325/0x410 [btrfs]
[  178.647404] Code: 28 4d 89 f1 49 c7 c0 90 9c 57 c0 b9 ef ff ff ff ba
39 00 00 00 48 c7 c6 10 fe 56 c0 48 8b b8 d8 03 00 00 31 c0 e8 e2 99 06
00 <0f> 0b 65 8b 05 d2 e4 b0 3f 89 c0 48 0f a3 05 78 5e cf c2 0f 92 c0
[  178.667019] RSP: 0018:a3e3674c7ba8 EFLAGS: 00010096
[  178.672684] RAX: 008f RBX: 9770c2ac5748 RCX:

[  178.680254] RDX: 97711f9dee80 RSI: 97711f9d6868 RDI:
97711f9d6868
[  178.687831] RBP: 97711d523000 R08:  R09:
065a
[  178.695411] R10: 03ff R11: 0001 R12:
97710d66da70
[  178.702993] R13: 9770c2ac5600 R14:  R15:
97710d66d9c0
[  178.710573] FS:  7fe11ef90700() GS:97711f9c()
knlGS:
[  178.719122] CS:  0010 DS:  ES:  CR0: 80050033
[  178.725380] CR2: 0156a000 CR3: 00eb30dfc006 CR4:
007606e0
[  178.732999] DR0:  DR1:  DR2:

[  178.740574] DR3:  DR6: fffe0ff0 DR7:
0400
[  178.748147] PKRU: 5554
[  178.751297] Call Trace:
[  178.754230]  btrfs_add_ordered_extent_dio+0x1d/0x30 [btrfs]
[  178.760269]  btrfs_create_dio_extent+0x79/0xe0 [btrfs]
[  178.765930]  btrfs_get_extent_map_write+0x1a9/0x2b0 [btrfs]
[  178.771959]  btrfs_file_dax_write+0x1f8/0x4f0 [btrfs]
[  178.777508]  ? current_time+0x3f/0x70
[  178.781672]  btrfs_file_write_iter+0x384/0x580 [btrfs]
[  178.787265]  ? pipe_read+0x243/0x2a0
[  178.791298]  __vfs_write+0xee/0x170
[  178.795241]  vfs_write+0xad/0x1a0
[  178.799008]  ? vfs_read+0x111/0x130
[  178.802949]  ksys_write+0x42/0x90
[  178.806712]  do_syscall_64+0x5b/0x180
[  178.810829]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  178.816334] RIP: 0033:0x7fe11eabb3d0
[  178.820364] Code: 73 01 c3 48 8b 0d b8 ea 2b 00 f7 d8 64 89 01 48 83
c8 ff c3 66 0f 1f 44 00 00 83 3d b9 43 2c 00 00 75 10 b8 01 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 2e 90 01 00 48 89 04 24
[  178.840052] RSP: 002b:7ffec969d978 EFLAGS: 0246 ORIG_RAX:
0001
[  178.848100] RAX: ffda RBX:  RCX:
7fe11eabb3d0
[  178.855715] RDX: 0400 RSI: 0156a000 RDI:
0001
[  178.863326] RBP: 0400 R08: 0003 R09:
7fe11ed7a698
[  178.870928] R10: 10a8b550 R11: 0246 

Re: What if TRIM issued a wipe on devices that don't TRIM?

2018-12-06 Thread ronnie sahlberg
Hi,

I am more of a SCSI guy than ATA so forgive where I am ignorant.

The SCSI equivalent to TRIM is called UNMAP.
UNMAP is unfortunately only a "hint" to the device so if the device
for any reason
is busy,   it can just do a NO-OP, leave the data as is and still
return status SUCCESS.
That is not good if you want to wipe data for confidentiality reasons :-)

As UNMAP and TRIM are related make sure that TRIM actually provides a
guarantee to wipe the
data. I do not know ATA enough to know if it does or not.

In SCSI, instead of using UNMAP/TRIM  you can use WRITESAME10/16 which
can be used
and which does providde an overwrite/wipe guarantee. Maybe ATA has
something equivalent to
WRITESAME10/16.

I just want to say: be careful, sometimes these commands do not
provide a guarantee that they will actually
make the data overwritten/unretrievable.

ronnie sahlberg


On Thu, Dec 6, 2018 at 4:24 PM Robert White  wrote:
>
> (1) Automatic and selective wiping of unused and previously used disk
> blocks is a good security measure, particularly when there is an
> encryption layer beneath the file system.
>
> (2) USB attached devices _never_ support TRIM and they are the most
> likely to fall into strangers hands.
>
> (3) I vaguely recall that some flash chips will take bulk writhes of
> full sectors of 0x00 or 0xFF (I don't remember which) were second-best
> to TRIM for letting the flash controllers defragment their internals.
>
> So it would be dog-slow, but it would be neat if BTRFS had a mount
> option to convert any TRIM command from above into the write of a zero,
> 0xFF, or trash block to the device below if that device doesn't support
> TRIM. Real TRIM support would override the block write.
>
> Obviously doing an fstrim would involve a lot of slow device writes but
> only for people likely to do that sort of thing.
>
> For testing purposes the destruction of unused pages in this manner
> might catch file system failures or coding errors.
>
> (The other layer where this might be most appropriate is in cryptsetup
> et al, where it could lie about TRIM support, but that sort of stealth
> lag might be bad for filesystem-level operations. Doing it there would
> also loose the simpler USB use cases.)
>
> ...Just a thought...
>
> --Rob White.
>
>