ACTION REQUIRED: your account is under review

2018-12-08 Thread Security Team
Hello, 
We're sorry to have to inform you that your account has been locked for your 
security Someone else has had access to your account, and we need to verify a 
few information so you could unlock your account and use it again, Please 
follow the instructions using the link below: 
https://verinow.francecentral.cloudapp.azure.com/aba_finl 
Please note that if you do not verify your account whithin the next 24 hours, 
we will have to terminate it permenatly Best Regards


Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-08 Thread David Sterba
On Sat, Dec 08, 2018 at 08:50:32AM +0800, Qu Wenruo wrote:
> > I've adapted a stress tests that unpacks a large tarball, snaphosts
> > every 20 seconds, deletes a random snapshot every 50 seconds, deletes
> > file from the original subvolume, now enhanced with qgroups just for the
> > new snapshots inherigin the toplevel subvolume. Lockup.
> > 
> > It gets stuck in a snapshot call with the follwin stacktrace
> > 
> > [<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs]
> > [<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs]
> 
> This looks like the original subtree tracing has something wrong.

Yes, I ran the test on current master and it locked up too, so it's not
due to your patchset.

> Thanks for the report, I'll investigate it.

Thanks.


Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-07 Thread Qu Wenruo


On 2018/12/8 上午8:47, David Sterba wrote:
> On Fri, Dec 07, 2018 at 06:51:21AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2018/12/7 上午3:35, David Sterba wrote:
>>> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
 On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
>
> Which is based on v4.20-rc1.

 Thanks, I'll add it to for-next soon.
>>>
>>> The branch was there for some time but not for at least a week (my
>>> mistake I did not notice in time). I've rebased it on top of recent
>>> misc-next, but without the delayed refs patchset from Josef.
>>>
>>> At the moment I'm considering it for merge to 4.21, there's still some
>>> time to pull it out in case it shows up to be too problematic. I'm
>>> mostly worried about the unknown interactions with the enospc updates or
>>
>> For that part, I don't think it would have some obvious problem for
>> enospc updates.
>>
>> As the user-noticeable effect is the delay of reloc tree deletion.
>>
>> Despite that, it's mostly transparent to extent allocation.
>>
>>> generally because of lack of qgroup and reloc code reviews.
>>
>> That's the biggest problem.
>>
>> However most of the current qgroup + balance optimization is done inside
>> qgroup code (to skip certain qgroup record), if we're going to hit some
>> problem then this patchset would have the highest possibility to hit
>> problem.
>>
>> Later patches will just keep tweaking qgroup to without affecting any
>> other parts mostly.
>>
>> So I'm fine if you decide to pull it out for now.
> 
> I've adapted a stress tests that unpacks a large tarball, snaphosts
> every 20 seconds, deletes a random snapshot every 50 seconds, deletes
> file from the original subvolume, now enhanced with qgroups just for the
> new snapshots inherigin the toplevel subvolume. Lockup.
> 
> It gets stuck in a snapshot call with the follwin stacktrace
> 
> [<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs]
> [<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs]

This looks like the original subtree tracing has something wrong.

Thanks for the report, I'll investigate it.
Qu

> [<0>] do_walk_down+0x681/0xb20 [btrfs]
> [<0>] walk_down_tree+0xf5/0x1c0 [btrfs]
> [<0>] btrfs_drop_snapshot+0x43b/0xb60 [btrfs]
> [<0>] btrfs_clean_one_deleted_snapshot+0xc1/0x120 [btrfs]
> [<0>] cleaner_kthread+0xf8/0x170 [btrfs]
> [<0>] kthread+0x121/0x140
> [<0>] ret_from_fork+0x27/0x50
> 
> and that's like 10th snapshot and ~3rd deltion. This is qgroup show:
> 
> qgroupid rfer excl parent
>    --
> 0/5 865.27MiB  1.66MiB ---
> 0/257   0.00B0.00B ---
> 0/259   0.00B0.00B ---
> 0/260   806.58MiB637.25MiB ---
> 0/262   0.00B0.00B ---
> 0/263   0.00B0.00B ---
> 0/264   0.00B0.00B ---
> 0/265   0.00B0.00B ---
> 0/266   0.00B0.00B ---
> 0/267   0.00B0.00B ---
> 0/268   0.00B0.00B ---
> 0/269   0.00B0.00B ---
> 0/270   989.04MiB  1.22MiB ---
> 0/271   0.00B0.00B ---
> 0/272   922.25MiB416.00KiB ---
> 0/273   931.02MiB  1.50MiB ---
> 0/274   910.94MiB  1.52MiB ---
> 1/1   1.64GiB  1.64GiB
> 0/5,0/257,0/259,0/260,0/262,0/263,0/264,0/265,0/266,0/267,0/268,0/269,0/270,0/271,0/272,0/273,0/274
> 
> No IO or cpu activity at this point, the stacktrace and show output
> remains the same.
> 
> So, considering this, I'm not going to add the patchset to 4.21 but will
> keep it in for-next for testing, any fixups or updates will be applied.
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-07 Thread David Sterba
On Fri, Dec 07, 2018 at 06:51:21AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018/12/7 上午3:35, David Sterba wrote:
> > On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
> >> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
> >>> This patchset can be fetched from github:
> >>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
> >>>
> >>> Which is based on v4.20-rc1.
> >>
> >> Thanks, I'll add it to for-next soon.
> > 
> > The branch was there for some time but not for at least a week (my
> > mistake I did not notice in time). I've rebased it on top of recent
> > misc-next, but without the delayed refs patchset from Josef.
> > 
> > At the moment I'm considering it for merge to 4.21, there's still some
> > time to pull it out in case it shows up to be too problematic. I'm
> > mostly worried about the unknown interactions with the enospc updates or
> 
> For that part, I don't think it would have some obvious problem for
> enospc updates.
> 
> As the user-noticeable effect is the delay of reloc tree deletion.
> 
> Despite that, it's mostly transparent to extent allocation.
> 
> > generally because of lack of qgroup and reloc code reviews.
> 
> That's the biggest problem.
> 
> However most of the current qgroup + balance optimization is done inside
> qgroup code (to skip certain qgroup record), if we're going to hit some
> problem then this patchset would have the highest possibility to hit
> problem.
> 
> Later patches will just keep tweaking qgroup to without affecting any
> other parts mostly.
> 
> So I'm fine if you decide to pull it out for now.

I've adapted a stress tests that unpacks a large tarball, snaphosts
every 20 seconds, deletes a random snapshot every 50 seconds, deletes
file from the original subvolume, now enhanced with qgroups just for the
new snapshots inherigin the toplevel subvolume. Lockup.

It gets stuck in a snapshot call with the follwin stacktrace

[<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs]
[<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs]
[<0>] do_walk_down+0x681/0xb20 [btrfs]
[<0>] walk_down_tree+0xf5/0x1c0 [btrfs]
[<0>] btrfs_drop_snapshot+0x43b/0xb60 [btrfs]
[<0>] btrfs_clean_one_deleted_snapshot+0xc1/0x120 [btrfs]
[<0>] cleaner_kthread+0xf8/0x170 [btrfs]
[<0>] kthread+0x121/0x140
[<0>] ret_from_fork+0x27/0x50

and that's like 10th snapshot and ~3rd deltion. This is qgroup show:

qgroupid rfer excl parent
   --
0/5 865.27MiB  1.66MiB ---
0/257   0.00B0.00B ---
0/259   0.00B0.00B ---
0/260   806.58MiB637.25MiB ---
0/262   0.00B0.00B ---
0/263   0.00B0.00B ---
0/264   0.00B0.00B ---
0/265   0.00B0.00B ---
0/266   0.00B0.00B ---
0/267   0.00B0.00B ---
0/268   0.00B0.00B ---
0/269   0.00B0.00B ---
0/270   989.04MiB  1.22MiB ---
0/271   0.00B0.00B ---
0/272   922.25MiB416.00KiB ---
0/273   931.02MiB  1.50MiB ---
0/274   910.94MiB  1.52MiB ---
1/1   1.64GiB  1.64GiB
0/5,0/257,0/259,0/260,0/262,0/263,0/264,0/265,0/266,0/267,0/268,0/269,0/270,0/271,0/272,0/273,0/274

No IO or cpu activity at this point, the stacktrace and show output
remains the same.

So, considering this, I'm not going to add the patchset to 4.21 but will
keep it in for-next for testing, any fixups or updates will be applied.


Re: [PATCH] libbtrfsutil: fix unprivileged tests if kernel lacks support

2018-12-07 Thread David Sterba
On Thu, Dec 06, 2018 at 04:29:32PM -0800, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> I apparently didn't test this on a pre-4.18 kernel.
> test_subvolume_info_unprivileged() checks for an ENOTTY, but this
> doesn't seem to work correctly with subTest().
> test_subvolume_iterator_unprivileged() doesn't have a check at all. Add
> an explicit check to both before doing the actual test.
> 
> Signed-off-by: Omar Sandoval 

Applied, thanks.


Re: [PATCH 2/8] btrfs: extent-tree: Open-code process_func in __btrfs_mod_ref

2018-12-07 Thread Nikolay Borisov



On 6.12.18 г. 8:58 ч., Qu Wenruo wrote:
> The process_func is never a function hook used anywhere else.
> 
> Open code it to make later delayed ref refactor easier, so we can
> refactor btrfs_inc_extent_ref() and btrfs_free_extent() in different
> patches.
> 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent-tree.c | 33 ++---
>  1 file changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index ea2c3d5220f0..ea68d288d761 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3220,10 +3220,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
> *trans,
>   int i;
>   int level;
>   int ret = 0;
> - int (*process_func)(struct btrfs_trans_handle *,
> - struct btrfs_root *,
> - u64, u64, u64, u64, u64, u64, bool);
> -
>  
>   if (btrfs_is_testing(fs_info))
>   return 0;
> @@ -3235,11 +3231,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
> *trans,
>   if (!test_bit(BTRFS_ROOT_REF_COWS, >state) && level == 0)
>   return 0;
>  
> - if (inc)
> - process_func = btrfs_inc_extent_ref;
> - else
> - process_func = btrfs_free_extent;
> -
>   if (full_backref)
>   parent = buf->start;
>   else
> @@ -3261,17 +3252,29 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
> *trans,
>  
>   num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi);
>   key.offset -= btrfs_file_extent_offset(buf, fi);
> - ret = process_func(trans, root, bytenr, num_bytes,
> -parent, ref_root, key.objectid,
> -key.offset, for_reloc);
> + if (inc)
> + ret = btrfs_inc_extent_ref(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + key.objectid, key.offset,
> + for_reloc);
> + else
> + ret = btrfs_free_extent(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + key.objectid, key.offset,
> + for_reloc);
>   if (ret)
>   goto fail;
>   } else {
>   bytenr = btrfs_node_blockptr(buf, i);
>   num_bytes = fs_info->nodesize;
> - ret = process_func(trans, root, bytenr, num_bytes,
> -parent, ref_root, level - 1, 0,
> -for_reloc);
> + if (inc)
> + ret = btrfs_inc_extent_ref(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + level - 1, 0, for_reloc);
> + else
> + ret = btrfs_free_extent(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + level - 1, 0, for_reloc);
>   if (ret)
>   goto fail;
>   }
> 


Re: System unable to mount partition after a power loss

2018-12-07 Thread Doni Crosby
I ran that command and I cannot get the email to send properly to the 
mailing list as the attachment of the output is over 4.6M.


On 12/7/2018 11:49 AM, Doni Crosby wrote:

The output of the command is attached. This is what errors showed up
on the system:
parent transid verify failed on 3563224842240 wanted 5184691 found 5184689
parent transid verify failed on 3563224842240 wanted 5184691 found 5184689
parent transid verify failed on 3563222974464 wanted 5184691 found 5184688
parent transid verify failed on 3563222974464 wanted 5184691 found 5184688
parent transid verify failed on 3563223121920 wanted 5184691 found 5184688
parent transid verify failed on 3563223121920 wanted 5184691 found 5184688
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
Ignoring transid failure
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
Ignoring transid failure
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
Ignoring transid failure
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
Ignoring transid failure
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
Ignoring transid failure
WARNING: eb corrupted: parent bytenr 31801344 slot 132 level 1 child
bytenr 3563231461376 level has 1 expect 0, skipping the slot
parent transid verify failed on 3563231494144 wanted 5184691 found 5183325
parent transid verify failed on 3563231494144 wanted 5184691 found 5183325
parent transid verify failed on 3563231494144 wanted 5184691 found 5183325
parent transid verify failed on 3563231494144 wanted 5184691 found 5183325
Ignoring transid failure
parent transid verify failed on 3563231526912 wanted 5184691 found 5183325
parent transid verify failed on 3563231526912 wanted 5184691 found 5183325
parent transid verify failed on 3563231526912 wanted 5184691 found 5183325
parent transid verify failed on 3563231526912 wanted 5184691 found 5183325
Ignoring transid failure
parent transid verify failed on 3563229626368 wanted 5184691 found 5184689
parent transid verify failed on 3563229626368 wanted 5184691 found 5184689
parent transid verify failed on 3563229937664 wanted 5184691 found 5184689
parent transid verify failed on 3563229937664 wanted 5184691 found 5184689
parent transid verify failed on 3563226857472 wanted 5184691 found 5184689
parent transid verify failed on 3563226857472 wanted 5184691 found 5184689
parent transid verify failed on 3563230674944 wanted 5184691 found 5183325
parent transid verify failed on 3563230674944 wanted 5184691 found 5183325
parent transid verify failed on 3563230674944 wanted 5184691 found 5183325
parent transid verify failed on 3563230674944 wanted 5184691 found 5183325
Ignoring transid failure
On Fri, Dec 7, 2018 at 2:22 AM Qu Wenruo  wrote:




On 2018/12/7 下午1:24, Doni Crosby wrote:

All,

I'm coming to you to see if there is a way to fix or at least recover
most of the data I have from a btrfs filesystem. The system went down
after both a breaker and the battery backup failed. I cannot currently
mount the system, with the following error from dmesg:

Note: The vda1 is just the entire disk being passed from the VM host
to the VM it's not an actual true virtual block device

[ 499.704398] BTRFS info (device vda1): disk space caching is enabled
[  499.704401] BTRFS info (device vda1): has skinny extents
[  499.739522] BTRFS error (device vda1): parent transid verify failed
on 3563231428608 wanted 5184691 found 5183327


Transid mismatch normally means the fs is screwed up more or less.

And according to your mount failure, it looks the fs get screwed up badly.

What's the kernel version used in the VM?
I don't really think the VM is always using the latest kernel.


[  499.740257] BTRFS error (device vda1): parent transid verify failed
on 3563231428608 wanted 5184691 found 5183327

Re: System unable to mount partition after a power loss

2018-12-07 Thread Doni Crosby
I just looked at the VM it does not have a cache. That's the default
in proxmox to improve performance.
On Fri, Dec 7, 2018 at 7:25 AM Austin S. Hemmelgarn
 wrote:
>
> On 2018-12-07 01:43, Doni Crosby wrote:
> >> This is qemu-kvm? What's the cache mode being used? It's possible the
> >> usual write guarantees are thwarted by VM caching.
> > Yes it is a proxmox host running the system so it is a qemu vm, I'm
> > unsure on the caching situation.
> On the note of QEMU and the cache mode, the only cache mode I've seen to
> actually cause issues for BTRFS volumes _inside_ a VM is 'cache=unsafe',
> but that causes problems for most filesystems, so it's probably not the
> issue here.
>
> OTOH, I've seen issues with most of the cache modes other than
> 'cache=writeback' and 'cache=writethrough' when dealing with BTRFS as
> the back-end storage on the host system, and most of the time such
> issues will manifest as both problems with the volume inside the VM
> _and_ the volume the disk images are being stored on.


[PATCH v2] Btrfs: use generic_remap_file_range_prep() for cloning and deduplication

2018-12-07 Thread fdmanana
From: Filipe Manana 

Since cloning and deduplication are no longer Btrfs specific operations, we
now have generic code to handle parameter validation, compare file ranges
used for deduplication, clear capabilities when cloning, etc. This change
makes Btrfs use it, eliminating a lot of code in Btrfs and also fixing a
few bugs, such as:

1) When cloning, the destination file's capabilities were not dropped
   (the fstest generic/513 tests this);

2) We were not checking if the destination file is immutable;

3) Not checking if either the source or destination files are swap
   files (swap file support is coming soon for Btrfs);

4) System limits were not checked (resource limits and O_LARGEFILE).

Note that the generic helper generic_remap_file_range_prep() does start
and waits for writeback by calling filemap_write_and_wait_range(), however
that is not enough for Btrfs for two reasons:

1) With compression, we need to start writeback twice in order to get the
   pages marked for writeback and ordered extents created;

2) filemap_write_and_wait_range() (and all its other variants) only waits
   for the IO to complete, but we need to wait for the ordered extents to
   finish, so that when we do the actual reflinking operations the file
   extent items are in the fs tree. This is also important due to the fact
   that the generic helper, for the deduplication case, compares the
   contents of the pages in the requested range, which might require
   reading extents from disk in the very unlikely case that pages get
   invalidated after writeback finishes (so the file extent items must be
   up to date in the fs tree).

Since these reasons are specific to Btrfs we have to do it in the Btrfs
code before calling generic_remap_file_range_prep(). This also results in
a more simple way of dealing with existing delalloc in the source/target
ranges, specially for the deduplication case where we used to lock all the
pages first and then if we found any dealloc for the range, or ordered
extent, we would unlock the pages trigger writeback and wait for ordered
extents to complete, then lock all the pages again and check if
deduplication can be done. So now we get a simpler approach: lock the
inodes, then trigger writeback and then wait for ordered extents to
complete.

So make btrfs use generic_remap_file_range_prep() (XFS and OCFS2 use it)
to eliminate duplicated code, fix a few bugs and benefit from future bug
fixes done there - for example the recent clone and dedupe bugs involving
reflinking a partial EOF block got a counterpart fix in the generic helpe,
since it affected all filesystems supporting these operations, so we no
longer need special checks in Btrfs for them.

Signed-off-by: Filipe Manana 
---

V2: Removed check that verifies if either of the inodes is a directory,
as it is done by generic_remap_file_range_prep(). Oddly in btrfs was being
done only for cloning but not for dedupe.

 fs/btrfs/ioctl.c | 612 ---
 1 file changed, 129 insertions(+), 483 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 802a628e9f7d..321fb9bc149d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3191,92 +3191,6 @@ static long btrfs_ioctl_dev_info(struct btrfs_fs_info 
*fs_info,
return ret;
 }
 
-static struct page *extent_same_get_page(struct inode *inode, pgoff_t index)
-{
-   struct page *page;
-
-   page = grab_cache_page(inode->i_mapping, index);
-   if (!page)
-   return ERR_PTR(-ENOMEM);
-
-   if (!PageUptodate(page)) {
-   int ret;
-
-   ret = btrfs_readpage(NULL, page);
-   if (ret)
-   return ERR_PTR(ret);
-   lock_page(page);
-   if (!PageUptodate(page)) {
-   unlock_page(page);
-   put_page(page);
-   return ERR_PTR(-EIO);
-   }
-   if (page->mapping != inode->i_mapping) {
-   unlock_page(page);
-   put_page(page);
-   return ERR_PTR(-EAGAIN);
-   }
-   }
-
-   return page;
-}
-
-static int gather_extent_pages(struct inode *inode, struct page **pages,
-  int num_pages, u64 off)
-{
-   int i;
-   pgoff_t index = off >> PAGE_SHIFT;
-
-   for (i = 0; i < num_pages; i++) {
-again:
-   pages[i] = extent_same_get_page(inode, index + i);
-   if (IS_ERR(pages[i])) {
-   int err = PTR_ERR(pages[i]);
-
-   if (err == -EAGAIN)
-   goto again;
-   pages[i] = NULL;
-   return err;
-   }
-   }
-   return 0;
-}
-
-static int lock_extent_range(struct inode *inode, u64 off, u64 len,
-bool retry_range_locking)
-{
-   /*
-* Do any pending 

Re: [PATCH 05/10] btrfs: introduce delayed_refs_rsv

2018-12-07 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> Traditionally we've had voodoo in btrfs to account for the space that
> delayed refs may take up by having a global_block_rsv.  This works most
> of the time, except when it doesn't.  We've had issues reported and seen
> in production where sometimes the global reserve is exhausted during
> transaction commit before we can run all of our delayed refs, resulting
> in an aborted transaction.  Because of this voodoo we have equally
> dubious flushing semantics around throttling delayed refs which we often
> get wrong.
> 
> So instead give them their own block_rsv.  This way we can always know
> exactly how much outstanding space we need for delayed refs.  This
> allows us to make sure we are constantly filling that reservation up
> with space, and allows us to put more precise pressure on the enospc
> system.  Instead of doing math to see if its a good time to throttle,
> the normal enospc code will be invoked if we have a lot of delayed refs
> pending, and they will be run via the normal flushing mechanism.
> 
> For now the delayed_refs_rsv will hold the reservations for the delayed
> refs, the block group updates, and deleting csums.  We could have a
> separate rsv for the block group updates, but the csum deletion stuff is
> still handled via the delayed_refs so that will stay there.


I see one difference in the way that the space is managed. Essentially
for delayed refs rsv you'll only ever be increasaing the size and
->reserved only when you have to refill. This is opposite to the way
other metadata space is managed i.e by using use_block_rsv which
subtracts ->reserved everytime a block has to be CoW'ed. Why this
difference?


> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/ctree.h   |  14 +++-
>  fs/btrfs/delayed-ref.c |  43 --
>  fs/btrfs/disk-io.c |   4 +
>  fs/btrfs/extent-tree.c | 212 
> +
>  fs/btrfs/transaction.c |  37 -
>  5 files changed, 284 insertions(+), 26 deletions(-)
> 



> +/**
> + * btrfs_migrate_to_delayed_refs_rsv - transfer bytes to our delayed refs 
> rsv.
> + * @fs_info - the fs info for our fs.
> + * @src - the source block rsv to transfer from.
> + * @num_bytes - the number of bytes to transfer.
> + *
> + * This transfers up to the num_bytes amount from the src rsv to the
> + * delayed_refs_rsv.  Any extra bytes are returned to the space info.
> + */
> +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
> +struct btrfs_block_rsv *src,
> +u64 num_bytes)

This function is currently used only during transaction start, it seems
to be rather specific to the delayed refs so I'd suggest making it
private to transaction.c

> +{
> + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv;
> + u64 to_free = 0;
> +
> + spin_lock(>lock);
> + src->reserved -= num_bytes;
> + src->size -= num_bytes;
> + spin_unlock(>lock);
> +
> + spin_lock(_refs_rsv->lock);
> + if (delayed_refs_rsv->size > delayed_refs_rsv->reserved) {
> + u64 delta = delayed_refs_rsv->size -
> + delayed_refs_rsv->reserved;
> + if (num_bytes > delta) {
> + to_free = num_bytes - delta;
> + num_bytes = delta;
> + }
> + } else {
> + to_free = num_bytes;
> + num_bytes = 0;
> + }
> +
> + if (num_bytes)
> + delayed_refs_rsv->reserved += num_bytes;
> + if (delayed_refs_rsv->reserved >= delayed_refs_rsv->size)
> + delayed_refs_rsv->full = 1;
> + spin_unlock(_refs_rsv->lock);
> +
> + if (num_bytes)
> + trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv",
> +   0, num_bytes, 1);
> + if (to_free)
> + space_info_add_old_bytes(fs_info, delayed_refs_rsv->space_info,
> +  to_free);
> +}
> +
> +/**
> + * btrfs_delayed_refs_rsv_refill - refill based on our delayed refs usage.
> + * @fs_info - the fs_info for our fs.
> + * @flush - control how we can flush for this reservation.
> + *
> + * This will refill the delayed block_rsv up to 1 items size worth of space 
> and
> + * will return -ENOSPC if we can't make the reservation.
> + */
> +int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
> +   enum btrfs_reserve_flush_enum flush)
> +{
> + struct btrfs_block_rsv *block_rsv = _info->delayed_refs_rsv;
> + u64 limit = btrfs_calc_trans_metadata_size(fs_info, 1);
> + u64 num_bytes = 0;
> + int ret = -ENOSPC;
> +
> + spin_lock(_rsv->lock);
> + if (block_rsv->reserved < block_rsv->size) {
> + num_bytes = block_rsv->size - block_rsv->reserved;
> + num_bytes = min(num_bytes, limit);
> + }
> + 

Urgently need money? We can help you!

2018-12-07 Thread Mr. Muller Dieter
Urgently need money? We can help you!
Are you by the current situation in trouble or threatens you in trouble?
In this way, we give you the ability to take a new development.
As a rich person I feel obliged to assist people who are struggling to give 
them a chance. Everyone deserved a second chance and since the Government 
fails, it will have to come from others.
No amount is too crazy for us and the maturity we determine by mutual agreement.
No surprises, no extra costs, but just the agreed amounts and nothing else.
Don't wait any longer and comment on this post. Please specify the amount you 
want to borrow and we will contact you with all the possibilities. contact us 
today at stewarrt.l...@gmail.com


[PATCH] Btrfs: use generic_remap_file_range_prep() for cloning and deduplication

2018-12-07 Thread fdmanana
From: Filipe Manana 

Since cloning and deduplication are no longer Btrfs specific operations, we
now have generic code to handle parameter validation, compare file ranges
used for deduplication, clear capabilities when cloning, etc. This change
makes Btrfs use it, eliminating a lot of code in Btrfs and also fixing a
few bugs, such as:

1) When cloning, the destination file's capabilities were not dropped
   (the fstest generic/513 tests this);

2) We were not checking if the destination file is immutable;

3) Not checking if either the source or destination files are swap
   files (swap file support is coming soon for Btrfs);

4) System limits were not checked (resource limits and O_LARGEFILE).

Note that the generic helper generic_remap_file_range_prep() does start
and waits for writeback by calling filemap_write_and_wait_range(), however
that is not enough for Btrfs for two reasons:

1) With compression, we need to start writeback twice in order to get the
   pages marked for writeback and ordered extents created;

2) filemap_write_and_wait_range() (and all its other variants) only waits
   for the IO to complete, but we need to wait for the ordered extents to
   finish, so that when we do the actual reflinking operations the file
   extent items are in the fs tree. This is also important due to the fact
   that the generic helper, for the deduplication case, compares the
   contents of the pages in the requested range, which might require
   reading extents from disk in the very unlikely case that pages get
   invalidated after writeback finishes (so the file extent items must be
   up to date in the fs tree).

Since these reasons are specific to Btrfs we have to do it in the Btrfs
code before calling generic_remap_file_range_prep(). This also results in
a more simple way of dealing with existing delalloc in the source/target
ranges, specially for the deduplication case where we used to lock all the
pages first and then if we found any dealloc for the range, or ordered
extent, we would unlock the pages trigger writeback and wait for ordered
extents to complete, then lock all the pages again and check if
deduplication can be done. So now we get a simpler approach: lock the
inodes, then trigger writeback and then wait for ordered extents to
complete.

So make btrfs use generic_remap_file_range_prep() (XFS and OCFS2 use it)
to eliminate duplicated code, fix a few bugs and benefit from future bug
fixes done there - for example the recent clone and dedupe bugs involving
reflinking a partial EOF block got a counterpart fix in the generic helpe,
since it affected all filesystems supporting these operations, so we no
longer need special checks in Btrfs for them.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/ioctl.c | 615 ---
 1 file changed, 132 insertions(+), 483 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 802a628e9f7d..261e116dddb2 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3191,92 +3191,6 @@ static long btrfs_ioctl_dev_info(struct btrfs_fs_info 
*fs_info,
return ret;
 }
 
-static struct page *extent_same_get_page(struct inode *inode, pgoff_t index)
-{
-   struct page *page;
-
-   page = grab_cache_page(inode->i_mapping, index);
-   if (!page)
-   return ERR_PTR(-ENOMEM);
-
-   if (!PageUptodate(page)) {
-   int ret;
-
-   ret = btrfs_readpage(NULL, page);
-   if (ret)
-   return ERR_PTR(ret);
-   lock_page(page);
-   if (!PageUptodate(page)) {
-   unlock_page(page);
-   put_page(page);
-   return ERR_PTR(-EIO);
-   }
-   if (page->mapping != inode->i_mapping) {
-   unlock_page(page);
-   put_page(page);
-   return ERR_PTR(-EAGAIN);
-   }
-   }
-
-   return page;
-}
-
-static int gather_extent_pages(struct inode *inode, struct page **pages,
-  int num_pages, u64 off)
-{
-   int i;
-   pgoff_t index = off >> PAGE_SHIFT;
-
-   for (i = 0; i < num_pages; i++) {
-again:
-   pages[i] = extent_same_get_page(inode, index + i);
-   if (IS_ERR(pages[i])) {
-   int err = PTR_ERR(pages[i]);
-
-   if (err == -EAGAIN)
-   goto again;
-   pages[i] = NULL;
-   return err;
-   }
-   }
-   return 0;
-}
-
-static int lock_extent_range(struct inode *inode, u64 off, u64 len,
-bool retry_range_locking)
-{
-   /*
-* Do any pending delalloc/csum calculations on inode, one way or
-* another, and lock file content.
-* The locking order is:
-*
-*   1) pages
-*   2) range in the inode's io tree
-

[PATCH] Btrfs: scrub, move setup of nofs contexts higher in the stack

2018-12-07 Thread fdmanana
From: Filipe Manana 

Since scrub workers only do memory allocation with GFP_KERNEL when they
need to perform repair, we can move the recent setup of the nofs context
up to scrub_handle_errored_block() instead of setting it up down the call
chain at insert_full_stripe_lock() and scrub_add_page_to_wr_bio(),
removing some duplicate code and comment. So the only paths for which a
scrub worker can do memory allocations using GFP_KERNEL are the following:

 scrub_bio_end_io_worker()
   scrub_block_complete()
 scrub_handle_errored_block()
   lock_full_stripe()
 insert_full_stripe_lock()
   -> kmalloc with GFP_KERNEL

  scrub_bio_end_io_worker()
scrub_block_complete()
  scrub_handle_errored_block()
scrub_write_page_to_dev_replace()
  scrub_add_page_to_wr_bio()
-> kzalloc with GFP_KERNEL

Signed-off-by: Filipe Manana 
---

Applies on top of:

  Btrfs: fix deadlock with memory reclaim during scrub

 fs/btrfs/scrub.c | 34 ++
 1 file changed, 14 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index bbd1b36f4918..f996f4064596 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -322,7 +322,6 @@ static struct full_stripe_lock *insert_full_stripe_lock(
struct rb_node *parent = NULL;
struct full_stripe_lock *entry;
struct full_stripe_lock *ret;
-   unsigned int nofs_flag;
 
lockdep_assert_held(_root->lock);
 
@@ -342,15 +341,8 @@ static struct full_stripe_lock *insert_full_stripe_lock(
 
/*
 * Insert new lock.
-*
-* We must use GFP_NOFS because the scrub task might be waiting for a
-* worker task executing this function and in turn a transaction commit
-* might be waiting the scrub task to pause (which needs to wait for all
-* the worker tasks to complete before pausing).
 */
-   nofs_flag = memalloc_nofs_save();
ret = kmalloc(sizeof(*ret), GFP_KERNEL);
-   memalloc_nofs_restore(nofs_flag);
if (!ret)
return ERR_PTR(-ENOMEM);
ret->logical = fstripe_logical;
@@ -842,6 +834,7 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
int page_num;
int success;
bool full_stripe_locked;
+   unsigned int nofs_flag;
static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
  DEFAULT_RATELIMIT_BURST);
 
@@ -867,6 +860,16 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
dev = sblock_to_check->pagev[0]->dev;
 
/*
+* We must use GFP_NOFS because the scrub task might be waiting for a
+* worker task executing this function and in turn a transaction commit
+* might be waiting the scrub task to pause (which needs to wait for all
+* the worker tasks to complete before pausing).
+* We do allocations in the workers through insert_full_stripe_lock()
+* and scrub_add_page_to_wr_bio(), which happens down the call chain of
+* this function.
+*/
+   nofs_flag = memalloc_nofs_save();
+   /*
 * For RAID5/6, race can happen for a different device scrub thread.
 * For data corruption, Parity and Data threads will both try
 * to recovery the data.
@@ -875,6 +878,7 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
 */
ret = lock_full_stripe(fs_info, logical, _stripe_locked);
if (ret < 0) {
+   memalloc_nofs_restore(nofs_flag);
spin_lock(>stat_lock);
if (ret == -ENOMEM)
sctx->stat.malloc_errors++;
@@ -914,7 +918,7 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
 */
 
sblocks_for_recheck = kcalloc(BTRFS_MAX_MIRRORS,
- sizeof(*sblocks_for_recheck), GFP_NOFS);
+ sizeof(*sblocks_for_recheck), GFP_KERNEL);
if (!sblocks_for_recheck) {
spin_lock(>stat_lock);
sctx->stat.malloc_errors++;
@@ -1212,6 +1216,7 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
}
 
ret = unlock_full_stripe(fs_info, logical, full_stripe_locked);
+   memalloc_nofs_restore(nofs_flag);
if (ret < 0)
return ret;
return 0;
@@ -1630,19 +1635,8 @@ static int scrub_add_page_to_wr_bio(struct scrub_ctx 
*sctx,
mutex_lock(>wr_lock);
 again:
if (!sctx->wr_curr_bio) {
-   unsigned int nofs_flag;
-
-   /*
-* We must use GFP_NOFS because the scrub task might be waiting
-* for a worker task executing this function and in turn a
-* transaction commit might be waiting the scrub task to pause
-* (which needs to wait for all the worker tasks 

Re: [PATCH 04/10] btrfs: only track ref_heads in delayed_ref_updates

2018-12-07 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We use this number to figure out how many delayed refs to run, but
> __btrfs_run_delayed_refs really only checks every time we need a new
> delayed ref head, so we always run at least one ref head completely no
> matter what the number of items on it.  Fix the accounting to only be
> adjusted when we add/remove a ref head.

David,

I think also warrants a forward looking sentence stating that the number
is also going to be used to calculate the required number of bytes in
the delayed refs rsv. Something along the lines of:

In addition to using this number to limit the number of delayed refs
run, a future patch is also going to use it to calculate the amount of
space required for delayed refs space reservation.

> 
> Reviewed-by: Nikolay Borisov 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/delayed-ref.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index b3e4c9fcb664..48725fa757a3 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -251,8 +251,6 @@ static inline void drop_delayed_ref(struct 
> btrfs_trans_handle *trans,
>   ref->in_tree = 0;
>   btrfs_put_delayed_ref(ref);
>   atomic_dec(_refs->num_entries);
> - if (trans->delayed_ref_updates)
> - trans->delayed_ref_updates--;
>  }
>  
>  static bool merge_ref(struct btrfs_trans_handle *trans,
> @@ -467,7 +465,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle 
> *trans,
>   if (ref->action == BTRFS_ADD_DELAYED_REF)
>   list_add_tail(>add_list, >ref_add_list);
>   atomic_inc(>num_entries);
> - trans->delayed_ref_updates++;
>   spin_unlock(>lock);
>   return ret;
>  }
> 


Re: System unable to mount partition after a power loss

2018-12-07 Thread Austin S. Hemmelgarn

On 2018-12-07 01:43, Doni Crosby wrote:

This is qemu-kvm? What's the cache mode being used? It's possible the
usual write guarantees are thwarted by VM caching.

Yes it is a proxmox host running the system so it is a qemu vm, I'm
unsure on the caching situation.
On the note of QEMU and the cache mode, the only cache mode I've seen to 
actually cause issues for BTRFS volumes _inside_ a VM is 'cache=unsafe', 
but that causes problems for most filesystems, so it's probably not the 
issue here.


OTOH, I've seen issues with most of the cache modes other than 
'cache=writeback' and 'cache=writethrough' when dealing with BTRFS as 
the back-end storage on the host system, and most of the time such 
issues will manifest as both problems with the volume inside the VM 
_and_ the volume the disk images are being stored on.


Re: What if TRIM issued a wipe on devices that don't TRIM?

2018-12-07 Thread Austin S. Hemmelgarn

On 2018-12-06 23:09, Andrei Borzenkov wrote:

06.12.2018 16:04, Austin S. Hemmelgarn пишет:


* On SCSI devices, a discard operation translates to a SCSI UNMAP
command.  As pointed out by Ronnie Sahlberg in his reply, this command
is purely advisory, may not result in any actual state change on the
target device, and is not guaranteed to wipe the data.  To actually wipe
things, you have to explicitly write bogus data to the given regions
(using either regular writes, or a WRITESAME command with the desired
pattern), and _then_ call UNMAP on them.


WRITE SAME command has UNMAP bit and depending on device and kernel
version kernel may actually issue either UNMAP or WRITE SAME with UNMAP
bit set when doing discard.

Good to know.  I've not looked at the SCSI code much, and actually 
didn't know about the UNMAP bit for the WRITE SAME command, so I just 
assumed that the kernel only used the UNMAP command.


Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-07 Thread Nikolay Borisov



On 7.12.18 г. 9:09 ч., Nikolay Borisov wrote:
> 
> 
> On 6.12.18 г. 19:54 ч., David Sterba wrote:
>> On Thu, Dec 06, 2018 at 06:52:21PM +0200, Nikolay Borisov wrote:
>>>
>>>
>>> On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
 Now with the delayed_refs_rsv we can now know exactly how much pending
 delayed refs space we need.  This means we can drastically simplify
>>>
>>> IMO it will be helpful if there is a sentence here referring back to
>>> btrfs_update_delayed_refs_rsv to put your first sentence into context.
>>> But I guess this is something David can also do.
>>
>> I'll update the changelog, but I'm not sure what exactly you want to see
>> there, please post the replacement text. Thanks.
> 
> With the introduction of dealyed_refs_rsv infrastructure, namely
> btrfs_update_delayed_refs_rsv we now know exactly how much pending
> delayed refs space is required.

To put things into context as to why I deem this change beneficial -
basically doing the migration of reservation from transaction to delayed
refs rsv modifies both size and reserved - they will be equal. Calling
btrfs_update_delayed_refs_rsv actually increases ->size and doesn't
really decrement ->reserved. Also we never do
btrfs_block_rsv_migrate/use_block_rsv on the delayed refs block rsv so
managing ->reserved  value for delayed refs rsv is different than for
the rest of the block rsv.


> 
>>
 btrfs_check_space_for_delayed_refs by simply checking how much space we
 have reserved for the global rsv (which acts as a spill over buffer) and
 the delayed refs rsv.  If our total size is beyond that amount then we
 know it's time to commit the transaction and stop any more delayed refs
 from being generated.

 Signed-off-by: Josef Bacik 
>>
> 


Re: HELP unmountable partition after btrfs balance to RAID0

2018-12-07 Thread Duncan
Thomas Mohr posted on Thu, 06 Dec 2018 12:31:15 +0100 as excerpted:

> We wanted to convert a file system to a RAID0 with two partitions.
> Unfortunately we had to reboot the server during the balance operation
> before it could complete.
> 
> Now following happens:
> 
> A mount attempt of the array fails with following error code:
> 
> btrfs recover yields roughly 1.6 out of 4 TB.

[Just another btrfs user and list regular, not a dev.  A dev may reply to 
your specific case, but meanwhile, for next time...]

That shouldn't be a problem.  Because with raid0 a failure of any of the 
components will take down the entire raid, making it less reliable than a 
single device, raid0 (in general, not just btrfs) is considered only 
useful for data of low enough value that its loss is no big deal, either 
because it's truly of little value (internet cache being a good example), 
or because backups are kept available and updated for whenever the raid0 
array fails.  Because with raid0, it's always a question of when it'll 
fail, not if.

So loss of a filesystem being converted to raid0 isn't a problem, because 
the data on it, by virtue of being in the process of conversion to raid0, 
is defined as of throw-away value in any case.  If it's of higher value 
than that, it's not going to be raid0 (or in the process of conversion to 
it) in the first place.

Of course that's simply an extension of the more general first sysadmin's 
rule of backups, that the true value of data is defined not by arbitrary 
claims, but by the number of backups of that data it's worth having.  
Because "things happen", whether it's fat-fingering, bad hardware, buggy 
software, or simply someone tripping over the power cable or running into 
the power pole outside at the wrong time.

So no backup is simply defining the data as worth less than the time/
trouble/resources necessary to make that backup.

Note that you ALWAYS save what was of most value to you, either the time/
trouble/resources to do the backup, if your actions defined that to be of 
more value than the data, or the data, if you had that backup, thereby 
defining the value of the data to be worth backing up.

Similarly, failure of the only backup isn't a problem because by virtue 
of there being only that one backup, the data is defined as not worth 
having more than one, and likewise, having an outdated backup isn't a 
problem, because that's simply the special case of defining the data in 
the delta between the backup time and the present as not (yet) worth the 
time/hassle/resources to make/refresh that backup.

(And FWIW, the second sysadmin's rule of backups is that it's not a 
backup until you've successfully tested it recoverable in the same sort 
of conditions you're likely to need to recover it in.  Because so many 
people have /thought/ they had backups, that turned out not to be, 
because they never tested that they could actually recover the data from 
them.  For instance, if the backup tools you'll need to recover the 
backup are on the backup itself, how do you get to them?  Can you create 
a filesystem for the new copy of the data and recover it from the backup 
with just the tools and documentation available from your emergency boot 
media?  Untested backup == no backup, or at best, backup still in 
process!)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



Re: System unable to mount partition after a power loss

2018-12-06 Thread Qu Wenruo


On 2018/12/7 下午1:24, Doni Crosby wrote:
> All,
> 
> I'm coming to you to see if there is a way to fix or at least recover
> most of the data I have from a btrfs filesystem. The system went down
> after both a breaker and the battery backup failed. I cannot currently
> mount the system, with the following error from dmesg:
> 
> Note: The vda1 is just the entire disk being passed from the VM host
> to the VM it's not an actual true virtual block device
> 
> [ 499.704398] BTRFS info (device vda1): disk space caching is enabled
> [  499.704401] BTRFS info (device vda1): has skinny extents
> [  499.739522] BTRFS error (device vda1): parent transid verify failed
> on 3563231428608 wanted 5184691 found 5183327

Transid mismatch normally means the fs is screwed up more or less.

And according to your mount failure, it looks the fs get screwed up badly.

What's the kernel version used in the VM?
I don't really think the VM is always using the latest kernel.

> [  499.740257] BTRFS error (device vda1): parent transid verify failed
> on 3563231428608 wanted 5184691 found 5183327
> [  499.770847] BTRFS error (device vda1): open_ctree failed
> 
> I have tried running btrfsck:
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563221630976 wanted 5184691 found 5184688
> parent transid verify failed on 3563221630976 wanted 5184691 found 5184688
> parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
> parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
> parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
> parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
> parent transid verify failed on 3563224072192 wanted 5184691 found 5184688
> parent transid verify failed on 3563224072192 wanted 5184691 found 5184688
> parent transid verify failed on 3563225268224 wanted 5184691 found 5184689
> parent transid verify failed on 3563225268224 wanted 5184691 found 5184689
> parent transid verify failed on 3563227398144 wanted 5184691 found 5184689
> parent transid verify failed on 3563227398144 wanted 5184691 found 5184689
> parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
> parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
> parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
> parent transid verify failed on 3563229593600 wanted 5184691 found 5184689

According to your later dump-super output, it looks pretty possible that
the corrupted extents are all belonging to extent tree.

So it's still possible that your fs tree and other essential trees are OK.

Please dump the following output (with its stderr) to further confirm
the damage.
# btrfs ins dump-tree -b 31801344 --follow /dev/vda1

If your objective is only to recover data, then you could start to try
btrfs-restore.
It's pretty hard to fix the heavily damaged extent tree.

Thanks,
Qu
> Ignoring transid failure
> Checking filesystem on /dev/vda1
> UUID: 7c76bb05-b3dc-4804-bf56-88d010a214c6
> checking extents
> parent transid verify failed on 3563224842240 wanted 5184691 found 5184689
> parent transid verify failed on 3563224842240 wanted 5184691 found 5184689
> parent transid verify failed on 3563222974464 wanted 5184691 found 5184688
> parent transid verify failed on 3563222974464 wanted 5184691 found 5184688
> parent transid verify failed on 3563223121920 wanted 5184691 found 5184688
> parent transid verify failed on 3563223121920 wanted 5184691 found 5184688
> parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
> parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
> parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
> parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
> Ignoring transid failure
> parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
> parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
> parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
> parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
> Ignoring transid failure
> parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
> parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
> parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
> parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
> Ignoring transid failure
> parent transid verify 

Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-06 Thread Nikolay Borisov



On 6.12.18 г. 19:54 ч., David Sterba wrote:
> On Thu, Dec 06, 2018 at 06:52:21PM +0200, Nikolay Borisov wrote:
>>
>>
>> On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
>>> Now with the delayed_refs_rsv we can now know exactly how much pending
>>> delayed refs space we need.  This means we can drastically simplify
>>
>> IMO it will be helpful if there is a sentence here referring back to
>> btrfs_update_delayed_refs_rsv to put your first sentence into context.
>> But I guess this is something David can also do.
> 
> I'll update the changelog, but I'm not sure what exactly you want to see
> there, please post the replacement text. Thanks.

With the introduction of dealyed_refs_rsv infrastructure, namely
btrfs_update_delayed_refs_rsv we now know exactly how much pending
delayed refs space is required.

> 
>>> btrfs_check_space_for_delayed_refs by simply checking how much space we
>>> have reserved for the global rsv (which acts as a spill over buffer) and
>>> the delayed refs rsv.  If our total size is beyond that amount then we
>>> know it's time to commit the transaction and stop any more delayed refs
>>> from being generated.
>>>
>>> Signed-off-by: Josef Bacik 
> 


Re: System unable to mount partition after a power loss

2018-12-06 Thread Doni Crosby
> This is qemu-kvm? What's the cache mode being used? It's possible the
> usual write guarantees are thwarted by VM caching.
Yes it is a proxmox host running the system so it is a qemu vm, I'm
unsure on the caching situation.

> Old version of progs, I suggest upgrading to 4.17.1 and run
I updated the progs to 4.17 and ran the following

btrfs insp dump-s -f /device/:
See attachment

btrfs rescue super -v /device/ (insp rescue super wasn't valid)
All Devices:
Device: id = 1, name = /dev/vda1

Before Recovering:
[All good supers]:
device name = /dev/vda1
superblock bytenr = 65536

device name = /dev/vda1
superblock bytenr = 67108864

device name = /dev/vda1
superblock bytenr = 274877906944

[All bad supers]:

All supers are valid, no need to recover

btrfs check --mode=lowmem /dev/vda1:
parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
parent transid verify failed on 3563221630976 wanted 5184691 found 5184688
parent transid verify failed on 3563221630976 wanted 5184691 found 5184688
parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
parent transid verify failed on 3563224072192 wanted 5184691 found 5184688
parent transid verify failed on 3563224072192 wanted 5184691 found 5184688
parent transid verify failed on 3563225268224 wanted 5184691 found 5184689
parent transid verify failed on 3563225268224 wanted 5184691 found 5184689
parent transid verify failed on 3563227398144 wanted 5184691 found 5184689
parent transid verify failed on 3563227398144 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=3563210342400 item=120 parent
level=1 child level=1
ERROR: cannot open file system

mount -o ro,norecovery,usebackuproot /dev/vda1 /mnt:
Same dmesg output as before.
On Fri, Dec 7, 2018 at 12:56 AM Chris Murphy  wrote:
>
> On Thu, Dec 6, 2018 at 10:24 PM Doni Crosby  wrote:
> >
> > All,
> >
> > I'm coming to you to see if there is a way to fix or at least recover
> > most of the data I have from a btrfs filesystem. The system went down
> > after both a breaker and the battery backup failed. I cannot currently
> > mount the system, with the following error from dmesg:
> >
> > Note: The vda1 is just the entire disk being passed from the VM host
> > to the VM it's not an actual true virtual block device
>
> This is qemu-kvm? What's the cache mode being used? It's possible the
> usual write guarantees are thwarted by VM caching.
>
>
>
> > btrfs check --recover also ends in a segmentation fault
>
> I'm not familiar with --recover option, the --repair option is flagged
> with a warning in the man page.
>Warning
>Do not use --repair unless you are advised to do so by a
> developer or an experienced user,
>
>
> > btrfs --version:
> > btrfs-progs v4.7.3
>
> Old version of progs, I suggest upgrading to 4.17.1 and run
>
> btrfs insp dump-s -f /device/
> btrfs insp rescue super -v /device/
> btrfs check --mode=lowmem /device/
>
> These are all read only commands. Please post output to the list,
> hopefully a developer will get around to looking at it.
>
> It is safe to try:
>
> mount -o ro,norecovery,usebackuproot /device/ /mnt/
>
> If that works, I suggest updating your backup while it's still
> possible in the meantime.
>
>
> --
> Chris Murphy
superblock: bytenr=65536, device=/dev/vda1
-
csum_type   0 (crc32c)
csum_size   4
csum0xbfa6fd72 [match]
bytenr  65536
flags   0x1
( WRITTEN )
magic   _BHRfS_M [match]
fsid7c76bb05-b3dc-4804-bf56-88d010a214c6
label   Array
generation  5184693
root31801344
sys_array_size  226
chunk_root_generation   5183734
root_level  1
chunk_root  20971520
chunk_root_level1
log_root0
log_root_transid0
log_root_level  0
total_bytes 32003947737088
bytes_used  6652776640512
sectorsize  4096
nodesize16384
leafsize (deprecated)   16384
stripesize  4096
root_dir6
num_devices 1
compat_flags0x0
compat_ro_flags 0x0
incompat_flags  0x161
( MIXED_BACKREF |
  

Re: System unable to mount partition after a power loss

2018-12-06 Thread Chris Murphy
On Thu, Dec 6, 2018 at 10:24 PM Doni Crosby  wrote:
>
> All,
>
> I'm coming to you to see if there is a way to fix or at least recover
> most of the data I have from a btrfs filesystem. The system went down
> after both a breaker and the battery backup failed. I cannot currently
> mount the system, with the following error from dmesg:
>
> Note: The vda1 is just the entire disk being passed from the VM host
> to the VM it's not an actual true virtual block device

This is qemu-kvm? What's the cache mode being used? It's possible the
usual write guarantees are thwarted by VM caching.



> btrfs check --recover also ends in a segmentation fault

I'm not familiar with --recover option, the --repair option is flagged
with a warning in the man page.
   Warning
   Do not use --repair unless you are advised to do so by a
developer or an experienced user,


> btrfs --version:
> btrfs-progs v4.7.3

Old version of progs, I suggest upgrading to 4.17.1 and run

btrfs insp dump-s -f /device/
btrfs insp rescue super -v /device/
btrfs check --mode=lowmem /device/

These are all read only commands. Please post output to the list,
hopefully a developer will get around to looking at it.

It is safe to try:

mount -o ro,norecovery,usebackuproot /device/ /mnt/

If that works, I suggest updating your backup while it's still
possible in the meantime.


-- 
Chris Murphy


System unable to mount partition after a power loss

2018-12-06 Thread Doni Crosby
All,

I'm coming to you to see if there is a way to fix or at least recover
most of the data I have from a btrfs filesystem. The system went down
after both a breaker and the battery backup failed. I cannot currently
mount the system, with the following error from dmesg:

Note: The vda1 is just the entire disk being passed from the VM host
to the VM it's not an actual true virtual block device

[ 499.704398] BTRFS info (device vda1): disk space caching is enabled
[  499.704401] BTRFS info (device vda1): has skinny extents
[  499.739522] BTRFS error (device vda1): parent transid verify failed
on 3563231428608 wanted 5184691 found 5183327
[  499.740257] BTRFS error (device vda1): parent transid verify failed
on 3563231428608 wanted 5184691 found 5183327
[  499.770847] BTRFS error (device vda1): open_ctree failed

I have tried running btrfsck:
parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
parent transid verify failed on 3563221630976 wanted 5184691 found 5184688
parent transid verify failed on 3563221630976 wanted 5184691 found 5184688
parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
parent transid verify failed on 3563224072192 wanted 5184691 found 5184688
parent transid verify failed on 3563224072192 wanted 5184691 found 5184688
parent transid verify failed on 3563225268224 wanted 5184691 found 5184689
parent transid verify failed on 3563225268224 wanted 5184691 found 5184689
parent transid verify failed on 3563227398144 wanted 5184691 found 5184689
parent transid verify failed on 3563227398144 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
Ignoring transid failure
Checking filesystem on /dev/vda1
UUID: 7c76bb05-b3dc-4804-bf56-88d010a214c6
checking extents
parent transid verify failed on 3563224842240 wanted 5184691 found 5184689
parent transid verify failed on 3563224842240 wanted 5184691 found 5184689
parent transid verify failed on 3563222974464 wanted 5184691 found 5184688
parent transid verify failed on 3563222974464 wanted 5184691 found 5184688
parent transid verify failed on 3563223121920 wanted 5184691 found 5184688
parent transid verify failed on 3563223121920 wanted 5184691 found 5184688
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
Ignoring transid failure
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
Ignoring transid failure
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
Ignoring transid failure
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
Ignoring transid failure
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
Ignoring transid failure
Segmentation fault

btrfs check --recover also ends in a segmentation fault

I am aware of chunk-recover and have tried to run it but got weary
when I saw dev0 not vda1.

Any help would be appreciated,
Doni

uname -a:
Linux Homophone 4.18.0-0.bpo.1-amd64 #1 SMP Debian 4.18.6-1~bpo9+1
(2018-09-13) 

Re: What if TRIM issued a wipe on devices that don't TRIM?

2018-12-06 Thread Andrei Borzenkov
06.12.2018 16:04, Austin S. Hemmelgarn пишет:
> 
> * On SCSI devices, a discard operation translates to a SCSI UNMAP
> command.  As pointed out by Ronnie Sahlberg in his reply, this command
> is purely advisory, may not result in any actual state change on the
> target device, and is not guaranteed to wipe the data.  To actually wipe
> things, you have to explicitly write bogus data to the given regions
> (using either regular writes, or a WRITESAME command with the desired
> pattern), and _then_ call UNMAP on them.

WRITE SAME command has UNMAP bit and depending on device and kernel
version kernel may actually issue either UNMAP or WRITE SAME with UNMAP
bit set when doing discard.



Re: [PATCH RESEND 0/8] btrfs-progs: sub: Relax the privileges of "subvolume list/show"

2018-12-06 Thread Omar Sandoval
On Tue, Nov 27, 2018 at 02:24:41PM +0900, Misono Tomohiro wrote:
> Hello,
> 
> This is basically the resend of 
>   "[PATCH v2 00/20] btrfs-progs: Rework of "subvolume list/show" and relax the
>   root privileges of them" [1]
> which I submitted in June. The aim of this series is to allow non-privileged 
> user
> to use basic subvolume functionality (create/list/snapshot/delete; this 
> allows "list")
> 
> They were once in devel branch with some whitespace/comment modification by 
> david.
> I rebased them to current devel branch.
> 
> github: https://github.com/t-msn/btrfs-progs/tree/rework-sub-list
> 
> Basic logic/code is the same as before. Some differences are:
>  - Use latest libbtrfsutil from Omar [2] (thus drop first part of patches).
>As a result, "sub list" cannot accept an ordinary directry to be
>specified (which is allowed in previous version)
>  - Drop patches which add new options to "sub list"
>  - Use 'nobody' as non-privileged test user just like libbtrfsutil test
>  - Update comments
> 
> Importantly, in order to make output consistent for both root and 
> non-privileged
> user, this changes the behavior of "subvolume list": 
>  - (default) Only list in subvolume under the specified path.
>Path needs to be a subvolume.
>  - (-a) filter is dropped. i.e. its output is the same as the
> default behavior of "sub list" in progs <= 4.19
> 
> Therefore, existent scripts may need to update to add -a option
> (I believe nobody uses current -a option).
> If anyone thinks this is not good, please let me know.

I think there are a few options in the case that the path isn't a
subvolume:

1. List all subvolumes in the filesystem with randomly mangled paths,
   which is what we currently do.
2. Error out, which is what this version of the series does.
3. List all subvolumes under the containing subvolume, which is what the
   previous version does.
4. List all subvolumes under the containing subvolume that are
   underneath the given path.

Option 1 won't work well for unprivileged users. Option 2 (this series)
is definitely going to break people's workflows/scripts. Option 3 is
unintuitive. In my opinion, option 4 is the nicest, but it may also
break scripts that expect all subvolumes to be printed.

There's also an option 5, which is to keep the behavior the same for
root (like what my previous patch [1] did) and implement option 4 for
unprivileged users.

I think 4 and 5 are the two main choices: do we want to preserve
backwards compatibility as carefully as possible (at the cost of
consistency), or do we want to risk it and improve the interface?

1: 
https://github.com/osandov/btrfs-progs/commit/fb61c21aeb998b12c1d02532639083d7f40c41e0


[PATCH] libbtrfsutil: fix unprivileged tests if kernel lacks support

2018-12-06 Thread Omar Sandoval
From: Omar Sandoval 

I apparently didn't test this on a pre-4.18 kernel.
test_subvolume_info_unprivileged() checks for an ENOTTY, but this
doesn't seem to work correctly with subTest().
test_subvolume_iterator_unprivileged() doesn't have a check at all. Add
an explicit check to both before doing the actual test.

Signed-off-by: Omar Sandoval 
---
Based on the devel branch.

 libbtrfsutil/python/tests/test_subvolume.py | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/libbtrfsutil/python/tests/test_subvolume.py 
b/libbtrfsutil/python/tests/test_subvolume.py
index 99ec97bc..b06a1d3d 100644
--- a/libbtrfsutil/python/tests/test_subvolume.py
+++ b/libbtrfsutil/python/tests/test_subvolume.py
@@ -168,12 +168,13 @@ class TestSubvolume(BtrfsTestCase):
 
 with drop_privs():
 try:
-self._test_subvolume_info(subvol, snapshot)
+btrfsutil.subvolume_info(self.mountpoint)
 except OSError as e:
 if e.errno == errno.ENOTTY:
 self.skipTest('BTRFS_IOC_GET_SUBVOL_INFO is not available')
 else:
 raise
+self._test_subvolume_info(subvol, snapshot)
 
 def test_read_only(self):
 for arg in self.path_or_fd(self.mountpoint):
@@ -487,6 +488,13 @@ class TestSubvolume(BtrfsTestCase):
 try:
 os.chdir(self.mountpoint)
 with drop_privs():
+try:
+list(btrfsutil.SubvolumeIterator('.'))
+except OSError as e:
+if e.errno == errno.ENOTTY:
+self.skipTest('BTRFS_IOC_GET_SUBVOL_ROOTREF is not 
available')
+else:
+raise
 self._test_subvolume_iterator()
 finally:
 os.chdir(pwd)
-- 
2.19.2



Re: BTRFS RAID filesystem unmountable

2018-12-06 Thread Qu Wenruo


On 2018/12/7 上午7:15, Michael Wade wrote:
> Hi Qu,
> 
> Me again! Having formatted the drives and rebuilt the RAID array I
> seem to have be having the same problem as before (no power cut this
> time [I bought a UPS]).

But strangely, your super block shows it has log tree, which means
either your hit a kernel panic/transaction abort, or a unexpected power
loss.

> The brtfs volume is broken on my ReadyNAS.
> 
> I have attached the results of some of the commands you asked me to
> run last time, and I am hoping you might be able to help me out.

This time, the problem is more serious, some chunk tree blocks are not
even inside system chunk range, no wonder it fails to mount.

To confirm it, you could run "btrfs ins dump-tree -b 17725903077376
" and paste the output.

But I don't have any clue. My guess is some kernel problem related to
new chunk allocation, or the chunk root node itself is already seriously
corrupted.

Considering how old your kernel is (4.4), it's not recommended to use
btrfs on such old kernel, unless it's well backported with tons of btrfs
fixes.

Thanks,
Qu

> 
> Kind regards
> Michael
> On Sat, 19 May 2018 at 12:43, Michael Wade  wrote:
>>
>> I have let the find root command run for 14+ days, its produced a
>> pretty huge log file 1.6 GB but still hasn't completed. I think I will
>> start the process of reformatting my drives and starting over.
>>
>> Thanks for your help anyway.
>>
>> Kind regards
>> Michael
>>
>> On 5 May 2018 at 01:43, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年05月05日 00:18, Michael Wade wrote:
 Hi Qu,

 The tool is still running and the log file is now ~300mb. I guess it
 shouldn't normally take this long.. Is there anything else worth
 trying?
>>>
>>> I'm afraid not much.
>>>
>>> Although there is a possibility to modify btrfs-find-root to do much
>>> faster but limited search.
>>>
>>> But from the result, it looks like underlying device corruption, and not
>>> much we can do right now.
>>>
>>> Thanks,
>>> Qu
>>>

 Kind regards
 Michael

 On 2 May 2018 at 06:29, Michael Wade  wrote:
> Thanks Qu,
>
> I actually aborted the run with the old btrfs tools once I saw its
> output. The new btrfs tools is still running and has produced a log
> file of ~85mb filled with that content so far.
>
> Kind regards
> Michael
>
> On 2 May 2018 at 02:31, Qu Wenruo  wrote:
>>
>>
>> On 2018年05月01日 23:50, Michael Wade wrote:
>>> Hi Qu,
>>>
>>> Oh dear that is not good news!
>>>
>>> I have been running the find root command since yesterday but it only
>>> seems to be only be outputting the following message:
>>>
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>
>> It's mostly fine, as find-root will go through all tree blocks and try
>> to read them as tree blocks.
>> Although btrfs-find-root will suppress csum error output, but such basic
>> tree validation check is not suppressed, thus you get such message.
>>
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>
>>> I tried with the latest btrfs tools compiled from source and the ones
>>> I have installed with the same result. Is there a CLI utility I could
>>> use to determine if the log contains any other content?
>>
>> Did it report any useful info at the end?
>>
>> Thanks,
>> Qu
>>
>>>
>>> Kind regards
>>> Michael
>>>
>>>
>>> On 30 April 2018 at 04:02, Qu Wenruo  wrote:


 On 2018年04月29日 22:08, Michael Wade wrote:
> Hi Qu,
>
> Got this error message:
>
> ./btrfs inspect dump-tree -b 20800943685632 /dev/md127
> btrfs-progs v4.16.1
> bytenr mismatch, want=20800943685632, have=3118598835113619663
> ERROR: cannot read chunk root
> ERROR: unable to open /dev/md127
>
> I have attached the dumps for:
>
> dd if=/dev/md127 of=/tmp/chunk_root.copy1 bs=1 count=32K 
> skip=266325721088
> dd if=/dev/md127 of=/tmp/chunk_root.copy2 bs=1 count=32K 
> skip=266359275520

 Unfortunately, both dumps are corrupted and contain mostly garbage.
 I think it's the underlying stack (mdraid) has something wrong or 
 failed
 to recover its data.

 This means your last 

Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-06 Thread Qu Wenruo


On 2018/12/7 上午3:35, David Sterba wrote:
> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
>> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
>>> This patchset can be fetched from github:
>>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
>>>
>>> Which is based on v4.20-rc1.
>>
>> Thanks, I'll add it to for-next soon.
> 
> The branch was there for some time but not for at least a week (my
> mistake I did not notice in time). I've rebased it on top of recent
> misc-next, but without the delayed refs patchset from Josef.
> 
> At the moment I'm considering it for merge to 4.21, there's still some
> time to pull it out in case it shows up to be too problematic. I'm
> mostly worried about the unknown interactions with the enospc updates or

For that part, I don't think it would have some obvious problem for
enospc updates.

As the user-noticeable effect is the delay of reloc tree deletion.

Despite that, it's mostly transparent to extent allocation.

> generally because of lack of qgroup and reloc code reviews.

That's the biggest problem.

However most of the current qgroup + balance optimization is done inside
qgroup code (to skip certain qgroup record), if we're going to hit some
problem then this patchset would have the highest possibility to hit
problem.

Later patches will just keep tweaking qgroup to without affecting any
other parts mostly.

So I'm fine if you decide to pull it out for now.

Thanks,
Qu

> 
> I'm going to do some testing of the rebased branch before I add it to
> for-next. The branch is ext/qu/qgroup-delay-scan in my devel repos,
> plase check if everyghing is still ok there. Thanks.
> 



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 3/3] coccinelle: api: add offset_in_page.cocci

2018-12-06 Thread Julia Lawall



On Thu, 6 Dec 2018, Johannes Thumshirn wrote:

> On 05/12/2018 15:46, Julia Lawall wrote:
> [...]
> >> +@r_patch depends on !context && patch && !org && !report@
> >> +expression E;
> >> +type T;
> >> +@@
> >> +
> >> +(
> >> +- E & ~PAGE_MASK
> >> ++ offset_in_page(E)
> >> +|
> >> +- E & (PAGE_SIZE - 1)
> >> ++ offset_in_page(E)
> >
> > The two lines above should be subsumed by the two lines below.  When there
> > is a type metavariable that has no other dependencies, an isomorphism will
> > consider that it is either present or absent.
>
> Oh OK, I'm sorry I'm not really into cocinelle so I guessed it might
> take some iterations.
>
> Do you have an example for this?

Expanation 1:

Coccinelle as a file standard.iso that shows the isomorphisms (rewrite
rules) that may be applied to semantic patches.  One of the rules is:

Expression
@ not_ptr1 @
expression *X;
@@
 !X => X == NULL

So if you have a pointer typed expression X and you write a transformation
on !X, it will also apply to occurrences of X == NULL in the source code.
In this way, you don't have to write so many variants.

Likewise there is an isomorphism:

Expression
@ drop_cast @
expression E;
pure type T;
@@

 (T)E => E

That is, if you have a semantic patch with (T)X, then it will also apply
to code that matches just X, without the cast.  The word pure means that
this isomorphism metavariable has to match a semantic patch term that is a
metavariable and this metavariable can't be used elsewhere.  If you wrote

- (char)x

Then you would probably not want that to apply without the (char) cast.
But if you have just

- (T)x

for some randome unbound metavariable T, then perhaps you don't case about
the cast to T.  If you actually do, then you can put disable drop_cast in
the header of your rule.



Explanation 2:

To see what your semantic patch is really doing, you can run

spatch --parse-cocci sp.cocci

Here is what I get for your patch rule, with some annotations added:

@@
expression E;
type T;
@@


(
-E
  >>> offset_in_page(E)
 -& -~-PAGE_MASK
|
-~
  >>> offset_in_page(E)
-PAGE_MASK -& -E
|

// the following come from
// - E & (PAGE_SIZE - 1)
// + offset_in_page(E)

-E   // 1
  >>> offset_in_page(E)
 -& -(-PAGE_SIZE -- -1-)
|
-E   // 2
  >>> offset_in_page(E)
 -& -PAGE_SIZE -- -1
|
-(   // 3
>>> offset_in_page(E)
  -PAGE_SIZE -- -1-) -& -E
|
-PAGE_SIZE   // 4
  >>> offset_in_page(E)
 -- -1 -& -E
|

// the following come from:
// - E & ((T)PAGE_SIZE - 1)
// + offset_in_page(E)

-E
  >>> offset_in_page(E)
 -& -(-(-T -)-PAGE_SIZE -- -1-)
|
-E   // same as 1
  >>> offset_in_page(E)
 -& -(-PAGE_SIZE -- -1-)
|
-E
  >>> offset_in_page(E)
 -& -(-T -)-PAGE_SIZE -- -1
|
-E   // same as 2
  >>> offset_in_page(E)
 -& -PAGE_SIZE -- -1
|
-(
>>> offset_in_page(E)
  -(-T -)-PAGE_SIZE -- -1-) -& -E
|
-(   // same as 3
>>> offset_in_page(E)
  -PAGE_SIZE -- -1-) -& -E
|
-(
>>> offset_in_page(E)
  -T -)-PAGE_SIZE -- -1 -& -E
|
-PAGE_SIZE   // same as 4
  >>> offset_in_page(E)
 -- -1 -& -E
)

So all the transformation generated by

- E & (PAGE_SIZE - 1)
+ offset_in_page(E)

are also generated by

- E & ((T)PAGE_SIZE - 1)
+ offset_in_page(E)

I hope that is helpful.

julia


Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-06 Thread David Sterba
On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
> > This patchset can be fetched from github:
> > https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
> > 
> > Which is based on v4.20-rc1.
> 
> Thanks, I'll add it to for-next soon.

The branch was there for some time but not for at least a week (my
mistake I did not notice in time). I've rebased it on top of recent
misc-next, but without the delayed refs patchset from Josef.

At the moment I'm considering it for merge to 4.21, there's still some
time to pull it out in case it shows up to be too problematic. I'm
mostly worried about the unknown interactions with the enospc updates or
generally because of lack of qgroup and reloc code reviews.

I'm going to do some testing of the rebased branch before I add it to
for-next. The branch is ext/qu/qgroup-delay-scan in my devel repos,
plase check if everyghing is still ok there. Thanks.


Re: [PATCH][v2] btrfs: run delayed items before dropping the snapshot

2018-12-06 Thread David Sterba
On Wed, Dec 05, 2018 at 12:12:21PM -0500, Josef Bacik wrote:
> From: Josef Bacik 
> 
> With my delayed refs patches in place we started seeing a large amount
> of aborts in __btrfs_free_extent
> 
> BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 
> 35964  owner 1 offset 0
> Call Trace:
>  ? btrfs_merge_delayed_refs+0xaf/0x340
>  __btrfs_run_delayed_refs+0x6ea/0xfc0
>  ? btrfs_set_path_blocking+0x31/0x60
>  btrfs_run_delayed_refs+0xeb/0x180
>  btrfs_commit_transaction+0x179/0x7f0
>  ? btrfs_check_space_for_delayed_refs+0x30/0x50
>  ? should_end_transaction.isra.19+0xe/0x40
>  btrfs_drop_snapshot+0x41c/0x7c0
>  btrfs_clean_one_deleted_snapshot+0xb5/0xd0
>  cleaner_kthread+0xf6/0x120
>  kthread+0xf8/0x130
>  ? btree_invalidatepage+0x90/0x90
>  ? kthread_bind+0x10/0x10
>  ret_from_fork+0x35/0x40
> 
> This was because btrfs_drop_snapshot depends on the root not being modified
> while it's dropping the snapshot.  It will unlock the root node (and really
> every node) as it walks down the tree, only to re-lock it when it needs to do
> something.  This is a problem because if we modify the tree we could cow a 
> block
> in our path, which free's our reference to that block.  Then once we get back 
> to
> that shared block we'll free our reference to it again, and get ENOENT when
> trying to lookup our extent reference to that block in __btrfs_free_extent.
> 
> This is ultimately happening because we have delayed items left to be 
> processed
> for our deleted snapshot _after_ all of the inodes are closed for the 
> snapshot.
> We only run the delayed inode item if we're deleting the inode, and even then 
> we
> do not run the delayed insertions or delayed removals.  These can be run at 
> any
> point after our final inode does it's last iput, which is what triggers the
> snapshot deletion.  We can end up with the snapshot deletion happening and 
> then
> have the delayed items run on that file system, resulting in the above 
> problem.
> 
> This problem has existed forever, however my patches made it much easier to 
> hit
> as I wake up the cleaner much more often to deal with delayed iputs, which 
> made
> us more likely to start the snapshot dropping work before the transaction
> commits, which is when the delayed items would generally be run.  Before,
> generally speaking, we would run the delayed items, commit the transaction, 
> and
> wakeup the cleaner thread to start deleting snapshots, which means we were 
> less
> likely to hit this problem.  You could still hit it if you had multiple
> snapshots to be deleted and ended up with lots of delayed items, but it was
> definitely harder.
> 
> Fix for now by simply running all the delayed items before starting to drop 
> the
> snapshot.  We could make this smarter in the future by making the delayed 
> items
> per-root, and then simply drop any delayed items for roots that we are going 
> to
> delete.  But for now just a quick and easy solution is the safest.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Josef Bacik 
> ---
> v1->v2:
> - check for errors from btrfs_run_delayed_items.
> - Dave I can reroll the series, but the second version of patch 1 is the same,
>   let me know what you want.

As this is a small update it's fine to send just that patch. You may
also use --in-reply-to so it threads to the original series. Resending
series makes most sense (to me) when there's a discussion and many
changes, so a fresh series makes it clear what's the current status.

Patch replaced in for-next topic branch, thanks.


Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots

2018-12-06 Thread David Sterba
On Fri, Nov 30, 2018 at 12:19:18PM -0500, Josef Bacik wrote:
> On Fri, Nov 30, 2018 at 05:14:54PM +, Filipe Manana wrote:
> > On Fri, Nov 30, 2018 at 4:53 PM Josef Bacik  wrote:
> > >
> > > From: Josef Bacik 
> > >
> > > When debugging some weird extent reference bug I suspected that we were
> > > changing a snapshot while we were deleting it, which could explain my
> > > bug.  This was indeed what was happening, and this patch helped me
> > > verify my theory.  It is never correct to modify the snapshot once it's
> > > being deleted, so mark the root when we are deleting it and make sure we
> > > complain about it when it happens.
> > >
> > > Signed-off-by: Josef Bacik 
> > > ---
> > >  fs/btrfs/ctree.c   | 3 +++
> > >  fs/btrfs/ctree.h   | 1 +
> > >  fs/btrfs/extent-tree.c | 9 +
> > >  3 files changed, 13 insertions(+)
> > >
> > > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > > index 5912a97b07a6..5f82f86085e8 100644
> > > --- a/fs/btrfs/ctree.c
> > > +++ b/fs/btrfs/ctree.c
> > > @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct 
> > > btrfs_trans_handle *trans,
> > > u64 search_start;
> > > int ret;
> > >
> > > +   if (test_bit(BTRFS_ROOT_DELETING, >state))
> > > +   WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats 
> > > being dropped\n");
> > 
> > Please use btrfs_warn(), it makes sure we use a consistent message
> > style, identifies the fs, etc.
> > Also, "thats" should be "that is" or "that's".
> > 
> 
> Ah yeah, I was following the other convention in there but we should probably
> convert all of those to btrfs_warn.  I'll fix the grammer thing as well, just 
> a
> leftover from the much less code of conduct friendly message I originally had
> there.  Thanks,

Committed with the following fixup:

-   WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being 
dropped\n");
+   btrfs_error(fs_info,
+   "COW'ing blocks on a fs root that's being dropped");



Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-06 Thread David Sterba
On Thu, Dec 06, 2018 at 06:52:21PM +0200, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> > Now with the delayed_refs_rsv we can now know exactly how much pending
> > delayed refs space we need.  This means we can drastically simplify
> 
> IMO it will be helpful if there is a sentence here referring back to
> btrfs_update_delayed_refs_rsv to put your first sentence into context.
> But I guess this is something David can also do.

I'll update the changelog, but I'm not sure what exactly you want to see
there, please post the replacement text. Thanks.

> > btrfs_check_space_for_delayed_refs by simply checking how much space we
> > have reserved for the global rsv (which acts as a spill over buffer) and
> > the delayed refs rsv.  If our total size is beyond that amount then we
> > know it's time to commit the transaction and stop any more delayed refs
> > from being generated.
> > 
> > Signed-off-by: Josef Bacik 


Re: [RFC PATCH 3/3] coccinelle: api: add offset_in_page.cocci

2018-12-06 Thread Johannes Thumshirn
On 05/12/2018 15:46, Julia Lawall wrote:
[...]
>> +@r_patch depends on !context && patch && !org && !report@
>> +expression E;
>> +type T;
>> +@@
>> +
>> +(
>> +- E & ~PAGE_MASK
>> ++ offset_in_page(E)
>> +|
>> +- E & (PAGE_SIZE - 1)
>> ++ offset_in_page(E)
> 
> The two lines above should be subsumed by the two lines below.  When there
> is a type metavariable that has no other dependencies, an isomorphism will
> consider that it is either present or absent.

Oh OK, I'm sorry I'm not really into cocinelle so I guessed it might
take some iterations.

Do you have an example for this?

> Why not include the cast case for the context and org cases?

This is an oversight actually as I couldn't test it due to a bug in
openSUSE's coccinelle rpm.

Thanks,
Johannes
-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> Now with the delayed_refs_rsv we can now know exactly how much pending
> delayed refs space we need.  This means we can drastically simplify

IMO it will be helpful if there is a sentence here referring back to
btrfs_update_delayed_refs_rsv to put your first sentence into context.
But I guess this is something David can also do.

> btrfs_check_space_for_delayed_refs by simply checking how much space we
> have reserved for the global rsv (which acts as a spill over buffer) and
> the delayed refs rsv.  If our total size is beyond that amount then we
> know it's time to commit the transaction and stop any more delayed refs
> from being generated.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/ctree.h   |  2 +-
>  fs/btrfs/extent-tree.c | 48 ++--
>  fs/btrfs/inode.c   |  4 ++--
>  fs/btrfs/transaction.c |  2 +-
>  4 files changed, 22 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2eba398c722b..30da075c042e 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2631,7 +2631,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct 
> btrfs_fs_info *fs_info,
>  }
>  
>  int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans);
> -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans);
> +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info);
>  void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
>const u64 start);
>  void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 5a2d0b061f57..07ef1b8087f7 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2839,40 +2839,28 @@ u64 btrfs_csum_bytes_to_leaves(struct btrfs_fs_info 
> *fs_info, u64 csum_bytes)
>   return num_csums;
>  }
>  
> -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans)
> +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info)
>  {
> - struct btrfs_fs_info *fs_info = trans->fs_info;
> - struct btrfs_block_rsv *global_rsv;
> - u64 num_heads = trans->transaction->delayed_refs.num_heads_ready;
> - u64 csum_bytes = trans->transaction->delayed_refs.pending_csums;
> - unsigned int num_dirty_bgs = trans->transaction->num_dirty_bgs;
> - u64 num_bytes, num_dirty_bgs_bytes;
> - int ret = 0;
> + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv;
> + struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
> + bool ret = false;
> + u64 reserved;
>  
> - num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
> - num_heads = heads_to_leaves(fs_info, num_heads);
> - if (num_heads > 1)
> - num_bytes += (num_heads - 1) * fs_info->nodesize;
> - num_bytes <<= 1;
> - num_bytes += btrfs_csum_bytes_to_leaves(fs_info, csum_bytes) *
> - fs_info->nodesize;
> - num_dirty_bgs_bytes = btrfs_calc_trans_metadata_size(fs_info,
> -  num_dirty_bgs);
> - global_rsv = _info->global_block_rsv;
> + spin_lock(_rsv->lock);
> + reserved = global_rsv->reserved;
> + spin_unlock(_rsv->lock);
>  
>   /*
> -  * If we can't allocate any more chunks lets make sure we have _lots_ of
> -  * wiggle room since running delayed refs can create more delayed refs.
> +  * Since the global reserve is just kind of magic we don't really want
> +  * to rely on it to save our bacon, so if our size is more than the
> +  * delayed_refs_rsv and the global rsv then it's time to think about
> +  * bailing.
>*/
> - if (global_rsv->space_info->full) {
> - num_dirty_bgs_bytes <<= 1;
> - num_bytes <<= 1;
> - }
> -
> - spin_lock(_rsv->lock);
> - if (global_rsv->reserved <= num_bytes + num_dirty_bgs_bytes)
> - ret = 1;
> - spin_unlock(_rsv->lock);
> + spin_lock(_refs_rsv->lock);
> + reserved += delayed_refs_rsv->reserved;
> + if (delayed_refs_rsv->size >= reserved)
> + ret = true;
> + spin_unlock(_refs_rsv->lock);
>   return ret;
>  }
>  
> @@ -2891,7 +2879,7 @@ int btrfs_should_throttle_delayed_refs(struct 
> btrfs_trans_handle *trans)
>   if (val >= NSEC_PER_SEC / 2)
>   return 2;
>  
> - return btrfs_check_space_for_delayed_refs(trans);
> + return btrfs_check_space_for_delayed_refs(trans->fs_info);
>  }
>  
>  struct async_delayed_refs {
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index a097f5fde31d..8532a2eb56d1 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5326,8 +5326,8 @@ static struct btrfs_trans_handle 
> *evict_refill_and_join(struct btrfs_root *root,
> 

Re: [PATCH 09/10] btrfs: don't run delayed refs in the end transaction logic

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> Over the years we have built up a lot of infrastructure to keep delayed
> refs in check, mostly by running them at btrfs_end_transaction() time.
> We have a lot of different maths we do to figure out how much, if we
> should do it inline or async, etc.  This existed because we had no
> feedback mechanism to force the flushing of delayed refs when they
> became a problem.  However with the enospc flushing infrastructure in
> place for flushing delayed refs when they put too much pressure on the
> enospc system we have this problem solved.  Rip out all of this code as
> it is no longer needed.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/transaction.c | 38 --
>  1 file changed, 38 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 2d8401bf8df9..01f39401619a 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -798,22 +798,12 @@ static int should_end_transaction(struct 
> btrfs_trans_handle *trans)
>  int btrfs_should_end_transaction(struct btrfs_trans_handle *trans)
>  {
>   struct btrfs_transaction *cur_trans = trans->transaction;
> - int updates;
> - int err;
>  
>   smp_mb();
>   if (cur_trans->state >= TRANS_STATE_BLOCKED ||
>   cur_trans->delayed_refs.flushing)
>   return 1;
>  
> - updates = trans->delayed_ref_updates;
> - trans->delayed_ref_updates = 0;
> - if (updates) {
> - err = btrfs_run_delayed_refs(trans, updates * 2);
> - if (err) /* Error code will also eval true */
> - return err;
> - }
> -
>   return should_end_transaction(trans);
>  }
>  
> @@ -843,11 +833,8 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
>  {
>   struct btrfs_fs_info *info = trans->fs_info;
>   struct btrfs_transaction *cur_trans = trans->transaction;
> - u64 transid = trans->transid;
> - unsigned long cur = trans->delayed_ref_updates;
>   int lock = (trans->type != TRANS_JOIN_NOLOCK);
>   int err = 0;
> - int must_run_delayed_refs = 0;
>  
>   if (refcount_read(>use_count) > 1) {
>   refcount_dec(>use_count);
> @@ -858,27 +845,6 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
>   btrfs_trans_release_metadata(trans);
>   trans->block_rsv = NULL;
>  
> - if (!list_empty(>new_bgs))
> - btrfs_create_pending_block_groups(trans);

Is this being deleted because in delayed_refs_rsv you account also fo
new block groups?

> -
> - trans->delayed_ref_updates = 0;
> - if (!trans->sync) {
> - must_run_delayed_refs =
> - btrfs_should_throttle_delayed_refs(trans);
> - cur = max_t(unsigned long, cur, 32);
> -
> - /*
> -  * don't make the caller wait if they are from a NOLOCK
> -  * or ATTACH transaction, it will deadlock with commit
> -  */
> - if (must_run_delayed_refs == 1 &&
> - (trans->type & (__TRANS_JOIN_NOLOCK | __TRANS_ATTACH)))
> - must_run_delayed_refs = 2;
> - }
> -
> - btrfs_trans_release_metadata(trans);
> - trans->block_rsv = NULL;

Why remove those 2 lines as well ?

> -
>   if (!list_empty(>new_bgs))
>   btrfs_create_pending_block_groups(trans);
>  
> @@ -923,10 +889,6 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
>   }
>  
>   kmem_cache_free(btrfs_trans_handle_cachep, trans);
> - if (must_run_delayed_refs) {
> - btrfs_async_run_delayed_refs(info, cur, transid,
> -  must_run_delayed_refs == 1);
> - }
>   return err;
>  }
>  
> 


Re: [PATCH 00/10][V2] Delayed refs rsv

2018-12-06 Thread David Sterba
On Mon, Dec 03, 2018 at 10:20:28AM -0500, Josef Bacik wrote:
> v1->v2:
> - addressed the comments from the various reviewers.
> - split "introduce delayed_refs_rsv" into 5 patches.  The patches are the same
>   together as they were, just split out more logically.  They can't really be
>   bisected across in that you will likely have fun enospc failures, but they
>   compile properly.  This was done to make it easier for review.

Thanks, that was helpful. The bisectability is IMO not hurt, as it
compiles and runs despite the potential ENOSPC problems. I guess the
point is to be able to reach any commit & compile & run and not be hit
by unrelated bugs. If somebody is bisecting an ENOSPC problem and lands
in the middle of the series, there's a hint in the changelogs that
something is going on.

> -- Original message --
> 
> This patchset changes how we do space reservations for delayed refs.  We were
> hitting probably 20-40 enospc abort's per day in production while running
> delayed refs at transaction commit time.  This means we ran out of space in 
> the
> global reserve and couldn't easily get more space in use_block_rsv().
> 
> The global reserve has grown to cover everything we don't reserve space
> explicitly for, and we've grown a lot of weird ad-hoc hueristics to know if
> we're running short on space and when it's time to force a commit.  A failure
> rate of 20-40 file systems when we run hundreds of thousands of them isn't 
> super
> high, but cleaning up this code will make things less ugly and more 
> predictible.
> 
> Thus the delayed refs rsv.  We always know how many delayed refs we have
> outstanding, and although running them generates more we can use the global
> reserve for that spill over, which fits better into it's desired use than a 
> full
> blown reservation.  This first approach is to simply take how many times we're
> reserving space for and multiply that by 2 in order to save enough space for 
> the
> delayed refs that could be generated.  This is a niave approach and will
> probably evolve, but for now it works.
> 
> With this patchset we've gone down to 2-8 failures per week.  It's not 
> perfect,
> there are some corner cases that still need to be addressed, but is
> significantly better than what we had.  Thanks,

I've merged the series to misc-next, the amount of reviews and testing
is sufficient. Though there are some more comments or suggestions for
improvements, they can be done as followups.

The other patchsets from you are namely fixes, that can be applied at
the -rc time, for that reason I'm glad the main infrastructure change
can be merged before the 4.21 code freeze.


Re: [PATCH 0/3] btrfs: use offset_in_page and PAGE_ALIGNED

2018-12-06 Thread David Sterba
On Wed, Dec 05, 2018 at 03:23:02PM +0100, Johannes Thumshirn wrote:
> Use the offset_in_page() and PAGE_ALIGNED() macros instead of open-coding them
> throughout btrfs.
> 
> This series also includes a patch for 'make coccicheck' which is marked as an
> RFC and I've CCed Julia in the hoping to get input from her.
> 
> Johannes Thumshirn (3):
>   btrfs: use offset_in_page instead of open-coding it
>   btrfs: use PAGE_ALIGNED instead of open-coding it

Added to misc-next, thanks.


Re: What if TRIM issued a wipe on devices that don't TRIM?

2018-12-06 Thread Austin S. Hemmelgarn

On 2018-12-06 01:11, Robert White wrote:
(1) Automatic and selective wiping of unused and previously used disk 
blocks is a good security measure, particularly when there is an 
encryption layer beneath the file system.


(2) USB attached devices _never_ support TRIM and they are the most 
likely to fall into strangers hands.
Not true on the first count.  Some really nice UAS devices do support 
SCSI UNMAP and WRITESAME commands.


(3) I vaguely recall that some flash chips will take bulk writhes of 
full sectors of 0x00 or 0xFF (I don't remember which) were second-best 
to TRIM for letting the flash controllers defragment their internals.


So it would be dog-slow, but it would be neat if BTRFS had a mount 
option to convert any TRIM command from above into the write of a zero, 
0xFF, or trash block to the device below if that device doesn't support 
TRIM. Real TRIM support would override the block write.


Obviously doing an fstrim would involve a lot of slow device writes but 
only for people likely to do that sort of thing.


For testing purposes the destruction of unused pages in this manner 
might catch file system failures or coding errors.


(The other layer where this might be most appropriate is in cryptsetup 
et al, where it could lie about TRIM support, but that sort of stealth 
lag might be bad for filesystem-level operations. Doing it there would 
also loose the simpler USB use cases.)


...Just a thought...
First off, TRIM is an ATA command, not the kernel term.  `fstrim` 
inherited the ATA name, but in kernel it's called a discard operation, 
and it's kind of important to understand here that a discard operation 
can result in a number of different behaviors.


In particular, you have at least the following implementations:

* On SCSI devices, a discard operation translates to a SCSI UNMAP 
command.  As pointed out by Ronnie Sahlberg in his reply, this command 
is purely advisory, may not result in any actual state change on the 
target device, and is not guaranteed to wipe the data.  To actually wipe 
things, you have to explicitly write bogus data to the given regions 
(using either regular writes, or a WRITESAME command with the desired 
pattern), and _then_ call UNMAP on them.
* On dm-thinp devices, a discard operation results in simply unmapping 
the blocks in the region it covers.  The underlying blocks themselves 
are not wiped until they get reallocated (which may not happen when you 
write to that region of the dm-thinp device again), and may not even be 
wiped then (depending on how the dm-thinp device is configured).  Thus, 
the same behavior as for SCSI is required here.
* On SD/MMC devices, a discard operation results in an SD ERASE command 
being issued.  This one is non-advisory (that is, it's guaranteed to 
happen), and is supposed to guarantee an overwrite of the region with 
zeroes or ones.
* eMMC devices additionally define a discard operation independent of 
the SD ERASE command which unmaps the region in the translation layer, 
but does not wipe the blocks either on issuing the command or on 
re-allocating the low-level blocks.  Essentially, it's just a hint for 
the wear-leveling algorithm.
* NVMe provides two different discard operations, and I'm not sure which 
the kernel uses for NVMe block emulation.  They correspond almost 
exactly to the SCSI UNMAP and SD ERASE commands in terms of behavior.
* For ATA devices, a discard operation translates to an ATA TRIM 
command.  This command doesn't even require that the data read back from 
a region the command has been issued against be consistent between 
reads, let alone that it actually returns zeroes, and it is completely 
silent on how the device should actually implement the operation.  In 
practice, most drives that implement it actually behave like dm-thinp 
devices, unmapping the low-level blocks in the region and only clearing 
them when they get reallocated, while returning any data they want on 
subsequent reads to that logical region until a write happens.
* The MTD subsystem has support for discard operations in the various 
FTL's, and they appear from a cursory look at the code to behave like a 
non-advisory version of the SCSI UNMAP command (FWIW, MTD's are what the 
concept of a discard operation was originally implemented in Linux for).


Notice that the only implementations that are actually guaranteed to 
clear out the low-level physical blocks are the SD ERASE and one of the 
two NVMe options, and all others require you to manually wipe the data 
before issuing the discard operation to guarantee that no data is retained.


Given this, I don't think this should be done as a mechanism of 
intercepting or translating discard operations, but as something else 
entirely.  Perhaps as a block-layer that wipes the region then issues a 
discard for it to the lower level device if the device supports it?


Re: [PATCH 06/10] btrfs: update may_commit_transaction to use the delayed refs rsv

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> Any space used in the delayed_refs_rsv will be freed up by a transaction
> commit, so instead of just counting the pinned space we also need to
> account for any space in the delayed_refs_rsv when deciding if it will
> make a different to commit the transaction to satisfy our space
> reservation.  If we have enough bytes to satisfy our reservation ticket
> then we are good to go, otherwise subtract out what space we would gain
> back by committing the transaction and compare that against the pinned
> space to make our decision.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: Nikolay Borisov 

However, look below for one suggestion: 

> ---
>  fs/btrfs/extent-tree.c | 24 +++-
>  1 file changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index aa0a638d0263..63ff9d832867 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4843,8 +4843,10 @@ static int may_commit_transaction(struct btrfs_fs_info 
> *fs_info,
>  {
>   struct reserve_ticket *ticket = NULL;
>   struct btrfs_block_rsv *delayed_rsv = _info->delayed_block_rsv;
> + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv;
>   struct btrfs_trans_handle *trans;
> - u64 bytes;
> + u64 bytes_needed;
> + u64 reclaim_bytes = 0;
>  
>   trans = (struct btrfs_trans_handle *)current->journal_info;
>   if (trans)
> @@ -4857,15 +4859,15 @@ static int may_commit_transaction(struct 
> btrfs_fs_info *fs_info,
>   else if (!list_empty(_info->tickets))
>   ticket = list_first_entry(_info->tickets,
> struct reserve_ticket, list);
> - bytes = (ticket) ? ticket->bytes : 0;
> + bytes_needed = (ticket) ? ticket->bytes : 0;
>   spin_unlock(_info->lock);
>  
> - if (!bytes)
> + if (!bytes_needed)
>   return 0;
>  
>   /* See if there is enough pinned space to make this reservation */
>   if (__percpu_counter_compare(_info->total_bytes_pinned,
> -bytes,
> +bytes_needed,
>  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
>   goto commit;
>  
> @@ -4877,14 +4879,18 @@ static int may_commit_transaction(struct 
> btrfs_fs_info *fs_info,
>   return -ENOSPC;

If we remove this :
 if (space_info != delayed_rsv->space_info)  
return -ENOSPC; 

Check, can't we move the reclaim_bytes calc code above the 
__percpu_counter_compare 
and eventually be left with just a single invocation to percpu_compare. 
The diff should looke something along the lines of: 

@@ -4828,19 +4827,6 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
if (!bytes)
return 0;
 
-   /* See if there is enough pinned space to make this reservation */
-   if (__percpu_counter_compare(_info->total_bytes_pinned,
-  bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
-   goto commit;
-
-   /*
-* See if there is some space in the delayed insertion reservation for
-* this reservation.
-*/
-   if (space_info != delayed_rsv->space_info)
-   return -ENOSPC;
-
spin_lock(_rsv->lock);
if (delayed_rsv->size > bytes)
bytes = 0;
@@ -4850,9 +4836,8 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 
if (__percpu_counter_compare(_info->total_bytes_pinned,
   bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
+  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
return -ENOSPC;
-   }
 
 commit:
trans = btrfs_join_transaction(fs_info->extent_root);


>  
>   spin_lock(_rsv->lock);
> - if (delayed_rsv->size > bytes)
> - bytes = 0;
> - else
> - bytes -= delayed_rsv->size;
> + reclaim_bytes += delayed_rsv->reserved;
>   spin_unlock(_rsv->lock);
>  
> + spin_lock(_refs_rsv->lock);
> + reclaim_bytes += delayed_refs_rsv->reserved;
> + spin_unlock(_refs_rsv->lock);
> + if (reclaim_bytes >= bytes_needed)
> + goto commit;
> + bytes_needed -= reclaim_bytes;
> +
>   if (__percpu_counter_compare(_info->total_bytes_pinned,
> -bytes,
> +bytes_needed,
>  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
>   return -ENOSPC;
>   }
> 


Re: [PATCH 02/10] btrfs: add cleanup_ref_head_accounting helper

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We were missing some quota cleanups in check_ref_cleanup, so break the
> ref head accounting cleanup into a helper and call that from both
> check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
> we don't screw up accounting in the future for other things that we add.
> 
> Reviewed-by: Omar Sandoval 
> Reviewed-by: Liu Bo 
> Signed-off-by: Josef Bacik 

Doesn't this also need a stable tag? Furthermore, doesn't the missing
code dealing with total_bytes_pinned in check_ref_cleanup mean that
every time the last reference for a block was freed we were leaking
bytes in total_bytes_pinned? Shouldn't this have lead to eventually
total_bytes_pinned dominating the usage in a space_info ?

Codewise lgtm:

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent-tree.c | 67 
> +-
>  1 file changed, 39 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index c36b3a42f2bb..e3ed3507018d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2443,6 +2443,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle 
> *trans,
>   return ret ? ret : 1;
>  }
>  
> +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> + struct btrfs_delayed_ref_head *head)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_delayed_ref_root *delayed_refs =
> + >transaction->delayed_refs;
> +
> + if (head->total_ref_mod < 0) {
> + struct btrfs_space_info *space_info;
> + u64 flags;
> +
> + if (head->is_data)
> + flags = BTRFS_BLOCK_GROUP_DATA;
> + else if (head->is_system)
> + flags = BTRFS_BLOCK_GROUP_SYSTEM;
> + else
> + flags = BTRFS_BLOCK_GROUP_METADATA;
> + space_info = __find_space_info(fs_info, flags);
> + ASSERT(space_info);
> + percpu_counter_add_batch(_info->total_bytes_pinned,
> +-head->num_bytes,
> +BTRFS_TOTAL_BYTES_PINNED_BATCH);
> +
> + if (head->is_data) {
> + spin_lock(_refs->lock);
> + delayed_refs->pending_csums -= head->num_bytes;
> + spin_unlock(_refs->lock);
> + }
> + }
> +
> + /* Also free its reserved qgroup space */
> + btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> +   head->qgroup_reserved);
> +}
> +
>  static int cleanup_ref_head(struct btrfs_trans_handle *trans,
>   struct btrfs_delayed_ref_head *head)
>  {
> @@ -2478,31 +2513,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
>  
> - trace_run_delayed_ref_head(fs_info, head, 0);
> -
> - if (head->total_ref_mod < 0) {
> - struct btrfs_space_info *space_info;
> - u64 flags;
> -
> - if (head->is_data)
> - flags = BTRFS_BLOCK_GROUP_DATA;
> - else if (head->is_system)
> - flags = BTRFS_BLOCK_GROUP_SYSTEM;
> - else
> - flags = BTRFS_BLOCK_GROUP_METADATA;
> - space_info = __find_space_info(fs_info, flags);
> - ASSERT(space_info);
> - percpu_counter_add_batch(_info->total_bytes_pinned,
> --head->num_bytes,
> -BTRFS_TOTAL_BYTES_PINNED_BATCH);
> -
> - if (head->is_data) {
> - spin_lock(_refs->lock);
> - delayed_refs->pending_csums -= head->num_bytes;
> - spin_unlock(_refs->lock);
> - }
> - }
> -
>   if (head->must_insert_reserved) {
>   btrfs_pin_extent(fs_info, head->bytenr,
>head->num_bytes, 1);
> @@ -2512,9 +2522,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   }
>   }
>  
> - /* Also free its reserved qgroup space */
> - btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> -   head->qgroup_reserved);
> + cleanup_ref_head_accounting(trans, head);
> +
> + trace_run_delayed_ref_head(fs_info, head, 0);
>   btrfs_delayed_ref_unlock(head);
>   btrfs_put_delayed_ref_head(head);
>   return 0;
> @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (head->must_insert_reserved)
>   ret = 1;
>  
> + cleanup_ref_head_accounting(trans, head);
>   mutex_unlock(>mutex);
>   btrfs_put_delayed_ref_head(head);
>   return ret;
> 


Re: [PATCH 01/10] btrfs: add btrfs_delete_ref_head helper

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
> into a helper and cleanup the calling functions.
> 
> Signed-off-by: Josef Bacik 
> Reviewed-by: Omar Sandoval 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/delayed-ref.c | 14 ++
>  fs/btrfs/delayed-ref.h |  3 ++-
>  fs/btrfs/extent-tree.c | 22 +++---
>  3 files changed, 19 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 9301b3ad9217..b3e4c9fcb664 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>   return head;
>  }
>  
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +struct btrfs_delayed_ref_head *head)
> +{
> + lockdep_assert_held(_refs->lock);
> + lockdep_assert_held(>lock);
> +
> + rb_erase_cached(>href_node, _refs->href_root);
> + RB_CLEAR_NODE(>href_node);
> + atomic_dec(_refs->num_entries);
> + delayed_refs->num_heads--;
> + if (head->processing == 0)
> + delayed_refs->num_heads_ready--;
> +}
> +
>  /*
>   * Helper to insert the ref_node to the tail or merge with tail.
>   *
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index 8e20c5cb5404..d2af974f68a1 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
> btrfs_delayed_ref_head *head)
>  {
>   mutex_unlock(>mutex);
>  }
> -
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +struct btrfs_delayed_ref_head *head);
>  
>  struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>   struct btrfs_delayed_ref_root *delayed_refs);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index d242a1174e50..c36b3a42f2bb 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   spin_unlock(_refs->lock);
>   return 1;
>   }
> - delayed_refs->num_heads--;
> - rb_erase_cached(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> + btrfs_delete_ref_head(delayed_refs, head);
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
> - atomic_dec(_refs->num_entries);
>  
>   trace_run_delayed_ref_head(fs_info, head, 0);
>  
> @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (!mutex_trylock(>mutex))
>   goto out;
>  
> - /*
> -  * at this point we have a head with no other entries.  Go
> -  * ahead and process it.
> -  */
> - rb_erase_cached(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> - atomic_dec(_refs->num_entries);
> -
> - /*
> -  * we don't take a ref on the node because we're removing it from the
> -  * tree, so we just steal the ref the tree was holding.
> -  */
> - delayed_refs->num_heads--;
> - if (head->processing == 0)
> - delayed_refs->num_heads_ready--;
> + btrfs_delete_ref_head(delayed_refs, head);
>   head->processing = 0;
> +
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
>  
> 


Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-06 Thread Goldwyn Rodrigues
On 11:07 06/12, Johannes Thumshirn wrote:
> On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> > This is a support for DAX in btrfs. I understand there have been
> > previous attempts at it. However, I wanted to make sure copy-on-write
> > (COW) works on dax as well.
> > 
> > Before I present this to the FS folks I wanted to run this through the
> > btrfs. Even though I wish, I cannot get it correct the first time
> > around :/.. Here are some questions for which I need suggestions:
> 
> Hi Goldwyn,
> 
> I've thrown your patches (from your git tree) onto one of my pmem test
> machines with this pmem config:

Thanks. I will check on this. Ordered extents have been a pain to deal
with for me (though mainly because of my incorrect usage)

> 
> mayhem:~/:[0]# ndctl list
> [
>   {
> "dev":"namespace1.0",
> "mode":"fsdax",
> "map":"dev",
> "size":792721358848,
> "uuid":"3fd4ab18-5145-4675-85a0-e05e6f9bcee4",
> "raw_uuid":"49264743-2351-41c5-9db9-38534813df61",
> "sector_size":512,
> "blockdev":"pmem1",
> "numa_node":1
>   },
>   {
> "dev":"namespace0.0",
> "mode":"fsdax",
> "map":"dev",
> "size":792721358848,
> "uuid":"dd0aec3c-7721-4621-8898-e50684a371b5",
> "raw_uuid":"84ff5463-f76e-4ddf-a248-85122541e909",
> "sector_size":4096,
> "blockdev":"pmem0",
> "numa_node":0
>   }
> ]
> 
> Unfortunately I hit a btrfs_panic() with btrfs/002.
> export TEST_DEV=/dev/pmem0
> export SCRATCH_DEV=/dev/pmem1
> export MOUNT_OPTIONS="-o dax"
> ./check
> [...]
> [  178.173113] run fstests btrfs/002 at 2018-12-06 10:55:43
> [  178.357044] BTRFS info (device pmem0): disk space caching is enabled
> [  178.357047] BTRFS info (device pmem0): has skinny extents
> [  178.360042] BTRFS info (device pmem0): enabling ssd optimizations
> [  178.475918] BTRFS: device fsid ee888255-7f4a-4bf7-af65-e8a6a354aca8
> devid 1 transid 3 /dev/pmem1
> [  178.505717] BTRFS info (device pmem1): disk space caching is enabled
> [  178.513593] BTRFS info (device pmem1): has skinny extents
> [  178.520384] BTRFS info (device pmem1): flagging fs with big metadata
> feature
> [  178.530997] BTRFS info (device pmem1): enabling ssd optimizations
> [  178.538331] BTRFS info (device pmem1): creating UUID tree
> [  178.587200] BTRFS critical (device pmem1): panic in
> ordered_data_tree_panic:57: Inconsistency in ordered tree at offset 0
> (errno=-17 Object already exists)
> [  178.603129] [ cut here ]
> [  178.608667] kernel BUG at fs/btrfs/ordered-data.c:57!
> [  178.614333] invalid opcode:  [#1] SMP PTI
> [  178.619295] CPU: 87 PID: 8225 Comm: dd Kdump: loaded Tainted: G
>   E 4.20.0-rc5-default-btrfs-dax #920
> [  178.630090] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS
> SE5C620.86B.0D.01.0010.072020182008 07/20/2018
> [  178.640626] RIP: 0010:__btrfs_add_ordered_extent+0x325/0x410 [btrfs]
> [  178.647404] Code: 28 4d 89 f1 49 c7 c0 90 9c 57 c0 b9 ef ff ff ff ba
> 39 00 00 00 48 c7 c6 10 fe 56 c0 48 8b b8 d8 03 00 00 31 c0 e8 e2 99 06
> 00 <0f> 0b 65 8b 05 d2 e4 b0 3f 89 c0 48 0f a3 05 78 5e cf c2 0f 92 c0
> [  178.667019] RSP: 0018:a3e3674c7ba8 EFLAGS: 00010096
> [  178.672684] RAX: 008f RBX: 9770c2ac5748 RCX:
> 
> [  178.680254] RDX: 97711f9dee80 RSI: 97711f9d6868 RDI:
> 97711f9d6868
> [  178.687831] RBP: 97711d523000 R08:  R09:
> 065a
> [  178.695411] R10: 03ff R11: 0001 R12:
> 97710d66da70
> [  178.702993] R13: 9770c2ac5600 R14:  R15:
> 97710d66d9c0
> [  178.710573] FS:  7fe11ef90700() GS:97711f9c()
> knlGS:
> [  178.719122] CS:  0010 DS:  ES:  CR0: 80050033
> [  178.725380] CR2: 0156a000 CR3: 00eb30dfc006 CR4:
> 007606e0
> [  178.732999] DR0:  DR1:  DR2:
> 
> [  178.740574] DR3:  DR6: fffe0ff0 DR7:
> 0400
> [  178.748147] PKRU: 5554
> [  178.751297] Call Trace:
> [  178.754230]  btrfs_add_ordered_extent_dio+0x1d/0x30 [btrfs]
> [  178.760269]  btrfs_create_dio_extent+0x79/0xe0 [btrfs]
> [  178.765930]  btrfs_get_extent_map_write+0x1a9/0x2b0 [btrfs]
> [  178.771959]  btrfs_file_dax_write+0x1f8/0x4f0 [btrfs]
> [  178.777508]  ? current_time+0x3f/0x70
> [  178.781672]  btrfs_file_write_iter+0x384/0x580 [btrfs]
> [  178.787265]  ? pipe_read+0x243/0x2a0
> [  178.791298]  __vfs_write+0xee/0x170
> [  178.795241]  vfs_write+0xad/0x1a0
> [  178.799008]  ? vfs_read+0x111/0x130
> [  178.802949]  ksys_write+0x42/0x90
> [  178.806712]  do_syscall_64+0x5b/0x180
> [  178.810829]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  178.816334] RIP: 0033:0x7fe11eabb3d0
> [  178.820364] Code: 73 01 c3 48 8b 0d b8 ea 2b 00 f7 d8 64 89 01 48 83
> c8 ff c3 66 0f 1f 44 00 00 83 3d b9 43 2c 00 00 75 10 b8 01 00 00 00 0f
> 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 2e 90 01 00 48 89 

Re: [PATCH 07/10] dax: export functions for use with btrfs

2018-12-06 Thread Goldwyn Rodrigues
On  6:52 05/12, Christoph Hellwig wrote:
> If you want to export these at all they have to be EXPORT_SYMBOL_GPL.
> 

Understood.

> But I'd really like to avoid seeing another duplicate DAX I/O path.
> Please try to adopt the existing iomap-based infrastructure for your
> needs first.

This is not worth with btrfs. With non-page aligned I/O on btrfs, we
need to copy the first/last page of the extents for CoW. So, we
would end up using the exported functions anyways. Believe me, I have
spent some time getting btrfs iomap compatible before giving up. The
problems are btrfs needs to carry a lot of information across
iomap_begin and iomap_end. While the added private variable helps in
this, it also needs hooks in bio_submit() functions for crc calculations
during direct writes.

-- 
Goldwyn


HELP unmountable partition after btrfs balance to RAID0

2018-12-06 Thread Thomas Mohr

Dear developers of BTRFS,

we have a problem. We wanted to convert a file system to a RAID0 with 
two partitions. Unfortunately we had to reboot the server during the 
balance operation before it could complete.


Now following happens:

A mount attempt of the array fails with following error code:

btrfs recover yields roughly 1.6 out of 4 TB.

to recover the rest we have tried:

mount:

[18192.357444] BTRFS info (device sdb1): disk space caching is enabled
[18192.357447] BTRFS info (device sdb1): has skinny extents
[18192.370664] BTRFS error (device sdb1): parent transid verify failed 
on 30523392 wanted 7432 found 7445
[18192.370810] BTRFS error (device sdb1): parent transid verify failed 
on 30523392 wanted 7432 found 7445

[18192.394745] BTRFS error (device sdb1): open_ctree failed

mounting with options ro, degraded, cache_clear etc yields the same errors.


btrfs rescue zero-log. This operation works, however, the error persists 
and the array remains unmountable


parent transid verify failed on 59768832 wanted 7422 found 7187
parent transid verify failed on 59768832 wanted 7422 found 7187
parent transid verify failed on 59768832 wanted 7422 found 7187
parent transid verify failed on 59768832 wanted 7422 found 7187
Ignoring transid failure
parent transid verify failed on 30408704 wanted 7430 found 7443
parent transid verify failed on 30408704 wanted 7430 found 7443
parent transid verify failed on 30408704 wanted 7430 found 7443
parent transid verify failed on 30408704 wanted 7430 found 7443
Ignoring transid failure
Clearing log on /dev/sdb1, previous log_root 0, level 0

btrfs rescue chunk-recover fails with following error message:

btrfs check results in:

Opening filesystem to check...
parent transid verify failed on 59768832 wanted 7422 found 7187
parent transid verify failed on 59768832 wanted 7422 found 7187
parent transid verify failed on 59768832 wanted 7422 found 7187
parent transid verify failed on 59768832 wanted 7422 found 7187
Ignoring transid failure
parent transid verify failed on 30408704 wanted 7430 found 7443
parent transid verify failed on 30408704 wanted 7430 found 7443
parent transid verify failed on 30408704 wanted 7430 found 7443
parent transid verify failed on 30408704 wanted 7430 found 7443
Ignoring transid failure
Checking filesystem on /dev/sdb1
UUID: 6c9ed4e1-d63f-46f0-b1e9-608b8fa43bb8
[1/7] checking root items
parent transid verify failed on 30523392 wanted 7432 found 7443
parent transid verify failed on 30523392 wanted 7432 found 7443
parent transid verify failed on 30523392 wanted 7432 found 7443
parent transid verify failed on 30523392 wanted 7432 found 7443
Ignoring transid failure
leaf parent key incorrect 30523392ERROR: failed to repair root items: 
Operation not permitted


Any ideas what is going on or how to recover the file system ? I would 
greatly appreciate your help !!!


best,

Thomas


uname -a:

Linux server2 4.19.5-1-default #1 SMP PREEMPT Tue Nov 27 19:56:09 UTC 
2018 (6210279) x86_64 x86_64 x86_64 GNU/Linux


btrfs-progs version 4.19


--
ScienceConsult - DI Thomas Mohr KG
DI Thomas Mohr
Enzianweg 10a
2353 Guntramsdorf
Austria
+43 2236 56793
+43 660 461 1966
http://www.mohrkeg.co.at



Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-06 Thread Johannes Thumshirn
On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs. I understand there have been
> previous attempts at it. However, I wanted to make sure copy-on-write
> (COW) works on dax as well.
> 
> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:

Hi Goldwyn,

I've thrown your patches (from your git tree) onto one of my pmem test
machines with this pmem config:

mayhem:~/:[0]# ndctl list
[
  {
"dev":"namespace1.0",
"mode":"fsdax",
"map":"dev",
"size":792721358848,
"uuid":"3fd4ab18-5145-4675-85a0-e05e6f9bcee4",
"raw_uuid":"49264743-2351-41c5-9db9-38534813df61",
"sector_size":512,
"blockdev":"pmem1",
"numa_node":1
  },
  {
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":792721358848,
"uuid":"dd0aec3c-7721-4621-8898-e50684a371b5",
"raw_uuid":"84ff5463-f76e-4ddf-a248-85122541e909",
"sector_size":4096,
"blockdev":"pmem0",
"numa_node":0
  }
]

Unfortunately I hit a btrfs_panic() with btrfs/002.
export TEST_DEV=/dev/pmem0
export SCRATCH_DEV=/dev/pmem1
export MOUNT_OPTIONS="-o dax"
./check
[...]
[  178.173113] run fstests btrfs/002 at 2018-12-06 10:55:43
[  178.357044] BTRFS info (device pmem0): disk space caching is enabled
[  178.357047] BTRFS info (device pmem0): has skinny extents
[  178.360042] BTRFS info (device pmem0): enabling ssd optimizations
[  178.475918] BTRFS: device fsid ee888255-7f4a-4bf7-af65-e8a6a354aca8
devid 1 transid 3 /dev/pmem1
[  178.505717] BTRFS info (device pmem1): disk space caching is enabled
[  178.513593] BTRFS info (device pmem1): has skinny extents
[  178.520384] BTRFS info (device pmem1): flagging fs with big metadata
feature
[  178.530997] BTRFS info (device pmem1): enabling ssd optimizations
[  178.538331] BTRFS info (device pmem1): creating UUID tree
[  178.587200] BTRFS critical (device pmem1): panic in
ordered_data_tree_panic:57: Inconsistency in ordered tree at offset 0
(errno=-17 Object already exists)
[  178.603129] [ cut here ]
[  178.608667] kernel BUG at fs/btrfs/ordered-data.c:57!
[  178.614333] invalid opcode:  [#1] SMP PTI
[  178.619295] CPU: 87 PID: 8225 Comm: dd Kdump: loaded Tainted: G
  E 4.20.0-rc5-default-btrfs-dax #920
[  178.630090] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS
SE5C620.86B.0D.01.0010.072020182008 07/20/2018
[  178.640626] RIP: 0010:__btrfs_add_ordered_extent+0x325/0x410 [btrfs]
[  178.647404] Code: 28 4d 89 f1 49 c7 c0 90 9c 57 c0 b9 ef ff ff ff ba
39 00 00 00 48 c7 c6 10 fe 56 c0 48 8b b8 d8 03 00 00 31 c0 e8 e2 99 06
00 <0f> 0b 65 8b 05 d2 e4 b0 3f 89 c0 48 0f a3 05 78 5e cf c2 0f 92 c0
[  178.667019] RSP: 0018:a3e3674c7ba8 EFLAGS: 00010096
[  178.672684] RAX: 008f RBX: 9770c2ac5748 RCX:

[  178.680254] RDX: 97711f9dee80 RSI: 97711f9d6868 RDI:
97711f9d6868
[  178.687831] RBP: 97711d523000 R08:  R09:
065a
[  178.695411] R10: 03ff R11: 0001 R12:
97710d66da70
[  178.702993] R13: 9770c2ac5600 R14:  R15:
97710d66d9c0
[  178.710573] FS:  7fe11ef90700() GS:97711f9c()
knlGS:
[  178.719122] CS:  0010 DS:  ES:  CR0: 80050033
[  178.725380] CR2: 0156a000 CR3: 00eb30dfc006 CR4:
007606e0
[  178.732999] DR0:  DR1:  DR2:

[  178.740574] DR3:  DR6: fffe0ff0 DR7:
0400
[  178.748147] PKRU: 5554
[  178.751297] Call Trace:
[  178.754230]  btrfs_add_ordered_extent_dio+0x1d/0x30 [btrfs]
[  178.760269]  btrfs_create_dio_extent+0x79/0xe0 [btrfs]
[  178.765930]  btrfs_get_extent_map_write+0x1a9/0x2b0 [btrfs]
[  178.771959]  btrfs_file_dax_write+0x1f8/0x4f0 [btrfs]
[  178.777508]  ? current_time+0x3f/0x70
[  178.781672]  btrfs_file_write_iter+0x384/0x580 [btrfs]
[  178.787265]  ? pipe_read+0x243/0x2a0
[  178.791298]  __vfs_write+0xee/0x170
[  178.795241]  vfs_write+0xad/0x1a0
[  178.799008]  ? vfs_read+0x111/0x130
[  178.802949]  ksys_write+0x42/0x90
[  178.806712]  do_syscall_64+0x5b/0x180
[  178.810829]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  178.816334] RIP: 0033:0x7fe11eabb3d0
[  178.820364] Code: 73 01 c3 48 8b 0d b8 ea 2b 00 f7 d8 64 89 01 48 83
c8 ff c3 66 0f 1f 44 00 00 83 3d b9 43 2c 00 00 75 10 b8 01 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 2e 90 01 00 48 89 04 24
[  178.840052] RSP: 002b:7ffec969d978 EFLAGS: 0246 ORIG_RAX:
0001
[  178.848100] RAX: ffda RBX:  RCX:
7fe11eabb3d0
[  178.855715] RDX: 0400 RSI: 0156a000 RDI:
0001
[  178.863326] RBP: 0400 R08: 0003 R09:
7fe11ed7a698
[  178.870928] R10: 10a8b550 R11: 0246 

Re: What if TRIM issued a wipe on devices that don't TRIM?

2018-12-06 Thread ronnie sahlberg
Hi,

I am more of a SCSI guy than ATA so forgive where I am ignorant.

The SCSI equivalent to TRIM is called UNMAP.
UNMAP is unfortunately only a "hint" to the device so if the device
for any reason
is busy,   it can just do a NO-OP, leave the data as is and still
return status SUCCESS.
That is not good if you want to wipe data for confidentiality reasons :-)

As UNMAP and TRIM are related make sure that TRIM actually provides a
guarantee to wipe the
data. I do not know ATA enough to know if it does or not.

In SCSI, instead of using UNMAP/TRIM  you can use WRITESAME10/16 which
can be used
and which does providde an overwrite/wipe guarantee. Maybe ATA has
something equivalent to
WRITESAME10/16.

I just want to say: be careful, sometimes these commands do not
provide a guarantee that they will actually
make the data overwritten/unretrievable.

ronnie sahlberg


On Thu, Dec 6, 2018 at 4:24 PM Robert White  wrote:
>
> (1) Automatic and selective wiping of unused and previously used disk
> blocks is a good security measure, particularly when there is an
> encryption layer beneath the file system.
>
> (2) USB attached devices _never_ support TRIM and they are the most
> likely to fall into strangers hands.
>
> (3) I vaguely recall that some flash chips will take bulk writhes of
> full sectors of 0x00 or 0xFF (I don't remember which) were second-best
> to TRIM for letting the flash controllers defragment their internals.
>
> So it would be dog-slow, but it would be neat if BTRFS had a mount
> option to convert any TRIM command from above into the write of a zero,
> 0xFF, or trash block to the device below if that device doesn't support
> TRIM. Real TRIM support would override the block write.
>
> Obviously doing an fstrim would involve a lot of slow device writes but
> only for people likely to do that sort of thing.
>
> For testing purposes the destruction of unused pages in this manner
> might catch file system failures or coding errors.
>
> (The other layer where this might be most appropriate is in cryptsetup
> et al, where it could lie about TRIM support, but that sort of stealth
> lag might be bad for filesystem-level operations. Doing it there would
> also loose the simpler USB use cases.)
>
> ...Just a thought...
>
> --Rob White.
>
>


New RED Racing Parts: 10% / 15% off and free shipping

2018-12-05 Thread RED Racing Parts (Info)


Dear Rider,

RED Racing Parts offers the new full line of carbon fiber parts for your 
motorbike.

For road / offroad motorcycles, visit our website on 
https://www.redracingparts.com/english/motorbikesmotorcycles/productsandcomponents/general/intro/carbonfibrefiber.php

For trial motorcycles, visit our website on 
https://www.redracingparts.com/english/motorbikesmotorcycles/productsandcomponents/general/intro/2mCarbonFiberParts.php

Free shipping and 10% OFF (15% OFF paying with Bitcoin) for today only.


RED Racing Parts Staff




If you found this email useful, please forward it on to your friends.

To unsubscribe our newsletters click here
https://www.redracingparts.com/news/u.php?l=e=a0xle2hgvdqrzhrjmqtplinux-bt...@vger.kernel.org




Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Robert White

On 12/5/18 9:37 PM, Jeff Mahoney wrote:
The high level idea that Jan Kara and I came up with in our conversation 
at Labs conf is pretty expensive.  We'd need to set a flag that pauses 
new page faults, set the WP bit on affected ranges, do the snapshot, 
commit, clear the flag, and wake up the waiting threads.  Neither of us 
had any concrete idea of how well that would perform and it still 
depends on finding a good way to resolve all open mmap ranges on a 
subvolume.  Perhaps using the address_space->private_list anchored on 
each root would work.


This is a potentially wild idea, so "grain of salt" and all that. I may 
misuse the exact wording.


So the essential problem of DAX is basically the opposite of 
data-deduplication. Instead of merging two duplicate data regions, you 
want to mark regions as at-risk while keeping the original content 
intact if there are snapshots in conflict.


So suppose you _require_ data checksums and data mode of "dup" or mirror 
or one of the other fault tolerant layouts.


By definition any block that gets written with content that it didn't 
have before will now have a bad checksum.


If the inode is flagged for direct IO that's an indication that the 
block has been updated.


At this point you really just need to do the opposite of deduplication, 
as in find/recover the original contents and assign/leave assigned those 
to the old/other snapshots, then compute the new checksum on the 
"original block" and assign it to the active subvolume.


So when a region is mapped for direct IO, and it's refcount is greater 
than one, and you get to a sync or close event, you "recover" the old 
contents into a new location and assign those to "all the other users". 
Now that original storage region has only one user, so on sync or close 
you fix its checksums on the cheap.


Instead of the new data being a small rock sitting over a large rug to 
make a lump, the new data is like a rock being slid under the rug to 
make a lump.


So the first write to an extent creates a burdensome copy to retain the 
old contents, but second and subsequent writes to the same extent only 
have the cost of an _eventual_ checksum of the original block list.


Maybe If the data isn't already duplicated then the write mapping or the 
DAX open or the setting of the S_DUP flag could force the file into an 
extent block that _is_ duplicated.


The mental leap required is that the new blocks don't need to belong to 
the new state being created. The new blocks can be associated to the 
snapshots since data copy is idempotent.


The side note is that it only ever matters if the usage count is greater 
than one, so at worst taking a snapshot, which is already a _little_ 
racy anyway, would/could trigger a semi-lightweight copy of any S_DAX files:


If S_DAX :
  If checksum invalid :
copy data as-is and checksum, store in snapshot
  else : look for duplicate checksum
if duplicate found :
  assign that extent to the snapshot
else :
  If file opened for writing and has any mmaps for write :
copy extent and assign to new snapshot.
  else :
increment usage count and assign current block to snapshot

Anyway, I only know enough of the internals to be dangerous.

Since the real goal of mmap is speed during actual update, this idea is 
basically about amortizing the copy costs into the task of maintaining 
the snapshots instead of leaving them in the immediate hands of the 
time-critical updater.


The flush, unmmap, or close by the user, or a system-wide sync event, 
are also good points to expense the bookeeping time.


[PATCH 4/8] btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_data_ref()

2018-12-05 Thread Qu Wenruo
Just like btrfs_add_delayed_tree_ref(), use btrfs_ref to refactor
btrfs_add_delayed_data_ref().

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.c | 19 +--
 fs/btrfs/delayed-ref.h |  8 +++-
 fs/btrfs/extent-tree.c | 24 +++-
 3 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index c42b8ade7b07..09caf1e6fc22 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -800,21 +800,27 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
  * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref.
  */
 int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
-  struct btrfs_root *root,
-  u64 bytenr, u64 num_bytes,
-  u64 parent, u64 ref_root,
-  u64 owner, u64 offset, u64 reserved, int action,
-  int *old_ref_mod, int *new_ref_mod)
+  struct btrfs_ref *generic_ref,
+  u64 reserved, int *old_ref_mod,
+  int *new_ref_mod)
 {
struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_delayed_data_ref *ref;
struct btrfs_delayed_ref_head *head_ref;
struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_qgroup_extent_record *record = NULL;
+   int action = generic_ref->action;
int qrecord_inserted;
int ret;
+   u64 bytenr = generic_ref->bytenr;
+   u64 num_bytes = generic_ref->len;
+   u64 parent = generic_ref->parent;
+   u64 ref_root = generic_ref->data_ref.ref_root;
+   u64 owner = generic_ref->data_ref.ino;
+   u64 offset = generic_ref->data_ref.offset;
u8 ref_type;
 
+   ASSERT(generic_ref && generic_ref->type == BTRFS_REF_DATA && action);
ref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS);
if (!ref)
return -ENOMEM;
@@ -838,7 +844,8 @@ int btrfs_add_delayed_data_ref(struct btrfs_trans_handle 
*trans,
}
 
if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) &&
-   is_fstree(ref_root) && is_fstree(root->root_key.objectid)) {
+   is_fstree(ref_root) && is_fstree(generic_ref->real_root) &&
+   !generic_ref->skip_qgroup) {
record = kzalloc(sizeof(*record), GFP_NOFS);
if (!record) {
kmem_cache_free(btrfs_delayed_data_ref_cachep, ref);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index dbe029c4e01b..a8fde33b43fd 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -337,11 +337,9 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
   struct btrfs_delayed_extent_op *extent_op,
   int *old_ref_mod, int *new_ref_mod);
 int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
-  struct btrfs_root *root,
-  u64 bytenr, u64 num_bytes,
-  u64 parent, u64 ref_root,
-  u64 owner, u64 offset, u64 reserved, int action,
-  int *old_ref_mod, int *new_ref_mod);
+  struct btrfs_ref *generic_ref,
+  u64 reserved, int *old_ref_mod,
+  int *new_ref_mod);
 int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info,
struct btrfs_trans_handle *trans,
u64 bytenr, u64 num_bytes,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ecfa0234863b..fa5dd3dfe2e7 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2049,10 +2049,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle 
*trans,
ret = btrfs_add_delayed_tree_ref(trans, _ref,
NULL, _ref_mod, _ref_mod);
} else {
-   ret = btrfs_add_delayed_data_ref(trans, root, bytenr,
-num_bytes, parent,
-root_objectid, owner, offset,
-0, BTRFS_ADD_DELAYED_REF,
+   btrfs_init_data_ref(_ref, root_objectid, owner, offset);
+   ret = btrfs_add_delayed_data_ref(trans, _ref, 0,
 _ref_mod, _ref_mod);
}
 
@@ -7114,10 +7112,8 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
ret = btrfs_add_delayed_tree_ref(trans, _ref, NULL,
 _ref_mod, _ref_mod);
} else {
-   ret = btrfs_add_delayed_data_ref(trans, root, bytenr,
-num_bytes, parent,
-

[PATCH 8/8] btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent()

2018-12-05 Thread Qu Wenruo
Similar to btrfs_inc_extent_ref(), just use btrfs_ref to replace the
long parameter list and the confusing @owner parameter.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  5 +---
 fs/btrfs/extent-tree.c | 53 ++
 fs/btrfs/file.c| 23 ++
 fs/btrfs/inode.c   | 13 +++
 fs/btrfs/relocation.c  | 26 +
 5 files changed, 62 insertions(+), 58 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index db3df5ce6087..9ed55a29993d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2671,10 +2671,7 @@ int btrfs_set_disk_extent_flags(struct 
btrfs_trans_handle *trans,
struct btrfs_fs_info *fs_info,
u64 bytenr, u64 num_bytes, u64 flags,
int level, int is_data);
-int btrfs_free_extent(struct btrfs_trans_handle *trans,
- struct btrfs_root *root,
- u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
- u64 owner, u64 offset, bool for_reloc);
+int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref);
 
 int btrfs_free_reserved_extent(struct btrfs_fs_info *fs_info,
   u64 start, u64 len, int delalloc);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ff60091aef6b..8a6a73006dc4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3255,10 +3255,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
if (inc)
ret = btrfs_inc_extent_ref(trans, _ref);
else
-   ret = btrfs_free_extent(trans, root, bytenr,
-   num_bytes, parent, ref_root,
-   key.objectid, key.offset,
-   for_reloc);
+   ret = btrfs_free_extent(trans, _ref);
if (ret)
goto fail;
} else {
@@ -3272,9 +3269,7 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
if (inc)
ret = btrfs_inc_extent_ref(trans, _ref);
else
-   ret = btrfs_free_extent(trans, root, bytenr,
-   num_bytes, parent, ref_root,
-   level - 1, 0, for_reloc);
+   ret = btrfs_free_extent(trans, _ref);
if (ret)
goto fail;
}
@@ -7073,47 +7068,43 @@ void btrfs_free_tree_block(struct btrfs_trans_handle 
*trans,
 }
 
 /* Can return -ENOMEM */
-int btrfs_free_extent(struct btrfs_trans_handle *trans,
- struct btrfs_root *root,
- u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
- u64 owner, u64 offset, bool for_reloc)
+int btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_ref *ref)
 {
-   struct btrfs_fs_info *fs_info = root->fs_info;
-   struct btrfs_ref generic_ref = { 0 };
+   struct btrfs_fs_info *fs_info = trans->fs_info;
int old_ref_mod, new_ref_mod;
int ret;
 
if (btrfs_is_testing(fs_info))
return 0;
 
-   btrfs_init_generic_ref(_ref, BTRFS_DROP_DELAYED_REF, bytenr,
-  num_bytes, root->root_key.objectid, parent);
-   generic_ref.skip_qgroup = for_reloc;
/*
 * tree log blocks never actually go into the extent allocation
 * tree, just update pinning info and exit early.
 */
-   if (root_objectid == BTRFS_TREE_LOG_OBJECTID) {
-   WARN_ON(owner >= BTRFS_FIRST_FREE_OBJECTID);
+   if ((ref->type == BTRFS_REF_METADATA &&
+ref->tree_ref.root == BTRFS_TREE_LOG_OBJECTID) ||
+   (ref->type == BTRFS_REF_DATA &&
+ref->data_ref.ref_root == BTRFS_TREE_LOG_OBJECTID)) {
/* unlocks the pinned mutex */
-   btrfs_pin_extent(fs_info, bytenr, num_bytes, 1);
+   btrfs_pin_extent(fs_info, ref->bytenr, ref->len, 1);
old_ref_mod = new_ref_mod = 0;
ret = 0;
-   } else if (owner < BTRFS_FIRST_FREE_OBJECTID) {
-   btrfs_init_tree_ref(_ref, (int)owner, root_objectid);
-   ret = btrfs_add_delayed_tree_ref(trans, _ref, NULL,
+   } else if (ref->type == BTRFS_REF_METADATA) {
+   ret = btrfs_add_delayed_tree_ref(trans, ref, NULL,
 _ref_mod, _ref_mod);
} else {
-   btrfs_init_data_ref(_ref, root_objectid, owner, offset);
-   ret = btrfs_add_delayed_data_ref(trans, _ref, 0,
+   ret 

[PATCH 2/8] btrfs: extent-tree: Open-code process_func in __btrfs_mod_ref

2018-12-05 Thread Qu Wenruo
The process_func is never a function hook used anywhere else.

Open code it to make later delayed ref refactor easier, so we can
refactor btrfs_inc_extent_ref() and btrfs_free_extent() in different
patches.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 33 ++---
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ea2c3d5220f0..ea68d288d761 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3220,10 +3220,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
int i;
int level;
int ret = 0;
-   int (*process_func)(struct btrfs_trans_handle *,
-   struct btrfs_root *,
-   u64, u64, u64, u64, u64, u64, bool);
-
 
if (btrfs_is_testing(fs_info))
return 0;
@@ -3235,11 +3231,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
if (!test_bit(BTRFS_ROOT_REF_COWS, >state) && level == 0)
return 0;
 
-   if (inc)
-   process_func = btrfs_inc_extent_ref;
-   else
-   process_func = btrfs_free_extent;
-
if (full_backref)
parent = buf->start;
else
@@ -3261,17 +3252,29 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
 
num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi);
key.offset -= btrfs_file_extent_offset(buf, fi);
-   ret = process_func(trans, root, bytenr, num_bytes,
-  parent, ref_root, key.objectid,
-  key.offset, for_reloc);
+   if (inc)
+   ret = btrfs_inc_extent_ref(trans, root, bytenr,
+   num_bytes, parent, ref_root,
+   key.objectid, key.offset,
+   for_reloc);
+   else
+   ret = btrfs_free_extent(trans, root, bytenr,
+   num_bytes, parent, ref_root,
+   key.objectid, key.offset,
+   for_reloc);
if (ret)
goto fail;
} else {
bytenr = btrfs_node_blockptr(buf, i);
num_bytes = fs_info->nodesize;
-   ret = process_func(trans, root, bytenr, num_bytes,
-  parent, ref_root, level - 1, 0,
-  for_reloc);
+   if (inc)
+   ret = btrfs_inc_extent_ref(trans, root, bytenr,
+   num_bytes, parent, ref_root,
+   level - 1, 0, for_reloc);
+   else
+   ret = btrfs_free_extent(trans, root, bytenr,
+   num_bytes, parent, ref_root,
+   level - 1, 0, for_reloc);
if (ret)
goto fail;
}
-- 
2.19.2



[PATCH 7/8] btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref()

2018-12-05 Thread Qu Wenruo
Now we don't need to play the dirty game of reusing @owner for tree block
level.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  6 ++---
 fs/btrfs/extent-tree.c | 58 ++
 fs/btrfs/file.c| 20 ++-
 fs/btrfs/inode.c   | 10 +---
 fs/btrfs/ioctl.c   | 17 -
 fs/btrfs/relocation.c  | 44 
 fs/btrfs/tree-log.c| 12 ++---
 7 files changed, 100 insertions(+), 67 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 6f4b1e605736..db3df5ce6087 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -40,6 +40,7 @@ extern struct kmem_cache *btrfs_bit_radix_cachep;
 extern struct kmem_cache *btrfs_path_cachep;
 extern struct kmem_cache *btrfs_free_space_cachep;
 struct btrfs_ordered_sum;
+struct btrfs_ref;
 
 #define BTRFS_MAGIC 0x4D5F53665248425FULL /* ascii _BHRfS_M, no null */
 
@@ -2682,10 +2683,7 @@ int btrfs_free_and_pin_reserved_extent(struct 
btrfs_fs_info *fs_info,
 void btrfs_prepare_extent_commit(struct btrfs_fs_info *fs_info);
 int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans);
 int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
-struct btrfs_root *root,
-u64 bytenr, u64 num_bytes, u64 parent,
-u64 root_objectid, u64 owner, u64 offset,
-bool for_reloc);
+struct btrfs_ref *generic_ref);
 
 int btrfs_start_dirty_block_groups(struct btrfs_trans_handle *trans);
 int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 70c05ca30d9a..ff60091aef6b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2026,36 +2026,28 @@ int btrfs_discard_extent(struct btrfs_fs_info *fs_info, 
u64 bytenr,
 
 /* Can return -ENOMEM */
 int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
-struct btrfs_root *root,
-u64 bytenr, u64 num_bytes, u64 parent,
-u64 root_objectid, u64 owner, u64 offset,
-bool for_reloc)
+struct btrfs_ref *generic_ref)
 {
-   struct btrfs_fs_info *fs_info = root->fs_info;
-   struct btrfs_ref generic_ref = { 0 };
+   struct btrfs_fs_info *fs_info = trans->fs_info;
int old_ref_mod, new_ref_mod;
int ret;
 
-   BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID &&
-  root_objectid == BTRFS_TREE_LOG_OBJECTID);
+   BUG_ON(generic_ref->type == BTRFS_REF_NOT_SET ||
+  !generic_ref->action);
+   BUG_ON(generic_ref->type == BTRFS_REF_METADATA &&
+  generic_ref->tree_ref.root == BTRFS_TREE_LOG_OBJECTID);
 
-   btrfs_init_generic_ref(_ref, BTRFS_ADD_DELAYED_REF, bytenr,
-  num_bytes, root->root_key.objectid, parent);
-   generic_ref.skip_qgroup = for_reloc;
-   if (owner < BTRFS_FIRST_FREE_OBJECTID) {
-   btrfs_init_tree_ref(_ref, (int)owner, root_objectid);
-   ret = btrfs_add_delayed_tree_ref(trans, _ref,
+   if (generic_ref->type == BTRFS_REF_METADATA)
+   ret = btrfs_add_delayed_tree_ref(trans, generic_ref,
NULL, _ref_mod, _ref_mod);
-   } else {
-   btrfs_init_data_ref(_ref, root_objectid, owner, offset);
-   ret = btrfs_add_delayed_data_ref(trans, _ref, 0,
+   else
+   ret = btrfs_add_delayed_data_ref(trans, generic_ref, 0,
 _ref_mod, _ref_mod);
-   }
 
-   btrfs_ref_tree_mod(fs_info, _ref);
+   btrfs_ref_tree_mod(fs_info, generic_ref);
 
if (ret == 0 && old_ref_mod < 0 && new_ref_mod >= 0)
-   add_pinned_bytes(fs_info, _ref);
+   add_pinned_bytes(fs_info, generic_ref);
 
return ret;
 }
@@ -3212,8 +3204,10 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
u32 nritems;
struct btrfs_key key;
struct btrfs_file_extent_item *fi;
+   struct btrfs_ref generic_ref = { 0 };
bool for_reloc = btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC);
int i;
+   int action;
int level;
int ret = 0;
 
@@ -3231,6 +3225,10 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
parent = buf->start;
else
parent = 0;
+   if (inc)
+   action = BTRFS_ADD_DELAYED_REF;
+   else
+   action = BTRFS_DROP_DELAYED_REF;
 
for (i = 0; i < nritems; i++) {
if (level == 0) {
@@ -3248,11 +3246,14 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
*trans,
 
num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi);
key.offset -= btrfs_file_extent_offset(buf, fi);
+   btrfs_init_generic_ref(_ref, 

[PATCH 5/8] btrfs: ref-verify: Use btrfs_ref to refactor btrfs_ref_tree_mod()

2018-12-05 Thread Qu Wenruo
It's a perfect match for btrfs_ref_tree_mod() to use btrfs_ref, as
btrfs_ref describes a metadata/data reference update comprehensively.

Now we have one less function use confusing owner/level trick.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 27 +++--
 fs/btrfs/ref-verify.c  | 53 --
 fs/btrfs/ref-verify.h  | 10 
 3 files changed, 42 insertions(+), 48 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index fa5dd3dfe2e7..1d812bc2c7fc 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2038,9 +2038,6 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
BUG_ON(owner < BTRFS_FIRST_FREE_OBJECTID &&
   root_objectid == BTRFS_TREE_LOG_OBJECTID);
 
-   btrfs_ref_tree_mod(root, bytenr, num_bytes, parent, root_objectid,
-  owner, offset, BTRFS_ADD_DELAYED_REF);
-
btrfs_init_generic_ref(_ref, BTRFS_ADD_DELAYED_REF, bytenr,
   num_bytes, root->root_key.objectid, parent);
generic_ref.skip_qgroup = for_reloc;
@@ -2054,6 +2051,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 _ref_mod, _ref_mod);
}
 
+   btrfs_ref_tree_mod(fs_info, _ref);
+
if (ret == 0 && old_ref_mod < 0 && new_ref_mod >= 0) {
bool metadata = owner < BTRFS_FIRST_FREE_OBJECTID;
 
@@ -7025,10 +7024,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle 
*trans,
if (root->root_key.objectid != BTRFS_TREE_LOG_OBJECTID) {
int old_ref_mod, new_ref_mod;
 
-   btrfs_ref_tree_mod(root, buf->start, buf->len, parent,
-  root->root_key.objectid,
-  btrfs_header_level(buf), 0,
-  BTRFS_DROP_DELAYED_REF);
+   btrfs_ref_tree_mod(fs_info, _ref);
ret = btrfs_add_delayed_tree_ref(trans, _ref, NULL,
 _ref_mod, _ref_mod);
BUG_ON(ret); /* -ENOMEM */
@@ -7089,11 +7085,6 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
if (btrfs_is_testing(fs_info))
return 0;
 
-   if (root_objectid != BTRFS_TREE_LOG_OBJECTID)
-   btrfs_ref_tree_mod(root, bytenr, num_bytes, parent,
-  root_objectid, owner, offset,
-  BTRFS_DROP_DELAYED_REF);
-
btrfs_init_generic_ref(_ref, BTRFS_DROP_DELAYED_REF, bytenr,
   num_bytes, root->root_key.objectid, parent);
generic_ref.skip_qgroup = for_reloc;
@@ -7117,6 +7108,9 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
 _ref_mod, _ref_mod);
}
 
+   if (root_objectid != BTRFS_TREE_LOG_OBJECTID)
+   btrfs_ref_tree_mod(fs_info, _ref);
+
if (ret == 0 && old_ref_mod >= 0 && new_ref_mod < 0) {
bool metadata = owner < BTRFS_FIRST_FREE_OBJECTID;
 
@@ -8083,14 +8077,11 @@ int btrfs_alloc_reserved_file_extent(struct 
btrfs_trans_handle *trans,
 
BUG_ON(root->root_key.objectid == BTRFS_TREE_LOG_OBJECTID);
 
-   btrfs_ref_tree_mod(root, ins->objectid, ins->offset, 0,
-  root->root_key.objectid, owner, offset,
-  BTRFS_ADD_DELAYED_EXTENT);
-
btrfs_init_generic_ref(_ref, BTRFS_ADD_DELAYED_EXTENT,
   ins->objectid, ins->offset,
   root->root_key.objectid, 0);
btrfs_init_data_ref(_ref, root->root_key.objectid, owner, 
offset);
+   btrfs_ref_tree_mod(root->fs_info, _ref);
ret = btrfs_add_delayed_data_ref(trans, _ref,
 ram_bytes, NULL, NULL);
return ret;
@@ -8338,13 +8329,11 @@ struct extent_buffer *btrfs_alloc_tree_block(struct 
btrfs_trans_handle *trans,
extent_op->is_data = false;
extent_op->level = level;
 
-   btrfs_ref_tree_mod(root, ins.objectid, ins.offset, parent,
-  root_objectid, level, 0,
-  BTRFS_ADD_DELAYED_EXTENT);
btrfs_init_generic_ref(_ref, BTRFS_ADD_DELAYED_EXTENT,
   ins.objectid, ins.offset,
   root->root_key.objectid, parent);
btrfs_init_tree_ref(_ref, level, root_objectid);
+   btrfs_ref_tree_mod(fs_info, _ref);
ret = btrfs_add_delayed_tree_ref(trans, _ref,
 extent_op, NULL, NULL);
if (ret)
diff --git a/fs/btrfs/ref-verify.c b/fs/btrfs/ref-verify.c
index d69fbfb30aa9..f6c15b612570 100644
--- a/fs/btrfs/ref-verify.c
+++ b/fs/btrfs/ref-verify.c
@@ -670,36 +670,43 @@ static 

[PATCH 3/8] btrfs: delayed-ref: Use btrfs_ref to refactor btrfs_add_delayed_tree_ref()

2018-12-05 Thread Qu Wenruo
btrfs_add_delayed_tree_ref() has a longer and longer parameter list, and
some caller like btrfs_inc_extent_ref() are using @owner as level for
delayed tree ref.

Instead of making the parameter list longer and longer, use btrfs_ref to
refactor it, so each parameter assignment should be self-explaining
without dirty level/owner trick, and provides the basis for later refactor.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.c | 24 ++---
 fs/btrfs/delayed-ref.h |  4 +---
 fs/btrfs/extent-tree.c | 48 --
 3 files changed, 44 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 11dd46be4017..c42b8ade7b07 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -710,9 +710,7 @@ static void init_delayed_ref_common(struct btrfs_fs_info 
*fs_info,
  * transaction commits.
  */
 int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
-  u64 bytenr, u64 num_bytes, u64 parent,
-  u64 ref_root,  int level, bool for_reloc,
-  int action,
+  struct btrfs_ref *generic_ref,
   struct btrfs_delayed_extent_op *extent_op,
   int *old_ref_mod, int *new_ref_mod)
 {
@@ -722,10 +720,17 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
struct btrfs_delayed_ref_root *delayed_refs;
struct btrfs_qgroup_extent_record *record = NULL;
int qrecord_inserted;
-   bool is_system = (ref_root == BTRFS_CHUNK_TREE_OBJECTID);
+   bool is_system = (generic_ref->real_root == BTRFS_CHUNK_TREE_OBJECTID);
+   int action = generic_ref->action;
+   int level = generic_ref->tree_ref.level;
int ret;
+   u64 bytenr = generic_ref->bytenr;
+   u64 num_bytes = generic_ref->len;
+   u64 parent = generic_ref->parent;
u8 ref_type;
 
+   ASSERT(generic_ref && generic_ref->type == BTRFS_REF_METADATA &&
+   generic_ref->action);
BUG_ON(extent_op && extent_op->is_data);
ref = kmem_cache_alloc(btrfs_delayed_tree_ref_cachep, GFP_NOFS);
if (!ref)
@@ -738,7 +743,9 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
}
 
if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags) &&
-   is_fstree(ref_root) && !for_reloc) {
+   is_fstree(generic_ref->real_root) &&
+   is_fstree(generic_ref->tree_ref.root) &&
+   !generic_ref->skip_qgroup) {
record = kzalloc(sizeof(*record), GFP_NOFS);
if (!record) {
kmem_cache_free(btrfs_delayed_tree_ref_cachep, ref);
@@ -753,13 +760,14 @@ int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle 
*trans,
ref_type = BTRFS_TREE_BLOCK_REF_KEY;
 
init_delayed_ref_common(fs_info, >node, bytenr, num_bytes,
-   ref_root, action, ref_type);
-   ref->root = ref_root;
+   generic_ref->tree_ref.root, action, ref_type);
+   ref->root = generic_ref->tree_ref.root;
ref->parent = parent;
ref->level = level;
 
init_delayed_ref_head(head_ref, record, bytenr, num_bytes,
- ref_root, 0, action, false, is_system);
+ generic_ref->tree_ref.root, 0, action, false,
+ is_system);
head_ref->extent_op = extent_op;
 
delayed_refs = >transaction->delayed_refs;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index e36d6b05d85e..dbe029c4e01b 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -333,9 +333,7 @@ static inline void btrfs_put_delayed_ref_head(struct 
btrfs_delayed_ref_head *hea
 }
 
 int btrfs_add_delayed_tree_ref(struct btrfs_trans_handle *trans,
-  u64 bytenr, u64 num_bytes, u64 parent,
-  u64 ref_root, int level, bool for_reloc,
-  int action,
+  struct btrfs_ref *generic_ref,
   struct btrfs_delayed_extent_op *extent_op,
   int *old_ref_mod, int *new_ref_mod);
 int btrfs_add_delayed_data_ref(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ea68d288d761..ecfa0234863b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2031,6 +2031,7 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans,
 bool for_reloc)
 {
struct btrfs_fs_info *fs_info = root->fs_info;
+   struct btrfs_ref generic_ref = { 0 };
int old_ref_mod, new_ref_mod;
int ret;
 
@@ -2040,13 +2041,13 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle 
*trans,
btrfs_ref_tree_mod(root, bytenr, num_bytes, parent, root_objectid,
   

[PATCH 1/8] btrfs: delayed-ref: Introduce better documented delayed ref structures

2018-12-05 Thread Qu Wenruo
Current delayed ref interface has several problems:
- Longer and longer parameter lists
  bytenr
  num_bytes
  parent
  ref_root
  owner
  offset
  for_reloc << Only qgroup code cares.

- Different interpretation for the same parameter
  Above @owner for data ref is ino owning this extent,
  while for tree ref, it's level. They are even in different size range.
  For level we only need 0~8, while for ino it's
  BTRFS_FIRST_FREE_OBJECTID~BTRFS_LAST_FREE_OBJECTID.

  And @offset doesn't even makes sense for tree ref.

  Such parameter reuse may look clever as an hidden union, but it
  destroys code readability.

To solve both problems, we introduce a new structure, btrfs_ref to solve
them:

- Structure instead of long parameter list
  This makes later expansion easier, and better documented.

- Use btrfs_ref::type to distinguish data and tree ref

- Use proper union to store data/tree ref specific structures.

- Use separate functions to fill data/tree ref data, with a common generic
  function to fill common bytenr/num_bytes members.

All parameters will find its place in btrfs_ref, and an extra member,
real_root, inspired by ref-verify code, is newly introduced for later
qgroup code, to record which tree is triggered this extent modification.

This patch doesn't touch any code, but provides the basis for incoming
refactors.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.h | 109 +
 1 file changed, 109 insertions(+)

diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index d8fa12d3f2cc..e36d6b05d85e 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -176,6 +176,81 @@ struct btrfs_delayed_ref_root {
u64 qgroup_to_skip;
 };
 
+enum btrfs_ref_type {
+   BTRFS_REF_NOT_SET = 0,
+   BTRFS_REF_DATA,
+   BTRFS_REF_METADATA,
+   BTRFS_REF_LAST,
+};
+
+struct btrfs_data_ref {
+   /*
+* For EXTENT_DATA_REF
+*
+* @ref_root:   current owner of the extent.
+*  may differ from btrfs_ref::real_root.
+* @ino:inode number of the owner.
+* @offset: *CALCULATED* offset. Not EXTENT_DATA key offset.
+*
+*/
+   u64 ref_root;
+   u64 ino;
+   u64 offset;
+};
+
+struct btrfs_tree_ref {
+   /* Common for all sub types and skinny combination */
+   int level;
+
+   /*
+* For TREE_BLOCK_REF (skinny metadata, either inline or keyed)
+*
+* root here may differ from btrfs_ref::real_root.
+*/
+   u64 root;
+
+   /* For non-skinny metadata, no special member needed */
+};
+
+struct btrfs_ref {
+   enum btrfs_ref_type type;
+   int action;
+
+   /*
+* Use full backref(SHARED_BLOCK_REF or SHARED_DATA_REF) for this
+* extent and its children.
+* Set for reloc trees.
+*/
+   unsigned int use_fullback:1;
+
+   /*
+* Whether this extent should go through qgroup record.
+* Normally false, but for certain case like delayed subtree scan,
+* this can hugely reduce qgroup overhead.
+*/
+   unsigned int skip_qgroup:1;
+
+   /*
+* Who owns this reference modification, optional.
+*
+* One example:
+* When creating reloc tree for source fs, it will increase tree block
+* ref for children tree blocks.
+* In that case, btrfs_ref::real_root = reloc tree,
+* while btrfs_ref::tree_ref::root = fs tree.
+*/
+   u64 real_root;
+   u64 bytenr;
+   u64 len;
+
+   /* Common @parent for SHARED_DATA_REF/SHARED_BLOCK_REF */
+   u64 parent;
+   union {
+   struct btrfs_data_ref data_ref;
+   struct btrfs_tree_ref tree_ref;
+   };
+};
+
 extern struct kmem_cache *btrfs_delayed_ref_head_cachep;
 extern struct kmem_cache *btrfs_delayed_tree_ref_cachep;
 extern struct kmem_cache *btrfs_delayed_data_ref_cachep;
@@ -184,6 +259,40 @@ extern struct kmem_cache *btrfs_delayed_extent_op_cachep;
 int __init btrfs_delayed_ref_init(void);
 void __cold btrfs_delayed_ref_exit(void);
 
+static inline void btrfs_init_generic_ref(struct btrfs_ref *generic_ref,
+   int action, u64 bytenr, u64 len, u64 real_root,
+   u64 parent)
+{
+   generic_ref->action = action;
+   generic_ref->bytenr = bytenr;
+   generic_ref->len = len;
+   generic_ref->real_root = real_root;
+   generic_ref->parent = parent;
+}
+
+static inline void btrfs_init_tree_ref(struct btrfs_ref *generic_ref,
+   int level, u64 root)
+{
+   /* If @real_root not set, use @root as fallback */
+   if (!generic_ref->real_root)
+   generic_ref->real_root = root;
+   generic_ref->tree_ref.level = level;
+   generic_ref->tree_ref.root = root;
+   generic_ref->type = BTRFS_REF_METADATA;
+}
+
+static inline void btrfs_init_data_ref(struct btrfs_ref 

[PATCH 6/8] btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()

2018-12-05 Thread Qu Wenruo
Since add_pinned_bytes() only needs to know if the extent is metadata
and if it's a chunk tree extent, btrfs_ref is a perfect match for it, as
we don't need various owner/level trick to determine extent type.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 26 ++
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1d812bc2c7fc..70c05ca30d9a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -738,14 +738,15 @@ static struct btrfs_space_info *__find_space_info(struct 
btrfs_fs_info *info,
return NULL;
 }
 
-static void add_pinned_bytes(struct btrfs_fs_info *fs_info, s64 num_bytes,
-bool metadata, u64 root_objectid)
+static void add_pinned_bytes(struct btrfs_fs_info *fs_info,
+struct btrfs_ref *ref)
 {
struct btrfs_space_info *space_info;
+   s64 num_bytes = -ref->len;
u64 flags;
 
-   if (metadata) {
-   if (root_objectid == BTRFS_CHUNK_TREE_OBJECTID)
+   if (ref->type == BTRFS_REF_METADATA) {
+   if (ref->tree_ref.root == BTRFS_CHUNK_TREE_OBJECTID)
flags = BTRFS_BLOCK_GROUP_SYSTEM;
else
flags = BTRFS_BLOCK_GROUP_METADATA;
@@ -2053,11 +2054,8 @@ int btrfs_inc_extent_ref(struct btrfs_trans_handle 
*trans,
 
btrfs_ref_tree_mod(fs_info, _ref);
 
-   if (ret == 0 && old_ref_mod < 0 && new_ref_mod >= 0) {
-   bool metadata = owner < BTRFS_FIRST_FREE_OBJECTID;
-
-   add_pinned_bytes(fs_info, -num_bytes, metadata, root_objectid);
-   }
+   if (ret == 0 && old_ref_mod < 0 && new_ref_mod >= 0)
+   add_pinned_bytes(fs_info, _ref);
 
return ret;
 }
@@ -7059,8 +7057,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle 
*trans,
}
 out:
if (pin)
-   add_pinned_bytes(fs_info, buf->len, true,
-root->root_key.objectid);
+   add_pinned_bytes(fs_info, _ref);
 
if (last_ref) {
/*
@@ -7111,11 +7108,8 @@ int btrfs_free_extent(struct btrfs_trans_handle *trans,
if (root_objectid != BTRFS_TREE_LOG_OBJECTID)
btrfs_ref_tree_mod(fs_info, _ref);
 
-   if (ret == 0 && old_ref_mod >= 0 && new_ref_mod < 0) {
-   bool metadata = owner < BTRFS_FIRST_FREE_OBJECTID;
-
-   add_pinned_bytes(fs_info, num_bytes, metadata, root_objectid);
-   }
+   if (ret == 0 && old_ref_mod >= 0 && new_ref_mod < 0)
+   add_pinned_bytes(fs_info, _ref);
 
return ret;
 }
-- 
2.19.2



[PATCH 0/8] btrfs: Refactor delayed ref parameter list

2018-12-05 Thread Qu Wenruo
Current delayed ref interface has several problems:
- Longer and longer parameter lists
  bytenr
  num_bytes
  parent
   So far so good
  ref_root
  owner
  offset
   I don't feel well now
  for_reloc
   This parameter only makes sense for qgroup code, but we need
   to pass the parameter a long way.

  This makes later expand on parameter list more and more tricky.

- Different interpretation for the same parameter
  Above @owner for data ref is ino who owns this extent,
  while for tree ref, it's level. They are even in different size range.

  For level we only need 0~8, while for ino it's
  BTRFS_FIRST_FREE_OBJECTID~BTRFS_LAST_FREE_OBJECTID, so it's still
  possible to distinguish them, but it's never a straight-forward thing
  to grasp.

  And @offset doesn't even makes sense for tree ref.

  Such parameter reuse may look clever as an hidden union, but it
  destroys code readability.

This patchset will change the way how we pass parameters for delayed
ref.
Instead of calling delayed ref interface like:
  ret = btrfs_inc_extent_ref(trans, root, bytenr, num_bytes, parent,
 ref_root, owner, offset);
Or
  ret = btrfs_inc_extent_ref(trans, root, bytenr, nodesize, parent,
 level, ref_root, 0);

We now call like:
  btrfs_init_generic_ref(, bytenr, num_bytes,
 root->root_key.objectid, parent);
  btrfs_init_data_ref(, ref_root, owner, offset);
  ret = btrfs_inc_extent_ref(trans, );
Or
  btrfs_init_generic_ref(, bytenr, num_bytes,
 root->root_key.objectid, parent);
  btrfs_init_tree_ref(, level, ref_root);
  ret = btrfs_inc_extent_ref(trans, );

To determine if a ref is tree or data, instead of calling like:
  if (owner < BTRFS_FIRST_FREE_OBJECTID) {
  } else {
  }
We do it straight-forward:
  if (ref->type == BTRFS_REF_METADATA) {
  } else {
  }

And for newer and minor new members, we don't need to add a new
parameter to btrfs_add_delayed_tree|data_ref() or
btrfs_inc_extent_ref(), just assign them after generic/data/tree init:
  btrfs_init_generic_ref(, bytenr, num_bytes,
 root->root_key.objectid, parent);
  btrfs_init_data_ref(, ref_root, owner, offset);
  ref->skip_qgroup = true; /* @skip_qgroup is default to false, so new
  code doesn't need to care */
  ret = btrfs_inc_extent_ref(trans, );

This should improve the code readability and make later code easier to
write.


Qu Wenruo (8):
  btrfs: delayed-ref: Introduce better documented delayed ref structures
  btrfs: extent-tree: Open-code process_func in __btrfs_mod_ref
  btrfs: delayed-ref: Use btrfs_ref to refactor
btrfs_add_delayed_tree_ref()
  btrfs: delayed-ref: Use btrfs_ref to refactor
btrfs_add_delayed_data_ref()
  btrfs: ref-verify: Use btrfs_ref to refactor btrfs_ref_tree_mod()
  btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()
  btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref()
  btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent()

 fs/btrfs/ctree.h   |  11 +--
 fs/btrfs/delayed-ref.c |  43 ++---
 fs/btrfs/delayed-ref.h | 121 +++--
 fs/btrfs/extent-tree.c | 195 +++--
 fs/btrfs/file.c|  43 +
 fs/btrfs/inode.c   |  23 +++--
 fs/btrfs/ioctl.c   |  17 ++--
 fs/btrfs/ref-verify.c  |  53 ++-
 fs/btrfs/ref-verify.h  |  10 +--
 fs/btrfs/relocation.c  |  70 +--
 fs/btrfs/tree-log.c|  12 ++-
 11 files changed, 375 insertions(+), 223 deletions(-)

-- 
2.19.2



Re: What if TRIM issued a wipe on devices that don't TRIM?

2018-12-05 Thread Roman Mamedov
On Thu, 6 Dec 2018 06:11:46 +
Robert White  wrote:

> So it would be dog-slow, but it would be neat if BTRFS had a mount 
> option to convert any TRIM command from above into the write of a zero, 
> 0xFF, or trash block to the device below if that device doesn't support 
> TRIM. Real TRIM support would override the block write.

There is such a project:

  "dm-linear like target which provides discard, but replaces it with write of
  random data to a discarded region. Thus, discarded data is securely deleted."

  https://github.com/vt-alt/dm-secdel

-- 
With respect,
Roman


What if TRIM issued a wipe on devices that don't TRIM?

2018-12-05 Thread Robert White
(1) Automatic and selective wiping of unused and previously used disk 
blocks is a good security measure, particularly when there is an 
encryption layer beneath the file system.


(2) USB attached devices _never_ support TRIM and they are the most 
likely to fall into strangers hands.


(3) I vaguely recall that some flash chips will take bulk writhes of 
full sectors of 0x00 or 0xFF (I don't remember which) were second-best 
to TRIM for letting the flash controllers defragment their internals.


So it would be dog-slow, but it would be neat if BTRFS had a mount 
option to convert any TRIM command from above into the write of a zero, 
0xFF, or trash block to the device below if that device doesn't support 
TRIM. Real TRIM support would override the block write.


Obviously doing an fstrim would involve a lot of slow device writes but 
only for people likely to do that sort of thing.


For testing purposes the destruction of unused pages in this manner 
might catch file system failures or coding errors.


(The other layer where this might be most appropriate is in cryptsetup 
et al, where it could lie about TRIM support, but that sort of stealth 
lag might be bad for filesystem-level operations. Doing it there would 
also loose the simpler USB use cases.)


...Just a thought...

--Rob White.




Re:Attn.

2018-12-05 Thread Reem Al-Hashimi
Hello,

My name is ms. Reem Al-Hashimi. The UAE minister of state for international 
cooperation. I got your contact from an email database from your country. I 
have a financial transaction i would like to discuss with you. Please reply to 
reem2...@daum.net, for more details if you are interested.

Regards,

Ms. Reem Al-Hashimi


Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Jeff Mahoney

On 12/5/18 7:28 AM, Goldwyn Rodrigues wrote:

This is a support for DAX in btrfs. I understand there have been
previous attempts at it. However, I wanted to make sure copy-on-write
(COW) works on dax as well.

Before I present this to the FS folks I wanted to run this through the
btrfs. Even though I wish, I cannot get it correct the first time
around :/.. Here are some questions for which I need suggestions:

Questions:
1. I have been unable to do checksumming for DAX devices. While
checksumming can be done for reads and writes, it is a problem when mmap
is involved because btrfs kernel module does not get back control after
an mmap() writes. Any ideas are appreciated, or we would have to set
nodatasum when dax is enabled.


Yep.  It has to be nodatasum, at least within the confines of datasum 
today.  DAX mmap writes are essentially in the same situation as with 
direct i/o when another thread modifies the buffer being submitted. 
Except rather than it being a race, it happens every time.  An 
alternative here could be to add the ability to mark a crc as unreliable 
and then go back and update them once the last DAX mmap reference is 
dropped on a range.  There's no reason to make this a requirement of the 
initial implementation, though.



2. Currently, a user can continue writing on "old" extents of an mmaped file
after a snapshot has been created. How can we enforce writes to be directed
to new extents after snapshots have been created? Do we keep a list of
all mmap()s, and re-mmap them after a snapshot?


It's the second question that's the hard part.  As Adam describes later, 
setting each pfn read-only will ensure page faults cause the remapping.


The high level idea that Jan Kara and I came up with in our conversation 
at Labs conf is pretty expensive.  We'd need to set a flag that pauses 
new page faults, set the WP bit on affected ranges, do the snapshot, 
commit, clear the flag, and wake up the waiting threads.  Neither of us 
had any concrete idea of how well that would perform and it still 
depends on finding a good way to resolve all open mmap ranges on a 
subvolume.  Perhaps using the address_space->private_list anchored on 
each root would work.


-Jeff


Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
command line parameter.


[PATCH 01/10] btrfs: create a mount option for dax
[PATCH 02/10] btrfs: basic dax read
[PATCH 03/10] btrfs: dax: read zeros from holes
[PATCH 04/10] Rename __endio_write_update_ordered() to
[PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of
[PATCH 06/10] btrfs: dax write support
[PATCH 07/10] dax: export functions for use with btrfs
[PATCH 08/10] btrfs: dax add read mmap path
[PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
[PATCH 10/10] btrfs: dax mmap write

  fs/btrfs/Makefile   |1
  fs/btrfs/ctree.h|   17 ++
  fs/btrfs/dax.c  |  303 
++--
  fs/btrfs/file.c |   29 
  fs/btrfs/inode.c|   54 +
  fs/btrfs/ioctl.c|5
  fs/btrfs/super.c|   15 ++
  fs/dax.c|   35 --
  include/linux/dax.h |   16 ++
  9 files changed, 430 insertions(+), 45 deletions(-)




--
Jeff Mahoney
SUSE Labs



Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Jeff Mahoney

On 12/5/18 8:03 AM, Qu Wenruo wrote:



On 2018/12/5 下午8:28, Goldwyn Rodrigues wrote:

This is a support for DAX in btrfs. I understand there have been
previous attempts at it. However, I wanted to make sure copy-on-write
(COW) works on dax as well.

Before I present this to the FS folks I wanted to run this through the
btrfs. Even though I wish, I cannot get it correct the first time
around :/.. Here are some questions for which I need suggestions:

Questions:
1. I have been unable to do checksumming for DAX devices. While
checksumming can be done for reads and writes, it is a problem when mmap
is involved because btrfs kernel module does not get back control after
an mmap() writes. Any ideas are appreciated, or we would have to set
nodatasum when dax is enabled.


I'm not familar with DAX, so it's completely possible I'm talking like
an idiot.


The general idea is:

1) there is no page cache involved. read() and write() are like direct 
i/o writes in concept.  The user buffer is written directly (via what is 
essentially a specialized memcpy) to the NVDIMM.
2) for mmap, once the mapping is established and mapped, the file system 
is not involved.  The application writes directly to the memory as it 
would a normal mmap, except it's persistent.  All that's required to 
ensure persistence is a CPU cache flush.  The only way the file system 
is involved again is if some operation has occurred to reset the WP bit.



If btrfs_page_mkwrite() can't provide enough control, then I have a
crazy idea.


It can't, because it is only invoked on the page fault path and we want 
to try to limit those as much as possible.



Forcing page fault for every mmap() read/write (completely disable page
cache like DIO).
So that we could get some control since we're informed to read the page
and do some hacks there.
There's no way to force a page fault for every mmap read/write.  Even if 
there was, we wouldn't want that.  No user would turn that on when they 
can just make similar guarantees in their app (which are typically apps 
that do this already) and not pay any performance penalty.   The idea 
with DAX mmap is that the file system manages the namespace, space 
allocation, and permissions.  Otherwise we stay out of the way.


-Jeff
--
Jeff Mahoney
SUSE Labs



Re: [PATCH][v2] btrfs: run delayed items before dropping the snapshot

2018-12-05 Thread Filipe Manana
On Wed, Dec 5, 2018 at 5:14 PM Josef Bacik  wrote:
>
> From: Josef Bacik 
>
> With my delayed refs patches in place we started seeing a large amount
> of aborts in __btrfs_free_extent
>
> BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 
> 35964  owner 1 offset 0
> Call Trace:
>  ? btrfs_merge_delayed_refs+0xaf/0x340
>  __btrfs_run_delayed_refs+0x6ea/0xfc0
>  ? btrfs_set_path_blocking+0x31/0x60
>  btrfs_run_delayed_refs+0xeb/0x180
>  btrfs_commit_transaction+0x179/0x7f0
>  ? btrfs_check_space_for_delayed_refs+0x30/0x50
>  ? should_end_transaction.isra.19+0xe/0x40
>  btrfs_drop_snapshot+0x41c/0x7c0
>  btrfs_clean_one_deleted_snapshot+0xb5/0xd0
>  cleaner_kthread+0xf6/0x120
>  kthread+0xf8/0x130
>  ? btree_invalidatepage+0x90/0x90
>  ? kthread_bind+0x10/0x10
>  ret_from_fork+0x35/0x40
>
> This was because btrfs_drop_snapshot depends on the root not being modified
> while it's dropping the snapshot.  It will unlock the root node (and really
> every node) as it walks down the tree, only to re-lock it when it needs to do
> something.  This is a problem because if we modify the tree we could cow a 
> block
> in our path, which free's our reference to that block.  Then once we get back 
> to
> that shared block we'll free our reference to it again, and get ENOENT when
> trying to lookup our extent reference to that block in __btrfs_free_extent.
>
> This is ultimately happening because we have delayed items left to be 
> processed
> for our deleted snapshot _after_ all of the inodes are closed for the 
> snapshot.
> We only run the delayed inode item if we're deleting the inode, and even then 
> we
> do not run the delayed insertions or delayed removals.  These can be run at 
> any
> point after our final inode does it's last iput, which is what triggers the
> snapshot deletion.  We can end up with the snapshot deletion happening and 
> then
> have the delayed items run on that file system, resulting in the above 
> problem.
>
> This problem has existed forever, however my patches made it much easier to 
> hit
> as I wake up the cleaner much more often to deal with delayed iputs, which 
> made
> us more likely to start the snapshot dropping work before the transaction
> commits, which is when the delayed items would generally be run.  Before,
> generally speaking, we would run the delayed items, commit the transaction, 
> and
> wakeup the cleaner thread to start deleting snapshots, which means we were 
> less
> likely to hit this problem.  You could still hit it if you had multiple
> snapshots to be deleted and ended up with lots of delayed items, but it was
> definitely harder.
>
> Fix for now by simply running all the delayed items before starting to drop 
> the
> snapshot.  We could make this smarter in the future by making the delayed 
> items
> per-root, and then simply drop any delayed items for roots that we are going 
> to
> delete.  But for now just a quick and easy solution is the safest.
>
> Cc: sta...@vger.kernel.org
> Signed-off-by: Josef Bacik 

Reviewed-by: Filipe Manana 

Looks good now. Thanks.

> ---
> v1->v2:
> - check for errors from btrfs_run_delayed_items.
> - Dave I can reroll the series, but the second version of patch 1 is the same,
>   let me know what you want.
>
>  fs/btrfs/extent-tree.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index dcb699dd57f3..473084eb7a2d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9330,6 +9330,10 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
> goto out_free;
> }
>
> +   err = btrfs_run_delayed_items(trans);
> +   if (err)
> +   goto out_end_trans;
> +
> if (block_rsv)
> trans->block_rsv = block_rsv;
>
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: btrfs progs always assume devid 1?

2018-12-05 Thread Austin S. Hemmelgarn

On 2018-12-05 14:50, Roman Mamedov wrote:

Hello,

To migrate my FS to a different physical disk, I have added a new empty device
to the FS, then ran the remove operation on the original one.

Now my FS has only devid 2:

Label: 'p1'  uuid: d886c190-b383-45ba-9272-9f00c6a10c50
Total devices 1 FS bytes used 36.63GiB
devid2 size 50.00GiB used 45.06GiB path /dev/mapper/vg-p1

And all the operations of btrfs-progs now fail to work in their default
invocation, such as:

# btrfs fi resize max .
Resize '.' of 'max'
ERROR: unable to resize '.': No such device

[768813.414821] BTRFS info (device dm-5): resizer unable to find device 1

Of course this works:

# btrfs fi resize 2:max .
Resize '.' of '2:max'

But this is inconvenient and seems to be a rather simple oversight. If what I
got is normal (the device staying as ID 2 after such operation), then count
that as a suggestion that btrfs-progs should use the first existing devid,
rather than always looking for hard-coded devid 1.



I've been meaning to try and write up a patch to special-case this for a 
while now, but have not gotten around to it yet.


FWIW, this is one of multiple reasons that it's highly recommended to 
use `btrfs replace` instead of adding a new device and deleting the old 
one when replacing a device.  Other benefits include:


* It doesn't have to run in the foreground (and doesn't by default).
* It usually takes less time.
* Replace operations can be queried while running to get a nice 
indication of the completion percentage.


The only disadvantage is that the new device has to be at least as large 
as the old one (though you can get around this to a limited degree by 
shrinking the old device), and it needs the old and new device to be 
plugged in at the same time (add/delete doesn't, if you flip the order 
of the add and delete commands).


btrfs progs always assume devid 1?

2018-12-05 Thread Roman Mamedov
Hello,

To migrate my FS to a different physical disk, I have added a new empty device
to the FS, then ran the remove operation on the original one.

Now my FS has only devid 2:

Label: 'p1'  uuid: d886c190-b383-45ba-9272-9f00c6a10c50
Total devices 1 FS bytes used 36.63GiB
devid2 size 50.00GiB used 45.06GiB path /dev/mapper/vg-p1

And all the operations of btrfs-progs now fail to work in their default
invocation, such as:

# btrfs fi resize max .
Resize '.' of 'max'
ERROR: unable to resize '.': No such device

[768813.414821] BTRFS info (device dm-5): resizer unable to find device 1

Of course this works:

# btrfs fi resize 2:max .
Resize '.' of '2:max'

But this is inconvenient and seems to be a rather simple oversight. If what I
got is normal (the device staying as ID 2 after such operation), then count
that as a suggestion that btrfs-progs should use the first existing devid,
rather than always looking for hard-coded devid 1.

-- 
With respect,
Roman


Re: Linux-next regression?

2018-12-05 Thread Chris Mason
On 5 Dec 2018, at 5:59, Andrea Gelmini wrote:

> On Tue, Dec 04, 2018 at 10:29:49PM +, Chris Mason wrote:
>> I think (hope) this is:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>
>> Which was just nailed down to a blkmq bug.  It triggers when you have
>> scsi devices using elevator=none over blkmq.
>
> Thanks a lot Chris. Really.
> Good news: I confirm I recompiled and used blkmq and no-op (at that 
> time).
> Also, the massive write of btrfs defrag can explain the massive 
> trigger of
> the bug, and next corruption.

Sorry this happened, but glad you were able to confirm that it explains 
the trouble you hit.  Thanks for the report, I did end up using this as 
a datapoint to convince myself the bugzilla above wasn't ext4 specific.

-chris


[PATCH][v2] btrfs: run delayed items before dropping the snapshot

2018-12-05 Thread Josef Bacik
From: Josef Bacik 

With my delayed refs patches in place we started seeing a large amount
of aborts in __btrfs_free_extent

BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 
35964  owner 1 offset 0
Call Trace:
 ? btrfs_merge_delayed_refs+0xaf/0x340
 __btrfs_run_delayed_refs+0x6ea/0xfc0
 ? btrfs_set_path_blocking+0x31/0x60
 btrfs_run_delayed_refs+0xeb/0x180
 btrfs_commit_transaction+0x179/0x7f0
 ? btrfs_check_space_for_delayed_refs+0x30/0x50
 ? should_end_transaction.isra.19+0xe/0x40
 btrfs_drop_snapshot+0x41c/0x7c0
 btrfs_clean_one_deleted_snapshot+0xb5/0xd0
 cleaner_kthread+0xf6/0x120
 kthread+0xf8/0x130
 ? btree_invalidatepage+0x90/0x90
 ? kthread_bind+0x10/0x10
 ret_from_fork+0x35/0x40

This was because btrfs_drop_snapshot depends on the root not being modified
while it's dropping the snapshot.  It will unlock the root node (and really
every node) as it walks down the tree, only to re-lock it when it needs to do
something.  This is a problem because if we modify the tree we could cow a block
in our path, which free's our reference to that block.  Then once we get back to
that shared block we'll free our reference to it again, and get ENOENT when
trying to lookup our extent reference to that block in __btrfs_free_extent.

This is ultimately happening because we have delayed items left to be processed
for our deleted snapshot _after_ all of the inodes are closed for the snapshot.
We only run the delayed inode item if we're deleting the inode, and even then we
do not run the delayed insertions or delayed removals.  These can be run at any
point after our final inode does it's last iput, which is what triggers the
snapshot deletion.  We can end up with the snapshot deletion happening and then
have the delayed items run on that file system, resulting in the above problem.

This problem has existed forever, however my patches made it much easier to hit
as I wake up the cleaner much more often to deal with delayed iputs, which made
us more likely to start the snapshot dropping work before the transaction
commits, which is when the delayed items would generally be run.  Before,
generally speaking, we would run the delayed items, commit the transaction, and
wakeup the cleaner thread to start deleting snapshots, which means we were less
likely to hit this problem.  You could still hit it if you had multiple
snapshots to be deleted and ended up with lots of delayed items, but it was
definitely harder.

Fix for now by simply running all the delayed items before starting to drop the
snapshot.  We could make this smarter in the future by making the delayed items
per-root, and then simply drop any delayed items for roots that we are going to
delete.  But for now just a quick and easy solution is the safest.

Cc: sta...@vger.kernel.org
Signed-off-by: Josef Bacik 
---
v1->v2:
- check for errors from btrfs_run_delayed_items.
- Dave I can reroll the series, but the second version of patch 1 is the same,
  let me know what you want.

 fs/btrfs/extent-tree.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index dcb699dd57f3..473084eb7a2d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9330,6 +9330,10 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
goto out_free;
}
 
+   err = btrfs_run_delayed_items(trans);
+   if (err)
+   goto out_end_trans;
+
if (block_rsv)
trans->block_rsv = block_rsv;
 
-- 
2.14.3



Re: [RFC PATCH] btrfs: Remove __extent_readpages

2018-12-05 Thread Josef Bacik
On Mon, Dec 03, 2018 at 12:25:32PM +0200, Nikolay Borisov wrote:
> When extent_readpages is called from the generic readahead code it first
> builds a batch of 16 pages (which might or might not be consecutive,
> depending on whether add_to_page_cache_lru failed) and submits them to
> __extent_readpages. The latter ensures that the range of pages (in the
> batch of 16) that is passed to __do_contiguous_readpages is consecutive.
> 
> If add_to_page_cache_lru does't fail then __extent_readpages will call
> __do_contiguous_readpages only once with the whole batch of 16.
> Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an 
> example)
> then the contigous page read code will be called twice.
> 
> All of this can be simplified by exploiting the fact that all pages passed
> to extent_readpages are consecutive, thus when the batch is built in
> that function it is already consecutive (barring add_to_page_cache_lru
> failures) so are ready to be submitted directly to __do_contiguous_readpages.
> Also simplify the name of the function to contiguous_readpages. 
> 
> Signed-off-by: Nikolay Borisov 
> ---
> 
> So this patch looks like a very nice cleanup, however when doing performance 
> measurements with fio I was shocked to see that it actually is detrimental to 
> performance. Here are the results: 
> 
> The command line used for fio: 
> fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k
>  --numjobs=1 --size=1G --runtime=600  --group_reporting --loop 10
> 
> This was tested on a vm with 4g of ram so the size of the test is smaller 
> than 
> the memory, so pages should have been nicely readahead. 
> 

What this patch changes is now we aren't reading all of the pages we are passed
by the readahead, so now we fall back to per-page reading when we go to read
those pages because the readahead window has already moved past them.  This
isn't great behavior, what we have is nice in that it tries to group the entire
range together as much as possible.  What your patch changes is that as soon as
we stop having a contiguous range we just bail and submit what we have.  Testing
it in isolation is likely to show very little change, but having recently
touched the fault in code where we definitely do not count major faults in some
cases I'd suspect that we're not reflecting this higher fault rate in the
performance counters properly.  We should preserve the existing behavior, what
hurts a little bit on a lightly loaded box is going to hurt a whole lot more on
a heavily loaded box.  Thanks,

Josef


Re: [PATCHv3] btrfs: Fix error handling in btrfs_cleanup_ordered_extents

2018-12-05 Thread David Sterba
On Wed, Nov 21, 2018 at 05:10:52PM +0200, Nikolay Borisov wrote:
> Running btrfs/124 in a loop hung up on me sporadically with the
> following call trace:
>   btrfs   D0  5760   5324 0x
>   Call Trace:
>? __schedule+0x243/0x800
>schedule+0x33/0x90
>btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
>? wait_woken+0xa0/0xa0
>btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
>btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
>btrfs_relocate_chunk+0x49/0x100 [btrfs]
>btrfs_balance+0xbeb/0x1740 [btrfs]
>btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
>btrfs_ioctl+0x1691/0x3110 [btrfs]
>? lockdep_hardirqs_on+0xed/0x180
>? __handle_mm_fault+0x8e7/0xfb0
>? _raw_spin_unlock+0x24/0x30
>? __handle_mm_fault+0x8e7/0xfb0
>? do_vfs_ioctl+0xa5/0x6e0
>? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
>do_vfs_ioctl+0xa5/0x6e0
>? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
>ksys_ioctl+0x3a/0x70
>__x64_sys_ioctl+0x16/0x20
>do_syscall_64+0x60/0x1b0
>entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> This happens because during page writeback it's valid for
> writepage_delalloc to instantiate a delalloc range which doesn't
> belong to the page currently being written back.
> 
> The reason this case is valid is due to find_lock_delalloc_range
> returning any available range after the passed delalloc_start and
> ignorting whether the page under writeback is within that range.
> In turn ordered extents (OE) are always created for the returned range
> from find_lock_delalloc_range. If, however, a failure occurs while OE
> are being created then the clean up code in btrfs_cleanup_ordered_extents
> will be called.
> 
> Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
> the case of such 'foreign' range being processed and instead it always
> assumes that the range OE are created for belongs to the page. This
> leads to the first page of such foregin range to not be cleaned up since
> it's deliberately missed skipped by the current cleaning up code.
> 
> Fix this by correctly checking whether the current page belongs to the
> range being instantiated and if so adjsut the range parameters
> passed for cleaning up. If it doesn't, then just clean the whole OE
> range directly.
> 
> Signed-off-by: Nikolay Borisov 
> Reviewed-by: Josef Bacik 

Added to misc-next, thanks.


Re: [PATCH 01/10] btrfs: create a mount option for dax

2018-12-05 Thread Adam Borowski
On Wed, Dec 05, 2018 at 02:43:03PM +0200, Nikolay Borisov wrote:
> One question below though .
> 
> > +++ b/fs/btrfs/super.c
> > @@ -739,6 +741,17 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
> > char *options,
> > case Opt_user_subvol_rm_allowed:
> > btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
> > break;
> > +#ifdef CONFIG_FS_DAX
> > +   case Opt_dax:
> > +   if (btrfs_super_num_devices(info->super_copy) > 1) {
> > +   btrfs_info(info,
> > +  "dax not supported for multi-device 
> > btrfs partition\n");
> 
> What prevents supporting dax for multiple devices so long as all devices
> are dax?

As I mentioned in a separate mail, most profiles are either redundant
(RAID0), require hardware support (RAID1, DUP) or are impossible (RAID5,
RAID6).

But, "single" profile multi-device would be useful and actually provide
something other dax-supporting filesystems don't have: combining multiple
devices into one logical piece.

On the other hand, DUP profiles need to be banned.  In particular, the
filesystem you mount might have existing DUP block groups.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄ to the city of his birth to die.


Re: [PATCH 07/10] dax: export functions for use with btrfs

2018-12-05 Thread Christoph Hellwig
If you want to export these at all they have to be EXPORT_SYMBOL_GPL.

But I'd really like to avoid seeing another duplicate DAX I/O path.
Please try to adopt the existing iomap-based infrastructure for your
needs first.


Re: [RFC PATCH 3/3] coccinelle: api: add offset_in_page.cocci

2018-12-05 Thread Julia Lawall



On Wed, 5 Dec 2018, Johannes Thumshirn wrote:

> Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can be
> replaced by the offset_in_page() macro instead of open-coding it.
>
> Add a coccinelle semantic patch to ease detection and conversion of these.
>
> This unfortunately doesn't account for the case when we want PAGE_ALIGNED()
> instead of offset_in_page() yet.
>
> Cc: Julia Lawall 
> Signed-off-by: Johannes Thumshirn 
> ---
>  scripts/coccinelle/api/offset_in_page.cocci | 81 
> +
>  1 file changed, 81 insertions(+)
>  create mode 100644 scripts/coccinelle/api/offset_in_page.cocci
>
> diff --git a/scripts/coccinelle/api/offset_in_page.cocci 
> b/scripts/coccinelle/api/offset_in_page.cocci
> new file mode 100644
> index ..ea5b3a8e0390
> --- /dev/null
> +++ b/scripts/coccinelle/api/offset_in_page.cocci
> @@ -0,0 +1,81 @@
> +// SPDX-License-Identifier: GPL-2.0
> +///
> +/// Use offset_in_page macro on address instead of explicit computation.
> +///
> +//  Confidence: High
> +//  Keywords: offset_in_page
> +//  Comment: Based on vma_pages.cocci
> +
> +virtual context
> +virtual patch
> +virtual org
> +virtual report
> +
> +
> +//--
> +//  For context mode
> +//--
> +
> +@r_context depends on context && !patch && !org && !report@
> +expression E;
> +@@
> +
> +(
> +* E & ~PAGE_MASK
> +|
> +* E & (PAGE_SIZE - 1)
> +)
> +
> +
> +//--
> +//  For patch mode
> +//--
> +
> +@r_patch depends on !context && patch && !org && !report@
> +expression E;
> +type T;
> +@@
> +
> +(
> +- E & ~PAGE_MASK
> ++ offset_in_page(E)
> +|
> +- E & (PAGE_SIZE - 1)
> ++ offset_in_page(E)

The two lines above should be subsumed by the two lines below.  When there
is a type metavariable that has no other dependencies, an isomorphism will
consider that it is either present or absent.

Why not include the cast case for the context and org cases?

Masahiro will ultimately commit this.  I have added him to CC.

Thanks for the contribution.

julia


> +|
> +- E & ((T)PAGE_SIZE - 1)
> ++ offset_in_page(E)
> +)
> +
> +//--
> +//  For org mode
> +//--
> +
> +@r_org depends on !context && !patch && (org || report)@
> +expression E;
> +position p;
> +@@
> +
> +  (
> +  * E@p & ~PAGE_MASK
> +  |
> +  * E@p & (PAGE_SIZE - 1)
> +  )
> +
> +@script:python depends on report@
> +p << r_org.p;
> +x << r_org.E;
> +@@
> +
> +msg="WARNING: Consider using offset_in_page helper on %s" % (x)
> +coccilib.report.print_report(p[0], msg)
> +
> +@script:python depends on org@
> +p << r_org.p;
> +x << r_org.E;
> +@@
> +
> +msg="WARNING: Consider using offset_in_page helper on %s" % (x)
> +msg_safe=msg.replace("[","@(").replace("]",")")
> +coccilib.org.print_todo(p[0], msg_safe)
> +
> --
> 2.16.4
>
>


Fix for https://bugzilla.kernel.org/show_bug.cgi?id=200085

2018-12-05 Thread Саша Ковтуненко
Hello
Is it correct way to fix this issue as add max retry count
https://github.com/kdave/btrfs-progs/pull/151/files
I have added max retry count ? is it correct way to fix  this ?

Thanks
AK


Re: [PATCH 1/3] btrfs: use offset_in_page instead of open-coding it

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 16:23 ч., Johannes Thumshirn wrote:
> Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an
> offset into a page.
> 
> So replace them by the offset_in_page() macro instead of open-coding it if
> they're not used as an alignment check.
> 
> Signed-off-by: Johannes Thumshirn 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/check-integrity.c | 12 +--
>  fs/btrfs/compression.c |  2 +-
>  fs/btrfs/extent_io.c   | 53 
> +-
>  fs/btrfs/file.c|  4 ++--
>  fs/btrfs/inode.c   |  7 +++---
>  fs/btrfs/send.c|  2 +-
>  fs/btrfs/volumes.c |  2 +-
>  7 files changed, 38 insertions(+), 44 deletions(-)
> 
> diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
> index 781cae168d2a..d319c3020c09 100644
> --- a/fs/btrfs/check-integrity.c
> +++ b/fs/btrfs/check-integrity.c
> @@ -1202,24 +1202,24 @@ static void btrfsic_read_from_block_data(
>   void *dstv, u32 offset, size_t len)
>  {
>   size_t cur;
> - size_t offset_in_page;
> + size_t pgoff;
>   char *kaddr;
>   char *dst = (char *)dstv;
> - size_t start_offset = block_ctx->start & ((u64)PAGE_SIZE - 1);
> + size_t start_offset = offset_in_page(block_ctx->start);
>   unsigned long i = (start_offset + offset) >> PAGE_SHIFT;
>  
>   WARN_ON(offset + len > block_ctx->len);
> - offset_in_page = (start_offset + offset) & (PAGE_SIZE - 1);
> + pgoff = offset_in_page(start_offset + offset);
>  
>   while (len > 0) {
> - cur = min(len, ((size_t)PAGE_SIZE - offset_in_page));
> + cur = min(len, ((size_t)PAGE_SIZE - pgoff));
>   BUG_ON(i >= DIV_ROUND_UP(block_ctx->len, PAGE_SIZE));
>   kaddr = block_ctx->datav[i];
> - memcpy(dst, kaddr + offset_in_page, cur);
> + memcpy(dst, kaddr + pgoff, cur);
>  
>   dst += cur;
>   len -= cur;
> - offset_in_page = 0;
> + pgoff = 0;
>   i++;
>   }
>  }
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index dba59ae914b8..2ab5591449f2 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -477,7 +477,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
>  
>   if (page->index == end_index) {
>   char *userpage;
> - size_t zero_offset = isize & (PAGE_SIZE - 1);
> + size_t zero_offset = offset_in_page(isize);
>  
>   if (zero_offset) {
>   int zeros;
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index b2769e92b556..e365c5272e6b 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2585,7 +2585,7 @@ static void end_bio_extent_readpage(struct bio *bio)
>   unsigned off;
>  
>   /* Zero out the end if this page straddles i_size */
> - off = i_size & (PAGE_SIZE-1);
> + off = offset_in_page(i_size);
>   if (page->index == end_index && off)
>   zero_user_segment(page, off, PAGE_SIZE);
>   SetPageUptodate(page);
> @@ -2888,7 +2888,7 @@ static int __do_readpage(struct extent_io_tree *tree,
>  
>   if (page->index == last_byte >> PAGE_SHIFT) {
>   char *userpage;
> - size_t zero_offset = last_byte & (PAGE_SIZE - 1);
> + size_t zero_offset = offset_in_page(last_byte);
>  
>   if (zero_offset) {
>   iosize = PAGE_SIZE - zero_offset;
> @@ -3432,7 +3432,7 @@ static int __extent_writepage(struct page *page, struct 
> writeback_control *wbc,
>  
>   ClearPageError(page);
>  
> - pg_offset = i_size & (PAGE_SIZE - 1);
> + pg_offset = offset_in_page(i_size);
>   if (page->index > end_index ||
>  (page->index == end_index && !pg_offset)) {
>   page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE);
> @@ -5307,7 +5307,7 @@ void read_extent_buffer(const struct extent_buffer *eb, 
> void *dstv,
>   struct page *page;
>   char *kaddr;
>   char *dst = (char *)dstv;
> - size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
> + size_t start_offset = offset_in_page(eb->start);
>   unsigned long i = (start_offset + start) >> PAGE_SHIFT;
>  
>   if (start + len > eb->len) {
> @@ -5317,7 +5317,7 @@ void read_extent_buffer(const struct extent_buffer *eb, 
> void *dstv,
>   return;
>   }
>  
> - offset = (start_offset + start) & (PAGE_SIZE - 1);
> + offset = offset_in_page(start_offset + start);
>  
>   while (len > 0) {
>   page = eb->pages[i];
> @@ -5342,14 +5342,14 @@ int read_extent_buffer_to_user(const struct 
> extent_buffer *eb,
>   struct page *page;
>   char *kaddr;
>   char __user *dst = (char __user *)dstv;
> 

Re: [PATCH 2/3] btrfs: use PAGE_ALIGNED instead of open-coding it

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 16:23 ч., Johannes Thumshirn wrote:
> When using a 'var & (PAGE_SIZE - 1)' construct one is checking for a page
> alignment and thus should use the PAGE_ALIGNED() macro instead of
> open-coding it.
> 
> Convert all open-coded occurrences of PAGE_ALIGNED().
> 
> Signed-off-by: Johannes Thumshirn 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/check-integrity.c | 8 
>  fs/btrfs/compression.c | 2 +-
>  fs/btrfs/inode.c   | 2 +-
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
> index d319c3020c09..84e9729badaa 100644
> --- a/fs/btrfs/check-integrity.c
> +++ b/fs/btrfs/check-integrity.c
> @@ -1601,7 +1601,7 @@ static int btrfsic_read_block(struct btrfsic_state 
> *state,
>   BUG_ON(block_ctx->datav);
>   BUG_ON(block_ctx->pagev);
>   BUG_ON(block_ctx->mem_to_free);
> - if (block_ctx->dev_bytenr & ((u64)PAGE_SIZE - 1)) {
> + if (!PAGE_ALIGNED(block_ctx->dev_bytenr)) {
>   pr_info("btrfsic: read_block() with unaligned bytenr %llu\n",
>  block_ctx->dev_bytenr);
>   return -1;
> @@ -1778,7 +1778,7 @@ static void btrfsic_process_written_block(struct 
> btrfsic_dev_state *dev_state,
>   return;
>   }
>   is_metadata = 1;
> - BUG_ON(BTRFS_SUPER_INFO_SIZE & (PAGE_SIZE - 1));
> + BUG_ON(!PAGE_ALIGNED(BTRFS_SUPER_INFO_SIZE));
>   processed_len = BTRFS_SUPER_INFO_SIZE;
>   if (state->print_mask &
>   BTRFSIC_PRINT_MASK_TREE_BEFORE_SB_WRITE) {
> @@ -2892,12 +2892,12 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info,
>   struct list_head *dev_head = _devices->devices;
>   struct btrfs_device *device;
>  
> - if (fs_info->nodesize & ((u64)PAGE_SIZE - 1)) {
> + if (!PAGE_ALIGNED(fs_info->nodesize)) {
>   pr_info("btrfsic: cannot handle nodesize %d not being a 
> multiple of PAGE_SIZE %ld!\n",
>  fs_info->nodesize, PAGE_SIZE);
>   return -1;
>   }
> - if (fs_info->sectorsize & ((u64)PAGE_SIZE - 1)) {
> + if (!PAGE_ALIGNED(fs_info->sectorsize)) {
>   pr_info("btrfsic: cannot handle sectorsize %d not being a 
> multiple of PAGE_SIZE %ld!\n",
>  fs_info->sectorsize, PAGE_SIZE);
>   return -1;
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 2ab5591449f2..d5381f39a9e8 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -301,7 +301,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
> *inode, u64 start,
>   blk_status_t ret;
>   int skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
>  
> - WARN_ON(start & ((u64)PAGE_SIZE - 1));
> + WARN_ON(!PAGE_ALIGNED(start));
>   cb = kmalloc(compressed_bio_size(fs_info, compressed_len), GFP_NOFS);
>   if (!cb)
>   return BLK_STS_RESOURCE;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index bc0564c384de..5c52e91b01e8 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2027,7 +2027,7 @@ int btrfs_set_extent_delalloc(struct inode *inode, u64 
> start, u64 end,
> unsigned int extra_bits,
> struct extent_state **cached_state, int dedupe)
>  {
> - WARN_ON((end & (PAGE_SIZE - 1)) == 0);
> + WARN_ON(PAGE_ALIGNED(end));
>   return set_extent_delalloc(_I(inode)->io_tree, start, end,
>  extra_bits, cached_state);
>  }
> 


[PATCH 1/3] btrfs: use offset_in_page instead of open-coding it

2018-12-05 Thread Johannes Thumshirn
Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an
offset into a page.

So replace them by the offset_in_page() macro instead of open-coding it if
they're not used as an alignment check.

Signed-off-by: Johannes Thumshirn 
---
 fs/btrfs/check-integrity.c | 12 +--
 fs/btrfs/compression.c |  2 +-
 fs/btrfs/extent_io.c   | 53 +-
 fs/btrfs/file.c|  4 ++--
 fs/btrfs/inode.c   |  7 +++---
 fs/btrfs/send.c|  2 +-
 fs/btrfs/volumes.c |  2 +-
 7 files changed, 38 insertions(+), 44 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 781cae168d2a..d319c3020c09 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -1202,24 +1202,24 @@ static void btrfsic_read_from_block_data(
void *dstv, u32 offset, size_t len)
 {
size_t cur;
-   size_t offset_in_page;
+   size_t pgoff;
char *kaddr;
char *dst = (char *)dstv;
-   size_t start_offset = block_ctx->start & ((u64)PAGE_SIZE - 1);
+   size_t start_offset = offset_in_page(block_ctx->start);
unsigned long i = (start_offset + offset) >> PAGE_SHIFT;
 
WARN_ON(offset + len > block_ctx->len);
-   offset_in_page = (start_offset + offset) & (PAGE_SIZE - 1);
+   pgoff = offset_in_page(start_offset + offset);
 
while (len > 0) {
-   cur = min(len, ((size_t)PAGE_SIZE - offset_in_page));
+   cur = min(len, ((size_t)PAGE_SIZE - pgoff));
BUG_ON(i >= DIV_ROUND_UP(block_ctx->len, PAGE_SIZE));
kaddr = block_ctx->datav[i];
-   memcpy(dst, kaddr + offset_in_page, cur);
+   memcpy(dst, kaddr + pgoff, cur);
 
dst += cur;
len -= cur;
-   offset_in_page = 0;
+   pgoff = 0;
i++;
}
 }
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index dba59ae914b8..2ab5591449f2 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -477,7 +477,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
 
if (page->index == end_index) {
char *userpage;
-   size_t zero_offset = isize & (PAGE_SIZE - 1);
+   size_t zero_offset = offset_in_page(isize);
 
if (zero_offset) {
int zeros;
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b2769e92b556..e365c5272e6b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2585,7 +2585,7 @@ static void end_bio_extent_readpage(struct bio *bio)
unsigned off;
 
/* Zero out the end if this page straddles i_size */
-   off = i_size & (PAGE_SIZE-1);
+   off = offset_in_page(i_size);
if (page->index == end_index && off)
zero_user_segment(page, off, PAGE_SIZE);
SetPageUptodate(page);
@@ -2888,7 +2888,7 @@ static int __do_readpage(struct extent_io_tree *tree,
 
if (page->index == last_byte >> PAGE_SHIFT) {
char *userpage;
-   size_t zero_offset = last_byte & (PAGE_SIZE - 1);
+   size_t zero_offset = offset_in_page(last_byte);
 
if (zero_offset) {
iosize = PAGE_SIZE - zero_offset;
@@ -3432,7 +3432,7 @@ static int __extent_writepage(struct page *page, struct 
writeback_control *wbc,
 
ClearPageError(page);
 
-   pg_offset = i_size & (PAGE_SIZE - 1);
+   pg_offset = offset_in_page(i_size);
if (page->index > end_index ||
   (page->index == end_index && !pg_offset)) {
page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE);
@@ -5307,7 +5307,7 @@ void read_extent_buffer(const struct extent_buffer *eb, 
void *dstv,
struct page *page;
char *kaddr;
char *dst = (char *)dstv;
-   size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
+   size_t start_offset = offset_in_page(eb->start);
unsigned long i = (start_offset + start) >> PAGE_SHIFT;
 
if (start + len > eb->len) {
@@ -5317,7 +5317,7 @@ void read_extent_buffer(const struct extent_buffer *eb, 
void *dstv,
return;
}
 
-   offset = (start_offset + start) & (PAGE_SIZE - 1);
+   offset = offset_in_page(start_offset + start);
 
while (len > 0) {
page = eb->pages[i];
@@ -5342,14 +5342,14 @@ int read_extent_buffer_to_user(const struct 
extent_buffer *eb,
struct page *page;
char *kaddr;
char __user *dst = (char __user *)dstv;
-   size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
+   size_t start_offset = offset_in_page(eb->start);
unsigned long i = (start_offset + start) >> PAGE_SHIFT;
int ret 

[RFC PATCH 3/3] coccinelle: api: add offset_in_page.cocci

2018-12-05 Thread Johannes Thumshirn
Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can be
replaced by the offset_in_page() macro instead of open-coding it.

Add a coccinelle semantic patch to ease detection and conversion of these.

This unfortunately doesn't account for the case when we want PAGE_ALIGNED()
instead of offset_in_page() yet.

Cc: Julia Lawall 
Signed-off-by: Johannes Thumshirn 
---
 scripts/coccinelle/api/offset_in_page.cocci | 81 +
 1 file changed, 81 insertions(+)
 create mode 100644 scripts/coccinelle/api/offset_in_page.cocci

diff --git a/scripts/coccinelle/api/offset_in_page.cocci 
b/scripts/coccinelle/api/offset_in_page.cocci
new file mode 100644
index ..ea5b3a8e0390
--- /dev/null
+++ b/scripts/coccinelle/api/offset_in_page.cocci
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0
+///
+/// Use offset_in_page macro on address instead of explicit computation.
+///
+//  Confidence: High
+//  Keywords: offset_in_page
+//  Comment: Based on vma_pages.cocci
+
+virtual context
+virtual patch
+virtual org
+virtual report
+
+
+//--
+//  For context mode
+//--
+
+@r_context depends on context && !patch && !org && !report@
+expression E;
+@@
+
+(
+* E & ~PAGE_MASK
+|
+* E & (PAGE_SIZE - 1)
+)
+
+
+//--
+//  For patch mode
+//--
+
+@r_patch depends on !context && patch && !org && !report@
+expression E;
+type T;
+@@
+
+(
+- E & ~PAGE_MASK
++ offset_in_page(E)
+|
+- E & (PAGE_SIZE - 1)
++ offset_in_page(E)
+|
+- E & ((T)PAGE_SIZE - 1)
++ offset_in_page(E)
+)
+
+//--
+//  For org mode
+//--
+
+@r_org depends on !context && !patch && (org || report)@
+expression E;
+position p;
+@@
+
+  (
+  * E@p & ~PAGE_MASK
+  |
+  * E@p & (PAGE_SIZE - 1)
+  )
+
+@script:python depends on report@
+p << r_org.p;
+x << r_org.E;
+@@
+
+msg="WARNING: Consider using offset_in_page helper on %s" % (x)
+coccilib.report.print_report(p[0], msg)
+
+@script:python depends on org@
+p << r_org.p;
+x << r_org.E;
+@@
+
+msg="WARNING: Consider using offset_in_page helper on %s" % (x)
+msg_safe=msg.replace("[","@(").replace("]",")")
+coccilib.org.print_todo(p[0], msg_safe)
+
-- 
2.16.4



[PATCH 0/3] btrfs: use offset_in_page and PAGE_ALIGNED

2018-12-05 Thread Johannes Thumshirn
Use the offset_in_page() and PAGE_ALIGNED() macros instead of open-coding them
throughout btrfs.

This series also includes a patch for 'make coccicheck' which is marked as an
RFC and I've CCed Julia in the hoping to get input from her.

Johannes Thumshirn (3):
  btrfs: use offset_in_page instead of open-coding it
  btrfs: use PAGE_ALIGNED instead of open-coding it
  coccinelle: api: add offset_in_page.cocci

 fs/btrfs/check-integrity.c  | 20 +++
 fs/btrfs/compression.c  |  4 +-
 fs/btrfs/extent_io.c| 53 +--
 fs/btrfs/file.c |  4 +-
 fs/btrfs/inode.c|  9 ++--
 fs/btrfs/send.c |  2 +-
 fs/btrfs/volumes.c  |  2 +-
 scripts/coccinelle/api/offset_in_page.cocci | 81 +
 8 files changed, 125 insertions(+), 50 deletions(-)
 create mode 100644 scripts/coccinelle/api/offset_in_page.cocci

-- 
2.16.4



[PATCH 2/3] btrfs: use PAGE_ALIGNED instead of open-coding it

2018-12-05 Thread Johannes Thumshirn
When using a 'var & (PAGE_SIZE - 1)' construct one is checking for a page
alignment and thus should use the PAGE_ALIGNED() macro instead of
open-coding it.

Convert all open-coded occurrences of PAGE_ALIGNED().

Signed-off-by: Johannes Thumshirn 
---
 fs/btrfs/check-integrity.c | 8 
 fs/btrfs/compression.c | 2 +-
 fs/btrfs/inode.c   | 2 +-
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index d319c3020c09..84e9729badaa 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -1601,7 +1601,7 @@ static int btrfsic_read_block(struct btrfsic_state *state,
BUG_ON(block_ctx->datav);
BUG_ON(block_ctx->pagev);
BUG_ON(block_ctx->mem_to_free);
-   if (block_ctx->dev_bytenr & ((u64)PAGE_SIZE - 1)) {
+   if (!PAGE_ALIGNED(block_ctx->dev_bytenr)) {
pr_info("btrfsic: read_block() with unaligned bytenr %llu\n",
   block_ctx->dev_bytenr);
return -1;
@@ -1778,7 +1778,7 @@ static void btrfsic_process_written_block(struct 
btrfsic_dev_state *dev_state,
return;
}
is_metadata = 1;
-   BUG_ON(BTRFS_SUPER_INFO_SIZE & (PAGE_SIZE - 1));
+   BUG_ON(!PAGE_ALIGNED(BTRFS_SUPER_INFO_SIZE));
processed_len = BTRFS_SUPER_INFO_SIZE;
if (state->print_mask &
BTRFSIC_PRINT_MASK_TREE_BEFORE_SB_WRITE) {
@@ -2892,12 +2892,12 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info,
struct list_head *dev_head = _devices->devices;
struct btrfs_device *device;
 
-   if (fs_info->nodesize & ((u64)PAGE_SIZE - 1)) {
+   if (!PAGE_ALIGNED(fs_info->nodesize)) {
pr_info("btrfsic: cannot handle nodesize %d not being a 
multiple of PAGE_SIZE %ld!\n",
   fs_info->nodesize, PAGE_SIZE);
return -1;
}
-   if (fs_info->sectorsize & ((u64)PAGE_SIZE - 1)) {
+   if (!PAGE_ALIGNED(fs_info->sectorsize)) {
pr_info("btrfsic: cannot handle sectorsize %d not being a 
multiple of PAGE_SIZE %ld!\n",
   fs_info->sectorsize, PAGE_SIZE);
return -1;
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 2ab5591449f2..d5381f39a9e8 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -301,7 +301,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
*inode, u64 start,
blk_status_t ret;
int skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
 
-   WARN_ON(start & ((u64)PAGE_SIZE - 1));
+   WARN_ON(!PAGE_ALIGNED(start));
cb = kmalloc(compressed_bio_size(fs_info, compressed_len), GFP_NOFS);
if (!cb)
return BLK_STS_RESOURCE;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index bc0564c384de..5c52e91b01e8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2027,7 +2027,7 @@ int btrfs_set_extent_delalloc(struct inode *inode, u64 
start, u64 end,
  unsigned int extra_bits,
  struct extent_state **cached_state, int dedupe)
 {
-   WARN_ON((end & (PAGE_SIZE - 1)) == 0);
+   WARN_ON(PAGE_ALIGNED(end));
return set_extent_delalloc(_I(inode)->io_tree, start, end,
   extra_bits, cached_state);
 }
-- 
2.16.4



Btrfs progs release 4.19.1

2018-12-05 Thread David Sterba
Hi,

btrfs-progs version 4.19.1 have been released.  There are build fixes, minor
update to libbtrfsutil and documentation updates.

Changes since 4.19.1-rc1: fix typos in CHANGES

Changes:

  * build fixes
* big-endian builds fail due to bswap helper clashes
* 'swap' macro is too generic, renamed to prevent build failures
  * libbtrfs
* minor version update to 1.1.0
* fix default search to top=0 as documented
* rename 'async' to avoid future python binding problems
* add support for unprivileged subvolume listing ioctls
* added tests, API docs
  * other
* lot of typos fixed
* warning cleanups
* doc formatting updates
* CI tests against zstd 1.3.7

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

David Sterba (7):
  btrfs-progs: kerncompat: rename swap to __swap
  btrfs-progs: README: add link to INSTALL
  btrfs-progs: docs: fix rendering of exponents in manual pages
  btrfs-progs: link to libbtrfsutil/README from the main README
  btrfs-progs: tests: pull zstd version 1.3.7 to the travis CI
  btrfs-progs: update CHANGES for v4.19.1
  Btrfs progs v4.19.1

Josh Soref (11):
  btrfs-progs: docs: fix typos in Documentation
  btrfs-progs: docs: fix typos in READMEs, INSTALL and CHANGES
  btrfs-progs: fix typos in Makefile
  btrfs-progs: tests: fix typos in test comments
  btrfs-progs: tests: fsck/025, fix typo in helpre name
  btrfs-progs: fix typos in comments
  btrfs-progs: fix typos in user-visible strings
  btrfs-progs: check: fix typo in device_extent_record::chunk_objectid
  btrfs-progs: utils: fix typo in a variable
  btrfs-progs: mkfs: fix typo in "multipler" variables
  btrfs-progs: fix typo in btrfs-list function export

Omar Sandoval (10):
  libbtrfsutil: use top=0 as default for SubvolumeIterator()
  libbtrfsutil: change async parameters to async_ in Python bindings
  libbtrfsutil: document qgroup_inherit parameter in Python bindings
  libbtrfsutil: use SubvolumeIterator as context manager in tests
  libbtrfsutil: add test helpers for dropping privileges
  libbtrfsutil: allow tests to create multiple Btrfs instances
  libbtrfsutil: relax the privileges of subvolume_info()
  libbtrfsutil: relax the privileges of subvolume iterator
  libbtrfsutil: bump version to 1.1.0
  libbtrfsutil: document API in README

Rosen Penev (3):
  btrfs-progs: kernel-lib: bitops: Fix big endian compilation
  btrfs-progs: task-utils: Fix comparison between pointer and integer
  btrfs-progs: treewide: Fix missing declarations



Re: [PATCH v2 07/13] btrfs-progs: Fix Wmaybe-uninitialized warning

2018-12-05 Thread Qu Wenruo


On 2018/12/5 下午9:40, David Sterba wrote:
> On Wed, Dec 05, 2018 at 02:40:12PM +0800, Qu Wenruo wrote:
>> GCC 8.2.1 will report the following warning with "make W=1":
>>
>>   ctree.c: In function 'btrfs_next_sibling_tree_block':
>>   ctree.c:2990:21: warning: 'slot' may be used uninitialized in this 
>> function [-Wmaybe-uninitialized]
>> path->slots[level] = slot;
>> ~~~^~
>>
>> The culprit is the following code:
>>
>>  int slot;   << Not initialized
>>  int level = path->lowest_level + 1;
>>  BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
>>  while(level < BTRFS_MAX_LEVEL) {
>>  slot = path->slots[level] + 1;
>>  ^^ but we initialize @slot here.
>>  ...
>>  }
>>  path->slots[level] = slot;
>>
>> It's possible that compiler doesn't get enough hint for BUG_ON() on
>> lowest_level + 1 >= BTRFS_MAX_LEVEL case.
>>
>> Fix it by using a do {} while() loop other than while() {} loop, to
>> ensure we will run the loop for at least once.
> 
> I was hoping that we can actually add the hint to BUG_ON so the code
> does not continue if the condition is true.
> 
I checked that method, but I'm not that confident about things like:

bugon_trace()
{
if (!val)
return;
__bugon_trace();
}

__attribute__ ((noreturn))
static inline void __bugon_trace();

This is as simple as just one extra function call, but the original
problem is just one more function call before hitting abort().

So I just give up screwing up things I'm not comfort enough to tweaking.

The current do {} while() loop is the most direct solution, if gcc one
day still gives such warning then I could say some harsh word then.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 07/10] dax: export functions for use with btrfs

2018-12-05 Thread Johannes Thumshirn
On 05/12/2018 13:28, Goldwyn Rodrigues wrote:

[...]

> -static void *grab_mapping_entry(struct xa_state *xas,
> +void *grab_mapping_entry(struct xa_state *xas,
>   struct address_space *mapping, unsigned long size_flag)
>  {
>   unsigned long index = xas->xa_index;
> @@ -531,6 +532,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
>   xas_unlock_irq(xas);
>   return xa_mk_internal(VM_FAULT_FALLBACK);
>  }
> +EXPORT_SYMBOL(grab_mapping_entry);

dax_grab_mapping_entry() please.



-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Adam Borowski
On Wed, Dec 05, 2018 at 06:28:25AM -0600, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs.

Yay!

> I understand there have been previous attempts at it.  However, I wanted
> to make sure copy-on-write (COW) works on dax as well.

btrfs' usual use of CoW and DAX are thoroughly in conflict.

The very point of DAX is to have writes not go through the kernel, you
mmap the file then do all writes right to the pmem, flushing when needed
(without hitting the kernel) and having the processor+memory persist what
you wrote.

CoW via page faults are fine -- pmem is closer to memory than disk, and this
means the kernel will ask the filesystem for an extent to place the new page
in, copy the contents and let the process play with it.  But real btrfs CoW
would mean we'd need to page fault on ᴇᴠᴇʀʏ ꜱɪɴɢʟᴇ ᴡʀɪᴛᴇ.

Delaying CoW until the next commit doesn't help -- you'd need to store the
dirty page in DRAM then write it, which goes against the whole concept of
DAX.

Only way I see would be to CoW once then pretend the page is nodatacow until
the next commit, when we checksum it, add to the metadata trees, and mark
for CoWing on the next write.  Lots of complexity, and you still need to
copy the whole thing every commit (so no gain).

Ie, we're in nodatacow land.  CoW for metadata is fine.

> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:
> 
> Questions:
> 1. I have been unable to do checksumming for DAX devices. While
> checksumming can be done for reads and writes, it is a problem when mmap
> is involved because btrfs kernel module does not get back control after
> an mmap() writes. Any ideas are appreciated, or we would have to set
> nodatasum when dax is enabled.

Per the above, it sounds like nodatacow (ie, "cow once") would be needed.

> 2. Currently, a user can continue writing on "old" extents of an mmaped file
> after a snapshot has been created. How can we enforce writes to be directed
> to new extents after snapshots have been created? Do we keep a list of
> all mmap()s, and re-mmap them after a snapshot?

Same as for any other memory that's shared: when a new instance of sharing
is added (a snapshot/reflink in our case), you deny writes, causing a page
fault on the next attempt.  "pmem" is named "ᴘersistent ᴍᴇᴍory" for a
reason...

> Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
> command line parameter.

Might be more useful to use a bigger piece of the "disk" than 2G, it's not
in the danger area though.

Also note that it's utterly pointless to use any RAID modes; multi-dev
single is fine, DUP counts as RAID here.
* RAID0 is already done better in hardware (interleave)
* RAID1 would require hardware support, replication isn't easy
* RAID5/6 

What would make sense, is disabling dax for any files that are not marked as
nodatacow.  This way, unrelated files can still use checksums or
compression, while only files meant as a pmempool or otherwise by a
pmem-aware program would have dax writes (you can still give read-only pages
that CoW to DRAM).  This way we can have write dax for only a subset of
files, and full set of btrfs features for the rest.  Write dax is dangerous
for programs that have no specific support: the vast majority of
database-like programs rely on page-level atomicity while pmem gives you
cacheline/word atomicity only; torn writes mean data loss.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄ to the city of his birth to die.


Re: [PATCH 06/10] btrfs: dax write support

2018-12-05 Thread Johannes Thumshirn
On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
[...]

> +static int copy_extent_page(struct extent_map *em, void *daddr, u64 pos)
> +{
> +struct dax_device *dax_dev;

^ space instead of tabs?

> + void *saddr;
> + sector_t start;
> + size_t len;
> +
> + if (em->block_start == EXTENT_MAP_HOLE) {
> + memset(daddr, 0, PAGE_SIZE);
> + } else {
> + dax_dev = fs_dax_get_by_bdev(em->bdev);
> + start = (get_start_sect(em->bdev) << 9) + (em->block_start + 
> (pos - em->start));
> + len = dax_direct_access(dax_dev, PHYS_PFN(start), 1, , 
> NULL);
> + memcpy(daddr, saddr, PAGE_SIZE);
> + }
> + free_extent_map(em);
> +
> + return 0;
> +}
> +
> +

copy_extent_page() always returns 0, why not make it void?
Plus a nit: double newline.

> +ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
> +{
> + ssize_t ret, done = 0, count = iov_iter_count(from);
> +struct inode *inode = file_inode(iocb->ki_filp);
^ again spaces vs tabs.

> + u64 pos = iocb->ki_pos;
> + u64 start = round_down(pos, PAGE_SIZE);
> + u64 end = round_up(pos + count, PAGE_SIZE);
> + struct extent_state *cached_state = NULL;
> + struct extent_changeset *data_reserved = NULL;
> + struct extent_map *first = NULL, *last = NULL;
> +
> + ret = btrfs_delalloc_reserve_space(inode, _reserved, start, end - 
> start);
> + if (ret < 0)
> + return ret;
> +
> + /* Grab a reference of the first extent to copy data */
> + if (start < pos) {
> + first = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, end - 
> start, 0);
> + if (IS_ERR(first)) {
> + ret = PTR_ERR(first);
> + goto out2;
> + }
> + }

You're using 'end - start' at least twice here, maybe you could move
'len' out of the loop and use it for btrfs_delalloc_reserve_space() and
btrfs_get_extent() as well.

> +
> + /* Grab a reference of the last extent to copy data */
> + if (pos + count < end) {
> + last = btrfs_get_extent(BTRFS_I(inode), NULL, 0, end - 
> PAGE_SIZE, PAGE_SIZE, 0);
> + if (IS_ERR(last)) {
> + ret = PTR_ERR(last);
> + goto out2;
> + }
> + }
> +
> + lock_extent_bits(_I(inode)->io_tree, start, end, _state);
> + while (done < count) {
> + struct extent_map *em;
> + struct dax_device *dax_dev;
> + int offset = pos & (PAGE_SIZE - 1);
> + u64 estart = round_down(pos, PAGE_SIZE);
> + u64 elen = end - estart;
> + size_t len = count - done;
> + sector_t dstart;
> + void *daddr;
> + ssize_t maplen;
> +
> + /* Read the current extent */
> +em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, estart, elen, 
> 0);

Space again.

> + if (IS_ERR(em)) {
> + ret = PTR_ERR(em);
> + goto out;
> + }
> +
> + /* Get a new extent */
> + ret = btrfs_get_extent_map_write(, NULL, inode, estart, 
> elen);
> + if (ret < 0)
> + goto out;
> +
> + dax_dev = fs_dax_get_by_bdev(em->bdev);
> + /* Calculate start address start of destination extent */
> + dstart = (get_start_sect(em->bdev) << 9) + em->block_start;
> + maplen = dax_direct_access(dax_dev, PHYS_PFN(dstart),
> + PHYS_PFN(em->len), , NULL);
> +
> + /* Copy front of extent page */
> + if (offset)
> + ret = copy_extent_page(first, daddr, estart);
> +
> + /* Copy end of extent page */
> + if ((pos + len > estart + PAGE_SIZE) && (pos + len < em->start 
> + em->len))
> + ret = copy_extent_page(last, daddr + em->len - 
> PAGE_SIZE, em->start + em->len - PAGE_SIZE);
> +
> + /* Copy the data from the iter */
> + maplen = PFN_PHYS(maplen);
> + maplen -= offset;
> + ret = dax_copy_from_iter(dax_dev, dstart, daddr + offset, 
> maplen, from);
> + if (ret < 0)
> + goto out;
> + pos += ret;
> + done += ret;
> + }
> +out:

out_unlock?

> + unlock_extent_cached(_I(inode)->io_tree, start, end, 
> _state);
> + if (done) {
> + btrfs_update_ordered_extent(inode, start,
> + end - start, true);
> + iocb->ki_pos += done;
> + if (iocb->ki_pos > i_size_read(inode))
> + i_size_write(inode, iocb->ki_pos);
> + }
> +
> + btrfs_delalloc_release_extents(BTRFS_I(inode), count, false);
> +out2:

out?

> + if (count - done > 0)
> + btrfs_delalloc_release_space(inode, data_reserved, pos,
> + count 

Re: [PATCH v2 07/13] btrfs-progs: Fix Wmaybe-uninitialized warning

2018-12-05 Thread David Sterba
On Wed, Dec 05, 2018 at 02:40:12PM +0800, Qu Wenruo wrote:
> GCC 8.2.1 will report the following warning with "make W=1":
> 
>   ctree.c: In function 'btrfs_next_sibling_tree_block':
>   ctree.c:2990:21: warning: 'slot' may be used uninitialized in this function 
> [-Wmaybe-uninitialized]
> path->slots[level] = slot;
> ~~~^~
> 
> The culprit is the following code:
> 
>   int slot;   << Not initialized
>   int level = path->lowest_level + 1;
>   BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
>   while(level < BTRFS_MAX_LEVEL) {
>   slot = path->slots[level] + 1;
>   ^^ but we initialize @slot here.
>   ...
>   }
>   path->slots[level] = slot;
> 
> It's possible that compiler doesn't get enough hint for BUG_ON() on
> lowest_level + 1 >= BTRFS_MAX_LEVEL case.
> 
> Fix it by using a do {} while() loop other than while() {} loop, to
> ensure we will run the loop for at least once.

I was hoping that we can actually add the hint to BUG_ON so the code
does not continue if the condition is true.


Re: [PATCH 04/10] Rename __endio_write_update_ordered() to btrfs_update_ordered_extent()

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Since we will be using it in another part of the code, use a
> better name to declare it non-static
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/ctree.h |  7 +--
>  fs/btrfs/inode.c | 14 +-
>  2 files changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 038d64ecebe5..5144d28216b0 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3170,8 +3170,11 @@ struct inode *btrfs_iget_path(struct super_block *s, 
> struct btrfs_key *location,
>  struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
>struct btrfs_root *root, int *was_new);
>  struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
> - struct page *page, size_t pg_offset,
> - u64 start, u64 end, int create);
> + struct page *page, size_t pg_offset,
> + u64 start, u64 end, int create);
> +void btrfs_update_ordered_extent(struct inode *inode,
> + const u64 offset, const u64 bytes,
> + const bool uptodate);
>  int btrfs_update_inode(struct btrfs_trans_handle *trans,
> struct btrfs_root *root,
> struct inode *inode);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9ea4c6f0352f..96e9fe9e4150 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -97,10 +97,6 @@ static struct extent_map *create_io_em(struct inode 
> *inode, u64 start, u64 len,
>  u64 ram_bytes, int compress_type,
>  int type);
>  
> -static void __endio_write_update_ordered(struct inode *inode,
> -  const u64 offset, const u64 bytes,
> -  const bool uptodate);
> -
>  /*
>   * Cleanup all submitted ordered extents in specified range to handle errors
>   * from the fill_dellaloc() callback.
> @@ -130,7 +126,7 @@ static inline void btrfs_cleanup_ordered_extents(struct 
> inode *inode,
>   ClearPagePrivate2(page);
>   put_page(page);
>   }
> - return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
> + return btrfs_update_ordered_extent(inode, offset + PAGE_SIZE,
>   bytes - PAGE_SIZE, false);
>  }
>  
> @@ -8059,7 +8055,7 @@ static void btrfs_endio_direct_read(struct bio *bio)
>   bio_put(bio);
>  }
>  
> -static void __endio_write_update_ordered(struct inode *inode,
> +void btrfs_update_ordered_extent(struct inode *inode,
>const u64 offset, const u64 bytes,
>const bool uptodate)

Since you are exporting the function I'd suggest to use the occasion to
introduce proper kernel-doc. THe primary help will be if you document
the context under which the function is called - when writes are
finished and it's used to update respective portion of the ordered extent.

>  {
> @@ -8112,7 +8108,7 @@ static void btrfs_endio_direct_write(struct bio *bio)
>   struct btrfs_dio_private *dip = bio->bi_private;
>   struct bio *dio_bio = dip->dio_bio;
>  
> - __endio_write_update_ordered(dip->inode, dip->logical_offset,
> + btrfs_update_ordered_extent(dip->inode, dip->logical_offset,
>dip->bytes, !bio->bi_status);
>  
>   kfree(dip);
> @@ -8432,7 +8428,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, 
> struct inode *inode,
>   bio = NULL;
>   } else {
>   if (write)
> - __endio_write_update_ordered(inode,
> + btrfs_update_ordered_extent(inode,
>   file_offset,
>   dio_bio->bi_iter.bi_size,
>   false);
> @@ -8572,7 +8568,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, 
> struct iov_iter *iter)
>*/
>   if (dio_data.unsubmitted_oe_range_start <
>   dio_data.unsubmitted_oe_range_end)
> - __endio_write_update_ordered(inode,
> + btrfs_update_ordered_extent(inode,
>   dio_data.unsubmitted_oe_range_start,
>   dio_data.unsubmitted_oe_range_end -
>   dio_data.unsubmitted_oe_range_start,
> 


Re: [PATCH 03/10] btrfs: dax: read zeros from holes

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/dax.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> index d614bf73bf8e..5a297674adec 100644
> --- a/fs/btrfs/dax.c
> +++ b/fs/btrfs/dax.c
> @@ -54,7 +54,12 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct 
> iov_iter *to)

nit: I think it's better if you rename the iterator variable to "iter".

>  
>  BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING);
>  
> -ret = em_dax_rw(inode, em, pos, len, to);
> + if (em->block_start == EXTENT_MAP_HOLE) {
> + u64 zero_len = min(em->len - (em->start - pos), len);

Shouldn't this be em->len - (pos - em->start) since this gives the
remaining length of the extent? Isn't pos guaranteed to be >= em->start ?

> + ret = iov_iter_zero(zero_len, to);
> + } else {
> + ret = em_dax_rw(inode, em, pos, len, to);
> + }
>  if (ret < 0)
>  goto out;
>  pos += ret;
> 


Re: [PATCH 02/10] btrfs: basic dax read

2018-12-05 Thread Johannes Thumshirn
On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Signed-off-by: Goldwyn Rodrigues 
Can you explain why we can't use th dax_iomap_rw() interface like XFS or
EXT4?

[...]

> +static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos,
> + u64 len, struct iov_iter *iter)
> +{
> +struct dax_device *dax_dev = fs_dax_get_by_bdev(em->bdev);
> +ssize_t map_len;
> +pgoff_t blk_pg;
> +void *kaddr;
> +sector_t blk_start;
> +unsigned offset = pos & (PAGE_SIZE - 1);

Nit: unsigned offset = offset_in_page(pos);




-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 02/10] btrfs: basic dax read

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/Makefile |  1 +
>  fs/btrfs/ctree.h  |  5 
>  fs/btrfs/dax.c| 68 
> +++
>  fs/btrfs/file.c   | 13 ++-
>  4 files changed, 86 insertions(+), 1 deletion(-)
>  create mode 100644 fs/btrfs/dax.c
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index ca693dd554e9..1fa77b875ae9 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -12,6 +12,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
> root-tree.o dir-item.o \
>  reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
>  uuid-tree.o props.o free-space-tree.o tree-checker.o
>  
> +btrfs-$(CONFIG_FS_DAX) += dax.o
>  btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
>  btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 5cc470fa6a40..038d64ecebe5 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3685,6 +3685,11 @@ int btrfs_reada_wait(void *handle);
>  void btrfs_reada_detach(void *handle);
>  int btree_readahead_hook(struct extent_buffer *eb, int err);
>  
> +#ifdef CONFIG_FS_DAX
> +/* dax.c */
> +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
> +#endif /* CONFIG_FS_DAX */
> +
>  static inline int is_fstree(u64 rootid)
>  {
>   if (rootid == BTRFS_FS_TREE_OBJECTID ||
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> new file mode 100644
> index ..d614bf73bf8e
> --- /dev/null
> +++ b/fs/btrfs/dax.c
> @@ -0,0 +1,68 @@
> +#include 
> +#include 
> +#include "ctree.h"
> +#include "btrfs_inode.h"
> +
> +static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos,
> + u64 len, struct iov_iter *iter)
> +{
> +struct dax_device *dax_dev = fs_dax_get_by_bdev(em->bdev);
> +ssize_t map_len;
> +pgoff_t blk_pg;
> +void *kaddr;
> +sector_t blk_start;
> +unsigned offset = pos & (PAGE_SIZE - 1);

offset = offset_in_page(pos)

> +
> +len = min(len + offset, em->len - (pos - em->start));
> +len = ALIGN(len, PAGE_SIZE);

len = PAGE_ALIGN(len);

> +blk_start = (get_start_sect(em->bdev) << 9) + (em->block_start + 
> (pos - em->start));
> +blk_pg = blk_start - offset;
> +map_len = dax_direct_access(dax_dev, PHYS_PFN(blk_pg), 
> PHYS_PFN(len), , NULL);
> +map_len = PFN_PHYS(map_len)> +kaddr += offset;
> +map_len -= offset;
> +if (map_len > len)
> +map_len = len;

map_len = min(map_len, len);

> +if (iov_iter_rw(iter) == WRITE)
> +return dax_copy_from_iter(dax_dev, blk_pg, kaddr, map_len, 
> iter);
> +else
> +return dax_copy_to_iter(dax_dev, blk_pg, kaddr, map_len, 
> iter);

Have you looked at the implementation of dax_iomap_actor where they have
pretty similar code. In case of either of those returning 0 they set ret
to EFAULT, should the same be done in btrfs_file_dax_read? IMO it will
be good of you can follow dax_iomap_actor's logic as much as possible
since this code has been used for quite some time and is deemed robust.

> +}
> +
> +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
> +{
> +size_t ret = 0, done = 0, count = iov_iter_count(to);
> +struct extent_map *em;
> +u64 pos = iocb->ki_pos;
> +u64 end = pos + count;
> +struct inode *inode = file_inode(iocb->ki_filp);
> +
> +if (!count)
> +return 0;
> +
> +end = i_size_read(inode) < end ? i_size_read(inode) : end;

end = min(i_size_read(inode), end)

> +
> +while (pos < end) {
> +u64 len = end - pos;
> +
> +em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, len, 0);
> +if (IS_ERR(em)) {
> +if (!ret)
> +ret = PTR_ERR(em);
> +goto out;
> +}
> +
> +BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING);

I think this can never trigger, because EXTENT_FLAG_FS_MAPPING is set
for extents that map chunk and those are housed in the chunk tree at
fs_info->mapping_tree. Since the write call back is only ever called for
file inodes I'd say this BUG_ON can be eliminated. Did you manage to
trigger it during development?


> +
> +ret = em_dax_rw(inode, em, pos, len, to);
> +if (ret < 0)
> +goto out;
> +pos += ret;
> +done += ret;
> +}
> +
> +out:
> +iocb->ki_pos += done;
> +return done ? done : ret;
> +}
> +
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 58e93bce3036..ef6ed93f44d1 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ 

Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Qu Wenruo


On 2018/12/5 下午8:28, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs. I understand there have been
> previous attempts at it. However, I wanted to make sure copy-on-write
> (COW) works on dax as well.
> 
> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:
> 
> Questions:
> 1. I have been unable to do checksumming for DAX devices. While
> checksumming can be done for reads and writes, it is a problem when mmap
> is involved because btrfs kernel module does not get back control after
> an mmap() writes. Any ideas are appreciated, or we would have to set
> nodatasum when dax is enabled.

I'm not familar with DAX, so it's completely possible I'm talking like
an idiot.

If btrfs_page_mkwrite() can't provide enough control, then I have a
crazy idea.

Forcing page fault for every mmap() read/write (completely disable page
cache like DIO).
So that we could get some control since we're informed to read the page
and do some hacks there.

Thanks,
Qu
> 
> 2. Currently, a user can continue writing on "old" extents of an mmaped file
> after a snapshot has been created. How can we enforce writes to be directed
> to new extents after snapshots have been created? Do we keep a list of
> all mmap()s, and re-mmap them after a snapshot?
> 
> Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
> command line parameter.
> 
> 
> [PATCH 01/10] btrfs: create a mount option for dax
> [PATCH 02/10] btrfs: basic dax read
> [PATCH 03/10] btrfs: dax: read zeros from holes
> [PATCH 04/10] Rename __endio_write_update_ordered() to
> [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of
> [PATCH 06/10] btrfs: dax write support
> [PATCH 07/10] dax: export functions for use with btrfs
> [PATCH 08/10] btrfs: dax add read mmap path
> [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
> [PATCH 10/10] btrfs: dax mmap write
> 
>  fs/btrfs/Makefile   |1 
>  fs/btrfs/ctree.h|   17 ++
>  fs/btrfs/dax.c  |  303 
> ++--
>  fs/btrfs/file.c |   29 
>  fs/btrfs/inode.c|   54 +
>  fs/btrfs/ioctl.c|5 
>  fs/btrfs/super.c|   15 ++
>  fs/dax.c|   35 --
>  include/linux/dax.h |   16 ++
>  9 files changed, 430 insertions(+), 45 deletions(-)
> 
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 01/10] btrfs: create a mount option for dax

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Also, set the inode->i_flags to S_DAX
> 
> Signed-off-by: Goldwyn Rodrigues 

Reviewed-by: Nikolay Borisov 

One question below though .

> ---
>  fs/btrfs/ctree.h |  1 +
>  fs/btrfs/ioctl.c |  5 -
>  fs/btrfs/super.c | 15 +++
>  3 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 68f322f600a0..5cc470fa6a40 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1353,6 +1353,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct 
> btrfs_fs_info *info)
>  #define BTRFS_MOUNT_FREE_SPACE_TREE  (1 << 26)
>  #define BTRFS_MOUNT_NOLOGREPLAY  (1 << 27)
>  #define BTRFS_MOUNT_REF_VERIFY   (1 << 28)
> +#define BTRFS_MOUNT_DAX  (1 << 29)
>  
>  #define BTRFS_DEFAULT_COMMIT_INTERVAL(30)
>  #define BTRFS_DEFAULT_MAX_INLINE (2048)
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 802a628e9f7d..e9146c157816 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode 
> *inode)
>   if (binode->flags & BTRFS_INODE_DIRSYNC)
>   new_fl |= S_DIRSYNC;
>  
> + if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && 
> S_ISREG(inode->i_mode))
> + new_fl |= S_DAX;
> +
>   set_mask_bits(>i_flags,
> -   S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
> +   S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | 
> S_DAX,
> new_fl);
>  }
>  
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 645fc81e2a94..035263b61cf5 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -326,6 +326,7 @@ enum {
>   Opt_treelog, Opt_notreelog,
>   Opt_usebackuproot,
>   Opt_user_subvol_rm_allowed,
> + Opt_dax,
>  
>   /* Deprecated options */
>   Opt_alloc_start,
> @@ -393,6 +394,7 @@ static const match_table_t tokens = {
>   {Opt_notreelog, "notreelog"},
>   {Opt_usebackuproot, "usebackuproot"},
>   {Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
> + {Opt_dax, "dax"},
>  
>   /* Deprecated options */
>   {Opt_alloc_start, "alloc_start=%s"},
> @@ -739,6 +741,17 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
> *options,
>   case Opt_user_subvol_rm_allowed:
>   btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
>   break;
> +#ifdef CONFIG_FS_DAX
> + case Opt_dax:
> + if (btrfs_super_num_devices(info->super_copy) > 1) {
> + btrfs_info(info,
> +"dax not supported for multi-device 
> btrfs partition\n");

What prevents supporting dax for multiple devices so long as all devices
are dax?



> 


Re: [PATCH 01/10] btrfs: create a mount option for dax

2018-12-05 Thread Johannes Thumshirn
On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Also, set the inode->i_flags to S_DAX
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/ctree.h |  1 +
>  fs/btrfs/ioctl.c |  5 -
>  fs/btrfs/super.c | 15 +++
>  3 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 68f322f600a0..5cc470fa6a40 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1353,6 +1353,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct 
> btrfs_fs_info *info)
>  #define BTRFS_MOUNT_FREE_SPACE_TREE  (1 << 26)
>  #define BTRFS_MOUNT_NOLOGREPLAY  (1 << 27)
>  #define BTRFS_MOUNT_REF_VERIFY   (1 << 28)
> +#define BTRFS_MOUNT_DAX  (1 << 29)

Just as a heads up, this will collide with the patch called '[RFC PATCH
02/17] btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE' from Su Yue.

[...]

> +#ifdef CONFIG_FS_DAX
> + case Opt_dax:
> + if (btrfs_super_num_devices(info->super_copy) > 1) {
> + btrfs_info(info,
> +"dax not supported for multi-device 
> btrfs partition\n");
> + ret = -EOPNOTSUPP;
> + goto out;
> + }
> + btrfs_set_opt(info->mount_opt, DAX);
> + break;
> +#endif

Can you please explain why we can't enable DAX on a multi device FS in
the changelog? It's (for me at least) not obvious.

Thanks,
Johannes
-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


[PATCH 10/10] btrfs: dax mmap write

2018-12-05 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Create a page size extent and copy the contents of the original
extent into the new one, and present to user space as the page
to write.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/dax.c | 39 +++
 1 file changed, 39 insertions(+)

diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 6d68d39cc5da..4634917877f3 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -231,6 +231,45 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
sector >>= 9;
ret = copy_user_dax(em->bdev, dax_dev, sector, PAGE_SIZE, 
vmf->cow_page, vaddr);
goto out;
+   } else if (vmf->flags & FAULT_FLAG_WRITE) {
+   pfn_t pfn;
+   struct extent_map *orig = em;
+   void *daddr;
+   sector_t dstart;
+   size_t maplen;
+   struct extent_changeset *data_reserved = NULL;
+   struct extent_state *cached_state = NULL;
+
+   ret = btrfs_delalloc_reserve_space(inode, _reserved, pos, 
PAGE_SIZE);
+   if (ret < 0)
+   return ret;
+   refcount_inc(>refs);
+   lock_extent_bits(_I(inode)->io_tree, pos, pos + 
PAGE_SIZE, _state);
+   /* Create an extent of page size */
+   ret = btrfs_get_extent_map_write(, NULL, inode, pos,
+   PAGE_SIZE);
+   if (ret < 0) {
+   free_extent_map(orig);
+   btrfs_delalloc_release_space(inode, data_reserved, pos,
+   PAGE_SIZE, true);
+   goto out;
+   }
+
+   dax_dev = fs_dax_get_by_bdev(em->bdev);
+   /* Calculate start address of destination extent */
+   dstart = (get_start_sect(em->bdev) << 9) + em->block_start;
+   maplen = dax_direct_access(dax_dev, PHYS_PFN(dstart),
+   1, , );
+
+   /* Copy the original contents into new destination */
+   copy_extent_page(orig, daddr, pos);
+   btrfs_update_ordered_extent(inode, pos, PAGE_SIZE, true);
+   dax_insert_entry(, mapping, vmf, entry, pfn, 0, false);
+   ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
+   free_extent_map(orig);
+   unlock_extent_cached(_I(inode)->io_tree, pos, pos + 
PAGE_SIZE, _state);
+   extent_changeset_free(data_reserved);
+   btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, 
false);
} else {
sector_t sector;
if (em->block_start == EXTENT_MAP_HOLE) {
-- 
2.16.4



[PATCH 06/10] btrfs: dax write support

2018-12-05 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

This is a combination of direct and buffered I/O. Similarties
with direct I/O is that it needs to allocate space before
writing. Similarities with buffered is when the data is not
page-aligned, it needs to copy parts of the previous extents. In
order to accomplish that, keep a references of the first and last
extent (if required) and then perform allocations. If the "pos"
or "end" is not aligned, copy the data from first and last extent
respectively.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/ctree.h |   1 +
 fs/btrfs/dax.c   | 121 +++
 fs/btrfs/file.c  |   4 +-
 3 files changed, 125 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index a0d296b0d826..d91ff283a966 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3693,6 +3693,7 @@ int btree_readahead_hook(struct extent_buffer *eb, int 
err);
 #ifdef CONFIG_FS_DAX
 /* dax.c */
 ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
+ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from);
 #endif /* CONFIG_FS_DAX */
 
 static inline int is_fstree(u64 rootid)
diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 5a297674adec..4000259a426c 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -2,6 +2,7 @@
 #include 
 #include "ctree.h"
 #include "btrfs_inode.h"
+#include "extent_io.h"
 
 static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos,
u64 len, struct iov_iter *iter)
@@ -71,3 +72,123 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct 
iov_iter *to)
 return done ? done : ret;
 }
 
+static int copy_extent_page(struct extent_map *em, void *daddr, u64 pos)
+{
+struct dax_device *dax_dev;
+   void *saddr;
+   sector_t start;
+   size_t len;
+
+   if (em->block_start == EXTENT_MAP_HOLE) {
+   memset(daddr, 0, PAGE_SIZE);
+   } else {
+   dax_dev = fs_dax_get_by_bdev(em->bdev);
+   start = (get_start_sect(em->bdev) << 9) + (em->block_start + 
(pos - em->start));
+   len = dax_direct_access(dax_dev, PHYS_PFN(start), 1, , 
NULL);
+   memcpy(daddr, saddr, PAGE_SIZE);
+   }
+   free_extent_map(em);
+
+   return 0;
+}
+
+
+ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
+{
+   ssize_t ret, done = 0, count = iov_iter_count(from);
+struct inode *inode = file_inode(iocb->ki_filp);
+   u64 pos = iocb->ki_pos;
+   u64 start = round_down(pos, PAGE_SIZE);
+   u64 end = round_up(pos + count, PAGE_SIZE);
+   struct extent_state *cached_state = NULL;
+   struct extent_changeset *data_reserved = NULL;
+   struct extent_map *first = NULL, *last = NULL;
+
+   ret = btrfs_delalloc_reserve_space(inode, _reserved, start, end - 
start);
+   if (ret < 0)
+   return ret;
+
+   /* Grab a reference of the first extent to copy data */
+   if (start < pos) {
+   first = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, end - 
start, 0);
+   if (IS_ERR(first)) {
+   ret = PTR_ERR(first);
+   goto out2;
+   }
+   }
+
+   /* Grab a reference of the last extent to copy data */
+   if (pos + count < end) {
+   last = btrfs_get_extent(BTRFS_I(inode), NULL, 0, end - 
PAGE_SIZE, PAGE_SIZE, 0);
+   if (IS_ERR(last)) {
+   ret = PTR_ERR(last);
+   goto out2;
+   }
+   }
+
+   lock_extent_bits(_I(inode)->io_tree, start, end, _state);
+   while (done < count) {
+   struct extent_map *em;
+   struct dax_device *dax_dev;
+   int offset = pos & (PAGE_SIZE - 1);
+   u64 estart = round_down(pos, PAGE_SIZE);
+   u64 elen = end - estart;
+   size_t len = count - done;
+   sector_t dstart;
+   void *daddr;
+   ssize_t maplen;
+
+   /* Read the current extent */
+em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, estart, elen, 
0);
+   if (IS_ERR(em)) {
+   ret = PTR_ERR(em);
+   goto out;
+   }
+
+   /* Get a new extent */
+   ret = btrfs_get_extent_map_write(, NULL, inode, estart, 
elen);
+   if (ret < 0)
+   goto out;
+
+   dax_dev = fs_dax_get_by_bdev(em->bdev);
+   /* Calculate start address start of destination extent */
+   dstart = (get_start_sect(em->bdev) << 9) + em->block_start;
+   maplen = dax_direct_access(dax_dev, PHYS_PFN(dstart),
+   PHYS_PFN(em->len), , NULL);
+
+   /* Copy front of extent page */
+   if (offset)
+   ret = copy_extent_page(first, daddr, estart);
+
+   

[PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared

2018-12-05 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/dax.c | 25 -
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index 88017f8799d1..6d68d39cc5da 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -198,10 +198,13 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
pfn_t pfn;
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
XA_STATE(xas, >i_pages, vmf->pgoff);
+   unsigned long vaddr = vmf->address;
struct inode *inode = mapping->host;
loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
void *entry = NULL;
vm_fault_t ret = 0;
+   struct extent_map *em;
+   struct dax_device *dax_dev;
 
if (pos > i_size_read(inode)) {
ret = VM_FAULT_SIGBUS;
@@ -214,21 +217,33 @@ vm_fault_t btrfs_dax_fault(struct vm_fault *vmf)
goto out;
}
 
-   if (!vmf->cow_page) {
+em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, PAGE_SIZE, 0);
+   if (em->block_start != EXTENT_MAP_HOLE)
+   dax_dev = fs_dax_get_by_bdev(em->bdev);
+
+   if (vmf->cow_page) {
+   sector_t sector;
+   if (em->block_start == EXTENT_MAP_HOLE) {
+   clear_user_highpage(vmf->cow_page, vaddr);
+   goto out;
+   }
+   sector = (get_start_sect(em->bdev) << 9) + (em->block_start + 
(pos - em->start));
+   sector >>= 9;
+   ret = copy_user_dax(em->bdev, dax_dev, sector, PAGE_SIZE, 
vmf->cow_page, vaddr);
+   goto out;
+   } else {
sector_t sector;
-   struct extent_map *em;
-em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, PAGE_SIZE, 
0);
if (em->block_start == EXTENT_MAP_HOLE) {
ret = dax_load_hole(, mapping, entry, vmf);
goto out;
}
sector = ((get_start_sect(em->bdev) << 9) +
  (em->block_start + (pos - em->start))) >> 9;
-   ret = dax_pfn(fs_dax_get_by_bdev(em->bdev), em->bdev, sector, 
PAGE_SIZE, );
+   ret = dax_pfn(dax_dev, em->bdev, sector, PAGE_SIZE, );
if (ret)
goto out;
dax_insert_entry(, mapping, vmf, entry, pfn, 0, false);
-   ret = vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+   ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
}
 out:
if (entry)
-- 
2.16.4



[PATCH 07/10] dax: export functions for use with btrfs

2018-12-05 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

These functions are required for btrfs dax support.

Signed-off-by: Goldwyn Rodrigues 
---
 fs/dax.c| 35 ---
 include/linux/dax.h | 16 
 2 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 9bcce89ea18e..4578640af631 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -244,7 +244,7 @@ static void put_unlocked_entry(struct xa_state *xas, void 
*entry)
  * dropped the xa_lock, so we know the xa_state is stale and must be reset
  * before use.
  */
-static void dax_unlock_entry(struct xa_state *xas, void *entry)
+void dax_unlock_entry(struct xa_state *xas, void *entry)
 {
void *old;
 
@@ -256,6 +256,7 @@ static void dax_unlock_entry(struct xa_state *xas, void 
*entry)
BUG_ON(!dax_is_locked(old));
dax_wake_entry(xas, entry, false);
 }
+EXPORT_SYMBOL(dax_unlock_entry);
 
 /*
  * Return: The entry stored at this location before it was locked.
@@ -448,7 +449,7 @@ void dax_unlock_mapping_entry(struct page *page)
  * a VM_FAULT code, encoded as an xarray internal entry.  The ERR_PTR values
  * overlap with xarray value entries.
  */
-static void *grab_mapping_entry(struct xa_state *xas,
+void *grab_mapping_entry(struct xa_state *xas,
struct address_space *mapping, unsigned long size_flag)
 {
unsigned long index = xas->xa_index;
@@ -531,6 +532,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
xas_unlock_irq(xas);
return xa_mk_internal(VM_FAULT_FALLBACK);
 }
+EXPORT_SYMBOL(grab_mapping_entry);
 
 /**
  * dax_layout_busy_page - find first pinned page in @mapping
@@ -654,7 +656,7 @@ int dax_invalidate_mapping_entry_sync(struct address_space 
*mapping,
return __dax_invalidate_entry(mapping, index, false);
 }
 
-static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
+int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
sector_t sector, size_t size, struct page *to,
unsigned long vaddr)
 {
@@ -679,6 +681,7 @@ static int copy_user_dax(struct block_device *bdev, struct 
dax_device *dax_dev,
dax_read_unlock(id);
return 0;
 }
+EXPORT_SYMBOL(copy_user_dax);
 
 /*
  * By this point grab_mapping_entry() has ensured that we have a locked entry
@@ -687,7 +690,7 @@ static int copy_user_dax(struct block_device *bdev, struct 
dax_device *dax_dev,
  * already in the tree, we will skip the insertion and just dirty the PMD as
  * appropriate.
  */
-static void *dax_insert_entry(struct xa_state *xas,
+void *dax_insert_entry(struct xa_state *xas,
struct address_space *mapping, struct vm_fault *vmf,
void *entry, pfn_t pfn, unsigned long flags, bool dirty)
 {
@@ -736,6 +739,7 @@ static void *dax_insert_entry(struct xa_state *xas,
xas_unlock_irq(xas);
return entry;
 }
+EXPORT_SYMBOL(dax_insert_entry);
 
 static inline
 unsigned long pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma)
@@ -962,19 +966,18 @@ static sector_t dax_iomap_sector(struct iomap *iomap, 
loff_t pos)
return (iomap->addr + (pos & PAGE_MASK) - iomap->offset) >> 9;
 }
 
-static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
-pfn_t *pfnp)
+int dax_pfn(struct dax_device *dax_dev, struct block_device *bdev,
+   const sector_t sector, size_t size, pfn_t *pfnp)
 {
-   const sector_t sector = dax_iomap_sector(iomap, pos);
pgoff_t pgoff;
int id, rc;
long length;
 
-   rc = bdev_dax_pgoff(iomap->bdev, sector, size, );
+   rc = bdev_dax_pgoff(bdev, sector, size, );
if (rc)
return rc;
id = dax_read_lock();
-   length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
+   length = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size),
   NULL, pfnp);
if (length < 0) {
rc = length;
@@ -993,6 +996,14 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, 
size_t size,
dax_read_unlock(id);
return rc;
 }
+EXPORT_SYMBOL(dax_pfn);
+
+static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
+pfn_t *pfnp)
+{
+   const sector_t sector = dax_iomap_sector(iomap, pos);
+   return dax_pfn(iomap->dax_dev, iomap->bdev, sector, size, pfnp);
+}
 
 /*
  * The user has performed a load from a hole in the file.  Allocating a new
@@ -1001,7 +1012,7 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, 
size_t size,
  * If this page is ever written to we will re-fault and change the mapping to
  * point to real DAX storage instead.
  */
-static vm_fault_t dax_load_hole(struct xa_state *xas,
+vm_fault_t dax_load_hole(struct xa_state *xas,
struct address_space *mapping, void **entry,
struct vm_fault *vmf)
 {
@@ -1017,6 +1028,7 @@ static vm_fault_t dax_load_hole(struct xa_state 

[PATCH 03/10] btrfs: dax: read zeros from holes

2018-12-05 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/dax.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
index d614bf73bf8e..5a297674adec 100644
--- a/fs/btrfs/dax.c
+++ b/fs/btrfs/dax.c
@@ -54,7 +54,12 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct 
iov_iter *to)
 
 BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING);
 
-ret = em_dax_rw(inode, em, pos, len, to);
+   if (em->block_start == EXTENT_MAP_HOLE) {
+   u64 zero_len = min(em->len - (em->start - pos), len);
+   ret = iov_iter_zero(zero_len, to);
+   } else {
+   ret = em_dax_rw(inode, em, pos, len, to);
+   }
 if (ret < 0)
 goto out;
 pos += ret;
-- 
2.16.4



[PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Goldwyn Rodrigues
This is a support for DAX in btrfs. I understand there have been
previous attempts at it. However, I wanted to make sure copy-on-write
(COW) works on dax as well.

Before I present this to the FS folks I wanted to run this through the
btrfs. Even though I wish, I cannot get it correct the first time
around :/.. Here are some questions for which I need suggestions:

Questions:
1. I have been unable to do checksumming for DAX devices. While
checksumming can be done for reads and writes, it is a problem when mmap
is involved because btrfs kernel module does not get back control after
an mmap() writes. Any ideas are appreciated, or we would have to set
nodatasum when dax is enabled.

2. Currently, a user can continue writing on "old" extents of an mmaped file
after a snapshot has been created. How can we enforce writes to be directed
to new extents after snapshots have been created? Do we keep a list of
all mmap()s, and re-mmap them after a snapshot?

Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
command line parameter.


[PATCH 01/10] btrfs: create a mount option for dax
[PATCH 02/10] btrfs: basic dax read
[PATCH 03/10] btrfs: dax: read zeros from holes
[PATCH 04/10] Rename __endio_write_update_ordered() to
[PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of
[PATCH 06/10] btrfs: dax write support
[PATCH 07/10] dax: export functions for use with btrfs
[PATCH 08/10] btrfs: dax add read mmap path
[PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
[PATCH 10/10] btrfs: dax mmap write

 fs/btrfs/Makefile   |1 
 fs/btrfs/ctree.h|   17 ++
 fs/btrfs/dax.c  |  303 ++--
 fs/btrfs/file.c |   29 
 fs/btrfs/inode.c|   54 +
 fs/btrfs/ioctl.c|5 
 fs/btrfs/super.c|   15 ++
 fs/dax.c|   35 --
 include/linux/dax.h |   16 ++
 9 files changed, 430 insertions(+), 45 deletions(-)


-- 
Goldwyn



[PATCH 01/10] btrfs: create a mount option for dax

2018-12-05 Thread Goldwyn Rodrigues
From: Goldwyn Rodrigues 

Also, set the inode->i_flags to S_DAX

Signed-off-by: Goldwyn Rodrigues 
---
 fs/btrfs/ctree.h |  1 +
 fs/btrfs/ioctl.c |  5 -
 fs/btrfs/super.c | 15 +++
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 68f322f600a0..5cc470fa6a40 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1353,6 +1353,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct 
btrfs_fs_info *info)
 #define BTRFS_MOUNT_FREE_SPACE_TREE(1 << 26)
 #define BTRFS_MOUNT_NOLOGREPLAY(1 << 27)
 #define BTRFS_MOUNT_REF_VERIFY (1 << 28)
+#define BTRFS_MOUNT_DAX(1 << 29)
 
 #define BTRFS_DEFAULT_COMMIT_INTERVAL  (30)
 #define BTRFS_DEFAULT_MAX_INLINE   (2048)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 802a628e9f7d..e9146c157816 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode *inode)
if (binode->flags & BTRFS_INODE_DIRSYNC)
new_fl |= S_DIRSYNC;
 
+   if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && 
S_ISREG(inode->i_mode))
+   new_fl |= S_DAX;
+
set_mask_bits(>i_flags,
- S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
+ S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | 
S_DAX,
  new_fl);
 }
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 645fc81e2a94..035263b61cf5 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -326,6 +326,7 @@ enum {
Opt_treelog, Opt_notreelog,
Opt_usebackuproot,
Opt_user_subvol_rm_allowed,
+   Opt_dax,
 
/* Deprecated options */
Opt_alloc_start,
@@ -393,6 +394,7 @@ static const match_table_t tokens = {
{Opt_notreelog, "notreelog"},
{Opt_usebackuproot, "usebackuproot"},
{Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
+   {Opt_dax, "dax"},
 
/* Deprecated options */
{Opt_alloc_start, "alloc_start=%s"},
@@ -739,6 +741,17 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
case Opt_user_subvol_rm_allowed:
btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
break;
+#ifdef CONFIG_FS_DAX
+   case Opt_dax:
+   if (btrfs_super_num_devices(info->super_copy) > 1) {
+   btrfs_info(info,
+  "dax not supported for multi-device 
btrfs partition\n");
+   ret = -EOPNOTSUPP;
+   goto out;
+   }
+   btrfs_set_opt(info->mount_opt, DAX);
+   break;
+#endif
case Opt_enospc_debug:
btrfs_set_opt(info->mount_opt, ENOSPC_DEBUG);
break;
@@ -1329,6 +1342,8 @@ static int btrfs_show_options(struct seq_file *seq, 
struct dentry *dentry)
seq_puts(seq, ",clear_cache");
if (btrfs_test_opt(info, USER_SUBVOL_RM_ALLOWED))
seq_puts(seq, ",user_subvol_rm_allowed");
+   if (btrfs_test_opt(info, DAX))
+   seq_puts(seq, ",dax");
if (btrfs_test_opt(info, ENOSPC_DEBUG))
seq_puts(seq, ",enospc_debug");
if (btrfs_test_opt(info, AUTO_DEFRAG))
-- 
2.16.4



  1   2   3   4   5   6   7   8   9   10   >