Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-08 Thread David Sterba
On Sat, Dec 08, 2018 at 08:50:32AM +0800, Qu Wenruo wrote:
> > I've adapted a stress tests that unpacks a large tarball, snaphosts
> > every 20 seconds, deletes a random snapshot every 50 seconds, deletes
> > file from the original subvolume, now enhanced with qgroups just for the
> > new snapshots inherigin the toplevel subvolume. Lockup.
> > 
> > It gets stuck in a snapshot call with the follwin stacktrace
> > 
> > [<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs]
> > [<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs]
> 
> This looks like the original subtree tracing has something wrong.

Yes, I ran the test on current master and it locked up too, so it's not
due to your patchset.

> Thanks for the report, I'll investigate it.

Thanks.


Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-07 Thread Qu Wenruo


On 2018/12/8 上午8:47, David Sterba wrote:
> On Fri, Dec 07, 2018 at 06:51:21AM +0800, Qu Wenruo wrote:
>>
>>
>> On 2018/12/7 上午3:35, David Sterba wrote:
>>> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
 On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
> This patchset can be fetched from github:
> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
>
> Which is based on v4.20-rc1.

 Thanks, I'll add it to for-next soon.
>>>
>>> The branch was there for some time but not for at least a week (my
>>> mistake I did not notice in time). I've rebased it on top of recent
>>> misc-next, but without the delayed refs patchset from Josef.
>>>
>>> At the moment I'm considering it for merge to 4.21, there's still some
>>> time to pull it out in case it shows up to be too problematic. I'm
>>> mostly worried about the unknown interactions with the enospc updates or
>>
>> For that part, I don't think it would have some obvious problem for
>> enospc updates.
>>
>> As the user-noticeable effect is the delay of reloc tree deletion.
>>
>> Despite that, it's mostly transparent to extent allocation.
>>
>>> generally because of lack of qgroup and reloc code reviews.
>>
>> That's the biggest problem.
>>
>> However most of the current qgroup + balance optimization is done inside
>> qgroup code (to skip certain qgroup record), if we're going to hit some
>> problem then this patchset would have the highest possibility to hit
>> problem.
>>
>> Later patches will just keep tweaking qgroup to without affecting any
>> other parts mostly.
>>
>> So I'm fine if you decide to pull it out for now.
> 
> I've adapted a stress tests that unpacks a large tarball, snaphosts
> every 20 seconds, deletes a random snapshot every 50 seconds, deletes
> file from the original subvolume, now enhanced with qgroups just for the
> new snapshots inherigin the toplevel subvolume. Lockup.
> 
> It gets stuck in a snapshot call with the follwin stacktrace
> 
> [<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs]
> [<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs]

This looks like the original subtree tracing has something wrong.

Thanks for the report, I'll investigate it.
Qu

> [<0>] do_walk_down+0x681/0xb20 [btrfs]
> [<0>] walk_down_tree+0xf5/0x1c0 [btrfs]
> [<0>] btrfs_drop_snapshot+0x43b/0xb60 [btrfs]
> [<0>] btrfs_clean_one_deleted_snapshot+0xc1/0x120 [btrfs]
> [<0>] cleaner_kthread+0xf8/0x170 [btrfs]
> [<0>] kthread+0x121/0x140
> [<0>] ret_from_fork+0x27/0x50
> 
> and that's like 10th snapshot and ~3rd deltion. This is qgroup show:
> 
> qgroupid rfer excl parent
>    --
> 0/5 865.27MiB  1.66MiB ---
> 0/257   0.00B0.00B ---
> 0/259   0.00B0.00B ---
> 0/260   806.58MiB637.25MiB ---
> 0/262   0.00B0.00B ---
> 0/263   0.00B0.00B ---
> 0/264   0.00B0.00B ---
> 0/265   0.00B0.00B ---
> 0/266   0.00B0.00B ---
> 0/267   0.00B0.00B ---
> 0/268   0.00B0.00B ---
> 0/269   0.00B0.00B ---
> 0/270   989.04MiB  1.22MiB ---
> 0/271   0.00B0.00B ---
> 0/272   922.25MiB416.00KiB ---
> 0/273   931.02MiB  1.50MiB ---
> 0/274   910.94MiB  1.52MiB ---
> 1/1   1.64GiB  1.64GiB
> 0/5,0/257,0/259,0/260,0/262,0/263,0/264,0/265,0/266,0/267,0/268,0/269,0/270,0/271,0/272,0/273,0/274
> 
> No IO or cpu activity at this point, the stacktrace and show output
> remains the same.
> 
> So, considering this, I'm not going to add the patchset to 4.21 but will
> keep it in for-next for testing, any fixups or updates will be applied.
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-07 Thread David Sterba
On Fri, Dec 07, 2018 at 06:51:21AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018/12/7 上午3:35, David Sterba wrote:
> > On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
> >> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
> >>> This patchset can be fetched from github:
> >>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
> >>>
> >>> Which is based on v4.20-rc1.
> >>
> >> Thanks, I'll add it to for-next soon.
> > 
> > The branch was there for some time but not for at least a week (my
> > mistake I did not notice in time). I've rebased it on top of recent
> > misc-next, but without the delayed refs patchset from Josef.
> > 
> > At the moment I'm considering it for merge to 4.21, there's still some
> > time to pull it out in case it shows up to be too problematic. I'm
> > mostly worried about the unknown interactions with the enospc updates or
> 
> For that part, I don't think it would have some obvious problem for
> enospc updates.
> 
> As the user-noticeable effect is the delay of reloc tree deletion.
> 
> Despite that, it's mostly transparent to extent allocation.
> 
> > generally because of lack of qgroup and reloc code reviews.
> 
> That's the biggest problem.
> 
> However most of the current qgroup + balance optimization is done inside
> qgroup code (to skip certain qgroup record), if we're going to hit some
> problem then this patchset would have the highest possibility to hit
> problem.
> 
> Later patches will just keep tweaking qgroup to without affecting any
> other parts mostly.
> 
> So I'm fine if you decide to pull it out for now.

I've adapted a stress tests that unpacks a large tarball, snaphosts
every 20 seconds, deletes a random snapshot every 50 seconds, deletes
file from the original subvolume, now enhanced with qgroups just for the
new snapshots inherigin the toplevel subvolume. Lockup.

It gets stuck in a snapshot call with the follwin stacktrace

[<0>] btrfs_tree_read_lock+0xf3/0x150 [btrfs]
[<0>] btrfs_qgroup_trace_subtree+0x280/0x7b0 [btrfs]
[<0>] do_walk_down+0x681/0xb20 [btrfs]
[<0>] walk_down_tree+0xf5/0x1c0 [btrfs]
[<0>] btrfs_drop_snapshot+0x43b/0xb60 [btrfs]
[<0>] btrfs_clean_one_deleted_snapshot+0xc1/0x120 [btrfs]
[<0>] cleaner_kthread+0xf8/0x170 [btrfs]
[<0>] kthread+0x121/0x140
[<0>] ret_from_fork+0x27/0x50

and that's like 10th snapshot and ~3rd deltion. This is qgroup show:

qgroupid rfer excl parent
   --
0/5 865.27MiB  1.66MiB ---
0/257   0.00B0.00B ---
0/259   0.00B0.00B ---
0/260   806.58MiB637.25MiB ---
0/262   0.00B0.00B ---
0/263   0.00B0.00B ---
0/264   0.00B0.00B ---
0/265   0.00B0.00B ---
0/266   0.00B0.00B ---
0/267   0.00B0.00B ---
0/268   0.00B0.00B ---
0/269   0.00B0.00B ---
0/270   989.04MiB  1.22MiB ---
0/271   0.00B0.00B ---
0/272   922.25MiB416.00KiB ---
0/273   931.02MiB  1.50MiB ---
0/274   910.94MiB  1.52MiB ---
1/1   1.64GiB  1.64GiB
0/5,0/257,0/259,0/260,0/262,0/263,0/264,0/265,0/266,0/267,0/268,0/269,0/270,0/271,0/272,0/273,0/274

No IO or cpu activity at this point, the stacktrace and show output
remains the same.

So, considering this, I'm not going to add the patchset to 4.21 but will
keep it in for-next for testing, any fixups or updates will be applied.


Re: [PATCH] libbtrfsutil: fix unprivileged tests if kernel lacks support

2018-12-07 Thread David Sterba
On Thu, Dec 06, 2018 at 04:29:32PM -0800, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> I apparently didn't test this on a pre-4.18 kernel.
> test_subvolume_info_unprivileged() checks for an ENOTTY, but this
> doesn't seem to work correctly with subTest().
> test_subvolume_iterator_unprivileged() doesn't have a check at all. Add
> an explicit check to both before doing the actual test.
> 
> Signed-off-by: Omar Sandoval 

Applied, thanks.


Re: [PATCH 2/8] btrfs: extent-tree: Open-code process_func in __btrfs_mod_ref

2018-12-07 Thread Nikolay Borisov



On 6.12.18 г. 8:58 ч., Qu Wenruo wrote:
> The process_func is never a function hook used anywhere else.
> 
> Open code it to make later delayed ref refactor easier, so we can
> refactor btrfs_inc_extent_ref() and btrfs_free_extent() in different
> patches.
> 
> Signed-off-by: Qu Wenruo 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent-tree.c | 33 ++---
>  1 file changed, 18 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index ea2c3d5220f0..ea68d288d761 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3220,10 +3220,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
> *trans,
>   int i;
>   int level;
>   int ret = 0;
> - int (*process_func)(struct btrfs_trans_handle *,
> - struct btrfs_root *,
> - u64, u64, u64, u64, u64, u64, bool);
> -
>  
>   if (btrfs_is_testing(fs_info))
>   return 0;
> @@ -3235,11 +3231,6 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
> *trans,
>   if (!test_bit(BTRFS_ROOT_REF_COWS, >state) && level == 0)
>   return 0;
>  
> - if (inc)
> - process_func = btrfs_inc_extent_ref;
> - else
> - process_func = btrfs_free_extent;
> -
>   if (full_backref)
>   parent = buf->start;
>   else
> @@ -3261,17 +3252,29 @@ static int __btrfs_mod_ref(struct btrfs_trans_handle 
> *trans,
>  
>   num_bytes = btrfs_file_extent_disk_num_bytes(buf, fi);
>   key.offset -= btrfs_file_extent_offset(buf, fi);
> - ret = process_func(trans, root, bytenr, num_bytes,
> -parent, ref_root, key.objectid,
> -key.offset, for_reloc);
> + if (inc)
> + ret = btrfs_inc_extent_ref(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + key.objectid, key.offset,
> + for_reloc);
> + else
> + ret = btrfs_free_extent(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + key.objectid, key.offset,
> + for_reloc);
>   if (ret)
>   goto fail;
>   } else {
>   bytenr = btrfs_node_blockptr(buf, i);
>   num_bytes = fs_info->nodesize;
> - ret = process_func(trans, root, bytenr, num_bytes,
> -parent, ref_root, level - 1, 0,
> -for_reloc);
> + if (inc)
> + ret = btrfs_inc_extent_ref(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + level - 1, 0, for_reloc);
> + else
> + ret = btrfs_free_extent(trans, root, bytenr,
> + num_bytes, parent, ref_root,
> + level - 1, 0, for_reloc);
>   if (ret)
>   goto fail;
>   }
> 


Re: System unable to mount partition after a power loss

2018-12-07 Thread Doni Crosby
I ran that command and I cannot get the email to send properly to the 
mailing list as the attachment of the output is over 4.6M.


On 12/7/2018 11:49 AM, Doni Crosby wrote:

The output of the command is attached. This is what errors showed up
on the system:
parent transid verify failed on 3563224842240 wanted 5184691 found 5184689
parent transid verify failed on 3563224842240 wanted 5184691 found 5184689
parent transid verify failed on 3563222974464 wanted 5184691 found 5184688
parent transid verify failed on 3563222974464 wanted 5184691 found 5184688
parent transid verify failed on 3563223121920 wanted 5184691 found 5184688
parent transid verify failed on 3563223121920 wanted 5184691 found 5184688
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
Ignoring transid failure
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
Ignoring transid failure
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
Ignoring transid failure
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
parent transid verify failed on 3563231412224 wanted 5184691 found 5183325
Ignoring transid failure
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
parent transid verify failed on 3563231461376 wanted 5184691 found 5183325
Ignoring transid failure
WARNING: eb corrupted: parent bytenr 31801344 slot 132 level 1 child
bytenr 3563231461376 level has 1 expect 0, skipping the slot
parent transid verify failed on 3563231494144 wanted 5184691 found 5183325
parent transid verify failed on 3563231494144 wanted 5184691 found 5183325
parent transid verify failed on 3563231494144 wanted 5184691 found 5183325
parent transid verify failed on 3563231494144 wanted 5184691 found 5183325
Ignoring transid failure
parent transid verify failed on 3563231526912 wanted 5184691 found 5183325
parent transid verify failed on 3563231526912 wanted 5184691 found 5183325
parent transid verify failed on 3563231526912 wanted 5184691 found 5183325
parent transid verify failed on 3563231526912 wanted 5184691 found 5183325
Ignoring transid failure
parent transid verify failed on 3563229626368 wanted 5184691 found 5184689
parent transid verify failed on 3563229626368 wanted 5184691 found 5184689
parent transid verify failed on 3563229937664 wanted 5184691 found 5184689
parent transid verify failed on 3563229937664 wanted 5184691 found 5184689
parent transid verify failed on 3563226857472 wanted 5184691 found 5184689
parent transid verify failed on 3563226857472 wanted 5184691 found 5184689
parent transid verify failed on 3563230674944 wanted 5184691 found 5183325
parent transid verify failed on 3563230674944 wanted 5184691 found 5183325
parent transid verify failed on 3563230674944 wanted 5184691 found 5183325
parent transid verify failed on 3563230674944 wanted 5184691 found 5183325
Ignoring transid failure
On Fri, Dec 7, 2018 at 2:22 AM Qu Wenruo  wrote:




On 2018/12/7 下午1:24, Doni Crosby wrote:

All,

I'm coming to you to see if there is a way to fix or at least recover
most of the data I have from a btrfs filesystem. The system went down
after both a breaker and the battery backup failed. I cannot currently
mount the system, with the following error from dmesg:

Note: The vda1 is just the entire disk being passed from the VM host
to the VM it's not an actual true virtual block device

[ 499.704398] BTRFS info (device vda1): disk space caching is enabled
[  499.704401] BTRFS info (device vda1): has skinny extents
[  499.739522] BTRFS error (device vda1): parent transid verify failed
on 3563231428608 wanted 5184691 found 5183327


Transid mismatch normally means the fs is screwed up more or less.

And according to your mount failure, it looks the fs get screwed up badly.

What's the kernel version used in the VM?
I don't really think the VM is always using the latest kernel.


[  499.740257] BTRFS error (device vda1): parent transid verify failed
on 3563231428608 wanted 5184691 found 5183327

Re: System unable to mount partition after a power loss

2018-12-07 Thread Doni Crosby
I just looked at the VM it does not have a cache. That's the default
in proxmox to improve performance.
On Fri, Dec 7, 2018 at 7:25 AM Austin S. Hemmelgarn
 wrote:
>
> On 2018-12-07 01:43, Doni Crosby wrote:
> >> This is qemu-kvm? What's the cache mode being used? It's possible the
> >> usual write guarantees are thwarted by VM caching.
> > Yes it is a proxmox host running the system so it is a qemu vm, I'm
> > unsure on the caching situation.
> On the note of QEMU and the cache mode, the only cache mode I've seen to
> actually cause issues for BTRFS volumes _inside_ a VM is 'cache=unsafe',
> but that causes problems for most filesystems, so it's probably not the
> issue here.
>
> OTOH, I've seen issues with most of the cache modes other than
> 'cache=writeback' and 'cache=writethrough' when dealing with BTRFS as
> the back-end storage on the host system, and most of the time such
> issues will manifest as both problems with the volume inside the VM
> _and_ the volume the disk images are being stored on.


Re: [PATCH 05/10] btrfs: introduce delayed_refs_rsv

2018-12-07 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> Traditionally we've had voodoo in btrfs to account for the space that
> delayed refs may take up by having a global_block_rsv.  This works most
> of the time, except when it doesn't.  We've had issues reported and seen
> in production where sometimes the global reserve is exhausted during
> transaction commit before we can run all of our delayed refs, resulting
> in an aborted transaction.  Because of this voodoo we have equally
> dubious flushing semantics around throttling delayed refs which we often
> get wrong.
> 
> So instead give them their own block_rsv.  This way we can always know
> exactly how much outstanding space we need for delayed refs.  This
> allows us to make sure we are constantly filling that reservation up
> with space, and allows us to put more precise pressure on the enospc
> system.  Instead of doing math to see if its a good time to throttle,
> the normal enospc code will be invoked if we have a lot of delayed refs
> pending, and they will be run via the normal flushing mechanism.
> 
> For now the delayed_refs_rsv will hold the reservations for the delayed
> refs, the block group updates, and deleting csums.  We could have a
> separate rsv for the block group updates, but the csum deletion stuff is
> still handled via the delayed_refs so that will stay there.


I see one difference in the way that the space is managed. Essentially
for delayed refs rsv you'll only ever be increasaing the size and
->reserved only when you have to refill. This is opposite to the way
other metadata space is managed i.e by using use_block_rsv which
subtracts ->reserved everytime a block has to be CoW'ed. Why this
difference?


> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/ctree.h   |  14 +++-
>  fs/btrfs/delayed-ref.c |  43 --
>  fs/btrfs/disk-io.c |   4 +
>  fs/btrfs/extent-tree.c | 212 
> +
>  fs/btrfs/transaction.c |  37 -
>  5 files changed, 284 insertions(+), 26 deletions(-)
> 



> +/**
> + * btrfs_migrate_to_delayed_refs_rsv - transfer bytes to our delayed refs 
> rsv.
> + * @fs_info - the fs info for our fs.
> + * @src - the source block rsv to transfer from.
> + * @num_bytes - the number of bytes to transfer.
> + *
> + * This transfers up to the num_bytes amount from the src rsv to the
> + * delayed_refs_rsv.  Any extra bytes are returned to the space info.
> + */
> +void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
> +struct btrfs_block_rsv *src,
> +u64 num_bytes)

This function is currently used only during transaction start, it seems
to be rather specific to the delayed refs so I'd suggest making it
private to transaction.c

> +{
> + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv;
> + u64 to_free = 0;
> +
> + spin_lock(>lock);
> + src->reserved -= num_bytes;
> + src->size -= num_bytes;
> + spin_unlock(>lock);
> +
> + spin_lock(_refs_rsv->lock);
> + if (delayed_refs_rsv->size > delayed_refs_rsv->reserved) {
> + u64 delta = delayed_refs_rsv->size -
> + delayed_refs_rsv->reserved;
> + if (num_bytes > delta) {
> + to_free = num_bytes - delta;
> + num_bytes = delta;
> + }
> + } else {
> + to_free = num_bytes;
> + num_bytes = 0;
> + }
> +
> + if (num_bytes)
> + delayed_refs_rsv->reserved += num_bytes;
> + if (delayed_refs_rsv->reserved >= delayed_refs_rsv->size)
> + delayed_refs_rsv->full = 1;
> + spin_unlock(_refs_rsv->lock);
> +
> + if (num_bytes)
> + trace_btrfs_space_reservation(fs_info, "delayed_refs_rsv",
> +   0, num_bytes, 1);
> + if (to_free)
> + space_info_add_old_bytes(fs_info, delayed_refs_rsv->space_info,
> +  to_free);
> +}
> +
> +/**
> + * btrfs_delayed_refs_rsv_refill - refill based on our delayed refs usage.
> + * @fs_info - the fs_info for our fs.
> + * @flush - control how we can flush for this reservation.
> + *
> + * This will refill the delayed block_rsv up to 1 items size worth of space 
> and
> + * will return -ENOSPC if we can't make the reservation.
> + */
> +int btrfs_delayed_refs_rsv_refill(struct btrfs_fs_info *fs_info,
> +   enum btrfs_reserve_flush_enum flush)
> +{
> + struct btrfs_block_rsv *block_rsv = _info->delayed_refs_rsv;
> + u64 limit = btrfs_calc_trans_metadata_size(fs_info, 1);
> + u64 num_bytes = 0;
> + int ret = -ENOSPC;
> +
> + spin_lock(_rsv->lock);
> + if (block_rsv->reserved < block_rsv->size) {
> + num_bytes = block_rsv->size - block_rsv->reserved;
> + num_bytes = min(num_bytes, limit);
> + }
> + 

Re: [PATCH 04/10] btrfs: only track ref_heads in delayed_ref_updates

2018-12-07 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We use this number to figure out how many delayed refs to run, but
> __btrfs_run_delayed_refs really only checks every time we need a new
> delayed ref head, so we always run at least one ref head completely no
> matter what the number of items on it.  Fix the accounting to only be
> adjusted when we add/remove a ref head.

David,

I think also warrants a forward looking sentence stating that the number
is also going to be used to calculate the required number of bytes in
the delayed refs rsv. Something along the lines of:

In addition to using this number to limit the number of delayed refs
run, a future patch is also going to use it to calculate the amount of
space required for delayed refs space reservation.

> 
> Reviewed-by: Nikolay Borisov 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/delayed-ref.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index b3e4c9fcb664..48725fa757a3 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -251,8 +251,6 @@ static inline void drop_delayed_ref(struct 
> btrfs_trans_handle *trans,
>   ref->in_tree = 0;
>   btrfs_put_delayed_ref(ref);
>   atomic_dec(_refs->num_entries);
> - if (trans->delayed_ref_updates)
> - trans->delayed_ref_updates--;
>  }
>  
>  static bool merge_ref(struct btrfs_trans_handle *trans,
> @@ -467,7 +465,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle 
> *trans,
>   if (ref->action == BTRFS_ADD_DELAYED_REF)
>   list_add_tail(>add_list, >ref_add_list);
>   atomic_inc(>num_entries);
> - trans->delayed_ref_updates++;
>   spin_unlock(>lock);
>   return ret;
>  }
> 


Re: System unable to mount partition after a power loss

2018-12-07 Thread Austin S. Hemmelgarn

On 2018-12-07 01:43, Doni Crosby wrote:

This is qemu-kvm? What's the cache mode being used? It's possible the
usual write guarantees are thwarted by VM caching.

Yes it is a proxmox host running the system so it is a qemu vm, I'm
unsure on the caching situation.
On the note of QEMU and the cache mode, the only cache mode I've seen to 
actually cause issues for BTRFS volumes _inside_ a VM is 'cache=unsafe', 
but that causes problems for most filesystems, so it's probably not the 
issue here.


OTOH, I've seen issues with most of the cache modes other than 
'cache=writeback' and 'cache=writethrough' when dealing with BTRFS as 
the back-end storage on the host system, and most of the time such 
issues will manifest as both problems with the volume inside the VM 
_and_ the volume the disk images are being stored on.


Re: What if TRIM issued a wipe on devices that don't TRIM?

2018-12-07 Thread Austin S. Hemmelgarn

On 2018-12-06 23:09, Andrei Borzenkov wrote:

06.12.2018 16:04, Austin S. Hemmelgarn пишет:


* On SCSI devices, a discard operation translates to a SCSI UNMAP
command.  As pointed out by Ronnie Sahlberg in his reply, this command
is purely advisory, may not result in any actual state change on the
target device, and is not guaranteed to wipe the data.  To actually wipe
things, you have to explicitly write bogus data to the given regions
(using either regular writes, or a WRITESAME command with the desired
pattern), and _then_ call UNMAP on them.


WRITE SAME command has UNMAP bit and depending on device and kernel
version kernel may actually issue either UNMAP or WRITE SAME with UNMAP
bit set when doing discard.

Good to know.  I've not looked at the SCSI code much, and actually 
didn't know about the UNMAP bit for the WRITE SAME command, so I just 
assumed that the kernel only used the UNMAP command.


Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-07 Thread Nikolay Borisov



On 7.12.18 г. 9:09 ч., Nikolay Borisov wrote:
> 
> 
> On 6.12.18 г. 19:54 ч., David Sterba wrote:
>> On Thu, Dec 06, 2018 at 06:52:21PM +0200, Nikolay Borisov wrote:
>>>
>>>
>>> On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
 Now with the delayed_refs_rsv we can now know exactly how much pending
 delayed refs space we need.  This means we can drastically simplify
>>>
>>> IMO it will be helpful if there is a sentence here referring back to
>>> btrfs_update_delayed_refs_rsv to put your first sentence into context.
>>> But I guess this is something David can also do.
>>
>> I'll update the changelog, but I'm not sure what exactly you want to see
>> there, please post the replacement text. Thanks.
> 
> With the introduction of dealyed_refs_rsv infrastructure, namely
> btrfs_update_delayed_refs_rsv we now know exactly how much pending
> delayed refs space is required.

To put things into context as to why I deem this change beneficial -
basically doing the migration of reservation from transaction to delayed
refs rsv modifies both size and reserved - they will be equal. Calling
btrfs_update_delayed_refs_rsv actually increases ->size and doesn't
really decrement ->reserved. Also we never do
btrfs_block_rsv_migrate/use_block_rsv on the delayed refs block rsv so
managing ->reserved  value for delayed refs rsv is different than for
the rest of the block rsv.


> 
>>
 btrfs_check_space_for_delayed_refs by simply checking how much space we
 have reserved for the global rsv (which acts as a spill over buffer) and
 the delayed refs rsv.  If our total size is beyond that amount then we
 know it's time to commit the transaction and stop any more delayed refs
 from being generated.

 Signed-off-by: Josef Bacik 
>>
> 


Re: HELP unmountable partition after btrfs balance to RAID0

2018-12-07 Thread Duncan
Thomas Mohr posted on Thu, 06 Dec 2018 12:31:15 +0100 as excerpted:

> We wanted to convert a file system to a RAID0 with two partitions.
> Unfortunately we had to reboot the server during the balance operation
> before it could complete.
> 
> Now following happens:
> 
> A mount attempt of the array fails with following error code:
> 
> btrfs recover yields roughly 1.6 out of 4 TB.

[Just another btrfs user and list regular, not a dev.  A dev may reply to 
your specific case, but meanwhile, for next time...]

That shouldn't be a problem.  Because with raid0 a failure of any of the 
components will take down the entire raid, making it less reliable than a 
single device, raid0 (in general, not just btrfs) is considered only 
useful for data of low enough value that its loss is no big deal, either 
because it's truly of little value (internet cache being a good example), 
or because backups are kept available and updated for whenever the raid0 
array fails.  Because with raid0, it's always a question of when it'll 
fail, not if.

So loss of a filesystem being converted to raid0 isn't a problem, because 
the data on it, by virtue of being in the process of conversion to raid0, 
is defined as of throw-away value in any case.  If it's of higher value 
than that, it's not going to be raid0 (or in the process of conversion to 
it) in the first place.

Of course that's simply an extension of the more general first sysadmin's 
rule of backups, that the true value of data is defined not by arbitrary 
claims, but by the number of backups of that data it's worth having.  
Because "things happen", whether it's fat-fingering, bad hardware, buggy 
software, or simply someone tripping over the power cable or running into 
the power pole outside at the wrong time.

So no backup is simply defining the data as worth less than the time/
trouble/resources necessary to make that backup.

Note that you ALWAYS save what was of most value to you, either the time/
trouble/resources to do the backup, if your actions defined that to be of 
more value than the data, or the data, if you had that backup, thereby 
defining the value of the data to be worth backing up.

Similarly, failure of the only backup isn't a problem because by virtue 
of there being only that one backup, the data is defined as not worth 
having more than one, and likewise, having an outdated backup isn't a 
problem, because that's simply the special case of defining the data in 
the delta between the backup time and the present as not (yet) worth the 
time/hassle/resources to make/refresh that backup.

(And FWIW, the second sysadmin's rule of backups is that it's not a 
backup until you've successfully tested it recoverable in the same sort 
of conditions you're likely to need to recover it in.  Because so many 
people have /thought/ they had backups, that turned out not to be, 
because they never tested that they could actually recover the data from 
them.  For instance, if the backup tools you'll need to recover the 
backup are on the backup itself, how do you get to them?  Can you create 
a filesystem for the new copy of the data and recover it from the backup 
with just the tools and documentation available from your emergency boot 
media?  Untested backup == no backup, or at best, backup still in 
process!)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



Re: System unable to mount partition after a power loss

2018-12-06 Thread Qu Wenruo


On 2018/12/7 下午1:24, Doni Crosby wrote:
> All,
> 
> I'm coming to you to see if there is a way to fix or at least recover
> most of the data I have from a btrfs filesystem. The system went down
> after both a breaker and the battery backup failed. I cannot currently
> mount the system, with the following error from dmesg:
> 
> Note: The vda1 is just the entire disk being passed from the VM host
> to the VM it's not an actual true virtual block device
> 
> [ 499.704398] BTRFS info (device vda1): disk space caching is enabled
> [  499.704401] BTRFS info (device vda1): has skinny extents
> [  499.739522] BTRFS error (device vda1): parent transid verify failed
> on 3563231428608 wanted 5184691 found 5183327

Transid mismatch normally means the fs is screwed up more or less.

And according to your mount failure, it looks the fs get screwed up badly.

What's the kernel version used in the VM?
I don't really think the VM is always using the latest kernel.

> [  499.740257] BTRFS error (device vda1): parent transid verify failed
> on 3563231428608 wanted 5184691 found 5183327
> [  499.770847] BTRFS error (device vda1): open_ctree failed
> 
> I have tried running btrfsck:
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
> parent transid verify failed on 3563221630976 wanted 5184691 found 5184688
> parent transid verify failed on 3563221630976 wanted 5184691 found 5184688
> parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
> parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
> parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
> parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
> parent transid verify failed on 3563224072192 wanted 5184691 found 5184688
> parent transid verify failed on 3563224072192 wanted 5184691 found 5184688
> parent transid verify failed on 3563225268224 wanted 5184691 found 5184689
> parent transid verify failed on 3563225268224 wanted 5184691 found 5184689
> parent transid verify failed on 3563227398144 wanted 5184691 found 5184689
> parent transid verify failed on 3563227398144 wanted 5184691 found 5184689
> parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
> parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
> parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
> parent transid verify failed on 3563229593600 wanted 5184691 found 5184689

According to your later dump-super output, it looks pretty possible that
the corrupted extents are all belonging to extent tree.

So it's still possible that your fs tree and other essential trees are OK.

Please dump the following output (with its stderr) to further confirm
the damage.
# btrfs ins dump-tree -b 31801344 --follow /dev/vda1

If your objective is only to recover data, then you could start to try
btrfs-restore.
It's pretty hard to fix the heavily damaged extent tree.

Thanks,
Qu
> Ignoring transid failure
> Checking filesystem on /dev/vda1
> UUID: 7c76bb05-b3dc-4804-bf56-88d010a214c6
> checking extents
> parent transid verify failed on 3563224842240 wanted 5184691 found 5184689
> parent transid verify failed on 3563224842240 wanted 5184691 found 5184689
> parent transid verify failed on 3563222974464 wanted 5184691 found 5184688
> parent transid verify failed on 3563222974464 wanted 5184691 found 5184688
> parent transid verify failed on 3563223121920 wanted 5184691 found 5184688
> parent transid verify failed on 3563223121920 wanted 5184691 found 5184688
> parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
> parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
> parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
> parent transid verify failed on 3563229970432 wanted 5184691 found 5184689
> Ignoring transid failure
> parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
> parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
> parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
> parent transid verify failed on 3563231428608 wanted 5184691 found 5183327
> Ignoring transid failure
> parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
> parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
> parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
> parent transid verify failed on 3563231444992 wanted 5184691 found 5183325
> Ignoring transid failure
> parent transid verify 

Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-06 Thread Nikolay Borisov



On 6.12.18 г. 19:54 ч., David Sterba wrote:
> On Thu, Dec 06, 2018 at 06:52:21PM +0200, Nikolay Borisov wrote:
>>
>>
>> On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
>>> Now with the delayed_refs_rsv we can now know exactly how much pending
>>> delayed refs space we need.  This means we can drastically simplify
>>
>> IMO it will be helpful if there is a sentence here referring back to
>> btrfs_update_delayed_refs_rsv to put your first sentence into context.
>> But I guess this is something David can also do.
> 
> I'll update the changelog, but I'm not sure what exactly you want to see
> there, please post the replacement text. Thanks.

With the introduction of dealyed_refs_rsv infrastructure, namely
btrfs_update_delayed_refs_rsv we now know exactly how much pending
delayed refs space is required.

> 
>>> btrfs_check_space_for_delayed_refs by simply checking how much space we
>>> have reserved for the global rsv (which acts as a spill over buffer) and
>>> the delayed refs rsv.  If our total size is beyond that amount then we
>>> know it's time to commit the transaction and stop any more delayed refs
>>> from being generated.
>>>
>>> Signed-off-by: Josef Bacik 
> 


Re: System unable to mount partition after a power loss

2018-12-06 Thread Doni Crosby
> This is qemu-kvm? What's the cache mode being used? It's possible the
> usual write guarantees are thwarted by VM caching.
Yes it is a proxmox host running the system so it is a qemu vm, I'm
unsure on the caching situation.

> Old version of progs, I suggest upgrading to 4.17.1 and run
I updated the progs to 4.17 and ran the following

btrfs insp dump-s -f /device/:
See attachment

btrfs rescue super -v /device/ (insp rescue super wasn't valid)
All Devices:
Device: id = 1, name = /dev/vda1

Before Recovering:
[All good supers]:
device name = /dev/vda1
superblock bytenr = 65536

device name = /dev/vda1
superblock bytenr = 67108864

device name = /dev/vda1
superblock bytenr = 274877906944

[All bad supers]:

All supers are valid, no need to recover

btrfs check --mode=lowmem /dev/vda1:
parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
parent transid verify failed on 3563224121344 wanted 5184691 found 5184688
parent transid verify failed on 3563221630976 wanted 5184691 found 5184688
parent transid verify failed on 3563221630976 wanted 5184691 found 5184688
parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
parent transid verify failed on 3563223138304 wanted 5184691 found 5184688
parent transid verify failed on 3563224072192 wanted 5184691 found 5184688
parent transid verify failed on 3563224072192 wanted 5184691 found 5184688
parent transid verify failed on 3563225268224 wanted 5184691 found 5184689
parent transid verify failed on 3563225268224 wanted 5184691 found 5184689
parent transid verify failed on 3563227398144 wanted 5184691 found 5184689
parent transid verify failed on 3563227398144 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
parent transid verify failed on 3563229593600 wanted 5184691 found 5184689
Ignoring transid failure
ERROR: child eb corrupted: parent bytenr=3563210342400 item=120 parent
level=1 child level=1
ERROR: cannot open file system

mount -o ro,norecovery,usebackuproot /dev/vda1 /mnt:
Same dmesg output as before.
On Fri, Dec 7, 2018 at 12:56 AM Chris Murphy  wrote:
>
> On Thu, Dec 6, 2018 at 10:24 PM Doni Crosby  wrote:
> >
> > All,
> >
> > I'm coming to you to see if there is a way to fix or at least recover
> > most of the data I have from a btrfs filesystem. The system went down
> > after both a breaker and the battery backup failed. I cannot currently
> > mount the system, with the following error from dmesg:
> >
> > Note: The vda1 is just the entire disk being passed from the VM host
> > to the VM it's not an actual true virtual block device
>
> This is qemu-kvm? What's the cache mode being used? It's possible the
> usual write guarantees are thwarted by VM caching.
>
>
>
> > btrfs check --recover also ends in a segmentation fault
>
> I'm not familiar with --recover option, the --repair option is flagged
> with a warning in the man page.
>Warning
>Do not use --repair unless you are advised to do so by a
> developer or an experienced user,
>
>
> > btrfs --version:
> > btrfs-progs v4.7.3
>
> Old version of progs, I suggest upgrading to 4.17.1 and run
>
> btrfs insp dump-s -f /device/
> btrfs insp rescue super -v /device/
> btrfs check --mode=lowmem /device/
>
> These are all read only commands. Please post output to the list,
> hopefully a developer will get around to looking at it.
>
> It is safe to try:
>
> mount -o ro,norecovery,usebackuproot /device/ /mnt/
>
> If that works, I suggest updating your backup while it's still
> possible in the meantime.
>
>
> --
> Chris Murphy
superblock: bytenr=65536, device=/dev/vda1
-
csum_type   0 (crc32c)
csum_size   4
csum0xbfa6fd72 [match]
bytenr  65536
flags   0x1
( WRITTEN )
magic   _BHRfS_M [match]
fsid7c76bb05-b3dc-4804-bf56-88d010a214c6
label   Array
generation  5184693
root31801344
sys_array_size  226
chunk_root_generation   5183734
root_level  1
chunk_root  20971520
chunk_root_level1
log_root0
log_root_transid0
log_root_level  0
total_bytes 32003947737088
bytes_used  6652776640512
sectorsize  4096
nodesize16384
leafsize (deprecated)   16384
stripesize  4096
root_dir6
num_devices 1
compat_flags0x0
compat_ro_flags 0x0
incompat_flags  0x161
( MIXED_BACKREF |
  

Re: System unable to mount partition after a power loss

2018-12-06 Thread Chris Murphy
On Thu, Dec 6, 2018 at 10:24 PM Doni Crosby  wrote:
>
> All,
>
> I'm coming to you to see if there is a way to fix or at least recover
> most of the data I have from a btrfs filesystem. The system went down
> after both a breaker and the battery backup failed. I cannot currently
> mount the system, with the following error from dmesg:
>
> Note: The vda1 is just the entire disk being passed from the VM host
> to the VM it's not an actual true virtual block device

This is qemu-kvm? What's the cache mode being used? It's possible the
usual write guarantees are thwarted by VM caching.



> btrfs check --recover also ends in a segmentation fault

I'm not familiar with --recover option, the --repair option is flagged
with a warning in the man page.
   Warning
   Do not use --repair unless you are advised to do so by a
developer or an experienced user,


> btrfs --version:
> btrfs-progs v4.7.3

Old version of progs, I suggest upgrading to 4.17.1 and run

btrfs insp dump-s -f /device/
btrfs insp rescue super -v /device/
btrfs check --mode=lowmem /device/

These are all read only commands. Please post output to the list,
hopefully a developer will get around to looking at it.

It is safe to try:

mount -o ro,norecovery,usebackuproot /device/ /mnt/

If that works, I suggest updating your backup while it's still
possible in the meantime.


-- 
Chris Murphy


Re: What if TRIM issued a wipe on devices that don't TRIM?

2018-12-06 Thread Andrei Borzenkov
06.12.2018 16:04, Austin S. Hemmelgarn пишет:
> 
> * On SCSI devices, a discard operation translates to a SCSI UNMAP
> command.  As pointed out by Ronnie Sahlberg in his reply, this command
> is purely advisory, may not result in any actual state change on the
> target device, and is not guaranteed to wipe the data.  To actually wipe
> things, you have to explicitly write bogus data to the given regions
> (using either regular writes, or a WRITESAME command with the desired
> pattern), and _then_ call UNMAP on them.

WRITE SAME command has UNMAP bit and depending on device and kernel
version kernel may actually issue either UNMAP or WRITE SAME with UNMAP
bit set when doing discard.



Re: [PATCH RESEND 0/8] btrfs-progs: sub: Relax the privileges of "subvolume list/show"

2018-12-06 Thread Omar Sandoval
On Tue, Nov 27, 2018 at 02:24:41PM +0900, Misono Tomohiro wrote:
> Hello,
> 
> This is basically the resend of 
>   "[PATCH v2 00/20] btrfs-progs: Rework of "subvolume list/show" and relax the
>   root privileges of them" [1]
> which I submitted in June. The aim of this series is to allow non-privileged 
> user
> to use basic subvolume functionality (create/list/snapshot/delete; this 
> allows "list")
> 
> They were once in devel branch with some whitespace/comment modification by 
> david.
> I rebased them to current devel branch.
> 
> github: https://github.com/t-msn/btrfs-progs/tree/rework-sub-list
> 
> Basic logic/code is the same as before. Some differences are:
>  - Use latest libbtrfsutil from Omar [2] (thus drop first part of patches).
>As a result, "sub list" cannot accept an ordinary directry to be
>specified (which is allowed in previous version)
>  - Drop patches which add new options to "sub list"
>  - Use 'nobody' as non-privileged test user just like libbtrfsutil test
>  - Update comments
> 
> Importantly, in order to make output consistent for both root and 
> non-privileged
> user, this changes the behavior of "subvolume list": 
>  - (default) Only list in subvolume under the specified path.
>Path needs to be a subvolume.
>  - (-a) filter is dropped. i.e. its output is the same as the
> default behavior of "sub list" in progs <= 4.19
> 
> Therefore, existent scripts may need to update to add -a option
> (I believe nobody uses current -a option).
> If anyone thinks this is not good, please let me know.

I think there are a few options in the case that the path isn't a
subvolume:

1. List all subvolumes in the filesystem with randomly mangled paths,
   which is what we currently do.
2. Error out, which is what this version of the series does.
3. List all subvolumes under the containing subvolume, which is what the
   previous version does.
4. List all subvolumes under the containing subvolume that are
   underneath the given path.

Option 1 won't work well for unprivileged users. Option 2 (this series)
is definitely going to break people's workflows/scripts. Option 3 is
unintuitive. In my opinion, option 4 is the nicest, but it may also
break scripts that expect all subvolumes to be printed.

There's also an option 5, which is to keep the behavior the same for
root (like what my previous patch [1] did) and implement option 4 for
unprivileged users.

I think 4 and 5 are the two main choices: do we want to preserve
backwards compatibility as carefully as possible (at the cost of
consistency), or do we want to risk it and improve the interface?

1: 
https://github.com/osandov/btrfs-progs/commit/fb61c21aeb998b12c1d02532639083d7f40c41e0


Re: BTRFS RAID filesystem unmountable

2018-12-06 Thread Qu Wenruo


On 2018/12/7 上午7:15, Michael Wade wrote:
> Hi Qu,
> 
> Me again! Having formatted the drives and rebuilt the RAID array I
> seem to have be having the same problem as before (no power cut this
> time [I bought a UPS]).

But strangely, your super block shows it has log tree, which means
either your hit a kernel panic/transaction abort, or a unexpected power
loss.

> The brtfs volume is broken on my ReadyNAS.
> 
> I have attached the results of some of the commands you asked me to
> run last time, and I am hoping you might be able to help me out.

This time, the problem is more serious, some chunk tree blocks are not
even inside system chunk range, no wonder it fails to mount.

To confirm it, you could run "btrfs ins dump-tree -b 17725903077376
" and paste the output.

But I don't have any clue. My guess is some kernel problem related to
new chunk allocation, or the chunk root node itself is already seriously
corrupted.

Considering how old your kernel is (4.4), it's not recommended to use
btrfs on such old kernel, unless it's well backported with tons of btrfs
fixes.

Thanks,
Qu

> 
> Kind regards
> Michael
> On Sat, 19 May 2018 at 12:43, Michael Wade  wrote:
>>
>> I have let the find root command run for 14+ days, its produced a
>> pretty huge log file 1.6 GB but still hasn't completed. I think I will
>> start the process of reformatting my drives and starting over.
>>
>> Thanks for your help anyway.
>>
>> Kind regards
>> Michael
>>
>> On 5 May 2018 at 01:43, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年05月05日 00:18, Michael Wade wrote:
 Hi Qu,

 The tool is still running and the log file is now ~300mb. I guess it
 shouldn't normally take this long.. Is there anything else worth
 trying?
>>>
>>> I'm afraid not much.
>>>
>>> Although there is a possibility to modify btrfs-find-root to do much
>>> faster but limited search.
>>>
>>> But from the result, it looks like underlying device corruption, and not
>>> much we can do right now.
>>>
>>> Thanks,
>>> Qu
>>>

 Kind regards
 Michael

 On 2 May 2018 at 06:29, Michael Wade  wrote:
> Thanks Qu,
>
> I actually aborted the run with the old btrfs tools once I saw its
> output. The new btrfs tools is still running and has produced a log
> file of ~85mb filled with that content so far.
>
> Kind regards
> Michael
>
> On 2 May 2018 at 02:31, Qu Wenruo  wrote:
>>
>>
>> On 2018年05月01日 23:50, Michael Wade wrote:
>>> Hi Qu,
>>>
>>> Oh dear that is not good news!
>>>
>>> I have been running the find root command since yesterday but it only
>>> seems to be only be outputting the following message:
>>>
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>
>> It's mostly fine, as find-root will go through all tree blocks and try
>> to read them as tree blocks.
>> Although btrfs-find-root will suppress csum error output, but such basic
>> tree validation check is not suppressed, thus you get such message.
>>
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>> ERROR: tree block bytenr 0 is not aligned to sectorsize 4096
>>>
>>> I tried with the latest btrfs tools compiled from source and the ones
>>> I have installed with the same result. Is there a CLI utility I could
>>> use to determine if the log contains any other content?
>>
>> Did it report any useful info at the end?
>>
>> Thanks,
>> Qu
>>
>>>
>>> Kind regards
>>> Michael
>>>
>>>
>>> On 30 April 2018 at 04:02, Qu Wenruo  wrote:


 On 2018年04月29日 22:08, Michael Wade wrote:
> Hi Qu,
>
> Got this error message:
>
> ./btrfs inspect dump-tree -b 20800943685632 /dev/md127
> btrfs-progs v4.16.1
> bytenr mismatch, want=20800943685632, have=3118598835113619663
> ERROR: cannot read chunk root
> ERROR: unable to open /dev/md127
>
> I have attached the dumps for:
>
> dd if=/dev/md127 of=/tmp/chunk_root.copy1 bs=1 count=32K 
> skip=266325721088
> dd if=/dev/md127 of=/tmp/chunk_root.copy2 bs=1 count=32K 
> skip=266359275520

 Unfortunately, both dumps are corrupted and contain mostly garbage.
 I think it's the underlying stack (mdraid) has something wrong or 
 failed
 to recover its data.

 This means your last 

Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-06 Thread Qu Wenruo


On 2018/12/7 上午3:35, David Sterba wrote:
> On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
>> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
>>> This patchset can be fetched from github:
>>> https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
>>>
>>> Which is based on v4.20-rc1.
>>
>> Thanks, I'll add it to for-next soon.
> 
> The branch was there for some time but not for at least a week (my
> mistake I did not notice in time). I've rebased it on top of recent
> misc-next, but without the delayed refs patchset from Josef.
> 
> At the moment I'm considering it for merge to 4.21, there's still some
> time to pull it out in case it shows up to be too problematic. I'm
> mostly worried about the unknown interactions with the enospc updates or

For that part, I don't think it would have some obvious problem for
enospc updates.

As the user-noticeable effect is the delay of reloc tree deletion.

Despite that, it's mostly transparent to extent allocation.

> generally because of lack of qgroup and reloc code reviews.

That's the biggest problem.

However most of the current qgroup + balance optimization is done inside
qgroup code (to skip certain qgroup record), if we're going to hit some
problem then this patchset would have the highest possibility to hit
problem.

Later patches will just keep tweaking qgroup to without affecting any
other parts mostly.

So I'm fine if you decide to pull it out for now.

Thanks,
Qu

> 
> I'm going to do some testing of the rebased branch before I add it to
> for-next. The branch is ext/qu/qgroup-delay-scan in my devel repos,
> plase check if everyghing is still ok there. Thanks.
> 



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 3/3] coccinelle: api: add offset_in_page.cocci

2018-12-06 Thread Julia Lawall



On Thu, 6 Dec 2018, Johannes Thumshirn wrote:

> On 05/12/2018 15:46, Julia Lawall wrote:
> [...]
> >> +@r_patch depends on !context && patch && !org && !report@
> >> +expression E;
> >> +type T;
> >> +@@
> >> +
> >> +(
> >> +- E & ~PAGE_MASK
> >> ++ offset_in_page(E)
> >> +|
> >> +- E & (PAGE_SIZE - 1)
> >> ++ offset_in_page(E)
> >
> > The two lines above should be subsumed by the two lines below.  When there
> > is a type metavariable that has no other dependencies, an isomorphism will
> > consider that it is either present or absent.
>
> Oh OK, I'm sorry I'm not really into cocinelle so I guessed it might
> take some iterations.
>
> Do you have an example for this?

Expanation 1:

Coccinelle as a file standard.iso that shows the isomorphisms (rewrite
rules) that may be applied to semantic patches.  One of the rules is:

Expression
@ not_ptr1 @
expression *X;
@@
 !X => X == NULL

So if you have a pointer typed expression X and you write a transformation
on !X, it will also apply to occurrences of X == NULL in the source code.
In this way, you don't have to write so many variants.

Likewise there is an isomorphism:

Expression
@ drop_cast @
expression E;
pure type T;
@@

 (T)E => E

That is, if you have a semantic patch with (T)X, then it will also apply
to code that matches just X, without the cast.  The word pure means that
this isomorphism metavariable has to match a semantic patch term that is a
metavariable and this metavariable can't be used elsewhere.  If you wrote

- (char)x

Then you would probably not want that to apply without the (char) cast.
But if you have just

- (T)x

for some randome unbound metavariable T, then perhaps you don't case about
the cast to T.  If you actually do, then you can put disable drop_cast in
the header of your rule.



Explanation 2:

To see what your semantic patch is really doing, you can run

spatch --parse-cocci sp.cocci

Here is what I get for your patch rule, with some annotations added:

@@
expression E;
type T;
@@


(
-E
  >>> offset_in_page(E)
 -& -~-PAGE_MASK
|
-~
  >>> offset_in_page(E)
-PAGE_MASK -& -E
|

// the following come from
// - E & (PAGE_SIZE - 1)
// + offset_in_page(E)

-E   // 1
  >>> offset_in_page(E)
 -& -(-PAGE_SIZE -- -1-)
|
-E   // 2
  >>> offset_in_page(E)
 -& -PAGE_SIZE -- -1
|
-(   // 3
>>> offset_in_page(E)
  -PAGE_SIZE -- -1-) -& -E
|
-PAGE_SIZE   // 4
  >>> offset_in_page(E)
 -- -1 -& -E
|

// the following come from:
// - E & ((T)PAGE_SIZE - 1)
// + offset_in_page(E)

-E
  >>> offset_in_page(E)
 -& -(-(-T -)-PAGE_SIZE -- -1-)
|
-E   // same as 1
  >>> offset_in_page(E)
 -& -(-PAGE_SIZE -- -1-)
|
-E
  >>> offset_in_page(E)
 -& -(-T -)-PAGE_SIZE -- -1
|
-E   // same as 2
  >>> offset_in_page(E)
 -& -PAGE_SIZE -- -1
|
-(
>>> offset_in_page(E)
  -(-T -)-PAGE_SIZE -- -1-) -& -E
|
-(   // same as 3
>>> offset_in_page(E)
  -PAGE_SIZE -- -1-) -& -E
|
-(
>>> offset_in_page(E)
  -T -)-PAGE_SIZE -- -1 -& -E
|
-PAGE_SIZE   // same as 4
  >>> offset_in_page(E)
 -- -1 -& -E
)

So all the transformation generated by

- E & (PAGE_SIZE - 1)
+ offset_in_page(E)

are also generated by

- E & ((T)PAGE_SIZE - 1)
+ offset_in_page(E)

I hope that is helpful.

julia


Re: [PATCH v2 0/6] btrfs: qgroup: Delay subtree scan to reduce overhead

2018-12-06 Thread David Sterba
On Mon, Nov 12, 2018 at 10:33:33PM +0100, David Sterba wrote:
> On Thu, Nov 08, 2018 at 01:49:12PM +0800, Qu Wenruo wrote:
> > This patchset can be fetched from github:
> > https://github.com/adam900710/linux/tree/qgroup_delayed_subtree_rebased
> > 
> > Which is based on v4.20-rc1.
> 
> Thanks, I'll add it to for-next soon.

The branch was there for some time but not for at least a week (my
mistake I did not notice in time). I've rebased it on top of recent
misc-next, but without the delayed refs patchset from Josef.

At the moment I'm considering it for merge to 4.21, there's still some
time to pull it out in case it shows up to be too problematic. I'm
mostly worried about the unknown interactions with the enospc updates or
generally because of lack of qgroup and reloc code reviews.

I'm going to do some testing of the rebased branch before I add it to
for-next. The branch is ext/qu/qgroup-delay-scan in my devel repos,
plase check if everyghing is still ok there. Thanks.


Re: [PATCH][v2] btrfs: run delayed items before dropping the snapshot

2018-12-06 Thread David Sterba
On Wed, Dec 05, 2018 at 12:12:21PM -0500, Josef Bacik wrote:
> From: Josef Bacik 
> 
> With my delayed refs patches in place we started seeing a large amount
> of aborts in __btrfs_free_extent
> 
> BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 
> 35964  owner 1 offset 0
> Call Trace:
>  ? btrfs_merge_delayed_refs+0xaf/0x340
>  __btrfs_run_delayed_refs+0x6ea/0xfc0
>  ? btrfs_set_path_blocking+0x31/0x60
>  btrfs_run_delayed_refs+0xeb/0x180
>  btrfs_commit_transaction+0x179/0x7f0
>  ? btrfs_check_space_for_delayed_refs+0x30/0x50
>  ? should_end_transaction.isra.19+0xe/0x40
>  btrfs_drop_snapshot+0x41c/0x7c0
>  btrfs_clean_one_deleted_snapshot+0xb5/0xd0
>  cleaner_kthread+0xf6/0x120
>  kthread+0xf8/0x130
>  ? btree_invalidatepage+0x90/0x90
>  ? kthread_bind+0x10/0x10
>  ret_from_fork+0x35/0x40
> 
> This was because btrfs_drop_snapshot depends on the root not being modified
> while it's dropping the snapshot.  It will unlock the root node (and really
> every node) as it walks down the tree, only to re-lock it when it needs to do
> something.  This is a problem because if we modify the tree we could cow a 
> block
> in our path, which free's our reference to that block.  Then once we get back 
> to
> that shared block we'll free our reference to it again, and get ENOENT when
> trying to lookup our extent reference to that block in __btrfs_free_extent.
> 
> This is ultimately happening because we have delayed items left to be 
> processed
> for our deleted snapshot _after_ all of the inodes are closed for the 
> snapshot.
> We only run the delayed inode item if we're deleting the inode, and even then 
> we
> do not run the delayed insertions or delayed removals.  These can be run at 
> any
> point after our final inode does it's last iput, which is what triggers the
> snapshot deletion.  We can end up with the snapshot deletion happening and 
> then
> have the delayed items run on that file system, resulting in the above 
> problem.
> 
> This problem has existed forever, however my patches made it much easier to 
> hit
> as I wake up the cleaner much more often to deal with delayed iputs, which 
> made
> us more likely to start the snapshot dropping work before the transaction
> commits, which is when the delayed items would generally be run.  Before,
> generally speaking, we would run the delayed items, commit the transaction, 
> and
> wakeup the cleaner thread to start deleting snapshots, which means we were 
> less
> likely to hit this problem.  You could still hit it if you had multiple
> snapshots to be deleted and ended up with lots of delayed items, but it was
> definitely harder.
> 
> Fix for now by simply running all the delayed items before starting to drop 
> the
> snapshot.  We could make this smarter in the future by making the delayed 
> items
> per-root, and then simply drop any delayed items for roots that we are going 
> to
> delete.  But for now just a quick and easy solution is the safest.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Josef Bacik 
> ---
> v1->v2:
> - check for errors from btrfs_run_delayed_items.
> - Dave I can reroll the series, but the second version of patch 1 is the same,
>   let me know what you want.

As this is a small update it's fine to send just that patch. You may
also use --in-reply-to so it threads to the original series. Resending
series makes most sense (to me) when there's a discussion and many
changes, so a fresh series makes it clear what's the current status.

Patch replaced in for-next topic branch, thanks.


Re: [PATCH 1/2] btrfs: catch cow on deleting snapshots

2018-12-06 Thread David Sterba
On Fri, Nov 30, 2018 at 12:19:18PM -0500, Josef Bacik wrote:
> On Fri, Nov 30, 2018 at 05:14:54PM +, Filipe Manana wrote:
> > On Fri, Nov 30, 2018 at 4:53 PM Josef Bacik  wrote:
> > >
> > > From: Josef Bacik 
> > >
> > > When debugging some weird extent reference bug I suspected that we were
> > > changing a snapshot while we were deleting it, which could explain my
> > > bug.  This was indeed what was happening, and this patch helped me
> > > verify my theory.  It is never correct to modify the snapshot once it's
> > > being deleted, so mark the root when we are deleting it and make sure we
> > > complain about it when it happens.
> > >
> > > Signed-off-by: Josef Bacik 
> > > ---
> > >  fs/btrfs/ctree.c   | 3 +++
> > >  fs/btrfs/ctree.h   | 1 +
> > >  fs/btrfs/extent-tree.c | 9 +
> > >  3 files changed, 13 insertions(+)
> > >
> > > diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
> > > index 5912a97b07a6..5f82f86085e8 100644
> > > --- a/fs/btrfs/ctree.c
> > > +++ b/fs/btrfs/ctree.c
> > > @@ -1440,6 +1440,9 @@ noinline int btrfs_cow_block(struct 
> > > btrfs_trans_handle *trans,
> > > u64 search_start;
> > > int ret;
> > >
> > > +   if (test_bit(BTRFS_ROOT_DELETING, >state))
> > > +   WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats 
> > > being dropped\n");
> > 
> > Please use btrfs_warn(), it makes sure we use a consistent message
> > style, identifies the fs, etc.
> > Also, "thats" should be "that is" or "that's".
> > 
> 
> Ah yeah, I was following the other convention in there but we should probably
> convert all of those to btrfs_warn.  I'll fix the grammer thing as well, just 
> a
> leftover from the much less code of conduct friendly message I originally had
> there.  Thanks,

Committed with the following fixup:

-   WARN(1, KERN_CRIT "cow'ing blocks on a fs root thats being 
dropped\n");
+   btrfs_error(fs_info,
+   "COW'ing blocks on a fs root that's being dropped");



Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-06 Thread David Sterba
On Thu, Dec 06, 2018 at 06:52:21PM +0200, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> > Now with the delayed_refs_rsv we can now know exactly how much pending
> > delayed refs space we need.  This means we can drastically simplify
> 
> IMO it will be helpful if there is a sentence here referring back to
> btrfs_update_delayed_refs_rsv to put your first sentence into context.
> But I guess this is something David can also do.

I'll update the changelog, but I'm not sure what exactly you want to see
there, please post the replacement text. Thanks.

> > btrfs_check_space_for_delayed_refs by simply checking how much space we
> > have reserved for the global rsv (which acts as a spill over buffer) and
> > the delayed refs rsv.  If our total size is beyond that amount then we
> > know it's time to commit the transaction and stop any more delayed refs
> > from being generated.
> > 
> > Signed-off-by: Josef Bacik 


Re: [RFC PATCH 3/3] coccinelle: api: add offset_in_page.cocci

2018-12-06 Thread Johannes Thumshirn
On 05/12/2018 15:46, Julia Lawall wrote:
[...]
>> +@r_patch depends on !context && patch && !org && !report@
>> +expression E;
>> +type T;
>> +@@
>> +
>> +(
>> +- E & ~PAGE_MASK
>> ++ offset_in_page(E)
>> +|
>> +- E & (PAGE_SIZE - 1)
>> ++ offset_in_page(E)
> 
> The two lines above should be subsumed by the two lines below.  When there
> is a type metavariable that has no other dependencies, an isomorphism will
> consider that it is either present or absent.

Oh OK, I'm sorry I'm not really into cocinelle so I guessed it might
take some iterations.

Do you have an example for this?

> Why not include the cast case for the context and org cases?

This is an oversight actually as I couldn't test it due to a bug in
openSUSE's coccinelle rpm.

Thanks,
Johannes
-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 08/10] btrfs: rework btrfs_check_space_for_delayed_refs

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> Now with the delayed_refs_rsv we can now know exactly how much pending
> delayed refs space we need.  This means we can drastically simplify

IMO it will be helpful if there is a sentence here referring back to
btrfs_update_delayed_refs_rsv to put your first sentence into context.
But I guess this is something David can also do.

> btrfs_check_space_for_delayed_refs by simply checking how much space we
> have reserved for the global rsv (which acts as a spill over buffer) and
> the delayed refs rsv.  If our total size is beyond that amount then we
> know it's time to commit the transaction and stop any more delayed refs
> from being generated.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/ctree.h   |  2 +-
>  fs/btrfs/extent-tree.c | 48 ++--
>  fs/btrfs/inode.c   |  4 ++--
>  fs/btrfs/transaction.c |  2 +-
>  4 files changed, 22 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 2eba398c722b..30da075c042e 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -2631,7 +2631,7 @@ static inline u64 btrfs_calc_trunc_metadata_size(struct 
> btrfs_fs_info *fs_info,
>  }
>  
>  int btrfs_should_throttle_delayed_refs(struct btrfs_trans_handle *trans);
> -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans);
> +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info);
>  void btrfs_dec_block_group_reservations(struct btrfs_fs_info *fs_info,
>const u64 start);
>  void btrfs_wait_block_group_reservations(struct btrfs_block_group_cache *bg);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 5a2d0b061f57..07ef1b8087f7 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2839,40 +2839,28 @@ u64 btrfs_csum_bytes_to_leaves(struct btrfs_fs_info 
> *fs_info, u64 csum_bytes)
>   return num_csums;
>  }
>  
> -int btrfs_check_space_for_delayed_refs(struct btrfs_trans_handle *trans)
> +bool btrfs_check_space_for_delayed_refs(struct btrfs_fs_info *fs_info)
>  {
> - struct btrfs_fs_info *fs_info = trans->fs_info;
> - struct btrfs_block_rsv *global_rsv;
> - u64 num_heads = trans->transaction->delayed_refs.num_heads_ready;
> - u64 csum_bytes = trans->transaction->delayed_refs.pending_csums;
> - unsigned int num_dirty_bgs = trans->transaction->num_dirty_bgs;
> - u64 num_bytes, num_dirty_bgs_bytes;
> - int ret = 0;
> + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv;
> + struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
> + bool ret = false;
> + u64 reserved;
>  
> - num_bytes = btrfs_calc_trans_metadata_size(fs_info, 1);
> - num_heads = heads_to_leaves(fs_info, num_heads);
> - if (num_heads > 1)
> - num_bytes += (num_heads - 1) * fs_info->nodesize;
> - num_bytes <<= 1;
> - num_bytes += btrfs_csum_bytes_to_leaves(fs_info, csum_bytes) *
> - fs_info->nodesize;
> - num_dirty_bgs_bytes = btrfs_calc_trans_metadata_size(fs_info,
> -  num_dirty_bgs);
> - global_rsv = _info->global_block_rsv;
> + spin_lock(_rsv->lock);
> + reserved = global_rsv->reserved;
> + spin_unlock(_rsv->lock);
>  
>   /*
> -  * If we can't allocate any more chunks lets make sure we have _lots_ of
> -  * wiggle room since running delayed refs can create more delayed refs.
> +  * Since the global reserve is just kind of magic we don't really want
> +  * to rely on it to save our bacon, so if our size is more than the
> +  * delayed_refs_rsv and the global rsv then it's time to think about
> +  * bailing.
>*/
> - if (global_rsv->space_info->full) {
> - num_dirty_bgs_bytes <<= 1;
> - num_bytes <<= 1;
> - }
> -
> - spin_lock(_rsv->lock);
> - if (global_rsv->reserved <= num_bytes + num_dirty_bgs_bytes)
> - ret = 1;
> - spin_unlock(_rsv->lock);
> + spin_lock(_refs_rsv->lock);
> + reserved += delayed_refs_rsv->reserved;
> + if (delayed_refs_rsv->size >= reserved)
> + ret = true;
> + spin_unlock(_refs_rsv->lock);
>   return ret;
>  }
>  
> @@ -2891,7 +2879,7 @@ int btrfs_should_throttle_delayed_refs(struct 
> btrfs_trans_handle *trans)
>   if (val >= NSEC_PER_SEC / 2)
>   return 2;
>  
> - return btrfs_check_space_for_delayed_refs(trans);
> + return btrfs_check_space_for_delayed_refs(trans->fs_info);
>  }
>  
>  struct async_delayed_refs {
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index a097f5fde31d..8532a2eb56d1 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -5326,8 +5326,8 @@ static struct btrfs_trans_handle 
> *evict_refill_and_join(struct btrfs_root *root,
> 

Re: [PATCH 09/10] btrfs: don't run delayed refs in the end transaction logic

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> Over the years we have built up a lot of infrastructure to keep delayed
> refs in check, mostly by running them at btrfs_end_transaction() time.
> We have a lot of different maths we do to figure out how much, if we
> should do it inline or async, etc.  This existed because we had no
> feedback mechanism to force the flushing of delayed refs when they
> became a problem.  However with the enospc flushing infrastructure in
> place for flushing delayed refs when they put too much pressure on the
> enospc system we have this problem solved.  Rip out all of this code as
> it is no longer needed.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/transaction.c | 38 --
>  1 file changed, 38 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 2d8401bf8df9..01f39401619a 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -798,22 +798,12 @@ static int should_end_transaction(struct 
> btrfs_trans_handle *trans)
>  int btrfs_should_end_transaction(struct btrfs_trans_handle *trans)
>  {
>   struct btrfs_transaction *cur_trans = trans->transaction;
> - int updates;
> - int err;
>  
>   smp_mb();
>   if (cur_trans->state >= TRANS_STATE_BLOCKED ||
>   cur_trans->delayed_refs.flushing)
>   return 1;
>  
> - updates = trans->delayed_ref_updates;
> - trans->delayed_ref_updates = 0;
> - if (updates) {
> - err = btrfs_run_delayed_refs(trans, updates * 2);
> - if (err) /* Error code will also eval true */
> - return err;
> - }
> -
>   return should_end_transaction(trans);
>  }
>  
> @@ -843,11 +833,8 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
>  {
>   struct btrfs_fs_info *info = trans->fs_info;
>   struct btrfs_transaction *cur_trans = trans->transaction;
> - u64 transid = trans->transid;
> - unsigned long cur = trans->delayed_ref_updates;
>   int lock = (trans->type != TRANS_JOIN_NOLOCK);
>   int err = 0;
> - int must_run_delayed_refs = 0;
>  
>   if (refcount_read(>use_count) > 1) {
>   refcount_dec(>use_count);
> @@ -858,27 +845,6 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
>   btrfs_trans_release_metadata(trans);
>   trans->block_rsv = NULL;
>  
> - if (!list_empty(>new_bgs))
> - btrfs_create_pending_block_groups(trans);

Is this being deleted because in delayed_refs_rsv you account also fo
new block groups?

> -
> - trans->delayed_ref_updates = 0;
> - if (!trans->sync) {
> - must_run_delayed_refs =
> - btrfs_should_throttle_delayed_refs(trans);
> - cur = max_t(unsigned long, cur, 32);
> -
> - /*
> -  * don't make the caller wait if they are from a NOLOCK
> -  * or ATTACH transaction, it will deadlock with commit
> -  */
> - if (must_run_delayed_refs == 1 &&
> - (trans->type & (__TRANS_JOIN_NOLOCK | __TRANS_ATTACH)))
> - must_run_delayed_refs = 2;
> - }
> -
> - btrfs_trans_release_metadata(trans);
> - trans->block_rsv = NULL;

Why remove those 2 lines as well ?

> -
>   if (!list_empty(>new_bgs))
>   btrfs_create_pending_block_groups(trans);
>  
> @@ -923,10 +889,6 @@ static int __btrfs_end_transaction(struct 
> btrfs_trans_handle *trans,
>   }
>  
>   kmem_cache_free(btrfs_trans_handle_cachep, trans);
> - if (must_run_delayed_refs) {
> - btrfs_async_run_delayed_refs(info, cur, transid,
> -  must_run_delayed_refs == 1);
> - }
>   return err;
>  }
>  
> 


Re: [PATCH 00/10][V2] Delayed refs rsv

2018-12-06 Thread David Sterba
On Mon, Dec 03, 2018 at 10:20:28AM -0500, Josef Bacik wrote:
> v1->v2:
> - addressed the comments from the various reviewers.
> - split "introduce delayed_refs_rsv" into 5 patches.  The patches are the same
>   together as they were, just split out more logically.  They can't really be
>   bisected across in that you will likely have fun enospc failures, but they
>   compile properly.  This was done to make it easier for review.

Thanks, that was helpful. The bisectability is IMO not hurt, as it
compiles and runs despite the potential ENOSPC problems. I guess the
point is to be able to reach any commit & compile & run and not be hit
by unrelated bugs. If somebody is bisecting an ENOSPC problem and lands
in the middle of the series, there's a hint in the changelogs that
something is going on.

> -- Original message --
> 
> This patchset changes how we do space reservations for delayed refs.  We were
> hitting probably 20-40 enospc abort's per day in production while running
> delayed refs at transaction commit time.  This means we ran out of space in 
> the
> global reserve and couldn't easily get more space in use_block_rsv().
> 
> The global reserve has grown to cover everything we don't reserve space
> explicitly for, and we've grown a lot of weird ad-hoc hueristics to know if
> we're running short on space and when it's time to force a commit.  A failure
> rate of 20-40 file systems when we run hundreds of thousands of them isn't 
> super
> high, but cleaning up this code will make things less ugly and more 
> predictible.
> 
> Thus the delayed refs rsv.  We always know how many delayed refs we have
> outstanding, and although running them generates more we can use the global
> reserve for that spill over, which fits better into it's desired use than a 
> full
> blown reservation.  This first approach is to simply take how many times we're
> reserving space for and multiply that by 2 in order to save enough space for 
> the
> delayed refs that could be generated.  This is a niave approach and will
> probably evolve, but for now it works.
> 
> With this patchset we've gone down to 2-8 failures per week.  It's not 
> perfect,
> there are some corner cases that still need to be addressed, but is
> significantly better than what we had.  Thanks,

I've merged the series to misc-next, the amount of reviews and testing
is sufficient. Though there are some more comments or suggestions for
improvements, they can be done as followups.

The other patchsets from you are namely fixes, that can be applied at
the -rc time, for that reason I'm glad the main infrastructure change
can be merged before the 4.21 code freeze.


Re: [PATCH 0/3] btrfs: use offset_in_page and PAGE_ALIGNED

2018-12-06 Thread David Sterba
On Wed, Dec 05, 2018 at 03:23:02PM +0100, Johannes Thumshirn wrote:
> Use the offset_in_page() and PAGE_ALIGNED() macros instead of open-coding them
> throughout btrfs.
> 
> This series also includes a patch for 'make coccicheck' which is marked as an
> RFC and I've CCed Julia in the hoping to get input from her.
> 
> Johannes Thumshirn (3):
>   btrfs: use offset_in_page instead of open-coding it
>   btrfs: use PAGE_ALIGNED instead of open-coding it

Added to misc-next, thanks.


Re: What if TRIM issued a wipe on devices that don't TRIM?

2018-12-06 Thread Austin S. Hemmelgarn

On 2018-12-06 01:11, Robert White wrote:
(1) Automatic and selective wiping of unused and previously used disk 
blocks is a good security measure, particularly when there is an 
encryption layer beneath the file system.


(2) USB attached devices _never_ support TRIM and they are the most 
likely to fall into strangers hands.
Not true on the first count.  Some really nice UAS devices do support 
SCSI UNMAP and WRITESAME commands.


(3) I vaguely recall that some flash chips will take bulk writhes of 
full sectors of 0x00 or 0xFF (I don't remember which) were second-best 
to TRIM for letting the flash controllers defragment their internals.


So it would be dog-slow, but it would be neat if BTRFS had a mount 
option to convert any TRIM command from above into the write of a zero, 
0xFF, or trash block to the device below if that device doesn't support 
TRIM. Real TRIM support would override the block write.


Obviously doing an fstrim would involve a lot of slow device writes but 
only for people likely to do that sort of thing.


For testing purposes the destruction of unused pages in this manner 
might catch file system failures or coding errors.


(The other layer where this might be most appropriate is in cryptsetup 
et al, where it could lie about TRIM support, but that sort of stealth 
lag might be bad for filesystem-level operations. Doing it there would 
also loose the simpler USB use cases.)


...Just a thought...
First off, TRIM is an ATA command, not the kernel term.  `fstrim` 
inherited the ATA name, but in kernel it's called a discard operation, 
and it's kind of important to understand here that a discard operation 
can result in a number of different behaviors.


In particular, you have at least the following implementations:

* On SCSI devices, a discard operation translates to a SCSI UNMAP 
command.  As pointed out by Ronnie Sahlberg in his reply, this command 
is purely advisory, may not result in any actual state change on the 
target device, and is not guaranteed to wipe the data.  To actually wipe 
things, you have to explicitly write bogus data to the given regions 
(using either regular writes, or a WRITESAME command with the desired 
pattern), and _then_ call UNMAP on them.
* On dm-thinp devices, a discard operation results in simply unmapping 
the blocks in the region it covers.  The underlying blocks themselves 
are not wiped until they get reallocated (which may not happen when you 
write to that region of the dm-thinp device again), and may not even be 
wiped then (depending on how the dm-thinp device is configured).  Thus, 
the same behavior as for SCSI is required here.
* On SD/MMC devices, a discard operation results in an SD ERASE command 
being issued.  This one is non-advisory (that is, it's guaranteed to 
happen), and is supposed to guarantee an overwrite of the region with 
zeroes or ones.
* eMMC devices additionally define a discard operation independent of 
the SD ERASE command which unmaps the region in the translation layer, 
but does not wipe the blocks either on issuing the command or on 
re-allocating the low-level blocks.  Essentially, it's just a hint for 
the wear-leveling algorithm.
* NVMe provides two different discard operations, and I'm not sure which 
the kernel uses for NVMe block emulation.  They correspond almost 
exactly to the SCSI UNMAP and SD ERASE commands in terms of behavior.
* For ATA devices, a discard operation translates to an ATA TRIM 
command.  This command doesn't even require that the data read back from 
a region the command has been issued against be consistent between 
reads, let alone that it actually returns zeroes, and it is completely 
silent on how the device should actually implement the operation.  In 
practice, most drives that implement it actually behave like dm-thinp 
devices, unmapping the low-level blocks in the region and only clearing 
them when they get reallocated, while returning any data they want on 
subsequent reads to that logical region until a write happens.
* The MTD subsystem has support for discard operations in the various 
FTL's, and they appear from a cursory look at the code to behave like a 
non-advisory version of the SCSI UNMAP command (FWIW, MTD's are what the 
concept of a discard operation was originally implemented in Linux for).


Notice that the only implementations that are actually guaranteed to 
clear out the low-level physical blocks are the SD ERASE and one of the 
two NVMe options, and all others require you to manually wipe the data 
before issuing the discard operation to guarantee that no data is retained.


Given this, I don't think this should be done as a mechanism of 
intercepting or translating discard operations, but as something else 
entirely.  Perhaps as a block-layer that wipes the region then issues a 
discard for it to the lower level device if the device supports it?


Re: [PATCH 06/10] btrfs: update may_commit_transaction to use the delayed refs rsv

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> Any space used in the delayed_refs_rsv will be freed up by a transaction
> commit, so instead of just counting the pinned space we also need to
> account for any space in the delayed_refs_rsv when deciding if it will
> make a different to commit the transaction to satisfy our space
> reservation.  If we have enough bytes to satisfy our reservation ticket
> then we are good to go, otherwise subtract out what space we would gain
> back by committing the transaction and compare that against the pinned
> space to make our decision.
> 
> Signed-off-by: Josef Bacik 

Reviewed-by: Nikolay Borisov 

However, look below for one suggestion: 

> ---
>  fs/btrfs/extent-tree.c | 24 +++-
>  1 file changed, 15 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index aa0a638d0263..63ff9d832867 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4843,8 +4843,10 @@ static int may_commit_transaction(struct btrfs_fs_info 
> *fs_info,
>  {
>   struct reserve_ticket *ticket = NULL;
>   struct btrfs_block_rsv *delayed_rsv = _info->delayed_block_rsv;
> + struct btrfs_block_rsv *delayed_refs_rsv = _info->delayed_refs_rsv;
>   struct btrfs_trans_handle *trans;
> - u64 bytes;
> + u64 bytes_needed;
> + u64 reclaim_bytes = 0;
>  
>   trans = (struct btrfs_trans_handle *)current->journal_info;
>   if (trans)
> @@ -4857,15 +4859,15 @@ static int may_commit_transaction(struct 
> btrfs_fs_info *fs_info,
>   else if (!list_empty(_info->tickets))
>   ticket = list_first_entry(_info->tickets,
> struct reserve_ticket, list);
> - bytes = (ticket) ? ticket->bytes : 0;
> + bytes_needed = (ticket) ? ticket->bytes : 0;
>   spin_unlock(_info->lock);
>  
> - if (!bytes)
> + if (!bytes_needed)
>   return 0;
>  
>   /* See if there is enough pinned space to make this reservation */
>   if (__percpu_counter_compare(_info->total_bytes_pinned,
> -bytes,
> +bytes_needed,
>  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
>   goto commit;
>  
> @@ -4877,14 +4879,18 @@ static int may_commit_transaction(struct 
> btrfs_fs_info *fs_info,
>   return -ENOSPC;

If we remove this :
 if (space_info != delayed_rsv->space_info)  
return -ENOSPC; 

Check, can't we move the reclaim_bytes calc code above the 
__percpu_counter_compare 
and eventually be left with just a single invocation to percpu_compare. 
The diff should looke something along the lines of: 

@@ -4828,19 +4827,6 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
if (!bytes)
return 0;
 
-   /* See if there is enough pinned space to make this reservation */
-   if (__percpu_counter_compare(_info->total_bytes_pinned,
-  bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) >= 0)
-   goto commit;
-
-   /*
-* See if there is some space in the delayed insertion reservation for
-* this reservation.
-*/
-   if (space_info != delayed_rsv->space_info)
-   return -ENOSPC;
-
spin_lock(_rsv->lock);
if (delayed_rsv->size > bytes)
bytes = 0;
@@ -4850,9 +4836,8 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 
if (__percpu_counter_compare(_info->total_bytes_pinned,
   bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
+  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
return -ENOSPC;
-   }
 
 commit:
trans = btrfs_join_transaction(fs_info->extent_root);


>  
>   spin_lock(_rsv->lock);
> - if (delayed_rsv->size > bytes)
> - bytes = 0;
> - else
> - bytes -= delayed_rsv->size;
> + reclaim_bytes += delayed_rsv->reserved;
>   spin_unlock(_rsv->lock);
>  
> + spin_lock(_refs_rsv->lock);
> + reclaim_bytes += delayed_refs_rsv->reserved;
> + spin_unlock(_refs_rsv->lock);
> + if (reclaim_bytes >= bytes_needed)
> + goto commit;
> + bytes_needed -= reclaim_bytes;
> +
>   if (__percpu_counter_compare(_info->total_bytes_pinned,
> -bytes,
> +bytes_needed,
>  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
>   return -ENOSPC;
>   }
> 


Re: [PATCH 02/10] btrfs: add cleanup_ref_head_accounting helper

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We were missing some quota cleanups in check_ref_cleanup, so break the
> ref head accounting cleanup into a helper and call that from both
> check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
> we don't screw up accounting in the future for other things that we add.
> 
> Reviewed-by: Omar Sandoval 
> Reviewed-by: Liu Bo 
> Signed-off-by: Josef Bacik 

Doesn't this also need a stable tag? Furthermore, doesn't the missing
code dealing with total_bytes_pinned in check_ref_cleanup mean that
every time the last reference for a block was freed we were leaking
bytes in total_bytes_pinned? Shouldn't this have lead to eventually
total_bytes_pinned dominating the usage in a space_info ?

Codewise lgtm:

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/extent-tree.c | 67 
> +-
>  1 file changed, 39 insertions(+), 28 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index c36b3a42f2bb..e3ed3507018d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2443,6 +2443,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle 
> *trans,
>   return ret ? ret : 1;
>  }
>  
> +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> + struct btrfs_delayed_ref_head *head)
> +{
> + struct btrfs_fs_info *fs_info = trans->fs_info;
> + struct btrfs_delayed_ref_root *delayed_refs =
> + >transaction->delayed_refs;
> +
> + if (head->total_ref_mod < 0) {
> + struct btrfs_space_info *space_info;
> + u64 flags;
> +
> + if (head->is_data)
> + flags = BTRFS_BLOCK_GROUP_DATA;
> + else if (head->is_system)
> + flags = BTRFS_BLOCK_GROUP_SYSTEM;
> + else
> + flags = BTRFS_BLOCK_GROUP_METADATA;
> + space_info = __find_space_info(fs_info, flags);
> + ASSERT(space_info);
> + percpu_counter_add_batch(_info->total_bytes_pinned,
> +-head->num_bytes,
> +BTRFS_TOTAL_BYTES_PINNED_BATCH);
> +
> + if (head->is_data) {
> + spin_lock(_refs->lock);
> + delayed_refs->pending_csums -= head->num_bytes;
> + spin_unlock(_refs->lock);
> + }
> + }
> +
> + /* Also free its reserved qgroup space */
> + btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> +   head->qgroup_reserved);
> +}
> +
>  static int cleanup_ref_head(struct btrfs_trans_handle *trans,
>   struct btrfs_delayed_ref_head *head)
>  {
> @@ -2478,31 +2513,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
>  
> - trace_run_delayed_ref_head(fs_info, head, 0);
> -
> - if (head->total_ref_mod < 0) {
> - struct btrfs_space_info *space_info;
> - u64 flags;
> -
> - if (head->is_data)
> - flags = BTRFS_BLOCK_GROUP_DATA;
> - else if (head->is_system)
> - flags = BTRFS_BLOCK_GROUP_SYSTEM;
> - else
> - flags = BTRFS_BLOCK_GROUP_METADATA;
> - space_info = __find_space_info(fs_info, flags);
> - ASSERT(space_info);
> - percpu_counter_add_batch(_info->total_bytes_pinned,
> --head->num_bytes,
> -BTRFS_TOTAL_BYTES_PINNED_BATCH);
> -
> - if (head->is_data) {
> - spin_lock(_refs->lock);
> - delayed_refs->pending_csums -= head->num_bytes;
> - spin_unlock(_refs->lock);
> - }
> - }
> -
>   if (head->must_insert_reserved) {
>   btrfs_pin_extent(fs_info, head->bytenr,
>head->num_bytes, 1);
> @@ -2512,9 +2522,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   }
>   }
>  
> - /* Also free its reserved qgroup space */
> - btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> -   head->qgroup_reserved);
> + cleanup_ref_head_accounting(trans, head);
> +
> + trace_run_delayed_ref_head(fs_info, head, 0);
>   btrfs_delayed_ref_unlock(head);
>   btrfs_put_delayed_ref_head(head);
>   return 0;
> @@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (head->must_insert_reserved)
>   ret = 1;
>  
> + cleanup_ref_head_accounting(trans, head);
>   mutex_unlock(>mutex);
>   btrfs_put_delayed_ref_head(head);
>   return ret;
> 


Re: [PATCH 01/10] btrfs: add btrfs_delete_ref_head helper

2018-12-06 Thread Nikolay Borisov



On 3.12.18 г. 17:20 ч., Josef Bacik wrote:
> From: Josef Bacik 
> 
> We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
> into a helper and cleanup the calling functions.
> 
> Signed-off-by: Josef Bacik 
> Reviewed-by: Omar Sandoval 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/delayed-ref.c | 14 ++
>  fs/btrfs/delayed-ref.h |  3 ++-
>  fs/btrfs/extent-tree.c | 22 +++---
>  3 files changed, 19 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index 9301b3ad9217..b3e4c9fcb664 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>   return head;
>  }
>  
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +struct btrfs_delayed_ref_head *head)
> +{
> + lockdep_assert_held(_refs->lock);
> + lockdep_assert_held(>lock);
> +
> + rb_erase_cached(>href_node, _refs->href_root);
> + RB_CLEAR_NODE(>href_node);
> + atomic_dec(_refs->num_entries);
> + delayed_refs->num_heads--;
> + if (head->processing == 0)
> + delayed_refs->num_heads_ready--;
> +}
> +
>  /*
>   * Helper to insert the ref_node to the tail or merge with tail.
>   *
> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> index 8e20c5cb5404..d2af974f68a1 100644
> --- a/fs/btrfs/delayed-ref.h
> +++ b/fs/btrfs/delayed-ref.h
> @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
> btrfs_delayed_ref_head *head)
>  {
>   mutex_unlock(>mutex);
>  }
> -
> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> +struct btrfs_delayed_ref_head *head);
>  
>  struct btrfs_delayed_ref_head *btrfs_select_ref_head(
>   struct btrfs_delayed_ref_root *delayed_refs);
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index d242a1174e50..c36b3a42f2bb 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
> *trans,
>   spin_unlock(_refs->lock);
>   return 1;
>   }
> - delayed_refs->num_heads--;
> - rb_erase_cached(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> + btrfs_delete_ref_head(delayed_refs, head);
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
> - atomic_dec(_refs->num_entries);
>  
>   trace_run_delayed_ref_head(fs_info, head, 0);
>  
> @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
> btrfs_trans_handle *trans,
>   if (!mutex_trylock(>mutex))
>   goto out;
>  
> - /*
> -  * at this point we have a head with no other entries.  Go
> -  * ahead and process it.
> -  */
> - rb_erase_cached(>href_node, _refs->href_root);
> - RB_CLEAR_NODE(>href_node);
> - atomic_dec(_refs->num_entries);
> -
> - /*
> -  * we don't take a ref on the node because we're removing it from the
> -  * tree, so we just steal the ref the tree was holding.
> -  */
> - delayed_refs->num_heads--;
> - if (head->processing == 0)
> - delayed_refs->num_heads_ready--;
> + btrfs_delete_ref_head(delayed_refs, head);
>   head->processing = 0;
> +
>   spin_unlock(>lock);
>   spin_unlock(_refs->lock);
>  
> 


Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-06 Thread Goldwyn Rodrigues
On 11:07 06/12, Johannes Thumshirn wrote:
> On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> > This is a support for DAX in btrfs. I understand there have been
> > previous attempts at it. However, I wanted to make sure copy-on-write
> > (COW) works on dax as well.
> > 
> > Before I present this to the FS folks I wanted to run this through the
> > btrfs. Even though I wish, I cannot get it correct the first time
> > around :/.. Here are some questions for which I need suggestions:
> 
> Hi Goldwyn,
> 
> I've thrown your patches (from your git tree) onto one of my pmem test
> machines with this pmem config:

Thanks. I will check on this. Ordered extents have been a pain to deal
with for me (though mainly because of my incorrect usage)

> 
> mayhem:~/:[0]# ndctl list
> [
>   {
> "dev":"namespace1.0",
> "mode":"fsdax",
> "map":"dev",
> "size":792721358848,
> "uuid":"3fd4ab18-5145-4675-85a0-e05e6f9bcee4",
> "raw_uuid":"49264743-2351-41c5-9db9-38534813df61",
> "sector_size":512,
> "blockdev":"pmem1",
> "numa_node":1
>   },
>   {
> "dev":"namespace0.0",
> "mode":"fsdax",
> "map":"dev",
> "size":792721358848,
> "uuid":"dd0aec3c-7721-4621-8898-e50684a371b5",
> "raw_uuid":"84ff5463-f76e-4ddf-a248-85122541e909",
> "sector_size":4096,
> "blockdev":"pmem0",
> "numa_node":0
>   }
> ]
> 
> Unfortunately I hit a btrfs_panic() with btrfs/002.
> export TEST_DEV=/dev/pmem0
> export SCRATCH_DEV=/dev/pmem1
> export MOUNT_OPTIONS="-o dax"
> ./check
> [...]
> [  178.173113] run fstests btrfs/002 at 2018-12-06 10:55:43
> [  178.357044] BTRFS info (device pmem0): disk space caching is enabled
> [  178.357047] BTRFS info (device pmem0): has skinny extents
> [  178.360042] BTRFS info (device pmem0): enabling ssd optimizations
> [  178.475918] BTRFS: device fsid ee888255-7f4a-4bf7-af65-e8a6a354aca8
> devid 1 transid 3 /dev/pmem1
> [  178.505717] BTRFS info (device pmem1): disk space caching is enabled
> [  178.513593] BTRFS info (device pmem1): has skinny extents
> [  178.520384] BTRFS info (device pmem1): flagging fs with big metadata
> feature
> [  178.530997] BTRFS info (device pmem1): enabling ssd optimizations
> [  178.538331] BTRFS info (device pmem1): creating UUID tree
> [  178.587200] BTRFS critical (device pmem1): panic in
> ordered_data_tree_panic:57: Inconsistency in ordered tree at offset 0
> (errno=-17 Object already exists)
> [  178.603129] [ cut here ]
> [  178.608667] kernel BUG at fs/btrfs/ordered-data.c:57!
> [  178.614333] invalid opcode:  [#1] SMP PTI
> [  178.619295] CPU: 87 PID: 8225 Comm: dd Kdump: loaded Tainted: G
>   E 4.20.0-rc5-default-btrfs-dax #920
> [  178.630090] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS
> SE5C620.86B.0D.01.0010.072020182008 07/20/2018
> [  178.640626] RIP: 0010:__btrfs_add_ordered_extent+0x325/0x410 [btrfs]
> [  178.647404] Code: 28 4d 89 f1 49 c7 c0 90 9c 57 c0 b9 ef ff ff ff ba
> 39 00 00 00 48 c7 c6 10 fe 56 c0 48 8b b8 d8 03 00 00 31 c0 e8 e2 99 06
> 00 <0f> 0b 65 8b 05 d2 e4 b0 3f 89 c0 48 0f a3 05 78 5e cf c2 0f 92 c0
> [  178.667019] RSP: 0018:a3e3674c7ba8 EFLAGS: 00010096
> [  178.672684] RAX: 008f RBX: 9770c2ac5748 RCX:
> 
> [  178.680254] RDX: 97711f9dee80 RSI: 97711f9d6868 RDI:
> 97711f9d6868
> [  178.687831] RBP: 97711d523000 R08:  R09:
> 065a
> [  178.695411] R10: 03ff R11: 0001 R12:
> 97710d66da70
> [  178.702993] R13: 9770c2ac5600 R14:  R15:
> 97710d66d9c0
> [  178.710573] FS:  7fe11ef90700() GS:97711f9c()
> knlGS:
> [  178.719122] CS:  0010 DS:  ES:  CR0: 80050033
> [  178.725380] CR2: 0156a000 CR3: 00eb30dfc006 CR4:
> 007606e0
> [  178.732999] DR0:  DR1:  DR2:
> 
> [  178.740574] DR3:  DR6: fffe0ff0 DR7:
> 0400
> [  178.748147] PKRU: 5554
> [  178.751297] Call Trace:
> [  178.754230]  btrfs_add_ordered_extent_dio+0x1d/0x30 [btrfs]
> [  178.760269]  btrfs_create_dio_extent+0x79/0xe0 [btrfs]
> [  178.765930]  btrfs_get_extent_map_write+0x1a9/0x2b0 [btrfs]
> [  178.771959]  btrfs_file_dax_write+0x1f8/0x4f0 [btrfs]
> [  178.777508]  ? current_time+0x3f/0x70
> [  178.781672]  btrfs_file_write_iter+0x384/0x580 [btrfs]
> [  178.787265]  ? pipe_read+0x243/0x2a0
> [  178.791298]  __vfs_write+0xee/0x170
> [  178.795241]  vfs_write+0xad/0x1a0
> [  178.799008]  ? vfs_read+0x111/0x130
> [  178.802949]  ksys_write+0x42/0x90
> [  178.806712]  do_syscall_64+0x5b/0x180
> [  178.810829]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  178.816334] RIP: 0033:0x7fe11eabb3d0
> [  178.820364] Code: 73 01 c3 48 8b 0d b8 ea 2b 00 f7 d8 64 89 01 48 83
> c8 ff c3 66 0f 1f 44 00 00 83 3d b9 43 2c 00 00 75 10 b8 01 00 00 00 0f
> 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 2e 90 01 00 48 89 

Re: [PATCH 07/10] dax: export functions for use with btrfs

2018-12-06 Thread Goldwyn Rodrigues
On  6:52 05/12, Christoph Hellwig wrote:
> If you want to export these at all they have to be EXPORT_SYMBOL_GPL.
> 

Understood.

> But I'd really like to avoid seeing another duplicate DAX I/O path.
> Please try to adopt the existing iomap-based infrastructure for your
> needs first.

This is not worth with btrfs. With non-page aligned I/O on btrfs, we
need to copy the first/last page of the extents for CoW. So, we
would end up using the exported functions anyways. Believe me, I have
spent some time getting btrfs iomap compatible before giving up. The
problems are btrfs needs to carry a lot of information across
iomap_begin and iomap_end. While the added private variable helps in
this, it also needs hooks in bio_submit() functions for crc calculations
during direct writes.

-- 
Goldwyn


Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-06 Thread Johannes Thumshirn
On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs. I understand there have been
> previous attempts at it. However, I wanted to make sure copy-on-write
> (COW) works on dax as well.
> 
> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:

Hi Goldwyn,

I've thrown your patches (from your git tree) onto one of my pmem test
machines with this pmem config:

mayhem:~/:[0]# ndctl list
[
  {
"dev":"namespace1.0",
"mode":"fsdax",
"map":"dev",
"size":792721358848,
"uuid":"3fd4ab18-5145-4675-85a0-e05e6f9bcee4",
"raw_uuid":"49264743-2351-41c5-9db9-38534813df61",
"sector_size":512,
"blockdev":"pmem1",
"numa_node":1
  },
  {
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":792721358848,
"uuid":"dd0aec3c-7721-4621-8898-e50684a371b5",
"raw_uuid":"84ff5463-f76e-4ddf-a248-85122541e909",
"sector_size":4096,
"blockdev":"pmem0",
"numa_node":0
  }
]

Unfortunately I hit a btrfs_panic() with btrfs/002.
export TEST_DEV=/dev/pmem0
export SCRATCH_DEV=/dev/pmem1
export MOUNT_OPTIONS="-o dax"
./check
[...]
[  178.173113] run fstests btrfs/002 at 2018-12-06 10:55:43
[  178.357044] BTRFS info (device pmem0): disk space caching is enabled
[  178.357047] BTRFS info (device pmem0): has skinny extents
[  178.360042] BTRFS info (device pmem0): enabling ssd optimizations
[  178.475918] BTRFS: device fsid ee888255-7f4a-4bf7-af65-e8a6a354aca8
devid 1 transid 3 /dev/pmem1
[  178.505717] BTRFS info (device pmem1): disk space caching is enabled
[  178.513593] BTRFS info (device pmem1): has skinny extents
[  178.520384] BTRFS info (device pmem1): flagging fs with big metadata
feature
[  178.530997] BTRFS info (device pmem1): enabling ssd optimizations
[  178.538331] BTRFS info (device pmem1): creating UUID tree
[  178.587200] BTRFS critical (device pmem1): panic in
ordered_data_tree_panic:57: Inconsistency in ordered tree at offset 0
(errno=-17 Object already exists)
[  178.603129] [ cut here ]
[  178.608667] kernel BUG at fs/btrfs/ordered-data.c:57!
[  178.614333] invalid opcode:  [#1] SMP PTI
[  178.619295] CPU: 87 PID: 8225 Comm: dd Kdump: loaded Tainted: G
  E 4.20.0-rc5-default-btrfs-dax #920
[  178.630090] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS
SE5C620.86B.0D.01.0010.072020182008 07/20/2018
[  178.640626] RIP: 0010:__btrfs_add_ordered_extent+0x325/0x410 [btrfs]
[  178.647404] Code: 28 4d 89 f1 49 c7 c0 90 9c 57 c0 b9 ef ff ff ff ba
39 00 00 00 48 c7 c6 10 fe 56 c0 48 8b b8 d8 03 00 00 31 c0 e8 e2 99 06
00 <0f> 0b 65 8b 05 d2 e4 b0 3f 89 c0 48 0f a3 05 78 5e cf c2 0f 92 c0
[  178.667019] RSP: 0018:a3e3674c7ba8 EFLAGS: 00010096
[  178.672684] RAX: 008f RBX: 9770c2ac5748 RCX:

[  178.680254] RDX: 97711f9dee80 RSI: 97711f9d6868 RDI:
97711f9d6868
[  178.687831] RBP: 97711d523000 R08:  R09:
065a
[  178.695411] R10: 03ff R11: 0001 R12:
97710d66da70
[  178.702993] R13: 9770c2ac5600 R14:  R15:
97710d66d9c0
[  178.710573] FS:  7fe11ef90700() GS:97711f9c()
knlGS:
[  178.719122] CS:  0010 DS:  ES:  CR0: 80050033
[  178.725380] CR2: 0156a000 CR3: 00eb30dfc006 CR4:
007606e0
[  178.732999] DR0:  DR1:  DR2:

[  178.740574] DR3:  DR6: fffe0ff0 DR7:
0400
[  178.748147] PKRU: 5554
[  178.751297] Call Trace:
[  178.754230]  btrfs_add_ordered_extent_dio+0x1d/0x30 [btrfs]
[  178.760269]  btrfs_create_dio_extent+0x79/0xe0 [btrfs]
[  178.765930]  btrfs_get_extent_map_write+0x1a9/0x2b0 [btrfs]
[  178.771959]  btrfs_file_dax_write+0x1f8/0x4f0 [btrfs]
[  178.777508]  ? current_time+0x3f/0x70
[  178.781672]  btrfs_file_write_iter+0x384/0x580 [btrfs]
[  178.787265]  ? pipe_read+0x243/0x2a0
[  178.791298]  __vfs_write+0xee/0x170
[  178.795241]  vfs_write+0xad/0x1a0
[  178.799008]  ? vfs_read+0x111/0x130
[  178.802949]  ksys_write+0x42/0x90
[  178.806712]  do_syscall_64+0x5b/0x180
[  178.810829]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  178.816334] RIP: 0033:0x7fe11eabb3d0
[  178.820364] Code: 73 01 c3 48 8b 0d b8 ea 2b 00 f7 d8 64 89 01 48 83
c8 ff c3 66 0f 1f 44 00 00 83 3d b9 43 2c 00 00 75 10 b8 01 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 2e 90 01 00 48 89 04 24
[  178.840052] RSP: 002b:7ffec969d978 EFLAGS: 0246 ORIG_RAX:
0001
[  178.848100] RAX: ffda RBX:  RCX:
7fe11eabb3d0
[  178.855715] RDX: 0400 RSI: 0156a000 RDI:
0001
[  178.863326] RBP: 0400 R08: 0003 R09:
7fe11ed7a698
[  178.870928] R10: 10a8b550 R11: 0246 

Re: What if TRIM issued a wipe on devices that don't TRIM?

2018-12-06 Thread ronnie sahlberg
Hi,

I am more of a SCSI guy than ATA so forgive where I am ignorant.

The SCSI equivalent to TRIM is called UNMAP.
UNMAP is unfortunately only a "hint" to the device so if the device
for any reason
is busy,   it can just do a NO-OP, leave the data as is and still
return status SUCCESS.
That is not good if you want to wipe data for confidentiality reasons :-)

As UNMAP and TRIM are related make sure that TRIM actually provides a
guarantee to wipe the
data. I do not know ATA enough to know if it does or not.

In SCSI, instead of using UNMAP/TRIM  you can use WRITESAME10/16 which
can be used
and which does providde an overwrite/wipe guarantee. Maybe ATA has
something equivalent to
WRITESAME10/16.

I just want to say: be careful, sometimes these commands do not
provide a guarantee that they will actually
make the data overwritten/unretrievable.

ronnie sahlberg


On Thu, Dec 6, 2018 at 4:24 PM Robert White  wrote:
>
> (1) Automatic and selective wiping of unused and previously used disk
> blocks is a good security measure, particularly when there is an
> encryption layer beneath the file system.
>
> (2) USB attached devices _never_ support TRIM and they are the most
> likely to fall into strangers hands.
>
> (3) I vaguely recall that some flash chips will take bulk writhes of
> full sectors of 0x00 or 0xFF (I don't remember which) were second-best
> to TRIM for letting the flash controllers defragment their internals.
>
> So it would be dog-slow, but it would be neat if BTRFS had a mount
> option to convert any TRIM command from above into the write of a zero,
> 0xFF, or trash block to the device below if that device doesn't support
> TRIM. Real TRIM support would override the block write.
>
> Obviously doing an fstrim would involve a lot of slow device writes but
> only for people likely to do that sort of thing.
>
> For testing purposes the destruction of unused pages in this manner
> might catch file system failures or coding errors.
>
> (The other layer where this might be most appropriate is in cryptsetup
> et al, where it could lie about TRIM support, but that sort of stealth
> lag might be bad for filesystem-level operations. Doing it there would
> also loose the simpler USB use cases.)
>
> ...Just a thought...
>
> --Rob White.
>
>


Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Robert White

On 12/5/18 9:37 PM, Jeff Mahoney wrote:
The high level idea that Jan Kara and I came up with in our conversation 
at Labs conf is pretty expensive.  We'd need to set a flag that pauses 
new page faults, set the WP bit on affected ranges, do the snapshot, 
commit, clear the flag, and wake up the waiting threads.  Neither of us 
had any concrete idea of how well that would perform and it still 
depends on finding a good way to resolve all open mmap ranges on a 
subvolume.  Perhaps using the address_space->private_list anchored on 
each root would work.


This is a potentially wild idea, so "grain of salt" and all that. I may 
misuse the exact wording.


So the essential problem of DAX is basically the opposite of 
data-deduplication. Instead of merging two duplicate data regions, you 
want to mark regions as at-risk while keeping the original content 
intact if there are snapshots in conflict.


So suppose you _require_ data checksums and data mode of "dup" or mirror 
or one of the other fault tolerant layouts.


By definition any block that gets written with content that it didn't 
have before will now have a bad checksum.


If the inode is flagged for direct IO that's an indication that the 
block has been updated.


At this point you really just need to do the opposite of deduplication, 
as in find/recover the original contents and assign/leave assigned those 
to the old/other snapshots, then compute the new checksum on the 
"original block" and assign it to the active subvolume.


So when a region is mapped for direct IO, and it's refcount is greater 
than one, and you get to a sync or close event, you "recover" the old 
contents into a new location and assign those to "all the other users". 
Now that original storage region has only one user, so on sync or close 
you fix its checksums on the cheap.


Instead of the new data being a small rock sitting over a large rug to 
make a lump, the new data is like a rock being slid under the rug to 
make a lump.


So the first write to an extent creates a burdensome copy to retain the 
old contents, but second and subsequent writes to the same extent only 
have the cost of an _eventual_ checksum of the original block list.


Maybe If the data isn't already duplicated then the write mapping or the 
DAX open or the setting of the S_DUP flag could force the file into an 
extent block that _is_ duplicated.


The mental leap required is that the new blocks don't need to belong to 
the new state being created. The new blocks can be associated to the 
snapshots since data copy is idempotent.


The side note is that it only ever matters if the usage count is greater 
than one, so at worst taking a snapshot, which is already a _little_ 
racy anyway, would/could trigger a semi-lightweight copy of any S_DAX files:


If S_DAX :
  If checksum invalid :
copy data as-is and checksum, store in snapshot
  else : look for duplicate checksum
if duplicate found :
  assign that extent to the snapshot
else :
  If file opened for writing and has any mmaps for write :
copy extent and assign to new snapshot.
  else :
increment usage count and assign current block to snapshot

Anyway, I only know enough of the internals to be dangerous.

Since the real goal of mmap is speed during actual update, this idea is 
basically about amortizing the copy costs into the task of maintaining 
the snapshots instead of leaving them in the immediate hands of the 
time-critical updater.


The flush, unmmap, or close by the user, or a system-wide sync event, 
are also good points to expense the bookeeping time.


Re: What if TRIM issued a wipe on devices that don't TRIM?

2018-12-05 Thread Roman Mamedov
On Thu, 6 Dec 2018 06:11:46 +
Robert White  wrote:

> So it would be dog-slow, but it would be neat if BTRFS had a mount 
> option to convert any TRIM command from above into the write of a zero, 
> 0xFF, or trash block to the device below if that device doesn't support 
> TRIM. Real TRIM support would override the block write.

There is such a project:

  "dm-linear like target which provides discard, but replaces it with write of
  random data to a discarded region. Thus, discarded data is securely deleted."

  https://github.com/vt-alt/dm-secdel

-- 
With respect,
Roman


Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Jeff Mahoney

On 12/5/18 7:28 AM, Goldwyn Rodrigues wrote:

This is a support for DAX in btrfs. I understand there have been
previous attempts at it. However, I wanted to make sure copy-on-write
(COW) works on dax as well.

Before I present this to the FS folks I wanted to run this through the
btrfs. Even though I wish, I cannot get it correct the first time
around :/.. Here are some questions for which I need suggestions:

Questions:
1. I have been unable to do checksumming for DAX devices. While
checksumming can be done for reads and writes, it is a problem when mmap
is involved because btrfs kernel module does not get back control after
an mmap() writes. Any ideas are appreciated, or we would have to set
nodatasum when dax is enabled.


Yep.  It has to be nodatasum, at least within the confines of datasum 
today.  DAX mmap writes are essentially in the same situation as with 
direct i/o when another thread modifies the buffer being submitted. 
Except rather than it being a race, it happens every time.  An 
alternative here could be to add the ability to mark a crc as unreliable 
and then go back and update them once the last DAX mmap reference is 
dropped on a range.  There's no reason to make this a requirement of the 
initial implementation, though.



2. Currently, a user can continue writing on "old" extents of an mmaped file
after a snapshot has been created. How can we enforce writes to be directed
to new extents after snapshots have been created? Do we keep a list of
all mmap()s, and re-mmap them after a snapshot?


It's the second question that's the hard part.  As Adam describes later, 
setting each pfn read-only will ensure page faults cause the remapping.


The high level idea that Jan Kara and I came up with in our conversation 
at Labs conf is pretty expensive.  We'd need to set a flag that pauses 
new page faults, set the WP bit on affected ranges, do the snapshot, 
commit, clear the flag, and wake up the waiting threads.  Neither of us 
had any concrete idea of how well that would perform and it still 
depends on finding a good way to resolve all open mmap ranges on a 
subvolume.  Perhaps using the address_space->private_list anchored on 
each root would work.


-Jeff


Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
command line parameter.


[PATCH 01/10] btrfs: create a mount option for dax
[PATCH 02/10] btrfs: basic dax read
[PATCH 03/10] btrfs: dax: read zeros from holes
[PATCH 04/10] Rename __endio_write_update_ordered() to
[PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of
[PATCH 06/10] btrfs: dax write support
[PATCH 07/10] dax: export functions for use with btrfs
[PATCH 08/10] btrfs: dax add read mmap path
[PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
[PATCH 10/10] btrfs: dax mmap write

  fs/btrfs/Makefile   |1
  fs/btrfs/ctree.h|   17 ++
  fs/btrfs/dax.c  |  303 
++--
  fs/btrfs/file.c |   29 
  fs/btrfs/inode.c|   54 +
  fs/btrfs/ioctl.c|5
  fs/btrfs/super.c|   15 ++
  fs/dax.c|   35 --
  include/linux/dax.h |   16 ++
  9 files changed, 430 insertions(+), 45 deletions(-)




--
Jeff Mahoney
SUSE Labs



Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Jeff Mahoney

On 12/5/18 8:03 AM, Qu Wenruo wrote:



On 2018/12/5 下午8:28, Goldwyn Rodrigues wrote:

This is a support for DAX in btrfs. I understand there have been
previous attempts at it. However, I wanted to make sure copy-on-write
(COW) works on dax as well.

Before I present this to the FS folks I wanted to run this through the
btrfs. Even though I wish, I cannot get it correct the first time
around :/.. Here are some questions for which I need suggestions:

Questions:
1. I have been unable to do checksumming for DAX devices. While
checksumming can be done for reads and writes, it is a problem when mmap
is involved because btrfs kernel module does not get back control after
an mmap() writes. Any ideas are appreciated, or we would have to set
nodatasum when dax is enabled.


I'm not familar with DAX, so it's completely possible I'm talking like
an idiot.


The general idea is:

1) there is no page cache involved. read() and write() are like direct 
i/o writes in concept.  The user buffer is written directly (via what is 
essentially a specialized memcpy) to the NVDIMM.
2) for mmap, once the mapping is established and mapped, the file system 
is not involved.  The application writes directly to the memory as it 
would a normal mmap, except it's persistent.  All that's required to 
ensure persistence is a CPU cache flush.  The only way the file system 
is involved again is if some operation has occurred to reset the WP bit.



If btrfs_page_mkwrite() can't provide enough control, then I have a
crazy idea.


It can't, because it is only invoked on the page fault path and we want 
to try to limit those as much as possible.



Forcing page fault for every mmap() read/write (completely disable page
cache like DIO).
So that we could get some control since we're informed to read the page
and do some hacks there.
There's no way to force a page fault for every mmap read/write.  Even if 
there was, we wouldn't want that.  No user would turn that on when they 
can just make similar guarantees in their app (which are typically apps 
that do this already) and not pay any performance penalty.   The idea 
with DAX mmap is that the file system manages the namespace, space 
allocation, and permissions.  Otherwise we stay out of the way.


-Jeff
--
Jeff Mahoney
SUSE Labs



Re: [PATCH][v2] btrfs: run delayed items before dropping the snapshot

2018-12-05 Thread Filipe Manana
On Wed, Dec 5, 2018 at 5:14 PM Josef Bacik  wrote:
>
> From: Josef Bacik 
>
> With my delayed refs patches in place we started seeing a large amount
> of aborts in __btrfs_free_extent
>
> BTRFS error (device sdb1): unable to find ref byte nr 91947008 parent 0 root 
> 35964  owner 1 offset 0
> Call Trace:
>  ? btrfs_merge_delayed_refs+0xaf/0x340
>  __btrfs_run_delayed_refs+0x6ea/0xfc0
>  ? btrfs_set_path_blocking+0x31/0x60
>  btrfs_run_delayed_refs+0xeb/0x180
>  btrfs_commit_transaction+0x179/0x7f0
>  ? btrfs_check_space_for_delayed_refs+0x30/0x50
>  ? should_end_transaction.isra.19+0xe/0x40
>  btrfs_drop_snapshot+0x41c/0x7c0
>  btrfs_clean_one_deleted_snapshot+0xb5/0xd0
>  cleaner_kthread+0xf6/0x120
>  kthread+0xf8/0x130
>  ? btree_invalidatepage+0x90/0x90
>  ? kthread_bind+0x10/0x10
>  ret_from_fork+0x35/0x40
>
> This was because btrfs_drop_snapshot depends on the root not being modified
> while it's dropping the snapshot.  It will unlock the root node (and really
> every node) as it walks down the tree, only to re-lock it when it needs to do
> something.  This is a problem because if we modify the tree we could cow a 
> block
> in our path, which free's our reference to that block.  Then once we get back 
> to
> that shared block we'll free our reference to it again, and get ENOENT when
> trying to lookup our extent reference to that block in __btrfs_free_extent.
>
> This is ultimately happening because we have delayed items left to be 
> processed
> for our deleted snapshot _after_ all of the inodes are closed for the 
> snapshot.
> We only run the delayed inode item if we're deleting the inode, and even then 
> we
> do not run the delayed insertions or delayed removals.  These can be run at 
> any
> point after our final inode does it's last iput, which is what triggers the
> snapshot deletion.  We can end up with the snapshot deletion happening and 
> then
> have the delayed items run on that file system, resulting in the above 
> problem.
>
> This problem has existed forever, however my patches made it much easier to 
> hit
> as I wake up the cleaner much more often to deal with delayed iputs, which 
> made
> us more likely to start the snapshot dropping work before the transaction
> commits, which is when the delayed items would generally be run.  Before,
> generally speaking, we would run the delayed items, commit the transaction, 
> and
> wakeup the cleaner thread to start deleting snapshots, which means we were 
> less
> likely to hit this problem.  You could still hit it if you had multiple
> snapshots to be deleted and ended up with lots of delayed items, but it was
> definitely harder.
>
> Fix for now by simply running all the delayed items before starting to drop 
> the
> snapshot.  We could make this smarter in the future by making the delayed 
> items
> per-root, and then simply drop any delayed items for roots that we are going 
> to
> delete.  But for now just a quick and easy solution is the safest.
>
> Cc: sta...@vger.kernel.org
> Signed-off-by: Josef Bacik 

Reviewed-by: Filipe Manana 

Looks good now. Thanks.

> ---
> v1->v2:
> - check for errors from btrfs_run_delayed_items.
> - Dave I can reroll the series, but the second version of patch 1 is the same,
>   let me know what you want.
>
>  fs/btrfs/extent-tree.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index dcb699dd57f3..473084eb7a2d 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -9330,6 +9330,10 @@ int btrfs_drop_snapshot(struct btrfs_root *root,
> goto out_free;
> }
>
> +   err = btrfs_run_delayed_items(trans);
> +   if (err)
> +   goto out_end_trans;
> +
> if (block_rsv)
> trans->block_rsv = block_rsv;
>
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: btrfs progs always assume devid 1?

2018-12-05 Thread Austin S. Hemmelgarn

On 2018-12-05 14:50, Roman Mamedov wrote:

Hello,

To migrate my FS to a different physical disk, I have added a new empty device
to the FS, then ran the remove operation on the original one.

Now my FS has only devid 2:

Label: 'p1'  uuid: d886c190-b383-45ba-9272-9f00c6a10c50
Total devices 1 FS bytes used 36.63GiB
devid2 size 50.00GiB used 45.06GiB path /dev/mapper/vg-p1

And all the operations of btrfs-progs now fail to work in their default
invocation, such as:

# btrfs fi resize max .
Resize '.' of 'max'
ERROR: unable to resize '.': No such device

[768813.414821] BTRFS info (device dm-5): resizer unable to find device 1

Of course this works:

# btrfs fi resize 2:max .
Resize '.' of '2:max'

But this is inconvenient and seems to be a rather simple oversight. If what I
got is normal (the device staying as ID 2 after such operation), then count
that as a suggestion that btrfs-progs should use the first existing devid,
rather than always looking for hard-coded devid 1.



I've been meaning to try and write up a patch to special-case this for a 
while now, but have not gotten around to it yet.


FWIW, this is one of multiple reasons that it's highly recommended to 
use `btrfs replace` instead of adding a new device and deleting the old 
one when replacing a device.  Other benefits include:


* It doesn't have to run in the foreground (and doesn't by default).
* It usually takes less time.
* Replace operations can be queried while running to get a nice 
indication of the completion percentage.


The only disadvantage is that the new device has to be at least as large 
as the old one (though you can get around this to a limited degree by 
shrinking the old device), and it needs the old and new device to be 
plugged in at the same time (add/delete doesn't, if you flip the order 
of the add and delete commands).


Re: Linux-next regression?

2018-12-05 Thread Chris Mason
On 5 Dec 2018, at 5:59, Andrea Gelmini wrote:

> On Tue, Dec 04, 2018 at 10:29:49PM +, Chris Mason wrote:
>> I think (hope) this is:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>
>> Which was just nailed down to a blkmq bug.  It triggers when you have
>> scsi devices using elevator=none over blkmq.
>
> Thanks a lot Chris. Really.
> Good news: I confirm I recompiled and used blkmq and no-op (at that 
> time).
> Also, the massive write of btrfs defrag can explain the massive 
> trigger of
> the bug, and next corruption.

Sorry this happened, but glad you were able to confirm that it explains 
the trouble you hit.  Thanks for the report, I did end up using this as 
a datapoint to convince myself the bugzilla above wasn't ext4 specific.

-chris


Re: [RFC PATCH] btrfs: Remove __extent_readpages

2018-12-05 Thread Josef Bacik
On Mon, Dec 03, 2018 at 12:25:32PM +0200, Nikolay Borisov wrote:
> When extent_readpages is called from the generic readahead code it first
> builds a batch of 16 pages (which might or might not be consecutive,
> depending on whether add_to_page_cache_lru failed) and submits them to
> __extent_readpages. The latter ensures that the range of pages (in the
> batch of 16) that is passed to __do_contiguous_readpages is consecutive.
> 
> If add_to_page_cache_lru does't fail then __extent_readpages will call
> __do_contiguous_readpages only once with the whole batch of 16.
> Alternatively, if add_to_page_cache_lru fails once on the 8th page (as an 
> example)
> then the contigous page read code will be called twice.
> 
> All of this can be simplified by exploiting the fact that all pages passed
> to extent_readpages are consecutive, thus when the batch is built in
> that function it is already consecutive (barring add_to_page_cache_lru
> failures) so are ready to be submitted directly to __do_contiguous_readpages.
> Also simplify the name of the function to contiguous_readpages. 
> 
> Signed-off-by: Nikolay Borisov 
> ---
> 
> So this patch looks like a very nice cleanup, however when doing performance 
> measurements with fio I was shocked to see that it actually is detrimental to 
> performance. Here are the results: 
> 
> The command line used for fio: 
> fio --name=/media/scratch/seqread --rw=read --direct=0 --ioengine=sync --bs=4k
>  --numjobs=1 --size=1G --runtime=600  --group_reporting --loop 10
> 
> This was tested on a vm with 4g of ram so the size of the test is smaller 
> than 
> the memory, so pages should have been nicely readahead. 
> 

What this patch changes is now we aren't reading all of the pages we are passed
by the readahead, so now we fall back to per-page reading when we go to read
those pages because the readahead window has already moved past them.  This
isn't great behavior, what we have is nice in that it tries to group the entire
range together as much as possible.  What your patch changes is that as soon as
we stop having a contiguous range we just bail and submit what we have.  Testing
it in isolation is likely to show very little change, but having recently
touched the fault in code where we definitely do not count major faults in some
cases I'd suspect that we're not reflecting this higher fault rate in the
performance counters properly.  We should preserve the existing behavior, what
hurts a little bit on a lightly loaded box is going to hurt a whole lot more on
a heavily loaded box.  Thanks,

Josef


Re: [PATCHv3] btrfs: Fix error handling in btrfs_cleanup_ordered_extents

2018-12-05 Thread David Sterba
On Wed, Nov 21, 2018 at 05:10:52PM +0200, Nikolay Borisov wrote:
> Running btrfs/124 in a loop hung up on me sporadically with the
> following call trace:
>   btrfs   D0  5760   5324 0x
>   Call Trace:
>? __schedule+0x243/0x800
>schedule+0x33/0x90
>btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
>? wait_woken+0xa0/0xa0
>btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
>btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
>btrfs_relocate_chunk+0x49/0x100 [btrfs]
>btrfs_balance+0xbeb/0x1740 [btrfs]
>btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
>btrfs_ioctl+0x1691/0x3110 [btrfs]
>? lockdep_hardirqs_on+0xed/0x180
>? __handle_mm_fault+0x8e7/0xfb0
>? _raw_spin_unlock+0x24/0x30
>? __handle_mm_fault+0x8e7/0xfb0
>? do_vfs_ioctl+0xa5/0x6e0
>? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
>do_vfs_ioctl+0xa5/0x6e0
>? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
>ksys_ioctl+0x3a/0x70
>__x64_sys_ioctl+0x16/0x20
>do_syscall_64+0x60/0x1b0
>entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> This happens because during page writeback it's valid for
> writepage_delalloc to instantiate a delalloc range which doesn't
> belong to the page currently being written back.
> 
> The reason this case is valid is due to find_lock_delalloc_range
> returning any available range after the passed delalloc_start and
> ignorting whether the page under writeback is within that range.
> In turn ordered extents (OE) are always created for the returned range
> from find_lock_delalloc_range. If, however, a failure occurs while OE
> are being created then the clean up code in btrfs_cleanup_ordered_extents
> will be called.
> 
> Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
> the case of such 'foreign' range being processed and instead it always
> assumes that the range OE are created for belongs to the page. This
> leads to the first page of such foregin range to not be cleaned up since
> it's deliberately missed skipped by the current cleaning up code.
> 
> Fix this by correctly checking whether the current page belongs to the
> range being instantiated and if so adjsut the range parameters
> passed for cleaning up. If it doesn't, then just clean the whole OE
> range directly.
> 
> Signed-off-by: Nikolay Borisov 
> Reviewed-by: Josef Bacik 

Added to misc-next, thanks.


Re: [PATCH 01/10] btrfs: create a mount option for dax

2018-12-05 Thread Adam Borowski
On Wed, Dec 05, 2018 at 02:43:03PM +0200, Nikolay Borisov wrote:
> One question below though .
> 
> > +++ b/fs/btrfs/super.c
> > @@ -739,6 +741,17 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
> > char *options,
> > case Opt_user_subvol_rm_allowed:
> > btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
> > break;
> > +#ifdef CONFIG_FS_DAX
> > +   case Opt_dax:
> > +   if (btrfs_super_num_devices(info->super_copy) > 1) {
> > +   btrfs_info(info,
> > +  "dax not supported for multi-device 
> > btrfs partition\n");
> 
> What prevents supporting dax for multiple devices so long as all devices
> are dax?

As I mentioned in a separate mail, most profiles are either redundant
(RAID0), require hardware support (RAID1, DUP) or are impossible (RAID5,
RAID6).

But, "single" profile multi-device would be useful and actually provide
something other dax-supporting filesystems don't have: combining multiple
devices into one logical piece.

On the other hand, DUP profiles need to be banned.  In particular, the
filesystem you mount might have existing DUP block groups.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄ to the city of his birth to die.


Re: [PATCH 07/10] dax: export functions for use with btrfs

2018-12-05 Thread Christoph Hellwig
If you want to export these at all they have to be EXPORT_SYMBOL_GPL.

But I'd really like to avoid seeing another duplicate DAX I/O path.
Please try to adopt the existing iomap-based infrastructure for your
needs first.


Re: [RFC PATCH 3/3] coccinelle: api: add offset_in_page.cocci

2018-12-05 Thread Julia Lawall



On Wed, 5 Dec 2018, Johannes Thumshirn wrote:

> Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can be
> replaced by the offset_in_page() macro instead of open-coding it.
>
> Add a coccinelle semantic patch to ease detection and conversion of these.
>
> This unfortunately doesn't account for the case when we want PAGE_ALIGNED()
> instead of offset_in_page() yet.
>
> Cc: Julia Lawall 
> Signed-off-by: Johannes Thumshirn 
> ---
>  scripts/coccinelle/api/offset_in_page.cocci | 81 
> +
>  1 file changed, 81 insertions(+)
>  create mode 100644 scripts/coccinelle/api/offset_in_page.cocci
>
> diff --git a/scripts/coccinelle/api/offset_in_page.cocci 
> b/scripts/coccinelle/api/offset_in_page.cocci
> new file mode 100644
> index ..ea5b3a8e0390
> --- /dev/null
> +++ b/scripts/coccinelle/api/offset_in_page.cocci
> @@ -0,0 +1,81 @@
> +// SPDX-License-Identifier: GPL-2.0
> +///
> +/// Use offset_in_page macro on address instead of explicit computation.
> +///
> +//  Confidence: High
> +//  Keywords: offset_in_page
> +//  Comment: Based on vma_pages.cocci
> +
> +virtual context
> +virtual patch
> +virtual org
> +virtual report
> +
> +
> +//--
> +//  For context mode
> +//--
> +
> +@r_context depends on context && !patch && !org && !report@
> +expression E;
> +@@
> +
> +(
> +* E & ~PAGE_MASK
> +|
> +* E & (PAGE_SIZE - 1)
> +)
> +
> +
> +//--
> +//  For patch mode
> +//--
> +
> +@r_patch depends on !context && patch && !org && !report@
> +expression E;
> +type T;
> +@@
> +
> +(
> +- E & ~PAGE_MASK
> ++ offset_in_page(E)
> +|
> +- E & (PAGE_SIZE - 1)
> ++ offset_in_page(E)

The two lines above should be subsumed by the two lines below.  When there
is a type metavariable that has no other dependencies, an isomorphism will
consider that it is either present or absent.

Why not include the cast case for the context and org cases?

Masahiro will ultimately commit this.  I have added him to CC.

Thanks for the contribution.

julia


> +|
> +- E & ((T)PAGE_SIZE - 1)
> ++ offset_in_page(E)
> +)
> +
> +//--
> +//  For org mode
> +//--
> +
> +@r_org depends on !context && !patch && (org || report)@
> +expression E;
> +position p;
> +@@
> +
> +  (
> +  * E@p & ~PAGE_MASK
> +  |
> +  * E@p & (PAGE_SIZE - 1)
> +  )
> +
> +@script:python depends on report@
> +p << r_org.p;
> +x << r_org.E;
> +@@
> +
> +msg="WARNING: Consider using offset_in_page helper on %s" % (x)
> +coccilib.report.print_report(p[0], msg)
> +
> +@script:python depends on org@
> +p << r_org.p;
> +x << r_org.E;
> +@@
> +
> +msg="WARNING: Consider using offset_in_page helper on %s" % (x)
> +msg_safe=msg.replace("[","@(").replace("]",")")
> +coccilib.org.print_todo(p[0], msg_safe)
> +
> --
> 2.16.4
>
>


Re: [PATCH 1/3] btrfs: use offset_in_page instead of open-coding it

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 16:23 ч., Johannes Thumshirn wrote:
> Constructs like 'var & (PAGE_SIZE - 1)' or 'var & ~PAGE_MASK' can denote an
> offset into a page.
> 
> So replace them by the offset_in_page() macro instead of open-coding it if
> they're not used as an alignment check.
> 
> Signed-off-by: Johannes Thumshirn 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/check-integrity.c | 12 +--
>  fs/btrfs/compression.c |  2 +-
>  fs/btrfs/extent_io.c   | 53 
> +-
>  fs/btrfs/file.c|  4 ++--
>  fs/btrfs/inode.c   |  7 +++---
>  fs/btrfs/send.c|  2 +-
>  fs/btrfs/volumes.c |  2 +-
>  7 files changed, 38 insertions(+), 44 deletions(-)
> 
> diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
> index 781cae168d2a..d319c3020c09 100644
> --- a/fs/btrfs/check-integrity.c
> +++ b/fs/btrfs/check-integrity.c
> @@ -1202,24 +1202,24 @@ static void btrfsic_read_from_block_data(
>   void *dstv, u32 offset, size_t len)
>  {
>   size_t cur;
> - size_t offset_in_page;
> + size_t pgoff;
>   char *kaddr;
>   char *dst = (char *)dstv;
> - size_t start_offset = block_ctx->start & ((u64)PAGE_SIZE - 1);
> + size_t start_offset = offset_in_page(block_ctx->start);
>   unsigned long i = (start_offset + offset) >> PAGE_SHIFT;
>  
>   WARN_ON(offset + len > block_ctx->len);
> - offset_in_page = (start_offset + offset) & (PAGE_SIZE - 1);
> + pgoff = offset_in_page(start_offset + offset);
>  
>   while (len > 0) {
> - cur = min(len, ((size_t)PAGE_SIZE - offset_in_page));
> + cur = min(len, ((size_t)PAGE_SIZE - pgoff));
>   BUG_ON(i >= DIV_ROUND_UP(block_ctx->len, PAGE_SIZE));
>   kaddr = block_ctx->datav[i];
> - memcpy(dst, kaddr + offset_in_page, cur);
> + memcpy(dst, kaddr + pgoff, cur);
>  
>   dst += cur;
>   len -= cur;
> - offset_in_page = 0;
> + pgoff = 0;
>   i++;
>   }
>  }
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index dba59ae914b8..2ab5591449f2 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -477,7 +477,7 @@ static noinline int add_ra_bio_pages(struct inode *inode,
>  
>   if (page->index == end_index) {
>   char *userpage;
> - size_t zero_offset = isize & (PAGE_SIZE - 1);
> + size_t zero_offset = offset_in_page(isize);
>  
>   if (zero_offset) {
>   int zeros;
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index b2769e92b556..e365c5272e6b 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2585,7 +2585,7 @@ static void end_bio_extent_readpage(struct bio *bio)
>   unsigned off;
>  
>   /* Zero out the end if this page straddles i_size */
> - off = i_size & (PAGE_SIZE-1);
> + off = offset_in_page(i_size);
>   if (page->index == end_index && off)
>   zero_user_segment(page, off, PAGE_SIZE);
>   SetPageUptodate(page);
> @@ -2888,7 +2888,7 @@ static int __do_readpage(struct extent_io_tree *tree,
>  
>   if (page->index == last_byte >> PAGE_SHIFT) {
>   char *userpage;
> - size_t zero_offset = last_byte & (PAGE_SIZE - 1);
> + size_t zero_offset = offset_in_page(last_byte);
>  
>   if (zero_offset) {
>   iosize = PAGE_SIZE - zero_offset;
> @@ -3432,7 +3432,7 @@ static int __extent_writepage(struct page *page, struct 
> writeback_control *wbc,
>  
>   ClearPageError(page);
>  
> - pg_offset = i_size & (PAGE_SIZE - 1);
> + pg_offset = offset_in_page(i_size);
>   if (page->index > end_index ||
>  (page->index == end_index && !pg_offset)) {
>   page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE);
> @@ -5307,7 +5307,7 @@ void read_extent_buffer(const struct extent_buffer *eb, 
> void *dstv,
>   struct page *page;
>   char *kaddr;
>   char *dst = (char *)dstv;
> - size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
> + size_t start_offset = offset_in_page(eb->start);
>   unsigned long i = (start_offset + start) >> PAGE_SHIFT;
>  
>   if (start + len > eb->len) {
> @@ -5317,7 +5317,7 @@ void read_extent_buffer(const struct extent_buffer *eb, 
> void *dstv,
>   return;
>   }
>  
> - offset = (start_offset + start) & (PAGE_SIZE - 1);
> + offset = offset_in_page(start_offset + start);
>  
>   while (len > 0) {
>   page = eb->pages[i];
> @@ -5342,14 +5342,14 @@ int read_extent_buffer_to_user(const struct 
> extent_buffer *eb,
>   struct page *page;
>   char *kaddr;
>   char __user *dst = (char __user *)dstv;
> 

Re: [PATCH 2/3] btrfs: use PAGE_ALIGNED instead of open-coding it

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 16:23 ч., Johannes Thumshirn wrote:
> When using a 'var & (PAGE_SIZE - 1)' construct one is checking for a page
> alignment and thus should use the PAGE_ALIGNED() macro instead of
> open-coding it.
> 
> Convert all open-coded occurrences of PAGE_ALIGNED().
> 
> Signed-off-by: Johannes Thumshirn 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/check-integrity.c | 8 
>  fs/btrfs/compression.c | 2 +-
>  fs/btrfs/inode.c   | 2 +-
>  3 files changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
> index d319c3020c09..84e9729badaa 100644
> --- a/fs/btrfs/check-integrity.c
> +++ b/fs/btrfs/check-integrity.c
> @@ -1601,7 +1601,7 @@ static int btrfsic_read_block(struct btrfsic_state 
> *state,
>   BUG_ON(block_ctx->datav);
>   BUG_ON(block_ctx->pagev);
>   BUG_ON(block_ctx->mem_to_free);
> - if (block_ctx->dev_bytenr & ((u64)PAGE_SIZE - 1)) {
> + if (!PAGE_ALIGNED(block_ctx->dev_bytenr)) {
>   pr_info("btrfsic: read_block() with unaligned bytenr %llu\n",
>  block_ctx->dev_bytenr);
>   return -1;
> @@ -1778,7 +1778,7 @@ static void btrfsic_process_written_block(struct 
> btrfsic_dev_state *dev_state,
>   return;
>   }
>   is_metadata = 1;
> - BUG_ON(BTRFS_SUPER_INFO_SIZE & (PAGE_SIZE - 1));
> + BUG_ON(!PAGE_ALIGNED(BTRFS_SUPER_INFO_SIZE));
>   processed_len = BTRFS_SUPER_INFO_SIZE;
>   if (state->print_mask &
>   BTRFSIC_PRINT_MASK_TREE_BEFORE_SB_WRITE) {
> @@ -2892,12 +2892,12 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info,
>   struct list_head *dev_head = _devices->devices;
>   struct btrfs_device *device;
>  
> - if (fs_info->nodesize & ((u64)PAGE_SIZE - 1)) {
> + if (!PAGE_ALIGNED(fs_info->nodesize)) {
>   pr_info("btrfsic: cannot handle nodesize %d not being a 
> multiple of PAGE_SIZE %ld!\n",
>  fs_info->nodesize, PAGE_SIZE);
>   return -1;
>   }
> - if (fs_info->sectorsize & ((u64)PAGE_SIZE - 1)) {
> + if (!PAGE_ALIGNED(fs_info->sectorsize)) {
>   pr_info("btrfsic: cannot handle sectorsize %d not being a 
> multiple of PAGE_SIZE %ld!\n",
>  fs_info->sectorsize, PAGE_SIZE);
>   return -1;
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 2ab5591449f2..d5381f39a9e8 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -301,7 +301,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
> *inode, u64 start,
>   blk_status_t ret;
>   int skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
>  
> - WARN_ON(start & ((u64)PAGE_SIZE - 1));
> + WARN_ON(!PAGE_ALIGNED(start));
>   cb = kmalloc(compressed_bio_size(fs_info, compressed_len), GFP_NOFS);
>   if (!cb)
>   return BLK_STS_RESOURCE;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index bc0564c384de..5c52e91b01e8 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2027,7 +2027,7 @@ int btrfs_set_extent_delalloc(struct inode *inode, u64 
> start, u64 end,
> unsigned int extra_bits,
> struct extent_state **cached_state, int dedupe)
>  {
> - WARN_ON((end & (PAGE_SIZE - 1)) == 0);
> + WARN_ON(PAGE_ALIGNED(end));
>   return set_extent_delalloc(_I(inode)->io_tree, start, end,
>  extra_bits, cached_state);
>  }
> 


Re: [PATCH v2 07/13] btrfs-progs: Fix Wmaybe-uninitialized warning

2018-12-05 Thread Qu Wenruo


On 2018/12/5 下午9:40, David Sterba wrote:
> On Wed, Dec 05, 2018 at 02:40:12PM +0800, Qu Wenruo wrote:
>> GCC 8.2.1 will report the following warning with "make W=1":
>>
>>   ctree.c: In function 'btrfs_next_sibling_tree_block':
>>   ctree.c:2990:21: warning: 'slot' may be used uninitialized in this 
>> function [-Wmaybe-uninitialized]
>> path->slots[level] = slot;
>> ~~~^~
>>
>> The culprit is the following code:
>>
>>  int slot;   << Not initialized
>>  int level = path->lowest_level + 1;
>>  BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
>>  while(level < BTRFS_MAX_LEVEL) {
>>  slot = path->slots[level] + 1;
>>  ^^ but we initialize @slot here.
>>  ...
>>  }
>>  path->slots[level] = slot;
>>
>> It's possible that compiler doesn't get enough hint for BUG_ON() on
>> lowest_level + 1 >= BTRFS_MAX_LEVEL case.
>>
>> Fix it by using a do {} while() loop other than while() {} loop, to
>> ensure we will run the loop for at least once.
> 
> I was hoping that we can actually add the hint to BUG_ON so the code
> does not continue if the condition is true.
> 
I checked that method, but I'm not that confident about things like:

bugon_trace()
{
if (!val)
return;
__bugon_trace();
}

__attribute__ ((noreturn))
static inline void __bugon_trace();

This is as simple as just one extra function call, but the original
problem is just one more function call before hitting abort().

So I just give up screwing up things I'm not comfort enough to tweaking.

The current do {} while() loop is the most direct solution, if gcc one
day still gives such warning then I could say some harsh word then.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 07/10] dax: export functions for use with btrfs

2018-12-05 Thread Johannes Thumshirn
On 05/12/2018 13:28, Goldwyn Rodrigues wrote:

[...]

> -static void *grab_mapping_entry(struct xa_state *xas,
> +void *grab_mapping_entry(struct xa_state *xas,
>   struct address_space *mapping, unsigned long size_flag)
>  {
>   unsigned long index = xas->xa_index;
> @@ -531,6 +532,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
>   xas_unlock_irq(xas);
>   return xa_mk_internal(VM_FAULT_FALLBACK);
>  }
> +EXPORT_SYMBOL(grab_mapping_entry);

dax_grab_mapping_entry() please.



-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Adam Borowski
On Wed, Dec 05, 2018 at 06:28:25AM -0600, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs.

Yay!

> I understand there have been previous attempts at it.  However, I wanted
> to make sure copy-on-write (COW) works on dax as well.

btrfs' usual use of CoW and DAX are thoroughly in conflict.

The very point of DAX is to have writes not go through the kernel, you
mmap the file then do all writes right to the pmem, flushing when needed
(without hitting the kernel) and having the processor+memory persist what
you wrote.

CoW via page faults are fine -- pmem is closer to memory than disk, and this
means the kernel will ask the filesystem for an extent to place the new page
in, copy the contents and let the process play with it.  But real btrfs CoW
would mean we'd need to page fault on ᴇᴠᴇʀʏ ꜱɪɴɢʟᴇ ᴡʀɪᴛᴇ.

Delaying CoW until the next commit doesn't help -- you'd need to store the
dirty page in DRAM then write it, which goes against the whole concept of
DAX.

Only way I see would be to CoW once then pretend the page is nodatacow until
the next commit, when we checksum it, add to the metadata trees, and mark
for CoWing on the next write.  Lots of complexity, and you still need to
copy the whole thing every commit (so no gain).

Ie, we're in nodatacow land.  CoW for metadata is fine.

> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:
> 
> Questions:
> 1. I have been unable to do checksumming for DAX devices. While
> checksumming can be done for reads and writes, it is a problem when mmap
> is involved because btrfs kernel module does not get back control after
> an mmap() writes. Any ideas are appreciated, or we would have to set
> nodatasum when dax is enabled.

Per the above, it sounds like nodatacow (ie, "cow once") would be needed.

> 2. Currently, a user can continue writing on "old" extents of an mmaped file
> after a snapshot has been created. How can we enforce writes to be directed
> to new extents after snapshots have been created? Do we keep a list of
> all mmap()s, and re-mmap them after a snapshot?

Same as for any other memory that's shared: when a new instance of sharing
is added (a snapshot/reflink in our case), you deny writes, causing a page
fault on the next attempt.  "pmem" is named "ᴘersistent ᴍᴇᴍory" for a
reason...

> Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
> command line parameter.

Might be more useful to use a bigger piece of the "disk" than 2G, it's not
in the danger area though.

Also note that it's utterly pointless to use any RAID modes; multi-dev
single is fine, DUP counts as RAID here.
* RAID0 is already done better in hardware (interleave)
* RAID1 would require hardware support, replication isn't easy
* RAID5/6 

What would make sense, is disabling dax for any files that are not marked as
nodatacow.  This way, unrelated files can still use checksums or
compression, while only files meant as a pmempool or otherwise by a
pmem-aware program would have dax writes (you can still give read-only pages
that CoW to DRAM).  This way we can have write dax for only a subset of
files, and full set of btrfs features for the rest.  Write dax is dangerous
for programs that have no specific support: the vast majority of
database-like programs rely on page-level atomicity while pmem gives you
cacheline/word atomicity only; torn writes mean data loss.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄ to the city of his birth to die.


Re: [PATCH 06/10] btrfs: dax write support

2018-12-05 Thread Johannes Thumshirn
On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
[...]

> +static int copy_extent_page(struct extent_map *em, void *daddr, u64 pos)
> +{
> +struct dax_device *dax_dev;

^ space instead of tabs?

> + void *saddr;
> + sector_t start;
> + size_t len;
> +
> + if (em->block_start == EXTENT_MAP_HOLE) {
> + memset(daddr, 0, PAGE_SIZE);
> + } else {
> + dax_dev = fs_dax_get_by_bdev(em->bdev);
> + start = (get_start_sect(em->bdev) << 9) + (em->block_start + 
> (pos - em->start));
> + len = dax_direct_access(dax_dev, PHYS_PFN(start), 1, , 
> NULL);
> + memcpy(daddr, saddr, PAGE_SIZE);
> + }
> + free_extent_map(em);
> +
> + return 0;
> +}
> +
> +

copy_extent_page() always returns 0, why not make it void?
Plus a nit: double newline.

> +ssize_t btrfs_file_dax_write(struct kiocb *iocb, struct iov_iter *from)
> +{
> + ssize_t ret, done = 0, count = iov_iter_count(from);
> +struct inode *inode = file_inode(iocb->ki_filp);
^ again spaces vs tabs.

> + u64 pos = iocb->ki_pos;
> + u64 start = round_down(pos, PAGE_SIZE);
> + u64 end = round_up(pos + count, PAGE_SIZE);
> + struct extent_state *cached_state = NULL;
> + struct extent_changeset *data_reserved = NULL;
> + struct extent_map *first = NULL, *last = NULL;
> +
> + ret = btrfs_delalloc_reserve_space(inode, _reserved, start, end - 
> start);
> + if (ret < 0)
> + return ret;
> +
> + /* Grab a reference of the first extent to copy data */
> + if (start < pos) {
> + first = btrfs_get_extent(BTRFS_I(inode), NULL, 0, start, end - 
> start, 0);
> + if (IS_ERR(first)) {
> + ret = PTR_ERR(first);
> + goto out2;
> + }
> + }

You're using 'end - start' at least twice here, maybe you could move
'len' out of the loop and use it for btrfs_delalloc_reserve_space() and
btrfs_get_extent() as well.

> +
> + /* Grab a reference of the last extent to copy data */
> + if (pos + count < end) {
> + last = btrfs_get_extent(BTRFS_I(inode), NULL, 0, end - 
> PAGE_SIZE, PAGE_SIZE, 0);
> + if (IS_ERR(last)) {
> + ret = PTR_ERR(last);
> + goto out2;
> + }
> + }
> +
> + lock_extent_bits(_I(inode)->io_tree, start, end, _state);
> + while (done < count) {
> + struct extent_map *em;
> + struct dax_device *dax_dev;
> + int offset = pos & (PAGE_SIZE - 1);
> + u64 estart = round_down(pos, PAGE_SIZE);
> + u64 elen = end - estart;
> + size_t len = count - done;
> + sector_t dstart;
> + void *daddr;
> + ssize_t maplen;
> +
> + /* Read the current extent */
> +em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, estart, elen, 
> 0);

Space again.

> + if (IS_ERR(em)) {
> + ret = PTR_ERR(em);
> + goto out;
> + }
> +
> + /* Get a new extent */
> + ret = btrfs_get_extent_map_write(, NULL, inode, estart, 
> elen);
> + if (ret < 0)
> + goto out;
> +
> + dax_dev = fs_dax_get_by_bdev(em->bdev);
> + /* Calculate start address start of destination extent */
> + dstart = (get_start_sect(em->bdev) << 9) + em->block_start;
> + maplen = dax_direct_access(dax_dev, PHYS_PFN(dstart),
> + PHYS_PFN(em->len), , NULL);
> +
> + /* Copy front of extent page */
> + if (offset)
> + ret = copy_extent_page(first, daddr, estart);
> +
> + /* Copy end of extent page */
> + if ((pos + len > estart + PAGE_SIZE) && (pos + len < em->start 
> + em->len))
> + ret = copy_extent_page(last, daddr + em->len - 
> PAGE_SIZE, em->start + em->len - PAGE_SIZE);
> +
> + /* Copy the data from the iter */
> + maplen = PFN_PHYS(maplen);
> + maplen -= offset;
> + ret = dax_copy_from_iter(dax_dev, dstart, daddr + offset, 
> maplen, from);
> + if (ret < 0)
> + goto out;
> + pos += ret;
> + done += ret;
> + }
> +out:

out_unlock?

> + unlock_extent_cached(_I(inode)->io_tree, start, end, 
> _state);
> + if (done) {
> + btrfs_update_ordered_extent(inode, start,
> + end - start, true);
> + iocb->ki_pos += done;
> + if (iocb->ki_pos > i_size_read(inode))
> + i_size_write(inode, iocb->ki_pos);
> + }
> +
> + btrfs_delalloc_release_extents(BTRFS_I(inode), count, false);
> +out2:

out?

> + if (count - done > 0)
> + btrfs_delalloc_release_space(inode, data_reserved, pos,
> + count 

Re: [PATCH v2 07/13] btrfs-progs: Fix Wmaybe-uninitialized warning

2018-12-05 Thread David Sterba
On Wed, Dec 05, 2018 at 02:40:12PM +0800, Qu Wenruo wrote:
> GCC 8.2.1 will report the following warning with "make W=1":
> 
>   ctree.c: In function 'btrfs_next_sibling_tree_block':
>   ctree.c:2990:21: warning: 'slot' may be used uninitialized in this function 
> [-Wmaybe-uninitialized]
> path->slots[level] = slot;
> ~~~^~
> 
> The culprit is the following code:
> 
>   int slot;   << Not initialized
>   int level = path->lowest_level + 1;
>   BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
>   while(level < BTRFS_MAX_LEVEL) {
>   slot = path->slots[level] + 1;
>   ^^ but we initialize @slot here.
>   ...
>   }
>   path->slots[level] = slot;
> 
> It's possible that compiler doesn't get enough hint for BUG_ON() on
> lowest_level + 1 >= BTRFS_MAX_LEVEL case.
> 
> Fix it by using a do {} while() loop other than while() {} loop, to
> ensure we will run the loop for at least once.

I was hoping that we can actually add the hint to BUG_ON so the code
does not continue if the condition is true.


Re: [PATCH 04/10] Rename __endio_write_update_ordered() to btrfs_update_ordered_extent()

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Since we will be using it in another part of the code, use a
> better name to declare it non-static
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/ctree.h |  7 +--
>  fs/btrfs/inode.c | 14 +-
>  2 files changed, 10 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 038d64ecebe5..5144d28216b0 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3170,8 +3170,11 @@ struct inode *btrfs_iget_path(struct super_block *s, 
> struct btrfs_key *location,
>  struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
>struct btrfs_root *root, int *was_new);
>  struct extent_map *btrfs_get_extent(struct btrfs_inode *inode,
> - struct page *page, size_t pg_offset,
> - u64 start, u64 end, int create);
> + struct page *page, size_t pg_offset,
> + u64 start, u64 end, int create);
> +void btrfs_update_ordered_extent(struct inode *inode,
> + const u64 offset, const u64 bytes,
> + const bool uptodate);
>  int btrfs_update_inode(struct btrfs_trans_handle *trans,
> struct btrfs_root *root,
> struct inode *inode);
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 9ea4c6f0352f..96e9fe9e4150 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -97,10 +97,6 @@ static struct extent_map *create_io_em(struct inode 
> *inode, u64 start, u64 len,
>  u64 ram_bytes, int compress_type,
>  int type);
>  
> -static void __endio_write_update_ordered(struct inode *inode,
> -  const u64 offset, const u64 bytes,
> -  const bool uptodate);
> -
>  /*
>   * Cleanup all submitted ordered extents in specified range to handle errors
>   * from the fill_dellaloc() callback.
> @@ -130,7 +126,7 @@ static inline void btrfs_cleanup_ordered_extents(struct 
> inode *inode,
>   ClearPagePrivate2(page);
>   put_page(page);
>   }
> - return __endio_write_update_ordered(inode, offset + PAGE_SIZE,
> + return btrfs_update_ordered_extent(inode, offset + PAGE_SIZE,
>   bytes - PAGE_SIZE, false);
>  }
>  
> @@ -8059,7 +8055,7 @@ static void btrfs_endio_direct_read(struct bio *bio)
>   bio_put(bio);
>  }
>  
> -static void __endio_write_update_ordered(struct inode *inode,
> +void btrfs_update_ordered_extent(struct inode *inode,
>const u64 offset, const u64 bytes,
>const bool uptodate)

Since you are exporting the function I'd suggest to use the occasion to
introduce proper kernel-doc. THe primary help will be if you document
the context under which the function is called - when writes are
finished and it's used to update respective portion of the ordered extent.

>  {
> @@ -8112,7 +8108,7 @@ static void btrfs_endio_direct_write(struct bio *bio)
>   struct btrfs_dio_private *dip = bio->bi_private;
>   struct bio *dio_bio = dip->dio_bio;
>  
> - __endio_write_update_ordered(dip->inode, dip->logical_offset,
> + btrfs_update_ordered_extent(dip->inode, dip->logical_offset,
>dip->bytes, !bio->bi_status);
>  
>   kfree(dip);
> @@ -8432,7 +8428,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, 
> struct inode *inode,
>   bio = NULL;
>   } else {
>   if (write)
> - __endio_write_update_ordered(inode,
> + btrfs_update_ordered_extent(inode,
>   file_offset,
>   dio_bio->bi_iter.bi_size,
>   false);
> @@ -8572,7 +8568,7 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, 
> struct iov_iter *iter)
>*/
>   if (dio_data.unsubmitted_oe_range_start <
>   dio_data.unsubmitted_oe_range_end)
> - __endio_write_update_ordered(inode,
> + btrfs_update_ordered_extent(inode,
>   dio_data.unsubmitted_oe_range_start,
>   dio_data.unsubmitted_oe_range_end -
>   dio_data.unsubmitted_oe_range_start,
> 


Re: [PATCH 03/10] btrfs: dax: read zeros from holes

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/dax.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> index d614bf73bf8e..5a297674adec 100644
> --- a/fs/btrfs/dax.c
> +++ b/fs/btrfs/dax.c
> @@ -54,7 +54,12 @@ ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct 
> iov_iter *to)

nit: I think it's better if you rename the iterator variable to "iter".

>  
>  BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING);
>  
> -ret = em_dax_rw(inode, em, pos, len, to);
> + if (em->block_start == EXTENT_MAP_HOLE) {
> + u64 zero_len = min(em->len - (em->start - pos), len);

Shouldn't this be em->len - (pos - em->start) since this gives the
remaining length of the extent? Isn't pos guaranteed to be >= em->start ?

> + ret = iov_iter_zero(zero_len, to);
> + } else {
> + ret = em_dax_rw(inode, em, pos, len, to);
> + }
>  if (ret < 0)
>  goto out;
>  pos += ret;
> 


Re: [PATCH 02/10] btrfs: basic dax read

2018-12-05 Thread Johannes Thumshirn
On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Signed-off-by: Goldwyn Rodrigues 
Can you explain why we can't use th dax_iomap_rw() interface like XFS or
EXT4?

[...]

> +static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos,
> + u64 len, struct iov_iter *iter)
> +{
> +struct dax_device *dax_dev = fs_dax_get_by_bdev(em->bdev);
> +ssize_t map_len;
> +pgoff_t blk_pg;
> +void *kaddr;
> +sector_t blk_start;
> +unsigned offset = pos & (PAGE_SIZE - 1);

Nit: unsigned offset = offset_in_page(pos);




-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 02/10] btrfs: basic dax read

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/Makefile |  1 +
>  fs/btrfs/ctree.h  |  5 
>  fs/btrfs/dax.c| 68 
> +++
>  fs/btrfs/file.c   | 13 ++-
>  4 files changed, 86 insertions(+), 1 deletion(-)
>  create mode 100644 fs/btrfs/dax.c
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index ca693dd554e9..1fa77b875ae9 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -12,6 +12,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
> root-tree.o dir-item.o \
>  reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
>  uuid-tree.o props.o free-space-tree.o tree-checker.o
>  
> +btrfs-$(CONFIG_FS_DAX) += dax.o
>  btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
>  btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
>  btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 5cc470fa6a40..038d64ecebe5 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -3685,6 +3685,11 @@ int btrfs_reada_wait(void *handle);
>  void btrfs_reada_detach(void *handle);
>  int btree_readahead_hook(struct extent_buffer *eb, int err);
>  
> +#ifdef CONFIG_FS_DAX
> +/* dax.c */
> +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to);
> +#endif /* CONFIG_FS_DAX */
> +
>  static inline int is_fstree(u64 rootid)
>  {
>   if (rootid == BTRFS_FS_TREE_OBJECTID ||
> diff --git a/fs/btrfs/dax.c b/fs/btrfs/dax.c
> new file mode 100644
> index ..d614bf73bf8e
> --- /dev/null
> +++ b/fs/btrfs/dax.c
> @@ -0,0 +1,68 @@
> +#include 
> +#include 
> +#include "ctree.h"
> +#include "btrfs_inode.h"
> +
> +static ssize_t em_dax_rw(struct inode *inode, struct extent_map *em, u64 pos,
> + u64 len, struct iov_iter *iter)
> +{
> +struct dax_device *dax_dev = fs_dax_get_by_bdev(em->bdev);
> +ssize_t map_len;
> +pgoff_t blk_pg;
> +void *kaddr;
> +sector_t blk_start;
> +unsigned offset = pos & (PAGE_SIZE - 1);

offset = offset_in_page(pos)

> +
> +len = min(len + offset, em->len - (pos - em->start));
> +len = ALIGN(len, PAGE_SIZE);

len = PAGE_ALIGN(len);

> +blk_start = (get_start_sect(em->bdev) << 9) + (em->block_start + 
> (pos - em->start));
> +blk_pg = blk_start - offset;
> +map_len = dax_direct_access(dax_dev, PHYS_PFN(blk_pg), 
> PHYS_PFN(len), , NULL);
> +map_len = PFN_PHYS(map_len)> +kaddr += offset;
> +map_len -= offset;
> +if (map_len > len)
> +map_len = len;

map_len = min(map_len, len);

> +if (iov_iter_rw(iter) == WRITE)
> +return dax_copy_from_iter(dax_dev, blk_pg, kaddr, map_len, 
> iter);
> +else
> +return dax_copy_to_iter(dax_dev, blk_pg, kaddr, map_len, 
> iter);

Have you looked at the implementation of dax_iomap_actor where they have
pretty similar code. In case of either of those returning 0 they set ret
to EFAULT, should the same be done in btrfs_file_dax_read? IMO it will
be good of you can follow dax_iomap_actor's logic as much as possible
since this code has been used for quite some time and is deemed robust.

> +}
> +
> +ssize_t btrfs_file_dax_read(struct kiocb *iocb, struct iov_iter *to)
> +{
> +size_t ret = 0, done = 0, count = iov_iter_count(to);
> +struct extent_map *em;
> +u64 pos = iocb->ki_pos;
> +u64 end = pos + count;
> +struct inode *inode = file_inode(iocb->ki_filp);
> +
> +if (!count)
> +return 0;
> +
> +end = i_size_read(inode) < end ? i_size_read(inode) : end;

end = min(i_size_read(inode), end)

> +
> +while (pos < end) {
> +u64 len = end - pos;
> +
> +em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, pos, len, 0);
> +if (IS_ERR(em)) {
> +if (!ret)
> +ret = PTR_ERR(em);
> +goto out;
> +}
> +
> +BUG_ON(em->flags & EXTENT_FLAG_FS_MAPPING);

I think this can never trigger, because EXTENT_FLAG_FS_MAPPING is set
for extents that map chunk and those are housed in the chunk tree at
fs_info->mapping_tree. Since the write call back is only ever called for
file inodes I'd say this BUG_ON can be eliminated. Did you manage to
trigger it during development?


> +
> +ret = em_dax_rw(inode, em, pos, len, to);
> +if (ret < 0)
> +goto out;
> +pos += ret;
> +done += ret;
> +}
> +
> +out:
> +iocb->ki_pos += done;
> +return done ? done : ret;
> +}
> +
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 58e93bce3036..ef6ed93f44d1 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ 

Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Qu Wenruo


On 2018/12/5 下午8:28, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs. I understand there have been
> previous attempts at it. However, I wanted to make sure copy-on-write
> (COW) works on dax as well.
> 
> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:
> 
> Questions:
> 1. I have been unable to do checksumming for DAX devices. While
> checksumming can be done for reads and writes, it is a problem when mmap
> is involved because btrfs kernel module does not get back control after
> an mmap() writes. Any ideas are appreciated, or we would have to set
> nodatasum when dax is enabled.

I'm not familar with DAX, so it's completely possible I'm talking like
an idiot.

If btrfs_page_mkwrite() can't provide enough control, then I have a
crazy idea.

Forcing page fault for every mmap() read/write (completely disable page
cache like DIO).
So that we could get some control since we're informed to read the page
and do some hacks there.

Thanks,
Qu
> 
> 2. Currently, a user can continue writing on "old" extents of an mmaped file
> after a snapshot has been created. How can we enforce writes to be directed
> to new extents after snapshots have been created? Do we keep a list of
> all mmap()s, and re-mmap them after a snapshot?
> 
> Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
> command line parameter.
> 
> 
> [PATCH 01/10] btrfs: create a mount option for dax
> [PATCH 02/10] btrfs: basic dax read
> [PATCH 03/10] btrfs: dax: read zeros from holes
> [PATCH 04/10] Rename __endio_write_update_ordered() to
> [PATCH 05/10] btrfs: Carve out btrfs_get_extent_map_write() out of
> [PATCH 06/10] btrfs: dax write support
> [PATCH 07/10] dax: export functions for use with btrfs
> [PATCH 08/10] btrfs: dax add read mmap path
> [PATCH 09/10] btrfs: dax support for cow_page/mmap_private and shared
> [PATCH 10/10] btrfs: dax mmap write
> 
>  fs/btrfs/Makefile   |1 
>  fs/btrfs/ctree.h|   17 ++
>  fs/btrfs/dax.c  |  303 
> ++--
>  fs/btrfs/file.c |   29 
>  fs/btrfs/inode.c|   54 +
>  fs/btrfs/ioctl.c|5 
>  fs/btrfs/super.c|   15 ++
>  fs/dax.c|   35 --
>  include/linux/dax.h |   16 ++
>  9 files changed, 430 insertions(+), 45 deletions(-)
> 
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 01/10] btrfs: create a mount option for dax

2018-12-05 Thread Nikolay Borisov



On 5.12.18 г. 14:28 ч., Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Also, set the inode->i_flags to S_DAX
> 
> Signed-off-by: Goldwyn Rodrigues 

Reviewed-by: Nikolay Borisov 

One question below though .

> ---
>  fs/btrfs/ctree.h |  1 +
>  fs/btrfs/ioctl.c |  5 -
>  fs/btrfs/super.c | 15 +++
>  3 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 68f322f600a0..5cc470fa6a40 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1353,6 +1353,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct 
> btrfs_fs_info *info)
>  #define BTRFS_MOUNT_FREE_SPACE_TREE  (1 << 26)
>  #define BTRFS_MOUNT_NOLOGREPLAY  (1 << 27)
>  #define BTRFS_MOUNT_REF_VERIFY   (1 << 28)
> +#define BTRFS_MOUNT_DAX  (1 << 29)
>  
>  #define BTRFS_DEFAULT_COMMIT_INTERVAL(30)
>  #define BTRFS_DEFAULT_MAX_INLINE (2048)
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index 802a628e9f7d..e9146c157816 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -149,8 +149,11 @@ void btrfs_sync_inode_flags_to_i_flags(struct inode 
> *inode)
>   if (binode->flags & BTRFS_INODE_DIRSYNC)
>   new_fl |= S_DIRSYNC;
>  
> + if ((btrfs_test_opt(btrfs_sb(inode->i_sb), DAX)) && 
> S_ISREG(inode->i_mode))
> + new_fl |= S_DAX;
> +
>   set_mask_bits(>i_flags,
> -   S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC,
> +   S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | 
> S_DAX,
> new_fl);
>  }
>  
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 645fc81e2a94..035263b61cf5 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -326,6 +326,7 @@ enum {
>   Opt_treelog, Opt_notreelog,
>   Opt_usebackuproot,
>   Opt_user_subvol_rm_allowed,
> + Opt_dax,
>  
>   /* Deprecated options */
>   Opt_alloc_start,
> @@ -393,6 +394,7 @@ static const match_table_t tokens = {
>   {Opt_notreelog, "notreelog"},
>   {Opt_usebackuproot, "usebackuproot"},
>   {Opt_user_subvol_rm_allowed, "user_subvol_rm_allowed"},
> + {Opt_dax, "dax"},
>  
>   /* Deprecated options */
>   {Opt_alloc_start, "alloc_start=%s"},
> @@ -739,6 +741,17 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
> *options,
>   case Opt_user_subvol_rm_allowed:
>   btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
>   break;
> +#ifdef CONFIG_FS_DAX
> + case Opt_dax:
> + if (btrfs_super_num_devices(info->super_copy) > 1) {
> + btrfs_info(info,
> +"dax not supported for multi-device 
> btrfs partition\n");

What prevents supporting dax for multiple devices so long as all devices
are dax?



> 


Re: [PATCH 01/10] btrfs: create a mount option for dax

2018-12-05 Thread Johannes Thumshirn
On 05/12/2018 13:28, Goldwyn Rodrigues wrote:
> From: Goldwyn Rodrigues 
> 
> Also, set the inode->i_flags to S_DAX
> 
> Signed-off-by: Goldwyn Rodrigues 
> ---
>  fs/btrfs/ctree.h |  1 +
>  fs/btrfs/ioctl.c |  5 -
>  fs/btrfs/super.c | 15 +++
>  3 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 68f322f600a0..5cc470fa6a40 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1353,6 +1353,7 @@ static inline u32 BTRFS_MAX_XATTR_SIZE(const struct 
> btrfs_fs_info *info)
>  #define BTRFS_MOUNT_FREE_SPACE_TREE  (1 << 26)
>  #define BTRFS_MOUNT_NOLOGREPLAY  (1 << 27)
>  #define BTRFS_MOUNT_REF_VERIFY   (1 << 28)
> +#define BTRFS_MOUNT_DAX  (1 << 29)

Just as a heads up, this will collide with the patch called '[RFC PATCH
02/17] btrfs: add mount definition BTRFS_MOUNT_PRIORITY_USAGE' from Su Yue.

[...]

> +#ifdef CONFIG_FS_DAX
> + case Opt_dax:
> + if (btrfs_super_num_devices(info->super_copy) > 1) {
> + btrfs_info(info,
> +"dax not supported for multi-device 
> btrfs partition\n");
> + ret = -EOPNOTSUPP;
> + goto out;
> + }
> + btrfs_set_opt(info->mount_opt, DAX);
> + break;
> +#endif

Can you please explain why we can't enable DAX on a multi device FS in
the changelog? It's (for me at least) not obvious.

Thanks,
Johannes
-- 
Johannes ThumshirnSUSE Labs Filesystems
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: Linux-next regression?

2018-12-05 Thread Andrea Gelmini
On Tue, Dec 04, 2018 at 10:29:49PM +, Chris Mason wrote:
> I think (hope) this is:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=201685
> 
> Which was just nailed down to a blkmq bug.  It triggers when you have 
> scsi devices using elevator=none over blkmq.

Thanks a lot Chris. Really.
Good news: I confirm I recompiled and used blkmq and no-op (at that time).
Also, the massive write of btrfs defrag can explain the massive trigger of
the bug, and next corruption.

Thanks again,
Andrea


Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Nikolay Borisov



On 4.12.18 г. 22:14 ч., Wilson, Ellis wrote:
> On 12/4/18 8:07 AM, Nikolay Borisov wrote:
>> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>>> With 14TB drives available today, it doesn't take more than a handful of
>>> drives to result in a filesystem that takes around a minute to mount.
>>> As a result of this, I suspect this will become an increasingly problem
>>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm
>>> not a contributor so I have no room to do so -- just shedding some light
>>> on a problem that may deserve attention as filesystem sizes continue to
>>> grow.
>> Would it be possible to provide perf traces of the longer-running mount
>> time? Everyone seems to be fixated on reading block groups (which is
>> likely to be the culprit) but before pointing finger I'd like concrete
>> evidence pointed at the offender.
> 
> I am glad to collect such traces -- please advise with commands that 
> would achieve that.  If you just mean block traces, I can do that, but I 
> suspect you mean something more BTRFS-specific.

A command that would be good is :

perf record --all-kernel -g mount /dev/vdc /media/scratch/

of course replace device/mount path appropriately. This will result in a
perf.data file which contains stacktraces of the hottest paths executed
during invocation of mount. If you could send this file to the mailing
list or upload it somwhere for interested people (me and perhaps) Qu to
inspect would be appreciated.

If the file turned out way too big you can use

perf report --stdio  to create a text output and you could send that as
well.

> 
> Best,
> 
> ellis
> 


Re: Ran into "invalid block group size" bug, unclear how to proceed.

2018-12-04 Thread Mike Javorski
Will do, thanks!

- mike
On Tue, Dec 4, 2018 at 9:24 PM Qu Wenruo  wrote:
>
>
>
> On 2018/12/5 上午6:33, Mike Javorski wrote:
> > On Tue, Dec 4, 2018 at 2:18 AM Qu Wenruo  wrote:
> >>
> >>
> >>
> >> On 2018/12/4 上午11:32, Mike Javorski wrote:
> >>> Need a bit of advice here ladies / gents. I am running into an issue
> >>> which Qu Wenruo seems to have posted a patch for several weeks ago
> >>> (see https://patchwork.kernel.org/patch/10694997/).
> >>>
> >>> Here is the relevant dmesg output which led me to Qu's patch.
> >>> 
> >>> [   10.032475] BTRFS critical (device sdb): corrupt leaf: root=2
> >>> block=24655027060736 slot=20 bg_start=13188988928 bg_len=10804527104,
> >>> invalid block group size, have 10804527104 expect (0, 10737418240]
> >>> [   10.032493] BTRFS error (device sdb): failed to read block groups: -5
> >>> [   10.053365] BTRFS error (device sdb): open_ctree failed
> >>> 
> >>
> >> Exactly the same symptom.
> >>
> >>>
> >>> This server has a 16 disk btrfs filesystem (RAID6) which I boot
> >>> periodically to btrfs-send snapshots to. This machine is running
> >>> ArchLinux and I had just updated  to their latest 4.19.4 kernel
> >>> package (from 4.18.10 which was working fine). I've tried updating to
> >>> the 4.19.6 kernel that is in testing, but that doesn't seem to resolve
> >>> the issue. From what I can see on kernel.org, the patch above is not
> >>> pushed to stable or to Linus' tree.
> >>>
> >>> At this point the question is what to do. Is my FS toast?
> >>
> >> If there is no other problem at all, your fs is just fine.
> >> It's my original patch too sensitive (the excuse for not checking chunk
> >> allocator carefully enough).
> >>
> >> But since you have the down time, it's never a bad idea to run a btrfs
> >> check --readonly to see if your fs is really OK.
> >>
> >
> > After running for 4 hours...
> >
> > UUID: 25b16375-b90b-408e-b592-fb07ed116d58
> > [1/7] checking root items
> > [2/7] checking extents
> > [3/7] checking free space cache
> > [4/7] checking fs roots
> > [5/7] checking only csums items (without verifying data)
> > [6/7] checking root refs
> > [7/7] checking quota groups
> > found 24939616169984 bytes used, no error found
> > total csum bytes: 24321980768
> > total tree bytes: 41129721856
> > total fs tree bytes: 9854648320
> > total extent tree bytes: 737804288
> > btree space waste bytes: 7483785005
> > file data blocks allocated: 212883520618496
> >  referenced 212876546314240
> >
> > So things appear good to go. I will keep an eye out for the patch to
> > land before upgrading the kernel again.
> >
> >>> Could I
> >>> revert to the 4.18.10 kernel and boot safely?
> >>
> >> If your btrfs check --readonly doesn't report any problem, then you're
> >> completely fine to do so.
> >> Although I still recommend to go RAID10 other than RAID5/6.
> >
> > I understand the risk, but don't have the funds to buy sufficient
> > disks to operate in RAID10.
>
> Then my advice would be, for any powerloss, please run a full-disk scrub
> (and of course ensure there is not another powerloss during scrubbing).
>
> I know this sounds silly and slow, but at least it should workaround the
> write hole problem.
>
> Thanks,
> Qu
>
> > The data is mostly large files and
> > activity is predominantly reads, so risk is currently acceptable given
> > the backup server. All super critical data is backed up to (very slow)
> > cloud storage.
> >
> >>
> >> Thanks,
> >> Qu
> >>
> >>> I don't know if the 4.19
> >>> boot process may have flipped some bits which would make reverting
> >>> problematic.
> >>>
> >>> Thanks much,
> >>>
> >>> - mike
> >>>
> >>
>


Re: [PATCH 7/9] btrfs-progs: Fix Wmaybe-uninitialized warning

2018-12-04 Thread Qu Wenruo


On 2018/12/4 下午8:17, David Sterba wrote:
> On Fri, Nov 16, 2018 at 03:54:24PM +0800, Qu Wenruo wrote:
>> The only location is the following code:
>>
>>  int level = path->lowest_level + 1;
>>  BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
>>  while(level < BTRFS_MAX_LEVEL) {
>>  slot = path->slots[level] + 1;
>>  ...
>>  }
>>  path->slots[level] = slot;
>>
>> Again, it's the stupid compiler needs some hint for the fact that
>> we will always enter the while loop for at least once, thus @slot should
>> always be initialized.
> 
> Harsh words for the compiler, and I say not deserved. The same code
> pasted to kernel a built with the same version does not report the
> warning, so it's apparently a missing annotation of BUG_ON in
> btrfs-progs that does not give the right hint.
> 
Well, in fact after the recent gcc8 updates (god knows how many versions
gcc8 get updated in Arch after the patchset), it doesn't report this
error anymore.

But your idea on the BUG_ON() lacking noreturn attribute makes sense.

I'll just add some hint for kerncompact.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: Ran into "invalid block group size" bug, unclear how to proceed.

2018-12-04 Thread Qu Wenruo


On 2018/12/5 上午6:33, Mike Javorski wrote:
> On Tue, Dec 4, 2018 at 2:18 AM Qu Wenruo  wrote:
>>
>>
>>
>> On 2018/12/4 上午11:32, Mike Javorski wrote:
>>> Need a bit of advice here ladies / gents. I am running into an issue
>>> which Qu Wenruo seems to have posted a patch for several weeks ago
>>> (see https://patchwork.kernel.org/patch/10694997/).
>>>
>>> Here is the relevant dmesg output which led me to Qu's patch.
>>> 
>>> [   10.032475] BTRFS critical (device sdb): corrupt leaf: root=2
>>> block=24655027060736 slot=20 bg_start=13188988928 bg_len=10804527104,
>>> invalid block group size, have 10804527104 expect (0, 10737418240]
>>> [   10.032493] BTRFS error (device sdb): failed to read block groups: -5
>>> [   10.053365] BTRFS error (device sdb): open_ctree failed
>>> 
>>
>> Exactly the same symptom.
>>
>>>
>>> This server has a 16 disk btrfs filesystem (RAID6) which I boot
>>> periodically to btrfs-send snapshots to. This machine is running
>>> ArchLinux and I had just updated  to their latest 4.19.4 kernel
>>> package (from 4.18.10 which was working fine). I've tried updating to
>>> the 4.19.6 kernel that is in testing, but that doesn't seem to resolve
>>> the issue. From what I can see on kernel.org, the patch above is not
>>> pushed to stable or to Linus' tree.
>>>
>>> At this point the question is what to do. Is my FS toast?
>>
>> If there is no other problem at all, your fs is just fine.
>> It's my original patch too sensitive (the excuse for not checking chunk
>> allocator carefully enough).
>>
>> But since you have the down time, it's never a bad idea to run a btrfs
>> check --readonly to see if your fs is really OK.
>>
> 
> After running for 4 hours...
> 
> UUID: 25b16375-b90b-408e-b592-fb07ed116d58
> [1/7] checking root items
> [2/7] checking extents
> [3/7] checking free space cache
> [4/7] checking fs roots
> [5/7] checking only csums items (without verifying data)
> [6/7] checking root refs
> [7/7] checking quota groups
> found 24939616169984 bytes used, no error found
> total csum bytes: 24321980768
> total tree bytes: 41129721856
> total fs tree bytes: 9854648320
> total extent tree bytes: 737804288
> btree space waste bytes: 7483785005
> file data blocks allocated: 212883520618496
>  referenced 212876546314240
> 
> So things appear good to go. I will keep an eye out for the patch to
> land before upgrading the kernel again.
> 
>>> Could I
>>> revert to the 4.18.10 kernel and boot safely?
>>
>> If your btrfs check --readonly doesn't report any problem, then you're
>> completely fine to do so.
>> Although I still recommend to go RAID10 other than RAID5/6.
> 
> I understand the risk, but don't have the funds to buy sufficient
> disks to operate in RAID10.

Then my advice would be, for any powerloss, please run a full-disk scrub
(and of course ensure there is not another powerloss during scrubbing).

I know this sounds silly and slow, but at least it should workaround the
write hole problem.

Thanks,
Qu

> The data is mostly large files and
> activity is predominantly reads, so risk is currently acceptable given
> the backup server. All super critical data is backed up to (very slow)
> cloud storage.
> 
>>
>> Thanks,
>> Qu
>>
>>> I don't know if the 4.19
>>> boot process may have flipped some bits which would make reverting
>>> problematic.
>>>
>>> Thanks much,
>>>
>>> - mike
>>>
>>



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 2/2] btrfs: scrub: move scrub_setup_ctx allocation out of device_list_mutex

2018-12-04 Thread David Sterba
On Tue, Dec 04, 2018 at 05:22:19PM +0200, Nikolay Borisov wrote:
> > @@ -3874,16 +3882,9 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, 
> > u64 devid, u64 start,
> > if (ret) {
> > mutex_unlock(_info->scrub_lock);
> > mutex_unlock(_info->fs_devices->device_list_mutex);
> > -   return ret;
> > +   goto out_free_ctx;
> 
> Don't we suffer the same issue when calling scrub_workers_get since in
> it we do btrfs_alloc_workqueue which also calls kzalloc with GFP_KERNEL?

Yes, that's right. I instrumented only the allocations in scrub.c to see
if the nofs and lock_not_held assertions work at all so this one did not
get caught directly.

As scrub_workers_get still needs the scrub_lock, fixing it by moving
does not work and would need more restructuring.


Re: Ran into "invalid block group size" bug, unclear how to proceed.

2018-12-04 Thread Mike Javorski
On Tue, Dec 4, 2018 at 2:18 AM Qu Wenruo  wrote:
>
>
>
> On 2018/12/4 上午11:32, Mike Javorski wrote:
> > Need a bit of advice here ladies / gents. I am running into an issue
> > which Qu Wenruo seems to have posted a patch for several weeks ago
> > (see https://patchwork.kernel.org/patch/10694997/).
> >
> > Here is the relevant dmesg output which led me to Qu's patch.
> > 
> > [   10.032475] BTRFS critical (device sdb): corrupt leaf: root=2
> > block=24655027060736 slot=20 bg_start=13188988928 bg_len=10804527104,
> > invalid block group size, have 10804527104 expect (0, 10737418240]
> > [   10.032493] BTRFS error (device sdb): failed to read block groups: -5
> > [   10.053365] BTRFS error (device sdb): open_ctree failed
> > 
>
> Exactly the same symptom.
>
> >
> > This server has a 16 disk btrfs filesystem (RAID6) which I boot
> > periodically to btrfs-send snapshots to. This machine is running
> > ArchLinux and I had just updated  to their latest 4.19.4 kernel
> > package (from 4.18.10 which was working fine). I've tried updating to
> > the 4.19.6 kernel that is in testing, but that doesn't seem to resolve
> > the issue. From what I can see on kernel.org, the patch above is not
> > pushed to stable or to Linus' tree.
> >
> > At this point the question is what to do. Is my FS toast?
>
> If there is no other problem at all, your fs is just fine.
> It's my original patch too sensitive (the excuse for not checking chunk
> allocator carefully enough).
>
> But since you have the down time, it's never a bad idea to run a btrfs
> check --readonly to see if your fs is really OK.
>

After running for 4 hours...

UUID: 25b16375-b90b-408e-b592-fb07ed116d58
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups
found 24939616169984 bytes used, no error found
total csum bytes: 24321980768
total tree bytes: 41129721856
total fs tree bytes: 9854648320
total extent tree bytes: 737804288
btree space waste bytes: 7483785005
file data blocks allocated: 212883520618496
 referenced 212876546314240

So things appear good to go. I will keep an eye out for the patch to
land before upgrading the kernel again.

> > Could I
> > revert to the 4.18.10 kernel and boot safely?
>
> If your btrfs check --readonly doesn't report any problem, then you're
> completely fine to do so.
> Although I still recommend to go RAID10 other than RAID5/6.

I understand the risk, but don't have the funds to buy sufficient
disks to operate in RAID10. The data is mostly large files and
activity is predominantly reads, so risk is currently acceptable given
the backup server. All super critical data is backed up to (very slow)
cloud storage.

>
> Thanks,
> Qu
>
> > I don't know if the 4.19
> > boot process may have flipped some bits which would make reverting
> > problematic.
> >
> > Thanks much,
> >
> > - mike
> >
>


Re: Linux-next regression?

2018-12-04 Thread Chris Mason
On 28 Nov 2018, at 11:05, Andrea Gelmini wrote:

> On Tue, Nov 27, 2018 at 10:16:52PM +0800, Qu Wenruo wrote:
>>
>> But it's less a concerning problem since it doesn't reach latest RC, 
>> so
>> if you could reproduce it stably, I'd recommend to do a bisect.
>
> No problem to bisect, usually.
> But right now it's not possible for me, I explain further.
> Anyway, here the rest of the story.
>
> So, in the end I:
> a) booted with 4.20.0-rc4
> b) updated backup
> c) did the btrfs check --read-only
> d) seven steps, everything is perfect
> e) no complains on screen or in logs (never had)
> f) so, started to compile linux-next 20181128 (on another partition)
> e) without using (reading or writing) on /home, I started
> f) btrfs filesystem defrag -v -r -t 128M /home
> g) it worked without complain (in screen or logs)
> h) then, reboot with kernel tag 20181128
> i) and no way to mount:

I think (hope) this is:

https://bugzilla.kernel.org/show_bug.cgi?id=201685

Which was just nailed down to a blkmq bug.  It triggers when you have 
scsi devices using elevator=none over blkmq.

-chris


Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Wilson, Ellis
On 12/4/18 8:07 AM, Nikolay Borisov wrote:
> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>> With 14TB drives available today, it doesn't take more than a handful of
>> drives to result in a filesystem that takes around a minute to mount.
>> As a result of this, I suspect this will become an increasingly problem
>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm
>> not a contributor so I have no room to do so -- just shedding some light
>> on a problem that may deserve attention as filesystem sizes continue to
>> grow.
> Would it be possible to provide perf traces of the longer-running mount
> time? Everyone seems to be fixated on reading block groups (which is
> likely to be the culprit) but before pointing finger I'd like concrete
> evidence pointed at the offender.

I am glad to collect such traces -- please advise with commands that 
would achieve that.  If you just mean block traces, I can do that, but I 
suspect you mean something more BTRFS-specific.

Best,

ellis



Re: Need help with potential ~45TB dataloss

2018-12-04 Thread Chris Murphy
On Tue, Dec 4, 2018 at 3:09 AM Patrick Dijkgraaf
 wrote:
>
> Hi Chris,
>
> See the output below. Any suggestions based on it?

If they're SATA drives, they may not support SCT ERC; and if they're
SAS, depending on what controller they're behind, smartctl might need
a hint to properly ask the drive for SCT ERC status. Simplest way to
know is do 'smartctl -x' on one drive, assuming they're all the same
basic make/model other than size.


-- 
Chris Murphy


Re: [PATCH 3/3] btrfs: replace cleaner_delayed_iput_mutex with a waitqueue

2018-12-04 Thread Josef Bacik
On Tue, Dec 04, 2018 at 01:46:58PM +0200, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 18:06 ч., Josef Bacik wrote:
> > The throttle path doesn't take cleaner_delayed_iput_mutex, which means
> > we could think we're done flushing iputs in the data space reservation
> > path when we could have a throttler doing an iput.  There's no real
> > reason to serialize the delayed iput flushing, so instead of taking the
> > cleaner_delayed_iput_mutex whenever we flush the delayed iputs just
> > replace it with an atomic counter and a waitqueue.  This removes the
> > short (or long depending on how big the inode is) window where we think
> > there are no more pending iputs when there really are some.
> > 
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/ctree.h   |  4 +++-
> >  fs/btrfs/disk-io.c |  5 ++---
> >  fs/btrfs/extent-tree.c | 13 -
> >  fs/btrfs/inode.c   | 21 +
> >  4 files changed, 34 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> > index dc56a4d940c3..20af5d6d81f1 100644
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -915,7 +915,8 @@ struct btrfs_fs_info {
> >  
> > spinlock_t delayed_iput_lock;
> > struct list_head delayed_iputs;
> > -   struct mutex cleaner_delayed_iput_mutex;
> > +   atomic_t nr_delayed_iputs;
> > +   wait_queue_head_t delayed_iputs_wait;
> >  
> 
> Have you considered whether the same could be achieved with a completion
> rather than an open-coded waitqueue? I tried prototyping it and it could
> be done but it becomes messy regarding when the completion should be
> initialised i.e only when it's not in btrfs_add_delayed_iput
> 

Yeah a waitqueue makes more sense here than a completion since it's not a
one-off.

> 
> 
> 
> > @@ -4958,9 +4962,8 @@ static void flush_space(struct btrfs_fs_info *fs_info,
> >  * bunch of pinned space, so make sure we run the iputs before
> >  * we do our pinned bytes check below.
> >  */
> > -   mutex_lock(_info->cleaner_delayed_iput_mutex);
> > btrfs_run_delayed_iputs(fs_info);
> > -   mutex_unlock(_info->cleaner_delayed_iput_mutex);
> > +   btrfs_wait_on_delayed_iputs(fs_info);
> 
> Waiting on delayed iputs here is pointless since they are run
> synchronously form this context.
> 

Unless there are other threads (the cleaner thread) running iputs as well.  We
could be running an iput that is evicting the inode in another thread and we
really want that space, so we need to wait here to make sure everybody is truly
done.  Thanks,

Josef


Re: [PATCH 2/3] btrfs: wakeup cleaner thread when adding delayed iput

2018-12-04 Thread Josef Bacik
On Tue, Dec 04, 2018 at 11:21:14AM +0200, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 18:06 ч., Josef Bacik wrote:
> > The cleaner thread usually takes care of delayed iputs, with the
> > exception of the btrfs_end_transaction_throttle path.  The cleaner
> > thread only gets woken up every 30 seconds, so instead wake it up to do
> > it's work so that we can free up that space as quickly as possible.
> 
> This description misses any rationale whatsoever about why the cleaner
> needs to be woken up more frequently than 30 seconds (and IMO this is
> the most important question that needs answering).
> 

Yeah I'll add that.

> Also have you done any measurements of the number of processed delayed
> inodes with this change. Given the behavior you so desire why not just
> make delayed iputs to be run via schedule_work on the global workqueue
> and be done with it? I'm sure the latency will be better than the
> current 30 seconds one :)

We already have the cleaner thread to do this work, and it sets up for the
snapshot drop stuff to be run as well.  We could probably add another delayed
work thing, but I would rather do that in a different patch.  Thanks,

Josef


Re: [PATCHv3] btrfs: Fix error handling in btrfs_cleanup_ordered_extents

2018-12-04 Thread Nikolay Borisov



On 21.11.18 г. 17:10 ч., Nikolay Borisov wrote:
> Running btrfs/124 in a loop hung up on me sporadically with the
> following call trace:
>   btrfs   D0  5760   5324 0x
>   Call Trace:
>? __schedule+0x243/0x800
>schedule+0x33/0x90
>btrfs_start_ordered_extent+0x10c/0x1b0 [btrfs]
>? wait_woken+0xa0/0xa0
>btrfs_wait_ordered_range+0xbb/0x100 [btrfs]
>btrfs_relocate_block_group+0x1ff/0x230 [btrfs]
>btrfs_relocate_chunk+0x49/0x100 [btrfs]
>btrfs_balance+0xbeb/0x1740 [btrfs]
>btrfs_ioctl_balance+0x2ee/0x380 [btrfs]
>btrfs_ioctl+0x1691/0x3110 [btrfs]
>? lockdep_hardirqs_on+0xed/0x180
>? __handle_mm_fault+0x8e7/0xfb0
>? _raw_spin_unlock+0x24/0x30
>? __handle_mm_fault+0x8e7/0xfb0
>? do_vfs_ioctl+0xa5/0x6e0
>? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
>do_vfs_ioctl+0xa5/0x6e0
>? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
>ksys_ioctl+0x3a/0x70
>__x64_sys_ioctl+0x16/0x20
>do_syscall_64+0x60/0x1b0
>entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> This happens because during page writeback it's valid for
> writepage_delalloc to instantiate a delalloc range which doesn't
> belong to the page currently being written back.
> 
> The reason this case is valid is due to find_lock_delalloc_range
> returning any available range after the passed delalloc_start and
> ignorting whether the page under writeback is within that range.
> In turn ordered extents (OE) are always created for the returned range
> from find_lock_delalloc_range. If, however, a failure occurs while OE
> are being created then the clean up code in btrfs_cleanup_ordered_extents
> will be called.
> 
> Unfortunately the code in btrfs_cleanup_ordered_extents doesn't consider
> the case of such 'foreign' range being processed and instead it always
> assumes that the range OE are created for belongs to the page. This
> leads to the first page of such foregin range to not be cleaned up since
> it's deliberately missed skipped by the current cleaning up code.
> 
> Fix this by correctly checking whether the current page belongs to the
> range being instantiated and if so adjsut the range parameters
> passed for cleaning up. If it doesn't, then just clean the whole OE
> range directly.
> 
> Signed-off-by: Nikolay Borisov 
> Reviewed-by: Josef Bacik 
> ---
> V3:
>  * Re-worded the commit for easier comprehension
>  * Added RB tag from Josef
> 
> V2:
>  * Fix compilation failure due to missing parentheses
>  * Fixed the "Fixes" tag.
>  fs/btrfs/inode.c | 29 -
>  1 file changed, 20 insertions(+), 9 deletions(-)
> 

Ping,

Also this patch needs:

Fixes: 524272607e88 ("btrfs: Handle delalloc error correctly to avoid
ordered extent hang") and it needs to be applied to the stable releases 4.14





Re: [PATCH 2/2] btrfs: scrub: move scrub_setup_ctx allocation out of device_list_mutex

2018-12-04 Thread Nikolay Borisov



On 4.12.18 г. 17:11 ч., David Sterba wrote:
> The scrub context is allocated with GFP_KERNEL and called from
> btrfs_scrub_dev under the fs_info::device_list_mutex. This is not safe
> regarding reclaim that could try to flush filesystem data in order to
> get the memory. And the device_list_mutex is held during superblock
> commit, so this would cause a lockup.
> 
> Move the alocation and initialization before any changes that require
> the mutex.
> 
> Signed-off-by: David Sterba 
> ---
>  fs/btrfs/scrub.c | 30 ++
>  1 file changed, 18 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index ffcab263e057..051d14c9f013 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3834,13 +3834,18 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, 
> u64 devid, u64 start,
>   return -EINVAL;
>   }
>  
> + /* Allocate outside of device_list_mutex */
> + sctx = scrub_setup_ctx(fs_info, is_dev_replace);
> + if (IS_ERR(sctx))
> + return PTR_ERR(sctx);
>  
>   mutex_lock(_info->fs_devices->device_list_mutex);
>   dev = btrfs_find_device(fs_info, devid, NULL, NULL);
>   if (!dev || (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state) &&
>!is_dev_replace)) {
>   mutex_unlock(_info->fs_devices->device_list_mutex);
> - return -ENODEV;
> + ret = -ENODEV;
> + goto out_free_ctx;
>   }
>  
>   if (!is_dev_replace && !readonly &&
> @@ -3848,7 +3853,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>   mutex_unlock(_info->fs_devices->device_list_mutex);
>   btrfs_err_in_rcu(fs_info, "scrub: device %s is not writable",
>   rcu_str_deref(dev->name));
> - return -EROFS;
> + ret = -EROFS;
> + goto out_free_ctx;
>   }
>  
>   mutex_lock(_info->scrub_lock);
> @@ -3856,7 +3862,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>   test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state)) {
>   mutex_unlock(_info->scrub_lock);
>   mutex_unlock(_info->fs_devices->device_list_mutex);
> - return -EIO;
> + ret = -EIO;
> + goto out_free_ctx;
>   }
>  
>   btrfs_dev_replace_read_lock(_info->dev_replace);
> @@ -3866,7 +3873,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>   btrfs_dev_replace_read_unlock(_info->dev_replace);
>   mutex_unlock(_info->scrub_lock);
>   mutex_unlock(_info->fs_devices->device_list_mutex);
> - return -EINPROGRESS;
> + ret = -EINPROGRESS;
> + goto out_free_ctx;
>   }
>   btrfs_dev_replace_read_unlock(_info->dev_replace);
>  
> @@ -3874,16 +3882,9 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>   if (ret) {
>   mutex_unlock(_info->scrub_lock);
>   mutex_unlock(_info->fs_devices->device_list_mutex);
> - return ret;
> + goto out_free_ctx;

Don't we suffer the same issue when calling scrub_workers_get since in
it we do btrfs_alloc_workqueue which also calls kzalloc with GFP_KERNEL?


>   }
>  
> - sctx = scrub_setup_ctx(fs_info, is_dev_replace);
> - if (IS_ERR(sctx)) {
> - mutex_unlock(_info->scrub_lock);
> - mutex_unlock(_info->fs_devices->device_list_mutex);
> - scrub_workers_put(fs_info);
> - return PTR_ERR(sctx);
> - }
>   sctx->readonly = readonly;
>   dev->scrub_ctx = sctx;
>   mutex_unlock(_info->fs_devices->device_list_mutex);
> @@ -3936,6 +3937,11 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>  
>   scrub_put_ctx(sctx);
>  
> + return ret;
> +
> +out_free_ctx:
> + scrub_free_ctx(sctx);
> +
>   return ret;
>  }
>  
> 


Re: [PATCH 1/2] btrfs: scrub: pass fs_info to scrub_setup_ctx

2018-12-04 Thread Nikolay Borisov



On 4.12.18 г. 17:11 ч., David Sterba wrote:
> We can pass fs_info directly as this is the only member of btrfs_device
> that's bing used inside scrub_setup_ctx.
> 
> Signed-off-by: David Sterba 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/scrub.c | 9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index bbd1b36f4918..ffcab263e057 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -578,12 +578,11 @@ static void scrub_put_ctx(struct scrub_ctx *sctx)
>   scrub_free_ctx(sctx);
>  }
>  
> -static noinline_for_stack
> -struct scrub_ctx *scrub_setup_ctx(struct btrfs_device *dev, int 
> is_dev_replace)
> +static noinline_for_stack struct scrub_ctx *scrub_setup_ctx(
> + struct btrfs_fs_info *fs_info, int is_dev_replace)
>  {
>   struct scrub_ctx *sctx;
>   int i;
> - struct btrfs_fs_info *fs_info = dev->fs_info;
>  
>   sctx = kzalloc(sizeof(*sctx), GFP_KERNEL);
>   if (!sctx)
> @@ -592,7 +591,7 @@ struct scrub_ctx *scrub_setup_ctx(struct btrfs_device 
> *dev, int is_dev_replace)
>   sctx->is_dev_replace = is_dev_replace;
>   sctx->pages_per_rd_bio = SCRUB_PAGES_PER_RD_BIO;
>   sctx->curr = -1;
> - sctx->fs_info = dev->fs_info;
> + sctx->fs_info = fs_info;
>   for (i = 0; i < SCRUB_BIOS_PER_SCTX; ++i) {
>   struct scrub_bio *sbio;
>  
> @@ -3878,7 +3877,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
> devid, u64 start,
>   return ret;
>   }
>  
> - sctx = scrub_setup_ctx(dev, is_dev_replace);
> + sctx = scrub_setup_ctx(fs_info, is_dev_replace);
>   if (IS_ERR(sctx)) {
>   mutex_unlock(_info->scrub_lock);
>   mutex_unlock(_info->fs_devices->device_list_mutex);
> 


Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Lionel Bouton
Le 04/12/2018 à 03:52, Chris Murphy a écrit :
> On Mon, Dec 3, 2018 at 1:04 PM Lionel Bouton
>  wrote:
>> Le 03/12/2018 à 20:56, Lionel Bouton a écrit :
>>> [...]
>>> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
>>> tuning of the io queue (switching between classic io-schedulers and
>>> blk-mq ones in the virtual machines) and BTRFS mount options
>>> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
>>> in mount time (I managed to reduce the mount of IO requests
>> Sent to quickly : I meant to write "managed to reduce by half the number
>> of IO write requests for the same amount of data writen"
>>
>>>  by half on
>>> one server in production though although more tests are needed to
>>> isolate the cause).
> Interesting. I wonder if it's ssd_spread or space_cache=v2 that
> reduces the writes by half, or by how much for each? That's a major
> reduction in writes, and suggests it might be possible for further
> optimization, to help mitigate the wandering trees impact.

Note, the other major changes were :
- 4.9 upgrade to 1.14,
- using multi-queue aware bfq instead of noop.

If BTRFS IO patterns in our case allow bfq to merge io-requests, this
could be another explanation.

Lionel



Re: [PATCH v4] Btrfs: fix deadlock with memory reclaim during scrub

2018-12-04 Thread David Sterba
On Wed, Nov 28, 2018 at 02:40:07PM +, Filipe Manana wrote:
> > > Well, the worker tasks can also not use gfp_kernel, since the scrub
> > > task waits for them to complete before pausing.
> > > I missed this, and 2 reviewers as well, so perhaps it wasn't that
> > > trivial and I shouldn't feel that I miserably failed to identify all
> > > cases for something rather trivial. V5 sent.
> >
> > You can say that you left it there intentionally, such cookies are a
> > good drill for reviewers to stay sharp.
> 
> Unfortunately for me, it was not on purpose.
> 
> However there's the small drill of, for the workers only, potentially
> moving the memalloc_nofs_save() and
> restore to scrub_handle_errored_block(), since the only two possible
> gfp_kernel allocations for workers
> are during the case where a repair is needed:
> 
> scrub_bio_end_io_worker()
>   scrub_block_complete()
> scrub_handle_errored_block()
>   lock_full_stripe()
> insert_full_stripe_lock()
>   -> kmalloc with GFP_KERNEL
> 
> 
> scrub_bio_end_io_worker()
>scrub_block_complete()
>  scrub_handle_errored_block()
>scrub_write_page_to_dev_replace()
>  scrub_add_page_to_wr_bio()
>-> kzalloc with GFP_KERNEL
> 
> Just to avoid some duplication.

Sounds like a nice cleanup and more in line with the idea of scoped
GFP_NOFS, the markers should be at a higher level. Starting at the
bottom of the callstack is fine as it's obvious where we want the nofs
protection, and then push it up the call chain. That way it's easier to
review. Please send a followup patch, thanks.


Re: [PATCH v2 1/3] btrfs: remove always true if branch in find_delalloc_range

2018-12-04 Thread David Sterba
On Thu, Nov 29, 2018 at 11:33:38AM +0800, Lu Fengqi wrote:
> The @found is always false when it comes to the if branch. Besides, the
> bool type is more suitable for @found. Change the return value of the
> function and its caller to bool as well.
> 
> Signed-off-by: Lu Fengqi 

Added to misc-next, thanks.


Re: [PATCH] btrfs: skip file_extent generation check for free_space_inode in run_delalloc_nocow

2018-12-04 Thread David Sterba
On Thu, Nov 29, 2018 at 05:31:32PM +0800, Lu Fengqi wrote:
> The btrfs/001 with inode_cache mount option will encounter the following
> warning:
> 
> WARNING: CPU: 1 PID: 23700 at fs/btrfs/inode.c:956 
> cow_file_range.isra.19+0x32b/0x430 [btrfs]
> CPU: 1 PID: 23700 Comm: btrfs Kdump: loaded Tainted: GW  O  
> 4.20.0-rc4-custom+ #30
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> RIP: 0010:cow_file_range.isra.19+0x32b/0x430 [btrfs]
> Call Trace:
>  ? free_extent_buffer+0x46/0x90 [btrfs]
>  run_delalloc_nocow+0x455/0x900 [btrfs]
>  btrfs_run_delalloc_range+0x1a7/0x360 [btrfs]
>  writepage_delalloc+0xf9/0x150 [btrfs]
>  __extent_writepage+0x125/0x3e0 [btrfs]
>  extent_write_cache_pages+0x1b6/0x3e0 [btrfs]
>  ? __wake_up_common_lock+0x63/0xc0
>  extent_writepages+0x50/0x80 [btrfs]
>  do_writepages+0x41/0xd0
>  ? __filemap_fdatawrite_range+0x9e/0xf0
>  __filemap_fdatawrite_range+0xbe/0xf0
>  btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
>  __btrfs_write_out_cache+0x42c/0x480 [btrfs]
>  btrfs_write_out_ino_cache+0x84/0xd0 [btrfs]
>  btrfs_save_ino_cache+0x551/0x660 [btrfs]
>  commit_fs_roots+0xc5/0x190 [btrfs]
>  btrfs_commit_transaction+0x2bf/0x8d0 [btrfs]
>  btrfs_mksubvol+0x48d/0x4d0 [btrfs]
>  btrfs_ioctl_snap_create_transid+0x170/0x180 [btrfs]
>  btrfs_ioctl_snap_create_v2+0x124/0x180 [btrfs]
>  btrfs_ioctl+0x123f/0x3030 [btrfs]
> 
> The file extent generation of the free space inode is equal to the last
> snapshot of the file root, so the inode will be passed to cow_file_rage.
> But the inode was created and its extents were preallocated in
> btrfs_save_ino_cache, there are no cow copies on disk.
> 
> The preallocated extents don't present on disk, and the
> btrfs_cross_ref_exist will ignore the -ENOENT returned by
> check_committed_ref, so we can directly write the inode to the disk.
> 
> Fixes: 78d4295b1eee ("btrfs: lift some btrfs_cross_ref_exist checks in nocow 
> path")
> Signed-off-by: Lu Fengqi 

With updated changelog and reviewed-by added to misc-next, thanks.


Re: [PATCH v2] Btrfs: fix fsync of files with multiple hard links in new directories

2018-12-04 Thread David Sterba
On Wed, Nov 28, 2018 at 02:54:28PM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> The log tree has a long standing problem that when a file is fsync'ed we
> only check for new ancestors, created in the current transaction, by
> following only the hard link for which the fsync was issued. We follow the
> ancestors using the VFS' dget_parent() API. This means that if we create a
> new link for a file in a directory that is new (or in an any other new
> ancestor directory) and then fsync the file using an old hard link, we end
> up not logging the new ancestor, and on log replay that new hard link and
> ancestor do not exist. In some cases, involving renames, the file will not
> exist at all.
> 
> Example:
> 
>   mkfs.btrfs -f /dev/sdb
>   mount /dev/sdb /mnt
> 
>   mkdir /mnt/A
>   touch /mnt/foo
>   ln /mnt/foo /mnt/A/bar
>   xfs_io -c fsync /mnt/foo
> 
>   
> 
> In this example after log replay only the hard link named 'foo' exists
> and directory A does not exist, which is unexpected. In other major linux
> filesystems, such as ext4, xfs and f2fs for example, both hard links exist
> and so does directory A after mounting again the filesystem.
> 
> Checking if any new ancestors are new and need to be logged was added in
> 2009 by commit 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes"),
> however only for the ancestors of the hard link (dentry) for which the
> fsync was issued, instead of checking for all ancestors for all of the
> inode's hard links.
> 
> So fix this by tracking the id of the last transaction where a hard link
> was created for an inode and then on fsync fallback to a full transaction
> commit when an inode has more than one hard link and at least one new hard
> link was created in the current transaction. This is the simplest solution
> since this is not a common use case (adding frequently hard links for
> which there's an ancestor created in the current transaction and then
> fsync the file). In case it ever becomes a common use case, a solution
> that consists of iterating the fs/subvol btree for each hard link and
> check if any ancestor is new, could be implemented.
> 
> This solves many unexpected scenarios reported by Jayashree Mohan and
> Vijay Chidambaram, and for which there is a new test case for fstests
> under review.
> 
> Reported-by: Vijay Chidambaram 
> Reported-by: Jayashree Mohan 
> Fixes: 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes")
> Signed-off-by: Filipe Manana 

Added to misc-next, thanks.


Re: [PATCH] btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable

2018-12-04 Thread Qu Wenruo


On 2018/12/4 下午9:52, David Sterba wrote:
> On Tue, Dec 04, 2018 at 06:15:13PM +0800, Qu Wenruo wrote:
>> Gentle ping.
>>
>> Please put this patch into current release as the new block group size
>> limit check introduced in v4.19 is causing at least 2 reports in mail list.
> 
> BTW, if there's an urgent fix or patch that should be considered for
> current devel cycle, please note that in the subject like
> 
>   [PATCH for 4.20-rc] btrfs: ...
> 
> to make it more visible.
> 

Great thanks for this hint!

Just forgot we have such tag.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


Re: experiences running btrfs on external USB disks?

2018-12-04 Thread Austin S. Hemmelgarn

On 2018-12-04 08:37, Graham Cobb wrote:

On 04/12/2018 12:38, Austin S. Hemmelgarn wrote:

In short, USB is _crap_ for fixed storage, don't use it like that, even
if you are using filesystems which don't appear to complain.


That's useful advice, thanks.

Do you (or anyone else) have any experience of using btrfs over iSCSI? I
was thinking about this for three different use cases:

1) Giving my workstation a data disk that is actually a partition on a
server -- keeping all the data on the big disks on the server and
reducing power consumption (just a small boot SSD in the workstation).

2) Splitting a btrfs RAID1 between a local disk and a remote iSCSI
mirror to provide  redundancy without putting more disks in the local
system. Of course, this would mean that one of the RAID1 copies would
have higher latency than the other.

3) Like case 1 but actually exposing an LVM logical volume from the
server using iSCSI, rather than a simple disk partition. I would then
put both encryption and RAID running on the server below that logical
volume.

NBD could also be an alternative to iSCSI in these cases as well.

Any thoughts?
I've not run it over iSCSI (I tend to avoid that overly-complicated 
mess), but I have done it over NBD and ATAoE, as well as some more 
exotic arrangements, and it's really not too bad.  The important part is 
making sure your block layer and all the stuff under it are reliable, 
and USB is not.


Re: [PATCH] btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable

2018-12-04 Thread David Sterba
On Tue, Dec 04, 2018 at 06:15:13PM +0800, Qu Wenruo wrote:
> Gentle ping.
> 
> Please put this patch into current release as the new block group size
> limit check introduced in v4.19 is causing at least 2 reports in mail list.

BTW, if there's an urgent fix or patch that should be considered for
current devel cycle, please note that in the subject like

  [PATCH for 4.20-rc] btrfs: ...

to make it more visible.


Re: experiences running btrfs on external USB disks?

2018-12-04 Thread Graham Cobb
On 04/12/2018 12:38, Austin S. Hemmelgarn wrote:
> In short, USB is _crap_ for fixed storage, don't use it like that, even
> if you are using filesystems which don't appear to complain.

That's useful advice, thanks.

Do you (or anyone else) have any experience of using btrfs over iSCSI? I
was thinking about this for three different use cases:

1) Giving my workstation a data disk that is actually a partition on a
server -- keeping all the data on the big disks on the server and
reducing power consumption (just a small boot SSD in the workstation).

2) Splitting a btrfs RAID1 between a local disk and a remote iSCSI
mirror to provide  redundancy without putting more disks in the local
system. Of course, this would mean that one of the RAID1 copies would
have higher latency than the other.

3) Like case 1 but actually exposing an LVM logical volume from the
server using iSCSI, rather than a simple disk partition. I would then
put both encryption and RAID running on the server below that logical
volume.

NBD could also be an alternative to iSCSI in these cases as well.

Any thoughts?

Graham


Re: [PATCH] btrfs: tree-checker: Don't check max block group size as current max chunk size limit is unreliable

2018-12-04 Thread David Sterba
On Tue, Dec 04, 2018 at 06:15:13PM +0800, Qu Wenruo wrote:
> Gentle ping.
> 
> Please put this patch into current release as the new block group size
> limit check introduced in v4.19 is causing at least 2 reports in mail list.

I see, on the way to 4.20-rc with stable tags. Thanks.


Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Qu Wenruo



On 2018/12/4 下午9:07, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>> Hi all,
>>
>> Many months ago I promised to graph how long it took to mount a BTRFS 
>> filesystem as it grows.  I finally had (made) time for this, and the 
>> attached is the result of my testing.  The image is a fairly 
>> self-explanatory graph, and the raw data is also attached in 
>> comma-delimited format for the more curious.  The columns are: 
>> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
>>
>> Experimental setup:
>> - System:
>> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
>> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
>> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
>> - 3 unmount/mount cycles performed in between adding another 250GB of data
>> - 250GB of data added each time in the form of 25x10GB files in their 
>> own directory.  Files generated in parallel each epoch (25 at the same 
>> time, with a 1MB record size).
>> - 240 repetitions of this performed (to collect timings in increments of 
>> 250GB between a 0GB and 60TB filesystem)
>> - Normal "time" command used to measure time to mount.  "Real" time used 
>> of the timings reported from time.
>> - Mount:
>> /dev/md0 on /btrfs type btrfs 
>> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
>>
>> At 60TB, we take 30s to mount the filesystem, which is actually not as 
>> bad as I originally thought it would be (perhaps as a result of using 
>> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
>> to comment if folks more intimately familiar with BTRFS think this is 
>> due to the very large files I've used.  I can redo the test with much 
>> more realistic data if people have legitimate reason to think it will 
>> drastically change the result.
>>
>> With 14TB drives available today, it doesn't take more than a handful of 
>> drives to result in a filesystem that takes around a minute to mount. 
>> As a result of this, I suspect this will become an increasingly problem 
>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
>> not a contributor so I have no room to do so -- just shedding some light 
>> on a problem that may deserve attention as filesystem sizes continue to 
>> grow.
> 
> Would it be possible to provide perf traces of the longer-running mount
> time? Everyone seems to be fixated on reading block groups (which is
> likely to be the culprit) but before pointing finger I'd like concrete
> evidence pointed at the offender.

IIRC I submitted such analyse years ago.

Nowadays it may change due to chunk <-> bg <-> dev_extents cross checking.
So yes, it would be a good idea to show such percentage.

Thanks,
Qu

> 
>>
>> Best,
>>
>> ellis
>>


Re: btrfs development - question about crypto api integration

2018-12-04 Thread David Sterba
On Fri, Nov 30, 2018 at 06:27:58PM +0200, Nikolay Borisov wrote:
> 
> 
> On 30.11.18 г. 17:22 ч., Chris Mason wrote:
> > On 29 Nov 2018, at 12:37, Nikolay Borisov wrote:
> > 
> >> On 29.11.18 г. 18:43 ч., Jean Fobe wrote:
> >>> Hi all,
> >>> I've been studying LZ4 and other compression algorithms on the
> >>> kernel, and seen other projects such as zram and ubifs using the
> >>> crypto api. Is there a technical reason for not using the crypto api
> >>> for compression (and possibly for encryption) in btrfs?
> >>> I did not find any design/technical implementation choices in
> >>> btrfs development in the developer's FAQ on the wiki. If I completely
> >>> missed it, could someone point me in the right direction?
> >>> Lastly, if there is no technical reason for this, would it be
> >>> something interesting to have implemented?
> >>
> >> I personally think it would be better if btrfs' exploited the generic
> >> framework. And in fact when you look at zstd, btrfs does use the
> >> generic, low-level ZSTD routines but not the crypto library wrappers. 
> >> If
> >> I were I'd try and convert zstd (since it's the most recently added
> >> algorithm) to using the crypto layer to see if there are any lurking
> >> problems.
> > 
> > Back when I first added the zlib support, the zlib API was both easier 
> > to use and a better fit for our async worker threads.  That doesn't mean 
> > we shouldn't switch, it's just how we got to step one ;)
> 
> And what about zstd? WHy is it also using the low level api and not the
> crypto wrappers?

I think beacuse it just copied the way things already were.

Here's fs/ubifs/compress.c as an example use of the API in a filesystem:

ubifs_compress
  crypto_comp_compress
crypto_comp_crt -- address of, dereference
->cot_commpress -- takes another address of something, indirect
   function call
  with value 'crypto_compress'
-- this does 2 pointer dereferences and indirect call to
   coa_compress
   coa_compress = lzo_compress
 lzo_compress -- pointer dereferences for crypto context
   lzo1x_1_compress -- through static __lzo_compress (no overhead)

while in btrfs:

btrfs_compress_pages
  ->compress_pages
lzo1x_1_compress

The crypto API adds several pointer and function call indirectinos, I'm
not sure I want to get rid of the direct calls just yet.


Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Nikolay Borisov



On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
> Hi all,
> 
> Many months ago I promised to graph how long it took to mount a BTRFS 
> filesystem as it grows.  I finally had (made) time for this, and the 
> attached is the result of my testing.  The image is a fairly 
> self-explanatory graph, and the raw data is also attached in 
> comma-delimited format for the more curious.  The columns are: 
> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
> 
> Experimental setup:
> - System:
> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
> - 3 unmount/mount cycles performed in between adding another 250GB of data
> - 250GB of data added each time in the form of 25x10GB files in their 
> own directory.  Files generated in parallel each epoch (25 at the same 
> time, with a 1MB record size).
> - 240 repetitions of this performed (to collect timings in increments of 
> 250GB between a 0GB and 60TB filesystem)
> - Normal "time" command used to measure time to mount.  "Real" time used 
> of the timings reported from time.
> - Mount:
> /dev/md0 on /btrfs type btrfs 
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
> 
> At 60TB, we take 30s to mount the filesystem, which is actually not as 
> bad as I originally thought it would be (perhaps as a result of using 
> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
> to comment if folks more intimately familiar with BTRFS think this is 
> due to the very large files I've used.  I can redo the test with much 
> more realistic data if people have legitimate reason to think it will 
> drastically change the result.
> 
> With 14TB drives available today, it doesn't take more than a handful of 
> drives to result in a filesystem that takes around a minute to mount. 
> As a result of this, I suspect this will become an increasingly problem 
> for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
> not a contributor so I have no room to do so -- just shedding some light 
> on a problem that may deserve attention as filesystem sizes continue to 
> grow.

Would it be possible to provide perf traces of the longer-running mount
time? Everyone seems to be fixated on reading block groups (which is
likely to be the culprit) but before pointing finger I'd like concrete
evidence pointed at the offender.

> 
> Best,
> 
> ellis
> 


Re: [PATCH 2/5] btrfs: Refactor btrfs_can_relocate

2018-12-04 Thread David Sterba
On Tue, Dec 04, 2018 at 08:34:15AM +0200, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 19:25 ч., David Sterba wrote:
> > On Sat, Nov 17, 2018 at 09:29:27AM +0800, Anand Jain wrote:
> >>> - ret = find_free_dev_extent(trans, device, min_free,
> >>> -_offset, NULL);
> >>> - if (!ret)
> >>> + if (!find_free_dev_extent(trans, device, min_free,
> >>> +_offset, NULL))
> >>
> >>   This can return -ENOMEM;
> >>
> >>> @@ -2856,8 +2856,7 @@ static int btrfs_relocate_chunk(struct 
> >>> btrfs_fs_info *fs_info, u64 chunk_offset)
> >>>*/
> >>>   lockdep_assert_held(_info->delete_unused_bgs_mutex);
> >>>   
> >>> - ret = btrfs_can_relocate(fs_info, chunk_offset);
> >>> - if (ret)
> >>> + if (!btrfs_can_relocate(fs_info, chunk_offset))
> >>>   return -ENOSPC;
> >>
> >>   And ends up converting -ENOMEM to -ENOSPC.
> >>
> >>   Its better to propagate the accurate error.
> > 
> > Right, converting to bool is obscuring the reason why the functions
> > fail. Making the code simpler at this cost does not look like a good
> > idea to me. I'll remove the patch from misc-next for now.
> 
> The patch itself does not make the code more obscure than currently is,
> because even if ENOMEM is returned it's still converted to ENOSPC in
> btrfs_relocate_chunk.

Sorry, but this can be hardly used as an excuse to keep the code
obscure. btrfs_can_relocate & co are not very often changed so there's
cruft accumulated. The error value propagation was probably not a hot
topic in 2009, btrfs_can_relocate needs the cleanups but please let's do
that properly.


Re: [PATCH v1.1 9/9] btrfs-progs: Cleanup warning reported by -Wmissing-prototypes

2018-12-04 Thread David Sterba
On Tue, Dec 04, 2018 at 08:24:38PM +0800, Qu Wenruo wrote:
> 
> 
> On 2018/12/4 下午8:22, David Sterba wrote:
> > On Fri, Nov 16, 2018 at 04:04:51PM +0800, Qu Wenruo wrote:
> >> The following missing prototypes will be fixed:
> >> 1)  btrfs.c::handle_special_globals()
> >> 2)  check/mode-lowmem.c::repair_ternary_lowmem()
> >> 3)  extent-tree.c::btrfs_search_overlap_extent()
> >> Above 3 can be fixed by making them static
> >>
> >> 4)  utils.c::btrfs_check_nodesize()
> >> Fixed by moving it to fsfeatures.c
> >>
> >> 5)  chunk-recover.c::btrfs_recover_chunk_tree()
> >> 6)  super-recover.c::btrfs_recover_superblocks()
> >> Fixed by moving the declaration from cmds-rescue.c to rescue.h
> >>
> >> 7)  utils-lib.c::arg_strtou64()
> >> 8)  utils-lib.c::lookup_path_rootid()
> >> Fixed by include "utils.h"
> >>
> >> 9)  free-space-tree.c::set_free_space_tree_thresholds()
> >> Fixed by deleting it, as there is no user.
> >>
> >> 10) free-space-tree.c::convert_free_space_to_bitmaps()
> >> 11) free-space-tree.c::convert_free_space_to_extents()
> >> 12) free-space-tree.c::__remove_from_free_space_tree()
> >> 13) free-space-tree.c::__add_to_free_space_tree()
> >> 14) free-space-tree.c::btrfs_create_tree()
> >> Fixed by making them static.
> > 
> > Please split this to more patches grouped by the type of change.
> 
> No problem, just as the numbering is already grouped.

Note that 1-3 and 10-14 are the same group.


Re: [PATCH 7/9] btrfs-progs: Fix Wmaybe-uninitialized warning

2018-12-04 Thread Nikolay Borisov



On 4.12.18 г. 14:22 ч., Adam Borowski wrote:
> On Tue, Dec 04, 2018 at 01:17:04PM +0100, David Sterba wrote:
>> On Fri, Nov 16, 2018 at 03:54:24PM +0800, Qu Wenruo wrote:
>>> The only location is the following code:
>>>
>>> int level = path->lowest_level + 1;
>>> BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
>>> while(level < BTRFS_MAX_LEVEL) {
>>> slot = path->slots[level] + 1;
>>> ...
>>> }
>>> path->slots[level] = slot;
>>>
>>> Again, it's the stupid compiler needs some hint for the fact that
>>> we will always enter the while loop for at least once, thus @slot should
>>> always be initialized.
>>
>> Harsh words for the compiler, and I say not deserved. The same code
>> pasted to kernel a built with the same version does not report the
>> warning, so it's apparently a missing annotation of BUG_ON in
>> btrfs-progs that does not give the right hint.
> 
> It's be nice if the C language provided a kind of a while loop that executes
> at least once...

But it does, it's called:

do {

} while()

> 


Re: experiences running btrfs on external USB disks?

2018-12-04 Thread Austin S. Hemmelgarn

On 2018-12-04 00:37, Tomasz Chmielewski wrote:

I'm trying to use btrfs on an external USB drive, without much success.

When the drive is connected for 2-3+ days, the filesystem gets remounted 
readonly, with BTRFS saying "IO failure":


[77760.444607] BTRFS error (device sdb1): bad tree block start, want 
378372096 have 0
[77760.550933] BTRFS error (device sdb1): bad tree block start, want 
378372096 have 0
[77760.550972] BTRFS: error (device sdb1) in __btrfs_free_extent:6804: 
errno=-5 IO failure

[77760.550979] BTRFS info (device sdb1): forced readonly
[77760.551003] BTRFS: error (device sdb1) in 
btrfs_run_delayed_refs:2935: errno=-5 IO failure

[77760.553223] BTRFS error (device sdb1): pending csums is 4096


Note that there are no other kernel messages (i.e. that would indicate a 
problem with disk, cable disconnection etc.).


The load on the drive itself can be quite heavy at times (i.e. 100% IO 
for 1-2 h and more) - can it contribute to the problem (i.e. btrfs 
thinks there is some timeout somewhere)?


Running 4.19.6 right now, but was experiencing the issue also with 4.18 
kernels.




# btrfs device stats /data
[/dev/sda1].write_io_errs    0
[/dev/sda1].read_io_errs 0
[/dev/sda1].flush_io_errs    0
[/dev/sda1].corruption_errs  0
[/dev/sda1].generation_errs  0
It looks to me like the typical USB issues that are present with almost 
all filesystems but only seem to be noticed by BTRFS because it does 
more rigorous checking of data.


In short, USB is _crap_ for fixed storage, don't use it like that, even 
if you are using filesystems which don't appear to complain.


Re: [PATCH v1.1 9/9] btrfs-progs: Cleanup warning reported by -Wmissing-prototypes

2018-12-04 Thread Qu Wenruo


On 2018/12/4 下午8:22, David Sterba wrote:
> On Fri, Nov 16, 2018 at 04:04:51PM +0800, Qu Wenruo wrote:
>> The following missing prototypes will be fixed:
>> 1)  btrfs.c::handle_special_globals()
>> 2)  check/mode-lowmem.c::repair_ternary_lowmem()
>> 3)  extent-tree.c::btrfs_search_overlap_extent()
>> Above 3 can be fixed by making them static
>>
>> 4)  utils.c::btrfs_check_nodesize()
>> Fixed by moving it to fsfeatures.c
>>
>> 5)  chunk-recover.c::btrfs_recover_chunk_tree()
>> 6)  super-recover.c::btrfs_recover_superblocks()
>> Fixed by moving the declaration from cmds-rescue.c to rescue.h
>>
>> 7)  utils-lib.c::arg_strtou64()
>> 8)  utils-lib.c::lookup_path_rootid()
>> Fixed by include "utils.h"
>>
>> 9)  free-space-tree.c::set_free_space_tree_thresholds()
>> Fixed by deleting it, as there is no user.
>>
>> 10) free-space-tree.c::convert_free_space_to_bitmaps()
>> 11) free-space-tree.c::convert_free_space_to_extents()
>> 12) free-space-tree.c::__remove_from_free_space_tree()
>> 13) free-space-tree.c::__add_to_free_space_tree()
>> 14) free-space-tree.c::btrfs_create_tree()
>> Fixed by making them static.
> 
> Please split this to more patches grouped by the type of change.

No problem, just as the numbering is already grouped.

Thanks,
Qu
> 
>> --- /dev/null
>> +++ b/rescue.h
>> @@ -0,0 +1,14 @@
>> +/*
>> + * Copyright (C) 2018 SUSE.  All rights reserved.
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public
>> + * License v2 as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
>> + * General Public License for more details.
>> + */
> 
> Missing ifdef to prevent multiple inclusion
> 
>> +int btrfs_recover_superblocks(const char *path, int verbose, int yes);
>> +int btrfs_recover_chunk_tree(const char *path, int verbose, int yes);



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 7/9] btrfs-progs: Fix Wmaybe-uninitialized warning

2018-12-04 Thread Adam Borowski
On Tue, Dec 04, 2018 at 01:17:04PM +0100, David Sterba wrote:
> On Fri, Nov 16, 2018 at 03:54:24PM +0800, Qu Wenruo wrote:
> > The only location is the following code:
> > 
> > int level = path->lowest_level + 1;
> > BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
> > while(level < BTRFS_MAX_LEVEL) {
> > slot = path->slots[level] + 1;
> > ...
> > }
> > path->slots[level] = slot;
> > 
> > Again, it's the stupid compiler needs some hint for the fact that
> > we will always enter the while loop for at least once, thus @slot should
> > always be initialized.
> 
> Harsh words for the compiler, and I say not deserved. The same code
> pasted to kernel a built with the same version does not report the
> warning, so it's apparently a missing annotation of BUG_ON in
> btrfs-progs that does not give the right hint.

It's be nice if the C language provided a kind of a while loop that executes
at least once...

-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄ to the city of his birth to die.


Re: [PATCH v1.1 9/9] btrfs-progs: Cleanup warning reported by -Wmissing-prototypes

2018-12-04 Thread David Sterba
On Fri, Nov 16, 2018 at 04:04:51PM +0800, Qu Wenruo wrote:
> The following missing prototypes will be fixed:
> 1)  btrfs.c::handle_special_globals()
> 2)  check/mode-lowmem.c::repair_ternary_lowmem()
> 3)  extent-tree.c::btrfs_search_overlap_extent()
> Above 3 can be fixed by making them static
> 
> 4)  utils.c::btrfs_check_nodesize()
> Fixed by moving it to fsfeatures.c
> 
> 5)  chunk-recover.c::btrfs_recover_chunk_tree()
> 6)  super-recover.c::btrfs_recover_superblocks()
> Fixed by moving the declaration from cmds-rescue.c to rescue.h
> 
> 7)  utils-lib.c::arg_strtou64()
> 8)  utils-lib.c::lookup_path_rootid()
> Fixed by include "utils.h"
> 
> 9)  free-space-tree.c::set_free_space_tree_thresholds()
> Fixed by deleting it, as there is no user.
> 
> 10) free-space-tree.c::convert_free_space_to_bitmaps()
> 11) free-space-tree.c::convert_free_space_to_extents()
> 12) free-space-tree.c::__remove_from_free_space_tree()
> 13) free-space-tree.c::__add_to_free_space_tree()
> 14) free-space-tree.c::btrfs_create_tree()
> Fixed by making them static.

Please split this to more patches grouped by the type of change.

> --- /dev/null
> +++ b/rescue.h
> @@ -0,0 +1,14 @@
> +/*
> + * Copyright (C) 2018 SUSE.  All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public
> + * License v2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + */

Missing ifdef to prevent multiple inclusion

> +int btrfs_recover_superblocks(const char *path, int verbose, int yes);
> +int btrfs_recover_chunk_tree(const char *path, int verbose, int yes);


  1   2   3   4   5   6   7   8   9   10   >