[PATCH v4] Btrfs: fix deadlock with memory reclaim during scrub

2018-11-23 Thread fdmanana
From: Filipe Manana 

When a transaction commit starts, it attempts to pause scrub and it blocks
until the scrub is paused. So while the transaction is blocked waiting for
scrub to pause, we can not do memory allocation with GFP_KERNEL from scrub,
otherwise we risk getting into a deadlock with reclaim.

Checking for scrub pause requests is done early at the beginning of the
while loop of scrub_stripe() and later in the loop, scrub_extent() and
scrub_raid56_parity() are called, which in turn call scrub_pages() and
scrub_pages_for_parity() respectively. These last two functions do memory
allocations using GFP_KERNEL. Same problem could happen while scrubbing
the super blocks, since it calls scrub_pages().

So make sure GFP_NOFS is used for the memory allocations because at any
time a scrub pause request can happen from another task that started to
commit a transaction.

Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
Signed-off-by: Filipe Manana 
---

V2: Make using GFP_NOFS unconditionial. Previous version was racy, as pausing
requests migth happen just after we checked for them.

V3: Use memalloc_nofs_save() just like V1 did.

V4: Similar problem happened for raid56, which was previously missed, so
deal with it as well as the case for scrub_supers().

 fs/btrfs/scrub.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 3be1456b5116..e08b7502d1f0 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3779,6 +3779,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
devid, u64 start,
struct scrub_ctx *sctx;
int ret;
struct btrfs_device *dev;
+   unsigned int nofs_flag;
 
if (btrfs_fs_closing(fs_info))
return -EINVAL;
@@ -3882,6 +3883,16 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
devid, u64 start,
atomic_inc(_info->scrubs_running);
mutex_unlock(_info->scrub_lock);
 
+   /*
+* In order to avoid deadlock with reclaim when there is a transaction
+* trying to pause scrub, make sure we use GFP_NOFS for all the
+* allocations done at btrfs_scrub_pages() and scrub_pages_for_parity()
+* invoked by our callees. The pausing request is done when the
+* transaction commit starts, and it blocks the transaction until scrub
+* is paused (done at specific points at scrub_stripe() or right above
+* before incrementing fs_info->scrubs_running).
+*/
+   nofs_flag = memalloc_nofs_save();
if (!is_dev_replace) {
/*
 * by holding device list mutex, we can
@@ -3895,6 +3906,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 
devid, u64 start,
if (!ret)
ret = scrub_enumerate_chunks(sctx, dev, start, end,
 is_dev_replace);
+   memalloc_nofs_restore(nofs_flag);
 
wait_event(sctx->list_wait, atomic_read(>bios_in_flight) == 0);
atomic_dec(_info->scrubs_running);
-- 
2.11.0



Re: Btrfs and fanotify filesystem watches

2018-11-23 Thread Amir Goldstein
On Fri, Nov 23, 2018 at 3:34 PM Amir Goldstein  wrote:
>
> On Fri, Nov 23, 2018 at 2:52 PM Jan Kara  wrote:
> >
> > Changed subject to better match what we discuss and added btrfs list to CC.
> >
> > On Thu 22-11-18 17:18:25, Amir Goldstein wrote:
> > > On Thu, Nov 22, 2018 at 3:26 PM Jan Kara  wrote:
> > > >
> > > > On Thu 22-11-18 14:36:35, Amir Goldstein wrote:
> > > > > > > Regardless, IIUC, btrfs_statfs() returns an fsid which is 
> > > > > > > associated with
> > > > > > > the single super block struct, so all dentries in all subvolumes 
> > > > > > > will
> > > > > > > return the same fsid: btrfs_sb(dentry->d_sb)->fsid.
> > > > > >
> > > > > > This is not true AFAICT. Looking at btrfs_statfs() I can see:
> > > > > >
> > > > > > buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ 
> > > > > > be32_to_cpu(fsid[2]);
> > > > > > buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ 
> > > > > > be32_to_cpu(fsid[3]);
> > > > > > /* Mask in the root object ID too, to disambiguate subvols 
> > > > > > */
> > > > > > buf->f_fsid.val[0] ^=
> > > > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid 
> > > > > > >> 32;
> > > > > > buf->f_fsid.val[1] ^=
> > > > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid;
> > > > > >
> > > > > > So subvolume root is xored into the FSID. Thus dentries from 
> > > > > > different
> > > > > > subvolumes are going to return different fsids...
> > > > > >
> > > > >
> > > > > Right... how could I have missed that :-/
> > > > >
> > > > > I do not want to bring back FSNOTIFY_EVENT_DENTRY just for that
> > > > > and I saw how many flaws you pointed to in $SUBJECT patch.
> > > > > Instead, I will use:
> > > > > statfs_by_dentry(d_find_any_alias(inode) ?: inode->i_sb->sb_root,...
> > > >
> > > > So what about my proposal to store fsid in the notification mark when 
> > > > it is
> > > > created and then use it when that mark results in event being generated?
> > > > When mark is created, we have full path available, so getting proper 
> > > > fsid
> > > > is trivial. Furthermore if the behavior is documented like that, it is
> > > > fairly easy for userspace to track fsids it should care about and 
> > > > translate
> > > > them to proper file descriptors for open_by_handle().
> > > >
> > > > This would get rid of statfs() on every event creation (which I don't 
> > > > like
> > > > much) and also avoids these problems "how to get fsid for inode". What 
> > > > do
> > > > you think?
> > > >
> > >
> > > That's interesting. I like the simplicity.
> > > What happens when there are 2 btrfs subvols /subvol1 /subvol2
> > > and the application obviously doesn't know about this and does:
> > > fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol1);
> > > statfs("/subvol1",...);
> > > fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol2);
> > > statfs("/subvol2",...);
> > >
> > > Now the second fanotify_mark() call just updates the existing super block
> > > mark mask, but does not change the fsid on the mark, so all events will 
> > > have
> > > fsid of subvol1 that was stored when first creating the mark.
> >
> > Yes.
> >
> > > fanotify-watch application (for example) hashes the watches (paths) under
> > > /subvol2 by fid with fsid of subvol2, so events cannot get matched back to
> > > "watch" (i.e. path).
> >
> > I agree this can be confusing... but with btrfs fanotify-watch will be
> > confused even with your current code, won't it? Because FAN_MARK_FILESYSTEM
> > on /subvol1 (with fsid A) is going to return also events on inodes from
> > /subvol2 (with fsid B). So your current code will return event with fsid B
> > which fanotify-watch has no way to match back and can get confused.
> >
> > So currently application can get events with fsid it has never seen, with
> > the code as I suggest it can get "wrong" fsid. That is confusing but still
> > somewhat better?
>
> It's not better IMO because the purpose of the fid in event is a unique key
> that application can use to match a path it already indexed.
> wrong fsid => no match.
> Using open_by_handle_at() to resolve unknown fid is a limited solution
> and I don't think it is going to work in this case (see below).
>
> >
> > The core of the problem is that btrfs does not have "the superblock
> > identifier" that would correspond to FAN_MARK_FILESYSTEM scope of events
> > that we could use.
> >
> > > And when trying to open_by_handle fid with fhandle from /subvol2
> > > using mount_fd of /subvol1, I suppose we can either get ESTALE
> > > or a disconnected dentry, because object from /subvol2 cannot
> > > have a path inside /subvol1
> >
> > So open_by_handle() should work fine even if we get mount_fd of /subvol1
> > and handle for inode on /subvol2. mount_fd in open_by_handle() is really
> > only used to get the superblock and that is the same.
>
> I don't think it will work fine.
> do_handle_to_path() will compose a path from /subvol1 mnt and dentry 

Re: [PATCH] btrfs: only run delayed refs if we're committing

2018-11-23 Thread Filipe Manana
On Thu, Nov 22, 2018 at 12:35 AM Josef Bacik  wrote:
>
> I noticed in a giant dbench run that we spent a lot of time on lock
> contention while running transaction commit.  This is because dbench
> results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
> they all run the delayed refs first thing, so they all contend with
> each other.  This leads to seconds of 0 throughput.  Change this to only
> run the delayed refs if we're the ones committing the transaction.  This
> makes the latency go away and we get no more lock contention.

Can you share the following in the changelog please?

1) How did you ran dbench (parameters, config).

2) What results did you get before and after this change. So that we all get
an idea of how good the impact is.

While the reduced contention makes all sense and seems valid, I'm not
sure this is always a win.
It certainly is when multiple tasks are calling
btrfs_commit_transaction() simultaneously, but,
what about when only one does it?

By running all delayed references inside the critical section of the
transaction commit
(state == TRANS_STATE_COMMIT_START), instead of running most of them
outside/before,
we will be blocking for a longer a time other tasks calling
btrfs_start_transaction() (which is used
a lot - creating files, unlinking files, adding links, etc, and even fsync).

Won't there by any other types of workload and tests other then dbench
that can get increased
latency and/or smaller throughput?

I find that sort of information useful to have in the changelog. If
you verified that or you think
it's irrelevant to measure/consider, it would be great to have it
mentioned in the changelog
(and explained).

Thanks.

>
> Reviewed-by: Omar Sandoval 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/transaction.c | 24 +---
>  1 file changed, 9 insertions(+), 15 deletions(-)
>
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 3c1be9db897c..41cc96cc59a3 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1918,15 +1918,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
> btrfs_trans_release_metadata(trans);
> trans->block_rsv = NULL;
>
> -   /* make a pass through all the delayed refs we have so far
> -* any runnings procs may add more while we are here
> -*/
> -   ret = btrfs_run_delayed_refs(trans, 0);
> -   if (ret) {
> -   btrfs_end_transaction(trans);
> -   return ret;
> -   }
> -
> cur_trans = trans->transaction;
>
> /*
> @@ -1938,12 +1929,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
>
> btrfs_create_pending_block_groups(trans);
>
> -   ret = btrfs_run_delayed_refs(trans, 0);
> -   if (ret) {
> -   btrfs_end_transaction(trans);
> -   return ret;
> -   }
> -
> if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) {
> int run_it = 0;
>
> @@ -2014,6 +1999,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
> *trans)
> spin_unlock(_info->trans_lock);
> }
>
> +   /*
> +* We are now the only one in the commit area, we can run delayed refs
> +* without hitting a bunch of lock contention from a lot of people
> +* trying to commit the transaction at once.
> +*/
> +   ret = btrfs_run_delayed_refs(trans, 0);
> +   if (ret)
> +   goto cleanup_transaction;
> +
> extwriter_counter_dec(cur_trans, trans->type);
>
> ret = btrfs_start_delalloc_flush(fs_info);
> --
> 2.14.3
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


[PATCH v3] Btrfs: fix deadlock with memory reclaim during scrub

2018-11-23 Thread fdmanana
From: Filipe Manana 

When a transaction commit starts, it attempts to pause scrub and it blocks
until the scrub is paused. So while the transaction is blocked waiting for
scrub to pause, we can not do memory allocation with GFP_KERNEL while scrub
is running, we must use GFP_NOS to avoid deadlock with reclaim. Checking
for pause requests is done early in the while loop of scrub_stripe(), and
later in the loop, scrub_extent() is called, which in turns calls
scrub_pages(), which does memory allocations using GFP_KERNEL. So use
GFP_NOFS for the memory allocations because at any time a scrub pause
request can happen from another task that started to commit a transaction.

Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
Signed-off-by: Filipe Manana 
---

V2: Make using GFP_NOFS unconditionial. Previous version was racy, as pausing
requests migth happen just after we checked for them.

V3: Use memalloc_nofs_save() just like V1 did.

 fs/btrfs/scrub.c | 23 ++-
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 3be1456b5116..8e9ead5073ec 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2204,13 +2204,24 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
logical, u64 len,
 {
struct scrub_block *sblock;
int index;
+   unsigned int nofs_flag;
+   int ret = 0;
+
+   /*
+* In order to avoid deadlock with reclaim when there is a transaction
+* trying to pause scrub, use GFP_NOFS. The pausing request is done when
+* the transaction commit starts, and it blocks the transaction until
+* scrub is paused (done at specific points at scrub_stripe()).
+*/
+   nofs_flag = memalloc_nofs_save();
 
sblock = kzalloc(sizeof(*sblock), GFP_KERNEL);
if (!sblock) {
spin_lock(>stat_lock);
sctx->stat.malloc_errors++;
spin_unlock(>stat_lock);
-   return -ENOMEM;
+   ret = -ENOMEM;
+   goto out;
}
 
/* one ref inside this function, plus one for each page added to
@@ -2230,7 +2241,8 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
logical, u64 len,
sctx->stat.malloc_errors++;
spin_unlock(>stat_lock);
scrub_block_put(sblock);
-   return -ENOMEM;
+   ret = -ENOMEM;
+   goto out;
}
BUG_ON(index >= SCRUB_MAX_PAGES_PER_BLOCK);
scrub_page_get(spage);
@@ -2269,12 +2281,11 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
logical, u64 len,
} else {
for (index = 0; index < sblock->page_count; index++) {
struct scrub_page *spage = sblock->pagev[index];
-   int ret;
 
ret = scrub_add_page_to_rd_bio(sctx, spage);
if (ret) {
scrub_block_put(sblock);
-   return ret;
+   goto out;
}
}
 
@@ -2284,7 +2295,9 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
logical, u64 len,
 
/* last one frees, either here or in bio completion for last page */
scrub_block_put(sblock);
-   return 0;
+out:
+   memalloc_nofs_restore(nofs_flag);
+   return ret;
 }
 
 static void scrub_bio_end_io(struct bio *bio)
-- 
2.11.0



Re: [PATCH v2] Btrfs: fix deadlock with memory reclaim during scrub

2018-11-23 Thread Nikolay Borisov



On 23.11.18 г. 18:05 ч., fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> When a transaction commit starts, it attempts to pause scrub and it blocks
> until the scrub is paused. So while the transaction is blocked waiting for
> scrub to pause, we can not do memory allocation with GFP_KERNEL while scrub
> is running, we must use GFP_NOS to avoid deadlock with reclaim. Checking
> for pause requests is done early in the while loop of scrub_stripe(), and
> later in the loop, scrub_extent() is called, which in turns calls
> scrub_pages(), which does memory allocations using GFP_KERNEL. So use
> GFP_NOFS for the memory allocations because at any time a scrub pause
> request can happen from another task that started to commit a transaction.
> 
> Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
> Signed-off-by: Filipe Manana 
> ---
> 
> V2: Make using GFP_NOFS unconditionial. Previous version was racy, as pausing
> requests migth happen just after we checked for them.
> 
>  fs/btrfs/scrub.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 3be1456b5116..0630ea0881bc 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -2205,7 +2205,13 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
> logical, u64 len,
>   struct scrub_block *sblock;
>   int index;
>  
> - sblock = kzalloc(sizeof(*sblock), GFP_KERNEL);
> + /*
> +  * In order to avoid deadlock with reclaim when there is a transaction
> +  * trying to pause scrub, use GFP_NOFS. The pausing request is done when
> +  * the transaction commit starts, and it blocks the transaction until
> +  * scrub is paused (done at specific points at scrub_stripe()).
> +  */
> + sblock = kzalloc(sizeof(*sblock), GFP_NOFS);

Newer code shouldn't use GFP_NOFS, rather leave GFP_KERNEL as is and
instead use the memaloc_nofs_save/memalloc_nofs_restore. For background
information refer to: Documentation/core-api/gfp_mask-from-fs-io.rst

>   if (!sblock) {
>   spin_lock(>stat_lock);
>   sctx->stat.malloc_errors++;
> @@ -2223,7 +2229,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
> logical, u64 len,
>   struct scrub_page *spage;
>   u64 l = min_t(u64, len, PAGE_SIZE);
>  
> - spage = kzalloc(sizeof(*spage), GFP_KERNEL);
> + spage = kzalloc(sizeof(*spage), GFP_NOFS);
>   if (!spage) {
>  leave_nomem:
>   spin_lock(>stat_lock);
> @@ -2250,7 +2256,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
> logical, u64 len,
>   spage->have_csum = 0;
>   }
>   sblock->page_count++;
> - spage->page = alloc_page(GFP_KERNEL);
> + spage->page = alloc_page(GFP_NOFS);
>   if (!spage->page)
>   goto leave_nomem;
>   len -= l;
> 


[PATCH v2] Btrfs: fix deadlock with memory reclaim during scrub

2018-11-23 Thread fdmanana
From: Filipe Manana 

When a transaction commit starts, it attempts to pause scrub and it blocks
until the scrub is paused. So while the transaction is blocked waiting for
scrub to pause, we can not do memory allocation with GFP_KERNEL while scrub
is running, we must use GFP_NOS to avoid deadlock with reclaim. Checking
for pause requests is done early in the while loop of scrub_stripe(), and
later in the loop, scrub_extent() is called, which in turns calls
scrub_pages(), which does memory allocations using GFP_KERNEL. So use
GFP_NOFS for the memory allocations because at any time a scrub pause
request can happen from another task that started to commit a transaction.

Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
Signed-off-by: Filipe Manana 
---

V2: Make using GFP_NOFS unconditionial. Previous version was racy, as pausing
requests migth happen just after we checked for them.

 fs/btrfs/scrub.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 3be1456b5116..0630ea0881bc 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2205,7 +2205,13 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
logical, u64 len,
struct scrub_block *sblock;
int index;
 
-   sblock = kzalloc(sizeof(*sblock), GFP_KERNEL);
+   /*
+* In order to avoid deadlock with reclaim when there is a transaction
+* trying to pause scrub, use GFP_NOFS. The pausing request is done when
+* the transaction commit starts, and it blocks the transaction until
+* scrub is paused (done at specific points at scrub_stripe()).
+*/
+   sblock = kzalloc(sizeof(*sblock), GFP_NOFS);
if (!sblock) {
spin_lock(>stat_lock);
sctx->stat.malloc_errors++;
@@ -2223,7 +2229,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
logical, u64 len,
struct scrub_page *spage;
u64 l = min_t(u64, len, PAGE_SIZE);
 
-   spage = kzalloc(sizeof(*spage), GFP_KERNEL);
+   spage = kzalloc(sizeof(*spage), GFP_NOFS);
if (!spage) {
 leave_nomem:
spin_lock(>stat_lock);
@@ -2250,7 +2256,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
logical, u64 len,
spage->have_csum = 0;
}
sblock->page_count++;
-   spage->page = alloc_page(GFP_KERNEL);
+   spage->page = alloc_page(GFP_NOFS);
if (!spage->page)
goto leave_nomem;
len -= l;
-- 
2.11.0



Re: [PATCH 0/6] Delayed refs rsv

2018-11-23 Thread David Sterba
On Wed, Nov 21, 2018 at 01:59:06PM -0500, Josef Bacik wrote:
> This patchset changes how we do space reservations for delayed refs.  We were
> hitting probably 20-40 enospc abort's per day in production while running
> delayed refs at transaction commit time.  This means we ran out of space in 
> the
> global reserve and couldn't easily get more space in use_block_rsv().
> 
> The global reserve has grown to cover everything we don't reserve space
> explicitly for, and we've grown a lot of weird ad-hoc hueristics to know if
> we're running short on space and when it's time to force a commit.  A failure
> rate of 20-40 file systems when we run hundreds of thousands of them isn't 
> super
> high, but cleaning up this code will make things less ugly and more 
> predictible.
> 
> Thus the delayed refs rsv.  We always know how many delayed refs we have
> outstanding, and although running them generates more we can use the global
> reserve for that spill over, which fits better into it's desired use than a 
> full
> blown reservation.  This first approach is to simply take how many times we're
> reserving space for and multiply that by 2 in order to save enough space for 
> the
> delayed refs that could be generated.  This is a niave approach and will
> probably evolve, but for now it works.
> 
> With this patchset we've gone down to 2-8 failures per week.  It's not 
> perfect,
> there are some corner cases that still need to be addressed, but is
> significantly better than what we had.  Thanks,

Parts of this cover letter could go to the main patch 5/6 to extend the
background why the change is made.

As the other patchsets build on top of this, I'll merge this to to
misc-next soon. The testing so far looks ok, some patches might need
fixups but nothing big that I can't do at commit time.

The patch 5/6 still seems too big, some parts could be split as
preparatory work, otherwise the main idea where to wire in the special
block reserve seems clear. If anybody wants to do a review there, please
do, I did only a high-level and did not check all the calculations etc.


Re: [PATCH 3/6] btrfs: cleanup extent_op handling

2018-11-23 Thread David Sterba
On Wed, Nov 21, 2018 at 01:59:09PM -0500, Josef Bacik wrote:
> From: Josef Bacik 
> 
> The cleanup_extent_op function actually would run the extent_op if it
> needed running, which made the name sort of a misnomer.  Change it to
> run_and_cleanup_extent_op, and move the actual cleanup work to
> cleanup_extent_op so it can be used by check_ref_cleanup() in order to
> unify the extent op handling.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 36 +++-
>  1 file changed, 23 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index e3ed3507018d..8a776dc9cb38 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -2424,19 +2424,33 @@ static void unselect_delayed_ref_head(struct 
> btrfs_delayed_ref_root *delayed_ref
>   btrfs_delayed_ref_unlock(head);
>  }
>  
> -static int cleanup_extent_op(struct btrfs_trans_handle *trans,
> -  struct btrfs_delayed_ref_head *head)
> +static struct btrfs_delayed_extent_op *
> +cleanup_extent_op(struct btrfs_trans_handle *trans,
> +   struct btrfs_delayed_ref_head *head)

Please keep the type and function name on one line, if the arguments
would overflow, then move it on the next line and un-indent as
necessary. I fix such things when merging, no need to resend.

>  {
>   struct btrfs_delayed_extent_op *extent_op = head->extent_op;
> - int ret;
>  
>   if (!extent_op)
> - return 0;
> - head->extent_op = NULL;
> + return NULL;
> +
>   if (head->must_insert_reserved) {
> + head->extent_op = NULL;
>   btrfs_free_delayed_extent_op(extent_op);
> - return 0;
> + return NULL;
>   }
> + return extent_op;
> +}
> +
> +static int run_and_cleanup_extent_op(struct btrfs_trans_handle *trans,
> +  struct btrfs_delayed_ref_head *head)
> +{
> + struct btrfs_delayed_extent_op *extent_op =
> + cleanup_extent_op(trans, head);

I prefer to do non-trivial initializations in the statement block, it's
easy to overlook that in among the declarations. Simple variable init,
pointer dereferenes or using simple helpers is fine but cleanup_extent_op
does not seem to fit to that. It's in other patches too so I can update
that unless there are other things that would need a resend.


Re: [PATCH 2/6] btrfs: add cleanup_ref_head_accounting helper

2018-11-23 Thread David Sterba
On Thu, Nov 22, 2018 at 09:06:30AM +0800, Qu Wenruo wrote:
> 
> 
> On 2018/11/22 上午2:59, Josef Bacik wrote:
> > From: Josef Bacik 
> > 
> > We were missing some quota cleanups in check_ref_cleanup, so break the
> > ref head accounting cleanup into a helper and call that from both
> > check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
> > we don't screw up accounting in the future for other things that we add.
> > 
> > Reviewed-by: Omar Sandoval 
> > Reviewed-by: Liu Bo 
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/extent-tree.c | 67 
> > +-
> >  1 file changed, 39 insertions(+), 28 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index c36b3a42f2bb..e3ed3507018d 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -2443,6 +2443,41 @@ static int cleanup_extent_op(struct 
> > btrfs_trans_handle *trans,
> > return ret ? ret : 1;
> >  }
> >  
> > +static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
> > +   struct btrfs_delayed_ref_head *head)
> > +{
> > +   struct btrfs_fs_info *fs_info = trans->fs_info;
> > +   struct btrfs_delayed_ref_root *delayed_refs =
> > +   >transaction->delayed_refs;
> > +
> > +   if (head->total_ref_mod < 0) {
> > +   struct btrfs_space_info *space_info;
> > +   u64 flags;
> > +
> > +   if (head->is_data)
> > +   flags = BTRFS_BLOCK_GROUP_DATA;
> > +   else if (head->is_system)
> > +   flags = BTRFS_BLOCK_GROUP_SYSTEM;
> > +   else
> > +   flags = BTRFS_BLOCK_GROUP_METADATA;
> > +   space_info = __find_space_info(fs_info, flags);
> > +   ASSERT(space_info);
> > +   percpu_counter_add_batch(_info->total_bytes_pinned,
> > +  -head->num_bytes,
> > +  BTRFS_TOTAL_BYTES_PINNED_BATCH);
> > +
> > +   if (head->is_data) {
> > +   spin_lock(_refs->lock);
> > +   delayed_refs->pending_csums -= head->num_bytes;
> > +   spin_unlock(_refs->lock);
> > +   }
> > +   }
> > +
> > +   /* Also free its reserved qgroup space */
> > +   btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
> > + head->qgroup_reserved);
> 
> This part will be removed in patch "[PATCH] btrfs: qgroup: Move reserved
> data account from btrfs_delayed_ref_head to btrfs_qgroup_extent_record".

We need to decide the order of merging if the patches touch the same
code. Your patch looks easier to refresh on top of this series, so
please hold on until this gets merged.


Re: [PATCH 1/6] btrfs: add btrfs_delete_ref_head helper

2018-11-23 Thread Nikolay Borisov



On 23.11.18 г. 15:45 ч., David Sterba wrote:
> On Thu, Nov 22, 2018 at 11:42:28AM +0200, Nikolay Borisov wrote:
>>
>>
>> On 22.11.18 г. 11:12 ч., Nikolay Borisov wrote:
>>>
>>>
>>> On 21.11.18 г. 20:59 ч., Josef Bacik wrote:
 From: Josef Bacik 

 We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
 into a helper and cleanup the calling functions.

 Signed-off-by: Josef Bacik 
 Reviewed-by: Omar Sandoval 
 ---
  fs/btrfs/delayed-ref.c | 14 ++
  fs/btrfs/delayed-ref.h |  3 ++-
  fs/btrfs/extent-tree.c | 22 +++---
  3 files changed, 19 insertions(+), 20 deletions(-)

 diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
 index 9301b3ad9217..b3e4c9fcb664 100644
 --- a/fs/btrfs/delayed-ref.c
 +++ b/fs/btrfs/delayed-ref.c
 @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head(
return head;
  }
  
 +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
 + struct btrfs_delayed_ref_head *head)
 +{
 +  lockdep_assert_held(_refs->lock);
 +  lockdep_assert_held(>lock);
 +
 +  rb_erase_cached(>href_node, _refs->href_root);
 +  RB_CLEAR_NODE(>href_node);
 +  atomic_dec(_refs->num_entries);
 +  delayed_refs->num_heads--;
 +  if (head->processing == 0)
 +  delayed_refs->num_heads_ready--;
>>>
>>> num_heads_ready will never execute in cleanup_ref_head, since
>>> processing == 0 only when the ref head is unselected. Perhaps those 2
>>> lines shouldn't be in this function? I find it a bit confusing that if
>>> processing is 0 we decrement num_heads_ready in check_ref_cleanup, but
>>> in unselect_delayed_ref_head we set it to 0 and increment it.
>>>
 +}
 +
  /*
   * Helper to insert the ref_node to the tail or merge with tail.
   *
 diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
 index 8e20c5cb5404..d2af974f68a1 100644
 --- a/fs/btrfs/delayed-ref.h
 +++ b/fs/btrfs/delayed-ref.h
 @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
 btrfs_delayed_ref_head *head)
  {
mutex_unlock(>mutex);
  }
 -
 +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
 + struct btrfs_delayed_ref_head *head);
  
  struct btrfs_delayed_ref_head *btrfs_select_ref_head(
struct btrfs_delayed_ref_root *delayed_refs);
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index d242a1174e50..c36b3a42f2bb 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct 
 btrfs_trans_handle *trans,
spin_unlock(_refs->lock);
return 1;
}
 -  delayed_refs->num_heads--;
 -  rb_erase_cached(>href_node, _refs->href_root);
 -  RB_CLEAR_NODE(>href_node);
 +  btrfs_delete_ref_head(delayed_refs, head);
spin_unlock(>lock);
spin_unlock(_refs->lock);
 -  atomic_dec(_refs->num_entries);
  
trace_run_delayed_ref_head(fs_info, head, 0);
  
 @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
 btrfs_trans_handle *trans,
if (!mutex_trylock(>mutex))
goto out;
  
 -  /*
 -   * at this point we have a head with no other entries.  Go
 -   * ahead and process it.
 -   */
 -  rb_erase_cached(>href_node, _refs->href_root);
 -  RB_CLEAR_NODE(>href_node);
 -  atomic_dec(_refs->num_entries);
 -
 -  /*
 -   * we don't take a ref on the node because we're removing it from the
 -   * tree, so we just steal the ref the tree was holding.
 -   */
 -  delayed_refs->num_heads--;
 -  if (head->processing == 0)
 -  delayed_refs->num_heads_ready--;
 +  btrfs_delete_ref_head(delayed_refs, head);
head->processing = 0;
>>
>> On a closer inspection I think here we can do:
>>
>> ASSERT(head->processing == 0) because at that point we've taken the
>> head->lock spinlock which is held during ordinary delayed refs
>> processing (in __btrfs_run_delayed_refs) when the head is selected (and
>> processing is 1). So head->processing == 0 here I think is a hard
>> invariant of the code. The decrement here should pair with the increment
>> when the head was initially added to the tree.
>>
>> In cleanup_ref_head we don't need to ever worry about num_heads_ready
>> since it has already been decremented by btrfs_select_ref_head.
>>
>> As a matter fact this counter is not used anywhere so we might as well
>> just remove it.
> 
> The logic does not use it, there's only a WARN_ON in
> btrfs_select_ref_head, that's more like a debugging or assertion that
> everything is fine. So the question is whether to keep it as a
> consistency check (and add comments) or remove it and simplify the 

Re: [PATCH 1/6] btrfs: add btrfs_delete_ref_head helper

2018-11-23 Thread David Sterba
On Thu, Nov 22, 2018 at 11:42:28AM +0200, Nikolay Borisov wrote:
> 
> 
> On 22.11.18 г. 11:12 ч., Nikolay Borisov wrote:
> > 
> > 
> > On 21.11.18 г. 20:59 ч., Josef Bacik wrote:
> >> From: Josef Bacik 
> >>
> >> We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
> >> into a helper and cleanup the calling functions.
> >>
> >> Signed-off-by: Josef Bacik 
> >> Reviewed-by: Omar Sandoval 
> >> ---
> >>  fs/btrfs/delayed-ref.c | 14 ++
> >>  fs/btrfs/delayed-ref.h |  3 ++-
> >>  fs/btrfs/extent-tree.c | 22 +++---
> >>  3 files changed, 19 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> >> index 9301b3ad9217..b3e4c9fcb664 100644
> >> --- a/fs/btrfs/delayed-ref.c
> >> +++ b/fs/btrfs/delayed-ref.c
> >> @@ -400,6 +400,20 @@ struct btrfs_delayed_ref_head *btrfs_select_ref_head(
> >>return head;
> >>  }
> >>  
> >> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> >> + struct btrfs_delayed_ref_head *head)
> >> +{
> >> +  lockdep_assert_held(_refs->lock);
> >> +  lockdep_assert_held(>lock);
> >> +
> >> +  rb_erase_cached(>href_node, _refs->href_root);
> >> +  RB_CLEAR_NODE(>href_node);
> >> +  atomic_dec(_refs->num_entries);
> >> +  delayed_refs->num_heads--;
> >> +  if (head->processing == 0)
> >> +  delayed_refs->num_heads_ready--;
> > 
> > num_heads_ready will never execute in cleanup_ref_head, since
> > processing == 0 only when the ref head is unselected. Perhaps those 2
> > lines shouldn't be in this function? I find it a bit confusing that if
> > processing is 0 we decrement num_heads_ready in check_ref_cleanup, but
> > in unselect_delayed_ref_head we set it to 0 and increment it.
> > 
> >> +}
> >> +
> >>  /*
> >>   * Helper to insert the ref_node to the tail or merge with tail.
> >>   *
> >> diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
> >> index 8e20c5cb5404..d2af974f68a1 100644
> >> --- a/fs/btrfs/delayed-ref.h
> >> +++ b/fs/btrfs/delayed-ref.h
> >> @@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
> >> btrfs_delayed_ref_head *head)
> >>  {
> >>mutex_unlock(>mutex);
> >>  }
> >> -
> >> +void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
> >> + struct btrfs_delayed_ref_head *head);
> >>  
> >>  struct btrfs_delayed_ref_head *btrfs_select_ref_head(
> >>struct btrfs_delayed_ref_root *delayed_refs);
> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> >> index d242a1174e50..c36b3a42f2bb 100644
> >> --- a/fs/btrfs/extent-tree.c
> >> +++ b/fs/btrfs/extent-tree.c
> >> @@ -2474,12 +2474,9 @@ static int cleanup_ref_head(struct 
> >> btrfs_trans_handle *trans,
> >>spin_unlock(_refs->lock);
> >>return 1;
> >>}
> >> -  delayed_refs->num_heads--;
> >> -  rb_erase_cached(>href_node, _refs->href_root);
> >> -  RB_CLEAR_NODE(>href_node);
> >> +  btrfs_delete_ref_head(delayed_refs, head);
> >>spin_unlock(>lock);
> >>spin_unlock(_refs->lock);
> >> -  atomic_dec(_refs->num_entries);
> >>  
> >>trace_run_delayed_ref_head(fs_info, head, 0);
> >>  
> >> @@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
> >> btrfs_trans_handle *trans,
> >>if (!mutex_trylock(>mutex))
> >>goto out;
> >>  
> >> -  /*
> >> -   * at this point we have a head with no other entries.  Go
> >> -   * ahead and process it.
> >> -   */
> >> -  rb_erase_cached(>href_node, _refs->href_root);
> >> -  RB_CLEAR_NODE(>href_node);
> >> -  atomic_dec(_refs->num_entries);
> >> -
> >> -  /*
> >> -   * we don't take a ref on the node because we're removing it from the
> >> -   * tree, so we just steal the ref the tree was holding.
> >> -   */
> >> -  delayed_refs->num_heads--;
> >> -  if (head->processing == 0)
> >> -  delayed_refs->num_heads_ready--;
> >> +  btrfs_delete_ref_head(delayed_refs, head);
> >>head->processing = 0;
> 
> On a closer inspection I think here we can do:
> 
> ASSERT(head->processing == 0) because at that point we've taken the
> head->lock spinlock which is held during ordinary delayed refs
> processing (in __btrfs_run_delayed_refs) when the head is selected (and
> processing is 1). So head->processing == 0 here I think is a hard
> invariant of the code. The decrement here should pair with the increment
> when the head was initially added to the tree.
> 
> In cleanup_ref_head we don't need to ever worry about num_heads_ready
> since it has already been decremented by btrfs_select_ref_head.
> 
> As a matter fact this counter is not used anywhere so we might as well
> just remove it.

The logic does not use it, there's only a WARN_ON in
btrfs_select_ref_head, that's more like a debugging or assertion that
everything is fine. So the question is whether to keep it as a
consistency check (and add comments) or remove it and simplify the code.


[PATCH] Btrfs: fix deadlock with memory reclaim during scrub

2018-11-23 Thread fdmanana
From: Filipe Manana 

When a transaction commit starts, it attempts to pause scrub and it blocks
until the scrub is paused. So while the transaction is blocked waiting for
scrub to pause, we can not do memory allocation with GFP_KERNEL while scrub
is running, we must use GFP_NOS to avoid deadlock with reclaim. Checking
for pause requests is done early in the while loop of scrub_stripe(), and
later in the loop, scrub_extent() is called, which in turns calls
scrub_pages(), which does memory allocations using GFP_KERNEL. So use
GFP_NOFS for the memory allocations if there are any scrub pause requests.

Fixes: 58c4e173847a ("btrfs: scrub: use GFP_KERNEL on the submission path")
Signed-off-by: Filipe Manana 
---
 fs/btrfs/scrub.c | 26 +-
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 3be1456b5116..5fcb9d1eb983 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2204,13 +2204,26 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
logical, u64 len,
 {
struct scrub_block *sblock;
int index;
+   bool pause_req = (atomic_read(>fs_info->scrub_pause_req) != 0);
+   unsigned int nofs_flag;
+   int ret = 0;
+
+   /*
+* In order to avoid deadlock with reclaim when there is a transaction
+* trying to pause scrub, use GFP_NOFS. The pausing request is done when
+* the transaction commit starts, and it blocks the transaction until
+* scrub is paused (done at specific points at scrub_stripe()).
+*/
+   if (pause_req)
+   nofs_flag = memalloc_nofs_save();
 
sblock = kzalloc(sizeof(*sblock), GFP_KERNEL);
if (!sblock) {
spin_lock(>stat_lock);
sctx->stat.malloc_errors++;
spin_unlock(>stat_lock);
-   return -ENOMEM;
+   ret = -ENOMEM;
+   goto out;
}
 
/* one ref inside this function, plus one for each page added to
@@ -2230,7 +2243,8 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
logical, u64 len,
sctx->stat.malloc_errors++;
spin_unlock(>stat_lock);
scrub_block_put(sblock);
-   return -ENOMEM;
+   ret = -ENOMEM;
+   goto out;
}
BUG_ON(index >= SCRUB_MAX_PAGES_PER_BLOCK);
scrub_page_get(spage);
@@ -2269,12 +2283,11 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
logical, u64 len,
} else {
for (index = 0; index < sblock->page_count; index++) {
struct scrub_page *spage = sblock->pagev[index];
-   int ret;
 
ret = scrub_add_page_to_rd_bio(sctx, spage);
if (ret) {
scrub_block_put(sblock);
-   return ret;
+   goto out;
}
}
 
@@ -2284,7 +2297,10 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 
logical, u64 len,
 
/* last one frees, either here or in bio completion for last page */
scrub_block_put(sblock);
-   return 0;
+ out:
+   if (pause_req)
+   memalloc_nofs_restore(nofs_flag);
+   return ret;
 }
 
 static void scrub_bio_end_io(struct bio *bio)
-- 
2.11.0



Re: Btrfs and fanotify filesystem watches

2018-11-23 Thread Amir Goldstein
On Fri, Nov 23, 2018 at 2:52 PM Jan Kara  wrote:
>
> Changed subject to better match what we discuss and added btrfs list to CC.
>
> On Thu 22-11-18 17:18:25, Amir Goldstein wrote:
> > On Thu, Nov 22, 2018 at 3:26 PM Jan Kara  wrote:
> > >
> > > On Thu 22-11-18 14:36:35, Amir Goldstein wrote:
> > > > > > Regardless, IIUC, btrfs_statfs() returns an fsid which is 
> > > > > > associated with
> > > > > > the single super block struct, so all dentries in all subvolumes 
> > > > > > will
> > > > > > return the same fsid: btrfs_sb(dentry->d_sb)->fsid.
> > > > >
> > > > > This is not true AFAICT. Looking at btrfs_statfs() I can see:
> > > > >
> > > > > buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ 
> > > > > be32_to_cpu(fsid[2]);
> > > > > buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ 
> > > > > be32_to_cpu(fsid[3]);
> > > > > /* Mask in the root object ID too, to disambiguate subvols */
> > > > > buf->f_fsid.val[0] ^=
> > > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid >> 
> > > > > 32;
> > > > > buf->f_fsid.val[1] ^=
> > > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid;
> > > > >
> > > > > So subvolume root is xored into the FSID. Thus dentries from different
> > > > > subvolumes are going to return different fsids...
> > > > >
> > > >
> > > > Right... how could I have missed that :-/
> > > >
> > > > I do not want to bring back FSNOTIFY_EVENT_DENTRY just for that
> > > > and I saw how many flaws you pointed to in $SUBJECT patch.
> > > > Instead, I will use:
> > > > statfs_by_dentry(d_find_any_alias(inode) ?: inode->i_sb->sb_root,...
> > >
> > > So what about my proposal to store fsid in the notification mark when it 
> > > is
> > > created and then use it when that mark results in event being generated?
> > > When mark is created, we have full path available, so getting proper fsid
> > > is trivial. Furthermore if the behavior is documented like that, it is
> > > fairly easy for userspace to track fsids it should care about and 
> > > translate
> > > them to proper file descriptors for open_by_handle().
> > >
> > > This would get rid of statfs() on every event creation (which I don't like
> > > much) and also avoids these problems "how to get fsid for inode". What do
> > > you think?
> > >
> >
> > That's interesting. I like the simplicity.
> > What happens when there are 2 btrfs subvols /subvol1 /subvol2
> > and the application obviously doesn't know about this and does:
> > fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol1);
> > statfs("/subvol1",...);
> > fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol2);
> > statfs("/subvol2",...);
> >
> > Now the second fanotify_mark() call just updates the existing super block
> > mark mask, but does not change the fsid on the mark, so all events will have
> > fsid of subvol1 that was stored when first creating the mark.
>
> Yes.
>
> > fanotify-watch application (for example) hashes the watches (paths) under
> > /subvol2 by fid with fsid of subvol2, so events cannot get matched back to
> > "watch" (i.e. path).
>
> I agree this can be confusing... but with btrfs fanotify-watch will be
> confused even with your current code, won't it? Because FAN_MARK_FILESYSTEM
> on /subvol1 (with fsid A) is going to return also events on inodes from
> /subvol2 (with fsid B). So your current code will return event with fsid B
> which fanotify-watch has no way to match back and can get confused.
>
> So currently application can get events with fsid it has never seen, with
> the code as I suggest it can get "wrong" fsid. That is confusing but still
> somewhat better?

It's not better IMO because the purpose of the fid in event is a unique key
that application can use to match a path it already indexed.
wrong fsid => no match.
Using open_by_handle_at() to resolve unknown fid is a limited solution
and I don't think it is going to work in this case (see below).

>
> The core of the problem is that btrfs does not have "the superblock
> identifier" that would correspond to FAN_MARK_FILESYSTEM scope of events
> that we could use.
>
> > And when trying to open_by_handle fid with fhandle from /subvol2
> > using mount_fd of /subvol1, I suppose we can either get ESTALE
> > or a disconnected dentry, because object from /subvol2 cannot
> > have a path inside /subvol1
>
> So open_by_handle() should work fine even if we get mount_fd of /subvol1
> and handle for inode on /subvol2. mount_fd in open_by_handle() is really
> only used to get the superblock and that is the same.

I don't think it will work fine.
do_handle_to_path() will compose a path from /subvol1 mnt and dentry from
/subvol2. This may resolve to a full path that does not really exist,
so application
cannot match it to anything that is can use name_to_handle_at() to identify.

The whole thing just sounds too messy. At the very least we need more time
to communicate with btrfs developers and figure this out, so I am going to
go 

Btrfs and fanotify filesystem watches

2018-11-23 Thread Jan Kara
Changed subject to better match what we discuss and added btrfs list to CC.

On Thu 22-11-18 17:18:25, Amir Goldstein wrote:
> On Thu, Nov 22, 2018 at 3:26 PM Jan Kara  wrote:
> >
> > On Thu 22-11-18 14:36:35, Amir Goldstein wrote:
> > > > > Regardless, IIUC, btrfs_statfs() returns an fsid which is associated 
> > > > > with
> > > > > the single super block struct, so all dentries in all subvolumes will
> > > > > return the same fsid: btrfs_sb(dentry->d_sb)->fsid.
> > > >
> > > > This is not true AFAICT. Looking at btrfs_statfs() I can see:
> > > >
> > > > buf->f_fsid.val[0] = be32_to_cpu(fsid[0]) ^ 
> > > > be32_to_cpu(fsid[2]);
> > > > buf->f_fsid.val[1] = be32_to_cpu(fsid[1]) ^ 
> > > > be32_to_cpu(fsid[3]);
> > > > /* Mask in the root object ID too, to disambiguate subvols */
> > > > buf->f_fsid.val[0] ^=
> > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid >> 32;
> > > > buf->f_fsid.val[1] ^=
> > > > BTRFS_I(d_inode(dentry))->root->root_key.objectid;
> > > >
> > > > So subvolume root is xored into the FSID. Thus dentries from different
> > > > subvolumes are going to return different fsids...
> > > >
> > >
> > > Right... how could I have missed that :-/
> > >
> > > I do not want to bring back FSNOTIFY_EVENT_DENTRY just for that
> > > and I saw how many flaws you pointed to in $SUBJECT patch.
> > > Instead, I will use:
> > > statfs_by_dentry(d_find_any_alias(inode) ?: inode->i_sb->sb_root,...
> >
> > So what about my proposal to store fsid in the notification mark when it is
> > created and then use it when that mark results in event being generated?
> > When mark is created, we have full path available, so getting proper fsid
> > is trivial. Furthermore if the behavior is documented like that, it is
> > fairly easy for userspace to track fsids it should care about and translate
> > them to proper file descriptors for open_by_handle().
> >
> > This would get rid of statfs() on every event creation (which I don't like
> > much) and also avoids these problems "how to get fsid for inode". What do
> > you think?
> >
> 
> That's interesting. I like the simplicity.
> What happens when there are 2 btrfs subvols /subvol1 /subvol2
> and the application obviously doesn't know about this and does:
> fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol1);
> statfs("/subvol1",...);
> fanotify_mark(fd, FAN_MARK_ADD|FAN_MARK_FILESYSTEM, ... /subvol2);
> statfs("/subvol2",...);
> 
> Now the second fanotify_mark() call just updates the existing super block
> mark mask, but does not change the fsid on the mark, so all events will have
> fsid of subvol1 that was stored when first creating the mark.

Yes.

> fanotify-watch application (for example) hashes the watches (paths) under
> /subvol2 by fid with fsid of subvol2, so events cannot get matched back to
> "watch" (i.e. path).

I agree this can be confusing... but with btrfs fanotify-watch will be
confused even with your current code, won't it? Because FAN_MARK_FILESYSTEM
on /subvol1 (with fsid A) is going to return also events on inodes from
/subvol2 (with fsid B). So your current code will return event with fsid B
which fanotify-watch has no way to match back and can get confused.

So currently application can get events with fsid it has never seen, with
the code as I suggest it can get "wrong" fsid. That is confusing but still
somewhat better?

The core of the problem is that btrfs does not have "the superblock
identifier" that would correspond to FAN_MARK_FILESYSTEM scope of events
that we could use.

> And when trying to open_by_handle fid with fhandle from /subvol2
> using mount_fd of /subvol1, I suppose we can either get ESTALE
> or a disconnected dentry, because object from /subvol2 cannot
> have a path inside /subvol1

So open_by_handle() should work fine even if we get mount_fd of /subvol1
and handle for inode on /subvol2. mount_fd in open_by_handle() is really
only used to get the superblock and that is the same.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] btrfs: remove btrfs_bio_end_io_t

2018-11-23 Thread David Sterba
On Fri, Nov 23, 2018 at 09:42:27AM +0100, Johannes Thumshirn wrote:
> The btrfs_bio_end_io_t typedef was introduced with commit
> a1d3c4786a4b ("btrfs: btrfs_multi_bio replaced with btrfs_bio")
> but never used anywhere. This commit also introduced a forward declaration
> of 'struct btrfs_bio' which is only needed for btrfs_bio_end_io_t.
> 
> Remove both as they're not needed anywhere.
> 
> Signed-off-by: Johannes Thumshirn 

Added to misc-next, thanks.


Re: [[Missing subject]]

2018-11-23 Thread Qu Wenruo


On 2018/11/23 下午2:41, Andy Leadbetter wrote:
> I have a failing 2TB disk that is part of a 4 disk RAID 6 system.  I
> have added a new 2TB disk to the computer, and started a BTRFS replace
> for the old and new disk.  The process starts correctly however some
> hours into the job, there is an error and kernel oops. relevant log
> below.
> 
> The disks are configured on top of bcache, in 5 arrays with a small
> 128GB SSD cache shared.  The system in this configuration has worked
> perfectly for 3 years, until 2 weeks ago csum errors started
> appearing.  I have a crashplan backup of all files on the disk, so I
> am not concerned about data loss, but I would like to avoid rebuild
> the system.
> 
> btrfs dev stats shows
> [/dev/bcache0].write_io_errs0
> [/dev/bcache0].read_io_errs 0
> [/dev/bcache0].flush_io_errs0
> [/dev/bcache0].corruption_errs  0
> [/dev/bcache0].generation_errs  0
> [/dev/bcache1].write_io_errs0
> [/dev/bcache1].read_io_errs 20
> [/dev/bcache1].flush_io_errs0
> [/dev/bcache1].corruption_errs  0
> [/dev/bcache1].generation_errs  14

Unfortunately, this is not a sign of degrading disk, but something
really went wrong, screwing up some metadata.

For such case, it's recommended to do a "btrfs check --readonly", to
show how serious the problem is.

It could be some subvolume corruption, or some non-essential tree, but
anyway the generation mismatch is a problem that neither kernel or
btrfs-progs has a real good solution.

So at least please consider rebuild the fs.

Despite that, it's recommended to provide the versions of all the
kernels run on the fs, along with the mount option used.

We had some similar reports on such generation mismatch, but still we
don't have a convincing cause for it.
From old kernel to space cache corruption to powerloss + space cache
corruption.

> [/dev/bcache3].write_io_errs0
> [/dev/bcache3].read_io_errs 0
> [/dev/bcache3].flush_io_errs0
> [/dev/bcache3].corruption_errs  0
> [/dev/bcache3].generation_errs  19
> [/dev/bcache2].write_io_errs0
> [/dev/bcache2].read_io_errs 0
> [/dev/bcache2].flush_io_errs0
> [/dev/bcache2].corruption_errs  0
> [/dev/bcache2].generation_errs  2
> 
> and a smart test of the backing disk /dev/bcache1 shows a high read
> error rate, and lot of reallocated sectors.  The disk is 10 years old,
> and has clearly started to fail.
> 
> I've tried the latest kernel, and the latest tools, but nothing will
> allow me to replace, or delete the failed disk.
> 
>   884.171025] BTRFS info (device bcache0): dev_replace from
> /dev/bcache1 (devid 2) to /dev/bcache4 started
> [ 3301.101958] BTRFS error (device bcache0): parent transid verify
> failed on 8251260944384 wanted 640926 found 640907
> [ 3301.241214] BTRFS error (device bcache0): parent transid verify
> failed on 8251260944384 wanted 640926 found 640907
> [ 3301.241398] BTRFS error (device bcache0): parent transid verify
> failed on 8251260944384 wanted 640926 found 640907
> [ 3301.241513] BTRFS error (device bcache0): parent transid verify
> failed on 8251260944384 wanted 640926 found 640907

If btrfs check --readonly only reports this problem, it may be possible
for us to fix it.

Please also do a tree block dump on this block by:
# btrfs ins dump-tree -b 8251260944384 /dev/bcache0

If btrfs check --readonly reports a lot of problems, then it's strongly
recommended to rebuild the filesystem.

Thanks,
Qu

> [ 3302.381094] BTRFS error (device bcache0):
> btrfs_scrub_dev(/dev/bcache1, 2, /dev/bcache4) failed -5
> [ 3302.394612] WARNING: CPU: 0 PID: 5936 at
> /build/linux-5s7Xkn/linux-4.15.0/fs/btrfs/dev-replace.c:413
> btrfs_dev_replace_start+0x281/0x320 [btrfs]
> [ 3302.394613] Modules linked in: btrfs zstd_compress xor raid6_pq
> bcache intel_rapl x86_pkg_temp_thermal intel_powerclamp
> snd_hda_codec_hdmi coretemp kvm_intel snd_hda_codec_realtek kvm
> snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core
> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hwdep
> snd_pcm pcbc snd_seq_midi aesni_intel snd_seq_midi_event joydev
> input_leds aes_x86_64 snd_rawmidi crypto_simd glue_helper snd_seq
> eeepc_wmi cryptd asus_wmi snd_seq_device snd_timer wmi_bmof
> sparse_keymap snd intel_cstate intel_rapl_perf soundcore mei_me mei
> shpchp mac_hid sch_fq_codel acpi_pad parport_pc ppdev lp parport
> ip_tables x_tables autofs4 overlay nls_iso8859_1 dm_mirror
> dm_region_hash dm_log hid_generic usbhid hid uas usb_storage i915
> i2c_algo_bit drm_kms_helper syscopyarea sysfillrect
> [ 3302.394640]  sysimgblt fb_sys_fops r8169 mxm_wmi mii drm ahci
> libahci wmi video
> [ 3302.394646] CPU: 0 PID: 5936 Comm: btrfs Not tainted
> 4.15.0-20-generic #21-Ubuntu
> [ 3302.394646] Hardware name: System manufacturer System Product
> Name/H110M-R, BIOS 3404 10/10/2017
> [ 3302.394658] RIP: 0010:btrfs_dev_replace_start+0x281/0x320 [btrfs]
> [ 3302.394659] RSP: 0018:a8b582b5fd18 EFLAGS: 00010282
> [ 3302.394660] RAX: fffb RBX: 

Re:

2018-11-23 Thread Andy Leadbetter
Will capture all of that this evening, and try it with the latest
kernel and tools.  Thanks for the input on what info is relevant, with
gather it asap.
On Fri, 23 Nov 2018 at 07:53, Chris Murphy  wrote:
>
> On Thu, Nov 22, 2018 at 11:41 PM Andy Leadbetter
>  wrote:
> >
> > I have a failing 2TB disk that is part of a 4 disk RAID 6 system.  I
> > have added a new 2TB disk to the computer, and started a BTRFS replace
> > for the old and new disk.  The process starts correctly however some
> > hours into the job, there is an error and kernel oops. relevant log
> > below.
>
> The relevant log is the entire dmesg, not a snippet. It's decently
> likely there's more than one thing going on here. We also need full
> output of 'smartctl -x' for all four drives, and also 'smartctl -l
> scterc' for all four drives, and also 'cat
> /sys/block/sda/device/timeout' for all four drives. And which bcache
> mode you're using.
>
> The call trace provided is from kernel 4.15 which is sufficiently long
> ago I think any dev working on raid56 might want to see where it's
> getting tripped up on something a lot newer, and this is why:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/raid56.c?id=v4.19.3=v4.15.1
>
> That's a lot of changes in just the raid56 code between 4.15 and 4.19.
> And then in you call trace, btrfs_dev_replace_start is found in
> dev-replace.c which likewise has a lot of changes. But then also, I
> think 4.15 might still be in the era where it was not recommended to
> use 'btrfs dev replace' for raid56, only non-raid56. I'm not sure if
> the problems with device replace were fixed, and if they were fixed
> kernel or progs side. Anyway, the latest I recall, it was recommended
> on raid56 to 'btrfs dev add' then 'btrfs dev remove'.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/diff/fs/btrfs/dev-replace.c?id=v4.19.3=v4.15.1
>
> And that's only a few hundred changes for each. Check out inode.c -
> there are over 2000 changes.
>
>
> > The disks are configured on top of bcache, in 5 arrays with a small
> > 128GB SSD cache shared.  The system in this configuration has worked
> > perfectly for 3 years, until 2 weeks ago csum errors started
> > appearing.  I have a crashplan backup of all files on the disk, so I
> > am not concerned about data loss, but I would like to avoid rebuild
> > the system.
>
> btrfs-progs 4.17 still considers raid56 experimental, not for
> production use. And three years ago, the current upstream kernel
> released was 4.3 so I'm gonna guess the kernel history of this file
> system goes back older than that, very close to raid56 code birth. And
> then adding bcache to this mix just makes it all the more complicated.
>
>
>
> >
> > btrfs dev stats shows
> > [/dev/bcache0].write_io_errs0
> > [/dev/bcache0].read_io_errs 0
> > [/dev/bcache0].flush_io_errs0
> > [/dev/bcache0].corruption_errs  0
> > [/dev/bcache0].generation_errs  0
> > [/dev/bcache1].write_io_errs0
> > [/dev/bcache1].read_io_errs 20
> > [/dev/bcache1].flush_io_errs0
> > [/dev/bcache1].corruption_errs  0
> > [/dev/bcache1].generation_errs  14
> > [/dev/bcache3].write_io_errs0
> > [/dev/bcache3].read_io_errs 0
> > [/dev/bcache3].flush_io_errs0
> > [/dev/bcache3].corruption_errs  0
> > [/dev/bcache3].generation_errs  19
> > [/dev/bcache2].write_io_errs0
> > [/dev/bcache2].read_io_errs 0
> > [/dev/bcache2].flush_io_errs0
> > [/dev/bcache2].corruption_errs  0
> > [/dev/bcache2].generation_errs  2
>
>
> 3 of 4 drives have at least one generation error. While there are no
> corruptions reported, generation errors can be really tricky to
> recover from at all. If only one device had only read errors, this
> would be a lot less difficult.
>
>
> > I've tried the latest kernel, and the latest tools, but nothing will
> > allow me to replace, or delete the failed disk.
>
> If the file system is mounted, I would try to make a local backup ASAP
> before you lose the whole volume. Whether it's LVM pool of two drives
> (linear/concat) with XFS, or if you go with Btrfs -dsingle -mraid1
> (also basically a concat) doesn't really matter, but I'd get whatever
> you can off the drive. I expect avoiding a rebuild in some form or
> another is very wishful thinking and not very likely.
>
> The more changes are made to the file system, repair attempts or
> otherwise writing to it, decreases the chance of recovery.
>
> --
> Chris Murphy


Re: [PATCH] btrfs: remove btrfs_bio_end_io_t

2018-11-23 Thread Nikolay Borisov



On 23.11.18 г. 10:42 ч., Johannes Thumshirn wrote:
> The btrfs_bio_end_io_t typedef was introduced with commit
> a1d3c4786a4b ("btrfs: btrfs_multi_bio replaced with btrfs_bio")
> but never used anywhere. This commit also introduced a forward declaration
> of 'struct btrfs_bio' which is only needed for btrfs_bio_end_io_t.
> 
> Remove both as they're not needed anywhere.
> 
> Signed-off-by: Johannes Thumshirn 

Reviewed-by: Nikolay Borisov 

> ---
>  fs/btrfs/volumes.h | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 8b092bb1e2ee..c93097b0b469 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -295,9 +295,6 @@ struct btrfs_bio_stripe {
>   u64 length; /* only used for discard mappings */
>  };
>  
> -struct btrfs_bio;
> -typedef void (btrfs_bio_end_io_t) (struct btrfs_bio *bio, int err);
> -
>  struct btrfs_bio {
>   refcount_t refs;
>   atomic_t stripes_pending;
> 


[PATCH v2] btrfs: improve error handling of btrfs_add_link()

2018-11-23 Thread Johannes Thumshirn
err holds the return value of either btrfs_del_root_ref() or
btrfs_del_inode_ref() but it hasn't been checked since it's
introduction with commit fe66a05a0679 (Btrfs: improve error handling
for btrfs_insert_dir_item callers) in 2012.

Instead of silently ignoring the return values, print a message so the user
knows what kind of error has encountered.

Link: https://lore.kernel.org/linux-btrfs/20181119141323.gc24...@twin.jikos.cz
Signed-off-by: Johannes Thumshirn 
---

Changes to v1:
* Only print an error message and let the callers abort the transaction (Dave)
---
 fs/btrfs/inode.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 9becf8543489..8ca2f82b35a3 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6351,6 +6351,7 @@ int btrfs_add_link(struct btrfs_trans_handle *trans,
struct btrfs_root *root = parent_inode->root;
u64 ino = btrfs_ino(inode);
u64 parent_ino = btrfs_ino(parent_inode);
+   int err;
 
if (unlikely(ino == BTRFS_FIRST_FREE_OBJECTID)) {
memcpy(, >root->root_key, sizeof(key));
@@ -6395,17 +6396,25 @@ int btrfs_add_link(struct btrfs_trans_handle *trans,
 fail_dir_item:
if (unlikely(ino == BTRFS_FIRST_FREE_OBJECTID)) {
u64 local_index;
-   int err;
+
err = btrfs_del_root_ref(trans, key.objectid,
 root->root_key.objectid, parent_ino,
 _index, name, name_len);
+   if (err)
+   btrfs_info(trans->fs_info,
+  "failed to delete reference to %.*s, root %llu ref %llu",
+  name_len, name, key.objectid,
+  root->root_key.objectid);
 
} else if (add_backref) {
u64 local_index;
-   int err;
 
err = btrfs_del_inode_ref(trans, root, name, name_len,
  ino, parent_ino, _index);
+   if (err)
+   btrfs_info(trans->fs_info,
+  "failed to delete reference to %.*s, inode %llu parent %llu",
+  name_len, name, ino, parent_ino);
}
return ret;
 }
-- 
2.16.4



[PATCH] btrfs: remove btrfs_bio_end_io_t

2018-11-23 Thread Johannes Thumshirn
The btrfs_bio_end_io_t typedef was introduced with commit
a1d3c4786a4b ("btrfs: btrfs_multi_bio replaced with btrfs_bio")
but never used anywhere. This commit also introduced a forward declaration
of 'struct btrfs_bio' which is only needed for btrfs_bio_end_io_t.

Remove both as they're not needed anywhere.

Signed-off-by: Johannes Thumshirn 
---
 fs/btrfs/volumes.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 8b092bb1e2ee..c93097b0b469 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -295,9 +295,6 @@ struct btrfs_bio_stripe {
u64 length; /* only used for discard mappings */
 };
 
-struct btrfs_bio;
-typedef void (btrfs_bio_end_io_t) (struct btrfs_bio *bio, int err);
-
 struct btrfs_bio {
refcount_t refs;
atomic_t stripes_pending;
-- 
2.16.4



Re: [PATCH 1/4] btrfs: merge btrfs_submit_bio_done to its caller

2018-11-23 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes ThumshirnSUSE Labs
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 2/4] btrfs: replace async_cow::root with fs_info

2018-11-23 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes ThumshirnSUSE Labs
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 3/4] btrfs: remove redundant csum buffer in btrfs_io_bio

2018-11-23 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes ThumshirnSUSE Labs
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 4/4] btrfs: replace btrfs_io_bio::end_io with a simple helper

2018-11-23 Thread Johannes Thumshirn
Looks good,
Reviewed-by: Johannes Thumshirn 
-- 
Johannes ThumshirnSUSE Labs
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850