Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

2015-12-04 Thread Bob Peterson
Hi,

- Original Message -
> > Hi Steve,
> >
> > Another possibility is to add a new inode work_queue which does the work of
> > gfs2_clear_inode(). Its only function would be to clear the inode, and
> > therefore, there should not be any competing work, as there is in the case
> > of
> > a glock. The gfs2_clear_inode() function would then queue the work and
> > return
> > to its caller, and therefore it wouldn't block. The work that's been queued
> > might block on DLM, but it shouldn't matter, in theory, since the shrinker
> > will complete, which will free up the fence, which will free up dlm, which
> > will
> > free up everything. What do you think?
> >
> > Regards,
> >
> > Bob Peterson
> > Red Hat File Systems
> 
> Well that is a possibility. There is another dependency cycle there
> however, which is the one relating to the inode lifetime, and we need to
> be careful that we don't break that in attempting to fix the memory
> allocation at fence time issue. I know it isn't easy, but we need to be
> very careful to avoid introducing any new races, as they are likely to
> be very tricky to debug in this area.
> 
> One thing we could do, is to be more careful about when we take this
> delayed path. We could, for example, add a check to see if we are
> running in the context of the shrinker, and only delay the eviction in
> that specific case. I was wondering whether we could move the decision
> further up the chain too, and avoid evicting inodes with zero link count
> under memory pressure in the first place. The issue of needing to
> allocate memory to evict inodes is likely to be something that might
> affect other filesystems too, so it might be useful as a generic feature.
> 
> I'd prefer to avoid adding a new inode workqueue, however there is no
> reason why we could not add a new feature to an existing workqueue by
> adding a new inode or glock flag as appropriate (which will not expand
> the size of the inode) to deal with this case,
> 
> Steve.
> 
> 

I tried adding a work queue for the inodes, but it was a disaster and led
to other hangs.

Then I decided it made more sense to just add a new delayed work function
for the final put of the inode glock. That also led to problems because
glocks may be reused for a different inode with the same block address,
due to the fact that the glocks often outlive the inodes. What happened
was that, given enough pressure, the delayed work became backlogged and
the second put couldn't be queued due to a previous queue (with delay==0).

In the end, I decided that you're right: The safest approach is to
isolate this to the specific problem I'm trying to solve: to do our
business the same way, but when it comes to the final close, check if
it's from the fenced process, and if so, queue the final put. That should
mitigate the problem.

Unfortunately, it's not good enough to only do these things when the link
count is zero: Closing any file at any time could recreate the problem,
provided there was a pending fence action.

I'd also prefer if there was a way we could tell whether we were called
from the slab shrinker, but unfortunately, there's no good way to know
without wasting a lot of time traversing the call stack and such.
I did a fair amount of investigating this: there are a couple of bits
in the superblock that I was hoping to leverage: PF_FSTRANS is used by
NFS to detect situations where it mustn't do memory allocations, but
I don't think we can use it.
There's also PF_MEMALLOC, but that's set for the global shrinker, not the
slab shrinker, so I don't think we can use that either.

I investigated how ocfs2 handles this situation: turns out, it doesn't,
so they probably have the same vulnerability.

In doing all this investigation, I learned of another similar possible
problem discovered in ocfs2: If the shrinker tried to ditch an inode from
memory, then the file system calls into DLM, but DLM needs to do some
COMM stuff, and the comm stuff needs to allocate memory for comm buffers.
In that case, the use of superblock flags (as above) was discouraged and
NACKed in favor of using a different set of memory flags, such as GFP_NOFS,
etc. I never did determine exactly how this ended up.

In the meantime, I prototyped the following patch, and it works, but
it's a hack, and I'm not too happy with it. So I'm going to see if I can
find a solution that would work for both GFS2 and ocfs2.

Regards,

Bob Peterson
Red Hat File Systems
---
commit 3142a311fa848d44b15e9f86fc3599b7d9b8b3f5
Author: Bob Peterson 
Date:   Mon Nov 30 12:04:09 2015 -0600

GFS2: Make gfs2_clear_inode() not block on final glock put

This patch changes function gfs2_clear_inode() so that instead
of calling gfs2_glock_put, it calls a new gfs2_glock_put_noblock
function that avoids a possible deadlock that would occur should
it call dlm during a fence operation: dlm waits for a fence
operation, the fence operation waits for memory, the shrinker
waits for gfs2 to free an 

Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

2015-12-04 Thread David Teigland
On Fri, Dec 04, 2015 at 09:51:53AM -0500, Bob Peterson wrote:
> it's from the fenced process, and if so, queue the final put. That should
> mitigate the problem.

Bob, I'm perplexed by the focus on fencing; this issue is broader than
fencing as I mentioned in bz 1255872.  Over the years that I've reported
these issues, rarely if ever have they involving fencing.  Any userland
process, not just the fencing process, can allocate memory, fall into the
general shrinking path, get into gfs2 and dlm, and end up blocked for some
undefined time.  That can cause problems in any number of ways.

The specific problem you're focused on may be one of the easier ways of
demonstrating the problem -- making the original userland process one of
the cluster-related processes that gfs2/dlm depend on, combined with
recovery when those processes do an especially large amount of work that
gfs2/dlm require.  But problems could occur if any process is forced to
unwittingly do this dlm work, not just a cluster-related process, and it
would not need to involve recovery (or fencing which is one small part of
it).

I believe in gfs1 and the original gfs2, gfs had its own mechanism/threads
for shrinking its cache and doing the dlm work, and would not do anything
from the generic shrinking paths because of this issue.  I don't think
it's reasonable to expect random, unsuspecting processes on the system to
perform gfs2/dlm operations that are often remote, lengthy, indefinite, or
unpredictable.  I think gfs2 needs to do that kind of heavy lifting from
its own dedicated contexts, or from processes that are explicitly choosing
to use gfs2.



Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

2015-12-04 Thread Bob Peterson
- Original Message -
> On Fri, Dec 04, 2015 at 09:51:53AM -0500, Bob Peterson wrote:
> > it's from the fenced process, and if so, queue the final put. That should
> > mitigate the problem.
> 
> Bob, I'm perplexed by the focus on fencing; this issue is broader than
> fencing as I mentioned in bz 1255872.  Over the years that I've reported
> these issues, rarely if ever have they involving fencing.  Any userland
> process, not just the fencing process, can allocate memory, fall into the
> general shrinking path, get into gfs2 and dlm, and end up blocked for some
> undefined time.  That can cause problems in any number of ways.
> 
> The specific problem you're focused on may be one of the easier ways of
> demonstrating the problem -- making the original userland process one of
> the cluster-related processes that gfs2/dlm depend on, combined with
> recovery when those processes do an especially large amount of work that
> gfs2/dlm require.  But problems could occur if any process is forced to
> unwittingly do this dlm work, not just a cluster-related process, and it
> would not need to involve recovery (or fencing which is one small part of
> it).
> 
> I believe in gfs1 and the original gfs2, gfs had its own mechanism/threads
> for shrinking its cache and doing the dlm work, and would not do anything
> from the generic shrinking paths because of this issue.  I don't think
> it's reasonable to expect random, unsuspecting processes on the system to
> perform gfs2/dlm operations that are often remote, lengthy, indefinite, or
> unpredictable.  I think gfs2 needs to do that kind of heavy lifting from
> its own dedicated contexts, or from processes that are explicitly choosing
> to use gfs2.
> 
> 
Hi Dave,

Thanks for your input.
You're right, of course. The problem can occur to other processes that cause
the shrinker to run, not just from fenced.

Yes, the original GFS2 did not have this problem. When gfs2_clear_inode
was called, it called gfs2_glock_schedule_for_reclaim, which queued the inode
glock to a special queue which was handled by the gfs2_glockd daemon. The
reference count was bumped to prevent the gfs2_clear_inode() from being the
last guy out.

The problem with that approach is that it's a centralized list, which means
there's no parallelism and there can get to be a backlog of glocks waiting
to be reclaimed. If that glock needs to be reacquired because the block was
reused for a different inode, we could end up reaping glocks that are still
being (re)used.

We could still do that, but either we reintroduce the gfs2_glockd daemon,
or give our existing daemon, gfs2_quotad, more responsibilities, which would
make it a bit uglier and more complex than it is today.

My previous attempts to solve this involved using a work queue and
queuing the final gfs2_glock_put() and that fixed the problem for all
cases, unless there was already work queued.

When you think about it, using delayed work accomplishes the same thing, but
with the parallelism. When it works. Perhaps I just need to focus on a way to
allow work to be queued multiple times (in an ideal case) or, alternatively,
an atomic counter that corresponds to the number of times the work should be
executed. Or something similar.

Regards,

Bob Peterson
Red Hat File Systems



[Cluster-devel] [GFS2 PATCH] GFS2: Release iopen glock in gfs2_create_inode error cases

2015-12-04 Thread Bob Peterson
Hi,

Some error cases in gfs2_create_inode were not unlocking the iopen
glock, getting the reference count off. This adds the proper unlock.
The error logic in function gfs2_create_inode was also convoluted,
so this patch simplifies it. It also takes care of a bug in
which gfs2_qa_delete() was not called in an error case.

Regards,

Bob Peterson
Red Hat File Systems

Signed-off-by: Bob Peterson 
---
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index a8ce2e9..1cc6d8e 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -593,7 +593,7 @@ static int gfs2_create_inode(struct inode *dir, struct 
dentry *dentry,
struct gfs2_inode *dip = GFS2_I(dir), *ip;
struct gfs2_sbd *sdp = GFS2_SB(>i_inode);
struct gfs2_glock *io_gl;
-   int error, free_vfs_inode = 0;
+   int error, free_vfs_inode = 1;
u32 aflags = 0;
unsigned blocks = 1;
struct gfs2_diradd da = { .bh = NULL, .save_loc = 1, };
@@ -650,7 +650,7 @@ static int gfs2_create_inode(struct inode *dir, struct 
dentry *dentry,
 
error = posix_acl_create(dir, , _acl, );
if (error)
-   goto fail_free_vfs_inode;
+   goto fail_gunlock;
 
ip = GFS2_I(inode);
error = gfs2_rsqa_alloc(ip);
@@ -738,6 +738,9 @@ static int gfs2_create_inode(struct inode *dir, struct 
dentry *dentry,
gfs2_set_iop(inode);
insert_inode_hash(inode);
 
+   free_vfs_inode = 0; /* After this point, the inode is no longer
+  considered free. Any failures need to undo
+  the gfs2 structures. */
if (default_acl) {
error = gfs2_set_acl(inode, default_acl, ACL_TYPE_DEFAULT);
posix_acl_release(default_acl);
@@ -771,11 +774,8 @@ static int gfs2_create_inode(struct inode *dir, struct 
dentry *dentry,
return error;
 
 fail_gunlock3:
-   gfs2_glock_dq_uninit(ghs + 1);
-   if (ip->i_gl)
-   gfs2_glock_put(ip->i_gl);
-   goto fail_gunlock;
-
+   gfs2_glock_dq_uninit(>i_iopen_gh);
+   gfs2_glock_put(io_gl);
 fail_gunlock2:
gfs2_glock_dq_uninit(ghs + 1);
 fail_free_inode:
@@ -787,8 +787,6 @@ fail_free_acls:
posix_acl_release(default_acl);
if (acl)
posix_acl_release(acl);
-fail_free_vfs_inode:
-   free_vfs_inode = 1;
 fail_gunlock:
gfs2_dir_no_add();
gfs2_glock_dq_uninit(ghs);



[Cluster-devel] [GFS2 PATCH] GFS2: Always use iopen glock for gl_deletes

2015-12-04 Thread Bob Peterson
Hi,

Before this patch, when function try_rgrp_unlink queued a glock for
delete_work to reclaim the space, it used the inode glock to do so.
That's different from the iopen callback which uses the iopen glock
for the same purpose. We should be consistent and always use the
iopen glock. This may also save us reference counting problems with
the inode glock, since clear_glock does an extra glock_put() for the
inode glock.

Regards,

Bob Peterson
Red Hat File Systems
---
Signed-off-by: Bob Peterson 

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index 42ac3bb..00b2f22 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -1789,7 +1789,7 @@ static void try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 
*last_unlinked, u64 skip
continue;
*last_unlinked = block;
 
-   error = gfs2_glock_get(sdp, block, _inode_glops, CREATE, 
);
+   error = gfs2_glock_get(sdp, block, _iopen_glops, CREATE, 
);
if (error)
continue;
 



[Cluster-devel] [GFS2 PATCH] GFS2: Don't do glock put on when inode creation fails

2015-12-04 Thread Bob Peterson
Hi,

Currently the error path of function gfs2_inode_lookup calls function
gfs2_glock_put corresponding to an earlier call to gfs2_glock_get for
the inode glock. That's wrong because the error path also calls
iget_failed() which eventually calls iput, which eventually calls
gfs2_evict_inode, which does another gfs2_glock_put. This double-put
can cause the glock reference count to get off.

Regards,

Bob Peterson
Red Hat File Systems
---
Signed-off-by: Bob Peterson 

diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 1cc6d8e..148a4d5 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -197,7 +197,6 @@ fail_iopen:
gfs2_glock_put(io_gl);
 fail_put:
ip->i_gl->gl_object = NULL;
-   gfs2_glock_put(ip->i_gl);
 fail:
iget_failed(inode);
return ERR_PTR(error);