Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

2015-12-08 Thread Steven Whitehouse

Hi,

On 08/12/15 07:57, Dave Chinner wrote:

On Wed, Dec 02, 2015 at 11:42:13AM -0500, Bob Peterson wrote:

- Original Message -
(snip)

Please take a look at this
again and figure out what the problematic cycle of events is, and then
work out how to avoid that happening in the first place. There is no
point in replacing one problem with another one, particularly one which
would likely be very tricky to debug,

Steve.

Rhe problematic cycle of events is well known:
gfs2_clear_inode calls gfs2_glock_put() for the inode's glock,
but if it's the very last put, it calls into dlm, which can block,
and that's where we get into trouble.

The livelock goes like this:

1. A fence operation needs memory, so it blocks on memory allocation.
2. Memory allocation blocks on slab shrinker.
3. Slab shrinker calls into vfs inode shrinker to free inodes from memory.



7. dlm blocks on a pending fence operation. Goto 1.

Therefore, the fence operation should be doing GFP_NOFS allocations
to prevent re-entry into the DLM via the filesystem via the shrinker

Cheers,

Dave.


Which would be ideal, but how do you do that from user space?

Steve.



Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

2015-12-07 Thread Dave Chinner
On Wed, Dec 02, 2015 at 11:42:13AM -0500, Bob Peterson wrote:
> - Original Message -
> (snip)
> > Please take a look at this
> > again and figure out what the problematic cycle of events is, and then
> > work out how to avoid that happening in the first place. There is no
> > point in replacing one problem with another one, particularly one which
> > would likely be very tricky to debug,
> > 
> > Steve.
> 
> Rhe problematic cycle of events is well known:
> gfs2_clear_inode calls gfs2_glock_put() for the inode's glock,
> but if it's the very last put, it calls into dlm, which can block,
> and that's where we get into trouble.
> 
> The livelock goes like this:
> 
> 1. A fence operation needs memory, so it blocks on memory allocation.
> 2. Memory allocation blocks on slab shrinker.
> 3. Slab shrinker calls into vfs inode shrinker to free inodes from memory.

> 7. dlm blocks on a pending fence operation. Goto 1.

Therefore, the fence operation should be doing GFP_NOFS allocations
to prevent re-entry into the DLM via the filesystem via the shrinker

Cheers,

Dave.
-- 
Dave Chinner
dchin...@redhat.com



Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

2015-12-04 Thread Bob Peterson
Hi,

- Original Message -
> > Hi Steve,
> >
> > Another possibility is to add a new inode work_queue which does the work of
> > gfs2_clear_inode(). Its only function would be to clear the inode, and
> > therefore, there should not be any competing work, as there is in the case
> > of
> > a glock. The gfs2_clear_inode() function would then queue the work and
> > return
> > to its caller, and therefore it wouldn't block. The work that's been queued
> > might block on DLM, but it shouldn't matter, in theory, since the shrinker
> > will complete, which will free up the fence, which will free up dlm, which
> > will
> > free up everything. What do you think?
> >
> > Regards,
> >
> > Bob Peterson
> > Red Hat File Systems
> 
> Well that is a possibility. There is another dependency cycle there
> however, which is the one relating to the inode lifetime, and we need to
> be careful that we don't break that in attempting to fix the memory
> allocation at fence time issue. I know it isn't easy, but we need to be
> very careful to avoid introducing any new races, as they are likely to
> be very tricky to debug in this area.
> 
> One thing we could do, is to be more careful about when we take this
> delayed path. We could, for example, add a check to see if we are
> running in the context of the shrinker, and only delay the eviction in
> that specific case. I was wondering whether we could move the decision
> further up the chain too, and avoid evicting inodes with zero link count
> under memory pressure in the first place. The issue of needing to
> allocate memory to evict inodes is likely to be something that might
> affect other filesystems too, so it might be useful as a generic feature.
> 
> I'd prefer to avoid adding a new inode workqueue, however there is no
> reason why we could not add a new feature to an existing workqueue by
> adding a new inode or glock flag as appropriate (which will not expand
> the size of the inode) to deal with this case,
> 
> Steve.
> 
> 

I tried adding a work queue for the inodes, but it was a disaster and led
to other hangs.

Then I decided it made more sense to just add a new delayed work function
for the final put of the inode glock. That also led to problems because
glocks may be reused for a different inode with the same block address,
due to the fact that the glocks often outlive the inodes. What happened
was that, given enough pressure, the delayed work became backlogged and
the second put couldn't be queued due to a previous queue (with delay==0).

In the end, I decided that you're right: The safest approach is to
isolate this to the specific problem I'm trying to solve: to do our
business the same way, but when it comes to the final close, check if
it's from the fenced process, and if so, queue the final put. That should
mitigate the problem.

Unfortunately, it's not good enough to only do these things when the link
count is zero: Closing any file at any time could recreate the problem,
provided there was a pending fence action.

I'd also prefer if there was a way we could tell whether we were called
from the slab shrinker, but unfortunately, there's no good way to know
without wasting a lot of time traversing the call stack and such.
I did a fair amount of investigating this: there are a couple of bits
in the superblock that I was hoping to leverage: PF_FSTRANS is used by
NFS to detect situations where it mustn't do memory allocations, but
I don't think we can use it.
There's also PF_MEMALLOC, but that's set for the global shrinker, not the
slab shrinker, so I don't think we can use that either.

I investigated how ocfs2 handles this situation: turns out, it doesn't,
so they probably have the same vulnerability.

In doing all this investigation, I learned of another similar possible
problem discovered in ocfs2: If the shrinker tried to ditch an inode from
memory, then the file system calls into DLM, but DLM needs to do some
COMM stuff, and the comm stuff needs to allocate memory for comm buffers.
In that case, the use of superblock flags (as above) was discouraged and
NACKed in favor of using a different set of memory flags, such as GFP_NOFS,
etc. I never did determine exactly how this ended up.

In the meantime, I prototyped the following patch, and it works, but
it's a hack, and I'm not too happy with it. So I'm going to see if I can
find a solution that would work for both GFS2 and ocfs2.

Regards,

Bob Peterson
Red Hat File Systems
---
commit 3142a311fa848d44b15e9f86fc3599b7d9b8b3f5
Author: Bob Peterson 
Date:   Mon Nov 30 12:04:09 2015 -0600

GFS2: Make gfs2_clear_inode() not block on final glock put

This patch changes function gfs2_clear_inode() so that instead
of calling gfs2_glock_put, it calls a new gfs2_glock_put_noblock
function that avoids a possible deadlock that would occur should
it call dlm during a fence operation: dlm waits for a fence
operation, the fence operation waits for memory, the shrinker
waits for gfs2 to free an 

Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

2015-12-04 Thread David Teigland
On Fri, Dec 04, 2015 at 09:51:53AM -0500, Bob Peterson wrote:
> it's from the fenced process, and if so, queue the final put. That should
> mitigate the problem.

Bob, I'm perplexed by the focus on fencing; this issue is broader than
fencing as I mentioned in bz 1255872.  Over the years that I've reported
these issues, rarely if ever have they involving fencing.  Any userland
process, not just the fencing process, can allocate memory, fall into the
general shrinking path, get into gfs2 and dlm, and end up blocked for some
undefined time.  That can cause problems in any number of ways.

The specific problem you're focused on may be one of the easier ways of
demonstrating the problem -- making the original userland process one of
the cluster-related processes that gfs2/dlm depend on, combined with
recovery when those processes do an especially large amount of work that
gfs2/dlm require.  But problems could occur if any process is forced to
unwittingly do this dlm work, not just a cluster-related process, and it
would not need to involve recovery (or fencing which is one small part of
it).

I believe in gfs1 and the original gfs2, gfs had its own mechanism/threads
for shrinking its cache and doing the dlm work, and would not do anything
from the generic shrinking paths because of this issue.  I don't think
it's reasonable to expect random, unsuspecting processes on the system to
perform gfs2/dlm operations that are often remote, lengthy, indefinite, or
unpredictable.  I think gfs2 needs to do that kind of heavy lifting from
its own dedicated contexts, or from processes that are explicitly choosing
to use gfs2.



Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

2015-12-04 Thread Bob Peterson
- Original Message -
> On Fri, Dec 04, 2015 at 09:51:53AM -0500, Bob Peterson wrote:
> > it's from the fenced process, and if so, queue the final put. That should
> > mitigate the problem.
> 
> Bob, I'm perplexed by the focus on fencing; this issue is broader than
> fencing as I mentioned in bz 1255872.  Over the years that I've reported
> these issues, rarely if ever have they involving fencing.  Any userland
> process, not just the fencing process, can allocate memory, fall into the
> general shrinking path, get into gfs2 and dlm, and end up blocked for some
> undefined time.  That can cause problems in any number of ways.
> 
> The specific problem you're focused on may be one of the easier ways of
> demonstrating the problem -- making the original userland process one of
> the cluster-related processes that gfs2/dlm depend on, combined with
> recovery when those processes do an especially large amount of work that
> gfs2/dlm require.  But problems could occur if any process is forced to
> unwittingly do this dlm work, not just a cluster-related process, and it
> would not need to involve recovery (or fencing which is one small part of
> it).
> 
> I believe in gfs1 and the original gfs2, gfs had its own mechanism/threads
> for shrinking its cache and doing the dlm work, and would not do anything
> from the generic shrinking paths because of this issue.  I don't think
> it's reasonable to expect random, unsuspecting processes on the system to
> perform gfs2/dlm operations that are often remote, lengthy, indefinite, or
> unpredictable.  I think gfs2 needs to do that kind of heavy lifting from
> its own dedicated contexts, or from processes that are explicitly choosing
> to use gfs2.
> 
> 
Hi Dave,

Thanks for your input.
You're right, of course. The problem can occur to other processes that cause
the shrinker to run, not just from fenced.

Yes, the original GFS2 did not have this problem. When gfs2_clear_inode
was called, it called gfs2_glock_schedule_for_reclaim, which queued the inode
glock to a special queue which was handled by the gfs2_glockd daemon. The
reference count was bumped to prevent the gfs2_clear_inode() from being the
last guy out.

The problem with that approach is that it's a centralized list, which means
there's no parallelism and there can get to be a backlog of glocks waiting
to be reclaimed. If that glock needs to be reacquired because the block was
reused for a different inode, we could end up reaping glocks that are still
being (re)used.

We could still do that, but either we reintroduce the gfs2_glockd daemon,
or give our existing daemon, gfs2_quotad, more responsibilities, which would
make it a bit uglier and more complex than it is today.

My previous attempts to solve this involved using a work queue and
queuing the final gfs2_glock_put() and that fixed the problem for all
cases, unless there was already work queued.

When you think about it, using delayed work accomplishes the same thing, but
with the parallelism. When it works. Perhaps I just need to focus on a way to
allow work to be queued multiple times (in an ideal case) or, alternatively,
an atomic counter that corresponds to the number of times the work should be
executed. Or something similar.

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

2015-12-03 Thread Steven Whitehouse

Hi,

On 02/12/15 17:41, Bob Peterson wrote:

- Original Message -

- Original Message -
(snip)

Please take a look at this
again and figure out what the problematic cycle of events is, and then
work out how to avoid that happening in the first place. There is no
point in replacing one problem with another one, particularly one which
would likely be very tricky to debug,

Steve.

Rhe problematic cycle of events is well known:
gfs2_clear_inode calls gfs2_glock_put() for the inode's glock,
but if it's the very last put, it calls into dlm, which can block,
and that's where we get into trouble.

The livelock goes like this:

1. A fence operation needs memory, so it blocks on memory allocation.
2. Memory allocation blocks on slab shrinker.
3. Slab shrinker calls into vfs inode shrinker to free inodes from memory.
4. vfs inode shrinker eventually calls gfs2_clear_inode to free an inode.
5. gfs2_clear_inode calls the final gfs2_glock_put to unlock the inode's
glock.
6. gfs2_glock_put calls dlm unlock to unlock the glock.
7. dlm blocks on a pending fence operation. Goto 1.

So we need to prevent gfs2_clear_inode from calling into DLM. Still, somebody
needs to do the final put and tell dlm to unlock the inode's glock, which is
why
I've been trying to queue it to the delayed work queue.

If I can't do that, we're left with few alternatives: Perhaps a new
function of the quota daemon: to run the lru list and call dlm to unlock
any that have a special bit set, but that just seems ugly.

I've thought of some other alternatives, but they seem a lot uglier and
harder to manage. I'll give it some more thought.

Regards,

Bob Peterson
Red Hat File Systems


Hi Steve,

Another possibility is to add a new inode work_queue which does the work of
gfs2_clear_inode(). Its only function would be to clear the inode, and
therefore, there should not be any competing work, as there is in the case of
a glock. The gfs2_clear_inode() function would then queue the work and return
to its caller, and therefore it wouldn't block. The work that's been queued
might block on DLM, but it shouldn't matter, in theory, since the shrinker
will complete, which will free up the fence, which will free up dlm, which will
free up everything. What do you think?

Regards,

Bob Peterson
Red Hat File Systems


Well that is a possibility. There is another dependency cycle there 
however, which is the one relating to the inode lifetime, and we need to 
be careful that we don't break that in attempting to fix the memory 
allocation at fence time issue. I know it isn't easy, but we need to be 
very careful to avoid introducing any new races, as they are likely to 
be very tricky to debug in this area.


One thing we could do, is to be more careful about when we take this 
delayed path. We could, for example, add a check to see if we are 
running in the context of the shrinker, and only delay the eviction in 
that specific case. I was wondering whether we could move the decision 
further up the chain too, and avoid evicting inodes with zero link count 
under memory pressure in the first place. The issue of needing to 
allocate memory to evict inodes is likely to be something that might 
affect other filesystems too, so it might be useful as a generic feature.


I'd prefer to avoid adding a new inode workqueue, however there is no 
reason why we could not add a new feature to an existing workqueue by 
adding a new inode or glock flag as appropriate (which will not expand 
the size of the inode) to deal with this case,


Steve.



Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

2015-12-02 Thread Bob Peterson
- Original Message -
> - Original Message -
> (snip)
> > Please take a look at this
> > again and figure out what the problematic cycle of events is, and then
> > work out how to avoid that happening in the first place. There is no
> > point in replacing one problem with another one, particularly one which
> > would likely be very tricky to debug,
> > 
> > Steve.
> 
> Rhe problematic cycle of events is well known:
> gfs2_clear_inode calls gfs2_glock_put() for the inode's glock,
> but if it's the very last put, it calls into dlm, which can block,
> and that's where we get into trouble.
> 
> The livelock goes like this:
> 
> 1. A fence operation needs memory, so it blocks on memory allocation.
> 2. Memory allocation blocks on slab shrinker.
> 3. Slab shrinker calls into vfs inode shrinker to free inodes from memory.
> 4. vfs inode shrinker eventually calls gfs2_clear_inode to free an inode.
> 5. gfs2_clear_inode calls the final gfs2_glock_put to unlock the inode's
> glock.
> 6. gfs2_glock_put calls dlm unlock to unlock the glock.
> 7. dlm blocks on a pending fence operation. Goto 1.
> 
> So we need to prevent gfs2_clear_inode from calling into DLM. Still, somebody
> needs to do the final put and tell dlm to unlock the inode's glock, which is
> why
> I've been trying to queue it to the delayed work queue.
> 
> If I can't do that, we're left with few alternatives: Perhaps a new
> function of the quota daemon: to run the lru list and call dlm to unlock
> any that have a special bit set, but that just seems ugly.
> 
> I've thought of some other alternatives, but they seem a lot uglier and
> harder to manage. I'll give it some more thought.
> 
> Regards,
> 
> Bob Peterson
> Red Hat File Systems
> 

Hi Steve,

Another possibility is to add a new inode work_queue which does the work of
gfs2_clear_inode(). Its only function would be to clear the inode, and
therefore, there should not be any competing work, as there is in the case of
a glock. The gfs2_clear_inode() function would then queue the work and return
to its caller, and therefore it wouldn't block. The work that's been queued
might block on DLM, but it shouldn't matter, in theory, since the shrinker
will complete, which will free up the fence, which will free up dlm, which will
free up everything. What do you think?

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

2015-12-02 Thread Bob Peterson
- Original Message -
(snip)
> Please take a look at this
> again and figure out what the problematic cycle of events is, and then
> work out how to avoid that happening in the first place. There is no
> point in replacing one problem with another one, particularly one which
> would likely be very tricky to debug,
> 
> Steve.

Rhe problematic cycle of events is well known:
gfs2_clear_inode calls gfs2_glock_put() for the inode's glock,
but if it's the very last put, it calls into dlm, which can block,
and that's where we get into trouble.

The livelock goes like this:

1. A fence operation needs memory, so it blocks on memory allocation.
2. Memory allocation blocks on slab shrinker.
3. Slab shrinker calls into vfs inode shrinker to free inodes from memory.
4. vfs inode shrinker eventually calls gfs2_clear_inode to free an inode.
5. gfs2_clear_inode calls the final gfs2_glock_put to unlock the inode's glock.
6. gfs2_glock_put calls dlm unlock to unlock the glock.
7. dlm blocks on a pending fence operation. Goto 1.

So we need to prevent gfs2_clear_inode from calling into DLM. Still, somebody
needs to do the final put and tell dlm to unlock the inode's glock, which is why
I've been trying to queue it to the delayed work queue.

If I can't do that, we're left with few alternatives: Perhaps a new
function of the quota daemon: to run the lru list and call dlm to unlock
any that have a special bit set, but that just seems ugly.

I've thought of some other alternatives, but they seem a lot uglier and
harder to manage. I'll give it some more thought.

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

2015-12-02 Thread Steven Whitehouse

HI,

On 01/12/15 15:42, Bob Peterson wrote:

- Original Message -

Hi,

On 25/11/15 14:22, Bob Peterson wrote:

- Original Message -

Hi,

On 19/11/15 18:42, Bob Peterson wrote:

This patch changes function gfs2_clear_inode() so that instead
of calling gfs2_glock_put directly() most of the time, it queues
the glock to the delayed work queue. That avoids a possible
deadlock where it calls dlm during a fence operation:
dlm waits for a fence operation, the fence operation waits for
memory, the shrinker waits for gfs2 to free an inode from memory,
but gfs2 waits for dlm.

Signed-off-by: Bob Peterson 
---
fs/gfs2/glock.c | 34 +-
fs/gfs2/glock.h |  1 +
fs/gfs2/super.c |  5 -
3 files changed, 22 insertions(+), 18 deletions(-)

[snip]
Most of the patch seems to just rename the workqueue which makes it
tricky to spot the other changes. However, the below code seems to be
the new bit..


diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 9d5c3f7..46e5004 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -24,6 +24,7 @@
#include 
#include 
#include 
+#include 
#include 
#include 
#include 
@@ -1614,7 +1615,9 @@ out:
ip->i_gl->gl_object = NULL;
flush_delayed_work(>i_gl->gl_work);
gfs2_glock_add_to_lru(ip->i_gl);
-   gfs2_glock_put(ip->i_gl);
+   if (queue_delayed_work(gfs2_glock_workqueue,
+  >i_gl->gl_work, 0) == 0)
+   gfs2_glock_put(ip->i_gl);
ip->i_gl = NULL;
if (ip->i_iopen_gh.gh_gl) {
ip->i_iopen_gh.gh_gl->gl_object = NULL;

which replaces a put with a queue & put if the queue fails (due to it
being already on the queue) which doesn't look quite right to be since
if calling gfs2_glock_put() was not safe before, then calling it
conditionally like this is still no safer I think?

Steve.

Hi,

The call to gfs2_glock_put() in this case should be safe.

If queuing the delayed work fails, it means the glock reference count is
greater than 1, to be decremented when the glock state machine runs.
Which means this can't be the final glock_put().
Which means we can't possibly call into DLM, which means we can't block.
Which means it's safe.

Regards,

Bob Peterson
Red Hat File Systems

There is no reason that this cannot be the final glock put, since there
is no synchronization with the work that has been queued, so it might
well have run and decremented the ref count before we return from the
queuing function. It is unlikely that will be the case, but it is still
possible,

Steve.


Hi Steve,

It's kind of an ugly hack, but can we do something like the patch below instead?

Regards,

Bob Peterson
Red Hat File Systems
---
commit 1949050b4b13c1b32ea45987fbf2936ae779609e
Author: Bob Peterson 
Date:   Thu Nov 19 12:06:31 2015 -0600

GFS2: Make gfs2_clear_inode() not block on final glock put

This patch changes function gfs2_clear_inode() so that instead
of calling gfs2_glock_put, it calls a new gfs2_glock_put_noblock
function that avoids a possible deadlock that would occur should
it call dlm during a fence operation: dlm waits for a fence
operation, the fence operation waits for memory, the shrinker
waits for gfs2 to free an inode from memory, but gfs2 waits for
dlm. The new non-blocking glock_put does this:

1. It acquires the lockref to ensure no one else is messing with it.
2. If the lockref is put (not locked) it can safely return because
it is not the last reference to the glock.
3. If this is the last reference, it tries to queue delayed work for
the glock.
4. If it was able to queue the delayed work, it's safe to return
because the glock_work_func will run in another process, so
this one cannot block.
5. If it was unable to queue the delayed work, it needs to schedule
and start the whole process again.

Signed-off-by: Bob Peterson 

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index a4ff7b5..22870c6 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -178,6 +178,27 @@ void gfs2_glock_put(struct gfs2_glock *gl)
  }
  
  /**

+ * gfs2_glock_put_noblock() - Decrement reference count on glock
+ * @gl: The glock to put
+ *
+ * This is the same as gfs2_glock_put() but it's not allowed to block
+ */
+
+void gfs2_glock_put_noblock(struct gfs2_glock *gl)
+{
+   while (1) {
+   if (lockref_put_or_lock(>gl_lockref))
+   break;
+
+   spin_unlock(>gl_lockref.lock);

That just drops the ref count without doing anything.


+   if (queue_delayed_work(glock_workqueue, >gl_work, 0) != 0)
+   break;
You can't call queue_delayed_work on a glock for which you don't have a 
ref count - it might not exist any more. Please take a look at this 
again and figure out what the problematic cycle of events is, and then 
work out how to avoid that happening in the first place. There is 

Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

2015-12-01 Thread Bob Peterson
- Original Message -
> Hi,
> 
> On 25/11/15 14:22, Bob Peterson wrote:
> > - Original Message -
> >> Hi,
> >>
> >> On 19/11/15 18:42, Bob Peterson wrote:
> >>> This patch changes function gfs2_clear_inode() so that instead
> >>> of calling gfs2_glock_put directly() most of the time, it queues
> >>> the glock to the delayed work queue. That avoids a possible
> >>> deadlock where it calls dlm during a fence operation:
> >>> dlm waits for a fence operation, the fence operation waits for
> >>> memory, the shrinker waits for gfs2 to free an inode from memory,
> >>> but gfs2 waits for dlm.
> >>>
> >>> Signed-off-by: Bob Peterson 
> >>> ---
> >>>fs/gfs2/glock.c | 34 +-
> >>>fs/gfs2/glock.h |  1 +
> >>>fs/gfs2/super.c |  5 -
> >>>3 files changed, 22 insertions(+), 18 deletions(-)
> >> [snip]
> >> Most of the patch seems to just rename the workqueue which makes it
> >> tricky to spot the other changes. However, the below code seems to be
> >> the new bit..
> >>
> >>> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> >>> index 9d5c3f7..46e5004 100644
> >>> --- a/fs/gfs2/super.c
> >>> +++ b/fs/gfs2/super.c
> >>> @@ -24,6 +24,7 @@
> >>>#include 
> >>>#include 
> >>>#include 
> >>> +#include 
> >>>#include 
> >>>#include 
> >>>#include 
> >>> @@ -1614,7 +1615,9 @@ out:
> >>>   ip->i_gl->gl_object = NULL;
> >>>   flush_delayed_work(>i_gl->gl_work);
> >>>   gfs2_glock_add_to_lru(ip->i_gl);
> >>> - gfs2_glock_put(ip->i_gl);
> >>> + if (queue_delayed_work(gfs2_glock_workqueue,
> >>> +>i_gl->gl_work, 0) == 0)
> >>> + gfs2_glock_put(ip->i_gl);
> >>>   ip->i_gl = NULL;
> >>>   if (ip->i_iopen_gh.gh_gl) {
> >>>   ip->i_iopen_gh.gh_gl->gl_object = NULL;
> >> which replaces a put with a queue & put if the queue fails (due to it
> >> being already on the queue) which doesn't look quite right to be since
> >> if calling gfs2_glock_put() was not safe before, then calling it
> >> conditionally like this is still no safer I think?
> >>
> >> Steve.
> > Hi,
> >
> > The call to gfs2_glock_put() in this case should be safe.
> >
> > If queuing the delayed work fails, it means the glock reference count is
> > greater than 1, to be decremented when the glock state machine runs.
> > Which means this can't be the final glock_put().
> > Which means we can't possibly call into DLM, which means we can't block.
> > Which means it's safe.
> >
> > Regards,
> >
> > Bob Peterson
> > Red Hat File Systems
> 
> There is no reason that this cannot be the final glock put, since there
> is no synchronization with the work that has been queued, so it might
> well have run and decremented the ref count before we return from the
> queuing function. It is unlikely that will be the case, but it is still
> possible,
> 
> Steve.
> 

Hi Steve,

It's kind of an ugly hack, but can we do something like the patch below instead?

Regards,

Bob Peterson
Red Hat File Systems
---
commit 1949050b4b13c1b32ea45987fbf2936ae779609e
Author: Bob Peterson 
Date:   Thu Nov 19 12:06:31 2015 -0600

GFS2: Make gfs2_clear_inode() not block on final glock put

This patch changes function gfs2_clear_inode() so that instead
of calling gfs2_glock_put, it calls a new gfs2_glock_put_noblock
function that avoids a possible deadlock that would occur should
it call dlm during a fence operation: dlm waits for a fence
operation, the fence operation waits for memory, the shrinker
waits for gfs2 to free an inode from memory, but gfs2 waits for
dlm. The new non-blocking glock_put does this:

1. It acquires the lockref to ensure no one else is messing with it.
2. If the lockref is put (not locked) it can safely return because
   it is not the last reference to the glock.
3. If this is the last reference, it tries to queue delayed work for
   the glock.
4. If it was able to queue the delayed work, it's safe to return
   because the glock_work_func will run in another process, so
   this one cannot block.
5. If it was unable to queue the delayed work, it needs to schedule
   and start the whole process again.

Signed-off-by: Bob Peterson 

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index a4ff7b5..22870c6 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -178,6 +178,27 @@ void gfs2_glock_put(struct gfs2_glock *gl)
 }
 
 /**
+ * gfs2_glock_put_noblock() - Decrement reference count on glock
+ * @gl: The glock to put
+ *
+ * This is the same as gfs2_glock_put() but it's not allowed to block
+ */
+
+void gfs2_glock_put_noblock(struct gfs2_glock *gl)
+{
+   while (1) {
+   if (lockref_put_or_lock(>gl_lockref))
+   break;
+
+   spin_unlock(>gl_lockref.lock);
+   if (queue_delayed_work(glock_workqueue, >gl_work, 0) != 0)
+   break;
+
+   cond_resched();
+   }
+}
+
+/**

Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

2015-11-25 Thread Steven Whitehouse

Hi,

On 25/11/15 14:22, Bob Peterson wrote:

- Original Message -

Hi,

On 19/11/15 18:42, Bob Peterson wrote:

This patch changes function gfs2_clear_inode() so that instead
of calling gfs2_glock_put directly() most of the time, it queues
the glock to the delayed work queue. That avoids a possible
deadlock where it calls dlm during a fence operation:
dlm waits for a fence operation, the fence operation waits for
memory, the shrinker waits for gfs2 to free an inode from memory,
but gfs2 waits for dlm.

Signed-off-by: Bob Peterson 
---
   fs/gfs2/glock.c | 34 +-
   fs/gfs2/glock.h |  1 +
   fs/gfs2/super.c |  5 -
   3 files changed, 22 insertions(+), 18 deletions(-)

[snip]
Most of the patch seems to just rename the workqueue which makes it
tricky to spot the other changes. However, the below code seems to be
the new bit..


diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 9d5c3f7..46e5004 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -24,6 +24,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 
   #include 
@@ -1614,7 +1615,9 @@ out:
ip->i_gl->gl_object = NULL;
flush_delayed_work(>i_gl->gl_work);
gfs2_glock_add_to_lru(ip->i_gl);
-   gfs2_glock_put(ip->i_gl);
+   if (queue_delayed_work(gfs2_glock_workqueue,
+  >i_gl->gl_work, 0) == 0)
+   gfs2_glock_put(ip->i_gl);
ip->i_gl = NULL;
if (ip->i_iopen_gh.gh_gl) {
ip->i_iopen_gh.gh_gl->gl_object = NULL;

which replaces a put with a queue & put if the queue fails (due to it
being already on the queue) which doesn't look quite right to be since
if calling gfs2_glock_put() was not safe before, then calling it
conditionally like this is still no safer I think?

Steve.

Hi,

The call to gfs2_glock_put() in this case should be safe.

If queuing the delayed work fails, it means the glock reference count is
greater than 1, to be decremented when the glock state machine runs.
Which means this can't be the final glock_put().
Which means we can't possibly call into DLM, which means we can't block.
Which means it's safe.

Regards,

Bob Peterson
Red Hat File Systems


There is no reason that this cannot be the final glock put, since there 
is no synchronization with the work that has been queued, so it might 
well have run and decremented the ref count before we return from the 
queuing function. It is unlikely that will be the case, but it is still 
possible,


Steve.



Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

2015-11-25 Thread Bob Peterson
- Original Message -
> Hi,
> 
> On 19/11/15 18:42, Bob Peterson wrote:
> > This patch changes function gfs2_clear_inode() so that instead
> > of calling gfs2_glock_put directly() most of the time, it queues
> > the glock to the delayed work queue. That avoids a possible
> > deadlock where it calls dlm during a fence operation:
> > dlm waits for a fence operation, the fence operation waits for
> > memory, the shrinker waits for gfs2 to free an inode from memory,
> > but gfs2 waits for dlm.
> >
> > Signed-off-by: Bob Peterson 
> > ---
> >   fs/gfs2/glock.c | 34 +-
> >   fs/gfs2/glock.h |  1 +
> >   fs/gfs2/super.c |  5 -
> >   3 files changed, 22 insertions(+), 18 deletions(-)
> [snip]
> Most of the patch seems to just rename the workqueue which makes it
> tricky to spot the other changes. However, the below code seems to be
> the new bit..
> 
> > diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> > index 9d5c3f7..46e5004 100644
> > --- a/fs/gfs2/super.c
> > +++ b/fs/gfs2/super.c
> > @@ -24,6 +24,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   #include 
> >   #include 
> >   #include 
> > @@ -1614,7 +1615,9 @@ out:
> > ip->i_gl->gl_object = NULL;
> > flush_delayed_work(>i_gl->gl_work);
> > gfs2_glock_add_to_lru(ip->i_gl);
> > -   gfs2_glock_put(ip->i_gl);
> > +   if (queue_delayed_work(gfs2_glock_workqueue,
> > +  >i_gl->gl_work, 0) == 0)
> > +   gfs2_glock_put(ip->i_gl);
> > ip->i_gl = NULL;
> > if (ip->i_iopen_gh.gh_gl) {
> > ip->i_iopen_gh.gh_gl->gl_object = NULL;
> 
> which replaces a put with a queue & put if the queue fails (due to it
> being already on the queue) which doesn't look quite right to be since
> if calling gfs2_glock_put() was not safe before, then calling it
> conditionally like this is still no safer I think?
> 
> Steve.

Hi,

The call to gfs2_glock_put() in this case should be safe.

If queuing the delayed work fails, it means the glock reference count is
greater than 1, to be decremented when the glock state machine runs.
Which means this can't be the final glock_put().
Which means we can't possibly call into DLM, which means we can't block.
Which means it's safe.

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

2015-11-20 Thread Steven Whitehouse

Hi,

On 19/11/15 18:42, Bob Peterson wrote:

This patch changes function gfs2_clear_inode() so that instead
of calling gfs2_glock_put directly() most of the time, it queues
the glock to the delayed work queue. That avoids a possible
deadlock where it calls dlm during a fence operation:
dlm waits for a fence operation, the fence operation waits for
memory, the shrinker waits for gfs2 to free an inode from memory,
but gfs2 waits for dlm.

Signed-off-by: Bob Peterson 
---
  fs/gfs2/glock.c | 34 +-
  fs/gfs2/glock.h |  1 +
  fs/gfs2/super.c |  5 -
  3 files changed, 22 insertions(+), 18 deletions(-)

[snip]
Most of the patch seems to just rename the workqueue which makes it 
tricky to spot the other changes. However, the below code seems to be 
the new bit..



diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 9d5c3f7..46e5004 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -24,6 +24,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -1614,7 +1615,9 @@ out:
ip->i_gl->gl_object = NULL;
flush_delayed_work(>i_gl->gl_work);
gfs2_glock_add_to_lru(ip->i_gl);
-   gfs2_glock_put(ip->i_gl);
+   if (queue_delayed_work(gfs2_glock_workqueue,
+  >i_gl->gl_work, 0) == 0)
+   gfs2_glock_put(ip->i_gl);
ip->i_gl = NULL;
if (ip->i_iopen_gh.gh_gl) {
ip->i_iopen_gh.gh_gl->gl_object = NULL;


which replaces a put with a queue & put if the queue fails (due to it 
being already on the queue) which doesn't look quite right to be since 
if calling gfs2_glock_put() was not safe before, then calling it 
conditionally like this is still no safer I think?


Steve.



[Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put

2015-11-19 Thread Bob Peterson
This patch changes function gfs2_clear_inode() so that instead
of calling gfs2_glock_put directly() most of the time, it queues
the glock to the delayed work queue. That avoids a possible
deadlock where it calls dlm during a fence operation:
dlm waits for a fence operation, the fence operation waits for
memory, the shrinker waits for gfs2 to free an inode from memory,
but gfs2 waits for dlm.

Signed-off-by: Bob Peterson 
---
 fs/gfs2/glock.c | 34 +-
 fs/gfs2/glock.h |  1 +
 fs/gfs2/super.c |  5 -
 3 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 68484ef..53fedbb 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -62,7 +62,7 @@ typedef void (*glock_examiner) (struct gfs2_glock * gl);
 static void do_xmote(struct gfs2_glock *gl, struct gfs2_holder *gh, unsigned 
int target);
 
 static struct dentry *gfs2_root;
-static struct workqueue_struct *glock_workqueue;
+struct workqueue_struct *gfs2_glock_workqueue;
 struct workqueue_struct *gfs2_delete_workqueue;
 static LIST_HEAD(lru_list);
 static atomic_t lru_count = ATOMIC_INIT(0);
@@ -481,7 +481,7 @@ __acquires(>gl_lockref.lock)
}
} else { /* lock_nolock */
finish_xmote(gl, target);
-   if (queue_delayed_work(glock_workqueue, >gl_work, 0) == 0)
+   if (queue_delayed_work(gfs2_glock_workqueue, >gl_work, 0) 
== 0)
gfs2_glock_put(gl);
}
 
@@ -554,7 +554,7 @@ out_sched:
clear_bit(GLF_LOCK, >gl_flags);
smp_mb__after_atomic();
gl->gl_lockref.count++;
-   if (queue_delayed_work(glock_workqueue, >gl_work, 0) == 0)
+   if (queue_delayed_work(gfs2_glock_workqueue, >gl_work, 0) == 0)
gl->gl_lockref.count--;
return;
 
@@ -618,7 +618,7 @@ static void glock_work_func(struct work_struct *work)
else {
if (gl->gl_name.ln_type != LM_TYPE_INODE)
delay = 0;
-   if (queue_delayed_work(glock_workqueue, >gl_work, delay) == 
0)
+   if (queue_delayed_work(gfs2_glock_workqueue, >gl_work, 
delay) == 0)
gfs2_glock_put(gl);
}
if (drop_ref)
@@ -973,7 +973,7 @@ int gfs2_glock_nq(struct gfs2_holder *gh)
 test_and_clear_bit(GLF_FROZEN, >gl_flags))) {
set_bit(GLF_REPLY_PENDING, >gl_flags);
gl->gl_lockref.count++;
-   if (queue_delayed_work(glock_workqueue, >gl_work, 0) == 0)
+   if (queue_delayed_work(gfs2_glock_workqueue, >gl_work, 0) 
== 0)
gl->gl_lockref.count--;
}
run_queue(gl, 1);
@@ -1042,7 +1042,7 @@ void gfs2_glock_dq(struct gfs2_holder *gh)
!test_bit(GLF_DEMOTE, >gl_flags) &&
gl->gl_name.ln_type == LM_TYPE_INODE)
delay = gl->gl_hold_time;
-   if (queue_delayed_work(glock_workqueue, >gl_work, delay) == 0)
+   if (queue_delayed_work(gfs2_glock_workqueue, >gl_work, delay) == 0)
gfs2_glock_put(gl);
 }
 
@@ -1220,7 +1220,7 @@ void gfs2_glock_cb(struct gfs2_glock *gl, unsigned int 
state)
spin_lock(>gl_lockref.lock);
handle_callback(gl, state, delay, true);
spin_unlock(>gl_lockref.lock);
-   if (queue_delayed_work(glock_workqueue, >gl_work, delay) == 0)
+   if (queue_delayed_work(gfs2_glock_workqueue, >gl_work, delay) == 0)
gfs2_glock_put(gl);
 }
 
@@ -1282,7 +1282,7 @@ void gfs2_glock_complete(struct gfs2_glock *gl, int ret)
set_bit(GLF_REPLY_PENDING, >gl_flags);
spin_unlock(>gl_lockref.lock);
 
-   if (queue_delayed_work(glock_workqueue, >gl_work, 0) == 0)
+   if (queue_delayed_work(gfs2_glock_workqueue, >gl_work, 0) == 0)
gfs2_glock_put(gl);
 }
 
@@ -1341,7 +1341,7 @@ add_back_to_lru:
if (demote_ok(gl))
handle_callback(gl, LM_ST_UNLOCKED, 0, false);
WARN_ON(!test_and_clear_bit(GLF_LOCK, >gl_flags));
-   if (queue_delayed_work(glock_workqueue, >gl_work, 0) == 0)
+   if (queue_delayed_work(gfs2_glock_workqueue, >gl_work, 0) 
== 0)
gl->gl_lockref.count--;
spin_unlock(>gl_lockref.lock);
cond_resched_lock(_lock);
@@ -1445,7 +1445,7 @@ static void thaw_glock(struct gfs2_glock *gl)
if (!test_and_clear_bit(GLF_FROZEN, >gl_flags))
goto out;
set_bit(GLF_REPLY_PENDING, >gl_flags);
-   if (queue_delayed_work(glock_workqueue, >gl_work, 0) == 0) {
+   if (queue_delayed_work(gfs2_glock_workqueue, >gl_work, 0) == 0) {
 out:
gfs2_glock_put(gl);
}
@@ -1465,7 +1465,7 @@ static void clear_glock(struct gfs2_glock *gl)
if (gl->gl_state != LM_ST_UNLOCKED)
handle_callback(gl, LM_ST_UNLOCKED, 0, false);
spin_unlock(>gl_lockref.lock);
-   if