Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
Hi, On 08/12/15 07:57, Dave Chinner wrote: On Wed, Dec 02, 2015 at 11:42:13AM -0500, Bob Peterson wrote: - Original Message - (snip) Please take a look at this again and figure out what the problematic cycle of events is, and then work out how to avoid that happening in the first place. There is no point in replacing one problem with another one, particularly one which would likely be very tricky to debug, Steve. Rhe problematic cycle of events is well known: gfs2_clear_inode calls gfs2_glock_put() for the inode's glock, but if it's the very last put, it calls into dlm, which can block, and that's where we get into trouble. The livelock goes like this: 1. A fence operation needs memory, so it blocks on memory allocation. 2. Memory allocation blocks on slab shrinker. 3. Slab shrinker calls into vfs inode shrinker to free inodes from memory. 7. dlm blocks on a pending fence operation. Goto 1. Therefore, the fence operation should be doing GFP_NOFS allocations to prevent re-entry into the DLM via the filesystem via the shrinker Cheers, Dave. Which would be ideal, but how do you do that from user space? Steve.
Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
On Wed, Dec 02, 2015 at 11:42:13AM -0500, Bob Peterson wrote: > - Original Message - > (snip) > > Please take a look at this > > again and figure out what the problematic cycle of events is, and then > > work out how to avoid that happening in the first place. There is no > > point in replacing one problem with another one, particularly one which > > would likely be very tricky to debug, > > > > Steve. > > Rhe problematic cycle of events is well known: > gfs2_clear_inode calls gfs2_glock_put() for the inode's glock, > but if it's the very last put, it calls into dlm, which can block, > and that's where we get into trouble. > > The livelock goes like this: > > 1. A fence operation needs memory, so it blocks on memory allocation. > 2. Memory allocation blocks on slab shrinker. > 3. Slab shrinker calls into vfs inode shrinker to free inodes from memory. > 7. dlm blocks on a pending fence operation. Goto 1. Therefore, the fence operation should be doing GFP_NOFS allocations to prevent re-entry into the DLM via the filesystem via the shrinker Cheers, Dave. -- Dave Chinner dchin...@redhat.com
Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
- Original Message - > On Fri, Dec 04, 2015 at 09:51:53AM -0500, Bob Peterson wrote: > > it's from the fenced process, and if so, queue the final put. That should > > mitigate the problem. > > Bob, I'm perplexed by the focus on fencing; this issue is broader than > fencing as I mentioned in bz 1255872. Over the years that I've reported > these issues, rarely if ever have they involving fencing. Any userland > process, not just the fencing process, can allocate memory, fall into the > general shrinking path, get into gfs2 and dlm, and end up blocked for some > undefined time. That can cause problems in any number of ways. > > The specific problem you're focused on may be one of the easier ways of > demonstrating the problem -- making the original userland process one of > the cluster-related processes that gfs2/dlm depend on, combined with > recovery when those processes do an especially large amount of work that > gfs2/dlm require. But problems could occur if any process is forced to > unwittingly do this dlm work, not just a cluster-related process, and it > would not need to involve recovery (or fencing which is one small part of > it). > > I believe in gfs1 and the original gfs2, gfs had its own mechanism/threads > for shrinking its cache and doing the dlm work, and would not do anything > from the generic shrinking paths because of this issue. I don't think > it's reasonable to expect random, unsuspecting processes on the system to > perform gfs2/dlm operations that are often remote, lengthy, indefinite, or > unpredictable. I think gfs2 needs to do that kind of heavy lifting from > its own dedicated contexts, or from processes that are explicitly choosing > to use gfs2. > > Hi Dave, Thanks for your input. You're right, of course. The problem can occur to other processes that cause the shrinker to run, not just from fenced. Yes, the original GFS2 did not have this problem. When gfs2_clear_inode was called, it called gfs2_glock_schedule_for_reclaim, which queued the inode glock to a special queue which was handled by the gfs2_glockd daemon. The reference count was bumped to prevent the gfs2_clear_inode() from being the last guy out. The problem with that approach is that it's a centralized list, which means there's no parallelism and there can get to be a backlog of glocks waiting to be reclaimed. If that glock needs to be reacquired because the block was reused for a different inode, we could end up reaping glocks that are still being (re)used. We could still do that, but either we reintroduce the gfs2_glockd daemon, or give our existing daemon, gfs2_quotad, more responsibilities, which would make it a bit uglier and more complex than it is today. My previous attempts to solve this involved using a work queue and queuing the final gfs2_glock_put() and that fixed the problem for all cases, unless there was already work queued. When you think about it, using delayed work accomplishes the same thing, but with the parallelism. When it works. Perhaps I just need to focus on a way to allow work to be queued multiple times (in an ideal case) or, alternatively, an atomic counter that corresponds to the number of times the work should be executed. Or something similar. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
On Fri, Dec 04, 2015 at 09:51:53AM -0500, Bob Peterson wrote: > it's from the fenced process, and if so, queue the final put. That should > mitigate the problem. Bob, I'm perplexed by the focus on fencing; this issue is broader than fencing as I mentioned in bz 1255872. Over the years that I've reported these issues, rarely if ever have they involving fencing. Any userland process, not just the fencing process, can allocate memory, fall into the general shrinking path, get into gfs2 and dlm, and end up blocked for some undefined time. That can cause problems in any number of ways. The specific problem you're focused on may be one of the easier ways of demonstrating the problem -- making the original userland process one of the cluster-related processes that gfs2/dlm depend on, combined with recovery when those processes do an especially large amount of work that gfs2/dlm require. But problems could occur if any process is forced to unwittingly do this dlm work, not just a cluster-related process, and it would not need to involve recovery (or fencing which is one small part of it). I believe in gfs1 and the original gfs2, gfs had its own mechanism/threads for shrinking its cache and doing the dlm work, and would not do anything from the generic shrinking paths because of this issue. I don't think it's reasonable to expect random, unsuspecting processes on the system to perform gfs2/dlm operations that are often remote, lengthy, indefinite, or unpredictable. I think gfs2 needs to do that kind of heavy lifting from its own dedicated contexts, or from processes that are explicitly choosing to use gfs2.
Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
Hi, - Original Message - > > Hi Steve, > > > > Another possibility is to add a new inode work_queue which does the work of > > gfs2_clear_inode(). Its only function would be to clear the inode, and > > therefore, there should not be any competing work, as there is in the case > > of > > a glock. The gfs2_clear_inode() function would then queue the work and > > return > > to its caller, and therefore it wouldn't block. The work that's been queued > > might block on DLM, but it shouldn't matter, in theory, since the shrinker > > will complete, which will free up the fence, which will free up dlm, which > > will > > free up everything. What do you think? > > > > Regards, > > > > Bob Peterson > > Red Hat File Systems > > Well that is a possibility. There is another dependency cycle there > however, which is the one relating to the inode lifetime, and we need to > be careful that we don't break that in attempting to fix the memory > allocation at fence time issue. I know it isn't easy, but we need to be > very careful to avoid introducing any new races, as they are likely to > be very tricky to debug in this area. > > One thing we could do, is to be more careful about when we take this > delayed path. We could, for example, add a check to see if we are > running in the context of the shrinker, and only delay the eviction in > that specific case. I was wondering whether we could move the decision > further up the chain too, and avoid evicting inodes with zero link count > under memory pressure in the first place. The issue of needing to > allocate memory to evict inodes is likely to be something that might > affect other filesystems too, so it might be useful as a generic feature. > > I'd prefer to avoid adding a new inode workqueue, however there is no > reason why we could not add a new feature to an existing workqueue by > adding a new inode or glock flag as appropriate (which will not expand > the size of the inode) to deal with this case, > > Steve. > > I tried adding a work queue for the inodes, but it was a disaster and led to other hangs. Then I decided it made more sense to just add a new delayed work function for the final put of the inode glock. That also led to problems because glocks may be reused for a different inode with the same block address, due to the fact that the glocks often outlive the inodes. What happened was that, given enough pressure, the delayed work became backlogged and the second put couldn't be queued due to a previous queue (with delay==0). In the end, I decided that you're right: The safest approach is to isolate this to the specific problem I'm trying to solve: to do our business the same way, but when it comes to the final close, check if it's from the fenced process, and if so, queue the final put. That should mitigate the problem. Unfortunately, it's not good enough to only do these things when the link count is zero: Closing any file at any time could recreate the problem, provided there was a pending fence action. I'd also prefer if there was a way we could tell whether we were called from the slab shrinker, but unfortunately, there's no good way to know without wasting a lot of time traversing the call stack and such. I did a fair amount of investigating this: there are a couple of bits in the superblock that I was hoping to leverage: PF_FSTRANS is used by NFS to detect situations where it mustn't do memory allocations, but I don't think we can use it. There's also PF_MEMALLOC, but that's set for the global shrinker, not the slab shrinker, so I don't think we can use that either. I investigated how ocfs2 handles this situation: turns out, it doesn't, so they probably have the same vulnerability. In doing all this investigation, I learned of another similar possible problem discovered in ocfs2: If the shrinker tried to ditch an inode from memory, then the file system calls into DLM, but DLM needs to do some COMM stuff, and the comm stuff needs to allocate memory for comm buffers. In that case, the use of superblock flags (as above) was discouraged and NACKed in favor of using a different set of memory flags, such as GFP_NOFS, etc. I never did determine exactly how this ended up. In the meantime, I prototyped the following patch, and it works, but it's a hack, and I'm not too happy with it. So I'm going to see if I can find a solution that would work for both GFS2 and ocfs2. Regards, Bob Peterson Red Hat File Systems --- commit 3142a311fa848d44b15e9f86fc3599b7d9b8b3f5 Author: Bob Peterson Date: Mon Nov 30 12:04:09 2015 -0600 GFS2: Make gfs2_clear_inode() not block on final glock put This patch changes function gfs2_clear_inode() so that instead of calling gfs2_glock_put, it calls a new gfs2_glock_put_noblock function that avoids a possible deadlock that would occur should it call dlm during a fence operation: dlm waits for a fence operation, the fence operation waits for memory, the shrinker waits for gfs2 to free an inode from memory, but gf
Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
Hi, On 02/12/15 17:41, Bob Peterson wrote: - Original Message - - Original Message - (snip) Please take a look at this again and figure out what the problematic cycle of events is, and then work out how to avoid that happening in the first place. There is no point in replacing one problem with another one, particularly one which would likely be very tricky to debug, Steve. Rhe problematic cycle of events is well known: gfs2_clear_inode calls gfs2_glock_put() for the inode's glock, but if it's the very last put, it calls into dlm, which can block, and that's where we get into trouble. The livelock goes like this: 1. A fence operation needs memory, so it blocks on memory allocation. 2. Memory allocation blocks on slab shrinker. 3. Slab shrinker calls into vfs inode shrinker to free inodes from memory. 4. vfs inode shrinker eventually calls gfs2_clear_inode to free an inode. 5. gfs2_clear_inode calls the final gfs2_glock_put to unlock the inode's glock. 6. gfs2_glock_put calls dlm unlock to unlock the glock. 7. dlm blocks on a pending fence operation. Goto 1. So we need to prevent gfs2_clear_inode from calling into DLM. Still, somebody needs to do the final put and tell dlm to unlock the inode's glock, which is why I've been trying to queue it to the delayed work queue. If I can't do that, we're left with few alternatives: Perhaps a new function of the quota daemon: to run the lru list and call dlm to unlock any that have a special bit set, but that just seems ugly. I've thought of some other alternatives, but they seem a lot uglier and harder to manage. I'll give it some more thought. Regards, Bob Peterson Red Hat File Systems Hi Steve, Another possibility is to add a new inode work_queue which does the work of gfs2_clear_inode(). Its only function would be to clear the inode, and therefore, there should not be any competing work, as there is in the case of a glock. The gfs2_clear_inode() function would then queue the work and return to its caller, and therefore it wouldn't block. The work that's been queued might block on DLM, but it shouldn't matter, in theory, since the shrinker will complete, which will free up the fence, which will free up dlm, which will free up everything. What do you think? Regards, Bob Peterson Red Hat File Systems Well that is a possibility. There is another dependency cycle there however, which is the one relating to the inode lifetime, and we need to be careful that we don't break that in attempting to fix the memory allocation at fence time issue. I know it isn't easy, but we need to be very careful to avoid introducing any new races, as they are likely to be very tricky to debug in this area. One thing we could do, is to be more careful about when we take this delayed path. We could, for example, add a check to see if we are running in the context of the shrinker, and only delay the eviction in that specific case. I was wondering whether we could move the decision further up the chain too, and avoid evicting inodes with zero link count under memory pressure in the first place. The issue of needing to allocate memory to evict inodes is likely to be something that might affect other filesystems too, so it might be useful as a generic feature. I'd prefer to avoid adding a new inode workqueue, however there is no reason why we could not add a new feature to an existing workqueue by adding a new inode or glock flag as appropriate (which will not expand the size of the inode) to deal with this case, Steve.
Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
- Original Message - > - Original Message - > (snip) > > Please take a look at this > > again and figure out what the problematic cycle of events is, and then > > work out how to avoid that happening in the first place. There is no > > point in replacing one problem with another one, particularly one which > > would likely be very tricky to debug, > > > > Steve. > > Rhe problematic cycle of events is well known: > gfs2_clear_inode calls gfs2_glock_put() for the inode's glock, > but if it's the very last put, it calls into dlm, which can block, > and that's where we get into trouble. > > The livelock goes like this: > > 1. A fence operation needs memory, so it blocks on memory allocation. > 2. Memory allocation blocks on slab shrinker. > 3. Slab shrinker calls into vfs inode shrinker to free inodes from memory. > 4. vfs inode shrinker eventually calls gfs2_clear_inode to free an inode. > 5. gfs2_clear_inode calls the final gfs2_glock_put to unlock the inode's > glock. > 6. gfs2_glock_put calls dlm unlock to unlock the glock. > 7. dlm blocks on a pending fence operation. Goto 1. > > So we need to prevent gfs2_clear_inode from calling into DLM. Still, somebody > needs to do the final put and tell dlm to unlock the inode's glock, which is > why > I've been trying to queue it to the delayed work queue. > > If I can't do that, we're left with few alternatives: Perhaps a new > function of the quota daemon: to run the lru list and call dlm to unlock > any that have a special bit set, but that just seems ugly. > > I've thought of some other alternatives, but they seem a lot uglier and > harder to manage. I'll give it some more thought. > > Regards, > > Bob Peterson > Red Hat File Systems > Hi Steve, Another possibility is to add a new inode work_queue which does the work of gfs2_clear_inode(). Its only function would be to clear the inode, and therefore, there should not be any competing work, as there is in the case of a glock. The gfs2_clear_inode() function would then queue the work and return to its caller, and therefore it wouldn't block. The work that's been queued might block on DLM, but it shouldn't matter, in theory, since the shrinker will complete, which will free up the fence, which will free up dlm, which will free up everything. What do you think? Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
- Original Message - (snip) > Please take a look at this > again and figure out what the problematic cycle of events is, and then > work out how to avoid that happening in the first place. There is no > point in replacing one problem with another one, particularly one which > would likely be very tricky to debug, > > Steve. Rhe problematic cycle of events is well known: gfs2_clear_inode calls gfs2_glock_put() for the inode's glock, but if it's the very last put, it calls into dlm, which can block, and that's where we get into trouble. The livelock goes like this: 1. A fence operation needs memory, so it blocks on memory allocation. 2. Memory allocation blocks on slab shrinker. 3. Slab shrinker calls into vfs inode shrinker to free inodes from memory. 4. vfs inode shrinker eventually calls gfs2_clear_inode to free an inode. 5. gfs2_clear_inode calls the final gfs2_glock_put to unlock the inode's glock. 6. gfs2_glock_put calls dlm unlock to unlock the glock. 7. dlm blocks on a pending fence operation. Goto 1. So we need to prevent gfs2_clear_inode from calling into DLM. Still, somebody needs to do the final put and tell dlm to unlock the inode's glock, which is why I've been trying to queue it to the delayed work queue. If I can't do that, we're left with few alternatives: Perhaps a new function of the quota daemon: to run the lru list and call dlm to unlock any that have a special bit set, but that just seems ugly. I've thought of some other alternatives, but they seem a lot uglier and harder to manage. I'll give it some more thought. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
HI, On 01/12/15 15:42, Bob Peterson wrote: - Original Message - Hi, On 25/11/15 14:22, Bob Peterson wrote: - Original Message - Hi, On 19/11/15 18:42, Bob Peterson wrote: This patch changes function gfs2_clear_inode() so that instead of calling gfs2_glock_put directly() most of the time, it queues the glock to the delayed work queue. That avoids a possible deadlock where it calls dlm during a fence operation: dlm waits for a fence operation, the fence operation waits for memory, the shrinker waits for gfs2 to free an inode from memory, but gfs2 waits for dlm. Signed-off-by: Bob Peterson --- fs/gfs2/glock.c | 34 +- fs/gfs2/glock.h | 1 + fs/gfs2/super.c | 5 - 3 files changed, 22 insertions(+), 18 deletions(-) [snip] Most of the patch seems to just rename the workqueue which makes it tricky to spot the other changes. However, the below code seems to be the new bit.. diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c index 9d5c3f7..46e5004 100644 --- a/fs/gfs2/super.c +++ b/fs/gfs2/super.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -1614,7 +1615,9 @@ out: ip->i_gl->gl_object = NULL; flush_delayed_work(&ip->i_gl->gl_work); gfs2_glock_add_to_lru(ip->i_gl); - gfs2_glock_put(ip->i_gl); + if (queue_delayed_work(gfs2_glock_workqueue, + &ip->i_gl->gl_work, 0) == 0) + gfs2_glock_put(ip->i_gl); ip->i_gl = NULL; if (ip->i_iopen_gh.gh_gl) { ip->i_iopen_gh.gh_gl->gl_object = NULL; which replaces a put with a queue & put if the queue fails (due to it being already on the queue) which doesn't look quite right to be since if calling gfs2_glock_put() was not safe before, then calling it conditionally like this is still no safer I think? Steve. Hi, The call to gfs2_glock_put() in this case should be safe. If queuing the delayed work fails, it means the glock reference count is greater than 1, to be decremented when the glock state machine runs. Which means this can't be the final glock_put(). Which means we can't possibly call into DLM, which means we can't block. Which means it's safe. Regards, Bob Peterson Red Hat File Systems There is no reason that this cannot be the final glock put, since there is no synchronization with the work that has been queued, so it might well have run and decremented the ref count before we return from the queuing function. It is unlikely that will be the case, but it is still possible, Steve. Hi Steve, It's kind of an ugly hack, but can we do something like the patch below instead? Regards, Bob Peterson Red Hat File Systems --- commit 1949050b4b13c1b32ea45987fbf2936ae779609e Author: Bob Peterson Date: Thu Nov 19 12:06:31 2015 -0600 GFS2: Make gfs2_clear_inode() not block on final glock put This patch changes function gfs2_clear_inode() so that instead of calling gfs2_glock_put, it calls a new gfs2_glock_put_noblock function that avoids a possible deadlock that would occur should it call dlm during a fence operation: dlm waits for a fence operation, the fence operation waits for memory, the shrinker waits for gfs2 to free an inode from memory, but gfs2 waits for dlm. The new non-blocking glock_put does this: 1. It acquires the lockref to ensure no one else is messing with it. 2. If the lockref is put (not locked) it can safely return because it is not the last reference to the glock. 3. If this is the last reference, it tries to queue delayed work for the glock. 4. If it was able to queue the delayed work, it's safe to return because the glock_work_func will run in another process, so this one cannot block. 5. If it was unable to queue the delayed work, it needs to schedule and start the whole process again. Signed-off-by: Bob Peterson diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index a4ff7b5..22870c6 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -178,6 +178,27 @@ void gfs2_glock_put(struct gfs2_glock *gl) } /** + * gfs2_glock_put_noblock() - Decrement reference count on glock + * @gl: The glock to put + * + * This is the same as gfs2_glock_put() but it's not allowed to block + */ + +void gfs2_glock_put_noblock(struct gfs2_glock *gl) +{ + while (1) { + if (lockref_put_or_lock(&gl->gl_lockref)) + break; + + spin_unlock(&gl->gl_lockref.lock); That just drops the ref count without doing anything. + if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) != 0) + break; You can't call queue_delayed_work on a glock for which you don't have a ref count - it might not exist any more. Please take a look at this again and figure out what the problematic cycle of events is, and then work out how to avoid that happening in the first place. There is no point in replacing one problem with ano
Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
- Original Message - > Hi, > > On 25/11/15 14:22, Bob Peterson wrote: > > - Original Message - > >> Hi, > >> > >> On 19/11/15 18:42, Bob Peterson wrote: > >>> This patch changes function gfs2_clear_inode() so that instead > >>> of calling gfs2_glock_put directly() most of the time, it queues > >>> the glock to the delayed work queue. That avoids a possible > >>> deadlock where it calls dlm during a fence operation: > >>> dlm waits for a fence operation, the fence operation waits for > >>> memory, the shrinker waits for gfs2 to free an inode from memory, > >>> but gfs2 waits for dlm. > >>> > >>> Signed-off-by: Bob Peterson > >>> --- > >>>fs/gfs2/glock.c | 34 +- > >>>fs/gfs2/glock.h | 1 + > >>>fs/gfs2/super.c | 5 - > >>>3 files changed, 22 insertions(+), 18 deletions(-) > >> [snip] > >> Most of the patch seems to just rename the workqueue which makes it > >> tricky to spot the other changes. However, the below code seems to be > >> the new bit.. > >> > >>> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c > >>> index 9d5c3f7..46e5004 100644 > >>> --- a/fs/gfs2/super.c > >>> +++ b/fs/gfs2/super.c > >>> @@ -24,6 +24,7 @@ > >>>#include > >>>#include > >>>#include > >>> +#include > >>>#include > >>>#include > >>>#include > >>> @@ -1614,7 +1615,9 @@ out: > >>> ip->i_gl->gl_object = NULL; > >>> flush_delayed_work(&ip->i_gl->gl_work); > >>> gfs2_glock_add_to_lru(ip->i_gl); > >>> - gfs2_glock_put(ip->i_gl); > >>> + if (queue_delayed_work(gfs2_glock_workqueue, > >>> +&ip->i_gl->gl_work, 0) == 0) > >>> + gfs2_glock_put(ip->i_gl); > >>> ip->i_gl = NULL; > >>> if (ip->i_iopen_gh.gh_gl) { > >>> ip->i_iopen_gh.gh_gl->gl_object = NULL; > >> which replaces a put with a queue & put if the queue fails (due to it > >> being already on the queue) which doesn't look quite right to be since > >> if calling gfs2_glock_put() was not safe before, then calling it > >> conditionally like this is still no safer I think? > >> > >> Steve. > > Hi, > > > > The call to gfs2_glock_put() in this case should be safe. > > > > If queuing the delayed work fails, it means the glock reference count is > > greater than 1, to be decremented when the glock state machine runs. > > Which means this can't be the final glock_put(). > > Which means we can't possibly call into DLM, which means we can't block. > > Which means it's safe. > > > > Regards, > > > > Bob Peterson > > Red Hat File Systems > > There is no reason that this cannot be the final glock put, since there > is no synchronization with the work that has been queued, so it might > well have run and decremented the ref count before we return from the > queuing function. It is unlikely that will be the case, but it is still > possible, > > Steve. > Hi Steve, It's kind of an ugly hack, but can we do something like the patch below instead? Regards, Bob Peterson Red Hat File Systems --- commit 1949050b4b13c1b32ea45987fbf2936ae779609e Author: Bob Peterson Date: Thu Nov 19 12:06:31 2015 -0600 GFS2: Make gfs2_clear_inode() not block on final glock put This patch changes function gfs2_clear_inode() so that instead of calling gfs2_glock_put, it calls a new gfs2_glock_put_noblock function that avoids a possible deadlock that would occur should it call dlm during a fence operation: dlm waits for a fence operation, the fence operation waits for memory, the shrinker waits for gfs2 to free an inode from memory, but gfs2 waits for dlm. The new non-blocking glock_put does this: 1. It acquires the lockref to ensure no one else is messing with it. 2. If the lockref is put (not locked) it can safely return because it is not the last reference to the glock. 3. If this is the last reference, it tries to queue delayed work for the glock. 4. If it was able to queue the delayed work, it's safe to return because the glock_work_func will run in another process, so this one cannot block. 5. If it was unable to queue the delayed work, it needs to schedule and start the whole process again. Signed-off-by: Bob Peterson diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index a4ff7b5..22870c6 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -178,6 +178,27 @@ void gfs2_glock_put(struct gfs2_glock *gl) } /** + * gfs2_glock_put_noblock() - Decrement reference count on glock + * @gl: The glock to put + * + * This is the same as gfs2_glock_put() but it's not allowed to block + */ + +void gfs2_glock_put_noblock(struct gfs2_glock *gl) +{ + while (1) { + if (lockref_put_or_lock(&gl->gl_lockref)) + break; + + spin_unlock(&gl->gl_lockref.lock); + if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) != 0) + break; + + cond_resched(); + } +} + +/** * may_grant - check if its ok to grant a
Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
Hi, On 25/11/15 14:22, Bob Peterson wrote: - Original Message - Hi, On 19/11/15 18:42, Bob Peterson wrote: This patch changes function gfs2_clear_inode() so that instead of calling gfs2_glock_put directly() most of the time, it queues the glock to the delayed work queue. That avoids a possible deadlock where it calls dlm during a fence operation: dlm waits for a fence operation, the fence operation waits for memory, the shrinker waits for gfs2 to free an inode from memory, but gfs2 waits for dlm. Signed-off-by: Bob Peterson --- fs/gfs2/glock.c | 34 +- fs/gfs2/glock.h | 1 + fs/gfs2/super.c | 5 - 3 files changed, 22 insertions(+), 18 deletions(-) [snip] Most of the patch seems to just rename the workqueue which makes it tricky to spot the other changes. However, the below code seems to be the new bit.. diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c index 9d5c3f7..46e5004 100644 --- a/fs/gfs2/super.c +++ b/fs/gfs2/super.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -1614,7 +1615,9 @@ out: ip->i_gl->gl_object = NULL; flush_delayed_work(&ip->i_gl->gl_work); gfs2_glock_add_to_lru(ip->i_gl); - gfs2_glock_put(ip->i_gl); + if (queue_delayed_work(gfs2_glock_workqueue, + &ip->i_gl->gl_work, 0) == 0) + gfs2_glock_put(ip->i_gl); ip->i_gl = NULL; if (ip->i_iopen_gh.gh_gl) { ip->i_iopen_gh.gh_gl->gl_object = NULL; which replaces a put with a queue & put if the queue fails (due to it being already on the queue) which doesn't look quite right to be since if calling gfs2_glock_put() was not safe before, then calling it conditionally like this is still no safer I think? Steve. Hi, The call to gfs2_glock_put() in this case should be safe. If queuing the delayed work fails, it means the glock reference count is greater than 1, to be decremented when the glock state machine runs. Which means this can't be the final glock_put(). Which means we can't possibly call into DLM, which means we can't block. Which means it's safe. Regards, Bob Peterson Red Hat File Systems There is no reason that this cannot be the final glock put, since there is no synchronization with the work that has been queued, so it might well have run and decremented the ref count before we return from the queuing function. It is unlikely that will be the case, but it is still possible, Steve.
Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
- Original Message - > Hi, > > On 19/11/15 18:42, Bob Peterson wrote: > > This patch changes function gfs2_clear_inode() so that instead > > of calling gfs2_glock_put directly() most of the time, it queues > > the glock to the delayed work queue. That avoids a possible > > deadlock where it calls dlm during a fence operation: > > dlm waits for a fence operation, the fence operation waits for > > memory, the shrinker waits for gfs2 to free an inode from memory, > > but gfs2 waits for dlm. > > > > Signed-off-by: Bob Peterson > > --- > > fs/gfs2/glock.c | 34 +- > > fs/gfs2/glock.h | 1 + > > fs/gfs2/super.c | 5 - > > 3 files changed, 22 insertions(+), 18 deletions(-) > [snip] > Most of the patch seems to just rename the workqueue which makes it > tricky to spot the other changes. However, the below code seems to be > the new bit.. > > > diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c > > index 9d5c3f7..46e5004 100644 > > --- a/fs/gfs2/super.c > > +++ b/fs/gfs2/super.c > > @@ -24,6 +24,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -1614,7 +1615,9 @@ out: > > ip->i_gl->gl_object = NULL; > > flush_delayed_work(&ip->i_gl->gl_work); > > gfs2_glock_add_to_lru(ip->i_gl); > > - gfs2_glock_put(ip->i_gl); > > + if (queue_delayed_work(gfs2_glock_workqueue, > > + &ip->i_gl->gl_work, 0) == 0) > > + gfs2_glock_put(ip->i_gl); > > ip->i_gl = NULL; > > if (ip->i_iopen_gh.gh_gl) { > > ip->i_iopen_gh.gh_gl->gl_object = NULL; > > which replaces a put with a queue & put if the queue fails (due to it > being already on the queue) which doesn't look quite right to be since > if calling gfs2_glock_put() was not safe before, then calling it > conditionally like this is still no safer I think? > > Steve. Hi, The call to gfs2_glock_put() in this case should be safe. If queuing the delayed work fails, it means the glock reference count is greater than 1, to be decremented when the glock state machine runs. Which means this can't be the final glock_put(). Which means we can't possibly call into DLM, which means we can't block. Which means it's safe. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
Hi, On 19/11/15 18:42, Bob Peterson wrote: This patch changes function gfs2_clear_inode() so that instead of calling gfs2_glock_put directly() most of the time, it queues the glock to the delayed work queue. That avoids a possible deadlock where it calls dlm during a fence operation: dlm waits for a fence operation, the fence operation waits for memory, the shrinker waits for gfs2 to free an inode from memory, but gfs2 waits for dlm. Signed-off-by: Bob Peterson --- fs/gfs2/glock.c | 34 +- fs/gfs2/glock.h | 1 + fs/gfs2/super.c | 5 - 3 files changed, 22 insertions(+), 18 deletions(-) [snip] Most of the patch seems to just rename the workqueue which makes it tricky to spot the other changes. However, the below code seems to be the new bit.. diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c index 9d5c3f7..46e5004 100644 --- a/fs/gfs2/super.c +++ b/fs/gfs2/super.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -1614,7 +1615,9 @@ out: ip->i_gl->gl_object = NULL; flush_delayed_work(&ip->i_gl->gl_work); gfs2_glock_add_to_lru(ip->i_gl); - gfs2_glock_put(ip->i_gl); + if (queue_delayed_work(gfs2_glock_workqueue, + &ip->i_gl->gl_work, 0) == 0) + gfs2_glock_put(ip->i_gl); ip->i_gl = NULL; if (ip->i_iopen_gh.gh_gl) { ip->i_iopen_gh.gh_gl->gl_object = NULL; which replaces a put with a queue & put if the queue fails (due to it being already on the queue) which doesn't look quite right to be since if calling gfs2_glock_put() was not safe before, then calling it conditionally like this is still no safer I think? Steve.
[Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
This patch changes function gfs2_clear_inode() so that instead of calling gfs2_glock_put directly() most of the time, it queues the glock to the delayed work queue. That avoids a possible deadlock where it calls dlm during a fence operation: dlm waits for a fence operation, the fence operation waits for memory, the shrinker waits for gfs2 to free an inode from memory, but gfs2 waits for dlm. Signed-off-by: Bob Peterson --- fs/gfs2/glock.c | 34 +- fs/gfs2/glock.h | 1 + fs/gfs2/super.c | 5 - 3 files changed, 22 insertions(+), 18 deletions(-) diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index 68484ef..53fedbb 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -62,7 +62,7 @@ typedef void (*glock_examiner) (struct gfs2_glock * gl); static void do_xmote(struct gfs2_glock *gl, struct gfs2_holder *gh, unsigned int target); static struct dentry *gfs2_root; -static struct workqueue_struct *glock_workqueue; +struct workqueue_struct *gfs2_glock_workqueue; struct workqueue_struct *gfs2_delete_workqueue; static LIST_HEAD(lru_list); static atomic_t lru_count = ATOMIC_INIT(0); @@ -481,7 +481,7 @@ __acquires(&gl->gl_lockref.lock) } } else { /* lock_nolock */ finish_xmote(gl, target); - if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0) + if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, 0) == 0) gfs2_glock_put(gl); } @@ -554,7 +554,7 @@ out_sched: clear_bit(GLF_LOCK, &gl->gl_flags); smp_mb__after_atomic(); gl->gl_lockref.count++; - if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0) + if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, 0) == 0) gl->gl_lockref.count--; return; @@ -618,7 +618,7 @@ static void glock_work_func(struct work_struct *work) else { if (gl->gl_name.ln_type != LM_TYPE_INODE) delay = 0; - if (queue_delayed_work(glock_workqueue, &gl->gl_work, delay) == 0) + if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, delay) == 0) gfs2_glock_put(gl); } if (drop_ref) @@ -973,7 +973,7 @@ int gfs2_glock_nq(struct gfs2_holder *gh) test_and_clear_bit(GLF_FROZEN, &gl->gl_flags))) { set_bit(GLF_REPLY_PENDING, &gl->gl_flags); gl->gl_lockref.count++; - if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0) + if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, 0) == 0) gl->gl_lockref.count--; } run_queue(gl, 1); @@ -1042,7 +1042,7 @@ void gfs2_glock_dq(struct gfs2_holder *gh) !test_bit(GLF_DEMOTE, &gl->gl_flags) && gl->gl_name.ln_type == LM_TYPE_INODE) delay = gl->gl_hold_time; - if (queue_delayed_work(glock_workqueue, &gl->gl_work, delay) == 0) + if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, delay) == 0) gfs2_glock_put(gl); } @@ -1220,7 +1220,7 @@ void gfs2_glock_cb(struct gfs2_glock *gl, unsigned int state) spin_lock(&gl->gl_lockref.lock); handle_callback(gl, state, delay, true); spin_unlock(&gl->gl_lockref.lock); - if (queue_delayed_work(glock_workqueue, &gl->gl_work, delay) == 0) + if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, delay) == 0) gfs2_glock_put(gl); } @@ -1282,7 +1282,7 @@ void gfs2_glock_complete(struct gfs2_glock *gl, int ret) set_bit(GLF_REPLY_PENDING, &gl->gl_flags); spin_unlock(&gl->gl_lockref.lock); - if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0) + if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, 0) == 0) gfs2_glock_put(gl); } @@ -1341,7 +1341,7 @@ add_back_to_lru: if (demote_ok(gl)) handle_callback(gl, LM_ST_UNLOCKED, 0, false); WARN_ON(!test_and_clear_bit(GLF_LOCK, &gl->gl_flags)); - if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0) + if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, 0) == 0) gl->gl_lockref.count--; spin_unlock(&gl->gl_lockref.lock); cond_resched_lock(&lru_lock); @@ -1445,7 +1445,7 @@ static void thaw_glock(struct gfs2_glock *gl) if (!test_and_clear_bit(GLF_FROZEN, &gl->gl_flags)) goto out; set_bit(GLF_REPLY_PENDING, &gl->gl_flags); - if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0) { + if (queue_delayed_work(gfs2_glock_workqueue, &gl->gl_work, 0) == 0) { out: gfs2_glock_put(gl); } @@ -1465,7 +1465,7 @@ static void clear_glock(struct gfs2_glock *gl) if (gl->gl_state != LM_ST_UNLOCKED)