[Cluster-devel] [GFS2 PATCH] GFS2: Don't do glock put on when inode creation fails
Hi, Currently the error path of function gfs2_inode_lookup calls function gfs2_glock_put corresponding to an earlier call to gfs2_glock_get for the inode glock. That's wrong because the error path also calls iget_failed() which eventually calls iput, which eventually calls gfs2_evict_inode, which does another gfs2_glock_put. This double-put can cause the glock reference count to get off. Regards, Bob Peterson Red Hat File Systems --- Signed-off-by: Bob Peterson diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c index 1cc6d8e..148a4d5 100644 --- a/fs/gfs2/inode.c +++ b/fs/gfs2/inode.c @@ -197,7 +197,6 @@ fail_iopen: gfs2_glock_put(io_gl); fail_put: ip->i_gl->gl_object = NULL; - gfs2_glock_put(ip->i_gl); fail: iget_failed(inode); return ERR_PTR(error);
[Cluster-devel] [GFS2 PATCH] GFS2: Always use iopen glock for gl_deletes
Hi, Before this patch, when function try_rgrp_unlink queued a glock for delete_work to reclaim the space, it used the inode glock to do so. That's different from the iopen callback which uses the iopen glock for the same purpose. We should be consistent and always use the iopen glock. This may also save us reference counting problems with the inode glock, since clear_glock does an extra glock_put() for the inode glock. Regards, Bob Peterson Red Hat File Systems --- Signed-off-by: Bob Peterson diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c index 42ac3bb..00b2f22 100644 --- a/fs/gfs2/rgrp.c +++ b/fs/gfs2/rgrp.c @@ -1789,7 +1789,7 @@ static void try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 *last_unlinked, u64 skip continue; *last_unlinked = block; - error = gfs2_glock_get(sdp, block, &gfs2_inode_glops, CREATE, &gl); + error = gfs2_glock_get(sdp, block, &gfs2_iopen_glops, CREATE, &gl); if (error) continue;
Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
- Original Message - > On Fri, Dec 04, 2015 at 09:51:53AM -0500, Bob Peterson wrote: > > it's from the fenced process, and if so, queue the final put. That should > > mitigate the problem. > > Bob, I'm perplexed by the focus on fencing; this issue is broader than > fencing as I mentioned in bz 1255872. Over the years that I've reported > these issues, rarely if ever have they involving fencing. Any userland > process, not just the fencing process, can allocate memory, fall into the > general shrinking path, get into gfs2 and dlm, and end up blocked for some > undefined time. That can cause problems in any number of ways. > > The specific problem you're focused on may be one of the easier ways of > demonstrating the problem -- making the original userland process one of > the cluster-related processes that gfs2/dlm depend on, combined with > recovery when those processes do an especially large amount of work that > gfs2/dlm require. But problems could occur if any process is forced to > unwittingly do this dlm work, not just a cluster-related process, and it > would not need to involve recovery (or fencing which is one small part of > it). > > I believe in gfs1 and the original gfs2, gfs had its own mechanism/threads > for shrinking its cache and doing the dlm work, and would not do anything > from the generic shrinking paths because of this issue. I don't think > it's reasonable to expect random, unsuspecting processes on the system to > perform gfs2/dlm operations that are often remote, lengthy, indefinite, or > unpredictable. I think gfs2 needs to do that kind of heavy lifting from > its own dedicated contexts, or from processes that are explicitly choosing > to use gfs2. > > Hi Dave, Thanks for your input. You're right, of course. The problem can occur to other processes that cause the shrinker to run, not just from fenced. Yes, the original GFS2 did not have this problem. When gfs2_clear_inode was called, it called gfs2_glock_schedule_for_reclaim, which queued the inode glock to a special queue which was handled by the gfs2_glockd daemon. The reference count was bumped to prevent the gfs2_clear_inode() from being the last guy out. The problem with that approach is that it's a centralized list, which means there's no parallelism and there can get to be a backlog of glocks waiting to be reclaimed. If that glock needs to be reacquired because the block was reused for a different inode, we could end up reaping glocks that are still being (re)used. We could still do that, but either we reintroduce the gfs2_glockd daemon, or give our existing daemon, gfs2_quotad, more responsibilities, which would make it a bit uglier and more complex than it is today. My previous attempts to solve this involved using a work queue and queuing the final gfs2_glock_put() and that fixed the problem for all cases, unless there was already work queued. When you think about it, using delayed work accomplishes the same thing, but with the parallelism. When it works. Perhaps I just need to focus on a way to allow work to be queued multiple times (in an ideal case) or, alternatively, an atomic counter that corresponds to the number of times the work should be executed. Or something similar. Regards, Bob Peterson Red Hat File Systems
[Cluster-devel] [GFS2 PATCH] GFS2: Release iopen glock in gfs2_create_inode error cases
Hi, Some error cases in gfs2_create_inode were not unlocking the iopen glock, getting the reference count off. This adds the proper unlock. The error logic in function gfs2_create_inode was also convoluted, so this patch simplifies it. It also takes care of a bug in which gfs2_qa_delete() was not called in an error case. Regards, Bob Peterson Red Hat File Systems Signed-off-by: Bob Peterson --- diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c index a8ce2e9..1cc6d8e 100644 --- a/fs/gfs2/inode.c +++ b/fs/gfs2/inode.c @@ -593,7 +593,7 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry, struct gfs2_inode *dip = GFS2_I(dir), *ip; struct gfs2_sbd *sdp = GFS2_SB(&dip->i_inode); struct gfs2_glock *io_gl; - int error, free_vfs_inode = 0; + int error, free_vfs_inode = 1; u32 aflags = 0; unsigned blocks = 1; struct gfs2_diradd da = { .bh = NULL, .save_loc = 1, }; @@ -650,7 +650,7 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry, error = posix_acl_create(dir, &mode, &default_acl, &acl); if (error) - goto fail_free_vfs_inode; + goto fail_gunlock; ip = GFS2_I(inode); error = gfs2_rsqa_alloc(ip); @@ -738,6 +738,9 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry, gfs2_set_iop(inode); insert_inode_hash(inode); + free_vfs_inode = 0; /* After this point, the inode is no longer + considered free. Any failures need to undo + the gfs2 structures. */ if (default_acl) { error = gfs2_set_acl(inode, default_acl, ACL_TYPE_DEFAULT); posix_acl_release(default_acl); @@ -771,11 +774,8 @@ static int gfs2_create_inode(struct inode *dir, struct dentry *dentry, return error; fail_gunlock3: - gfs2_glock_dq_uninit(ghs + 1); - if (ip->i_gl) - gfs2_glock_put(ip->i_gl); - goto fail_gunlock; - + gfs2_glock_dq_uninit(&ip->i_iopen_gh); + gfs2_glock_put(io_gl); fail_gunlock2: gfs2_glock_dq_uninit(ghs + 1); fail_free_inode: @@ -787,8 +787,6 @@ fail_free_acls: posix_acl_release(default_acl); if (acl) posix_acl_release(acl); -fail_free_vfs_inode: - free_vfs_inode = 1; fail_gunlock: gfs2_dir_no_add(&da); gfs2_glock_dq_uninit(ghs);
Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
On Fri, Dec 04, 2015 at 09:51:53AM -0500, Bob Peterson wrote: > it's from the fenced process, and if so, queue the final put. That should > mitigate the problem. Bob, I'm perplexed by the focus on fencing; this issue is broader than fencing as I mentioned in bz 1255872. Over the years that I've reported these issues, rarely if ever have they involving fencing. Any userland process, not just the fencing process, can allocate memory, fall into the general shrinking path, get into gfs2 and dlm, and end up blocked for some undefined time. That can cause problems in any number of ways. The specific problem you're focused on may be one of the easier ways of demonstrating the problem -- making the original userland process one of the cluster-related processes that gfs2/dlm depend on, combined with recovery when those processes do an especially large amount of work that gfs2/dlm require. But problems could occur if any process is forced to unwittingly do this dlm work, not just a cluster-related process, and it would not need to involve recovery (or fencing which is one small part of it). I believe in gfs1 and the original gfs2, gfs had its own mechanism/threads for shrinking its cache and doing the dlm work, and would not do anything from the generic shrinking paths because of this issue. I don't think it's reasonable to expect random, unsuspecting processes on the system to perform gfs2/dlm operations that are often remote, lengthy, indefinite, or unpredictable. I think gfs2 needs to do that kind of heavy lifting from its own dedicated contexts, or from processes that are explicitly choosing to use gfs2.
Re: [Cluster-devel] [GFS2 PATCH 1/2] GFS2: Make gfs2_clear_inode() queue the final put
Hi, - Original Message - > > Hi Steve, > > > > Another possibility is to add a new inode work_queue which does the work of > > gfs2_clear_inode(). Its only function would be to clear the inode, and > > therefore, there should not be any competing work, as there is in the case > > of > > a glock. The gfs2_clear_inode() function would then queue the work and > > return > > to its caller, and therefore it wouldn't block. The work that's been queued > > might block on DLM, but it shouldn't matter, in theory, since the shrinker > > will complete, which will free up the fence, which will free up dlm, which > > will > > free up everything. What do you think? > > > > Regards, > > > > Bob Peterson > > Red Hat File Systems > > Well that is a possibility. There is another dependency cycle there > however, which is the one relating to the inode lifetime, and we need to > be careful that we don't break that in attempting to fix the memory > allocation at fence time issue. I know it isn't easy, but we need to be > very careful to avoid introducing any new races, as they are likely to > be very tricky to debug in this area. > > One thing we could do, is to be more careful about when we take this > delayed path. We could, for example, add a check to see if we are > running in the context of the shrinker, and only delay the eviction in > that specific case. I was wondering whether we could move the decision > further up the chain too, and avoid evicting inodes with zero link count > under memory pressure in the first place. The issue of needing to > allocate memory to evict inodes is likely to be something that might > affect other filesystems too, so it might be useful as a generic feature. > > I'd prefer to avoid adding a new inode workqueue, however there is no > reason why we could not add a new feature to an existing workqueue by > adding a new inode or glock flag as appropriate (which will not expand > the size of the inode) to deal with this case, > > Steve. > > I tried adding a work queue for the inodes, but it was a disaster and led to other hangs. Then I decided it made more sense to just add a new delayed work function for the final put of the inode glock. That also led to problems because glocks may be reused for a different inode with the same block address, due to the fact that the glocks often outlive the inodes. What happened was that, given enough pressure, the delayed work became backlogged and the second put couldn't be queued due to a previous queue (with delay==0). In the end, I decided that you're right: The safest approach is to isolate this to the specific problem I'm trying to solve: to do our business the same way, but when it comes to the final close, check if it's from the fenced process, and if so, queue the final put. That should mitigate the problem. Unfortunately, it's not good enough to only do these things when the link count is zero: Closing any file at any time could recreate the problem, provided there was a pending fence action. I'd also prefer if there was a way we could tell whether we were called from the slab shrinker, but unfortunately, there's no good way to know without wasting a lot of time traversing the call stack and such. I did a fair amount of investigating this: there are a couple of bits in the superblock that I was hoping to leverage: PF_FSTRANS is used by NFS to detect situations where it mustn't do memory allocations, but I don't think we can use it. There's also PF_MEMALLOC, but that's set for the global shrinker, not the slab shrinker, so I don't think we can use that either. I investigated how ocfs2 handles this situation: turns out, it doesn't, so they probably have the same vulnerability. In doing all this investigation, I learned of another similar possible problem discovered in ocfs2: If the shrinker tried to ditch an inode from memory, then the file system calls into DLM, but DLM needs to do some COMM stuff, and the comm stuff needs to allocate memory for comm buffers. In that case, the use of superblock flags (as above) was discouraged and NACKed in favor of using a different set of memory flags, such as GFP_NOFS, etc. I never did determine exactly how this ended up. In the meantime, I prototyped the following patch, and it works, but it's a hack, and I'm not too happy with it. So I'm going to see if I can find a solution that would work for both GFS2 and ocfs2. Regards, Bob Peterson Red Hat File Systems --- commit 3142a311fa848d44b15e9f86fc3599b7d9b8b3f5 Author: Bob Peterson Date: Mon Nov 30 12:04:09 2015 -0600 GFS2: Make gfs2_clear_inode() not block on final glock put This patch changes function gfs2_clear_inode() so that instead of calling gfs2_glock_put, it calls a new gfs2_glock_put_noblock function that avoids a possible deadlock that would occur should it call dlm during a fence operation: dlm waits for a fence operation, the fence operation waits for memory, the shrinker waits for gfs2 to free an inode from memory, but gf