Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
Hi, On Thu, 2012-01-05 at 10:46 -0600, David Teigland wrote: This new method of managing recovery is an alternative to the previous approach of using the userland gfs_controld. - use dlm slot numbers to assign journal id's - use dlm recovery callbacks to initiate journal recovery - use a dlm lock to determine the first node to mount fs - use a dlm lock to track journals that need recovery Signed-off-by: David Teigland teigl...@redhat.com --- fs/gfs2/glock.c |2 +- fs/gfs2/glock.h |7 +- fs/gfs2/incore.h| 58 +++- fs/gfs2/lock_dlm.c | 993 ++- fs/gfs2/main.c | 10 + fs/gfs2/ops_fstype.c| 29 +- fs/gfs2/recovery.c |4 + fs/gfs2/sys.c | 33 +- fs/gfs2/sys.h |2 + include/linux/gfs2_ondisk.h |2 + 10 files changed, 1098 insertions(+), 42 deletions(-) I've just been looking at this again, and a question springs to mind... how does this deal with nodes which are read-only or spectator mounts? In the old system we used to propagate that information to gfs_controld but I've not spotted anything similar in the patch so far, so I'm wondering whether it needs to know that information or not, Steve.
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
On Mon, Jan 09, 2012 at 04:36:30PM +, Steven Whitehouse wrote: On Thu, 2012-01-05 at 10:46 -0600, David Teigland wrote: This new method of managing recovery is an alternative to the previous approach of using the userland gfs_controld. - use dlm slot numbers to assign journal id's - use dlm recovery callbacks to initiate journal recovery - use a dlm lock to determine the first node to mount fs - use a dlm lock to track journals that need recovery I've just been looking at this again, and a question springs to mind... how does this deal with nodes which are read-only or spectator mounts? In the old system we used to propagate that information to gfs_controld but I've not spotted anything similar in the patch so far, so I'm wondering whether it needs to know that information or not, The dlm allocates a slot for all lockspace members, so spectator mounts (like readonly mounts) would be given a slot/jid. In gfs_controld, spectator mounts are not be given a jid (that came from the time when adding a journal required extending the device+fs.) These days, there's probably no meaningful difference between spectator and readonly mounts.
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
On Mon, Jan 09, 2012 at 11:46:26AM -0500, David Teigland wrote: On Mon, Jan 09, 2012 at 04:36:30PM +, Steven Whitehouse wrote: On Thu, 2012-01-05 at 10:46 -0600, David Teigland wrote: This new method of managing recovery is an alternative to the previous approach of using the userland gfs_controld. - use dlm slot numbers to assign journal id's - use dlm recovery callbacks to initiate journal recovery - use a dlm lock to determine the first node to mount fs - use a dlm lock to track journals that need recovery I've just been looking at this again, and a question springs to mind... how does this deal with nodes which are read-only or spectator mounts? In the old system we used to propagate that information to gfs_controld but I've not spotted anything similar in the patch so far, so I'm wondering whether it needs to know that information or not, The dlm allocates a slot for all lockspace members, so spectator mounts (like readonly mounts) would be given a slot/jid. In gfs_controld, spectator mounts are not be given a jid (that came from the time when adding a journal required extending the device+fs.) These days, there's probably no meaningful difference between spectator and readonly mounts. There's one other part, and that's whether a readonly or spectator node should attempt to recover the journal of a failed node. In cluster3 this decision was always a bit mixed up, with some logic in gfs_controld and some in gfs2. We should make a clear decision now and include it in this patch. I think gfs2_recover_func() should return GAVEUP right at the start for any of the cases where you don't want it doing recovery. What cases would you prefer?
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
Hi, On Mon, 2012-01-09 at 11:46 -0500, David Teigland wrote: On Mon, Jan 09, 2012 at 04:36:30PM +, Steven Whitehouse wrote: On Thu, 2012-01-05 at 10:46 -0600, David Teigland wrote: This new method of managing recovery is an alternative to the previous approach of using the userland gfs_controld. - use dlm slot numbers to assign journal id's - use dlm recovery callbacks to initiate journal recovery - use a dlm lock to determine the first node to mount fs - use a dlm lock to track journals that need recovery I've just been looking at this again, and a question springs to mind... how does this deal with nodes which are read-only or spectator mounts? In the old system we used to propagate that information to gfs_controld but I've not spotted anything similar in the patch so far, so I'm wondering whether it needs to know that information or not, The dlm allocates a slot for all lockspace members, so spectator mounts (like readonly mounts) would be given a slot/jid. In gfs_controld, spectator mounts are not be given a jid (that came from the time when adding a journal required extending the device+fs.) These days, there's probably no meaningful difference between spectator and readonly mounts. The issue is more related to recovery time, though. For spectator mounts, we don't have to care about recovery at all, and if one fails, it does not need to be recovered at the fs level. Spectator mounts can check the journals, but cannot recover any, so as the first mounter of the filesystem, they must fail to mount if any journals are left dirty. For read only, it is less clear... the first read only mounter of the fs must recover all journals. After that, currently, read-only nodes do not perform recovery, although we could change that - it isn't clear what the correct path is here, so we need to pick one and stick with it. If a read-only node fails, we do not need to recover it, since there is nothing to do (as per spectator). What I want to avoid is having a cluster of read-only mounted nodes, have one fail, and then the rest of the cluster is stuck because its trying to recover the journal for the failed node and there are no nodes which are able to perform that recovery left in the cluster. If assigning a slot to spectator mounts means that spectator mounts now have (effectively) a journal id assigned to them, even if they never touch it, then that is a change we need to be careful to document in case someone has a small filesystem with a number of spectator mounters sharing it, and is thus unable to create more journals when they upgrade to the new system. Steve.
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
Hi, On Mon, 2012-01-09 at 12:00 -0500, David Teigland wrote: On Mon, Jan 09, 2012 at 11:46:26AM -0500, David Teigland wrote: On Mon, Jan 09, 2012 at 04:36:30PM +, Steven Whitehouse wrote: On Thu, 2012-01-05 at 10:46 -0600, David Teigland wrote: This new method of managing recovery is an alternative to the previous approach of using the userland gfs_controld. - use dlm slot numbers to assign journal id's - use dlm recovery callbacks to initiate journal recovery - use a dlm lock to determine the first node to mount fs - use a dlm lock to track journals that need recovery I've just been looking at this again, and a question springs to mind... how does this deal with nodes which are read-only or spectator mounts? In the old system we used to propagate that information to gfs_controld but I've not spotted anything similar in the patch so far, so I'm wondering whether it needs to know that information or not, The dlm allocates a slot for all lockspace members, so spectator mounts (like readonly mounts) would be given a slot/jid. In gfs_controld, spectator mounts are not be given a jid (that came from the time when adding a journal required extending the device+fs.) These days, there's probably no meaningful difference between spectator and readonly mounts. There's one other part, and that's whether a readonly or spectator node should attempt to recover the journal of a failed node. In cluster3 this decision was always a bit mixed up, with some logic in gfs_controld and some in gfs2. We should make a clear decision now and include it in this patch. I think gfs2_recover_func() should return GAVEUP right at the start for any of the cases where you don't want it doing recovery. What cases would you prefer? Yes, if it can't recover, then thats a good idea. We also need to ensure that we are not trying to recover nodes which don't need recovery though (see my earlier email) Steve.
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
- Original Message - | This new method of managing recovery is an alternative to | the previous approach of using the userland gfs_controld. | | - use dlm slot numbers to assign journal id's | - use dlm recovery callbacks to initiate journal recovery | - use a dlm lock to determine the first node to mount fs | - use a dlm lock to track journals that need recovery | | Signed-off-by: David Teigland teigl...@redhat.com | --- | --- a/fs/gfs2/lock_dlm.c | +++ b/fs/gfs2/lock_dlm.c (snip) | +#include linux/gfs2_ondisk.h | #include linux/gfs2_ondisk.h Hi, Dave, are you going to post a replacement patch or addendum patch that addresses Steve's concerns, such as the above? I'd like to review this, but I want the review the latest/greatest. Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
On Thu, Jan 05, 2012 at 10:08:15AM -0500, Bob Peterson wrote: - Original Message - | This new method of managing recovery is an alternative to | the previous approach of using the userland gfs_controld. | | - use dlm slot numbers to assign journal id's | - use dlm recovery callbacks to initiate journal recovery | - use a dlm lock to determine the first node to mount fs | - use a dlm lock to track journals that need recovery | | Signed-off-by: David Teigland teigl...@redhat.com | --- | --- a/fs/gfs2/lock_dlm.c | +++ b/fs/gfs2/lock_dlm.c (snip) | +#include linux/gfs2_ondisk.h | #include linux/gfs2_ondisk.h Hi, Dave, are you going to post a replacement patch or addendum patch that addresses Steve's concerns, such as the above? I'd like to review this, but I want the review the latest/greatest. I haven't resent the patches after making the changes (which were fairly minor.) I'll resend them shortly for another check before a pull request. Dave
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
Hi, On Thu, 2012-01-05 at 10:21 -0500, David Teigland wrote: On Thu, Jan 05, 2012 at 10:08:15AM -0500, Bob Peterson wrote: - Original Message - | This new method of managing recovery is an alternative to | the previous approach of using the userland gfs_controld. | | - use dlm slot numbers to assign journal id's | - use dlm recovery callbacks to initiate journal recovery | - use a dlm lock to determine the first node to mount fs | - use a dlm lock to track journals that need recovery | | Signed-off-by: David Teigland teigl...@redhat.com | --- | --- a/fs/gfs2/lock_dlm.c | +++ b/fs/gfs2/lock_dlm.c (snip) | +#include linux/gfs2_ondisk.h | #include linux/gfs2_ondisk.h Hi, Dave, are you going to post a replacement patch or addendum patch that addresses Steve's concerns, such as the above? I'd like to review this, but I want the review the latest/greatest. I haven't resent the patches after making the changes (which were fairly minor.) I'll resend them shortly for another check before a pull request. Dave I think it would be a good plan to not send this last patch for the current merge window and let it settle for a bit longer. Running things so fine with the timing makes me nervous bearing in mind the number of changes, and that three issues have been caught in the last few days. Lets try and resolve the remaining points and then we can have something really solid ready for the next window. I don't think there is any particular rush to get it in at the moment. I know its taken a bit longer than is ideal to get through the review, but we've had a major holiday in the way which hasn't helped, Steve.
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
- Original Message - | This new method of managing recovery is an alternative to | the previous approach of using the userland gfs_controld. | | - use dlm slot numbers to assign journal id's | - use dlm recovery callbacks to initiate journal recovery | - use a dlm lock to determine the first node to mount fs | - use a dlm lock to track journals that need recovery | | Signed-off-by: David Teigland teigl...@redhat.com | --- Hi, (snip) | + if (!test_bit_le(i, lvb_bits+JID_BITMAP_OFFSET)) Nit, but I'd prefer + rather than + with no spaces. | + continue; | + | + __clear_bit_le(i, lvb_bits+JID_BITMAP_OFFSET); Nit, same. (snip) | + __set_bit_le(i, lvb_bits+JID_BITMAP_OFFSET); Again, please globally change: s/+/ + / except in instances of var++. (snip) Other than these nits, I have no concerns (other than what Steve pointed out). Regards, Bob Peterson Red Hat File Systems
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
On Thu, Jan 05, 2012 at 04:58:22PM +, Steven Whitehouse wrote: + clear_bit(SDF_NOJOURNALID, sdp-sd_flags); + smp_mb__after_clear_bit(); + wake_up_bit(sdp-sd_flags, SDF_NOJOURNALID); + ls-ls_first = !!test_bit(DFL_FIRST_MOUNT, ls-ls_recover_flags); + return 0; + This bit of code, which was correct last time you posted this patch appears to have reverted to its previous incorrect state. ls_first must Thanks, I'll move it back, I removed ls_first and put it back in the wrong place. I keep forgetting about it because... be set before SDF_NOJOURNALID is cleared, otherwise the uninitialised value may be read, in this case there can be no other reader, so it doesn't matter. Dave
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
Hi, On Thu, 2011-12-22 at 16:23 -0500, David Teigland wrote: On Mon, Dec 19, 2011 at 12:47:38PM -0500, David Teigland wrote: On Mon, Dec 19, 2011 at 01:07:38PM +, Steven Whitehouse wrote: struct lm_lockstruct { int ls_jid; unsigned int ls_first; - unsigned int ls_first_done; unsigned int ls_nodir; Since ls_flags and ls_first also also only boolean flags, they could potentially be moved into the flags, though we can always do that later. yes, I can use a flag in place of ls_first. I went back to ls_first after finding the flag broke the old code path for gfs_controld, and making it work involved changing more of the old code that I wanted to in this patch. We can go back and reorganize how some of that old code works (and remove ls_first), in a subsequent patch. + /* +* Other nodes need to do some work in dlm recovery and gfs2_control +* before the recover_done and control_lock will be ready for us below. +* A delay here is not required but often avoids having to retry. +*/ + + msleep(500); This is now msleep_interruptible(500), I couldn't get around adding some sort of delay here. If we think of another way to delay this I'll be happy to try it. I've finished and tested the rest of the changes. https://github.com/teigland/linux-dlm/commits/devel11 http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/teigland/linux-dlm.git;a=shortlog;h=refs/heads/next Ok. I think we are going very much in the right direction. Let me think about this over Christmas and lets catch up again in the New Year, Steve.
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
On Mon, Dec 19, 2011 at 12:47:38PM -0500, David Teigland wrote: On Mon, Dec 19, 2011 at 01:07:38PM +, Steven Whitehouse wrote: struct lm_lockstruct { int ls_jid; unsigned int ls_first; - unsigned int ls_first_done; unsigned int ls_nodir; Since ls_flags and ls_first also also only boolean flags, they could potentially be moved into the flags, though we can always do that later. yes, I can use a flag in place of ls_first. I went back to ls_first after finding the flag broke the old code path for gfs_controld, and making it work involved changing more of the old code that I wanted to in this patch. We can go back and reorganize how some of that old code works (and remove ls_first), in a subsequent patch. + /* + * Other nodes need to do some work in dlm recovery and gfs2_control + * before the recover_done and control_lock will be ready for us below. + * A delay here is not required but often avoids having to retry. + */ + + msleep(500); This is now msleep_interruptible(500), I couldn't get around adding some sort of delay here. If we think of another way to delay this I'll be happy to try it. I've finished and tested the rest of the changes. https://github.com/teigland/linux-dlm/commits/devel11 http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/teigland/linux-dlm.git;a=shortlog;h=refs/heads/next
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
Hi, On Tue, 2011-12-20 at 16:04 -0500, David Teigland wrote: On Tue, Dec 20, 2011 at 02:16:43PM -0500, David Teigland wrote: On Tue, Dec 20, 2011 at 10:39:08AM +, Steven Whitehouse wrote: I dislike arbitrary delays also, so I'm hesitant to add them. The choices here are: - removing NOQUEUE from the requests below, but with NOQUEUE you have a much better chance of killing a mount command, which is a fairly nice feature, I think. - removing the delay, which results in nodes often doing fast+repeated lock attempts, which could get rather excessive. I'd be worried about having that kind of unlimited loop sitting there. - using some kind of delay. While I don't like the look of the delay, I like the other options less. Do you have a preference, or any other ideas? Well, I'd prefer to just remove the NOQUEUE command in that case, so that we don't spin here. The dlm request is async anyway, so we should be able to wait for it in an interruptible manner and send a cancel if required. I won't do async+cancel here, that would make the code unnecessarily ugly and complicated. There's really no reason to be so dogmatic about delays, but since you refuse I'll just make it block, assuming I don't find any new problems with that. Now that I look at it, waiting vs blocking on the lock requests is not the main issue; removing NOQUEUE doesn't really do anything. We're waiting for the other nodes to finish their work and update the state in the lvb. The only option is to periodically check the lvb, so the only choices are to do that as fast as possible (not nice), or introduce a delay. I don't think I understand whats going on in that case. What I thought should be happening was this: - Try to get mounter lock in EX - If successful, then we are the first mounter so recover all journals - Write info into LVB - Drop mounter lock to PR so other nodes can mount - If failed to get mounter lock in EX, then wait for lock in PR state - This will block until the EX lock is dropped to PR - Read info from LVB So a node with the mounter lock in EX knows that it is always the first mounter and will recover all journals before demoting the mounter lock to PR. A node with the mounter lock in PR may only recover its own journal (at mount time). That makes this the exact equivalent of what we currently do with the first mounter flag from gfs_controld. So I guess what I can't quite figure out is how it it possible for the LVB to be out of sync with the lock state of the mounter lock, Steve.
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
On Wed, Dec 21, 2011 at 10:45:21AM +, Steven Whitehouse wrote: I don't think I understand whats going on in that case. What I thought should be happening was this: - Try to get mounter lock in EX - If successful, then we are the first mounter so recover all journals - Write info into LVB - Drop mounter lock to PR so other nodes can mount - If failed to get mounter lock in EX, then wait for lock in PR state - This will block until the EX lock is dropped to PR - Read info from LVB So a node with the mounter lock in EX knows that it is always the first mounter and will recover all journals before demoting the mounter lock to PR. A node with the mounter lock in PR may only recover its own journal (at mount time). I previously used one lock similar to that, but had to change it a bit. I had to split it across two separate locks, called control_lock and mounted_lock. There need to be two because of two conflicting requirements. The control_lock lvb is used to communicate the generation number and jid bits. Writing the lvb requires an EX lock, and EX prevents others from continually holding a PR lock. Without mounted nodes continually holding a PR lock we can't use EX to indicate first mounter. So, the mounted_lock (no lvb) is used to indicate the first mounter. Here all mounted nodes continually hold a PR lock, and a mounting node attempts to get an EX lock, so any node to get an EX lock is the first mounter. (I previously used control_lock with zero lvb to indicate first mounter, but there are some fairly common cases where the lvb may not be zero when we need a first mounter.) Now back to the reason why we need to retry lock requests and can't just block. It's not related to the first mounter case. When a node mounts, it needs to wait for other (previously mounted) nodes to update the control_lock lvb with the latest generation number, and then it also needs to wait for any bits set in the lvb to be cleared. i.e. it needs to wait for any unrecovered journals to be recovered before it finishes mounting. To do this, it needs to wait in a loop reading the control_lock lvb. The question is whether we want to add some sort of delay to that loop or not, and how. msleep_interruptible(), schedule_timeout_interruptible(), something else? Dave
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
Hi, On Mon, 2011-12-19 at 12:47 -0500, David Teigland wrote: On Mon, Dec 19, 2011 at 01:07:38PM +, Steven Whitehouse wrote: struct lm_lockstruct { int ls_jid; unsigned int ls_first; - unsigned int ls_first_done; unsigned int ls_nodir; Since ls_flags and ls_first also also only boolean flags, they could potentially be moved into the flags, though we can always do that later. yes, I can use a flag in place of ls_first. + int ls_recover_jid_done; /* read by gfs_controld */ + int ls_recover_jid_status; /* read by gfs_controld */ ^^^ this isn't actually true any more. All recent gfs_controld versions take their cue from the uevents, so this is here only for backwards compatibility reasons and these two will be removed at some future date. I'll add a longer comment saying something like that. + /* + * Other nodes need to do some work in dlm recovery and gfs2_control + * before the recover_done and control_lock will be ready for us below. + * A delay here is not required but often avoids having to retry. + */ + + msleep(500); Can we get rid of this then? I'd rather just wait for the lock, rather than adding delays of arbitrary time periods into the code. I dislike arbitrary delays also, so I'm hesitant to add them. The choices here are: - removing NOQUEUE from the requests below, but with NOQUEUE you have a much better chance of killing a mount command, which is a fairly nice feature, I think. - removing the delay, which results in nodes often doing fast+repeated lock attempts, which could get rather excessive. I'd be worried about having that kind of unlimited loop sitting there. - using some kind of delay. While I don't like the look of the delay, I like the other options less. Do you have a preference, or any other ideas? Well, I'd prefer to just remove the NOQUEUE command in that case, so that we don't spin here. The dlm request is async anyway, so we should be able to wait for it in an interruptible manner and send a cancel if required. +static int control_first_done(struct gfs2_sbd *sdp) +{ + struct lm_lockstruct *ls = sdp-sd_lockstruct; + char lvb_bits[GDLM_LVB_SIZE]; + uint32_t start_gen, block_gen; + int error; + +restart: + spin_lock(ls-ls_recover_spin); + start_gen = ls-ls_recover_start; + block_gen = ls-ls_recover_block; + + if (test_bit(DFL_BLOCK_LOCKS, ls-ls_recover_flags) || + !test_bit(DFL_MOUNT_DONE, ls-ls_recover_flags) || + !test_bit(DFL_FIRST_MOUNT, ls-ls_recover_flags)) { + /* sanity check, should not happen */ + fs_err(sdp, control_first_done start %u block %u flags %lx\n, +start_gen, block_gen, ls-ls_recover_flags); + spin_unlock(ls-ls_recover_spin); + control_unlock(sdp); + return -1; + } + + if (start_gen == block_gen) { + /* + * Wait for the end of a dlm recovery cycle to switch from + * first mounter recovery. We can ignore any recover_slot + * callbacks between the recover_prep and next recover_done + * because we are still the first mounter and any failed nodes + * have not fully mounted, so they don't need recovery. + */ + spin_unlock(ls-ls_recover_spin); + fs_info(sdp, control_first_done wait gen %u\n, start_gen); + msleep(500); Again - I don't want to add arbitrary delays into the code. Why is this waiting for half a second? Why not some other length of time? We should figure out how to wait for the end of the first mounter recovery some other way if that is what is required. This msleep slows down a rare loop to wake up a couple times vs once with a proper wait mechanism. It's waiting for the next recover_done() callback, which the dlm will call when it is done with recovery. We do have the option here of using a standard wait mechanism, wait_on_bit() or something. I'll see if any of those would work here without adding too much to the code. Ok. That would be a better option I think. +static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid, + unsigned int result) +{ + struct lm_lockstruct *ls = sdp-sd_lockstruct; + + /* don't care about the recovery of own journal during mount */ + if (jid == ls-ls_jid) + return; + + /* another node is recovering the journal, give it a chance to +finish before trying again */ + if (result == LM_RD_GAVEUP) + msleep(1000); Again, lets put in a proper wait for this condition. If the issue is one of races between cluster nodes (thundering herd type problem), then we might need some kind of back off, but in that case, it should probably be for a random time period. In this case, while one node is recovering a
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
On Tue, Dec 20, 2011 at 10:39:08AM +, Steven Whitehouse wrote: I dislike arbitrary delays also, so I'm hesitant to add them. The choices here are: - removing NOQUEUE from the requests below, but with NOQUEUE you have a much better chance of killing a mount command, which is a fairly nice feature, I think. - removing the delay, which results in nodes often doing fast+repeated lock attempts, which could get rather excessive. I'd be worried about having that kind of unlimited loop sitting there. - using some kind of delay. While I don't like the look of the delay, I like the other options less. Do you have a preference, or any other ideas? Well, I'd prefer to just remove the NOQUEUE command in that case, so that we don't spin here. The dlm request is async anyway, so we should be able to wait for it in an interruptible manner and send a cancel if required. I won't do async+cancel here, that would make the code unnecessarily ugly and complicated. There's really no reason to be so dogmatic about delays, but since you refuse I'll just make it block, assuming I don't find any new problems with that. Again - I don't want to add arbitrary delays into the code. Why is this waiting for half a second? Why not some other length of time? We should figure out how to wait for the end of the first mounter recovery some other way if that is what is required. This msleep slows down a rare loop to wake up a couple times vs once with a proper wait mechanism. It's waiting for the next recover_done() callback, which the dlm will call when it is done with recovery. We do have the option here of using a standard wait mechanism, wait_on_bit() or something. I'll see if any of those would work here without adding too much to the code. Ok. That would be a better option I think. Only if it doesn't make things more (unnecessarily) complex. Dave
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
On Tue, Dec 20, 2011 at 02:16:43PM -0500, David Teigland wrote: On Tue, Dec 20, 2011 at 10:39:08AM +, Steven Whitehouse wrote: I dislike arbitrary delays also, so I'm hesitant to add them. The choices here are: - removing NOQUEUE from the requests below, but with NOQUEUE you have a much better chance of killing a mount command, which is a fairly nice feature, I think. - removing the delay, which results in nodes often doing fast+repeated lock attempts, which could get rather excessive. I'd be worried about having that kind of unlimited loop sitting there. - using some kind of delay. While I don't like the look of the delay, I like the other options less. Do you have a preference, or any other ideas? Well, I'd prefer to just remove the NOQUEUE command in that case, so that we don't spin here. The dlm request is async anyway, so we should be able to wait for it in an interruptible manner and send a cancel if required. I won't do async+cancel here, that would make the code unnecessarily ugly and complicated. There's really no reason to be so dogmatic about delays, but since you refuse I'll just make it block, assuming I don't find any new problems with that. Now that I look at it, waiting vs blocking on the lock requests is not the main issue; removing NOQUEUE doesn't really do anything. We're waiting for the other nodes to finish their work and update the state in the lvb. The only option is to periodically check the lvb, so the only choices are to do that as fast as possible (not nice), or introduce a delay.
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
On Fri, 2011-12-16 at 16:03 -0600, David Teigland wrote: This new method of managing recovery is an alternative to the previous approach of using the userland gfs_controld. - use dlm slot numbers to assign journal id's - use dlm recovery callbacks to initiate journal recovery - use a dlm lock to determine the first node to mount fs - use a dlm lock to track journals that need recovery Signed-off-by: David Teigland teigl...@redhat.com --- fs/gfs2/glock.c |2 +- fs/gfs2/glock.h |7 +- fs/gfs2/incore.h| 51 ++- fs/gfs2/lock_dlm.c | 979 ++- fs/gfs2/main.c | 10 + fs/gfs2/ops_fstype.c| 29 +- fs/gfs2/recovery.c |4 + fs/gfs2/sys.c | 29 +- fs/gfs2/sys.h |2 + include/linux/gfs2_ondisk.h |2 + 10 files changed, 1075 insertions(+), 40 deletions(-) diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c index 88e8a23..376816f 100644 --- a/fs/gfs2/glock.c +++ b/fs/gfs2/glock.c @@ -1353,7 +1353,7 @@ void gfs2_glock_complete(struct gfs2_glock *gl, int ret) spin_lock(gl-gl_spin); gl-gl_reply = ret; - if (unlikely(test_bit(DFL_BLOCK_LOCKS, ls-ls_flags))) { + if (unlikely(test_bit(DFL_BLOCK_LOCKS, ls-ls_recover_flags))) { if (gfs2_should_freeze(gl)) { set_bit(GLF_FROZEN, gl-gl_flags); spin_unlock(gl-gl_spin); diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h index 6670711..5b548b07 100644 --- a/fs/gfs2/glock.h +++ b/fs/gfs2/glock.h @@ -121,8 +121,11 @@ enum { struct lm_lockops { const char *lm_proto_name; - int (*lm_mount) (struct gfs2_sbd *sdp, const char *fsname); - void (*lm_unmount) (struct gfs2_sbd *sdp); + int (*lm_mount) (struct gfs2_sbd *sdp, const char *table); + void (*lm_first_done) (struct gfs2_sbd *sdp); + void (*lm_recovery_result) (struct gfs2_sbd *sdp, unsigned int jid, + unsigned int result); + void (*lm_unmount) (struct gfs2_sbd *sdp); void (*lm_withdraw) (struct gfs2_sbd *sdp); void (*lm_put_lock) (struct gfs2_glock *gl); int (*lm_lock) (struct gfs2_glock *gl, unsigned int req_state, diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h index 892ac37..059e462 100644 --- a/fs/gfs2/incore.h +++ b/fs/gfs2/incore.h @@ -139,8 +139,38 @@ struct gfs2_bufdata { #define GDLM_STRNAME_BYTES 25 #define GDLM_LVB_SIZE32 +/* + * ls_recover_flags: + * + * DFL_BLOCK_LOCKS: dlm is in recovery and will grant locks that had been + * held by failed nodes whose journals need recovery. Those locks should + * only be used for journal recovery until the journal recovery is done. + * This is set by the dlm recover_prep callback and cleared by the + * gfs2_control thread when journal recovery is complete. To avoid + * races between recover_prep setting and gfs2_control clearing, recover_spin + * is held while changing this bit and reading/writing recover_block + * and recover_start. + * + * DFL_FIRST_MOUNT: this node is the first to mount this fs and is doing + * recovery of all journals before allowing other nodes to mount the fs. + * This is cleared when FIRST_MOUNT_DONE is set. + * + * DFL_FIRST_MOUNT_DONE: this node was the first mounter, and has finished + * recovery of all journals, and now allows other nodes to mount the fs. + * + * DFL_MOUNT_DONE: gdlm_mount has completed successfully and cleared + * BLOCK_LOCKS for the first time. The gfs2_control thread should now + * control clearing BLOCK_LOCKS for further recoveries. + * + * DFL_UNMOUNT: gdlm_unmount sets to keep sdp off gfs2_control_wq. + */ + enum { DFL_BLOCK_LOCKS = 0, + DFL_FIRST_MOUNT = 1, + DFL_FIRST_MOUNT_DONE= 2, + DFL_MOUNT_DONE = 3, + DFL_UNMOUNT = 4, }; struct lm_lockname { @@ -504,14 +534,26 @@ struct gfs2_sb_host { struct lm_lockstruct { int ls_jid; unsigned int ls_first; - unsigned int ls_first_done; unsigned int ls_nodir; Since ls_flags and ls_first also also only boolean flags, they could potentially be moved into the flags, though we can always do that later. const struct lm_lockops *ls_ops; - unsigned long ls_flags; dlm_lockspace_t *ls_dlm; - int ls_recover_jid_done; - int ls_recover_jid_status; + int ls_recover_jid_done; /* read by gfs_controld */ + int ls_recover_jid_status; /* read by gfs_controld */ ^^^ this isn't actually true any more. All recent gfs_controld versions take their cue from the uevents, so this is here only for backwards compatibility reasons and these two will be removed at some future date. + + struct dlm_lksb ls_mounted_lksb; /* mounted_lock */ + struct dlm_lksb ls_control_lksb; /*
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
Hi, On Fri, 2011-12-16 at 16:03 -0600, David Teigland wrote: This new method of managing recovery is an alternative to the previous approach of using the userland gfs_controld. - use dlm slot numbers to assign journal id's - use dlm recovery callbacks to initiate journal recovery - use a dlm lock to determine the first node to mount fs - use a dlm lock to track journals that need recovery Signed-off-by: David Teigland teigl...@redhat.com --- fs/gfs2/glock.c |2 +- fs/gfs2/glock.h |7 +- fs/gfs2/incore.h| 51 ++- fs/gfs2/lock_dlm.c | 979 ++- fs/gfs2/main.c | 10 + fs/gfs2/ops_fstype.c| 29 +- fs/gfs2/recovery.c |4 + fs/gfs2/sys.c | 29 +- fs/gfs2/sys.h |2 + include/linux/gfs2_ondisk.h |2 + 10 files changed, 1075 insertions(+), 40 deletions(-) [snip] diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c index 20f63b0..bacb7af 100644 --- a/fs/gfs2/lock_dlm.c +++ b/fs/gfs2/lock_dlm.c @@ -1,6 +1,6 @@ /* * Copyright (C) Sistina Software, Inc. 1997-2003 All rights reserved. - * Copyright (C) 2004-2009 Red Hat, Inc. All rights reserved. + * Copyright (C) 2004-2011 Red Hat, Inc. All rights reserved. * * This copyrighted material is made available to anyone wishing to use, * modify, copy, or redistribute it subject to the terms and conditions @@ -11,12 +11,16 @@ #include linux/dlm.h #include linux/slab.h #include linux/types.h +#include linux/delay.h +#include linux/gfs2_ondisk.h #include linux/gfs2_ondisk.h Also, just spotted that we only need one copy of gfs2_ondisk.h Steve.
Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination
On Mon, Dec 19, 2011 at 01:07:38PM +, Steven Whitehouse wrote: struct lm_lockstruct { int ls_jid; unsigned int ls_first; - unsigned int ls_first_done; unsigned int ls_nodir; Since ls_flags and ls_first also also only boolean flags, they could potentially be moved into the flags, though we can always do that later. yes, I can use a flag in place of ls_first. + int ls_recover_jid_done; /* read by gfs_controld */ + int ls_recover_jid_status; /* read by gfs_controld */ ^^^ this isn't actually true any more. All recent gfs_controld versions take their cue from the uevents, so this is here only for backwards compatibility reasons and these two will be removed at some future date. I'll add a longer comment saying something like that. + /* +* Other nodes need to do some work in dlm recovery and gfs2_control +* before the recover_done and control_lock will be ready for us below. +* A delay here is not required but often avoids having to retry. +*/ + + msleep(500); Can we get rid of this then? I'd rather just wait for the lock, rather than adding delays of arbitrary time periods into the code. I dislike arbitrary delays also, so I'm hesitant to add them. The choices here are: - removing NOQUEUE from the requests below, but with NOQUEUE you have a much better chance of killing a mount command, which is a fairly nice feature, I think. - removing the delay, which results in nodes often doing fast+repeated lock attempts, which could get rather excessive. I'd be worried about having that kind of unlimited loop sitting there. - using some kind of delay. While I don't like the look of the delay, I like the other options less. Do you have a preference, or any other ideas? +static int control_first_done(struct gfs2_sbd *sdp) +{ + struct lm_lockstruct *ls = sdp-sd_lockstruct; + char lvb_bits[GDLM_LVB_SIZE]; + uint32_t start_gen, block_gen; + int error; + +restart: + spin_lock(ls-ls_recover_spin); + start_gen = ls-ls_recover_start; + block_gen = ls-ls_recover_block; + + if (test_bit(DFL_BLOCK_LOCKS, ls-ls_recover_flags) || + !test_bit(DFL_MOUNT_DONE, ls-ls_recover_flags) || + !test_bit(DFL_FIRST_MOUNT, ls-ls_recover_flags)) { + /* sanity check, should not happen */ + fs_err(sdp, control_first_done start %u block %u flags %lx\n, + start_gen, block_gen, ls-ls_recover_flags); + spin_unlock(ls-ls_recover_spin); + control_unlock(sdp); + return -1; + } + + if (start_gen == block_gen) { + /* +* Wait for the end of a dlm recovery cycle to switch from +* first mounter recovery. We can ignore any recover_slot +* callbacks between the recover_prep and next recover_done +* because we are still the first mounter and any failed nodes +* have not fully mounted, so they don't need recovery. +*/ + spin_unlock(ls-ls_recover_spin); + fs_info(sdp, control_first_done wait gen %u\n, start_gen); + msleep(500); Again - I don't want to add arbitrary delays into the code. Why is this waiting for half a second? Why not some other length of time? We should figure out how to wait for the end of the first mounter recovery some other way if that is what is required. This msleep slows down a rare loop to wake up a couple times vs once with a proper wait mechanism. It's waiting for the next recover_done() callback, which the dlm will call when it is done with recovery. We do have the option here of using a standard wait mechanism, wait_on_bit() or something. I'll see if any of those would work here without adding too much to the code. +static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid, +unsigned int result) +{ + struct lm_lockstruct *ls = sdp-sd_lockstruct; + + /* don't care about the recovery of own journal during mount */ + if (jid == ls-ls_jid) + return; + + /* another node is recovering the journal, give it a chance to + finish before trying again */ + if (result == LM_RD_GAVEUP) + msleep(1000); Again, lets put in a proper wait for this condition. If the issue is one of races between cluster nodes (thundering herd type problem), then we might need some kind of back off, but in that case, it should probably be for a random time period. In this case, while one node is recovering a journal, the other nodes will all try to recover the same journal (and fail), as quickly as they can. I looked at using queue_delayed_work here, but couldn't tell if that was ok with zero delay... I now see others use 0, so I'll try it. + error = dlm_new_lockspace(fsname, cluster, flags, GDLM_LVB_SIZE, +