Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2012-01-09 Thread Steven Whitehouse
Hi,

On Thu, 2012-01-05 at 10:46 -0600, David Teigland wrote:
 This new method of managing recovery is an alternative to
 the previous approach of using the userland gfs_controld.
 
 - use dlm slot numbers to assign journal id's
 - use dlm recovery callbacks to initiate journal recovery
 - use a dlm lock to determine the first node to mount fs
 - use a dlm lock to track journals that need recovery
 
 Signed-off-by: David Teigland teigl...@redhat.com
 ---
  fs/gfs2/glock.c |2 +-
  fs/gfs2/glock.h |7 +-
  fs/gfs2/incore.h|   58 +++-
  fs/gfs2/lock_dlm.c  |  993 
 ++-
  fs/gfs2/main.c  |   10 +
  fs/gfs2/ops_fstype.c|   29 +-
  fs/gfs2/recovery.c  |4 +
  fs/gfs2/sys.c   |   33 +-
  fs/gfs2/sys.h   |2 +
  include/linux/gfs2_ondisk.h |2 +
  10 files changed, 1098 insertions(+), 42 deletions(-)
 

I've just been looking at this again, and a question springs to mind...
how does this deal with nodes which are read-only or spectator mounts?
In the old system we used to propagate that information to gfs_controld
but I've not spotted anything similar in the patch so far, so I'm
wondering whether it needs to know that information or not,

Steve.




Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2012-01-09 Thread David Teigland
On Mon, Jan 09, 2012 at 04:36:30PM +, Steven Whitehouse wrote:
 On Thu, 2012-01-05 at 10:46 -0600, David Teigland wrote:
  This new method of managing recovery is an alternative to
  the previous approach of using the userland gfs_controld.
  
  - use dlm slot numbers to assign journal id's
  - use dlm recovery callbacks to initiate journal recovery
  - use a dlm lock to determine the first node to mount fs
  - use a dlm lock to track journals that need recovery
 
 I've just been looking at this again, and a question springs to mind...
 how does this deal with nodes which are read-only or spectator mounts?
 In the old system we used to propagate that information to gfs_controld
 but I've not spotted anything similar in the patch so far, so I'm
 wondering whether it needs to know that information or not,

The dlm allocates a slot for all lockspace members, so spectator mounts
(like readonly mounts) would be given a slot/jid.  In gfs_controld,
spectator mounts are not be given a jid (that came from the time when
adding a journal required extending the device+fs.)  These days, there's
probably no meaningful difference between spectator and readonly mounts.



Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2012-01-09 Thread David Teigland
On Mon, Jan 09, 2012 at 11:46:26AM -0500, David Teigland wrote:
 On Mon, Jan 09, 2012 at 04:36:30PM +, Steven Whitehouse wrote:
  On Thu, 2012-01-05 at 10:46 -0600, David Teigland wrote:
   This new method of managing recovery is an alternative to
   the previous approach of using the userland gfs_controld.
   
   - use dlm slot numbers to assign journal id's
   - use dlm recovery callbacks to initiate journal recovery
   - use a dlm lock to determine the first node to mount fs
   - use a dlm lock to track journals that need recovery
  
  I've just been looking at this again, and a question springs to mind...
  how does this deal with nodes which are read-only or spectator mounts?
  In the old system we used to propagate that information to gfs_controld
  but I've not spotted anything similar in the patch so far, so I'm
  wondering whether it needs to know that information or not,
 
 The dlm allocates a slot for all lockspace members, so spectator mounts
 (like readonly mounts) would be given a slot/jid.  In gfs_controld,
 spectator mounts are not be given a jid (that came from the time when
 adding a journal required extending the device+fs.)  These days, there's
 probably no meaningful difference between spectator and readonly mounts.

There's one other part, and that's whether a readonly or spectator node
should attempt to recover the journal of a failed node.  In cluster3 this
decision was always a bit mixed up, with some logic in gfs_controld and
some in gfs2.

We should make a clear decision now and include it in this patch.
I think gfs2_recover_func() should return GAVEUP right at the start
for any of the cases where you don't want it doing recovery.  What
cases would you prefer?



Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2012-01-09 Thread Steven Whitehouse
Hi,

On Mon, 2012-01-09 at 11:46 -0500, David Teigland wrote:
 On Mon, Jan 09, 2012 at 04:36:30PM +, Steven Whitehouse wrote:
  On Thu, 2012-01-05 at 10:46 -0600, David Teigland wrote:
   This new method of managing recovery is an alternative to
   the previous approach of using the userland gfs_controld.
   
   - use dlm slot numbers to assign journal id's
   - use dlm recovery callbacks to initiate journal recovery
   - use a dlm lock to determine the first node to mount fs
   - use a dlm lock to track journals that need recovery
  
  I've just been looking at this again, and a question springs to mind...
  how does this deal with nodes which are read-only or spectator mounts?
  In the old system we used to propagate that information to gfs_controld
  but I've not spotted anything similar in the patch so far, so I'm
  wondering whether it needs to know that information or not,
 
 The dlm allocates a slot for all lockspace members, so spectator mounts
 (like readonly mounts) would be given a slot/jid.  In gfs_controld,
 spectator mounts are not be given a jid (that came from the time when
 adding a journal required extending the device+fs.)  These days, there's
 probably no meaningful difference between spectator and readonly mounts.
 

The issue is more related to recovery time, though. For spectator
mounts, we don't have to care about recovery at all, and if one fails,
it does not need to be recovered at the fs level. Spectator mounts can
check the journals, but cannot recover any, so as the first mounter of
the filesystem, they must fail to mount if any journals are left dirty.

For read only, it is less clear... the first read only mounter of the fs
must recover all journals. After that, currently, read-only nodes do not
perform recovery, although we could change that - it isn't clear what
the correct path is here, so we need to pick one and stick with it. If a
read-only node fails, we do not need to recover it, since there is
nothing to do (as per spectator).

What I want to avoid is having a cluster of read-only mounted nodes,
have one fail, and then the rest of the cluster is stuck because its
trying to recover the journal for the failed node and there are no nodes
which are able to perform that recovery left in the cluster.

If assigning a slot to spectator mounts means that spectator mounts now
have (effectively) a journal id assigned to them, even if they never
touch it, then that is a change we need to be careful to document in
case someone has a small filesystem with a number of spectator mounters
sharing it, and is thus unable to create more journals when they upgrade
to the new system.

Steve.




Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2012-01-09 Thread Steven Whitehouse
Hi,

On Mon, 2012-01-09 at 12:00 -0500, David Teigland wrote:
 On Mon, Jan 09, 2012 at 11:46:26AM -0500, David Teigland wrote:
  On Mon, Jan 09, 2012 at 04:36:30PM +, Steven Whitehouse wrote:
   On Thu, 2012-01-05 at 10:46 -0600, David Teigland wrote:
This new method of managing recovery is an alternative to
the previous approach of using the userland gfs_controld.

- use dlm slot numbers to assign journal id's
- use dlm recovery callbacks to initiate journal recovery
- use a dlm lock to determine the first node to mount fs
- use a dlm lock to track journals that need recovery
   
   I've just been looking at this again, and a question springs to mind...
   how does this deal with nodes which are read-only or spectator mounts?
   In the old system we used to propagate that information to gfs_controld
   but I've not spotted anything similar in the patch so far, so I'm
   wondering whether it needs to know that information or not,
  
  The dlm allocates a slot for all lockspace members, so spectator mounts
  (like readonly mounts) would be given a slot/jid.  In gfs_controld,
  spectator mounts are not be given a jid (that came from the time when
  adding a journal required extending the device+fs.)  These days, there's
  probably no meaningful difference between spectator and readonly mounts.
 
 There's one other part, and that's whether a readonly or spectator node
 should attempt to recover the journal of a failed node.  In cluster3 this
 decision was always a bit mixed up, with some logic in gfs_controld and
 some in gfs2.
 
 We should make a clear decision now and include it in this patch.
 I think gfs2_recover_func() should return GAVEUP right at the start
 for any of the cases where you don't want it doing recovery.  What
 cases would you prefer?
 

Yes, if it can't recover, then thats a good idea. We also need to ensure
that we are not trying to recover nodes which don't need recovery though
(see my earlier email)

Steve.




Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2012-01-05 Thread Bob Peterson
- Original Message -
| This new method of managing recovery is an alternative to
| the previous approach of using the userland gfs_controld.
| 
| - use dlm slot numbers to assign journal id's
| - use dlm recovery callbacks to initiate journal recovery
| - use a dlm lock to determine the first node to mount fs
| - use a dlm lock to track journals that need recovery
| 
| Signed-off-by: David Teigland teigl...@redhat.com
| ---
| --- a/fs/gfs2/lock_dlm.c
| +++ b/fs/gfs2/lock_dlm.c
(snip)
| +#include linux/gfs2_ondisk.h
|  #include linux/gfs2_ondisk.h

Hi,

Dave, are you going to post a replacement patch or addendum patch
that addresses Steve's concerns, such as the above?
I'd like to review this, but I want the review the latest/greatest.

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2012-01-05 Thread David Teigland
On Thu, Jan 05, 2012 at 10:08:15AM -0500, Bob Peterson wrote:
 - Original Message -
 | This new method of managing recovery is an alternative to
 | the previous approach of using the userland gfs_controld.
 | 
 | - use dlm slot numbers to assign journal id's
 | - use dlm recovery callbacks to initiate journal recovery
 | - use a dlm lock to determine the first node to mount fs
 | - use a dlm lock to track journals that need recovery
 | 
 | Signed-off-by: David Teigland teigl...@redhat.com
 | ---
 | --- a/fs/gfs2/lock_dlm.c
 | +++ b/fs/gfs2/lock_dlm.c
 (snip)
 | +#include linux/gfs2_ondisk.h
 |  #include linux/gfs2_ondisk.h
 
 Hi,
 
 Dave, are you going to post a replacement patch or addendum patch
 that addresses Steve's concerns, such as the above?
 I'd like to review this, but I want the review the latest/greatest.

I haven't resent the patches after making the changes (which were fairly
minor.)  I'll resend them shortly for another check before a pull request.

Dave



Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2012-01-05 Thread Steven Whitehouse
Hi,

On Thu, 2012-01-05 at 10:21 -0500, David Teigland wrote:
 On Thu, Jan 05, 2012 at 10:08:15AM -0500, Bob Peterson wrote:
  - Original Message -
  | This new method of managing recovery is an alternative to
  | the previous approach of using the userland gfs_controld.
  | 
  | - use dlm slot numbers to assign journal id's
  | - use dlm recovery callbacks to initiate journal recovery
  | - use a dlm lock to determine the first node to mount fs
  | - use a dlm lock to track journals that need recovery
  | 
  | Signed-off-by: David Teigland teigl...@redhat.com
  | ---
  | --- a/fs/gfs2/lock_dlm.c
  | +++ b/fs/gfs2/lock_dlm.c
  (snip)
  | +#include linux/gfs2_ondisk.h
  |  #include linux/gfs2_ondisk.h
  
  Hi,
  
  Dave, are you going to post a replacement patch or addendum patch
  that addresses Steve's concerns, such as the above?
  I'd like to review this, but I want the review the latest/greatest.
 
 I haven't resent the patches after making the changes (which were fairly
 minor.)  I'll resend them shortly for another check before a pull request.
 
 Dave
 

I think it would be a good plan to not send this last patch for the
current merge window and let it settle for a bit longer. Running things
so fine with the timing makes me nervous bearing in mind the number of
changes, and that three issues have been caught in the last few days.

Lets try and resolve the remaining points and then we can have something
really solid ready for the next window. I don't think there is any
particular rush to get it in at the moment.

I know its taken a bit longer than is ideal to get through the review,
but we've had a major holiday in the way which hasn't helped,

Steve.




Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2012-01-05 Thread Bob Peterson
- Original Message -
| This new method of managing recovery is an alternative to
| the previous approach of using the userland gfs_controld.
| 
| - use dlm slot numbers to assign journal id's
| - use dlm recovery callbacks to initiate journal recovery
| - use a dlm lock to determine the first node to mount fs
| - use a dlm lock to track journals that need recovery
| 
| Signed-off-by: David Teigland teigl...@redhat.com
| ---
Hi,

(snip)
| + if (!test_bit_le(i, lvb_bits+JID_BITMAP_OFFSET))
Nit, but I'd prefer  +  rather than + with no spaces.
| + continue;
| +
| + __clear_bit_le(i, lvb_bits+JID_BITMAP_OFFSET);
Nit, same.
(snip)
| + __set_bit_le(i, lvb_bits+JID_BITMAP_OFFSET);
Again, please globally change: s/+/ + / except in instances of var++.
(snip)

Other than these nits, I have no concerns (other than what Steve pointed out).

Regards,

Bob Peterson
Red Hat File Systems



Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2012-01-05 Thread David Teigland
On Thu, Jan 05, 2012 at 04:58:22PM +, Steven Whitehouse wrote:
  +   clear_bit(SDF_NOJOURNALID, sdp-sd_flags);
  +   smp_mb__after_clear_bit();
  +   wake_up_bit(sdp-sd_flags, SDF_NOJOURNALID);
  +   ls-ls_first = !!test_bit(DFL_FIRST_MOUNT, ls-ls_recover_flags);
  +   return 0;
  +
 
 This bit of code, which was correct last time you posted this patch
 appears to have reverted to its previous incorrect state. ls_first must

Thanks, I'll move it back, I removed ls_first and put it back in the wrong
place.  I keep forgetting about it because...

 be set before SDF_NOJOURNALID is cleared, otherwise the uninitialised
 value may be read,

in this case there can be no other reader, so it doesn't matter.

Dave



Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2011-12-23 Thread Steven Whitehouse
Hi,

On Thu, 2011-12-22 at 16:23 -0500, David Teigland wrote:
 On Mon, Dec 19, 2011 at 12:47:38PM -0500, David Teigland wrote:
  On Mon, Dec 19, 2011 at 01:07:38PM +, Steven Whitehouse wrote:
 struct lm_lockstruct {
int ls_jid;
unsigned int ls_first;
-   unsigned int ls_first_done;
unsigned int ls_nodir;
   Since ls_flags and ls_first also also only boolean flags, they could
   potentially be moved into the flags, though we can always do that later.
  
  yes, I can use a flag in place of ls_first.
 
 I went back to ls_first after finding the flag broke the old code path for
 gfs_controld, and making it work involved changing more of the old code
 that I wanted to in this patch.  We can go back and reorganize how some of
 that old code works (and remove ls_first), in a subsequent patch.
 
+   /*
+* Other nodes need to do some work in dlm recovery and 
gfs2_control
+* before the recover_done and control_lock will be ready for 
us below.
+* A delay here is not required but often avoids having to 
retry.
+*/
+
+   msleep(500);
 
 This is now msleep_interruptible(500), I couldn't get around adding some
 sort of delay here.  If we think of another way to delay this I'll be
 happy to try it.
 
 I've finished and tested the rest of the changes.
 https://github.com/teigland/linux-dlm/commits/devel11
 http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/teigland/linux-dlm.git;a=shortlog;h=refs/heads/next
 

Ok. I think we are going very much in the right direction. Let me think
about this over Christmas and lets catch up again in the New Year,

Steve.




Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2011-12-22 Thread David Teigland
On Mon, Dec 19, 2011 at 12:47:38PM -0500, David Teigland wrote:
 On Mon, Dec 19, 2011 at 01:07:38PM +, Steven Whitehouse wrote:
struct lm_lockstruct {
 int ls_jid;
 unsigned int ls_first;
   - unsigned int ls_first_done;
 unsigned int ls_nodir;
  Since ls_flags and ls_first also also only boolean flags, they could
  potentially be moved into the flags, though we can always do that later.
 
 yes, I can use a flag in place of ls_first.

I went back to ls_first after finding the flag broke the old code path for
gfs_controld, and making it work involved changing more of the old code
that I wanted to in this patch.  We can go back and reorganize how some of
that old code works (and remove ls_first), in a subsequent patch.

   + /*
   +  * Other nodes need to do some work in dlm recovery and gfs2_control
   +  * before the recover_done and control_lock will be ready for us below.
   +  * A delay here is not required but often avoids having to retry.
   +  */
   +
   + msleep(500);

This is now msleep_interruptible(500), I couldn't get around adding some
sort of delay here.  If we think of another way to delay this I'll be
happy to try it.

I've finished and tested the rest of the changes.
https://github.com/teigland/linux-dlm/commits/devel11
http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/teigland/linux-dlm.git;a=shortlog;h=refs/heads/next



Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2011-12-21 Thread Steven Whitehouse
Hi,

On Tue, 2011-12-20 at 16:04 -0500, David Teigland wrote:
 On Tue, Dec 20, 2011 at 02:16:43PM -0500, David Teigland wrote:
  On Tue, Dec 20, 2011 at 10:39:08AM +, Steven Whitehouse wrote:
I dislike arbitrary delays also, so I'm hesitant to add them.
The choices here are:
- removing NOQUEUE from the requests below, but with NOQUEUE you have a
  much better chance of killing a mount command, which is a fairly nice
  feature, I think.
- removing the delay, which results in nodes often doing fast+repeated
  lock attempts, which could get rather excessive.  I'd be worried about
  having that kind of unlimited loop sitting there.
- using some kind of delay.

While I don't like the look of the delay, I like the other options less.
Do you have a preference, or any other ideas?

   Well, I'd prefer to just remove the NOQUEUE command in that case, so
   that we don't spin here. The dlm request is async anyway, so we should
   be able to wait for it in an interruptible manner and send a cancel if
   required.
  
  I won't do async+cancel here, that would make the code unnecessarily ugly
  and complicated.  There's really no reason to be so dogmatic about delays,
  but since you refuse I'll just make it block, assuming I don't find any
  new problems with that.
 
 Now that I look at it, waiting vs blocking on the lock requests is not the
 main issue; removing NOQUEUE doesn't really do anything.  We're waiting
 for the other nodes to finish their work and update the state in the lvb.
 The only option is to periodically check the lvb, so the only choices are
 to do that as fast as possible (not nice), or introduce a delay.
 

I don't think I understand whats going on in that case. What I thought
should be happening was this:

 - Try to get mounter lock in EX
   - If successful, then we are the first mounter so recover all
journals
   - Write info into LVB
   - Drop mounter lock to PR so other nodes can mount

 - If failed to get mounter lock in EX, then wait for lock in PR state
   - This will block until the EX lock is dropped to PR
   - Read info from LVB

So a node with the mounter lock in EX knows that it is always the first
mounter and will recover all journals before demoting the mounter lock
to PR. A node with the mounter lock in PR may only recover its own
journal (at mount time).

That makes this the exact equivalent of what we currently do with the
first mounter flag from gfs_controld.

So I guess what I can't quite figure out is how it it possible for the
LVB to be out of sync with the lock state of the mounter lock,

Steve.





Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2011-12-21 Thread David Teigland
On Wed, Dec 21, 2011 at 10:45:21AM +, Steven Whitehouse wrote:
 I don't think I understand whats going on in that case. What I thought
 should be happening was this:
 
  - Try to get mounter lock in EX
- If successful, then we are the first mounter so recover all
  journals
- Write info into LVB
- Drop mounter lock to PR so other nodes can mount
 
  - If failed to get mounter lock in EX, then wait for lock in PR state
- This will block until the EX lock is dropped to PR
- Read info from LVB
 
 So a node with the mounter lock in EX knows that it is always the first
 mounter and will recover all journals before demoting the mounter lock
 to PR. A node with the mounter lock in PR may only recover its own
 journal (at mount time).

I previously used one lock similar to that, but had to change it a bit.
I had to split it across two separate locks, called control_lock and
mounted_lock.  There need to be two because of two conflicting requirements.

The control_lock lvb is used to communicate the generation number and jid
bits.  Writing the lvb requires an EX lock, and EX prevents others from
continually holding a PR lock.  Without mounted nodes continually holding
a PR lock we can't use EX to indicate first mounter.

So, the mounted_lock (no lvb) is used to indicate the first mounter.
Here all mounted nodes continually hold a PR lock, and a mounting node
attempts to get an EX lock, so any node to get an EX lock is the first
mounter.

(I previously used control_lock with zero lvb to indicate first mounter,
but there are some fairly common cases where the lvb may not be zero when
we need a first mounter.)

Now back to the reason why we need to retry lock requests and can't just
block.  It's not related to the first mounter case.  When a node mounts,
it needs to wait for other (previously mounted) nodes to update the
control_lock lvb with the latest generation number, and then it also needs
to wait for any bits set in the lvb to be cleared.  i.e. it needs to wait
for any unrecovered journals to be recovered before it finishes mounting.

To do this, it needs to wait in a loop reading the control_lock lvb.  The
question is whether we want to add some sort of delay to that loop or not,
and how.  msleep_interruptible(), schedule_timeout_interruptible(),
something else?

Dave



Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2011-12-20 Thread Steven Whitehouse
Hi,

On Mon, 2011-12-19 at 12:47 -0500, David Teigland wrote:
 On Mon, Dec 19, 2011 at 01:07:38PM +, Steven Whitehouse wrote:
struct lm_lockstruct {
 int ls_jid;
 unsigned int ls_first;
   - unsigned int ls_first_done;
 unsigned int ls_nodir;
  Since ls_flags and ls_first also also only boolean flags, they could
  potentially be moved into the flags, though we can always do that later.
 
 yes, I can use a flag in place of ls_first.
 
   + int ls_recover_jid_done; /* read by gfs_controld */
   + int ls_recover_jid_status; /* read by gfs_controld */
^^^ this isn't
  actually true any more. All recent gfs_controld versions take their cue
  from the uevents, so this is here only for backwards compatibility
  reasons and these two will be removed at some future date.
 
 I'll add a longer comment saying something like that.
 
   + /*
   +  * Other nodes need to do some work in dlm recovery and gfs2_control
   +  * before the recover_done and control_lock will be ready for us below.
   +  * A delay here is not required but often avoids having to retry.
   +  */
   +
   + msleep(500);
  Can we get rid of this then? I'd rather just wait for the lock, rather
  than adding delays of arbitrary time periods into the code.
 
 I dislike arbitrary delays also, so I'm hesitant to add them.
 The choices here are:
 - removing NOQUEUE from the requests below, but with NOQUEUE you have a
   much better chance of killing a mount command, which is a fairly nice
   feature, I think.
 - removing the delay, which results in nodes often doing fast+repeated
   lock attempts, which could get rather excessive.  I'd be worried about
   having that kind of unlimited loop sitting there.
 - using some kind of delay.
 
 While I don't like the look of the delay, I like the other options less.
 Do you have a preference, or any other ideas?
 
Well, I'd prefer to just remove the NOQUEUE command in that case, so
that we don't spin here. The dlm request is async anyway, so we should
be able to wait for it in an interruptible manner and send a cancel if
required.

 
   +static int control_first_done(struct gfs2_sbd *sdp)
   +{
   + struct lm_lockstruct *ls = sdp-sd_lockstruct;
   + char lvb_bits[GDLM_LVB_SIZE];
   + uint32_t start_gen, block_gen;
   + int error;
   +
   +restart:
   + spin_lock(ls-ls_recover_spin);
   + start_gen = ls-ls_recover_start;
   + block_gen = ls-ls_recover_block;
   +
   + if (test_bit(DFL_BLOCK_LOCKS, ls-ls_recover_flags) ||
   + !test_bit(DFL_MOUNT_DONE, ls-ls_recover_flags) ||
   + !test_bit(DFL_FIRST_MOUNT, ls-ls_recover_flags)) {
   + /* sanity check, should not happen */
   + fs_err(sdp, control_first_done start %u block %u flags %lx\n,
   +start_gen, block_gen, ls-ls_recover_flags);
   + spin_unlock(ls-ls_recover_spin);
   + control_unlock(sdp);
   + return -1;
   + }
   +
   + if (start_gen == block_gen) {
   + /*
   +  * Wait for the end of a dlm recovery cycle to switch from
   +  * first mounter recovery.  We can ignore any recover_slot
   +  * callbacks between the recover_prep and next recover_done
   +  * because we are still the first mounter and any failed nodes
   +  * have not fully mounted, so they don't need recovery.
   +  */
   + spin_unlock(ls-ls_recover_spin);
   + fs_info(sdp, control_first_done wait gen %u\n, start_gen);
   + msleep(500);
  Again - I don't want to add arbitrary delays into the code. Why is this
  waiting for half a second? Why not some other length of time? We should
  figure out how to wait for the end of the first mounter recovery some
  other way if that is what is required.
 
 This msleep slows down a rare loop to wake up a couple times vs once with
 a proper wait mechanism.  It's waiting for the next recover_done()
 callback, which the dlm will call when it is done with recovery.  We do
 have the option here of using a standard wait mechanism, wait_on_bit() or
 something.  I'll see if any of those would work here without adding too
 much to the code.
 
Ok. That would be a better option I think.

 
   +static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid,
   +  unsigned int result)
   +{
   + struct lm_lockstruct *ls = sdp-sd_lockstruct;
   +
   + /* don't care about the recovery of own journal during mount */
   + if (jid == ls-ls_jid)
   + return;
   +
   + /* another node is recovering the journal, give it a chance to
   +finish before trying again */
   + if (result == LM_RD_GAVEUP)
   + msleep(1000);
  Again, lets put in a proper wait for this condition. If the issue is one
  of races between cluster nodes (thundering herd type problem), then we
  might need some kind of back off, but in that case, it should probably
  be for a random time period.
 
 In this case, while one node is recovering a 

Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2011-12-20 Thread David Teigland
On Tue, Dec 20, 2011 at 10:39:08AM +, Steven Whitehouse wrote:
  I dislike arbitrary delays also, so I'm hesitant to add them.
  The choices here are:
  - removing NOQUEUE from the requests below, but with NOQUEUE you have a
much better chance of killing a mount command, which is a fairly nice
feature, I think.
  - removing the delay, which results in nodes often doing fast+repeated
lock attempts, which could get rather excessive.  I'd be worried about
having that kind of unlimited loop sitting there.
  - using some kind of delay.
  
  While I don't like the look of the delay, I like the other options less.
  Do you have a preference, or any other ideas?
  
 Well, I'd prefer to just remove the NOQUEUE command in that case, so
 that we don't spin here. The dlm request is async anyway, so we should
 be able to wait for it in an interruptible manner and send a cancel if
 required.

I won't do async+cancel here, that would make the code unnecessarily ugly
and complicated.  There's really no reason to be so dogmatic about delays,
but since you refuse I'll just make it block, assuming I don't find any
new problems with that.

   Again - I don't want to add arbitrary delays into the code. Why is this
   waiting for half a second? Why not some other length of time? We should
   figure out how to wait for the end of the first mounter recovery some
   other way if that is what is required.
  
  This msleep slows down a rare loop to wake up a couple times vs once with
  a proper wait mechanism.  It's waiting for the next recover_done()
  callback, which the dlm will call when it is done with recovery.  We do
  have the option here of using a standard wait mechanism, wait_on_bit() or
  something.  I'll see if any of those would work here without adding too
  much to the code.
  
 Ok. That would be a better option I think.

Only if it doesn't make things more (unnecessarily) complex.

Dave



Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2011-12-20 Thread David Teigland
On Tue, Dec 20, 2011 at 02:16:43PM -0500, David Teigland wrote:
 On Tue, Dec 20, 2011 at 10:39:08AM +, Steven Whitehouse wrote:
   I dislike arbitrary delays also, so I'm hesitant to add them.
   The choices here are:
   - removing NOQUEUE from the requests below, but with NOQUEUE you have a
 much better chance of killing a mount command, which is a fairly nice
 feature, I think.
   - removing the delay, which results in nodes often doing fast+repeated
 lock attempts, which could get rather excessive.  I'd be worried about
 having that kind of unlimited loop sitting there.
   - using some kind of delay.
   
   While I don't like the look of the delay, I like the other options less.
   Do you have a preference, or any other ideas?
   
  Well, I'd prefer to just remove the NOQUEUE command in that case, so
  that we don't spin here. The dlm request is async anyway, so we should
  be able to wait for it in an interruptible manner and send a cancel if
  required.
 
 I won't do async+cancel here, that would make the code unnecessarily ugly
 and complicated.  There's really no reason to be so dogmatic about delays,
 but since you refuse I'll just make it block, assuming I don't find any
 new problems with that.

Now that I look at it, waiting vs blocking on the lock requests is not the
main issue; removing NOQUEUE doesn't really do anything.  We're waiting
for the other nodes to finish their work and update the state in the lvb.
The only option is to periodically check the lvb, so the only choices are
to do that as fast as possible (not nice), or introduce a delay.



Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2011-12-19 Thread Steven Whitehouse
On Fri, 2011-12-16 at 16:03 -0600, David Teigland wrote:
 This new method of managing recovery is an alternative to
 the previous approach of using the userland gfs_controld.
 
 - use dlm slot numbers to assign journal id's
 - use dlm recovery callbacks to initiate journal recovery
 - use a dlm lock to determine the first node to mount fs
 - use a dlm lock to track journals that need recovery
 
 Signed-off-by: David Teigland teigl...@redhat.com
 ---
  fs/gfs2/glock.c |2 +-
  fs/gfs2/glock.h |7 +-
  fs/gfs2/incore.h|   51 ++-
  fs/gfs2/lock_dlm.c  |  979 
 ++-
  fs/gfs2/main.c  |   10 +
  fs/gfs2/ops_fstype.c|   29 +-
  fs/gfs2/recovery.c  |4 +
  fs/gfs2/sys.c   |   29 +-
  fs/gfs2/sys.h   |2 +
  include/linux/gfs2_ondisk.h |2 +
  10 files changed, 1075 insertions(+), 40 deletions(-)
 
 diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
 index 88e8a23..376816f 100644
 --- a/fs/gfs2/glock.c
 +++ b/fs/gfs2/glock.c
 @@ -1353,7 +1353,7 @@ void gfs2_glock_complete(struct gfs2_glock *gl, int ret)
   spin_lock(gl-gl_spin);
   gl-gl_reply = ret;
  
 - if (unlikely(test_bit(DFL_BLOCK_LOCKS, ls-ls_flags))) {
 + if (unlikely(test_bit(DFL_BLOCK_LOCKS, ls-ls_recover_flags))) {
   if (gfs2_should_freeze(gl)) {
   set_bit(GLF_FROZEN, gl-gl_flags);
   spin_unlock(gl-gl_spin);
 diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
 index 6670711..5b548b07 100644
 --- a/fs/gfs2/glock.h
 +++ b/fs/gfs2/glock.h
 @@ -121,8 +121,11 @@ enum {
  
  struct lm_lockops {
   const char *lm_proto_name;
 - int (*lm_mount) (struct gfs2_sbd *sdp, const char *fsname);
 - void (*lm_unmount) (struct gfs2_sbd *sdp);
 + int (*lm_mount) (struct gfs2_sbd *sdp, const char *table);
 + void (*lm_first_done) (struct gfs2_sbd *sdp);
 + void (*lm_recovery_result) (struct gfs2_sbd *sdp, unsigned int jid,
 + unsigned int result);
 + void (*lm_unmount) (struct gfs2_sbd *sdp);
   void (*lm_withdraw) (struct gfs2_sbd *sdp);
   void (*lm_put_lock) (struct gfs2_glock *gl);
   int (*lm_lock) (struct gfs2_glock *gl, unsigned int req_state,
 diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
 index 892ac37..059e462 100644
 --- a/fs/gfs2/incore.h
 +++ b/fs/gfs2/incore.h
 @@ -139,8 +139,38 @@ struct gfs2_bufdata {
  #define GDLM_STRNAME_BYTES   25
  #define GDLM_LVB_SIZE32
  
 +/*
 + * ls_recover_flags:
 + *
 + * DFL_BLOCK_LOCKS: dlm is in recovery and will grant locks that had been
 + * held by failed nodes whose journals need recovery.  Those locks should
 + * only be used for journal recovery until the journal recovery is done.
 + * This is set by the dlm recover_prep callback and cleared by the
 + * gfs2_control thread when journal recovery is complete.  To avoid
 + * races between recover_prep setting and gfs2_control clearing, recover_spin
 + * is held while changing this bit and reading/writing recover_block
 + * and recover_start.
 + *
 + * DFL_FIRST_MOUNT: this node is the first to mount this fs and is doing
 + * recovery of all journals before allowing other nodes to mount the fs.
 + * This is cleared when FIRST_MOUNT_DONE is set.
 + *
 + * DFL_FIRST_MOUNT_DONE: this node was the first mounter, and has finished
 + * recovery of all journals, and now allows other nodes to mount the fs.
 + *
 + * DFL_MOUNT_DONE: gdlm_mount has completed successfully and cleared
 + * BLOCK_LOCKS for the first time.  The gfs2_control thread should now
 + * control clearing BLOCK_LOCKS for further recoveries.
 + *
 + * DFL_UNMOUNT: gdlm_unmount sets to keep sdp off gfs2_control_wq.
 + */
 +
  enum {
   DFL_BLOCK_LOCKS = 0,
 + DFL_FIRST_MOUNT = 1,
 + DFL_FIRST_MOUNT_DONE= 2,
 + DFL_MOUNT_DONE  = 3,
 + DFL_UNMOUNT = 4,
  };
  
  struct lm_lockname {
 @@ -504,14 +534,26 @@ struct gfs2_sb_host {
  struct lm_lockstruct {
   int ls_jid;
   unsigned int ls_first;
 - unsigned int ls_first_done;
   unsigned int ls_nodir;
Since ls_flags and ls_first also also only boolean flags, they could
potentially be moved into the flags, though we can always do that later.

   const struct lm_lockops *ls_ops;
 - unsigned long ls_flags;
   dlm_lockspace_t *ls_dlm;
  
 - int ls_recover_jid_done;
 - int ls_recover_jid_status;
 + int ls_recover_jid_done; /* read by gfs_controld */
 + int ls_recover_jid_status; /* read by gfs_controld */
  ^^^ this isn't
actually true any more. All recent gfs_controld versions take their cue
from the uevents, so this is here only for backwards compatibility
reasons and these two will be removed at some future date.

 +
 + struct dlm_lksb ls_mounted_lksb; /* mounted_lock */
 + struct dlm_lksb ls_control_lksb; /* 

Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2011-12-19 Thread Steven Whitehouse
Hi,

On Fri, 2011-12-16 at 16:03 -0600, David Teigland wrote:
 This new method of managing recovery is an alternative to
 the previous approach of using the userland gfs_controld.
 
 - use dlm slot numbers to assign journal id's
 - use dlm recovery callbacks to initiate journal recovery
 - use a dlm lock to determine the first node to mount fs
 - use a dlm lock to track journals that need recovery
 
 Signed-off-by: David Teigland teigl...@redhat.com
 ---
  fs/gfs2/glock.c |2 +-
  fs/gfs2/glock.h |7 +-
  fs/gfs2/incore.h|   51 ++-
  fs/gfs2/lock_dlm.c  |  979 
 ++-
  fs/gfs2/main.c  |   10 +
  fs/gfs2/ops_fstype.c|   29 +-
  fs/gfs2/recovery.c  |4 +
  fs/gfs2/sys.c   |   29 +-
  fs/gfs2/sys.h   |2 +
  include/linux/gfs2_ondisk.h |2 +
  10 files changed, 1075 insertions(+), 40 deletions(-)
[snip]
 diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
 index 20f63b0..bacb7af 100644
 --- a/fs/gfs2/lock_dlm.c
 +++ b/fs/gfs2/lock_dlm.c
 @@ -1,6 +1,6 @@
  /*
   * Copyright (C) Sistina Software, Inc.  1997-2003 All rights reserved.
 - * Copyright (C) 2004-2009 Red Hat, Inc.  All rights reserved.
 + * Copyright (C) 2004-2011 Red Hat, Inc.  All rights reserved.
   *
   * This copyrighted material is made available to anyone wishing to use,
   * modify, copy, or redistribute it subject to the terms and conditions
 @@ -11,12 +11,16 @@
  #include linux/dlm.h
  #include linux/slab.h
  #include linux/types.h
 +#include linux/delay.h
 +#include linux/gfs2_ondisk.h
  #include linux/gfs2_ondisk.h
  
Also, just spotted that we only need one copy of gfs2_ondisk.h

Steve.




Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2011-12-19 Thread David Teigland
On Mon, Dec 19, 2011 at 01:07:38PM +, Steven Whitehouse wrote:
   struct lm_lockstruct {
  int ls_jid;
  unsigned int ls_first;
  -   unsigned int ls_first_done;
  unsigned int ls_nodir;
 Since ls_flags and ls_first also also only boolean flags, they could
 potentially be moved into the flags, though we can always do that later.

yes, I can use a flag in place of ls_first.

  +   int ls_recover_jid_done; /* read by gfs_controld */
  +   int ls_recover_jid_status; /* read by gfs_controld */
   ^^^ this isn't
 actually true any more. All recent gfs_controld versions take their cue
 from the uevents, so this is here only for backwards compatibility
 reasons and these two will be removed at some future date.

I'll add a longer comment saying something like that.

  +   /*
  +* Other nodes need to do some work in dlm recovery and gfs2_control
  +* before the recover_done and control_lock will be ready for us below.
  +* A delay here is not required but often avoids having to retry.
  +*/
  +
  +   msleep(500);
 Can we get rid of this then? I'd rather just wait for the lock, rather
 than adding delays of arbitrary time periods into the code.

I dislike arbitrary delays also, so I'm hesitant to add them.
The choices here are:
- removing NOQUEUE from the requests below, but with NOQUEUE you have a
  much better chance of killing a mount command, which is a fairly nice
  feature, I think.
- removing the delay, which results in nodes often doing fast+repeated
  lock attempts, which could get rather excessive.  I'd be worried about
  having that kind of unlimited loop sitting there.
- using some kind of delay.

While I don't like the look of the delay, I like the other options less.
Do you have a preference, or any other ideas?


  +static int control_first_done(struct gfs2_sbd *sdp)
  +{
  +   struct lm_lockstruct *ls = sdp-sd_lockstruct;
  +   char lvb_bits[GDLM_LVB_SIZE];
  +   uint32_t start_gen, block_gen;
  +   int error;
  +
  +restart:
  +   spin_lock(ls-ls_recover_spin);
  +   start_gen = ls-ls_recover_start;
  +   block_gen = ls-ls_recover_block;
  +
  +   if (test_bit(DFL_BLOCK_LOCKS, ls-ls_recover_flags) ||
  +   !test_bit(DFL_MOUNT_DONE, ls-ls_recover_flags) ||
  +   !test_bit(DFL_FIRST_MOUNT, ls-ls_recover_flags)) {
  +   /* sanity check, should not happen */
  +   fs_err(sdp, control_first_done start %u block %u flags %lx\n,
  +  start_gen, block_gen, ls-ls_recover_flags);
  +   spin_unlock(ls-ls_recover_spin);
  +   control_unlock(sdp);
  +   return -1;
  +   }
  +
  +   if (start_gen == block_gen) {
  +   /*
  +* Wait for the end of a dlm recovery cycle to switch from
  +* first mounter recovery.  We can ignore any recover_slot
  +* callbacks between the recover_prep and next recover_done
  +* because we are still the first mounter and any failed nodes
  +* have not fully mounted, so they don't need recovery.
  +*/
  +   spin_unlock(ls-ls_recover_spin);
  +   fs_info(sdp, control_first_done wait gen %u\n, start_gen);
  +   msleep(500);
 Again - I don't want to add arbitrary delays into the code. Why is this
 waiting for half a second? Why not some other length of time? We should
 figure out how to wait for the end of the first mounter recovery some
 other way if that is what is required.

This msleep slows down a rare loop to wake up a couple times vs once with
a proper wait mechanism.  It's waiting for the next recover_done()
callback, which the dlm will call when it is done with recovery.  We do
have the option here of using a standard wait mechanism, wait_on_bit() or
something.  I'll see if any of those would work here without adding too
much to the code.


  +static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid,
  +unsigned int result)
  +{
  +   struct lm_lockstruct *ls = sdp-sd_lockstruct;
  +
  +   /* don't care about the recovery of own journal during mount */
  +   if (jid == ls-ls_jid)
  +   return;
  +
  +   /* another node is recovering the journal, give it a chance to
  +  finish before trying again */
  +   if (result == LM_RD_GAVEUP)
  +   msleep(1000);
 Again, lets put in a proper wait for this condition. If the issue is one
 of races between cluster nodes (thundering herd type problem), then we
 might need some kind of back off, but in that case, it should probably
 be for a random time period.

In this case, while one node is recovering a journal, the other nodes will
all try to recover the same journal (and fail), as quickly as they can.  I
looked at using queue_delayed_work here, but couldn't tell if that was ok
with zero delay... I now see others use 0, so I'll try it.


  +   error = dlm_new_lockspace(fsname, cluster, flags, GDLM_LVB_SIZE,
  +