Re: [Cluster-devel] [PATCH] fenced: send dbus signal when node is fenced

2011-02-03 Thread David Teigland
> +void dbus_init (void)

No space before (

Also, it would be a good idea to put a fenced-specific prefix before
fenced's own dbus functions, e.g. fd_dbus_init(), because dbus_ is the
dbus lib's namespace and open to symbol collisions.

> +{
> +#ifdef DBUS
> +
> +if (!(bus = dbus_bus_get_private (DBUS_BUS_SYSTEM, NULL))) {
> + log_error ("failed to get dbus connection");
> +} else {
> + log_debug ("connected to dbus %s", dbus_bus_get_unique_name (bus));
> +}
> +
> +#endif
> +
> +return;
> +}

It may be neater to put all of these functions under a single #ifdef DBUS
and then add empty stubs in an #else.

> +#ifdef DBUS
> +#include 
> +#endif

could that go in dbus.c?

> + if (!optd_disable_dbus) {
> + dbus_init();

In this case, the optd/cfgd pair is redundant, but technically this should
be testing the cfgd_disable_dbus value (the optd just indicates if the
option has been set on the command line).



Re: [Cluster-devel] [PATCH] fenced: send dbus signal when node is fenced

2011-02-03 Thread David Teigland
On Thu, Feb 03, 2011 at 01:26:07PM -0600, Ryan O'Hara wrote:
> This patch adds the ability to send a dbus signal when a node is fenced.
> This code is can reestablish a connection with dbus if necessary.

ACK



Re: [Cluster-devel] [PATCH] dlm: Reset fs_notified when check_fs_done

2011-02-22 Thread David Teigland
On Tue, Feb 22, 2011 at 04:35:42PM +0800, Jiaju Zhang wrote:
> On Tue, Nov 9, 2010 at 6:06 AM, David Teigland  wrote:
> > On Mon, Nov 08, 2010 at 11:05:49PM +0800, Jiaju Zhang wrote:
> >> Luckily, things have changed now. One user met this issue two months
> >> ago and he's also very kindly to test the patch. The result is the
> >> patch really works.
> >>
> >> Attached is the log before they apply the patch. This time the log
> >> has already included the debugging messages which were added by the
> >> commit 27b09badd40a2d1500500fa6945aeb532f75bd13 , so we can see what
> >> really happens on the user's site.
> >> (The log is a bit large when it was uncompressed, this is because the
> >> spinning would print many messages to the log.)
> >>
> >> I rebased the patch against current upstream code now. Thank you for
> >> your review in advance :)
> >
> > Thanks, the patch looks good, I'll push this out.
> 
> Hi David, I haven't found this patch from
> git://git.fedorahosted.org/dlm.git, was it being missed?

Sorry, it's in the cluster.git STABLE31 branch, but I forgot dlm.git, I've
just pushed it there too.
Dave



[Cluster-devel] fenced: don't ignore victim_done messages for reduced victims

2011-02-22 Thread David Teigland

Needs ACK for RHEL6.


When a victim is "reduced" (i.e. fenced skips fencing it because it
rejoins the cluster cleanly before fenced fences it), it is immediately
removed from the list of victims, before the "victim_done" message is
sent for it.  The victim_done message updates the time of the last
successful fencing operation for a failed node.

The code that processes received victim_done messages was ignoring the
message for the reduced victim because the node couldn't be found in
the victims list.  This caused the latest fencing information to not be
recorded for the node, causing dlm_controld to wait indefinately for
fencing to complete for the reduced victim.

The fix is to simply record the information from a victim_done message
even if the node is not in the victims list.

bz 678704

Signed-off-by: David Teigland 
---
 fence/fenced/cpg.c |   18 --
 1 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/fence/fenced/cpg.c b/fence/fenced/cpg.c
index a8629b9..99e16a0 100644
--- a/fence/fenced/cpg.c
+++ b/fence/fenced/cpg.c
@@ -652,9 +652,9 @@ static void receive_victim_done(struct fd *fd, struct 
fd_header *hd, int len)
 
node = get_node_victim(fd, id->nodeid);
if (!node) {
+   /* see comment below about no node */
log_debug("receive_victim_done %d:%u no victim nodeid %d",
  hd->nodeid, seq, id->nodeid);
-   return;
}
 
log_debug("receive_victim_done %d:%u remove victim %d time %llu how %d",
@@ -670,9 +670,11 @@ static void receive_victim_done(struct fd *fd, struct 
fd_header *hd, int len)
if (hd->nodeid == our_nodeid) {
/* sanity check, I don't think this should happen;
   see comment in fence_victims() */
-   if (!node->local_victim_done)
-   log_error("expect local_victim_done");
-   node->local_victim_done = 0;
+   if (node) {
+   if (!node->local_victim_done)
+   log_error("expect local_victim_done");
+   node->local_victim_done = 0;
+   }
} else {
/* save details of fencing operation from master, which
   master saves at the time it completes it */
@@ -680,8 +682,12 @@ static void receive_victim_done(struct fd *fd, struct 
fd_header *hd, int len)
   id->fence_how, id->fence_time);
}
 
-   list_del(&node->list);
-   free(node);
+   /* we can have no node when reduce_victims() removes it, bz 678704 */
+
+   if (node) {
+   list_del(&node->list);
+   free(node);
+   }
 }
 
 /* we know that the quorum value here is consistent with the cpg events
-- 
1.7.1.1



Re: [Cluster-devel] RFC: generic improvement to fence agents api

2011-03-21 Thread David Teigland
On Sat, Mar 19, 2011 at 07:34:55AM +0100, Fabio M. Di Nitto wrote:
> My suggestion would be to allow to specify a list of ports instead.

This comes up now and then.  The current rule of one action per agent
execution is a tried and true, fundamental property of the agent api.
It should not be changed IMO.  I'll need some time to come up with the
various specific reasons against it, but at least one of them (a big
one) is partial failure/success.

Dave



Re: [Cluster-devel] [PATCH] dlm: Remove superfluous call to recalc_sigpending()

2011-03-28 Thread David Teigland
On Thu, Mar 24, 2011 at 01:56:47PM +, Matt Fleming wrote:
> From: Matt Fleming 
> 
> recalc_sigpending() is called within sigprocmask(), so there is no
> need call it again after sigprocmask() has returned.

Thanks, pushed to dlm.git next.
Dave



[Cluster-devel] dlm_controld: clear waiting plocks for closed files

2011-05-23 Thread David Teigland
The new CLOSE flag is set in unlock operations that are
generated by the vfs removing locks that were not unlocked
by the process, when the process closes the file or exits.

The kernel does not take a reply for these unlock-close
operations.

plock requests can now be interrupted in the kernel when the
process is killed.  So the unlock-close also needs to clear
any waiting plocks that were abandoned by the killed process.

The corresponding kernel patch:
https://lkml.org/lkml/2011/5/23/237

Signed-off-by: David Teigland 
---
 group/dlm_controld/plock.c |   28 
 1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/group/dlm_controld/plock.c b/group/dlm_controld/plock.c
index f27001f..6d5dea8 100644
--- a/group/dlm_controld/plock.c
+++ b/group/dlm_controld/plock.c
@@ -631,6 +631,27 @@ static int unlock_internal(struct lockspace *ls, struct 
resource *r,
return rv;
 }
 
+static void clear_waiters(struct lockspace *ls, struct resource *r,
+ struct dlm_plock_info *in)
+{
+   struct lock_waiter *w, *safe;
+
+   list_for_each_entry_safe(w, safe, &r->waiters, list) {
+   if (w->info.nodeid != in->nodeid || w->info.owner != in->owner)
+   continue;
+
+   list_del(&w->list);
+
+   log_plock_error(ls, "clear waiter %llx %llx-%llx %d/%u/%llx",
+   (unsigned long long)in->number,
+   (unsigned long long)in->start,
+   (unsigned long long)in->end,
+   in->nodeid, in->pid,
+   (unsigned long long)in->owner);
+   free(w);
+   }
+}
+
 static int add_waiter(struct lockspace *ls, struct resource *r,
  struct dlm_plock_info *in)
 
@@ -716,9 +737,16 @@ static void do_unlock(struct lockspace *ls, struct 
dlm_plock_info *in,
 
rv = unlock_internal(ls, r, in);
 
+   if (in->flags & DLM_PLOCK_FL_CLOSE) {
+   clear_waiters(ls, r, in);
+   /* no replies for unlock-close ops */
+   goto skip_result;
+   }
+
if (in->nodeid == our_nodeid)
write_result(ls, in, rv);
 
+ skip_result:
do_waiters(ls, r);
put_resource(r);
 }
-- 
1.7.1.1



Re: [Cluster-devel] [PATCH 26/34] dlm: Drop __TIME__ usage

2011-05-25 Thread David Teigland
On Wed, May 25, 2011 at 10:47:16PM +0200, Michal Marek wrote:
> Dne 5.4.2011 16:59, Michal Marek napsal(a):
> > The kernel already prints its build timestamp during boot, no need to
> > repeat it in random drivers and produce different object files each
> > time.
> > 
> > Cc: Christine Caulfield 
> > Cc: David Teigland 
> > Cc: cluster-devel@redhat.com
> > Signed-off-by: Michal Marek 
> > ---
> >  fs/dlm/main.c |2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> Hi,
> 
> I don't see this patch in today's linux-next. Any objection against me
> applying it to the kbuild-2.6.git repository?

No, please do.
Thanks,
Dave



Re: [Cluster-devel] [PATCH 0/3] dlm_controld: Improvement on plock searching efficiency

2011-05-27 Thread David Teigland
On Fri, May 27, 2011 at 07:44:03AM +0800, Jiaju Zhang wrote:
> This series introduces a RB tree for improving plock resources searching
> efficiency. We met this performance issue when running Samba on top of
> cluster filesystem, profiling during nbench runs with num-progs=500, the
> dlm_controld process can often be observed consuming an entire CPU core.

Thanks, the patches look good, I'll push them out to dlm.git.
I'm curious how many resources (files) you had before you saw
significant benefit from this patch?

Dave



Re: [Cluster-devel] [PATCH] fs, dlm: Don't leak, don't do pointless NULL checks and use kzalloc

2011-06-29 Thread David Teigland
On Wed, Jun 29, 2011 at 11:09:27PM +0200, Jesper Juhl wrote:
> In fs/dlm/lock.c in the dlm_scan_waiters() function there are 3 small
> issues:
> 
> 1) first time through the loop we allocate memory for 'warned', if we
> then (in the loop) don't take the "if (!warned)" path and loop again,
> the second time through the loop we'll allocate memory again and store
> it to 'warned' without freeing the previous allocation - this leaks
> memory.

I don't think so; num_nodes won't be set to zero.

> 2) There's no need to test the return value of the allocation and do a
> memset if is succeedes. Just use kzalloc() to obtain zeroed memory.

fine

> 3) Since kfree() handles NULL pointers gracefully, the test of
> 'warned' against NULL before the kfree() after the loop is completely
> pointless. Remove it.

fine

ack if you want to push those two out yourself.
Dave



Re: [Cluster-devel] [PATCH] fs, dlm: Don't leak, don't do pointless NULL checks and use kzalloc

2011-06-29 Thread David Teigland
On Wed, Jun 29, 2011 at 11:51:00PM +0200, Jesper Juhl wrote:
> > I don't think so; num_nodes won't be set to zero.
> 
> Hmm. How so?  Maybe I'm missing something obvious, but;
> num_nodes is initialized to zero at the beginning of the function, which 
> means that we'll definately do the first allocation in the loop.

Zero is meant to mean "first time through the loop".

> We then set num_nodes equal to ls->ls_num_nodes - what guarantees that 
> this will not be zero so we won't do a second allocation (and leak) the 
> second time through the loop?

That's just the nature of a lockspace, I guess -- it doesn't make sense or
exist without nodes in it.  I doubt any of the dlm code would work if that
weren't true.

Dave



[Cluster-devel] [RFT] dlm: replace lkb hash table with idr

2011-07-06 Thread David Teigland
Request for testing

I'm looking at possible improvements to the dlm hash tables.  This patch
keeps lkbs in an idr instead of a hash table.  Before pushing this patch
further, I'd like to know if it makes any difference in environments using
millions of locks on each node.


From: David Teigland 
Date: Tue, 5 Jul 2011 17:58:26 -0500
Subject: [PATCH] dlm: keep lkbs in idr

Instead of strange hash table.

Signed-off-by: David Teigland 
---
 fs/dlm/config.c   |7 ---
 fs/dlm/config.h   |1 -
 fs/dlm/dlm_internal.h |   14 ++
 fs/dlm/lock.c |   69 ++---
 fs/dlm/lockspace.c|  114 +++--
 5 files changed, 81 insertions(+), 124 deletions(-)

diff --git a/fs/dlm/config.c b/fs/dlm/config.c
index 9b026ea..00376d8 100644
--- a/fs/dlm/config.c
+++ b/fs/dlm/config.c
@@ -92,7 +92,6 @@ struct dlm_cluster {
unsigned int cl_tcp_port;
unsigned int cl_buffer_size;
unsigned int cl_rsbtbl_size;
-   unsigned int cl_lkbtbl_size;
unsigned int cl_dirtbl_size;
unsigned int cl_recover_timer;
unsigned int cl_toss_secs;
@@ -107,7 +106,6 @@ enum {
CLUSTER_ATTR_TCP_PORT = 0,
CLUSTER_ATTR_BUFFER_SIZE,
CLUSTER_ATTR_RSBTBL_SIZE,
-   CLUSTER_ATTR_LKBTBL_SIZE,
CLUSTER_ATTR_DIRTBL_SIZE,
CLUSTER_ATTR_RECOVER_TIMER,
CLUSTER_ATTR_TOSS_SECS,
@@ -160,7 +158,6 @@ __CONFIGFS_ATTR(name, 0644, name##_read, name##_write)
 CLUSTER_ATTR(tcp_port, 1);
 CLUSTER_ATTR(buffer_size, 1);
 CLUSTER_ATTR(rsbtbl_size, 1);
-CLUSTER_ATTR(lkbtbl_size, 1);
 CLUSTER_ATTR(dirtbl_size, 1);
 CLUSTER_ATTR(recover_timer, 1);
 CLUSTER_ATTR(toss_secs, 1);
@@ -174,7 +171,6 @@ static struct configfs_attribute *cluster_attrs[] = {
[CLUSTER_ATTR_TCP_PORT] = &cluster_attr_tcp_port.attr,
[CLUSTER_ATTR_BUFFER_SIZE] = &cluster_attr_buffer_size.attr,
[CLUSTER_ATTR_RSBTBL_SIZE] = &cluster_attr_rsbtbl_size.attr,
-   [CLUSTER_ATTR_LKBTBL_SIZE] = &cluster_attr_lkbtbl_size.attr,
[CLUSTER_ATTR_DIRTBL_SIZE] = &cluster_attr_dirtbl_size.attr,
[CLUSTER_ATTR_RECOVER_TIMER] = &cluster_attr_recover_timer.attr,
[CLUSTER_ATTR_TOSS_SECS] = &cluster_attr_toss_secs.attr,
@@ -435,7 +431,6 @@ static struct config_group *make_cluster(struct 
config_group *g,
cl->cl_tcp_port = dlm_config.ci_tcp_port;
cl->cl_buffer_size = dlm_config.ci_buffer_size;
cl->cl_rsbtbl_size = dlm_config.ci_rsbtbl_size;
-   cl->cl_lkbtbl_size = dlm_config.ci_lkbtbl_size;
cl->cl_dirtbl_size = dlm_config.ci_dirtbl_size;
cl->cl_recover_timer = dlm_config.ci_recover_timer;
cl->cl_toss_secs = dlm_config.ci_toss_secs;
@@ -983,7 +978,6 @@ int dlm_our_addr(struct sockaddr_storage *addr, int num)
 #define DEFAULT_TCP_PORT   21064
 #define DEFAULT_BUFFER_SIZE 4096
 #define DEFAULT_RSBTBL_SIZE 1024
-#define DEFAULT_LKBTBL_SIZE 1024
 #define DEFAULT_DIRTBL_SIZE 1024
 #define DEFAULT_RECOVER_TIMER  5
 #define DEFAULT_TOSS_SECS 10
@@ -997,7 +991,6 @@ struct dlm_config_info dlm_config = {
.ci_tcp_port = DEFAULT_TCP_PORT,
.ci_buffer_size = DEFAULT_BUFFER_SIZE,
.ci_rsbtbl_size = DEFAULT_RSBTBL_SIZE,
-   .ci_lkbtbl_size = DEFAULT_LKBTBL_SIZE,
.ci_dirtbl_size = DEFAULT_DIRTBL_SIZE,
.ci_recover_timer = DEFAULT_RECOVER_TIMER,
.ci_toss_secs = DEFAULT_TOSS_SECS,
diff --git a/fs/dlm/config.h b/fs/dlm/config.h
index dd0ce24..2605744 100644
--- a/fs/dlm/config.h
+++ b/fs/dlm/config.h
@@ -20,7 +20,6 @@ struct dlm_config_info {
int ci_tcp_port;
int ci_buffer_size;
int ci_rsbtbl_size;
-   int ci_lkbtbl_size;
int ci_dirtbl_size;
int ci_recover_timer;
int ci_toss_secs;
diff --git a/fs/dlm/dlm_internal.h b/fs/dlm/dlm_internal.h
index 0262451..23a234b 100644
--- a/fs/dlm/dlm_internal.h
+++ b/fs/dlm/dlm_internal.h
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -52,7 +53,6 @@ struct dlm_ls;
 struct dlm_lkb;
 struct dlm_rsb;
 struct dlm_member;
-struct dlm_lkbtable;
 struct dlm_rsbtable;
 struct dlm_dirtable;
 struct dlm_direntry;
@@ -108,11 +108,6 @@ struct dlm_rsbtable {
spinlock_t  lock;
 };
 
-struct dlm_lkbtable {
-   struct list_headlist;
-   rwlock_tlock;
-   uint16_tcounter;
-};
 
 /*
  * Lockspace member (per node in a ls)
@@ -248,7 +243,6 @@ struct dlm_lkb {
int8_t  lkb_wait_count;
int lkb_wait_nodeid; /* for debugging */
 
-   struct list_headlkb_idtbl_list; /* lockspace lkbtbl */
struct list_headlkb_statequeue; /* rsb g/c/w list */
struct list_headlkb_rsb_lookup; /* waiting for rsb lookup */
struct list_headlkb_wait_reply; /* wait

Re: [Cluster-devel] [RFT] dlm: replace lkb hash table with idr

2011-07-08 Thread David Teigland
On Wed, Jul 06, 2011 at 12:14:26PM -0400, David Teigland wrote:
> Request for testing
> 
> I'm looking at possible improvements to the dlm hash tables.

I've pushed this and another patch related to hash table performance to
the tmp-testing branch,
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm.git

Dave



Re: [Cluster-devel] [PATCH v3] fs, dlm: don't do pointless NULL check, use kzalloc and fix order of arguments

2011-07-11 Thread David Teigland
On Sun, Jul 10, 2011 at 10:54:31PM +0200, Jesper Juhl wrote:
> In fs/dlm/lock.c in the dlm_scan_waiters() function there are 3 small
> issues:
> 
> 1) There's no need to test the return value of the allocation and do a
> memset if is succeedes. Just use kzalloc() to obtain zeroed memory.
> 
> 2) Since kfree() handles NULL pointers gracefully, the test of
> 'warned' against NULL before the kfree() after the loop is completely
> pointless. Remove it.
> 
> 3) The arguments to kmalloc() (now kzalloc()) were swapped. Thanks to
> Dr. David Alan Gilbert for pointing this out.
> 
> Signed-off-by: Jesper Juhl 

Thanks, I pushed this and a patch for the other swapped kmalloc to the
next branch.
Dave



[Cluster-devel] [PATCH] dlm_controld: fix plock dev_write no op

2011-08-18 Thread David Teigland
When a plock unlock is received due to the file
being closed (the CLOSE flag is set), we should
not write an unlock result back to the kernel.
If we do, the kernel, which does not expect a
reply, will report the error "dev_write no op".

In cases where dlm_controld encounters and error
handling the unlock operation, it was writing
the error result back to the kernel, even though
the unlock was flagged with CLOSE.  The fix is
to check for the CLOSE flag and skip writing
the error result, as we do with normal results.

This problem is especially visible when using
flocks (not plocks).  This is because the kernel
generates extraneous plock unlock requests
when files are closed with flocks.  Because
dlm_controld finds no plocks on the files,
it replies to the kernel with an error, rather
than skipping the reply to do CLOSE.

bz 731775

Signed-off-by: David Teigland 
---
 group/dlm_controld/plock.c |6 --
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/group/dlm_controld/plock.c b/group/dlm_controld/plock.c
index 6d5dea8..556993b 100644
--- a/group/dlm_controld/plock.c
+++ b/group/dlm_controld/plock.c
@@ -1583,8 +1583,10 @@ void process_plocks(int ci)
return;
 
  fail:
-   info.rv = rv;
-   rv = write(plock_device_fd, &info, sizeof(info));
+   if (!(info.flags & DLM_PLOCK_FL_CLOSE)) {
+   info.rv = rv;
+   rv = write(plock_device_fd, &info, sizeof(info));
+   }
 }
 
 void process_saved_plocks(struct lockspace *ls)
-- 
1.7.6



Re: [Cluster-devel] [PATCH 4/5] gfs_controld: Remove dead code from loop()

2011-09-06 Thread David Teigland
On Tue, Sep 06, 2011 at 01:00:16PM +0100, Andrew Price wrote:
> This patch removes an if statement where the true branch is never taken.
> At this point in the code, poll_timeout could only be 500 or -1.
> 
> Signed-off-by: Andrew Price 
> ---
>  group/gfs_controld/main.c |3 ---
>  1 files changed, 0 insertions(+), 3 deletions(-)
> 
> diff --git a/group/gfs_controld/main.c b/group/gfs_controld/main.c
> index 252965a..aa38839 100644
> --- a/group/gfs_controld/main.c
> +++ b/group/gfs_controld/main.c
> @@ -1286,9 +1286,6 @@ static void loop(void)
>   if (dmsetup_wait) {
>   if (poll_timeout == -1)
>   poll_timeout = 1000;
> - } else {
> - if (poll_timeout == 1000)
> - poll_timeout = -1;

ok

>   }
>   }
>   query_unlock();
> -- 
> 1.7.6



Re: [Cluster-devel] DLM + SCTP bug (was Re: [DRBD-user] kernel panic with DRBD: solved)

2011-09-12 Thread David Teigland
> >> When node A starts back up, the SCTP protocol notices this (as it?s
> >> supposed to), and delivers an SCTP_ASSOC_CHANGE / SCTP_RESTART
> >> notification to the SCTP socket, telling the socket owner (the dlm_recv
> >> thread) that the other node has restarted.  DLM responds by telling SCTP
> >> to create a clone of the master socket, for use in communicating with
> >> the newly restarted node.  (This is an SCTP_SOCKOPT_PEELOFF request.) 
> >> And this is where things go wrong: the SCTP_SOCKOPT_PEELOFF request is
> >> designed to be called from user space, not from a kernel thread, and so
> >> it /does/ allocate a file descriptor for the new socket.  Since DLM is
> >> calling it from a kernel thread, the kernel thread now has an open file
> >> descriptor (#0) to that socket.  And since kernel threads share the same
> >> file descriptor, every kernel thread on the system has this open
> >> descriptor.  So defect #1 is that DLM is calling an SCTP user-space
> >> interface from a kernel thread, which results in pollution of the kernel
> >> thread file descriptor table.

Thanks for that analysis.  As you point out, SCTP is only ever really used
or tested from user space, not from the kernel like the dlm does.  So I'm
not surprised to hear about problems like this.  I don't know how
difficult it might be to fix that.  I'd also expect to find other problems
like it with dlm+sctp.  Some experienced time and attention is probably
needed to move the dlm's sctp support beyond experimental.

Dave



Re: [Cluster-devel] dlm: master - dlm_controld: new plock state transfer

2011-09-26 Thread David Teigland
On Sat, Sep 24, 2011 at 07:13:34AM +0200, Fabio M. Di Nitto wrote:
> Quick question.. deadlock.c/netlink.c have been dropped from the build
> and not referenced anywhere for distribution. Is it a plan to kill them
> completely or do they need porting?

I'm going to leave the files there as artifacts in case someone ever wants
to look at developing it again.



[Cluster-devel] [PATCH 2/3] dlm_controld: full check for member changes

2011-09-28 Thread David Teigland
cman members are queried in response to a callback,
and members sometimes leave and rejoin between queries
(e.g. when they leave and rejoin before corosync
detects they left.)

This means that simply checking if a node is a member
in consecutive queries sometimes misses events.  We
need to compare the incarnation numbers of members
from consecutive queries to avoid this.

bz 663397

Signed-off-by: David Teigland 
---
 group/dlm_controld/member_cman.c |   79 --
 1 files changed, 75 insertions(+), 4 deletions(-)

diff --git a/group/dlm_controld/member_cman.c b/group/dlm_controld/member_cman.c
index c6b7cc7..2b115d5 100644
--- a/group/dlm_controld/member_cman.c
+++ b/group/dlm_controld/member_cman.c
@@ -9,6 +9,7 @@ static cman_node_t  old_nodes[MAX_NODES];
 static int  old_node_count;
 static cman_node_t  cman_nodes[MAX_NODES];
 static int  cman_node_count;
+static uint32_t cluster_ringid_seq;
 
 void kick_node_from_cluster(int nodeid)
 {
@@ -22,6 +23,17 @@ void kick_node_from_cluster(int nodeid)
}
 }
 
+static cman_node_t *get_node(cman_node_t *node_list, int count, int nodeid)
+{
+   int i;
+
+   for (i = 0; i < count; i++) {
+   if (node_list[i].cn_nodeid == nodeid)
+   return &node_list[i];
+   }
+   return NULL;
+}
+
 static int is_member(cman_node_t *node_list, int count, int nodeid)
 {
int i;
@@ -69,11 +81,21 @@ char *nodeid2name(int nodeid)
 
 static void statechange(void)
 {
+   cman_cluster_t info;
+   cman_node_t *old;
int i, j, rv;
struct cman_node_address addrs[MAX_NODE_ADDRESSES];
int num_addrs;
struct cman_node_address *addrptr = addrs;
 
+   rv = cman_get_cluster(ch, &info);
+   if (rv < 0) {
+   log_error("cman_get_cluster error %d %d", rv, errno);
+   /* keep going, this is just informational */
+   memset(&info, 0, sizeof(info));
+   }
+   cluster_ringid_seq = info.ci_generation;
+
cluster_quorate = cman_is_quorate(ch);
 
old_node_count = cman_node_count;
@@ -99,8 +121,8 @@ static void statechange(void)
if (old_nodes[i].cn_member &&
!is_cluster_member(old_nodes[i].cn_nodeid)) {
 
-   log_debug("cluster node %d removed",
- old_nodes[i].cn_nodeid);
+   log_debug("cluster node %d removed seq %u",
+ old_nodes[i].cn_nodeid, cluster_ringid_seq);
 
node_history_cluster_remove(old_nodes[i].cn_nodeid);
 
@@ -112,6 +134,9 @@ static void statechange(void)
if (cman_nodes[i].cn_member &&
!is_old_member(cman_nodes[i].cn_nodeid)) {
 
+   log_debug("cluster node %d added seq %u",
+ cman_nodes[i].cn_nodeid, cluster_ringid_seq);
+
rv = cman_get_node_addrs(ch, cman_nodes[i].cn_nodeid,
 MAX_NODE_ADDRESSES,
 &num_addrs, addrs);
@@ -121,8 +146,54 @@ static void statechange(void)
addrptr = &cman_nodes[i].cn_address;
}
 
-   log_debug("cluster node %d added",
- cman_nodes[i].cn_nodeid);
+   node_history_cluster_add(cman_nodes[i].cn_nodeid);
+
+   for (j = 0; j < num_addrs; j++) {
+   add_configfs_node(cman_nodes[i].cn_nodeid,
+ addrptr[j].cna_address,
+ addrptr[j].cna_addrlen,
+ (cman_nodes[i].cn_nodeid ==
+  our_nodeid));
+   }
+   } else {
+   /* look for any nodes that were members of both
+* old and new but have a new incarnation number
+* from old to new, indicating they left and rejoined
+* in between */
+
+   old = get_node(old_nodes, old_node_count, 
cman_nodes[i].cn_nodeid);
+
+   if (!old)
+   continue;
+   if (cman_nodes[i].cn_incarnation == old->cn_incarnation)
+   continue;
+
+   log_debug("cluster node %d removed and added seq %u "
+ "old %u new %u",
+ cman_nodes[i].cn_nodeid, cluster_ringid_seq,
+ old->cn_incarnation,
+

[Cluster-devel] [PATCH 1/3] fenced: full check for member changes

2011-09-28 Thread David Teigland
cman members are queried in response to a callback,
and members sometimes leave and rejoin between queries
(e.g. when they leave and rejoin before corosync
detects they left.)

This means that simply checking if a node is a member
in consecutive queries sometimes misses events.  We
need to compare the incarnation numbers of members
from consecutive queries to avoid this.

bz 663397

Signed-off-by: David Teigland 
---
 fence/fenced/member_cman.c |   36 
 1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/fence/fenced/member_cman.c b/fence/fenced/member_cman.c
index 0919b8e..ee879be 100644
--- a/fence/fenced/member_cman.c
+++ b/fence/fenced/member_cman.c
@@ -33,6 +33,17 @@ void kick_node_from_cluster(int nodeid)
}
 }
 
+static cman_node_t *get_node(cman_node_t *node_list, int count, int nodeid)
+{
+   int i;
+
+   for (i = 0; i < count; i++) {
+   if (node_list[i].cn_nodeid == nodeid)
+   return &node_list[i];
+   }
+   return NULL;
+}
+
 static int is_member(cman_node_t *node_list, int count, int nodeid)
 {
int i;
@@ -149,6 +160,7 @@ int name_to_nodeid(char *name)
 static void update_cluster(void)
 {
cman_cluster_t info;
+   cman_node_t *old;
int quorate = cluster_quorate;
int removed = 0, added = 0;
int i, rv;
@@ -197,6 +209,30 @@ static void update_cluster(void)
 
node_history_cluster_add(cman_nodes[i].cn_nodeid);
added++;
+   } else {
+   /* look for any nodes that were members of both
+* old and new but have a new incarnation number
+* from old to new, indicating they left and rejoined
+* in between */
+
+   old = get_node(old_nodes, old_node_count, 
cman_nodes[i].cn_nodeid);
+
+   if (!old)
+   continue;
+   if (cman_nodes[i].cn_incarnation == old->cn_incarnation)
+   continue;
+
+   log_debug("cluster node %d removed and added seq %u "
+ "old %u new %u",
+ cman_nodes[i].cn_nodeid, cluster_ringid_seq,
+ old->cn_incarnation,
+ cman_nodes[i].cn_incarnation);
+
+   node_history_cluster_remove(cman_nodes[i].cn_nodeid);
+   removed++;
+
+   node_history_cluster_add(cman_nodes[i].cn_nodeid);
+   added++;
}
}
 
-- 
1.7.6



[Cluster-devel] [PATCH 3/3] gfs_controld: full check for member changes

2011-09-28 Thread David Teigland
cman members are queried in response to a callback,
and members sometimes leave and rejoin between queries
(e.g. when they leave and rejoin before corosync
detects they left.)

This means that simply checking if a node is a member
in consecutive queries sometimes misses events.  We
need to compare the incarnation numbers of members
from consecutive queries to avoid this.

bz 663397

Signed-off-by: David Teigland 
---
 group/gfs_controld/member_cman.c |   51 +++---
 1 files changed, 47 insertions(+), 4 deletions(-)

diff --git a/group/gfs_controld/member_cman.c b/group/gfs_controld/member_cman.c
index 277a11e..7f354b1 100644
--- a/group/gfs_controld/member_cman.c
+++ b/group/gfs_controld/member_cman.c
@@ -9,6 +9,7 @@ static cman_node_t  old_nodes[MAX_NODES];
 static int  old_node_count;
 static cman_node_t  cman_nodes[MAX_NODES];
 static int  cman_node_count;
+static uint32_t cluster_ringid_seq;
 
 void kick_node_from_cluster(int nodeid)
 {
@@ -22,6 +23,17 @@ void kick_node_from_cluster(int nodeid)
}
 }
 
+static cman_node_t *get_node(cman_node_t *node_list, int count, int nodeid)
+{
+   int i;
+
+   for (i = 0; i < count; i++) {
+   if (node_list[i].cn_nodeid == nodeid)
+   return &node_list[i];
+   }
+   return NULL;
+}
+
 static int is_member(cman_node_t *node_list, int count, int nodeid)
 {
int i;
@@ -45,8 +57,18 @@ static int is_cluster_member(int nodeid)
 
 static void statechange(void)
 {
+   cman_cluster_t info;
+   cman_node_t *old;
int i, rv;
 
+   rv = cman_get_cluster(ch, &info);
+   if (rv < 0) {
+   log_error("cman_get_cluster error %d %d", rv, errno);
+   /* keep going, this is just informational */
+   memset(&info, 0, sizeof(info));
+   }
+   cluster_ringid_seq = info.ci_generation;
+
old_node_count = cman_node_count;
memcpy(&old_nodes, &cman_nodes, sizeof(old_nodes));
 
@@ -70,8 +92,8 @@ static void statechange(void)
if (old_nodes[i].cn_member &&
!is_cluster_member(old_nodes[i].cn_nodeid)) {
 
-   log_debug("cluster node %d removed",
- old_nodes[i].cn_nodeid);
+   log_debug("cluster node %d removed seq %u",
+ old_nodes[i].cn_nodeid, cluster_ringid_seq);
 
node_history_cluster_remove(old_nodes[i].cn_nodeid);
}
@@ -81,10 +103,31 @@ static void statechange(void)
if (cman_nodes[i].cn_member &&
!is_old_member(cman_nodes[i].cn_nodeid)) {
 
-   log_debug("cluster node %d added",
- cman_nodes[i].cn_nodeid);
+   log_debug("cluster node %d added seq %u",
+ cman_nodes[i].cn_nodeid, cluster_ringid_seq);
 
node_history_cluster_add(cman_nodes[i].cn_nodeid);
+   } else {
+   /* look for any nodes that were members of both
+* old and new but have a new incarnation number
+* from old to new, indicating they left and rejoined
+* in between */
+
+   old = get_node(old_nodes, old_node_count, 
cman_nodes[i].cn_nodeid);
+
+   if (!old)
+   continue;
+   if (cman_nodes[i].cn_incarnation == old->cn_incarnation)
+   continue;
+
+   log_debug("cluster node %d removed and added seq %u "
+ "old %u new %u",
+  cman_nodes[i].cn_nodeid, cluster_ringid_seq,
+  old->cn_incarnation,
+  cman_nodes[i].cn_incarnation);
+
+   node_history_cluster_remove(cman_nodes[i].cn_nodeid);
+   node_history_cluster_add(cman_nodes[i].cn_nodeid);
}
}
 }
-- 
1.7.6



Re: [Cluster-devel] dlm: master - dlm_controld: remove ccs and new Makefile

2011-09-30 Thread David Teigland
On Fri, Sep 30, 2011 at 01:07:01PM +0200, Fabio M. Di Nitto wrote:
> On 09/30/2011 12:02 AM, David Teigland wrote:
> 
> > add a normal, sane Makefile
> 
> If you plan to drop autoconf+autotool, that is your call (I disagree for
> several reasons, but it's your project and I am not going to argue), but
> you need to do it all over the tree, 

yep, I'm getting there



Re: [Cluster-devel] dlm: master - dlm: clear out old stuff and build system

2011-10-03 Thread David Teigland
>  dlm/libdlm/libdlm.pc.in   |   11 -
>  dlm/libdlm/libdlm_lt.pc.in|   11 -
> 
> dropping the .pc file is going to break dlm users.
> 
> pc files are used by different build systems (not just
> autotools/autoconf) to detect libdlm and link against it correctly.
> 
> Similar to what you use for libxml2 in your new build system, they
> provide pkg-config information.

ok, I've pushed this out, including this which I suspect is not quite
correct:

DESTDIR=
PREFIX=/usr
LIBDIR=/usr/lib64
HDRDIR=/usr/include 
MANDIR=/usr/share/man
PKGDIR=/usr/lib64/pkgconfig
UDEVDIR=/etc/udev/rules.d

Should all references to /usr be $(PREFIX) instead?

> Even if they are not in this exact form, you will need COPYING.app|libs
> and a COPYRIGHT file. Alternatively you need to add those info to each
> file in the project.

yep, getting there



Re: [Cluster-devel] [Upstream patch] DLM: Convert rsb data from linked list to rb_tree

2011-10-05 Thread David Teigland
On Wed, Oct 05, 2011 at 03:25:39PM -0400, Bob Peterson wrote:
> Hi,
> 
> This upstream patch changes the way DLM keeps track of RSBs.
> Before, they were in a linked list off a hash table.  Now,
> they're an rb_tree off the same hash table.  This speeds up
> DLM lookups greatly.
> 
> Today's DLM is faster than older DLMs for many file systems,
> (e.g. in RHEL5) due to the larger hash table size.  However,
> this rb_tree implementation scales much better.  For my
> 1000-directories-with-1000-files test, the patch doesn't
> show much of an improvement.  But when I scale the file system
> to 4000 directories with 4000 files (16 million files), it
> helps greatly. The time to do rm -fR /mnt/gfs2/* drops from
> 42.01 hours to 23.68 hours.

How many hash table buckets were you using in that test?
If it was the default (1024), I'd be interested to know how
16k compares.

> With this patch I believe we could also reduce the size of
> the hash table again or eliminate it completely, but we can
> evaluate and do that later.
> 
> NOTE: Today's upstream DLM code has special code to
> pre-allocate RSB structures for faster lookup.  This patch
> eliminates that step, since it doesn't have a resource name
> at that time for inserting new entries in the rb_tree.

We need to keep that; why do you say there's no resource name?
pre_rsb_struct() and get_rsb_struct() are specially designed to work
as they do because:

> @@ -367,28 +336,16 @@ static int get_rsb_struct(struct dlm_ls *ls, char 
> *name, int len,
> struct dlm_rsb **r_ret)
> + r = dlm_allocate_rsb(ls);
> + if (!r)
> + return -ENOMEM;

That's not allowed here because a spinlock is held:

>   spin_lock(&ls->ls_rsbtbl[bucket].lock);
>  
>   error = _search_rsb(ls, name, namelen, bucket, flags, &r);
> @@ -508,10 +492,6 @@ static int find_rsb(struct dlm_ls *ls, char *name, int 
> namelen,
>   goto out_unlock;
>  
>   error = get_rsb_struct(ls, name, namelen, &r);
> - if (error == -EAGAIN) {
> - spin_unlock(&ls->ls_rsbtbl[bucket].lock);
> - goto retry;
> - }
>   if (error)
>   goto out_unlock;

If you try to fix the problem above by releasing the spinlock between the
search and the malloc, then you have to repeat the search.  Eliminating
the repeated search is the main reason for pre_rsb/get_rsb.



Re: [Cluster-devel] dlm: master - add license/copyright headers

2011-10-06 Thread David Teigland
On Thu, Oct 06, 2011 at 08:02:10PM +0200, Fabio M. Di Nitto wrote:
> Hi David,
> 
> this is going to need another quick pass.
> 
> The libdlm headers are fine, but for the daemon/tool, we had GPLv2+ in
> STABLE31 and current header only reflects GPLv2.

I'm defaulting to plain v2 unless there's a reason to do otherwise.

> In theory you need to add a similar header to the Makefile's too.

I've never seen the point.



Re: [Cluster-devel] [Upstream patch] DLM: Convert rsb data from linked list to rb_tree

2011-10-10 Thread David Teigland
On Sat, Oct 08, 2011 at 06:13:52AM -0400, Bob Peterson wrote:
> - Original Message -
> | On Wed, Oct 05, 2011 at 03:25:39PM -0400, Bob Peterson wrote:
> | > Hi,
> | > 
> | > This upstream patch changes the way DLM keeps track of RSBs.
> | > Before, they were in a linked list off a hash table.  Now,
> | > they're an rb_tree off the same hash table.  This speeds up
> | > DLM lookups greatly.
> | > 
> | > Today's DLM is faster than older DLMs for many file systems,
> | > (e.g. in RHEL5) due to the larger hash table size.  However,
> | > this rb_tree implementation scales much better.  For my
> | > 1000-directories-with-1000-files test, the patch doesn't
> | > show much of an improvement.  But when I scale the file system
> | > to 4000 directories with 4000 files (16 million files), it
> | > helps greatly. The time to do rm -fR /mnt/gfs2/* drops from
> | > 42.01 hours to 23.68 hours.
> | 
> | How many hash table buckets were you using in that test?
> | If it was the default (1024), I'd be interested to know how
> | 16k compares.
> 
> Hi,
> 
> Interestingly, on the stock 2.6.32-206.el6.x86_64 kernel
> and 16K hash buckets, the time was virtually the same as
> with my patch: 1405m46.519s (23.43 hours). So perhaps we
> should re-evaluate whether we should use the rb_tree
> implementation or just increase the hash buckets as needed.
> I guess the question is now mainly related to scaling and
> memory usage for all those hash tables at this point.

I'm still interested in possibly using an rbtree with fewer hash buckets.

At the same time, I think the bigger problem may be why gfs2 is caching so
many locks in the first place, especially for millions of unlinked files
whose locks will never benefit you again.



Re: [Cluster-devel] [coverity] liblogthread

2011-10-10 Thread David Teigland
On Mon, Oct 10, 2011 at 10:45:17AM +0200, Fabio M. Di Nitto wrote:
> This is the first patchset to address some issues spotted by Coverity scan.

look fine



Re: [Cluster-devel] [Upstream patch] DLM: Convert rsb data from linked list to rb_tree

2011-10-10 Thread David Teigland
On Mon, Oct 10, 2011 at 04:51:01PM +0100, Steven Whitehouse wrote:
> Hi,
> 
> On Mon, 2011-10-10 at 10:43 -0400, David Teigland wrote:
> > On Sat, Oct 08, 2011 at 06:13:52AM -0400, Bob Peterson wrote:
> > > - Original Message -
> > > | On Wed, Oct 05, 2011 at 03:25:39PM -0400, Bob Peterson wrote:
> > > | > Hi,
> > > | > 
> > > | > This upstream patch changes the way DLM keeps track of RSBs.
> > > | > Before, they were in a linked list off a hash table.  Now,
> > > | > they're an rb_tree off the same hash table.  This speeds up
> > > | > DLM lookups greatly.
> > > | > 
> > > | > Today's DLM is faster than older DLMs for many file systems,
> > > | > (e.g. in RHEL5) due to the larger hash table size.  However,
> > > | > this rb_tree implementation scales much better.  For my
> > > | > 1000-directories-with-1000-files test, the patch doesn't
> > > | > show much of an improvement.  But when I scale the file system
> > > | > to 4000 directories with 4000 files (16 million files), it
> > > | > helps greatly. The time to do rm -fR /mnt/gfs2/* drops from
> > > | > 42.01 hours to 23.68 hours.
> > > | 
> > > | How many hash table buckets were you using in that test?
> > > | If it was the default (1024), I'd be interested to know how
> > > | 16k compares.
> > > 
> > > Hi,
> > > 
> > > Interestingly, on the stock 2.6.32-206.el6.x86_64 kernel
> > > and 16K hash buckets, the time was virtually the same as
> > > with my patch: 1405m46.519s (23.43 hours). So perhaps we
> > > should re-evaluate whether we should use the rb_tree
> > > implementation or just increase the hash buckets as needed.
> > > I guess the question is now mainly related to scaling and
> > > memory usage for all those hash tables at this point.
> > 
> > I'm still interested in possibly using an rbtree with fewer hash buckets.
> > 
> > At the same time, I think the bigger problem may be why gfs2 is caching so
> > many locks in the first place, especially for millions of unlinked files
> > whose locks will never benefit you again.
> > 
> 
> I doubt that it is caching locks for unlinked files for any great period
> of time if there is any memory pressure. It is memory pressure which
> instigates the dropping of glocks relating to inodes usually. If the
> glock is unlocked then simply dropping the ref count on the glock will
> cause the deallocation (dlm lock drop) to occur.
>
> The reason why this tends to be seen at unlink time is just that
> unlinking small files is a good way to iterate over a lot of inodes in a
> relatively short time period, and thus locking can start to dominate the
> timings if it isn't very low latency. I suspect there are lots of other
> workloads which would see this (find perhaps?) as well. The faster the
> array, the more noticeable the effect is likely to be.
> 
> Also, we do get some benefit if glocks are cached after inodes have been
> unlinked. If an inode is then recreated in the same disk block, we'll
> have all the glocks ready for it, without having to wait for them to be
> set up from scratch.
> 
> I am slightly concerned that we ought to be testing with much greater
> numbers of inodes for these kinds of tests. Considering the memory sizes
> of modern servers, having 1m or even 10m files in cache is really not a
> lot these days.

The fact remains that caching "as much as possible" tends to be harmful,
and some careful limiting would be a good investment.

> There is a second consideration in selecting the data structure for the
> dlm lock/resource etc. tables, which is the locking. Using a simple hash
> table does make it much easier to use RCU based locking which is a great
> advantage if there are a lot of look up operations compared with the
> number of insert/remove operations. That becomes a lot more difficult
> (and requires a much greater lookup to insert/remove ratio) when trees
> are involved.
> 
> We have run into the same kind of issues with GFS2's glock hash table.
> We firstly reduced the size of the hash buckets to a single pointer so
> that we could have a lot more hash heads for the given amount of memory.
> Currently we have 2^16 hash table entries.
> 
> Also, we are now using RCU in order to reduce the locking overhead as
> much as possible. We still have a spinlock for the lru list, but that is
> not taken that much by comparison, and there is no locking in the lookup
> path now except to disable preempt through the small critical section.
>

Re: [Cluster-devel] [Upstream patch] DLM: Convert rsb data from linked list to rb_tree

2011-10-10 Thread David Teigland
On Mon, Oct 10, 2011 at 08:00:07PM +0100, Steven Whitehouse wrote:
> > The fact remains that caching "as much as possible" tends to be harmful,
> > and some careful limiting would be a good investment.
> > 
> There is a limit. The point is that the limit is dynamic and depends on
> memory pressure. 

the "as possible" part of "as much as possible".

> The VM requests a reduction in the number of cached
> objects when it wants to reduce the size of what we have cached. So the
> larger the amount of RAM, the more inodes may be potentially cached.
> 
> I don't agree that caching as much as possible (given the constraints
> just mentioned) is bad. 

the "tends to" part meant that it can be good or bad, depending.

> The more we cache, the more disk I/O is avoided so the better the
> performance will be.

You don't need to argue that more caching can be good.  My point is it's
not universally true in practice, as Bob's tests show.



[Cluster-devel] cluster4 gfs_controld

2011-10-13 Thread David Teigland
Here's the outline of my plan to remove/replace the essential bits of
gfs_controld in cluster4.  I expect it'll go away entirely, but there
could be one or two minor things it would still handle on the side.

kernel dlm/gfs2 will continue to be operable with either
. cluster3 dlm_controld/gfs_controld combination, or
. cluster4 dlm_controld only

Two main things from gfs_controld need replacing:

1. jid allocation, first mounter

cluster3
. both from gfs_controld

cluster4
. jid from dlm-kernel "slots" which will be assigned similarly
. first mounter using a dlm lock in lock_dlm

2. recovery coordination, failure notification

cluster3
. coordination of dlm-kernel/gfs-kernel recovery is done
  indirectly in userspace between dlm_controld/gfs_controld,
  which then toggle sysfs files.
. write("sysfs block", 0) -> block_store(1)
  write("sysfs recover", jid) -> recover_store(jid)
  write("sysfs block", 1) -> block_store(0)

cluster4
. coordination of dlm-kernel/gfs-kernel recovery is done
  directly in kernel using callbacks from dlm-kernel to gfs-kernel.
. gdlm_mount(struct gfs2_sbd *sdp, const char *table, int *first, int *jid)
  calls dlm_recover_register(dlm, &jid, &recover_callbacks)
. gdlm_recover_prep() -> block_store(1)
  gdlm_recover_slot(jid) -> recover_store(jid)
  gdlm_recover_done() -> block_store(0)

cluster3 dlm/gfs recovery
. dlm_controld sees nodedown  (libcpg)
. gfs_controld sees nodedown  (libcpg)
. dlm_controld stops dlm-kernel   (sysfs control 0)
. gfs_controld stops gfs-kernel   (sysfs block 1)
. dlm_controld waits for gfs_controld kernel stop (libdlmcontrol)
. gfs_controld waits for dlm_controld kernel stop (libdlmcontrol)
. dlm_controld syncs state among all nodes(libcpg)
. gfs_controld syncs state among all nodes(libcpg)
. dlm_controld starts dlm-kernel recovery (sysfs control 1)
. gfs_controld starts gfs-kernel recovery (sysfs recover jid)
. gfs_controld starts gfs-kernel  (sysfs block 0)

cluster4 dlm/gfs recovery
. dlm_controld sees nodedown  (libcpg)
. dlm_controld stops dlm-kernel   (sysfs control 0)
. dlm-kernel stops gfs-kernel (callback block 1)
. dlm_controld syncs state among all nodes(libcpg)
. dlm_controld starts dlm-kernel recovery (sysfs control 1)
. dlm-kernel starts gfs-kernel recovery   (callback recover jid)
. dlm-kernel starts gfs-kernel(callback block 0)



Re: [Cluster-devel] cluster4 gfs_controld

2011-10-13 Thread David Teigland
On Fri, Oct 14, 2011 at 12:02:27AM +0900, Masatake YAMATO wrote:
> Just a question.
> I'm happy if you give me a hint.
> 
> > ...
> > cluster3 dlm/gfs recovery
> > . dlm_controld sees nodedown  (libcpg)
> > . gfs_controld sees nodedown  (libcpg)
> > . dlm_controld stops dlm-kernel   (sysfs control 0)
> > . gfs_controld stops gfs-kernel   (sysfs block 1)
> > . dlm_controld waits for gfs_controld kernel stop (libdlmcontrol)
> > ...
> 
>   "dlm_controld waits for gfs_controld kernel stop (libdlmcontrol)"
> 
> Is this true?
> I'd like to know which source code file of which package this is
> written. Which should I inspect dlm_controld or libdlmcontrol?

The function is check_fs_done()

http://git.fedorahosted.org/git?p=cluster.git;a=blob;f=group/dlm_controld/cpg.c;h=9b0d22333be540a733f2f74db4acc577c82b6026;hb=RHEL6#l636



Re: [Cluster-devel] cluster4 gfs_controld

2011-10-13 Thread David Teigland
On Thu, Oct 13, 2011 at 03:41:31PM +0100, Steven Whitehouse wrote:
> > cluster4
> > . jid from dlm-kernel "slots" which will be assigned similarly
> What is the actual algorithm used to assign these slots?

The same as picking jids: lowest unused id starting with 0.  As for
implementation, I'll add it to the current dlm recovery messages.

(Frankly, I'd really like to just set jid to nodeid-1.  Any support for
that?  It would obviously add a slight requirement to picking nodeid's,
which 99.9% of people already do.)

> > . first mounter using a dlm lock in lock_dlm
> > 
> That sounds good to me. The thing we need to resolve is how do we get
> from one to the other. We may have to introduce a new name for the lock
> protocol to avoid people accidentally using both schemes in the same
> cluster.

Compatibility rests on the fact that the new dlm kernel features will only
work when the cluster4 dlm_controld is used.

dlm_controld.v3 running: dlm_recover_register() returns an error, and
everything falls back to working as it does now, with gfs_controld.v3 etc.

dlm_controld.v4 running: dlm_recover_register() works, lock_dlm sets jid
and first.  (gfs_controld.v3 will fail to even run with dlm_controld.v4,
and other combinations of v3/v4 daemons will also fail to run together.)

> > cluster4
> > . coordination of dlm-kernel/gfs-kernel recovery is done
> >   directly in kernel using callbacks from dlm-kernel to gfs-kernel.
> > . gdlm_mount(struct gfs2_sbd *sdp, const char *table, int *first, int *jid)
> >   calls dlm_recover_register(dlm, &jid, &recover_callbacks)
> Can we not just pass the extra functions to dlm_create_lockspace? That
> seems a bit simpler than adding an extra function just to register the
> callbacks,

Yes we could; I may do that.  Returning the error mentioned above becomes
less direct.  I'd have to overload the jid arg, or add another to indicate
the callbacks are enabled.



Re: [Cluster-devel] cluster4 gfs_controld

2011-10-13 Thread David Teigland
On Thu, Oct 13, 2011 at 05:16:29PM +0100, Steven Whitehouse wrote:
> Hi,
> 
> On Thu, 2011-10-13 at 11:30 -0400, David Teigland wrote:
> > On Thu, Oct 13, 2011 at 03:41:31PM +0100, Steven Whitehouse wrote:
> > > > cluster4
> > > > . jid from dlm-kernel "slots" which will be assigned similarly
> > > What is the actual algorithm used to assign these slots?
> > 
> > The same as picking jids: lowest unused id starting with 0.  As for
> > implementation, I'll add it to the current dlm recovery messages.
> > 
> Yes, but the current implementation uses corosync to enforce ordering of
> events, so I'm wondering how the dlm will do that after the change.

One node picks it and then tells everyone.  In this case the low nodeid
coordinating recovery, and the new data is added to the status messages.
We still rely on agreed ordering of different recovery events, which is
still based on libcpg.

In cluster3 everyone picks the same values independently based on the
ordered events/messages, just because that's how gfs_controld decides
everything.


> > (Frankly, I'd really like to just set jid to nodeid-1.  Any support for
> > that?  It would obviously add a slight requirement to picking nodeid's,
> > which 99.9% of people already do.)
> > 
> The problem is that if you have a cluster with lots of nodes, but where
> each fs is only mounted by a small number of them, we'd have to insist
> on always creating as many journals as there are nodes in the cluster.

yeah

> > Yes we could; I may do that.  Returning the error mentioned above becomes
> > less direct.  I'd have to overload the jid arg, or add another to indicate
> > the callbacks are enabled.
> > 
> Another alternative is just to add a member of the recover_callbacks
> structure which would be a function taking the first and jid as
> arguments and the dlm can call that to pass the into to gfs2.
> 
> That way dlm users who don't care about that would just leave those
> functions NULL, for example,

The simplest variation should become evident once I start writing it;
it's hard to predict.



Re: [Cluster-devel] [Upstream patch] DLM: Convert rsb data from linked list to rb_tree

2011-10-25 Thread David Teigland
Hi Bob, I've made a few minor/cosmetic changes and attached my current
version (not tested yet).

>  static int shrink_bucket(struct dlm_ls *ls, int b)
>  {
> + struct rb_node *n = NULL;
>   struct dlm_rsb *r;
>   int count = 0, found;
>  
>   for (;;) {
>   found = 0;
>   spin_lock(&ls->ls_rsbtbl[b].lock);
> - list_for_each_entry_reverse(r, &ls->ls_rsbtbl[b].toss,
> - res_hashchain) {
> + for (n = rb_first(&ls->ls_rsbtbl[b].toss); n;
> +  n = rb_next(&r->res_hashnode)) {
> + r = rb_entry(n, struct dlm_rsb, res_hashnode);
> + if (unlikely(&r->res_hashnode == n)) {
> + spin_unlock(&ls->ls_rsbtbl[b].lock);
> + return 0;
> + }

Isn't that unlikely statement actually always true?

>   if (!time_after_eq(jiffies, r->res_toss_time +
>  dlm_config.ci_toss_secs * HZ))
>   continue;
> @@ -1108,7 +1167,7 @@ static int shrink_bucket(struct dlm_ls *ls, int b)
>   }
>  
>   if (kref_put(&r->res_ref, kill_rsb)) {
> - list_del(&r->res_hashchain);
> + rb_erase(&r->res_hashnode, &ls->ls_rsbtbl[b].toss);
>   spin_unlock(&ls->ls_rsbtbl[b].lock);
>  
>   if (is_master(r))


diff --git a/fs/dlm/debug_fs.c b/fs/dlm/debug_fs.c
index 5977923..a834f6b 100644
--- a/fs/dlm/debug_fs.c
+++ b/fs/dlm/debug_fs.c
@@ -393,6 +393,7 @@ static const struct seq_operations format3_seq_ops;
 
 static void *table_seq_start(struct seq_file *seq, loff_t *pos)
 {
+   struct rb_node *node;
struct dlm_ls *ls = seq->private;
struct rsbtbl_iter *ri;
struct dlm_rsb *r;
@@ -418,9 +419,10 @@ static void *table_seq_start(struct seq_file *seq, loff_t 
*pos)
ri->format = 3;
 
spin_lock(&ls->ls_rsbtbl[bucket].lock);
-   if (!list_empty(&ls->ls_rsbtbl[bucket].list)) {
-   list_for_each_entry(r, &ls->ls_rsbtbl[bucket].list,
-   res_hashchain) {
+   if (!RB_EMPTY_ROOT(&ls->ls_rsbtbl[bucket].keep)) {
+   for (node = rb_first(&ls->ls_rsbtbl[bucket].keep); node;
+node = rb_next(node)) {
+   r = rb_entry(node, struct dlm_rsb, res_hashnode);
if (!entry--) {
dlm_hold_rsb(r);
ri->rsb = r;
@@ -449,9 +451,8 @@ static void *table_seq_start(struct seq_file *seq, loff_t 
*pos)
}
 
spin_lock(&ls->ls_rsbtbl[bucket].lock);
-   if (!list_empty(&ls->ls_rsbtbl[bucket].list)) {
-   r = list_first_entry(&ls->ls_rsbtbl[bucket].list,
-struct dlm_rsb, res_hashchain);
+   if (!RB_EMPTY_ROOT(&ls->ls_rsbtbl[bucket].keep)) {
+   r = rb_entry(node, struct dlm_rsb, res_hashnode);
dlm_hold_rsb(r);
ri->rsb = r;
ri->bucket = bucket;
@@ -467,7 +468,7 @@ static void *table_seq_next(struct seq_file *seq, void 
*iter_ptr, loff_t *pos)
 {
struct dlm_ls *ls = seq->private;
struct rsbtbl_iter *ri = iter_ptr;
-   struct list_head *next;
+   struct rb_node *next;
struct dlm_rsb *r, *rp;
loff_t n = *pos;
unsigned bucket;
@@ -480,10 +481,10 @@ static void *table_seq_next(struct seq_file *seq, void 
*iter_ptr, loff_t *pos)
 
spin_lock(&ls->ls_rsbtbl[bucket].lock);
rp = ri->rsb;
-   next = rp->res_hashchain.next;
+   next = rb_next(&rp->res_hashnode);
 
-   if (next != &ls->ls_rsbtbl[bucket].list) {
-   r = list_entry(next, struct dlm_rsb, res_hashchain);
+   if (next) {
+   r = rb_entry(next, struct dlm_rsb, res_hashnode);
dlm_hold_rsb(r);
ri->rsb = r;
spin_unlock(&ls->ls_rsbtbl[bucket].lock);
@@ -511,9 +512,9 @@ static void *table_seq_next(struct seq_file *seq, void 
*iter_ptr, loff_t *pos)
}
 
spin_lock(&ls->ls_rsbtbl[bucket].lock);
-   if (!list_empty(&ls->ls_rsbtbl[bucket].list)) {
-   r = list_first_entry(&ls->ls_rsbtbl[bucket].list,
-struct dlm_rsb, res_hashchain);
+   if (!RB_EMPTY_ROOT(&ls->ls_rsbtbl[bucket].keep)) {
+   next = rb_first(&ls->ls_rsbtbl[bucket].keep);
+   r = rb_entry(next, struct dlm_rsb, res_hashnode);
dlm_hold_rsb(r);
ri->rsb = r;
ri->bucket = bucket;
diff --git a/fs/dlm/dlm_internal.h b/fs/dlm/dlm_internal.h
index fe2860c..5685a9a 10064

Re: [Cluster-devel] GFS2: glock statistics gathering (RFC)

2011-11-04 Thread David Teigland
On Fri, Nov 04, 2011 at 03:19:49PM +, Steven Whitehouse wrote:
> The three pairs of mean/variance measure the following
> things:
> 
>  1. DLM lock time (non-blocking requests)

You don't need to track and save this value, because all results will be
one of three values which can gather once:

short: the dir node and master node are local: 0 network round trip
medium: one is local, one is remote: 1 network round trip
long: both are remote: 2 network round trips

Once you've measured values for short/med/long, then you're done.
The distribution will depend on the usage pattern.

>  2. DLM lock time (blocking requests)

I think what you want to quantify is how much contention a given lock is
under.  A time measurement is probably not a great way to get that since
it's a combination of: the value above, how long gfs2 takes to release the
lock (itself a combination of things, including the the tunable itself),
and how many nodes are competing for the lock (depends on workload).

>  3. Inter-request time (again to the DLM)

Time between gfs2 requesting the same lock?  That sounds like it might
work ok for measuring contention.

> 1. To be able to better set the glock "min hold time"

Less for a lock with high contention?

> 2. To spot performance issues more easily

Apart from contention, I'm not sure there are many perf issues that dlm
measurements would help with.

> 3. To improve the algorithm for selecting resource groups for
> allocation (to base it on lock wait time, rather than blindly
> using a "try lock")

Don't you grab an rg lock and keep it cached?  How would lock times help?

Also, ocfs2 keeps quite a lot of locking stats you might look at.

Dave



Re: [Cluster-devel] GFS2: glock statistics gathering (RFC)

2011-11-04 Thread David Teigland
On Fri, Nov 04, 2011 at 04:57:31PM +, Steven Whitehouse wrote:
> Hi,
> 
> On Fri, 2011-11-04 at 12:31 -0400, David Teigland wrote:
> > On Fri, Nov 04, 2011 at 03:19:49PM +, Steven Whitehouse wrote:
> > > The three pairs of mean/variance measure the following
> > > things:
> > > 
> > >  1. DLM lock time (non-blocking requests)
> > 
> > You don't need to track and save this value, because all results will be
> > one of three values which can gather once:
> > 
> > short: the dir node and master node are local: 0 network round trip
> > medium: one is local, one is remote: 1 network round trip
> > long: both are remote: 2 network round trips
> > 
> > Once you've measured values for short/med/long, then you're done.
> > The distribution will depend on the usage pattern.
> > 
> The reason for tracking this is to be able to compare it with the
> blocking request value to (I hope) get a rough idea of the difference
> between the two, which may indicate contention on the lock. So this
> is really a "baseline" measurement.
> 
> Plus we do need to measure it, since it will vary according to a
> number of things, such as what hardware is in use.

Right, but the baseline shouldn't change once you have it.

> > > 2. To spot performance issues more easily
> > 
> > Apart from contention, I'm not sure there are many perf issues that dlm
> > measurements would help with.
> > 
> That is the #1 cause of reported performance issues, so top of our list
> to work on. The goal is to make it easier to track down the source of
> these kinds of problems.

I still think that time averages and computations sounds like a difficult
and indirect way of measuring contention... but see how it works.

Dave



[Cluster-devel] dlm patches

2011-12-16 Thread David Teigland
This is the current series of dlm patches from
https://github.com/teigland/linux-dlm/tree/devel9

The first is already pushed to linux-next for the next merge cycle.
The others, which allow gfs2 to be used without gfs_controld, are
still being tested, and may be ready for the next merge cycle,
depending on how that goes.

Dave



[Cluster-devel] [PATCH 3/5] dlm: add node slots and generation

2011-12-16 Thread David Teigland
Slot numbers are assigned to nodes when they join the lockspace.
The slot number chosen is the minimum unused value starting at 1.
Once a node is assigned a slot, that slot number will not change
while the node remains a lockspace member.  If the node leaves
and rejoins it can be assigned a new slot number.

A new generation number is also added to a lockspace.  It is
set and incremented during each recovery along with the slot
collection/assignment.

The slot numbers will be passed to gfs2 which will use them as
journal id's.

Signed-off-by: David Teigland 
---
 fs/dlm/dlm_internal.h |   48 -
 fs/dlm/lockspace.c|5 +
 fs/dlm/member.c   |  274 -
 fs/dlm/member.h   |7 ++
 fs/dlm/rcom.c |   99 +++---
 fs/dlm/rcom.h |2 +-
 fs/dlm/recover.c  |   64 ++--
 7 files changed, 470 insertions(+), 29 deletions(-)

diff --git a/fs/dlm/dlm_internal.h b/fs/dlm/dlm_internal.h
index 5685a9a..f4d132c 100644
--- a/fs/dlm/dlm_internal.h
+++ b/fs/dlm/dlm_internal.h
@@ -117,6 +117,18 @@ struct dlm_member {
struct list_headlist;
int nodeid;
int weight;
+   int slot;
+   int slot_prev;
+   uint32_tgeneration;
+};
+
+/*
+ * low nodeid saves array of these in ls_slots
+ */
+
+struct dlm_slot {
+   int nodeid;
+   int slot;
 };
 
 /*
@@ -337,7 +349,9 @@ static inline int rsb_flag(struct dlm_rsb *r, enum 
rsb_flags flag)
 /* dlm_header is first element of all structs sent between nodes */
 
 #define DLM_HEADER_MAJOR   0x0003
-#define DLM_HEADER_MINOR   0x
+#define DLM_HEADER_MINOR   0x0001
+
+#define DLM_HEADER_SLOTS   0x0001
 
 #define DLM_MSG1
 #define DLM_RCOM   2
@@ -425,10 +439,34 @@ union dlm_packet {
struct dlm_rcom rcom;
 };
 
+#define DLM_RSF_NEED_SLOTS 0x0001
+
+/* RCOM_STATUS data */
+struct rcom_status {
+   __le32  rs_flags;
+   __le32  rs_unused1;
+   __le64  rs_unused2;
+};
+
+/* RCOM_STATUS_REPLY data */
 struct rcom_config {
__le32  rf_lvblen;
__le32  rf_lsflags;
-   __le64  rf_unused;
+
+   /* DLM_HEADER_SLOTS adds: */
+   __le32  rf_flags;
+   __le16  rf_our_slot;
+   __le16  rf_num_slots;
+   __le32  rf_generation;
+   __le32  rf_unused1;
+   __le64  rf_unused2;
+};
+
+struct rcom_slot {
+   __le32  ro_nodeid;
+   __le16  ro_slot;
+   __le16  ro_unused1;
+   __le64  ro_unused2;
 };
 
 struct rcom_lock {
@@ -455,6 +493,7 @@ struct dlm_ls {
struct list_headls_list;/* list of lockspaces */
dlm_lockspace_t *ls_local_handle;
uint32_tls_global_id;   /* global unique lockspace ID */
+   uint32_tls_generation;
uint32_tls_exflags;
int ls_lvblen;
int ls_count;   /* refcount of processes in
@@ -493,6 +532,11 @@ struct dlm_ls {
int ls_total_weight;
int *ls_node_array;
 
+   int ls_slot;
+   int ls_num_slots;
+   int ls_slots_size;
+   struct dlm_slot *ls_slots;
+
struct dlm_rsb  ls_stub_rsb;/* for returning errors */
struct dlm_lkb  ls_stub_lkb;/* for returning errors */
struct dlm_message  ls_stub_ms; /* for faking a reply */
diff --git a/fs/dlm/lockspace.c b/fs/dlm/lockspace.c
index 1d16a23..1441f04 100644
--- a/fs/dlm/lockspace.c
+++ b/fs/dlm/lockspace.c
@@ -525,6 +525,11 @@ static int new_lockspace(const char *name, int namelen, 
void **lockspace,
if (!ls->ls_recover_buf)
goto out_dirfree;
 
+   ls->ls_slot = 0;
+   ls->ls_num_slots = 0;
+   ls->ls_slots_size = 0;
+   ls->ls_slots = NULL;
+
INIT_LIST_HEAD(&ls->ls_recover_list);
spin_lock_init(&ls->ls_recover_list_lock);
ls->ls_recover_list_count = 0;
diff --git a/fs/dlm/member.c b/fs/dlm/member.c
index 5ebd1df..fbbcda9 100644
--- a/fs/dlm/member.c
+++ b/fs/dlm/member.c
@@ -19,6 +19,270 @@
 #include "config.h"
 #include "lowcomms.h"
 
+int dlm_slots_version(struct dlm_header *h)
+{
+   if ((h->h_version & 0x) < DLM_HEADER_SLOTS)
+   return 0;
+   return 1;
+}
+
+void dlm_slot_save(struct dlm_ls *ls, struct dlm_rcom *rc,
+

[Cluster-devel] [PATCH 1/5] dlm: convert rsb list to rb_tree

2011-12-16 Thread David Teigland
From: Bob Peterson 

Change the linked lists to rb_tree's in the rsb
hash table to speed up searches.  Slow rsb searches
were having a large impact on gfs2 performance due
to the large number of dlm locks gfs2 uses.

Signed-off-by: Bob Peterson 
Signed-off-by: David Teigland 
---
 fs/dlm/debug_fs.c |   28 ---
 fs/dlm/dlm_internal.h |9 +++--
 fs/dlm/lock.c |   87 +++-
 fs/dlm/lockspace.c|   23 +
 fs/dlm/recover.c  |   21 +++
 5 files changed, 113 insertions(+), 55 deletions(-)

diff --git a/fs/dlm/debug_fs.c b/fs/dlm/debug_fs.c
index 5977923..3dca2b3 100644
--- a/fs/dlm/debug_fs.c
+++ b/fs/dlm/debug_fs.c
@@ -393,6 +393,7 @@ static const struct seq_operations format3_seq_ops;
 
 static void *table_seq_start(struct seq_file *seq, loff_t *pos)
 {
+   struct rb_node *node;
struct dlm_ls *ls = seq->private;
struct rsbtbl_iter *ri;
struct dlm_rsb *r;
@@ -418,9 +419,10 @@ static void *table_seq_start(struct seq_file *seq, loff_t 
*pos)
ri->format = 3;
 
spin_lock(&ls->ls_rsbtbl[bucket].lock);
-   if (!list_empty(&ls->ls_rsbtbl[bucket].list)) {
-   list_for_each_entry(r, &ls->ls_rsbtbl[bucket].list,
-   res_hashchain) {
+   if (!RB_EMPTY_ROOT(&ls->ls_rsbtbl[bucket].keep)) {
+   for (node = rb_first(&ls->ls_rsbtbl[bucket].keep); node;
+node = rb_next(node)) {
+   r = rb_entry(node, struct dlm_rsb, res_hashnode);
if (!entry--) {
dlm_hold_rsb(r);
ri->rsb = r;
@@ -449,9 +451,9 @@ static void *table_seq_start(struct seq_file *seq, loff_t 
*pos)
}
 
spin_lock(&ls->ls_rsbtbl[bucket].lock);
-   if (!list_empty(&ls->ls_rsbtbl[bucket].list)) {
-   r = list_first_entry(&ls->ls_rsbtbl[bucket].list,
-struct dlm_rsb, res_hashchain);
+   if (!RB_EMPTY_ROOT(&ls->ls_rsbtbl[bucket].keep)) {
+   node = rb_first(&ls->ls_rsbtbl[bucket].keep);
+   r = rb_entry(node, struct dlm_rsb, res_hashnode);
dlm_hold_rsb(r);
ri->rsb = r;
ri->bucket = bucket;
@@ -467,7 +469,7 @@ static void *table_seq_next(struct seq_file *seq, void 
*iter_ptr, loff_t *pos)
 {
struct dlm_ls *ls = seq->private;
struct rsbtbl_iter *ri = iter_ptr;
-   struct list_head *next;
+   struct rb_node *next;
struct dlm_rsb *r, *rp;
loff_t n = *pos;
unsigned bucket;
@@ -480,10 +482,10 @@ static void *table_seq_next(struct seq_file *seq, void 
*iter_ptr, loff_t *pos)
 
spin_lock(&ls->ls_rsbtbl[bucket].lock);
rp = ri->rsb;
-   next = rp->res_hashchain.next;
+   next = rb_next(&rp->res_hashnode);
 
-   if (next != &ls->ls_rsbtbl[bucket].list) {
-   r = list_entry(next, struct dlm_rsb, res_hashchain);
+   if (next) {
+   r = rb_entry(next, struct dlm_rsb, res_hashnode);
dlm_hold_rsb(r);
ri->rsb = r;
spin_unlock(&ls->ls_rsbtbl[bucket].lock);
@@ -511,9 +513,9 @@ static void *table_seq_next(struct seq_file *seq, void 
*iter_ptr, loff_t *pos)
}
 
spin_lock(&ls->ls_rsbtbl[bucket].lock);
-   if (!list_empty(&ls->ls_rsbtbl[bucket].list)) {
-   r = list_first_entry(&ls->ls_rsbtbl[bucket].list,
-struct dlm_rsb, res_hashchain);
+   if (!RB_EMPTY_ROOT(&ls->ls_rsbtbl[bucket].keep)) {
+   next = rb_first(&ls->ls_rsbtbl[bucket].keep);
+   r = rb_entry(next, struct dlm_rsb, res_hashnode);
dlm_hold_rsb(r);
ri->rsb = r;
ri->bucket = bucket;
diff --git a/fs/dlm/dlm_internal.h b/fs/dlm/dlm_internal.h
index fe2860c..5685a9a 100644
--- a/fs/dlm/dlm_internal.h
+++ b/fs/dlm/dlm_internal.h
@@ -103,8 +103,8 @@ struct dlm_dirtable {
 };
 
 struct dlm_rsbtable {
-   struct list_headlist;
-   struct list_headtoss;
+   struct rb_root  keep;
+   struct rb_root  toss;
spinlock_t  lock;
 };
 
@@ -285,7 +285,10 @@ struct dlm_rsb {
unsigned long   res_toss_time;
uint32_tres_first_lkid;
struct list_headres_lookup; /* lkbs waiting on first */
-   struct list_headres_hashchain;  /* rsbtbl */
+   union {
+   struct list_headres_hashchain;
+

[Cluster-devel] [PATCH 2/5] dlm: move recovery barrier calls

2011-12-16 Thread David Teigland
Put all the calls to recovery barriers in the same function
to clarify where they each happen.  Should not change any behavior.
Also modify some recovery debug lines to make them consistent.

Signed-off-by: David Teigland 
---
 fs/dlm/dir.c  |1 -
 fs/dlm/member.c   |7 +--
 fs/dlm/recover.c  |2 --
 fs/dlm/recoverd.c |   45 +++--
 4 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/fs/dlm/dir.c b/fs/dlm/dir.c
index 7b84c1d..8364157 100644
--- a/fs/dlm/dir.c
+++ b/fs/dlm/dir.c
@@ -290,7 +290,6 @@ int dlm_recover_directory(struct dlm_ls *ls)
 
  out_status:
error = 0;
-   dlm_set_recover_status(ls, DLM_RS_DIR);
log_debug(ls, "dlm_recover_directory %d entries", count);
  out_free:
kfree(last_name);
diff --git a/fs/dlm/member.c b/fs/dlm/member.c
index b12532e..5ebd1df 100644
--- a/fs/dlm/member.c
+++ b/fs/dlm/member.c
@@ -251,7 +251,6 @@ int dlm_recover_members(struct dlm_ls *ls, struct 
dlm_recover *rv, int *neg_out)
ls->ls_low_nodeid = low;
 
make_member_array(ls);
-   dlm_set_recover_status(ls, DLM_RS_NODES);
*neg_out = neg;
 
error = ping_members(ls);
@@ -261,12 +260,8 @@ int dlm_recover_members(struct dlm_ls *ls, struct 
dlm_recover *rv, int *neg_out)
ls->ls_members_result = error;
complete(&ls->ls_members_done);
}
-   if (error)
-   goto out;
 
-   error = dlm_recover_members_wait(ls);
- out:
-   log_debug(ls, "total members %d error %d", ls->ls_num_nodes, error);
+   log_debug(ls, "dlm_recover_members %d nodes", ls->ls_num_nodes);
return error;
 }
 
diff --git a/fs/dlm/recover.c b/fs/dlm/recover.c
index 50467ce..81b2393 100644
--- a/fs/dlm/recover.c
+++ b/fs/dlm/recover.c
@@ -542,8 +542,6 @@ int dlm_recover_locks(struct dlm_ls *ls)
  out:
if (error)
recover_list_clear(ls);
-   else
-   dlm_set_recover_status(ls, DLM_RS_LOCKS);
return error;
 }
 
diff --git a/fs/dlm/recoverd.c b/fs/dlm/recoverd.c
index 774da3c..5a9e1a4 100644
--- a/fs/dlm/recoverd.c
+++ b/fs/dlm/recoverd.c
@@ -54,7 +54,7 @@ static int ls_recover(struct dlm_ls *ls, struct dlm_recover 
*rv)
unsigned long start;
int error, neg = 0;
 
-   log_debug(ls, "recover %llx", (unsigned long long)rv->seq);
+   log_debug(ls, "dlm_recover %llx", (unsigned long long)rv->seq);
 
mutex_lock(&ls->ls_recoverd_active);
 
@@ -76,14 +76,22 @@ static int ls_recover(struct dlm_ls *ls, struct dlm_recover 
*rv)
 
/*
 * Add or remove nodes from the lockspace's ls_nodes list.
-* Also waits for all nodes to complete dlm_recover_members.
 */
 
error = dlm_recover_members(ls, rv, &neg);
if (error) {
-   log_debug(ls, "recover_members failed %d", error);
+   log_debug(ls, "dlm_recover_members error %d", error);
goto fail;
}
+
+   dlm_set_recover_status(ls, DLM_RS_NODES);
+
+   error = dlm_recover_members_wait(ls);
+   if (error) {
+   log_debug(ls, "dlm_recover_members_wait error %d", error);
+   goto fail;
+   }
+
start = jiffies;
 
/*
@@ -93,17 +101,15 @@ static int ls_recover(struct dlm_ls *ls, struct 
dlm_recover *rv)
 
error = dlm_recover_directory(ls);
if (error) {
-   log_debug(ls, "recover_directory failed %d", error);
+   log_debug(ls, "dlm_recover_directory error %d", error);
goto fail;
}
 
-   /*
-* Wait for all nodes to complete directory rebuild.
-*/
+   dlm_set_recover_status(ls, DLM_RS_DIR);
 
error = dlm_recover_directory_wait(ls);
if (error) {
-   log_debug(ls, "recover_directory_wait failed %d", error);
+   log_debug(ls, "dlm_recover_directory_wait error %d", error);
goto fail;
}
 
@@ -133,7 +139,7 @@ static int ls_recover(struct dlm_ls *ls, struct dlm_recover 
*rv)
 
error = dlm_recover_masters(ls);
if (error) {
-   log_debug(ls, "recover_masters failed %d", error);
+   log_debug(ls, "dlm_recover_masters error %d", error);
goto fail;
}
 
@@ -143,13 +149,15 @@ static int ls_recover(struct dlm_ls *ls, struct 
dlm_recover *rv)
 
error = dlm_recover_locks(ls);
if (error) {
-   log_debug(ls, "recover_locks failed %d", error);
+   log_debug(ls, "dlm_recover_locks error %d", error);
goto fail;
}
 
+   dlm_set_recover_status(ls, DLM_RS_LOCK

[Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2011-12-16 Thread David Teigland
This new method of managing recovery is an alternative to
the previous approach of using the userland gfs_controld.

- use dlm slot numbers to assign journal id's
- use dlm recovery callbacks to initiate journal recovery
- use a dlm lock to determine the first node to mount fs
- use a dlm lock to track journals that need recovery

Signed-off-by: David Teigland 
---
 fs/gfs2/glock.c |2 +-
 fs/gfs2/glock.h |7 +-
 fs/gfs2/incore.h|   51 ++-
 fs/gfs2/lock_dlm.c  |  979 ++-
 fs/gfs2/main.c  |   10 +
 fs/gfs2/ops_fstype.c|   29 +-
 fs/gfs2/recovery.c  |4 +
 fs/gfs2/sys.c   |   29 +-
 fs/gfs2/sys.h   |2 +
 include/linux/gfs2_ondisk.h |2 +
 10 files changed, 1075 insertions(+), 40 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 88e8a23..376816f 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1353,7 +1353,7 @@ void gfs2_glock_complete(struct gfs2_glock *gl, int ret)
spin_lock(&gl->gl_spin);
gl->gl_reply = ret;
 
-   if (unlikely(test_bit(DFL_BLOCK_LOCKS, &ls->ls_flags))) {
+   if (unlikely(test_bit(DFL_BLOCK_LOCKS, &ls->ls_recover_flags))) {
if (gfs2_should_freeze(gl)) {
set_bit(GLF_FROZEN, &gl->gl_flags);
spin_unlock(&gl->gl_spin);
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 6670711..5b548b07 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -121,8 +121,11 @@ enum {
 
 struct lm_lockops {
const char *lm_proto_name;
-   int (*lm_mount) (struct gfs2_sbd *sdp, const char *fsname);
-   void (*lm_unmount) (struct gfs2_sbd *sdp);
+   int (*lm_mount) (struct gfs2_sbd *sdp, const char *table);
+   void (*lm_first_done) (struct gfs2_sbd *sdp);
+   void (*lm_recovery_result) (struct gfs2_sbd *sdp, unsigned int jid,
+   unsigned int result);
+   void (*lm_unmount) (struct gfs2_sbd *sdp);
void (*lm_withdraw) (struct gfs2_sbd *sdp);
void (*lm_put_lock) (struct gfs2_glock *gl);
int (*lm_lock) (struct gfs2_glock *gl, unsigned int req_state,
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 892ac37..059e462 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -139,8 +139,38 @@ struct gfs2_bufdata {
 #define GDLM_STRNAME_BYTES 25
 #define GDLM_LVB_SIZE  32
 
+/*
+ * ls_recover_flags:
+ *
+ * DFL_BLOCK_LOCKS: dlm is in recovery and will grant locks that had been
+ * held by failed nodes whose journals need recovery.  Those locks should
+ * only be used for journal recovery until the journal recovery is done.
+ * This is set by the dlm recover_prep callback and cleared by the
+ * gfs2_control thread when journal recovery is complete.  To avoid
+ * races between recover_prep setting and gfs2_control clearing, recover_spin
+ * is held while changing this bit and reading/writing recover_block
+ * and recover_start.
+ *
+ * DFL_FIRST_MOUNT: this node is the first to mount this fs and is doing
+ * recovery of all journals before allowing other nodes to mount the fs.
+ * This is cleared when FIRST_MOUNT_DONE is set.
+ *
+ * DFL_FIRST_MOUNT_DONE: this node was the first mounter, and has finished
+ * recovery of all journals, and now allows other nodes to mount the fs.
+ *
+ * DFL_MOUNT_DONE: gdlm_mount has completed successfully and cleared
+ * BLOCK_LOCKS for the first time.  The gfs2_control thread should now
+ * control clearing BLOCK_LOCKS for further recoveries.
+ *
+ * DFL_UNMOUNT: gdlm_unmount sets to keep sdp off gfs2_control_wq.
+ */
+
 enum {
DFL_BLOCK_LOCKS = 0,
+   DFL_FIRST_MOUNT = 1,
+   DFL_FIRST_MOUNT_DONE= 2,
+   DFL_MOUNT_DONE  = 3,
+   DFL_UNMOUNT = 4,
 };
 
 struct lm_lockname {
@@ -504,14 +534,26 @@ struct gfs2_sb_host {
 struct lm_lockstruct {
int ls_jid;
unsigned int ls_first;
-   unsigned int ls_first_done;
unsigned int ls_nodir;
const struct lm_lockops *ls_ops;
-   unsigned long ls_flags;
dlm_lockspace_t *ls_dlm;
 
-   int ls_recover_jid_done;
-   int ls_recover_jid_status;
+   int ls_recover_jid_done; /* read by gfs_controld */
+   int ls_recover_jid_status; /* read by gfs_controld */
+
+   struct dlm_lksb ls_mounted_lksb; /* mounted_lock */
+   struct dlm_lksb ls_control_lksb; /* control_lock */
+   char ls_control_lvb[GDLM_LVB_SIZE]; /* control_lock lvb */
+   struct completion ls_sync_wait; /* {control,mounted}_{lock,unlock} */
+
+   spinlock_t ls_recover_spin; /* protects following fields */
+   unsigned long ls_recover_flags; /* DFL_ */
+   uint32_t ls_recover_mount; /* gen in first recover_done cb */
+   uint32_t ls_recover_start; /* gen in last recover_done cb */
+   uint32_t ls_recover_block; /* copy recover_sta

[Cluster-devel] [PATCH 4/5] dlm: add recovery callbacks

2011-12-16 Thread David Teigland
These new callbacks notify the dlm user about lock recovery.
GFS2, and possibly others, need to be aware of when the dlm
will be doing lock recovery for a failed lockspace member.

In the past, this coordination has been done between dlm and
file system daemons in userspace, which then direct their
kernel counterparts.  These callbacks allow the same
coordination directly, and more simply.

Signed-off-by: David Teigland 
---
 fs/dlm/config.c   |  130 ++--
 fs/dlm/config.h   |   17 +++-
 fs/dlm/dlm_internal.h |   20 ++
 fs/dlm/lockspace.c|   37 +++--
 fs/dlm/member.c   |  197 +++--
 fs/dlm/member.h   |3 +-
 fs/dlm/recoverd.c |   10 +-
 fs/dlm/user.c |4 +-
 fs/gfs2/lock_dlm.c|4 +-
 fs/ocfs2/stack_user.c |4 +-
 include/linux/dlm.h   |   65 +++-
 11 files changed, 319 insertions(+), 172 deletions(-)

diff --git a/fs/dlm/config.c b/fs/dlm/config.c
index 6cf72fc..e7e327d 100644
--- a/fs/dlm/config.c
+++ b/fs/dlm/config.c
@@ -2,7 +2,7 @@
 ***
 **
 **  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
-**  Copyright (C) 2004-2008 Red Hat, Inc.  All rights reserved.
+**  Copyright (C) 2004-2011 Red Hat, Inc.  All rights reserved.
 **
 **  This copyrighted material is made available to anyone wishing to use,
 **  modify, copy, or redistribute it subject to the terms and conditions
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -36,6 +37,7 @@
 static struct config_group *space_list;
 static struct config_group *comm_list;
 static struct dlm_comm *local_comm;
+static uint32_t dlm_comm_count;
 
 struct dlm_clusters;
 struct dlm_cluster;
@@ -103,6 +105,8 @@ struct dlm_cluster {
unsigned int cl_timewarn_cs;
unsigned int cl_waitwarn_us;
unsigned int cl_new_rsb_count;
+   unsigned int cl_recover_callbacks;
+   char cl_cluster_name[DLM_LOCKSPACE_LEN];
 };
 
 enum {
@@ -118,6 +122,8 @@ enum {
CLUSTER_ATTR_TIMEWARN_CS,
CLUSTER_ATTR_WAITWARN_US,
CLUSTER_ATTR_NEW_RSB_COUNT,
+   CLUSTER_ATTR_RECOVER_CALLBACKS,
+   CLUSTER_ATTR_CLUSTER_NAME,
 };
 
 struct cluster_attribute {
@@ -126,6 +132,27 @@ struct cluster_attribute {
ssize_t (*store)(struct dlm_cluster *, const char *, size_t);
 };
 
+static ssize_t cluster_cluster_name_read(struct dlm_cluster *cl, char *buf)
+{
+   return sprintf(buf, "%s\n", cl->cl_cluster_name);
+}
+
+static ssize_t cluster_cluster_name_write(struct dlm_cluster *cl,
+ const char *buf, size_t len)
+{
+   strncpy(dlm_config.ci_cluster_name, buf, DLM_LOCKSPACE_LEN);
+   strncpy(cl->cl_cluster_name, buf, DLM_LOCKSPACE_LEN);
+   return len;
+}
+
+static struct cluster_attribute cluster_attr_cluster_name = {
+   .attr   = { .ca_owner = THIS_MODULE,
+.ca_name = "cluster_name",
+.ca_mode = S_IRUGO | S_IWUSR },
+   .show   = cluster_cluster_name_read,
+   .store  = cluster_cluster_name_write,
+};
+
 static ssize_t cluster_set(struct dlm_cluster *cl, unsigned int *cl_field,
   int *info_field, int check_zero,
   const char *buf, size_t len)
@@ -171,6 +198,7 @@ CLUSTER_ATTR(protocol, 0);
 CLUSTER_ATTR(timewarn_cs, 1);
 CLUSTER_ATTR(waitwarn_us, 0);
 CLUSTER_ATTR(new_rsb_count, 0);
+CLUSTER_ATTR(recover_callbacks, 0);
 
 static struct configfs_attribute *cluster_attrs[] = {
[CLUSTER_ATTR_TCP_PORT] = &cluster_attr_tcp_port.attr,
@@ -185,6 +213,8 @@ static struct configfs_attribute *cluster_attrs[] = {
[CLUSTER_ATTR_TIMEWARN_CS] = &cluster_attr_timewarn_cs.attr,
[CLUSTER_ATTR_WAITWARN_US] = &cluster_attr_waitwarn_us.attr,
[CLUSTER_ATTR_NEW_RSB_COUNT] = &cluster_attr_new_rsb_count.attr,
+   [CLUSTER_ATTR_RECOVER_CALLBACKS] = &cluster_attr_recover_callbacks.attr,
+   [CLUSTER_ATTR_CLUSTER_NAME] = &cluster_attr_cluster_name.attr,
NULL,
 };
 
@@ -293,6 +323,7 @@ struct dlm_comms {
 
 struct dlm_comm {
struct config_item item;
+   int seq;
int nodeid;
int local;
int addr_count;
@@ -309,6 +340,7 @@ struct dlm_node {
int nodeid;
int weight;
int new;
+   int comm_seq; /* copy of cm->seq when nd->nodeid is set */
 };
 
 static struct configfs_group_operations clusters_ops = {
@@ -455,6 +487,9 @@ static struct config_group *make_cluster(struct 
config_group *g,
cl->cl_timewarn_cs = dlm_config.ci_timewarn_cs;
cl->cl_waitwarn_us = dlm_config.ci_waitwarn_us;
cl->cl_new_rsb_count = dlm_config.ci_new_rsb_count;
+   cl->cl_recover_callbacks = dlm_config.ci_recover_callbacks;
+   memcpy(cl->cl_cluster_name, dlm_config.c

Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2011-12-19 Thread David Teigland
On Mon, Dec 19, 2011 at 01:07:38PM +, Steven Whitehouse wrote:
> >  struct lm_lockstruct {
> > int ls_jid;
> > unsigned int ls_first;
> > -   unsigned int ls_first_done;
> > unsigned int ls_nodir;
> Since ls_flags and ls_first also also only boolean flags, they could
> potentially be moved into the flags, though we can always do that later.

yes, I can use a flag in place of ls_first.

> > +   int ls_recover_jid_done; /* read by gfs_controld */
> > +   int ls_recover_jid_status; /* read by gfs_controld */
>   ^^^ this isn't
> actually true any more. All recent gfs_controld versions take their cue
> from the uevents, so this is here only for backwards compatibility
> reasons and these two will be removed at some future date.

I'll add a longer comment saying something like that.

> > +   /*
> > +* Other nodes need to do some work in dlm recovery and gfs2_control
> > +* before the recover_done and control_lock will be ready for us below.
> > +* A delay here is not required but often avoids having to retry.
> > +*/
> > +
> > +   msleep(500);
> Can we get rid of this then? I'd rather just wait for the lock, rather
> than adding delays of arbitrary time periods into the code.

I dislike arbitrary delays also, so I'm hesitant to add them.
The choices here are:
- removing NOQUEUE from the requests below, but with NOQUEUE you have a
  much better chance of killing a mount command, which is a fairly nice
  feature, I think.
- removing the delay, which results in nodes often doing fast+repeated
  lock attempts, which could get rather excessive.  I'd be worried about
  having that kind of unlimited loop sitting there.
- using some kind of delay.

While I don't like the look of the delay, I like the other options less.
Do you have a preference, or any other ideas?


> > +static int control_first_done(struct gfs2_sbd *sdp)
> > +{
> > +   struct lm_lockstruct *ls = &sdp->sd_lockstruct;
> > +   char lvb_bits[GDLM_LVB_SIZE];
> > +   uint32_t start_gen, block_gen;
> > +   int error;
> > +
> > +restart:
> > +   spin_lock(&ls->ls_recover_spin);
> > +   start_gen = ls->ls_recover_start;
> > +   block_gen = ls->ls_recover_block;
> > +
> > +   if (test_bit(DFL_BLOCK_LOCKS, &ls->ls_recover_flags) ||
> > +   !test_bit(DFL_MOUNT_DONE, &ls->ls_recover_flags) ||
> > +   !test_bit(DFL_FIRST_MOUNT, &ls->ls_recover_flags)) {
> > +   /* sanity check, should not happen */
> > +   fs_err(sdp, "control_first_done start %u block %u flags %lx\n",
> > +  start_gen, block_gen, ls->ls_recover_flags);
> > +   spin_unlock(&ls->ls_recover_spin);
> > +   control_unlock(sdp);
> > +   return -1;
> > +   }
> > +
> > +   if (start_gen == block_gen) {
> > +   /*
> > +* Wait for the end of a dlm recovery cycle to switch from
> > +* first mounter recovery.  We can ignore any recover_slot
> > +* callbacks between the recover_prep and next recover_done
> > +* because we are still the first mounter and any failed nodes
> > +* have not fully mounted, so they don't need recovery.
> > +*/
> > +   spin_unlock(&ls->ls_recover_spin);
> > +   fs_info(sdp, "control_first_done wait gen %u\n", start_gen);
> > +   msleep(500);
> Again - I don't want to add arbitrary delays into the code. Why is this
> waiting for half a second? Why not some other length of time? We should
> figure out how to wait for the end of the first mounter recovery some
> other way if that is what is required.

This msleep slows down a rare loop to wake up a couple times vs once with
a proper wait mechanism.  It's waiting for the next recover_done()
callback, which the dlm will call when it is done with recovery.  We do
have the option here of using a standard wait mechanism, wait_on_bit() or
something.  I'll see if any of those would work here without adding too
much to the code.


> > +static void gdlm_recovery_result(struct gfs2_sbd *sdp, unsigned int jid,
> > +unsigned int result)
> > +{
> > +   struct lm_lockstruct *ls = &sdp->sd_lockstruct;
> > +
> > +   /* don't care about the recovery of own journal during mount */
> > +   if (jid == ls->ls_jid)
> > +   return;
> > +
> > +   /* another node is recovering the journal, give it a chance to
> > +  finish before trying again */
> > +   if (result == LM_RD_GAVEUP)
> > +   msleep(1000);
> Again, lets put in a proper wait for this condition. If the issue is one
> of races between cluster nodes (thundering herd type problem), then we
> might need some kind of back off, but in that case, it should probably
> be for a random time period.

In this case, while one node is recovering a journal, the other nodes will
all try to recover the same journal (and fail), as quickly as they can.  I
looked at using queue_delayed_work here, but couldn't tell if that was ok

Re: [Cluster-devel] [PATCH 3/5] dlm: add node slots and generation

2011-12-19 Thread David Teigland
> Nit, but this should have some spaces, iow, "i + 1;"

>  -error = check_config(ls, rc, nodeid);
>  +error = check_rcom_config(ls, rc, nodeid);

yeah, I'll change those, thanks



Re: [Cluster-devel] [PATCH 4/5] dlm: add recovery callbacks

2011-12-19 Thread David Teigland
On Mon, Dec 19, 2011 at 12:36:57PM +, Steven Whitehouse wrote:
> > +   struct dlm_lockspace_ops ls_ops;
> ^^ I'd suggest just keeping a pointer to
> this, see below.


> > +static int new_lockspace(const char *name, const char *cluster, uint32_t 
> > flags,
> > +int lvblen, struct dlm_lockspace_ops *ops,
> this should be const


> > +   if (ops)
> > +   memcpy(&ls->ls_ops, ops, sizeof(struct dlm_lockspace_ops));
> > +
> Why not just keep a pointer to the ops? There is no need to copy them.
> 
> Also - since the ops are specific to a user of the lockspace, this
> implies that it is no longer possible to have multiple openers of the
> same lockspace on the same node. I think that needs documenting
> somewhere at least. The other possibility would be to introduce a
> structure to represent a "user of a lockspace" I suppose... 

> > +int dlm_new_lockspace(const char *name, const char *cluster, uint32_t 
> > flags,
> > + int lvblen, struct dlm_lockspace_ops *ops,
>  likewise this can also be const

> Please don't mix the callback arg and the functions in the same
> structure. The functions will be identical for all filesystems, where as
> the callback arg will be unique to each filesystem, so it would be
> better to just add an extra cb_arg to the new lockspace function.

I'll change all those, thanks



Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2011-12-20 Thread David Teigland
On Tue, Dec 20, 2011 at 10:39:08AM +, Steven Whitehouse wrote:
> > I dislike arbitrary delays also, so I'm hesitant to add them.
> > The choices here are:
> > - removing NOQUEUE from the requests below, but with NOQUEUE you have a
> >   much better chance of killing a mount command, which is a fairly nice
> >   feature, I think.
> > - removing the delay, which results in nodes often doing fast+repeated
> >   lock attempts, which could get rather excessive.  I'd be worried about
> >   having that kind of unlimited loop sitting there.
> > - using some kind of delay.
> > 
> > While I don't like the look of the delay, I like the other options less.
> > Do you have a preference, or any other ideas?
> > 
> Well, I'd prefer to just remove the NOQUEUE command in that case, so
> that we don't spin here. The dlm request is async anyway, so we should
> be able to wait for it in an interruptible manner and send a cancel if
> required.

I won't do async+cancel here, that would make the code unnecessarily ugly
and complicated.  There's really no reason to be so dogmatic about delays,
but since you refuse I'll just make it block, assuming I don't find any
new problems with that.

> > > Again - I don't want to add arbitrary delays into the code. Why is this
> > > waiting for half a second? Why not some other length of time? We should
> > > figure out how to wait for the end of the first mounter recovery some
> > > other way if that is what is required.
> > 
> > This msleep slows down a rare loop to wake up a couple times vs once with
> > a proper wait mechanism.  It's waiting for the next recover_done()
> > callback, which the dlm will call when it is done with recovery.  We do
> > have the option here of using a standard wait mechanism, wait_on_bit() or
> > something.  I'll see if any of those would work here without adding too
> > much to the code.
> > 
> Ok. That would be a better option I think.

Only if it doesn't make things more (unnecessarily) complex.

Dave



Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2011-12-20 Thread David Teigland
On Tue, Dec 20, 2011 at 02:16:43PM -0500, David Teigland wrote:
> On Tue, Dec 20, 2011 at 10:39:08AM +, Steven Whitehouse wrote:
> > > I dislike arbitrary delays also, so I'm hesitant to add them.
> > > The choices here are:
> > > - removing NOQUEUE from the requests below, but with NOQUEUE you have a
> > >   much better chance of killing a mount command, which is a fairly nice
> > >   feature, I think.
> > > - removing the delay, which results in nodes often doing fast+repeated
> > >   lock attempts, which could get rather excessive.  I'd be worried about
> > >   having that kind of unlimited loop sitting there.
> > > - using some kind of delay.
> > > 
> > > While I don't like the look of the delay, I like the other options less.
> > > Do you have a preference, or any other ideas?
> > > 
> > Well, I'd prefer to just remove the NOQUEUE command in that case, so
> > that we don't spin here. The dlm request is async anyway, so we should
> > be able to wait for it in an interruptible manner and send a cancel if
> > required.
> 
> I won't do async+cancel here, that would make the code unnecessarily ugly
> and complicated.  There's really no reason to be so dogmatic about delays,
> but since you refuse I'll just make it block, assuming I don't find any
> new problems with that.

Now that I look at it, waiting vs blocking on the lock requests is not the
main issue; removing NOQUEUE doesn't really do anything.  We're waiting
for the other nodes to finish their work and update the state in the lvb.
The only option is to periodically check the lvb, so the only choices are
to do that as fast as possible (not nice), or introduce a delay.



Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2011-12-21 Thread David Teigland
On Wed, Dec 21, 2011 at 10:45:21AM +, Steven Whitehouse wrote:
> I don't think I understand whats going on in that case. What I thought
> should be happening was this:
> 
>  - Try to get mounter lock in EX
>- If successful, then we are the first mounter so recover all
>  journals
>- Write info into LVB
>- Drop mounter lock to PR so other nodes can mount
> 
>  - If failed to get mounter lock in EX, then wait for lock in PR state
>- This will block until the EX lock is dropped to PR
>- Read info from LVB
> 
> So a node with the mounter lock in EX knows that it is always the first
> mounter and will recover all journals before demoting the mounter lock
> to PR. A node with the mounter lock in PR may only recover its own
> journal (at mount time).

I previously used one lock similar to that, but had to change it a bit.
I had to split it across two separate locks, called control_lock and
mounted_lock.  There need to be two because of two conflicting requirements.

The control_lock lvb is used to communicate the generation number and jid
bits.  Writing the lvb requires an EX lock, and EX prevents others from
continually holding a PR lock.  Without mounted nodes continually holding
a PR lock we can't use EX to indicate first mounter.

So, the mounted_lock (no lvb) is used to indicate the first mounter.
Here all mounted nodes continually hold a PR lock, and a mounting node
attempts to get an EX lock, so any node to get an EX lock is the first
mounter.

(I previously used control_lock with "zero lvb" to indicate first mounter,
but there are some fairly common cases where the lvb may not be zero when
we need a first mounter.)

Now back to the reason why we need to retry lock requests and can't just
block.  It's not related to the first mounter case.  When a node mounts,
it needs to wait for other (previously mounted) nodes to update the
control_lock lvb with the latest generation number, and then it also needs
to wait for any bits set in the lvb to be cleared.  i.e. it needs to wait
for any unrecovered journals to be recovered before it finishes mounting.

To do this, it needs to wait in a loop reading the control_lock lvb.  The
question is whether we want to add some sort of delay to that loop or not,
and how.  msleep_interruptible(), schedule_timeout_interruptible(),
something else?

Dave



Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2011-12-22 Thread David Teigland
On Mon, Dec 19, 2011 at 12:47:38PM -0500, David Teigland wrote:
> On Mon, Dec 19, 2011 at 01:07:38PM +, Steven Whitehouse wrote:
> > >  struct lm_lockstruct {
> > >   int ls_jid;
> > >   unsigned int ls_first;
> > > - unsigned int ls_first_done;
> > >   unsigned int ls_nodir;
> > Since ls_flags and ls_first also also only boolean flags, they could
> > potentially be moved into the flags, though we can always do that later.
> 
> yes, I can use a flag in place of ls_first.

I went back to ls_first after finding the flag broke the old code path for
gfs_controld, and making it work involved changing more of the old code
that I wanted to in this patch.  We can go back and reorganize how some of
that old code works (and remove ls_first), in a subsequent patch.

> > > + /*
> > > +  * Other nodes need to do some work in dlm recovery and gfs2_control
> > > +  * before the recover_done and control_lock will be ready for us below.
> > > +  * A delay here is not required but often avoids having to retry.
> > > +  */
> > > +
> > > + msleep(500);

This is now msleep_interruptible(500), I couldn't get around adding some
sort of delay here.  If we think of another way to delay this I'll be
happy to try it.

I've finished and tested the rest of the changes.
https://github.com/teigland/linux-dlm/commits/devel11
http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/teigland/linux-dlm.git;a=shortlog;h=refs/heads/next



Re: [Cluster-devel] [patch] gfs2: make some sizes unsigned in set_recover_size()

2012-01-04 Thread David Teigland
> [patch] dlm: le32 vs le16
> gfs2: make some sizes unsigned in set_recover_size()

Thanks, I've folded in both of those.
Dave



Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2012-01-05 Thread David Teigland
On Thu, Jan 05, 2012 at 10:08:15AM -0500, Bob Peterson wrote:
> - Original Message -
> | This new method of managing recovery is an alternative to
> | the previous approach of using the userland gfs_controld.
> | 
> | - use dlm slot numbers to assign journal id's
> | - use dlm recovery callbacks to initiate journal recovery
> | - use a dlm lock to determine the first node to mount fs
> | - use a dlm lock to track journals that need recovery
> | 
> | Signed-off-by: David Teigland 
> | ---
> | --- a/fs/gfs2/lock_dlm.c
> | +++ b/fs/gfs2/lock_dlm.c
> (snip)
> | +#include 
> |  #include 
> 
> Hi,
> 
> Dave, are you going to post a replacement patch or addendum patch
> that addresses Steve's concerns, such as the above?
> I'd like to review this, but I want the review the latest/greatest.

I haven't resent the patches after making the changes (which were fairly
minor.)  I'll resend them shortly for another check before a pull request.

Dave



Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2012-01-05 Thread David Teigland
On Thu, Jan 05, 2012 at 03:40:09PM +, Steven Whitehouse wrote:
> I think it would be a good plan to not send this last patch for the
> current merge window and let it settle for a bit longer. Running things
> so fine with the timing makes me nervous bearing in mind the number of
> changes,

To allay your fears, keep in mind that the big new feature here does not
impact the current mode in the slightest.  Everything should continue to
work just as it has before given the current versions of dlm_controld and
gfs_controld.  So, there should be no reason for worry.  A version of
dlm_controld under development will enable the new feature.

> and that three issues have been caught in the last few days.

The issues that arose from -next were completely inconsequential in
practice, so that's a moot point.

> Lets try and resolve the remaining points and then we can have something
> really solid ready for the next window.

I don't know of any remaining points.

> I don't think there is any particular rush to get it in at the moment.

I think people will want to start using the new cluster4 components (which
require these patches) before a 3.4 kernel would be in a Fedora release.
So the motivation is to avoid making everyone build their own kernels and
allow cluster4 versions of things to be put in Fedora.

Dave



[Cluster-devel] [PATCH 1/5] dlm: convert rsb list to rb_tree

2012-01-05 Thread David Teigland
From: Bob Peterson 

Change the linked lists to rb_tree's in the rsb
hash table to speed up searches.  Slow rsb searches
were having a large impact on gfs2 performance due
to the large number of dlm locks gfs2 uses.

Signed-off-by: Bob Peterson 
Signed-off-by: David Teigland 
---
 fs/dlm/debug_fs.c |   28 ---
 fs/dlm/dlm_internal.h |9 +++--
 fs/dlm/lock.c |   87 +++-
 fs/dlm/lockspace.c|   23 +
 fs/dlm/recover.c  |   21 +++
 5 files changed, 113 insertions(+), 55 deletions(-)

diff --git a/fs/dlm/debug_fs.c b/fs/dlm/debug_fs.c
index 5977923..3dca2b3 100644
--- a/fs/dlm/debug_fs.c
+++ b/fs/dlm/debug_fs.c
@@ -393,6 +393,7 @@ static const struct seq_operations format3_seq_ops;
 
 static void *table_seq_start(struct seq_file *seq, loff_t *pos)
 {
+   struct rb_node *node;
struct dlm_ls *ls = seq->private;
struct rsbtbl_iter *ri;
struct dlm_rsb *r;
@@ -418,9 +419,10 @@ static void *table_seq_start(struct seq_file *seq, loff_t 
*pos)
ri->format = 3;
 
spin_lock(&ls->ls_rsbtbl[bucket].lock);
-   if (!list_empty(&ls->ls_rsbtbl[bucket].list)) {
-   list_for_each_entry(r, &ls->ls_rsbtbl[bucket].list,
-   res_hashchain) {
+   if (!RB_EMPTY_ROOT(&ls->ls_rsbtbl[bucket].keep)) {
+   for (node = rb_first(&ls->ls_rsbtbl[bucket].keep); node;
+node = rb_next(node)) {
+   r = rb_entry(node, struct dlm_rsb, res_hashnode);
if (!entry--) {
dlm_hold_rsb(r);
ri->rsb = r;
@@ -449,9 +451,9 @@ static void *table_seq_start(struct seq_file *seq, loff_t 
*pos)
}
 
spin_lock(&ls->ls_rsbtbl[bucket].lock);
-   if (!list_empty(&ls->ls_rsbtbl[bucket].list)) {
-   r = list_first_entry(&ls->ls_rsbtbl[bucket].list,
-struct dlm_rsb, res_hashchain);
+   if (!RB_EMPTY_ROOT(&ls->ls_rsbtbl[bucket].keep)) {
+   node = rb_first(&ls->ls_rsbtbl[bucket].keep);
+   r = rb_entry(node, struct dlm_rsb, res_hashnode);
dlm_hold_rsb(r);
ri->rsb = r;
ri->bucket = bucket;
@@ -467,7 +469,7 @@ static void *table_seq_next(struct seq_file *seq, void 
*iter_ptr, loff_t *pos)
 {
struct dlm_ls *ls = seq->private;
struct rsbtbl_iter *ri = iter_ptr;
-   struct list_head *next;
+   struct rb_node *next;
struct dlm_rsb *r, *rp;
loff_t n = *pos;
unsigned bucket;
@@ -480,10 +482,10 @@ static void *table_seq_next(struct seq_file *seq, void 
*iter_ptr, loff_t *pos)
 
spin_lock(&ls->ls_rsbtbl[bucket].lock);
rp = ri->rsb;
-   next = rp->res_hashchain.next;
+   next = rb_next(&rp->res_hashnode);
 
-   if (next != &ls->ls_rsbtbl[bucket].list) {
-   r = list_entry(next, struct dlm_rsb, res_hashchain);
+   if (next) {
+   r = rb_entry(next, struct dlm_rsb, res_hashnode);
dlm_hold_rsb(r);
ri->rsb = r;
spin_unlock(&ls->ls_rsbtbl[bucket].lock);
@@ -511,9 +513,9 @@ static void *table_seq_next(struct seq_file *seq, void 
*iter_ptr, loff_t *pos)
}
 
spin_lock(&ls->ls_rsbtbl[bucket].lock);
-   if (!list_empty(&ls->ls_rsbtbl[bucket].list)) {
-   r = list_first_entry(&ls->ls_rsbtbl[bucket].list,
-struct dlm_rsb, res_hashchain);
+   if (!RB_EMPTY_ROOT(&ls->ls_rsbtbl[bucket].keep)) {
+   next = rb_first(&ls->ls_rsbtbl[bucket].keep);
+   r = rb_entry(next, struct dlm_rsb, res_hashnode);
dlm_hold_rsb(r);
ri->rsb = r;
ri->bucket = bucket;
diff --git a/fs/dlm/dlm_internal.h b/fs/dlm/dlm_internal.h
index fe2860c..5685a9a 100644
--- a/fs/dlm/dlm_internal.h
+++ b/fs/dlm/dlm_internal.h
@@ -103,8 +103,8 @@ struct dlm_dirtable {
 };
 
 struct dlm_rsbtable {
-   struct list_headlist;
-   struct list_headtoss;
+   struct rb_root  keep;
+   struct rb_root  toss;
spinlock_t  lock;
 };
 
@@ -285,7 +285,10 @@ struct dlm_rsb {
unsigned long   res_toss_time;
uint32_tres_first_lkid;
struct list_headres_lookup; /* lkbs waiting on first */
-   struct list_headres_hashchain;  /* rsbtbl */
+   union {
+   struct list_headres_hashchain;
+

[Cluster-devel] [PATCH 0/5] dlm and gfs2 patches for 3.3

2012-01-05 Thread David Teigland
These are dlm and gfs2 patches from -next proposed for 3.3.  The gfs2 part
still needs an ack from Steve W, so it could be dropped.  The patches add
a major feature to the dlm and gfs2 that allows gfs2 mounters and recovery
to be managed directly within the kernel, via the dlm, rather than in
userland.  This new feature is not used by current dlm_controld and
gfs_controld daemons, but will be enabled by a new dlm_controld version
under development.


Bob Peterson (1):
  dlm: convert rsb list to rb_tree

David Teigland (4):
  dlm: move recovery barrier calls
  dlm: add node slots and generation
  dlm: add recovery callbacks
  gfs2: dlm based recovery coordination


 fs/dlm/config.c |  130 ---
 fs/dlm/config.h |   17 +-
 fs/dlm/debug_fs.c   |   28 +-
 fs/dlm/dir.c|1 -
 fs/dlm/dlm_internal.h   |   60 +++-
 fs/dlm/lock.c   |   87 +++-
 fs/dlm/lockspace.c  |   71 +++-
 fs/dlm/member.c |  486 ++
 fs/dlm/member.h |   10 +-
 fs/dlm/rcom.c   |   99 -
 fs/dlm/rcom.h   |2 +-
 fs/dlm/recover.c|   87 +++-
 fs/dlm/recoverd.c   |   53 ++-
 fs/dlm/user.c   |5 +-
 fs/gfs2/glock.c |2 +-
 fs/gfs2/glock.h |7 +-
 fs/gfs2/incore.h|   58 +++-
 fs/gfs2/lock_dlm.c  |  993 ++-
 fs/gfs2/main.c  |   10 +
 fs/gfs2/ops_fstype.c|   29 +-
 fs/gfs2/recovery.c  |4 +
 fs/gfs2/sys.c   |   33 +-
 fs/gfs2/sys.h   |2 +
 fs/ocfs2/stack_user.c   |4 +-
 include/linux/dlm.h |   71 +++-
 include/linux/gfs2_ondisk.h |2 +
 26 files changed, 2039 insertions(+), 312 deletions(-)


commit 68557b104bcf95f2e17aa591409eaecf2a08e55c
Author: David Teigland 
Date:   Tue Dec 20 17:03:04 2011 -0600

gfs2: dlm based recovery coordination

This new method of managing recovery is an alternative to
the previous approach of using the userland gfs_controld.

- use dlm slot numbers to assign journal id's
- use dlm recovery callbacks to initiate journal recovery
- use a dlm lock to determine the first node to mount fs
- use a dlm lock to track journals that need recovery

Signed-off-by: David Teigland 

commit 60f98d1839376d30e13f3e452dce2433fad3060e
Author: David Teigland 
Date:   Wed Nov 2 14:30:58 2011 -0500

dlm: add recovery callbacks

These new callbacks notify the dlm user about lock recovery.
GFS2, and possibly others, need to be aware of when the dlm
will be doing lock recovery for a failed lockspace member.

In the past, this coordination has been done between dlm and
file system daemons in userspace, which then direct their
kernel counterparts.  These callbacks allow the same
coordination directly, and more simply.

Signed-off-by: David Teigland 

commit 757a42719635495779462514458bbfbf12a37dac
Author: David Teigland 
Date:   Thu Oct 20 13:26:28 2011 -0500

dlm: add node slots and generation

Slot numbers are assigned to nodes when they join the lockspace.
The slot number chosen is the minimum unused value starting at 1.
Once a node is assigned a slot, that slot number will not change
while the node remains a lockspace member.  If the node leaves
and rejoins it can be assigned a new slot number.

A new generation number is also added to a lockspace.  It is
set and incremented during each recovery along with the slot
collection/assignment.

The slot numbers will be passed to gfs2 which will use them as
journal id's.

Signed-off-by: David Teigland 

commit f95a34c66554235b70a681fcd9feebc195f7ec0e
Author: David Teigland 
Date:   Fri Oct 14 12:34:58 2011 -0500

dlm: move recovery barrier calls

Put all the calls to recovery barriers in the same function
to clarify where they each happen.  Should not change any behavior.
Also modify some recovery debug lines to make them consistent.

Signed-off-by: David Teigland 

commit 9beb3bf5a92bb8fc6503f844bf0772df29f14a02
Author: Bob Peterson 
Date:   Wed Oct 26 15:24:55 2011 -0500

dlm: convert rsb list to rb_tree

Change the linked lists to rb_tree's in the rsb
hash table to speed up searches.  Slow rsb searches
were having a large impact on gfs2 performance due
to the large number of dlm locks gfs2 uses.

Signed-off-by: Bob Peterson 
Signed-off-by: David Teigland 



[Cluster-devel] [PATCH 3/5] dlm: add node slots and generation

2012-01-05 Thread David Teigland
Slot numbers are assigned to nodes when they join the lockspace.
The slot number chosen is the minimum unused value starting at 1.
Once a node is assigned a slot, that slot number will not change
while the node remains a lockspace member.  If the node leaves
and rejoins it can be assigned a new slot number.

A new generation number is also added to a lockspace.  It is
set and incremented during each recovery along with the slot
collection/assignment.

The slot numbers will be passed to gfs2 which will use them as
journal id's.

Signed-off-by: David Teigland 
---
 fs/dlm/dlm_internal.h |   48 -
 fs/dlm/lockspace.c|5 +
 fs/dlm/member.c   |  284 -
 fs/dlm/member.h   |7 ++
 fs/dlm/rcom.c |   99 ++---
 fs/dlm/rcom.h |2 +-
 fs/dlm/recover.c  |   64 ++--
 7 files changed, 480 insertions(+), 29 deletions(-)

diff --git a/fs/dlm/dlm_internal.h b/fs/dlm/dlm_internal.h
index 5685a9a..f4d132c 100644
--- a/fs/dlm/dlm_internal.h
+++ b/fs/dlm/dlm_internal.h
@@ -117,6 +117,18 @@ struct dlm_member {
struct list_headlist;
int nodeid;
int weight;
+   int slot;
+   int slot_prev;
+   uint32_tgeneration;
+};
+
+/*
+ * low nodeid saves array of these in ls_slots
+ */
+
+struct dlm_slot {
+   int nodeid;
+   int slot;
 };
 
 /*
@@ -337,7 +349,9 @@ static inline int rsb_flag(struct dlm_rsb *r, enum 
rsb_flags flag)
 /* dlm_header is first element of all structs sent between nodes */
 
 #define DLM_HEADER_MAJOR   0x0003
-#define DLM_HEADER_MINOR   0x
+#define DLM_HEADER_MINOR   0x0001
+
+#define DLM_HEADER_SLOTS   0x0001
 
 #define DLM_MSG1
 #define DLM_RCOM   2
@@ -425,10 +439,34 @@ union dlm_packet {
struct dlm_rcom rcom;
 };
 
+#define DLM_RSF_NEED_SLOTS 0x0001
+
+/* RCOM_STATUS data */
+struct rcom_status {
+   __le32  rs_flags;
+   __le32  rs_unused1;
+   __le64  rs_unused2;
+};
+
+/* RCOM_STATUS_REPLY data */
 struct rcom_config {
__le32  rf_lvblen;
__le32  rf_lsflags;
-   __le64  rf_unused;
+
+   /* DLM_HEADER_SLOTS adds: */
+   __le32  rf_flags;
+   __le16  rf_our_slot;
+   __le16  rf_num_slots;
+   __le32  rf_generation;
+   __le32  rf_unused1;
+   __le64  rf_unused2;
+};
+
+struct rcom_slot {
+   __le32  ro_nodeid;
+   __le16  ro_slot;
+   __le16  ro_unused1;
+   __le64  ro_unused2;
 };
 
 struct rcom_lock {
@@ -455,6 +493,7 @@ struct dlm_ls {
struct list_headls_list;/* list of lockspaces */
dlm_lockspace_t *ls_local_handle;
uint32_tls_global_id;   /* global unique lockspace ID */
+   uint32_tls_generation;
uint32_tls_exflags;
int ls_lvblen;
int ls_count;   /* refcount of processes in
@@ -493,6 +532,11 @@ struct dlm_ls {
int ls_total_weight;
int *ls_node_array;
 
+   int ls_slot;
+   int ls_num_slots;
+   int ls_slots_size;
+   struct dlm_slot *ls_slots;
+
struct dlm_rsb  ls_stub_rsb;/* for returning errors */
struct dlm_lkb  ls_stub_lkb;/* for returning errors */
struct dlm_message  ls_stub_ms; /* for faking a reply */
diff --git a/fs/dlm/lockspace.c b/fs/dlm/lockspace.c
index 1d16a23..1441f04 100644
--- a/fs/dlm/lockspace.c
+++ b/fs/dlm/lockspace.c
@@ -525,6 +525,11 @@ static int new_lockspace(const char *name, int namelen, 
void **lockspace,
if (!ls->ls_recover_buf)
goto out_dirfree;
 
+   ls->ls_slot = 0;
+   ls->ls_num_slots = 0;
+   ls->ls_slots_size = 0;
+   ls->ls_slots = NULL;
+
INIT_LIST_HEAD(&ls->ls_recover_list);
spin_lock_init(&ls->ls_recover_list_lock);
ls->ls_recover_list_count = 0;
diff --git a/fs/dlm/member.c b/fs/dlm/member.c
index 5ebd1df..eebc52a 100644
--- a/fs/dlm/member.c
+++ b/fs/dlm/member.c
@@ -19,6 +19,280 @@
 #include "config.h"
 #include "lowcomms.h"
 
+int dlm_slots_version(struct dlm_header *h)
+{
+   if ((h->h_version & 0x) < DLM_HEADER_SLOTS)
+   return 0;
+   return 1;
+}
+
+void dlm_slot_save(struct dlm_ls *ls, struct dlm_rcom *rc,
+ s

[Cluster-devel] [PATCH 2/5] dlm: move recovery barrier calls

2012-01-05 Thread David Teigland
Put all the calls to recovery barriers in the same function
to clarify where they each happen.  Should not change any behavior.
Also modify some recovery debug lines to make them consistent.

Signed-off-by: David Teigland 
---
 fs/dlm/dir.c  |1 -
 fs/dlm/member.c   |7 +--
 fs/dlm/recover.c  |2 --
 fs/dlm/recoverd.c |   45 +++--
 4 files changed, 28 insertions(+), 27 deletions(-)

diff --git a/fs/dlm/dir.c b/fs/dlm/dir.c
index 7b84c1d..8364157 100644
--- a/fs/dlm/dir.c
+++ b/fs/dlm/dir.c
@@ -290,7 +290,6 @@ int dlm_recover_directory(struct dlm_ls *ls)
 
  out_status:
error = 0;
-   dlm_set_recover_status(ls, DLM_RS_DIR);
log_debug(ls, "dlm_recover_directory %d entries", count);
  out_free:
kfree(last_name);
diff --git a/fs/dlm/member.c b/fs/dlm/member.c
index b12532e..5ebd1df 100644
--- a/fs/dlm/member.c
+++ b/fs/dlm/member.c
@@ -251,7 +251,6 @@ int dlm_recover_members(struct dlm_ls *ls, struct 
dlm_recover *rv, int *neg_out)
ls->ls_low_nodeid = low;
 
make_member_array(ls);
-   dlm_set_recover_status(ls, DLM_RS_NODES);
*neg_out = neg;
 
error = ping_members(ls);
@@ -261,12 +260,8 @@ int dlm_recover_members(struct dlm_ls *ls, struct 
dlm_recover *rv, int *neg_out)
ls->ls_members_result = error;
complete(&ls->ls_members_done);
}
-   if (error)
-   goto out;
 
-   error = dlm_recover_members_wait(ls);
- out:
-   log_debug(ls, "total members %d error %d", ls->ls_num_nodes, error);
+   log_debug(ls, "dlm_recover_members %d nodes", ls->ls_num_nodes);
return error;
 }
 
diff --git a/fs/dlm/recover.c b/fs/dlm/recover.c
index 50467ce..81b2393 100644
--- a/fs/dlm/recover.c
+++ b/fs/dlm/recover.c
@@ -542,8 +542,6 @@ int dlm_recover_locks(struct dlm_ls *ls)
  out:
if (error)
recover_list_clear(ls);
-   else
-   dlm_set_recover_status(ls, DLM_RS_LOCKS);
return error;
 }
 
diff --git a/fs/dlm/recoverd.c b/fs/dlm/recoverd.c
index 774da3c..5a9e1a4 100644
--- a/fs/dlm/recoverd.c
+++ b/fs/dlm/recoverd.c
@@ -54,7 +54,7 @@ static int ls_recover(struct dlm_ls *ls, struct dlm_recover 
*rv)
unsigned long start;
int error, neg = 0;
 
-   log_debug(ls, "recover %llx", (unsigned long long)rv->seq);
+   log_debug(ls, "dlm_recover %llx", (unsigned long long)rv->seq);
 
mutex_lock(&ls->ls_recoverd_active);
 
@@ -76,14 +76,22 @@ static int ls_recover(struct dlm_ls *ls, struct dlm_recover 
*rv)
 
/*
 * Add or remove nodes from the lockspace's ls_nodes list.
-* Also waits for all nodes to complete dlm_recover_members.
 */
 
error = dlm_recover_members(ls, rv, &neg);
if (error) {
-   log_debug(ls, "recover_members failed %d", error);
+   log_debug(ls, "dlm_recover_members error %d", error);
goto fail;
}
+
+   dlm_set_recover_status(ls, DLM_RS_NODES);
+
+   error = dlm_recover_members_wait(ls);
+   if (error) {
+   log_debug(ls, "dlm_recover_members_wait error %d", error);
+   goto fail;
+   }
+
start = jiffies;
 
/*
@@ -93,17 +101,15 @@ static int ls_recover(struct dlm_ls *ls, struct 
dlm_recover *rv)
 
error = dlm_recover_directory(ls);
if (error) {
-   log_debug(ls, "recover_directory failed %d", error);
+   log_debug(ls, "dlm_recover_directory error %d", error);
goto fail;
}
 
-   /*
-* Wait for all nodes to complete directory rebuild.
-*/
+   dlm_set_recover_status(ls, DLM_RS_DIR);
 
error = dlm_recover_directory_wait(ls);
if (error) {
-   log_debug(ls, "recover_directory_wait failed %d", error);
+   log_debug(ls, "dlm_recover_directory_wait error %d", error);
goto fail;
}
 
@@ -133,7 +139,7 @@ static int ls_recover(struct dlm_ls *ls, struct dlm_recover 
*rv)
 
error = dlm_recover_masters(ls);
if (error) {
-   log_debug(ls, "recover_masters failed %d", error);
+   log_debug(ls, "dlm_recover_masters error %d", error);
goto fail;
}
 
@@ -143,13 +149,15 @@ static int ls_recover(struct dlm_ls *ls, struct 
dlm_recover *rv)
 
error = dlm_recover_locks(ls);
if (error) {
-   log_debug(ls, "recover_locks failed %d", error);
+   log_debug(ls, "dlm_recover_locks error %d", error);
goto fail;
}
 
+   dlm_set_recover_status(ls, DLM_RS_LOCK

[Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2012-01-05 Thread David Teigland
This new method of managing recovery is an alternative to
the previous approach of using the userland gfs_controld.

- use dlm slot numbers to assign journal id's
- use dlm recovery callbacks to initiate journal recovery
- use a dlm lock to determine the first node to mount fs
- use a dlm lock to track journals that need recovery

Signed-off-by: David Teigland 
---
 fs/gfs2/glock.c |2 +-
 fs/gfs2/glock.h |7 +-
 fs/gfs2/incore.h|   58 +++-
 fs/gfs2/lock_dlm.c  |  993 ++-
 fs/gfs2/main.c  |   10 +
 fs/gfs2/ops_fstype.c|   29 +-
 fs/gfs2/recovery.c  |4 +
 fs/gfs2/sys.c   |   33 +-
 fs/gfs2/sys.h   |2 +
 include/linux/gfs2_ondisk.h |2 +
 10 files changed, 1098 insertions(+), 42 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 88e8a23..376816f 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1353,7 +1353,7 @@ void gfs2_glock_complete(struct gfs2_glock *gl, int ret)
spin_lock(&gl->gl_spin);
gl->gl_reply = ret;
 
-   if (unlikely(test_bit(DFL_BLOCK_LOCKS, &ls->ls_flags))) {
+   if (unlikely(test_bit(DFL_BLOCK_LOCKS, &ls->ls_recover_flags))) {
if (gfs2_should_freeze(gl)) {
set_bit(GLF_FROZEN, &gl->gl_flags);
spin_unlock(&gl->gl_spin);
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 6670711..5b548b07 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -121,8 +121,11 @@ enum {
 
 struct lm_lockops {
const char *lm_proto_name;
-   int (*lm_mount) (struct gfs2_sbd *sdp, const char *fsname);
-   void (*lm_unmount) (struct gfs2_sbd *sdp);
+   int (*lm_mount) (struct gfs2_sbd *sdp, const char *table);
+   void (*lm_first_done) (struct gfs2_sbd *sdp);
+   void (*lm_recovery_result) (struct gfs2_sbd *sdp, unsigned int jid,
+   unsigned int result);
+   void (*lm_unmount) (struct gfs2_sbd *sdp);
void (*lm_withdraw) (struct gfs2_sbd *sdp);
void (*lm_put_lock) (struct gfs2_glock *gl);
int (*lm_lock) (struct gfs2_glock *gl, unsigned int req_state,
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 892ac37..9182a87 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -139,8 +139,45 @@ struct gfs2_bufdata {
 #define GDLM_STRNAME_BYTES 25
 #define GDLM_LVB_SIZE  32
 
+/*
+ * ls_recover_flags:
+ *
+ * DFL_BLOCK_LOCKS: dlm is in recovery and will grant locks that had been
+ * held by failed nodes whose journals need recovery.  Those locks should
+ * only be used for journal recovery until the journal recovery is done.
+ * This is set by the dlm recover_prep callback and cleared by the
+ * gfs2_control thread when journal recovery is complete.  To avoid
+ * races between recover_prep setting and gfs2_control clearing, recover_spin
+ * is held while changing this bit and reading/writing recover_block
+ * and recover_start.
+ *
+ * DFL_NO_DLM_OPS: dlm lockspace ops/callbacks are not being used.
+ *
+ * DFL_FIRST_MOUNT: this node is the first to mount this fs and is doing
+ * recovery of all journals before allowing other nodes to mount the fs.
+ * This is cleared when FIRST_MOUNT_DONE is set.
+ *
+ * DFL_FIRST_MOUNT_DONE: this node was the first mounter, and has finished
+ * recovery of all journals, and now allows other nodes to mount the fs.
+ *
+ * DFL_MOUNT_DONE: gdlm_mount has completed successfully and cleared
+ * BLOCK_LOCKS for the first time.  The gfs2_control thread should now
+ * control clearing BLOCK_LOCKS for further recoveries.
+ *
+ * DFL_UNMOUNT: gdlm_unmount sets to keep sdp off gfs2_control_wq.
+ *
+ * DFL_DLM_RECOVERY: set while dlm is in recovery, between recover_prep()
+ * and recover_done(), i.e. set while recover_block == recover_start.
+ */
+
 enum {
DFL_BLOCK_LOCKS = 0,
+   DFL_NO_DLM_OPS  = 1,
+   DFL_FIRST_MOUNT = 2,
+   DFL_FIRST_MOUNT_DONE= 3,
+   DFL_MOUNT_DONE  = 4,
+   DFL_UNMOUNT = 5,
+   DFL_DLM_RECOVERY= 6,
 };
 
 struct lm_lockname {
@@ -504,14 +541,26 @@ struct gfs2_sb_host {
 struct lm_lockstruct {
int ls_jid;
unsigned int ls_first;
-   unsigned int ls_first_done;
unsigned int ls_nodir;
const struct lm_lockops *ls_ops;
-   unsigned long ls_flags;
dlm_lockspace_t *ls_dlm;
 
-   int ls_recover_jid_done;
-   int ls_recover_jid_status;
+   int ls_recover_jid_done;   /* These two are deprecated, */
+   int ls_recover_jid_status; /* used previously by gfs_controld */
+
+   struct dlm_lksb ls_mounted_lksb; /* mounted_lock */
+   struct dlm_lksb ls_control_lksb; /* control_lock */
+   char ls_control_lvb[GDLM_LVB_SIZE]; /* control_lock lvb */
+   struct completion ls_sync_wait; /* {control,mounted}_{lock,unlock} */
+
+   

[Cluster-devel] [PATCH 4/5] dlm: add recovery callbacks

2012-01-05 Thread David Teigland
These new callbacks notify the dlm user about lock recovery.
GFS2, and possibly others, need to be aware of when the dlm
will be doing lock recovery for a failed lockspace member.

In the past, this coordination has been done between dlm and
file system daemons in userspace, which then direct their
kernel counterparts.  These callbacks allow the same
coordination directly, and more simply.

Signed-off-by: David Teigland 
---
 fs/dlm/config.c   |  130 ++--
 fs/dlm/config.h   |   17 +++-
 fs/dlm/dlm_internal.h |   21 ++
 fs/dlm/lockspace.c|   43 +--
 fs/dlm/member.c   |  197 +++--
 fs/dlm/member.h   |3 +-
 fs/dlm/recoverd.c |   10 +-
 fs/dlm/user.c |5 +-
 fs/gfs2/lock_dlm.c|4 +-
 fs/ocfs2/stack_user.c |4 +-
 include/linux/dlm.h   |   71 -
 11 files changed, 333 insertions(+), 172 deletions(-)

diff --git a/fs/dlm/config.c b/fs/dlm/config.c
index 6cf72fc..e7e327d 100644
--- a/fs/dlm/config.c
+++ b/fs/dlm/config.c
@@ -2,7 +2,7 @@
 ***
 **
 **  Copyright (C) Sistina Software, Inc.  1997-2003  All rights reserved.
-**  Copyright (C) 2004-2008 Red Hat, Inc.  All rights reserved.
+**  Copyright (C) 2004-2011 Red Hat, Inc.  All rights reserved.
 **
 **  This copyrighted material is made available to anyone wishing to use,
 **  modify, copy, or redistribute it subject to the terms and conditions
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -36,6 +37,7 @@
 static struct config_group *space_list;
 static struct config_group *comm_list;
 static struct dlm_comm *local_comm;
+static uint32_t dlm_comm_count;
 
 struct dlm_clusters;
 struct dlm_cluster;
@@ -103,6 +105,8 @@ struct dlm_cluster {
unsigned int cl_timewarn_cs;
unsigned int cl_waitwarn_us;
unsigned int cl_new_rsb_count;
+   unsigned int cl_recover_callbacks;
+   char cl_cluster_name[DLM_LOCKSPACE_LEN];
 };
 
 enum {
@@ -118,6 +122,8 @@ enum {
CLUSTER_ATTR_TIMEWARN_CS,
CLUSTER_ATTR_WAITWARN_US,
CLUSTER_ATTR_NEW_RSB_COUNT,
+   CLUSTER_ATTR_RECOVER_CALLBACKS,
+   CLUSTER_ATTR_CLUSTER_NAME,
 };
 
 struct cluster_attribute {
@@ -126,6 +132,27 @@ struct cluster_attribute {
ssize_t (*store)(struct dlm_cluster *, const char *, size_t);
 };
 
+static ssize_t cluster_cluster_name_read(struct dlm_cluster *cl, char *buf)
+{
+   return sprintf(buf, "%s\n", cl->cl_cluster_name);
+}
+
+static ssize_t cluster_cluster_name_write(struct dlm_cluster *cl,
+ const char *buf, size_t len)
+{
+   strncpy(dlm_config.ci_cluster_name, buf, DLM_LOCKSPACE_LEN);
+   strncpy(cl->cl_cluster_name, buf, DLM_LOCKSPACE_LEN);
+   return len;
+}
+
+static struct cluster_attribute cluster_attr_cluster_name = {
+   .attr   = { .ca_owner = THIS_MODULE,
+.ca_name = "cluster_name",
+.ca_mode = S_IRUGO | S_IWUSR },
+   .show   = cluster_cluster_name_read,
+   .store  = cluster_cluster_name_write,
+};
+
 static ssize_t cluster_set(struct dlm_cluster *cl, unsigned int *cl_field,
   int *info_field, int check_zero,
   const char *buf, size_t len)
@@ -171,6 +198,7 @@ CLUSTER_ATTR(protocol, 0);
 CLUSTER_ATTR(timewarn_cs, 1);
 CLUSTER_ATTR(waitwarn_us, 0);
 CLUSTER_ATTR(new_rsb_count, 0);
+CLUSTER_ATTR(recover_callbacks, 0);
 
 static struct configfs_attribute *cluster_attrs[] = {
[CLUSTER_ATTR_TCP_PORT] = &cluster_attr_tcp_port.attr,
@@ -185,6 +213,8 @@ static struct configfs_attribute *cluster_attrs[] = {
[CLUSTER_ATTR_TIMEWARN_CS] = &cluster_attr_timewarn_cs.attr,
[CLUSTER_ATTR_WAITWARN_US] = &cluster_attr_waitwarn_us.attr,
[CLUSTER_ATTR_NEW_RSB_COUNT] = &cluster_attr_new_rsb_count.attr,
+   [CLUSTER_ATTR_RECOVER_CALLBACKS] = &cluster_attr_recover_callbacks.attr,
+   [CLUSTER_ATTR_CLUSTER_NAME] = &cluster_attr_cluster_name.attr,
NULL,
 };
 
@@ -293,6 +323,7 @@ struct dlm_comms {
 
 struct dlm_comm {
struct config_item item;
+   int seq;
int nodeid;
int local;
int addr_count;
@@ -309,6 +340,7 @@ struct dlm_node {
int nodeid;
int weight;
int new;
+   int comm_seq; /* copy of cm->seq when nd->nodeid is set */
 };
 
 static struct configfs_group_operations clusters_ops = {
@@ -455,6 +487,9 @@ static struct config_group *make_cluster(struct 
config_group *g,
cl->cl_timewarn_cs = dlm_config.ci_timewarn_cs;
cl->cl_waitwarn_us = dlm_config.ci_waitwarn_us;
cl->cl_new_rsb_count = dlm_config.ci_new_rsb_count;
+   cl->cl_recover_callbacks = dlm_config.ci_recover_callbacks;
+   memcpy(cl->cl_cluster_name, dlm_config.c

Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2012-01-05 Thread David Teigland
On Thu, Jan 05, 2012 at 04:58:22PM +, Steven Whitehouse wrote:
> > +   clear_bit(SDF_NOJOURNALID, &sdp->sd_flags);
> > +   smp_mb__after_clear_bit();
> > +   wake_up_bit(&sdp->sd_flags, SDF_NOJOURNALID);
> > +   ls->ls_first = !!test_bit(DFL_FIRST_MOUNT, &ls->ls_recover_flags);
> > +   return 0;
> > +
> 
> This bit of code, which was correct last time you posted this patch
> appears to have reverted to its previous incorrect state. ls_first must

Thanks, I'll move it back, I removed ls_first and put it back in the wrong
place.  I keep forgetting about it because...

> be set before SDF_NOJOURNALID is cleared, otherwise the uninitialised
> value may be read,

in this case there can be no other reader, so it doesn't matter.

Dave



Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2012-01-09 Thread David Teigland
On Mon, Jan 09, 2012 at 04:36:30PM +, Steven Whitehouse wrote:
> On Thu, 2012-01-05 at 10:46 -0600, David Teigland wrote:
> > This new method of managing recovery is an alternative to
> > the previous approach of using the userland gfs_controld.
> > 
> > - use dlm slot numbers to assign journal id's
> > - use dlm recovery callbacks to initiate journal recovery
> > - use a dlm lock to determine the first node to mount fs
> > - use a dlm lock to track journals that need recovery
> 
> I've just been looking at this again, and a question springs to mind...
> how does this deal with nodes which are read-only or spectator mounts?
> In the old system we used to propagate that information to gfs_controld
> but I've not spotted anything similar in the patch so far, so I'm
> wondering whether it needs to know that information or not,

The dlm allocates a "slot" for all lockspace members, so spectator mounts
(like readonly mounts) would be given a slot/jid.  In gfs_controld,
spectator mounts are not be given a jid (that came from the time when
adding a journal required extending the device+fs.)  These days, there's
probably no meaningful difference between spectator and readonly mounts.



Re: [Cluster-devel] [PATCH 5/5] gfs2: dlm based recovery coordination

2012-01-09 Thread David Teigland
On Mon, Jan 09, 2012 at 11:46:26AM -0500, David Teigland wrote:
> On Mon, Jan 09, 2012 at 04:36:30PM +, Steven Whitehouse wrote:
> > On Thu, 2012-01-05 at 10:46 -0600, David Teigland wrote:
> > > This new method of managing recovery is an alternative to
> > > the previous approach of using the userland gfs_controld.
> > > 
> > > - use dlm slot numbers to assign journal id's
> > > - use dlm recovery callbacks to initiate journal recovery
> > > - use a dlm lock to determine the first node to mount fs
> > > - use a dlm lock to track journals that need recovery
> > 
> > I've just been looking at this again, and a question springs to mind...
> > how does this deal with nodes which are read-only or spectator mounts?
> > In the old system we used to propagate that information to gfs_controld
> > but I've not spotted anything similar in the patch so far, so I'm
> > wondering whether it needs to know that information or not,
> 
> The dlm allocates a "slot" for all lockspace members, so spectator mounts
> (like readonly mounts) would be given a slot/jid.  In gfs_controld,
> spectator mounts are not be given a jid (that came from the time when
> adding a journal required extending the device+fs.)  These days, there's
> probably no meaningful difference between spectator and readonly mounts.

There's one other part, and that's whether a readonly or spectator node
should attempt to recover the journal of a failed node.  In cluster3 this
decision was always a bit mixed up, with some logic in gfs_controld and
some in gfs2.

We should make a clear decision now and include it in this patch.
I think gfs2_recover_func() should return GAVEUP right at the start
for any of the cases where you don't want it doing recovery.  What
cases would you prefer?



[Cluster-devel] gfs2: let spectator mount do read only recovery

2012-01-09 Thread David Teigland
Previously, a spectator mount would not even attempt to do
journal recovery for a failed node.  This meant that if all
mounted nodes were spectators, everyone would be stuck after
a node failed, all waiting for recovery to be performed.
This is unnecessary since the failed node had a clean journal.

Instead, allow a spectator mount to do a partial "read only"
recovery, which means it will check if the failed journal is
clean, and if so, report a successful recovery.  If the failed
journal is not clean, it reports that journal recovery failed.
This makes it work the same as a read only mount on a read only
block device.

Signed-off-by: David Teigland 
---
 fs/gfs2/incore.h |1 +
 fs/gfs2/ops_fstype.c |2 +-
 fs/gfs2/recovery.c   |4 +++-
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 9182a87..59114c5 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -503,6 +503,7 @@ enum {
SDF_NORECOVERY  = 4,
SDF_DEMOTE  = 5,
SDF_NOJOURNALID = 6,
+   SDF_RORECOVERY  = 7, /* read only recovery */
 };
 
 #define GFS2_FSNAME_LEN256
diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index 0df89da..ef097b2 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1081,7 +1081,7 @@ static int fill_super(struct super_block *sb, struct 
gfs2_args *args, int silent
 
if (sdp->sd_args.ar_spectator) {
 sb->s_flags |= MS_RDONLY;
-   set_bit(SDF_NORECOVERY, &sdp->sd_flags);
+   set_bit(SDF_RORECOVERY, &sdp->sd_flags);
}
if (sdp->sd_args.ar_posix_acl)
sb->s_flags |= MS_POSIXACL;
diff --git a/fs/gfs2/recovery.c b/fs/gfs2/recovery.c
index af49e8f..80701d1 100644
--- a/fs/gfs2/recovery.c
+++ b/fs/gfs2/recovery.c
@@ -516,7 +516,9 @@ void gfs2_recover_func(struct work_struct *work)
if (error)
goto fail_gunlock_ji;
 
-   if (test_bit(SDF_JOURNAL_CHECKED, &sdp->sd_flags)) {
+   if (test_bit(SDF_RORECOVERY, &sdp->sd_flags)) {
+   ro = 1;
+   } else if (test_bit(SDF_JOURNAL_CHECKED, &sdp->sd_flags)) {
if (!test_bit(SDF_JOURNAL_LIVE, &sdp->sd_flags))
ro = 1;
} else {
-- 
1.7.6



[Cluster-devel] gfs2: fail mount if journal recovery fails

2012-01-09 Thread David Teigland
If the first mounter fails to recover one of the journals
during mount, the mount should fail.

Signed-off-by: David Teigland 
---
 fs/gfs2/incore.h   |1 +
 fs/gfs2/recovery.c |3 ++-
 2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 59114c5..ad2d05e 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -434,6 +434,7 @@ struct gfs2_jdesc {
 #define JDF_RECOVERY 1
unsigned int jd_jid;
unsigned int jd_blocks;
+   int jd_recover_error;
 };
 
 struct gfs2_statfs_change_host {
diff --git a/fs/gfs2/recovery.c b/fs/gfs2/recovery.c
index 80701d1..963b2d7 100644
--- a/fs/gfs2/recovery.c
+++ b/fs/gfs2/recovery.c
@@ -583,6 +583,7 @@ fail_gunlock_j:
 
fs_info(sdp, "jid=%u: %s\n", jd->jd_jid, (error) ? "Failed" : "Done");
 fail:
+   jd->jd_recover_error = error;
gfs2_recovery_done(sdp, jd->jd_jid, LM_RD_GAVEUP);
 done:
clear_bit(JDF_RECOVERY, &jd->jd_flags);
@@ -611,6 +612,6 @@ int gfs2_recover_journal(struct gfs2_jdesc *jd, bool wait)
wait_on_bit(&jd->jd_flags, JDF_RECOVERY, gfs2_recovery_wait,
TASK_UNINTERRUPTIBLE);
 
-   return 0;
+   return wait ? jd->jd_recover_error : 0;
 }
 
-- 
1.7.6



[Cluster-devel] gfs2: dlm based recovery coordination

2012-01-09 Thread David Teigland
Steve, attached is the latest version of this patch, which includes
changes for all the suggestions I've seen.  I've sent sent a pull request
for all the dlm patches preceding this.  Would you like to take this into
your tree once the dlm patches are pulled?  As I've mentioned, I think
the current merge cycle would be good, but you can send it off whenever
you think is right.

Dave

>From 0fb2d7726b570c6a5eb289bac237fb384b9c6f0b Mon Sep 17 00:00:00 2001
From: David Teigland 
Date: Tue, 20 Dec 2011 17:03:04 -0600
Subject: [PATCH] gfs2: dlm based recovery coordination

This new method of managing recovery is an alternative to
the previous approach of using the userland gfs_controld.

- use dlm slot numbers to assign journal id's
- use dlm recovery callbacks to initiate journal recovery
- use a dlm lock to determine the first node to mount fs
- use a dlm lock to track journals that need recovery

Signed-off-by: David Teigland 
---
 fs/gfs2/glock.c |2 +-
 fs/gfs2/glock.h |7 +-
 fs/gfs2/incore.h|   58 +++-
 fs/gfs2/lock_dlm.c  |  993 ++-
 fs/gfs2/main.c  |   10 +
 fs/gfs2/ops_fstype.c|   29 +-
 fs/gfs2/recovery.c  |4 +
 fs/gfs2/sys.c   |   33 +-
 fs/gfs2/sys.h   |2 +
 include/linux/gfs2_ondisk.h |2 +
 10 files changed, 1098 insertions(+), 42 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index 88e8a23..376816f 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1353,7 +1353,7 @@ void gfs2_glock_complete(struct gfs2_glock *gl, int ret)
spin_lock(&gl->gl_spin);
gl->gl_reply = ret;
 
-   if (unlikely(test_bit(DFL_BLOCK_LOCKS, &ls->ls_flags))) {
+   if (unlikely(test_bit(DFL_BLOCK_LOCKS, &ls->ls_recover_flags))) {
if (gfs2_should_freeze(gl)) {
set_bit(GLF_FROZEN, &gl->gl_flags);
spin_unlock(&gl->gl_spin);
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 6670711..5b548b07 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -121,8 +121,11 @@ enum {
 
 struct lm_lockops {
const char *lm_proto_name;
-   int (*lm_mount) (struct gfs2_sbd *sdp, const char *fsname);
-   void (*lm_unmount) (struct gfs2_sbd *sdp);
+   int (*lm_mount) (struct gfs2_sbd *sdp, const char *table);
+   void (*lm_first_done) (struct gfs2_sbd *sdp);
+   void (*lm_recovery_result) (struct gfs2_sbd *sdp, unsigned int jid,
+   unsigned int result);
+   void (*lm_unmount) (struct gfs2_sbd *sdp);
void (*lm_withdraw) (struct gfs2_sbd *sdp);
void (*lm_put_lock) (struct gfs2_glock *gl);
int (*lm_lock) (struct gfs2_glock *gl, unsigned int req_state,
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 892ac37..9182a87 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -139,8 +139,45 @@ struct gfs2_bufdata {
 #define GDLM_STRNAME_BYTES 25
 #define GDLM_LVB_SIZE  32
 
+/*
+ * ls_recover_flags:
+ *
+ * DFL_BLOCK_LOCKS: dlm is in recovery and will grant locks that had been
+ * held by failed nodes whose journals need recovery.  Those locks should
+ * only be used for journal recovery until the journal recovery is done.
+ * This is set by the dlm recover_prep callback and cleared by the
+ * gfs2_control thread when journal recovery is complete.  To avoid
+ * races between recover_prep setting and gfs2_control clearing, recover_spin
+ * is held while changing this bit and reading/writing recover_block
+ * and recover_start.
+ *
+ * DFL_NO_DLM_OPS: dlm lockspace ops/callbacks are not being used.
+ *
+ * DFL_FIRST_MOUNT: this node is the first to mount this fs and is doing
+ * recovery of all journals before allowing other nodes to mount the fs.
+ * This is cleared when FIRST_MOUNT_DONE is set.
+ *
+ * DFL_FIRST_MOUNT_DONE: this node was the first mounter, and has finished
+ * recovery of all journals, and now allows other nodes to mount the fs.
+ *
+ * DFL_MOUNT_DONE: gdlm_mount has completed successfully and cleared
+ * BLOCK_LOCKS for the first time.  The gfs2_control thread should now
+ * control clearing BLOCK_LOCKS for further recoveries.
+ *
+ * DFL_UNMOUNT: gdlm_unmount sets to keep sdp off gfs2_control_wq.
+ *
+ * DFL_DLM_RECOVERY: set while dlm is in recovery, between recover_prep()
+ * and recover_done(), i.e. set while recover_block == recover_start.
+ */
+
 enum {
DFL_BLOCK_LOCKS = 0,
+   DFL_NO_DLM_OPS  = 1,
+   DFL_FIRST_MOUNT = 2,
+   DFL_FIRST_MOUNT_DONE= 3,
+   DFL_MOUNT_DONE  = 4,
+   DFL_UNMOUNT = 5,
+   DFL_DLM_RECOVERY= 6,
 };
 
 struct lm_lockname {
@@ -504,14 +541,26 @@ struct gfs2_sb_host {
 struct lm_lockstruct {
int ls_jid;
unsigned int ls_first;
-   unsigned int ls_first_done;
unsigned int ls_nod

Re: [Cluster-devel] [PATCH] fs/dlm/rcom.c: included member.h twice

2012-02-16 Thread David Teigland
On Thu, Feb 16, 2012 at 02:55:21PM +0100, Danny Kukawka wrote:
> fs/dlm/rcom.c included 'member.h' twice, remove the duplicate.

I'll fold this into the current patch I'm working on.

> 
> Signed-off-by: Danny Kukawka 
> ---
>  fs/dlm/rcom.c |1 -
>  1 files changed, 0 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/dlm/rcom.c b/fs/dlm/rcom.c
> index ac5c616..3464d6c 100644
> --- a/fs/dlm/rcom.c
> +++ b/fs/dlm/rcom.c
> @@ -23,7 +23,6 @@
>  #include "memory.h"
>  #include "lock.h"
>  #include "util.h"
> -#include "member.h"
>  
>  
>  static int rcom_response(struct dlm_ls *ls)
> -- 
> 1.7.8.3



Re: [Cluster-devel] last element of dlm_local_addr[] never used?

2012-03-21 Thread David Teigland
On Wed, Mar 21, 2012 at 12:24:35PM +0300, Dan Carpenter wrote:
> In fs/dlm/lowcomms.c we declare the dlm_local_addr[] array like
> this:
> static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];
> 
> But it looks like the last element of the array is never used:
> 
>   1072  /* Get local addresses */
>   1073  static void init_local(void)
>   1074  {
>   1075  struct sockaddr_storage sas, *addr;
>   1076  int i;
>   1077  
>   1078  dlm_local_count = 0;
>   1079  for (i = 0; i < DLM_MAX_ADDR_COUNT - 1; i++) {
> ^^
>   1080  if (dlm_our_addr(&sas, i))
>   1081  break;
>   1082  
>   1083  addr = kmalloc(sizeof(*addr), GFP_NOFS);
>   1084  if (!addr)
>   1085  break;
>   1086  memcpy(addr, &sas, sizeof(*addr));
>   1087  dlm_local_addr[dlm_local_count++] = addr;
>   1088  }
>   1089  }
> 
> There isn't anywhere else we use it either.  It's not harmful to
> leave the last element unused but I wondered if it was intentional.

config.c properly supports the max, so we should just remove the -1.
I'll add a patch.
Thanks,
Dave



Re: [Cluster-devel] GFS2: Pre-pull patch posting (merge window)

2012-03-23 Thread David Teigland

> on i386:
>
> ERROR: "sctp_do_peeloff" [fs/dlm/dlm.ko] undefined!
>
>
> GFS2_FS selects DLM (if GFS2_FS_LOCKING_DLM, which is enabled).
> GFS2_FS selects IP_SCTP if DLM_SCTP, which is not enabled and not
> used anywhere else in the kernel tree AFAICT.
> DLM just always selects IP_SCTP.

Here's what we have now:

config GFS2_FS
tristate "GFS2 file system support"
depends on (64BIT || LBDAF)
select DLM if GFS2_FS_LOCKING_DLM
select CONFIGFS_FS if GFS2_FS_LOCKING_DLM
select SYSFS if GFS2_FS_LOCKING_DLM
select IP_SCTP if DLM_SCTP
select FS_POSIX_ACL
select CRC32
select QUOTACTL

menuconfig DLM
tristate "Distributed Lock Manager (DLM)"
depends on EXPERIMENTAL && INET
depends on SYSFS && CONFIGFS_FS && (IPV6 || IPV6=n)
select IP_SCTP

Why does gfs2 Kconfig bother with SCTP at all?  It seems that line should
just be removed.  I'll also remove EXPERIMENTAL.  I don't understand the
vagaries of Kconfig, so a dumb question, how could sctp_do_peeloff
possibly be undefined if we're selecting SCTP.



Re: [Cluster-devel] GFS2: Pre-pull patch posting (merge window)

2012-03-23 Thread David Teigland
On Fri, Mar 23, 2012 at 01:06:05PM -0700, Randy Dunlap wrote:
> >> GFS2_FS selects DLM (if GFS2_FS_LOCKING_DLM, which is enabled).
> >> GFS2_FS selects IP_SCTP if DLM_SCTP, which is not enabled and not
> >> used anywhere else in the kernel tree AFAICT.
> >> DLM just always selects IP_SCTP.
> > 
> > Here's what we have now:
> > 
> > config GFS2_FS
> > tristate "GFS2 file system support"
> > depends on (64BIT || LBDAF)
> > select DLM if GFS2_FS_LOCKING_DLM
> > select CONFIGFS_FS if GFS2_FS_LOCKING_DLM
> > select SYSFS if GFS2_FS_LOCKING_DLM
> > select IP_SCTP if DLM_SCTP
> > select FS_POSIX_ACL
> > select CRC32
> > select QUOTACTL
> > 
> > menuconfig DLM
> > tristate "Distributed Lock Manager (DLM)"
> > depends on EXPERIMENTAL && INET
> > depends on SYSFS && CONFIGFS_FS && (IPV6 || IPV6=n)
> > select IP_SCTP
> > 
> > Why does gfs2 Kconfig bother with SCTP at all?  It seems that line should
> > just be removed.  I'll also remove EXPERIMENTAL.  I don't understand the
> > vagaries of Kconfig, so a dumb question, how could sctp_do_peeloff
> > possibly be undefined if we're selecting SCTP.
> 
> What is selecting SCTP?  DLM?  so GFS2 selects DLM, but selects
> don't follow dependency chains.  Also, the "select IP_SCTP if DLM_SCTP"
> in GFS2 is meaningless since there is no DLM_SCTP.

https://lkml.org/lkml/2012/3/8/222 seems to have caused this by adding
the new dependency on the sctp module without any Kconfig changes.

Should that patch have added depends IP_SCTP to the dlm and gfs2?



Re: [Cluster-devel] GPF in dlm_lowcomms_stop

2012-03-30 Thread David Teigland
On Wed, Mar 21, 2012 at 07:59:13PM -0600, dann frazier wrote:
> However... we've dropped the connections_lock, so its possible that a
> new connection gets created on line 9. This connection structure would
> have pointers to the workqueues that we're about to destroy. Sometime
> later on we get data on this new connection, we try to add work to the
> now-obliterated workqueues, and things blow up.

Hi Dan, I'm not very familiar with this code either, but I've talked with
Chrissie and she suggested we try something like this:

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 133ef6d..a3c431e 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -142,6 +142,7 @@ struct writequeue_entry {
 
 static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];
 static int dlm_local_count;
+static int dlm_allow_conn;
 
 /* Work queues */
 static struct workqueue_struct *recv_workqueue;
@@ -710,6 +711,13 @@ static int tcp_accept_from_sock(struct connection *con)
struct connection *newcon;
struct connection *addcon;
 
+   mutex_lock(&connections_lock);
+   if (!dlm_allow_conn) {
+   mutex_unlock(&connections_lock);
+   return -1;
+   }
+   mutex_unlock(&connections_lock);
+
memset(&peeraddr, 0, sizeof(peeraddr));
result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM,
  IPPROTO_TCP, &newsock);
@@ -1503,6 +1511,7 @@ void dlm_lowcomms_stop(void)
   socket activity.
*/
mutex_lock(&connections_lock);
+   dlm_allow_conn = 0;
foreach_conn(stop_conn);
mutex_unlock(&connections_lock);
 
@@ -1540,6 +1549,8 @@ int dlm_lowcomms_start(void)
if (!con_cache)
goto out;
 
+   dlm_allow_conn = 1;
+
/* Start listening */
if (dlm_config.ci_protocol == 0)
error = tcp_listen_for_all();
@@ -1555,6 +1566,7 @@ int dlm_lowcomms_start(void)
return 0;
 
 fail_unlisten:
+   dlm_allow_conn = 0;
con = nodeid2con(0,0);
if (con) {
close_connection(con, false);



Re: [Cluster-devel] GPF in dlm_lowcomms_stop

2012-03-30 Thread David Teigland
On Fri, Mar 30, 2012 at 11:42:56AM -0400, David Teigland wrote:
> Hi Dan, I'm not very familiar with this code either, but I've talked with
> Chrissie and she suggested we try something like this:

A second version that addresses a potentially similar problem in start.

diff --git a/fs/dlm/lowcomms.c b/fs/dlm/lowcomms.c
index 133ef6d..5c1b0e3 100644
--- a/fs/dlm/lowcomms.c
+++ b/fs/dlm/lowcomms.c
@@ -142,6 +142,7 @@ struct writequeue_entry {
 
 static struct sockaddr_storage *dlm_local_addr[DLM_MAX_ADDR_COUNT];
 static int dlm_local_count;
+static int dlm_allow_conn;
 
 /* Work queues */
 static struct workqueue_struct *recv_workqueue;
@@ -710,6 +711,13 @@ static int tcp_accept_from_sock(struct connection *con)
struct connection *newcon;
struct connection *addcon;
 
+   mutex_lock(&connections_lock);
+   if (!dlm_allow_conn) {
+   mutex_unlock(&connections_lock);
+   return -1;
+   }
+   mutex_unlock(&connections_lock);
+
memset(&peeraddr, 0, sizeof(peeraddr));
result = sock_create_kern(dlm_local_addr[0]->ss_family, SOCK_STREAM,
  IPPROTO_TCP, &newsock);
@@ -1503,6 +1511,7 @@ void dlm_lowcomms_stop(void)
   socket activity.
*/
mutex_lock(&connections_lock);
+   dlm_allow_conn = 0;
foreach_conn(stop_conn);
mutex_unlock(&connections_lock);
 
@@ -1530,7 +1539,7 @@ int dlm_lowcomms_start(void)
if (!dlm_local_count) {
error = -ENOTCONN;
log_print("no local IP address has been set");
-   goto out;
+   goto fail;
}
 
error = -ENOMEM;
@@ -1538,7 +1547,13 @@ int dlm_lowcomms_start(void)
  __alignof__(struct connection), 0,
  NULL);
if (!con_cache)
-   goto out;
+   goto fail;
+
+   error = work_start();
+   if (error)
+   goto fail_destroy;
+
+   dlm_allow_conn = 1;
 
/* Start listening */
if (dlm_config.ci_protocol == 0)
@@ -1548,20 +1563,17 @@ int dlm_lowcomms_start(void)
if (error)
goto fail_unlisten;
 
-   error = work_start();
-   if (error)
-   goto fail_unlisten;
-
return 0;
 
 fail_unlisten:
+   dlm_allow_conn = 0;
con = nodeid2con(0,0);
if (con) {
close_connection(con, false);
kmem_cache_free(con_cache, con);
}
+fail_destroy:
kmem_cache_destroy(con_cache);
-
-out:
+fail:
return error;
 }



Re: [Cluster-devel] [GFS2 PATCH] GFS2: Instruct DLM to avoid queue convert slowdowns

2012-04-10 Thread David Teigland
On Tue, Apr 10, 2012 at 10:12:28AM +0100, Steven Whitehouse wrote:
> Hi,
> 
> On Thu, 2012-04-05 at 12:11 -0400, Bob Peterson wrote:
> > Hi,
> > 
> > Here's another patch (explanation below). This patch replies upon
> > a DLM patch that hasn't fully gone upstream yet, so perhaps it
> > shouldn't be added to the nmw tree until it is. This greatly
> > improves the performance of gfs2_grow in a clustered gfs2 file system.
> > 
> > Regards,
> > 
> I'm still not very keen on dragging in this bit of dlm. If it is really
> needed, then we should use the copy in the dlm itself and not add our
> own copy of it.

The table is equivalent to:
(rq_mode > gr_mode) || (gr_mode == PR && rq_mode == CW)

> When you say that this relies upon this dlm patch, what does that mean?
> What are the consequences of having this patch but not the dlm one? I'm
> wondering whether I should hold off on this at least until the dlm one
> has been finalised and applied.

Yeah, using QUECVT everywhere will make things worse until it's fixed.



Re: [Cluster-devel] GPF in dlm_lowcomms_stop

2012-05-04 Thread David Teigland
On Fri, May 04, 2012 at 11:33:17AM -0600, dann frazier wrote:
> On Fri, Mar 30, 2012 at 11:17:56AM -0600, dann frazier wrote:
> > On Fri, Mar 30, 2012 at 12:42:40PM -0400, David Teigland wrote:
> > > On Fri, Mar 30, 2012 at 11:42:56AM -0400, David Teigland wrote:
> > > > Hi Dan, I'm not very familiar with this code either, but I've talked 
> > > > with
> > > > Chrissie and she suggested we try something like this:
> > 
> > Yeah, that's the mechanism I was thinking of as well.
> > 
> > > A second version that addresses a potentially similar problem in
> > > start.
> > 
> > Oh, good catch! I hadn't considered that path.
> > 
> > Reviewed-by: dann frazier 
> 
> fyi, we haven't uncovered any problems when testing with this
> patch.

Great, thanks.

> Are there plans to submit it for the next merge cycle?

Yes, it's here
http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/teigland/linux-dlm.git;a=shortlog;h=refs/heads/next




Re: [Cluster-devel] [patch] dlm: NULL dereference on failure in kmem_cache_create()

2012-05-15 Thread David Teigland
On Tue, May 15, 2012 at 11:58:12AM +0300, Dan Carpenter wrote:
> We aren't allowed to pass NULL pointers to kmem_cache_destroy() so if
> both allocations fail, it leads to a NULL dereference.

thanks, added that to next branch.



Re: [Cluster-devel] [patch] dlm: remove stray unlock

2012-05-21 Thread David Teigland
On Mon, May 21, 2012 at 05:35:26PM +0300, Dan Carpenter wrote:
> Smatch complains that we unlock this twice.  It looks like an accidental
> to me.

Thanks, will fix that.



Re: [Cluster-devel] [PATCH] cman init: allow dlm hash table sizes to be tunable at startup

2012-07-25 Thread David Teigland
On Wed, Jul 25, 2012 at 07:32:28AM +0200, Fabio M. Di Nitto wrote:
> From: "Fabio M. Di Nitto" 
> 
> Resolves: rhbz#842370

looks good, thanks

> +# DLM_LKBTBL_SIZE - DLM_RSBTBL_SIZE - DLM_DIRTBL_SIZE
> +# Allow tuning of DLM kernel hash table sizes.
> +# do NOT change unless instructed to do so.

Maybe something like "only change these according to documented
instructions"?  Because I believe there are instructions somewhere
(RH customer portal?) about when/how to change these.



Re: [Cluster-devel] tasks of dlm_recoverd?

2012-08-27 Thread David Teigland
On Mon, Aug 27, 2012 at 01:43:22PM +0200, Heiko Nardmann wrote:
> Hi together!
> 
> During the shutdown of my second cluster node (two node cluster) I
> have seen a process 'dlm_recoverd' running with 100% CPU usage for
> about 6 minutes.
> 
> It's just that I have no idea what is the task of this process since
> I haven't been able to find the binary nor a man page for it. Is
> this one coming from the kernel or from one of the cluster
> components?
> 
> Can anyone point me into the right direction? Googling hasn't been
> helpful either ...

It's a kernel thread, and it's doing lock recovery.  There have been some
recent improvements to speed up recovery for millions of locks, e.g.
http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=7210cb7a72a22303cdb225bd1aea28697a17bbae



Re: [Cluster-devel] [PATCH] dlm: check the maximum size of a request from user

2012-09-10 Thread David Teigland
On Sun, Sep 09, 2012 at 04:16:58PM +0200, Sasha Levin wrote:
> device_write only checks whether the request size is big enough, but it 
> doesn't
> check if the size is too big.
> 
> At that point, it also tries to allocate as much memory as the user has 
> requested
> even if it's too much. This can lead to OOM killer kicking in, or memory 
> corruption
> if (count + 1) overflows.

thanks, pushed to next



Re: [Cluster-devel] fence daemon problems

2012-10-03 Thread David Teigland
On Wed, Oct 03, 2012 at 09:25:08AM +, Dietmar Maurer wrote:
> So the observed behavior is expected? 

Yes, it's a stateful partition merge, and I think /var/log/messages should
have mentioned something about that.  When a node is partitioned from the
others (e.g. network disconnected), it has to be cleanly reset before it's
allowed back.  "cleanly reset" typically means rebooted.  If it comes back
without being reset (e.g. network reconnected), then the others ignore it,
which is what you saw.



Re: [Cluster-devel] fence daemon problems

2012-10-03 Thread David Teigland
On Wed, Oct 03, 2012 at 04:12:10PM +, Dietmar Maurer wrote:
> > Yes, it's a stateful partition merge, and I think /var/log/messages should 
> > have
> > mentioned something about that.  When a node is partitioned from the
> > others (e.g. network disconnected), it has to be cleanly reset before it's
> > allowed back.  "cleanly reset" typically means rebooted.  If it comes back
> > without being reset (e.g. network reconnected), then the others ignore it,
> > which is what you saw.

> What message should I look for?

I was wrong, I was thinking about the "daemon node %d stateful merge"
messages which are debug, but should probably be changed to error.

> I don't really understand why 'dlm_controld' initiates fencing, although
> the node does not has quorum?
> 
> I thought 'dlm_controld' should wait until cluster is quorate before
> starting fence actions?

I guess you're talking about the dlm_tool ls output?  The "fencing" there
means it is waiting for fenced to finish fencing before it starts dlm
recovery.  fenced waits for quorum.

hp2:~# dlm_tool ls
dlm lockspaces
name  rgmanager
id0x5231f3eb
flags 0x0004 kern_stop
changemember 3 joined 1 remove 0 failed 0 seq 2,2
members   2 3 4
new changemember 2 joined 0 remove 1 failed 1 seq 3,3
new statuswait_messages 0 wait_condition 1 fencing
new members   3 4





Re: [Cluster-devel] fence daemon problems

2012-10-03 Thread David Teigland
On Wed, Oct 03, 2012 at 04:26:35PM +, Dietmar Maurer wrote:
> > I guess you're talking about the dlm_tool ls output? 
> 
> Yes.
> 
> > The "fencing" there
> > means it is waiting for fenced to finish fencing before it starts dlm 
> > recovery.
> > fenced waits for quorum.
> 
> So who actually starts fencing when cluster is not quorate? rgmanager?

fenced always starts fencing, but it waits for quorum first.  In other
words, if your cluster looses quorum, nothing happens, not even fencing.

The intention of that is to prevent an inquorate node/partition from
killing a quorate group of nodes that are running normally.  e.g. if a 5
node cluster is partitioned into 2/3 or 1/4.  You don't want the 2 or 1
node group to fence the 3 or 4 nodes that are fine.

The difficult cases, which I think you're seeing, are partitions where no
group has quorum, e.g. 2/2.  In this case we do nothing, and the user has
to resolve it by resetting some of the nodes.  You might be able to assign
different numbers of votes to reduce the likelihood of everyone loosing
quorum.



Re: [Cluster-devel] fence daemon problems

2012-10-03 Thread David Teigland
On Wed, Oct 03, 2012 at 04:55:55PM +, Dietmar Maurer wrote:
> > The difficult cases, which I think you're seeing, are partitions where
> > no group has quorum, e.g. 2/2.  In this case we do nothing, and the
> > user has to resolve it by resetting some of the nodes
> 
> The problem with that is that those 'difficult' cases are very likely.
> For example a switch reboot results in that state if you do not have
> redundant network (yes, I know that this setup is simply wrong).
> 
> And things get worse, because it is not possible to reboot such nodes,
> because rgmanager shutdown simply hangs. Is there any way to avoid that,
> so that it is at least possible to reboot those nodes?

Fabio's checkquorum script will reboot nodes that loose quorum.



Re: [Cluster-devel] cluster4 dlm dlm_stonith ??? should it really fence by turning node off?

2012-11-05 Thread David Teigland
On Sat, Nov 03, 2012 at 03:58:28PM +0100, Jacek Konieczny wrote:
> Hello,
> 
> The dlm_stonith fencing helper is really convenient when Pacemaker is in
> use. Though, it doesn't quite work as I would expect ??? when fencing 
> is needed it requests a node to be turned off instead of rebooting. And
> it doesn't handle unfencing ??? so automatic recovery is not possible
> (rebooted node could join the cluster cleanly later, provided quorum
> handling is properly configured in the cluster stack).
> 
> Preferably this behaviour should be configurable. I have hacked a
> work-around by (ab)using argv[0] ??? when 'dlm_stonith' is called as
> 'dlm_stonith_reboot' the node would be rebooted instead of halting 
> ??? this works for me well-enough, but I don't think this is the right
> solution.

Could you send the patch?  Do you think the patch is not right or reboot
is not right?  If the later, what do you think is wrong with reboot?

> Any ideas how to solve that properly? An argument for the helper to be
> included in the config file? Or, maybe, just change the default
> behaviour?
> 
> Greets,
> Jacek



Re: [Cluster-devel] cluster4 dlm: startup notification for systemd

2012-11-05 Thread David Teigland
On Sat, Nov 03, 2012 at 04:27:54PM +0100, Jacek Konieczny wrote:
> Hello,
> 
> The two patches:
> 
>[PATCH 1/2] --foreground option added to dlm_controld
>[PATCH 2/2] Startup notification by sd_notify()
> 
> add startup notification for the systemd service unit. This way startup
> of services depending on DLM can be properly serialized.

Thanks, I just got my first systemd system set up, so I'll give this a
try.  I've also traced this error back to systemd:
"could not set SCHED_RR priority 99 err 1" which google seems to confirm,
along with the suggestion of using "ControlGroup=cpu:/" ... I don't know
if that's advise is still current.



Re: [Cluster-devel] [PATCH] dlm_stonith_{off, reboot} aliases for fence helper

2012-11-05 Thread David Teigland
On Mon, Nov 05, 2012 at 07:05:22PM +0100, Jacek Konieczny wrote:
> - rv = stonith_api_kick_helper(nodeid, 300, 1);
> + rv = stonith_api_kick_helper(nodeid, 300, turn_off);

I'd like it to be "reboot", but seeing the arg as "bool off" I figured the
opposite would be "on" ... if you're saying that the opposite of off is
actually reboot, then I'll just make it 0.




[Cluster-devel] gfs2: skip dlm_unlock calls in unmount

2012-11-07 Thread David Teigland
When unmounting, gfs2 does a full dlm_unlock operation on every
cached lock.  This can create a very large amount of work and can
take a long time to complete.  However, the vast majority of these
dlm unlock operations are unnecessary because after all the unlocks
are done, gfs2 leaves the dlm lockspace, which automatically clears
the locks of the leaving node, without unlocking each one individually.
So, gfs2 can skip explicit dlm unlocks, and use dlm_release_lockspace to
remove the locks implicitly.  The one exception is when the lock's lvb is
being used.  In this case, dlm_unlock is called because it may update the
lvb of the resource.

Signed-off-by: David Teigland 
---
 fs/gfs2/glock.c|1 +
 fs/gfs2/incore.h   |1 +
 fs/gfs2/lock_dlm.c |6 ++
 3 files changed, 8 insertions(+)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index e6c2fd5..f3a5edb 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1528,6 +1528,7 @@ static void dump_glock_func(struct gfs2_glock *gl)
 
 void gfs2_gl_hash_clear(struct gfs2_sbd *sdp)
 {
+   set_bit(SDF_SKIP_DLM_UNLOCK, &sdp->sd_flags);
glock_hash_walk(clear_glock, sdp);
flush_workqueue(glock_workqueue);
wait_event(sdp->sd_glock_wait, atomic_read(&sdp->sd_glock_disposal) == 
0);
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 3d469d3..67a39cf 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -539,6 +539,7 @@ enum {
SDF_DEMOTE  = 5,
SDF_NOJOURNALID = 6,
SDF_RORECOVERY  = 7, /* read only recovery */
+   SDF_SKIP_DLM_UNLOCK = 8,
 };
 
 #define GFS2_FSNAME_LEN256
diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index 0fb6539..efd0fb6 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -289,6 +289,12 @@ static void gdlm_put_lock(struct gfs2_glock *gl)
gfs2_glstats_inc(gl, GFS2_LKS_DCOUNT);
gfs2_sbstats_inc(gl, GFS2_LKS_DCOUNT);
gfs2_update_request_times(gl);
+
+   if (test_bit(SDF_SKIP_DLM_UNLOCK, &sdp->sd_flags) && !gl->gl_lvb[0]) {
+   gfs2_glock_free(gl);
+   return;
+   }
+
error = dlm_unlock(ls->ls_dlm, gl->gl_lksb.sb_lkid, DLM_LKF_VALBLK,
   NULL, gl);
if (error) {
-- 
1.7.10.1.362.g242cab3



Re: [Cluster-devel] gfs2: skip dlm_unlock calls in unmount

2012-11-08 Thread David Teigland
On Thu, Nov 08, 2012 at 10:26:53AM +, Steven Whitehouse wrote:
> Hi,
> 
> On Wed, 2012-11-07 at 14:14 -0500, David Teigland wrote:
> > When unmounting, gfs2 does a full dlm_unlock operation on every
> > cached lock.  This can create a very large amount of work and can
> > take a long time to complete.  However, the vast majority of these
> > dlm unlock operations are unnecessary because after all the unlocks
> > are done, gfs2 leaves the dlm lockspace, which automatically clears
> > the locks of the leaving node, without unlocking each one individually.
> > So, gfs2 can skip explicit dlm unlocks, and use dlm_release_lockspace to
> > remove the locks implicitly.  The one exception is when the lock's lvb is
> > being used.  In this case, dlm_unlock is called because it may update the
> > lvb of the resource.
> > 
> 
> I'm wondering just how much we are likely to gain from this we
> currently use LVBs for both quota (and more recently) rgrp too. If we
> were to start using the LVBs for inodes and/or iopen locks eventually
> then that would seem to rather reduce the benefits of this.

Considering what you say below, after you've converted to NL, there's no
more lvb to consider, so the lvb is not an issue in that case. The lvb is
only written if you're unlocking from PW or EX, so there's bound to always
be many unlocks that could be skipped.  I'll adjust the patch to skip
unlock unless there's an lvb and the mode is PW or EX.

> The other question is what the cost of conversion to NL vs unlock of an
> NL lock is. Even with the patch we are still iterating over each lock to
> do a conversion to NL in any case where the lock is not already in NL.
> So all we are saving is the final NL -> unlocked change.

yeah, I'd forgotten about that.

> One thought is whether it would not be better to do a direct "whatever"
> -> unlocked change in the first place, rather than splitting the
> operation into two parts.

Converting to NL would actually be less expensive than unlock because the
NL convert does not involve a reply message, but unlock does.

So skipping the unlocks is a first step that gives us a big benefit very
simply.  To benefit even further, we could later look into skipping the
"convert to NL" step also, and just abandoning the dlm locks in whatever
mode they're in; but that's probably not as simple a change.



Re: [Cluster-devel] gfs2: skip dlm_unlock calls in unmount

2012-11-08 Thread David Teigland
On Thu, Nov 08, 2012 at 06:48:19PM +, Steven Whitehouse wrote:
> > Converting to NL would actually be less expensive than unlock because the
> > NL convert does not involve a reply message, but unlock does.
> > 
> I'm not entirely sure I follow... at least from the filesystem point of
> view (and without your proposed change) both conversions and unlocks
> result in a reply. Is this a dlm internal reply perhaps?

Right, I was refering to the internal dlm reply over the network.

> > So skipping the unlocks is a first step that gives us a big benefit very
> > simply.  To benefit even further, we could later look into skipping the
> > "convert to NL" step also, and just abandoning the dlm locks in whatever
> > mode they're in; but that's probably not as simple a change.
> > 
> 
> Yes, thats true... the issue is that the glock state machine treats all
> glocks on an individual basis, and the demotion to NL also deals with
> any writing back and invalidating of the cache thats required at the
> same time. So that makes it tricky to separate from the requests to the
> dlm.
> 
> That said, I'd like to be able to move towards dealing with batches of
> glocks in the future, since that means we can provide a more favourable
> ordering of i/o requests. That is not an easy thing to do though.
> 
> In addition to the benefit for umount, I'm also wondering whether, if
> these unlocks are relatively slow, we should look at what happens during
> normal operation, where we do from time to time, send unlock requests.
> Those are mostly (though indirectly) in response to memory pressure. Is
> there anything we can do there to speed things up I wonder?

The main thing would be to not use a completion callback for dlm_unlock
(either make dlm not send one, or ignore it in gfs2).  This would let you
free the glock memory right away.

But, removing unlock completions can create new problems, because you'd
need to handle new errors from dlm_lock() when it ran up against an
incomplete unlock.  Dealing with that complication may negate any benefit
from ignoring unlock completions.  Unless, of course, you knew you
wouldn't be making any more dlm_lock calls on that lock, e.g. during
unmount.



[Cluster-devel] [PATCH] gfs2: skip dlm_unlock calls in unmount

2012-11-08 Thread David Teigland
When unmounting, gfs2 does a full dlm_unlock operation on every
cached lock.  This can create a very large amount of work and can
take a long time to complete.  However, the vast majority of these
dlm unlock operations are unnecessary because after all the unlocks
are done, gfs2 leaves the dlm lockspace, which automatically clears
the locks of the leaving node, without unlocking each one individually.
So, gfs2 can skip explicit dlm unlocks, and use dlm_release_lockspace to
remove the locks implicitly.  The one exception is when the lock's lvb is
being used.  In this case, dlm_unlock is called because it may update the
lvb of the resource.

Signed-off-by: David Teigland 
---
 fs/gfs2/glock.c|1 +
 fs/gfs2/incore.h   |1 +
 fs/gfs2/lock_dlm.c |8 
 3 files changed, 10 insertions(+)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index e6c2fd5..f3a5edb 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1528,6 +1528,7 @@ static void dump_glock_func(struct gfs2_glock *gl)
 
 void gfs2_gl_hash_clear(struct gfs2_sbd *sdp)
 {
+   set_bit(SDF_SKIP_DLM_UNLOCK, &sdp->sd_flags);
glock_hash_walk(clear_glock, sdp);
flush_workqueue(glock_workqueue);
wait_event(sdp->sd_glock_wait, atomic_read(&sdp->sd_glock_disposal) == 
0);
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 3d469d3..67a39cf 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -539,6 +539,7 @@ enum {
SDF_DEMOTE  = 5,
SDF_NOJOURNALID = 6,
SDF_RORECOVERY  = 7, /* read only recovery */
+   SDF_SKIP_DLM_UNLOCK = 8,
 };
 
 #define GFS2_FSNAME_LEN256
diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index 0fb6539..806a639 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -289,6 +289,14 @@ static void gdlm_put_lock(struct gfs2_glock *gl)
gfs2_glstats_inc(gl, GFS2_LKS_DCOUNT);
gfs2_sbstats_inc(gl, GFS2_LKS_DCOUNT);
gfs2_update_request_times(gl);
+
+   /* don't want to skip dlm_unlock writing the lvb when lock is ex */
+   if (test_bit(SDF_SKIP_DLM_UNLOCK, &sdp->sd_flags) &&
+   (!gl->gl_lvb[0] || gl->gl_state != LM_ST_EXCLUSIVE)) {
+   gfs2_glock_free(gl);
+   return;
+   }
+
error = dlm_unlock(ls->ls_dlm, gl->gl_lksb.sb_lkid, DLM_LKF_VALBLK,
   NULL, gl);
if (error) {
-- 
1.7.10.1.362.g242cab3



Re: [Cluster-devel] [PATCH] gfs2: skip dlm_unlock calls in unmount

2012-11-09 Thread David Teigland
On Fri, Nov 09, 2012 at 09:45:17AM +, Steven Whitehouse wrote:
> > +   if (test_bit(SDF_SKIP_DLM_UNLOCK, &sdp->sd_flags) &&
> > +   (!gl->gl_lvb[0] || gl->gl_state != LM_ST_EXCLUSIVE)) {
> I'm still not happy with using !gl->gl_lvb[0] to determine whether the
> LVB is in use or not. I think we need a better test, or alternatively
> just test the lock state, since most locks will be NL anyway before they
> get to this point in time,

Yeah, a glock flag indicating the lvb is used would be best, I'll just
test the lock state.

This actually brings up another improvement you could make.  Right now
gfs2 enables the lvb on all locks, even though it only uses it on a small
minority.  Limiting the lvb to locks that need it would:

- save 64 bytes of memory for every local lock
  (32 in gfs2_glock, 32 in dlm_rsb)

- save 96 bytes of memory for every remote lock
  (32 in gfs2_glock, 32 in local dlm_rsb, 32 in remote dlm_lkb)

- save 32 bytes of network message size in many dlm messages

- save a lot of memcpying of zeros

- save some recovery time




Re: [Cluster-devel] [PATCH] gfs2: skip dlm_unlock calls in unmount

2012-11-12 Thread David Teigland
On Mon, Nov 12, 2012 at 10:44:36AM +, Steven Whitehouse wrote:
> > - save 64 bytes of memory for every local lock
> >   (32 in gfs2_glock, 32 in dlm_rsb)
> > 
> > - save 96 bytes of memory for every remote lock
> >   (32 in gfs2_glock, 32 in local dlm_rsb, 32 in remote dlm_lkb)
> > 
> > - save 32 bytes of network message size in many dlm messages
> > 
> > - save a lot of memcpying of zeros
> > 
> > - save some recovery time
> > 
> > 
> 
> Yes, although we did consider what the best thing to do was back at the
> start of GFS2 development wrt LVBs. The actual overhead didn't seem too
> much really. The previous implementation had the LVB hanging off the
> glock from a pointer, so on 64 bit, that pointer alone was 8 bytes. We
> also saved another 4 bytes (plus a further 4 for alignment) by not
> requiring the atomic counter. So it seemed not unreasonable to just
> inline the LVB into the glock.

I still think you'll save around 64-80 bytes per lock on average.

> Another for having it on all glocks was that if we did want to start
> making use of it on different glock types in the future, we could do so
> without having to worry about whether its value would be preserved or
> not. Also, it removed some tests from the fast path of acquiring and
> dropping locks.

Keep in mind that the dlm does not inline them, so using an lvb when it's
not needed creates extra work in the dlm.  This extra work probably
exceeds the extra work gfs2 would have to do with non-inlined lvbs.

> Trying to reduce the size of the lock requests makes sense if that is
> becoming a limiting factor in performance (is it? I'm not sure) so maybe
> we should revisit this.

I think it's worth a try, it's probably no less helpful than a lot of the
other optimizations we've added, which do add up together.




[Cluster-devel] [PATCH] gfs2: skip dlm_unlock calls in unmount

2012-11-13 Thread David Teigland
When unmounting, gfs2 does a full dlm_unlock operation on every
cached lock.  This can create a very large amount of work and can
take a long time to complete.  However, the vast majority of these
dlm unlock operations are unnecessary because after all the unlocks
are done, gfs2 leaves the dlm lockspace, which automatically clears
the locks of the leaving node, without unlocking each one individually.
So, gfs2 can skip explicit dlm unlocks, and use dlm_release_lockspace to
remove the locks implicitly.  The one exception is when the lock's lvb is
being used.  In this case, dlm_unlock is called because it may update the
lvb of the resource.

Signed-off-by: David Teigland 
---
 fs/gfs2/glock.c|1 +
 fs/gfs2/incore.h   |1 +
 fs/gfs2/lock_dlm.c |8 
 3 files changed, 10 insertions(+)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index e6c2fd5..f3a5edb 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1528,6 +1528,7 @@ static void dump_glock_func(struct gfs2_glock *gl)
 
 void gfs2_gl_hash_clear(struct gfs2_sbd *sdp)
 {
+   set_bit(SDF_SKIP_DLM_UNLOCK, &sdp->sd_flags);
glock_hash_walk(clear_glock, sdp);
flush_workqueue(glock_workqueue);
wait_event(sdp->sd_glock_wait, atomic_read(&sdp->sd_glock_disposal) == 
0);
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 3d469d3..67a39cf 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -539,6 +539,7 @@ enum {
SDF_DEMOTE  = 5,
SDF_NOJOURNALID = 6,
SDF_RORECOVERY  = 7, /* read only recovery */
+   SDF_SKIP_DLM_UNLOCK = 8,
 };
 
 #define GFS2_FSNAME_LEN256
diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index 0fb6539..f6504d3 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -289,6 +289,14 @@ static void gdlm_put_lock(struct gfs2_glock *gl)
gfs2_glstats_inc(gl, GFS2_LKS_DCOUNT);
gfs2_sbstats_inc(gl, GFS2_LKS_DCOUNT);
gfs2_update_request_times(gl);
+
+   /* don't want to skip dlm_unlock writing the lvb when lock is ex */
+   if (test_bit(SDF_SKIP_DLM_UNLOCK, &sdp->sd_flags) &&
+   gl->gl_state != LM_ST_EXCLUSIVE) {
+   gfs2_glock_free(gl);
+   return;
+   }
+
error = dlm_unlock(ls->ls_dlm, gl->gl_lksb.sb_lkid, DLM_LKF_VALBLK,
   NULL, gl);
if (error) {
-- 
1.7.10.1.362.g242cab3



[Cluster-devel] [PATCH 2/2] gfs2: remove redundant lvb pointer

2012-11-14 Thread David Teigland
The lksb struct already contains a pointer to the lvb,
so another directly from the glock struct is not needed.

Signed-off-by: David Teigland 
---
 fs/gfs2/glock.c|   10 --
 fs/gfs2/incore.h   |1 -
 fs/gfs2/lock_dlm.c |8 
 fs/gfs2/quota.c|6 +++---
 fs/gfs2/rgrp.c |2 +-
 5 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index dbe6e71..279c0af 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -110,7 +110,7 @@ static void gfs2_glock_dealloc(struct rcu_head *rcu)
if (gl->gl_ops->go_flags & GLOF_ASPACE) {
kmem_cache_free(gfs2_glock_aspace_cachep, gl);
} else {
-   kfree(gl->gl_lvb);
+   kfree(gl->gl_lksb.sb_lvbptr);
kmem_cache_free(gfs2_glock_cachep, gl);
}
 }
@@ -742,15 +742,13 @@ int gfs2_glock_get(struct gfs2_sbd *sdp, u64 number,
return -ENOMEM;
 
memset(&gl->gl_lksb, 0, sizeof(struct dlm_lksb));
-   gl->gl_lvb = NULL;
 
if (glops->go_flags & GLOF_LVB) {
-   gl->gl_lvb = kzalloc(GFS2_MIN_LVB_SIZE, GFP_KERNEL);
-   if (!gl->gl_lvb) {
+   gl->gl_lksb.sb_lvbptr = kzalloc(GFS2_MIN_LVB_SIZE, GFP_KERNEL);
+   if (!gl->gl_lksb.sb_lvbptr) {
kmem_cache_free(cachep, gl);
return -ENOMEM;
}
-   gl->gl_lksb.sb_lvbptr = gl->gl_lvb;
}
 
atomic_inc(&sdp->sd_glock_disposal);
@@ -791,7 +789,7 @@ int gfs2_glock_get(struct gfs2_sbd *sdp, u64 number,
tmp = search_bucket(hash, sdp, &name);
if (tmp) {
spin_unlock_bucket(hash);
-   kfree(gl->gl_lvb);
+   kfree(gl->gl_lksb.sb_lvbptr);
kmem_cache_free(cachep, gl);
atomic_dec(&sdp->sd_glock_disposal);
gl = tmp;
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index a37baa5..3ecf9bd 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -322,7 +322,6 @@ struct gfs2_glock {
ktime_t gl_dstamp;
struct gfs2_lkstats gl_stats;
struct dlm_lksb gl_lksb;
-   char *gl_lvb;
unsigned long gl_tchange;
void *gl_object;
 
diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index d28ae37..8dad6b0 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -120,8 +120,8 @@ static void gdlm_ast(void *arg)
gfs2_update_reply_times(gl);
BUG_ON(gl->gl_lksb.sb_flags & DLM_SBF_DEMOTED);
 
-   if (gl->gl_lksb.sb_flags & DLM_SBF_VALNOTVALID && gl->gl_lvb)
-   memset(gl->gl_lvb, 0, GDLM_LVB_SIZE);
+   if ((gl->gl_lksb.sb_flags & DLM_SBF_VALNOTVALID) && 
gl->gl_lksb.sb_lvbptr)
+   memset(gl->gl_lksb.sb_lvbptr, 0, GDLM_LVB_SIZE);
 
switch (gl->gl_lksb.sb_status) {
case -DLM_EUNLOCK: /* Unlocked, so glock can be freed */
@@ -205,7 +205,7 @@ static u32 make_flags(struct gfs2_glock *gl, const unsigned 
int gfs_flags,
 {
u32 lkf = 0;
 
-   if (gl->gl_lvb)
+   if (gl->gl_lksb.sb_lvbptr)
lkf |= DLM_LKF_VALBLK;
 
if (gfs_flags & LM_FLAG_TRY)
@@ -294,7 +294,7 @@ static void gdlm_put_lock(struct gfs2_glock *gl)
 
/* don't want to skip dlm_unlock writing the lvb when lock is ex */
if (test_bit(SDF_SKIP_DLM_UNLOCK, &sdp->sd_flags) &&
-   gl->gl_lvb && gl->gl_state != LM_ST_EXCLUSIVE) {
+   gl->gl_lksb.sb_lvbptr && (gl->gl_state != LM_ST_EXCLUSIVE)) {
gfs2_glock_free(gl);
return;
}
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index c5af8e1..6baf370 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -869,7 +869,7 @@ static int update_qd(struct gfs2_sbd *sdp, struct 
gfs2_quota_data *qd)
if (error < 0)
return error;
 
-   qlvb = (struct gfs2_quota_lvb *)qd->qd_gl->gl_lvb;
+   qlvb = (struct gfs2_quota_lvb *)qd->qd_gl->gl_lksb.sb_lvbptr;
qlvb->qb_magic = cpu_to_be32(GFS2_MAGIC);
qlvb->__pad = 0;
qlvb->qb_limit = q.qu_limit;
@@ -893,7 +893,7 @@ restart:
if (error)
return error;
 
-   qd->qd_qb = *(struct gfs2_quota_lvb *)qd->qd_gl->gl_lvb;
+   qd->qd_qb = *(struct gfs2_quota_lvb *)qd->qd_gl->gl_lksb.sb_lvbptr;
 
if (force_refresh || qd->qd_qb.qb_magic != cpu_to_be32(GFS2_MAGIC)) {
gfs2_glock_dq_uninit(q_gh);
@@ -1506,7 +1506,7 @@ static int gfs2_get_dqblk(struct super_block *sb, struct 
kqid qid,
if (error)
goto out;
 
-   qlvb = (struct gfs2_quota_lvb *)qd->qd_gl->gl_lvb;
+   qlvb = (struct gfs2_quota_lvb *)qd->qd_gl->gl_lksb.sb_lvbptr;
fdq->d_version = FS_DQUOT_VE

[Cluster-devel] [PATCH 1/2] gfs2: only use lvb on glocks that need it

2012-11-14 Thread David Teigland
Save the effort of allocating, reading and writing
the lvb for most glocks that do not use it.

Signed-off-by: David Teigland 
---
 fs/gfs2/glock.c|   27 +--
 fs/gfs2/glops.c|3 ++-
 fs/gfs2/incore.h   |3 ++-
 fs/gfs2/lock_dlm.c |   12 +++-
 4 files changed, 32 insertions(+), 13 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index f3a5edb..dbe6e71 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -107,10 +107,12 @@ static void gfs2_glock_dealloc(struct rcu_head *rcu)
 {
struct gfs2_glock *gl = container_of(rcu, struct gfs2_glock, gl_rcu);
 
-   if (gl->gl_ops->go_flags & GLOF_ASPACE)
+   if (gl->gl_ops->go_flags & GLOF_ASPACE) {
kmem_cache_free(gfs2_glock_aspace_cachep, gl);
-   else
+   } else {
+   kfree(gl->gl_lvb);
kmem_cache_free(gfs2_glock_cachep, gl);
+   }
 }
 
 void gfs2_glock_free(struct gfs2_glock *gl)
@@ -547,7 +549,10 @@ __acquires(&gl->gl_spin)
if (sdp->sd_lockstruct.ls_ops->lm_lock) {
/* lock_dlm */
ret = sdp->sd_lockstruct.ls_ops->lm_lock(gl, target, lck_flags);
-   GLOCK_BUG_ON(gl, ret);
+   if (ret) {
+   printk(KERN_ERR "GFS2: lm_lock ret %d\n", ret);
+   GLOCK_BUG_ON(gl, 1);
+   }
} else { /* lock_nolock */
finish_xmote(gl, target);
if (queue_delayed_work(glock_workqueue, &gl->gl_work, 0) == 0)
@@ -736,6 +741,18 @@ int gfs2_glock_get(struct gfs2_sbd *sdp, u64 number,
if (!gl)
return -ENOMEM;
 
+   memset(&gl->gl_lksb, 0, sizeof(struct dlm_lksb));
+   gl->gl_lvb = NULL;
+
+   if (glops->go_flags & GLOF_LVB) {
+   gl->gl_lvb = kzalloc(GFS2_MIN_LVB_SIZE, GFP_KERNEL);
+   if (!gl->gl_lvb) {
+   kmem_cache_free(cachep, gl);
+   return -ENOMEM;
+   }
+   gl->gl_lksb.sb_lvbptr = gl->gl_lvb;
+   }
+
atomic_inc(&sdp->sd_glock_disposal);
gl->gl_sbd = sdp;
gl->gl_flags = 0;
@@ -753,9 +770,6 @@ int gfs2_glock_get(struct gfs2_sbd *sdp, u64 number,
preempt_enable();
gl->gl_stats.stats[GFS2_LKS_DCOUNT] = 0;
gl->gl_stats.stats[GFS2_LKS_QCOUNT] = 0;
-   memset(&gl->gl_lksb, 0, sizeof(struct dlm_lksb));
-   memset(gl->gl_lvb, 0, 32 * sizeof(char));
-   gl->gl_lksb.sb_lvbptr = gl->gl_lvb;
gl->gl_tchange = jiffies;
gl->gl_object = NULL;
gl->gl_hold_time = GL_GLOCK_DFT_HOLD;
@@ -777,6 +791,7 @@ int gfs2_glock_get(struct gfs2_sbd *sdp, u64 number,
tmp = search_bucket(hash, sdp, &name);
if (tmp) {
spin_unlock_bucket(hash);
+   kfree(gl->gl_lvb);
kmem_cache_free(cachep, gl);
atomic_dec(&sdp->sd_glock_disposal);
gl = tmp;
diff --git a/fs/gfs2/glops.c b/fs/gfs2/glops.c
index 32cc4fd..635bd03 100644
--- a/fs/gfs2/glops.c
+++ b/fs/gfs2/glops.c
@@ -552,7 +552,7 @@ const struct gfs2_glock_operations gfs2_rgrp_glops = {
.go_unlock = gfs2_rgrp_go_unlock,
.go_dump = gfs2_rgrp_dump,
.go_type = LM_TYPE_RGRP,
-   .go_flags = GLOF_ASPACE,
+   .go_flags = GLOF_ASPACE | GLOF_LVB,
 };
 
 const struct gfs2_glock_operations gfs2_trans_glops = {
@@ -577,6 +577,7 @@ const struct gfs2_glock_operations gfs2_nondisk_glops = {
 
 const struct gfs2_glock_operations gfs2_quota_glops = {
.go_type = LM_TYPE_QUOTA,
+   .go_flags = GLOF_LVB,
 };
 
 const struct gfs2_glock_operations gfs2_journal_glops = {
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index 67a39cf..a37baa5 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -216,6 +216,7 @@ struct gfs2_glock_operations {
const int go_type;
const unsigned long go_flags;
 #define GLOF_ASPACE 1
+#define GLOF_LVB2
 };
 
 enum {
@@ -321,7 +322,7 @@ struct gfs2_glock {
ktime_t gl_dstamp;
struct gfs2_lkstats gl_stats;
struct dlm_lksb gl_lksb;
-   char gl_lvb[32];
+   char *gl_lvb;
unsigned long gl_tchange;
void *gl_object;
 
diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index f6504d3..d28ae37 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -120,7 +120,7 @@ static void gdlm_ast(void *arg)
gfs2_update_reply_times(gl);
BUG_ON(gl->gl_lksb.sb_flags & DLM_SBF_DEMOTED);
 
-   if (gl->gl_lksb.sb_flags & DLM_SBF_VALNOTVALID)
+   if (gl->gl_lksb.sb_flags & DLM_SBF_VALNOTVALID && gl->gl_lvb)
memset(gl->gl_lvb, 0, GDLM_LVB_SIZE);
 
switch (gl->gl_lksb.sb_status) {
@@ -203,8 +203,10 @@ static int make_mode(const unsigned int lmstate)
 static u32 make_flags(st

[Cluster-devel] gfs2: fix skip unlock condition

2013-01-03 Thread David Teigland
The recent commit fb6791d100d1bba20b5cdbc4912e1f7086ec60f8
included the wrong logic.  The lvbptr check was incorrectly
added after the patch was tested.

Signed-off-by: David Teigland 
---
 fs/gfs2/lock_dlm.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/lock_dlm.c b/fs/gfs2/lock_dlm.c
index b906ed1..9802de0 100644
--- a/fs/gfs2/lock_dlm.c
+++ b/fs/gfs2/lock_dlm.c
@@ -281,6 +281,7 @@ static void gdlm_put_lock(struct gfs2_glock *gl)
 {
struct gfs2_sbd *sdp = gl->gl_sbd;
struct lm_lockstruct *ls = &sdp->sd_lockstruct;
+   int lvb_needs_unlock = 0;
int error;
 
if (gl->gl_lksb.sb_lkid == 0) {
@@ -294,8 +295,12 @@ static void gdlm_put_lock(struct gfs2_glock *gl)
gfs2_update_request_times(gl);
 
/* don't want to skip dlm_unlock writing the lvb when lock is ex */
+
+   if (gl->gl_lksb.sb_lvbptr && (gl->gl_state == LM_ST_EXCLUSIVE))
+   lvb_needs_unlock = 1;
+
if (test_bit(SDF_SKIP_DLM_UNLOCK, &sdp->sd_flags) &&
-   gl->gl_lksb.sb_lvbptr && (gl->gl_state != LM_ST_EXCLUSIVE)) {
+   !lvb_needs_unlock) {
gfs2_glock_free(gl);
return;
}
-- 
1.8.1.rc1.5.g7e0651a



Re: [Cluster-devel] [PATCH 09/14] dlm: use idr_for_each_entry() in recover_idr_clear() error path

2013-01-28 Thread David Teigland
On Fri, Jan 25, 2013 at 05:31:07PM -0800, Tejun Heo wrote:
> Convert recover_idr_clear() to use idr_for_each_entry() instead of
> idr_for_each().  It's somewhat less efficient this way but it
> shouldn't matter in an error path.  This is to help with deprecation
> of idr_remove_all().
> 
> Only compile tested.
> 
> Signed-off-by: Tejun Heo 
> Cc: Christine Caulfield 
> Cc: David Teigland 
> Cc: cluster-devel@redhat.com
> ---
> This patch depends on an earlier idr patch and I think it would be
> best to route these together through -mm.  Christine, David, can you
> please ack this?

Ack



Re: [Cluster-devel] [PATCH 10/14] dlm: don't use idr_remove_all()

2013-01-28 Thread David Teigland
On Fri, Jan 25, 2013 at 05:31:08PM -0800, Tejun Heo wrote:
> idr_destroy() can destroy idr by itself and idr_remove_all() is being
> deprecated.
> 
> The conversion isn't completely trivial for recover_idr_clear() as
> it's the only place in kernel which makes legitimate use of
> idr_remove_all() w/o idr_destroy().  Replace it with idr_remove() call
> inside idr_for_each_entry() loop.  It goes on top so that it matches
> the operation order in recover_idr_del().
> 
> Only compile tested.
> 
> Signed-off-by: Tejun Heo 
> Cc: Christine Caulfield 
> Cc: David Teigland 
> Cc: cluster-devel@redhat.com
> ---
> This patch depends on an earlier idr patch and given the trivial
> nature of the patch, I think it would be best to route these together
> through -mm.  Please holler if there's any objection.

Yes, that's good for me.  I'll grab the set and test the dlm bits.



Re: [Cluster-devel] [PATCH 10/14] dlm: don't use idr_remove_all()

2013-01-29 Thread David Teigland
On Mon, Jan 28, 2013 at 10:57:23AM -0500, David Teigland wrote:
> On Fri, Jan 25, 2013 at 05:31:08PM -0800, Tejun Heo wrote:
> > idr_destroy() can destroy idr by itself and idr_remove_all() is being
> > deprecated.
> > 
> > The conversion isn't completely trivial for recover_idr_clear() as
> > it's the only place in kernel which makes legitimate use of
> > idr_remove_all() w/o idr_destroy().  Replace it with idr_remove() call
> > inside idr_for_each_entry() loop.  It goes on top so that it matches
> > the operation order in recover_idr_del().
> > 
> > Only compile tested.
> > 
> > Signed-off-by: Tejun Heo 
> > Cc: Christine Caulfield 
> > Cc: David Teigland 
> > Cc: cluster-devel@redhat.com
> > ---
> > This patch depends on an earlier idr patch and given the trivial
> > nature of the patch, I think it would be best to route these together
> > through -mm.  Please holler if there's any objection.
> 
> Yes, that's good for me.  I'll grab the set and test the dlm bits.

Hi Tejun,
Unfortunately, the list_for_each_entry doesn't seem to be clearing
everything.  I've seen "warning: recover_list_count 39" at the end of that
function.
Dave



Re: [Cluster-devel] [PATCH 10/14] dlm: don't use idr_remove_all()

2013-01-30 Thread David Teigland
On Tue, Jan 29, 2013 at 10:13:17AM -0500, David Teigland wrote:
> On Mon, Jan 28, 2013 at 10:57:23AM -0500, David Teigland wrote:
> > On Fri, Jan 25, 2013 at 05:31:08PM -0800, Tejun Heo wrote:
> > > idr_destroy() can destroy idr by itself and idr_remove_all() is being
> > > deprecated.
> > > 
> > > The conversion isn't completely trivial for recover_idr_clear() as
> > > it's the only place in kernel which makes legitimate use of
> > > idr_remove_all() w/o idr_destroy().  Replace it with idr_remove() call
> > > inside idr_for_each_entry() loop.  It goes on top so that it matches
> > > the operation order in recover_idr_del().
> > > 
> > > Only compile tested.
> > > 
> > > Signed-off-by: Tejun Heo 
> > > Cc: Christine Caulfield 
> > > Cc: David Teigland 
> > > Cc: cluster-devel@redhat.com
> > > ---
> > > This patch depends on an earlier idr patch and given the trivial
> > > nature of the patch, I think it would be best to route these together
> > > through -mm.  Please holler if there's any objection.
> > 
> > Yes, that's good for me.  I'll grab the set and test the dlm bits.
> 
> Hi Tejun,
> Unfortunately, the list_for_each_entry doesn't seem to be clearing
> everything.  I've seen "warning: recover_list_count 39" at the end of that
> function.

I don't want to pretend to understand the internals of this idr code, but
it's not clear that idr_for_each is equivalent to idr_for_each_entry when
iterating through all id values.  The "++id" in idr_for_each_entry looks
like it could lead to some missed entries?  The comment about idr_get_next
returning the "next number to given id" sounds like an entry with an id of
"++id" would be missed.

Dave



Re: [Cluster-devel] [PATCH 10/14] dlm: don't use idr_remove_all()

2013-02-01 Thread David Teigland
On Thu, Jan 31, 2013 at 04:18:41PM -0800, Tejun Heo wrote:
> It looks a bit weird to me that ls->ls_recover_list_count is also
> incremented by recover_list_add().  The two code paths don't seem to
> be interlocke at least upon my very shallow glance.  Is it that only
> either the list or idr is in use?

Yes, that's correct.

> Anyways, can you please apply the following patch and see which IDs
> are leaking from the iteration?  The patch too is only compile tested
> so I might have done something stupid but it hopefully shouldn't be
> too difficult to make it work.

I'm trying your patch now, I don't have a test optimized to hit this code
so it may take a while.

>  static void recover_idr_clear(struct dlm_ls *ls)
>  {
>   struct dlm_rsb *r;
> @@ -358,7 +364,9 @@ static void recover_idr_clear(struct dlm
>  
>   spin_lock(&ls->ls_recover_idr_lock);
>  
> + pr_info("XXX clearing:");
>   idr_for_each_entry(&ls->ls_recover_idr, r, id) {
> + pr_cont(" %d", id);

It will often be clearing hundreds of thousands of entries, so this will
probably be excessive.

>   if (ls->ls_recover_list_count != 0) {
>   log_error(ls, "warning: recover_list_count %d",
> ls->ls_recover_list_count);
> + pr_info("XXX leftovers: ");
> + idr_for_each(&ls->ls_recover_idr, dlm_dump_idr, NULL);
> + pr_cont("\n");

I already tried my own version of this, but used idr_for_each_entry a
second time.  Unfortunately, the number it found and printed did not match
recover_list_count.

warning: recover_list_count 566

It printed 304 ids:

41218 41222 41223 41224 41226 41228 41229 41230 41231 41232 41234 41235
41236 41237 41239 41241 41242 41243 41244 41245 41246 41249 41252 41253
41254 41255 41256 41257 41259 41260 41261 41263 41264 41266 41271 41272
41273 41274 41277 41278 41475 41480 41483 41524 41525 41526 41655 41731
41741 41745 41749 41767 41768 41769 41772 41773 41782 42113 42114 42115
42121 42122 42124 42128 42132 42136 42138 42139 42141 42165 42375 42381
42385 42388 42390 42392 42399 42401 42404 42407 42409 42411 42416 42422
42694 42699 42712 42717 42727 42866 43009 43042 43044 43046 43049 43051
43058 43059 43064 43065 43066 43067 43330 43332 43337 43338 43339 43343
43348 43349 43351 43354 43355 43356 43361 43362 43368 43369 43370 43375
43376 43377 43378 43379 43381 43575 43576 43577 43677 43678 43680 43683
43684 43685 43689 43690 43819 43820 43823 43824 43825 43826 43827 43828
43829 43831 43905 43907 43908 43912 43929 43930 43955 43956 43960 43962
43965 44288 44289 44291 44296 44298 44300 44310 44311 44313 44314 44316
44318 44321 44323 44325 44454 44456 44457 44458 44544 44547 44548 44550
44555 44557 44560 44562 44564 44567 44573 44575 44576 44578 44579 44581
44582 44583 44584 44585 44589 44592 44595 44596 44726 44728 44729 44732
44734 44866 44867 44873 44876 44878 44879 44912 44914 44916 44920 44923
44924 45053 45186 45189 45190 45195 45197 45199 45200 45201 45203 45204
45208 45209 45212 45213 45216 45220 45223 45224 45225 45227 45228 45231
45234 45440 45441 45444 45448 45450 45452 45454 45456 45457 45458 45459
45460 45461 45464 45466 45467 45472 45475 45477 45484 45485 45488 45492
45494 45495 45496 45497 45498 45499 45628 45630 45698 45699 45700 45703
45707 45708 45710 45713 45715 45717 45720 45722 45723 45724 45725 45727
45729 45730 45731 45733 45734 45737 45739 45741 45742 45746 45748 45750
45755 47292 47293 47294



Re: [Cluster-devel] [PATCH] idr: fix a subtle bug in idr_get_next()

2013-02-05 Thread David Teigland
On Sat, Feb 02, 2013 at 03:11:35PM -0800, Tejun Heo wrote:
> On Sat, Feb 02, 2013 at 03:10:48PM -0800, Tejun Heo wrote:
> > Fix it by ensuring proceeding to the next slot doesn't carry over the
> > unaligned offset - ie. use round_up(id + 1, slot_distance) instead of
> > id += slot_distance.
> > 
> > Signed-off-by: Tejun Heo 
> > Reported-by: David Teigland 
> > Cc: KAMEZAWA Hiroyuki 
> 
> David, can you please test whether the patch makes the skipped
> deletion bug go away?

Yes, I've tested, and it works fine now.
Thanks,
Dave



  1   2   3   4   5   >