Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-06 Thread Kadlecsik Jozsef
On Mon, 6 Apr 2009, Kadlecsik Jozsef wrote: On Sun, 5 Apr 2009, Wendy Cheng wrote: Based on code reading ... 1. iput() gets inode_lock (a spin lock) 2. iput() calls iput_final() 3. iput_final() calls gfs_drop_inode() that calls generic_drop_inode() 4. generic_drop_inode()

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-05 Thread Wendy Cheng
Then don't remove it yet. The ramification needs more thoughts ... That generic_drop_inode() can *not* be removed. Not sure whether my head is clear enough this time Based on code reading ... 1. iput() gets inode_lock (a spin lock) 2. iput() calls iput_final() 3. iput_final() calls

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-03 Thread Kadlecsik Jozsef
On Thu, 2 Apr 2009, Wendy Cheng wrote: Kadlecsik Jozsef wrote: - commit 82d176ba485f2ef049fd303b9e41868667cebbdb gfs_drop_inode as .drop_inode replacing .put_inode. .put_inode was called without holding a lock, but .drop_inode is called under inode_lock held.

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-03 Thread Wendy Cheng
Kadlecsik Jozsef wrote: On Thu, 2 Apr 2009, Wendy Cheng wrote: Kadlecsik Jozsef wrote: - commit 82d176ba485f2ef049fd303b9e41868667cebbdb gfs_drop_inode as .drop_inode replacing .put_inode. .put_inode was called without holding a lock, but .drop_inode is called under

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-03 Thread Kadlecsik Jozsef
On Fri, 3 Apr 2009, Wendy Cheng wrote: Kadlecsik Jozsef wrote: On Thu, 2 Apr 2009, Wendy Cheng wrote: Kadlecsik Jozsef wrote: - commit 82d176ba485f2ef049fd303b9e41868667cebbdb gfs_drop_inode as .drop_inode replacing .put_inode. .put_inode was called

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-03 Thread Wendy Cheng
Kadlecsik Jozsef wrote: On Fri, 3 Apr 2009, Wendy Cheng wrote: Kadlecsik Jozsef wrote: On Thu, 2 Apr 2009, Wendy Cheng wrote: Kadlecsik Jozsef wrote: - commit 82d176ba485f2ef049fd303b9e41868667cebbdb gfs_drop_inode as .drop_inode replacing

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-02 Thread Kadlecsik Jozsef
On Tue, 31 Mar 2009, Kadlecsik Jozsef wrote: I'll restore the kernel on a not so critical node and will try to find out how to trigger the bug without mailman. If that succeeds then I'll remove the patch in question and re-run the test. It'll need a few days, surely, but I'll report the

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-02 Thread Wendy Cheng
Kadlecsik Jozsef wrote: If you have any idea what to do next, please write it. Do you have your kernel source somewhere (in tar ball format) so people can look into it ? -- Wendy -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-02 Thread Kadlecsik Jozsef
Hi, On Thu, 2 Apr 2009, Wendy Cheng wrote: If you have any idea what to do next, please write it. Do you have your kernel source somewhere (in tar ball format) so people can look into it ? I have created the tarballs, you can find them at http://www.kfki.hu/~kadlec/gfs/: - Kernel is

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-02 Thread Kadlecsik Jozsef
On Thu, 2 Apr 2009, Kadlecsik Jozsef wrote: If you have any idea what to do next, please write it. Spent again some time looking through the git commits and that triggered some wild guessing: - commit ddebb0c3dc7d0b87c402ba17731ad41abdd43f2d ? It is a temporary fix for 2.6.26, which is

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-02 Thread Wendy Cheng
Kadlecsik Jozsef wrote: - commit 82d176ba485f2ef049fd303b9e41868667cebbdb gfs_drop_inode as .drop_inode replacing .put_inode. .put_inode was called without holding a lock, but .drop_inode is called under inode_lock held. Might it be a problem? I was planning to take a look over the

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-02 Thread Kadlecsik Jozsef
On Thu, 2 Apr 2009, Wendy Cheng wrote: Kadlecsik Jozsef wrote: - commit 82d176ba485f2ef049fd303b9e41868667cebbdb gfs_drop_inode as .drop_inode replacing .put_inode. .put_inode was called without holding a lock, but .drop_inode is called under inode_lock held. Might it be a problem?

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-02 Thread Wendy Cheng
Kadlecsik Jozsef wrote: On Thu, 2 Apr 2009, Wendy Cheng wrote: Kadlecsik Jozsef wrote: - commit 82d176ba485f2ef049fd303b9e41868667cebbdb gfs_drop_inode as .drop_inode replacing .put_inode. .put_inode was called without holding a lock, but .drop_inode is called under inode_lock

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-04-02 Thread Wendy Cheng
Kadlecsik Jozsef wrote: - commit 82d176ba485f2ef049fd303b9e41868667cebbdb gfs_drop_inode as .drop_inode replacing .put_inode. .put_inode was called without holding a lock, but .drop_inode is called under inode_lock held. Might it be a problem Based on code reading ... 1.

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-31 Thread Kadlecsik Jozsef
Hi, On Mon, 30 Mar 2009, Abhijith Das wrote: Could you remove the patch associated with bz 466645 and see if you can hit the hang again? I've looked at the patch and I can't spot anything obvious. If this patch is causing your problems, I'll work on reproducing the problem on my setup here

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-31 Thread David Teigland
On Tue, Mar 31, 2009 at 11:18:51AM +0200, Kadlecsik Jozsef wrote: On Mon, 30 Mar 2009, David Teigland wrote: On Fri, Mar 27, 2009 at 06:19:50PM +0100, Kadlecsik Jozsef wrote: Combing through the log files I found the following: Mar 27 13:31:56 lxserv0 fenced[3833]: web1-gfs not

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-30 Thread Kadlecsik Jozsef
On Mon, 30 Mar 2009, Kadlecsik Jozsef wrote: On Sun, 29 Mar 2009, Wendy Cheng wrote: Kadlecsik Jozsef wrote: There are three different netconsole log recordings at http://www.kfki.hu/~kadlec/gfs/ One of the new console logs has a good catch (netconsole0.txt): you *do* have a

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-30 Thread Kadlecsik Jozsef
Hi, On Sun, 29 Mar 2009, Wendy Cheng wrote: Kadlecsik Jozsef wrote: There are three different netconsole log recordings at http://www.kfki.hu/~kadlec/gfs/ One of the new console logs has a good catch (netconsole0.txt): you *do* have a deadlock as the CPUs are spinning waiting for spin

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-30 Thread Wendy Cheng
Kadlecsik Jozsef wrote: You mean the part of the patch @@ -1503,6 +1503,15 @@ gfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct error = gfs_glock_nq_init(ip-i_gl, LM_ST_SHARED, LM_FLAG_ANY, gh); if (!error) { generic_fillattr(inode, stat); +

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-30 Thread David Teigland
On Thu, Mar 26, 2009 at 11:47:00PM +0100, Kadlecsik Jozsef wrote: Hi, Freshly built cluster-2.03.11 reproducibly freezes as mailman started. The versions are: linux-2.6.27.21 cluster-2.03.11 openais from svn, subrev 1152 version 0.80 So, in summary: - nodes 1-5 are correctly forming a

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-30 Thread Abhijith Das
Wendy Cheng wrote: Kadlecsik Jozsef wrote: You mean the part of the patch @@ -1503,6 +1503,15 @@ gfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct error = gfs_glock_nq_init(ip-i_gl, LM_ST_SHARED, LM_FLAG_ANY, gh); if (!error) {

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-29 Thread Kadlecsik Jozsef
Hi, On Sat, 28 Mar 2009, Wendy Cheng wrote: Kadlecsik Jozsef wrote: I don't see a strong evidence of deadlock (but it could) from the thread backtraces However, assuming the cluster worked before, you could have overloaded the e1000 driver in this case. There are suspicious page

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-28 Thread Kadlecsik Jozsef
Hi, On Fri, 27 Mar 2009, Wendy Cheng wrote: I should get some sleep - but can't it be that I hit the potential deadlock mentioned here: commit  4787e11dc7831f42228b89ba7726fd6f6901a1e3 gfs-kmod: workaround for potential deadlock. Prefault user pages [...] file.

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-28 Thread Wendy Cheng
Kadlecsik Jozsef wrote: I don't see a strong evidence of deadlock (but it could) from the thread backtraces However, assuming the cluster worked before, you could have overloaded the e1000 driver in this case. There are suspicious page faults but memory is very ok. So one possibility is that GFS

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-28 Thread Wendy Cheng
Wendy Cheng wrote: . [snip] ... There are many foot-prints of spin_lock - that's worrisome. Hit a couple of sysrq-w next time when you have hangs, other than sysrq-t. This should give traces of the threads that are actively on CPUs at that time. Also check your kernel change log (to see

RE: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-27 Thread Kadlecsik Jozsef
On Fri, 27 Mar 2009, Ben Yarwood wrote: Replaying a journal as below usually idicates a node has withdrawn from that file system I believe. You should grep messages on all nodes for 'GFS', if any node is repoting errors with this fs then it will need rebooting/fencing before access to that

RE: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-27 Thread Kadlecsik Jozsef
On Fri, 27 Mar 2009, Kadlecsik Jozsef wrote: The failining node is fenced off. Here are the steps to reproduce the freeze of the node: - all nodes are running and member of the cluster - start the mailman queue manager: the node freezes - the freezed node fenced off by a member of the

RE: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-27 Thread Nick Lunt
...@redhat.com] On Behalf Of Kadlecsik Jozsef Sent: 27 March 2009 12:27 To: linux clustering Subject: RE: [Linux-cluster] Freeze with cluster-2.03.11 On Fri, 27 Mar 2009, Kadlecsik Jozsef wrote: In an attempt to trigger the freeze without mailman (if it is due to a corrupt fs) I umounted

RE: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-27 Thread Kadlecsik Jozsef
On Fri, 27 Mar 2009, Nick Lunt wrote: I have to suggest purchasing solaris, hp-ux or AIX for running enterprise clusters. Thanks, but currently - and in the foreseeable future - that's a no-option for us. And at the moment it'd be unpractical as the existing cluster must be fixed. Best

RE: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-27 Thread Robert Hurst
I thought this list was for technical feedback that addresses the issue at hand, however, since one opinion was offered, I will express one of my own. On Fri, 2009-03-27 at 15:11 +, Nick Lunt wrote: I have to suggest purchasing solaris, hp-ux or AIX for running enterprise clusters.

RE: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-27 Thread Kadlecsik Jozsef
Hi, Combing through the log files I found the following: Mar 27 13:31:56 lxserv0 fenced[3833]: web1-gfs not a cluster member after 0 sec post_fail_delay Mar 27 13:31:56 lxserv0 fenced[3833]: fencing node web1-gfs Mar 27 13:31:56 lxserv0 fenced[3833]: can't get node number for node e1÷?e1÷? Mar

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-27 Thread Wendy Cheng
... [snip] ... Sigh. The pressure is mounting to fix the cluster at any cost, and nothing remained but to downgrade to cluster-2.01.00/openais-0.80.3 which would be just ridiculous. I have doubts that GFS (i.e. GFS1) is tuned and well-maintained on newer versions of RHCS (as well as 2.6

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-27 Thread Bob Peterson
- Kadlecsik Jozsef kad...@mail.kfki.hu wrote: | Hi, | | Combing through the log files I found the following: | | Mar 27 13:31:56 lxserv0 fenced[3833]: web1-gfs not a cluster member | after 0 sec post_fail_delay | Mar 27 13:31:56 lxserv0 fenced[3833]: fencing node web1-gfs | Mar 27 13:31:56

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-27 Thread Bob Peterson
- Kadlecsik Jozsef kad...@mail.kfki.hu wrote: | Yes. Probably it's worth to summarize what's happening here: | | - Full, healthy-looking cluster with all of the five nodes joined | runs smoothly. | - One node freezes out of the blue; it can reliably be triggered | anytime by starting

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-27 Thread Kadlecsik Jozsef
On Fri, 27 Mar 2009, Bob Peterson wrote: | Combing through the log files I found the following: | | Mar 27 13:31:56 lxserv0 fenced[3833]: web1-gfs not a cluster member | after 0 sec post_fail_delay | Mar 27 13:31:56 lxserv0 fenced[3833]: fencing node web1-gfs | Mar 27 13:31:56 lxserv0

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-27 Thread Kadlecsik Jozsef
On Sat, 28 Mar 2009, Kadlecsik Jozsef wrote: On Fri, 27 Mar 2009, Bob Peterson wrote: Perhaps you should change your post_fail_delay to some very high number, recreate the problem, and when it freezes force a sysrq-trigger to get call traces for all the processes. Then also you can

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-27 Thread Wendy Cheng
I should get some sleep - but can't it be that I hit the potential deadlock mentioned here: Please take my observation with a grain of salt (as I don't have Linux source code in front of me to check the exact locking sequence, nor can I afford spending time on this) ... I don't see a strong

[Linux-cluster] Freeze with cluster-2.03.11

2009-03-26 Thread Kadlecsik Jozsef
Hi, Freshly built cluster-2.03.11 reproducibly freezes as mailman started. The versions are: linux-2.6.27.21 cluster-2.03.11 openais from svn, subrev 1152 version 0.80 LVM2.2.02.44 This is a five node cluster wich was just upgraded from cluster-2.01.00, node by node. All nodes went fine

Re: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-26 Thread Kadlecsik Jozsef
On Thu, 26 Mar 2009, Kadlecsik Jozsef wrote: Freshly built cluster-2.03.11 reproducibly freezes as mailman started. [...] Of course all the mailman data is over GFS: list config files, locks, queues, archives, etc. When the system is frozen, nothing can be obtained by the magic sysreq keys,

RE: [Linux-cluster] Freeze with cluster-2.03.11

2009-03-26 Thread Ben Yarwood
Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Kadlecsik Jozsef Sent: 26 March 2009 22:47 To: linux clustering Subject: [Linux-cluster] Freeze with cluster-2.03.11 Hi, Freshly built cluster-2.03.11 reproducibly freezes as mailman started