On Mon, 6 Apr 2009, Kadlecsik Jozsef wrote:
On Sun, 5 Apr 2009, Wendy Cheng wrote:
Based on code reading ...
1. iput() gets inode_lock (a spin lock)
2. iput() calls iput_final()
3. iput_final() calls gfs_drop_inode() that calls
generic_drop_inode()
4. generic_drop_inode()
Then don't remove it yet. The ramification needs more thoughts ...
That generic_drop_inode() can *not* be removed.
Not sure whether my head is clear enough this time
Based on code reading ...
1. iput() gets inode_lock (a spin lock)
2. iput() calls iput_final()
3. iput_final() calls
On Thu, 2 Apr 2009, Wendy Cheng wrote:
Kadlecsik Jozsef wrote:
- commit 82d176ba485f2ef049fd303b9e41868667cebbdb
gfs_drop_inode as .drop_inode replacing .put_inode.
.put_inode was called without holding a lock, but .drop_inode
is called under inode_lock held.
Kadlecsik Jozsef wrote:
On Thu, 2 Apr 2009, Wendy Cheng wrote:
Kadlecsik Jozsef wrote:
- commit 82d176ba485f2ef049fd303b9e41868667cebbdb
gfs_drop_inode as .drop_inode replacing .put_inode.
.put_inode was called without holding a lock, but .drop_inode
is called under
On Fri, 3 Apr 2009, Wendy Cheng wrote:
Kadlecsik Jozsef wrote:
On Thu, 2 Apr 2009, Wendy Cheng wrote:
Kadlecsik Jozsef wrote:
- commit 82d176ba485f2ef049fd303b9e41868667cebbdb
gfs_drop_inode as .drop_inode replacing .put_inode.
.put_inode was called
Kadlecsik Jozsef wrote:
On Fri, 3 Apr 2009, Wendy Cheng wrote:
Kadlecsik Jozsef wrote:
On Thu, 2 Apr 2009, Wendy Cheng wrote:
Kadlecsik Jozsef wrote:
- commit 82d176ba485f2ef049fd303b9e41868667cebbdb
gfs_drop_inode as .drop_inode replacing
On Tue, 31 Mar 2009, Kadlecsik Jozsef wrote:
I'll restore the kernel on a not so critical node and will try to find out
how to trigger the bug without mailman. If that succeeds then I'll remove
the patch in question and re-run the test. It'll need a few days, surely,
but I'll report the
Kadlecsik Jozsef wrote:
If you have any idea what to do next, please write it.
Do you have your kernel source somewhere (in tar ball format) so people
can look into it ?
-- Wendy
--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster
Hi,
On Thu, 2 Apr 2009, Wendy Cheng wrote:
If you have any idea what to do next, please write it.
Do you have your kernel source somewhere (in tar ball format) so people can
look into it ?
I have created the tarballs, you can find them at
http://www.kfki.hu/~kadlec/gfs/:
- Kernel is
On Thu, 2 Apr 2009, Kadlecsik Jozsef wrote:
If you have any idea what to do next, please write it.
Spent again some time looking through the git commits and that
triggered some wild guessing:
- commit ddebb0c3dc7d0b87c402ba17731ad41abdd43f2d ?
It is a temporary fix for 2.6.26, which is
Kadlecsik Jozsef wrote:
- commit 82d176ba485f2ef049fd303b9e41868667cebbdb
gfs_drop_inode as .drop_inode replacing .put_inode.
.put_inode was called without holding a lock, but .drop_inode
is called under inode_lock held. Might it be a problem?
I was planning to take a look over the
On Thu, 2 Apr 2009, Wendy Cheng wrote:
Kadlecsik Jozsef wrote:
- commit 82d176ba485f2ef049fd303b9e41868667cebbdb
gfs_drop_inode as .drop_inode replacing .put_inode.
.put_inode was called without holding a lock, but .drop_inode
is called under inode_lock held. Might it be a problem?
Kadlecsik Jozsef wrote:
On Thu, 2 Apr 2009, Wendy Cheng wrote:
Kadlecsik Jozsef wrote:
- commit 82d176ba485f2ef049fd303b9e41868667cebbdb
gfs_drop_inode as .drop_inode replacing .put_inode.
.put_inode was called without holding a lock, but .drop_inode
is called under inode_lock
Kadlecsik Jozsef wrote:
- commit 82d176ba485f2ef049fd303b9e41868667cebbdb
gfs_drop_inode as .drop_inode replacing .put_inode.
.put_inode was called without holding a lock, but .drop_inode
is called under inode_lock held. Might it be a problem
Based on code reading ...
1.
Hi,
On Mon, 30 Mar 2009, Abhijith Das wrote:
Could you remove the patch associated with bz 466645 and see if you can
hit the hang again? I've looked at the patch and I can't spot anything
obvious. If this patch is causing your problems, I'll work on
reproducing the problem on my setup here
On Tue, Mar 31, 2009 at 11:18:51AM +0200, Kadlecsik Jozsef wrote:
On Mon, 30 Mar 2009, David Teigland wrote:
On Fri, Mar 27, 2009 at 06:19:50PM +0100, Kadlecsik Jozsef wrote:
Combing through the log files I found the following:
Mar 27 13:31:56 lxserv0 fenced[3833]: web1-gfs not
On Mon, 30 Mar 2009, Kadlecsik Jozsef wrote:
On Sun, 29 Mar 2009, Wendy Cheng wrote:
Kadlecsik Jozsef wrote:
There are three different netconsole log recordings at
http://www.kfki.hu/~kadlec/gfs/
One of the new console logs has a good catch (netconsole0.txt): you *do*
have
a
Hi,
On Sun, 29 Mar 2009, Wendy Cheng wrote:
Kadlecsik Jozsef wrote:
There are three different netconsole log recordings at
http://www.kfki.hu/~kadlec/gfs/
One of the new console logs has a good catch (netconsole0.txt): you *do* have
a deadlock as the CPUs are spinning waiting for spin
Kadlecsik Jozsef wrote:
You mean the part of the patch
@@ -1503,6 +1503,15 @@ gfs_getattr(struct vfsmount *mnt, struct dentry *dentry,
struct
error = gfs_glock_nq_init(ip-i_gl, LM_ST_SHARED, LM_FLAG_ANY, gh);
if (!error) {
generic_fillattr(inode, stat);
+
On Thu, Mar 26, 2009 at 11:47:00PM +0100, Kadlecsik Jozsef wrote:
Hi,
Freshly built cluster-2.03.11 reproducibly freezes as mailman started.
The versions are:
linux-2.6.27.21
cluster-2.03.11
openais from svn, subrev 1152 version 0.80
So, in summary:
- nodes 1-5 are correctly forming a
Wendy Cheng wrote:
Kadlecsik Jozsef wrote:
You mean the part of the patch
@@ -1503,6 +1503,15 @@ gfs_getattr(struct vfsmount *mnt, struct dentry
*dentry, struct
error = gfs_glock_nq_init(ip-i_gl, LM_ST_SHARED, LM_FLAG_ANY, gh);
if (!error) {
Hi,
On Sat, 28 Mar 2009, Wendy Cheng wrote:
Kadlecsik Jozsef wrote:
I don't see a strong evidence of deadlock (but it could) from the
thread backtraces However, assuming the cluster worked before, you
could have overloaded the e1000 driver in this case. There are
suspicious page
Hi,
On Fri, 27 Mar 2009, Wendy Cheng wrote:
I should get some sleep - but can't it be that I hit the potential
deadlock mentioned here:
commit 4787e11dc7831f42228b89ba7726fd6f6901a1e3
gfs-kmod: workaround for potential deadlock. Prefault user pages
[...]
file.
Kadlecsik Jozsef wrote:
I don't see a strong evidence of deadlock (but it could) from the thread
backtraces However, assuming the cluster worked before, you could have
overloaded the e1000 driver in this case. There are suspicious page faults
but memory is very ok. So one possibility is that GFS
Wendy Cheng wrote:
. [snip] ... There are many foot-prints of spin_lock - that's
worrisome. Hit a couple of sysrq-w next time when you have hangs,
other than sysrq-t. This should give traces of the threads that are
actively on CPUs at that time. Also check your kernel change log (to
see
On Fri, 27 Mar 2009, Ben Yarwood wrote:
Replaying a journal as below usually idicates a node has withdrawn from that
file system I believe. You should grep messages on all nodes for 'GFS', if
any node is repoting errors with this fs then it will need rebooting/fencing
before access to that
On Fri, 27 Mar 2009, Kadlecsik Jozsef wrote:
The failining node is fenced off. Here are the steps to reproduce the
freeze of the node:
- all nodes are running and member of the cluster
- start the mailman queue manager: the node freezes
- the freezed node fenced off by a member of the
...@redhat.com] On Behalf Of Kadlecsik Jozsef
Sent: 27 March 2009 12:27
To: linux clustering
Subject: RE: [Linux-cluster] Freeze with cluster-2.03.11
On Fri, 27 Mar 2009, Kadlecsik Jozsef wrote:
In an attempt to trigger the freeze without mailman (if it is due to
a corrupt fs)
I umounted
On Fri, 27 Mar 2009, Nick Lunt wrote:
I have to suggest purchasing solaris, hp-ux or AIX for running enterprise
clusters.
Thanks, but currently - and in the foreseeable future - that's a no-option
for us. And at the moment it'd be unpractical as the existing cluster must
be fixed.
Best
I thought this list was for technical feedback that addresses the issue
at hand, however, since one opinion was offered, I will express one of
my own.
On Fri, 2009-03-27 at 15:11 +, Nick Lunt wrote:
I have to suggest purchasing solaris, hp-ux or AIX for running enterprise
clusters.
Hi,
Combing through the log files I found the following:
Mar 27 13:31:56 lxserv0 fenced[3833]: web1-gfs not a cluster member after 0 sec
post_fail_delay
Mar 27 13:31:56 lxserv0 fenced[3833]: fencing node web1-gfs
Mar 27 13:31:56 lxserv0 fenced[3833]: can't get node number for node e1÷?e1÷?
Mar
... [snip] ...
Sigh. The pressure is mounting to
fix the cluster at any cost, and nothing remained but to downgrade to
cluster-2.01.00/openais-0.80.3 which would be just ridiculous.
I have doubts that GFS (i.e. GFS1) is tuned and well-maintained on newer
versions of RHCS (as well as 2.6
- Kadlecsik Jozsef kad...@mail.kfki.hu wrote:
| Hi,
|
| Combing through the log files I found the following:
|
| Mar 27 13:31:56 lxserv0 fenced[3833]: web1-gfs not a cluster member
| after 0 sec post_fail_delay
| Mar 27 13:31:56 lxserv0 fenced[3833]: fencing node web1-gfs
| Mar 27 13:31:56
- Kadlecsik Jozsef kad...@mail.kfki.hu wrote:
| Yes. Probably it's worth to summarize what's happening here:
|
| - Full, healthy-looking cluster with all of the five nodes joined
| runs smoothly.
| - One node freezes out of the blue; it can reliably be triggered
| anytime by starting
On Fri, 27 Mar 2009, Bob Peterson wrote:
| Combing through the log files I found the following:
|
| Mar 27 13:31:56 lxserv0 fenced[3833]: web1-gfs not a cluster member
| after 0 sec post_fail_delay
| Mar 27 13:31:56 lxserv0 fenced[3833]: fencing node web1-gfs
| Mar 27 13:31:56 lxserv0
On Sat, 28 Mar 2009, Kadlecsik Jozsef wrote:
On Fri, 27 Mar 2009, Bob Peterson wrote:
Perhaps you should change your post_fail_delay to some very high
number, recreate the problem, and when it freezes force a
sysrq-trigger to get call traces for all the processes.
Then also you can
I should get some sleep - but can't it be that I hit the potential
deadlock mentioned here:
Please take my observation with a grain of salt (as I don't have Linux
source code in front of me to check the exact locking sequence, nor can I
afford spending time on this) ...
I don't see a strong
Hi,
Freshly built cluster-2.03.11 reproducibly freezes as mailman started.
The versions are:
linux-2.6.27.21
cluster-2.03.11
openais from svn, subrev 1152 version 0.80
LVM2.2.02.44
This is a five node cluster wich was just upgraded from cluster-2.01.00,
node by node. All nodes went fine
On Thu, 26 Mar 2009, Kadlecsik Jozsef wrote:
Freshly built cluster-2.03.11 reproducibly freezes as mailman started.
[...]
Of course all the mailman data is over GFS: list config files, locks,
queues, archives, etc. When the system is frozen, nothing can be obtained
by the magic sysreq keys,
Message-
From: linux-cluster-boun...@redhat.com
[mailto:linux-cluster-boun...@redhat.com] On Behalf Of Kadlecsik Jozsef
Sent: 26 March 2009 22:47
To: linux clustering
Subject: [Linux-cluster] Freeze with cluster-2.03.11
Hi,
Freshly built cluster-2.03.11 reproducibly freezes as mailman started
40 matches
Mail list logo