Sorry for the missing info, I should have known better :) All nodes run Debian, with the following software installed: Kernel: 2.6.26-1-amd64 x86_64
modinfo ocfs2: version: 1.5.0 description: OCFS2 1.5.0 srcversion: B19D847BA86E871E41B7A64 vermagic: 2.6.26-1-amd64 SMP mod_unload modversions ocfs2-tools: Version: 1.4.1-1 Tia, Kees Hoekzema > -----Original Message----- > From: Sunil Mushran [mailto:sunil.mush...@oracle.com] > Sent: woensdag 27 mei 2009 20:03 > To: Kees Hoekzema > Cc: ocfs2-users@oss.oracle.com > Subject: Re: [Ocfs2-users] Cluster lockup when one node fails > > kernel version, ocfs2 version? > > $ uname -a > $ modinfo ocfs2 > $ rpm -qa | grep ocfs2 > > > Kees Hoekzema wrote: > > Hello List, > > > > At the moment I'm running a 7-node ocfs2 cluster on a Dell MD3000i > (iscsi) > > NAS. This cluster has run fine for well over a year now, but recently > one of > > the older and more unstable servers in the cluster has started to > fail > > sometimes. > > > > While it is not a big problem that this particular server reboots, it > is > > however a problem that when he does that the whole cluster becomes > unusable > > until that node reboots and returns. > > > > Today we had another crash on the server. The other nodes displayed > it like > > this in the dmesg output: > > > > May 27 16:45:03 aphaea kernel: > > o2net: connection to node achelois (num 5) at 10.0.1.24:7777 has been > idle > > for 10.0 seconds, shutting it down. > > (0,3):o2net_idle_timer:1468 here are some times that might help debug > the > > situation: (tmr 1243435493.522086 now 1243435503.520354 dr > 1243435493.522080 > > adv 1243435493.522090:1243435493.522091 func (6169a8d1:502) > > 1243435148.2972:1243435148.2999) > > o2net: no longer connected to node achelois (num 5) at 10.0.1.24:7777 > > (3762,1):dlm_do_master_request:1335 ERROR: link to 5 went down! > > (3762,1):dlm_get_lock_resource:912 ERROR: status = -112 > > (5196,3):dlm_do_master_request:1335 ERROR: link to 5 went down! > > (5196,3):dlm_get_lock_resource:912 ERROR: status = -107 > > (735,3):dlm_do_master_request:1335 ERROR: link to 5 went down! > > (735,3):dlm_get_lock_resource:912 ERROR: status = -107 > > (21573,3):dlm_do_master_request:1335 ERROR: link to 5 went down! > > (21573,3):dlm_get_lock_resource:912 ERROR: status = -107 > > (2825,3):o2net_connect_expired:1629 ERROR: no connection established > with > > node 5 after 10.0 seconds, giving up and returning errors. > > (1916,3):dlm_do_master_request:1335 ERROR: link to 5 went down! > > (1916,3):dlm_get_lock_resource:912 ERROR: status = -107 > > .. > > [and a lot more similar errors] > > .. > > May 27 17:14:45 aphaea kernel: (2825,3):o2dlm_eviction_cb:258 o2dlm > has > > evicted node 5 from group 20AB0E216A25479A986F8FDFE574C640 > > > > The node that is in fault was totally frozen, so it most likely did > not even > > receive a kernel panic from ocfs2 so that it reboots. > > > > After we rebooted the node, the cluster became available again. > However, it > > still prevented the other 6 servers from accessing the shared storage > for > > almost 30 minutes. > > > > Is there a way to 'evict' a node faster? and continue normal > read/write > > operations without the node? > > Or is it possible to have at least read operations continue without > being > > locked out as well? > > > > Tia, > > Kees Hoekzema > > > > > > > > _______________________________________________ > > Ocfs2-users mailing list > > Ocfs2-users@oss.oracle.com > > http://oss.oracle.com/mailman/listinfo/ocfs2-users > > _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users