On 07/06/2015 07:04 PM, Dejan Muhamedagic wrote:
On Mon, Jul 06, 2015 at 03:14:34PM +0500, Muhammad Sharfuddin wrote:
On 07/06/2015 02:50 PM, Dejan Muhamedagic wrote:
Hi,
On Sun, Jul 05, 2015 at 09:13:56PM +0500, Muhammad Sharfuddin wrote:
SLES 11 SP 3 + online updates(pacemaker-1.1.11-0.8.11.70
openais-1.1.4-5.22.1.7)
Its a dual primary drbd cluster, which mounts a file system resource
on both the cluster nodes simultaneously(file system type is ocfs2).
Whenever any of the nodes goes down, the file system(/sharedata)
become inaccessible for exact 35 seconds on the other
(surviving/online) node, and then become available again on the
online node.
Please help me understand why the node which survives or remains
online unable to access the file system resource(/sharedata) for 35
seconds ? and how can I fix the cluster so that file system remains
accessible on the surviving node without any interruption/delay(as
in my case of about 35 seconds)
By inaccessible, I meant to say that running "ls -l /sharedata" and
"df /sharedata" does not return any output and does not return the
prompt back on the online node for exact 35 seconds once the other
node becomes offline.
e.g "node1" got offline somewhere around 01:37:15, and then
/sharedata file system was inaccessible during 01:37:35 and 01:38:18
on the online node i.e "node2".
Before the failing node gets fenced you won't be able to use the
ocfs2 filesystem. In this case, the fencing operation takes 40
seconds:
so its expected.
[...]
Jul 5 01:37:35 node2 sbd: [6197]: info: Writing reset to node slot node1
Jul 5 01:37:35 node2 sbd: [6197]: info: Messaging delay: 40
Jul 5 01:38:15 node2 sbd: [6197]: info: reset successfully
delivered to node1
Jul 5 01:38:15 node2 sbd: [6196]: info: Message successfully delivered.
[...]
You may want to reduce that sbd timeout.
Ok, so reducing the sbd timeout(or msgwait) would provide the
uninterrupted access to the ocfs2 file system on the
surviving/online node ?
or would it just minimize the downtime ?
Only the latter. But note that it is important that once sbd
reports success, the target node is really down. sbd is
timeout-based, i.e. it doesn't test whether the node actually
left. Hence this timeout shouldn't be too short.
Hmm, by the way for watchdog and msgwait timeout values, I always
blindly follow the suggested values @
https://www.novell.com/support/kb/doc.php?id=7011346
and the Suggested value there is 20 for watchdog, and 40 for msgwait.
I'll check the setup after reducing the timeout of watchdog to 10 and
msgwait to 20
Thanks,
Dejan
Thanks,
Dejan
_______________________________________________
Linux-HA mailing list is closing down.
Please subscribe to [email protected] instead.
http://clusterlabs.org/mailman/listinfo/users
_______________________________________________
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
--
Regards,
Muhammad Sharfuddin
_______________________________________________
Linux-HA mailing list is closing down.
Please subscribe to [email protected] instead.
http://clusterlabs.org/mailman/listinfo/users
_______________________________________________
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
--
Regards,
Muhammad Sharfuddin
_______________________________________________
Users mailing list: [email protected]
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org