Re: [Veritas-ha] Private communication problem
Hi Shashi, Why didn't you configured the second required cluster interconnect? Regards Manuel -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Shashi Kanth Boddula Sent: Donnerstag, 10. April 2008 14:50 To: veritas-ha@mailman.eng.auburn.edu Subject: [Veritas-ha] Private communication problem I have 2-node VCS 3.5 cluster running on HP-UX 11.11. There is some problem with private communication which i am not able to understand what exactly the problem and how to solve it. In each node I have 2 LAN cards, and i am using LAN1 for private communication. Node 1 #cat /etc/llttab set-node 0 set-cluster 9 link lan1 /dev/lan:1 - ether - - #cat /etc/llthosts 0 solstice 1 express #lltstat -n LLT node information: Node State Links * 0 solsticeOPEN2 Node 2 --- #cat /etc/llttab set-node 1 set-cluster 9 link lan1 /dev/lan:1 - ether - - #cat /etc/llthosts 0 solstice 1 express #lltstat -n LLT node information: Node State Links 0 solsticeCONNWAIT1 * 1 express OPEN2 In each node, for LAN1, i have assigned a private IP to check whether is there any network problem. node 1 -- #ifconfig lan1 lan1: flags=843UP,BROADCAST,RUNNING,MULTICAST inet 172.31.0.1 netmask ff00 broadcast 172.31.0.255 #ping 172.31.0.2 PING 172.31.0.2: 64 byte packets 64 bytes from 172.31.0.2: icmp_seq=0. time=0. ms node 2 #ifconfig lan1 lan1: flags=1843UP,BROADCAST,RUNNING,MULTICAST,CKO inet 172.31.0.2 netmask ff00 broadcast 172.31.0.255 #ping 172.31.0.1 PING 172.31.0.1: 64 byte packets 64 bytes from 172.31.0.1: icmp_seq=0. time=0. ms So the above indicates that there is no network issue. Node 1 #lltstat -nvv LLT node information: Node StateLink Status Address * 0 solsticeOPEN lan1 UP 00:30:6E:37:09:85 1 express CONNWAIT lan1 DOWN Node 2 --- #lltstat -nvv LLT node information: Node StateLink Status Address 0 solsticeCONNWAIT lan1 UP 00:30:6E:37:09:85 * 1 express OPEN lan1 UP 00:30:6E:49:D6:97 One point to mention is, on first node, the LAN1 card type is Fast Ethernet, and in the second node, the LAN1 card type is 1000-baseT . Does this gives any problem ? ___ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha ___ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha
[Veritas-ha] Tweaking monitors
All, The cluster keeps 'spawning' listeners (oracle) under high load, as it detects it offline (not true) 2008/04/10 09:59:31 VCS INFO V-16-2-13026 (pluto) Resource(base_listener) - monitor procedure finished successfully after failing to complete within the expected time for (4) consecutive times. I'd like it not to declare the resource as faulted until X monitor failures as well as increase timeout, but I don't remember where to change those things. Relevant configs attached. Could someone either point me into the right direction, or tell me a param off the top of their head to look into? # hares -display base_listener #Resource AttributeSystem Value base_listener Groupglobal pluto_gp base_listener Type global Netlsnr base_listener AutoStartglobal 1 base_listener Critical global 1 base_listener Enabled global 1 base_listener LastOnline global pluto base_listener MonitorOnly global 0 base_listener ResourceOwnerglobal unknown base_listener TriggerEvent global 0 base_listener ArgListValuesmars oracle /u81/app/oracle/product/102 /u81/app/oracle/product/102/network/adminbase_listener ./bin/Netlsnr/LsnrTest.pl 0 base_listener ArgListValuespluto oracle /u81/app/oracle/product/102 /u81/app/oracle/product/102/network/adminbase_listener ./bin/Netlsnr/LsnrTest.pl 0 base_listener ArgListValuessunoracle /u81/app/oracle/product/102 /u81/app/oracle/product/102/network/adminbase_listener ./bin/Netlsnr/LsnrTest.pl 0 base_listener ConfidenceLevel mars 0 base_listener ConfidenceLevel pluto 100 base_listener ConfidenceLevel sun0 base_listener Flagsmars base_listener Flagspluto base_listener Flagssun base_listener IState mars not waiting base_listener IState pluto not waiting base_listener IState sunnot waiting base_listener Probed mars 1 base_listener Probed pluto 1 base_listener Probed sun1 base_listener Startmars 0 base_listener Startpluto 1 base_listener Startsun0 base_listener Statemars OFFLINE base_listener Statepluto ONLINE base_listener StatesunOFFLINE base_listener AgentDebug global 0 base_listener ComputeStats global 0 base_listener Encoding global base_listener EnvFile global base_listener Home global /u81/app/oracle/product/102 base_listener Listener global base_listener base_listener LsnrPwd global base_listener MonScriptglobal ./bin/Netlsnr/LsnrTest.pl base_listener Ownerglobal oracle base_listener ResourceInfo global State Stale Msg TS base_listener TnsAdmin global /u81/app/oracle/product/102/network/admin base_listener MonitorTimeStats mars Avg 0 TS base_listener MonitorTimeStats pluto Avg 0 TS base_listener MonitorTimeStats sunAvg 0 TS OracleTypes.cf type Netlsnr ( static keylist SupportedActions = { VRTS_GetInstanceName, VRTS_GetRunningServices } static int OnlineRetryLimit = 2 static int RestartLimit = 2 static str ArgList[] = { Owner, Home, TnsAdmin, Listener, EnvFile, MonScript, LsnrPwd, AgentDebug, Encoding } str Owner str Home str TnsAdmin str Listener str EnvFile str MonScript = ./bin/Netlsnr/LsnrTest.pl str LsnrPwd boolean AgentDebug = 0 str Encoding ) ___ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha
Re: [Veritas-ha] Private communication problem
Did you connect both different cards via a cross-over cable? Regards Manuel -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Manuel Braun Sent: Donnerstag, 10. April 2008 16:13 To: Shashi Kanth Boddula; veritas-ha@mailman.eng.auburn.edu Subject: Re: [Veritas-ha] Private communication problem Hi Shashi, Why didn't you configured the second required cluster interconnect? Regards Manuel -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Shashi Kanth Boddula Sent: Donnerstag, 10. April 2008 14:50 To: veritas-ha@mailman.eng.auburn.edu Subject: [Veritas-ha] Private communication problem I have 2-node VCS 3.5 cluster running on HP-UX 11.11. There is some problem with private communication which i am not able to understand what exactly the problem and how to solve it. In each node I have 2 LAN cards, and i am using LAN1 for private communication. Node 1 #cat /etc/llttab set-node 0 set-cluster 9 link lan1 /dev/lan:1 - ether - - #cat /etc/llthosts 0 solstice 1 express #lltstat -n LLT node information: Node State Links * 0 solsticeOPEN2 Node 2 --- #cat /etc/llttab set-node 1 set-cluster 9 link lan1 /dev/lan:1 - ether - - #cat /etc/llthosts 0 solstice 1 express #lltstat -n LLT node information: Node State Links 0 solsticeCONNWAIT1 * 1 express OPEN2 In each node, for LAN1, i have assigned a private IP to check whether is there any network problem. node 1 -- #ifconfig lan1 lan1: flags=843UP,BROADCAST,RUNNING,MULTICAST inet 172.31.0.1 netmask ff00 broadcast 172.31.0.255 #ping 172.31.0.2 PING 172.31.0.2: 64 byte packets 64 bytes from 172.31.0.2: icmp_seq=0. time=0. ms node 2 #ifconfig lan1 lan1: flags=1843UP,BROADCAST,RUNNING,MULTICAST,CKO inet 172.31.0.2 netmask ff00 broadcast 172.31.0.255 #ping 172.31.0.1 PING 172.31.0.1: 64 byte packets 64 bytes from 172.31.0.1: icmp_seq=0. time=0. ms So the above indicates that there is no network issue. Node 1 #lltstat -nvv LLT node information: Node StateLink Status Address * 0 solsticeOPEN lan1 UP 00:30:6E:37:09:85 1 express CONNWAIT lan1 DOWN Node 2 --- #lltstat -nvv LLT node information: Node StateLink Status Address 0 solsticeCONNWAIT lan1 UP 00:30:6E:37:09:85 * 1 express OPEN lan1 UP 00:30:6E:49:D6:97 One point to mention is, on first node, the LAN1 card type is Fast Ethernet, and in the second node, the LAN1 card type is 1000-baseT . Does this gives any problem ? ___ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha ___ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha ___ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha
Re: [Veritas-ha] Tweaking monitors
Hi Andrey, are you looking for ToleranceLimit? Here is an excerpt from the VCS user guide: About the ToleranceLimit attribute The ToleranceLimit attribute defines the number of times the Monitor routine should return an offline status before declaring a resource offline. This attribute is typically used when a resource is busy and appears to be offline. Setting the attribute to a non-zero value instructs VCS to allow multiple failing monitor cycles with the expectation that the resource will eventually respond. Setting a non-zero ToleranceLimit also extends the time required to respond to an actual fault. Regards Manuel From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andrey Dmitriev Sent: Donnerstag, 10. April 2008 18:21 To: veritas-ha@mailman.eng.auburn.edu Subject: [Veritas-ha] Tweaking monitors All, The cluster keeps 'spawning' listeners (oracle) under high load, as it detects it offline (not true) 2008/04/10 09:59:31 VCS INFO V-16-2-13026 (pluto) Resource(base_listener) - monitor procedure finished successfully after failing to complete within the expected time for (4) consecutive times. I'd like it not to declare the resource as faulted until X monitor failures as well as increase timeout, but I don't remember where to change those things. Relevant configs attached. Could someone either point me into the right direction, or tell me a param off the top of their head to look into? # hares -display base_listener #Resource AttributeSystem Value base_listener Groupglobal pluto_gp base_listener Type global Netlsnr base_listener AutoStartglobal 1 base_listener Critical global 1 base_listener Enabled global 1 base_listener LastOnline global pluto base_listener MonitorOnly global 0 base_listener ResourceOwnerglobal unknown base_listener TriggerEvent global 0 base_listener ArgListValuesmars oracle /u81/app/oracle/product/102 /u81/app/oracle/product/102/network/adminbase_listener ./bin/Netlsnr/LsnrTest.pl 0 base_listener ArgListValuespluto oracle /u81/app/oracle/product/102 /u81/app/oracle/product/102/network/adminbase_listener ./bin/Netlsnr/LsnrTest.pl 0 base_listener ArgListValuessunoracle /u81/app/oracle/product/102 /u81/app/oracle/product/102/network/adminbase_listener ./bin/Netlsnr/LsnrTest.pl 0 base_listener ConfidenceLevel mars 0 base_listener ConfidenceLevel pluto 100 base_listener ConfidenceLevel sun0 base_listener Flagsmars base_listener Flagspluto base_listener Flagssun base_listener IState mars not waiting base_listener IState pluto not waiting base_listener IState sunnot waiting base_listener Probed mars 1 base_listener Probed pluto 1 base_listener Probed sun1 base_listener Startmars 0 base_listener Startpluto 1 base_listener Startsun0 base_listener Statemars OFFLINE base_listener Statepluto ONLINE base_listener StatesunOFFLINE base_listener AgentDebug global 0 base_listener ComputeStats global 0 base_listener Encoding global base_listener EnvFile global base_listener Home global /u81/app/oracle/product/102 base_listener Listener global base_listener base_listener LsnrPwd global base_listener MonScriptglobal ./bin/Netlsnr/LsnrTest.pl base_listener Ownerglobal oracle base_listener ResourceInfo global State Stale Msg TS base_listener TnsAdmin global /u81/app/oracle/product/102/network/admin base_listener MonitorTimeStats mars Avg 0 TS base_listener MonitorTimeStats pluto Avg 0 TS base_listener MonitorTimeStats sunAvg 0 TS OracleTypes.cf type Netlsnr ( static keylist SupportedActions = { VRTS_GetInstanceName, VRTS_GetRunningServices } static int OnlineRetryLimit = 2 static int RestartLimit = 2 static str ArgList[] = { Owner, Home, TnsAdmin, Listener, EnvFile, MonScript, LsnrPwd, AgentDebug, Encoding } str Owner str Home str TnsAdmin str Listener str EnvFile str MonScript = ./bin/Netlsnr/LsnrTest.pl str LsnrPwd boolean AgentDebug = 0 str Encoding ) ___ Veritas-ha maillist - Veritas-ha@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-ha
Re: [Veritas-ha] Tweaking monitors
FaultOnMonitorTimeout is obviously set to 4 and you are having more than 4 timeouts. That is 4 minutes of non-monitoring. You could increase FOTM or the monitor interval or both. From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Manuel Braun Sent: Thursday, April 10, 2008 1:19 PM To: Andrey Dmitriev; veritas-ha@mailman.eng.auburn.edu Subject: Re: [Veritas-ha] Tweaking monitors Hi Andrey, are you looking for ToleranceLimit? Here is an excerpt from the VCS user guide: About the ToleranceLimit attribute The ToleranceLimit attribute defines the number of times the Monitor routine should return an offline status before declaring a resource offline. This attribute is typically used when a resource is busy and appears to be offline. Setting the attribute to a non-zero value instructs VCS to allow multiple failing monitor cycles with the expectation that the resource will eventually respond. Setting a non-zero ToleranceLimit also extends the time required to respond to an actual fault. Regards Manuel From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andrey Dmitriev Sent: Donnerstag, 10. April 2008 18:21 To: veritas-ha@mailman.eng.auburn.edu Subject: [Veritas-ha] Tweaking monitors All, The cluster keeps 'spawning' listeners (oracle) under high load, as it detects it offline (not true) 2008/04/10 09:59:31 VCS INFO V-16-2-13026 (pluto) Resource(base_listener) - monitor procedure finished successfully after failing to complete within the expected time for (4) consecutive times. I'd like it not to declare the resource as faulted until X monitor failures as well as increase timeout, but I don't remember where to change those things. Relevant configs attached. Could someone either point me into the right direction, or tell me a param off the top of their head to look into? # hares -display base_listener #Resource AttributeSystem Value base_listener Groupglobal pluto_gp base_listener Type global Netlsnr base_listener AutoStartglobal 1 base_listener Critical global 1 base_listener Enabled global 1 base_listener LastOnline global pluto base_listener MonitorOnly global 0 base_listener ResourceOwnerglobal unknown base_listener TriggerEvent global 0 base_listener ArgListValuesmars oracle /u81/app/oracle/product/102 /u81/app/oracle/product/102/network/adminbase_listener ./bin/Netlsnr/LsnrTest.pl 0 base_listener ArgListValuespluto oracle /u81/app/oracle/product/102 /u81/app/oracle/product/102/network/adminbase_listener ./bin/Netlsnr/LsnrTest.pl 0 base_listener ArgListValuessunoracle /u81/app/oracle/product/102 /u81/app/oracle/product/102/network/adminbase_listener ./bin/Netlsnr/LsnrTest.pl 0 base_listener ConfidenceLevel mars 0 base_listener ConfidenceLevel pluto 100 base_listener ConfidenceLevel sun0 base_listener Flagsmars base_listener Flagspluto base_listener Flagssun base_listener IState mars not waiting base_listener IState pluto not waiting base_listener IState sunnot waiting base_listener Probed mars 1 base_listener Probed pluto 1 base_listener Probed sun1 base_listener Startmars 0 base_listener Startpluto 1 base_listener Startsun0 base_listener Statemars OFFLINE base_listener Statepluto ONLINE base_listener StatesunOFFLINE base_listener AgentDebug global 0 base_listener ComputeStats global 0 base_listener Encoding global base_listener EnvFile global base_listener Home global /u81/app/oracle/product/102 base_listener Listener global base_listener base_listener LsnrPwd global base_listener MonScriptglobal ./bin/Netlsnr/LsnrTest.pl base_listener Ownerglobal oracle base_listener ResourceInfo global State Stale Msg TS base_listener TnsAdmin global /u81/app/oracle/product/102/network/admin base_listener MonitorTimeStats mars Avg 0 TS base_listener MonitorTimeStats pluto Avg 0 TS base_listener MonitorTimeStats sunAvg 0 TS OracleTypes.cf type Netlsnr ( static keylist SupportedActions = { VRTS_GetInstanceName, VRTS_GetRunningServices } static int OnlineRetryLimit = 2 static int RestartLimit = 2 static str ArgList[] = { Owner, Home, TnsAdmin, Listener, EnvFile, MonScript, LsnrPwd, AgentDebug, Encoding } str Owner str Home str TnsAdmin str Listener str EnvFile str MonScript =
Re: [Veritas-ha] Tweaking monitors
Hi Andrey, Apart from what Manuel has mentioned, you might also want to look at 'FaultOnMonitorTimeouts' and 'MonitorTimeout' attributes. These are type-level attributes (hatype -display Oracle). MonitorTimeout is the number of seconds the monitor is allowed to run (or hang) before the agent kills it. If 'FaultOnMonitorTimeouts' number of monitors timeout consecutively, agent will call clean and forecefully fault the resource. The message you are seeing (2-13026) means that the monitor timedout (ran for too long) 4 consecutive times and at the 5th attempt, it ran and completed successfully before 'MonitorTimeout' seconds elapsed. More in User's Guide. HTH, Anand From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Manuel Braun Sent: Thursday, April 10, 2008 10:19 AM To: Andrey Dmitriev; veritas-ha@mailman.eng.auburn.edu Subject: Re: [Veritas-ha] Tweaking monitors Hi Andrey, are you looking for ToleranceLimit? Here is an excerpt from the VCS user guide: About the ToleranceLimit attribute The ToleranceLimit attribute defines the number of times the Monitor routine should return an offline status before declaring a resource offline. This attribute is typically used when a resource is busy and appears to be offline. Setting the attribute to a non-zero value instructs VCS to allow multiple failing monitor cycles with the expectation that the resource will eventually respond. Setting a non-zero ToleranceLimit also extends the time required to respond to an actual fault. Regards Manuel From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andrey Dmitriev Sent: Donnerstag, 10. April 2008 18:21 To: veritas-ha@mailman.eng.auburn.edu Subject: [Veritas-ha] Tweaking monitors All, The cluster keeps 'spawning' listeners (oracle) under high load, as it detects it offline (not true) 2008/04/10 09:59:31 VCS INFO V-16-2-13026 (pluto) Resource(base_listener) - monitor procedure finished successfully after failing to complete within the expected time for (4) consecutive times. I'd like it not to declare the resource as faulted until X monitor failures as well as increase timeout, but I don't remember where to change those things. Relevant configs attached. Could someone either point me into the right direction, or tell me a param off the top of their head to look into? # hares -display base_listener #Resource AttributeSystem Value base_listener Groupglobal pluto_gp base_listener Type global Netlsnr base_listener AutoStartglobal 1 base_listener Critical global 1 base_listener Enabled global 1 base_listener LastOnline global pluto base_listener MonitorOnly global 0 base_listener ResourceOwnerglobal unknown base_listener TriggerEvent global 0 base_listener ArgListValuesmars oracle /u81/app/oracle/product/102 /u81/app/oracle/product/102/network/adminbase_listener ./bin/Netlsnr/LsnrTest.pl 0 base_listener ArgListValuespluto oracle /u81/app/oracle/product/102 /u81/app/oracle/product/102/network/adminbase_listener ./bin/Netlsnr/LsnrTest.pl 0 base_listener ArgListValuessunoracle /u81/app/oracle/product/102 /u81/app/oracle/product/102/network/adminbase_listener ./bin/Netlsnr/LsnrTest.pl 0 base_listener ConfidenceLevel mars 0 base_listener ConfidenceLevel pluto 100 base_listener ConfidenceLevel sun0 base_listener Flagsmars base_listener Flagspluto base_listener Flagssun base_listener IState mars not waiting base_listener IState pluto not waiting base_listener IState sunnot waiting base_listener Probed mars 1 base_listener Probed pluto 1 base_listener Probed sun1 base_listener Startmars 0 base_listener Startpluto 1 base_listener Startsun0 base_listener Statemars OFFLINE base_listener Statepluto ONLINE base_listener StatesunOFFLINE base_listener AgentDebug global 0 base_listener ComputeStats global 0 base_listener Encoding global base_listener EnvFile global base_listener Home global /u81/app/oracle/product/102 base_listener Listener global base_listener base_listener LsnrPwd global base_listener MonScriptglobal ./bin/Netlsnr/LsnrTest.pl base_listener Ownerglobal oracle base_listener ResourceInfo global State Stale Msg TS base_listener TnsAdmin global /u81/app/oracle/product/102/network/admin base_listener MonitorTimeStats mars Avg 0 TS base_listener MonitorTimeStats pluto Avg 0 TS