[Linux-cluster] IP Resource behavior with Red Hat Cluster
Hi all, I am using Red Hat cluster 6.2.0 (version shown with cman_tool version) on Red Hat 5.5 I am on host that has multiple network interfaces and all(or some) of which may be active while I tried to bring up my IP resource up. My cluster is of simple configuration - It has only 2 nodes, and service basically consist of only IP resource, I had to chose random private IP address for test/debugging purpose (192.168) When I tried to start service it failed with message - clurgmgrd: [31853]: debug 192.168.25.135 is not configured I manually made this virtual IP available on host and then started service it worked - clurgmgrd: [31853]: debug 192.168.25.135 already configured My question is - Is it prerequisite for IP resource to be manually added before it can be protected via cluster? Thanks Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] IP Resource behavior with Red Hat Cluster
Hi Rajagopal, Thank you for your response I have created a cluster configuration by adding IP resource with value 192.168.25.153 (some value) and created a service which just has IP resource part of it. I have set all requisite configuration such two node, node names, failover,fencing etc. Upon trying to start then service(enable service),it failed - clurgmgrd: [31853]: debug 192.168.25.135 is not configured Then manually added this IP to host ifconfig eth0:1 192.168.25.135 Then service could start but it gave message - clurgmgrd: [31853]: debug 192.168.25.135 already configured So do I have to add virtual interface manually (as above or any other method?) before I could start service with IP resource under it? Thanks Parvez On Fri, Dec 24, 2010 at 11:30 AM, Rajagopal Swaminathan raju.rajs...@gmail.com wrote: Greetings, On Fri, Dec 24, 2010 at 5:33 AM, Parvez Shaikh parvez.h.sha...@gmail.com wrote: Hi all, I manually made this virtual IP available on host and then started service it worked - Can you please elaborate? did you try to assign IP to the ethx devices and then ping? clurgmgrd: [31853]: debug 192.168.25.135 already configured My question is - Is it prerequisite for IP resource to be manually added before it can be protected via cluster? Every resource/service has to be added to the cluster. And they cannot be used by anything else. Regards, Rajagopal -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] IP Resource behavior with Red Hat Cluster
Thanks a ton Jakov. It has clarified my doubts. Yours gratefully, Parvez On Sat, Dec 25, 2010 at 6:34 AM, Jakov Sosic jakov.so...@srce.hr wrote: On 12/24/2010 05:46 PM, Parvez Shaikh wrote: Hi Jakov Thank you for your response. My two hosts have multiple network interfaces or ethernet cards. I understood from your email, that the IP corresponding to cluster node name for both hosts, should be in the same subnet before a cluster could bring virtual IP up. No... you misunderstood me. I meant that if the virtual address is 192.168.25.X, than you have to have interface on each node that is set up with the ip address from the same subnet. That interface does not need to correspond to the cluster node name. For example: node1 - eth0 - 192.168.1.11 (netmask 255.255.255.0) node2 - eth0 - 192.168.1.12 (netmask 255.255.255.0) IP resource - 192.168.25.100 Now, how do you expect the cluster to know what to do with IP resource? On which interface can cluster glue 192.168.25.100? eth0? But why eth0? And what is the netmask? What about routes? So, you need to have for example eth1 on both machines set up in the same subnet, so that cluster can glue IP address from IP resource to that exact interface (which is set up statically). So you also have to have for example: node1 - eth1 - 192.168.25.47 (netmask 255.255.255.0) node2 - eth1 - 192.168.25.48 (netmask 255.255.255.0) Now, rgmanager will know where to activate IP resource, because 192.168.25.100 belongs to 192.168.25.0/24 subnet, which is active on node1/eth1 and node2/eth2. If you were to have another IP resource, for example 192.168.240.44, you would need another interface with another set of static ip addresses on every host you intend to run IP resource on... I hope you get it correctly now. -- Jakov Sosic www.srce.hr -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] IP Resource behavior with Red Hat Cluster
Hi I chose my IP resource as 192.168.13.15, I had eth3 configured on 192.168.13.1 but it still failed with error - Dec 27 17:35:32 datablade1 clurgmgrd[31853]: err Error storing ip: Duplicate Dec 27 17:36:55 datablade1 clurgmgrd[31853]: notice Starting disabled service service:service1 Dec 27 17:36:55 datablade1 clurgmgrd[31853]: notice start on ip 192.168.13.15/24 returned 1 (generic error) Dec 27 17:36:55 datablade1 clurgmgrd[31853]: warning #68: Failed to start service:service1; return value: 1 Below is set of interfaces - eth0 Link encap:Ethernet HWaddr 00:10:18:66:15:70 inet addr:192.168.10.1 Bcast:192.168.10.255 Mask:255.255.255.0 inet6 addr: fe80::210:18ff:fe66:1570/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:125 errors:0 dropped:0 overruns:0 frame:0 TX packets:305 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:32679 (31.9 KiB) TX bytes:42477 (41.4 KiB) Interrupt:177 Memory:9800-98012800 eth1 Link encap:Ethernet HWaddr 00:10:18:66:15:72 inet addr:192.168.11.1 Bcast:192.168.11.255 Mask:255.255.255.0 inet6 addr: fe80::210:18ff:fe66:1572/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1237019 errors:0 dropped:0 overruns:0 frame:0 TX packets:1919245 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:183885611 (175.3 MiB) TX bytes:337885336 (322.2 MiB) Interrupt:154 Memory:9a00-9a012800 eth2 Link encap:Ethernet HWaddr 00:10:18:66:15:74 inet addr:192.168.12.1 Bcast:192.168.12.255 Mask:255.255.255.0 inet6 addr: fe80::210:18ff:fe66:1574/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:419008 errors:0 dropped:0 overruns:0 frame:0 TX packets:29 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:26822898 (25.5 MiB) TX bytes:5992 (5.8 KiB) Interrupt:185 Memory:9400-94012800 eth3 Link encap:Ethernet HWaddr 00:10:18:66:15:76 inet addr:192.168.13.1 Bcast:192.168.13.255 Mask:255.255.255.0 UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Interrupt:162 Memory:9600-96012800 On Sat, Dec 25, 2010 at 6:34 AM, Jakov Sosic jakov.so...@srce.hr wrote: On 12/24/2010 05:46 PM, Parvez Shaikh wrote: Hi Jakov Thank you for your response. My two hosts have multiple network interfaces or ethernet cards. I understood from your email, that the IP corresponding to cluster node name for both hosts, should be in the same subnet before a cluster could bring virtual IP up. No... you misunderstood me. I meant that if the virtual address is 192.168.25.X, than you have to have interface on each node that is set up with the ip address from the same subnet. That interface does not need to correspond to the cluster node name. For example: node1 - eth0 - 192.168.1.11 (netmask 255.255.255.0) node2 - eth0 - 192.168.1.12 (netmask 255.255.255.0) IP resource - 192.168.25.100 Now, how do you expect the cluster to know what to do with IP resource? On which interface can cluster glue 192.168.25.100? eth0? But why eth0? And what is the netmask? What about routes? So, you need to have for example eth1 on both machines set up in the same subnet, so that cluster can glue IP address from IP resource to that exact interface (which is set up statically). So you also have to have for example: node1 - eth1 - 192.168.25.47 (netmask 255.255.255.0) node2 - eth1 - 192.168.25.48 (netmask 255.255.255.0) Now, rgmanager will know where to activate IP resource, because 192.168.25.100 belongs to 192.168.25.0/24 subnet, which is active on node1/eth1 and node2/eth2. If you were to have another IP resource, for example 192.168.240.44, you would need another interface with another set of static ip addresses on every host you intend to run IP resource on... I hope you get it correctly now. -- Jakov Sosic www.srce.hr -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] IP Resource behavior with Red Hat Cluster
Hi all Issue has been resolved. After debugging a bit I found that link to eth was not detected - ethtool ethX | grep Link detected: | awk '{print $3}' Output - no After resolving around this, I could get my IP resource up. Thank you for your kind suggestions and interest in this problem Gratefully yours On Mon, Dec 27, 2010 at 12:18 PM, Rajagopal Swaminathan raju.rajs...@gmail.com wrote: Greetinds, On Mon, Dec 27, 2010 at 9:51 AM, Parvez Shaikh parvez.h.sha...@gmail.com wrote: Hi Dec 27 17:35:32 datablade1 clurgmgrd[31853]: err Error storing ip: Duplicate Dec 27 17:36:55 datablade1 clurgmgrd[31853]: notice Starting disabled service service:service1 Dec 27 17:36:55 datablade1 clurgmgrd[31853]: notice start on ip 192.168.13.15/24 returned 1 (generic error) Dec 27 17:36:55 datablade1 clurgmgrd[31853]: warning #68: Failed to start service:service1; return value: 1 Below is set of interfaces - What does the ip addr show command say? Regards, Rajagopal -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Determining red hat cluster version
Hi Fabio This produces output - cman-2.0.115-29.el5 So does it indicate 2.0.115-29 is version? On Thu, Jan 6, 2011 at 12:34 PM, Fabio M. Di Nitto fdini...@redhat.com wrote: On 1/6/2011 6:24 AM, Parvez Shaikh wrote: Hi all, Is there any command which states Red Hat cluster version? I tried cman_tool version, and ccs_tool -V both produce different results, most likely reporting version of their own (not of Cluster suite) rpm -q -f $(which cman_tool) is one option, otherwise you need to parse cman_tool protocol version manually. Fabio -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Determining red hat cluster version
Thanks Fabio Is this version same as what can be referred as version of Red Hat Cluster Suite? The reason I am asking is, as a part of RHCS there are various components (Cluster_Administration-en-US, cluster-cim, cluster-snmp, cman, rgmanager, luci, ricci etc etc) and each of which shows its own version - cman has version as below, rgmanager as version 2.0.52. cluster-cim and cluster-snmp,modcluster has version 0.12.1, system-config-cluster has 1.0.57 version. Is there one version number referring to Cluster Suite which would have encompassed entire set of components (with their own versions may be) Gratefully yours On Thu, Jan 6, 2011 at 1:14 PM, Fabio M. Di Nitto fdini...@redhat.com wrote: On 1/6/2011 8:28 AM, Parvez Shaikh wrote: Hi Fabio This produces output - cman-2.0.115-29.el5 So does it indicate 2.0.115-29 is version? yes Fabio -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] configuring bladecenter fence device
Hi all, From RHCS documentation, I could see that bladecenter is one of the fence devices - http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/ap-fence-device-param-CA.html Table B.9. IBM Blade Center Field Description NameA name for the IBM BladeCenter device connected to the cluster. IP Address The IP address assigned to the device. Login The login name used to access the device. PasswordThe password used to authenticate the connection to the device. Password Script (optional) The script that supplies a password for access to the fence device. Using this supersedes the Password parameter. Blade The blade of the device. Use SSH (Rhel 5.4 and later) Indicates that system will use SSH to access the device. As per my understanding, IP address is IP address of management module of IBM blade center, login/password represent credentials to access the same. However did not get the parameter 'Blade'. How does it play role in fencing? In a situation where there are two blades - Blade-1 and Blade-2 and if Blade-1 goes down(hardware node failure), Blade-2 should fence out Blade-1, in that situation fenced on Blade-2 should power off(?) blade-2 using fence_bladecenter, so how should below sniplet of cluster.conf file should look like? - clusternodes clusternode name=blade1 nodeid=1 votes=1 fence method name=1 device blade=? name=BLADECENTER/ /method /fence /clusternode clusternode name=blade2 nodeid=2 votes=1 fence method name=1 device blade= name=BLADECENTER/ /method /fence /clusternode /clusternodes In which situation fence_bladecenter would be used to power on the blade? Your gratefully -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] configuring bladecenter fence device
Hi Ben Thanks a ton for below information. But I have doubt on cluster.conf file snippet below - clusternode name=node1 votes=1 fence method name=1 device blade=2 name=chassis_fence/ /method /fence /clusternode Here for node1 device blade is 2. Does it mean node1 is blade[2] from AMM perspective? So in order to fence out node1 fence_bladecenter would turn off blade[2]? Thanks On Thu, Jan 6, 2011 at 9:36 PM, Ben Turner btur...@redhat.com wrote: To address: As per my understanding, IP address is IP address of management module of IBM blade center, login/password represent credentials to access the same. Correct. However did not get the parameter 'Blade'. How does it play role in fencing? If I recall correctly the blade= is the identifier used to identify the blade in the AMM. I can't remember if it is a number of a slot or a user defined name. It corresponds to # fence_bladecenter -h -n, --plug=id Physical plug number on device or name of virtual machine If the fencing code: port : { getopt : n:, longopt : plug, help : -n, --plug=id Physical plug number on device or\n + name of virtual machine, required : 1, shortdesc : Physical plug number or name of virtual machine, order : 1 }, To test this try running: /sbin/fence_bladecenter -a ip or hostname of bladecenter -l login -p passwd -n blade number of the blade you want to fence -o status -v An example cluster.conf looks like: clusternode name=node1 votes=1 fence method name=1 device blade=2 name=chassis_fence/ /method /fence /clusternode clusternode name=node2 votes=1 fence method name=1 device blade=3 name=chassis_fence/ /method /fence fencedevices fencedevice agent=fence_bladecenter ipaddr=XXX.XXX.1.143 login=rchs_fence name=chassis_fence passwd=XXX/ /fencedevices -Ben - Original Message - Hi all, From RHCS documentation, I could see that bladecenter is one of the fence devices - http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/ap-fence-device-param-CA.html Table B.9. IBM Blade Center Field Description Name A name for the IBM BladeCenter device connected to the cluster. IP Address The IP address assigned to the device. Login The login name used to access the device. Password The password used to authenticate the connection to the device. Password Script (optional) The script that supplies a password for access to the fence device. Using this supersedes the Password parameter. Blade The blade of the device. Use SSH (Rhel 5.4 and later) Indicates that system will use SSH to access the device. As per my understanding, IP address is IP address of management module of IBM blade center, login/password represent credentials to access the same. However did not get the parameter 'Blade'. How does it play role in fencing? In a situation where there are two blades - Blade-1 and Blade-2 and if Blade-1 goes down(hardware node failure), Blade-2 should fence out Blade-1, in that situation fenced on Blade-2 should power off(?) blade-2 using fence_bladecenter, so how should below sniplet of cluster.conf file should look like? - clusternodes clusternode name=blade1 nodeid=1 votes=1 fence method name=1 device blade=? name=BLADECENTER/ /method /fence /clusternode clusternode name=blade2 nodeid=2 votes=1 fence method name=1 device blade= name=BLADECENTER/ /method /fence /clusternode /clusternodes In which situation fence_bladecenter would be used to power on the blade? Your gratefully -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] configuring bladecenter fence device
Thanks Hugo Your gratefully On Fri, Jan 7, 2011 at 11:09 AM, Hugo Lombard h...@elizium.za.net wrote: On Fri, Jan 07, 2011 at 10:12:16AM +0530, Parvez Shaikh wrote: clusternode name=node1 votes=1 fence method name=1 device blade=2 name=chassis_fence/ /method /fence /clusternode Here for node1 device blade is 2. Does it mean node1 is blade[2] from AMM perspective? So in order to fence out node1 fence_bladecenter would turn off blade[2]? Hi Parvez We use BladeCenters for our clusters, and I can confirm that the 'blade=2' parameter will translate to 'blade[2]' on the AMM. IOW, the '2' is the slot number that the blade is in. Two more things that might be of help: - The user specified in the 'login' parameter under the fencedevice should be a 'Blade Administrator' for the slots in question. - If you're running with SELinux enabled, check that the 'fenced_can_network_connect' boolean is set to 'on'. Regards -- Hugo Lombard -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] Error while manual fencing and output of clustat
Dear experts, I have two node cluster(node1 and node2), and manual fencing is configured. Service S2 is running on node2. To ensure failover happen, I shutdown node2.. I see following messages in /var/log/messages - agent fence_manual reports: failed: fence_manual no node name fence_ack_manual -n node2 doesn't work saying there is no FIFO in /tmp. fence_ack_manual -n node2 -e do work and then service S2 fails over to node2. Trying to find out why fence_manual is reporting error? node2 is pingable hostname and its entry is in /etc/hosts of node1 (and vice versa). I also see that after failover when I do clustat -x I get cluster status (in XML format) with - ?xml version=1.0? clustat version=4.1.1 groups group name=service:S state=111 state_str=starting flags=0 flags_str= owner=node1 last_owner=node1 restarts=0 last_transition=1294676678 last_transition_str=xx/ /groups /clustat I was expecting last_owner would correspond to node2(because this is node which was running service S and has failed); which would indicate that service is failing over FROM node2. Is there a way that node in cluster (a node on which service is failing over) could determine from which node the given service is failing over? Any inputs would be greatly appreciated. Thanks Yours gratefully -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Error while manual fencing and output of clustat
Thanks Xaviar. It resolved the error on fencing. However I still am grappling with issue of finding name of Failed cluster node on another cluster node to which service on failed node has failed over to. I was using output of clustat -x -S service name and was parsing XML file to obtain value of last_owner field. Any input on how to find out name of failed node on another cluster node, over which services from failed node are starting? Thanks On Mon, Jan 10, 2011 at 6:58 PM, Xavier Montagutelli xavier.montagute...@unilim.fr wrote: Hello Parvez, On Monday 10 January 2011 09:51:14 Parvez Shaikh wrote: Dear experts, I have two node cluster(node1 and node2), and manual fencing is configured. Service S2 is running on node2. To ensure failover happen, I shutdown node2.. I see following messages in /var/log/messages - agent fence_manual reports: failed: fence_manual no node name I am not an expert, but could you show us your cluster.conf file ? You need to give a nodename attribute to the fence_manual agent somewhere, the error message makes me think it's missing. For example : fencedevices fencedevice agent=fence_manual name=my_fence_manual/ /fencedevices ... clusternode name=node2 ... fence method name=1 device name=my_fence_manual nodename=node2/ /method /fence /clusternode fence_ack_manual -n node2 doesn't work saying there is no FIFO in /tmp. fence_ack_manual -n node2 -e do work and then service S2 fails over to node2. Trying to find out why fence_manual is reporting error? node2 is pingable hostname and its entry is in /etc/hosts of node1 (and vice versa). I also see that after failover when I do clustat -x I get cluster status (in XML format) with - ?xml version=1.0? clustat version=4.1.1 groups group name=service:S state=111 state_str=starting flags=0 flags_str= owner=node1 last_owner=node1 restarts=0 last_transition=1294676678 last_transition_str=xx/ /groups /clustat I was expecting last_owner would correspond to node2(because this is node which was running service S and has failed); which would indicate that service is failing over FROM node2. Is there a way that node in cluster (a node on which service is failing over) could determine from which node the given service is failing over? Any inputs would be greatly appreciated. Thanks Yours gratefully -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Xavier Montagutelli Tel : +33 (0)5 55 45 77 20 Service Commun Informatique Fax : +33 (0)5 55 45 75 95 Universite de Limoges 123, avenue Albert Thomas 87060 Limoges cedex -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] Determining failed node on another node of cluster during failover
Hi all, Taking this question from another thread, here is a challenge that I am facing - Following is simple cluster configuration - Node 1, node 2, node 3, and node4 are part of cluster, its unrestricted unordered fail-over domain with active - active nxn configuration So a node 2 can get services from node1, node3 or node4 when any of these(1,3,4) node fails(e.g. power failure). In that event I want to find out which of the node has failed over node2, I was invoking clustat -x -S service name on node2 in my custom agent and was parsing for last_owner field to obtain name of node on which service was previously running. This however doesn't seem to be working in case if I shutdown node(but works if I migrate service from one node to another using clusvcadm) Is there anyway that I can find out which node has failed during failover of service on a standby node? Any tool which I might have missed or some command which I can send to ccsd to get this information Thanks -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Determining failed node on another node of clusterduring failover
Hi Is monitoring package part of RHCS? What is name of this component? Is there any other mechanism which doesn't require to parse log/messages to determine which node has left the cluster on stand by node before failover is complee? Thanks On Wed, Jan 12, 2011 at 2:58 PM, Kit Gerrits kitgerr...@gmail.com wrote: Hello, If you want to find out which cluster node has failed, you could either check /var/log/messages and see which member has left the cluster, or you can set up monitoring to check if your servers are all in good shape. If you are running a cluster, I would suggest also setting up monitoring. The monitoring package can then notify you if any cluster member fails. Regards, Kit -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Parvez Shaikh Sent: woensdag 12 januari 2011 7:04 To: linux clustering Subject: [Linux-cluster] Determining failed node on another node of clusterduring failover Hi all, Taking this question from another thread, here is a challenge that I am facing - Following is simple cluster configuration - Node 1, node 2, node 3, and node4 are part of cluster, its unrestricted unordered fail-over domain with active - active nxn configuration So a node 2 can get services from node1, node3 or node4 when any of these(1,3,4) node fails(e.g. power failure). In that event I want to find out which of the node has failed over node2, I was invoking clustat -x -S service name on node2 in my custom agent and was parsing for last_owner field to obtain name of node on which service was previously running. This however doesn't seem to be working in case if I shutdown node(but works if I migrate service from one node to another using clusvcadm) Is there anyway that I can find out which node has failed during failover of service on a standby node? Any tool which I might have missed or some command which I can send to ccsd to get this information Thanks -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Determining failed node on another node of clusterduring failover
Hi, I have been using clustat command. clustat -x -s servicename to get following XML file - ?xml version=1.0? clustat version=4.1.1 groups group name=service:service_on_node1 state=112 state_str=started flags=0 flags_str= owner=node1 last_owner=none restarts=0 last_transition=1294752663 last_transition_str=Tue Jan 11 19:01:03 2011/ /groups /clustat I was under impression that last_owner field in the above XML file should give me node name where service was last running. I was parsing this XML file to obtain this information. Note that, this holds true if you migrate or relocate service from one node to another using clusvcadm or from conga or system-config-luster BUT if node is shutdown and service relocate to another node, last_owner is either 'none' or same as current node on which service is relocated. Parsing var/messages/log is easy but not optimal solution, it will need greping entire log file for some specific message where failed node name is appearing in clumgr messages. On Thu, Jan 13, 2011 at 4:35 AM, Kit Gerrits kitgerr...@gmail.com wrote: Hello, The Clustering software itself monitors nodes and devices in use by cluster services, but logs to /var/log/messages. A quick overview is presented by the 'clustat' command. Monitoring tools are freely available for any platform. Basic monitoring in Linux is available with Big Brother, Cacti, OpenNMS or Nagios (in order of increasing complexity). If you're bound to windows, maybe try ServersCheck . Parsing logs can be trivial, once you know how. What do you want to know and when do you want to know it? Have you looked at 'clustat' and 'cman_tool'? Regards, Kit -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Parvez Shaikh Sent: woensdag 12 januari 2011 11:01 To: linux clustering Subject: Re: [Linux-cluster] Determining failed node on another node of clusterduring failover Hi Is monitoring package part of RHCS? What is name of this component? Is there any other mechanism which doesn't require to parse log/messages to determine which node has left the cluster on stand by node before failover is complee? Thanks On Wed, Jan 12, 2011 at 2:58 PM, Kit Gerrits kitgerr...@gmail.com wrote: Hello, If you want to find out which cluster node has failed, you could either check /var/log/messages and see which member has left the cluster, or you can set up monitoring to check if your servers are all in good shape. If you are running a cluster, I would suggest also setting up monitoring. The monitoring package can then notify you if any cluster member fails. Regards, Kit -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Parvez Shaikh Sent: woensdag 12 januari 2011 7:04 To: linux clustering Subject: [Linux-cluster] Determining failed node on another node of clusterduring failover Hi all, Taking this question from another thread, here is a challenge that I am facing - Following is simple cluster configuration - Node 1, node 2, node 3, and node4 are part of cluster, its unrestricted unordered fail-over domain with active - active nxn configuration So a node 2 can get services from node1, node3 or node4 when any of these(1,3,4) node fails(e.g. power failure). In that event I want to find out which of the node has failed over node2, I was invoking clustat -x -S service name on node2 in my custom agent and was parsing for last_owner field to obtain name of node on which service was previously running. This however doesn't seem to be working in case if I shutdown node(but works if I migrate service from one node to another using clusvcadm) Is there anyway that I can find out which node has failed during failover of service on a standby node? Any tool which I might have missed or some command which I can send to ccsd to get this information Thanks -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Determining failed node on another node of clusterduring failover
Hi Any idea on how to get name of failed node using available cluster tools or commands? I have tried clustat but it seems to be producing unexpected output. I will have to obtain this information on target host/node; to which service is relocating as a part of failover. Thanks in advance Gratefully yours On 1/13/11, Parvez Shaikh parvez.h.sha...@gmail.com wrote: Hi, I have been using clustat command. clustat -x -s servicename to get following XML file - ?xml version=1.0? clustat version=4.1.1 groups group name=service:service_on_node1 state=112 state_str=started flags=0 flags_str= owner=node1 last_owner=none restarts=0 last_transition=1294752663 last_transition_str=Tue Jan 11 19:01:03 2011/ /groups /clustat I was under impression that last_owner field in the above XML file should give me node name where service was last running. I was parsing this XML file to obtain this information. Note that, this holds true if you migrate or relocate service from one node to another using clusvcadm or from conga or system-config-luster BUT if node is shutdown and service relocate to another node, last_owner is either 'none' or same as current node on which service is relocated. Parsing var/messages/log is easy but not optimal solution, it will need greping entire log file for some specific message where failed node name is appearing in clumgr messages. On Thu, Jan 13, 2011 at 4:35 AM, Kit Gerrits kitgerr...@gmail.com wrote: Hello, The Clustering software itself monitors nodes and devices in use by cluster services, but logs to /var/log/messages. A quick overview is presented by the 'clustat' command. Monitoring tools are freely available for any platform. Basic monitoring in Linux is available with Big Brother, Cacti, OpenNMS or Nagios (in order of increasing complexity). If you're bound to windows, maybe try ServersCheck . Parsing logs can be trivial, once you know how. What do you want to know and when do you want to know it? Have you looked at 'clustat' and 'cman_tool'? Regards, Kit -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Parvez Shaikh Sent: woensdag 12 januari 2011 11:01 To: linux clustering Subject: Re: [Linux-cluster] Determining failed node on another node of clusterduring failover Hi Is monitoring package part of RHCS? What is name of this component? Is there any other mechanism which doesn't require to parse log/messages to determine which node has left the cluster on stand by node before failover is complee? Thanks On Wed, Jan 12, 2011 at 2:58 PM, Kit Gerrits kitgerr...@gmail.com wrote: Hello, If you want to find out which cluster node has failed, you could either check /var/log/messages and see which member has left the cluster, or you can set up monitoring to check if your servers are all in good shape. If you are running a cluster, I would suggest also setting up monitoring. The monitoring package can then notify you if any cluster member fails. Regards, Kit -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Parvez Shaikh Sent: woensdag 12 januari 2011 7:04 To: linux clustering Subject: [Linux-cluster] Determining failed node on another node of clusterduring failover Hi all, Taking this question from another thread, here is a challenge that I am facing - Following is simple cluster configuration - Node 1, node 2, node 3, and node4 are part of cluster, its unrestricted unordered fail-over domain with active - active nxn configuration So a node 2 can get services from node1, node3 or node4 when any of these(1,3,4) node fails(e.g. power failure). In that event I want to find out which of the node has failed over node2, I was invoking clustat -x -S service name on node2 in my custom agent and was parsing for last_owner field to obtain name of node on which service was previously running. This however doesn't seem to be working in case if I shutdown node(but works if I migrate service from one node to another using clusvcadm) Is there anyway that I can find out which node has failed during failover of service on a standby node? Any tool which I might have missed or some command which I can send to ccsd to get this information Thanks -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] Questions related to cluster quorum and fencing
Hi all, *Quorum - * The questions are bit theoretical, I have gone through documentation and man pages and have understood that, a cluster is quorate if a cluster or its partition has nodes, with votes equal to or more than expected_votes in cman section of cluster.conf file (with no requirement mandating use of quorum disk) So how does cluster being quorate or non-quorate affects functioning of a cluster or services? If cluster is non-quorate, does it indicate an alarming situation and why? If a cluster is composed of resource groups which including only IP resource and script resource monitoring my application server listening on IP resource(no shared disk or shared resource between cluster nodes), then is cluster being quorate (or non quorate) important for services and/or cluster? *Fencing - * Is fencing and cluster being quorate or non-quorate related? I tried one experiment, wherein I removed fencing for cluster nodes and shutdown one of the nodes in cluster. And I got message in /var/log/messages indicating fencing failed for node, and service was not failed over from that node. So is fencing mandatory even if there is no shared disk between two cluster nodes? Also is a cluster non-quorate in a time window when a node has failed and has not been fenced successfully? Yours gratefully -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] Running cluster tools using non-root user
Hi all Is it possible to run cluster tools like clustat or clusvcadm etc. using non-root user? If yes, to which groups this user should belong to? Otherwise can this be done using sudo(and sudoers) file. As of now I get following error on clustat - Could not connect to CMAN: Permission denied Thanks, Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Running cluster tools using non-root user
I believe Pacemaker is not same as RHCS or do they share code? If yes, in which version of RHCS would this feature would be available? I require to enable service, disable service, and get status. I am using CLI tools and any scripting trick can help me running clusvcadm and/or clustat. su -c clusvcadm require entering password, can this also be eliminated using sudoers? Thanks On Wed, Jan 26, 2011 at 3:22 PM, Andrew Beekhof and...@beekhof.net wrote: [Shameless plug] The next version of Pacemaker (1.1.6) will have this feature :-) The patches were merged form our devel branch about a week ago. [/Shameless plug] On Tue, Jan 25, 2011 at 10:39 AM, Parvez Shaikh parvez.h.sha...@gmail.com wrote: Hi all Is it possible to run cluster tools like clustat or clusvcadm etc. using non-root user? If yes, to which groups this user should belong to? Otherwise can this be done using sudo(and sudoers) file. As of now I get following error on clustat - Could not connect to CMAN: Permission denied Thanks, Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] Tuning red hat cluster
Hi, As per my understanding rgmanager invokes 'status' on resource groups periodically to determine if these resources are up or down. I observed that this period is of around 30 seconds. Is it possible to tune or adjust this period for individual services or resource groups? Thanks -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] SNMP support with IBM Blade Center Fence Agent
Hi all, I have a question related to fence agents and SNMP alarms. Fence Agent can fail to fence the failed node for various reason; e.g. with my bladecenter fencing agent, I sometimes get message saying bladecenter fencing failed because of timeout or fence device IP address/user credentials are incorrect. In such a situation is it possible to generate SNMP trap? My cluster config file looks like below and in my case if bladecenter fencing fails, manual fencing kicks in and requires user to do fence_ack_manual, for this user must at least be notified via SNMP (or any other mechanism?) to intervene - clusternodes clusternode name=blade2 nodeid=2 votes=1 fence method name=1 device blade=2 name=BladeCenterFencing/ /method method name=2 device name=ManualFencing nodename=blade2/ /method /fence /clusternode clusternode name=blade1 nodeid=1 votes=1 fence method name=1 device blade=1 name=BladeCenterFencing/ /method method name=2 device name=ManualFencing nodename=blade1/ /method /fence /clusternode /clusternodes cman expected_votes=1 two_node=1/ fencedevices fencedevice agent=fence_bladecenter ipaddr=blade-mm.com login=USERID name=BladeCenterFencing passwd=PASSW0RD/ fencedevice agent=fence_manual name=ManualFencing/ /fencedevices Thanks, Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] SNMP support with IBM Blade Center Fence Agent
Hi Ryan, Thank you for response. Does it mean there is no way to intimate administrator about failure of fencing as of now? Let me give more information about my cluster - I have set of nodes in cluster with only IP resource being protected. I have two levels of fencing, first bladecenter fencing and second one is manual fencing. At times if machine is already down(either power failure or turned off abrupty); blade center fencing timesout and manual fencing happens. At this time, administrator is expected to run fence_ack_manual. Clearly this is not something which is desirable, as downtime of services is as long as administrator runs fence_ack_manual. What is recommended method to deal with blade center fencing failure in this situation? Do I have to add another level of fencing(between blade center and manual) which can fence automatically(not requiring manual interference)? Thanks On Mon, Feb 28, 2011 at 9:44 PM, Ryan O'Hara roh...@redhat.com wrote: On Mon, Feb 28, 2011 at 12:43:10PM +0530, Parvez Shaikh wrote: Hi all, I have a question related to fence agents and SNMP alarms. Fence Agent can fail to fence the failed node for various reason; e.g. with my bladecenter fencing agent, I sometimes get message saying bladecenter fencing failed because of timeout or fence device IP address/user credentials are incorrect. In such a situation is it possible to generate SNMP trap? This feature will be in RHEL6.1. There is a new project called 'foghorn' that creates SNMPv2 traps from dbus signals. git://git.fedorahosted.org/foghorn.git In RHEL6.1 (and the latest upstream release), certain cluster components will emit dbus signals when certain events occurs. This includes fencing. So when a node is fenced a dbus signal is generated by fenced. The foghorn service catches this signal and generated SNMPv2 trap. Note that foghorn runs as an AgentX subagent, so snmpd must be running as the master agentx. Ryan My cluster config file looks like below and in my case if bladecenter fencing fails, manual fencing kicks in and requires user to do fence_ack_manual, for this user must at least be notified via SNMP (or any other mechanism?) to intervene - clusternodes clusternode name=blade2 nodeid=2 votes=1 fence method name=1 device blade=2 name=BladeCenterFencing/ /method method name=2 device name=ManualFencing nodename=blade2/ /method /fence /clusternode clusternode name=blade1 nodeid=1 votes=1 fence method name=1 device blade=1 name=BladeCenterFencing/ /method method name=2 device name=ManualFencing nodename=blade1/ /method /fence /clusternode /clusternodes cman expected_votes=1 two_node=1/ fencedevices fencedevice agent=fence_bladecenter ipaddr=blade-mm.com login=USERID name=BladeCenterFencing passwd=PASSW0RD/ fencedevice agent=fence_manual name=ManualFencing/ /fencedevices Thanks, Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] SNMP support with IBM Blade Center Fence Agent
Hi Lon, Thank you for reply. What I gathered from your response is to remove manual fencing at once. This will cause fence daemon to retry fence_bladecenter until the node is fenced. More likely the fenced will succeed in fencing the failed node(provided IP, user name and password for bladecenter management module are right); even if it times out for the first time. Am I right? I will try removing manual fencing and see how things go. If fencing is failing (permanently), you can still run: fence_ack_manual -e -n nodename By the way as per my understanding fence_ack_manual -n node name can be executed to acknowledge only manually fenced node(and not bladecenter fenced node), correct me if this understanding is wrong. So God forbid, if fence_bladecenter fails for some reason; we still have option to run fence_manual and then fence_ack_manual, so cluster is back to working. Thanks again and have great weekend ahead Yours truly, Parvez On Fri, Mar 4, 2011 at 10:45 PM, Lon Hohberger l...@redhat.com wrote: On Tue, Mar 01, 2011 at 06:50:18PM +0530, Parvez Shaikh wrote: Hi Ryan, Thank you for response. Does it mean there is no way to intimate administrator about failure of fencing as of now? Let me give more information about my cluster - I have set of nodes in cluster with only IP resource being protected. I have two levels of fencing, first bladecenter fencing and second one is manual fencing. If the problem you have with fence_bladecenter is intermittent - for example, if it fails 1/2 the time, fence_manual is going to *detract* from your cluster's ability to recover automatically. Ordinarily, if a fencing action fails, fenced will automatically retry the operation. When you configure fence_manual as a backup, this retry will *never* occur, meaning your cluster hangs. At times if machine is already down(either power failure or turned off abrupty); blade center fencing timesout and manual fencing happens. At this time, administrator is expected to run fence_ack_manual. Clearly this is not something which is desirable, as downtime of services is as long as administrator runs fence_ack_manual. What is recommended method to deal with blade center fencing failure in this situation? Do I have to add another level of fencing(between blade center and manual) which can fence automatically(not requiring manual interference)? Start with removing fence_manual. If fencing is failing (permanently), you can still run: fence_ack_manual -e -n nodename my bladecenter fencing agent, I sometimes get message saying bladecenter fencing failed because of timeout or fence device IP address/user credentials are incorrect. ^^ This is why I think fence_manual is, in your specific case, very likely hurting your availability. -- Lon Hohberger - Red Hat, Inc. -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] Two node cluster - a potential problem of node fencing each other?
Hi all, I have a question pertaining to two node cluster, I have RHEL 5.5 and cluster along with it which at least should have two nodes. In a situation where both nodes of the cluster are up, and have reliable connection to fencing device (e.g. power switch OR any other power fencing device) and heartbeat link between two nodes goes down. Each node finds another node is down (because heartbeat IP becomes unreachable) and tries to fence each other. Is this situation possible? If so, can two nodes possibly fence (in short shutdown or reboot) each other? Is there anyway out of this situation? Thanks Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Two node cluster - a potential problem of node fencing each other?
redundant network link - i trust you were referring to ethernet bonding. On Sun, Mar 13, 2011 at 1:19 PM, Ian Hayes cthulhucall...@gmail.com wrote: On Sat, Mar 12, 2011 at 11:19 PM, Parvez Shaikh parvez.h.sha...@gmail.com wrote: Hi all, I have a question pertaining to two node cluster, I have RHEL 5.5 and cluster along with it which at least should have two nodes. In a situation where both nodes of the cluster are up, and have reliable connection to fencing device (e.g. power switch OR any other power fencing device) and heartbeat link between two nodes goes down. Each node finds another node is down (because heartbeat IP becomes unreachable) and tries to fence each other. Is this situation possible? If so, can two nodes possibly fence (in short shutdown or reboot) each other? Is there anyway out of this situation? This is a fairly common problem called split brain. The two nodes will go into a shootout, fencing each other. There are a few ways to prevent this, such as redundant network links and the use of quorum disks. -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] Clustat exit code for service status
Hi all, Command clustat -s service name gives status of service. If service is started (i.e. running on some node), exit code of this command is 0, if however service is not running, its exit code is non-zero (found it to be 119). Is this right and going to be continued in subsequent cluster versions as well? Reason I am asking this, is if I can use this command in shell script to give status of service - clustat -s service name if [ $? -eq 0 ]; then echo service is up else echo service is not up Thanks Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] Node without fencing method, is it possible to failover from such a node?
Hi all, I have a red hat cluster on IBM blade center with blades being my clusternodes and fence_bladecenter fencing agent. I have couple of resources - IP which activate or deactivate floating IP and script which start my server listening on this floating IP. This is a stateless server with no shared storage requirements or any shared resources which require me to use fancy fencing device. Everything was working fine, when I disable ethcard of heartbeat IP or of floating IP or pull powerplug or reboot/shutdown/halt one node, IP floats on another node and script start my server which happily listen on this IP. Life was good until I am now required to support cluster of nodes which are not hosted in bladecenter but any vanilla nodes. Now everything remains same but bladecenter fencing cant be used, and as per my understanding since I am using red hat cluster, it requires me to use some fence method, my first choice is to use power fencing and that only fencing suits my application needs. But is there any way (I know not the best and recommended but if I can live with it) to get away with fencing and let service failover in absence of fence devices configured for node? Thanks, Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Node without fencing method, is it possible to failover from such a node?
Guys, thanks a lot for your input. I have a doubt related to IPMI fencing. In IPMI fencing, we specify network address of IPMI controller. This is out of band network address as well as IPMI board must have power supply different from cluster node. Am I right? Thanks in advance for your help. Gratefully, Parvez On Thu, Mar 17, 2011 at 10:19 PM, Rajagopal Swaminathan raju.rajs...@gmail.com wrote: Greetings, On 3/17/11, Digimer li...@alteeve.com wrote: On 03/17/2011 01:25 AM, Parvez Shaikh wrote: Hi all, Life was good until I am now required to support cluster of nodes which are not hosted in bladecenter but any vanilla nodes. Suggestions from somebody who stupidly yapped I will support manual fencing and burnt his finger (Who? Oh! that was me): 1. Don't commit support for manual fencing 2. Don't support manual fencing. If you are in India, APC Fence PDU is available for around 30-35K INR (about a year back or so). If someone is ready to invest say 500K INR for HA hardware such as two servers etc., they might as well add 35k. OTOH, if those nodes are rack mounted servers (Unlike entry level server which does not have management port), the cost of the Powerfence strip will be a different issue when it comes to justifying, etc. within a corporate/Enterprise environment. Too much paperwork, I agree. But It will give a more robust infrastructure which will help us in using various tools like Zabbix, Spacewalk, snmp (I think fence strips have some SNMP - please check) etc. in the future. Life will be good then. With warm regards, Rajagopal -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] Plugged out blade from bladecenter chassis - fence_bladecenter failed
Hi all, I am using RHCS on IBM bladecenter with blade center fencing. I plugged out a blade from blade center chassis slot and was hoping that failover to occur. However when I did so, I get following message - fenced[10240]: agent fence_bladecenter reports: Failed: Unable to obtain correct plug status or plug is not available fenced[10240]: fence blade1 failed Is this supported that if I plug out blade from its slot, then failover occur without manual intervention? If so, which fencing must I use? Thanks, Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon
Hi Sufyan Does your status function return 0 or 1 if database is up or down respectively (i.e. have you tested it works outside script_db.sh) when run as root? On Thu, May 12, 2011 at 12:52 PM, Sufyan Khan sufyan.k...@its.ws wrote: First of all thanks for you quick response. Secondly please note: the working cluster.conf file is attached here, the previous file was not correct. Yes the orainfra is the user name. Any othere clue please. sufyan -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Jankowski, Chris Sent: Thursday, May 12, 2011 9:44 AM To: linux clustering Subject: Re: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Sufyan, What username does the instance of Oracle DB run as? Is this orainfra or some other username? The scripts assume a user named orainfra. If you use a different username then you need to modify the scripts accordingly. Regards, Chris Jankowski -Original Message- From: linux-cluster-boun...@redhat.com [mailto:linux-cluster-boun...@redhat.com] On Behalf Of Sufyan Khan Sent: Thursday, 12 May 2011 16:27 To: 'linux clustering' Subject: [Linux-cluster] oracle DB is not failing over on killin PMON deamon Dear All I need to setup HA cluster for mu oracle dabase. I have setup two node cluster using System-Config-Cluster .. on RHEL 5.5 I created RG a shared file system /emc01 ext3 , shared IP and DB script to monitor the DB. My cluster starts perfectly and fail over on shutting down primary node, also stopping shared IP fails node to failover node. But on kill PMON , or LSNR process the node does not fails and keep showing the status services running on primary node. I JUST NEED TO KNOW WHERE IS THE PROBLEM. ATTACHED IS DB scripts and cluster.conf file. Thanks in advance for help. Sufyan -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Plugged out blade from bladecenter chassis - fence_bladecenter failed
Hi, Has anyone used missing_as_off in cluster.conf file? Any help where to put this option in cluster.conf would be greatly appreciated Thanks, Parvez On Mon, May 2, 2011 at 6:49 PM, Parvez Shaikh parvez.h.sha...@gmail.comwrote: Hi Marek, I tried the option missing_as_off=1 and now I get an another error - fenced[18433]: fence node5.sscdomain failed fenced[18433]: fencing node node5.sscdomain Sniplet of cluster.conf file is - clusternode name=node5 nodeid=5 votes=1 fence method name=1 device blade=5 name=BladeCenterFencing missing_as_off=1/ /method /fence /clusternode /clusternodes fencedevices fencedevice agent=fence_bladecenter ipaddr=blade-mm-1 login=USERID name=BladeCenterFencing passwd=PASSW0RD/ /fencedevices Did I miss something? Thanks Parvez On Mon, May 2, 2011 at 1:03 PM, Marek Grac mg...@redhat.com wrote: Hi, On 04/29/2011 10:15 AM, Parvez Shaikh wrote: Hi Marek, Can we give this option in cluster.conf file for bladecenter fencing device or method for cluster.conf you should add ... missing_as_off=1 ... to fence configuration For IPMI, fencing is there similar option? There is no such method for IPMI. m, -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Plugged out blade from bladecenter chassis - fence_bladecenter failed
Hi Thanks Dominic, Do fence_bladecenter reboot the blade as a part of fencing always? I have seen it turning the blade off by default. Through fence_bladecenter --missing-as-off.. -o off returns me a correct result when run from command line but fencing fails through fenced. I am using RHEL 5.5 ES and fence_bladecenter version reports following - fence_bladecenter -V 2.0.115 (built Tue Dec 22 10:05:55 EST 2009) Copyright (C) Red Hat, Inc. 2004 All rights reserved. Anyway thanks for bugzilla reference Regards On Sun, Jun 19, 2011 at 10:14 PM, dOminic share2...@gmail.com wrote: There is a bug related to missing_as_off - https://bugzilla.redhat.com/show_bug.cgi?id=689851 - expects the fix in rhel5u7 . regards, On Wed, Apr 27, 2011 at 1:59 PM, Parvez Shaikh parvez.h.sha...@gmail.comwrote: Hi all, I am using RHCS on IBM bladecenter with blade center fencing. I plugged out a blade from blade center chassis slot and was hoping that failover to occur. However when I did so, I get following message - fenced[10240]: agent fence_bladecenter reports: Failed: Unable to obtain correct plug status or plug is not available fenced[10240]: fence blade1 failed Is this supported that if I plug out blade from its slot, then failover occur without manual intervention? If so, which fencing must I use? Thanks, Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] fence_ipmilan fails to reboot
Hi all, I am on RHEL 5.5; and I have two rack mounted servers with IPMI configured. When I run command from the prompt to reboot the server through fence_ipmilan, it shutsdown the server fine but it fails to power it on # fence_ipmilan -a IPMI IP Address -l admin -p password -o reboot Rebooting machine @ IPMI:IPMI IP Address...Failed But I can power it on or power off just fine # fence_ipmilan -a IPMI IP Address -l admin -p password -o on Powering on machine @ IPMI:IPMI IP Address...Done Due to this my fencing is failing and failover is not happening. I have questions around this - 1. Can we provide action (off or reboot) in cluster.conf for ipmi lan fencing? 2. Is there anything wrong in my configuration? Cluster.conf file is pasted below 3. Is this a known issue which is fixed in newer versions Here is how my cluster.conf looks like - ?xml version=1.0? cluster config_version=4 name=Cluster fence_daemon post_fail_delay=0 post_join_delay=3/ clusternodes clusternode name=blade1.domain nodeid=1 votes=1 fence method name=1 device lanplus= name=IPMI_1/ /method /fence /clusternode clusternode name=blade2.domain nodeid=2 votes=1 fence method name=1 device lanplus= name=IPMI_2/ /method /fence /clusternode /clusternodes cman expected_votes=1 two_node=1/ fencedevices fencedevice agent=fence_ipmilan auth=none ipaddr=IMPI 1 IP Address login=admin name=IPMI_1 passwd=password/ fencedevice agent=fence_ipmilan auth=none ipaddr=IMPI 2 IP Address login=admin name=IPMI_2 passwd=password/ /fencedevices rm failoverdomains failoverdomain name=FailoveDomain ordered=1 restricted=1 failoverdomainnode name=blade1.domain priority=2/ failoverdomainnode name=blade2.domain priority=1/ /failoverdomain /failoverdomains resources/ service autostart=1 name=service recovery=relocate/ /rm /cluster Thanks, Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] fence_ipmilan fails to reboot - SOLVED
Hi all, Thanks for your responses, after providing auth=password; fencing succeeded fencedevice agent=fence_ipmilan *auth=password*ipaddr=IP login=admin name=IPMI_1 passwd=password/ Thanks, Parvez On Fri, Jul 1, 2011 at 2:33 PM, שלום קלמר skle...@gmail.com wrote: Hi. I think you need to add the power_wait10 lanplus=1 Try this line: fencedevice agent=fence_ipmilan power_wait=10 ipaddr=xx.xx.xx.xx lanplus=1 login=xxxt name=node1_ilo passwd=yyy Regards Shalom. On Thu, Jun 30, 2011 at 1:03 PM, Parvez Shaikh parvez.h.sha...@gmail.comwrote: Hi all, I am on RHEL 5.5; and I have two rack mounted servers with IPMI configured. When I run command from the prompt to reboot the server through fence_ipmilan, it shutsdown the server fine but it fails to power it on # fence_ipmilan -a IPMI IP Address -l admin -p password -o reboot Rebooting machine @ IPMI:IPMI IP Address...Failed But I can power it on or power off just fine # fence_ipmilan -a IPMI IP Address -l admin -p password -o on Powering on machine @ IPMI:IPMI IP Address...Done Due to this my fencing is failing and failover is not happening. I have questions around this - 1. Can we provide action (off or reboot) in cluster.conf for ipmi lan fencing? 2. Is there anything wrong in my configuration? Cluster.conf file is pasted below 3. Is this a known issue which is fixed in newer versions Here is how my cluster.conf looks like - ?xml version=1.0? cluster config_version=4 name=Cluster fence_daemon post_fail_delay=0 post_join_delay=3/ clusternodes clusternode name=blade1.domain nodeid=1 votes=1 fence method name=1 device lanplus= name=IPMI_1/ /method /fence /clusternode clusternode name=blade2.domain nodeid=2 votes=1 fence method name=1 device lanplus= name=IPMI_2/ /method /fence /clusternode /clusternodes cman expected_votes=1 two_node=1/ fencedevices fencedevice agent=fence_ipmilan auth=none ipaddr=IMPI 1 IP Address login=admin name=IPMI_1 passwd=password/ fencedevice agent=fence_ipmilan auth=none ipaddr=IMPI 2 IP Address login=admin name=IPMI_2 passwd=password/ /fencedevices rm failoverdomains failoverdomain name=FailoveDomain ordered=1 restricted=1 failoverdomainnode name=blade1.domain priority=2/ failoverdomainnode name=blade2.domain priority=1/ /failoverdomain /failoverdomains resources/ service autostart=1 name=service recovery=relocate/ /rm /cluster Thanks, Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] Configuring failover time with Red Hat Cluster
Hi all, I was trying to find out how much time does it take for RHCS to detect failure and recover from it. I found the link - http://www.redhat.com/whitepapers/rha/RHA_ClusterSuiteWPPDF.pdf It says that network polling interval is 2 seconds and 6 retries are attempted before declaring a node as failed. I want to know can we tune this or configure it, say instead of 6 retries I want only 3 retries. Also reducing network polling time from 2 seconds to say 1 second (can it be less than 1 second, which I think would consume more CPU)? Also I have a script resource and I see it invoked with status argument after every 30 seconds, can we configure that as well? Failover also involve fencing, any pointers on how can we control / configure fencing time would also be useful,I use bladecenter fencing, IPMI fencing as well as UCS fencing. Thanks, Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Configuring failover time with Red Hat Cluster
Hello Christine, Thanks for the link enlisting various documents, I have RHC running over RHEL 5.5 and has been working fine. However I would greatly appreciate, some document or pointers which help me in estimate failover time or adjust it; if that is possible. I have been through Administration Guide and could not find how I can adjust it. Thanks, Parvez On Tue, Jul 5, 2011 at 5:58 PM, Christine Caulfield ccaul...@redhat.comwrote: That's a *very* old document. it's from 2003 and refers to RHEL2.1 .. which I sincerely hope you weren't planning to implement. Before you do anything more I recommend you read the documentation for the actual version of clustering you are going to install https://access.redhat.com/**knowledge/docs/Red_Hat_**Enterprise_Linux/https://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/ Chrissie On 05/07/11 12:32, Parvez Shaikh wrote: Hi all, I was trying to find out how much time does it take for RHCS to detect failure and recover from it. I found the link - http://www.redhat.com/**whitepapers/rha/RHA_**ClusterSuiteWPPDF.pdfhttp://www.redhat.com/whitepapers/rha/RHA_ClusterSuiteWPPDF.pdf It says that network polling interval is 2 seconds and 6 retries are attempted before declaring a node as failed. I want to know can we tune this or configure it, say instead of 6 retries I want only 3 retries. Also reducing network polling time from 2 seconds to say 1 second (can it be less than 1 second, which I think would consume more CPU)? Also I have a script resource and I see it invoked with status argument after every 30 seconds, can we configure that as well? Failover also involve fencing, any pointers on how can we control / configure fencing time would also be useful,I use bladecenter fencing, IPMI fencing as well as UCS fencing. Thanks, Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/**mailman/listinfo/linux-clusterhttps://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/**mailman/listinfo/linux-clusterhttps://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] $OCF_ERR_CONFIGURED - recovers service on another cluster node
Hi guys, I am using Red Hat Cluster Suite which comes with RHEL 5.5 - cman_tool version 6.2.0 config xxx Now I have a script resource in which I return $OCF_ERR_CONFIGURED; in case of a Fatal irrecoverable error, hoping that my service would not start on another cluster node. But I see that cluster, relocates it to another cluster node and attempts to start it. I referred error code documentation from http://www.linux-ha.org/doc/dev-guides/_return_codes.html Is there any return code which makes RHCS to give up on recovering service? Thanks -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] $OCF_ERR_CONFIGURED - recovers service on another cluster node
Hi, Requirement is to Fail service, not to fail over it on another node in case of certain issues, which would be detected by my service(automatically/programmatically) while it starts (if it doesn't find prerequisite) it will Fail, to do so which error code should I use in my start function ? Thanks On Fri, Jan 27, 2012 at 3:18 PM, emmanuel segura emi2f...@gmail.com wrote: The first thing you can do is stop your cluster service go to the node where you found the problem and using rg_test test /etc/cluster/cluster.conf start put_the_name_of_the_service like that you can see what it's wrong 2012/1/27 Parvez Shaikh parvez.h.sha...@gmail.com Hi guys, I am using Red Hat Cluster Suite which comes with RHEL 5.5 - cman_tool version 6.2.0 config xxx Now I have a script resource in which I return $OCF_ERR_CONFIGURED; in case of a Fatal irrecoverable error, hoping that my service would not start on another cluster node. But I see that cluster, relocates it to another cluster node and attempts to start it. I referred error code documentation from http://www.linux-ha.org/doc/dev-guides/_return_codes.html Is there any return code which makes RHCS to give up on recovering service? Thanks -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- esta es mi vida e me la vivo hasta que dios quiera -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] [TOTEM] The consensus timeout expired.
Hi all, I have a cluster with two blades in IBM BladeCenter. Following error is appearing when I start cman service and it keep repeating the message /var/log/messages - openais[10770]: [TOTEM] The consensus timeout expired. openais[10770]: [TOTEM] entering GATHER state from 3. Heart beating IP is available on the blade and link to blade2 is also fine. Cluster on blade2 is not running. Services like iptables and portmap are also down. Has anyone encountered such error and resolved it? I am using RHEL 5.5 cman_tool version 6.2.0 config 1 Thanks, Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] Multicast address by CMAN
Hi all, As per my understanding, CMAN uses cluster name to internally generate multi-cast address. In my cluster.conf Having a cluster with same name in a given network leads to issue and is undesirable. I want to know is there anyway to find if multicast address is already in use by some other cluster, so as to avoid using name that generate same multicast IP or for that matter configuring same multicast IP in cluster.conf Thanks, Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] clurgmgrd : notice relocating a service to better node
Hi, When I start or enable a service (that was previously disabled) on a a cluster node, I see message saying clurmgrd relocating service to better node. I am not understanding why. I can relocate service back to a node where I see above message and it runs fine there. What does better node could mean? Better in what sense as hardware and software configurations of both cluster nodes is same. What situation could possibly trigger this? Thanks, Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] clurgmgrd : notice relocating a service to better node
Hi Digimer, cman_tool version 6.2.0 config 3 RPM versions - cman-2.0.115-34.el5 rgmanager-2.0.52-6.el5 I am on RHEL 5.5 The configuration is like this - Cluster of 2 nodes. Each node is IBM Blade hosted in chassis. Private network within chassis is used for heartbeat across cluster nodes and other cluster service consist of IP resource and my own server which listens on this IP resource. cluster.conf file - ?xml version=1.0? cluster alias=PCluster config_version=3 name=PCluster clusternodes clusternode name=my_blade2.my_domain nodeid=2 votes=1 fence method name=1 device blade=2 missing_as_off=1 name=BladeCenterFencing/ /method /fence /clusternode clusternode name=my_blade1.my_domain nodeid=1 votes=1 fence method name=1 device blade=1 missing_as_off=1 name=BladeCenterFencing/ /method /fence /clusternode /clusternodes cman expected_votes=1 two_node=1/ fencedevices fencedevice agent=fence_bladecenter ipaddr=X login=USERID name=BladeCenterFencing passwd=X/ /fencedevices rm resources script file=/localhome/parvez/my_ha name=my_HaAgent/ ip address=192.168.11.171 monitor_link=1/ ip address=192.168.11.175 monitor_link=1/ ip address=192.168.11.176 monitor_link=1/ /resources failoverdomains failoverdomain name=my_domain nofailback=1 ordered=1 restricted=1 failoverdomainnode name=my_blade2.my_domain priority=2/ failoverdomainnode name=my_blade1.my_domain priority=1/ /failoverdomain /failoverdomains service autostart=0 domain=my_domain name=my_proc recovery=relocate script ref=my_HaAgent/ ip ref=192.168.11.175/ /service /rm fence_daemon clean_start=1 post_fail_delay=0 post_join_delay=0/ /cluster On Wed, Apr 11, 2012 at 11:51 AM, Digimer li...@alteeve.ca wrote: On 04/11/2012 02:14 AM, Parvez Shaikh wrote: Hi, When I start or enable a service (that was previously disabled) on a a cluster node, I see message saying clurmgrd relocating service to better node. I am not understanding why. I can relocate service back to a node where I see above message and it runs fine there. What does better node could mean? Better in what sense as hardware and software configurations of both cluster nodes is same. What situation could possibly trigger this? Thanks, Parvez What version of the cluster software are you using? What is the configuration? To get help, you need to share more details. :) -- Digimer Papers and Projects: https://alteeve.com -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] How to add shell script to cluster.conf
From this link - https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/4/html/Cluster_Administration/s1-config-service-dev-CA.html Script *Name* — Enter a name for the custom user script. *File (with path)* — Enter the path where this custom script is located (for example, /etc/init.d/*userscript*) On Sun, Sep 16, 2012 at 4:16 PM, Ben .T.George bentech4...@gmail.comwrote: Hi I have an NFS HA setup. how can i add a custom shell script to that resource group NFS HA services are working well..i am working with cluster.conf file directly..please help me to add this.i want to touch some information after exporting this filesystem Regards, Ben -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] 2 node cluster showing strange behaviour
Had similar issues however I was using RHEL 5.5 Please refer - https://access.redhat.com/knowledge/solutions/18542 On Mon, Sep 17, 2012 at 9:22 PM, Ben .T.George bentech4...@gmail.comwrote: HI i am just started building 2 node cluster.i installed all packages of red hat cluster suite by mounting RHEL 6 dvd. i joined cluster by using LUCI.after that my clustat showing like this: on node1: Cluster Status for eccprd @ Mon Sep 17 18:43:31 2012 Member Status: Quorate Member Name ID Status -- -- cgceccprd1.combinedgroup.net1 Online, Local cgceccprd2.combinedgroup.net2 Offline on node2: Cluster Status for eccprd @ Mon Sep 17 18:43:31 2012 Member Status: Quorate Member Name ID Status -- -- cgceccprd1.combinedgroup.net1 Offline cgceccprd2.combinedgroup.net2 Online, Local both nodes showing different status. i restarted many times, i deleted and created many times..then also same..please help me solve this Regards, Ben -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] linux-cluster
What kind of cluster is this - an academic project or production quality solution? If its former - go for manual fencing. You wont need fence device but failover wont be automatic If its later - yes you'll need fence device On Mon, Oct 1, 2012 at 10:15 PM, Rajagopal Swaminathan raju.rajs...@gmail.com wrote: Greetings, Hitesh, Please follow list guidlines. On Mon, Oct 1, 2012 at 11:49 AM, Digimer li...@alteeve.ca wrote: You don't seem to be reading what I am typing. Please go back over the various replies and read again what I said. Follow the links and read what they say. And please don't reply only to me. Click Reply All and include the mailing list. I have a constraint of using Linux on bare machine for my 2 desktop. Can you please let me know as how can i proceed...Do i have to purchase some sort of hardware? Yes. You will need to buy power fencing device -- basically a power strip with a ethernet port I would strongly suggest you have two network port on each system. What you want to do with a cluster? One more thing...till now i have used this setup: have Windows vista OS --- Virtual BoxRed Hat installed. You have to be kidding. You are using vista on bare metal for your HA? If i download Xen or KVM can i use the same setup instead of Virtual Box? Windows vista OS Xen or KVM Red Hat installed http://www.youtube.com/watch?v=oKI-tD0L18A [root@hitesh12 ~]# cat /etc/cluster/cluster.conf ?xml version=1.0? cluster config_version=1 name=dhoni There needs to be two_node directive somewher there. Read up. Better yet get help of some local technical person who knows what is HA. It is a lot more than simple desktop install. Or you need to invest quite a bit of time in learning and money in getting some extra hardware (fence devices, switches, NIC, External storage -- if required). And dont commit for or do that on production without knowing what you are getting into. If you can post more descriptively the objective of using cluster, perhaps you will get more specific information. Digimer's _*excellent*_ tutorial covers more or less all that you need to know about clusters. I wish I had that when I started playing around with that way back in 2007. -- Regards, Rajagopal -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] linux-cluster
Hi Digimer, Could you please give me reference/case studies of problem about why manual fencing was dropped and how automated fencing is fixing those? Thanks, Parvez On Tue, Oct 2, 2012 at 7:08 PM, Digimer li...@alteeve.ca wrote: On 10/02/2012 04:00 AM, Parvez Shaikh wrote: What kind of cluster is this - an academic project or production quality solution? If its former - go for manual fencing. You wont need fence device but failover wont be automatic *Please* don't do this. Manual fencing support was dropped for a reason. It's *far* too easy to mess things up when an admin uses it before identifying a problem. If its later - yes you'll need fence device This is the only sane option; Academic or production. Fencing is an integral part of the cluster and you do yourself no favour by not learning it in an academic setup. -- Digimer Papers and Projects: https://alteeve.ca Hydrogen is just a colourless, odourless gas which, if left alone in sufficient quantities for long periods of time, begins to think about itself. -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Hi
A curious observation, there is a sudden surge of sending emails on private addresses rather than sending over a mailing list. Please send your doubts / questions on mailing list linux-cluster@redhat.com instead of addressing personally. Regarding configuration for manual fencing - I don't have it with me, it was available with RHEL 5.5. Check it out in system-config-cluster tool if you can add manual fencing. Thanks, Parvez On Wed, Oct 3, 2012 at 10:46 AM, Renchu Mathew rench...@gmail.com wrote: Hi Purvez, I am trying to setup a test cluster environmet. But I haven't doen fencing. Please find below error messages. Some time after the nodes restarted, the other node is going down. can you please send me theconfiguration for manual fencing? Please find attached my cluster setup. It is not stable and /var/log/messages shows the below errors. Sep 11 08:49:10 node1 corosync[1814]: [QUORUM] Members[2]: 1 2 Sep 11 08:49:10 node1 corosync[1814]: [QUORUM] Members[2]: 1 2 Sep 11 08:49:10 node1 corosync[1814]: [CPG ] chosen downlist: sender r(0) ip(192.168.1.251) ; members(old:2 left:1) Sep 11 08:49:10 node1 corosync[1814]: [MAIN ] Completed service synchronization, ready to provide service. Sep 11 08:49:11 node1 corosync[1814]: cman killed by node 2 because we were killed by cman_tool or other application Sep 11 08:49:11 node1 fenced[1875]: telling cman to remove nodeid 2 from cluster Sep 11 08:49:11 node1 fenced[1875]: cluster is down, exiting Sep 11 08:49:11 node1 gfs_controld[1950]: cluster is down, exiting Sep 11 08:49:11 node1 gfs_controld[1950]: daemon cpg_dispatch error 2 Sep 11 08:49:11 node1 gfs_controld[1950]: cpg_dispatch error 2 Sep 11 08:49:11 node1 dlm_controld[1889]: cluster is down, exiting Sep 11 08:49:11 node1 dlm_controld[1889]: daemon cpg_dispatch error 2 Sep 11 08:49:11 node1 dlm_controld[1889]: cpg_dispatch error 2 Sep 11 08:49:11 node1 dlm_controld[1889]: cpg_dispatch error 2 Sep 11 08:49:11 node1 dlm_controld[1889]: cpg_dispatch error 2 Sep 11 08:49:11 node1 fenced[1875]: daemon cpg_dispatch error 2 Sep 11 08:49:11 node1 rgmanager[2409]: #67: Shutting down uncleanly Sep 11 08:49:11 node1 rgmanager[17059]: [clusterfs] unmounting /Data Sep 11 08:49:11 node1 rgmanager[17068]: [clusterfs] Sending SIGTERM to processes on /Data Sep 11 08:49:16 node1 rgmanager[17104]: [clusterfs] unmounting /Data Sep 11 08:49:16 node1 rgmanager[17113]: [clusterfs] Sending SIGKILL to processes on /Data Sep 11 08:49:19 node1 kernel: dlm: closing connection to node 2 Sep 11 08:49:19 node1 kernel: dlm: closing connection to node 1 Sep 11 08:49:19 node1 kernel: dlm: gfs2: no userland control daemon, stopping lockspace Sep 11 08:49:22 node1 rgmanager[17149]: [clusterfs] unmounting /Data Sep 11 08:49:22 node1 rgmanager[17158]: [clusterfs] Sending SIGKILL to processes on /Data Also when I try to restart the cman service, below error comes. Starting cluster: Checking if cluster has been disabled at boot...[ OK ] Checking Network Manager... [ OK ] Global setup... [ OK ] Loading kernel modules... [ OK ] Mounting configfs...[ OK ] Starting cman...[ OK ] Waiting for quorum... [ OK ] Starting fenced... [ OK ] Starting dlm_controld...[ OK ] Starting gfs_controld...[ OK ] Unfencing self... fence_node: cannot connect to cman [FAILED] Stopping cluster: Leaving fence domain... [ OK ] Stopping gfs_controld...[ OK ] Stopping dlm_controld...[ OK ] Stopping fenced... [ OK ] Stopping cman...[ OK ] Unloading kernel modules... [ OK ] Unmounting configfs... [ OK ] Thanks again. Renchu Mathew On Tue, Sep 11, 2012 at 9:10 PM, Arun Eapen CISSP, RHCA a...@redhat.com wrote: Put the fenced in debug mode and copy the error messages, for me to debug On Tue, 2012-09-11 at 11:52 +0400, Renchu Mathew wrote: Hi Arun, I have done the RH436 course in conducted by you at Redhat b'lore. How r u? I have configured a 2 node failover cluster setup (almost same like our RH436 lab setup in b'lore) It is almost ok except fencing. If I pull the active node
[Linux-cluster] Not restarting max_restart times before relocating failed service
Hi experts, I have defined a service as follows in cluster.conf - service autostart=0 domain=mydomain exclusive=0 max_restarts=5 name=mgmt recovery=restart script ref=myHaAgent/ ip ref=192.168.51.51/ /service I mentioned max_restarts=5 hoping that if cluster fails to start service 5 times, then it will relocate to another cluster node in failover domain. To check this, I turned down NIC hosting service's floating IP and got following logs - Oct 30 14:11:49 clurgmgrd: [10753]: warning Link for eth1: Not detected Oct 30 14:11:49 clurgmgrd: [10753]: warning No link on eth1... Oct 30 14:11:49 clurgmgrd: [10753]: warning No link on eth1... Oct 30 14:11:49 clurgmgrd[10753]: notice status on ip 192.168.51.51 returned 1 (generic error) Oct 30 14:11:49 clurgmgrd[10753]: notice Stopping service service:mgmt *Oct 30 14:12:00 clurgmgrd[10753]: notice Service service:mgmt is recovering* Oct 30 14:12:00 clurgmgrd[10753]: notice Recovering failed service service:mgmt Oct 30 14:12:00 clurgmgrd[10753]: notice start on ip 192.168.51.51 returned 1 (generic error) Oct 30 14:12:00 clurgmgrd[10753]: warning #68: Failed to start service:mgmt; return value: 1 Oct 30 14:12:00 clurgmgrd[10753]: notice Stopping service service:mgmt *Oct 30 14:12:00 clurgmgrd[10753]: notice Service service:mgmt is recovering Oct 30 14:12:00 clurgmgrd[10753]: warning #71: Relocating failed service service:mgmt* Oct 30 14:12:01 clurgmgrd[10753]: notice Service service:mgmt is stopped Oct 30 14:12:01 clurgmgrd[10753]: notice Service service:mgmt is stopped But from the log it appears that cluster tried to restart service only ONCE before relocating. I was expecting cluster to retry starting this service five times on the same node before relocating Can anybody correct my understanding? Thanks, Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
[Linux-cluster] Monitoring Frequency - can it be changed?
Hi experts, Can we change frequency at which resources are monitored by Cluster? I observed 30 seconds as monitoring frequency. Thanks, Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Not restarting max_restart times before relocating failed service
Hi Digimer, cman_tool version gives following - 6.2.0 config 22 Cluster.conf - ?xml version=1.0? cluster alias=PARVEZ config_version=22 name=PARVEZ clusternodes clusternode name=myblade2 nodeid=2 votes=1 fence method name=1 device blade=2 missing_as_off=1 name=BladeCenterFencing-1/ /method /fence /clusternode clusternode name=myblade1 nodeid=1 votes=1 fence method name=1 device blade=1 missing_as_off=1 name=BladeCenterFencing-1/ /method /fence /clusternode /clusternodes cman expected_votes=1 two_node=1/ fencedevices fencedevice agent=fence_bladecenter ipaddr= mm-1.mydomain.com login= name=BladeCenterFencing-1 passwd=X shell_timeout=10/ /fencedevices rm resources script file=/localhome/my/my_ha name=myHaAgent/ ip address=192.168.51.51 monitor_link=1/ /resources failoverdomains failoverdomain name=mydomain nofailback=1 ordered=1 restricted=1 failoverdomainnode name=myblade2 priority=2/ failoverdomainnode name=myblade1 priority=1/ /failoverdomain /failoverdomains service autostart=0 domain=mydomain exclusive=0 max_restarts=5 name=mgmt recovery=restart script ref=myHaAgent/ ip ref=192.168.51.51/ /service /rm fence_daemon clean_start=1 post_fail_delay=0 post_join_delay=0/ /cluster Thanks, Parvez On Tue, Oct 30, 2012 at 9:25 PM, Digimer li...@alteeve.ca wrote: On 10/30/2012 01:54 AM, Parvez Shaikh wrote: Hi experts, I have defined a service as follows in cluster.conf - service autostart=0 domain=mydomain exclusive=0 max_restarts=5 name=mgmt recovery=restart script ref=myHaAgent/ ip ref=192.168.51.51/ /service I mentioned max_restarts=5 hoping that if cluster fails to start service 5 times, then it will relocate to another cluster node in failover domain. To check this, I turned down NIC hosting service's floating IP and got following logs - Oct 30 14:11:49 clurgmgrd: [10753]: warning Link for eth1: Not detected Oct 30 14:11:49 clurgmgrd: [10753]: warning No link on eth1... Oct 30 14:11:49 clurgmgrd: [10753]: warning No link on eth1... Oct 30 14:11:49 clurgmgrd[10753]: notice status on ip 192.168.51.51 returned 1 (generic error) Oct 30 14:11:49 clurgmgrd[10753]: notice Stopping service service:mgmt *Oct 30 14:12:00 clurgmgrd[10753]: notice Service service:mgmt is recovering* Oct 30 14:12:00 clurgmgrd[10753]: notice Recovering failed service service:mgmt Oct 30 14:12:00 clurgmgrd[10753]: notice start on ip 192.168.51.51 returned 1 (generic error) Oct 30 14:12:00 clurgmgrd[10753]: warning #68: Failed to start service:mgmt; return value: 1 Oct 30 14:12:00 clurgmgrd[10753]: notice Stopping service service:mgmt *Oct 30 14:12:00 clurgmgrd[10753]: notice Service service:mgmt is recovering Oct 30 14:12:00 clurgmgrd[10753]: warning #71: Relocating failed service service:mgmt* Oct 30 14:12:01 clurgmgrd[10753]: notice Service service:mgmt is stopped Oct 30 14:12:01 clurgmgrd[10753]: notice Service service:mgmt is stopped But from the log it appears that cluster tried to restart service only ONCE before relocating. I was expecting cluster to retry starting this service five times on the same node before relocating Can anybody correct my understanding? Thanks, Parvez What version? Please paste your full cluster.conf. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Not restarting max_restart times before relocating failed service
Hi, I am using recovery=restart as evident from earlier attached cluster.conf Thanks, Parvez On Wed, Oct 31, 2012 at 2:53 PM, emmanuel segura emi2f...@gmail.com wrote: Hello Maybe you missing recovery=restart in your services 2012/10/31 Parvez Shaikh parvez.h.sha...@gmail.com Hi Digimer, cman_tool version gives following - 6.2.0 config 22 Cluster.conf - ?xml version=1.0? cluster alias=PARVEZ config_version=22 name=PARVEZ clusternodes clusternode name=myblade2 nodeid=2 votes=1 fence method name=1 device blade=2 missing_as_off=1 name=BladeCenterFencing-1/ /method /fence /clusternode clusternode name=myblade1 nodeid=1 votes=1 fence method name=1 device blade=1 missing_as_off=1 name=BladeCenterFencing-1/ /method /fence /clusternode /clusternodes cman expected_votes=1 two_node=1/ fencedevices fencedevice agent=fence_bladecenter ipaddr= mm-1.mydomain.com login= name=BladeCenterFencing-1 passwd=X shell_timeout=10/ /fencedevices rm resources script file=/localhome/my/my_ha name=myHaAgent/ ip address=192.168.51.51 monitor_link=1/ /resources failoverdomains failoverdomain name=mydomain nofailback=1 ordered=1 restricted=1 failoverdomainnode name=myblade2 priority=2/ failoverdomainnode name=myblade1 priority=1/ /failoverdomain /failoverdomains service autostart=0 domain=mydomain exclusive=0 max_restarts=5 name=mgmt recovery=restart script ref=myHaAgent/ ip ref=192.168.51.51/ /service /rm fence_daemon clean_start=1 post_fail_delay=0 post_join_delay=0/ /cluster Thanks, Parvez On Tue, Oct 30, 2012 at 9:25 PM, Digimer li...@alteeve.ca wrote: On 10/30/2012 01:54 AM, Parvez Shaikh wrote: Hi experts, I have defined a service as follows in cluster.conf - service autostart=0 domain=mydomain exclusive=0 max_restarts=5 name=mgmt recovery=restart script ref=myHaAgent/ ip ref=192.168.51.51/ /service I mentioned max_restarts=5 hoping that if cluster fails to start service 5 times, then it will relocate to another cluster node in failover domain. To check this, I turned down NIC hosting service's floating IP and got following logs - Oct 30 14:11:49 clurgmgrd: [10753]: warning Link for eth1: Not detected Oct 30 14:11:49 clurgmgrd: [10753]: warning No link on eth1... Oct 30 14:11:49 clurgmgrd: [10753]: warning No link on eth1... Oct 30 14:11:49 clurgmgrd[10753]: notice status on ip 192.168.51.51 returned 1 (generic error) Oct 30 14:11:49 clurgmgrd[10753]: notice Stopping service service:mgmt *Oct 30 14:12:00 clurgmgrd[10753]: notice Service service:mgmt is recovering* Oct 30 14:12:00 clurgmgrd[10753]: notice Recovering failed service service:mgmt Oct 30 14:12:00 clurgmgrd[10753]: notice start on ip 192.168.51.51 returned 1 (generic error) Oct 30 14:12:00 clurgmgrd[10753]: warning #68: Failed to start service:mgmt; return value: 1 Oct 30 14:12:00 clurgmgrd[10753]: notice Stopping service service:mgmt *Oct 30 14:12:00 clurgmgrd[10753]: notice Service service:mgmt is recovering Oct 30 14:12:00 clurgmgrd[10753]: warning #71: Relocating failed service service:mgmt* Oct 30 14:12:01 clurgmgrd[10753]: notice Service service:mgmt is stopped Oct 30 14:12:01 clurgmgrd[10753]: notice Service service:mgmt is stopped But from the log it appears that cluster tried to restart service only ONCE before relocating. I was expecting cluster to retry starting this service five times on the same node before relocating Can anybody correct my understanding? Thanks, Parvez What version? Please paste your full cluster.conf. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- esta es mi vida e me la vivo hasta que dios quiera -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing
[Linux-cluster] Normal startup vs startup due to failover on cluster node - can they be distinguished?
Hi experts, I am using Red Hat Cluster available on RHEL 5.5. And it doesn't have any inbuilt mechanism to generate SNMP traps in failures of resources or failover of services from one node to another. I have a script agent, which starts, stops and checks status of my application. Is it possible that in a script resource - to distinguish between normal startup of service / resource vs startup of service/resource in response to failover / failure handling? Doing so would help me write code to generate alarms if startup of service / resource (in my case a process) is due to failover (not normal startup). Further is it possible to get information such as cause of failure(leading to failover), and previous cluster node on which service / resource was running(prior to failover)? This would help to provide as much information as possible in traps Thanks, Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster
Re: [Linux-cluster] Normal startup vs startup due to failover on cluster node - can they be distinguished?
Kind reminder on this. Any inputs would be of great help. Basically I intend to have SNMP traps generated to notify failures and failover while using RHCS. Thanks, Parvez On Fri, Nov 23, 2012 at 2:54 PM, satya suresh kolapalli kolapallisatya...@gmail.com wrote: Hi, send the script which you have On 23 November 2012 10:55, Parvez Shaikh parvez.h.sha...@gmail.com wrote: Hi experts, I am using Red Hat Cluster available on RHEL 5.5. And it doesn't have any inbuilt mechanism to generate SNMP traps in failures of resources or failover of services from one node to another. I have a script agent, which starts, stops and checks status of my application. Is it possible that in a script resource - to distinguish between normal startup of service / resource vs startup of service/resource in response to failover / failure handling? Doing so would help me write code to generate alarms if startup of service / resource (in my case a process) is due to failover (not normal startup). Further is it possible to get information such as cause of failure(leading to failover), and previous cluster node on which service / resource was running(prior to failover)? This would help to provide as much information as possible in traps Thanks, Parvez -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Regards, SatyaSuresh Kolapalli Mob: 7702430892 -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster