Re: [Pacemaker] AP9606 fencing device
Hi, On Tue, Nov 16, 2010 at 08:15:10PM -0700, Devin Reade wrote: --On Wednesday, October 27, 2010 09:47:14 AM +0200 Pavlos Parissis pavlos.paris...@gmail.com wrote: I have a APC AP9606 PDU and I am trying to find a stonith agent which works with that PDU. I know that this is an old thread, but I'll reply anyway. I have a one cluster that uses an old APC AP9606 for which I've not been able to obtain a flash update. In particular, it is: hardware revision: J13 APP version 2.2.0 AOS version 3.0.3 It is running just fine (see caveat below) with the following configuration, and I can attest that it has properly stonith'd nodes many times. primitive msw stonith:apcmastersnmp \ operations $id=msw-operations \ op monitor interval=15 timeout=15 start-delay=15 \ params ipaddr=IPADDR port=161 community=COMMUNITY clone msw-clone msw \ meta clone-max=2 target-role=started (yeah, that monitor interval is probably a little quick ...) That particular cluster is getting long in the tooth: pacemaker-1.0.5-4.6.x86_64 openais-0.80.5-15.1.x86_64 The caveat is that this PDU used to work with the default implementation, however at some point someone updated the OIDs in apcmastersnmp to match newer firmware. Therefore, I had to reverse patch that RA: Yes, looking at the repository, that happened sometimes in 2007. Though the log message claimed that the OIDs would work with both older and newer PDUs. Apparently not. Then there was an effort by Philip Gwyn to implement a new plugin which would support both and it was almost finished. At the time we had a somewhat more stringent contribution policy and Philip didn't do what was necessary in the end. It's a shame that that contribution didn't make it to the project at the time. I can see that the code is still available at http://www.awale.qc.ca/ha-linux/apc-snmp/ If Philip's still listening or somebody else wants to push this, we can take a look at it again. Thanks, Dejan === --- apcmastersnmp.c.orig2009-09-26 16:12:27.0 -0600 +++ apcmastersnmp.c 2009-09-28 16:46:17.0 -0600 @@ -137,12 +137,12 @@ #define OUTLET_NO_CMD_PEND 2 /* oids */ -#define OID_IDENT .1.3.6.1.4.1.318.1.1.12.1.5.0 -#define OID_NUM_OUTLETS .1.3.6.1.4.1.318.1.1.12.1.8.0 -#define OID_OUTLET_NAMES .1.3.6.1.4.1.318.1.1.12.3.4.1.1.2.%i -#define OID_OUTLET_STATE .1.3.6.1.4.1.318.1.1.12.3.3.1.1.4.%i -#define OID_OUTLET_COMMAND_PENDING .1.3.6.1.4.1.318.1.1.12.3.5.1.1.5.%i -#define OID_OUTLET_REBOOT_DURATION .1.3.6.1.4.1.318.1.1.12.3.4.1.1.6.%i +#define OID_IDENT .1.3.6.1.4.1.318.1.1.4.1.4.0 +#define OID_NUM_OUTLETS.1.3.6.1.4.1.318.1.1.4.4.1.0 +#define OID_OUTLET_NAMES .1.3.6.1.4.1.318.1.1.4.5.2.1.3.%i +#define OID_OUTLET_STATE .1.3.6.1.4.1.318.1.1.4.4.2.1.3.%i +#define OID_OUTLET_COMMAND_PENDING .1.3.6.1.4.1.318.1.1.4.4.2.1.2.%i +#define OID_OUTLET_REBOOT_DURATION .1.3.6.1.4.1.318.1.1.4.5.2.1.5.%i /* snmpset -c private -v1 172.16.0.32:161 === ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
--On Wednesday, October 27, 2010 09:47:14 AM +0200 Pavlos Parissis pavlos.paris...@gmail.com wrote: I have a APC AP9606 PDU and I am trying to find a stonith agent which works with that PDU. I know that this is an old thread, but I'll reply anyway. I have a one cluster that uses an old APC AP9606 for which I've not been able to obtain a flash update. In particular, it is: hardware revision: J13 APP version 2.2.0 AOS version 3.0.3 It is running just fine (see caveat below) with the following configuration, and I can attest that it has properly stonith'd nodes many times. primitive msw stonith:apcmastersnmp \ operations $id=msw-operations \ op monitor interval=15 timeout=15 start-delay=15 \ params ipaddr=IPADDR port=161 community=COMMUNITY clone msw-clone msw \ meta clone-max=2 target-role=started (yeah, that monitor interval is probably a little quick ...) That particular cluster is getting long in the tooth: pacemaker-1.0.5-4.6.x86_64 openais-0.80.5-15.1.x86_64 The caveat is that this PDU used to work with the default implementation, however at some point someone updated the OIDs in apcmastersnmp to match newer firmware. Therefore, I had to reverse patch that RA: === --- apcmastersnmp.c.orig2009-09-26 16:12:27.0 -0600 +++ apcmastersnmp.c 2009-09-28 16:46:17.0 -0600 @@ -137,12 +137,12 @@ #define OUTLET_NO_CMD_PEND 2 /* oids */ -#define OID_IDENT .1.3.6.1.4.1.318.1.1.12.1.5.0 -#define OID_NUM_OUTLETS .1.3.6.1.4.1.318.1.1.12.1.8.0 -#define OID_OUTLET_NAMES .1.3.6.1.4.1.318.1.1.12.3.4.1.1.2.%i -#define OID_OUTLET_STATE .1.3.6.1.4.1.318.1.1.12.3.3.1.1.4.%i -#define OID_OUTLET_COMMAND_PENDING .1.3.6.1.4.1.318.1.1.12.3.5.1.1.5.%i -#define OID_OUTLET_REBOOT_DURATION .1.3.6.1.4.1.318.1.1.12.3.4.1.1.6.%i +#define OID_IDENT .1.3.6.1.4.1.318.1.1.4.1.4.0 +#define OID_NUM_OUTLETS.1.3.6.1.4.1.318.1.1.4.4.1.0 +#define OID_OUTLET_NAMES .1.3.6.1.4.1.318.1.1.4.5.2.1.3.%i +#define OID_OUTLET_STATE .1.3.6.1.4.1.318.1.1.4.4.2.1.3.%i +#define OID_OUTLET_COMMAND_PENDING .1.3.6.1.4.1.318.1.1.4.4.2.1.2.%i +#define OID_OUTLET_REBOOT_DURATION .1.3.6.1.4.1.318.1.1.4.5.2.1.5.%i /* snmpset -c private -v1 172.16.0.32:161 === ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
On 17 November 2010 04:15, Devin Reade g...@gno.org wrote: --On Wednesday, October 27, 2010 09:47:14 AM +0200 Pavlos Parissis pavlos.paris...@gmail.com wrote: I have a APC AP9606 PDU and I am trying to find a stonith agent which works with that PDU. I know that this is an old thread, but I'll reply anyway. I have a one cluster that uses an old APC AP9606 for which I've not been able to obtain a flash update. In particular, it is: hardware revision: J13 APP version 2.2.0 AOS version 3.0.3 It is running just fine (see caveat below) with the following configuration, and I can attest that it has properly stonith'd nodes many times. primitive msw stonith:apcmastersnmp \ operations $id=msw-operations \ op monitor interval=15 timeout=15 start-delay=15 \ params ipaddr=IPADDR port=161 community=COMMUNITY clone msw-clone msw \ meta clone-max=2 target-role=started (yeah, that monitor interval is probably a little quick ...) That particular cluster is getting long in the tooth: pacemaker-1.0.5-4.6.x86_64 openais-0.80.5-15.1.x86_64 The caveat is that this PDU used to work with the default implementation, however at some point someone updated the OIDs in apcmastersnmp to match newer firmware. Therefore, I had to reverse patch that RA: === --- apcmastersnmp.c.orig2009-09-26 16:12:27.0 -0600 +++ apcmastersnmp.c 2009-09-28 16:46:17.0 -0600 @@ -137,12 +137,12 @@ #define OUTLET_NO_CMD_PEND 2 /* oids */ -#define OID_IDENT .1.3.6.1.4.1.318.1.1.12.1.5.0 -#define OID_NUM_OUTLETS .1.3.6.1.4.1.318.1.1.12.1.8.0 -#define OID_OUTLET_NAMES .1.3.6.1.4.1.318.1.1.12.3.4.1.1.2.%i -#define OID_OUTLET_STATE .1.3.6.1.4.1.318.1.1.12.3.3.1.1.4.%i -#define OID_OUTLET_COMMAND_PENDING .1.3.6.1.4.1.318.1.1.12.3.5.1.1.5.%i -#define OID_OUTLET_REBOOT_DURATION .1.3.6.1.4.1.318.1.1.12.3.4.1.1.6.%i +#define OID_IDENT .1.3.6.1.4.1.318.1.1.4.1.4.0 +#define OID_NUM_OUTLETS.1.3.6.1.4.1.318.1.1.4.4.1.0 +#define OID_OUTLET_NAMES .1.3.6.1.4.1.318.1.1.4.5.2.1.3.%i +#define OID_OUTLET_STATE .1.3.6.1.4.1.318.1.1.4.4.2.1.3.%i +#define OID_OUTLET_COMMAND_PENDING .1.3.6.1.4.1.318.1.1.4.4.2.1.2.%i +#define OID_OUTLET_REBOOT_DURATION .1.3.6.1.4.1.318.1.1.4.5.2.1.5.%i /* snmpset -c private -v1 172.16.0.32:161 === I faced the same problem and because I didn't want to modify the code of apcmastersnmp RA, I used the rackpdu RA where I could set OIDs in the parameters. This RA worked perfectly until the PDU died! I suggest to use the rackpdu RA because if you upgrade your cluster software your modification will be gone. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
Hi, On Wed, Oct 27, 2010 at 08:15:09PM +0200, Pavlos Parissis wrote: On 27 October 2010 19:46, Pavlos Parissis pavlos.paris...@gmail.com wrote: I did more testing using the clone type of fencing and worked as I expected. test1 hack init script to return 1 on stop and run a crm resource move on that resource result node it was fenced and resource was started on the other node test2 using firewall to break the heartbeat links on node with resource result node it was fenced and resource was started on the other node As Dejan suggested I am going to run the same type of tests when 1 fence resource is used. In this test I will try to cause a fencing on the node which has fencing resource running on it and see if pacemaker moves the resource before it fences the node. I did the same tests without cloning and pacemaker moves fencing resource before triggers a reboot on the node where fencing resource was running. So, cloning fencing resource and having just one fence resource have the same behaviour! at least for these 2 tests. now I don't know which configuration solution I should choose! Whichever you feel more comfortable with, providing that the device really can support multiple connections simultaneously. I'd opt for non-cloned version. It's simpler, it avoids possible device contention. Thanks, Dejan Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] AP9606 fencing device
Hi, I have a APC AP9606 PDU and I am trying to find a stonith agent which works with that PDU. The apcmaster and apcmastersnmp don't work as you see below. I managed to get the rackpdu working by setting the outlet config (the oid for snmpwalk fails) and setting also the command OID. Here is a long command stonith -t external/rackpdu hostlist=node-01,node-02,node-03 pduip=192.168.100.100 oid=.1.3.6.1.4.1.318.1.1.4.4.2.1.3 community=private outlet_config=/tmp/outlet_config -T on node-01 Does anyone know any other PDU which works out of box with the supplied stonith agents? Regards, Pavlos [r...@node-01 ~]# stonith -t apcmastersnmp ipaddr=192.168.100.100 port=161 community=private -S ** (process:3887): CRITICAL **: APC_read: error in response packet, reason 2 [(noSuchName) There is no such variable name in this MIB.]. ** (process:3887): CRITICAL **: apcmastersnmp_set_config: cannot read number of outlets. Invalid config info for apcmastersnmp device Valid config names are: ipaddr port community [r...@node-01 ~]# stonith -t apcmaster ipaddr=192.168.100.100 login=stonith password=stonith -S ** (process:4215): CRITICAL **: Did not find string Escape character is '^]'. from APC MasterSwitch. ** (process:4215): CRITICAL **: Received [\xff\xfb\u0001\xff\xfb\u0003\xff\xfd\u0003 \u000dUser Name : ] ** (process:4215): CRITICAL **: Did not find string Escape character is '^]'. from APC MasterSwitch. ** (process:4215): CRITICAL **: Received [] connect() failed: Connection reset by peer ** (process:4215): CRITICAL **: Did not find string Escape character is '^]'. from APC MasterSwitch. ** (process:4215): CRITICAL **: Received [] connect() failed: Connection reset by peer connect() failed: Connection reset by peer ** (process:4215): CRITICAL **: Did not find string Escape character is '^]'. from APC MasterSwitch. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
On Oct 27, 2010, at 3:47 AM, Pavlos Parissis wrote: Does anyone know any other PDU which works out of box with the supplied stonith agents? I use APC AP7901, works like a charm: primitive pdu stonith:external/rackpdu \ params pduip=10.6.6.6 community=pdu-6 hostlist=AUTO clone fencing pdu Vadym ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
On Oct 27, 2010, at 7:27 AM, Pavlos Parissis wrote: On 27 October 2010 13:12, Vadym Chepkov vchep...@gmail.com wrote: On Oct 27, 2010, at 3:47 AM, Pavlos Parissis wrote: Does anyone know any other PDU which works out of box with the supplied stonith agents? I use APC AP7901, works like a charm: primitive pdu stonith:external/rackpdu \ params pduip=10.6.6.6 community=pdu-6 hostlist=AUTO clone fencing pdu Vadym Then most likely the defaults OIDs of the rackpdu agents matches the OIDs of the AP7901. In my case I have to use OID for the device itself 1.3.6.1.4.1.318.1.1.4.4.2.1.3 and OID for retrieving (snmpwalk) the outlet list .1.3.6.1.4.1.318.1.1.4.4.2.1.4 . Hold on a sec, are you using clone on AP7901? Does it support multiple connections? Mine it doesn't. Then it's useless regardless clone or not, you have to have multiple instances, because server can't reliable fence itself, right? Vadym ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
Hi, On Wed, Oct 27, 2010 at 01:58:20PM +0200, Pavlos Parissis wrote: On 27 October 2010 13:43, Vadym Chepkov vchep...@gmail.com wrote: On Oct 27, 2010, at 7:27 AM, Pavlos Parissis wrote: On 27 October 2010 13:12, Vadym Chepkov vchep...@gmail.com wrote: On Oct 27, 2010, at 3:47 AM, Pavlos Parissis wrote: Does anyone know any other PDU which works out of box with the supplied stonith agents? I use APC AP7901, works like a charm: primitive pdu stonith:external/rackpdu \ params pduip=10.6.6.6 community=pdu-6 hostlist=AUTO clone fencing pdu Vadym Then most likely the defaults OIDs of the rackpdu agents matches the OIDs of the AP7901. In my case I have to use OID for the device itself 1.3.6.1.4.1.318.1.1.4.4.2.1.3 and OID for retrieving (snmpwalk) the outlet list .1.3.6.1.4.1.318.1.1.4.4.2.1.4 . Hold on a sec, are you using clone on AP7901? Does it support multiple connections? Mine it doesn't. Then it's useless regardless clone or not, you have to have multiple instances, because server can't reliable fence itself, right? My understanding is/was that I need to have one resource running on 1 of the 3 nodes in the cluster and if a fence event has to be triggered then pacemaker will send to it to the one stonith resource. I am planning to test that the coming days.[1] Am I right? if not then I have to buy a different PDU! :-( Yes. In case a node which is currently running the stonith resource is to be fenced, then the stonith resource would move elsewhere first. But, yes, you should test this just like anything else. Make sure to test both the node gone event (failed links) and a critical action failing (such as stop). Thanks, Dejan Cheers, Pavlos [1] by testing I mean kill the heartbeat links on 1 node and DC node should fence that node. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
On Oct 27, 2010, at 7:58 AM, Pavlos Parissis wrote: On 27 October 2010 13:43, Vadym Chepkov vchep...@gmail.com wrote: On Oct 27, 2010, at 7:27 AM, Pavlos Parissis wrote: On 27 October 2010 13:12, Vadym Chepkov vchep...@gmail.com wrote: On Oct 27, 2010, at 3:47 AM, Pavlos Parissis wrote: Does anyone know any other PDU which works out of box with the supplied stonith agents? I use APC AP7901, works like a charm: primitive pdu stonith:external/rackpdu \ params pduip=10.6.6.6 community=pdu-6 hostlist=AUTO clone fencing pdu Vadym Then most likely the defaults OIDs of the rackpdu agents matches the OIDs of the AP7901. In my case I have to use OID for the device itself 1.3.6.1.4.1.318.1.1.4.4.2.1.3 and OID for retrieving (snmpwalk) the outlet list .1.3.6.1.4.1.318.1.1.4.4.2.1.4 . Hold on a sec, are you using clone on AP7901? Does it support multiple connections? Mine it doesn't. Then it's useless regardless clone or not, you have to have multiple instances, because server can't reliable fence itself, right? My understanding is/was that I need to have one resource running on 1 of the 3 nodes in the cluster and if a fence event has to be triggered then pacemaker will send to it to the one stonith resource. I am planning to test that the coming days.[1] Am I right? if not then I have to buy a different PDU! :-( My understanding is you have to have a fencing device for each of your hosts. Are you sure one connection limitation applies for SNMP? Most likely it's only for tcp sessions - ssh/http ? If you look into rackpdu log you will see this: Oct 19 12:39:00 xen-11 stonithd: [8606]: debug: external_run_cmd: Calling '/usr/lib64/stonith/plugins/external/rackpdu gethosts' Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_run_cmd: '/usr/lib64/stonith/plugins/external/rackpdu gethosts' output: xen-11 xen-12 Outlet_3 Outlet_4 Outlet_5 Outlet_6 Outlet_7 Outlet_8 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: running 'rackpdu gethosts' returned 0 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host xen-11 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host xen-12 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host Outlet_3 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host Outlet_4 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host Outlet_5 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host Outlet_6 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host Outlet_7 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host Outlet_8 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: remove us (xen-11) from the host list for pdu:0 check the last line - the agent is smart enough to know it can't fence itself. Vadym ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
On Oct 27, 2010, at 8:11 AM, Dejan Muhamedagic wrote: Hi, On Wed, Oct 27, 2010 at 01:58:20PM +0200, Pavlos Parissis wrote: On 27 October 2010 13:43, Vadym Chepkov vchep...@gmail.com wrote: On Oct 27, 2010, at 7:27 AM, Pavlos Parissis wrote: On 27 October 2010 13:12, Vadym Chepkov vchep...@gmail.com wrote: On Oct 27, 2010, at 3:47 AM, Pavlos Parissis wrote: Does anyone know any other PDU which works out of box with the supplied stonith agents? I use APC AP7901, works like a charm: primitive pdu stonith:external/rackpdu \ params pduip=10.6.6.6 community=pdu-6 hostlist=AUTO clone fencing pdu Vadym Then most likely the defaults OIDs of the rackpdu agents matches the OIDs of the AP7901. In my case I have to use OID for the device itself 1.3.6.1.4.1.318.1.1.4.4.2.1.3 and OID for retrieving (snmpwalk) the outlet list .1.3.6.1.4.1.318.1.1.4.4.2.1.4 . Hold on a sec, are you using clone on AP7901? Does it support multiple connections? Mine it doesn't. Then it's useless regardless clone or not, you have to have multiple instances, because server can't reliable fence itself, right? My understanding is/was that I need to have one resource running on 1 of the 3 nodes in the cluster and if a fence event has to be triggered then pacemaker will send to it to the one stonith resource. I am planning to test that the coming days.[1] Am I right? if not then I have to buy a different PDU! :-( Yes. In case a node which is currently running the stonith resource is to be fenced, then the stonith resource would move elsewhere first. But, yes, you should test this just like anything else. Make sure to test both the node gone event (failed links) and a critical action failing (such as stop). Thanks, Dejan rackpdu stonith agent seems to explicitly remove node itself from list of hosts it can fence. so I assume if you have just one instance running, cluster would not see any stonith device capable to fence server where agent started initially. Would pacemaker move such resource anyway? Since it reported it can't fence server in trouble? Vadym ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
On 27 October 2010 14:09, Vadym Chepkov vchep...@gmail.com wrote: [...snip...] Hold on a sec, are you using clone on AP7901? Does it support multiple connections? Mine it doesn't. Then it's useless regardless clone or not, you have to have multiple instances, because server can't reliable fence itself, right? My understanding is/was that I need to have one resource running on 1 of the 3 nodes in the cluster and if a fence event has to be triggered then pacemaker will send to it to the one stonith resource. I am planning to test that the coming days.[1] Am I right? if not then I have to buy a different PDU! :-( My understanding is you have to have a fencing device for each of your hosts. Are you sure one connection limitation applies for SNMP? Most likely it's only for tcp sessions - ssh/http ? Valid point Vadym, SNMP is over UDP so conntionless communication. I am wondering how i can test this - if cloning works on this PDU. If you look into rackpdu log you will see this: Oct 19 12:39:00 xen-11 stonithd: [8606]: debug: external_run_cmd: Calling '/usr/lib64/stonith/plugins/external/rackpdu gethosts' Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_run_cmd: '/usr/lib64/stonith/plugins/external/rackpdu gethosts' output: xen-11 xen-12 Outlet_3 Outlet_4 Outlet_5 Outlet_6 Outlet_7 Outlet_8 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: running 'rackpdu gethosts' returned 0 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host xen-11 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host xen-12 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host Outlet_3 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host Outlet_4 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host Outlet_5 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host Outlet_6 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host Outlet_7 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host Outlet_8 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: remove us (xen-11) from the host list for pdu:0 check the last line - the agent is smart enough to know it can't fence itself. Vadym ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
Hi, On Wed, Oct 27, 2010 at 08:34:19AM -0400, Vadym Chepkov wrote: On Oct 27, 2010, at 8:11 AM, Dejan Muhamedagic wrote: Hi, On Wed, Oct 27, 2010 at 01:58:20PM +0200, Pavlos Parissis wrote: On 27 October 2010 13:43, Vadym Chepkov vchep...@gmail.com wrote: On Oct 27, 2010, at 7:27 AM, Pavlos Parissis wrote: On 27 October 2010 13:12, Vadym Chepkov vchep...@gmail.com wrote: On Oct 27, 2010, at 3:47 AM, Pavlos Parissis wrote: Does anyone know any other PDU which works out of box with the supplied stonith agents? I use APC AP7901, works like a charm: primitive pdu stonith:external/rackpdu \ params pduip=10.6.6.6 community=pdu-6 hostlist=AUTO clone fencing pdu Vadym Then most likely the defaults OIDs of the rackpdu agents matches the OIDs of the AP7901. In my case I have to use OID for the device itself 1.3.6.1.4.1.318.1.1.4.4.2.1.3 and OID for retrieving (snmpwalk) the outlet list .1.3.6.1.4.1.318.1.1.4.4.2.1.4 . Hold on a sec, are you using clone on AP7901? Does it support multiple connections? Mine it doesn't. Then it's useless regardless clone or not, you have to have multiple instances, because server can't reliable fence itself, right? My understanding is/was that I need to have one resource running on 1 of the 3 nodes in the cluster and if a fence event has to be triggered then pacemaker will send to it to the one stonith resource. I am planning to test that the coming days.[1] Am I right? if not then I have to buy a different PDU! :-( Yes. In case a node which is currently running the stonith resource is to be fenced, then the stonith resource would move elsewhere first. But, yes, you should test this just like anything else. Make sure to test both the node gone event (failed links) and a critical action failing (such as stop). Thanks, Dejan rackpdu stonith agent seems to explicitly remove node itself from list of hosts it can fence. so I assume if you have just one instance running, cluster would not see any stonith device capable to fence server where agent started initially. Would pacemaker move such resource anyway? Yes, it should. Since it reported it can't fence server in trouble? The node on which the resource is running is removed from the hostlist, but once the resource moves elsewhere, the node will reappear in the list. Thanks, Dejan Vadym ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
Hi, I quickly tested cloning on this fencing and it worked. I used iptables to break the heartbeat link on node-01 and it was fenced by the other node - the DC. In the coming days I will test without cloning fencing device. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
On 27 October 2010 14:09, Vadym Chepkov vchep...@gmail.com wrote: On Oct 27, 2010, at 7:58 AM, Pavlos Parissis wrote: On 27 October 2010 13:43, Vadym Chepkov vchep...@gmail.com wrote: On Oct 27, 2010, at 7:27 AM, Pavlos Parissis wrote: On 27 October 2010 13:12, Vadym Chepkov vchep...@gmail.com wrote: On Oct 27, 2010, at 3:47 AM, Pavlos Parissis wrote: Does anyone know any other PDU which works out of box with the supplied stonith agents? I use APC AP7901, works like a charm: primitive pdu stonith:external/rackpdu \ params pduip=10.6.6.6 community=pdu-6 hostlist=AUTO clone fencing pdu Vadym Then most likely the defaults OIDs of the rackpdu agents matches the OIDs of the AP7901. In my case I have to use OID for the device itself 1.3.6.1.4.1.318.1.1.4.4.2.1.3 and OID for retrieving (snmpwalk) the outlet list .1.3.6.1.4.1.318.1.1.4.4.2.1.4 . Hold on a sec, are you using clone on AP7901? Does it support multiple connections? Mine it doesn't. Then it's useless regardless clone or not, you have to have multiple instances, because server can't reliable fence itself, right? My understanding is/was that I need to have one resource running on 1 of the 3 nodes in the cluster and if a fence event has to be triggered then pacemaker will send to it to the one stonith resource. I am planning to test that the coming days.[1] Am I right? if not then I have to buy a different PDU! :-( My understanding is you have to have a fencing device for each of your hosts. Are you sure one connection limitation applies for SNMP? Most likely it's only for tcp sessions - ssh/http ? If you look into rackpdu log you will see this: Oct 19 12:39:00 xen-11 stonithd: [8606]: debug: external_run_cmd: Calling '/usr/lib64/stonith/plugins/external/rackpdu gethosts' Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_run_cmd: '/usr/lib64/stonith/plugins/external/rackpdu gethosts' output: xen-11 xen-12 Outlet_3 Outlet_4 Outlet_5 Outlet_6 Outlet_7 Outlet_8 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: running 'rackpdu gethosts' returned 0 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host xen-11 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host xen-12 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host Outlet_3 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host Outlet_4 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host Outlet_5 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host Outlet_6 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host Outlet_7 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: external_hostlist: rackpdu host Outlet_8 Oct 19 12:39:01 xen-11 stonithd: [8606]: debug: remove us (xen-11) from the host list for pdu:0 check the last line - the agent is smart enough to know it can't fence itself. do you enable debug by setting debug 1 on ha.cf? do you see that WARN on your system? stonith-ng: [3369]: WARN: parse_host_line: Could not parse (0 42): /usr/lib/stonith/plugins/external/rackpdu: line 125: local: can only be used in a function Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
On 27 October 2010 19:23, Vadym Chepkov vchep...@gmail.com wrote: On Oct 27, 2010, at 1:18 PM, Pavlos Parissis wrote: ok, i have done the same hack but i will remove it. I think 1.1.4 will be out before we go on production and hopefully this will be fixed in 1.1.4. This is part of cluster-glue, not pacemaker and it's 1.0.6 now yeap you aright and I am wrong ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
I did more testing using the clone type of fencing and worked as I expected. test1 hack init script to return 1 on stop and run a crm resource move on that resource result node it was fenced and resource was started on the other node test2 using firewall to break the heartbeat links on node with resource result node it was fenced and resource was started on the other node As Dejan suggested I am going to run the same type of tests when 1 fence resource is used. In this test I will try to cause a fencing on the node which has fencing resource running on it and see if pacemaker moves the resource before it fences the node. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] AP9606 fencing device
On 27 October 2010 19:46, Pavlos Parissis pavlos.paris...@gmail.com wrote: I did more testing using the clone type of fencing and worked as I expected. test1 hack init script to return 1 on stop and run a crm resource move on that resource result node it was fenced and resource was started on the other node test2 using firewall to break the heartbeat links on node with resource result node it was fenced and resource was started on the other node As Dejan suggested I am going to run the same type of tests when 1 fence resource is used. In this test I will try to cause a fencing on the node which has fencing resource running on it and see if pacemaker moves the resource before it fences the node. I did the same tests without cloning and pacemaker moves fencing resource before triggers a reboot on the node where fencing resource was running. So, cloning fencing resource and having just one fence resource have the same behaviour! at least for these 2 tests. now I don't know which configuration solution I should choose! Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker