[Linux-ha-dev] Re: [Linux-HA] APC SNMP STONITH
Hello, Philip Gwyn wrote: As discussed earlier, I'm writing a new SNMP STONITH plugin. The goal is for it to seamlessly work with the new and old MIBs (AP9606 vs AP7900). Ok, the old apcmastersnmp needed work, right. Instead of fixing the current apcmastersnmp.c, I started over from stratch, very roughly basing it on the net-snmp tutorial. one thing that bothered me in the old apcmastersnmp was that one could not configure the oids, they were hardcoded as #defines, would it be possible to change that? (I know, configuration files end in .c) So far, I have a small library that will - query the PDU - detect which MIB to use - find necessary outlet - turn the outlet on (or off) - query the PDU until the outlet goes to that state (or timeout) http://www.awale.qc.ca/ha-linux/apc-snmp/ Tomorrow I'm going to go over apcmastersnmp.c again to see if there are some gotchas that I might have missed. However, it does a reset (not turn off) so I don't know how useful that is. why don't you use the reset as well? That is a feature of the pdu that allows to configure the delay between off and on. I really think you should stick to that. It worked like this for some years now, and there was no problem at all with it. Your argument that if the server needs to be resetted because there is a problem with it, and therefore should not start automatically I can not follow. If the server boots after a reset what harm can it do? And you can change the behaviour in the BIOS. If the heartbeat projects thinks about replacing the current apcmastersnmp with yours it should be a compatible as possible. (Is Heartbeat going to replace the plugin with this one?) What firmware did you test your plugin with? Please have a look at this thread: http://lists.community.tummy.com/pipermail/linux-ha-dev/2007-April/014240.html make sure your oids are valid for versions 2.x and 3.x. Regards, Peter ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] new apc firmware breaks apcmastersnmp.so
Hello all, why did the included patch fail the requirements for inclusion into heartbeat? The message below is about 4 months old. Thanks, Peter Peter Kruse wrote: Hi! Alan Robertson wrote: Could you make a patch-format file for this, and send it to the list as an ASCII attachment? attached. BTW, the error message you get when you try to stonith with the wrong apcmastersnmp.so is somewhat misleading: # stonith -t apcmastersnmp -p "apc-1 161 write-community" outlet1 stonith: Invalid config info for apcmastersnmp device stonith: Config info syntax: hostname/ip-address port community The hostname/IP-address, SNMP port and community string are white-space delimited. Peter --- apcmastersnmp.c.orig2007-04-04 09:03:58.0 +0200 +++ apcmastersnmp.c 2007-04-04 09:05:24.0 +0200 @@ -137,12 +137,12 @@ #define OUTLET_NO_CMD_PEND 2 /* oids */ -#define OID_IDENT ".1.3.6.1.4.1.318.1.1.4.1.4.0" -#define OID_NUM_OUTLETS".1.3.6.1.4.1.318.1.1.4.4.1.0" -#define OID_OUTLET_NAMES ".1.3.6.1.4.1.318.1.1.4.5.2.1.3.%i" -#define OID_OUTLET_STATE ".1.3.6.1.4.1.318.1.1.4.4.2.1.3.%i" -#define OID_OUTLET_COMMAND_PENDING ".1.3.6.1.4.1.318.1.1.4.4.2.1.2.%i" -#define OID_OUTLET_REBOOT_DURATION ".1.3.6.1.4.1.318.1.1.4.5.2.1.5.%i" +#define OID_IDENT ".1.3.6.1.4.1.318.1.1.12.1.5.0" +#define OID_NUM_OUTLETS".1.3.6.1.4.1.318.1.1.12.1.8.0" +#define OID_OUTLET_NAMES ".1.3.6.1.4.1.318.1.1.12.3.4.1.1.2.%i" +#define OID_OUTLET_STATE ".1.3.6.1.4.1.318.1.1.12.3.3.1.1.4.%i" +#define OID_OUTLET_COMMAND_PENDING ".1.3.6.1.4.1.318.1.1.12.3.5.1.1.5.%i" +#define OID_OUTLET_REBOOT_DURATION ".1.3.6.1.4.1.318.1.1.12.3.4.1.1.6.%i" /* own defines */ #define MAX_STRING 128 ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] new apc firmware breaks apcmastersnmp.so
Hello, Alan Robertson wrote: Dave Blaschke wrote: Also, is there some way to determine what firmware is on the APC and then pass the appropriate OID_ constant? This plugin must work for some folks (at least the original author anyway ;-) so these changes would probably break folks who are happy with their v1 APC, or is that not an issue? hm, don't know about v1, but as I said, the oids I posted are compatible with v2, which somehow indicates that that the original oids in apcmastersnmp.c were wrong in the first place... let me stress this again, the oids I posted work for v2 and v3, so I don't think there is any need to check the firmware version, except for version 1, which I cannot test... I'm sure there is a way to read it via SNMP. yes, use ...318.1.1.12.1.3.0, example: .318.1.1.12.1.3.0 = "v2.7.4" and also for v3: .318.1.1.12.1.3.0 = "v3.3.3" cheers, Peter ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] new apc firmware breaks apcmastersnmp.so
Hi Dave, Dave Blaschke wrote: I cannot find the "Config info syntax:" message in the latest or any of the most recent 2.0.x code - what version of heartbeat are you using? Oops, yes that was an old version, but that doesn't make a difference concerning the oids. Regardless, you should get a more meaningful message by parsing the logs - or try adding the -d option. Any config file error results in "Invalid config info..." including being unable to establish an SNMP session... That was my point. Peter ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] new apc firmware breaks apcmastersnmp.so
Hi! Alan Robertson wrote: Could you make a patch-format file for this, and send it to the list as an ASCII attachment? attached. BTW, the error message you get when you try to stonith with the wrong apcmastersnmp.so is somewhat misleading: # stonith -t apcmastersnmp -p "apc-1 161 write-community" outlet1 stonith: Invalid config info for apcmastersnmp device stonith: Config info syntax: hostname/ip-address port community The hostname/IP-address, SNMP port and community string are white-space delimited. Peter --- apcmastersnmp.c.orig2007-04-04 09:03:58.0 +0200 +++ apcmastersnmp.c 2007-04-04 09:05:24.0 +0200 @@ -137,12 +137,12 @@ #define OUTLET_NO_CMD_PEND 2 /* oids */ -#define OID_IDENT ".1.3.6.1.4.1.318.1.1.4.1.4.0" -#define OID_NUM_OUTLETS".1.3.6.1.4.1.318.1.1.4.4.1.0" -#define OID_OUTLET_NAMES ".1.3.6.1.4.1.318.1.1.4.5.2.1.3.%i" -#define OID_OUTLET_STATE ".1.3.6.1.4.1.318.1.1.4.4.2.1.3.%i" -#define OID_OUTLET_COMMAND_PENDING ".1.3.6.1.4.1.318.1.1.4.4.2.1.2.%i" -#define OID_OUTLET_REBOOT_DURATION ".1.3.6.1.4.1.318.1.1.4.5.2.1.5.%i" +#define OID_IDENT ".1.3.6.1.4.1.318.1.1.12.1.5.0" +#define OID_NUM_OUTLETS".1.3.6.1.4.1.318.1.1.12.1.8.0" +#define OID_OUTLET_NAMES ".1.3.6.1.4.1.318.1.1.12.3.4.1.1.2.%i" +#define OID_OUTLET_STATE ".1.3.6.1.4.1.318.1.1.12.3.3.1.1.4.%i" +#define OID_OUTLET_COMMAND_PENDING ".1.3.6.1.4.1.318.1.1.12.3.5.1.1.5.%i" +#define OID_OUTLET_REBOOT_DURATION ".1.3.6.1.4.1.318.1.1.12.3.4.1.1.6.%i" /* own defines */ #define MAX_STRING 128 ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-ha-dev] new apc firmware breaks apcmastersnmp.so
Hello, with the v3 firmware of APCs PDUs (models AP7920 and AP7921 at least) the apcmastersnmp.so plugin to stonith does not work anymore. in apcmastersnmp.c there is: #define OID_IDENT ".1.3.6.1.4.1.318.1.1.4.1.4.0" #define OID_NUM_OUTLETS".1.3.6.1.4.1.318.1.1.4.4.1.0" #define OID_OUTLET_NAMES ".1.3.6.1.4.1.318.1.1.4.5.2.1.3.%i" #define OID_OUTLET_STATE ".1.3.6.1.4.1.318.1.1.4.4.2.1.3.%i" #define OID_OUTLET_COMMAND_PENDING ".1.3.6.1.4.1.318.1.1.4.4.2.1.2.%i" #define OID_OUTLET_REBOOT_DURATION ".1.3.6.1.4.1.318.1.1.4.5.2.1.5.%i" which should be replaced by: #define OID_IDENT ".1.3.6.1.4.1.318.1.1.12.1.5.0" #define OID_NUM_OUTLETS ".1.3.6.1.4.1.318.1.1.12.1.8.0" #define OID_OUTLET_NAMES ".1.3.6.1.4.1.318.1.1.12.3.4.1.1.2.%i" #define OID_OUTLET_STATE ".1.3.6.1.4.1.318.1.1.12.3.3.1.1.4.%i" #define OID_OUTLET_COMMAND_PENDING ".1.3.6.1.4.1.318.1.1.12.3.5.1.1.5.%i" #define OID_OUTLET_REBOOT_DURATION ".1.3.6.1.4.1.318.1.1.12.3.4.1.1.6.%i" My tests have shown that these OIDs are also backward compatible with v2 firmware (aos 2.7.1./rpdu 2.7.4.) Regards, Peter ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] What happened to rsc_state?
Hi, Andrew Beekhof wrote: i ran ptest and it wants to start fence1:1 and fence2:1 the CRM probably just needs a little poke to rerun the PE. try: crm_attribute -n last_cleanup -v "`date -r`" ah! that did the trick, but I had to use "`date -R`" ;) i cleaned this up for 2.0.6 earlier this week... the problem is that -C results in a delete in the status section which is problematic to detect reliably (you'll get *way* more false positives that true hits). so in .6 crm_resource does the equivalent of the above command automatically. Very good, I will add it to my script then. Best regards, Peter ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] What happened to rsc_state?
Hi, Andrew Beekhof wrote: On 5/9/06, Peter Kruse <[EMAIL PROTECTED]> wrote: although cibadmin -Ql -o status does not show the failed resource anymore. How can I recover from this situation? cib contents? Oh, thanks for reminding me (I should know by now...) attached is output of "cibadmin -Q" before I ran the commands and after I ran the commands (also attached). crm_mon still reports this: Clone Set: DoFencing_fence1 fence1:0(stonith:external/apc): Started ha-test-2 fence1:1(stonith:external/apc): Stopped Clone Set: DoFencing_fence2 fence2:0(stonith:external/apc): Started ha-test-2 fence2:1(stonith:external/apc): Stopped Although the status should have been cleared. Regards, Peter cibadmin-Q.before.gz Description: GNU Zip compressed data cibadmin-Q.after.gz Description: GNU Zip compressed data crm_resource -C -r rg2:IPaddr2 -t primitive -H ha-test-1 crm_resource -C -r rg2:IPaddr2 -t primitive -H ha-test-1 crm_resource -C -r DoFencing_fence1:fence1:1 -t primitive -H ha-test-1 crm_resource -C -r DoFencing_fence1:fence1:1 -t primitive -H ha-test-1 crm_resource -C -r DoFencing_fence2:fence2:1 -t primitive -H ha-test-1 crm_resource -C -r DoFencing_fence2:fence2:1 -t primitive -H ha-test-1 crm_resource -C -r rg1:IPaddr3 -t primitive -H ha-test-1 ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] What happened to rsc_state?
Hi, Andrew Beekhof wrote: if you want a list of failed resources: crm_mon -1 | grep failed if you just want the lrm_rsc_op's that failed, look for rc_code != 0 && rc_code != 7 (where 7 is LSB for "Safely Stopped") in the result of cibadmin -Ql -o status Is that also true for fencing resources? If I disconnect the network from one node where the powerswitch is attached, crm_mon -1 prints: Clone Set: DoFencing_fence1 fence1:0(stonith:external/apc): Started ha-test-2 fence1:1(stonith:external/apc): Stopped Clone Set: DoFencing_fence2 fence2:0(stonith:external/apc): Started ha-test-2 fence2:1(stonith:external/apc): Stopped but with these commands I cannot recover: crm_resource -C -r DoFencing_fence1:fence1:1 -t primitive -H ha-test-1 crm_resource -C -r DoFencing_fence2:fence2:1 -t primitive -H ha-test-1 although cibadmin -Ql -o status does not show the failed resource anymore. How can I recover from this situation? Peter ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-ha-dev] What happened to rsc_state?
Hello, it seems that in 2.0.5 the attribute rsc_state to lrm_rsc_op has disappeared. And has been replaced by rc_code and op_status. But it is not the same. In order to remove errors in the cib, so that resources are started again, or nodes can take over again, I used to do something like this: Search in "http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-ha-dev] Error in debian package build
Hello, while trying to dpkg-buildpackage this error appears: dh_movefiles: debian/tmp/usr/lib/libcib.so.1.0.0 not found (supposed to put it in heartbeat-2) dh_movefiles: debian/tmp/usr/lib/libcrmcommon.so.1.0.0 not found (supposed to put it in heartbeat-2) dh_movefiles: debian/tmp/usr/lib/libpengine.so.1.0.0 not found (supposed to put it in heartbeat-2) make: *** [install-stamp] Error 1 the files in question really do not exist. This happened after a fresh checkout and a "cvs up -r STABLE_2_0_4" on a sarge system. Should they be removed from debian/heartbeat-2.files? Peter ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] File descriptor left open
Hello, Alan Robertson wrote: Do you have any idea where this message is coming from? Hm, no, they are from lrmd? When I started v2.0.3 yesterday there came these messages: Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: (rg1:fraid0:start:stderr) File descriptor 3 left open Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: (rg1:fraid0:start:stderr) File descriptor 4 left open Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: (rg1:fraid0:start:stderr) File descriptor 5 left open Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: (rg1:fraid0:start:stderr) File descriptor 6 left open Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: (rg1:fraid0:start:stderr) File descriptor 7 left open Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: (rg1:fraid0:start:stderr) File descriptor 8 left open Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: (rg1:fraid0:start:stderr) File descriptor 9 left open Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: (rg1:fraid0:start:stderr) File descriptor 10 left open Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: (rg1:fraid0:start:stderr) File descriptor 12 left open so it's from a start action on the raid agent. And on start there is no process left when the action is done. In fact what it does is some "mdadm --assemble" And this process terminates. In theory where could these messages come from? Maybe I missed some guideline for writing the RAs. Peter ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] File descriptor left open
Hello all, I'm still getting these messages in my syslog in v2.0.3. Maybe I missed something but I'm quite lost what to do about this. I mean the only way for a script to leave a file descriptor open is by having started a process in the background and not redirecting its output/stderr/input to /dev/null. In other words, how can a file descriptor be left open, if there is no process attached to it? Or, the other way around, if the script finishes and all processes started by it have also terminated there can not be any fd's left open??? Could all this be related to Bug #756 which is still open? Peter Peter Kruse wrote: Hello, In my logs I get these messages like this: Feb 7 18:23:57 ha-test-1 lrmd: [2000]: info: RA output: (rg1:fpbs1:start:stderr) Filedescriptor 3 left open File descriptor 4 left open File descriptor 5 left open File descriptor 6 left open File descriptor 7 left open File descriptor 8 left open File descriptor 10 left open Now I have two questions: 1. The message indicates that I have to make sure that all open files are closed. Would it be enough to do this in the bash scripts: exec < /dev/null > /dev/null 2>&1 Or would it be okay if when starting a process, to just do: process < /dev/null > /dev/null 2>&1 & 2. If I don't make sure all filedescriptors are closed, will then the open files persist until there are "too many open files"? Could it be that this message is a result: crm_attribute: [32702]: ERROR: socket_client_channel_new: socket: Too many open files ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-ha-dev] File descriptor left open
Hello, In my logs I get these messages like this: Feb 7 18:23:57 ha-test-1 lrmd: [2000]: info: RA output: (rg1:fpbs1:start:stderr) Filedescriptor 3 left open File descriptor 4 left open File descriptor 5 left open File descriptor 6 left open File descriptor 7 left open File descriptor 8 left open File descriptor 10 left open Now I have two questions: 1. The message indicates that I have to make sure that all open files are closed. Would it be enough to do this in the bash scripts: exec < /dev/null > /dev/null 2>&1 Or would it be okay if when starting a process, to just do: process < /dev/null > /dev/null 2>&1 & 2. If I don't make sure all filedescriptors are closed, will then the open files persist until there are "too many open files"? Could it be that this message is a result: crm_attribute: [32702]: ERROR: socket_client_channel_new: socket: Too many open files TIA, Peter ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Re: [Linux-ha-cvs] Linux-HA CVS: crm by andrew from
Good Morning, Huang Zhen wrote: > It looks that the code deems the HA_CCMUID as group id and HA_APIGID > as user id. Right, I just stumbled across that problem, too, The error message is: ERROR: mask(io.c:readCibXmlFile): /var/lib/heartbeat/crm/cib.xml must be owned and read/writeable by user 17, or owned and read/writable by group 65 But there is no user id 17 nor group id 65 on this system... Even doing chmod -R a+w * on /var/lib/heartbeat/crm doesn't help. Regards, Peter ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Tracking 2.0.3 release
Good morning, Lars Marowsky-Bree wrote: >On 2006-01-20T12:37:10, Peter Kruse <[EMAIL PROTECTED]> wrote: > > >OK, this we'll eventually provide again. (ipfail) > > except that ipfail relies on an external address, but I don't understand why the failure of an external address should cause a failover. Even if you use multiple addresses to ping. >However, that's pretty close to how we eventually want to support this. >If you already have the ifmonitord written, it'd be a small step for you >to actually feed this into the CIB as dampened node attributes, right >(instead of doing it within the resource agent)? And then we could >handle this internally, and you claim to have contributed a major >feature to heartbeat 2.0.x! ;-) > > > Sure, I would love that. But it's written in bash, and uses our own scripting library and ... you know ... "it works for us"... meaning we probably won't have the resources to support it. If you want to have a look at it however, I can send it to you, there are some ideas we took from the Failsafe agents, you will recognize. >Uhm, that is already supposed to exist within the CRM, if you set a >resource to unmanaged. We probably need an in-between state of "not >monitored" (or monitor failures ignored) instead of completely unmanaged >though. > > That's what I thought, too. If you set a resource group to unmanaged, the monitor actions are still called and failures are still recognized. But not sure if it causes a failover. >>3. you can set the maximum number of restarts before a real >>failover occurs, this is also stored in the cib. >> >> > >This _definetely_ belongs into a generic feature within the CRM. >Handling it within the RA is not the right place. We have an AI for it, >ETA is 2.0.3 or 2.0.4 (Andrew?). >_If_ you're handling it within the RA, there's no point in storing it >within the CIB. That's a waste, because the CIB sync is pretty >expensive. > >Set an instance parameter (which you'll then get within your environment >of course) and keep track of the number of local restarts within a file >under ${HA_RSCTMP} (that get's cleaned out on reboots). > > > Hm, ... yes, that's an idea, don't know why I thought it has to be stored in the cluster database. That I probably will change, thanks. Peter ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Tracking 2.0.3 release
Hello, Lars Marowsky-Bree wrote: On 2006-01-20T10:03:53, Andrew Beekhof <[EMAIL PROTECTED]> wrote: Woah, what are you calling crm_attribute for all the time? Its either an ipfail replacement or his way of getting resources to run on the node where they've failed the least... I forget which. There are three usages: 1. There is an ifmonitord that monitors all network interfaces in the cluster and writes the current status to the cib so it is available to all nodes. When a network interface fails (link goes down) before I return and error and cause a failover, I check if the other node has a link status of "up" for the specified interface. This is obviously neccessary before it can take over. If the status is not "up", no error is returned. This is for clusters that are not fully redundant to minimize the risk of a false alarm. 2. You can set a resource in maintenance mode, that prevents the monitor action to return an error. This variable is also stored in the cib, so tha RA have to check it every monitor interval. 3. you can set the maximum number of restarts before a real failover occurs, this is also stored in the cib. Regards, Peter We definetely need to include both features ourselves into 2.0.4. Despite bugfixing and some RAs, these would be about the only real new stuff I'd like to see there... (And the good thing is that it's probably much the same mechanism.) If 2.0.3 is delayed more, feel free to start writing a design & coding it up already ;-) Yeah, we know, logging needs tuning. This one probbably needs to be tuned down. Nod. Not logging read-only CIB calls wouldn't affect me too much. Yeah, it's this kind of feedback we need to really understand what we need to log, so it's all well. A regression test which just pounds the CIB with queries from several clients in parallel however seems a good idea. Andrew, if you're bored, how about such a testcase? (We could add it to BSC, or at least run it on demand there.) Except it takes 24 hours of such pounding to trigger it... not really feasible for CTS. Right, which is why I suggested a stand-alone CIB pounding which we can leave running somehwere for a couple of hours-days. I expect that if we really pound it from say 2-8 clients at once continuously/randomly the bug might surface faster than 24h ;-) ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] Tracking 2.0.3 release
Hello, Andrew Beekhof wrote: > > On Jan 19, 2006, at 12:54 PM, Lars Marowsky-Bree wrote: > >> - #1037: lrmd reports TIMEOUT althogh RA was never called >> >> This looks fairly obscure. I've asked for a clarification with CVS >> HEAD, >> because we've seen so many changes it's hard to say whether it's still >> an issue. >> >> >> I'm ignoring everything below critical for now; those are, by >> definition, not release critical, even though they may be major >> annoyances. But I think we need to roll this out _now_. I think we >> should, if we decide to give it a thumbs up, be able to roll this out >> after a weekend of test cycles. > > > I'd second that. I've been running a few thousand tests in the last > week and its been pretty damn stable. The problem I observe only manifested after the resources have been online for about 24h with one or two resource groups with some resources defined in it. So I'm not sure that the tests you run really are "real live" so to say. The resource agents really put some stress on the cib as they run crm_attribute on every monitor action, that's about 10 RAs calling crm_attribute every 30 seconds. This results in the message "Processing cib_query operation from ..." occuring in syslog about every second. I have two installations running CVS revision from 18.1.2005 running until now without problems - knock on wood. Please let me stress this further: Your tests are important to see if your code is reliable. Unfortunately they don't seem to be enough. I don't want to get into the discussion of "you cannot test everything". That is granted. But it seems it would be good to run tests with more "real-live-examples" - and those for a longer period of time. If you have the resources to set up a cluster with two physical machines and define resource groups with - well why not - all possible resources (nfs, samba, drbd, ...) please do so. > > The only thing is possible reason (from the CRM-side) to delay a > release is if we can find a root to Peter's CIB problems. yes, please. From my own experience I rather prefer not to consider problems I cannot reproduce, sure. And I don't expect you to take responsibility for Resource Agents not written by you. But believe me, on _every_ installation we made so far we had the same problem - that is, lrmd reporting a timeout on one resource agent and heartbeat was not able to recover which is ... well - bad. So far I have tracked down the problem to one of the crm_attribute calls taking too much time at one point. As I'm not a coder, it's not easy for me to understand the details of heartbeat, but I'm willing to, and going to help make heartbeat _the_ opensource HA software available. Thanks, Peter ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] problem with some RA (output: cat: write error: Broken pipe)
Hello, Anyway, I donnot test it yet, so not sure if it's really the fixing for your issue. Could you please test it and post the result to the mailing list? TIA! Yes, the problem is gone, there are no more messages like that in syslog. Great! Peter ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] problem with some RA (output: cat: write error: Broken pipe)
Hi, Francis Montagnac wrote: I think it would be better to only reset SIGPIPE to SIG_DFL (perhaps also other signals) in the LRM just before exec'ing any external (ie: not pertaining to heartbeat itself) commands like the RA's. Is that hard to do? Or has somebody already done so? Should I create a bug report? Peter ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] problem with some RA (output: cat: write error: Broken pipe)
Hello again, Peter Kruse wrote: Xun Sun wrote: On 1/12/06, Peter Kruse <[EMAIL PROTECTED]> wrote: ( exportfs ; cat /proc/fs/nfs/exports ) | grep -q "^${export_dir}[ ]" I guess it's a shell specific behavior. If you are using Bash, I would suggest removing the subshell construct. but using a pipe already creates a subshell, doesn't it? But if I replace above with this: { ${exportfs} ; cat /proc/fs/nfs/exports ; } | grep -q "^${export_dir}[ ]" I still get the same error. Only if do like this: cat /proc/fs/nfs/exports | grep -q "^${export_dir}[ ]" the error goes away. The same error with these: grep -q "^${export_dir}[ ]" < <( $exportfs ; cat /proc/fs/nfs/exports ) and show_exports() { $exportfs cat /proc/fs/nfs/exports } show_exports | grep -q "^${export_dir}[ ]" why? ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] problem with some RA (output: cat: write error: Broken pipe)
Hello, Xun Sun wrote: Jan 12 13:40:08 ha-test-1 lrmd: [16217]: info: RA output: (rg1:nfs1:monitor:stderr) cat: Jan 12 13:40:08 ha-test-1 lrmd: [16217]: info: RA output: (rg1:nfs1:monitor:stderr) write error Jan 12 13:40:08 ha-test-1 lrmd: [16217]: info: RA output: (rg1:nfs1:monitor:stderr) : Broken pipe BTW, are't the three log messages more reasonable to be a single message? i.e. "cat: write error: Broken pipe" yes, don't know why or what splits it... ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] problem with some RA (output: cat: write error: Broken pipe)
Hi, Xun Sun wrote: On 1/12/06, Peter Kruse <[EMAIL PROTECTED]> wrote: Hello, In one of my RAs there is a line like this: ( exportfs ; cat /proc/fs/nfs/exports ) | grep -q "^${export_dir}[ ]" I guess it's a shell specific behavior. If you are using Bash, I would suggest removing the subshell construct. but using a pipe already creates a subshell, doesn't it? But if I replace above with this: { ${exportfs} ; cat /proc/fs/nfs/exports ; } | grep -q "^${export_dir}[ ]" I still get the same error. Only if do like this: cat /proc/fs/nfs/exports | grep -q "^${export_dir}[ ]" the error goes away. Are there known issues with bash concerning this? Peter ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-ha-dev] problem with some RA (output: cat: write error: Broken pipe)
Hello, In one of my RAs there is a line like this: ( exportfs ; cat /proc/fs/nfs/exports ) | grep -q "^${export_dir}[ ]" This line apparently produces these errors: Jan 12 13:40:08 ha-test-1 lrmd: [16217]: info: RA output: (rg1:nfs1:monitor:stderr) cat: Jan 12 13:40:08 ha-test-1 lrmd: [16217]: info: RA output: (rg1:nfs1:monitor:stderr) write error Jan 12 13:40:08 ha-test-1 lrmd: [16217]: info: RA output: (rg1:nfs1:monitor:stderr) : Broken pipe Can anybody give me a hint what am I doing wrong with the above? Thanks, Peter ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/