[Linux-ha-dev] Re: [Linux-HA] APC SNMP STONITH

2007-09-18 Thread Peter Kruse

Hello,

Philip Gwyn wrote:

As discussed earlier, I'm writing a new SNMP STONITH plugin.  The goal is for
it to seamlessly work with the new and old MIBs (AP9606 vs AP7900).


Ok, the old apcmastersnmp needed work, right.



Instead of fixing the current apcmastersnmp.c, I started over from stratch,
very roughly basing it on the net-snmp tutorial.


one thing that bothered me in the old apcmastersnmp was that
one could not configure the oids, they were hardcoded
as #defines, would it be possible to change that?
(I know, configuration files end in .c)



So far, I have a small library that will 
  - query the PDU

  - detect which MIB to use
  - find necessary outlet 
  - turn the outlet on (or off)

  - query the PDU until the outlet goes to that state (or timeout)

   http://www.awale.qc.ca/ha-linux/apc-snmp/

Tomorrow I'm going to go over apcmastersnmp.c again to see if there are some
gotchas that I might have missed.  However, it does a reset (not turn off) so I
don't know how useful that is.


why don't you use the reset as well?  That is a feature of the pdu
that allows to configure the delay between off and on.  I really
think you should stick to that.  It worked like this for
some years now, and there was no problem at all with it.
Your argument that if the server needs to be resetted because
there is a problem with it, and therefore should not start
automatically I can not follow.  If the server boots after
a reset what harm can it do?  And you can change the
behaviour in the BIOS.  If the heartbeat projects
thinks about replacing the current apcmastersnmp with
yours it should be a compatible as possible.
(Is Heartbeat going to replace the plugin with this one?)

What firmware did you test your plugin with?
Please have a look at this thread:
http://lists.community.tummy.com/pipermail/linux-ha-dev/2007-April/014240.html

make sure your oids are valid for versions 2.x and 3.x.

Regards,

Peter
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] new apc firmware breaks apcmastersnmp.so

2007-08-20 Thread Peter Kruse

Hello all,

why did the included patch fail the requirements for inclusion into
heartbeat?

The message below is about 4 months old.

Thanks,

Peter

Peter Kruse wrote:

Hi!

Alan Robertson wrote:


Could you make a patch-format file for this, and send it to the list as
an ASCII attachment?



attached.
BTW, the error message you get when you try to stonith with the wrong
apcmastersnmp.so is somewhat misleading:

# stonith -t apcmastersnmp -p "apc-1 161 write-community" outlet1
stonith: Invalid config info for apcmastersnmp device
stonith: Config info syntax: hostname/ip-address port community
The hostname/IP-address, SNMP port and community string are white-space 
delimited.


Peter





--- apcmastersnmp.c.orig2007-04-04 09:03:58.0 +0200
+++ apcmastersnmp.c 2007-04-04 09:05:24.0 +0200
@@ -137,12 +137,12 @@
 #define OUTLET_NO_CMD_PEND 2
 
 /* oids */

-#define OID_IDENT  ".1.3.6.1.4.1.318.1.1.4.1.4.0"
-#define OID_NUM_OUTLETS".1.3.6.1.4.1.318.1.1.4.4.1.0"
-#define OID_OUTLET_NAMES   ".1.3.6.1.4.1.318.1.1.4.5.2.1.3.%i"
-#define OID_OUTLET_STATE   ".1.3.6.1.4.1.318.1.1.4.4.2.1.3.%i"
-#define OID_OUTLET_COMMAND_PENDING ".1.3.6.1.4.1.318.1.1.4.4.2.1.2.%i"
-#define OID_OUTLET_REBOOT_DURATION ".1.3.6.1.4.1.318.1.1.4.5.2.1.5.%i"
+#define OID_IDENT  ".1.3.6.1.4.1.318.1.1.12.1.5.0"
+#define OID_NUM_OUTLETS".1.3.6.1.4.1.318.1.1.12.1.8.0"
+#define OID_OUTLET_NAMES   ".1.3.6.1.4.1.318.1.1.12.3.4.1.1.2.%i"
+#define OID_OUTLET_STATE   ".1.3.6.1.4.1.318.1.1.12.3.3.1.1.4.%i"
+#define OID_OUTLET_COMMAND_PENDING ".1.3.6.1.4.1.318.1.1.12.3.5.1.1.5.%i"
+#define OID_OUTLET_REBOOT_DURATION ".1.3.6.1.4.1.318.1.1.12.3.4.1.1.6.%i"
 
 /* own defines */

 #define MAX_STRING 128




___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] new apc firmware breaks apcmastersnmp.so

2007-04-05 Thread Peter Kruse

Hello,

Alan Robertson wrote:

Dave Blaschke wrote:

Also, is there some way to determine what firmware is on the APC and
then pass the appropriate OID_ constant?  This plugin must work for some
folks (at least the original author anyway ;-) so these changes would
probably break folks who are happy with their v1 APC, or is that not an
issue?


hm, don't know about v1, but as I said, the oids I posted are compatible
with v2, which somehow indicates that that the original oids in
apcmastersnmp.c were wrong in the first place...
let me stress this again, the oids I posted work for v2 and v3,
so I don't think there is any need to check the firmware version,
except for version 1, which I cannot test...



I'm sure there is a way to read it via SNMP.



yes, use ...318.1.1.12.1.3.0, example:

.318.1.1.12.1.3.0 = "v2.7.4"

and also for v3:

.318.1.1.12.1.3.0 = "v3.3.3"

cheers,

Peter
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] new apc firmware breaks apcmastersnmp.so

2007-04-04 Thread Peter Kruse

Hi Dave,

Dave Blaschke wrote:
I cannot find the "Config info syntax:" message in the latest or any of 
the most recent 2.0.x code - what version of heartbeat are you using?  


Oops, yes that was an old version, but that doesn't make a difference
concerning the oids.

Regardless, you should get a more meaningful message by parsing the logs 
- or try adding the -d option.  Any config file error results in 
"Invalid config info..." including being unable to establish an SNMP 
session...


That was my point.

Peter
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] new apc firmware breaks apcmastersnmp.so

2007-04-04 Thread Peter Kruse

Hi!

Alan Robertson wrote:


Could you make a patch-format file for this, and send it to the list as
an ASCII attachment?



attached.
BTW, the error message you get when you try to stonith with the wrong
apcmastersnmp.so is somewhat misleading:

# stonith -t apcmastersnmp -p "apc-1 161 write-community" outlet1
stonith: Invalid config info for apcmastersnmp device
stonith: Config info syntax: hostname/ip-address port community
The hostname/IP-address, SNMP port and community string are white-space 
delimited.


Peter


--- apcmastersnmp.c.orig2007-04-04 09:03:58.0 +0200
+++ apcmastersnmp.c 2007-04-04 09:05:24.0 +0200
@@ -137,12 +137,12 @@
 #define OUTLET_NO_CMD_PEND 2
 
 /* oids */
-#define OID_IDENT  ".1.3.6.1.4.1.318.1.1.4.1.4.0"
-#define OID_NUM_OUTLETS".1.3.6.1.4.1.318.1.1.4.4.1.0"
-#define OID_OUTLET_NAMES   ".1.3.6.1.4.1.318.1.1.4.5.2.1.3.%i"
-#define OID_OUTLET_STATE   ".1.3.6.1.4.1.318.1.1.4.4.2.1.3.%i"
-#define OID_OUTLET_COMMAND_PENDING ".1.3.6.1.4.1.318.1.1.4.4.2.1.2.%i"
-#define OID_OUTLET_REBOOT_DURATION ".1.3.6.1.4.1.318.1.1.4.5.2.1.5.%i"
+#define OID_IDENT  ".1.3.6.1.4.1.318.1.1.12.1.5.0"
+#define OID_NUM_OUTLETS".1.3.6.1.4.1.318.1.1.12.1.8.0"
+#define OID_OUTLET_NAMES   ".1.3.6.1.4.1.318.1.1.12.3.4.1.1.2.%i"
+#define OID_OUTLET_STATE   ".1.3.6.1.4.1.318.1.1.12.3.3.1.1.4.%i"
+#define OID_OUTLET_COMMAND_PENDING ".1.3.6.1.4.1.318.1.1.12.3.5.1.1.5.%i"
+#define OID_OUTLET_REBOOT_DURATION ".1.3.6.1.4.1.318.1.1.12.3.4.1.1.6.%i"
 
 /* own defines */
 #define MAX_STRING 128
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-ha-dev] new apc firmware breaks apcmastersnmp.so

2007-04-03 Thread Peter Kruse

Hello,

with the v3 firmware of APCs PDUs (models AP7920 and AP7921 at least)
the apcmastersnmp.so plugin to stonith does not work anymore.
in apcmastersnmp.c there is:

#define OID_IDENT  ".1.3.6.1.4.1.318.1.1.4.1.4.0"
#define OID_NUM_OUTLETS".1.3.6.1.4.1.318.1.1.4.4.1.0"
#define OID_OUTLET_NAMES   ".1.3.6.1.4.1.318.1.1.4.5.2.1.3.%i"
#define OID_OUTLET_STATE   ".1.3.6.1.4.1.318.1.1.4.4.2.1.3.%i"
#define OID_OUTLET_COMMAND_PENDING ".1.3.6.1.4.1.318.1.1.4.4.2.1.2.%i"
#define OID_OUTLET_REBOOT_DURATION ".1.3.6.1.4.1.318.1.1.4.5.2.1.5.%i"

which should be replaced by:

#define OID_IDENT ".1.3.6.1.4.1.318.1.1.12.1.5.0"
#define OID_NUM_OUTLETS   ".1.3.6.1.4.1.318.1.1.12.1.8.0"
#define OID_OUTLET_NAMES  ".1.3.6.1.4.1.318.1.1.12.3.4.1.1.2.%i"
#define OID_OUTLET_STATE  ".1.3.6.1.4.1.318.1.1.12.3.3.1.1.4.%i"
#define OID_OUTLET_COMMAND_PENDING ".1.3.6.1.4.1.318.1.1.12.3.5.1.1.5.%i"
#define OID_OUTLET_REBOOT_DURATION ".1.3.6.1.4.1.318.1.1.12.3.4.1.1.6.%i"

My tests have shown that these OIDs are also backward compatible with
v2 firmware (aos 2.7.1./rpdu 2.7.4.)

Regards,

Peter

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] What happened to rsc_state?

2006-05-12 Thread Peter Kruse

Hi,

Andrew Beekhof wrote:


i ran ptest and it wants to start fence1:1 and fence2:1

the CRM probably just needs a little poke to rerun the PE.
try: crm_attribute -n last_cleanup -v "`date -r`"


ah!  that did the trick, but I had to use "`date -R`" ;)



i cleaned this up for 2.0.6 earlier this week... the problem is that
-C results in a delete in the status section which is problematic to
detect reliably (you'll get *way* more false positives that true
hits).

so in .6 crm_resource does the equivalent of the above command 
automatically.


Very good, I will add it to my script then.

Best regards,

Peter
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] What happened to rsc_state?

2006-05-10 Thread Peter Kruse

Hi,

Andrew Beekhof wrote:

On 5/9/06, Peter Kruse <[EMAIL PROTECTED]> wrote:


although cibadmin -Ql -o status does not show the failed resource
anymore.  How can I recover from this situation?



cib contents?


Oh, thanks for reminding me (I should know by now...)
attached is output of "cibadmin -Q" before I ran the commands
and after I ran the commands (also attached).  crm_mon still
reports this:

Clone Set: DoFencing_fence1
fence1:0(stonith:external/apc): Started ha-test-2
fence1:1(stonith:external/apc): Stopped
Clone Set: DoFencing_fence2
fence2:0(stonith:external/apc): Started ha-test-2
fence2:1(stonith:external/apc): Stopped

Although the status should have been cleared.

Regards,

Peter


cibadmin-Q.before.gz
Description: GNU Zip compressed data


cibadmin-Q.after.gz
Description: GNU Zip compressed data
crm_resource -C -r rg2:IPaddr2 -t primitive -H ha-test-1
crm_resource -C -r rg2:IPaddr2 -t primitive -H ha-test-1
crm_resource -C -r DoFencing_fence1:fence1:1 -t primitive -H ha-test-1
crm_resource -C -r DoFencing_fence1:fence1:1 -t primitive -H ha-test-1
crm_resource -C -r DoFencing_fence2:fence2:1 -t primitive -H ha-test-1
crm_resource -C -r DoFencing_fence2:fence2:1 -t primitive -H ha-test-1
crm_resource -C -r rg1:IPaddr3 -t primitive -H ha-test-1
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] What happened to rsc_state?

2006-05-09 Thread Peter Kruse

Hi,

Andrew Beekhof wrote:


if you want a list of failed resources: crm_mon -1 | grep failed

if you just want the lrm_rsc_op's that failed, look for rc_code != 0
&& rc_code != 7 (where 7 is LSB for "Safely Stopped") in the result of
cibadmin -Ql -o status


Is that also true for fencing resources?  If I disconnect the network
from one node where the powerswitch is attached, crm_mon -1 prints:

Clone Set: DoFencing_fence1
fence1:0(stonith:external/apc): Started ha-test-2
fence1:1(stonith:external/apc): Stopped
Clone Set: DoFencing_fence2
fence2:0(stonith:external/apc): Started ha-test-2
fence2:1(stonith:external/apc): Stopped


but with these commands I cannot recover:

crm_resource -C -r DoFencing_fence1:fence1:1 -t primitive -H ha-test-1
crm_resource -C -r DoFencing_fence2:fence2:1 -t primitive -H ha-test-1

although cibadmin -Ql -o status does not show the failed resource
anymore.  How can I recover from this situation?

Peter

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-ha-dev] What happened to rsc_state?

2006-05-09 Thread Peter Kruse

Hello,

it seems that in 2.0.5 the attribute rsc_state to lrm_rsc_op has
disappeared. And has been replaced by rc_code and op_status.
But it is not the same.  In order to remove errors in the
cib, so that resources are started again, or nodes can take over
again, I used to do something like this:
Search in "http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-ha-dev] Error in debian package build

2006-03-02 Thread Peter Kruse

Hello,

while trying to dpkg-buildpackage this error appears:

dh_movefiles: debian/tmp/usr/lib/libcib.so.1.0.0 not found (supposed to 
put it in heartbeat-2)
dh_movefiles: debian/tmp/usr/lib/libcrmcommon.so.1.0.0 not found 
(supposed to put it in heartbeat-2)
dh_movefiles: debian/tmp/usr/lib/libpengine.so.1.0.0 not found (supposed 
to put it in heartbeat-2)

make: *** [install-stamp] Error 1

the files in question really do not exist.  This happened
after a fresh checkout and a "cvs up -r STABLE_2_0_4" on
a sarge system.  Should they be removed from debian/heartbeat-2.files?

Peter
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] File descriptor left open

2006-02-14 Thread Peter Kruse

Hello,

Alan Robertson wrote:


Do you have any idea where this message is coming from?



Hm, no, they are from lrmd?  When I started v2.0.3 yesterday
there came these messages:

Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: 
(rg1:fraid0:start:stderr) File descriptor 3 left open
Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: 
(rg1:fraid0:start:stderr) File descriptor 4 left open
Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: 
(rg1:fraid0:start:stderr) File descriptor 5 left open
Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: 
(rg1:fraid0:start:stderr) File descriptor 6 left open
Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: 
(rg1:fraid0:start:stderr) File descriptor 7 left open
Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: 
(rg1:fraid0:start:stderr) File descriptor 8 left open
Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: 
(rg1:fraid0:start:stderr) File descriptor 9 left open
Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: 
(rg1:fraid0:start:stderr) File descriptor 10 left open
Feb 13 17:08:02 ha-test-1 lrmd: [5296]: info: RA output: 
(rg1:fraid0:start:stderr) File descriptor 12 left open


so it's from a start action on the raid agent.  And on start
there is no process left when the action is done.  In
fact what it does is some "mdadm --assemble"  And this process
terminates.  In theory where could these messages come from?
Maybe I missed some guideline for writing the RAs.

Peter
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] File descriptor left open

2006-02-13 Thread Peter Kruse

Hello all,

I'm still getting these messages in my syslog in v2.0.3.
Maybe I missed something but I'm quite lost what to do
about this.  I mean the only way for a script to leave
a file descriptor open is by having started a process
in the background and not redirecting its output/stderr/input
to /dev/null.  In other words, how can a file descriptor
be left open, if there is no process attached to it?
Or, the other way around, if the script finishes
and all processes started by it have also terminated
there can not be any fd's left open???

Could all this be related to Bug #756
which is still open?

   Peter

Peter Kruse wrote:

Hello,

In my logs I get these messages like this:

Feb  7 18:23:57 ha-test-1 lrmd: [2000]: info: RA output: 
(rg1:fpbs1:start:stderr) Filedescriptor 3 left open File descriptor 4 
left open File descriptor 5 left open File descriptor 6 left open File 
descriptor 7 left open File descriptor 8 left open File descriptor 10 
left open


Now I have two questions:

1. The message indicates that I have to make sure that all open files
   are closed.  Would it be enough to do this in the bash scripts:
 exec < /dev/null > /dev/null 2>&1
   Or would it be okay if when starting a process, to just do:
   process < /dev/null > /dev/null 2>&1 &
2. If I don't make sure all filedescriptors are closed, will then
   the open files persist until there are "too many open files"?
   Could it be that this message is a result:
   crm_attribute: [32702]: ERROR: socket_client_channel_new: socket: Too 
many open files



___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-ha-dev] File descriptor left open

2006-02-08 Thread Peter Kruse

Hello,

In my logs I get these messages like this:

Feb  7 18:23:57 ha-test-1 lrmd: [2000]: info: RA output: 
(rg1:fpbs1:start:stderr) Filedescriptor 3 left open File descriptor 4 
left open File descriptor 5 left open File descriptor 6 left open File 
descriptor 7 left open File descriptor 8 left open File descriptor 10 
left open


Now I have two questions:

1. The message indicates that I have to make sure that all open files
   are closed.  Would it be enough to do this in the bash scripts:
 exec < /dev/null > /dev/null 2>&1
   Or would it be okay if when starting a process, to just do:
   process < /dev/null > /dev/null 2>&1 &
2. If I don't make sure all filedescriptors are closed, will then
   the open files persist until there are "too many open files"?
   Could it be that this message is a result:
   crm_attribute: [32702]: ERROR: socket_client_channel_new: socket: 
Too many open files


TIA,

Peter
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Re: [Linux-ha-cvs] Linux-HA CVS: crm by andrew from

2006-02-05 Thread Peter Kruse
Good Morning,

Huang Zhen wrote:

> It looks that the code deems the HA_CCMUID as group id and HA_APIGID
> as user id.

Right, I just stumbled across that problem, too, The error message is:
ERROR: mask(io.c:readCibXmlFile): /var/lib/heartbeat/crm/cib.xml must be
owned and read/writeable by user 17, or owned and read/writable by group 65

But there is no user id 17 nor group id 65 on this system...
Even doing chmod -R a+w * on /var/lib/heartbeat/crm doesn't help.

Regards,

Peter

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Tracking 2.0.3 release

2006-01-21 Thread Peter Kruse
Good morning,

Lars Marowsky-Bree wrote:

>On 2006-01-20T12:37:10, Peter Kruse <[EMAIL PROTECTED]> wrote:
>  
>
>OK, this we'll eventually provide again. (ipfail)
>  
>
except that ipfail relies on an external address, but
I don't understand why the failure of an external address
should cause a failover.  Even if you use multiple addresses
to ping.

>However, that's pretty close to how we eventually want to support this.
>If you already have the ifmonitord written, it'd be a small step for you
>to actually feed this into the CIB as dampened node attributes, right
>(instead of doing it within the resource agent)? And then we could
>handle this internally, and you claim to have contributed a major
>feature to heartbeat 2.0.x! ;-)
>
>  
>
Sure, I would love that.  But it's written in bash,
and uses our own scripting library and ... you know ...
"it works for us"... meaning we probably won't have
the resources to support it.  If you want to have a look
at it however, I can send it to you, there are some
ideas we took from the Failsafe agents, you will
recognize.  

>Uhm, that is already supposed to exist within the CRM, if you set a
>resource to unmanaged. We probably need an in-between state of "not
>monitored" (or monitor failures ignored) instead of completely unmanaged
>though.
>  
>
That's what I thought, too.  If you set a resource group
to unmanaged, the monitor actions are still called
and failures are still recognized.  But not sure
if it causes a failover.

>>3. you can set the maximum number of restarts before a real
>>failover occurs, this is also stored in the cib.
>>
>>
>
>This _definetely_ belongs into a generic feature within the CRM.
>Handling it within the RA is not the right place. We have an AI for it,
>ETA is 2.0.3 or 2.0.4 (Andrew?).
>_If_ you're handling it within the RA, there's no point in storing it
>within the CIB. That's a waste, because the CIB sync is pretty
>expensive.
>
>Set an instance parameter (which you'll then get within your environment
>of course) and keep track of the number of local restarts within a file
>under ${HA_RSCTMP} (that get's cleaned out on reboots).
>
>  
>
Hm, ... yes, that's an idea, don't know why I thought it
has to be stored in the cluster database.  That I probably
will change, thanks.

Peter
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Tracking 2.0.3 release

2006-01-20 Thread Peter Kruse

Hello,

Lars Marowsky-Bree wrote:


On 2006-01-20T10:03:53, Andrew Beekhof <[EMAIL PROTECTED]> wrote:

 


Woah, what are you calling crm_attribute for all the time?
 

Its either an ipfail replacement or his way of getting resources to  
run on the node where they've failed the least... I forget which.
   



 


There are three usages:
1. There is an ifmonitord that monitors all network interfaces in the 
cluster

and writes the current status to the cib so it is available to all nodes.
When a network interface fails (link goes down)  before I return
and error and cause a failover, I check if the other node has
a link status of "up" for the specified interface.  This is obviously
neccessary before it can take over.  If the status is not "up",
no error is returned.  This is for clusters that are not fully
redundant to minimize the risk of a false alarm.
2. You can set a resource in maintenance mode, that prevents
the monitor action to return an error.  This variable is also
stored in the cib, so tha RA have to check it every monitor
interval.
3. you can set the maximum number of restarts before a real
failover occurs, this is also stored in the cib.

Regards,

   Peter


We definetely need to include both features ourselves into 2.0.4.

Despite bugfixing and some RAs, these would be about the only real new
stuff I'd like to see there... (And the good thing is that it's probably
much the same mechanism.)

If 2.0.3 is delayed more, feel free to start writing a design & coding
it up already ;-)

 


Yeah, we know, logging needs tuning. This one probbably needs to be
tuned down.
 


Nod.  Not logging read-only CIB calls wouldn't affect me too much.
   



Yeah, it's this kind of feedback we need to really understand what we
need to log, so it's all well.

 


A regression test which just pounds the CIB with queries from several
clients in parallel however seems a good idea. Andrew, if you're
bored, how about such a testcase? (We could add it to BSC, or at
least run it on demand there.)
 


Except it takes 24 hours of such pounding to trigger it... not really
feasible for CTS.
   



Right, which is why I suggested a stand-alone CIB pounding which we can
leave running somehwere for a couple of hours-days. I expect that if we
really pound it from say 2-8 clients at once continuously/randomly the
bug might surface faster than 24h ;-)


 



___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] Tracking 2.0.3 release

2006-01-19 Thread Peter Kruse
Hello,

Andrew Beekhof wrote:

>
> On Jan 19, 2006, at 12:54 PM, Lars Marowsky-Bree wrote:
>
>> - #1037: lrmd reports TIMEOUT althogh RA was never called
>>
>> This looks fairly obscure. I've asked for a clarification with CVS 
>> HEAD,
>> because we've seen so many changes it's hard to say whether it's still
>> an issue.
>>
>>
>> I'm ignoring everything below critical for now; those are, by
>> definition, not release critical, even though they may be major
>> annoyances. But I think we need to roll this out _now_. I think we
>> should, if we decide to give it a thumbs up, be able to roll this out
>> after a weekend of test cycles.
>
>
> I'd second that.  I've been running a few thousand tests in the last 
> week and its been pretty damn stable.


The problem I observe only manifested after the resources have been
online for about 24h with
one or two resource groups with some resources defined in it.  So I'm
not sure that the
tests you run really are "real live" so to say.  The resource agents
really put some
stress on the cib as they run crm_attribute on every monitor action,
that's about
10 RAs calling crm_attribute every 30 seconds. This results in the message
"Processing cib_query operation from ..." occuring in syslog about every
second.
I have two installations running CVS revision from 18.1.2005 running
until now without
problems - knock on wood.
Please let me stress this further:  Your tests are important to see if
your code
is reliable.  Unfortunately they don't seem to be enough.  I don't want
to get
into the discussion of "you cannot test everything".  That is granted.
But it seems it would be good to run tests with more "real-live-examples" -
and those for a longer period of time.  If you have the resources to set
up a cluster with two physical machines and define resource groups
with - well why not - all possible resources (nfs, samba, drbd, ...)
please do so.

>
> The only thing is possible reason (from the CRM-side) to delay a 
> release is if we can find a root to Peter's CIB problems.

yes, please.  From my own experience I rather prefer not to consider
problems
I cannot reproduce, sure.  And I don't expect you to take responsibility for
Resource Agents not written by you.  But believe me, on _every_
installation we made
so far we had the same problem - that is, lrmd reporting a timeout
on one resource agent and heartbeat was not able to recover which is ...
well - bad.
So far I have tracked down the problem to one of the crm_attribute calls
taking too much time at one point.

As I'm not a coder, it's not easy for me to understand the details
of heartbeat, but I'm willing to, and going to help make heartbeat
_the_ opensource HA software available.

Thanks,

Peter
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] problem with some RA (output: cat: write error: Broken pipe)

2006-01-16 Thread Peter Kruse

Hello,

Anyway, I donnot test it yet, so not sure if it's really the fixing for 
your issue. Could you please test it and post the result to the mailing 
list? TIA!




Yes, the problem is gone, there are no more messages like that
in syslog.

Great!

Peter
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] problem with some RA (output: cat: write error: Broken pipe)

2006-01-16 Thread Peter Kruse

Hi,

Francis Montagnac wrote:


I think it would be better to only reset SIGPIPE to SIG_DFL (perhaps
also other signals) in the LRM just before exec'ing any external (ie:
not pertaining to heartbeat itself) commands like the RA's.



Is that hard to do?  Or has somebody already done so?
Should I create a bug report?

Peter

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] problem with some RA (output: cat: write error: Broken pipe)

2006-01-12 Thread Peter Kruse

Hello again,

Peter Kruse wrote:


Xun Sun wrote:


On 1/12/06, Peter Kruse <[EMAIL PROTECTED]> wrote:



( exportfs ; cat /proc/fs/nfs/exports ) | grep -q "^${export_dir}[   ]"



I guess it's a shell specific behavior. If you are using Bash, I would
suggest removing the subshell construct.




but using a pipe already creates a subshell, doesn't it?
But if I replace above with this:

{ ${exportfs} ; cat /proc/fs/nfs/exports ; } | grep -q "^${export_dir}[ ]"

I still get the same error.  Only if do like this:

cat /proc/fs/nfs/exports | grep -q "^${export_dir}[ ]"

the error goes away.



The same error with these:

grep -q "^${export_dir}[ ]" < <( $exportfs ; cat /proc/fs/nfs/exports )

and

show_exports() {
$exportfs
cat /proc/fs/nfs/exports
}
show_exports | grep -q "^${export_dir}[ ]"

why?
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] problem with some RA (output: cat: write error: Broken pipe)

2006-01-12 Thread Peter Kruse

Hello,

Xun Sun wrote:


Jan 12 13:40:08 ha-test-1 lrmd: [16217]: info: RA output:
(rg1:nfs1:monitor:stderr) cat:
Jan 12 13:40:08 ha-test-1 lrmd: [16217]: info: RA output:
(rg1:nfs1:monitor:stderr) write error
Jan 12 13:40:08 ha-test-1 lrmd: [16217]: info: RA output:
(rg1:nfs1:monitor:stderr) : Broken pipe



BTW, are't the three log messages more reasonable to be a single
message? i.e. "cat: write error: Broken pipe"



yes, don't know why or what splits it...
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] problem with some RA (output: cat: write error: Broken pipe)

2006-01-12 Thread Peter Kruse

Hi,

Xun Sun wrote:

On 1/12/06, Peter Kruse <[EMAIL PROTECTED]> wrote:


Hello,

In one of my RAs there is a line like this:

( exportfs ; cat /proc/fs/nfs/exports ) | grep -q "^${export_dir}[   ]"



I guess it's a shell specific behavior. If you are using Bash, I would
suggest removing the subshell construct.




but using a pipe already creates a subshell, doesn't it?
But if I replace above with this:

{ ${exportfs} ; cat /proc/fs/nfs/exports ; } | grep -q "^${export_dir}[ ]"

I still get the same error.  Only if do like this:

cat /proc/fs/nfs/exports | grep -q "^${export_dir}[ ]"

the error goes away.

Are there known issues with bash concerning this?

Peter
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


[Linux-ha-dev] problem with some RA (output: cat: write error: Broken pipe)

2006-01-12 Thread Peter Kruse

Hello,

In one of my RAs there is a line like this:

( exportfs ; cat /proc/fs/nfs/exports ) | grep -q "^${export_dir}[   ]"

This line apparently produces these errors:

Jan 12 13:40:08 ha-test-1 lrmd: [16217]: info: RA output: 
(rg1:nfs1:monitor:stderr) cat:
Jan 12 13:40:08 ha-test-1 lrmd: [16217]: info: RA output: 
(rg1:nfs1:monitor:stderr) write error
Jan 12 13:40:08 ha-test-1 lrmd: [16217]: info: RA output: 
(rg1:nfs1:monitor:stderr) : Broken pipe


Can anybody give me a hint what am I doing wrong with the above?

Thanks,

Peter
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/