[Linux-HA] Antw: Re: Q: How does crm locate RAs?

2011-07-07 Thread Ulrich Windl
 Florian Haas florian.h...@linbit.com schrieb am 06.07.2011 um 21:51 in
Nachricht 4e14bca8.7070...@linbit.com:
 (Your MUA seems to have injected = in the toward the end of most
 lines. May want to have a look at fixing that.)
 
 On 07/06/2011 05:40 PM, Ulrich Windl wrote:
  Hi!
  
  As I've written my first OCF RA, I want to install it to test it in the
  real life.
 
 Wonderful. I might add it would be extraordinarily useful if you posted
 it somewhere so people can actually look at the code.

It's version 0.1. Once I have it tested in a cluster and I think it works 
well enough, I'll provide the code. But I have unanswered questions on how to 
install the RA.

Is it true that ocf is detected automatically, the provider is the 
first-level subdirectory? Then a question: Is it allowed to have a subdirectory 
per RA, like .../provider/agent/agent* ? I implemented my metadata as a 
separate XML file, and now I wonder if the framework will get confused if the 
provider directory contains a non-executable agent.xml file.

 
  I wonder: How does crm locate the existing RAs (crm ra =
  classes, crm ra info ocf:IPaddr2)? I found that most RAs are installed =
  in /usr/lib/ocf/resource.d/heartbeat, very few in 
 /usr/lib/ocf/resource.d/p=
  acemaker/. So where should by RA go?
 
 Into /usr/lib/ocf/resource.d/provider, where provider is essentially
 a directory name of your choosing. The RA may be a custom one that you
 never intend to share; in that case /usr/lib/ocf/resource.d/ulrichwindl/
 would be entirely appropriate.

OK, see question above.

 
 Or you submit the OCF RA to an upstream project, then
 /usr/lib/ocf/resource.d/project/ is fine. That is what the OCF RA for
 the RabbitMQ server does; it installs into
 /usr/lib/ocf/resource.d/rabbitmq/.
 
 Or you choose to submit the RA to the linux-ha project, then (for
 historical reasons) the preferred location would be
 /usr/lib/ocf/resource.d/heartbeat/. At least for the time being, we will
 have a unified namespace for the Linux-HA and Red Hat agents some
 stretch down the road.
 
 This, btw, is explained in
 http://linux-ha.org/doc/dev-guides/_installing_resource_agents.html.

Yes, for most parts.

 
  I'm not good in reading Python code, but it seems crm uses lrmadmin to
  build the list. Unfortunately the manual page for lrmadmin is very poor:
 
 I am sure Dejan will be thrilled to accept patches for lrmadmin man
 page. I for my part find it quite sufficient, and the RA dev guide is
 there for reference purposes about where to install resource agents.

Naturally you could patch (i.e. write) the man page once you know how the 
program works. Without a man page you'll have to read the sources, I'm afraid. 
Having to read the sources of a program to be able to use it is not the best 
choice IMHO.

 
  Also for updates, should the RA be versioned?
 
 Yes, that is why the metadata has a version field. Thanks for
 highlighting the fact that this is not immediately evident from the dev
 guide; I'll fix that.

I meant this: maybe you have an RA found in a subdirectory named RA-0.1 
(taking about filenames), and maybe you have a symlink RA - RA-0.1. Now you 
could configure RA and rely on the fact that all future versions will be 
compatible, or you could configure RA-0.1 to use a specific version. Now when 
there is an update to RA-0.2, that update might also change the symlink to 
RA - RA-0.2.

Is something like that supported, or even recommended?

Regards,
Ulrich


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] stonith-ng reboot returned 1

2011-07-07 Thread Lars Marowsky-Bree
On 2011-07-06T15:06:01, Craig Lesle craig.le...@bruden.com wrote:

 Interesting that st_timeout does not show 75 seconds on any try and looks 
 rather random, like it's calculated.

... right. I hadn't noticed that before.

So what's happening is that, in pacemaker's fencing/remote.c, the
stonith-timeout specified is divided up in 10% for _querying_ the list
of nodes a given stonith device can retrieve, and 90% for then
performing an actual operation. (Compare initiate_remote_stonith_op()
and call_remote_stonith())

I think this is counter-intuitive, to say the least.

In your initial case, it just so happens that 100s * 90% obviously
exactly matches your sbd msgwait, so an increase of +10s just wasn't
enough.



Regards,
Lars

-- 
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Q: How does crm locate RAs?

2011-07-07 Thread Florian Haas
On 2011-07-07 08:23, Ulrich Windl wrote:
 Florian Haas florian.h...@linbit.com schrieb am 06.07.2011 um 21:51 in
 Nachricht 4e14bca8.7070...@linbit.com:
 (Your MUA seems to have injected = in the toward the end of most
 lines. May want to have a look at fixing that.)

 On 07/06/2011 05:40 PM, Ulrich Windl wrote:
 Hi!

 As I've written my first OCF RA, I want to install it to test it in the
 real life.

 Wonderful. I might add it would be extraordinarily useful if you posted
 it somewhere so people can actually look at the code.
 
 It's version 0.1. Once I have it tested in a cluster and I think it works 
 well enough, I'll provide the code.

Just as a general suggestion, it may be wise to rethink that approach.
It's usually a good idea to float your resource agent early and solicit
feedback and ideas for improvement.

 Is it true that ocf is detected automatically, the provider is the 
 first-level subdirectory?

Well, yes. ocf is the resource agent class.

 Then a question: Is it allowed to have a subdirectory per RA, like 
 .../provider/agent/agent* ? I implemented my metadata as a separate XML file, 
 and now I wonder if the framework will get confused if the provider directory 
 contains a non-executable agent.xml file.

No it will not; all non-executable files in that directory are ignored.
You can put your metadata into a separate file and then simply cat
that file from the RA's meta-data action, and this is the approach that
the Red Hat Cluster agents usually take, but within the Linux-HA crowd
that approach hasn't been used, and I personally am not very fond of it.
Still, it's essentially up to you -- there is no best practice
forbidding separate metadata files.

I see no benefit at all, however, in having a separate provider
directory for each and every resource agent.

 Or you submit the OCF RA to an upstream project, then
 /usr/lib/ocf/resource.d/project/ is fine. That is what the OCF RA for
 the RabbitMQ server does; it installs into
 /usr/lib/ocf/resource.d/rabbitmq/.

 Or you choose to submit the RA to the linux-ha project, then (for
 historical reasons) the preferred location would be
 /usr/lib/ocf/resource.d/heartbeat/. At least for the time being, we will
 have a unified namespace for the Linux-HA and Red Hat agents some
 stretch down the road.

 This, btw, is explained in
 http://linux-ha.org/doc/dev-guides/_installing_resource_agents.html.
 
 Yes, for most parts.

Suggestions for improvement, and/or documentation patches, are most welcome.


 I'm not good in reading Python code, but it seems crm uses lrmadmin to
 build the list. Unfortunately the manual page for lrmadmin is very poor:

 I am sure Dejan will be thrilled to accept patches for lrmadmin man
 page. I for my part find it quite sufficient, and the RA dev guide is
 there for reference purposes about where to install resource agents.
 
 Naturally you could patch (i.e. write) the man page once you know how the 
 program works. Without a man page you'll have to read the sources, I'm 
 afraid. Having to read the sources of a program to be able to use it is not 
 the best choice IMHO.

Note I wasn't suggesting that you read the source. Reading the
aforementioned section in the dev guide should be enough.

 Also for updates, should the RA be versioned?

 Yes, that is why the metadata has a version field. Thanks for
 highlighting the fact that this is not immediately evident from the dev
 guide; I'll fix that.
 
 I meant this: maybe you have an RA found in a subdirectory named RA-0.1 
 (taking about filenames), and maybe you have a symlink RA - RA-0.1. Now 
 you could configure RA and rely on the fact that all future versions will 
 be compatible, or you could configure RA-0.1 to use a specific version. Now 
 when there is an update to RA-0.2, that update might also change the 
 symlink to RA - RA-0.2.

So now you are talking about not having just one subdirectory per RA,
but one per RA and version? Again, I see zero merit in that.

If you are worried about rolling upgrades and changes to your agent,
that problem is essentially moot. Users are generally expected to either
move resources temporarily away from a cluster node which is in the
process of being upgraded, or (in cluster managers where this is
available, such as Pacemaker) place the cluster in maintenance mode when
making software updates.

Cheers,
Florian



signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] iscsi not configured

2011-07-07 Thread spamvoll
hi..

im new and setting up my first HA Cluster. Corosync and Pacemaker
running fine and all IPs get moved, drbd works but iscsi drives me
crazy

my config:
...
primitive iscsiLUN ocf:heartbeat:iSCSILogicalUnit \
params path=/dev/drbd0
target_iqn=iqn.2011-06.de.my-domain.viki:drbd0 lun=0
primitive iscsiTarget ocf:heartbeat:iSCSITarget \
params iqn=iqn.2011-06.de.my-domain.viki:drbd0
property $id=cib-bootstrap-options \
dc-version=1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \
cluster-infrastructure=openais \
expected-quorum-votes=2 \
stonith-enabled=false \
no-quorum-policy=ignore
rsc_defaults $id=rsc-options \
resource-stickiness=100

my /etc/tgt/tgt.conf
target iqn.2011-06.de.my-domain.viki:drbd0
backing-store /dev/drbd0
/target

crm_mon shows:
iscsiLUN_monitor_0 (node=viki-2.my-domain.de, call=13, rc=6,
status=complete): not configured
iscsiTarget_monitor_0 (node=viki-2.my-domain.de, call=12, rc=6,
status=complete): not configured

/var/log/messages
Jul  7 11:44:48 viki-2 pengine: [22330]: WARN: unpack_rsc_op:
Processing failed op iscsi-tgtd_start_0 on viki-2.my-domain.de: not
running (7)
Jul  7 11:44:48 viki-2 attrd: [22329]: info: attrd_trigger_update:
Sending flush op to all hosts for: fail-count-iscsi-tgtd (INFINITY)
Jul  7 11:44:49 viki-2 pengine: [22330]: WARN:
common_apply_stickiness: Forcing iscsi-tgtd away from
viki-1.my-domain.de after 100 failures (max=100)
Jul  7 11:44:49 viki-2 pengine: [22330]: WARN:
common_apply_stickiness: Forcing iscsi-tgtd away from
viki-2.my-domain.de after 100 failures (max=100)
Jul  7 11:44:59 viki-2 attrd: [22329]: info: attrd_trigger_update:
Sending flush op to all hosts for: last-failure-iscsi-tgtd
(1310030469)

any ideas or hinds

thx
Hans Peter
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Forkbomb not initiating failover

2011-07-07 Thread James Smith
Hi,

Summary: Two node cluster running DRBD, IET with a floating IP and stonith 
enabled.

All this works well, I can kernel panic the machine, kill individual PIDs (for 
example IET)
which then invoke failover.  However, when I forkbomb the master, nothing 
happens.
The box is dead, the services stop responding etc, but pacemaker does not 
recognise
this and therefore failover does not occur.

Very occasionally it will fence and invoke failover after several minutes or 
even longer,
which is no good at all.

To me, it seems extremely odd pacemaker itself does not automatically 
incorporate system
health checks that can detect such a scenario.  I've raised this a couple of 
times, but the
suggestion is to run watchdog or create an RA to do resource checking.  
Watchdog certainly
does its job and is easy to configure, but this seems flawed to me.

Regards,
James
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] iscsi not configured

2011-07-07 Thread Michael Schwartzkopff
 hi..
 
 im new and setting up my first HA Cluster. Corosync and Pacemaker
 running fine and all IPs get moved, drbd works but iscsi drives me
 crazy
 
 my config:
 ...
 primitive iscsiLUN ocf:heartbeat:iSCSILogicalUnit \
   params path=/dev/drbd0
 target_iqn=iqn.2011-06.de.my-domain.viki:drbd0 lun=0
 primitive iscsiTarget ocf:heartbeat:iSCSITarget \
   params iqn=iqn.2011-06.de.my-domain.viki:drbd0
 property $id=cib-bootstrap-options \
   dc-version=1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \
   cluster-infrastructure=openais \
   expected-quorum-votes=2 \
   stonith-enabled=false \
   no-quorum-policy=ignore
 rsc_defaults $id=rsc-options \
   resource-stickiness=100
 
 my /etc/tgt/tgt.conf
 target iqn.2011-06.de.my-domain.viki:drbd0
 backing-store /dev/drbd0
 /target
 
 crm_mon shows:
 iscsiLUN_monitor_0 (node=viki-2.my-domain.de, call=13, rc=6,
 status=complete): not configured
 iscsiTarget_monitor_0 (node=viki-2.my-domain.de, call=12, rc=6,
 status=complete): not configured
 
 /var/log/messages
 Jul  7 11:44:48 viki-2 pengine: [22330]: WARN: unpack_rsc_op:
 Processing failed op iscsi-tgtd_start_0 on viki-2.my-domain.de: not
 running (7)
 Jul  7 11:44:48 viki-2 attrd: [22329]: info: attrd_trigger_update:
 Sending flush op to all hosts for: fail-count-iscsi-tgtd (INFINITY)
 Jul  7 11:44:49 viki-2 pengine: [22330]: WARN:
 common_apply_stickiness: Forcing iscsi-tgtd away from
 viki-1.my-domain.de after 100 failures (max=100)
 Jul  7 11:44:49 viki-2 pengine: [22330]: WARN:
 common_apply_stickiness: Forcing iscsi-tgtd away from
 viki-2.my-domain.de after 100 failures (max=100)
 Jul  7 11:44:59 viki-2 attrd: [22329]: info: attrd_trigger_update:
 Sending flush op to all hosts for: last-failure-iscsi-tgtd
 (1310030469)

Hi,

nice logs, but the wrong ones. We need to know why the resource agent returns 
the code 6, which means not configured. Please check your tgt config again.

Can you start tgt manually?

-- 
Dr. Michael Schwartzkopff
Guardinistr. 63
81375 München

Tel: (0163) 172 50 98


signature.asc
Description: This is a digitally signed message part.
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Forkbomb not initiating failover

2011-07-07 Thread Florian Haas
On 2011-07-07 11:59, James Smith wrote:
 Hi,
 
 Summary: Two node cluster running DRBD, IET with a floating IP and stonith 
 enabled.
 
 All this works well, I can kernel panic the machine, kill individual PIDs 
 (for example IET)
 which then invoke failover.  However, when I forkbomb the master, nothing 
 happens.
 The box is dead, the services stop responding etc, but pacemaker does not 
 recognise
 this and therefore failover does not occur.
 
 Very occasionally it will fence and invoke failover after several minutes or 
 even longer,
 which is no good at all.
 
 To me, it seems extremely odd pacemaker itself does not automatically 
 incorporate system
 health checks that can detect such a scenario.  I've raised this a couple of 
 times, but the
 suggestion is to run watchdog or create an RA to do resource checking.  
 Watchdog certainly
 does its job and is easy to configure, but this seems flawed to me.

Please refer to:

http://www.gossamer-threads.com/lists/linuxha/pacemaker/70081#70081

Cheers,
Florian



signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] iscsi not configured

2011-07-07 Thread Florian Haas
On 2011-07-07 11:48, spamv...@googlemail.com wrote:
 hi..
 
 im new and setting up my first HA Cluster. Corosync and Pacemaker
 running fine and all IPs get moved, drbd works but iscsi drives me
 crazy
 
 my config:
 ...
 primitive iscsiLUN ocf:heartbeat:iSCSILogicalUnit \
   params path=/dev/drbd0
 target_iqn=iqn.2011-06.de.my-domain.viki:drbd0 lun=0
 primitive iscsiTarget ocf:heartbeat:iSCSITarget \
   params iqn=iqn.2011-06.de.my-domain.viki:drbd0
 property $id=cib-bootstrap-options \
   dc-version=1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f \
   cluster-infrastructure=openais \iscsiLUN
   expected-quorum-votes=2 \
   stonith-enabled=false \
   no-quorum-policy=ignore
 rsc_defaults $id=rsc-options \
   resource-stickiness=100

You are missing order/colo constraints (or a group). And your iscsiLUN
is misconfigured, you forgot to escape a newline.

Please refer to:

http://www.linbit.com/en/education/tech-guides/highly-available-iscsi-with-drbd-and-pacemaker/

Cheers,
Florian



signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] stonith-ng reboot returned 1

2011-07-07 Thread Craig Lesle

 ... right. I hadn't noticed that before.

 So what's happening is that, in pacemaker's fencing/remote.c, the
 stonith-timeout specified is divided up in 10% for _querying_ the list
 of nodes a given stonith device can retrieve, and 90% for then
 performing an actual operation. (Compare initiate_remote_stonith_op()
 and call_remote_stonith())

 I think this is counter-intuitive, to say the least.

 In your initial case, it just so happens that 100s * 90% obviously
 exactly matches your sbd msgwait, so an increase of +10s just wasn't
 enough.



 Regards,
  Lars

Thanks Lars.

Interesting. It would seem more intuitive for remote.c to add 10% to the 
specified value in order to get it's querying overhead accounted for.

Now that I know about the query tax, will verify stonith-timeout is 
set to a value  (sbd-msgwait*110%).

Hopefully that little tidbit will make it in to the sbd wiki at some point.

Take care,
Craig

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Forkbomb not initiating failover

2011-07-07 Thread James Smith
Hi,

I appreciate that, but it doesn't answer the question.

What I'm getting at, is there are multiple scenarios where a system 
can fail but in my test scenario I was forcing high load.  My application 
wouldn't, in a working scenario, ever cause this type of load unless there 
was a very serious issue that would warrant failover.  So in this scenario I 
want pacemaker to be able to handle this accordingly without the 
need to configure additional services entirely separate to the working of 
pacemaker.

For example, it's easy to assume the monitor operations on the RA's can 
handle this already.  The slave should be initiating a monitor operation 
against 
the master to see if it's services are still responding.  But it seems only the 
master does this, but of course the master is foobared so never responds, 
so failover never occurs.  Surely I'm not the only one that sees this as rather 
flawed?

Regards,
James

-Original Message-
From: linux-ha-boun...@lists.linux-ha.org 
[mailto:linux-ha-boun...@lists.linux-ha.org] On Behalf Of Florian Haas
Sent: 07 July 2011 11:59
To: General Linux-HA mailing list
Subject: Re: [Linux-HA] Forkbomb not initiating failover

On 2011-07-07 11:59, James Smith wrote:
 Hi,
 
 Summary: Two node cluster running DRBD, IET with a floating IP and stonith 
 enabled.
 
 All this works well, I can kernel panic the machine, kill individual 
 PIDs (for example IET) which then invoke failover.  However, when I forkbomb 
 the master, nothing happens.
 The box is dead, the services stop responding etc, but pacemaker does 
 not recognise this and therefore failover does not occur.
 
 Very occasionally it will fence and invoke failover after several 
 minutes or even longer, which is no good at all.
 
 To me, it seems extremely odd pacemaker itself does not automatically 
 incorporate system health checks that can detect such a scenario.  
 I've raised this a couple of times, but the suggestion is to run 
 watchdog or create an RA to do resource checking.  Watchdog certainly does 
 its job and is easy to configure, but this seems flawed to me.

Please refer to:

http://www.gossamer-threads.com/lists/linuxha/pacemaker/70081#70081

Cheers,
Florian

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Forkbomb not initiating failover

2011-07-07 Thread Florian Haas
On 2011-07-07 13:52, James Smith wrote:
 Hi,
 
 I appreciate that, but it doesn't answer the question.

Then maybe I misunderstood the question. I had interpreted it to mean
why doesn't my cluster automatically fail over under high load? --
perhaps you can rephrase to clarify.

 What I'm getting at, is there are multiple scenarios where a system 
 can fail but in my test scenario I was forcing high load.  My application 
 wouldn't, in a working scenario, ever cause this type of load unless there 
 was a very serious issue that would warrant failover.

Er, how can you be so sure? How about if you just had a ton of users (or
client services) hammering your application. Then that would cause high
load, but it would clearly _not_ warrant failover -- since after you
would fail over, the other node would be hammered just as much.

 So in this scenario I 
 want pacemaker to be able to handle this accordingly without the 
 need to configure additional services entirely separate to the working of 
 pacemaker.

Now please define how exactly Pacemaker would be handling this
accordingly.

 For example, it's easy to assume the monitor operations on the RA's can 
 handle this already.  The slave should be initiating a monitor operation 
 against 
 the master to see if it's services are still responding.

I'm afraid you're missing the fact that in Pacemaker a slave does not
initiate a monitor operation against the master, what makes you think
that it does? Monitor operations are always run locally. It is only very
few resource agents that are configurable as master/slave sets. _Some_
of those can be configured to have a slave contact a master during
monitoring (like ocf:heartbeat:mysql), some never do (like ocf:linbit:drbd).

 But it seems only the 
 master does this,

No. All nodes do.

 but of course the master is foobared so never responds, 
 so failover never occurs.  Surely I'm not the only one that sees this as 
 rather 
 flawed?

So what would your preferred behavior be? Pacemaker failing over in case
load is high? That's a possibility and could be done via the system
health feature and an appropriate resource agent, but even if that
happens, you stand a pretty good chance -- even though I realize you
don't believe this -- that it is your application that causes this high
load, and then failover makes matters worse, not better.

Cheers,
Florian



signature.asc
Description: OpenPGP digital signature
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] stonith-ng reboot returned 1

2011-07-07 Thread Lars Marowsky-Bree
On 2011-07-07T05:40:23, Craig Lesle craig.le...@bruden.com wrote:

 Interesting. It would seem more intuitive for remote.c to add 10% to the 
 specified value in order to get it's querying overhead accounted for.
 
 Now that I know about the query tax, will verify stonith-timeout is 
 set to a value  (sbd-msgwait*110%).
 
 Hopefully that little tidbit will make it in to the sbd wiki at some point.

I'm actually hoping that it'll get fixed, not documented.


Regards,
Lars

-- 
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] stonith-ng reboot returned 1

2011-07-07 Thread Andrew Beekhof
On Thu, Jul 7, 2011 at 5:40 PM, Lars Marowsky-Bree l...@suse.de wrote:
 On 2011-07-06T15:06:01, Craig Lesle craig.le...@bruden.com wrote:

 Interesting that st_timeout does not show 75 seconds on any try and looks 
 rather random, like it's calculated.

 ... right. I hadn't noticed that before.

 So what's happening is that, in pacemaker's fencing/remote.c, the
 stonith-timeout specified is divided up in 10% for _querying_ the list
 of nodes a given stonith device can retrieve, and 90% for then
 performing an actual operation. (Compare initiate_remote_stonith_op()
 and call_remote_stonith())

 I think this is counter-intuitive, to say the least.

Probably right. If you want to change it to add 10%, go ahead :-)


 In your initial case, it just so happens that 100s * 90% obviously
 exactly matches your sbd msgwait, so an increase of +10s just wasn't
 enough.



 Regards,
    Lars

 --
 Architect Storage/HA, OPS Engineering, Novell, Inc.
 SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, 
 HRB 21284 (AG Nürnberg)
 Experience is the name everyone gives to their mistakes. -- Oscar Wilde


___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ERROR: glib: ucast: error binding socket. Retrying: Address already in use

2011-07-07 Thread Andrew Beekhof
There is some way to tell the system not to hand out 696 for use by
other daemons.
Its been a long time since I did it though so I forget the details
(even who is handing it out, possibly rpc).

On Thu, Jul 7, 2011 at 10:16 AM, Hai Tao taoh...@hotmail.com wrote:

 I got this error (ERROR: glib: ucast: error binding socket. Retrying: Address 
 already in use), and I know UDP 696 has been used by rpc.statd. I need a 
 solution.

 # lsof -i:696
 COMMAND    PID USER   FD   TYPE DEVICE SIZE NODE NAME
 rpc.statd 2216 root    6u  IPv4   5306       UDP *:rushd


 what can I do?

 Thanks.

 Hai T.

 ___
 Linux-HA mailing list
 Linux-HA@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 See also: http://linux-ha.org/ReportingProblems

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems