On Mon, 2014-10-13 at 12:51 +1100, Andrew Beekhof wrote:
Even the same address can be a problem. That brief window where things were
getting renewed can screw up corosync.
But as I proved, there was no renewal at all during the period of this
entire pacemaker run, so the use of DHCP here is
On Wed, 2014-10-08 at 12:39 +1100, Andrew Beekhof wrote:
On 8 Oct 2014, at 2:09 am, Brian J. Murrell (brian)
brian-squohqy54cvwr29bmmi...@public.gmane.org wrote:
Given a 2 node pacemaker-1.1.10-14.el6_5.3 cluster with nodes node5
and node6 I saw an unknown third node being added
Given a 2 node pacemaker-1.1.10-14.el6_5.3 cluster with nodes node5
and node6 I saw an unknown third node being added to the cluster,
but only on node5:
Sep 18 22:52:16 node5 corosync[17321]: [pcmk ] notice: pcmk_peer_update:
Transitional membership event on ring 12: memb=2, new=0, lost=0
Sep
Hi,
As was previously discussed there is a bug in the handling of a STONITH
if a node reboots too quickly. I had a different kind of failure that I
suspect is the same kind of problem, just different symptom.
The situation is a two node cluster with two resources plus a fencing
resource. Each
On Thu, 2014-04-10 at 10:04 +1000, Andrew Beekhof wrote:
Brian: the detective work above is highly appreciated
NP. I feel like I am getting better at reading these logs and can
provide some more detailed dissection of them. And am happy to do so to
help get to the bottom of things. :-)
On Tue, 2014-04-08 at 17:29 -0400, Digimer wrote:
Looks like your fencing (stonith) failed.
Where? If I'm reading the logs correctly, it looks like stonith worked.
Here's the stonith:
Apr 8 09:53:21 lotus-4vm6 stonith-ng[2492]: notice: log_operation: Operation
'reboot' [3306] (call 2 from
On Wed, 2014-01-08 at 13:30 +1100, Andrew Beekhof wrote:
What version of pacemaker?
Most recently I have been seeing this in 1.1.10 as shipped by RHEL6.5.
On 10 Dec 2013, at 4:40 am, Brian J. Murrell
brian-squohqy54cvwr29bmmi...@public.gmane.org wrote:
I didn't seem to get a response
On Thu, 2014-02-06 at 10:42 -0500, Brian J. Murrell (brian) wrote:
On Wed, 2014-01-08 at 13:30 +1100, Andrew Beekhof wrote:
What version of pacemaker?
Most recently I have been seeing this in 1.1.10 as shipped by RHEL6.5.
Doh! Somebody did a test run that had not been updated to use
On Wed, 2014-01-15 at 17:11 +1100, Andrew Beekhof wrote:
Consider any long running action, such as starting a database.
We do not update the CIB until after actions have completed, so there can and
will be times when the status section is out of date to one degree or another.
But that is
On Thu, 2014-01-16 at 08:35 +1100, Andrew Beekhof wrote:
I know, I was giving you another example of when the cib is not completely
up-to-date with reality.
Yeah, I understood that. I was just countering with why that example is
actually more acceptable.
It may very well be partially
On Tue, 2014-01-14 at 16:01 +1100, Andrew Beekhof wrote:
On Tue, 2014-01-14 at 08:09 +1100, Andrew Beekhof wrote:
The local cib hasn't caught up yet by the looks of it.
I should have asked in my previous message: is this entirely an artifact
of having just restarted or are there any
Hi,
I found a situation using pacemaker 1.1.10 on RHEL6.5 where the output
of crm_resource -L is not trust-able, shortly after a node is booted.
Here is the output from crm_resource -L on one of the nodes in a two
node cluster (the one that was not rebooted):
st-fencing
On Tue, 2014-01-14 at 08:09 +1100, Andrew Beekhof wrote:
The local cib hasn't caught up yet by the looks of it.
Should crm_resource actually be [mis-]reporting as if it were
knowledgeable when it's not though? IOW is this expected behaviour or
should it be considered a bug? Should I open a
On Tue, 2013-12-17 at 16:33 +0100, Florian Crouzat wrote:
Is it possible that lotus-5vm8 (from DNS) and lotus-5vm8-ring1 (from
/etc/hosts) resolves to the same IP (10.128.0.206) which could maybe
confuse cman and make it decide that there is only one ring ?
No, they do resolve to two
So, trying to create a cluster on a given node with ccs:
# ccs -p xxx -h $(hostname) --createcluster foo2
Validation Failure, unable to modify configuration file (use -i to ignore this
error).
But there shouldn't be any configuration here yet. I've not done
anything with this node:
# ccs -p
So, I was reading:
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/s2-rrp-ccs-CA.html
about adding a second ring to one's CMAN configuration. I tried to add
a second ring to my configuration without success.
Given the example:
# ccs -h
On Tue, 2013-12-10 at 10:27 +, Christine Caulfield wrote:
Sadly you're not wrong.
That's what I was afraid of.
But it's actually no worse than updating
corosync.conf manually,
I think it is...
in fact it's pretty much the same thing,
Not really. Updating corosync.conf on any
On Mon, 2013-12-09 at 09:28 +0100, Jan Friesse wrote:
Error 6 error means try again. This is happening ether if corosync is
overloaded or creating new membership. Please take a look to
/var/log/cluster/corosync.log if you see something strange there (+ make
sure you have newest corosync).
So, I'm trying to wrap my head around this need to migrate to pacemaker
+CMAN. I've been looking at
http://clusterlabs.org/quickstart-redhat.html and
https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Cluster_Administration/
It seems ccs is the tool to configure
[ Hopefully this doesn't cause a duplicate post but my first attempt
returned an error. ]
Using pacemaker 1.1.10 (but I think this issue is more general than that
release), I want to enforce a policy that once a node fails, no
resources can be started/run on it until the user permits it.
I have
I seem to have another instance where pacemaker fails to exit at the end
of a shutdown. Here's the log from the start of the service pacemaker
stop:
Dec 3 13:00:39 wtm-60vm8 crmd[14076]: notice: do_state_transition: State
transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
On Tue, 2013-12-03 at 18:26 -0500, David Vossel wrote:
We did away with all of the policy engine logic involved with trying to move
fencing devices off of the target node before executing the fencing action.
Behind the scenes all fencing devices are now essentially clones. If the
So, I'm migrating my working pacemaker configuration from 1.1.7 to
1.1.10 and am finding what appears to be a new behavior in 1.1.10.
If a given node is running a fencing resource and that node goes AWOL,
it needs to be fenced (of course). But any other node trying to take
over the fencing
On 13-07-08 03:48 AM, Andreas Mock wrote:
Hi all,
I'm just wondering what the best way is to
let an admin know that the cluster (rest of
a cluster) has stonithed some other nodes?
You could modify or even just wrap the stonith agent. They are usually
just python or shell script anyway
On 13-05-22 07:05 PM, Andrew Beekhof wrote:
Also, 1.1.8-7 was not tested with the plugin _at_all_ (and neither will
future RHEL builds).
Was 1.1.7-* in EL 6.3 tested with the plugin? Is staying with most
recent EL 6.3 pacemaker-1.1.7 release really the more stable option for
people not
Using pacemaker 1.1.8-7 on EL6, I got the following series of events
trying to shut down pacemaker and then corosync. The corosync shutdown
(service corosync stop) ended up spinning/hanging indefinitely (~7hrs
now). The events, including a:
May 21 23:47:18 node1 crmd[17598]:error: do_exit:
Using Pacemaker 1.1.8 on EL6.4 with the pacemaker plugin, I'm finding
strange behavior with stonith-admin -B node2. It seems to shut the
node down but not start it back up and ends up reporting a timer
expired:
# stonith_admin -B node2
Command failed: Timer expired
The pacemaker log for the
On 13-05-09 09:53 PM, Andrew Beekhof wrote:
May 7 02:36:16 node1 crmd[16836]: info: delete_resource: Removing
resource testfs-resource1 for 18002_crm_resource (internal) on node1
May 7 02:36:16 node1 lrmd: [16833]: info: flush_op: process for operation
monitor[8] on
Using Pacemaker 1.1.7 on EL6.3, I'm getting an intermittent recurrence
of a situation where I add a resource and start it and it seems to
start but then right away fail. i.e.
# clean up resource before trying to start, just to make sure we start with a
clean slate
# crm resource cleanup
I'm using pacemaker 1.1.8 and I don't see stonith resources moving away
from AWOL hosts like I thought I did with 1.1.7. So I guess the first
thing to do is clear up what is supposed to happen.
If I have a single stonith resource for a cluster and it's running on
node A and then node A goes
On 13-04-30 11:13 AM, Lars Marowsky-Bree wrote:
Pacemaker 1.1.8's stonith/fencing subsystem directly ties into the CIB,
and will complete the fencing request even if the fencing/stonith
resource is not instantiated on the node yet.
But clearly that's not happening here.
(There's a bug in
Using 1.1.8 on EL6.4, I am seeing this sort of thing:
pengine[1590]: warning: unpack_rsc_op: Processing failed op monitor for
my_resource on node1: unknown error (1)
The full log from the point of adding the resource until the errors:
Apr 30 11:46:30 node1 cibadmin[3380]: notice:
On 13-04-24 01:16 AM, Andrew Beekhof wrote:
Almost certainly you are hitting:
https://bugzilla.redhat.com/show_bug.cgi?id=951340
Yup. The patch posted there fixed it.
I am doing my best to convince people that make decisions that this is worthy
of an update before 6.5.
I've added
Using pacemaker 1.1.8 on RHEL 6.4, I did a test where I just killed
(-KILL) corosync on a peer node. Pacemaker seemed to take a long time
to transition to stonithing it though after noticing it was AWOL:
Apr 23 19:05:20 node2 corosync[1324]: [TOTEM ] A processor failed, forming
new
Given:
host1# crm node attribute host1 show foo
scope=nodes name=foo value=bar
Why doesn't this return anything:
host1# crm_attribute --node host1 --name foo --query
host1# echo $?
0
cibadmin -Q confirms the presence of the attribute:
node id=host1 uname=host1
On 13-04-11 06:00 PM, Andrew Beekhof wrote:
Actually, I think the semantics of -C are first-write-wins and any subsequent
attempts fail with -EEXSIST
Indeed, you are correct. I think my point though was that it didn't
matter in my case which writer wins since they should all be trying to
On 13-04-10 07:02 PM, Andrew Beekhof wrote:
On 11/04/2013, at 6:33 AM, Brian J. Murrell
brian-squohqy54cvwr29bmmi...@public.gmane.org wrote:
Does crm_resource suffer from this problem
no
Excellent.
I was unable to find any comprehensive documentation on just how to
implement
On 13-04-10 04:33 PM, Brian J. Murrell wrote:
Does crm_resource suffer from this problem or does it properly only send
exactly the update to the CIB for the operation it's trying to achieve?
In exploring all options, how about pcs? Does pcs' resource create
... for example have the same read
On 13-04-11 07:37 AM, Brian J. Murrell wrote:
In exploring all options, how about pcs? Does pcs' resource create
... for example have the same read+modify+replace problem as crm
configure or does pcs resource create also only send proper fragments to
update just the part of the CIB it's
On 13-02-21 07:48 PM, Andrew Beekhof wrote:
On Fri, Feb 22, 2013 at 5:18 AM, Brian J. Murrell
brian-squohqy54cvwr29bmmi...@public.gmane.org wrote:
I wonder what happens in the case of two racing crm commands that want
to update the CIB (with non-overlapping/conflicting data). Is there any
On 13-03-25 03:50 PM, Jacek Konieczny wrote:
The first node to notice that the other is unreachable will fence (kill)
the other, making sure it is the only one operating on the shared data.
Right. But with typical two-node clusters ignoring no-quorum, because
quorum is being ignored, as soon
On 13-02-25 10:30 AM, Dejan Muhamedagic wrote:
Before doing replace, crmsh queries the CIB and checks if the
epoch was modified in the meantime.
But doesn't take out a lock of any sort to prevent an update in the
meanwhile, right?
Those operations are not
atomic, though.
Indeed.
Perhaps
I seem to have found a situation where pacemaker (pacemaker-1.1.7-6.el6.x86_64)
refuses to stop (i.e. service pacemaker stop) on EL6.
The status of the 2 node cluster was that the node being asked to stop
(node2) was continually trying to stonith another node (node1) in the
cluster which was not
I wonder what happens in the case of two racing crm commands that want
to update the CIB (with non-overlapping/conflicting data). Is there any
locking to ensure that one crm cannot overwrite the other's change?
(i.e. second one to get there has to read in the new CIB before being
able to apply
Is there a way to return an individual property (or all properties)
and/or a rsc_default (or all) back to default values, using crm, or
otherwise?
Cheers,
b.
signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list:
I'm experimenting with asymmetric clusters and resource location
constraints.
My cluster has some resources which have to be restricted to certain
nodes and other resources which can run on any node. Given that, an
opt-in cluster seems the most manageable. That is, it seems easier to
create
On 13-01-23 03:32 AM, Dan Frincu wrote:
Hi,
Hi,
I usually put the node in standby, which means it can no longer run
any resources on it. Both Pacemaker and Corosync continue to run, node
provides quorum.
But a node in standby will still be STONITHed if it goes AWOL. I put a
node in standby
OK. So you have a corosync cluster of nodes with pacemaker managing
resources on them, including (of course) STONITH.
What's the best/proper way to shut down a node, say, for maintenance
such that pacemaker doesn't go trying to fix that situation and
STONITHing it to try to bring it back up,
On 12-07-04 02:12 AM, Andrew Beekhof wrote:
On Wed, Jul 4, 2012 at 10:06 AM, Brian J. Murrell
brian-squohqy54cvwr29bmmi...@public.gmane.org wrote:
Just because I reduced the number of nodes doesn't mean that I reduced
the parallelism any.
Yes. You did. You reduced the number of check
On 12-07-04 04:27 AM, Andreas Kurz wrote:
beside increasing the batch limit to a higher value ... did you also
tune corosync totem timings?
Not yet.
But a closer look at the logs reveals a bunch of these:
Jun 28 14:56:56 node-2 corosync[30497]: [pcmk ] ERROR: send_cluster_msg_raw:
Child
On 12-06-27 11:30 PM, Andrew Beekhof wrote:
The updates from you aren't the problem. Its the number of resource
operations (that need to be stored in the CIB) that result from your
changes that might be causing the problem.
Just to follow this up for anyone currently following or anyone
On 12-07-03 06:17 PM, Andrew Beekhof wrote:
Even adding passive nodes multiplies the number of probe operations
that need to be performed and loaded into the cib.
So it seems. I just would have not thought they be such a load since
from a simplistic perspective, since they are not trying to
On 12-07-03 04:26 PM, David Vossel wrote:
This is not a definite. Perhaps you are experiencing this given the
pacemaker version you are running
Yes, that is absolutely possible and it certainly has been under
consideration throughout this process. I did also recognize however,
that I am
So, I have an 18 node cluster here (so a small haystack, indeed, but
still a haystack in which to try to find a needle) where a certain
set of (yet unknown, figuring that out is part of this process)
operations are pooching pacemaker. The symptom is that on one or
more nodes I get the following
In my cluster configuration, each resource can be run on one of two node
and I designate a primary and a secondary using location constraints
such as:
location FOO-primary FOO 20: bar1
location FOO-secondary FOO 10: bar2
And I also set a default stickiness to prevent auto-fail-back (i.e. to
On 12-03-30 02:35 PM, Florian Haas wrote:
crm configure rsc_defaults resource-stickiness=0
... and then when resources have moved back, set it to 1000 again.
It's really that simple. :)
That sounds racy. I am changing a parameter which has the potential to
affect the stickiness of all
We seem to have occasion where we find crm_resource reporting that a
resource is running on more (usually all!) nodes when we query right
after adding it:
# crm_resource -resource chalkfs-OST_3 --locate
resource chalkfs-OST_3 is running on: chalk02
resource chalkfs-OST_3 is running
On 12-03-28 10:39 AM, Florian Haas wrote:
Probably because your resource agent reports OCF_SUCCESS on a probe
operation
To be clear, is this the status $OP in the agent?
Cheers,
b.
signature.asc
Description: OpenPGP digital signature
___
I want to be able to run a resource on any node in an asymmetric
cluster so I tried creating a rule to run it on any node not named
foo since there are no nodes named foo in my cluster:
# cat /tmp/foo.xml
rsc_location id=run-bar-anywhere rsc=bar
rule id=run-bar-anywhere-rule score=100
On 11-10-26 10:19 AM, Brian J. Murrell wrote:
# cat /tmp/foo.xml
rsc_location id=run-bar-anywhere rsc=bar
rule id=run-bar-anywhere-rule score=100
^^^
I figured it out. This integer has to be quoted. I'm thinking too
much like a programmer
I want to create a stonith primitive and clone it for each node in my
cluster. I'm using the fence-agents virsh agent as my stonith
primitive. Currently for a single node it looks like:
primitive st-pm-node1 stonith:fence_virsh \
params ipaddr=192.168.122.1 login=xxx passwd=xxx
I have a pacemaker 1.0.10 installation on rhel5 but I can't seem to
manage to get a working stonith configuration. I have tested my stonith
device manually using the stonith command and it works fine. What
doesn't seem to be happening is pacemaker/stonithd actually asking for a
stonith. In my
On 11-10-18 09:40 AM, Andreas Kurz wrote:
Hello,
Hi,
I'd expect this to be the problem ... if you insist on using an
unsymmetric cluster you must add a location score for each resource you
want to be up on a node ... so add a location constraint for the fencing
clone for each node ... or
So, in another thread there was a discussion of using cibadmin to
mitigate possible concurrency issue of crm shell. I have written a test
program to test that theory and unfortunately cibadmin falls down in the
face of heavy concurrency also with errors such as:
Signon to CIB failed: connection
On 11-09-28 10:20 AM, Dejan Muhamedagic wrote:
Hi,
Hi,
I'm really not sure. Need to investigate this area more.
Well, I am experimenting with cibadmin. It's certainly not as nice and
shiny as crm shell though. :-)
cibadmin talks to the cib (the process) and cib should allow
only one
On 11-09-26 03:44 AM, Tim Serong wrote:
Because:
1) You need to run cibadmin -o resources -C -x test.xml to create the
resource (-C creates, -U updates an existing resource).
That's what I thought/wondered but the EXAMPLES section in the manpage
is quite clear that it's asking one to
On 11-09-25 09:21 PM, Andrew Beekhof wrote:
As the error says, the resource R_10.10.10.101 doesn't exist yet.
Put it in a resources tag or use -C instead of -U
Thanks much. I already replied to Tim, but the summary is that the
manpage is incorrect in two places. One is specifying the
Using pacemaker-1.0.10-1.4.el5 I am trying to add the R_10.10.10.101
IPaddr2 example resource:
primitive id=R_10.10.10.101 class=ocf type=IPaddr2
provider=heartbeat
instance_attributes id=RA_R_10.10.10.101
attributes
nvpair id=R_ip_P_ip name=ip value=10.10.10.101/
nvpair id=R_ip_P_nic
I've seen both of setting a default-resource-stickiness property (i.e.
http://www.howtoforge.com/installation-and-setup-guide-for-drbd-openais-pacemaker-xen-on-opensuse-11.1)
and a rsc_defaults option with resource-stickiness
69 matches
Mail list logo