Re: [Pacemaker] [ha-wg] [RFC] Organizing HA Summit 2015

2014-12-01 Thread Tim Serong
On 11/25/2014 02:14 AM, Digimer wrote:
 On 24/11/14 10:12 AM, Lars Marowsky-Bree wrote:
 Beijing, the US, Tasmania (OK, one crazy guy), various countries in
 
 Oh, bring him! crazy++

What, you want to bring the guy who's boldly maintaining the outpost on
the southern frontier? ;)

*cough*

Barring a miracle or a sudden huge advance in matter transporter
technology I'm rather unlikely to make it, I'm afraid.  But I'll add my
voice to what Lars said in another email - go all physical (with good
minutes/notes/etherpads for others to review - which I assume is what's
going to happen this time), or all virtual.  Mixing the two is
exceedingly difficult to do well, IMO.

Regards,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] hawk session timeout?

2014-12-01 Thread Tim Serong
On 12/02/2014 07:43 AM, Schaefer, Diane E wrote:
 Hi,
 
   I am running hawk 0.6.1-0.11.1 on SLES SP3.  How do I configure HAWK
 so my web session times out.  My users are concerned since it never
 times out by default.

It actually will eventually time out if you don't log out manually, but
it'll take ten days...  This was put in so that if you're using the
dashboard function to view multiple clusters, you wouldn't have to keep
logging in to them if the sessions timed out.

A quick workaround is to edit this file:

/srv/www/hawk/config/initializers/session_store.rb

You want to change :expire_after to a smaller value (expressed in
seconds), then restart hawk.

Please feel free to file a bug for this (to either set it lower by
default, or break the setting out into a config file, or both) on the
SUSE bugzilla (assuming you're using SLE HA), or the github issue
tracker (https://github.com/ClusterLabs/hawk) if not.

Regards,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Hawk session ends after start or stop action

2014-03-05 Thread Tim Serong
On 03/05/2014 12:59 AM, Schaefer, Diane E wrote:
 Hi Lars,
 
   I am running pacemaker on SLES 11 SP3 and have applied the update
 package released in December.  The hawk level is 0.6.1-0.11.1 and
 lighttpd is 1.4.20-2.52.1 . When I log into hawk using firefox, google
 chrome or IE 9 all with the hacluster userid.  I can view my cluster
 definition, but I cannot perform any actions without my web session
 ending.  The action does get submitted OK.  One of my systems is running
 hawk 0.6.1-0.7.11 and lighttpd-1.4.20-2.46.10  and I don't seem to have
 the issue.
 
This is a bit strange, but it's possible that you are hitting a real bug
 in hawk somewhere.
 
Can you please take this log, and a hb_report from the offending
 cluster, and open a support call? Then we can investigate properly.
 
 I rebooted the system and then my hawk sessions no longer close.  I had
 originally configured my cluster without hawk support and then started
 it via /etc/init.d/hawk start.  I also turned on the chkconfig bit at
 this time.  I suspect not everything that is needed was started before
 my reboot or I’m not starting hawk correctly?  I have many clusters up
 in my test lab,  is there some process to check to see if it’s running
 before I reboot?
 

This sounds like the sort of problem that could happen if something was
wrong with the session cookie Hawk sets in your browser.  The Hawk
0.6.1-0.11.1 had an update which changed the session key, so it's not
impossible that the login checking was confused by the update, but has
resolved itself since you rebooted the system (and thus hawk was restarted).

If that *was* the problem, just restarting hawk (service hawk restart)
and possibly logging out and back in in your web browser should have
been enough to resolve it.

Regards,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Announce: Hawk (HA Web Konsole) 0.6.2

2013-12-06 Thread Tim Serong
Greetings,

This is to announce version 0.6.2 of Hawk, a web-based GUI for managing
and monitoring Pacemaker High-Availability clusters.

Notable features include:

- View cluster status (summary and detailed views).
- Examine potential failure scenarios via simulator mode.
- History explorer for analysis of cluster behaviour and prior failures.
- Perform regular management tasks (start, stop, move resources, put
  nodes on standby/maintenance, etc.)
- Configure resources, constraints, general cluster properties.
- Setup wizard for common configurations (currently web server and
  OCFS2 filesystem).

Packages for various openSUSE releases, as well as Fedora 18 and 19 are
available from the Open Build Service:


http://software.opensuse.org/download?project=network:ha-clustering:Stablepackage=hawk

More information is available in the README in the source tree:

  https://github.com/ClusterLabs/hawk

Some important notes:

- The latest versions of Hawk require pacemaker = 1.1.8.
- Hawk uses the crm shell[1] internally to provide much of its
  functionality, so you'll need that installed too.
- The history explorer requires hb_report, which is presently available
  in cluster-glue[2].  If you don't have that installed, you miss that
  piece of functionality, but everything else should work just fine.
- Hawk has been long used and tested on SLES and openSUSE.  I suspect
  (but have no actual way of knowing) that it has been rather less
  widely deployed on other distros.  Accordingly there may be some rough
  edges.  Please tell me about them!

More detailed usage documentation is avilable in the SUSE Linux
Enterprise High Availability Extension book:


https://www.suse.com/documentation/sle_ha/book_sleha/data/cha_ha_configuration_hawk.html

Please direct comments, feedback, questions, etc. to myself and/or
(preferably) the Pacemaker mailing list.

Happy clustering,

Tim

[1]
http://software.opensuse.org/download?project=network:ha-clustering:Stablepackage=crmsh
[2]
http://software.opensuse.org/download?project=network:ha-clustering:Stablepackage=cluster-glue

-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Announce: opensuse-ha mailing list

2013-09-17 Thread Tim Serong
Greetings,

There is now an opensuse-ha mailing list.  This list is for discussion
of high availability clustering on openSUSE.  This includes:

- The base cluster stack, i.e. corosync and pacemaker
- Management tools such as crmsh and hawk
- Clustered filesystems (e.g.: ocfs2)
- Replicated storage (drbd)
- Basically, anything in network:ha-clustering:* on OBS is
  on topic :)

If you'd like to subscribe, just send an email to:

  opensuse-ha+subscr...@opensuse.org

Please also see the wiki page at:

  https://en.opensuse.org/openSUSE:High_Availability

Happy clustering!

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] reorg of network:ha-clustering repo on build.opensuse.org

2013-08-09 Thread Tim Serong
On 07/26/2013 09:58 PM, Tim Serong wrote:
 On 07/25/2013 03:59 PM, Tim Serong wrote:
 Hi All,

 This is just a quick heads-up.  We're in the process of reorganising the
 network:ha-clustering repository on build.opensuse.org.  If you don't
 use any of the software from this repo feel free to stop reading now :)

 Currently we have:

 - network:ha-clustering (stable builds for various distros)
 - network:ha-clustering:Factory (devel project for openSUSE:Factory)

 This is going to change to:

 - network:ha-clustering:Stable (stable builds for various distros)
 - network:ha-clustering:Unstable (unstable/dev, various distros)
 - network:ha-clustering:Factory (devel project for openSUSE:Factory)

 This means that if you're currently using packages from
 network:ha-clustering, you'll need to point to
 network:ha-clustering:Stable instead (once we've finished shuffling
 everything around).

 I'll send another email out when this is done.
 
 network:ha-clustering:Stable has now been populated.  There is some
 documentation of the new repository configuration at:
 
   https://en.opensuse.org/openSUSE:High_Availability
 
 The old packages in the base network:ha-clustering repo will be purged
 soon, but not before 2013-08-05.

The old packages in the base network:ha-clustering repo have now been
purged.  For HA clustering fun, as mentioned above, please use one of:

- network:ha-clustering:Stable (stable builds for various distros)
- network:ha-clustering:Unstable (unstable/dev, various distros)
- network:ha-clustering:Factory (devel project for openSUSE:Factory)

For those of you not using openSUSE, network:ha-clustering:Stable
notably includes:

- crmsh, cluster-glue, pssh for CentOS 6, FC 18, FC 19
- hawk for FC 18

Regards,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] reorg of network:ha-clustering repo on build.opensuse.org

2013-07-26 Thread Tim Serong
On 07/25/2013 03:59 PM, Tim Serong wrote:
 Hi All,
 
 This is just a quick heads-up.  We're in the process of reorganising the
 network:ha-clustering repository on build.opensuse.org.  If you don't
 use any of the software from this repo feel free to stop reading now :)
 
 Currently we have:
 
 - network:ha-clustering (stable builds for various distros)
 - network:ha-clustering:Factory (devel project for openSUSE:Factory)
 
 This is going to change to:
 
 - network:ha-clustering:Stable (stable builds for various distros)
 - network:ha-clustering:Unstable (unstable/dev, various distros)
 - network:ha-clustering:Factory (devel project for openSUSE:Factory)
 
 This means that if you're currently using packages from
 network:ha-clustering, you'll need to point to
 network:ha-clustering:Stable instead (once we've finished shuffling
 everything around).
 
 I'll send another email out when this is done.

network:ha-clustering:Stable has now been populated.  There is some
documentation of the new repository configuration at:

  https://en.opensuse.org/openSUSE:High_Availability

The old packages in the base network:ha-clustering repo will be purged
soon, but not before 2013-08-05.

Regards,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] reorg of network:ha-clustering repo on build.opensuse.org

2013-07-25 Thread Tim Serong
Hi All,

This is just a quick heads-up.  We're in the process of reorganising the
network:ha-clustering repository on build.opensuse.org.  If you don't
use any of the software from this repo feel free to stop reading now :)

Currently we have:

- network:ha-clustering (stable builds for various distros)
- network:ha-clustering:Factory (devel project for openSUSE:Factory)

This is going to change to:

- network:ha-clustering:Stable (stable builds for various distros)
- network:ha-clustering:Unstable (unstable/dev, various distros)
- network:ha-clustering:Factory (devel project for openSUSE:Factory)

This means that if you're currently using packages from
network:ha-clustering, you'll need to point to
network:ha-clustering:Stable instead (once we've finished shuffling
everything around).

I'll send another email out when this is done.

Regards,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Is crm_gui available under RHEL6?

2013-02-17 Thread Tim Serong
On 02/15/2013 01:53 AM, Dejan Muhamedagic wrote:
 On Thu, Feb 14, 2013 at 10:46:40AM +0100, Rasto Levrinc wrote:
 On Thu, Feb 14, 2013 at 12:20 AM, Ron Kerry rke...@sgi.com wrote:
 I am not sure if this is an appropriate question for a community forum since
 it is a RHEL specific question. However, I cannot think of a better forum to
 use (as someone coming from a heavy SLES background), so I will ask it
 anyway. Feel free to shoot me down or point me in a different direction.

 I do not find the pacemaker GUI in any of the RHEL6 HA distribution rpms. I
 have tried to think of all of its various names crm_gui, hb_gui,
 mgmt/haclient etc, but I have not found it. A simple Google search also was
 not helpful - perhaps due to me not being sufficient skilled at search
 techniques. Is it available somewhere in the RHEL7 HA distribution and I am
 just not finding it? Or do I need to build it from source or pull some
 community built rpm off the web.

 I am also not aware of any crm_gui packages for rhel6 not even community
 build. But you should be able to compile it on rhel6 from here

 https://github.com/ClusterLabs/pacemaker-mgmt

 Luckily there are many alternative GUIs, but only 1 or 2 really usable.

 In theory you can get crmsh package from here

 http://download.opensuse.org/repositories/network:/ha-clustering/
 
 In practice too :) Every new version of crmsh is going to be
 available there for the selected platforms. Along with
 resource-agents, cluster-glue, etc.
 
 I don't see HAWK package there, so probably it's still not compatible with
 the rhel 6 Ruby version at this moment.
 
 Right, hawk is not built. Tim should be able to tell why.

Yeah, the hawk build in network:ha-clustering is against rails 2, which
precludes building on recent Fedora (and presumably RHEL) versions (FC
18 ships rails 3.2).

I do have a reasonable rails 3.2 port which I'll make available soon,
but I still have some work in progress, bugs to fix, things to clean up,
etc. etc. before announcing a release.

Regards,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] killproc not found? o2cb shutdown via resource agent

2012-11-08 Thread Tim Serong
On 11/08/2012 07:56 PM, Andrew Beekhof wrote:
 On Thu, Nov 8, 2012 at 5:16 PM, Tim Serong tser...@suse.com wrote:
 On 11/08/2012 12:11 PM, Andrew Beekhof wrote:
 On Thu, Nov 8, 2012 at 9:59 AM, Matthew O'Connor m...@ecsorl.com wrote:
 Follow-up and additional info:

 System is Ubuntu 12.04.  Not sure where killproc is supposed to be derived
 from, or if there is an assumption for it to be a standalone binary or
 script.  I did find it defined in /lib/lsb/init-functions.  Adding a .
 /lib/lsb/init-functions to the start of the
 /usr/lib/ocf/resource.d/heartbeat/.ocf-shellfuncs file makes the
 process-kill work, but I suspect this is not the most desirable solution.

 I think thats as good a solution as any.
 I wonder where other distros are getting it from.

 SLES 11 SP2:

 # rpm -qf /sbin/killproc
 sysvinit-2.86-210.1

 openSUSE 12.2:

 # rpm -qf /sbin/killproc
 sysvinit-tools-2.88+-77.3.1.x86_64

 Can't speak for any others offhand...
 
 Definitely not on fedora or its derivatives

Hrm.  Well, I just had a quick skim of the ocfs2-tools source, and I'd
be willing to bet the o2cb RA was based on the upstream o2cb init
script, which uses killproc, but also sources /lib/lsb/init-functions.
Does Fedora have killproc buried somewhere in there maybe?

On SUSE, /lib/lsb/init-functions defines start_daemon(), killproc(), and
pidofproc() but these just wrap binaries of the same name in /sbin
(which would explain why o2cb works fine on SUSE, as those missing
things are presumably in $PATH anyway).

I don't know about sourcing /lib/lsb/init-functions in .ocf-shellfuncs -
might be a bit broad?  Presumably couldn't hurt to source it in the o2cb
RA though, unless there's some other cleaner solution...

Regards,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] killproc not found? o2cb shutdown via resource agent

2012-11-07 Thread Tim Serong
On 11/08/2012 12:11 PM, Andrew Beekhof wrote:
 On Thu, Nov 8, 2012 at 9:59 AM, Matthew O'Connor m...@ecsorl.com wrote:
 Follow-up and additional info:

 System is Ubuntu 12.04.  Not sure where killproc is supposed to be derived
 from, or if there is an assumption for it to be a standalone binary or
 script.  I did find it defined in /lib/lsb/init-functions.  Adding a .
 /lib/lsb/init-functions to the start of the
 /usr/lib/ocf/resource.d/heartbeat/.ocf-shellfuncs file makes the
 process-kill work, but I suspect this is not the most desirable solution.
 
 I think thats as good a solution as any.
 I wonder where other distros are getting it from.

SLES 11 SP2:

# rpm -qf /sbin/killproc
sysvinit-2.86-210.1

openSUSE 12.2:

# rpm -qf /sbin/killproc
sysvinit-tools-2.88+-77.3.1.x86_64

Can't speak for any others offhand...

Regards,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[Pacemaker] Fwd: Re: How can I make the secondary machine elect itself owner of the floating IP address?

2012-09-23 Thread Tim Serong
Forwarding to the list for posterity (i.e. google) - I believe my reply
did solve the problem, BTW.

The crm config in question is:

node scc-bak
node scc-pri
primitive ClusterIP ocf:heartbeat:IPaddr2 \
params ip=10.1.1.180 cidr_netmask=24 \
op monitor interval=30s
primitive drbd_r0 ocf:linbit:drbd \
params drbd_resource=r0 \
op monitor interval=15 role=Master \
op monitor interval=30 role=Slave \
primitive fs_r0 ocf:heartbeat:Filesystem \
params device=/dev/drbd1 directory=/home/scc fstype=ext3 \
op monitor interval=10s
primitive scc-stonith stonith:meatware \
operations $id=scc-stonith-operations \
op monitor interval=3600 timeout=20 start-delay=15 \
params hostlist=10.1.1.32 10.1.1.31
group r0 fs_R0 ClusterIP
ms ms_drbd_r0 drbd_ro \
meta master-max=1 master-node-max=1 clone-max=2 \
clone-node-max=1 notify=true
colocation r0_on_drbd inf: r0 ms_drbd_r0:Master
order r0_after_drbd inf: ms_drbd_r0:promote r0:start
property $id=cib-bootstrap-options \
dc-version=1.1.6-b988976485d15cb702c9307df55512d323831a5e \
cluster-infrastructure=openais \
expected-quorum-votes=2 \
no-quorum-policy=ignore
rsc_defaults $id=rsc-options \
resource-stickiness=200

I probably should have noted that scc-pri and scc-bak aren't really
the best choice of names, because pri and bak are kind of
meaningless assuming identical nodes (and the nomenclature gets
confusing when you start talking about masters and slaves on top of that).

Anyway...

 Original Message 
Subject: Re: How can I make the secondary machine elect itself owner of
the floating IP address?
Date: Thu, 20 Sep 2012 12:36:03 +1000
From: Tim Serong
To: Epps, Josh

Hi Josh,

On 09/20/2012 10:47 AM, Epps, Josh wrote:
 Hi Tim,
 
 I saw one of your Gossamer threads and I really need some help.
 
 I have a two-node cluster running on SLES 11 SP2 with Pacemaker and DRBD.
 When I shutdown the primary with the shutdown -h now  the
 ocf:heartbeat:IPaddr2 transfers nicely to the backup server.
 But when I simulate a failure on the primary node by killing the power
 neither the floating IP address or the mount transfer to the secondary
 machine.

What's probably happening is:

- When you do a clean shutdown of one node, the surviving node knows the
first has gone away, and it can safely take over those resources.
- When you cut power, the surviving node doesn't know what state the
first node is in, so will do nothing until the first node is fenced.
- You're using the meatware STONITH plugin (which probably doesn't need
a monitor op, BTW), which means you should see a CRIT message in syslog
on the surviving node, telling you it expects the first node to be fenced.

 
 How can I make the secondary machine elect itself owner of the floating
 IP address?

Assuming the first machine is really down :) you should be able to tell
the cluster this is so by running meatclient -c scc-pri on the
surviving node (but do check syslog to see if you're really getting
warnings about a node needing to be fenced).

 Suse support today said that it can’t be done with just two nodes but we
 just require a one-way failover.

Two node clusters should work fine, they're just more annoying than
three node - see for example STONITH Deathmatch Explained at
http://ourobengr.com/ha/

If the above doesn't solve it for you, do you mind if we take this to
the linux-ha or pacemaker public mailing list?  More eyes on a problem
never hurts, and then a solution becomes googlable :)

Regards,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [corosync] Ideas on merging #linux-ha and #linux-cluster on freenode

2012-05-28 Thread Tim Serong
On 05/28/2012 10:51 AM, Andrew Beekhof wrote:
 On Mon, May 28, 2012 at 8:02 AM, Digimer li...@alteeve.ca wrote:
 I'm not sure if this has come up before, but I thought it might be worth
 discussing.

 With the cluster stacks merging, it strikes me that having two separate
 channels for effectively the same topic splits up folks. I know that
 #linux-ha technically still supports Heartbeat, but other than that, I
 see little difference between the two channels.

 I suppose a similar argument could me made for the myriad of mailing
 lists, too. I don't know if any of the lists really have significant
 enough load to cause a problem if the lists were merged. Could
 Linux-Cluster, Corosync and Pacemaker be merged?

 Thoughts?

 Digimer, hoping a hornets nest wasn't just opened. :)

 
 I think the only thing you missed was proposing a meta-project to rule
 them all :-)

  ...One Totem Ring to rule them all, one Totem Ring to find them...

If only Sauron had implemented RRP during the Second Age, things might
have turned out differently for Middle Earth.

SCNR,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Openais] Help on mysql-proxy resource

2012-03-30 Thread Tim Serong
Hi Carlos,

You'll have most luck with crm configuration questions on the Pacemaker
list (CC'd):

  pacemaker@oss.clusterlabs.org

I don't actually know anything about the mysql-proxy RA, but you might
have a typo.

On 03/30/2012 12:52 PM, Carlos xavier wrote:
 Hi.
 
 I have mysql-proxy running on my system and I want to agregate it to the
 cluster configuration.
 When it is started by the system I got this as result of ps auwwwx:
 
 root 29644  0.0  0.0  22844   844 ?S22:37   0:00
 /usr/sbin/mysql-proxy --pid-file /var/run/mysql-proxy.pid --daemon
 --proxy-lua-script

Note this is --proxy-lua-script (singular)

 /usr/share/doc/packages/mysql-proxy/examples/tutorial-basic.lua
 --proxy-backend-addresses=10.10.10.5:3306 --proxy-address=172.31.0.192:3306
 
 So I created the following configuration at the CRM:
 
 primitive mysql-proxy ocf:heartbeat:mysql-proxy \
 params binary=/usr/sbin/mysql-proxy
 pidfile=/var/run/mysql-proxy.pid proxy_backend_addresses=10.10.10.5:3306
 proxy_address=172.31.0.191:3306 parameters=--proxy-lua-scripts
 /usr/share/doc/packages/mysql-proxy/examples/tutorial-basic.lua \

This is --proxy-lua-scripts (plural).  I'm guessing maybe that's the
problem.

HTH,

Tim
-- 
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker - corosync with not automatic failover

2012-02-06 Thread Tim Serong

On 02/07/2012 02:26 AM, Dimokritos Stamatakis wrote:

Hello,
regarding my previous issue with pacemaker and heartbeat there was a
problem with the version that apt-get used to retrieve. I now use
pacemaker with corosync and it works fine.

In our setup we need to have the ability to decide which node shall get
the failover IP resource and force them to do so.
In the default corosync-heartbeat configuration the cluster nodes decide
which one shall get the failover IP resource. I want a way to stop the
nodes from auto-assigning the failover IP resource after a node failure.
I tried with monitoring disabled, but nothing happened. If I kill the
node that owns the failover IP resource, then they elect another node as
the new failover IP owner.
I want to stop that, and be able to assign the failover IP to a specific
node via the crm resource migrate failover-IP node_x command whenever
I want, and corosync not to assign by itself!

Is there a way to do that?


Well...  If you run crm resource migrate failover-IP node_x as 
mentioned above, failover-IP will stay on node_x forever, until you 
migrate it somewhere else (or unmigrate it, in which case it'll have the 
default behaviour of running on some node) :)


But you probably want to look at setting up some non-infinity location 
constraints, e.g.:


  location ip-prefer-node_0 failover-IP 100: node_0
  location ip-maybe-node_1  failover-IP  50: node_1
  ...

failover-IP would be placed with preference on node_0 (score 100), or 
node_1 (score 50), or some other node if neither node_0 nor node_1 are

available (and assuming you have more than two nodes).

HTH,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] OCFS2 problems when connectivity lost

2011-12-21 Thread Tim Serong

On 12/21/2011 09:47 PM, Ivan Savčić | Epix wrote:

Hello,


We are having a problem with a 3-node cluster based on
Pacemaker/Corosync with 2 primary DRBD+OCFS2 nodes and a quorum node.

Nodes run on Debian Squeeze, all packages are from the stable branch
except for Corosync (which is from backports for udpu functionality).
Each node has a single network card.

When the network is up, everything works without any problems, graceful
shutdown of resources on any node works as intended and doesn't reflect
on the remaining cluster partition.

When the network is down on one OCFS2 node, Pacemaker
(no-quorum-policy=stop) tries to shut the resources down on that node,
but fails to stop the OCFS2 filesystem resource stating that it is in
use.

*Both* OCFS2 nodes (ie. the one with the network down and the one which
is still up in the partition with quorum) hang with dmesg reporting that
events, ocfs2rec and ocfs2_wq are blocked for more than 120 seconds.


My guess would be:

The filesystem can't stop on the non-quorate node, because the network 
connection is down, so DLM can't do its thing.


The filesystem is probably frozen on the quorate node, because of loss 
of DLM comms.


If STONITH is configured, the non-quorate node should be killed after a 
failed (or timed out) stop, and the quorate node should resume behaving 
normally.


HTH,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Doc: Resource templates

2011-12-13 Thread Tim Serong

On 12/14/2011 02:57 AM, Dejan Muhamedagic wrote:

On Tue, Dec 13, 2011 at 04:18:33PM +0800, Gao,Yan wrote:

On 12/13/11 04:25, Andrew Beekhof wrote:

On Mon, Dec 12, 2011 at 9:20 PM, Gao,Yany...@suse.com  wrote:

On 12/12/11 17:52, Florian Haas wrote:

On Mon, Dec 12, 2011 at 10:36 AM, Gao,Yany...@suse.com  wrote:

On 12/12/11 17:16, Florian Haas wrote:

On Mon, Dec 12, 2011 at 10:04 AM, Gao,Yany...@suse.com  wrote:

On 12/12/11 15:55, Gao,Yan wrote:

Hi,
As some people have noticed, we've provided a new feature Resource
templates since pacemaker-1.1.6. I made a document about it which is
meant to be included into Pacemaker_Explained. I borrowed the
materials from Tanja Roth , Thomas Schraitle, (-- the documentation
specialists from SUSE) and Dejan Muhamedagic. Thanks to them!

Attaching it here first. If you are interested, please help review it.
And if anyone would like to help convert it into DocBook and made a
patch, I would be much appreciate. :-)

I can tell people would like to see a crm shell version of it as well.
I'll sort it out and post it here soon.

Attached the crm shell version of the document.


As much as I appreciate the new feature, was it really necessary that
you re-used a term that already has a defined meaning in the shell?

http://www.clusterlabs.org/doc/crm_cli.html#_templates

Couldn't you have called them resource prototypes instead? We've
already confused users enough in the past.

Since Dejan adopted the object name rsc_template in crm shell, and
call it Resource template in the help. I'm not inclined to use another
term in the document. Opinion, Dejan?


I didn't mean to suggest to use a term in the documentation that's
different from the one the shell uses. I am suggesting to rename the
feature altogether. Granted, it may be a bit late to have a naming
discussion now, but I haven't seen this feature discussed on the list
at all, so there wasn't really a chance to voice these concerns
sooner.

Actually there were discussions in pcmk-devel mailing list. Given that
it has been included into pacemaker-1.2 schema and released with
pacemaker-1.1.6, it seems too late for us to change it from cib side
now


Technically its not yet in the 1.2 area, that change was pending on
this documentation update.

OK then. I would like to hear more voices about that, since Dejan and
Tim have been working on this for quite some time too.


Well, I believe that we already discussed the name. And there
were no better ideas heard. But it could as well be that my
memory fails me.


I don't recall any better naming ideas floating past either (although, 
now that Florian mentions prototype, hmm...)


Anyway, IMO, overloading the word template isn't /too/ bad.  It could 
be qualified if necessary as resource template (the new feature we're 
talking about here) and configuration template (existing shell feature)...


Regards,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] ACL setup

2011-12-11 Thread Tim Serong

On 12/10/2011 10:35 AM, Larry Brigman wrote:

On Fri, Dec 9, 2011 at 3:19 PM, Andreas Kurz andr...@hastexo.com
mailto:andr...@hastexo.com wrote:

Hello Larry,

On 12/09/2011 11:15 PM, Larry Brigman wrote:
  I have installed pacemaker 1.1.5 and configure ACLs based on the
info from
  http://www.clusterlabs.org/doc/acls.html
 
  It looks like the user still does not have read access.
 
  Here is the acl section of config
  acls
  acl_role id=monitor
  read id=monitor-read xpath=/cib/
  /acl_role
  acl_user id=nvs
  role_ref id=monitor/
  /acl_user
  acl_user id=acm
  role_ref id=monitor/
  /acl_user
  /acls
 
  Here is what the user is getting:
  [nvs@sweng0057 ~]$ crm node show
  Signon to CIB failed: connection failed
  Init failed, could not perform requested operations
  ERROR: cannot parse xml: no element found: line 1, column 0
  [nvs@sweng0057 ~]$ crm status
 
  Connection to cluster failed: connection failed
 
 
  Any ideas as to why this wouldn't work and what to fix?

If you really followed exactly the guide ... did you check user nvs
already is in group haclient?

Thought of that.

Adding the user to the haclient group removes any restrictions as I was
able to
write to the config without error.


Did you set crm configure property enable-acl=true?  Without this, all 
users in the haclient group have full access.


Regards,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] How to stop a failed resource?

2011-11-07 Thread Tim Serong

On 11/07/2011 08:27 PM, Tim Ward wrote:

From: Andreas Kurz [mailto:andr...@hastexo.com]

and of course you did:

crm resource cleanup TestResource42


That works, thanks.

However I found no mention of it in either Clusters from Scratch 0r
Pacemaker Explained ... so which document(s) have I missed please?


http://clusterlabs.org/doc/crm_cli.html

Also, just run crm, it has tab completion, online help, etc.

Regards,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Language bindings, again (was Re: Newcomer's question - API?)

2011-11-02 Thread Tim Serong

On 11/02/2011 06:35 PM, Florian Haas wrote:

On 2011-11-02 04:33, Tim Serong wrote:

ianalI vaguely recall reading the FSF considered headers generally
exempt from GPL provisions, provided they're boring, i.e. just
structs, function definitions etc.  If they're a whole lotta inline
code, things are a bit different/ianal.


Really?


Here's a rough citation:

http://linux.slashdot.org/story/11/03/20/1529238/rms-on-header-files-and-derivative-works

(No, I didn't read the source material or any of the comments)


Anyway.  Roughly speaking, if we *did* have other language bindings for
libcib/libpengine, the story would be something like this (Andrew can
correct me if I'm wrong):

libcib would let you manipulate the CIB programatically, with much the
same ability you have when running cibadmin, i.e. you're just
manipulating chunks of XML.  Unless I'm not paying attention, there's no
e.g. create resource API; your program would have to construct the
correct XML resource definition then give it to libcib to inject into
the cluster configuration. Likewise, to stop and start a resource,
you'll be writing code to set the target-role meta attribute of that
resource.


I hate to handwave, as due to my practically non-existent C and C++-fu
this is something I can't tackle myself. But let me float this idea here
again.

Coming from Python, what's usually available there is a thin, low-level
wrapper around the C API, plus a high-level object-oriented API that is
the only thing callers ever actually use.

To make this portable to multiple languages, one possible option that's
been suggested to me before is to create an OO C++ wrapper around the
libcib/libpengine C APIs, and then SWIGify that (I do understand Andrew
cringes at that, which I'll just accept for a moment). Such that,
eventually, you might end up with something like

cib = cib.connect()
r = cib.resources.add(p_mysql,ocf:heartbeat:mysql,
binary=/usr/bin/mysqld)
cib.commit()
r.start()

Extrapolate for Perl, Java, PHP, Ruby, or anything else that SWIG supports.


No objection to that in principle - the major part of the work there is 
(or should be) the wrapper layer though, not the SWIG bits.


By contrast, SWIGing what we have now would only give the thin, 
low-level wrapper you referred to above.  Any anyone using that thin 
wrapper would probably need to go read crm_resource.c or crm_mon.c to 
figure out how to use it :)



So you may as well just invoke cibadmin, crm_resource,
crm_attribute directly.  I think it's safe to assume those interfaces
are stable.  At a higher level, crm configure ... should also be
considered pretty stable; we know people use it in scripts so we try not
to break it (and BTW, I use all this stuff in Hawk[1]).


Where I do seem to recall you conceded at one point that firing off a
binary every time you need to get a resource status doesn't exactly
scale to scores of resources in, say, a 16-node cluster, and a Ruby
library interface would be much more useful. Or am I mis-remembering?


No, you're not misremembering, but my previous email maybe could have 
been clearer...  For creating/modifying resources, IMO there's minimal 
overhead in invoking crm, or cibadmin or whatever, because you usually 
only have one-ish invocation(s) per create/edit/delete.


Getting status is the annoying thing.  The only way I know to do it 
comprehensively that doesn't involve multiple invocations of some CLI 
tool is to run cibadmin -Q, then interpret the status section, which 
is what I do in Hawk.  This means I now have a few hundred lines of 
fairly hairy Ruby code which reimplements a few of Pacemaker's pengine 
status unpack functions.  Which works, BTW.  But it doesn't really help 
anyone else, and TBH SWIG bindings would serve Hawk better here anyway, 
because then status calculation would only happen in one place 
(pengine), which would mean zero possibility of 
drift/misinterpretation/confusion.  Blah.


I do actually want to do the SWIG bindings at some point (it still 
hasn't filtered to the top of my list, and I wouldn't complain if 
someone beat me to it), but I want to make sure that whatever we do 
here, we get it right, because once it's there, we'll have to support it.


Regards,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Ocfs2-users] Error building ocfs2-tools

2011-11-02 Thread Tim Serong

Hi Nick,

It might not be obvious, but IMO this probably belongs back on the 
Pacemaker list (CC'd).


On 11/03/2011 02:40 AM, Nick Khamis wrote:

Hello Sunil and Tim,

Thank you so much for your responses. I have applied the patch, and
recompiled ocfs2-tools. When spinning
the pcmk stack, I am recieving the following error from ocfs_conrtold.pcmk

ocfs2_controld[14698]: 2011/11/02_11:32:19 ERROR: crm_abort:
send_ais_text: Triggered assert at ais.c:346 : dest != crm_msg_ais
Sending message 0 via cpg: FAILED (rc=22): Message error: Success (0)
ocfs2_controld[14698]: 2011/11/02_11:32:19 ERROR: send_ais_text:
Sending message 0 via cpg: FAILED (rc=22): Message error: Success (0)
ocfs2_controld[14698]: 2011/11/02_11:32:19 ERROR: crm_abort:
send_ais_text: Triggered assert at ais.c:346 : dest != crm_msg_ais
Sending message 1 via cpg: FAILED (rc=22): Message error: Success (0)
ocfs2_controld[14698]: 2011/11/02_11:32:19 ERROR: send_ais_text:
Sending message 1 via cpg: FAILED (rc=22): Message error: Success (0)
1320247939 setup_stack@170: Cluster connection established.  Local node id: 1
1320247939 setup_stack@174: Added Pacemaker as client 1 with fd -1


When in doubt, use the source...

ocfs2-tools' ocfs2_controld/pacemaker.c:165[1] says:

  send_ais_text(crm_class_notify, true, TRUE, NULL, crm_msg_ais);

pacemaker's lib/common/ais.c:327[2] says:

  switch(cluster_type) {
case pcmk_cluster_classic_ais:
  ...
  break;
case pcmk_cluster_corosync:
case pcmk_cluster_cman:
  transport = cpg;
  CRM_CHECK(dest != crm_msg_ais, rc = CS_ERR_MESSAGE_ERROR;
goto bail);

So you're hitting that assert, because Pacemaker sees cluster_type as 
either pcmk_cluster_corosync or pcmk_cluster_cman.


If Pacemaker saw cluster_type as pcmk_cluster_classic_ais, it would 
work fine.


From memory, you're running Pacemaker under CMAN, somehow. 
Unfortunately I have no idea what you need to do to reconfigure it so 
that ocfs2_controld works, or even if it will work in that environment, 
but the above code is the source of your trouble.


Regards,

Tim

[1] 
http://oss.oracle.com/git/?p=ocfs2-tools.git;a=blob;f=ocfs2_controld/pacemaker.c;h=822cf41c4c64cd3e5cb4373c339c2e575c4a5efd;hb=d45856e4a75348c1e3b44dc510c6b7f07b88a36f#l165
[2] 
http://hg.clusterlabs.org/pacemaker/1.1/file/9971ebba4494/lib/common/ais.c#l327 
but note ais.c moved to corosync.c in newer source tree on github


--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Linux-HA] pcmk + corosync + cman for dlm support?

2011-11-02 Thread Tim Serong

On 11/03/2011 04:11 PM, Vladislav Bogdanov wrote:

02.11.2011 16:36, Nick Khamis wrote:

Vladislav,

Thank you so much for your response. Just to make sure, all I need is to:
* Apply the three patches to cman. Found here
http://www.gossamer-threads.com/lists/linuxha/pacemaker/75164?do=post_view_threaded;.
* Recompile CMAN
* Do I have to recompile PCMK again?

I also want to mention that fencing is not important right now, and I
would like to disable fencing JUST for the prototype,
and untill things are going. I am almost there (cman-pcmk+ocfs2) with


That patches are for fencing only.
I have no idea about what goes wrong with your ocfs2_controld, I gave up
on trying ocfs2 because it hangs the whole cluster for me.


That reminds me...  Nick, if you disable fencing (even for your 
prototype), and you experience (or try to test) any kind of split brain, 
or you kill one node (ungracefully), the clustered filesystem on all the 
other (surviving) nodes will freeze/lock up, because the cluster is 
unable to fence the failed node.  Even if you choose something like 
meatware (see http://clusterlabs.org/doc/crm_fencing.html), you should 
still configure *some* means of fencing for any prototype system that's 
going to need fencing when put into production :)


Regards,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Newcomer's question - API?

2011-11-01 Thread Tim Serong

On 11/02/2011 08:34 AM, Florian Haas wrote:

On 2011-11-01 21:30, Andrew Beekhof wrote:

On Wed, Nov 2, 2011 at 7:04 AM, Florian Haasflor...@hastexo.com  wrote:

On 2011-11-01 17:52, Tim Ward wrote:

You can try tooking at LCMC as that is a Java-based GUI that
should at least get you going.


I did find some Java code but we can't use it because it's GPL, and I
didn't want to study it in case I accidentally copied some of it in
recreating it.


Well if you can't use anything that's under GPL, I presume anything
derived from cib.h is off limits to you anyway, as _that_ is under GPL.


LGPL iirc



From include/crm/cib.h:


/*
  * Copyright (C) 2004 Andrew Beekhofand...@beekhof.net
  *
  * This program is free software; you can redistribute it and/or
  * modify it under the terms of the GNU General Public
  * License as published by the Free Software Foundation; either
  * version 2 of the License, or (at your option) any later version.

Doesn't say much about LGPL afaics.


ianalI vaguely recall reading the FSF considered headers generally 
exempt from GPL provisions, provided they're boring, i.e. just 
structs, function definitions etc.  If they're a whole lotta inline 
code, things are a bit different/ianal.


Anyway.  Roughly speaking, if we *did* have other language bindings for 
libcib/libpengine, the story would be something like this (Andrew can 
correct me if I'm wrong):


libcib would let you manipulate the CIB programatically, with much the 
same ability you have when running cibadmin, i.e. you're just 
manipulating chunks of XML.  Unless I'm not paying attention, there's no 
e.g. create resource API; your program would have to construct the 
correct XML resource definition then give it to libcib to inject into 
the cluster configuration.  Likewise, to stop and start a resource, 
you'll be writing code to set the target-role meta attribute of that 
resource.  So you may as well just invoke cibadmin, crm_resource, 
crm_attribute directly.  I think it's safe to assume those interfaces 
are stable.  At a higher level, crm configure ... should also be 
considered pretty stable; we know people use it in scripts so we try not 
to break it (and BTW, I use all this stuff in Hawk[1]).


libpengine is more interesting.  That would give you reliable 
information about resource status.  The alternative (given no other 
language bindings) is generally either:


  - various invocations of crm_mon and crm_resource (maybe lots of
invocations, depending on what information you want to extract),
which can suck on large clusters, or,

  - one invocation of cibadmin -Q to get the CIB status section,
then process this yourself to determine resource status, using the
Dragon Page[2] as a guide.  If you do a good jobs of this and/or
you care about op history (not just current status), you will
end up reimplementing parts of the determine_online_status() and
unpack_rsc_op() functions from Pacemaker's lib/pengine/unpack.c in
$other_language_of_your_choice.

Regards,

Tim

[1] http://clusterlabs.org/wiki/Hawk
[2] 
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ch-status.html


--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Trouble with KVM Resource

2011-10-31 Thread Tim Serong

On 11/01/2011 02:23 PM, Cliff Massey wrote:


  I am having a problem with my kvm resource. It was working until I decided to 
re-install the kvm machine. The libvirt xml file and the pacemaker 
configuration did not change. I can start the kvm outside of pacemaker just 
fine. When I check the libvirt log, it shows no attempt to start the kvm 
machine from pacemaker.

crm_mon -1 shows:

Online: [ admin01 admin02 ]

  convirt-kvm   (ocf::heartbeat:VirtualDomain): Started admin01 (unmanaged) 
FAILED
  Master/Slave Set: ms-convirt [convirt-drbd]
  Masters: [ admin02 ]
  Slaves: [ admin01 ]
  sitescope-kvm (ocf::heartbeat:VirtualDomain): Started admin02
  Master/Slave Set: ms-sitescope [sitescope-drbd]
  Masters: [ admin02 ]
  Slaves: [ admin01 ]

Failed actions:
 convirt-kvm_monitor_0 (node=admin01, call=2, rc=1, status=complete): 
unknown error
 convirt-kvm_stop_0 (node=admin01, call=6, rc=1, status=complete): unknown 
error

My other kvm machine with the same config works just fine.


I can't tell you why it doesn't work anymore, but...



my logs are at:   http://pastebin.com/peFw5KKp


The relevant bit of that log is (pardon the formatting):

Nov  1 03:14:37 admin01 crmd: [15349]: info: te_rsc_command: Initiating 
action 4: monitor convirt-kvm_monitor_0 on admin01 (local)

...
Nov  1 03:14:38 admin01 VirtualDomain[15370]: ERROR: 
/var/run/heartbeat/rsctmp/VirtualDomain-convirt-kvm.state is empty. This 
is unexpected. Cannot determine domain name.

...
Nov  1 03:14:38 admin01 lrmd: [15346]: WARN: Managed convirt-kvm:monitor 
process 15370 exited with return code 1.

...
Nov  1 03:14:38 admin01 crmd: [15349]: info: process_lrm_event: LRM 
operation convirt-kvm_monitor_0 (call=2, rc=1, cib-update=29, 
confirmed=true) unknown error


So the probe (and presumably subsequent stop) for that resource failed, 
hence no attempt to start it.  As for how the state file is empty, I'm 
not sure.  Look at VirtualDomain_Define() in 
/usr/lib/ocf/resource.d/heartbeat/VirtualDomain (line ~200 onwards), by 
my reading it shouldn't be possible for that state file to be empty. 
Unless, somehow (wild guess), permissions on the state file or some 
parent directory prohibit writing?


Regards,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] cloning primatives with differing params

2011-10-25 Thread Tim Serong

On 26/10/11 05:45, Brian J. Murrell wrote:

I want to create a stonith primitive and clone it for each node in my
cluster.  I'm using the fence-agents virsh agent as my stonith
primitive.  Currently for a single node it looks like:

primitive st-pm-node1 stonith:fence_virsh \
params ipaddr=192.168.122.1 login=xxx passwd=xxx port=node1 action=reboot 
pcmk_host_list=node1 pcmk_host_check=static-list pcmk_host_map= secure=true

But of course that only works for one node and I want to create a
clonable primitive that will apply to all nodes as they are added
to the cluster.  What is stumping me though is the required port
parameter which is the node to stonith.  I've not seen an example
of how a clone resource can be created that can substitute values
in for each clone.  Is that even possible?


OCF resource agents can be aware they're running as clones, and do 
interesting things as a result, e.g.: IPaddr2, when cloned, with the 
unique_clone_address parameter set will add the clone ID to the IP 
address, to give you a whole bunch of IP addresses.


Unfortunately I don't know offhand if the same trick can work with 
STONITH agents (they'd have to be told by pacemaker they were cloned, 
and then each would have to be instrumented to support it).




On a pretty un-related question... given an asymmetric cluster, is there
a way to specify that a resource can run on any node without having
to add a location constraint for each node as they are added?


You could try one constraint per resource, covering all nodes, something 
like:


  location some-res-on-all-nodes some-resource \
rule 0: #uname eq node1 or #uname eq node2 or #uname eq node3 ...

Regards,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] 4 servers; different resources on different servers?

2011-10-03 Thread Tim Serong

On 04/10/11 04:06, Nick Khamis wrote:

I forgot to ask, for creating an asymmetric cluster, do the services
(mysql, apache etc..) have to
be installed on all the nodes.


Probably.  Pacemaker will still try to probe resources on all nodes, to 
ensure they're not running, then the RA will return not installed if 
the software isn't installed, and you'll see an error message on that 
node.  The error might not matter, but you might not like to see it :)



And finally is assymetric  active/active?


Assymetric means resources will never run at all by default, unless you 
specifically create location constraints to make them run on some node.


Active/active generally means something like some set of resources is 
running on at least two nodes[1].  There is no reason you can't do this 
with an assymetric cluster.  It just depends what location constraints 
you configure.


HTH,

Tim

[1] Depending on your definition, it might also mean the exact same 
resource is running on at least two nodes, e.g.: a clustered filesystem.


--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Call cib_modify failed (-22): The object/attribute does not exist

2011-09-26 Thread Tim Serong

On 25/09/11 01:16, Brian J. Murrell wrote:

Using pacemaker-1.0.10-1.4.el5 I am trying to add the R_10.10.10.101
IPaddr2 example resource:

primitive id=R_10.10.10.101 class=ocf type=IPaddr2
provider=heartbeat
  instance_attributes id=RA_R_10.10.10.101
   attributes
nvpair id=R_ip_P_ip name=ip value=10.10.10.101/
nvpair id=R_ip_P_nic name=nic value=eth0/
   /attributes
  /instance_attributes
/primitive

from the cibadmin manpage under EXAMPLES and getting:

# cibadmin -o resources -U -x test.xml
Call cib_modify failed (-22): The object/attribute does not exist
null

Any ideas why?


Because:

1) You need to run cibadmin -o resources -C -x test.xml to create the
   resource (-C creates, -U updates an existing resource).

2) Even if you use -C, it will probably still fail due to a schema
   violation, because the attributes element is bogus (apparently the
   cibadmin man page needs tweaking).  Try:

  primitive id=R_10.10.10.101 class=ocf type=IPaddr2
   provider=heartbeat
instance_attributes id=RA_R_10.10.10.101
  nvpair id=R_ip_P_ip name=ip value=10.10.10.101/
  nvpair id=R_ip_P_nic name=nic value=eth0/
/instance_attributes
  /primitive

Better yet, use the crm shell instead of cibadmin, and you can forget 
about the XML :)


Regards,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] crm_mon -n -1 : Command output format

2011-09-08 Thread Tim Serong

On 09/09/11 01:28, manish.gu...@ionidea.com wrote:

Hi,

   I am using crm_mon -n -1 command to parse resources status. Sometime
format is changed due to that I am getting unexcepted output on my
backend programme.

   Please anyboby can help me to know all the excepted possible output
format of crm_mon -n -1 command output.

   General format of output.

   =
   Node NodeName: NodeStatus
ResourceName ResourceAgentType Status
   =
   But for clone Resource I am getting this.when cluster in unmanaged status.

   Node NodeName: NodeStatus
ResourceName ResourceAgentType ORPHANED) Status

   Due to ORPHANED) Resource status is shifted. And I am getting wrong result.

  Please can you help me to alll the possible output scenerio.
  Or Please can you share the source code of crm_mon command.


I don't think all possible output scenarios are documented anywhere, 
given crm_mon is generally more for human consumption.  If it helps 
though, the source is at:


  http://hg.clusterlabs.org/pacemaker/1.1/file/tip/tools/crm_mon.c

You might also like to experiment with crm_resource -O, although I 
can't say offhand what that does with orphans.


Regards,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] crm resource status and HAWK display differ after manually mounting filesystem resource

2011-08-30 Thread Tim Serong

On 29/08/11 13:24, Tim Serong wrote:

On 28/08/11 21:43, Sebastian Kaps wrote:

Hi,

on our two-node cluster (SLES11-SP1+HAE; corosync 1.3.1, pacemaker
1.1.5) we have defined the following FS resource and its corresponding
clone:

primitive p_fs_wwwdata ocf:heartbeat:Filesystem \
params device=/dev/drbd1 \
directory=/mnt/wwwdata fstype=ocfs2 \
options=rw,noatime,noacl,nouser_xattr,commit=30,data=writeback \
op start interval=0 timeout=90s \
op stop interval=0 timeout=300s

clone c_fs_wwwdata p_fs_wwwdata \
params master-max=2 clone-max=2 \
meta target-role=Started is-managed=true

one of the nodes (node01) went down last night and I started it with
the cluster put into maintenance-mode.
After checking everything else, I mounted the ocfs2-resource manually,
did some crm resource reprobe/cleanup to make the cluster aware of
this and finally turned off the maintenance-mode.

Looking at the output of crm_mon, everything looks good again:

Clone Set: c_fs_wwwdata [p_fs_wwwdata]
Started: [ node01 node02 ]

alternatively looking at crm_mon -n:

Node node02: online
p_fs_wwwdata:1 (ocf::heartbeat:Filesystem) Started

Node node01: online
p_fs_wwwdata:0 (ocf::heartbeat:Filesystem) Started

but the HAWK web interface (version 0.3.6 coming with SLES11SP1-HAE)
displays this:

Clone Set: c_fs_wwwdata
- p_fs_wwwdata:0: Started: node01, node02
- p_fs_wwwdata:1: Stopped

Does anybody know why there is a difference?
Did I make a mistake when manually mounting the FS while it was
unmanaged?
Or is this only a cosmetical issue with HAWK?

When these resources are started by pacemaker, HAWK shows exactly
what's expected: two started resoures, one per node.

Thanks in advance!



It's almost certainly a cosmetic issue in Hawk. I have fixed one or two
bugs along these lines since version 0.3.6. If you'd like to try a newer
(not-officially-supported-by-SUSE-but-best-effort-support-by-me) build,
you can try hawk-0.4.1 from:

http://software.opensuse.org/search?q=Hawkbaseproject=SUSE%3ASLE-11%3ASP1lang=en


Alternately, if you can reproduce the issue then send me the output of
cibadmin -Q (offlist is fine), I can verify/fix it.


Just for the record, it was a cosmetic issue in Hawk, now fixed in hg:

  http://hg.clusterlabs.org/pacemaker/hawk/rev/3266874ef3fe

Regards,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] crm resource status and HAWK display differ after manually mounting filesystem resource

2011-08-28 Thread Tim Serong

On 28/08/11 21:43, Sebastian Kaps wrote:

Hi,

on our two-node cluster (SLES11-SP1+HAE; corosync 1.3.1, pacemaker 1.1.5) we 
have defined the following FS resource and its corresponding clone:

primitive p_fs_wwwdata ocf:heartbeat:Filesystem \
 params device=/dev/drbd1 \
directory=/mnt/wwwdata fstype=ocfs2 \
options=rw,noatime,noacl,nouser_xattr,commit=30,data=writeback \
 op start interval=0 timeout=90s \
 op stop interval=0 timeout=300s

clone c_fs_wwwdata p_fs_wwwdata \
 params master-max=2 clone-max=2 \
 meta target-role=Started is-managed=true

one of the nodes (node01) went down last night and I started it with the 
cluster put into maintenance-mode.
After checking everything else, I mounted the ocfs2-resource manually, did some crm 
resource reprobe/cleanup to make the cluster aware of this and finally turned off 
the maintenance-mode.

Looking at the output of crm_mon, everything looks good again:

  Clone Set: c_fs_wwwdata [p_fs_wwwdata]
  Started: [ node01 node02 ]

alternatively looking at crm_mon -n:

Node node02: online
 p_fs_wwwdata:1  (ocf::heartbeat:Filesystem) Started

Node node01: online
 p_fs_wwwdata:0  (ocf::heartbeat:Filesystem) Started

but the HAWK web interface (version 0.3.6 coming with SLES11SP1-HAE) displays 
this:

Clone Set: c_fs_wwwdata
   - p_fs_wwwdata:0: Started: node01, node02
   - p_fs_wwwdata:1: Stopped

Does anybody know why there is a difference?
Did I make a mistake when manually mounting the FS while it was unmanaged?
Or is this only a cosmetical issue with HAWK?

When these resources are started by pacemaker, HAWK shows exactly what's 
expected: two started resoures, one per node.

Thanks in advance!



It's almost certainly a cosmetic issue in Hawk.  I have fixed one or two 
bugs along these lines since version 0.3.6.  If you'd like to try a 
newer (not-officially-supported-by-SUSE-but-best-effort-support-by-me) 
build, you can try hawk-0.4.1 from:


http://software.opensuse.org/search?q=Hawkbaseproject=SUSE%3ASLE-11%3ASP1lang=en

Alternately, if you can reproduce the issue then send me the output of 
cibadmin -Q (offlist is fine), I can verify/fix it.


Regards,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] DLM and Control instances for OCFS2

2011-08-22 Thread Tim Serong

On 19/08/11 13:12, Prakash Velayutham wrote:

Hi,

I am using

pacemaker - 1.1.5-5.5.5
corosync - 1.3.0-5.6.1
ocfs2 - 1.4.3-0.16.7

I will be using 2 OCFS2 volumes for different purposes. Is it enough to have 
just one instance of

ocf:pacemaker:controld
and
ocf:ocfs2:o2cb

or do I need a separate instance of the above for each OCFS2 volume being 
managed by Corosync/Pacemaker cluster?


Nope, just the one.

Regards,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Announce: Hawk 4.1 (Pacemaker GUI) packages for Debian Squeeze

2011-08-22 Thread Tim Serong

On 22/08/11 07:44, Joerg Sauer wrote:

On Aug 20, 2011, at 6:14 PM, Joerg Sauerlists_pacema...@dizopsin.net
wrote:

This version should also install and run on Ubuntu 10.04 (only minimally
tested).

On Sun August 21 2011 06:07:26 Cotton Tenney wrote:

Awesome, I'll be trying this out next week. Thanks!


Uhm, that statement about Ubuntu 10.04 was actually wildly incorrect. The
Squeeze package will not work on Lucid, so I created a separate one. It does
not use the Ruby libs provided by Ubuntu, though (has frozen gems just like
upstream).

There is also an APT repository with both packages now.

More information:
http://www.dizopsin.net/debian-and-ubuntu-packages-for-clusterlabs-ha

Best regards,
Joerg


Many thanks for your work!

Regards,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Extracting resource state information from the XML

2011-08-11 Thread Tim Serong

On 11/08/11 21:51, pskrap wrote:


Hi,

I have a setup with tens of resources over several nodes. The interface that is
used to administer the system has a page showing all resources, their state and
which node they are running on.

I can get the information of one resource using 'crm_resource -W -rrsc' but
running this command over and over again for that many resources is far to slow
for my needs. The crm_mon produced web page is not enough as I need it in a
customized format. I figured the best way to do this efficiently is to query the
XML using cibadmin -Q, parse it and get the state of all resources from there in
one go.

Unfortunately I am not familiar with the status part of the XML. Is anyone able
to tell me how i can find the following information in the XML:

- resource state (running, stopped, failed)
- which node the resource is currently running on


You probably want to read Chapter 12. Status - Here be dragons of 
Pacemaker Explained:


http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ch-status.html

In particular, the Complex Resource History Example:

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ch12s03s02.html

Very roughly speaking, for each node_state, you have to look at each 
lrm_resource_op for each lrm_resource, and based on the specific op 
(start, stop, monitor, promote, demote, etc.) and its return code, you 
determine the state of the resource on that node.  e.g.: if the last op 
was a successful (rc=0) start, or a successful monitor, the resource is 
running on that node.


If you're in a hurry, you might find it less painful to parse the output 
of something like crm_mon -o -1 or crm_mon -n -1.


Or, if you'd like to examine some hairy Ruby code for interpreting the 
CIB status section, have a look at:


http://hg.clusterlabs.org/pacemaker/hawk/file/tip/hawk/app/models/cib.rb#l300

Note though that this looks at all the ops, to record a list of what's 
failed (it's a loose transliteration of Pacemaker's C code that does the 
same thing).  If you only care about state, you probably only care about 
the *last* op.


I should also take the opportunity to plug Hawk, if you need a web based 
thing for managing Pacemaker clusters:


http://www.clusterlabs.org/wiki/Hawk

HTH,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Dependency Loop Errors in Log

2011-08-09 Thread Tim Serong

On 09/08/11 02:36, Bobbie Lind wrote:

I have 6 servers with three sets of 2 failover pairs. So 2 servers for
one pair, 2 servers for another pair etc.  I am trying to configure this
under one pacemaker instance.

I changed from using Resource groups because the resources are not
dependent on each other, just located together.

I have 4 dummy resources that are used to help with colocation.

The following configuration works as designed when I first start up
pacemaker but when I try and run failover tests that's when things get
screwy.

Here is the relevant snippet of my configuration that shows the location
and colocation set up.  As well as what I *think* I am asking it to do.

[...snip...]

** Ensuring that the resources from one failover node do not start up on
the other nodes giving -500 points.
** failover pairs are MDSgroup, OSS1/OSS3, and OSS2/OSS4
colocation colocMDSOSS1 -500: anchorOSS1 MDSgroup
colocation colocMDSOSS2 -500: anchorOSS2 MDSgroup
colocation colocMDSOSS3 -500: anchorOSS3 MDSgroup
colocation colocMDSOSS4 -500: anchorOSS4 MDSgroup
colocation colocOSS1MDS -500: MDSgroup anchorOSS1
colocation colocOSS2MDS -500: MDSgroup anchorOSS2
colocation colocOSS3MDS -500: MDSgroup anchorOSS3
colocation colocOSS4MDS -500: MDSgroup anchorOSS4
colocation colocOSS2OSS1 -500: anchorOSS1 anchorOSS2
colocation colocOSS4OSS1 -500: anchorOSS1 anchorOSS4
colocation colocOSS1OSS2 -500: anchorOSS2 anchorOSS1
colocation colocOSS3OSS2 -500: anchorOSS2 anchorOSS3
colocation colocOSS2OSS3 -500: anchorOSS3 anchorOSS2
colocation colocOSS4OSS3 -500: anchorOSS3 anchorOSS4
colocation colocOSS1OSS4 -500: anchorOSS4 anchorOSS1
colocation colocOSS3OSS4 -500: anchorOSS4 anchorOSS3

[...snip...]

One of the issues I am running into is the logs are giving me dependency
loop errors.  Here is a snippet but it does this for all the
anchor/dummy resources and the LVM resource(from MDSgroup)

Aug 08 11:05:56 s02ns070 pengine: [32677]: info: rsc_merge_weights:
anchorOSS1: Breaking dependency loop at MDSgroup
[...snip...]

I think these dependency loops are what's causing the other quirky
behavior I have of resources failing to the wrong server.

I'm not sure where the dependency loop is coming from, but I'm sure it
has something to do with my configuration and score setup.

Any help deciphering this would be greatly appreciated.


You can't have bidirectional colocation, i.e.. either specify 
colocation colocMDSOSS1 -500: anchorOSS1 MDSgroup or colocation 
colocOSS1MDS -500: MDSgroup anchorOSS1, but not both.  The dependency 
loop error means pacemaker is tossing one of these away.  For some more 
detail check the Resource Constraints chapter of Pacemaker explained 
(http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/) 
or the mailing list archives (this has come up a few times in recent 
memory).


HTH,

Tim
--
Tim Serong
Senior Clustering Engineer
SUSE
tser...@suse.com

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] wiping out cluster config

2011-07-07 Thread Tim Serong

On 07/07/11 06:23, Jean-Francois Malouin wrote:

Hi,

I want to wipe out my existing cluster config and start afresh, with a
pristine/empty config without actually starting pacemaker -- cluster
is down right now. Is it enough to just remove files in
/var/lib/heartbeat/crm and /var/lib/pengine ?


That always worked for me.  Just make sure you do it on all nodes before 
you start any of them.  And if you break it, you get to keep both pieces :)


Regards,

Tim



This is on Debian/Squeeze with pacemaker 1.0.9.1.

thanks!
jf

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker




--
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Pacemaker+Corosync from OBS

2011-06-23 Thread Tim Serong

On 22/06/11 22:14, Ciro Iriarte wrote:

2011/6/21 Tim Serongtser...@novell.com:

On 22/06/11 08:57, Ciro Iriarte wrote:


Hi, I'm trying pacemaker from OBS and I don't see any init script for
corosync or pacemaker, am I overlooking something obvious?

Name: pacemakerRelocations: (not relocatable)
Version : 1.1.5 Vendor: openSUSE Build
Service
Release : 1.1   Build Date: Thu Apr 14
04:25:55 2011

Name: corosync Relocations: (not relocatable)
Version : 1.3.0 Vendor: openSUSE Build
Service
Release : 1.1   Build Date: Thu Apr 14
04:08:04 2011

Regards,



Install openais as well - it includes /etc/init.d/openais which starts
corosync.

Regards,

Tim
--
Tim Serongtser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.


Thanks, I though corosync replaced openais... I was expecting a
corosync init script :)


Understandable :)  The openais init script is a holdover from prior to 
the corosync/openais split.  Keeping it made upgrading systems from 
openais 0.8.x to corosync 1.x + openais 1.x a bit nicer, but we should 
probably do something about an actual corosync init script.



Also, I've read that it's better to start corosync and pacemaker
independently, (service ---  ver:  1), that's not currently possible
with OBS build then, am I right?


Correct, not yet possible (although, FWIW, AFAIK, the problems people 
experienced with service ver: 0 generally didn't manifest on SUSE).


I believe adding support for service ver: 1 (MCP) is mostly a matter of 
tweaking the spec file to include the init script and a couple of other 
things, then test it.  See lines 196-198 at:


https://build.opensuse.org/package/view_file?file=pacemaker.specpackage=pacemakerproject=network%3Aha-clusteringsrcmd5=a2aa81b9e6b8f3e4fcd7a5bbb6b25e8a

Patches (or, given it's OBS, submitreqs) gladly accepted :)

Regards,

Tim
--
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Pacemaker+Corosync from OBS

2011-06-21 Thread Tim Serong

On 22/06/11 08:57, Ciro Iriarte wrote:

Hi, I'm trying pacemaker from OBS and I don't see any init script for
corosync or pacemaker, am I overlooking something obvious?

Name: pacemakerRelocations: (not relocatable)
Version : 1.1.5 Vendor: openSUSE Build Service
Release : 1.1   Build Date: Thu Apr 14 04:25:55 2011

Name: corosync Relocations: (not relocatable)
Version : 1.3.0 Vendor: openSUSE Build Service
Release : 1.1   Build Date: Thu Apr 14 04:08:04 2011

Regards,



Install openais as well - it includes /etc/init.d/openais which starts 
corosync.


Regards,

Tim
--
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Permission denied using HAWK

2011-06-19 Thread Tim Serong

On 18/06/11 22:02, Michael Schwartzkopff wrote:

Hi,

Creating a resource in HAWK works like a charm. Very nice. Now I want to start
or stop the resource and the pop-up window tells me:

Error: Permission Denied

Any idea what might be wrong?

System:
- OpenSUSE 11.4
- pacemaker 1.1.5
- hawk 0.4.1 from OBS

Editing the resource works.


Rails 2.3.11 introduced a fix for a cross site request forgery exploit, 
which broke Hawk's start/stop/etc. functionality on the status screen. 
I've just updated Hawk on OBS to work correctly in this case.  Please 
try upgrading to the latest version (hawk-0.4.1-2.1.$ARCH.rpm).


Regards,

Tim
--
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Announce: Hawk (HA Web Konsole) 0.4.1

2011-05-20 Thread Tim Serong

On 19/05/11 00:43, Tim Serong wrote:

Hi Everybody,

This is to announce version 0.4.1 of Hawk, a web-based GUI for managing
and monitoring Pacemaker High-Availability clusters.

[...]

Building an RPM for Fedora/Red Hat is still just as easy as last time:

# hg clone http://hg.clusterlabs.org/pacemaker/hawk
# cd hawk
# hg update hawk-0.4.1
# make rpm


*ahem*

It /would/ still be just as easy if I had said hg update tip, or, in 
this specific instance, hg update 398ae27386e (the Makefile grabs the 
last tag from hg to use as a version number, which is one commit *after* 
the actual tagged commit).


Regards,

Tim
--
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Announce: Hawk (HA Web Konsole) 0.4.1

2011-05-20 Thread Tim Serong

On 19/05/11 20:49, Lars Marowsky-Bree wrote:

On 2011-05-18T18:54:27, Daugherity, Andrew Wadaugher...@tamu.edu  wrote:


This is to announce version 0.4.1 of Hawk, a web-based GUI for managing
and monitoring Pacemaker High-Availability clusters.
...
As before, packages for various SUSE-based distros can be obtained from
the network:ha-clustering and network:ha-clustering:Factory repos on
OBS, or you can just search for Hawk on software.opensuse.org:

   http://software.opensuse.org/search?baseproject=ALLq=Hawk


Are there any plans to push this out to the SLE11 HAE SP1 update channel?


Yes. But that may take a bit longer. Tim's announcing the open
source/upstream/community release here ;-)


I guess I could always just grab the hawk RPM from the OBS repo and
upgrade hawk... I'd rather not add the repo and risk mixing
corosync/pacemaker/etc. packages between repos on a production
cluster.


That's understandable, and I'd not advise that you do that. Perhaps Tim
can investigate publishing hawk packages that are build against the
latest maintenance updates for SLE HA.


I'll see what I can do.  In the meantime, the hawk RPM from OBS does 
install and run on top of SLE HA (i.e. works for me), but obviously any 
RPMs that aren't in the official update channel aren't officially 
supported by SUSE.  All us clustering types being paranoid by nature, if 
in doubt, I'd suggest trying it out on a test cluster first :)



Also, does anyone know why hawk and crm_gui are case-sensitive for usernames when nothing else is?  
(Yes, I know mixed-case usernames are bad -- I didn't set up the central auth.)  Everything else 
using LDAP auth (e.g. pam_ldap, apache mod_authnz_ldap, LDAP plugins to various CMSes/Wikis/issue 
trackers, etc.) is fine with both adaugherity and ADaugherity but 
hawk/crm_gui require the mixed-case version.


They go via the PAM backends too, so this is surprising ... Thanks for
pointing this out.


Noted.  I'm not sure what's going on there yet...

Regards,

Tim
--
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Announce: Hawk (HA Web Konsole) 0.4.1

2011-05-18 Thread Tim Serong

Hi Everybody,

This is to announce version 0.4.1 of Hawk, a web-based GUI for managing 
and monitoring Pacemaker High-Availability clusters.


You can use Hawk to:

  - Monitor your cluster.
  - Perform basic operator tasks (start/stop/migrate etc).
  - Create, edit and delete resources.
  - Edit crm_config properties.
  - Create, edit and delete location, colocation, and ordering
constraints (new in 0.4.1)

The constraint editor is accessible from the popup menu on the resources 
panel on the main status screen.  Ordering and colocation constraint 
chains are drawn with arrows between resources indicating dependencies, 
much as you see in the constraints chapter of Pacemaker Explained[1]. 
That it to say, to start A then B, you have an order constraint:


  [A]-[B]

...and to colocate B with A, you have a colocation constraint:

  [B]-[A]

Location constraints can be edited in simple form (just a resource, node 
and a score), or with a rule editor (if you need to specify roles or 
complex expressions).  Note that date expressions and some explanatory 
text are still to come here.  Any questions in the meantime, feel free 
to ask (I am particularly interested in feedback from people with large 
and/or complex sets of constraints).


As before, packages for various SUSE-based distros can be obtained from 
the network:ha-clustering and network:ha-clustering:Factory repos on 
OBS, or you can just search for Hawk on software.opensuse.org:


  http://software.opensuse.org/search?baseproject=ALLq=Hawk

Building an RPM for Fedora/Red Hat is still just as easy as last time:

  # hg clone http://hg.clusterlabs.org/pacemaker/hawk
  # cd hawk
  # hg update hawk-0.4.1
  # make rpm

(My apologies continue for all the non-RPM-based distro users.)

Further information is available at:

  http://www.clusterlabs.org/wiki/Hawk

Please direct comments, feedback, questions, etc. to myself and/or 
(preferably) the Pacemaker mailing list.


Happy clustering,

Tim

[1] 
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ch-constraints.html



--
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Failover when storage fails

2011-05-14 Thread Tim Serong

On 13/05/11 18:54, Max Williams wrote:

Well this is not what I am seeing here. Perhaps a bug?
I also tried adding op stop interval=0 timeout=10 to the LVM
resources but still when the storage disappears the cluster just
stops where it is and those log entries (below) just get printed
in a loop.
Cheers,
Max


OK, that's just weird (unless I'm missing something - anyone else seen 
this?).  Do you mind sending me an hb_report tarball (offlist)?  I'd 
suggest starting everything up cleanly, knocking the storage over, 
waiting a few minutes, then getting the hb_report for that entire time 
period.


Regards,

Tim



-Original Message-
From: Tim Serong [mailto:tser...@novell.com]
Sent: 13 May 2011 04:22
To: The Pacemaker cluster resource manager (pacemaker@oss.clusterlabs.org)
Subject: Re: [Pacemaker] Failover when storage fails

On 5/12/2011 at 02:28 AM, Max Williamsmax.willi...@betfair.com  wrote:

After further testing even with stonith enabled the cluster still gets
stuck in this state, presumably waiting on IO. I can get around it by
setting on-fail=fence on the LVM resources but shouldn't Pacemaker
be smart enough to realise the host is effectively offline?


If you've got STONITH enabled, nodes should just get fenced when this occurs, 
without your having to specify on-fail=fence for the monitor op.
What *should* happen is, the monitor fails or times out, then pacemaker will 
try to stop the resource.  If the stop also fails or times out, the node will 
be fenced.  See:

http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-operations.html

Also, http://ourobengr.com/ha#causes is relevant here.

Regards,

Tim


Or am I missing some timeout
value that would fix this situation?

pacemaker-1.1.2-7.el6.x86_64
corosync-1.2.3-21.el6.x86_64
RHEL 6.0

Config:

node host001.domain \
 attributes standby=off
node host002.domain \
 attributes standby=off
primitive MyApp_IP ocf:heartbeat:IPaddr \
 params ip=192.168.104.26 \
 op monitor interval=10s
primitive MyApp_fs_graph ocf:heartbeat:Filesystem \
 params device=/dev/VolGroupB00/AppLV2 directory=/naab1
fstype=ext4 \
 op monitor interval=10 timeout=10
primitive MyApp_fs_landing ocf:heartbeat:Filesystem \
 params device=/dev/VolGroupB01/AppLV1 directory=/naab2
fstype=ext4 \
 op monitor interval=10 timeout=10
primitive MyApp_lvm_graph ocf:heartbeat:LVM \
 params volgrpname=VolGroupB00 exclusive=yes \
 op monitor interval=10 timeout=10 on-fail=fence depth=0
primitive MyApp_lvm_landing ocf:heartbeat:LVM \
 params volgrpname=VolGroupB01 exclusive=yes \
 op monitor interval=10 timeout=10 on-fail=fence depth=0
primitive MyApp_scsi_reservation ocf:heartbeat:sg_persist \
 params sg_persist_resource=scsi_reservation0 devs=/dev/dm-6
/dev/dm-7 required_devs_nof=2 reservation_type=1
primitive MyApp_init_script lsb:AppInitScript \
 op monitor interval=10 timeout=10
primitive fence_host001.domain stonith:fence_ipmilan \
 params ipaddr=192.168.16.148 passwd=password login=root
pcmk_host_list=host001.domain pcmk_host_check=static-list \
 meta target-role=Started
primitive fence_host002.domain stonith:fence_ipmilan \
 params ipaddr=192.168.16.149 passwd=password login=root
pcmk_host_list=host002.domain pcmk_host_check=static-list \
 meta target-role=Started
group MyApp_group MyApp_lvm_graph MyApp_lvm_landing MyApp_fs_graph
MyApp_fs_landing MyApp_IP MyApp_init_script \
 meta target-role=Started migration-threshold=2 on-fail=restart
failure-timeout=300s
ms ms_MyApp_scsi_reservation MyApp_scsi_reservation \
 meta master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1
notify=true
colocation MyApp_group_on_scsi_reservation inf: MyApp_group
ms_MyApp_scsi_reservation:Master order
MyApp_group_after_scsi_reservation inf:
ms_MyApp_scsi_reservation:promote MyApp_group:start property
$id=cib-bootstrap-options \
 dc-version=1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe \
 cluster-infrastructure=openais \
 expected-quorum-votes=2 \
 no-quorum-policy=ignore \
 stonith-enabled=true \
 last-lrm-refresh=1305129673
rsc_defaults $id=rsc-options \
 resource-stickiness=1





From: Max Williams [mailto:max.willi...@betfair.com]
Sent: 11 May 2011 13:55
To: The Pacemaker cluster resource manager
(pacemaker@oss.clusterlabs.org)
Subject: [Pacemaker] Failover when storage fails

Hi,
I want to configure pacemaker to failover a group of resources and
sg_persist (master/slave) when there is a problem with the storage but
when I cause the iSCSI LUN to disappear simulating a failure, the
cluster always gets stuck in this state:

Last updated: Wed May 11 10:52:43 2011
Stack: openais
Current DC: host001.domain - partition with quorum
Version: 1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe
2 Nodes configured, 2 expected votes
4 Resources configured

Re: [Pacemaker] addendum: problems with node membership

2011-05-11 Thread Tim Serong
On 5/11/2011 at 10:00 PM, Thomas thomascasp...@t-online.de wrote: 
 p.s. 1 cluster nodes failed to respond to the join offer can be found in my 
 corosync log. Google was of no use with that message, I haven't found a  
 solution 
 yet. Cannot be that difficult I think, I just need the freshly installed 
 condition of pacemaker without reinstalling the complete package, because a 
 fresh node joins without problems...how can this be done? 

I'd suggest double-checking the corosync config and network settings
(IP addresses and preferably disable any firewalls) on all nodes.

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [PATCH]Bug 2567 - crm resource migrate should support an optional role parameter

2011-05-10 Thread Tim Serong
On 5/10/2011 at 08:22 PM, Holger Teutsch holger.teut...@web.de wrote: 
 On Tue, 2011-05-10 at 08:24 +0200, Andrew Beekhof wrote: 
  On Mon, May 9, 2011 at 8:44 PM, Holger Teutsch holger.teut...@web.de 
  wrote: 
   On Wed, 2011-04-27 at 13:25 +0200, Andrew Beekhof wrote: 
   On Sun, Apr 24, 2011 at 4:31 PM, Holger Teutsch holger.teut...@web.de  
 wrote: 
   ... 
Remaining diffs seem to be not related to my changes. 
   
   Unlikely I'm afraid.  We run the regression tests after every commit 
   and complain loudly if they fail. 
   What is the regression test output? 
   
   That's the output of tools/regression.sh of pacemaker-devel *without* my 
   patches: 
   Version: parent: 10731:bf7b957f4cbe tip 
   
   see attachment 
   
  There seems to be something not quite right with your environment. 
  Had you built the tools directory before running the test? 
 Yes, + install 
  
  In a clean chroot it passes onboth opensuse and fedora: 
   
 http://build.clusterlabs.org:8010/builders/opensuse-11.3-i386-devel/builds/ 
 48/steps/cli_test/logs/stdio 
  and 
   
 http://build.clusterlabs.org:8010/builders/fedora-13-x86_64-devel/builds/48 
 /steps/cli_test/logs/stdio 
   
  What distro are you on? 
   
 Opensuse 11.4 

Works for me on openSUSE 11.4 with a clean checkout of devel tip, so
presumably isn't endemic (not that this really helps you, sorry, but
I had to test).

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [pacemaker][patch 3/4] Simple changes for Pacemaker Explained, Chapter 6 CH_Constraints.xml

2011-05-04 Thread Tim Serong
On 5/4/2011 at 08:49 PM, Andrew Beekhof and...@beekhof.net wrote: 
 Tick tock.  I'm going to push this soon unless someone raises an objection  
 RSN. 

This is going into 1.1, right?

Do existing CIBs automagically get updated to this syntax, or does the
admin have to force this?  (Sorry, I forget if that was covered already).

Thanks,

Tim

  
 On Fri, Apr 15, 2011 at 4:55 PM, Andrew Beekhof and...@beekhof.net wrote: 
  On Fri, Apr 15, 2011 at 3:00 PM, Lars Marowsky-Bree l...@novell.com 
  wrote: 
  On 2011-04-13T08:37:12, Andrew Beekhof and...@beekhof.net wrote: 
  
   Before: 
   
 rsc_colocation id=coloc-set score=INFINITY 
   resource_set id=coloc-set-0 
 resource_ref id=dummy2/ 
 resource_ref id=dummy3/ 
   /resource_set 
   resource_set id=coloc-set-1 sequential=false 
   role=Master 
 resource_ref id=dummy0/ 
 resource_ref id=dummy1/ 
   /resource_set 
 /rsc_colocation 
 rsc_order id=order-set score=INFINITY 
   resource_set id=order-set-0 role=Master 
 resource_ref id=dummy0/ 
 resource_ref id=dummy1/ 
   /resource_set 
   resource_set id=order-set-1 sequential=false 
 resource_ref id=dummy2/ 
 resource_ref id=dummy3/ 
   /resource_set 
 /rsc_order 
   
   
   
   After: 
  
  So I am understanding this properly - we're getting rid of the 
  sequential attribute, yes? 
  
  Absolutely. 
  
  If so, three cheers. ;-) 
  
  Can you share the proposed schema and XSLT, if you already have some? 
  
  Attached 
  
  
rsc_colocation id=coloc-set score=INFINITY 
   colocation_set id=coloc-set-1 internal-colocation=0 
 resource_ref id=dummy0 role=Master/ 
 resource_ref id=dummy1 role=Master/ 
   /colocation_set 
   colocation_set id=coloc-set-0 
   internal-colocation=INFINITY 
 resource_ref id=dummy2/ 
 resource_ref id=dummy3/ 
   /colocation_set 
 /rsc_colocation 
 rsc_order id=order-set kind=Mandatory 
   ordering_set id=order-set-0 internal-ordering=Mandatory 
  
  So we have (score|kind) on the outside, and 
  internal-(ordering|colocation) on the inner elements. Is there a 
  particular reason to use a different name on the inner ones? 
  
  The language didn't feel right tbh - the inner ones felt like they 
  needed more context/clarification. 
  We can change the outer name too if you like. 
  
  Also, rsc_order has either score or kind; are you doing away with that 
  here? 
  
  Yes. Standardizing on kind.  Score never made sense for ordering :-( 
  
  
  lifetime would only apply to the entire element, right? 
  
  right 
  
  And, just to be fully annoying - is there a real benefit to having 
  ordering_set and colocation_set? 
  
  Very much so.  Because kind makes no sense for a colocation - and 
  vice-versa for score. 
  Using different element names means the rng can be stricter. 
  
  
  
   ordering_set id=order-set-1 internal-ordering=Optional 
 resource_ref id=dummy2/ 
  
  While we're messing with sets anyway, I'd like to re-hash the idea I 
  brought up on pcmk-devel. To make configuration more compact, I'd like 
  to have automatic sets - i.e., the set of all resources that match 
  something. 
  
  Imagine: 
  
 resource_list type=Xen provider=heartbeat class=ocf / 
  
  and suddenly all your Xen guests are correctly ordered and collocated. 
  The savings in administrative complexity and CIB size are huge. 
  
  Or would you rather do this via templates? 
  
  You mean something like? 
  
ordering_set id=order-set-0 internal-ordering=Mandatory 
  resource_pattern type= provider=  
   /ordering 
  
  Might make sense.  But doesn't strictly need to be bundled with this  
 change. 
  I'd be wary about allowing pattern matching on the name, I can imagine 
  resources ending up in multiple sets (loops!) very easily. 
  
  
 ___ 
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
  
 Project Home: http://www.clusterlabs.org 
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
 Bugs:  
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker 
  




-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Multi-site support in pacemaker (tokens, deadman, CTR)

2011-04-28 Thread Tim Serong
On 4/28/2011 at 11:06 PM, Florian Haas florian.h...@linbit.com wrote: 
 On 2011-04-27 20:55, Lars Marowsky-Bree wrote: 
  On 2011-04-26T23:34:16, Yan Gao y...@novell.com wrote: 
  And the cibs between different sites would still be synchronized? 
   
  The idea is that there would be - perhaps as part of the CTR daemon - a 
  process that would replicate (manually triggered, periodically, or 
  automatically) the configuration details of resources associated with a 
  given ticket (which are easily determined since they depend on it) to 
  the other sites that are eligible for the ticket. 
   
  Initially, I'd be quite happy if there was a replicate now button to 
  push or script to call - admins may actually have good reasons not to 
  immediately replicate everywhere, anyway. 
   
  It's conceivable that there would need to be some mangling as 
  configuration is replicated; e.g., path names and IP addresses may be 
  different. We _could_ express this using our CIB syntax already 
  (instance attribute sets take rules, and it'd probably be easy enough 
  to extend this matching to select on ticket ownership), and perhaps that 
  is good enough, since I'd imagine there would actually be quite little 
  to modify. 
   
  (Having many differences would make the configuration very complex to 
  manage and understand; hence, we want a syntax that makes it easy to 
  have a few different values, and annoying to have many ;-) 
  
 As I understood it we had essentially reached consensus in Boston that 
 CIB replication would best be achieved by a pair of complementary 
 resource agents. I don't think we had a name then, but I'll call them 
 Publisher and Subscriber for the purposes of this discussion. 
  
 The idea would be that Publisher exposes the configuration/ section of 
 the CIB via a network daemon, preferably one that uses encryption. 
 Suppose this is something like lighttpd with SSL/TLS support. This would 
 be a simple primitive running exactly once in the Pacemaker cluster, and 
 only if that cluster holds the ticket. 

Hawk just about does that (exposes bits of the CIB via HTTPS), although
admittedly it'd be overkill for just exposing the configuration section
for machine processing.

A stunningly trivial implementation is, simply, in lighttpd.conf:

  cgi.assign = ( /config =  )

Then, create a shell script called config in lighttpd's document root
directory, containing:

  #!/bin/sh
  
  echo Content-type: text/xml
  echo
  /usr/sbin/cibadmin -Q --scope constraints

Not so much with the security, but it works...

  
 Subscriber, by contrast, subscribes to this stream and will usually 
 mangle configuration in some shape or form, preferably configurable 
 through an RA parameter. What was discussed in Boston is that in an 
 initial step, Subscriber could simply take an XSLT script, apply it to 
 the CIB stream with xsltproc, and then update its local CIB with the result. 
  
 Subscriber would be the only resource (besides STONITH resources and 
 Slaves of master/slave sets) that can be active in a cluster that does 
 not hold the ticket. 
  
 Comments? 
  
 Cheers, 
 Florian 
  



-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Announce: Hawk (HA Web Konsole) 0.4.0

2011-04-23 Thread Tim Serong
On 4/22/2011 at 10:14 PM, Nikita Michalko michalko.sys...@a-i-p.com wrote: 
 Am Dienstag 19 April 2011 12:59:35 schrieb Tim Serong: 
  Greetings All, 
   
  This is to announce version 0.4.0 of Hawk, a web-based GUI for 
  managing and monitoring Pacemaker High-Availability clusters. 
   
  You can use Hawk 0.4.0 to: 
   
- Monitor your cluster, with much the same functionality as 
  crm_mon (displays node and resource status, failed ops). 
   
- Perform basic operator tasks: 
  - Node: standby, online, fence 
  - Resource: start, stop, migrate, unmigrate, clean up. 
   
- Create, edit and delete primitives, groups, clones, m/s 
  resources. 
   
- Edit crm_config properties. 
   
  Hawk is intended to run on each node in your cluster, and is 
  accessible via HTTPS on port 7630.  You can then access it by 
  pointing your web browser at the IP address of any cluster node, 
  or the address of any IPaddr(2) resource you may have configured. 
   
  You will need to configure a user account to log in as.  The 
  same rules apply as for the python GUI; you need to log in as 
  a user in the haclient group. 
   
  Packages for various SUSE-based distros can be obtained from the 
  network:ha-clustering and network:ha-clustering:Factory repos 
  on OBS, or you can just search for Hawk on software.opensuse.org: 
   
http://software.opensuse.org/search?baseproject=ALLq=Hawk 
  
  
 - just  tried to download HAWK, but don't know some password required by YAST 
  
  
 - which one? 

Did you download an RPM, or click the 1-Click Install link?
If the latter, it'll try to just install it on the system you're
downloading on, in which case YaST is asking for your root password
in order to install it.  This may or may not be what you want.

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] operative tasks for a pacemaker cluster

2011-04-13 Thread Tim Serong
On 4/13/2011 at 02:04 AM, mark - pacemaker list m+pacema...@nerdish.us wrote:
 Hello, 
  
 On Mon, Apr 11, 2011 at 11:11 AM, Andrew Beekhof and...@beekhof.net wrote: 
  On Mon, Apr 11, 2011 at 2:48 PM, Klaus Darilion 
  klaus.mailingli...@pernau.at wrote: 
  
  Recently I got hit by running out of inodes due to too many files in 
  /var/lib/pengine. 
  
  man pengine 
  
  look for -series-max 
  
 There is no pengine man page in the packages (pacemaker, heartbeat, or 
 corosync) from the EPEL repo, nor online with the other online 
 manpages at clusterlabs.  Am I missing it someplace?  I want to read 
 about this as I have just under 7000 files in /var/lib/pengine on a 
 node that has 7 days of uptime.  Will this grow unchecked, or do older 
 files eventually get cleaned up? 

Not sure what's up with the EPEL packaging, sorry.  The relevant bit of
that manpage is:

   pe-error-series-max = integer [-1]
   The number of PE inputs resulting in ERRORs to save

   Zero to disable, -1 to store unlimited.

   pe-warn-series-max = integer [-1]
   The number of PE inputs resulting in WARNINGs to save

   Zero to disable, -1 to store unlimited.

   pe-input-series-max = integer [-1]
   The number of other PE inputs to save

   Zero to disable, -1 to store unlimited.

So, yeah, by default unless you specifically limit it, it'll just keep
saving 'em.  They're invaluable for debugging failures, BTW.

Were those 7000 pe-inputs all created over that 7 day period?  Because
that's a transition every 1.44 minutes.  Is it just me, or does that
sound like a rather busy cluster?

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [pacemaker][patch 3/4] Simple changes for Pacemaker Explained, Chapter 6 CH_Constraints.xml

2011-04-11 Thread Tim Serong
On 3/21/2011 at 08:20 PM, Andrew Beekhof and...@beekhof.net wrote: 
  
 Small improvement to: 
 + The only thing that matters is that in order for any member of a set 
 to be active, all the members of the previous set must also be active 
 (and naturally on the same node). When a set has 
 literalsequential=true/literal, then in order for any member to 
 be active, the previous members must also be active. 
  
 + The only thing that matters is that in order for any member of a set 
 to be active, all the members of the previous setfootnoteparaas 
 determined by the display order in the configuration/para/footnote 
 must also be active (and naturally on the same node). 
 + When a set has literalsequential=true/literal, then in order 
 for any member to be active, the previous members must also be active. 

This isn't quite correct.

For members within a set (sequential=true), it is true that for a given
member to be active, the previous members must also be active.

Between sets however, it's the other way around - a given set depends on
the subsequent set.

The example colocation chain in Pacemaker Explained right now should thus
be changed as follows in order to match the diagram:

  constraints
rsc_colocation id=coloc-1 score=INFINITY 
   resource_set id=collocated-set-1 sequential=true
-resource_ref id=A/
-resource_ref id=B/
+resource_ref id=F/
+resource_ref id=G/
  /resource_set
  resource_set id=collocated-set-2 sequential=false
resource_ref id=C/
resource_ref id=D/
resource_ref id=E/
  /resource_set
  resource_set id=collocated-set-2 sequential=true role=Master
-resource_ref id=F/
-resource_ref id=G/
+resource_ref id=A/
+resource_ref id=B/
  /resource_set
/rsc_colocation
  /constraints

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [pacemaker][patch 3/4] Simple changes for Pacemaker Explained, Chapter 6 CH_Constraints.xml

2011-04-11 Thread Tim Serong
On 4/11/2011 at 09:37 PM, Andrew Beekhof and...@beekhof.net wrote: 
 On Mon, Apr 11, 2011 at 12:57 PM, Tim Serong tser...@novell.com wrote: 
  On 3/21/2011 at 08:20 PM, Andrew Beekhof and...@beekhof.net wrote: 
  
  Small improvement to: 
  + The only thing that matters is that in order for any member of a set 
  to be active, all the members of the previous set must also be active 
  (and naturally on the same node). When a set has 
  literalsequential=true/literal, then in order for any member to 
  be active, the previous members must also be active. 
  
  + The only thing that matters is that in order for any member of a set 
  to be active, all the members of the previous setfootnoteparaas 
  determined by the display order in the configuration/para/footnote 
  must also be active (and naturally on the same node). 
  + When a set has literalsequential=true/literal, then in order 
  for any member to be active, the previous members must also be active. 
  
  This isn't quite correct. 
  
  For members within a set (sequential=true), it is true that for a given 
  member to be active, the previous members must also be active. 
  
  Between sets however, it's the other way around - a given set depends on 
  the subsequent set. 
  
 Did I really write it like that? You tested it?

Yup.  Well, I tested it (pcmk 1.1.5), so I assume you wrote it like that :)

We want (pardon the ASCII art):

 /-- C --\
  G -- F --+--- D ---+-- B -- A
 \-- E --/

Test is:

  # crm configure colocation c inf: F G ( C D E ) A B
  # crm resource stop F
 (stops F and G)
  # crm resource start F
  # crm resource stop D
 (stops D, F and G)
  # crm resource start D
  # crm resource stop B
 (stops everything except A)

That shell colocation constraint maps exactly to the (new) XML shown below
(verified just in case it turned out to be a shell oddity).

 If so, thats just retarded and needs an overhaul. 

It is a little... confusing.

Regards,

Tim

  
  The example colocation chain in Pacemaker Explained right now should thus 
  be changed as follows in order to match the diagram: 
  
   constraints 
 rsc_colocation id=coloc-1 score=INFINITY  
resource_set id=collocated-set-1 sequential=true 
  -resource_ref id=A/ 
  -resource_ref id=B/ 
  +resource_ref id=F/ 
  +resource_ref id=G/ 
   /resource_set 
   resource_set id=collocated-set-2 sequential=false 
 resource_ref id=C/ 
 resource_ref id=D/ 
 resource_ref id=E/ 
   /resource_set 
   resource_set id=collocated-set-2 sequential=true role=Master 
  -resource_ref id=F/ 
  -resource_ref id=G/ 
  +resource_ref id=A/ 
  +resource_ref id=B/ 
   /resource_set 
 /rsc_colocation 
   /constraints 
  
  Regards, 
  
  Tim 
  
  
  -- 
  Tim Serong tser...@novell.com 
  Senior Clustering Engineer, OPS Engineering, Novell Inc. 
  
  
  
  
  ___ 
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
  
  Project Home: http://www.clusterlabs.org 
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
  Bugs:  
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker 
  
  




-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [pacemaker][patch 3/4] Simple changes for Pacemaker Explained, Chapter 6 CH_Constraints.xml

2011-04-11 Thread Tim Serong
  On 4/11/2011 at 10:23 PM, Andrew Beekhof and...@beekhof.net wrote: 
 On Mon, Apr 11, 2011 at 2:18 PM, Tim Serong tser...@novell.com wrote: 
  On 4/11/2011 at 09:37 PM, Andrew Beekhof and...@beekhof.net wrote: 
  On Mon, Apr 11, 2011 at 12:57 PM, Tim Serong tser...@novell.com wrote: 
   On 3/21/2011 at 08:20 PM, Andrew Beekhof and...@beekhof.net wrote: 
   
   Small improvement to: 
   + The only thing that matters is that in order for any member of a 
   set 
   to be active, all the members of the previous set must also be active 
   (and naturally on the same node). When a set has 
   literalsequential=true/literal, then in order for any member to 
   be active, the previous members must also be active. 
   
   + The only thing that matters is that in order for any member of a 
   set 
   to be active, all the members of the previous setfootnoteparaas 
   determined by the display order in the configuration/para/footnote 
   must also be active (and naturally on the same node). 
   + When a set has literalsequential=true/literal, then in order 
   for any member to be active, the previous members must also be active. 
   
   This isn't quite correct. 
   
   For members within a set (sequential=true), it is true that for a given 
   member to be active, the previous members must also be active. 
   
   Between sets however, it's the other way around - a given set depends on 
   the subsequent set. 
  
  Did I really write it like that? You tested it? 
  
  Yup.  Well, I tested it (pcmk 1.1.5), so I assume you wrote it like that :) 
  
  We want (pardon the ASCII art): 
  
  /-- C --\ 
   G -- F --+--- D ---+-- B -- A 
  \- - E --/ 
  
  Test is: 
  
   # crm configure colocation c inf: F G ( C D E ) A B 
  
 Given the well discussed issues with the shell syntax, I'd prefer to 
 see the raw xml actually. 

constraints
  rsc_colocation id=c score=INFINITY
resource_set id=c-0
  resource_ref id=F/
  resource_ref id=G/
/resource_set
resource_set id=c-1 sequential=false
  resource_ref id=C/
  resource_ref id=D/
  resource_ref id=E/
/resource_set
resource_set id=c-2
  resource_ref id=A/
  resource_ref id=B/
/resource_set
  /rsc_colocation
/constraints

   # crm resource stop F 
  (stops F and G) 
   # crm resource start F 
   # crm resource stop D 
  (stops D, F and G) 
   # crm resource start D 
   # crm resource stop B 
  (stops everything except A) 
  
  That shell colocation constraint maps exactly to the (new) XML shown below 
  (verified just in case it turned out to be a shell oddity). 
  
  If so, thats just retarded and needs an overhaul. 
  
  It is a little... confusing. 
  
  Regards, 
  
  Tim 
  
   
   The example colocation chain in Pacemaker Explained right now should 
   thus 
   be changed as follows in order to match the diagram: 
   
constraints 
  rsc_colocation id=coloc-1 score=INFINITY  
 resource_set id=collocated-set-1 sequential=true 
   -resource_ref id=A/ 
   -resource_ref id=B/ 
   +resource_ref id=F/ 
   +resource_ref id=G/ 
/resource_set 
resource_set id=collocated-set-2 sequential=false 
  resource_ref id=C/ 
  resource_ref id=D/ 
  resource_ref id=E/ 
/resource_set 
resource_set id=collocated-set-2 sequential=true 
   role=Master 
   -resource_ref id=F/ 
   -resource_ref id=G/ 
   +resource_ref id=A/ 
   +resource_ref id=B/ 
/resource_set 
  /rsc_colocation 
/constraints 
   
   Regards, 
   
   Tim 
   
   
   -- 
   Tim Serong tser...@novell.com 
   Senior Clustering Engineer, OPS Engineering, Novell Inc. 
   
   
   
   
   ___ 
   Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
   http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
   
   Project Home: http://www.clusterlabs.org 
   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
   Bugs: 
  http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker 
   
  
  
  
  
  
  -- 
  Tim Serong tser...@novell.com 
  Senior Clustering Engineer, OPS Engineering, Novell Inc. 
  
  
  
  ___ 
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
  
  Project Home: http://www.clusterlabs.org 
  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
  Bugs:  
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker 
  
  




-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http

Re: [Pacemaker] emulate crm_mon output by xsltproc'essing cibadmin -Ql

2011-03-09 Thread Tim Serong
  On 3/9/2011 at 07:51 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote: 
 On Wed, Mar 09, 2011 at 09:42:49AM +0100, Andrew Beekhof wrote: 
  I had http://hg.clusterlabs.org/pacemaker/1.1/raw-file/tip/xml/crm.xsl 
  doing something similar. 
  Agree its an interesting capability, haven't found much practical use 
  for it yet though. 
   
  Happy to put it in the extras directory though :-) 
  
 Fine with me. 
 Then at least it does not get lost. 
  
 How to figure out from the cib 
 pacemakers idea of the current status (and location) of a resource? 
 Look at the most-recent lrm_rsc_op, and it's result? 
 
Pretty much.  For all the gory details, read unpack_rsc_op() in
pacemaker/lib/pengine/unpack.c.  But it (more or less) comes down
to:

  - For each node, sort the ops in order of descending call ID.
  - The most recent op and rc on each node (highest call ID) tells
you the state of the resource on that node.

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] version confusion

2011-03-02 Thread Tim Serong
On 3/3/2011 at 09:17 AM, Klaus Darilion klaus.mailingli...@pernau.at wrote: 
 Hi! 
  
 I just updated by Debian box to wheezy to test Pacemaker 1.0.10. dpkg 
 reports version 1.0.10 but crm_mon reports version 1.0.9. So, which 
 version is really running? Is really 1.0.9 running or is this due to the 
 previously used 1.0.9 version? 
  
 # dpkg -l|grep pacem 
 ii  pacemaker1.0.10-5HA cluster resource manager 
  
  
 # crm_mon -1 
  
 Last updated: Wed Mar  2 23:14:23 2011 
 Stack: openais 
 Current DC: bulgari - partition WITHOUT quorum 
 Version: 1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3 

Note the hash after the version number.  If you search for that hash
at http://hg.clusterlabs.org/pacemaker/stable-1.0/ and poke around
a bit you'll find it's the commit two commits *before* the one that
actually updated the version number to 1.0.10.  So, yes, you do have
version 1.0.10.  Try to think of it as an unfortunate typo :)

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [Linux-HA] Solved: SLES 11 HAE SP1 Signon to CIB Failed

2011-02-14 Thread Tim Serong
On 2/9/2011 at 09:49 PM, darren.mans...@opengi.co.uk wrote: 
  So I compared the /etc/ais/openais.conf in non-sp1 with  
  /etc/corosync/corosync.conf from sp1 and found this bit missing which  
  could be quite useful... 

  service {  
  # Load the Pacemaker Cluster Resource Manager  
  ver:   0  
  name:  pacemaker  
  use_mgmtd: yes  
  use_logd:  yes 
  } 

  Added it and it works. Doh.  

  It seems the example corosync.conf that is shipped won't start  
  pacemaker, I'm not sure if that's on purpose or not, but I found it a  
  bit confusing after being used to it 'just working' previously. 
  
 Ah.  Understandably confusing.  That got fixed post-SP1, in a 
 maintenance update that went out in September or thereabouts. 
  
 Regards, 
  
 Tim 
  
  
 -- 
 Tim Serong tser...@novell.com 
 Senior Clustering Engineer, OPS Engineering, Novell Inc. 
  
  
 --- 
  
 Thanks Tim. 
  
 Although the media that can be downloaded *now* from Novell downloads 
 still has this issue, so any new clusters will fall foul of it. 
 Generally with a test build you won't perform updates as it burns a 
 licence you would need for the production system. Should the 
 downloadable media have the issue fixed? 

With the disclaimer that I haven't tried this myself lately... :)
On this page:

  http://www.novell.com/products/highavailability/eval.html

It says:

  Please note: Once you login, your evaluation software will
  automatically be registered to you. You will be able to
  immediately access free maintenance patches and updates online
  for a 60-day period following your registration date.

So apparently new users should be able to get the latest maintenance
updates.

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Solved: [Linux-HA] SLES 11 HAE SP1 Signon to CIB Failed

2011-02-03 Thread Tim Serong
On 2/3/2011 at 08:47 PM, darren.mans...@opengi.co.uk wrote: 
 On Fri, Jan 28, 2011 at 1:06 PM,  darren.mans...@opengi.co.uk wrote: 
  Hi all, this seems like it should be an easy one to fix, I'll raise a  
  support call with Novell if required. 
  
  
  Base install of SLES 11 32 bit SP1 with HAE SP1 and crm_mon gives  
  'signon to CIB failed'. Same thing with the CRM shell etc. 
  
 Too many open file descriptors? 
 lsof might show something interesting 
  
  
  
 --- 
  
  
 Unfortunately not. 
  
 It seems that corosync doesn't spawn anything else, which is causing 
 this issue: 
  
 [...]
  
 So I compared the /etc/ais/openais.conf in non-sp1 with 
 /etc/corosync/corosync.conf from sp1 and found this bit missing which 
 could be quite useful... 
  
 service { 
 # Load the Pacemaker Cluster Resource Manager 
 ver:   0 
 name:  pacemaker 
 use_mgmtd: yes 
 use_logd:  yes 
 } 
  
 Added it and it works. Doh. 
  
 It seems the example corosync.conf that is shipped won't start 
 pacemaker, I'm not sure if that's on purpose or not, but I found it a 
 bit confusing after being used to it 'just working' previously. 

Ah.  Understandably confusing.  That got fixed post-SP1, in a
maintenance update that went out in September or thereabouts.

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] STONITH external/ssh missing on RHEL 5.5 EPEL 5.4 + ClusterLabs Repo RPM Build?

2010-12-20 Thread Tim Serong
On 12/20/2010 at 10:09 PM, Pavlos Parissis pavlos.paris...@gmail.com wrote: 
 On 17 December 2010 20:41, Eliot Gable ega...@broadvox.com wrote: 
  
  I just did an install of Pacemaker on my CentOS 5.5 system using EPEL 5.4  
 and ClusterLabs Repo. It seems the RPMs do not include the STONITH plugin  
 external/ssh. Is it in some package that I missed or is it really not  
 provided? Is there any way to get it? 
  
  Thanks. 
  
  
 the following line from cluster-glue-fedora.spec could the reason 
  
 %exclude %{_libdir}/stonith/plugins/external/ssh 

That's intentional, see:

  http://hg.linux-ha.org/glue/rev/5ef3f9370458

You really don't want to rely on SSH STONITH in a production environment.

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Resources not migrating on node failure?

2010-12-01 Thread Tim Serong
On 12/1/2010 at 05:11 AM, Anton Altaparmakov ai...@cam.ac.uk wrote: 
 Hi, 
  
 I have set up a three node cluster (running Ubuntu 10.04 LTS server with  
 Corosync 1.2.0, Pacemaker 1.0.8, drbd 8.3.7), where one node is only present  
 to provide quorum to the other two nodes in case one node fails but it itself 
  
 cannot run any resources.  The other two nodes are running drbd in  
 master/slave to provide replicated storage, then XFS file system on top of  
 the drbd storage on the master, together with an NFS server on top of the XFS 
  
 mount, and a service IP address on which the NFS export is shared.  This is  
 all working brilliantly and I can cause the resources to move to the slave  
 node by running crm_standby -U cerberus -v on where cerberus is the master  
 node and everything then migrates to the slave node minotaur. 
  
 My problem is if I pull the power plug on the master node cerberus.  Then  
 nothing happens!  minotaur continues to run as slave and it never takes over. 
  
 And I don't get why.  )-: 

Probably because STONITH is disabled.  It can't take over the resources unless
it knows they're stopped, and without a clean shutdown, there's no way to
guarantee they're stopped without STONITH.

 Also, a second question, possibly related to the first problem, is do I need  
 to define monitor actions for each resource or is that done automatically?

No, you need to define them.
 
 If I need to do it specifically, how do I do that now that I have it all up  
 and running without defining monitor actions? 

Run crm configure edit and add whichever monitor ops you need.

Have a look at Clusters from Scratch at:

  http://www.clusterlabs.org/wiki/Documentation

HTH,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Extending CTS with other tests

2010-11-30 Thread Tim Serong
On 11/30/2010 at 09:21 PM, Andrew Beekhof and...@beekhof.net wrote: 
 On Thu, Nov 25, 2010 at 1:36 PM, Vit Pelcak vpel...@suse.cz wrote: 
  Hello everyone. 
  
  I ran into problem. 
  
  I cannot format ocfs2 partition with pcmk until primitive o2cb 
  ocf:ocfs2:o2cb is running. Right? 
  
 Probably 

To my intense amazement, you can do this:

  mkfs.ocfs2 --cluster-stack=pcmk --cluster-name=pacemaker /dev/foo

This works when the cluster is not running.  These parameters are
not mentioned anywhere at all in the mkfs.ocfs2 manpage.

*sigh*

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] colocation that doesn't

2010-11-29 Thread Tim Serong
On 11/30/2010 at 10:11 AM, Alan Jones falanclus...@gmail.com wrote: 
 On Thu, Nov 25, 2010 at 6:32 AM, Tim Serong tser...@novell.com wrote: 
  Can you elaborate on why you want this particular behaviour?  Maybe 
  there's some other way to approach the problem? 
  
 I have explained the issue as clearly as I know how.  The problem is  
 fundamental 
 to the design of the policy engine in Pacemaker.  It performs only two  
 passes to 
 resolve constraints, when what is required for general purpose 
 constraint resolution 
 is an iterative model.  These problems have been addressed in the literature  
 for 
 decades. 

What I meant by maybe there's some other way to approach the problem is
maybe there's some other way we can figure out how to get something *like*
the behaviour you desire, given the fact that Pacemaker's colocation
constraints behave the way they do.

If you have:

  primitive resX ocf:pacemaker:Dummy
  primitive resY ocf:pacemaker:Dummy
  location resX-loc resX 1: nodeA.acme.com
  location resY-loc resY 1: nodeB.acme.com
  colocation resX-resY -2: resX resY

And you have -inf constraints coming from an external source, as you
said before, can you change the external source so that it generates
different constraints?

e.g., instead of generating either of:

  location resX-nodeA resX -inf: nodeA.acme.com
  location resY-nodeB resY -inf: nodeB.acme.com

(where only the second one works, because of the dependency inherent
in the colocation contraint) can your external source specify these
constraints only in terms of resY, which is the one that's capable of
dragging resX around the place?  e.g.:

  location resX-nodeA resY inf: nodeA.acme.com
  location resY-nodeB resY -inf: nodeB.acme.com

Or, if that sounds completely deranged, how about this:

On the assumption your external source will only ever inject one
-inf rule, for one resource, why not make it change the colocation
constraint as well?  e.g.: generate either of:

  location resX-nodeA resX -inf: nodeA.acme.com
  colocation resY-resX -2: resY resX
  (and delete resX-resY if present)

  -- or --

  location resY-nodeB resY -inf: nodeB.acme.com
  colocation resX-resY -2: resX resY
  (and delete resY-resX if present)

Are there any more details about your application you can share?

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] colocation that doesn't

2010-11-25 Thread Tim Serong
On 11/25/2010 at 10:33 AM, Alan Jones falanclus...@gmail.com wrote: 
 Instead of: 
  
   colocation resX-resY -2: resX resY 
  
  Try: 
  
   colocation resX-resY -2: resY resX 
  
  
 That works fine, as you describe, for placing resY when resX is 
 limited by the -inf rule; but not the reverse. 
 In my configuration the -inf constraints come from an external source 
 and I wish place resX and resY in a symmetric way. 
 Start with resX and resY which can run on either nodeA or nodeB. 
 Give each a preferred node respectively; a weak preference. 
 Now request that, if possible, they should run on different nodes; 
 potentially overriding the weak node preference. 
 Now add external constraints that prohibit one or other from running 
 on one or the other node. 
 For example, if any one of the resources is prevented from running on 
 its preferred node, it should run on the non-preferred node and push 
 the other resource onto its non-preferred node. 
 I have not figured out how to express this in pacemaker. 

Ah, OK.  I'm not seeing it either.

Can you elaborate on why you want this particular behaviour?  Maybe
there's some other way to approach the problem?  (Or maybe someone else
can think of a way to express this...)

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] colocation that doesn't

2010-11-23 Thread Tim Serong
On 11/24/2010 at 12:32 PM, Alan Jones falanclus...@gmail.com wrote: 
 On Sat, Nov 20, 2010 at 1:05 AM, Andrew Beekhof and...@beekhof.net wrote: 
  Then -2 obviously isn't big enough is it. 
  
 I need a value between and not including -inf and -2 that will work. 
 All the values I've tried don't, so I'm open to suggestions. 
  
  Please read and understand: 
 
 http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ 
 s-resource-colocation.html 
  
 The way I read the with-rsc description it is in direct conflict with 
 your comment from Nov. 11th (below), 
 so I'm truly confused. 
  
  colocation X-Y -2: X Y 
  colocation Y-X -2: Y X 
 the second one is implied by the first and is therefore redundant 
  
  For how colocation constraints actually work instead of inventing your own  
 rules. 
  
 I'm interested in inventing rules, I'm trying to express the 
 constraints of my application. 
 So far, I have not been able to do so. 

Instead of:

  colocation resX-resY -2: resX resY

Try:

  colocation resX-resY -2: resY resX

Because:

The cluster decides where to put with-rsc (the second one), then decides
where to put rsc (the first one).  You have:

  location resX-nodeA resX -inf: nodeA.acme.com
  location resY-loc resY 1: nodeB.acme.com

If it decides where to put resY first, it puts resY on nodeB.  Then it tries
to place resX, wants to place it where resY is not (nodeA), but can't, due to
the -inf score for resX on nodeA.  So  in this case, resX lands on nodeB as
well.

If it decides where to put resX first, it puts resX on nodeB because of the
-inf score for nodeA.  Then it puts resY on nodeA, because of the -2 score
for the colocation constraint.

HTH,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Help understanding why a failover occurred.

2010-10-17 Thread Tim Serong
On 10/16/2010 at 09:45 AM, Jai away...@gmail.com wrote: 
 I have setup a DRBD-Xen failover cluster. Last night at around 02:50 it 
 failed  
 the resources from server bravo to alpha. I'm trying to find out what  
 caused the failover of resources. I don't see anything in the logs that  
 indicate the cause but I don't really know what to look for. If someone could 
  
 help me understand these logs and what I'm looking for would be great. I'm  
 not even sure how far back I need to go. 

I reckon it's this:

Oct 16 02:46:04 bravo attrd: [25098]: info: attrd_perform_update: Sent update 
161: pingval=0

Which suggests bravo lost connectivity to 12.12.12.1 around that time, causing
the failover.

For reference, if you're looking at pengine logs...  A few lines above where
it says info: process_pe_message: Transition NNN: PEngine Input stored in:
/var/lib/pengine/pe-input-MMM.bz2, you'll see what it's about to do to your
resources.  If this is just: Leave resource FOO (Started/Master/Slave etc.)
that transition is probably boring.  If it says Start FOO (...) or
Promote/Demote/Stop FOO (...), it means something has changed.  Scroll up
a bit, to above where pengine is saying unpack_config, determine_node_status
etc. and you should see a message suggesting the cause for the change (failed
op, timeout, ping attribute modified, etc.)  It might be a bit inscrutable
sometimes, but it'll be there somewhere...

HTH

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Cluster failure with mod_security using rotatelogs

2010-10-10 Thread Tim Serong
On 10/11/2010 at 10:17 AM, Markus Schlup mar...@qbik.ch wrote: 
 Hi all
  
 I'm running a cluster-based Apache reverse proxy with the mod_security  
 module. I would like to rotate the logfiles with rotatelogs as follows: 
  
 CustomLog |/usr/sbin/rotatelogs -l /var/log/httpd/access_log.%Y-%m-%d  
 86400 common 
  
 And especially the mod_security log with 
  
 SecAuditLog  |/usr/sbin/rotatelogs -l  
 /var/log/httpd/modsec_audit_log.%Y-%m-%d 86400 
  
 As soon as I change the mod_security log to this (instead of just using  
 SecAuditLog /var/log/httpd/modsec_audit_log) the resource does not  
 start anymore. 
  
 When trying to debug and start the apache resource by hand with 
  
 OCF_ROOT=/usr/lib/ocf OCF_RESKEY_configfile=/etc/httpd/conf/httpd.conf  
 OCF_RESKEY_statusurl=http://localhost:80/server-status sh -x  
 /usr/lib/ocf/resource.d/heartbeat/apache start 
  
 it stops after 
  
 ... 
 + for p in '$PORT' '$Port' 80 
 + CheckPort 80 
 + ocf_is_decimal 80 
 + case $1 in 
 + true 
 + '[' 80 -gt 0 ']' 
 + PORT=80 
 + break 
 + echo 127.0.0.1:80 
 + grep : 
 + '[' Xhttp://localhost:80/server-status = X ']' 
 + test /etc/httpd/run/httpd.pid 
 + : OK 
 + case $COMMAND in 
 + start_apache 
 + silent_status 
 + '[' -f /etc/httpd/run/httpd.pid ']' 
 + : No pid file 
 + false 
 + ocf_run /usr/sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf 
 ++ /usr/sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf 
  
 The resource is in fact started but the command does not finish - so I  
 guess that's the reason why the cluster fails in this setup ... strange  
 enough using the rotatelogs directives for the Apache error and access  
 logs is not an issue and works as expected. 
  
 Does someone know how to fix that problem? 

I've not seen that before, but, just to rule out one possibility...  What
happens if you just run:

  /usr/sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf

Does that ever return?  If no, I'd suggest apache is broken.  If yes,
I'd start pointing my finger towards ocf_run or the RA.

HTH,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] crm_gui login failure

2010-09-28 Thread Tim Serong
On 9/28/2010 at 04:11 PM, Yan Gao y...@novell.com wrote: 
 On 09/28/10 01:25, Phil Armstrong wrote: 
  I'm running pacemaker-1.1.2-0.6.1 on sles11sp1. I was only able to 
  successfully login to the crm_gui from one of my nodes in spite of the 
  fact that the login parameters appeared to be identical. I traced the 
  problem to a zero length /etc/pam.d/hbmgmt file on the node that 
  exhibited the login failure. This file is part of 
  pacemaker-mgmt-2.0.0-0.3.10, which I have installed on both nodes. 
   
  I had no previous knowledge of this file and so I am quite sure it 
  wasn't anything I did consciously to zero out the file, or to 
  consciously populate it with the contents of the working node: 
   
  #%PAM-1.0 
  auth include common-auth 
  account include common-account 
   
   
  Can anyone tell me how this file is created or modified ? 
 The file is extracted from pacemaker-mgmt on package installation. 
 The back-end of the GUI (mgmtd) reads it for user authentication. 
 No one is supposed or needs to modify the file for any reason. 
  
 So that's strange it was zeroed out. You might need to check the 
 modification time to recall what was happening. 

Wild guess - was your system STONITH'd or otherwise forcibly reset,
immediately after installing pacemaker-mgmt, and are you using XFS
for your root filesystem?

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] /etc/hosts

2010-09-28 Thread Tim Serong
On 9/28/2010 at 07:29 PM, Andrew Beekhof and...@beekhof.net wrote: 
 On Tue, Sep 28, 2010 at 6:05 AM, Mark Horton m...@nostromo.net wrote: 
  Hello, 
  I was wondering what side effects occur if you don't add all the 
  cluster nodes to the /etc/hosts file on each node? 
  
  I'd also be interested in hearing how others keep the hosts file in 
  sync.  For example, lets say you have 3 nodes, and 1 node is currently 
  down.  Then you add a 4th node, but you can't update the hosts file of 
  the down node.  So you must remember to do it when it comes back up. 
  I was trying to see if there was an automated way to keep them in sync 
  in case we forget to update the hosts file on the down node. 
  
 Pacemaker doesn't care, but your messaging layer (corosync or heartbeat)  
 might. 
 If the node that is down has no other way to find out the address of 
 the new node, and the cluster is configured to start automatically 
 when the machine boots, then you might have a problem. 

You might find csync2[1] useful.  You can use this to synchronize config
files across a cluster.  Assuming you've configured it to sync /etc/hosts,
any time you edit /etc/hosts on one node, run csync2 -x and it will
magically sync the changes out to the other nodes in your cluster.  It's
a smart manual push mechanism, not something that runs continuously in
the background, but it's a hell of a lot better than scp and having to
remember where to copy what to, and when :)

shameless-plug
There's a little section on csync2 in the SLE HAE Guide under
Transferring the Configuration to All Nodes at:
http://www.novell.com/documentation/sle_ha/book_sleha/?page=/documentation/sle_ha/book_sleha/data/sec_ha_installation_setup.html
/shameless-plug

HTH

Tim

[1] http://oss.linbit.com/csync2/


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] clmvd hangs on node1 if node2 is fenced

2010-08-27 Thread Tim Serong
On 8/27/2010 at 03:37 PM, Michael Smith msm...@cbnco.com wrote: 
 On Thu, 26 Aug 2010, Tim Serong wrote: 
  
   for now I have stonith-enabled=false in   
   my CIB. Is there a way to make clvmd/dlm respect it?  
   
  No.  At least, I don't think so, and/or I hope not :) 
  
 I think I'd consider it a bug: I've disabled stonith, so dlm shouldn't  
 wait forever for a fence operation that isn't going to happen. 
  
 CLVM is just making the metadata cluster-aware, so the only way I can  
 imagine screwing things up without fencing would be if I ran something  
 like lvresize on two nodes at the same time, during a split brain. 

So I dug around a little:

  # dlm_controld.pcmk -h 
  Usage:
dlm_controld [options]
  Options:
...
-f num  Enable (1) or disable (0) fencing recovery dependency
  Default is 1
-q num  Enable (1) or disable (0) quorum recovery dependency
  Default is 0

I reckon if you set the args parameter of your ocf:pacemaker:controld
resource to -f 0 -q 0, you'll have DLM ignoring fencing.  At this
point (lest someone reading the archives later thinks I am advocating
this) it would be irresponsible of me not to mention this story about
Why You Need STONITH:

  http://advogato.org/person/lmb/diary/105.html

There is also an accompanying comic:

  http://ourobengr.com/stonith-story

If DLM is ignoring fencing, everything that uses DLM is also going to
ignore fencing, so if you've got (say) an OCFS2 filesystem on top of
your CLVM volume, your filesystem will potentially be toast in a
split-brain situation.

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] clmvd hangs on node1 if node2 is fenced

2010-08-26 Thread Tim Serong
On 8/27/2010 at 08:50 AM, Michael Smith msm...@cbnco.com wrote: 
  Xinwei Hu hxin...@... writes:
   
That sounds worrying actually. 
I think this is logged as bug 585419 on SLES' bugzilla. 
If you can reproduce this issue, it worths to reopen it I think. 
  
 I've got a pair of fully patched SLES11 SP1 nodes and they're showing  
 what I guess is the same behaviour: if I hard-poweroff node2, operations  
 like vgdisplay -v hang on node1 for quite some time. Sometimes a  
 minute, sometimes two, sometimes forever. They get stuck here: 
  
 Aug 26 18:31:42 xen-test1 clvmd[8906]: doing PRE command LOCK_VG  
 'V_vm_store' at 
 1 (client=0x7f2714000b40) 
 Aug 26 18:31:42 xen-test1 clvmd[8906]: lock_resource 'V_vm_store',  
 flags=0, mode=3 
  
  
 After a few seconds, corosync  dlm notice the node is gone, but  
 vg_display and 
 friends still hang while trying to lock the VG. 
  
 Aug 26 18:31:44 xen-test1 corosync[8476]:  [TOTEM ] A processor failed,  
 forming new configuration. 
 Aug 26 18:31:50 xen-test1 cluster-dlm[8870]: update_cluster: Processing 
 membership 1260 
 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: dlm_process_node: Skipped  
 active node 219878572: born-on=1256, last-seen=1260, this-event=1260,  
 last-event=1256 
 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: del_configfs_node:  
 del_configfs_node rmdir /sys/kernel/config/dlm/cluster/comms/236655788 
 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: dlm_process_node: Removed  
 inactive node 236655788: born-on=1252, last-seen=1256, this-event=1260,  
 last-event=1256 
 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: log_config: dlm:controld  
 conf 1 0 1 memb 219878572 join left 236655788 
 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: log_config: dlm:ls:clvmd  
 conf 1 0 1 memb 219878572 join left 236655788 
 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: add_change: clvmd  
 add_change cg 3 remove nodeid 236655788 reason 3 
 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: add_change: clvmd  
 add_change cg 3 counts member 1 joined 0 remove 1 failed 1 
 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: stop_kernel: clvmd  
 stop_kernel cg 3 
 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: do_sysfs: write 0 to 
 /sys/kernel/dlm/clvmd/control 
 Aug 26 18:31:51 xen-test1 kernel: [  365.267802] dlm: closing connection  
 to node 236655788 
 Aug 26 18:31:51 xen-test1 clvmd[8906]: confchg callback. 0 joined, 1  
 left, 1 members 
 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: fence_node_time: Node 
 236655788/xen-test2 has not been shot yet 
 Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: check_fencing_done: clvmd 
 check_fencing 23665578 not fenced add 1282861615 fence 0 
 Aug 26 18:31:51 xen-test1 crmd: [8489]: info: ais_dispatch: Membership  
 1260: quorum still lost 
 Aug 26 18:31:51 xen-test1 cluster-dlm: [8870]: info: ais_dispatch:  
 Membership 1260: quorum still lost 

Do you have STONITH configured?  Note that it says xen-test2 has not
been shot yet and clvmd ... not fenced.  It's just going to sit there
until the down node is successfully fenced - this is intentional, as it's
not safe to keep running until you *know* the dead node is dead.

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] Updated openSUSE packages in network:ha-clustering repo

2010-08-23 Thread Tim Serong
Hi All,

Just a quick note for openSUSE users - there's updated packages now
in the network:ha-clustering and network:ha-clustering:Factory repos,
build for SLE 11, SLE 11 SP1, openSUSE 11.2, openSUSE 11.3 and Factory:

  http://download.opensuse.org/repositories/network:/ha-clustering/
  http://download.opensuse.org/repositories/network:/ha-clustering:/Factory/

This includes:

  - cluster-glue 1.0.6
  - corosync 1.2.7
  - csync2 1.34
  - hawk 0.3.5
  - ldirectord 1.0.3
  - libdlm 3.00.01
  - ocfs2-tools 1.4.3
  - openais 1.1.3
  - pacemaker 1.1.2.1
  - pacemaker-mgmt 2.0.0
  - resource-agents 1.0.3

Happy clustering,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] CFP: Linux Plumbers Mini-Conf on High-Availability/Clustering

2010-08-14 Thread Tim Serong
On 8/13/2010 at 11:21 PM, Florian Haas florian.h...@linbit.com wrote: 
 On 08/11/2010 01:53 PM, Florian Haas wrote: 
  On 08/10/2010 07:48 PM, Lars Marowsky-Bree wrote: 
  On 2010-08-04T15:59:27, Lars Marowsky-Bree l...@novell.com wrote: 
  
  Hi all, 
  
  there will (hopefully!) be a mini-conference on HA/Clustering at this 
  year's LPC in Cambridge, MA, Nov 3-5th. 
  
  Just a quick reminder, there've not been many proposals submitted yet. 
  If the trend continues, the mini-conf slot might instead be allocated to 
  another topic ... 
  
  Please, do consider to submit a talk to this soon - I know it's to a 
  large degree my fault for sending out the request so late. 
   
  I have a couple of proposals queued, but you caught me between leave and 
  Linuxcon. :) I'll submit them as soon as I can. 
  
 OK, I've submitted 3 proposals. But I'm a bit baffled to see just one 
 other proposal besides that. Red Hat folks, NTT people, please! We need 
 you! This is likely the only chance we get to collaborate in one place 
 this whole year. 

I actually can't see the original CFP email in the linux-cluster archives.
On the bold assumption that *this* email somehow magically makes it to that
list, here's the URL to submit proposals:

  http://www.linuxplumbersconf.org/2010/ocw/events/LPC2010MC/proposals/new

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Opensuse 11.3

2010-07-26 Thread Tim Serong
On 7/26/2010 at 11:51 PM, Andrew Beekhof and...@beekhof.net wrote: 
 On Mon, Jul 26, 2010 at 7:56 AM, Andrew Beekhof and...@beekhof.net wrote: 
  Probably this week. 
  
  On Wed, Jul 21, 2010 at 11:59 PM, Roberto Giordani r.giord...@libero.it  
 wrote: 
  Hello Andrew, 
  do you now when the clusterlabs rpm for Opnsuse 11.3 will be available? 
  
 It doesn't look to be possible I'm afraid. 
 SUSE isn't including the repodata directory at 
 http://download.opensuse.org/distribution/11.3/repo/oss/ which means 
 yum can't use it and I can't build packages for 11.3. 

I don't know what's up with that.

 Perhaps they want to encourage people to use their build service. 
 Any volunteers? 

FWIW, openSUSE 11.3 includes reasonably current versions of Pacemaker
(1.1.2.1), corosync (1.2.1), openais (1.1.2), cluster-glue (1.0.5) and
resource-agents (1.0.3).  Heartbeat is a bit out of date (2.99.3).
There's one problem I'm aware of (can't start openais/corosync on x86_64)
but this can be worked around by creating a few symlinks, see the bug
report for details:

  https://bugzilla.novell.com/show_bug.cgi?id=623427

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] RFC: cluster-wide attributes

2010-07-05 Thread Tim Serong
On 7/5/2010 at 04:54 PM, Andrew Beekhof and...@beekhof.net wrote: 
 On Mon, Jul 5, 2010 at 6:21 AM, Tim Serong tser...@novell.com wrote: 
  On 6/30/2010 at 09:42 PM, Andrew Beekhof and...@beekhof.net wrote: 
  On Thu, Jun 24, 2010 at 5:41 PM, Lars Marowsky-Bree l...@novell.com 
  wrote: 
   Hi, 
   
   another idea that goes along with the previous post are cluster-wide 
   attributes. Similar to per-node attributes, but basically a special 
   section in configuration: 
   
  optional 
  element name=cluster_attributes 
zeroOrMore 
  element name=attributes 
externalRef href=nvset.rng/ 
  /element 
/zeroOrMore 
  /element 
  /optional 
  
  Do we need a new section? Or can they go in with cluster-infrastructure 
  etc? 
  
   These then would also be referencable in the various dependencies like 
   node attributes, just globally. 
   
   Question - 
   
   1. Do we want to treat them like true node attributes, i.e., per-node 
   attributes would override the cluster-wide settings - or as indeed a 
   completely separate class? I lean towards the latter, but would solicit 
   some more opinions. 
  
  Not sure it really gives you anything by making them a separate class. 
  does 
  it? 
  Just means you have to look twice right? 
  
  Just for the record, a use case of this came up on IRC last week: 
  you could specify cluster-wide standby=on, so new nodes joining the 
  cluster would automatically join in standby mode, with the admin 
  activating them later (per-node standby=off thus overriding cluster- 
  wide attribute). 
  
 That doesn't necessarily mean they need to be a separate class though. 

No, not at all.  I'm just adding to the conversation in an unnecessarily
confusing fashion :)

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] RFC: cluster-wide attributes

2010-07-04 Thread Tim Serong
On 6/30/2010 at 09:42 PM, Andrew Beekhof and...@beekhof.net wrote: 
 On Thu, Jun 24, 2010 at 5:41 PM, Lars Marowsky-Bree l...@novell.com wrote: 
  Hi, 
  
  another idea that goes along with the previous post are cluster-wide 
  attributes. Similar to per-node attributes, but basically a special 
  section in configuration: 
  
 optional 
 element name=cluster_attributes 
   zeroOrMore 
 element name=attributes 
   externalRef href=nvset.rng/ 
 /element 
   /zeroOrMore 
 /element 
 /optional 
  
 Do we need a new section? Or can they go in with cluster-infrastructure etc? 
  
  These then would also be referencable in the various dependencies like 
  node attributes, just globally. 
  
  Question - 
  
  1. Do we want to treat them like true node attributes, i.e., per-node 
  attributes would override the cluster-wide settings - or as indeed a 
  completely separate class? I lean towards the latter, but would solicit 
  some more opinions. 
  
 Not sure it really gives you anything by making them a separate class. does  
 it? 
 Just means you have to look twice right? 

Just for the record, a use case of this came up on IRC last week:
you could specify cluster-wide standby=on, so new nodes joining the
cluster would automatically join in standby mode, with the admin
activating them later (per-node standby=off thus overriding cluster-
wide attribute).

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Pacemaker cant start CTDB

2010-07-02 Thread Tim Serong
On 7/2/2010 at 02:02 AM, Justin Shafer justinsha...@gmail.com wrote: 
 Hello all 
  
 I have noticed that corosync cant start CTDB in Fedora and Ubuntu. It will 
 work in SLES11 after installing samba-winbind. Going through the logs. 
 sometimes it cant get a recovery lock (filesystem related I know).. but 
 other times I have tried it can get a recovery lock.

Possibly the CTDB RA is hitting its start timeout before CTDB has
stabilized (which includes some recovery lock fiddling).  Try increasing
the timeout (crm configure ... op start timeout=...) for your CTDB
resource.  If that doesn't work, have a look at the CTDB RA itself, about
line 359: change seq 30 to something higher (probably we need to make
this configurable).

 and once it does it stops the monitoring and stops winbind and shuts down.

Does it say why?  You probably want /var/log/ctdb/log.ctdb and
/var/log/samba/log.{smbd,winbindd}...

 It was doing this with SLES 11 until I added samba-winbind, so I am just 
 guessing it cant find smb, nmb and winbind on Ubuntu and Fedora but its just 
 a guess..

Hard to say without seeing logs, but I'm guessing the CTDB RA is setting
CTDB_SERVICE_SMB, CTDB_SERVICE_NMB etc. incorrectly on those distros.
Please file a bug for this:

  http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Linux-HA

 In Suse it would start and then stop and never really say why in 
 ctdb.log until I added winbind and then the logs showed it trying to start 
 samba, etc. Too bad not all distros are the same in regards to smb, smbd, 
 samba. I configured /etc/default/ctdb in ubunti and /etc/sysconfig/ctdb in 
 fedora but no dice. Also I noticed that corosync doesn't rip out 
 /etc/default/ctdb and replace it with its own like in SLES11.. at least 
 Ubuntu isn't. 

Curious.  It's *meant* to replace that file.  Anything interesting that you
can specify in that file should be specified using RA instance parameters.
For some notes on this, see:

  http://linux-ha.org/wiki/CTDB_%28resource_agent%29

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Set order on two clone set, but apply on each node

2010-06-06 Thread Tim Serong
On 6/6/2010 at 12:58 AM, Comet / 余尚哲comet...@gmail.com wrote: 
 I have 10 nodes and want to run httpd  mysqld on all node, i plan to do 
 load balance later, 
 so i set two clone set on apache and mysql: 
  
 clone cl-apache apache 
 clone cl-mysqld mysqld 
  
 each apache will connect to mysqld locally, so if mysqld crash, i must 
 turnoff apache running on the same node to avoid error if apache try to load 
 data from mysql, 
 but i think the order constrant can't do that for me if my setting like 
 this: 
  
 order mysqld-before-apache inf: cl-mysqld cl-apache 
  
 If i want to apply this rule to each node, what setting should i configure? 

Try cloning a group, something like:

  group mysqld-with-apache mysqld apache
  clone cl-mysqld-with-apache mysqld-with-apache

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Issues with constraints - working for start/stop, being ignored on failures

2010-06-06 Thread Tim Serong
On 6/2/2010 at 11:10 AM, Cnut Jansen w...@cnutjansen.eu wrote: 
 Am 31.05.2010 05:47, schrieb Tim Serong: 
  On 5/31/2010 at 12:57 PM, Cnut Jansenw...@cnutjansen.eu  wrote: 
  
  Current constraints: 
  colocation TEST_colocO2cb inf: cloneO2cb cloneDlm 
  colocation colocGrpMysql inf: grpMysql cloneMountMysql 
  colocation colocMountMysql_drbd inf: cloneMountMysql msDrbdMysql:Master 
  colocation colocMountMysql_o2cb inf: cloneMountMysql cloneO2cb 
  colocation colocMountOpencms_drbd inf: cloneMountOpencms  
 msDrbdOpencms:Master 
  colocation colocMountOpencms_o2cb inf: cloneMountOpencms cloneO2cb 
  colocation colocTomcat inf: cloneTomcat cloneMountOpencms:Started 
  order TEST_orderO2cb 0: cloneDlm cloneO2cb 
  order orderGrpMysql 0: cloneMountMysql:start grpMysql 
  order orderMountMysql_drbd 0: msDrbdMysql:promote cloneMountMysql:start 
  order orderMountMysql_o2cb 0: cloneO2cb cloneMountMysql 
  order orderMountOpencms_drbd 0: msDrbdOpencms:promote  
 cloneMountOpencms:start 
  order orderMountOpencms_o2cb 0: cloneO2cb cloneMountOpencms 
  order orderTomcat 0: cloneMountOpencms:start cloneTomcat 
  
  Try specifying inf for those ordering scores rather than zero. 
  Ordering constraints with score=0 are considered optional and only 
  have an effect when both resources are starting or stopping.  You 
  should also be able to leave out the :start specifiers as this is 
  implicit. 
  
 About those :start specifiers on the mount-resources's order  
 constraints you're of course right, and I also allready knew about that.  
 They're just remains from some tests (probably seek for (other?)  
 workarounds or something) I did, which I only - due to their (to my  
 knowledge) harmless redundancy - so far allways forgot to remove again  
 when doing other, more relevant/important changes. you know, due to the  
 crm-shell's (which I currently use for editing my configuration)  
 canceling all resource monitor operations on the node the crm-shell is  
 started on, I prefer to avoid starting it as much as possible for  
 allways having to make sure I afterwards made all monitor operations run  
 again (i.e. switch cluster's maintenance-mode on/off or switch node to  
 standby and back online). 

Say what?  The CRM shell shouldn't be canceling ops...

 About those 0-scores, unfortunately they're necessary, since they're the  
 - afaik - official workaround for to prevent instances of clone  
 resources being also restarted on nodes where it's unnecessary to do so.  
 So with scores set to inf instead, when I for example put one node  
 into standby and/or back to online, most clone resources would also be  
 restarted on the other node. That's not acceptable for production. 
 This behaviour is according to what I remember having read only changed  
 in Pacemaker 1.0.7, which isn't shipped with SLES 11 yet. I'm hoping for  
 SLES 11 SP1 to change that, but haven't found any reliable informations  
 about its version of Pacemaker yet. 

SLES 11 SP1 and the SLE High Availability Extension 11 SP1 are now
available for download from http://download.novell.com/ - this includes
Pacemaker 1.1.2.

  Constraints added to work around at least the DRBD-resources left in 
  state started (unmanaged) failed: 
  order GNAH_orderDrbdMysql_stop 0: cloneMountMysql:stop msDrbdMysql:stop 
  order GNAH_orderDrbdOpencms_stop 0: cloneMountOpencms:stop 
  msDrbdOpencms:stop 
  (Also tried similiar constraints for msDrbd*:demote and cloneDlm:stop, 
  but neither seemed to have an effect) 
  
  Those shouldn't be necessary (I never tried putting ordering 
  constraints on stop ops before...) 
  
 They shouldn't, right; that's also what I had expected. But as I  
 reported in my post above, they - for what reason ever - actually DO  
 have an effect! I simply don't know yet, why, and hope others maybe  
 having a clue. Anyway, so far, they're the most acceptable workaround I  
 know of for those strange constraint issues that made me we write here.  
 (Another workaround are start-delays on stop-operations, but such are -  
 for there dependency upon individual node's system- and  
 resource-performances - not acceptable for production) 
 I just still don't know if it's just a case of misconfiguration and/or  
 lack of knowledge/experience on my side, or if it's really a bug in  
 Pacemaker; maybe even a allready fixed one in more recent versions than  
 SLES 11's Pacemaker 1.0.6. 

Curious...  I'd suggest seeing if you can reproduce on SLE 11 SP1.

Regards,

Tim

 For in case someone would like to have a look onto it, I attached  
 complete cluster configuration, with and without the workaround and both  
 as XML and as output of crm configure show. 
 (Please don't wonder about some quite high monitor operation intervals,  
 which were just set so when dumping the config; the tests done and  
 configs dumped when posting in Novell's support forum were done with  
 those timings being 1/100 of it and made no difference) 
  
  
 Here

Re: [Pacemaker] Openais OCF Script Question

2010-05-30 Thread Tim Serong
On 5/30/2010 at 11:13 AM, Emil Popov epo...@postpath.com wrote: 
 Hi 
 I'm trying to use a OCF script in my Openais Cluster. 
 For the most part it works. From time to time though , the Pacemaker  
 executes the original resource  LSB script instead of the correct OCF one 
  
 Therefore not passing correct parameters to the resource. 
   
   
 When I stop the resource  and start it again it executes the correct ocf  
 script the second time around. 
   
 This usually happened when the resource fails over to another node and  
 initially runs LSB script instead the OCF one. Very strange.  
  
 Any advise is greatly appreciated. 
   
 Below is the error in the /var/log/messages It insists on using the LSB in  
 /etc/init.d directory. I had renamed the /etc/init.d/ppsd script but that  
 causes the below error and Stonith reboots the node. 
   
   
   
 May 29 05:01:40 gpp0099pun018 crmd: [10927]: info: do_lrm_rsc_op: Performing  
 key=186:20891:0:977e982d-1345-4d4f-b69f-9bf0de010aa3 op=ppsd-6_start_0 ) 
 May 29 05:01:40 gpp0099pun018 lrmd: [10924]: info: rsc:ppsd-6: start 
 May 29 05:01:40 gpp0099pun018 lrmd: [7387]: WARN: For LSB init script, no  
 additional parameters are needed. 
 May 29 05:01:40 gpp0099pun018 lrmd: [7387]: ERROR: (raexeclsb.c:execra:266)  
 execv failed for /etc/init.d/ppsd: No such file or directory 
 May 29 05:01:40 gpp0099pun018 lrmd: [10924]: ERROR: Failed to open lsb RA  
 ppsd. No meta-data gotten. 
 May 29 05:01:40 gpp0099pun018 lrmd: [10924]: WARN: on_msg_get_metadata:  
 empty metadata for lsb::heartbeat::ppsd. 
 May 29 05:01:40 gpp0099pun018 crmd: [10927]: ERROR:  
 lrm_get_rsc_type_metadata(575): got a return code HA_FAIL from a reply  
 message of rmetadata with function g 
 et_ret_from_msg. 
 May 29 05:01:40 gpp0099pun018 crmd: [10927]: WARN: get_rsc_metadata: No  
 metadata found for ppsd::lsb:heartbeat 
 May 29 05:01:40 gpp0099pun018 crmd: [10927]: ERROR: string2xml: Can't parse  
 NULL input 
 May 29 05:01:40 gpp0099pun018 crmd: [10927]: ERROR: get_rsc_restart_list:  
 Metadata for (null)::lsb:ppsd is not valid XML 
 May 29 05:01:40 gpp0099pun018 crmd: [10927]: info: process_lrm_event: LRM  
 operation ppsd-6_start_0 (call=103, rc=254, cib-update=239, confirmed=true)  
 complete 
 unknown 
   
   
 Here is the resource configuration that I have in the Pacemaker. It's is  
 supposed to use OCF script named ppsd in directory  
 /usr/lib/ocf/resource.d/custom/ppsd 
   
   
 primitive ppsd-0 ocf:custom:ppsd \ 
 params externalip=192.168.0.50 \ 
 op monitor interval=10s timeout=90s \ 
 op start interval=0 timeout=1800s \ 
 op stop interval=0 timeout=180s \ 
 meta target-role=Started is-managed=true 
   
 Using Openais 0.80.5 
 Pacemaker 1.0.4 

Do you also have an LSB primitive defined called ppsd-6?  Because that's
what those logs say LRMD is trying to start...

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf


Re: [Pacemaker] Issues with constraints - working for start/stop, being ignored on failures

2010-05-30 Thread Tim Serong
On 5/31/2010 at 12:57 PM, Cnut Jansen w...@cnutjansen.eu wrote: 
 Hi, 
  
 I'm not sure if it's really some kind of bug (maybe allready widely  
 known and even allready fixed in more recent versions) or simply  
 misconfiguration and lack of knowledge and experience or something  
 (since I'm still quite new to HA-computing), but I have issues with  
 Pacemaker about the order constraints I defined, can't get rid of them  
 and only partially work around them. But such workarounds don't really  
 seem as intended/designed to me... 
  
 The problem is that even though upon starting / switching-to-online and  
 stopping / switching-to-standby the nodes / cluster, all constraint  
 chains work as they should, and so do they even upon directly stopping  
 the troubling fundamental resources, the DRBD- and DLM-resources, which  
 are the bases of my constraint chains. Therefor when i.e. a failure  
 occurs in the DRBD-resource for MySQL's DataDir, the cluster should  
 first stop the MySQL-resource-group (MySQL + IP-adress), then stop the  
 MySQL-mount-resource, then demote and finally stop the DRBD-resource.  
 But when trying to test the cluster's behaviour upon such a failure via  
 crm_resource -F -r drbdMysql:0 -H nde28, the cluster first tries to  
 demote the DRBD-resource, then also allready stop it, then the MySQL-IP,  
 the MySQL-mount and only finally MySQL. 
 The result of such a test isn't - due to failing demote and stop for the  
 DRBD-resource - hard to guess: DRBD-resource left in started  
 (unmanaged) failed, rest of involved resources is stopped. 
  
 I'm running Pacemaker 1.0.6, delivered with and running on SLES 11 with  
 HAE, both kept up-to-date with official update repositories (due to  
 company's directives). 
 In a few days SLES 11 SP1 shall be released, where I also hope for a  
 more recent version of Pacemaker, DRBD (still have to run 8.2.7) and  
 other HA-cluster-related stuff. 
  
 I also allready posted about these issues in Novell's support forum with  
 lots of more details: 
 http://forums.novell.com/novell-product-support-forums/suse-linux-enterprise-serve
  
 r-sles/sles-configure-administer/411152-constraint-issues-upon-failure-drbd-resource-su
  
 se-linux-enterprise-hae-11-a.html 
  
 So I'm wondering: 
 1) Aren't constraint chains upon defining them also allready implicitly  
 exactly invertedly defined for stopping resources too?

Yes, but see below for a note on scores.

 2) After my testing for workarounds: Why (seem to) do - in case of the  
 failing fundamental resources - order constraints for MS-resources's  
 stop-action have an effect, but neither those for MS-resources's  
 demote-action, nor those for (primitives's/?)clones's stop-action? Or is  
 that just for the MS-resources's stop-action being only the second  
 command anyway, and just therefor following my additional constraint?! 

I'm not sure about that.

 Current constraints: 
 colocation TEST_colocO2cb inf: cloneO2cb cloneDlm 
 colocation colocGrpMysql inf: grpMysql cloneMountMysql 
 colocation colocMountMysql_drbd inf: cloneMountMysql msDrbdMysql:Master 
 colocation colocMountMysql_o2cb inf: cloneMountMysql cloneO2cb 
 colocation colocMountOpencms_drbd inf: cloneMountOpencms  
 msDrbdOpencms:Master 
 colocation colocMountOpencms_o2cb inf: cloneMountOpencms cloneO2cb 
 colocation colocTomcat inf: cloneTomcat cloneMountOpencms:Started 
 order TEST_orderO2cb 0: cloneDlm cloneO2cb 
 order orderGrpMysql 0: cloneMountMysql:start grpMysql 
 order orderMountMysql_drbd 0: msDrbdMysql:promote cloneMountMysql:start 
 order orderMountMysql_o2cb 0: cloneO2cb cloneMountMysql 
 order orderMountOpencms_drbd 0: msDrbdOpencms:promote  
 cloneMountOpencms:start 
 order orderMountOpencms_o2cb 0: cloneO2cb cloneMountOpencms 
 order orderTomcat 0: cloneMountOpencms:start cloneTomcat 

Try specifying inf for those ordering scores rather than zero.
Ordering constraints with score=0 are considered optional and only
have an effect when both resources are starting or stopping.  You
should also be able to leave out the :start specifiers as this is
implicit.

 Constraints added to work around at least the DRBD-resources left in  
 state started (unmanaged) failed: 
 order GNAH_orderDrbdMysql_stop 0: cloneMountMysql:stop msDrbdMysql:stop 
 order GNAH_orderDrbdOpencms_stop 0: cloneMountOpencms:stop  
 msDrbdOpencms:stop 
 (Also tried similiar constraints for msDrbd*:demote and cloneDlm:stop,  
 but neither seemed to have an effect) 

Those shouldn't be necessary (I never tried putting ordering
constraints on stop ops before...)

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf


Re: [Pacemaker] How SuSEfirewall2 affects on openais startup?

2010-05-13 Thread Tim Serong
Hi,

On 5/13/2010 at 03:56 PM, Aleksey Zholdak alek...@zholdak.com wrote: 
  The firewall should let through the UDP multicast traffic on 
  ports mcastport and mcastport+1. 
  
 As I wrote above: all interfaces in SuSEfirewall2 is set to Internal  
 zone. So, how can I open these ports if it already opened? 


Just to double check, I assume Internal zone does not have any
firewall rules applied to it?  If you go to Allowed Services in the
YaST2 firewall config app, it should show everything greyed-out or
allowed for Internal Zone.

(Disclaimer: my major experience with SuSEfirewall2 is opening the ssh
port on a system I care about, and turning the firewall off completely
on my test cluster systems, because they're inside networks I trust)

You said earlier that openais starts OK if you have the firewall on,
but resources do not run.  What does the output of crm_mon -r1 show
in this case?

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf


Re: [Pacemaker] How SuSEfirewall2 affects on openais startup?

2010-05-13 Thread Tim Serong
On 5/13/2010 at 07:22 PM, Aleksey Zholdak alek...@zholdak.com wrote: 
 firewall should let through the UDP multicast traffic on  
 ports mcastport and mcastport+1.  

  As I wrote above: all interfaces in SuSEfirewall2 is set to Internal   
  zone. So, how can I open these ports if it already opened?  
  
   
  Just to double check, I assume Internal zone does not have any 
  firewall rules applied to it?  If you go to Allowed Services in the 
  YaST2 firewall config app, it should show everything greyed-out or 
  allowed for Internal Zone. 
  
 Yes, exactly, everything greyed-out and allowed for Internal Zone. 
 Internal zone is unprotected. All ports are open. 

OK, that sounds fine.

  You said earlier that openais starts OK if you have the firewall on, 
  but resources do not run.  What does the output of crm_mon -r1 show 
  in this case? 
  
 sles2:~ # crm_mon -r1 
  
 Last updated: Thu May 13 12:21:21 2010 
 Stack: openais 
 Current DC: NONE 
 2 Nodes configured, 2 expected votes 
 10 Resources configured. 
  
  
 Node sles2: UNCLEAN (offline) 
 Node sles1: UNCLEAN (offline) 

The above is normal for while the cluster is starting up.  This may sound
a little silly, but I would have expected everything to come online if
you just wait a few minutes.  You can watch status changes (if any) as
they occur, with crm_mon -r.  It's worth checking /var/log/messages etc.
on each node too, to see if anything is obviously screaming in pain.

 Full list of resources: 
  
   Clone Set: sbd-clone 
   Stopped: [ sbd_fense:0 sbd_fense:1 ] 

Don't clone the SBD stonith resource, you only need a single primitive
here (not that this should be causing your startup trouble).

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf


Re: [Pacemaker] How SuSEfirewall2 affects on openais startup?

2010-05-13 Thread Tim Serong
On 5/13/2010 at 11:48 PM, Aleksey Zholdak alek...@zholdak.com wrote: 
 firewall should let through the UDP multicast traffic on 
 ports mcastport and mcastport+1. 
 
 As I wrote above: all interfaces in SuSEfirewall2 is set to Internal 
 zone. So, how can I open these ports if it already opened? 
 
 Just to double check, I assume Internal zone does not have any 
 firewall rules applied to it?  If you go to Allowed Services in the 
 YaST2 firewall config app, it should show everything greyed-out or 
 allowed for Internal Zone. 
  
  Yes, exactly, everything greyed-out and allowed for Internal Zone. 
  Internal zone is unprotected. All ports are open. 
  
  OK, that sounds fine. 
  
  You said earlier that openais starts OK if you have the firewall on, 
  but resources do not run.  What does the output of crm_mon -r1 show 
  in this case? 
  
  sles2:~ # crm_mon -r1 
   
  Last updated: Thu May 13 12:21:21 2010 
  Stack: openais 
  Current DC: NONE 
  2 Nodes configured, 2 expected votes 
  10 Resources configured. 
   
  
  Node sles2: UNCLEAN (offline) 
  Node sles1: UNCLEAN (offline) 
  
  The above is normal for while the cluster is starting up.  This may sound 
  a little silly, but I would have expected everything to come online if 
  you just wait a few minutes.  You can watch status changes (if any) as 
  they occur, with crm_mon -r.  It's worth checking /var/log/messages etc. 
  on each node too, to see if anything is obviously screaming in pain. 
  
 In such state node are unchanged for hours. 

OK, I had to ask.

 Analysis of logs in this situation does not say anything ... 

If the firewall is blocking anything, it'll be making noise in
/var/log/firewall and/or dmesg.  Another thing to try is set debug: on
in the openais/corosync config file, then look at /var/log/messages.
This should give you more log info...

  
 I must remind you that we are talking about a running one node of the two.  
 The second node is turned off (burned, stolen, etc.) 
  
 Clone Set: sbd-clone 
 Stopped: [ sbd_fense:0 sbd_fense:1 ] 
  
  Don't clone the SBD stonith resource, you only need a single primitive 
  here (not that this should be causing your startup trouble). 
  
 sbd fence must be on each node. 

The sbd daemon needs to be running on both nodes (the openais init script
should take care of that on SLES), but there only needs to be one sbd
primitive, it does not need to be cloned.  Pacemaker will make sure it
is running somewhere, which is enough.

 When the firewall is off or run both of nodes - no problem. 

So, one node running, with the firewall off, is OK?

Two nodes running, with the firewall on, is OK?

I think I'm becoming confused...

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf


Re: [Pacemaker] Making a resource slightly sticky?

2010-05-13 Thread Tim Serong
On 5/14/2010 at 07:39 AM, Paul Graydon p...@ehawaii.gov wrote: 
 Hi, 
  
 One of my nodes decided to throw a wobbly this morning and locked up  
 it's network card for about a minute.  Pacemaker came to the rescue,  
 merrily transferred everything over to the other node successfully,  
 however when the original node came back again it transferred the  
 functions back across. 
  
 Is is possible at all to make resources sticky?  i.e. resources start on  
 node 1.  Node 1 fails, resources migrate to node 2.  Node 1 recovers,  
 but resources stay on node 2 until node 2 fails, at which point they'd  
 migrate to node 1. 

Yes, you want the resource-stickiness property.  Using crm configure,
per resource:

  # primitive foo  \
  meta resource-stickiness=1

Or, to make everything a bit sticky:

  # rsc_defaults resource-stickiness=1

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf


Re: [Pacemaker] Announce: HA Web Konsole (Hawk 0.3.3)

2010-04-13 Thread Tim Serong
On 4/13/2010 at 08:13 PM, Dejan Muhamedagic deja...@fastmail.fm wrote: 
 Hi, 
  
 On Mon, Apr 12, 2010 at 10:56:30PM +0200, Roberto Giordani wrote: 
  Hi Tim 
  it's working! 
  Thanks 
  the only simple error was /root/.crm_help_index that should be owned 
  by hacluster:haclient 
  
 Why should it be owned by another user if it's for root? Does 
 hawk use the crm shell? 

For performing management ops, yes.  It invokes:

  /usr/sbin/crm resource (start|stop|migrate|unmigrate|cleanup) [...]

It should be effectively run as hacluster:haclient though, not as root...

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf


Re: [Pacemaker] Announce: HA Web Konsole (Hawk 0.3.3)

2010-04-13 Thread Tim Serong
On 4/14/2010 at 01:59 AM, Dejan Muhamedagic deja...@fastmail.fm wrote: 
 On Tue, Apr 13, 2010 at 05:45:02AM -0600, Tim Serong wrote: 
  On 4/13/2010 at 08:13 PM, Dejan Muhamedagic deja...@fastmail.fm wrote:  
   Hi,  
 
   On Mon, Apr 12, 2010 at 10:56:30PM +0200, Roberto Giordani wrote:  
Hi Tim  
it's working!  
Thanks  
the only simple error was /root/.crm_help_index that should be owned  
by hacluster:haclient  
 
   Why should it be owned by another user if it's for root? Does  
   hawk use the crm shell?  
   
  For performing management ops, yes.  It invokes: 
   
/usr/sbin/crm resource (start|stop|migrate|unmigrate|cleanup) [...] 
   
  It should be effectively run as hacluster:haclient though, not as root... 
  
 Makes me wonder how it ended up in /root . 

I'd have expected it to appear in /var/lib/heartbeat/cores/hacluster
if it was going to appear anywhere...  But actually, I didn't think
the help index was created unless you tried to access the help (which
Hawk doesn't do)?

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf


[Pacemaker] Announce: HA Web Konsole (Hawk 0.3.3)

2010-04-12 Thread Tim Serong
Greetings All,

This is to announce version 0.3.3 of Hawk, a web-based GUI for
Pacemaker HA clusters.  Noticeable changes from version 0.3.1
include:

- Port number is now 7630 (registered with IANA)
- Shows Master/Slave sets (but currently just shows children as
  started)
- Added confirmation prompt for node ops (bnc#593003) and resource
  ops.
- Allow resource mgmt ops on groups (in addition to group children)
- Added ability to migrate resources (bnc#593005)
- Invoke crm for resource ops, report invocation errors in UI
  (bnc#583605)
- Add mgmt buttons for new resources that appear via JSON update
  (bnc#590037)

SLES/openSUSE packages can be obtained from the openSUSE Build
Service:

  http://software.opensuse.org/search?baseproject=ALLp=1q=hawk

Finally, the wiki page at http://clusterlabs.org/wiki/Hawk has been
updated slightly to reflect the current project status as outlined
in this email.

As before, please direct comments, feedback, questions etc.
to tser...@novell.com and/or the Pacemaker mailing list.

Happy clustering,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf


Re: [Pacemaker] Announce: HA Web Konsole (Hawk 0.3.3)

2010-04-12 Thread Tim Serong
On 4/13/2010 at 01:08 AM, Roberto Giordani r.giord...@libero.it wrote: 
 Hello, 
 where I can found a sysvinit  2.86-215.2 for opensuse 11.2 x86_64? 
 
 This is the dependency 
  
 rpm -ivh hawk-0.3.3-1.1.x86_64.rpm 
 warning: hawk-0.3.3-1.1.x86_64.rpm: Header V3 DSA signature: NOKEY, key  
 ID 45bd6ae1 
 error: Failed dependencies: 
  sysvinit  2.86-215.2 is needed by hawk-0.3.3-1.1.x86_64 

There's one in the YaST:Web/openSUSE_11.2 repo...  But actually, if
that's the only missing dependency, you could just install with --nodeps.
You'll only get into trouble if you're running a separate instance of
lighttpd for some other purpose (in which case startproc etc. may get
confused about which lighttpd its meant to be dealing with).

/me makes a note to do something more friendly about this depencency.

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf


[Pacemaker] [PATCH] Low: tools: crm_simulate - fix small xpath memory leak in inject_node_state()

2010-03-30 Thread Tim Serong
# HG changeset patch
# User Tim Serong tser...@novell.com
# Date 1269931000 -39600
# Node ID 37312dd57d64ef67d829b3dbb868c659438dc495
# Parent  8b867b37c8007042877943b0c74601528db24d0f
Low: tools: crm_simulate - fix small xpath memory leak in inject_node_state()

diff -r 8b867b37c800 -r 37312dd57d64 tools/crm_inject.c
--- a/tools/crm_inject.cMon Mar 29 16:45:22 2010 +0200
+++ b/tools/crm_inject.cTue Mar 30 17:36:40 2010 +1100
@@ -92,6 +92,7 @@
rc = cib_conn-cmds-query(cib_conn, xpath, cib_object, 
cib_xpath|cib_sync_call|cib_scope_local);
 }
 
+crm_free(xpath);
 CRM_ASSERT(rc == cib_ok);
 return cib_object;
 }

___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] ERROR: unpack_rsc_op: Hard error

2010-03-09 Thread Tim Serong
On 3/10/2010 at 09:48 AM, Werner wkuba...@co.sanmateo.ca.us wrote: 
 I have a problem with setting up a simple two node cluster with an IP address 
 that should fail over. 
  
 The two systems run SLES 11 with HAE.  I have done this configuration with  
 two 
 virtual machines and it works just fine in this environment.  However, when  
 I do 
 the exact same configuration on the real (physical) systems it fails.  This  
 is 
 what I get: 
  
 r...@imsrcdbdgrid2:~# crm configure show 
 node imsrcdbdgrid1 
 node imsrcdbdgrid2 
 primitive ClusterIP ocf:heartbeat:IPadd2 \

   ^IPaddr2

Looks like a typo - if your configuration is missing that 'r' character,
that'll be the source of your problem (although, if the crm shell let you
create a primitive using an RA that doesn't exist, that sounds like a bug).

 r...@imsrcdbdgrid2:~# crm_verify -L -V 
 crm_verify[26812]: 2010/03/09_14:42:44 ERROR: unpack_rsc_op: Hard error - 
 ClusterIP_monitor_0 failed with rc=5: Preventing ClusterIP from re-starting  
 on 
 imsrcdbdgrid1 

rc=5 means not installed, which you'll get if the RA explicitly returns
that error code, or if the RA itself doesn't exist.

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] Announce: HA Web Konsole (Hawk) 0.3.1

2010-03-04 Thread Tim Serong
Greetings All,

This is to announce version 0.3.1 of Hawk, a web-based GUI for
Pacemaker HA clusters.  This version introduces the ability
to perform some basic management tasks (node standby/online/fence
and resource start/stop/cleanup).  It also now includes a login
screen, so random passersby can't break your cluster.  The rule
here is the same as for the python GUI - you need to log in as
a user who is a member of the haclient group.

SLES/openSUSE packages can be obtained from the openSUSE Build
Service:

http://software.opensuse.org/search?baseproject=ALLp=1q=hawk

The wiki page at http://clusterlabs.org/wiki/Hawk has also been
updated to reflect the current project status as outlined in this
email.

As before, please direct comments, feedback, questions etc.
to tser...@novell.com and/or the Pacemaker mailing list.

Enjoy,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] DRBD Management Console 0.6.0

2010-03-02 Thread Tim Serong
On 3/2/2010 at 11:12 PM, Rasto Levrinc rasto.levr...@linbit.com wrote: 
 On Tue, March 2, 2010 11:41 am, Lars Marowsky-Bree wrote: 
  On 2010-02-28T12:24:26, Rasto Levrinc rasto.levr...@linbit.com wrote: 
  
  cool stuff. It's sad that we end up with a competing thingy ... Maybe we 
  could keep Tim's pure web-ui for the monitoring and most simple bits and 
  have drbd-mc replace the python UI. 
  
 Thanks lmb. I see a place for Hawk as a lightweight tool to quickly make 
 some changes and I could even somehow integrate in the DRBD-MC. I can 
 speed up the DRBD-MC quite a bit still, I did not even try to optimize it 
 till now, but it will never be very fast. 

Yeah, from my perspective I think Hawk and DRBD-MC will each have different
strengths, for example pointing a web browser at a cluster node to see status
is easy/quick/lightweight, whereas visualizing complex dependency relationships
between resources is more straightforward to implement and interact with
in a regular non-web app (although no doubt HTML5 advocates will disagree with
me here :))

  How does it interact with the CRM shell? Does it issue XML changes 
  directly? What kind of network connection is required between the UI client 
  and the servers? 
  
 DRBD-MC connects itself via SSH and uses mostly cibadmin and crm_resource 
 commands on the host. It could simply use crm shell commands instead, but 
 it doesn't at the moment, mostly to be compatible with older Heartbeats 
 and there was no reason to change it. 

By comparison, Hawk doesn't need SSH, as it's running on the cluster nodes.
Internally it also uses cibadmin, a couple of crm_* commands and the crm shell,
so currently only works with Pacemaker 1.x.  It /reads/ XML from cibadmin,
but I wasn't planning on having it change the XML directly, rather any changes
are (and will be) made through existing CLI tools.

Side point: I have it in the back of my mind that I may end up wanting to
communicate directly with libcib if the CLI tools ever become a performance
bottleneck, but this isn't a problem yet (earlier, Hawk was running
crm_resource to get the status of each resource, so the more resources,
the more execs: yuck.  Now it just figures everything out from a single
run of cibadmin).

  
  Is there a chance to share more code between the various UIs that, I 
  think, we are going to keep going forward? (I'm pretty sure the crm shell, 
  the web-ui and yours are going to remain actively maintained.) 
  
  
 Yes, I like that. 


I'm also keen on duplicating as little as possible, but I think there's more
scope for sharing of underlying tools (crm shell etc.), or perhaps developing
new scaffolding as necessary, than for sharing pieces of higher-level GUI
implementation.

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] DRBD Management Console 0.6.0

2010-03-01 Thread Tim Serong
On 3/1/2010 at 11:16 PM, Rasto Levrinc rasto.levr...@linbit.com wrote: 
 On Mon, March 1, 2010 12:10 pm, Cristian Mammoli - Apra Sistemi wrote: 
  Hi again... 
  
  
  I tried adding a resource with DMC. My script needs 2 mandatory 
  parameters: vmxpath and vimshbin 
  
  
  In the gui i filled the field for vmxpath while vimshbin was already 
  present because the resource agent has: 
  
  shortdesc lang=envmware-vim-cmd path/shortdesc 
  content type=string default=/usr/bin/vmware-vim-cmd/ 
  
 The question is, what the default here means. It is something that RA 
 would use if nothing is specified or it is suggestion for GUI, what to 
 offer as a default value. Obviously the DRBD-MC assumes the former and the 
 vmware RA the latter. 

IMO it's both :)  If the parameter is optional, default is the value the
RA should use internally if no value is explicitly specified.  If the
parameter is mandatory, default is what the management tools should populate
that field with initially.

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, OPS Engineering, Novell Inc.




___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Frustrating fun with Pacemaker / CentOS / Apache

2010-02-18 Thread Tim Serong
On 2/19/2010 at 08:40 AM, Paul Graydon p...@ehawaii.gov wrote: 
  I started looking into this today to find out whether it was 
  possible to use another URL for testing.  According to the heartbeat 
  script you can specify the parameter statusurl and as long as it has 
  a body and html tag on the page you test it should work. 

  You can also set your own testregex which should match the 
  output of statusurl. Since resource agents release 1.0.2, 
  apache can also do more thorough tests (see crm ra info apache 
  or ocf_heartbeat_apache(7)). 
  
  
  So I thought I'd give it a try, but it failed.  Initially I assumed 
  it was because I hadn't selected a page with the body and html tag 
  (having not noticed that was a necessity) but even when against a 
  page that has them it still failed.  Trying to execute the command 
  it runs came up with a failure for me too, but it appears to be how 
  all the arguments are presented to wget courtesy of sh -c. 
  
  It's looking for a positive return from: 
  
  sh -c wget -O- -q -L http://whatever.url.youprovided | tr '\012' ' ' 
  | grep -Ei / *body *[[:space:]]*/ *html * 
  
  Problem is if you cut it down to just that first section: 
  sh -c wget -O- -q -L http://whatever.url.youprovided 
  
  it pops back and tells you 
  wget: missing URL 
  Usage: wget [OPTION]... [URL]... 
  
  Try `wget --help' for more options. 
  
  If you execute wget without using sh -c in front of it it sees the 
  URL and parses it successfully. 
  
  Surrounding the wget string with ' marks, as in: 
  sh -c 'wget -O- -q -L http://whatever.url.youprovided ' 
  
  I'm trying to figure out what other options are available.  Adding 
  in ' marks on line 406 of the ocf heartbeat apache script breaks it! 

  I really don't think there is a need to change anything there. 
  Otherwise, apache would never be able to work. If you think you 
  found a problem, you can try to wrap the critical part in 
  set -x/+x and we'll see what the output looks like. 
  
  Thanks, 
  
  Dejan 
  
 I've looked into this with fresh eyes this morning and managed to track  
 down the problem to this addition to the related to meta  
 target-role=Started 
  
 Not sure quite where I picked that up from, presumably one of the  
 configurations I used as a template?  Without setting it as an attribute  
 it works fine, tested and retested with and without that addition. 
  
 This works: 
 primitive failover-apache ocf:heartbeat:apache \ 
  params configfile=/etc/httpd/conf/httpd.conf  
 httpd=/usr/sbin/httpd port=80 \ 
  op monitor interval=5s timeout=20s  
 statusurl=https://valid.test.url/index.html; 
  
 This doesn't: 
 primitive failover-apache ocf:heartbeat:apache \ 
  params configfile=/etc/httpd/conf/httpd.conf  
 httpd=/usr/sbin/httpd port=80 \ 
  op monitor interval=5s timeout=20s  
 statusurl=https://valid.test.url/index.html; \ 
  meta target-role=Started 

That's weird.  That attribute shouldn't make any difference in this case -
it's just telling the cluster that it should try to start the resource,
which is the default anyway.

 My understanding of the meta bits is a little weak, and I can't find an  
 explanation as to what target-role is actually trying to do. 

It specifies the state the resource is meant to be in[1], i.e. stopped,
started, or a master or slave (the latter of which you would use for an 
active/passive DRBD clone pair, for example).

Ignoring master/slave resources, this attribute is set if you use
crm resource stop or crm resource start to force a resource to stop
or start.

Regards,

Tim

[1] Yes, I know I should use the word role here, not state :)


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, Novell Inc.



___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Announce: Hawk (HA Web Konsole) 0.2.0

2010-02-09 Thread Tim Serong
On 2/9/2010 at 11:05 PM, darren.mans...@opengi.co.uk wrote: 
 It's pacemaker-1.0.3-4.1 
  
 No output for cluster-infrastructure. 
  
 But the HTML source does contain information, just display: none hides 
 it: 
  
 div id=summary style=display: none 
   table 
 trthStack:/th tdspan 
 id=summary::stack/span/td/tr 
 ...
   /table 
 /div 

It was keying the display off the cluster-infrastructure parameter,
which first appeared in Pacemaker 1.0.4.  I've since fixed this, and
OBS packages for hawk-0.2.1 have been built.  They should thus appear
in the repos on download.opensuse.org in the fullness of time.  Once
said time has elapsed, please install the new packages and let me know
how you go.

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, Novell Inc.




___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Announce: Hawk (HA Web Konsole) 0.2.0

2010-02-09 Thread Tim Serong
On 2/10/2010 at 03:40 AM, darren.mans...@opengi.co.uk wrote: 
 On Tue, 2010-02-09 at 06:44 -0700, Tim Serong wrote: 
  On 2/9/2010 at 11:05 PM, darren.mans...@opengi.co.uk wrote:=20 
   It's pacemaker-1.0.3-4.1=20 
   =20 
   No output for cluster-infrastructure.=20 
   =20 
   But the HTML source does contain information, just display: none hides=20 
   it:=20 
   =20 
   div id=3Dsummary style=3Ddisplay: none=20 
 table=20 
   trthStack:/th tdspan=20 
   id=3Dsummary::stack/span/td/tr=20 
   ... 
 /table=20 
   /div=20 
 =20 
  It was keying the display off the cluster-infrastructure parameter, 
  which first appeared in Pacemaker 1.0.4.  I've since fixed this, and 
  OBS packages for hawk-0.2.1 have been built.  They should thus appear 
  in the repos on download.opensuse.org in the fullness of time.  Once 
  said time has elapsed, please install the new packages and let me know 
  how you go. 
 =20 
  Regards, 
 =20 
  Tim 
 =20 
  
 This is great thanks. The only problem now is that in FF 3.5 and Google 
 Chrome in Linux it displays for about 5 seconds then the screen goes 
 blank. Is it this bit of JS? 
  
 Event.observe(window, 'load', function() { do_update(); }); 
  

So, by fixed I clearly meant fixed in only one of the two places that
require fixing.  Please try the following change (the relevant file will
be /srv/www/hawk/public/javascripts/application.js):

diff -r ed8bf3b8be26 hawk/public/javascripts/application.js
--- a/hawk/public/javascripts/application.jsTue Feb 09 23:27:49 2010 +1100
+++ b/hawk/public/javascripts/application.jsWed Feb 10 10:35:23 2010 +1100
@@ -35,7 +35,7 @@
 }
 
 function update_summary(summary) {
-  if (summary.stack) {
+  if (summary.version) {
 for (var e in summary) {
   $(summary:: + e).update(summary[e]);
 }
@@ -101,7 +101,7 @@
 update_errors(request.responseJSON.errors);
 update_summary(request.responseJSON.summary);
 
-if (request.responseJSON.summary.stack) {
+if (request.responseJSON.summary.version) {
   $(nodelist).show();
   if (update_panel(request.responseJSON.nodes)) {
 if ($(nodelist::children).hasClassName(closed)) {


Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, Novell Inc.




___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] Announce: Hawk (HA Web Konsole) 0.2.0

2010-02-08 Thread Tim Serong
Greetings All,

This is to announce version 0.2.0 of Hawk, a web-based GUI for
Pacemaker HA clusters.  The major item of note for this version
is that we now have reasonable feature parity with crm_mon, and
there are SLES/openSUSE packages available from the openSUSE
Build Service:

http://software.opensuse.org/search?baseproject=ALLp=1q=hawk

There is also a wiki page up at http://clusterlabs.org/wiki/Hawk
that gives a brief overview of the project, and tells you how
to get the source from Mercurial, if you don't want to (or can't)
use the above packages.

As before, please direct comments, feedback, questions etc.
to tser...@novell.com and/or the Pacemaker mailing list.

Thanks for listening,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, Novell Inc.




___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


[Pacemaker] Announce: Hawk (HA Web Konsole)

2010-01-16 Thread Tim Serong
Greetings All,

This is to announce the development of the Hawk project,
a web-based GUI for Pacemaker HA clusters.

So, why another management tool, given that we already have
the crm shell, the Python GUI, and DRBD MC?  In order:

1) We have the usual rationale for a GUI over (or in addition
   to) a CLI tool; it is (or should be) easier to use, for
   a wider audience.

2) The Python GUI is not always easily installable/runnable
   (think: sysadmins with Windows desktops and/or people who
   don't want to, or can't, forward X).
   
3) Believe it or not, there are a number of cases where,
   citing security reasons, site policy prohibits ssh access
   to servers (which is what DRBD MC uses internally).

There are also some differing goals; Hawk is not intended
to expose absolutely everything.  There will be point somewhere
where you have to say and now you must learn to use a shell.

Likewise, Hawk is not intended to install the base cluster
stack for you (whereas DRBD MC does a good job of this).

It's early days yet (no downloadable packages), but you can
get the current source as follows:

  # hg clone http://hg.clusterlabs.org/pacemaker/hawk
  # cd hawk
  # hg update tip

This will give you a web-based GUI with a display roughly
analagous to crm_mon, in terms of status of cluster resources.
It will show you running/dead/standby nodes, and the resources
(clones, groups  primitives) running on those nodes.

It does not yet provide information about failed resources or
nodes, other than the fact that they are not running.

Display of nodes  resources is collapsible (collapsed by
default), but if something breaks while you are looking at it,
the display will expand to show the broken nodes and/or
resources.

Hawk is intended to run on each node in your cluster.  You
can then access it by pointing your web browser at the IP
address of any cluster node, or the address of any IPaddr(2)
resource you may have configured.

Minimally, to see it in action, you will need the following
packages and their dependencies (names per openSUSE/SLES):

  - ruby
  - rubygem-rails-2_3
  - rubygem-gettext_rails

Once you've got those installed, run the following command:

  # hawk/script/server

Then, point your browser at http://your-server:3000/ to see
the status of your cluster.

Ultimately, hawk is intended to be installed and run as a
regular system service via /etc/init.d/hawk.  To do this,
you will need the following additional packages:

  - lighttpd
  - lighttpd-mod_magnet
  - ruby-fcgi
  - rubygem-rake

Then, try the following, but READ THE MAKEFILE FIRST!
make install (and the rest of the build system for that
matter) is frightfully primitive at the moment:

  # make
  # sudo make install
  # /etc/init.d/hawk start

Then, point your browser at http://your-server:/ to see
the status of your cluster.

Assuming you've read this far, what next?

- In the very near future (but probably not next week,
  because I'll be busy at linux.conf.au) you can expect to
  see further documentation and roadmap info up on the
  clusterlabs.org wiki.
  
- Immediate goal is to obtain feature parity with crm_mon
  (completing status display, adding error/failure messages).

- Various pieces of scaffolding need to be put in place (login
  page, access via HTTPS, clean up build/packaging, theming,
  etc.)

- After status display, the following major areas of
  funcionality are:
  - Basic operator tasks (stop/start/migrate resource,
standby/online node, etc.)
  - Explore failure scenarios (shadow CIB magic to see
what would happen if a node/resource failed).
  - Ability to actually configure resources and nodes.

Please direct comments, feedback, questions, etc. to
tser...@novell.com and/or the Pacemaker mailing list.

Thank you for your attention.

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, Novell Inc.



___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Wrong stack o2cb

2009-12-16 Thread Tim Serong
On 12/16/2009 at 01:41 AM, Поляченко Владимир
Владимировичstrafer.ad...@gmail.com wrote: 
 Hi all (sorry for my english, i can read and understand, but not write 
 in english) 
 Configure cluster in Fedora 12(base manual Cluster from Scratch 
 Apache in Fedora 11) 
  
 package from fedora repo 
  
 [r...@server1 /]# rpm -q pacemaker ocfs2-tools ocfs2-tools-pcmk 
 dlm-pcmk heartbeat corosync resource-agents drbd  
  
 pacemaker-1.0.5-4.fc12.i686  
 ocfs2-tools-1.4.3-3.fc12.i686 
 ocfs2-tools-pcmk-1.4.3-3.fc12.i686 
 dlm-pcmk-3.0.6-1.fc12.i686 
 heartbeat-3.0.0-0.5.0daab7da36a8.hg.fc12.i686 
 corosync-1.2.0-1.fc12.i686 
 resource-agents-3.0.6-1.fc12.i686 
 drbd-8.3.6-2.fc12.i686 
  
 Configuration Active/Active, next problem (/var/log/messages) 
  
 Dec 15 16:07:21 server1 crmd: [1189]: info: te_rsc_command: Initiating  
 action 4: monitor o2cb:0_monitor_0 on server1 (local) 
 Dec 15 16:07:21 server1 crmd: [1189]: info: do_lrm_rsc_op: Performing  
 key=4:91:7:78a6a7b0-ef15-434f-8aaf-e00cd0f9d6ef op=o2cb:0_monitor_0 ) 
 Dec 15 16:07:21 server1 lrmd: [1186]: info: rsc:o2cb:0:101: monitor 
 Dec 15 16:07:21 server1 o2cb[20999]: ERROR: Wrong stack o2cb 
 Dec 15 16:07:21 server1 lrmd: [1186]: info: RA output:  
 (o2cb:0:monitor:stderr) 2009/12/15_16:07:21 ERROR: Wrong stack o2cb 
 Dec 15 16:07:21 server1 crmd: [1189]: info: process_lrm_event: LRM operation  
 o2cb:0_monitor_0 (call=101, rc=5, cib-update=430, confirmed=true) not  
 installed 
 Dec 15 16:07:21 server1 crmd: [1189]: WARN: status_from_rc: Action 4  
 (o2cb:0_monitor_0) on server1 failed (target: 7 vs. rc: 5): Error 
 Dec 15 16:07:21 server1 crmd: [1189]: info: abort_transition_graph:  
 match_graph_event:272 - Triggered transition abort (complete=0,  
 tag=lrm_rsc_op, id=o2cb:0 
 _monitor_0, magic=0:5;4:91:7:78a6a7b0-ef15-434f-8aaf-e00cd0f9d6ef, 
 cib=0.329.2)  
 : Event failed 
 Dec 15 16:07:21 server1 crmd: [1189]: info: update_abort_priority: Abort  
 priority upgraded from 0 to 1 
 Dec 15 16:07:21 server1 crmd: [1189]: info: update_abort_priority: Abort  
 action done superceeded by restart 
 Dec 15 16:07:21 server1 crmd: [1189]: info: match_graph_event: Action  
 o2cb:0_monitor_0 (4) confirmed on server1 (rc=4) 
 Dec 15 16:07:21 server1 crmd: [1189]: info: te_rsc_command: Initiatingaction  
 3: probe_complete  
 probe_complete on server1 (local) - no waiting 
  
 but resource /dev/drbd1 mounted without problem(nodes online, mount 
 not start, i mount monually) 

You don't want to be mounting it manually, the cluster needs to do it
for you.

 crm config (only need rows) 
 - 
 primitive DataFS ocf:heartbeat:Filesystem \ 
 params device=/dev/drbd/by-res/data directory=/opt fstype=ocfs2 
  
 \ 
 meta target-role=Started
 primitive ServerData ocf:linbit:drbd \
 
 params drbd_resource=data  
 primitive dlm ocf:pacemaker:controld \
 
 op monitor interval=120s  
 primitive dlm ocf:pacemaker:controld \
 
 op monitor interval=120s  
 primitive o2cb ocf:ocfs2:o2cb \   
 
 op monitor interval=120s 
 ms ServerDataClone ServerData \   
 
 meta master-max=2 master-node-max=1 clone-max=2 
 clone-node-max=1 notify=true 
 clone dlm-clone dlm \ 
 meta interleave=true 
 clone o2cb-clone o2cb \ 
 meta interleav
e=true 
 colocation o2cb-with-dlm inf: o2cb-clone dlm-clone 
 order start-o2cb-after-dlm inf: dlm-clone o2cb-clone 
 - 
 I create /etc/ocfs2/cluser.conf 
 - 
 node: 
 name = server1 
 cluster = ocfs2 
 number = 0 
 ip_address = 10.10.10.1 
 ip_port =  
  
 node: 
 name = server2 
 cluster = ocfs2 
 number = 1 
 ip_address = 10.10.10.2 
 ip_port =  
  
 cluster: 
 name = ocfs2 
 node_count = 2 
 - 
 How resolve this problem? 

You shouldn't need /etc/ocfs2/cluster.conf.  AFAIK this is only used
in non-Pacemaker environments, when o2cb is managing the cluster.
Did you create your filesystem with oc2b running, or the Pacemaker
cluster?  If the former, I'd suggest:

- Make sure o2cb is chkconfig'd off.
- Make sure your pacemaker cluster is running, and that dlm and ocfs2 
  are up.
- run tunefs.ocfs2 --update-cluster-stack (or use mkfs to recreate
  your clustered filesystem).  One cluster stack can't
  mount a filesystem created with a different cluster stack.

HTH,

Tim


-- 
Tim Serong

Re: [Pacemaker] How to delete a resource

2009-12-07 Thread Tim Serong
On 12/7/2009 at 08:53 PM, Colin colin@gmail.com wrote: 
 Hi, 
  
 when trying to delete a resource, either by replacing the whole 
 resources-part of the CIB with cibadmin with a new version where 
 some resources are missing, or by using a crm_resource -t primitive 
 —resource name —delete, I get the following error: 
  
 Error performing operation: Update does not conform to the configured  
 schema/DTD 
  
 Now since the error doesn't tell me where the problem is, I can only 
 guess that the problem is that other, dynamic parts of the CIB still 
 reference the resource, and the schema prevents dangling 
 references. So if these methods don't work, and the crm-shell 
 doesn't have a delete for resources, is there an official and simple 
 way to delete a resource? 

This should do it:

  # crm configure delete resource-id

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, Novell Inc.



___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Node crash when 'ifdown eth0'

2009-11-30 Thread Tim Serong
On 12/1/2009 at 11:05 AM, hj lee kerd...@gmail.com wrote: 
 On Fri, Nov 27, 2009 at 3:05 PM, Steven Dake sd...@redhat.com wrote: 
  
  On Fri, 2009-11-27 at 11:32 -0200, Mark Horton wrote: 
   I'm using pacemaker 1.0.6 and corosync 1.1.2 (not using openais) with 
   centos 5.4.  The packages are from here: 
   http://www.clusterlabs.org/rpm/epel-5/ 
   
   Mark 
   
   On Fri, Nov 27, 2009 at 9:01 AM, Oscar Remí-rez de Ganuza Satrústegui 
   oscar...@unav.es wrote: 
Good morning, 

We are testing a cluster configuration on RHEL5 (x86_64) with pacemaker 
1.0.5 and openais (0.80.5). 
Two node cluster, active-passive, with the following resources: 
Mysql service resource and a NFS filesystem resource (shared storage in 
  a 
SAN). 

In our tests, when we bring down the network interface (ifdown eth0), 
  the 
  
  What is the use case for ifdown eth0 (ie what are you trying to verify)? 
  
  
 I have the same test case. In my case, when two nodes cluster is disconnect, 
 I want to see split-brain. And then I want to see the split-brain handler 
 resets one of nodes. What I want to verify is that the cluster will recover 
 network disconnection and split-brain situation. 

Try this, on one node:

  # iptables -A INPUT -s ip.of.other.node -j DROP
  # iptables -A OUTPUT -d ip.of.other.node -j DROP

HTH,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, Novell Inc.



___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] Fwd: virtual IP

2009-11-20 Thread Tim Serong
On 11/21/2009 at 06:25 AM, Shravan Mishra shravan.mis...@gmail.com wrote: 
 This is my exact output: 
  
  
 Last updated: Fri Nov 20 18:20:51 2009 
 Stack: openais 
 Current DC: node1.itactics.com - partition with quorum 
 Version: 1.0.5-9e9faaab40f3f97e3c0d623e4a4c47ed83fa1601 
 2 Nodes configured, 2 expected votes 
 4 Resources configured. 
  
  
 Online: [ node1.itactics.com node2.itactics.com ] 
  
 Master/Slave Set: ms-drbd 
 Masters: [ node1.itactics.com ] 
 Slaves: [ node2.itactics.com ] 
 node1.itactics.com-stonith  (stonith:external/safe/ipmi): 
 Started node2.itactics.com 
 node2.itactics.com-stonith  (stonith:external/safe/ipmi): 
 Started node1.itactics.com 
 Resource Group: svcs_grp 
 fs0 (ocf::heartbeat:Filesystem):Started node1.itactics.com 
 safe_svcs   (ocf::itactics:safe):   Started node1.itactics.com 
 vip (ocf::heartbeat:IPaddr2):   Stopped 
  
 Failed actions: 
 vip_monitor_0 (node=node1.itactics.com, call=7, rc=2, 
 status=complete): invalid parameter 
 vip_monitor_0 (node=node2.itactics.com, call=7, rc=2, 
 status=complete): invalid parameter 
  
  
 The config this time I tried was 
  
  
 primitive id=vip class=ocf type=IPaddr2 provider=heartbeat 
   operations 
 op id=op-vip-1 name=monitor timeout=30s interval=10s/ 
   /operations 
   instance_attributes id=ia-vip 
 nvpair id=vip-addr name=ip value=172.30.0.17 / 
 nvpair id=vip-intf name=nic value=eth1/ 
 nvpair id=vip-bcast name=broadcast value=172.30.255.255/ 
 nvpair id=vip-cidr_netmask name=cidr_netmask value=16/ 
   /instance_attributes 
 /primitive 
  
  
  
 Can somebody help me what's the problem here. 

You're probably suffering from 
https://bugzilla.novell.com/show_bug.cgi?id=553753 which is fixed by 
http://hg.linux-ha.org/agents/rev/5d341d5dc96a

Try explicitly adding the parameter clusterip_hash=sourceip-sourceport to the 
IP address.  This will add something like the following to the 
instance_attributes:

  nvpair id=vip-clusterip_hash name=clusterip_hash 
value=sourceip-sourceport/ 

Regards,

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, Novell Inc.




___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


Re: [Pacemaker] RFC: Compacting constraints

2009-11-05 Thread Tim Serong
On 11/6/2009 at 05:13 AM, Andrew Beekhof and...@beekhof.net wrote: 
 On Thu, Nov 5, 2009 at 4:57 PM, Dejan Muhamedagic deja...@fastmail.fm 
 wrote: 
 
  conjoin sounds sort of funny to me (as a non-native speaker). 
  
 Equally so to me, and Australian is kinda like english. 

How about coordinate?

Tim


-- 
Tim Serong tser...@novell.com
Senior Clustering Engineer, Novell Inc.




___
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker


  1   2   >