Re: [Pacemaker] [Problem]The monitor that start-delay is long does not stop.

2010-10-12 Thread renayama19661014
Hi Andrew,

  Funnily enough I was just looking at that message and saw that the
  code relevant to this one looked wrong too.
  
  I believe this should fix the issue:
 http://hg.clusterlabs.org/pacemaker/1.1/rev/e06810256413
  
  
   I registered log and more with Bugzilla.
  
   #65533;* http://developerbugs.linux-foundation.org/show_bug.cgi?id=2505
  
  Oops, I didn't see that. I should have included the bug number in the 
  commit :-(
 
 ok.
 I confirm your revision.

Confirmation has been late.

I confirmed that a problem was solved in your revision. 

In addition, I added a similar revision for 1.0 and confirmed that a problem 
was broken off.

I added comment to Bugzilla.

Best Regards,
Hideo Yamauchi.


--- renayama19661...@ybb.ne.jp wrote:

 Hi Andrew,
 
 Thank you for comment.
 
  Funnily enough I was just looking at that message and saw that the
  code relevant to this one looked wrong too.
  
  I believe this should fix the issue:
 http://hg.clusterlabs.org/pacemaker/1.1/rev/e06810256413
  
  
   I registered log and more with Bugzilla.
  
   #65533;* http://developerbugs.linux-foundation.org/show_bug.cgi?id=2505
  
  Oops, I didn't see that. I should have included the bug number in the 
  commit :-(
 
 ok.
 I confirm your revision.
 
 Best Regards,
 Hideo Yamauchi.
 
 --- Andrew Beekhof and...@beekhof.net wrote:
 
  On Thu, Oct 7, 2010 at 8:39 AM,  renayama19661...@ybb.ne.jp wrote:
   Hi,
  
   I operated the next to confirm the contribution of the mailing list.
  
   #65533;* http://www.gossamer-threads.com/lists/linuxha/pacemaker/66939
  
  
   Step1) I prepare cib.xml having monitor which set start-delay than five 
   minutes..
   Step2) I start two nodes and send cib.
  
   
   Last updated: Thu Oct #65533;7 14:58:09 2010
   Stack: Heartbeat
   Current DC: srv02 (1f8dd092-d82b-47eb-86c4-e011a2cd11b3) - partition 
   WITHOUT quorum
   Version: 1.0.9-860b32388908c6a345786d4ecd2e2a3bec780dd2
   2 Nodes configured, unknown expected votes
   1 Resources configured.
   
  
   Online: [ srv01 srv02 ]
  
   #65533;Resource Group: grpDummy
   #65533; #65533; prmFsPostgreSQLDB1-3 #65533; #65533; #65533; 
   (ocf::heartbeat:Dummy):
 Started
 srv01
   #65533; #65533; prmIpPostgreSQLDB2 (ocf::heartbeat:Dummy): Started srv01
  
   Step3) I causes the monitor error of the resource successively.
  
   
   Last updated: Thu Oct #65533;7 15:20:01 2010
   Stack: Heartbeat
   Current DC: srv02 (d3fe8b08-20d9-4990-aebb-56a0675af5bd) - partition 
   WITHOUT quorum
   Version: 1.0.9-860b32388908c6a345786d4ecd2e2a3bec780dd2
   2 Nodes configured, unknown expected votes
   1 Resources configured.
   
  
   Online: [ srv01 srv02 ]
  
   #65533;Resource Group: grpDummy
   #65533; #65533; prmFsPostgreSQLDB1-3 #65533; #65533; #65533; 
   (ocf::heartbeat:Dummy):
 Started
 srv02
   #65533; #65533; prmIpPostgreSQLDB2 (ocf::heartbeat:Dummy): Started srv02
  
   Migration summary:
   * Node srv02:
   * Node srv01:
   #65533; prmIpPostgreSQLDB2: migration-threshold=1 fail-count=1
   #65533; prmFsPostgreSQLDB1-3: migration-threshold=1 fail-count=1
  
   Failed actions:
   #65533; #65533;prmIpPostgreSQLDB2_monitor_6 (node=srv01, call=7, 
   rc=7,
 status=complete): not
 running
   #65533; #65533;prmFsPostgreSQLDB1-3_monitor_3 (node=srv01, call=5, 
   rc=7,
 status=complete):
 not running
  
   Step4) The resource does fail-over in a srv02 node, but the monitor 
   #65533;of srv01 does
 not
 stop.
  
   [r...@srv01 ~]# !tail
   tail -f /var/log/ha-log
   Oct #65533;7 15:27:27 srv01 lrmd: [15792]: debug: 
   rsc:prmFsPostgreSQLDB1-3:5: monitor
   Oct #65533;7 15:27:27 srv01 Dummy[16572]: DEBUG: prmFsPostgreSQLDB1-3 
   monitor : 7
   Oct #65533;7 15:27:58 srv01 lrmd: [15792]: debug: 
   rsc:prmFsPostgreSQLDB1-3:5: monitor
   Oct #65533;7 15:27:58 srv01 Dummy[16594]: DEBUG: prmFsPostgreSQLDB1-3 
   monitor : 7
   Oct #65533;7 15:27:59 srv01 lrmd: [15792]: debug: 
   rsc:prmIpPostgreSQLDB2:8: monitor
   Oct #65533;7 15:27:59 srv01 Dummy[16601]: DEBUG: prmIpPostgreSQLDB2 
   monitor : 7
   Oct #65533;7 15:27:59 srv01 lrmd: [15792]: debug: 
   rsc:prmIpPostgreSQLDB2:7: monitor
   Oct #65533;7 15:27:59 srv01 Dummy[16608]: DEBUG: prmIpPostgreSQLDB2 
   monitor : 7
   Oct #65533;7 15:28:28 srv01 lrmd: [15792]: debug: 
   rsc:prmFsPostgreSQLDB1-3:5: monitor
   Oct #65533;7 15:28:28 srv01 Dummy[16628]: DEBUG: prmFsPostgreSQLDB1-3 
   monitor : 7
  
   Step5) The fail-count does strange increase afterwards.
  
   
   Last updated: Thu Oct #65533;7 15:31:21 2010
   Stack: Heartbeat
   Current DC: srv02 (d3fe8b08-20d9-4990-aebb-56a0675af5bd) - partition 
   WITHOUT quorum
   Version: 1.0.9-860b32388908c6a345786d4ecd2e2a3bec780dd2
   2 Nodes configured, unknown expected votes
   1 Resources configured.
   
  
   Online: [ srv01 srv02 ]
  
   #65533;Resource Group: grpDummy
   #65533; #65533; prmFsPostgreSQLDB1-3 #65533; #65533; #65533; 
   

Re: [Pacemaker] stonith pacemaker problem

2010-10-12 Thread Vladislav Bogdanov
12.10.2010 07:25, Andrew Beekhof wrote:
 On Mon, Oct 11, 2010 at 9:51 PM, Vladislav Bogdanov
 bub...@hoster-ok.com wrote:
 11.10.2010 09:14, Andrew Beekhof wrote:
 strictly speaking you don't.
 but at least on fedora, the policy is that $x-libs always requires $x
 so just building against heartbeat-libs means that yum will suck in
 the main heartbeat package :-(

 And this seem to be a bit incorrect statement btw:
 
 no, you're wrong sorry.
 
 From: http://fedoraproject.org/wiki/Packaging/ReviewGuidelines
 
 SHOULD: Usually, subpackages other than devel should require the base
 package using a fully versioned dependency.
 
 http://fedoraproject.org/wiki/Packaging/Guidelines#RequiringBasePackage
 
 In any case, its not something Pacemaker has control over.

I think this is where packaging guidelines seem to be incomplete.
Frankly speaking, -libs is not a subpackage in a general meaning,
but rather a superpackage. Base package almost always requires -libs.

That means that -libs is a kinda special case. Guidelines are valid
for subpackages like modules, -data, -docs, -servers, -clients,
whatever else.
You can run (one line):

rpm -qa|grep -- -libs|grep -v -- -devel| while read rpm ; do echo
$rpm:; rpm -q --requires $rpm ; done|grep -Ev ^(lib|rpmlib|rtld|/)

And you'll see very small number of -libs packages which actually
require base package.
So majority of fedora packagers either do not follow guidelines or
realize that -libs is a different story.

Actually, what is the hidden meaning of splitting package to base and
-libs if -libs depend on base?

The main idea of a such split is to provide a way to have shared
libraries installed where main package (together which all its
dependencies) is not needed.

And I agree, this is for another mailing list anyways.
Falling silent...

 
 usually application
 (binary) requires some libraries, and some of that libraries are
 provided by -libs package which is built together with the binary. But,
 libraries themselves require something from the main package very
 rarely. That rare cases are configuration files which are read from
 inside of libraries without straight request from an application. And
 even in that case that configurations files are (should be) provided by
 -common subpackage (which -libs can depend on).
 The only point in such requirements is the licenses which are usually
 included in main packages. But from my point of view nothing prevents
 packager from including license file in %doc stanza for -libs too, so
 any 'reverse' dependencies could be easily avoided, leaving only
 'straight' ones - what libraries actually depend on.
 This is what I'm surprised from corosync, openais and pacemaker - I need
 to install corosync and openais packages on development host only
 because I need corresponding -libs and -devel packages. This is actually
 not a usual for Fedora, and this is really not needed. The main idea of
 -libs is to provide dso's which can be used by another applications
 without need to install 'main' package (together with all daemons,
 initscripts and dependencies on other libs). The same is for -devel - it
 really need -libs because it provides .so symlinks to libs for ld, but
 it shouldn't depend on main application.

 Best,
 Vladislav


 glad you found a path forward though

  understand that /usr/lib/ocf/resource.d/heartbeat has ocf scripts
 provided by heartbeat but that can be part of the Reusable cluster
 agents subsystem.

 Frankly I thought the way I had installed the system by erasing and
 installing the fresh packages it should have worked.

 But all said and done I learned a lot of cluster code by gdbing it.
 I'll be having a peaceful thanksgiving.

 Thanks and happy thanks giving.
 Shravan









 On Sun, Oct 10, 2010 at 2:46 PM, Andrew Beekhof and...@beekhof.net wrote:
 Not enough information.
 We'd need more than just the lrmd's logs, they only show what happened 
 not why.

 On Thu, Oct 7, 2010 at 11:02 PM, Shravan Mishra
 shravan.mis...@gmail.com wrote:
 Hi,

 Description of my environment:
   corosync=1.2.8
   pacemaker=1.1.3
   Linux= 2.6.29.6-0.6.smp.gcc4.1.x86_64 #1 SMP


 We are having a problem with our pacemaker which is continuously
 canceling the monitoring operation of our stonith devices.

 We ran:

 stonith -d -t external/safe/ipmi hostname=ha2.itactics.com
 ipaddr=192.168.2.7 userid=hellouser passwd=hello interface=lanplus -S

 it's output is attached as stonith.output.

 We have been trying to debug this issue for  a few days now with no 
 success.
 We are hoping that someone can help us as we are under immense
 pressure to move to RCS unless we can solve this issue in a day or two
 ,which I personally don't want to because we like the product.

 Any help will be greatly appreciated.


 Here is an excerpt from the /var/log/messages:
 =
 Oct  7 16:58:29 ha1 lrmd: [3581]: info:
 rsc:ha2.itactics.com-stonith:11155: start
 Oct  7 16:58:29 ha1 lrmd: [3581]: info:
 

Re: [Pacemaker] About behavior in Action Lost.

2010-10-12 Thread Keisuke MORI
2010/10/7 Andrew Beekhof and...@beekhof.net:
 On Thu, Oct 7, 2010 at 11:48 AM, Keisuke MORI keisuke.mori...@gmail.com 
 wrote:
 Andrew,

 2010/9/23 Andrew Beekhof and...@beekhof.net:
 Pushed as:
   http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18

 Not sure about applying to 1.0 though, its a dramatic change in behavior.

 I would like to backport this to 1.0.
 Would you agree with this?

 I would prefer not to, but if it is important to you then I will agree.


Thank you for your ACK. It's now in 1.0.
http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/146e405c1afa

-- 
Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] resource stop timeout broken in 1.0 branch tip

2010-10-12 Thread Keisuke MORI
2010/10/9 Andrew Beekhof and...@beekhof.net:
 On Fri, Oct 8, 2010 at 4:17 PM, Lars Ellenberg
 lars.ellenb...@linbit.com wrote:
 On Wed, Oct 06, 2010 at 06:29:12PM +0900, Keisuke MORI wrote:
 2010/10/6 Andrew Beekhof and...@beekhof.net:
  Is there more changesets
  that need to be backported regarding to this issues?
 
  There is now that Andreas brought the problem to my attention :-)
    http://hg.clusterlabs.org/pacemaker/1.1/rev/e097c70226fe
 
  If not, I think that the Andreas' patch should be applied to 1.0.
  It seems to me that the patch is sane as it would restore the old
  behavior for the stop operation with having the resource attributes as
  the first patch intended.
 
  See the comment in the above patch. Andreas' original patch wouldn't
  have worked if the resource definition changed.

 I see, I will backport this to 1.0 too.

Done.
http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/0d019d9e9c61



 May I take the oportunity to point you to
 http://hg.clusterlabs.org/pacemaker/1.1/rev/3f8df3dfb328

 ACK, no objection to this being backported :-)

Also done, along with a minor compilation fix.
http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/70438ddd4351
http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/0a40fd0cb9f2

-- 
Keisuke MORI

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [GUI]Compatibility issues of Python.

2010-10-12 Thread Yan Gao
Hi Hideo,

On 10/12/10 11:01, renayama19661...@ybb.ne.jp wrote:
 Hi Yan,
 
 I confirmed Japanese indication of GUI. (Pacemaker-Python-GUI-16a7d8a5d3eb)
 
 There was not the problem for Japanese display and translation. 
 
 However, the name of the msgbox function seems to be wrong. 
 I attached a patch of haclient.py.in.
They have existed for a long time...Thanks for finding them!

 
 I attach the ja.po file which I confirmed.
 
 It is the same thing that the ja.po file attached it to the next email.
 
  * http://www.gossamer-threads.com/lists/linuxha/pacemaker/67046
Appreciate your good work!

Pushed them:
http://hg.clusterlabs.org/pacemaker/pygui/rev/9920f30d364c
http://hg.clusterlabs.org/pacemaker/pygui/rev/af237f362f13

Regards,
  Yan
-- 
Yan Gao y...@novell.com
Software Engineer
China Server Team, OPS Engineering, Novell, Inc.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] [GUI]Compatibility issues of Python.

2010-10-12 Thread renayama19661014
Hi Yan,

   * http://www.gossamer-threads.com/lists/linuxha/pacemaker/67046
 Appreciate your good work!

Thanks!

 Pushed them:
 http://hg.clusterlabs.org/pacemaker/pygui/rev/9920f30d364c
 http://hg.clusterlabs.org/pacemaker/pygui/rev/af237f362f13

Thank you for revision of hg-GUI.

Best Regards,
Hideo Yamauchi.

--- Yan Gao y...@novell.com wrote:

 Hi Hideo,
 
 On 10/12/10 11:01, renayama19661...@ybb.ne.jp wrote:
  Hi Yan,
  
  I confirmed Japanese indication of GUI. (Pacemaker-Python-GUI-16a7d8a5d3eb)
  
  There was not the problem for Japanese display and translation. 
  
  However, the name of the msgbox function seems to be wrong. 
  I attached a patch of haclient.py.in.
 They have existed for a long time...Thanks for finding them!
 
  
  I attach the ja.po file which I confirmed.
  
  It is the same thing that the ja.po file attached it to the next email.
  
   * http://www.gossamer-threads.com/lists/linuxha/pacemaker/67046
 Appreciate your good work!
 
 Pushed them:
 http://hg.clusterlabs.org/pacemaker/pygui/rev/9920f30d364c
 http://hg.clusterlabs.org/pacemaker/pygui/rev/af237f362f13
 
 Regards,
   Yan
 -- 
 Yan Gao y...@novell.com
 Software Engineer
 China Server Team, OPS Engineering, Novell, Inc.
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: 
 http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Cluster failure with mod_security using rotatelogs

2010-10-12 Thread Markus Schlup
-- Your mail regarding  Re: [Pacemaker] Cluster failure with mod_security 
using rotatelogs 

 On 10/11/2010 at 10:17 AM, Markus Schlup mar...@qbik.ch wrote: 
  Hi all
   
  I'm running a cluster-based Apache reverse proxy with the mod_security  
  module. I would like to rotate the logfiles with rotatelogs as follows: 
   
  CustomLog |/usr/sbin/rotatelogs -l /var/log/httpd/access_log.%Y-%m-%d  
  86400 common 
   
  And especially the mod_security log with 
   
  SecAuditLog  |/usr/sbin/rotatelogs -l  
  /var/log/httpd/modsec_audit_log.%Y-%m-%d 86400 
   
  As soon as I change the mod_security log to this (instead of just using  
  SecAuditLog /var/log/httpd/modsec_audit_log) the resource does not  
  start anymore. 
   
  When trying to debug and start the apache resource by hand with 
   
  OCF_ROOT=/usr/lib/ocf OCF_RESKEY_configfile=/etc/httpd/conf/httpd.conf  
  OCF_RESKEY_statusurl=http://localhost:80/server-status sh -x  
  /usr/lib/ocf/resource.d/heartbeat/apache start 
   
  it stops after 
   
  ... 
  + for p in '$PORT' '$Port' 80 
  + CheckPort 80 
  + ocf_is_decimal 80 
  + case $1 in 
  + true 
  + '[' 80 -gt 0 ']' 
  + PORT=80 
  + break 
  + echo 127.0.0.1:80 
  + grep : 
  + '[' Xhttp://localhost:80/server-status = X ']' 
  + test /etc/httpd/run/httpd.pid 
  + : OK 
  + case $COMMAND in 
  + start_apache 
  + silent_status 
  + '[' -f /etc/httpd/run/httpd.pid ']' 
  + : No pid file 
  + false 
  + ocf_run /usr/sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf 
  ++ /usr/sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf 
   
  The resource is in fact started but the command does not finish - so I  
  guess that's the reason why the cluster fails in this setup ... strange  
  enough using the rotatelogs directives for the Apache error and access  
  logs is not an issue and works as expected. 
   
  Does someone know how to fix that problem? 
 
 I've not seen that before, but, just to rule out one possibility...  What
 happens if you just run:
 
   /usr/sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf
 
 Does that ever return?  If no, I'd suggest apache is broken.  If yes,
 I'd start pointing my finger towards ocf_run or the RA.
 
 HTH,
 
 Tim
 

Apache returns as expected.

Regards
Markus

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] resource stop timeout broken in 1.0 branch tip

2010-10-12 Thread Andrew Beekhof
On Tue, Oct 12, 2010 at 9:34 AM, Keisuke MORI keisuke.mori...@gmail.com wrote:
 2010/10/9 Andrew Beekhof and...@beekhof.net:
 On Fri, Oct 8, 2010 at 4:17 PM, Lars Ellenberg
 lars.ellenb...@linbit.com wrote:
 On Wed, Oct 06, 2010 at 06:29:12PM +0900, Keisuke MORI wrote:
 2010/10/6 Andrew Beekhof and...@beekhof.net:
  Is there more changesets
  that need to be backported regarding to this issues?
 
  There is now that Andreas brought the problem to my attention :-)
    http://hg.clusterlabs.org/pacemaker/1.1/rev/e097c70226fe
 
  If not, I think that the Andreas' patch should be applied to 1.0.
  It seems to me that the patch is sane as it would restore the old
  behavior for the stop operation with having the resource attributes as
  the first patch intended.
 
  See the comment in the above patch. Andreas' original patch wouldn't
  have worked if the resource definition changed.

 I see, I will backport this to 1.0 too.

 Done.
 http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/0d019d9e9c61



 May I take the oportunity to point you to
 http://hg.clusterlabs.org/pacemaker/1.1/rev/3f8df3dfb328

 ACK, no objection to this being backported :-)

 Also done, along with a minor compilation fix.
 http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/70438ddd4351
 http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/0a40fd0cb9f2

Great work!  I hope to be able to start 1.0 testing later this week.

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] 1st monitor is too fast after the start

2010-10-12 Thread Pavlos Parissis
Hi,

I noticed a race condition while I was integration an application with
Pacemaker and thought to share with you.

The init script of the application is LSB-compliant and passes the
tests mentioned at the Pacemaker documentation. Moreover, the init
script
uses the supplied functions from the system[1] for starting,stopping
and checking the application.

I observed few times that the monitor action was failing after the
startup of the cluster or the movement of the resource group.
Because it was not happening always and manual start/status was always
working, it was quite tricky and difficult to find out the root cause
of the failure.
After few hours of troubleshooting, I found out that the 1st monitor
action after the start action, was executed too fast for the
application to create the pid file. As result monitor action was
receiving error.

I know it sounds a bit strange but it happened on my systems. The fact
that my systems are basically vmware images on a laptop could have a
relation with the issue.

Nevertheless, I would like to ask if you are thinking to implement an
init_wait on 1st monitor action. Could be useful.

To solve my issue I put a sleep after the start of the application in
the init script. This gives enough time for the application to create
its pid file and the 1st monitor doesn't fail.


Cheers,
Pavlos


[1] Cent0S 5.4

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] reboot -v 100 constantly showing up in the 'standby' /var/log/cluster/corosync.log file

2010-10-12 Thread Mike A Meyer
We are running an Active/Passive two node
cluster with Pacemaker/Corosync and on the 'standby' node we are constantly
seeing this message in the /var/log/cluster/corosync.log file.

Oct 12 09:11:30 qa-magdb2 crm_attribute:
[12846]: info: Invoked: crm_attribute -N qa-magdb2 -n master-mysql_drbd:1
-l reboot -v 100 
Oct 12 09:11:45 qa-magdb2 crm_attribute:
[12874]: info: Invoked: crm_attribute -N qa-magdb2 -n master-mysql_drbd:1
-l reboot -v 100 
Oct 12 09:12:00 qa-magdb2 crm_attribute:
[12917]: info: Invoked: crm_attribute -N qa-magdb2 -n master-mysql_drbd:1
-l reboot -v 100 
Oct 12 09:12:15 qa-magdb2 crm_attribute:
[12949]: info: Invoked: crm_attribute -N qa-magdb2 -n master-mysql_drbd:1
-l reboot -v 100 

We are wondering what that all means
and what does the reboot -v 100 mean? We are not using Stonith as
I have seen references to Stonith setup and 'reboot' and we don't remember
setting up any value for reboot or '100'.  Everything seems
to be working fine and failing over when we need to. Just curious
what these messages above mean.  Any help would greatly be appreciated.

Our configuration file is below.

node qa-magdb1
node qa-magdb2
primitive email_notify ocf:heartbeat:MailTo
\
params email=test...@test.com
subject=DRBD/Pacemaker FAILOVER!!!  \
op monitor
interval=10 timeout=10 depth=0
primitive mysql_drbd ocf:linbit:drbd
\
params drbd_resource=mysql
\
op monitor
interval=15s
primitive mysql_fs ocf:heartbeat:Filesystem
\
params device=/dev/drbd0
directory=/mnt/mysql/ fstype=ext3
primitive mysql_service ocf:heartbeat:mysql
\
params binary=/usr/bin/mysqld_safe
config=/mnt/mysql/my.cnf datadir=/mnt/mysql/data
pid=/var/run/mysql/mysqld.pid socket=/var/run/mysql/mysql.sock
test_passwd=testingit test_table=mysql.user test_user=root
\
op monitor
interval=20s timeout=10s \
meta migration-threshold=10
target-role=Started
primitive mysql_vip ocf:heartbeat:IPaddr2
\
params ip=172.26.76.100
nic=eth0
group mysql mysql_fs mysql_vip mysql_service
email_notify
ms ms_mysql_drbd mysql_drbd \
meta master-max=1
master-node-max=1 clone-max=2 clone-node-max=1
notify=true
location primary_mysql mysql 10: qa-magdb1
location primary_mysql_drbd ms_mysql_drbd
10: qa-magdb1
location standby_mysql mysql 5: qa-magdb2
location standby_mysql_drbd ms_mysql_drbd
5: qa-magdb2
colocation mysql_on_drbd inf: mysql
ms_mysql_drbd:Master
order mysql_after_drbd inf: ms_mysql_drbd:promote
mysql:start
property $id=cib-bootstrap-options
\
dc-version=1.0.9-89bd754939df5150de7cd76835f98fe90851b677
\
cluster-infrastructure=openais
\
expected-quorum-votes=2
\
stonith-enabled=false
\
no-quorum-policy=ignore
\
cluster-recheck-interval=0

Thanks,
Mike
-

Thise-mailmessageisintendedonlyforthepersonaluseoftherecipient(s)
namedabove.Ifyouarenotanintendedrecipient,youmaynotreview,copyor
distributethismessage.Ifyouhavereceivedthiscommunicationinerror,
pleasenotifytheCDSGlobalHelpDesk(cdshelpd...@cds-global.com)immediately
bye-mailanddeletetheoriginalmessage.

-

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] sshd under cluster

2010-10-12 Thread Pavlos Parissis
Hi,

I was asked to place sshd daemon under cluster and because I faced few
challenges, I thought to share them with you.

The 1st challenge was to clone the sshd daemon, init script and its
configuration. The procedure is at the bottom of this mail.

The 2nd challenge was the init script of sshd in CentOS. It has 2
issues, 1st issue was that it was failing at test 6 mentioned here
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html.

The 2nd issue was that during shutdown or reboot of the cluster node,
stop action on resource was receiving return code 143 from init script
and the whole shutdown/reboot process was stuck for few minutes. The
root cause of that was the killall command which is being called by
the init script. The init script calls killall, only on shutdown or
reboot, to close any open connections. But, that call was killing also
the script itself! Because of that cluster was getting error on stop
action and the lock file of the sshd was not removed as well. You can
image the consequences.

For both issues I filled a bug report and hacked the init script in
order to have a short term resolution.

The last challenge was related to a mail sent few hours ago. The 1st
monitor action after the start action was too fast and sshd didn't
have enough time to create its pid file. As a result the monitor was
thinking that the sshd was down but it wasn't.
A sleep 1 after the start function in the init script solved the issue.

Cheers,
Pavlos

Clone SSH for pbx_0N
Prerequisite: the default sshd to listen only on nodes IP and not on all IPs.

cp -p /etc/init.d/sshd /etc/init.d/sshd-pbx_02

cp -p /etc/pam.d/sshd /etc/pam.d/sshd-pbx_02 # optional because it is
needed only if UsePam true - On RH is true by default

ln -s /usr/sbin/sshd /usr/sbin/sshd-pbx_02

touch /etc/sysconfig/sshd-pbx_02
echo 'OPTIONS=-f /etc/ssh/sshd_config-pbx_02'  /etc/sysconfig/sshd-pbx_02

cp -p /etc/ssh/sshd_config /etc/ssh/sshd_config-pbx_02

[r...@node-02 ~]# diff -wu /etc/init.d/sshd /etc/init.d/sshd-pbx_02
--- /etc/init.d/sshd2009-09-03 20:12:38.0 +0200
+++ /etc/init.d/sshd-pbx_02 2010-10-12 12:25:50.0 +0200
@@ -1,33 +1,33 @@
-#!/bin/bash
+#!/bin/bash -x
 #
-# Init file for OpenSSH server daemon
+# Init file for OpenSSH server daemon used by pbx_02
 #
 # chkconfig: 2345 55 25
-# description: OpenSSH server daemon
+# description: OpenSSH server daemon for pbx_02
 #
-# processname: sshd
-# config: /etc/ssh/ssh_host_key
-# config: /etc/ssh/ssh_host_key.pub
+# processname: sshd-pbx_02
+# config: /etc/ssh/ssh_host_key-pbx_02
+# config: /etc/ssh/ssh_host_key-pbx_02.pub
 # config: /etc/ssh/ssh_random_seed
-# config: /etc/ssh/sshd_config
-# pidfile: /var/run/sshd.pid
+# config: /etc/ssh/sshd_config-pbx_02
+# pidfile: /var/run/sshd-pbx_02.pid

 # source function library
 . /etc/rc.d/init.d/functions

 # pull in sysconfig settings
-[ -f /etc/sysconfig/sshd ]  . /etc/sysconfig/sshd
+[ -f /etc/sysconfig/sshd-pbx_02 ]  . /etc/sysconfig/sshd-pbx_02

 RETVAL=0
-prog=sshd
+prog=sshd-pbx_02

 # Some functions to make the below more readable
 KEYGEN=/usr/bin/ssh-keygen
-SSHD=/usr/sbin/sshd
-RSA1_KEY=/etc/ssh/ssh_host_key
-RSA_KEY=/etc/ssh/ssh_host_rsa_key
-DSA_KEY=/etc/ssh/ssh_host_dsa_key
-PID_FILE=/var/run/sshd.pid
+SSHD=/usr/sbin/sshd-pbx_02
+RSA1_KEY=/etc/ssh/ssh_host_key-pbx_02
+RSA_KEY=/etc/ssh/ssh_host_rsa_key-pbx_02
+DSA_KEY=/etc/ssh/ssh_host_dsa_key-pbx_02
+PID_FILE=/var/run/sshd-pbx_02.pid

 runlevel=$(set -- $(runlevel); eval echo \$$# )

@@ -110,7 +110,11 @@
echo -n $Starting $prog: 
$SSHD $OPTIONS  success || failure
RETVAL=$?
-   [ $RETVAL = 0 ]  touch /var/lock/subsys/sshd
+   [ $RETVAL = 0 ]  touch /var/lock/subsys/sshd-pbx_02
+# to avoid a race condition, 1st cluster monitor after start fails
+# because the pid file is not created yet. Few msecs detail on the
+# creation of pid file is enough to cause issues.
+sleep 1
echo
 }

@@ -119,16 +123,25 @@
echo -n $Stopping $prog: 
if [ -n `pidfileofproc $SSHD` ] ; then
killproc $SSHD
+   elif [ -z `pidfileofproc $SSHD`]  [ ! -f
/var/lock/subsys/sshd-pbx_02 ] ; then
+success
+RETVAL=0
else
failure $Stopping $prog
fi
RETVAL=$?
+
+### Added by Pavlos Parissis ###
+# Disable the below bit because killall kills the script itself.
+# This causes problems within the cluster, shutdown of a node fails.
+# Any open connections will be killed by /etc/init.d.halt anyways
+
# if we are in halt or reboot runlevel kill all running sessions
# so the TCP connections are closed cleanly
-   if [ x$runlevel = x0 -o x$runlevel = x6 ] ; then
-   killall $prog 2/dev/null
-   fi
-   [ $RETVAL = 0 ]  rm -f /var/lock/subsys/sshd
+   #if [ x$runlevel = x0 -o x$runlevel = x6 ] ; then
+   #killall $prog 

Re: [Pacemaker] Migrate resources based on connectivity

2010-10-12 Thread Dan Frincu

Hi,

Lars Ellenberg wrote:

On Mon, Oct 11, 2010 at 03:50:01PM +0300, Dan Frincu wrote:
  

Hi,

Dejan Muhamedagic wrote:


Hi,

On Sun, Oct 10, 2010 at 10:27:13PM +0300, Dan Frincu wrote:
  

Hi,

I have the following setup:
- order drbd0:promote drbd1:promote
- order drbd1:promote drbd2:promote
- order drbd2:promote all:start
- collocation all drbd2:Master
- all is a group of resources, drbd{0..3} are drbd ms resources.

I want to migrate the resources based on ping connectivity to a
default gateway. Based on 
http://www.clusterlabs.org/wiki/Pingd_with_resources_on_different_networks
and http://www.clusterlabs.org/wiki/Example_configurations I've
tried the following:
- primitive ping ocf:pacemaker:ping params host_list=1.2.3.4
multiplier=100 op monitor interval=5s timeout=5s
- clone ping_clone ping meta globally-unique=false
- location ping_nok all \
  rule $id=ping_nok-rule -inf: not_defined ping_clone or
ping_clone number:lte 0


Use pingd to reference the attribute in the location constraint.
  

Not to be disrespectful, but after 3 days being stuck on this issue,
I don't exactly understand how to do that. Could you please provide
an example.

Thank you in advance.



The example you reference lists:

primitive pingdnet1 ocf:pacemaker:pingd \
params host_list=192.168.23.1 \
name=pingdnet1
^^

clone cl-pingdnet1 pingdnet1
   ^

param name default is pingd,
and is the attribute name to be used in the location constraints.

You will need to reference pingd in you location constraint, or set an
explicit name in the primitive definition, and reference that.

Your ping primitive sets the default 'pingd' attribute,
but you reference some 'ping_clone' attribute,
which apparently no-one really references.

  
I've finally managed to finish the setup with the indications received 
above, the behavior is the expected one. Also, I've tried the 
ocf:pacemaker:pingd and even though it does the reachability tests 
properly, it fails to update the cib upon restoring the connectivity, I 
had to manually run attrd_updater -R to get the resources to start 
again, therefore I'm going with ocf:pacemaker:ping.


Anyways, Dejan, Lars, Andrew, thank you all very much for your help.

Best regards,

Dan
http://www.clusterlabs.org/wiki/Example_configurations

--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] problem about move node from one clusterto another cluster

2010-10-12 Thread Dan Frincu

Hi,

Depending on the openais version (please mention it) this behavior could 
happen, I've seen it as well, on openais-0.8.0. What I've done to fix it 
was to restart the openais process via /etc/init.d/openais restart. And 
then it worked, however, this was one of the reasons I updated the 
packages to the latest versions of corosync, pacemaker, etc. The tricky 
part was doing the migration procedure for upgrading production servers 
without service downtime, but that's another story.


Regards,

Dan

jiaju liu wrote:




Message: 2
Date: Tue, 12 Oct 2010 10:40:18 +0800 (CST)
From: jiaju liu liujiaj...@yahoo.com.cn
http://cn.mc157.mail.yahoo.com/mc/compose?to=liujiaj...@yahoo.com.cn
To: pacemaker@oss.clusterlabs.org
http://cn.mc157.mail.yahoo.com/mc/compose?to=pacema...@oss.clusterlabs.org
Subject: [Pacemaker] problem about move node from one cluster to
anothercluster
Message-ID: 765547.4759...@web15704.mail.cnb.yahoo.com

http://cn.mc157.mail.yahoo.com/mc/compose?to=765547.4759...@web15704.mail.cnb.yahoo.com
Content-Type: text/plain; charset=iso-8859-1

hi everybody
I use command service openais stop first to stop openais service
and then use rm -rf /var/lib/heartbear/crm/*? clear all
information. then change multicast address and then use service
openais start  in another cluster.
the problem is sometimes it works well I  can use crm_mon command.
and sometimes it doesn't work. I use service openais status to
check. It shows Running. but I can not use crm_mon to connect to
cluster.
I found the reason may be directory?/var/lib/heartbear/crm/ is
empty. why??if I reboot ,it works again.WHY
 
Now when the is directory is not empty it sometimes also does not

work.
when I use* crm_mon* it shows
Attempting connection to the cluster..
 
when I use *crm node list *it shows

Signon to CIB failed: connection failed
Init failed, could not perform requested operations
ERROR: cannot parse output of cibadmin -Ql -o nodes: no element
found: line 1, column 0


 
-- next part --

An HTML attachment was scrubbed...
URL:

http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20101012/7ea78f33/attachment-0001.htm

--


 



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
  


--
Dan FRINCU
Systems Engineer
CCNA, RHCE
Streamwide Romania

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


Re: [Pacemaker] Migrate resources based on connectivity

2010-10-12 Thread Pavlos Parissis
On 12 October 2010 20:00, Dan Frincu dfri...@streamwide.ro wrote:
 Hi,

 Lars Ellenberg wrote:

 On Mon, Oct 11, 2010 at 03:50:01PM +0300, Dan Frincu wrote:


 Hi,

 Dejan Muhamedagic wrote:


 Hi,

 On Sun, Oct 10, 2010 at 10:27:13PM +0300, Dan Frincu wrote:


 Hi,

 I have the following setup:
 - order drbd0:promote drbd1:promote
 - order drbd1:promote drbd2:promote
 - order drbd2:promote all:start
 - collocation all drbd2:Master
 - all is a group of resources, drbd{0..3} are drbd ms resources.

 I want to migrate the resources based on ping connectivity to a
 default gateway. Based on
 http://www.clusterlabs.org/wiki/Pingd_with_resources_on_different_networks
 and http://www.clusterlabs.org/wiki/Example_configurations I've
 tried the following:
 - primitive ping ocf:pacemaker:ping params host_list=1.2.3.4
 multiplier=100 op monitor interval=5s timeout=5s
 - clone ping_clone ping meta globally-unique=false
 - location ping_nok all \
   rule $id=ping_nok-rule -inf: not_defined ping_clone or
 ping_clone number:lte 0


 Use pingd to reference the attribute in the location constraint.


 Not to be disrespectful, but after 3 days being stuck on this issue,
 I don't exactly understand how to do that. Could you please provide
 an example.

 Thank you in advance.


 The example you reference lists:

   primitive pingdnet1 ocf:pacemaker:pingd \
   params host_list=192.168.23.1 \
   name=pingdnet1
   ^^

   clone cl-pingdnet1 pingdnet1
  ^

 param name default is pingd,
 and is the attribute name to be used in the location constraints.

 You will need to reference pingd in you location constraint, or set an
 explicit name in the primitive definition, and reference that.

 Your ping primitive sets the default 'pingd' attribute,
 but you reference some 'ping_clone' attribute,
 which apparently no-one really references.



 I've finally managed to finish the setup with the indications received
 above, the behavior is the expected one. Also, I've tried the
 ocf:pacemaker:pingd and even though it does the reachability tests properly,
 it fails to update the cib upon restoring the connectivity, I had to
 manually run attrd_updater -R to get the resources to start again, therefore
 I'm going with ocf:pacemaker:ping.

it would be quite useful for the rest of people if you post your final
and working configuration.
Cheers,
Pavlos

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker


[Pacemaker] what is the meaning of stand_alone_ping

2010-10-12 Thread jiaju liu
the screen says
node1 pingd: [1927]: info: stand_alone_ping: Node 192.168.10.110 is unr
eachable (read)
and the node could not start 
 


  ___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker