Re: [Pacemaker] [Problem]The monitor that start-delay is long does not stop.
Hi Andrew, Funnily enough I was just looking at that message and saw that the code relevant to this one looked wrong too. I believe this should fix the issue: http://hg.clusterlabs.org/pacemaker/1.1/rev/e06810256413 I registered log and more with Bugzilla. #65533;* http://developerbugs.linux-foundation.org/show_bug.cgi?id=2505 Oops, I didn't see that. I should have included the bug number in the commit :-( ok. I confirm your revision. Confirmation has been late. I confirmed that a problem was solved in your revision. In addition, I added a similar revision for 1.0 and confirmed that a problem was broken off. I added comment to Bugzilla. Best Regards, Hideo Yamauchi. --- renayama19661...@ybb.ne.jp wrote: Hi Andrew, Thank you for comment. Funnily enough I was just looking at that message and saw that the code relevant to this one looked wrong too. I believe this should fix the issue: http://hg.clusterlabs.org/pacemaker/1.1/rev/e06810256413 I registered log and more with Bugzilla. #65533;* http://developerbugs.linux-foundation.org/show_bug.cgi?id=2505 Oops, I didn't see that. I should have included the bug number in the commit :-( ok. I confirm your revision. Best Regards, Hideo Yamauchi. --- Andrew Beekhof and...@beekhof.net wrote: On Thu, Oct 7, 2010 at 8:39 AM, renayama19661...@ybb.ne.jp wrote: Hi, I operated the next to confirm the contribution of the mailing list. #65533;* http://www.gossamer-threads.com/lists/linuxha/pacemaker/66939 Step1) I prepare cib.xml having monitor which set start-delay than five minutes.. Step2) I start two nodes and send cib. Last updated: Thu Oct #65533;7 14:58:09 2010 Stack: Heartbeat Current DC: srv02 (1f8dd092-d82b-47eb-86c4-e011a2cd11b3) - partition WITHOUT quorum Version: 1.0.9-860b32388908c6a345786d4ecd2e2a3bec780dd2 2 Nodes configured, unknown expected votes 1 Resources configured. Online: [ srv01 srv02 ] #65533;Resource Group: grpDummy #65533; #65533; prmFsPostgreSQLDB1-3 #65533; #65533; #65533; (ocf::heartbeat:Dummy): Started srv01 #65533; #65533; prmIpPostgreSQLDB2 (ocf::heartbeat:Dummy): Started srv01 Step3) I causes the monitor error of the resource successively. Last updated: Thu Oct #65533;7 15:20:01 2010 Stack: Heartbeat Current DC: srv02 (d3fe8b08-20d9-4990-aebb-56a0675af5bd) - partition WITHOUT quorum Version: 1.0.9-860b32388908c6a345786d4ecd2e2a3bec780dd2 2 Nodes configured, unknown expected votes 1 Resources configured. Online: [ srv01 srv02 ] #65533;Resource Group: grpDummy #65533; #65533; prmFsPostgreSQLDB1-3 #65533; #65533; #65533; (ocf::heartbeat:Dummy): Started srv02 #65533; #65533; prmIpPostgreSQLDB2 (ocf::heartbeat:Dummy): Started srv02 Migration summary: * Node srv02: * Node srv01: #65533; prmIpPostgreSQLDB2: migration-threshold=1 fail-count=1 #65533; prmFsPostgreSQLDB1-3: migration-threshold=1 fail-count=1 Failed actions: #65533; #65533;prmIpPostgreSQLDB2_monitor_6 (node=srv01, call=7, rc=7, status=complete): not running #65533; #65533;prmFsPostgreSQLDB1-3_monitor_3 (node=srv01, call=5, rc=7, status=complete): not running Step4) The resource does fail-over in a srv02 node, but the monitor #65533;of srv01 does not stop. [r...@srv01 ~]# !tail tail -f /var/log/ha-log Oct #65533;7 15:27:27 srv01 lrmd: [15792]: debug: rsc:prmFsPostgreSQLDB1-3:5: monitor Oct #65533;7 15:27:27 srv01 Dummy[16572]: DEBUG: prmFsPostgreSQLDB1-3 monitor : 7 Oct #65533;7 15:27:58 srv01 lrmd: [15792]: debug: rsc:prmFsPostgreSQLDB1-3:5: monitor Oct #65533;7 15:27:58 srv01 Dummy[16594]: DEBUG: prmFsPostgreSQLDB1-3 monitor : 7 Oct #65533;7 15:27:59 srv01 lrmd: [15792]: debug: rsc:prmIpPostgreSQLDB2:8: monitor Oct #65533;7 15:27:59 srv01 Dummy[16601]: DEBUG: prmIpPostgreSQLDB2 monitor : 7 Oct #65533;7 15:27:59 srv01 lrmd: [15792]: debug: rsc:prmIpPostgreSQLDB2:7: monitor Oct #65533;7 15:27:59 srv01 Dummy[16608]: DEBUG: prmIpPostgreSQLDB2 monitor : 7 Oct #65533;7 15:28:28 srv01 lrmd: [15792]: debug: rsc:prmFsPostgreSQLDB1-3:5: monitor Oct #65533;7 15:28:28 srv01 Dummy[16628]: DEBUG: prmFsPostgreSQLDB1-3 monitor : 7 Step5) The fail-count does strange increase afterwards. Last updated: Thu Oct #65533;7 15:31:21 2010 Stack: Heartbeat Current DC: srv02 (d3fe8b08-20d9-4990-aebb-56a0675af5bd) - partition WITHOUT quorum Version: 1.0.9-860b32388908c6a345786d4ecd2e2a3bec780dd2 2 Nodes configured, unknown expected votes 1 Resources configured. Online: [ srv01 srv02 ] #65533;Resource Group: grpDummy #65533; #65533; prmFsPostgreSQLDB1-3 #65533; #65533; #65533;
Re: [Pacemaker] stonith pacemaker problem
12.10.2010 07:25, Andrew Beekhof wrote: On Mon, Oct 11, 2010 at 9:51 PM, Vladislav Bogdanov bub...@hoster-ok.com wrote: 11.10.2010 09:14, Andrew Beekhof wrote: strictly speaking you don't. but at least on fedora, the policy is that $x-libs always requires $x so just building against heartbeat-libs means that yum will suck in the main heartbeat package :-( And this seem to be a bit incorrect statement btw: no, you're wrong sorry. From: http://fedoraproject.org/wiki/Packaging/ReviewGuidelines SHOULD: Usually, subpackages other than devel should require the base package using a fully versioned dependency. http://fedoraproject.org/wiki/Packaging/Guidelines#RequiringBasePackage In any case, its not something Pacemaker has control over. I think this is where packaging guidelines seem to be incomplete. Frankly speaking, -libs is not a subpackage in a general meaning, but rather a superpackage. Base package almost always requires -libs. That means that -libs is a kinda special case. Guidelines are valid for subpackages like modules, -data, -docs, -servers, -clients, whatever else. You can run (one line): rpm -qa|grep -- -libs|grep -v -- -devel| while read rpm ; do echo $rpm:; rpm -q --requires $rpm ; done|grep -Ev ^(lib|rpmlib|rtld|/) And you'll see very small number of -libs packages which actually require base package. So majority of fedora packagers either do not follow guidelines or realize that -libs is a different story. Actually, what is the hidden meaning of splitting package to base and -libs if -libs depend on base? The main idea of a such split is to provide a way to have shared libraries installed where main package (together which all its dependencies) is not needed. And I agree, this is for another mailing list anyways. Falling silent... usually application (binary) requires some libraries, and some of that libraries are provided by -libs package which is built together with the binary. But, libraries themselves require something from the main package very rarely. That rare cases are configuration files which are read from inside of libraries without straight request from an application. And even in that case that configurations files are (should be) provided by -common subpackage (which -libs can depend on). The only point in such requirements is the licenses which are usually included in main packages. But from my point of view nothing prevents packager from including license file in %doc stanza for -libs too, so any 'reverse' dependencies could be easily avoided, leaving only 'straight' ones - what libraries actually depend on. This is what I'm surprised from corosync, openais and pacemaker - I need to install corosync and openais packages on development host only because I need corresponding -libs and -devel packages. This is actually not a usual for Fedora, and this is really not needed. The main idea of -libs is to provide dso's which can be used by another applications without need to install 'main' package (together with all daemons, initscripts and dependencies on other libs). The same is for -devel - it really need -libs because it provides .so symlinks to libs for ld, but it shouldn't depend on main application. Best, Vladislav glad you found a path forward though understand that /usr/lib/ocf/resource.d/heartbeat has ocf scripts provided by heartbeat but that can be part of the Reusable cluster agents subsystem. Frankly I thought the way I had installed the system by erasing and installing the fresh packages it should have worked. But all said and done I learned a lot of cluster code by gdbing it. I'll be having a peaceful thanksgiving. Thanks and happy thanks giving. Shravan On Sun, Oct 10, 2010 at 2:46 PM, Andrew Beekhof and...@beekhof.net wrote: Not enough information. We'd need more than just the lrmd's logs, they only show what happened not why. On Thu, Oct 7, 2010 at 11:02 PM, Shravan Mishra shravan.mis...@gmail.com wrote: Hi, Description of my environment: corosync=1.2.8 pacemaker=1.1.3 Linux= 2.6.29.6-0.6.smp.gcc4.1.x86_64 #1 SMP We are having a problem with our pacemaker which is continuously canceling the monitoring operation of our stonith devices. We ran: stonith -d -t external/safe/ipmi hostname=ha2.itactics.com ipaddr=192.168.2.7 userid=hellouser passwd=hello interface=lanplus -S it's output is attached as stonith.output. We have been trying to debug this issue for a few days now with no success. We are hoping that someone can help us as we are under immense pressure to move to RCS unless we can solve this issue in a day or two ,which I personally don't want to because we like the product. Any help will be greatly appreciated. Here is an excerpt from the /var/log/messages: = Oct 7 16:58:29 ha1 lrmd: [3581]: info: rsc:ha2.itactics.com-stonith:11155: start Oct 7 16:58:29 ha1 lrmd: [3581]: info:
Re: [Pacemaker] About behavior in Action Lost.
2010/10/7 Andrew Beekhof and...@beekhof.net: On Thu, Oct 7, 2010 at 11:48 AM, Keisuke MORI keisuke.mori...@gmail.com wrote: Andrew, 2010/9/23 Andrew Beekhof and...@beekhof.net: Pushed as: http://hg.clusterlabs.org/pacemaker/1.1/rev/8433015faf18 Not sure about applying to 1.0 though, its a dramatic change in behavior. I would like to backport this to 1.0. Would you agree with this? I would prefer not to, but if it is important to you then I will agree. Thank you for your ACK. It's now in 1.0. http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/146e405c1afa -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] resource stop timeout broken in 1.0 branch tip
2010/10/9 Andrew Beekhof and...@beekhof.net: On Fri, Oct 8, 2010 at 4:17 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Wed, Oct 06, 2010 at 06:29:12PM +0900, Keisuke MORI wrote: 2010/10/6 Andrew Beekhof and...@beekhof.net: Is there more changesets that need to be backported regarding to this issues? There is now that Andreas brought the problem to my attention :-) http://hg.clusterlabs.org/pacemaker/1.1/rev/e097c70226fe If not, I think that the Andreas' patch should be applied to 1.0. It seems to me that the patch is sane as it would restore the old behavior for the stop operation with having the resource attributes as the first patch intended. See the comment in the above patch. Andreas' original patch wouldn't have worked if the resource definition changed. I see, I will backport this to 1.0 too. Done. http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/0d019d9e9c61 May I take the oportunity to point you to http://hg.clusterlabs.org/pacemaker/1.1/rev/3f8df3dfb328 ACK, no objection to this being backported :-) Also done, along with a minor compilation fix. http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/70438ddd4351 http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/0a40fd0cb9f2 -- Keisuke MORI ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [GUI]Compatibility issues of Python.
Hi Hideo, On 10/12/10 11:01, renayama19661...@ybb.ne.jp wrote: Hi Yan, I confirmed Japanese indication of GUI. (Pacemaker-Python-GUI-16a7d8a5d3eb) There was not the problem for Japanese display and translation. However, the name of the msgbox function seems to be wrong. I attached a patch of haclient.py.in. They have existed for a long time...Thanks for finding them! I attach the ja.po file which I confirmed. It is the same thing that the ja.po file attached it to the next email. * http://www.gossamer-threads.com/lists/linuxha/pacemaker/67046 Appreciate your good work! Pushed them: http://hg.clusterlabs.org/pacemaker/pygui/rev/9920f30d364c http://hg.clusterlabs.org/pacemaker/pygui/rev/af237f362f13 Regards, Yan -- Yan Gao y...@novell.com Software Engineer China Server Team, OPS Engineering, Novell, Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] [GUI]Compatibility issues of Python.
Hi Yan, * http://www.gossamer-threads.com/lists/linuxha/pacemaker/67046 Appreciate your good work! Thanks! Pushed them: http://hg.clusterlabs.org/pacemaker/pygui/rev/9920f30d364c http://hg.clusterlabs.org/pacemaker/pygui/rev/af237f362f13 Thank you for revision of hg-GUI. Best Regards, Hideo Yamauchi. --- Yan Gao y...@novell.com wrote: Hi Hideo, On 10/12/10 11:01, renayama19661...@ybb.ne.jp wrote: Hi Yan, I confirmed Japanese indication of GUI. (Pacemaker-Python-GUI-16a7d8a5d3eb) There was not the problem for Japanese display and translation. However, the name of the msgbox function seems to be wrong. I attached a patch of haclient.py.in. They have existed for a long time...Thanks for finding them! I attach the ja.po file which I confirmed. It is the same thing that the ja.po file attached it to the next email. * http://www.gossamer-threads.com/lists/linuxha/pacemaker/67046 Appreciate your good work! Pushed them: http://hg.clusterlabs.org/pacemaker/pygui/rev/9920f30d364c http://hg.clusterlabs.org/pacemaker/pygui/rev/af237f362f13 Regards, Yan -- Yan Gao y...@novell.com Software Engineer China Server Team, OPS Engineering, Novell, Inc. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Cluster failure with mod_security using rotatelogs
-- Your mail regarding Re: [Pacemaker] Cluster failure with mod_security using rotatelogs On 10/11/2010 at 10:17 AM, Markus Schlup mar...@qbik.ch wrote: Hi all I'm running a cluster-based Apache reverse proxy with the mod_security module. I would like to rotate the logfiles with rotatelogs as follows: CustomLog |/usr/sbin/rotatelogs -l /var/log/httpd/access_log.%Y-%m-%d 86400 common And especially the mod_security log with SecAuditLog |/usr/sbin/rotatelogs -l /var/log/httpd/modsec_audit_log.%Y-%m-%d 86400 As soon as I change the mod_security log to this (instead of just using SecAuditLog /var/log/httpd/modsec_audit_log) the resource does not start anymore. When trying to debug and start the apache resource by hand with OCF_ROOT=/usr/lib/ocf OCF_RESKEY_configfile=/etc/httpd/conf/httpd.conf OCF_RESKEY_statusurl=http://localhost:80/server-status sh -x /usr/lib/ocf/resource.d/heartbeat/apache start it stops after ... + for p in '$PORT' '$Port' 80 + CheckPort 80 + ocf_is_decimal 80 + case $1 in + true + '[' 80 -gt 0 ']' + PORT=80 + break + echo 127.0.0.1:80 + grep : + '[' Xhttp://localhost:80/server-status = X ']' + test /etc/httpd/run/httpd.pid + : OK + case $COMMAND in + start_apache + silent_status + '[' -f /etc/httpd/run/httpd.pid ']' + : No pid file + false + ocf_run /usr/sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf ++ /usr/sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf The resource is in fact started but the command does not finish - so I guess that's the reason why the cluster fails in this setup ... strange enough using the rotatelogs directives for the Apache error and access logs is not an issue and works as expected. Does someone know how to fix that problem? I've not seen that before, but, just to rule out one possibility... What happens if you just run: /usr/sbin/httpd -DSTATUS -f /etc/httpd/conf/httpd.conf Does that ever return? If no, I'd suggest apache is broken. If yes, I'd start pointing my finger towards ocf_run or the RA. HTH, Tim Apache returns as expected. Regards Markus ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] resource stop timeout broken in 1.0 branch tip
On Tue, Oct 12, 2010 at 9:34 AM, Keisuke MORI keisuke.mori...@gmail.com wrote: 2010/10/9 Andrew Beekhof and...@beekhof.net: On Fri, Oct 8, 2010 at 4:17 PM, Lars Ellenberg lars.ellenb...@linbit.com wrote: On Wed, Oct 06, 2010 at 06:29:12PM +0900, Keisuke MORI wrote: 2010/10/6 Andrew Beekhof and...@beekhof.net: Is there more changesets that need to be backported regarding to this issues? There is now that Andreas brought the problem to my attention :-) http://hg.clusterlabs.org/pacemaker/1.1/rev/e097c70226fe If not, I think that the Andreas' patch should be applied to 1.0. It seems to me that the patch is sane as it would restore the old behavior for the stop operation with having the resource attributes as the first patch intended. See the comment in the above patch. Andreas' original patch wouldn't have worked if the resource definition changed. I see, I will backport this to 1.0 too. Done. http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/0d019d9e9c61 May I take the oportunity to point you to http://hg.clusterlabs.org/pacemaker/1.1/rev/3f8df3dfb328 ACK, no objection to this being backported :-) Also done, along with a minor compilation fix. http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/70438ddd4351 http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/0a40fd0cb9f2 Great work! I hope to be able to start 1.0 testing later this week. ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] 1st monitor is too fast after the start
Hi, I noticed a race condition while I was integration an application with Pacemaker and thought to share with you. The init script of the application is LSB-compliant and passes the tests mentioned at the Pacemaker documentation. Moreover, the init script uses the supplied functions from the system[1] for starting,stopping and checking the application. I observed few times that the monitor action was failing after the startup of the cluster or the movement of the resource group. Because it was not happening always and manual start/status was always working, it was quite tricky and difficult to find out the root cause of the failure. After few hours of troubleshooting, I found out that the 1st monitor action after the start action, was executed too fast for the application to create the pid file. As result monitor action was receiving error. I know it sounds a bit strange but it happened on my systems. The fact that my systems are basically vmware images on a laptop could have a relation with the issue. Nevertheless, I would like to ask if you are thinking to implement an init_wait on 1st monitor action. Could be useful. To solve my issue I put a sleep after the start of the application in the init script. This gives enough time for the application to create its pid file and the 1st monitor doesn't fail. Cheers, Pavlos [1] Cent0S 5.4 ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] reboot -v 100 constantly showing up in the 'standby' /var/log/cluster/corosync.log file
We are running an Active/Passive two node cluster with Pacemaker/Corosync and on the 'standby' node we are constantly seeing this message in the /var/log/cluster/corosync.log file. Oct 12 09:11:30 qa-magdb2 crm_attribute: [12846]: info: Invoked: crm_attribute -N qa-magdb2 -n master-mysql_drbd:1 -l reboot -v 100 Oct 12 09:11:45 qa-magdb2 crm_attribute: [12874]: info: Invoked: crm_attribute -N qa-magdb2 -n master-mysql_drbd:1 -l reboot -v 100 Oct 12 09:12:00 qa-magdb2 crm_attribute: [12917]: info: Invoked: crm_attribute -N qa-magdb2 -n master-mysql_drbd:1 -l reboot -v 100 Oct 12 09:12:15 qa-magdb2 crm_attribute: [12949]: info: Invoked: crm_attribute -N qa-magdb2 -n master-mysql_drbd:1 -l reboot -v 100 We are wondering what that all means and what does the reboot -v 100 mean? We are not using Stonith as I have seen references to Stonith setup and 'reboot' and we don't remember setting up any value for reboot or '100'. Everything seems to be working fine and failing over when we need to. Just curious what these messages above mean. Any help would greatly be appreciated. Our configuration file is below. node qa-magdb1 node qa-magdb2 primitive email_notify ocf:heartbeat:MailTo \ params email=test...@test.com subject=DRBD/Pacemaker FAILOVER!!! \ op monitor interval=10 timeout=10 depth=0 primitive mysql_drbd ocf:linbit:drbd \ params drbd_resource=mysql \ op monitor interval=15s primitive mysql_fs ocf:heartbeat:Filesystem \ params device=/dev/drbd0 directory=/mnt/mysql/ fstype=ext3 primitive mysql_service ocf:heartbeat:mysql \ params binary=/usr/bin/mysqld_safe config=/mnt/mysql/my.cnf datadir=/mnt/mysql/data pid=/var/run/mysql/mysqld.pid socket=/var/run/mysql/mysql.sock test_passwd=testingit test_table=mysql.user test_user=root \ op monitor interval=20s timeout=10s \ meta migration-threshold=10 target-role=Started primitive mysql_vip ocf:heartbeat:IPaddr2 \ params ip=172.26.76.100 nic=eth0 group mysql mysql_fs mysql_vip mysql_service email_notify ms ms_mysql_drbd mysql_drbd \ meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true location primary_mysql mysql 10: qa-magdb1 location primary_mysql_drbd ms_mysql_drbd 10: qa-magdb1 location standby_mysql mysql 5: qa-magdb2 location standby_mysql_drbd ms_mysql_drbd 5: qa-magdb2 colocation mysql_on_drbd inf: mysql ms_mysql_drbd:Master order mysql_after_drbd inf: ms_mysql_drbd:promote mysql:start property $id=cib-bootstrap-options \ dc-version=1.0.9-89bd754939df5150de7cd76835f98fe90851b677 \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=false \ no-quorum-policy=ignore \ cluster-recheck-interval=0 Thanks, Mike - Thise-mailmessageisintendedonlyforthepersonaluseoftherecipient(s) namedabove.Ifyouarenotanintendedrecipient,youmaynotreview,copyor distributethismessage.Ifyouhavereceivedthiscommunicationinerror, pleasenotifytheCDSGlobalHelpDesk(cdshelpd...@cds-global.com)immediately bye-mailanddeletetheoriginalmessage. - ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] sshd under cluster
Hi, I was asked to place sshd daemon under cluster and because I faced few challenges, I thought to share them with you. The 1st challenge was to clone the sshd daemon, init script and its configuration. The procedure is at the bottom of this mail. The 2nd challenge was the init script of sshd in CentOS. It has 2 issues, 1st issue was that it was failing at test 6 mentioned here http://www.clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/ap-lsb.html. The 2nd issue was that during shutdown or reboot of the cluster node, stop action on resource was receiving return code 143 from init script and the whole shutdown/reboot process was stuck for few minutes. The root cause of that was the killall command which is being called by the init script. The init script calls killall, only on shutdown or reboot, to close any open connections. But, that call was killing also the script itself! Because of that cluster was getting error on stop action and the lock file of the sshd was not removed as well. You can image the consequences. For both issues I filled a bug report and hacked the init script in order to have a short term resolution. The last challenge was related to a mail sent few hours ago. The 1st monitor action after the start action was too fast and sshd didn't have enough time to create its pid file. As a result the monitor was thinking that the sshd was down but it wasn't. A sleep 1 after the start function in the init script solved the issue. Cheers, Pavlos Clone SSH for pbx_0N Prerequisite: the default sshd to listen only on nodes IP and not on all IPs. cp -p /etc/init.d/sshd /etc/init.d/sshd-pbx_02 cp -p /etc/pam.d/sshd /etc/pam.d/sshd-pbx_02 # optional because it is needed only if UsePam true - On RH is true by default ln -s /usr/sbin/sshd /usr/sbin/sshd-pbx_02 touch /etc/sysconfig/sshd-pbx_02 echo 'OPTIONS=-f /etc/ssh/sshd_config-pbx_02' /etc/sysconfig/sshd-pbx_02 cp -p /etc/ssh/sshd_config /etc/ssh/sshd_config-pbx_02 [r...@node-02 ~]# diff -wu /etc/init.d/sshd /etc/init.d/sshd-pbx_02 --- /etc/init.d/sshd2009-09-03 20:12:38.0 +0200 +++ /etc/init.d/sshd-pbx_02 2010-10-12 12:25:50.0 +0200 @@ -1,33 +1,33 @@ -#!/bin/bash +#!/bin/bash -x # -# Init file for OpenSSH server daemon +# Init file for OpenSSH server daemon used by pbx_02 # # chkconfig: 2345 55 25 -# description: OpenSSH server daemon +# description: OpenSSH server daemon for pbx_02 # -# processname: sshd -# config: /etc/ssh/ssh_host_key -# config: /etc/ssh/ssh_host_key.pub +# processname: sshd-pbx_02 +# config: /etc/ssh/ssh_host_key-pbx_02 +# config: /etc/ssh/ssh_host_key-pbx_02.pub # config: /etc/ssh/ssh_random_seed -# config: /etc/ssh/sshd_config -# pidfile: /var/run/sshd.pid +# config: /etc/ssh/sshd_config-pbx_02 +# pidfile: /var/run/sshd-pbx_02.pid # source function library . /etc/rc.d/init.d/functions # pull in sysconfig settings -[ -f /etc/sysconfig/sshd ] . /etc/sysconfig/sshd +[ -f /etc/sysconfig/sshd-pbx_02 ] . /etc/sysconfig/sshd-pbx_02 RETVAL=0 -prog=sshd +prog=sshd-pbx_02 # Some functions to make the below more readable KEYGEN=/usr/bin/ssh-keygen -SSHD=/usr/sbin/sshd -RSA1_KEY=/etc/ssh/ssh_host_key -RSA_KEY=/etc/ssh/ssh_host_rsa_key -DSA_KEY=/etc/ssh/ssh_host_dsa_key -PID_FILE=/var/run/sshd.pid +SSHD=/usr/sbin/sshd-pbx_02 +RSA1_KEY=/etc/ssh/ssh_host_key-pbx_02 +RSA_KEY=/etc/ssh/ssh_host_rsa_key-pbx_02 +DSA_KEY=/etc/ssh/ssh_host_dsa_key-pbx_02 +PID_FILE=/var/run/sshd-pbx_02.pid runlevel=$(set -- $(runlevel); eval echo \$$# ) @@ -110,7 +110,11 @@ echo -n $Starting $prog: $SSHD $OPTIONS success || failure RETVAL=$? - [ $RETVAL = 0 ] touch /var/lock/subsys/sshd + [ $RETVAL = 0 ] touch /var/lock/subsys/sshd-pbx_02 +# to avoid a race condition, 1st cluster monitor after start fails +# because the pid file is not created yet. Few msecs detail on the +# creation of pid file is enough to cause issues. +sleep 1 echo } @@ -119,16 +123,25 @@ echo -n $Stopping $prog: if [ -n `pidfileofproc $SSHD` ] ; then killproc $SSHD + elif [ -z `pidfileofproc $SSHD`] [ ! -f /var/lock/subsys/sshd-pbx_02 ] ; then +success +RETVAL=0 else failure $Stopping $prog fi RETVAL=$? + +### Added by Pavlos Parissis ### +# Disable the below bit because killall kills the script itself. +# This causes problems within the cluster, shutdown of a node fails. +# Any open connections will be killed by /etc/init.d.halt anyways + # if we are in halt or reboot runlevel kill all running sessions # so the TCP connections are closed cleanly - if [ x$runlevel = x0 -o x$runlevel = x6 ] ; then - killall $prog 2/dev/null - fi - [ $RETVAL = 0 ] rm -f /var/lock/subsys/sshd + #if [ x$runlevel = x0 -o x$runlevel = x6 ] ; then + #killall $prog
Re: [Pacemaker] Migrate resources based on connectivity
Hi, Lars Ellenberg wrote: On Mon, Oct 11, 2010 at 03:50:01PM +0300, Dan Frincu wrote: Hi, Dejan Muhamedagic wrote: Hi, On Sun, Oct 10, 2010 at 10:27:13PM +0300, Dan Frincu wrote: Hi, I have the following setup: - order drbd0:promote drbd1:promote - order drbd1:promote drbd2:promote - order drbd2:promote all:start - collocation all drbd2:Master - all is a group of resources, drbd{0..3} are drbd ms resources. I want to migrate the resources based on ping connectivity to a default gateway. Based on http://www.clusterlabs.org/wiki/Pingd_with_resources_on_different_networks and http://www.clusterlabs.org/wiki/Example_configurations I've tried the following: - primitive ping ocf:pacemaker:ping params host_list=1.2.3.4 multiplier=100 op monitor interval=5s timeout=5s - clone ping_clone ping meta globally-unique=false - location ping_nok all \ rule $id=ping_nok-rule -inf: not_defined ping_clone or ping_clone number:lte 0 Use pingd to reference the attribute in the location constraint. Not to be disrespectful, but after 3 days being stuck on this issue, I don't exactly understand how to do that. Could you please provide an example. Thank you in advance. The example you reference lists: primitive pingdnet1 ocf:pacemaker:pingd \ params host_list=192.168.23.1 \ name=pingdnet1 ^^ clone cl-pingdnet1 pingdnet1 ^ param name default is pingd, and is the attribute name to be used in the location constraints. You will need to reference pingd in you location constraint, or set an explicit name in the primitive definition, and reference that. Your ping primitive sets the default 'pingd' attribute, but you reference some 'ping_clone' attribute, which apparently no-one really references. I've finally managed to finish the setup with the indications received above, the behavior is the expected one. Also, I've tried the ocf:pacemaker:pingd and even though it does the reachability tests properly, it fails to update the cib upon restoring the connectivity, I had to manually run attrd_updater -R to get the resources to start again, therefore I'm going with ocf:pacemaker:ping. Anyways, Dejan, Lars, Andrew, thank you all very much for your help. Best regards, Dan http://www.clusterlabs.org/wiki/Example_configurations -- Dan FRINCU Systems Engineer CCNA, RHCE Streamwide Romania ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] problem about move node from one clusterto another cluster
Hi, Depending on the openais version (please mention it) this behavior could happen, I've seen it as well, on openais-0.8.0. What I've done to fix it was to restart the openais process via /etc/init.d/openais restart. And then it worked, however, this was one of the reasons I updated the packages to the latest versions of corosync, pacemaker, etc. The tricky part was doing the migration procedure for upgrading production servers without service downtime, but that's another story. Regards, Dan jiaju liu wrote: Message: 2 Date: Tue, 12 Oct 2010 10:40:18 +0800 (CST) From: jiaju liu liujiaj...@yahoo.com.cn http://cn.mc157.mail.yahoo.com/mc/compose?to=liujiaj...@yahoo.com.cn To: pacemaker@oss.clusterlabs.org http://cn.mc157.mail.yahoo.com/mc/compose?to=pacema...@oss.clusterlabs.org Subject: [Pacemaker] problem about move node from one cluster to anothercluster Message-ID: 765547.4759...@web15704.mail.cnb.yahoo.com http://cn.mc157.mail.yahoo.com/mc/compose?to=765547.4759...@web15704.mail.cnb.yahoo.com Content-Type: text/plain; charset=iso-8859-1 hi everybody I use command service openais stop first to stop openais service and then use rm -rf /var/lib/heartbear/crm/*? clear all information. then change multicast address and then use service openais start in another cluster. the problem is sometimes it works well I can use crm_mon command. and sometimes it doesn't work. I use service openais status to check. It shows Running. but I can not use crm_mon to connect to cluster. I found the reason may be directory?/var/lib/heartbear/crm/ is empty. why??if I reboot ,it works again.WHY Now when the is directory is not empty it sometimes also does not work. when I use* crm_mon* it shows Attempting connection to the cluster.. when I use *crm node list *it shows Signon to CIB failed: connection failed Init failed, could not perform requested operations ERROR: cannot parse output of cibadmin -Ql -o nodes: no element found: line 1, column 0 -- next part -- An HTML attachment was scrubbed... URL: http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20101012/7ea78f33/attachment-0001.htm -- ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker -- Dan FRINCU Systems Engineer CCNA, RHCE Streamwide Romania ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
Re: [Pacemaker] Migrate resources based on connectivity
On 12 October 2010 20:00, Dan Frincu dfri...@streamwide.ro wrote: Hi, Lars Ellenberg wrote: On Mon, Oct 11, 2010 at 03:50:01PM +0300, Dan Frincu wrote: Hi, Dejan Muhamedagic wrote: Hi, On Sun, Oct 10, 2010 at 10:27:13PM +0300, Dan Frincu wrote: Hi, I have the following setup: - order drbd0:promote drbd1:promote - order drbd1:promote drbd2:promote - order drbd2:promote all:start - collocation all drbd2:Master - all is a group of resources, drbd{0..3} are drbd ms resources. I want to migrate the resources based on ping connectivity to a default gateway. Based on http://www.clusterlabs.org/wiki/Pingd_with_resources_on_different_networks and http://www.clusterlabs.org/wiki/Example_configurations I've tried the following: - primitive ping ocf:pacemaker:ping params host_list=1.2.3.4 multiplier=100 op monitor interval=5s timeout=5s - clone ping_clone ping meta globally-unique=false - location ping_nok all \ rule $id=ping_nok-rule -inf: not_defined ping_clone or ping_clone number:lte 0 Use pingd to reference the attribute in the location constraint. Not to be disrespectful, but after 3 days being stuck on this issue, I don't exactly understand how to do that. Could you please provide an example. Thank you in advance. The example you reference lists: primitive pingdnet1 ocf:pacemaker:pingd \ params host_list=192.168.23.1 \ name=pingdnet1 ^^ clone cl-pingdnet1 pingdnet1 ^ param name default is pingd, and is the attribute name to be used in the location constraints. You will need to reference pingd in you location constraint, or set an explicit name in the primitive definition, and reference that. Your ping primitive sets the default 'pingd' attribute, but you reference some 'ping_clone' attribute, which apparently no-one really references. I've finally managed to finish the setup with the indications received above, the behavior is the expected one. Also, I've tried the ocf:pacemaker:pingd and even though it does the reachability tests properly, it fails to update the cib upon restoring the connectivity, I had to manually run attrd_updater -R to get the resources to start again, therefore I'm going with ocf:pacemaker:ping. it would be quite useful for the rest of people if you post your final and working configuration. Cheers, Pavlos ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
[Pacemaker] what is the meaning of stand_alone_ping
the screen says node1 pingd: [1927]: info: stand_alone_ping: Node 192.168.10.110 is unr eachable (read) and the node could not start ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker