Re: [ClusterLabs] pacemakerd quits after few seconds with some errors

2016-08-22 Thread Gabriele Bulfon
Thanks! I am using Corosync 2.3.6 and Pacemaker 1.1.4 using the 
"--with-corosync".
How is Corosync looking for his own version?

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon
--
Da: Klaus Wenninger
A: users@clusterlabs.org
Data: 23 agosto 2016 4.54.44 CEST
Oggetto: Re: [ClusterLabs] pacemakerd quits after few seconds with some errors
On 08/23/2016 12:20 AM, Ken Gaillot wrote:
On 08/22/2016 12:17 PM, Gabriele Bulfon wrote:
Hi,
I built corosync/pacemaker for our XStreamOS/illumos : corosync starts
fine and log correctly, pacemakerd quits after some seconds with the
attached log.
Any idea where is the issue?
Pacemaker is not able to communicate with corosync for some reason.
Aug 22 19:13:02 [1324] xstorage1 corosync notice  [MAIN  ] Corosync
Cluster Engine ('UNKNOWN'): started and ready to provide service.
'UNKNOWN' should show the corosync version. I'm wondering if maybe you
have an older corosync without configuring the pacemaker plugin. It
would be much better to use corosync 2 instead, if you can.
If corosync is not able to determine its' own version the
pacemaker-build might not have been able as well. So it
might have made some weird decisions/assumptions ...
like e.g. not building the plugin at all ... assuming you
are not using corosync 2+ ...
Thanks,
Gabriele

*Sonicle S.r.l. *: http://www.sonicle.com
*Music: *http://www.gabrielebulfon.com
*Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] corosync.log is 5.1GB in a short period

2016-08-22 Thread 朱荣
Hello:
I has a problem about corosync log, my corosync log is increase to 5.1GB in a 
short time.
Then I check the corosync log, it’s show me the same message in short 
period,like the attachment.
What happened about corosync? Thank you!
my corosync and pacemaker is:corosync-2.3.4-7.el7.x86_64 
pacemaker-1.1.13-10.el7x86_64
  by zhu rong
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Which cluster HA package to choose

2016-08-22 Thread Digimer
On 22/08/16 11:40 AM, Ron Gilad wrote:
> Hi,
> 
> I have encountered with several cluster High Availability packages:
> 
> -  Pacemaker + Corosync
> 
> -  Red Hat Enterprise Linux Cluster
> 
> Which package do you think is the best to choose?
> 
> Do you know if the latest ver is stable?
> 
> And which companies are using it?
> 
>  
> 
> Thanks in advance,
> 
> Ron

Short answer is "Corosync v2 + pacemaker 1.1.10+" (1.1.14+, ideally)

Long answer is here: https://alteeve.ca/w/History_of_HA_Clustering

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemakerd quits after few seconds with some errors

2016-08-22 Thread Klaus Wenninger
On 08/23/2016 12:20 AM, Ken Gaillot wrote:
> On 08/22/2016 12:17 PM, Gabriele Bulfon wrote:
>> Hi,
>>
>> I built corosync/pacemaker for our XStreamOS/illumos : corosync starts
>> fine and log correctly, pacemakerd quits after some seconds with the
>> attached log.
>> Any idea where is the issue?
> Pacemaker is not able to communicate with corosync for some reason.
>
> Aug 22 19:13:02 [1324] xstorage1 corosync notice  [MAIN  ] Corosync
> Cluster Engine ('UNKNOWN'): started and ready to provide service.
>
> 'UNKNOWN' should show the corosync version. I'm wondering if maybe you
> have an older corosync without configuring the pacemaker plugin. It
> would be much better to use corosync 2 instead, if you can.
If corosync is not able to determine its' own version the
pacemaker-build might not have been able as well. So it
might have made some weird decisions/assumptions ...
like e.g. not building the plugin at all ... assuming you
are not using corosync 2+ ...
>
>> Thanks,
>> Gabriele
>>
>> 
>> *Sonicle S.r.l. *: http://www.sonicle.com 
>> *Music: *http://www.gabrielebulfon.com 
>> *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Which cluster HA package to choose

2016-08-22 Thread Ron Gilad
Hi,
I have encountered with several cluster High Availability packages:

-  Pacemaker + Corosync

-  Red Hat Enterprise Linux Cluster
Which package do you think is the best to choose?
Do you know if the latest ver is stable?
And which companies are using it?

Thanks in advance,
Ron
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemakerd quits after few seconds with some errors

2016-08-22 Thread Ken Gaillot
On 08/22/2016 12:17 PM, Gabriele Bulfon wrote:
> Hi,
> 
> I built corosync/pacemaker for our XStreamOS/illumos : corosync starts
> fine and log correctly, pacemakerd quits after some seconds with the
> attached log.
> Any idea where is the issue?

Pacemaker is not able to communicate with corosync for some reason.

Aug 22 19:13:02 [1324] xstorage1 corosync notice  [MAIN  ] Corosync
Cluster Engine ('UNKNOWN'): started and ready to provide service.

'UNKNOWN' should show the corosync version. I'm wondering if maybe you
have an older corosync without configuring the pacemaker plugin. It
would be much better to use corosync 2 instead, if you can.

> 
> Thanks,
> Gabriele
> 
> 
> *Sonicle S.r.l. *: http://www.sonicle.com 
> *Music: *http://www.gabrielebulfon.com 
> *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] pacemakerd quits after few seconds with some errors

2016-08-22 Thread Gabriele Bulfon
Hi,
I built corosync/pacemaker for our XStreamOS/illumos : corosync starts fine and 
log correctly, pacemakerd quits after some seconds with the attached log.
Any idea where is the issue?
Thanks,
Gabriele

Sonicle S.r.l.
:
http://www.sonicle.com
Music:
http://www.gabrielebulfon.com
Quantum Mechanics :
http://www.cdbaby.com/cd/gabrielebulfon


corosync.log
Description: binary/octet-stream
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Mysql slave did not start replication after failure, and read-only IP also remained active on the much outdated slave

2016-08-22 Thread Ken Gaillot
On 08/22/2016 07:24 AM, Attila Megyeri wrote:
> Hi Andrei,
> 
> I waited several hours, and nothing happened. 

And actually, we can see from the configuration you provided that
cluster-recheck-interval is 2 minutes.

I don't see anything about stonith; is it enabled and tested? This looks
like a situation where stonith would come into play. I know that power
fencing can be rough on a MySQL database, but perhaps intelligent
switches with network fencing would be appropriate.

The "Corosync main process was not scheduled" message is the start of
the trouble. It means the system was overloaded and corosync didn't get
any CPU time, so it couldn't maintain cluster communication.

Probably the most useful thing would be to upgrade to a recent version
of corosync+pacemaker+resource-agents. Recent corosync versions run with
realtime priority, which makes this much less likely.

Other than that, figure out what the load issue was, and try to prevent
it from recurring.

I'm not familiar enough with the RA to comment on its behavior. If you
think it's suspect, check the logs during the incident for messages from
the RA.

> I assume that the RA does not treat this case properly. Mysql was running, 
> but the "show slave status" command returned something that the RA was not 
> prepared to parse, and instead of reporting a non-readable attribute, it 
> returned some generic error, that did not stop the server. 
> 
> Rgds,
> Attila
> 
> 
> -Original Message-
> From: Andrei Borzenkov [mailto:arvidj...@gmail.com] 
> Sent: Monday, August 22, 2016 11:42 AM
> To: Cluster Labs - All topics related to open-source clustering welcomed 
> 
> Subject: Re: [ClusterLabs] Mysql slave did not start replication after 
> failure, and read-only IP also remained active on the much outdated slave
> 
> On Mon, Aug 22, 2016 at 12:18 PM, Attila Megyeri
>  wrote:
>> Dear community,
>>
>>
>>
>> A few days ago we had an issue in our Mysql M/S replication cluster.
>>
>> We have a one R/W Master, and a one RO Slave setup. RO VIP is supposed to be
>> running on the slave if it is not too much behind the master, and if any
>> error occurs, RO VIP is moved to the master.
>>
>>
>>
>> Something happened with the slave Mysql (some disk issue, still
>> investigating), but the problem is, that the slave VIP remained on the slave
>> device, even though the slave process was not running, and the server was
>> much outdated.
>>
>>
>>
>> During the issue the following log entries appeared (just an extract as it
>> would be too long):
>>
>>
>>
>>
>>
>> Aug 20 02:04:07 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process was
>> not scheduled for 14088.5488 ms (threshold is 4000. ms). Consider token
>> timeout increase.
>>
>> Aug 20 02:04:07 ctdb1 corosync[1056]:   [TOTEM ] A processor failed, forming
>> new configuration.
>>
>> Aug 20 02:04:34 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process was
>> not scheduled for 27065.2559 ms (threshold is 4000. ms). Consider token
>> timeout increase.
>>
>> Aug 20 02:04:34 ctdb1 corosync[1056]:   [TOTEM ] A new membership (xxx:6720)
>> was formed. Members left: 168362243 168362281 168362282 168362301 168362302
>> 168362311 168362312 1
>>
>> Aug 20 02:04:34 ctdb1 corosync[1056]:   [TOTEM ] A new membership (xxx:6724)
>> was formed. Members
>>
>> ..
>>
>> Aug 20 02:13:28 ctdb1 corosync[1056]:   [MAIN  ] Completed service
>> synchronization, ready to provide service.
>>
>> ..
>>
>> Aug 20 02:13:29 ctdb1 attrd[1584]:   notice: attrd_trigger_update: Sending
>> flush op to all hosts for: readable (1)
>>
>> …
>>
>> Aug 20 02:13:32 ctdb1 mysql(db-mysql)[10492]: INFO: post-demote notification
>> for ctdb1
>>
>> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-master)[10490]: INFO: IP status = ok,
>> IP_CIP=
>>
>> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
>> db-ip-master_stop_0 (call=371, rc=0, cib-update=179, confirmed=true) ok
>>
>> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Adding inet address
>> xxx/24 with broadcast address  to device eth0
>>
>> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Bringing device
>> eth0 up
>>
>> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO:
>> /usr/lib/heartbeat/send_arp -i 200 -r 5 -p
>> /usr/var/run/resource-agents/send_arp-xxx eth0 xxx auto not_used not_used
>>
>> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
>> db-ip-slave_start_0 (call=377, rc=0, cib-update=180, confirmed=true) ok
>>
>> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
>> db-ip-slave_monitor_2 (call=380, rc=0, cib-update=181, confirmed=false)
>> ok
>>
>> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
>> db-mysql_notify_0 (call=374, rc=0, cib-update=0, confirmed=true) ok
>>
>> Aug 20 02:13:32 ctdb1 attrd[1584]:   notice: attrd_trigger_update: Sending
>> flush op to all hosts for: master-db-mysql (1)
>>

Re: [ClusterLabs] Entire Group stop on stopping of single Resource

2016-08-22 Thread Jan Pokorný
On 19/08/16 23:09 +0530, jaspal singla wrote:
> I have an resource group (ctm_service) comprise of various resources. Now
> the requirement is when one of its resource stops for soem time (10-20)
> seconds, I want entire group will be stopped.

Note that if resource is stopped _just_ for this period (in seconds)
while monitor is set to a bigger value (30 s), pacemaker may miss the
resource being intermittently stopped.

> Is it possible to achieve this in pacemaker. Please help!

Just for clarification, do you mean stopped completely within the
cluster and not just on the node the group was running when one of
its resources stopped?

>  Resource Group: ctm_service
>  FSCheck
> (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/FsCheckAgent.py):
>(target-role:Stopped) Stopped
>  NTW_IF (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/NtwIFAgent.py):
>  (target-role:Stopped) Stopped
>  CTM_RSYNC  (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/RsyncAgent.py):
>  (target-role:Stopped) Stopped
>  REPL_IF
> (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/ODG_IFAgent.py):
> (target-role:Stopped) Stopped
>  ORACLE_REPLICATOR
> (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/ODG_ReplicatorAgent.py):
> (target-role:Stopped) Stopped
>  CTM_SID
> (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/OracleAgent.py):
> (target-role:Stopped) Stopped
>  CTM_SRV(lsb:../../..//cisco/PrimeOpticalServer/HA/bin/CtmAgent.py):
>(target-role:Stopped) Stopped
>  CTM_APACHE 
> (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/ApacheAgent.py):
> (target-role:Stopped) Stopped
> 
> _
> 
> 
> This is resource and resource group properties:
> 
> 
> ___
> 
> pcs -f cib.xml.geo resource create FSCheck lsb:../../..//cisco/
> PrimeOpticalServer/HA/bin/FsCheckAgent.py op monitor id=FSCheck-OP-monitor
> name=monitor interval=30s
> pcs -f cib.xml.geo resource create NTW_IF lsb:../../..//cisco/
> PrimeOpticalServer/HA/bin/NtwIFAgent.py op monitor id=NtwIFAgent-OP-monitor
> name=monitor interval=30s
> pcs -f cib.xml.geo resource create CTM_RSYNC lsb:../../..//cisco/
> PrimeOpticalServer/HA/bin/RsyncAgent.py op monitor id=CTM_RSYNC-OP-monitor
> name=monitor interval=30s on-fail=ignore stop id=CTM_RSYNC-OP-stop
> interval=0 on-fail=stop
> pcs -f cib.xml.geo resource create REPL_IF lsb:../../..//cisco/
> PrimeOpticalServer/HA/bin/ODG_IFAgent.py op monitor id=REPL_IF-OP-monitor
> name=monitor interval=30 on-fail=ignore stop id=REPL_IF-OP-stop interval=0
> on-fail=stop
> pcs -f cib.xml.geo resource create ORACLE_REPLICATOR lsb:../../..//cisco/
> PrimeOpticalServer/HA/bin/ODG_ReplicatorAgent.py op monitor
> id=ORACLE_REPLICATOR-OP-monitor name=monitor interval=30s on-fail=ignore
> stop id=ORACLE_REPLICATOR-OP-stop interval=0 on-fail=stop
> pcs -f cib.xml.geo resource create CTM_SID lsb:../../..//cisco/
> PrimeOpticalServer/HA/bin/OracleAgent.py op monitor id=CTM_SID-OP-monitor
> name=monitor interval=30s
> pcs -f cib.xml.geo resource create CTM_SRV lsb:../../..//cisco/
> PrimeOpticalServer/HA/bin/CtmAgent.py op monitor id=CTM_SRV-OP-monitor
> name=monitor interval=30s
> pcs -f cib.xml.geo resource create CTM_APACHE lsb:../../..//cisco/
> PrimeOpticalServer/HA/bin/ApacheAgent.py op monitor
> id=CTM_APACHE-OP-monitor name=monitor interval=30s
> pcs -f cib.xml.geo resource create CTM_HEARTBEAT lsb:../../..//cisco/
> PrimeOpticalServer/HA/bin/HeartBeat.py op monitor
> id=CTM_HEARTBEAT-OP-monitor name=monitor interval=30s
> pcs -f cib.xml.geo resource create FLASHBACK  lsb:../../..//cisco/
> PrimeOpticalServer/HA/bin/FlashBackMonitor.py op monitor
> id=FLASHBACK-OP-monitor name=monitor interval=30s
> 
> 
> pcs -f cib.xml.geo resource group add ctm_service FSCheck NTW_IF CTM_RSYNC
> REPL_IF ORACLE_REPLICATOR CTM_SID CTM_SRV CTM_APACHE
> 
> pcs -f cib.xml.geo resource meta ctm_service migration-threshold=1
> failure-timeout=10 target-role=stopped

Why do you have target-role=stopped (should preferably be title-cased
"Stopped") here/is that only for the test purposes?  I ask as it may
intefere with any subsequent modifications.


P.S. The presented configuration resembles output of clufter, so any
feedback to be turned into its improvements welcome.

-- 
Jan (Poki)


pgpl7Hmkl8gdo.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Mysql slave did not start replication after failure, and read-only IP also remained active on the much outdated slave

2016-08-22 Thread Attila Megyeri
Hi Andrei,

I waited several hours, and nothing happened. 

I assume that the RA does not treat this case properly. Mysql was running, but 
the "show slave status" command returned something that the RA was not prepared 
to parse, and instead of reporting a non-readable attribute, it returned some 
generic error, that did not stop the server. 

Rgds,
Attila


-Original Message-
From: Andrei Borzenkov [mailto:arvidj...@gmail.com] 
Sent: Monday, August 22, 2016 11:42 AM
To: Cluster Labs - All topics related to open-source clustering welcomed 

Subject: Re: [ClusterLabs] Mysql slave did not start replication after failure, 
and read-only IP also remained active on the much outdated slave

On Mon, Aug 22, 2016 at 12:18 PM, Attila Megyeri
 wrote:
> Dear community,
>
>
>
> A few days ago we had an issue in our Mysql M/S replication cluster.
>
> We have a one R/W Master, and a one RO Slave setup. RO VIP is supposed to be
> running on the slave if it is not too much behind the master, and if any
> error occurs, RO VIP is moved to the master.
>
>
>
> Something happened with the slave Mysql (some disk issue, still
> investigating), but the problem is, that the slave VIP remained on the slave
> device, even though the slave process was not running, and the server was
> much outdated.
>
>
>
> During the issue the following log entries appeared (just an extract as it
> would be too long):
>
>
>
>
>
> Aug 20 02:04:07 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process was
> not scheduled for 14088.5488 ms (threshold is 4000. ms). Consider token
> timeout increase.
>
> Aug 20 02:04:07 ctdb1 corosync[1056]:   [TOTEM ] A processor failed, forming
> new configuration.
>
> Aug 20 02:04:34 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process was
> not scheduled for 27065.2559 ms (threshold is 4000. ms). Consider token
> timeout increase.
>
> Aug 20 02:04:34 ctdb1 corosync[1056]:   [TOTEM ] A new membership (xxx:6720)
> was formed. Members left: 168362243 168362281 168362282 168362301 168362302
> 168362311 168362312 1
>
> Aug 20 02:04:34 ctdb1 corosync[1056]:   [TOTEM ] A new membership (xxx:6724)
> was formed. Members
>
> ..
>
> Aug 20 02:13:28 ctdb1 corosync[1056]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
>
> ..
>
> Aug 20 02:13:29 ctdb1 attrd[1584]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: readable (1)
>
> …
>
> Aug 20 02:13:32 ctdb1 mysql(db-mysql)[10492]: INFO: post-demote notification
> for ctdb1
>
> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-master)[10490]: INFO: IP status = ok,
> IP_CIP=
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-ip-master_stop_0 (call=371, rc=0, cib-update=179, confirmed=true) ok
>
> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Adding inet address
> xxx/24 with broadcast address  to device eth0
>
> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Bringing device
> eth0 up
>
> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO:
> /usr/lib/heartbeat/send_arp -i 200 -r 5 -p
> /usr/var/run/resource-agents/send_arp-xxx eth0 xxx auto not_used not_used
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-ip-slave_start_0 (call=377, rc=0, cib-update=180, confirmed=true) ok
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-ip-slave_monitor_2 (call=380, rc=0, cib-update=181, confirmed=false)
> ok
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-mysql_notify_0 (call=374, rc=0, cib-update=0, confirmed=true) ok
>
> Aug 20 02:13:32 ctdb1 attrd[1584]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: master-db-mysql (1)
>
> Aug 20 02:13:32 ctdb1 attrd[1584]:   notice: attrd_perform_update: Sent
> update 1622: master-db-mysql=1
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-mysql_demote_0 (call=384, rc=0, cib-update=182, confirmed=true) ok
>
> Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11160]: INFO: Ignoring post-demote
> notification for my own demotion.
>
> Aug 20 02:13:33 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-mysql_notify_0 (call=387, rc=0, cib-update=0, confirmed=true) ok
>
> Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11185]: ERROR: check_slave invoked on
> an instance that is not a replication slave.
>
> Aug 20 02:13:33 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-mysql_monitor_7000 (call=390, rc=0, cib-update=183, confirmed=false) ok
>
> Aug 20 02:13:33 ctdb1 ntpd[1560]: Listen normally on 16 eth0 . UDP 123
>
> Aug 20 02:13:33 ctdb1 ntpd[1560]: Deleting interface #12 eth0, xxx#123,
> interface stats: received=0, sent=0, dropped=0, active_time=2637334 secs
>
> Aug 20 02:13:33 ctdb1 ntpd[1560]: peers refreshed
>
> Aug 20 02:13:33 ctdb1 ntpd[1560]: new interface(s) found: waking up resolver
>
> Aug 20 02:13:40 

Re: [ClusterLabs] Mysql slave did not start replication after failure, and read-only IP also remained active on the much outdated slave

2016-08-22 Thread Andrei Borzenkov
On Mon, Aug 22, 2016 at 12:18 PM, Attila Megyeri
 wrote:
> Dear community,
>
>
>
> A few days ago we had an issue in our Mysql M/S replication cluster.
>
> We have a one R/W Master, and a one RO Slave setup. RO VIP is supposed to be
> running on the slave if it is not too much behind the master, and if any
> error occurs, RO VIP is moved to the master.
>
>
>
> Something happened with the slave Mysql (some disk issue, still
> investigating), but the problem is, that the slave VIP remained on the slave
> device, even though the slave process was not running, and the server was
> much outdated.
>
>
>
> During the issue the following log entries appeared (just an extract as it
> would be too long):
>
>
>
>
>
> Aug 20 02:04:07 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process was
> not scheduled for 14088.5488 ms (threshold is 4000. ms). Consider token
> timeout increase.
>
> Aug 20 02:04:07 ctdb1 corosync[1056]:   [TOTEM ] A processor failed, forming
> new configuration.
>
> Aug 20 02:04:34 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process was
> not scheduled for 27065.2559 ms (threshold is 4000. ms). Consider token
> timeout increase.
>
> Aug 20 02:04:34 ctdb1 corosync[1056]:   [TOTEM ] A new membership (xxx:6720)
> was formed. Members left: 168362243 168362281 168362282 168362301 168362302
> 168362311 168362312 1
>
> Aug 20 02:04:34 ctdb1 corosync[1056]:   [TOTEM ] A new membership (xxx:6724)
> was formed. Members
>
> ..
>
> Aug 20 02:13:28 ctdb1 corosync[1056]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
>
> ..
>
> Aug 20 02:13:29 ctdb1 attrd[1584]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: readable (1)
>
> …
>
> Aug 20 02:13:32 ctdb1 mysql(db-mysql)[10492]: INFO: post-demote notification
> for ctdb1
>
> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-master)[10490]: INFO: IP status = ok,
> IP_CIP=
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-ip-master_stop_0 (call=371, rc=0, cib-update=179, confirmed=true) ok
>
> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Adding inet address
> xxx/24 with broadcast address  to device eth0
>
> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Bringing device
> eth0 up
>
> Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO:
> /usr/lib/heartbeat/send_arp -i 200 -r 5 -p
> /usr/var/run/resource-agents/send_arp-xxx eth0 xxx auto not_used not_used
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-ip-slave_start_0 (call=377, rc=0, cib-update=180, confirmed=true) ok
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-ip-slave_monitor_2 (call=380, rc=0, cib-update=181, confirmed=false)
> ok
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-mysql_notify_0 (call=374, rc=0, cib-update=0, confirmed=true) ok
>
> Aug 20 02:13:32 ctdb1 attrd[1584]:   notice: attrd_trigger_update: Sending
> flush op to all hosts for: master-db-mysql (1)
>
> Aug 20 02:13:32 ctdb1 attrd[1584]:   notice: attrd_perform_update: Sent
> update 1622: master-db-mysql=1
>
> Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-mysql_demote_0 (call=384, rc=0, cib-update=182, confirmed=true) ok
>
> Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11160]: INFO: Ignoring post-demote
> notification for my own demotion.
>
> Aug 20 02:13:33 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-mysql_notify_0 (call=387, rc=0, cib-update=0, confirmed=true) ok
>
> Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11185]: ERROR: check_slave invoked on
> an instance that is not a replication slave.
>
> Aug 20 02:13:33 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation
> db-mysql_monitor_7000 (call=390, rc=0, cib-update=183, confirmed=false) ok
>
> Aug 20 02:13:33 ctdb1 ntpd[1560]: Listen normally on 16 eth0 . UDP 123
>
> Aug 20 02:13:33 ctdb1 ntpd[1560]: Deleting interface #12 eth0, xxx#123,
> interface stats: received=0, sent=0, dropped=0, active_time=2637334 secs
>
> Aug 20 02:13:33 ctdb1 ntpd[1560]: peers refreshed
>
> Aug 20 02:13:33 ctdb1 ntpd[1560]: new interface(s) found: waking up resolver
>
> Aug 20 02:13:40 ctdb1 mysql(db-mysql)[11224]: ERROR: check_slave invoked on
> an instance that is not a replication slave.
>
> Aug 20 02:13:47 ctdb1 mysql(db-mysql)[11263]: ERROR: check_slave invoked on
> an instance that is not a replication slave.
>
>
>
> And from this time, the last two lines repeat every 7 seconds (mysql
> monitoring interval)
>
>
>
>
>
> The expected behavior was that the slave (RO) VIP should have been moved to
> the master, as the secondary db was outdated.
>
> Unfortunately I cannot recall what crm_mon was showing when the issue was
> present, but I am sure that the RA did not handle the situation properly.
>
>
>
> Placing the slave node into standby and the online resolved the issue
> immediately (Slave 

[ClusterLabs] Mysql slave did not start replication after failure, and read-only IP also remained active on the much outdated slave

2016-08-22 Thread Attila Megyeri
Dear community,

A few days ago we had an issue in our Mysql M/S replication cluster.
We have a one R/W Master, and a one RO Slave setup. RO VIP is supposed to be 
running on the slave if it is not too much behind the master, and if any error 
occurs, RO VIP is moved to the master.

Something happened with the slave Mysql (some disk issue, still investigating), 
but the problem is, that the slave VIP remained on the slave device, even 
though the slave process was not running, and the server was much outdated.

During the issue the following log entries appeared (just an extract as it 
would be too long):


Aug 20 02:04:07 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process was not 
scheduled for 14088.5488 ms (threshold is 4000. ms). Consider token timeout 
increase.
Aug 20 02:04:07 ctdb1 corosync[1056]:   [TOTEM ] A processor failed, forming 
new configuration.
Aug 20 02:04:34 ctdb1 corosync[1056]:   [MAIN  ] Corosync main process was not 
scheduled for 27065.2559 ms (threshold is 4000. ms). Consider token timeout 
increase.
Aug 20 02:04:34 ctdb1 corosync[1056]:   [TOTEM ] A new membership (xxx:6720) 
was formed. Members left: 168362243 168362281 168362282 168362301 168362302 
168362311 168362312 1
Aug 20 02:04:34 ctdb1 corosync[1056]:   [TOTEM ] A new membership (xxx:6724) 
was formed. Members
..
Aug 20 02:13:28 ctdb1 corosync[1056]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
..
Aug 20 02:13:29 ctdb1 attrd[1584]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: readable (1)
...
Aug 20 02:13:32 ctdb1 mysql(db-mysql)[10492]: INFO: post-demote notification 
for ctdb1
Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-master)[10490]: INFO: IP status = ok, 
IP_CIP=
Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation 
db-ip-master_stop_0 (call=371, rc=0, cib-update=179, confirmed=true) ok
Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Adding inet address 
xxx/24 with broadcast address  to device eth0
Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: Bringing device eth0 up
Aug 20 02:13:32 ctdb1 IPaddr2(db-ip-slave)[10620]: INFO: 
/usr/lib/heartbeat/send_arp -i 200 -r 5 -p 
/usr/var/run/resource-agents/send_arp-xxx eth0 xxx auto not_used not_used
Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation 
db-ip-slave_start_0 (call=377, rc=0, cib-update=180, confirmed=true) ok
Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation 
db-ip-slave_monitor_2 (call=380, rc=0, cib-update=181, confirmed=false) ok
Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation 
db-mysql_notify_0 (call=374, rc=0, cib-update=0, confirmed=true) ok
Aug 20 02:13:32 ctdb1 attrd[1584]:   notice: attrd_trigger_update: Sending 
flush op to all hosts for: master-db-mysql (1)
Aug 20 02:13:32 ctdb1 attrd[1584]:   notice: attrd_perform_update: Sent update 
1622: master-db-mysql=1
Aug 20 02:13:32 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation 
db-mysql_demote_0 (call=384, rc=0, cib-update=182, confirmed=true) ok
Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11160]: INFO: Ignoring post-demote 
notification for my own demotion.
Aug 20 02:13:33 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation 
db-mysql_notify_0 (call=387, rc=0, cib-update=0, confirmed=true) ok
Aug 20 02:13:33 ctdb1 mysql(db-mysql)[11185]: ERROR: check_slave invoked on an 
instance that is not a replication slave.
Aug 20 02:13:33 ctdb1 crmd[1586]:   notice: process_lrm_event: LRM operation 
db-mysql_monitor_7000 (call=390, rc=0, cib-update=183, confirmed=false) ok
Aug 20 02:13:33 ctdb1 ntpd[1560]: Listen normally on 16 eth0 . UDP 123
Aug 20 02:13:33 ctdb1 ntpd[1560]: Deleting interface #12 eth0, xxx#123, 
interface stats: received=0, sent=0, dropped=0, active_time=2637334 secs
Aug 20 02:13:33 ctdb1 ntpd[1560]: peers refreshed
Aug 20 02:13:33 ctdb1 ntpd[1560]: new interface(s) found: waking up resolver
Aug 20 02:13:40 ctdb1 mysql(db-mysql)[11224]: ERROR: check_slave invoked on an 
instance that is not a replication slave.
Aug 20 02:13:47 ctdb1 mysql(db-mysql)[11263]: ERROR: check_slave invoked on an 
instance that is not a replication slave.

And from this time, the last two lines repeat every 7 seconds (mysql monitoring 
interval)


The expected behavior was that the slave (RO) VIP should have been moved to the 
master, as the secondary db was outdated.
Unfortunately I cannot recall what crm_mon was showing when the issue was 
present, but I am sure that the RA did not handle the situation properly.

Placing the slave node into standby and the online resolved the issue 
immediately (Slave started to sync, and in  a few minutes it catched up the 
master).


Here is the relevant config from the configuration:


primitive db-ip-master ocf:heartbeat:IPaddr2 \
params lvs_support="true" ip="XXX" cidr_netmask="24" 
broadcast="XXX" \
op start interval="0" timeout="20s"