Pacemaker 1.1.10
Corosync 2.3.3
this is a 3 nodes cluster configured with 3 clone resources, each attached wih
a vip resource of IPAddr2:
>crm status
Online: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
router_vip (ocf::heartbeat:IPaddr2): Started paas-controller-1
sdclient_vip (ocf::heartbeat:IPaddr2): Started paas-controller-3
apigateway_vip (ocf::heartbeat:IPaddr2): Started paas-controller-2
Clone Set: sdclient_rep [sdclient]
Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
Clone Set: router_rep [router]
Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
Clone Set: apigateway_rep [apigateway]
Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
It is observed that sometimes the clone resource is stuck to monitor when the
service fails:
router_vip (ocf::heartbeat:IPaddr2): Started paas-controller-1
sdclient_vip (ocf::heartbeat:IPaddr2): Started paas-controller-2
apigateway_vip (ocf::heartbeat:IPaddr2): Started paas-controller-3
Clone Set: sdclient_rep [sdclient]
Started: [ paas-controller-1 paas-controller-2 ]
Stopped: [ paas-controller-3 ]
Clone Set: router_rep [router]
router (ocf::heartbeat:router): Started paas-controller-3
FAILED
Started: [ paas-controller-1 paas-controller-2 ]
Clone Set: apigateway_rep [apigateway]
apigateway (ocf::heartbeat:apigateway): Started paas-controller-3
FAILED
Started: [ paas-controller-1 paas-controller-2 ]
in the example above. the sdclient_rep get restarted on node 3, while the other
two hang at monitoring on node 3, here are the ocf logs:
abnormal (apigateway_rep):
2017-02-13 18:27:53 [23586]===print_log test_monitor run_func main=== Starting
health check.
2017-02-13 18:27:53 [23586]===print_log test_monitor run_func main=== health
check succeed.
2017-02-13 18:27:55 [24010]===print_log test_monitor run_func main=== Starting
health check.
2017-02-13 18:27:55 [24010]===print_log test_monitor run_func main=== Failed:
docker daemon is not running.
2017-02-13 18:27:57 [24095]===print_log test_monitor run_func main=== Starting
health check.
2017-02-13 18:27:57 [24095]===print_log test_monitor run_func main=== Failed:
docker daemon is not running.
2017-02-13 18:27:59 [24159]===print_log test_monitor run_func main=== Starting
health check.
2017-02-13 18:27:59 [24159]===print_log test_monitor run_func main=== Failed:
docker daemon is not running.
normal (sdclient_rep):
2017-02-13 18:27:52 [23507]===print_log sdclient_monitor run_func main===
health check succeed.
2017-02-13 18:27:54 [23630]===print_log sdclient_monitor run_func main===
Starting health check.
2017-02-13 18:27:54 [23630]===print_log sdclient_monitor run_func main===
Failed: docker daemon is not running.
2017-02-13 18:27:55 [23710]===print_log sdclient_stop run_func main=== Starting
stop the container.
2017-02-13 18:27:55 [23710]===print_log sdclient_stop run_func main=== docker
daemon lost, pretend stop succeed.
2017-02-13 18:27:55 [23763]===print_log sdclient_start run_func main===
Starting run the container.
2017-02-13 18:27:55 [23763]===print_log sdclient_start run_func main=== docker
daemon lost, try again in 5 secs.
2017-02-13 18:28:00 [23763]===print_log sdclient_start run_func main=== docker
daemon lost, try again in 5 secs.
2017-02-13 18:28:05 [23763]===print_log sdclient_start run_func main=== docker
daemon lost, try again in 5 secs.
If I disable 2 clone resource, the switch over test for one clone resource
works as expected: fail the service -> monitor fails -> stop -> start
Online: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
sdclient_vip (ocf::heartbeat:IPaddr2): Started paas-controller-2
Clone Set: sdclient_rep [sdclient]
Started: [ paas-controller-1 paas-controller-2 ]
Stopped: [ paas-controller-3 ]
what's the reason behind????
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org