Re: [ClusterLabs] clone resource not get restarted on fail

2017-02-13 Thread Ken Gaillot
On 02/13/2017 07:57 AM, he.hailo...@zte.com.cn wrote:
> Pacemaker 1.1.10
> 
> Corosync 2.3.3
> 
> 
> this is a 3 nodes cluster configured with 3 clone resources, each
> attached wih a vip resource of IPAddr2:
> 
> 
> >crm status
> 
> 
> Online: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
> 
> 
>  router_vip (ocf::heartbeat:IPaddr2):   Started paas-controller-1 
> 
>  sdclient_vip   (ocf::heartbeat:IPaddr2):   Started paas-controller-3 
> 
>  apigateway_vip (ocf::heartbeat:IPaddr2):   Started paas-controller-2 
> 
>  Clone Set: sdclient_rep [sdclient]
> 
>  Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
> 
>  Clone Set: router_rep [router]
> 
>  Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
> 
>  Clone Set: apigateway_rep [apigateway]
> 
>  Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
> 
> 
> It is observed that sometimes the clone resource is stuck to monitor
> when the service fails:
> 
> 
>  router_vip (ocf::heartbeat:IPaddr2):   Started paas-controller-1 
> 
>  sdclient_vip   (ocf::heartbeat:IPaddr2):   Started paas-controller-2 
> 
>  apigateway_vip (ocf::heartbeat:IPaddr2):   Started paas-controller-3 
> 
>  Clone Set: sdclient_rep [sdclient]
> 
>  Started: [ paas-controller-1 paas-controller-2 ]
> 
>  Stopped: [ paas-controller-3 ]
> 
>  Clone Set: router_rep [router]
> 
>  router (ocf::heartbeat:router):Started
> paas-controller-3 FAILED 
> 
>  Started: [ paas-controller-1 paas-controller-2 ]
> 
>  Clone Set: apigateway_rep [apigateway]
> 
>  apigateway (ocf::heartbeat:apigateway):Started
> paas-controller-3 FAILED 
> 
>  Started: [ paas-controller-1 paas-controller-2 ]
> 
> 
> in the example above. the sdclient_rep get restarted on node 3, while
> the other two hang at monitoring on node 3, here are the ocf logs:
> 
> 
> abnormal (apigateway_rep):
> 
> 2017-02-13 18:27:53 [23586]===print_log test_monitor run_func main===
> Starting health check.
> 
> 2017-02-13 18:27:53 [23586]===print_log test_monitor run_func main===
> health check succeed.
> 
> 2017-02-13 18:27:55 [24010]===print_log test_monitor run_func main===
> Starting health check.
> 
> 2017-02-13 18:27:55 [24010]===print_log test_monitor run_func main===
> Failed: docker daemon is not running.
> 
> 2017-02-13 18:27:57 [24095]===print_log test_monitor run_func main===
> Starting health check.
> 
> 2017-02-13 18:27:57 [24095]===print_log test_monitor run_func main===
> Failed: docker daemon is not running.
> 
> 2017-02-13 18:27:59 [24159]===print_log test_monitor run_func main===
> Starting health check.
> 
> 2017-02-13 18:27:59 [24159]===print_log test_monitor run_func main===
> Failed: docker daemon is not running.
> 
> 
> normal (sdclient_rep):
> 
> 2017-02-13 18:27:52 [23507]===print_log sdclient_monitor run_func
> main=== health check succeed.
> 
> 2017-02-13 18:27:54 [23630]===print_log sdclient_monitor run_func
> main=== Starting health check.
> 
> 2017-02-13 18:27:54 [23630]===print_log sdclient_monitor run_func
> main=== Failed: docker daemon is not running.
> 
> 2017-02-13 18:27:55 [23710]===print_log sdclient_stop run_func main===
> Starting stop the container.
> 
> 2017-02-13 18:27:55 [23710]===print_log sdclient_stop run_func main===
> docker daemon lost, pretend stop succeed.
> 
> 2017-02-13 18:27:55 [23763]===print_log sdclient_start run_func main===
> Starting run the container.
> 
> 2017-02-13 18:27:55 [23763]===print_log sdclient_start run_func main===
> docker daemon lost, try again in 5 secs.
> 
> 2017-02-13 18:28:00 [23763]===print_log sdclient_start run_func main===
> docker daemon lost, try again in 5 secs.
> 
> 2017-02-13 18:28:05 [23763]===print_log sdclient_start run_func main===
> docker daemon lost, try again in 5 secs.
> 
> 
> If I disable 2 clone resource, the switch over test for one clone
> resource works as expected: fail the service -> monitor fails -> stop
> -> start
> 
> 
> Online: [ paas-controller-1 paas-controller-2 paas-controller-3 ]
> 
> 
>  sdclient_vip   (ocf::heartbeat:IPaddr2):   Started paas-controller-2 
> 
>  Clone Set: sdclient_rep [sdclient]
> 
>  Started: [ paas-controller-1 paas-controller-2 ]
> 
>  Stopped: [ paas-controller-3 ]
> 
> 
> what's the reason behind 

Can you show the configuration of the three clones, their operations,
and any constraints?

Normally, the response is controlled by the monitor operation's on-fail
attribute (which defaults to restart).


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] clone resource not get restarted on fail

2017-02-13 Thread he.hailong5
Pacemaker 1.1.10

Corosync 2.3.3




this is a 3 nodes cluster configured with 3 clone resources, each attached wih 
a vip resource of IPAddr2:




>crm status




Online: [ paas-controller-1 paas-controller-2 paas-controller-3 ]




 router_vip (ocf::heartbeat:IPaddr2):   Started paas-controller-1 

 sdclient_vip   (ocf::heartbeat:IPaddr2):   Started paas-controller-3 

 apigateway_vip (ocf::heartbeat:IPaddr2):   Started paas-controller-2 

 Clone Set: sdclient_rep [sdclient]

 Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ]

 Clone Set: router_rep [router]

 Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ]

 Clone Set: apigateway_rep [apigateway]

 Started: [ paas-controller-1 paas-controller-2 paas-controller-3 ]




It is observed that sometimes the clone resource is stuck to monitor when the 
service fails:




 router_vip (ocf::heartbeat:IPaddr2):   Started paas-controller-1 

 sdclient_vip   (ocf::heartbeat:IPaddr2):   Started paas-controller-2 

 apigateway_vip (ocf::heartbeat:IPaddr2):   Started paas-controller-3 

 Clone Set: sdclient_rep [sdclient]

 Started: [ paas-controller-1 paas-controller-2 ]

 Stopped: [ paas-controller-3 ]

 Clone Set: router_rep [router]

 router (ocf::heartbeat:router):Started paas-controller-3 
FAILED 

 Started: [ paas-controller-1 paas-controller-2 ]

 Clone Set: apigateway_rep [apigateway]

 apigateway (ocf::heartbeat:apigateway):Started paas-controller-3 
FAILED 

 Started: [ paas-controller-1 paas-controller-2 ]




in the example above. the sdclient_rep get restarted on node 3, while the other 
two hang at monitoring on node 3, here are the ocf logs:




abnormal (apigateway_rep):


2017-02-13 18:27:53 [23586]===print_log test_monitor run_func main=== Starting 
health check.

2017-02-13 18:27:53 [23586]===print_log test_monitor run_func main=== health 
check succeed.

2017-02-13 18:27:55 [24010]===print_log test_monitor run_func main=== Starting 
health check.

2017-02-13 18:27:55 [24010]===print_log test_monitor run_func main=== Failed: 
docker daemon is not running.

2017-02-13 18:27:57 [24095]===print_log test_monitor run_func main=== Starting 
health check.

2017-02-13 18:27:57 [24095]===print_log test_monitor run_func main=== Failed: 
docker daemon is not running.

2017-02-13 18:27:59 [24159]===print_log test_monitor run_func main=== Starting 
health check.

2017-02-13 18:27:59 [24159]===print_log test_monitor run_func main=== Failed: 
docker daemon is not running.




normal (sdclient_rep):

2017-02-13 18:27:52 [23507]===print_log sdclient_monitor run_func main=== 
health check succeed.

2017-02-13 18:27:54 [23630]===print_log sdclient_monitor run_func main=== 
Starting health check.

2017-02-13 18:27:54 [23630]===print_log sdclient_monitor run_func main=== 
Failed: docker daemon is not running.

2017-02-13 18:27:55 [23710]===print_log sdclient_stop run_func main=== Starting 
stop the container.

2017-02-13 18:27:55 [23710]===print_log sdclient_stop run_func main=== docker 
daemon lost, pretend stop succeed.

2017-02-13 18:27:55 [23763]===print_log sdclient_start run_func main=== 
Starting run the container.

2017-02-13 18:27:55 [23763]===print_log sdclient_start run_func main=== docker 
daemon lost, try again in 5 secs.

2017-02-13 18:28:00 [23763]===print_log sdclient_start run_func main=== docker 
daemon lost, try again in 5 secs.

2017-02-13 18:28:05 [23763]===print_log sdclient_start run_func main=== docker 
daemon lost, try again in 5 secs.




If I disable 2 clone resource, the switch over test for one clone resource 
works as expected: fail the service -> monitor fails -> stop -> start




Online: [ paas-controller-1 paas-controller-2 paas-controller-3 ]




 sdclient_vip   (ocf::heartbeat:IPaddr2):   Started paas-controller-2 

 Clone Set: sdclient_rep [sdclient]

 Started: [ paas-controller-1 paas-controller-2 ]

 Stopped: [ paas-controller-3 ]




what's the reason behind___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org