Re: [ClusterLabs] Failover When Host is Up, Out of Order Logs

2017-02-01 Thread Ken Gaillot
On 01/31/2017 11:44 AM, Corey Moullas wrote:
> I have been getting extremely strange behavior from a Corosync/Pacemaker
> install on OVH Public Cloud servers.
> 
>  
> 
> After hours of Googling, I thought I would try posting here to see if
> somebody knows what to do.
> 
>  
> 
> I see this in my logs very frequently:
> 
> Jan 31 10:31:36 [581] fusion01-2 corosync warning [MAIN  ] Corosync main
> process was not scheduled for 24334.5645 ms (threshold is 2400. ms).
> Consider token timeout increase.
> 
> Jan 31 10:31:36 [581] fusion01-2 corosync notice  [TOTEM ] A processor
> failed, forming new configuration.
> 
>  
> 
> I have increased token time to 10s and this still occurs regularly even
> though both hosts are always up.

The "was not scheduled" message means the kernel is not giving corosync
enough CPU time to keep the token alive. The message indicates that it
didn't get scheduled for 24 seconds, which is why 10 seconds wouldn't
help -- but simply raising the timeout isn't a good idea at that point.
You need to figure out why you're starved for CPU. Public clouds don't
generally provide any guarantees, so it may be on their end.

> There are also times when the floating IP script is fired, but corosync
> does not seem aware. When you run crm status it will show the ip being
> bound the fusion01-1 when in fact the script fired and moved it to
> fusion01-2.
> 
>  
> 
> Finally, the timing of the logs seems very odd. This is what the end of
> my corosync log file looks like. Notice the times appear out of order in
> the logs. I’m ripping my hair out with these issues. Anybody have a clue
> what may be going on here?

Both pacemaker and corosync write to the same log (which is probably a
bad idea and will be changed one day, but that's not of concern here).
Each is using its own time source. If pacemaker is getting scheduled
more frequently than corosync, it's certainly possible log messages will
be out of order, since corosync isn't able to write out buffers as often.

> 
>  
> 
> Jan 31 10:31:47 [21062] fusion01-2cib: info:
> cib_process_request:   Completed cib_modify operation for
> section status: OK (rc=0, origin=fusion01-1/crmd/245, version=0.81.123)
> 
> Jan 31 10:31:47 [21062] fusion01-2cib: info:
> cib_perform_op:Diff: --- 0.81.123 2
> 
> Jan 31 10:31:47 [21062] fusion01-2cib: info:
> cib_perform_op:Diff: +++ 0.81.124 (null)
> 
> Jan 31 10:31:47 [21062] fusion01-2cib: info:
> cib_perform_op:-- /cib/status/node_state[@id='1']/lrm[@id='1']
> 
> Jan 31 10:31:47 [21062] fusion01-2cib: info:
> cib_perform_op:+  /cib:  @num_updates=124
> 
> Jan 31 10:31:47 [21062] fusion01-2cib: info:
> cib_process_request:   Completed cib_delete operation for
> section //node_state[@uname='fusion01-1']/lrm: OK (rc=0,
> origin=fusion01-1/crmd/246, version=0.81.124)
> 
> Jan 31 10:31:47 [21062] fusion01-2cib: info:
> cib_perform_op:Diff: --- 0.81.124 2
> 
> Jan 31 10:31:47 [21062] fusion01-2cib: info:
> cib_perform_op:Diff: +++ 0.81.125 (null)
> 
> Jan 31 10:31:47 [21062] fusion01-2cib: info:
> cib_perform_op:+  /cib:  @num_updates=125
> 
> Jan 31 10:31:47 [21062] fusion01-2cib: info:
> cib_perform_op:+  /cib/status/node_state[@id='1']: 
> @crm-debug-origin=do_lrm_query_internal
> 
> Jan 31 10:31:47 [21062] fusion01-2cib: info:
> cib_perform_op:++ /cib/status/node_state[@id='1']:  
> 
> Jan 31 10:31:47 [21062] fusion01-2cib: info:
> cib_perform_op:++
> 
> 
> Jan 31 10:31:47 [21062] fusion01-2cib: info:
> cib_perform_op:++  
> 
> 
> Jan 31 10:31:47 [21062] fusion01-2cib: info:
> cib_perform_op:++
>  operation_key="FloatIP_start_0" operation="start"
> crm-debug-origin="build_active_RAs" crm_feature_set="3.0.10"
> transition-key="5:31:0:08c3f481-ccde-4f75-b1a7-acf8168cd0c1"
> transition-magic="0:0;5:31:0:08c3f481-ccde-4f75-b1a7-acf8168cd0c1"
> on_node="fusion01-1" call-id="17" rc-code="0" op-status="0" interval="0"
> last-run="1485859189" last-rc-change="1485859189" e
> 
> Jan 31 10:31:47 [21062] fusion01-2cib: info:
> cib_perform_op:++
>   
> 
> Jan 31 10:31:47 [21062] fusion01-2cib: info:
> cib_perform_op:++  
> 
> 
> Jan 31 10:31:47 [21062] fusion01-2cib: info:
> cib_perform_op:++
>  operation="monitor" crm-debug-origin="build_active_RAs"
> crm_feature_set="3.0.10"
> transition-key="4:1:7:1fe20aa3-b305-4282-99a3-b1f8190d3c2c"
> transition-magic="0:0;4:1:7:1fe20aa3-b305-4282-99a3-b1f8190d3c2c"
> on_node="fusion01-1" call-id="9" rc-code="0" 

[ClusterLabs] Failover When Host is Up, Out of Order Logs

2017-01-31 Thread Corey Moullas
I have been getting extremely strange behavior from a Corosync/Pacemaker 
install on OVH Public Cloud servers.

After hours of Googling, I thought I would try posting here to see if somebody 
knows what to do.

I see this in my logs very frequently:
Jan 31 10:31:36 [581] fusion01-2 corosync warning [MAIN  ] Corosync main 
process was not scheduled for 24334.5645 ms (threshold is 2400. ms). 
Consider token timeout increase.
Jan 31 10:31:36 [581] fusion01-2 corosync notice  [TOTEM ] A processor failed, 
forming new configuration.

I have increased token time to 10s and this still occurs regularly even though 
both hosts are always up.

There are also times when the floating IP script is fired, but corosync does 
not seem aware. When you run crm status it will show the ip being bound the 
fusion01-1 when in fact the script fired and moved it to fusion01-2.

Finally, the timing of the logs seems very odd. This is what the end of my 
corosync log file looks like. Notice the times appear out of order in the logs. 
I'm ripping my hair out with these issues. Anybody have a clue what may be 
going on here?

Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_process_request:   
Completed cib_modify operation for section status: OK (rc=0, 
origin=fusion01-1/crmd/245, version=0.81.123)
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
Diff: --- 0.81.123 2
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
Diff: +++ 0.81.124 (null)
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
-- /cib/status/node_state[@id='1']/lrm[@id='1']
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
+  /cib:  @num_updates=124
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_process_request:   
Completed cib_delete operation for section 
//node_state[@uname='fusion01-1']/lrm: OK (rc=0, origin=fusion01-1/crmd/246, 
version=0.81.124)
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
Diff: --- 0.81.124 2
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
Diff: +++ 0.81.125 (null)
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
+  /cib:  @num_updates=125
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
+  /cib/status/node_state[@id='1']:  @crm-debug-origin=do_lrm_query_internal
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
++ /cib/status/node_state[@id='1']:  
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
++ 
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
++   
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
++ 
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
++   
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
++ 
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
++ 
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
++   
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_process_request:   
Completed cib_modify operation for section status: OK (rc=0, 
origin=fusion01-1/crmd/247, version=0.81.125)
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_process_request:   
Completed cib_modify operation for section nodes: OK (rc=0, 
origin=fusion01-1/crmd/250, version=0.81.125)
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
Diff: --- 0.81.125 2
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
Diff: +++ 0.81.126 (null)
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
+  /cib:  @num_updates=126
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
+  /cib/status/node_state[@id='2']:  @crm-debug-origin=do_state_transition
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_perform_op:
+  /cib/status/node_state[@id='1']:  @crm-debug-origin=do_state_transition
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_process_request:   
Completed cib_modify operation for section status: OK (rc=0, 
origin=fusion01-1/crmd/251, version=0.81.126)
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_process_request:   
Completed cib_modify operation for section cib: OK (rc=0, 
origin=fusion01-1/crmd/252, version=0.81.126)
Jan 31 10:31:47 [21062] fusion01-2cib: info: cib_process_request:   
Completed cib_modify