Re: [ClusterLabs] Strange monitor return code log for LSB resource

2019-07-08 Thread Jan Pokorný
On 05/07/19 03:50 +, Harvey Shepherd wrote:
> I was able to resolve an issue which caused these logs to disappear.
> The problem was that the LSB script was named "logging" and the
> daemon that it controlled was also called "logging". The init script
> uses "start-stop-daemon --name" to start and stop the logging
> daemon, but "start-stop-daemon --stop --name logging" was causing
> both the init script and daemon to be terminated, and hence an error
> code would be returned for the stop operation.

Oh, that's nasty, indeed.  Incidentally, was just recently reminding
that /proc based traversals are rather betraying, for multiple
reasons, on developers counterpart list:
https://lists.clusterlabs.org/pipermail/developers/2019-July/002207.html

You've provided another example how relatively easy is for
people to shoot in their feet when process is specified
inprecisely/vaguely/in alias-prone ways.  It's something you
want to avoid where practically feasible.

> I have no idea why crm_resource logs were produced, or why the logs
> seemed to indicate that that it was treating an LSB resource as OCF
> (expected return code of 7 - not running), but the logs were no
> longer generated once I renamed the init script.

This deserves a little correction here: "not running" is not exclusive
to OCF class of resources -- it's defined also for LSB init scripts as
such, althought admittedly not (officially) for "status" operation
(e.g.
https://refspecs.linuxfoundation.org/LSB_3.1.0/LSB-Core-generic/LSB-Core-generic/iniscrptact.html
) so the fact pacemaker tolerantly interprets it at all may be solely
so as to get along buggy (non-compliant) LSB init scripts, such as
the one that "crm_resource" was trying to monitor in your logs :-p

Anyway, good that you got rid of said phantom executions.
Should it recidivate, you can use the described trick to nail it down.

> I've not seen them since. There were definitely no external
> monitoring tools installed or in use (it's a very isolated system
> with very few resources installed), although I was using crm_monitor
> at the time. Does that run crm_resource under the hood?

AFAIK, there's no connection like that.

-- 
Jan (Poki)


pgpbx4jZosAsA.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Strange monitor return code log for LSB resource

2019-07-04 Thread Harvey Shepherd
Thanks Jan for taking the time to provide such a detailed response. I was able 
to resolve an issue which caused these logs to disappear. The problem was that 
the LSB script was named "logging" and the daemon that it controlled was also 
called "logging". The init script uses "start-stop-daemon --name" to start and 
stop the logging daemon, but "start-stop-daemon --stop --name logging" was 
causing both the init script and daemon to be terminated, and hence an error 
code would be returned for the stop operation.

I have no idea why crm_resource logs were produced, or why the logs seemed to 
indicate that that it was treating an LSB resource as OCF (expected return code 
of 7 - not running), but the logs were no longer generated once I renamed the 
init script. I've not seen them since. There were definitely no external 
monitoring tools installed or in use (it's a very isolated system with very few 
resources installed), although I was using crm_monitor at the time. Does that 
run crm_resource under the hood?

Regards,
Harvey


From: Harvey Shepherd
Sent: Wednesday, 26 June 2019 9:26 a.m.
To: users@clusterlabs.org
Subject: Re: EXTERNAL: Re: [ClusterLabs] Strange monitor return code log for 
LSB resource

Thanks for your reply Andrei. There is no external monitoring software running. 
The logs that I posted are from the pacemaker log with debug enabled. The Linux 
hostname of the node is "ctr_qemu", but the Pacemaker name is "primary". It is 
actually the same node.

I've just run the same test again and interestingly this time the source of the 
monitor logs seems to be shared between "crm_resource" and 
"pacemaker-schedulerd"

Jun 25 21:18:31 ctr_qemu pacemaker-execd [1234] (recurring_action_timer)
debug: Scheduling another invocation of logging_status_1
Jun 25 21:18:31 ctr_qemu pacemaker-execd [1234] (operation_finished)
debug: logging_status_1:1181 - exited with rc=0
Jun 25 21:18:31 ctr_qemu pacemaker-execd [1234] (operation_finished)
debug: logging_status_1:1181:stdout [ Remote syslog service is running ]
Jun 25 21:18:31 ctr_qemu pacemaker-execd [1234] (log_finished)  debug: 
finished - rsc:logging action:monitor call_id:127 pid:1181 exit-code:0 
exec-time:0ms queue-time:0ms
Jun 25 21:18:35 ctr_qemu crm_resource[1268] (determine_op_status)   
debug: logging_monitor_1 on primary returned 'not running' (7) instead of 
the expected value: 'ok' (0)
Jun 25 21:18:35 ctr_qemu crm_resource[1268] (unpack_rsc_op_failure) 
warning: Processing failed monitor of logging on primary: not running | rc=7
Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (determine_op_status)   
debug: logging_monitor_1 on primary returned 'not running' (7) instead of 
the expected value: 'ok' (0)
Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (unpack_rsc_op_failure) 
warning: Processing failed monitor of logging on primary: not running | rc=7
Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (common_print)  info: 
logging   (lsb:logging):  Started primary
Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (common_apply_stickiness)   
debug: Resource logging: preferring current location (node=primary, weight=1)
Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (native_assign_node)
debug: Assigning primary to logging
Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (LogActions)info: 
Leave   logging   (Started primary)

Regards,
Harvey


From: Users  on behalf of Andrei Borzenkov 

Sent: Wednesday, 26 June 2019 4:47 a.m.
To: users@clusterlabs.org
Subject: EXTERNAL: Re: [ClusterLabs] Strange monitor return code log for LSB 
resource

25.06.2019 16:53, Harvey Shepherd пишет:
> Hi All,
>
>
> I have a 2 node cluster running under Pacemaker 2.0.2, with around 20 
> resources configured, the majority of which are LSB resources, but there are 
> also a few OCF ones. One of the LSB resources is controlled via an init 
> script called "logging" and runs only on the master node. The CIB 
> configuration for it is as follows:
>
>
> 
>   
> 
> 
> 
>   
> 
>
>
> There is a global setting which sets the default timeout:
>
>
>   
> 
>   
> 
>   
>
>
> All of the other LSB resources are configured in the same way, and seem to 
> work correctly, but for some reason I see the following recurring logs for 
> the logging resource:
>
>
> Jun 25 13:16:22 ctr_qemu pacemaker-execd [1234] (recurring_action_timer)  
>   debug: Scheduling another invocation of logging_status_1
> Jun 25 13:16:22 ctr_qemu pacemaker-execd [1234] (operation_finished)  
>   debug: logging_status_1

Re: [ClusterLabs] Strange monitor return code log for LSB resource

2019-07-04 Thread Jan Pokorný
Harvey,

On 25/06/19 21:26 +, Harvey Shepherd wrote:
> Thanks for your reply Andrei. There is no external monitoring
> software running. The logs that I posted are from the pacemaker log
> with debug enabled.

that's exactly what's questionable here -- how can crm_resource
invocation be at all correlated to the standard operation of pacemaker
cluster if you claim you don't have a clue what else (such as external
monitoring built upon crm_resource or, in extension upon crm/pcs)
could be triggering it.  That's where Andrei was zeroing in.

Anyway, to spare you possibly heavyhanded inspection of circumstances
in your cluster, I'd suggest you install system and run this command
as root:

  stap /usr/share/systemtap/examples/process/pstrace_exec.stp /logging

where pstrace_exec.stp can also be obtained from
https://sourceware.org/systemtap/examples/process/pstrace_exec.stp 
if your installation (systemtap-client package in my case) doesn't
carry it.

During the time, you'll likely observe occurrences where "logging"
script is run directly by pacemaker-execd, plus some process chains
finishing with crm_resource and these shall suggest the cause of
why that is being run at all (where it likely should not, see also
the remark on re-entrancy below).

> The Linux hostname of the node is "ctr_qemu", but the Pacemaker name
> is "primary". It is actually the same node.
> 
> I've just run the same test again and interestingly this time the
> source of the monitor logs seems to be shared between "crm_resource"
> and "pacemaker-schedulerd"
> 
> Jun 25 21:18:31 ctr_qemu pacemaker-execd [1234] (recurring_action_timer)  
>   debug: Scheduling another invocation of logging_status_1
> Jun 25 21:18:31 ctr_qemu pacemaker-execd [1234] (operation_finished)  
>   debug: logging_status_1:1181 - exited with rc=0
> Jun 25 21:18:31 ctr_qemu pacemaker-execd [1234] (operation_finished)  
>   debug: logging_status_1:1181:stdout [ Remote syslog service is running ]
> Jun 25 21:18:31 ctr_qemu pacemaker-execd [1234] (log_finished)  
> debug: finished - rsc:logging action:monitor call_id:127 pid:1181 exit-code:0 
> exec-time:0ms queue-time:0ms
> Jun 25 21:18:35 ctr_qemu crm_resource[1268] (determine_op_status) 
>   debug: logging_monitor_1 on primary returned 'not running' (7) instead 
> of the expected value: 'ok' (0)
> Jun 25 21:18:35 ctr_qemu crm_resource[1268] (unpack_rsc_op_failure)   
>   warning: Processing failed monitor of logging on primary: not running | rc=7
> Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (determine_op_status) 
>   debug: logging_monitor_1 on primary returned 'not running' (7) instead 
> of the expected value: 'ok' (0)
> Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (unpack_rsc_op_failure)   
>   warning: Processing failed monitor of logging on primary: not running | rc=7
> Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (common_print)  info: 
> logging   (lsb:logging):  Started primary
> Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (common_apply_stickiness) 
>   debug: Resource logging: preferring current location (node=primary, 
> weight=1)
> Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (native_assign_node)  
>   debug: Assigning primary to logging
> Jun 25 21:18:36 ctr_qemu pacemaker-schedulerd[1236] (LogActions)info: 
> Leave   logging   (Started primary)

another thing, actually:

> 25.06.2019 16:53, Harvey Shepherd пишет:
>> Pacemaker is reporting failed resource actions, but fail-count is not 
>> incremented for the resource:
>> 
>> Migration Summary:
>> * Node primary:
>> * Node secondary:
>> 
>> Failed Resource Actions:
>> * logging_monitor_1 on primary 'not running' (7): call=119, 
>> status=complete, exitreason='',
>> last-rc-change='Tue Jun 25 13:13:12 2019', queued=0ms, exec=0ms

Gotta wonder this can only be caused with what _exclusively_
pacemaker-execd observed (but that could be a result of interference
with crm_resource initiated operation ... with --force-check parameter,
or what else?), sadly, the log didn't contain the referenced time
of "last-rc-change".

Finally, speaking of interferences, this topic has been brought up
recently, regarding re-entrancy in particular:

https://lists.clusterlabs.org/pipermail/users/2019-February/025454.html

Apparently, axioms of agent/resource execution that apply with pacemaker
alone won't hold when there are multiple actors as it seems to be your
case (pacemaker vs. isolated crm_resource tool based trigger).


Hope this helps

-- 
Jan (Poki)


pgpWwEAgXpABx.pgp
Description: PGP signature
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Strange monitor return code log for LSB resource

2019-06-25 Thread Harvey Shepherd
Hi All,


I have a 2 node cluster running under Pacemaker 2.0.2, with around 20 resources 
configured, the majority of which are LSB resources, but there are also a few 
OCF ones. One of the LSB resources is controlled via an init script called 
"logging" and runs only on the master node. The CIB configuration for it is as 
follows:



  



  



There is a global setting which sets the default timeout:


  

  

  


All of the other LSB resources are configured in the same way, and seem to work 
correctly, but for some reason I see the following recurring logs for the 
logging resource:


Jun 25 13:16:22 ctr_qemu pacemaker-execd [1234] (recurring_action_timer)
debug: Scheduling another invocation of logging_status_1
Jun 25 13:16:22 ctr_qemu pacemaker-execd [1234] (operation_finished)
debug: logging_status_1:8423 - exited with rc=0
Jun 25 13:16:22 ctr_qemu pacemaker-execd [1234] (operation_finished)
debug: logging_status_1:8423:stdout [ Remote syslog service is running ]
Jun 25 13:16:22 ctr_qemu pacemaker-execd [1234] (log_finished)  debug: 
finished - rsc:logging action:monitor call_id:123 pid:8423 exit-code:0 
exec-time:0ms queue-time:0ms
Jun 25 13:16:23 ctr_qemu crm_resource[8436] (determine_op_status)   
debug: logging_monitor_1 on primary returned 'not running' (7) instead of 
the expected value: 'ok' (0)
Jun 25 13:16:23 ctr_qemu crm_resource[8436] (unpack_rsc_op_failure) 
warning: Processing failed monitor of logging on primary: not running | rc=7
Jun 25 13:16:30 ctr_qemu crm_resource[8571] (determine_op_status)   
debug: logging_monitor_1 on primary returned 'not running' (7) instead of 
the expected value: 'ok' (0)
Jun 25 13:16:30 ctr_qemu crm_resource[8571] (unpack_rsc_op_failure) 
warning: Processing failed monitor of logging on primary: not running | rc=7
Jun 25 13:16:32 ctr_qemu pacemaker-execd [1234] (recurring_action_timer)
debug: Scheduling another invocation of logging_status_1
Jun 25 13:16:32 ctr_qemu pacemaker-execd [1234] (operation_finished)
debug: logging_status_1:8670 - exited with rc=0
Jun 25 13:16:32 ctr_qemu pacemaker-execd [1234] (operation_finished)
debug: logging_status_1:8670:stdout [ Remote syslog service is running ]
Jun 25 13:16:32 ctr_qemu pacemaker-execd [1234] (log_finished)  debug: 
finished - rsc:logging action:monitor call_id:123 pid:8670 exit-code:0 
exec-time:0ms queue-time:0ms
Jun 25 13:16:33 ctr_qemu crm_resource[8683] (determine_op_status)   
debug: logging_monitor_1 on primary returned 'not running' (7) instead of 
the expected value: 'ok' (0)
Jun 25 13:16:33 ctr_qemu crm_resource[8683] (unpack_rsc_op_failure) 
warning: Processing failed monitor of logging on primary: not running | rc=7
Jun 25 13:16:40 ctr_qemu crm_resource[8818] (determine_op_status)   
debug: logging_monitor_1 on primary returned 'not running' (7) instead of 
the expected value: 'ok' (0)
Jun 25 13:16:40 ctr_qemu crm_resource[8818] (unpack_rsc_op_failure) 
warning: Processing failed monitor of logging on primary: not running | rc=7

Pacemaker is reporting failed resource actions, but fail-count is not 
incremented for the resource:


Migration Summary:
* Node primary:
* Node secondary:

Failed Resource Actions:
* logging_monitor_1 on primary 'not running' (7): call=119, 
status=complete, exitreason='',
last-rc-change='Tue Jun 25 13:13:12 2019', queued=0ms, exec=0ms


I have checked the operation of the LSB script manually and it always correctly 
exits with a return code of 0 when I run it manually, indicating that the 
resource is running. So my questions are:


1. Why does Pacemaker seem to be running a monitor operation in parallel with a 
status operation, with conflicting results? A monitor operation returning 7 
"not running" would only make sense for an OCF resource, but it is clearly 
defined as LSB in the CIB.

2. Why does the status operation always return 0 (running) and the monitor 
operation always returns 7 (not running)?

2. Why is fail-count not being incremented even though failures are being 
logged?


I would really appreciate any pointers that anyone could give me. Perhaps I've 
made an error in the configuration.



Thanks,

Harvey Shepherd
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/