Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-02-05 Thread Bogdan Dobrelya
On 04.02.2016 15:43, Bogdan Dobrelya wrote:
> Hello.
> Regarding the original issue, good news are the resource-agents
> ocf-shellfuncs is no more causing fork bombs to the dummy OCF RA [0]
> after the fix [1] done. The bad news are that "self-forking" monitors
> issue seems remaining for the rabbitmq OCF RA [2], and I can reproduce
> it for another custom agent [3], so I'd guess it may be a valid for
> another ones as well.
> 
> IIUC, the issue seems related to how lrmd's forking monitor actions.
> I tried to debug both pacemaker 1.1.10, 1.1.12 with gdb as the following:
> 
> # cat ./cmds
> set follow-fork-mode child
> set detach-on-fork off
> set follow-exec-mode new
> catch fork
> catch vfork
> cont
> # gdb -x cmds /usr/lib/pacemaker/lrmd `pgrep lrmd`
> 
> I can confirm it catches forked monitors and makes nested forks as well.
> But I have *many* debug symbols missing, bt is full of question marks
> and, honestly, I'm not a gdb guru and do not now that to check in for
> reproduced cases.
> 
> So any help with how to troubleshooting things further are very appreciated!

I figured out this is expected behaviour. There are no fork bombs left,
but usual fork & exec syscalls each time the OCF RA is calling a shell
command or ocf_run, ocf_log functions. And those false "self-forks" are
nothing more but a transient state between the fork and exec calls, when
the caption of the child process has yet to be updated... So I believe
the problem was solved by the aforementioned patch completely.

> 
> [0] https://github.com/bogdando/dummy-ocf-ra
> [1] https://github.com/ClusterLabs/resource-agents/issues/734
> [2]
> https://github.com/rabbitmq/rabbitmq-server/blob/master/scripts/rabbitmq-server-ha.ocf
> [3]
> https://git.openstack.org/cgit/openstack/fuel-library/tree/files/fuel-ha-utils/ocf/ns_vrouter
> 
> On 04.01.2016 17:33, Bogdan Dobrelya wrote:
>> On 04.01.2016 17:14, Dejan Muhamedagic wrote:
>>> Hi,
>>>
>>> On Mon, Jan 04, 2016 at 04:52:43PM +0100, Bogdan Dobrelya wrote:
 On 04.01.2016 16:36, Ken Gaillot wrote:
> On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote:
>> On 04.01.2016 15:50, Bogdan Dobrelya wrote:
>>> [...]
>> Also note, that lrmd spawns *many* monitors like:
>> root  6495  0.0  0.0  70268  1456 ?Ss2015   4:56  \_
>> /usr/lib/pacemaker/lrmd
>> root 31815  0.0  0.0   4440   780 ?S15:08   0:00  |   \_
>> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>> root 31908  0.0  0.0   4440   388 ?S15:08   0:00  |
>>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>> root 31910  0.0  0.0   4440   384 ?S15:08   0:00  |
>>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>> root 31915  0.0  0.0   4440   392 ?S15:08   0:00  |
>>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>> ...
>
> At first glance, that looks like your monitor action is calling itself
> recursively, but I don't see how in your code.

 Yes, it should be a bug in the ocf-shellfuncs's ocf_log().
>>>
>>> If you're sure about that, please open an issue at
>>> https://github.com/ClusterLabs/resource-agents/issues
>>
>> Submitted [0]. Thank you!
>> Note, that it seems the very import action causes the issue, not the
>> ocf_run or ocf_log code itself.
>>
>> [0] https://github.com/ClusterLabs/resource-agents/issues/734
>>
>>>
>>> Thanks,
>>>
>>> Dejan
>>>
>>> ___
>>> Users mailing list: Users@clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
> 
> 


-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-02-04 Thread Bogdan Dobrelya
Hello.
Regarding the original issue, good news are the resource-agents
ocf-shellfuncs is no more causing fork bombs to the dummy OCF RA [0]
after the fix [1] done. The bad news are that "self-forking" monitors
issue seems remaining for the rabbitmq OCF RA [2], and I can reproduce
it for another custom agent [3], so I'd guess it may be a valid for
another ones as well.

IIUC, the issue seems related to how lrmd's forking monitor actions.
I tried to debug both pacemaker 1.1.10, 1.1.12 with gdb as the following:

# cat ./cmds
set follow-fork-mode child
set detach-on-fork off
set follow-exec-mode new
catch fork
catch vfork
cont
# gdb -x cmds /usr/lib/pacemaker/lrmd `pgrep lrmd`

I can confirm it catches forked monitors and makes nested forks as well.
But I have *many* debug symbols missing, bt is full of question marks
and, honestly, I'm not a gdb guru and do not now that to check in for
reproduced cases.

So any help with how to troubleshooting things further are very appreciated!

[0] https://github.com/bogdando/dummy-ocf-ra
[1] https://github.com/ClusterLabs/resource-agents/issues/734
[2]
https://github.com/rabbitmq/rabbitmq-server/blob/master/scripts/rabbitmq-server-ha.ocf
[3]
https://git.openstack.org/cgit/openstack/fuel-library/tree/files/fuel-ha-utils/ocf/ns_vrouter

On 04.01.2016 17:33, Bogdan Dobrelya wrote:
> On 04.01.2016 17:14, Dejan Muhamedagic wrote:
>> Hi,
>>
>> On Mon, Jan 04, 2016 at 04:52:43PM +0100, Bogdan Dobrelya wrote:
>>> On 04.01.2016 16:36, Ken Gaillot wrote:
 On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote:
> On 04.01.2016 15:50, Bogdan Dobrelya wrote:
>> [...]
> Also note, that lrmd spawns *many* monitors like:
> root  6495  0.0  0.0  70268  1456 ?Ss2015   4:56  \_
> /usr/lib/pacemaker/lrmd
> root 31815  0.0  0.0   4440   780 ?S15:08   0:00  |   \_
> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> root 31908  0.0  0.0   4440   388 ?S15:08   0:00  |
>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> root 31910  0.0  0.0   4440   384 ?S15:08   0:00  |
>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> root 31915  0.0  0.0   4440   392 ?S15:08   0:00  |
>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> ...

 At first glance, that looks like your monitor action is calling itself
 recursively, but I don't see how in your code.
>>>
>>> Yes, it should be a bug in the ocf-shellfuncs's ocf_log().
>>
>> If you're sure about that, please open an issue at
>> https://github.com/ClusterLabs/resource-agents/issues
> 
> Submitted [0]. Thank you!
> Note, that it seems the very import action causes the issue, not the
> ocf_run or ocf_log code itself.
> 
> [0] https://github.com/ClusterLabs/resource-agents/issues/734
> 
>>
>> Thanks,
>>
>> Dejan
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> 
> 


-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-06 Thread Keisuke MORI
Hi,

2016-01-06 22:57 GMT+09:00 Jan Pokorný :
> Hello ,
>
> On 04/01/16 17:33 +0100, Bogdan Dobrelya wrote:

>> Note, that it seems the very import action causes the issue, not the
>> ocf_run or ocf_log code itself.
>>
>> [0] https://github.com/ClusterLabs/resource-agents/issues/734
>
> Have to wonder if there is any correlation with the issue discussed
> recently:
>
> http://oss.clusterlabs.org/pipermail/users/2015-November/001806.html
>
> Note that ocf:pacemaker:ClusterMon resource also sources
> ocf-shellfuncs set of helper functions.

I think that it's unlikely to be relevant.

In the case of this time, /bin/dash is apparently the cause of the
fork bomb by a bug in handling PS4 shell variable from the result of
testing on the github issue 734.

In the case of ClusterMon you mentioned, crm_mon is the one forking
repeatedly, not the ClusterMon shell script, not invoked via /bin/dash
either.

Regards,
-- 
Keisuke MORI

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-04 Thread Ken Gaillot
On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote:
> On 04.01.2016 15:50, Bogdan Dobrelya wrote:
>> So far so bad.
>> I made a dummy OCF script [0] to simulate an example
>> promote/demote/notify failure mode for a multistate clone resource which
>> is very similar to the one I reported originally. And the test to
>> reproduce my case with the dummy is:
>> - install dummy resource ocf ra and create the dummy resource as README
>> [0] says
>> - just watch the a) OCF logs from the dummy and b) outputs for the
>> reoccurring commands:
>>
>> # while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1;
>> sleep 20; done&
>> # crm_resource --resource p_dummy --list-operations
>>
>> At some point I noticed:
>> - there are no more "OK" messages logged from the monitor actions,
>> although according to the trace_ra dumps' timestamps, all monitors are
>> still being invoked!
>>
>> - at some point I noticed very strange results reported by the:
>> # crm_resource --resource p_dummy --list-operations
>> p_dummy (ocf::dummy:dummy): FAILED : p_dummy_monitor_103000
>> (node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan
>> 4 14:33:07 2016, exec=62107ms): Timed Out
>>   or
>> p_dummy (ocf::dummy:dummy): Started : p_dummy_monitor_103000
>> (node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan  4
>> 14:43:58 2016, exec=0ms): Timed Out
>>
>> - according to the trace_ra dumps reoccurring monitors are being invoked
>> by the intervals *much longer* than configured. For example, a 7 minutes
>> of "monitoring silence":
>> Mon Jan  4 14:47:46 UTC 2016
>> p_dummy.monitor.2016-01-04.14:40:52
>> Mon Jan  4 14:48:06 UTC 2016
>> p_dummy.monitor.2016-01-04.14:47:58
>>
>> Given that said, it is very likely there is some bug exist for
>> monitoring multi-state clones in pacemaker!
>>
>> [0] https://github.com/bogdando/dummy-ocf-ra
>>
> 
> Also note, that lrmd spawns *many* monitors like:
> root  6495  0.0  0.0  70268  1456 ?Ss2015   4:56  \_
> /usr/lib/pacemaker/lrmd
> root 31815  0.0  0.0   4440   780 ?S15:08   0:00  |   \_
> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> root 31908  0.0  0.0   4440   388 ?S15:08   0:00  |
>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> root 31910  0.0  0.0   4440   384 ?S15:08   0:00  |
>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> root 31915  0.0  0.0   4440   392 ?S15:08   0:00  |
>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> ...

At first glance, that looks like your monitor action is calling itself
recursively, but I don't see how in your code.

> At some point, there was  already. Then I unmanaged the p_dummy but
> it grew up to the 2403 after that. The number of running monitors may
> grow or decrease as well.
> Also, the /var/lib/heartbeat/trace_ra/dummy/ still have been populated
> by new p_dummy.monitor* files with recent timestamps. Why?..
> 
> If I pkill -9 all dummy monitors, lrmd spawns another ~2000 almost
> instantly :) Unless the node became unresponsive at some point. And
> after restarted by power off:
> # crm_resource --resource p_dummy --list-operations
> p_dummy (ocf::dummy:dummy): Started (unmanaged) :
> p_dummy_monitor_3 (node=node-1.test.domain.local, call=679, rc=1,
> last-rc-change=Mon Jan  4 15:04:25 2016, exec=66747ms): Timed Out
> or
> p_dummy (ocf::dummy:dummy): Stopped (unmanaged) :
> p_dummy_monitor_103000 (node=node-3.test.domain.local, call=142, rc=1,
> last-rc-change=Mon Jan  4 15:14:59 2016, exec=65237ms): Timed Out
> 
> And then lrmd repeats all of the fun again.
> 
> 


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-04 Thread Bogdan Dobrelya
On 04.01.2016 16:36, Ken Gaillot wrote:
> On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote:
>> On 04.01.2016 15:50, Bogdan Dobrelya wrote:
>>> So far so bad.
>>> I made a dummy OCF script [0] to simulate an example
>>> promote/demote/notify failure mode for a multistate clone resource which
>>> is very similar to the one I reported originally. And the test to
>>> reproduce my case with the dummy is:
>>> - install dummy resource ocf ra and create the dummy resource as README
>>> [0] says
>>> - just watch the a) OCF logs from the dummy and b) outputs for the
>>> reoccurring commands:
>>>
>>> # while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1;
>>> sleep 20; done&
>>> # crm_resource --resource p_dummy --list-operations
>>>
>>> At some point I noticed:
>>> - there are no more "OK" messages logged from the monitor actions,
>>> although according to the trace_ra dumps' timestamps, all monitors are
>>> still being invoked!
>>>
>>> - at some point I noticed very strange results reported by the:
>>> # crm_resource --resource p_dummy --list-operations
>>> p_dummy (ocf::dummy:dummy): FAILED : p_dummy_monitor_103000
>>> (node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan
>>> 4 14:33:07 2016, exec=62107ms): Timed Out
>>>   or
>>> p_dummy (ocf::dummy:dummy): Started : p_dummy_monitor_103000
>>> (node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan  4
>>> 14:43:58 2016, exec=0ms): Timed Out
>>>
>>> - according to the trace_ra dumps reoccurring monitors are being invoked
>>> by the intervals *much longer* than configured. For example, a 7 minutes
>>> of "monitoring silence":
>>> Mon Jan  4 14:47:46 UTC 2016
>>> p_dummy.monitor.2016-01-04.14:40:52
>>> Mon Jan  4 14:48:06 UTC 2016
>>> p_dummy.monitor.2016-01-04.14:47:58
>>>
>>> Given that said, it is very likely there is some bug exist for
>>> monitoring multi-state clones in pacemaker!
>>>
>>> [0] https://github.com/bogdando/dummy-ocf-ra
>>>
>>
>> Also note, that lrmd spawns *many* monitors like:
>> root  6495  0.0  0.0  70268  1456 ?Ss2015   4:56  \_
>> /usr/lib/pacemaker/lrmd
>> root 31815  0.0  0.0   4440   780 ?S15:08   0:00  |   \_
>> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>> root 31908  0.0  0.0   4440   388 ?S15:08   0:00  |
>>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>> root 31910  0.0  0.0   4440   384 ?S15:08   0:00  |
>>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>> root 31915  0.0  0.0   4440   392 ?S15:08   0:00  |
>>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>> ...
> 
> At first glance, that looks like your monitor action is calling itself
> recursively, but I don't see how in your code.

Yes, it should be a bug in the ocf-shellfuncs's ocf_log().

If I replace it in the dummy RA to the:
#. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs
ocf_log() {
  logger $HA_LOGFACILITY -t $HA_LOGTAG "$@"
}

there is no such issue anymore. And I see log messages "It's OK"
as expected.
Note, I used the resource-agents 3.9.5+git+a626847-1
from [0].

[0] http://ftp.de.debian.org/debian/ experimental/main amd64 Packages

> 
>> At some point, there was  already. Then I unmanaged the p_dummy but
>> it grew up to the 2403 after that. The number of running monitors may
>> grow or decrease as well.
>> Also, the /var/lib/heartbeat/trace_ra/dummy/ still have been populated
>> by new p_dummy.monitor* files with recent timestamps. Why?..
>>
>> If I pkill -9 all dummy monitors, lrmd spawns another ~2000 almost
>> instantly :) Unless the node became unresponsive at some point. And
>> after restarted by power off:
>> # crm_resource --resource p_dummy --list-operations
>> p_dummy (ocf::dummy:dummy): Started (unmanaged) :
>> p_dummy_monitor_3 (node=node-1.test.domain.local, call=679, rc=1,
>> last-rc-change=Mon Jan  4 15:04:25 2016, exec=66747ms): Timed Out
>> or
>> p_dummy (ocf::dummy:dummy): Stopped (unmanaged) :
>> p_dummy_monitor_103000 (node=node-3.test.domain.local, call=142, rc=1,
>> last-rc-change=Mon Jan  4 15:14:59 2016, exec=65237ms): Timed Out
>>
>> And then lrmd repeats all of the fun again.
>>
>>
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-04 Thread Bogdan Dobrelya
On 04.01.2016 15:50, Bogdan Dobrelya wrote:
> So far so bad.
> I made a dummy OCF script [0] to simulate an example
> promote/demote/notify failure mode for a multistate clone resource which
> is very similar to the one I reported originally. And the test to
> reproduce my case with the dummy is:
> - install dummy resource ocf ra and create the dummy resource as README
> [0] says
> - just watch the a) OCF logs from the dummy and b) outputs for the
> reoccurring commands:
> 
> # while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1;
> sleep 20; done&
> # crm_resource --resource p_dummy --list-operations
> 
> At some point I noticed:
> - there are no more "OK" messages logged from the monitor actions,
> although according to the trace_ra dumps' timestamps, all monitors are
> still being invoked!
> 
> - at some point I noticed very strange results reported by the:
> # crm_resource --resource p_dummy --list-operations
> p_dummy (ocf::dummy:dummy): FAILED : p_dummy_monitor_103000
> (node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan
> 4 14:33:07 2016, exec=62107ms): Timed Out
>   or
> p_dummy (ocf::dummy:dummy): Started : p_dummy_monitor_103000
> (node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan  4
> 14:43:58 2016, exec=0ms): Timed Out
> 
> - according to the trace_ra dumps reoccurring monitors are being invoked
> by the intervals *much longer* than configured. For example, a 7 minutes
> of "monitoring silence":
> Mon Jan  4 14:47:46 UTC 2016
> p_dummy.monitor.2016-01-04.14:40:52
> Mon Jan  4 14:48:06 UTC 2016
> p_dummy.monitor.2016-01-04.14:47:58
> 
> Given that said, it is very likely there is some bug exist for
> monitoring multi-state clones in pacemaker!
> 
> [0] https://github.com/bogdando/dummy-ocf-ra
> 

Also note, that lrmd spawns *many* monitors like:
root  6495  0.0  0.0  70268  1456 ?Ss2015   4:56  \_
/usr/lib/pacemaker/lrmd
root 31815  0.0  0.0   4440   780 ?S15:08   0:00  |   \_
/bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
root 31908  0.0  0.0   4440   388 ?S15:08   0:00  |
  \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
root 31910  0.0  0.0   4440   384 ?S15:08   0:00  |
  \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
root 31915  0.0  0.0   4440   392 ?S15:08   0:00  |
  \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
...

At some point, there was  already. Then I unmanaged the p_dummy but
it grew up to the 2403 after that. The number of running monitors may
grow or decrease as well.
Also, the /var/lib/heartbeat/trace_ra/dummy/ still have been populated
by new p_dummy.monitor* files with recent timestamps. Why?..

If I pkill -9 all dummy monitors, lrmd spawns another ~2000 almost
instantly :) Unless the node became unresponsive at some point. And
after restarted by power off:
# crm_resource --resource p_dummy --list-operations
p_dummy (ocf::dummy:dummy): Started (unmanaged) :
p_dummy_monitor_3 (node=node-1.test.domain.local, call=679, rc=1,
last-rc-change=Mon Jan  4 15:04:25 2016, exec=66747ms): Timed Out
or
p_dummy (ocf::dummy:dummy): Stopped (unmanaged) :
p_dummy_monitor_103000 (node=node-3.test.domain.local, call=142, rc=1,
last-rc-change=Mon Jan  4 15:14:59 2016, exec=65237ms): Timed Out

And then lrmd repeats all of the fun again.


-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-04 Thread Dejan Muhamedagic
Hi,

On Mon, Jan 04, 2016 at 04:52:43PM +0100, Bogdan Dobrelya wrote:
> On 04.01.2016 16:36, Ken Gaillot wrote:
> > On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote:
> >> On 04.01.2016 15:50, Bogdan Dobrelya wrote:
[...]
> >> Also note, that lrmd spawns *many* monitors like:
> >> root  6495  0.0  0.0  70268  1456 ?Ss2015   4:56  \_
> >> /usr/lib/pacemaker/lrmd
> >> root 31815  0.0  0.0   4440   780 ?S15:08   0:00  |   \_
> >> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> >> root 31908  0.0  0.0   4440   388 ?S15:08   0:00  |
> >>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> >> root 31910  0.0  0.0   4440   384 ?S15:08   0:00  |
> >>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> >> root 31915  0.0  0.0   4440   392 ?S15:08   0:00  |
> >>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> >> ...
> > 
> > At first glance, that looks like your monitor action is calling itself
> > recursively, but I don't see how in your code.
> 
> Yes, it should be a bug in the ocf-shellfuncs's ocf_log().

If you're sure about that, please open an issue at
https://github.com/ClusterLabs/resource-agents/issues

Thanks,

Dejan

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-01 Thread Vladislav Bogdanov
31.12.2015 15:33:45 CET, Bogdan Dobrelya  wrote:
>On 31.12.2015 14:48, Vladislav Bogdanov wrote:
>> blackbox tracing inside pacemaker, USR1, USR2 and TRAP signals iirc,
>quick google search should point you to Andrew's blog with all
>information about that feature.
>> Next, if you use ocf-shellfuncs in your RA, you could enable tracing
>for resource itself, just add 'trace_ra=1' to every operation config
>(start and monitor).
>
>Thank you, I will try to play with these things once I have the issue
>reproduced again. Cannot provide CIB as I don't have the env now.
>
>But still let me ask again, do anyone know or heard of anything like
>known/fixed bugs about corosync with pacemaker stop running monitor
>actions for a resource at some point, while notifications are still
>logged?
>
>Here is example:
>node-16 crmd:
>2015-12-29T13:16:49.113679+00:00 notice:notice: process_lrm_event:
>Operation p_rabbitmq-server_monitor_27000: unknown error
>(node=node-16.test.domain.local, call=254, rc=1, cib-updat
>e=1454, confirmed=false)
>node-17:
>2015-12-29T13:16:57.603834+00:00 notice:notice: process_lrm_event:
>Operation p_rabbitmq-server_monitor_103000: unknown error
>(node=node-17.test.domain.local, call=181, rc=1, cib-upda
>te=297, confirmed=false)
>node-18:
>2015-12-29T13:20:16.870619+00:00 notice:notice: process_lrm_event:
>Operation p_rabbitmq-server_monitor_103000: not running
>(node=node-18.test.domain.local, call=187, rc=7, cib-update
>=306, confirmed=false)
>node-20:
>2015-12-29T13:20:51.486219+00:00 notice:notice: process_lrm_event:
>Operation p_rabbitmq-server_monitor_3: not running
>(node=node-20.test.domain.local, call=180, rc=7, cib-update=
>308, confirmed=false)
>
>after that point only notifications got logged for affected nodes, like
>Operation p_rabbitmq-server_notify_0: ok
>(node=node-20.test.domain.local, call=287, rc=0, cib-update=0,
>confirmed=t
>rue)
>
>While the node-19 was not affected, and actions
>monitor/stop/start/notify logged OK all the time, like:
>2015-12-29T14:30:00.973561+00:00 notice:notice: process_lrm_event:
>Operation p_rabbitmq-server_monitor_3: not running
>(node=node-19.test.domain.local, call=423, rc=7, cib-update=438,
>confirmed=false)
>2015-12-29T14:30:01.631609+00:00 notice:notice: process_lrm_event:
>Operation p_rabbitmq-server_notify_0: ok
>(node=node-19.test.domain.local, call=424, rc=0, cib-update=0,
>confirmed=true)
>2015-12-29T14:31:19.084165+00:00 notice:notice: process_lrm_event:
>Operation p_rabbitmq-server_stop_0: ok (node=node-19.test.domain.local,
>call=427, rc=0, cib-update=439, confirmed=true)
>2015-12-29T14:32:53.120157+00:00 notice:notice: process_lrm_event:
>Operation p_rabbitmq-server_start_0: unknown error
>(node=node-19.test.domain.local, call=428, rc=1, cib-update=441,
>confirmed=true)

Well, not running and not logged is not the same thing. I do not have access to 
code right now, but I'm pretty sure that successful recurring monitors are not 
logged after the first run. trace_ra for monitor op should prove that. If not, 
then it should be a bug. I recall something was fixed in that area recently.

Best,
Vladislav


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2015-12-31 Thread Vladislav Bogdanov
31.12.2015 12:57:45 CET, Bogdan Dobrelya  wrote:
>Hello.
>I've been hopelessly fighting a bug [0] in the custom OCF agent of Fuel
>for OpenStack project. It is related to the destructive test case when
>one node of 3 or 5 total goes down and then back. The bug itself is
>tricky (is rarely reproduced), tl;dr, and has many duplicates. So I
>only
>put here the latest comment.
>
>As it says,
>at some point, after the rabbit OCF monitor reported an error followed
>by several "not running" reports (see crmd log snippet [1]), pacemaker
>starts "thinking" everything is fine with the resource and shows it as
>"running". While in fact it is completely dead and manually triggered
>OCF action monitor may confirm that (not running). But *why* pacemaker
>shows the resource is running and never calls monitor actions again?
>I have no idea how to proceed with the root cause of such pacemaker
>behaviour.
>
>So, I'm asking for guidance on the any recommendations on how-to debug
>and troubleshoot this strange situation and for which useful log
>patterns to seek (and where).
>Thank you in advance!
>
>PS. this is Pacemaker 1.1.12, Corosync 2.3.4,  libqb0 0.17.0 from
>Ubuntu
>vivid. But the Corosync & Pacemaker cluster looks healthy and I can
>find
>no log records saying otherwise.
>
>[0] https://bugs.launchpad.net/fuel/+bug/1472230/comments/32
>[1] http://pastebin.com/0UuBvzzz

Hi.
First, could you paste your CIB, preferably not in xml, but in crmsh format? 
Just to check that everything is fine with resource and fencing configuration.
Then, you may enable blackbox tracing inside pacemaker, USR1, USR2 and TRAP 
signals iirc, quick google search should point you to Andrew's blog with all 
information about that feature.
Next, if you use ocf-shellfuncs in your RA, you could enable tracing for 
resource itself, just add 'trace_ra=1' to every operation config (start and 
monitor).

All that may give you some additional hints on what's going on.

Also, you may think about upgrading pacemaker to 1.1.14-rcX, together with 
libqb to 0.17.2 (and rebuild corosync against that libqb).

Best,
Vladislav



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org