Re: [ClusterLabs] Anyone successfully install Pacemaker/Corosync on Freebsd?

2016-01-04 Thread Jan Friesse

Christine Caulfield napsal(a):

On 21/12/15 16:12, Ken Gaillot wrote:

On 12/19/2015 04:56 PM, mike wrote:

Hi All,

just curious if anyone has had any luck at one point installing
Pacemaker and Corosync on FreeBSD. I have to install from source of
course and I've run into an issue when running ./configure while trying
to install Corosync. The process craps out at nss with this error:


FYI, Ruben Kerkhof has done some recent work to get the FreeBSD build
working. It will go into the next 1.1.14 release candidate. In the
meantime, make sure you have the very latest code from upstream's 1.1
branch.



I also strongly recommend using the latest (from git) version of libqb
has it has some FreeBSD bugs fixed in it. We plan to do a proper release
of this in the new year.


Same applies also for corosync. Use git and it should work (even with 
clang).


Honza



Chrissie


checking for nss... configure: error: in `/root/heartbeat/corosync-2.3.3':
configure: error: The pkg-config script could not be found or is too
old. Make sure it
is in your PATH or set the PKG_CONFIG environment variable to the full
path to pkg-config.​
Alternatively, you may set the environment variables nss_CFLAGS
and nss_LIBS to avoid the need to call pkg-config.
See the pkg-config man page for more details.

I've looked unsuccessfully for a package called pkg-config and nss
appears to be installed as you can see from this output:

root@wellesley:~/heartbeat/corosync-2.3.3 # pkg install nss
Updating FreeBSD repository catalogue...
FreeBSD repository is up-to-date.
All repositories are up-to-date.
Checking integrity... done (0 conflicting)
The most recent version of packages are already installed

Anyway - just looking for any suggestions. Hoping that perhaps someone
has successfully done this.

thanks in advance
-mgb



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Asking for a new DLM release

2016-01-04 Thread Ferenc Wagner
Hi,

DLM 4.0.2 was released on 2013-07-31.  The Git repo accumulated some
fixes since then, which would be nice to have in a proper release.  It
does not seem like any work is being carried out on the code base
currently (the HEAD commit was created on 2015-04-13), so hereby I ask
the developers to kindly cut a new release, please, unless there are
known issues with the code.
-- 
Thanks,
Feri.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: About globally unique resource instances distribution per node

2016-01-04 Thread Ferenc Wagner
Daniel Hernández  writes:

> [...]
> The cluster start 3 instances of example1 on node1 and not 4 as I
> want. That happen when I have more than 1 resource to allocate. I also
> notice that I am not setting the cluster to start 3 instances of
> example1 on node1, Is there any way to do it ?

Please check how http://bugs.clusterlabs.org/show_bug.cgi?id=5221
applies to your case.  Comment on the issue if you've got something to
add.
-- 
Regards,
Feri.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-04 Thread Bogdan Dobrelya
On 01.01.2016 11:34, Vladislav Bogdanov wrote:
> 31.12.2015 15:33:45 CET, Bogdan Dobrelya  wrote:
>> On 31.12.2015 14:48, Vladislav Bogdanov wrote:
>>> blackbox tracing inside pacemaker, USR1, USR2 and TRAP signals iirc,
>> quick google search should point you to Andrew's blog with all
>> information about that feature.
>>> Next, if you use ocf-shellfuncs in your RA, you could enable tracing
>> for resource itself, just add 'trace_ra=1' to every operation config
>> (start and monitor).
>>
>> Thank you, I will try to play with these things once I have the issue
>> reproduced again. Cannot provide CIB as I don't have the env now.
>>
>> But still let me ask again, do anyone know or heard of anything like
>> known/fixed bugs about corosync with pacemaker stop running monitor
>> actions for a resource at some point, while notifications are still
>> logged?
>>
>> Here is example:
>> node-16 crmd:
>> 2015-12-29T13:16:49.113679+00:00 notice:notice: process_lrm_event:
>> Operation p_rabbitmq-server_monitor_27000: unknown error
>> (node=node-16.test.domain.local, call=254, rc=1, cib-updat
>> e=1454, confirmed=false)
>> node-17:
>> 2015-12-29T13:16:57.603834+00:00 notice:notice: process_lrm_event:
>> Operation p_rabbitmq-server_monitor_103000: unknown error
>> (node=node-17.test.domain.local, call=181, rc=1, cib-upda
>> te=297, confirmed=false)
>> node-18:
>> 2015-12-29T13:20:16.870619+00:00 notice:notice: process_lrm_event:
>> Operation p_rabbitmq-server_monitor_103000: not running
>> (node=node-18.test.domain.local, call=187, rc=7, cib-update
>> =306, confirmed=false)
>> node-20:
>> 2015-12-29T13:20:51.486219+00:00 notice:notice: process_lrm_event:
>> Operation p_rabbitmq-server_monitor_3: not running
>> (node=node-20.test.domain.local, call=180, rc=7, cib-update=
>> 308, confirmed=false)
>>
>> after that point only notifications got logged for affected nodes, like
>> Operation p_rabbitmq-server_notify_0: ok
>> (node=node-20.test.domain.local, call=287, rc=0, cib-update=0,
>> confirmed=t
>> rue)
>>
>> While the node-19 was not affected, and actions
>> monitor/stop/start/notify logged OK all the time, like:
>> 2015-12-29T14:30:00.973561+00:00 notice:notice: process_lrm_event:
>> Operation p_rabbitmq-server_monitor_3: not running
>> (node=node-19.test.domain.local, call=423, rc=7, cib-update=438,
>> confirmed=false)
>> 2015-12-29T14:30:01.631609+00:00 notice:notice: process_lrm_event:
>> Operation p_rabbitmq-server_notify_0: ok
>> (node=node-19.test.domain.local, call=424, rc=0, cib-update=0,
>> confirmed=true)
>> 2015-12-29T14:31:19.084165+00:00 notice:notice: process_lrm_event:
>> Operation p_rabbitmq-server_stop_0: ok (node=node-19.test.domain.local,
>> call=427, rc=0, cib-update=439, confirmed=true)
>> 2015-12-29T14:32:53.120157+00:00 notice:notice: process_lrm_event:
>> Operation p_rabbitmq-server_start_0: unknown error
>> (node=node-19.test.domain.local, call=428, rc=1, cib-update=441,
>> confirmed=true)
> 
> Well, not running and not logged is not the same thing. I do not have access 
> to code right now, but I'm pretty sure that successful recurring monitors are 
> not logged after the first run. trace_ra for monitor op should prove that. If 
> not, then it should be a bug. I recall something was fixed in that area 
> recently.
> 

Is it http://bugs.clusterlabs.org/show_bug.cgi?id=5072 /
http://bugs.clusterlabs.org/show_bug.cgi?id=5063 ? I found nothing more
recent in the pacemaker commits and issues. While not *exactly* my case
though, several promote and demote actions still had took a place due
the test.

Btw, as I understood from the bug 5072/5063 comments, it remains unfixed
for some reported cases, am I right?

> Best,
> Vladislav
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker documentation license clarification

2016-01-04 Thread Ferenc Wagner
Ken Gaillot  writes:

> Currently, the brand is specified in each book's publican.cfg (which is
> generated by configure, and can be edited by "make www-cli"). It works,
> so realistically it's a low priority to improve it, given everything
> else on the plate.

Well, it's not pretty to say the least, but I don't think I have to
touch that part.

> You're welcome to submit a pull request to change it to use the local
> brand directory.

Done, it's part of https://github.com/ClusterLabs/pacemaker/pull/876.
That pull request contains three independent patches, feel free to
cherry pick only part of it if you find anything objectionable.

> Be sure to consider that each book comes in multiple formats (and
> potentially translations, though they're out of date at this point,
> which is a whole separate discussion worth raising at some point), and
> add anything generated to .gitignore.

I think this minimal change won't cause problems with other format or
translations.  I forgot about gitignoring the xsl symlink though; I can
add that after the initial review.
-- 
Regards,
Feri.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: crm shell: "help migrate" is somewhat "thin"...

2016-01-04 Thread Kristoffer Grönlund
Ulrich Windl  writes:

 Ulrich Windl schrieb am 28.12.2015 um 09:17 in Nachricht <5680F025.9B2 : 
 161 :
> 60728>:
>> Hi!
>> 
>> When trying to migrate a clone resource I got this (SLES11 SP4 with 
>> crmsh-2.1.2+git49.g2e3fa0e-1.32):
>> # crm resource migrate cln_ctdb PT5M
>> Resource 'cln_ctdb' not moved: active in 2 locations.
>> You can prevent 'cln_ctdb' from running on a specific location with: --ban 
>> --host 
>> Error performing operation: Invalid argument
>
> An update: When I try "crm resource migrate --ban --host h02 
> cln_rksaps02_ctdb PT5M", I get an error message, also: "ERROR: 
> resource.migrate: Not our node: --host"
>

Hi,

The above error is reported by the pacemaker tools which take a
different set of arguments compared to crmsh itself.

The migrate   form is not applicable to cloned
resources. I agree that the help text could be clearer about this.

Cheers,
Kristoffer

>> 
>> So when I tried "help migrate" in interactive crm shell, I only get this:
>> ---
>> crm(live)resource# help migrate
>> Migrate a resource to another node
>> 
>> Migrate a resource to a different node. If node is left out, the
>> resource is migrated by creating a constraint which prevents it from
>> running on the current node. Additionally, you may specify a
>> lifetime for the constraint---once it expires, the location
>> constraint will no longer be active.
>> 
>> Usage:
>> 
>> migrate  [] [] [force]
>> ---
>> 
>> My guess is that the help is incomplete!
>> Can you temporarily provide a corrected usage text?
>> 
>> Regards,
>> Ulrich
>> 
>> 
>
>
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Asking for a new DLM release

2016-01-04 Thread Ferenc Wagner
Ferenc Wagner  writes:

> DLM 4.0.2 was released on 2013-07-31.  The Git repo accumulated some
> fixes since then, which would be nice to have in a proper release.

By the way I offer https://github.com/wferi/dlm/commits/upstream-patches
for merging or cherry-picking into upstream.

And if I'm hitting the wrong forum with this DLM topic, please advise me.
-- 
Thanks,
Feri.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Regarding IP tables and IP Address clone

2016-01-04 Thread Somanath Jeeva
Hi,

I checked with the IT team.

No, the Multicast MAC is not getting added to the ARP table of the switch.
I will try adding the entry to the ARP table manually and check.

Regards
Somanath Thilak J

From: Michael Schwartzkopff [mailto:m...@sys4.de]
Sent: Thursday, December 31, 2015 00:49
To: Cluster Labs - All topics related to open-source clustering welcomed
Subject: Re: [ClusterLabs] Antw: Regarding IP tables and IP Address clone


Am Mittwoch, 30. Dezember 2015, 14:56:58 schrieb Somanath Jeeva:

> >From: Michael Schwartzkopff [mailto:m...@sys4.de]

> >Sent: Wednesday, December 30, 2015 8:09 PM

> >To: Cluster Labs - All topics related to open-source clustering welcomed

> >Subject: Re: [ClusterLabs] Antw: Regarding IP tables and IP Address clone

>

> Am Mittwoch, 30. Dezember 2015, 13:54:40 schrieb Somanath Jeeva:

> >  Somanath Jeeva  >  ericsson.com>

> >  schrieb am 30.12.2015 um 11:34 in>

> > >

> > >Nachricht <4F5E5141ED95FF45B3128F3C7B1B2A6721ABFE13 at

>

> eusaamb109.ericsson.se>:

> > >> On 12/22/2015 08:09 AM, Somanath Jeeva wrote:

> > >>> Hi

> > >>>

> > >>> I am trying to use ip loadbalancing using cloning feature in

> > >>> pacemaker.

> > >>> but

> > >>

> > >> After 15 min the virtual ip becomes unreachable. Below is the

> > >> pacemaker

> > >>

> > >> cluster config

> > >>

> > >>> # pcs status

> > >>>

> > >>> Cluster name: DES

> > >>>

> > >>> Last updated: Tue Dec 22 08:57:55 2015

> > >>>

> > >>> Last change: Tue Dec 22 08:10:22 2015

> > >>>

> > >>> Stack: cman

> > >>>

> > >>> Current DC: node-01 - partition with quorum

> > >>>

> > >>> Version: 1.1.11-97629de

> > >>>

> > >>> 2 Nodes configured

> > >>>

> > >>> 2 Resources configured

> > >>>

> > >>>

> > >>>

> > >>>

> > >>>

> > >>> Online: [ node-01 node-02 ]

> > >>>

> > >>> Full list of resources:

> > >>> Clone Set: ClusterIP-clone [ClusterIP] (unique)

> > >>>

> > >>> ClusterIP:0 (ocf::heartbeat:IPaddr2): Started

> > >>> node-01

> > >>>

> > >>> ClusterIP:1 (ocf::heartbeat:IPaddr2): Started

> > >>> node-02

> > >>>

> > >>> #pcs config

> > >>>

> > >>> Cluster Name: DES

> > >>>

> > >>> Corosync Nodes:

> > >>> node-01 node-02

> > >>>

> > >>> Pacemaker Nodes:

> > >>>

> > >>> node-01 node-02

> > >>>

> > >>> Resources:

> > >>> Clone: ClusterIP-clone

> > >>>

> > >>> Meta Attrs: clone-max=2 clone-node-max=2 globally-unique=true

> > >>>

> > >>> Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)

> > >>>

> > >>> Attributes: ip=10.61.150.55 cidr_netmask=23

> > >>>

> > >>> clusterip_hash=sourceip

> > >>>

> > >>> Operations: start interval=0s timeout=20s

> > >>> (ClusterIP-start-timeout-20s)

> > >>>

> > >>> stop interval=0s timeout=20s

> > >>>

> > >>> (ClusterIP-stop-timeout-20s)

> > >>>

> > >>> monitor interval=5s (ClusterIP-monitor-interval-5s)

> > >>>

> > >>> Stonith Devices:

> > >>>

> > >>> Fencing Levels:

> > >>>

> > >>>

> > >>>

> > >>> Location Constraints:

> > >>>

> > >>> Ordering Constraints:

> > >>>

> > >>> Colocation Constraints:

> > >>>

> > >>> Cluster Properties:

> > >>> cluster-infrastructure: cman

> > >>>

> > >>> cluster-recheck-interval: 0

> > >>>

> > >>> dc-version: 1.1.11-97629de

> > >>>

> > >>> stonith-enabled: false

> > >>>

> > >>> Pacemaker and Corosync version:

> > >>>

> > >>> Pacemaker - 1.1.12-4

> > >>>

> > >>> Corosync - 1.4.7

> > >>>

> > >>>

> > >>>

> > >>>

> > >>>

> > >>> Is the issue due to configuration error or firewall issue.

> > >>>

> > >>>

> > >>>

> > >>>

> > >>>

> > >>> With Regards

> > >>>

> > >>> Somanath Thilak J

> > >>>

> > >>>

> > >>>

> > >>> Hi Somanath,

> > >>

> > >> The configuration looks fine (aside from fencing not being

> > >> configured),

> > >>

> > >> so I'd suspect a network issue.

> > >>

> > >>

> > >>

> > >> The IPaddr2 cloning relies on multicast MAC addresses (at the

> > >> Ethernet

> > >>

> > >> level, not multicast IP), and many switches have issues with that.

> > >> Make

> > >>

> > >> sure your switch supports multicast MAC (and if necessary, has it

> > >>

> > >> enabled on the relevant ports).

> > >>

> > >>

> > >>

> > >> Some people have found it necessary to add a static ARP entry for

> > >> the

> > >>

> > >> cluster IP/MAC in their firewall and/or switch.

> > >>

> > >>

> > >>

> > >> Hi ,

> > >>

> > >>

> > >>

> > >> It seems that the switches have multicast support enabled. Any idea

> > >> on how

> > >>

> > >> to trouble shoot the issue. I also tried adding the Multicast MAC

> > >> to the ip

> > >>

> > >> neigh tables. Still the Virtual IP goes down in 15 min or so.

> > >

> > >Did you try a "watch arp -vn" on your nodes to watch for changes (if

> > >you only have a few connections)?

> >

> > I could not see my virtual ip in the arp -vn command output. Only if

> > ass the static arp entry I can see the Virtual IP in the comm

Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-04 Thread Bogdan Dobrelya
So far so bad.
I made a dummy OCF script [0] to simulate an example
promote/demote/notify failure mode for a multistate clone resource which
is very similar to the one I reported originally. And the test to
reproduce my case with the dummy is:
- install dummy resource ocf ra and create the dummy resource as README
[0] says
- just watch the a) OCF logs from the dummy and b) outputs for the
reoccurring commands:

# while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1;
sleep 20; done&
# crm_resource --resource p_dummy --list-operations

At some point I noticed:
- there are no more "OK" messages logged from the monitor actions,
although according to the trace_ra dumps' timestamps, all monitors are
still being invoked!

- at some point I noticed very strange results reported by the:
# crm_resource --resource p_dummy --list-operations
p_dummy (ocf::dummy:dummy): FAILED : p_dummy_monitor_103000
(node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan
4 14:33:07 2016, exec=62107ms): Timed Out
  or
p_dummy (ocf::dummy:dummy): Started : p_dummy_monitor_103000
(node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan  4
14:43:58 2016, exec=0ms): Timed Out

- according to the trace_ra dumps reoccurring monitors are being invoked
by the intervals *much longer* than configured. For example, a 7 minutes
of "monitoring silence":
Mon Jan  4 14:47:46 UTC 2016
p_dummy.monitor.2016-01-04.14:40:52
Mon Jan  4 14:48:06 UTC 2016
p_dummy.monitor.2016-01-04.14:47:58

Given that said, it is very likely there is some bug exist for
monitoring multi-state clones in pacemaker!

[0] https://github.com/bogdando/dummy-ocf-ra

-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] [Q] Cluster failovers too soon

2016-01-04 Thread Sebish
< I am resending this mail, because of the outage of clusterlabs during 
the weekend, a received error message and my timelimit until next week>



Hello guys,

happy new year to all of you!

I have a little (/understanding?/) problem regarding Heartbeat/Pacemaker 
and deadtime/timeout.
I know that corosync is the way the go, but atm I have a heartbeat 
cluster and need to adjust it's time before a failover is initiated.


My cluster and resources completely ignore the heartbeat deadtime raise 
and the timeout in pacemaker resource agents definitions.
When I shut him off, the node gets shown as offline and the services are 
failovered after 4-9 seconds. But I want 20 seconds.


What do I have to adjust, to make the cluster failover after +- 20 
seconds instead of 9? Do I miss a parameter apart from 
deadtime(deadping) and timeout?

Every hint would be a great help!


Thank you very much
Sebish


*Config:*
--

*_/etc/heartbeat/ha.cf_**:*

...
keepalive 2
warntime 6
deadtime 20
initdead 60
...

*_crm (pacemaker)_:*

node $id="6acc2585-b49b-4b0f-8b2a-8561cceb8b83" nodec
node $id="891a8209-5e1a-40b6-8d72-8458a851bb9a" kamailioopenhab2
node $id="fd898711-4c76-4d00-941c-4528e174533c" kamailioopenhab1
primitive ClusterMon ocf:pacemaker:ClusterMon \
params user="root" update="30" extra_options="-E 
/usr/lib/ocf/resource.d/*myname*/*script*.sh" \

op monitor interval="10" timeout="40" on-fail="restart"
primitive FailoverIP ocf:heartbeat:IPaddr2 \
params ip="*ClusterIP*" cidr_netmask="18" \
op monitor interval="2s" timeout="20"
primitive Openhab lsb:openhab \
meta target-role="Started" \
op monitor interval="2s" timeout="20"
primitive Ping ocf:pacemaker:ping \
params host_list="*ClusterIP*" multiplier="100" \
op monitor interval="2s" timeout="20"
location ClusterMon_LocationA ClusterMon -inf: kamailioopenhab1
location ClusterMon_LocationB ClusterMon 10: kamailioopenhab2
location ClusterMon_LocationC ClusterMon inf: nodec
location FailoverIP_LocationA FailoverIP 20: kamailioopenhab1
location FailoverIP_LocationB FailoverIP 10: kamailioopenhab2
location FailoverIP_LocationC FailoverIP -inf: nodec
colocation Services_Colocation inf: FailoverIP Kamailio Openhab
property $id="cib-bootstrap-options" \
dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
cluster-infrastructure="Heartbeat" \
expected-quorum-votes="2" \
last-lrm-refresh="1451669632" \
stonith-enabled="false" \
no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
--

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-04 Thread Ken Gaillot
On 01/04/2016 08:50 AM, Bogdan Dobrelya wrote:
> So far so bad.
> I made a dummy OCF script [0] to simulate an example
> promote/demote/notify failure mode for a multistate clone resource which
> is very similar to the one I reported originally. And the test to
> reproduce my case with the dummy is:
> - install dummy resource ocf ra and create the dummy resource as README
> [0] says
> - just watch the a) OCF logs from the dummy and b) outputs for the
> reoccurring commands:
> 
> # while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1;
> sleep 20; done&
> # crm_resource --resource p_dummy --list-operations
> 
> At some point I noticed:
> - there are no more "OK" messages logged from the monitor actions,
> although according to the trace_ra dumps' timestamps, all monitors are
> still being invoked!

Yes, that's to reduce log clutter / I/O (which especially matters when
you scale to hundreds of resources). As long as a recurring monitor is
OK, only the first OK is logged.

> - at some point I noticed very strange results reported by the:
> # crm_resource --resource p_dummy --list-operations
> p_dummy (ocf::dummy:dummy): FAILED : p_dummy_monitor_103000
> (node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan
> 4 14:33:07 2016, exec=62107ms): Timed Out
>   or
> p_dummy (ocf::dummy:dummy): Started : p_dummy_monitor_103000
> (node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan  4
> 14:43:58 2016, exec=0ms): Timed Out

Note that these are on different nodes. When pacemaker starts a
resource, it first "probes" all nodes by running a one-time monitor
operation on them, to ensure the service is not already running
somewhere. So those are expected to "fail".

Your dummy RA always returns OCF_SUCCESS for status/monitor, which will
cause problems. Pacemaker will think it's already running everywhere,
and not try to start it.

A master/slave resource should use these return codes:

http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#_requirements_for_multi_state_resource_agents

> - according to the trace_ra dumps reoccurring monitors are being invoked
> by the intervals *much longer* than configured. For example, a 7 minutes
> of "monitoring silence":
> Mon Jan  4 14:47:46 UTC 2016
> p_dummy.monitor.2016-01-04.14:40:52
> Mon Jan  4 14:48:06 UTC 2016
> p_dummy.monitor.2016-01-04.14:47:58
> 
> Given that said, it is very likely there is some bug exist for
> monitoring multi-state clones in pacemaker!
> 
> [0] https://github.com/bogdando/dummy-ocf-ra
> 


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-04 Thread Bogdan Dobrelya
On 04.01.2016 15:50, Bogdan Dobrelya wrote:
> So far so bad.
> I made a dummy OCF script [0] to simulate an example
> promote/demote/notify failure mode for a multistate clone resource which
> is very similar to the one I reported originally. And the test to
> reproduce my case with the dummy is:
> - install dummy resource ocf ra and create the dummy resource as README
> [0] says
> - just watch the a) OCF logs from the dummy and b) outputs for the
> reoccurring commands:
> 
> # while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1;
> sleep 20; done&
> # crm_resource --resource p_dummy --list-operations
> 
> At some point I noticed:
> - there are no more "OK" messages logged from the monitor actions,
> although according to the trace_ra dumps' timestamps, all monitors are
> still being invoked!
> 
> - at some point I noticed very strange results reported by the:
> # crm_resource --resource p_dummy --list-operations
> p_dummy (ocf::dummy:dummy): FAILED : p_dummy_monitor_103000
> (node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan
> 4 14:33:07 2016, exec=62107ms): Timed Out
>   or
> p_dummy (ocf::dummy:dummy): Started : p_dummy_monitor_103000
> (node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan  4
> 14:43:58 2016, exec=0ms): Timed Out
> 
> - according to the trace_ra dumps reoccurring monitors are being invoked
> by the intervals *much longer* than configured. For example, a 7 minutes
> of "monitoring silence":
> Mon Jan  4 14:47:46 UTC 2016
> p_dummy.monitor.2016-01-04.14:40:52
> Mon Jan  4 14:48:06 UTC 2016
> p_dummy.monitor.2016-01-04.14:47:58
> 
> Given that said, it is very likely there is some bug exist for
> monitoring multi-state clones in pacemaker!
> 
> [0] https://github.com/bogdando/dummy-ocf-ra
> 

Also note, that lrmd spawns *many* monitors like:
root  6495  0.0  0.0  70268  1456 ?Ss2015   4:56  \_
/usr/lib/pacemaker/lrmd
root 31815  0.0  0.0   4440   780 ?S15:08   0:00  |   \_
/bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
root 31908  0.0  0.0   4440   388 ?S15:08   0:00  |
  \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
root 31910  0.0  0.0   4440   384 ?S15:08   0:00  |
  \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
root 31915  0.0  0.0   4440   392 ?S15:08   0:00  |
  \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
...

At some point, there was  already. Then I unmanaged the p_dummy but
it grew up to the 2403 after that. The number of running monitors may
grow or decrease as well.
Also, the /var/lib/heartbeat/trace_ra/dummy/ still have been populated
by new p_dummy.monitor* files with recent timestamps. Why?..

If I pkill -9 all dummy monitors, lrmd spawns another ~2000 almost
instantly :) Unless the node became unresponsive at some point. And
after restarted by power off&on:
# crm_resource --resource p_dummy --list-operations
p_dummy (ocf::dummy:dummy): Started (unmanaged) :
p_dummy_monitor_3 (node=node-1.test.domain.local, call=679, rc=1,
last-rc-change=Mon Jan  4 15:04:25 2016, exec=66747ms): Timed Out
or
p_dummy (ocf::dummy:dummy): Stopped (unmanaged) :
p_dummy_monitor_103000 (node=node-3.test.domain.local, call=142, rc=1,
last-rc-change=Mon Jan  4 15:14:59 2016, exec=65237ms): Timed Out

And then lrmd repeats all of the fun again.


-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-04 Thread Ken Gaillot
On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote:
> On 04.01.2016 15:50, Bogdan Dobrelya wrote:
>> So far so bad.
>> I made a dummy OCF script [0] to simulate an example
>> promote/demote/notify failure mode for a multistate clone resource which
>> is very similar to the one I reported originally. And the test to
>> reproduce my case with the dummy is:
>> - install dummy resource ocf ra and create the dummy resource as README
>> [0] says
>> - just watch the a) OCF logs from the dummy and b) outputs for the
>> reoccurring commands:
>>
>> # while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1;
>> sleep 20; done&
>> # crm_resource --resource p_dummy --list-operations
>>
>> At some point I noticed:
>> - there are no more "OK" messages logged from the monitor actions,
>> although according to the trace_ra dumps' timestamps, all monitors are
>> still being invoked!
>>
>> - at some point I noticed very strange results reported by the:
>> # crm_resource --resource p_dummy --list-operations
>> p_dummy (ocf::dummy:dummy): FAILED : p_dummy_monitor_103000
>> (node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan
>> 4 14:33:07 2016, exec=62107ms): Timed Out
>>   or
>> p_dummy (ocf::dummy:dummy): Started : p_dummy_monitor_103000
>> (node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan  4
>> 14:43:58 2016, exec=0ms): Timed Out
>>
>> - according to the trace_ra dumps reoccurring monitors are being invoked
>> by the intervals *much longer* than configured. For example, a 7 minutes
>> of "monitoring silence":
>> Mon Jan  4 14:47:46 UTC 2016
>> p_dummy.monitor.2016-01-04.14:40:52
>> Mon Jan  4 14:48:06 UTC 2016
>> p_dummy.monitor.2016-01-04.14:47:58
>>
>> Given that said, it is very likely there is some bug exist for
>> monitoring multi-state clones in pacemaker!
>>
>> [0] https://github.com/bogdando/dummy-ocf-ra
>>
> 
> Also note, that lrmd spawns *many* monitors like:
> root  6495  0.0  0.0  70268  1456 ?Ss2015   4:56  \_
> /usr/lib/pacemaker/lrmd
> root 31815  0.0  0.0   4440   780 ?S15:08   0:00  |   \_
> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> root 31908  0.0  0.0   4440   388 ?S15:08   0:00  |
>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> root 31910  0.0  0.0   4440   384 ?S15:08   0:00  |
>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> root 31915  0.0  0.0   4440   392 ?S15:08   0:00  |
>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> ...

At first glance, that looks like your monitor action is calling itself
recursively, but I don't see how in your code.

> At some point, there was  already. Then I unmanaged the p_dummy but
> it grew up to the 2403 after that. The number of running monitors may
> grow or decrease as well.
> Also, the /var/lib/heartbeat/trace_ra/dummy/ still have been populated
> by new p_dummy.monitor* files with recent timestamps. Why?..
> 
> If I pkill -9 all dummy monitors, lrmd spawns another ~2000 almost
> instantly :) Unless the node became unresponsive at some point. And
> after restarted by power off&on:
> # crm_resource --resource p_dummy --list-operations
> p_dummy (ocf::dummy:dummy): Started (unmanaged) :
> p_dummy_monitor_3 (node=node-1.test.domain.local, call=679, rc=1,
> last-rc-change=Mon Jan  4 15:04:25 2016, exec=66747ms): Timed Out
> or
> p_dummy (ocf::dummy:dummy): Stopped (unmanaged) :
> p_dummy_monitor_103000 (node=node-3.test.domain.local, call=142, rc=1,
> last-rc-change=Mon Jan  4 15:14:59 2016, exec=65237ms): Timed Out
> 
> And then lrmd repeats all of the fun again.
> 
> 


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-04 Thread Bogdan Dobrelya
On 04.01.2016 16:36, Ken Gaillot wrote:
> On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote:
>> On 04.01.2016 15:50, Bogdan Dobrelya wrote:
>>> So far so bad.
>>> I made a dummy OCF script [0] to simulate an example
>>> promote/demote/notify failure mode for a multistate clone resource which
>>> is very similar to the one I reported originally. And the test to
>>> reproduce my case with the dummy is:
>>> - install dummy resource ocf ra and create the dummy resource as README
>>> [0] says
>>> - just watch the a) OCF logs from the dummy and b) outputs for the
>>> reoccurring commands:
>>>
>>> # while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1;
>>> sleep 20; done&
>>> # crm_resource --resource p_dummy --list-operations
>>>
>>> At some point I noticed:
>>> - there are no more "OK" messages logged from the monitor actions,
>>> although according to the trace_ra dumps' timestamps, all monitors are
>>> still being invoked!
>>>
>>> - at some point I noticed very strange results reported by the:
>>> # crm_resource --resource p_dummy --list-operations
>>> p_dummy (ocf::dummy:dummy): FAILED : p_dummy_monitor_103000
>>> (node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan
>>> 4 14:33:07 2016, exec=62107ms): Timed Out
>>>   or
>>> p_dummy (ocf::dummy:dummy): Started : p_dummy_monitor_103000
>>> (node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan  4
>>> 14:43:58 2016, exec=0ms): Timed Out
>>>
>>> - according to the trace_ra dumps reoccurring monitors are being invoked
>>> by the intervals *much longer* than configured. For example, a 7 minutes
>>> of "monitoring silence":
>>> Mon Jan  4 14:47:46 UTC 2016
>>> p_dummy.monitor.2016-01-04.14:40:52
>>> Mon Jan  4 14:48:06 UTC 2016
>>> p_dummy.monitor.2016-01-04.14:47:58
>>>
>>> Given that said, it is very likely there is some bug exist for
>>> monitoring multi-state clones in pacemaker!
>>>
>>> [0] https://github.com/bogdando/dummy-ocf-ra
>>>
>>
>> Also note, that lrmd spawns *many* monitors like:
>> root  6495  0.0  0.0  70268  1456 ?Ss2015   4:56  \_
>> /usr/lib/pacemaker/lrmd
>> root 31815  0.0  0.0   4440   780 ?S15:08   0:00  |   \_
>> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>> root 31908  0.0  0.0   4440   388 ?S15:08   0:00  |
>>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>> root 31910  0.0  0.0   4440   384 ?S15:08   0:00  |
>>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>> root 31915  0.0  0.0   4440   392 ?S15:08   0:00  |
>>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>> ...
> 
> At first glance, that looks like your monitor action is calling itself
> recursively, but I don't see how in your code.

Yes, it should be a bug in the ocf-shellfuncs's ocf_log().

If I replace it in the dummy RA to the:
#. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs
ocf_log() {
  logger $HA_LOGFACILITY -t $HA_LOGTAG "$@"
}

there is no such issue anymore. And I see log messages "It's OK"
as expected.
Note, I used the resource-agents 3.9.5+git+a626847-1
from [0].

[0] http://ftp.de.debian.org/debian/ experimental/main amd64 Packages

> 
>> At some point, there was  already. Then I unmanaged the p_dummy but
>> it grew up to the 2403 after that. The number of running monitors may
>> grow or decrease as well.
>> Also, the /var/lib/heartbeat/trace_ra/dummy/ still have been populated
>> by new p_dummy.monitor* files with recent timestamps. Why?..
>>
>> If I pkill -9 all dummy monitors, lrmd spawns another ~2000 almost
>> instantly :) Unless the node became unresponsive at some point. And
>> after restarted by power off&on:
>> # crm_resource --resource p_dummy --list-operations
>> p_dummy (ocf::dummy:dummy): Started (unmanaged) :
>> p_dummy_monitor_3 (node=node-1.test.domain.local, call=679, rc=1,
>> last-rc-change=Mon Jan  4 15:04:25 2016, exec=66747ms): Timed Out
>> or
>> p_dummy (ocf::dummy:dummy): Stopped (unmanaged) :
>> p_dummy_monitor_103000 (node=node-3.test.domain.local, call=142, rc=1,
>> last-rc-change=Mon Jan  4 15:14:59 2016, exec=65237ms): Timed Out
>>
>> And then lrmd repeats all of the fun again.
>>
>>
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-04 Thread Dejan Muhamedagic
Hi,

On Mon, Jan 04, 2016 at 04:52:43PM +0100, Bogdan Dobrelya wrote:
> On 04.01.2016 16:36, Ken Gaillot wrote:
> > On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote:
> >> On 04.01.2016 15:50, Bogdan Dobrelya wrote:
[...]
> >> Also note, that lrmd spawns *many* monitors like:
> >> root  6495  0.0  0.0  70268  1456 ?Ss2015   4:56  \_
> >> /usr/lib/pacemaker/lrmd
> >> root 31815  0.0  0.0   4440   780 ?S15:08   0:00  |   \_
> >> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> >> root 31908  0.0  0.0   4440   388 ?S15:08   0:00  |
> >>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> >> root 31910  0.0  0.0   4440   384 ?S15:08   0:00  |
> >>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> >> root 31915  0.0  0.0   4440   392 ?S15:08   0:00  |
> >>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> >> ...
> > 
> > At first glance, that looks like your monitor action is calling itself
> > recursively, but I don't see how in your code.
> 
> Yes, it should be a bug in the ocf-shellfuncs's ocf_log().

If you're sure about that, please open an issue at
https://github.com/ClusterLabs/resource-agents/issues

Thanks,

Dejan

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-04 Thread Bogdan Dobrelya
On 04.01.2016 17:14, Dejan Muhamedagic wrote:
> Hi,
> 
> On Mon, Jan 04, 2016 at 04:52:43PM +0100, Bogdan Dobrelya wrote:
>> On 04.01.2016 16:36, Ken Gaillot wrote:
>>> On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote:
 On 04.01.2016 15:50, Bogdan Dobrelya wrote:
> [...]
 Also note, that lrmd spawns *many* monitors like:
 root  6495  0.0  0.0  70268  1456 ?Ss2015   4:56  \_
 /usr/lib/pacemaker/lrmd
 root 31815  0.0  0.0   4440   780 ?S15:08   0:00  |   \_
 /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
 root 31908  0.0  0.0   4440   388 ?S15:08   0:00  |
   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
 root 31910  0.0  0.0   4440   384 ?S15:08   0:00  |
   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
 root 31915  0.0  0.0   4440   392 ?S15:08   0:00  |
   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
 ...
>>>
>>> At first glance, that looks like your monitor action is calling itself
>>> recursively, but I don't see how in your code.
>>
>> Yes, it should be a bug in the ocf-shellfuncs's ocf_log().
> 
> If you're sure about that, please open an issue at
> https://github.com/ClusterLabs/resource-agents/issues

Submitted [0]. Thank you!
Note, that it seems the very import action causes the issue, not the
ocf_run or ocf_log code itself.

[0] https://github.com/ClusterLabs/resource-agents/issues/734

> 
> Thanks,
> 
> Dejan
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Best regards,
Bogdan Dobrelya,
Irc #bogdando

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] booth release schedule query

2016-01-04 Thread Jan Pokorný
From the same vein as the other today's question, I'd like to know if
there is a booth release to be cut any time soon.  Currently, latest
release is from Oct 2014.

I'd like to have it packaged in Fedora repositories (I know about the
OBS, but that may be inconvenient).

Thanks!

-- 
Jan (Poki)


pgpJ7_yW_b5Lb.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org