Re: [ClusterLabs] Antw: how to connect to the cluster from a docker container

2019-08-07 Thread Dejan Muhamedagic
On Tue, Aug 06, 2019 at 02:38:10PM +0200, Ulrich Windl wrote:
> >>> Dejan Muhamedagic  schrieb am 06.08.2019 um 10:37 in
> Nachricht <20190806083726.GA8262@capote>:
> > Hi,
> > 
> > Hawk runs in a docker container on one of the cluster nodes (the
> > nodes run Debian and apparently it's rather difficult to install
> > hawk on a non‑SUSE distribution, hence docker). Now, how to
> > connect to the cluster? Hawk uses the pacemaker command line
> > tools such as cibadmin. I have a vague recollection that there is
> > a way to connect over tcp/ip, but, if that is so, I cannot find
> > any documentation about it.
> 
> I always thought hawk has to run on one of the cluster nodes (natively).

Apparently. hawk invokes all external commands with hawk_invoke
which drops all the environment apart from HOME and CIB_file. 

One obvious possibility would be for the hawk_invoke to either
get the connection settings or to preserve the CIB_host and other
variables.

Cheers,

Dejan

> > 
> > Cheers,
> > 
> > Dejan
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: how to connect to the cluster from a docker container

2019-08-07 Thread Dejan Muhamedagic
Hi,

On Wed, Aug 07, 2019 at 11:23:09AM +0200, Klaus Wenninger wrote:
> On 8/7/19 10:09 AM, Dejan Muhamedagic wrote:
> > Hi Ulrich,
> >
> > On Tue, Aug 06, 2019 at 02:38:10PM +0200, Ulrich Windl wrote:
> >>>>> Dejan Muhamedagic  schrieb am 06.08.2019 um 10:37 
> >>>>> in
> >> Nachricht <20190806083726.GA8262@capote>:
> >>> Hi,
> >>>
> >>> Hawk runs in a docker container on one of the cluster nodes (the
> >>> nodes run Debian and apparently it's rather difficult to install
> >>> hawk on a non‑SUSE distribution, hence docker). Now, how to
> >>> connect to the cluster? Hawk uses the pacemaker command line
> >>> tools such as cibadmin. I have a vague recollection that there is
> >>> a way to connect over tcp/ip, but, if that is so, I cannot find
> >>> any documentation about it.
> >> I always thought hawk has to run on one of the cluster nodes (natively).
> > Well, let's see if that is the case. BTW, the Dockerfile is
> > available here:
> >
> > https://github.com/krig/docker-hawk
> >
> > Cheers,
> >
> > Dejan
> That container seems to be foreseen to act as a cluster-node
> controlling docker-containers on the same host.
> If the pacemaker-version inside the container is close enough
> to the pacemaker-version you are running on debian and
> if it has pacemaker-remote you might be able to run the
> container as guest-node.
> No idea though if tooling hawk uses is gonna be happy tunneling
> through pacemaker-remote.

hawk seems to be using only the standard pacemaker-cli-utils
(cibadmin etc).

> A little bit like hypervisors are doing it nowadays - running the
> admin-interface in a VM ...
> Of course just useful if you can live with hawk not being
> available if the cluster is in a state where it doesn't start
> the guest-node.

Interesting idea. Would then cibadmin et al work from this remote
node?

Cheers,

Dejan

> Klaus
> >
> >>> Cheers,
> >>>
> >>> Dejan
> >>> ___
> >>> Manage your subscription:
> >>> https://lists.clusterlabs.org/mailman/listinfo/users 
> >>>
> >>> ClusterLabs home: https://www.clusterlabs.org/ 
> >>
> >>
> >> ___
> >> Manage your subscription:
> >> https://lists.clusterlabs.org/mailman/listinfo/users
> >>
> >> ClusterLabs home: https://www.clusterlabs.org/
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > ClusterLabs home: https://www.clusterlabs.org/
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] how to connect to the cluster from a docker container

2019-08-07 Thread Dejan Muhamedagic
Hi,

On Tue, Aug 06, 2019 at 01:36:49PM +0200, Jan Pokorný wrote:
> Hello Dejan,
> 
> nice to see you around,
> 
> On 06/08/19 10:37 +0200, Dejan Muhamedagic wrote:
> > Hawk runs in a docker container on one of the cluster nodes (the
> > nodes run Debian and apparently it's rather difficult to install
> > hawk on a non-SUSE distribution, hence docker). Now, how to
> > connect to the cluster? Hawk uses the pacemaker command line
> > tools such as cibadmin. I have a vague recollection that there is
> > a way to connect over tcp/ip, but, if that is so, I cannot find
> > any documentation about it.
> [...]
> 2. use modern enough libqb (v1.0.2+) and use
> 
>  touch /etc/libqb/force-filesystem-sockets
> 
>on both host and within the container (assuming those two locations
>are fully disjoint, i.e., not an overlay-based reuse), you should
>then be able to share the respective reified sockets simply by
>sharing the pertaining directory (normally /var/run it seems)
> 
>- if indeed a directory as generic as /var/run is involved,
>  it may also lead to unexpected interferences, so the more
>  minimalistic the container is, the better I think
>  (or you can recompile libqb and play with path mapping
>  in container configuration to achieve smoother plug-in)
> 
> Then, pacemaker utilities would hopefully work across the container
> boundaries just as if they were fully native, hence hawk shall as
> well.
> 
> Let us know how far you'll get and where we can colletively join you
> in your attempts, I don't think we had such experience disseminated
> here.  I know for sure I haven't ever tried this in practice, some
> one else here could have.  Also, there may be a lot of fun with various
> Linux Security Modules like SELinux.

pacemakerd is not happy with the filesystem sockets:

Aug 07 14:12:26 alpaca1-pc pacemakerd  [7606] (crm_ipc_connect) 
debug: Could not establish pacemakerd connection: No such file or directory (2)
Aug 07 14:12:26 alpaca1-pc pacemakerd  [7606] (qb_ipcc_disconnect)  
debug: qb_ipcc_disconnect()
Aug 07 14:12:26 alpaca1-pc pacemakerd  [7606] (mcp_read_config) 
info: cmap connection setup failed: CS_ERR_NOT_EXIST .  Retrying in 1s
...
Aug  7 14:12:41 alpaca1-pc pacemakerd[7606]: Could not connect to Cluster 
Config uration Database API, error 12

Apparently, it fails to connect to corosync.

Any ideas?

Cheers,

Dejan
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Antw: how to connect to the cluster from a docker container

2019-08-07 Thread Dejan Muhamedagic
Hi Ulrich,

On Tue, Aug 06, 2019 at 02:38:10PM +0200, Ulrich Windl wrote:
> >>> Dejan Muhamedagic  schrieb am 06.08.2019 um 10:37 in
> Nachricht <20190806083726.GA8262@capote>:
> > Hi,
> > 
> > Hawk runs in a docker container on one of the cluster nodes (the
> > nodes run Debian and apparently it's rather difficult to install
> > hawk on a non‑SUSE distribution, hence docker). Now, how to
> > connect to the cluster? Hawk uses the pacemaker command line
> > tools such as cibadmin. I have a vague recollection that there is
> > a way to connect over tcp/ip, but, if that is so, I cannot find
> > any documentation about it.
> 
> I always thought hawk has to run on one of the cluster nodes (natively).

Well, let's see if that is the case. BTW, the Dockerfile is
available here:

https://github.com/krig/docker-hawk

Cheers,

Dejan

> > 
> > Cheers,
> > 
> > Dejan
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > ClusterLabs home: https://www.clusterlabs.org/ 
> 
> 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] how to connect to the cluster from a docker container

2019-08-07 Thread Dejan Muhamedagic
Hi Ken,

On Tue, Aug 06, 2019 at 08:58:20AM -0500, Ken Gaillot wrote:
> On Tue, 2019-08-06 at 14:03 +0200, Jan Pokorný wrote:
> > On 06/08/19 13:36 +0200, Jan Pokorný wrote:
> > > On 06/08/19 10:37 +0200, Dejan Muhamedagic wrote:
> > > > Hawk runs in a docker container on one of the cluster nodes (the
> > > > nodes run Debian and apparently it's rather difficult to install
> > > > hawk on a non-SUSE distribution, hence docker). Now, how to
> > > > connect to the cluster? Hawk uses the pacemaker command line
> > > > tools such as cibadmin. I have a vague recollection that there is
> > > > a way to connect over tcp/ip, but, if that is so, I cannot find
> > > > any documentation about it.
> 
> I think one of the solutions Jan suggested would be best, but what
> you're likely remembering is remote-tls-port:
> 
> https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Administration/#s-remote-connection
> 
> However that only works for the CIB, so anything that needed to contact
> other daemons wouldn't work.

Right, that's what I couldn't recall! I'm not sure if hawk uses
anything other than the connection to the cib.

Cheers,

Dejan


> > > 
> > > I think that what you are after is one of:
> > > 
> > > 1. have docker runtime for the particular container have the
> > > abstract
> > >Unix sockets shared from the host (--network=host? don't
> > > remember
> > >exactly)
> > > 
> > >- apparently, this weak style of compartmentalization comes with
> > >  many drawbacks, so you may be facing hefty work of cutting any
> > >  other interferences stemming from pre-chrooting assumptions of
> > >  what is a singleton on the system, incl. sockets etc.
> > > 
> > > 2. use modern enough libqb (v1.0.2+) and use
> > > 
> > >  touch /etc/libqb/force-filesystem-sockets
> > > 
> > >on both host and within the container (assuming those two
> > > locations
> > >are fully disjoint, i.e., not an overlay-based reuse), you
> > > should
> > >then be able to share the respective reified sockets simply by
> > >sharing the pertaining directory (normally /var/run it seems)
> > > 
> > >- if indeed a directory as generic as /var/run is involved,
> > >  it may also lead to unexpected interferences, so the more
> > >  minimalistic the container is, the better I think
> > >  (or you can recompile libqb and play with path mapping
> > >  in container configuration to achieve smoother plug-in)
> > 
> > Oh, and there's additional prerequisite for both to at least
> > theoretically work -- 1:1 sharing of /dev/shm (which may also
> > be problematic in a sense).
> > 
> > > Then, pacemaker utilities would hopefully work across the container
> > > boundaries just as if they were fully native, hence hawk shall as
> > > well.
> > > 
> > > Let us know how far you'll get and where we can colletively join
> > > you
> > > in your attempts, I don't think we had such experience disseminated
> > > here.  I know for sure I haven't ever tried this in practice, some
> > > one else here could have.  Also, there may be a lot of fun with
> > > various
> > > Linux Security Modules like SELinux.
> > 
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
> > ClusterLabs home: https://www.clusterlabs.org/
> -- 
> Ken Gaillot 
> 
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] how to connect to the cluster from a docker container

2019-08-07 Thread Dejan Muhamedagic
Hi Jan,

On Tue, Aug 06, 2019 at 01:36:49PM +0200, Jan Pokorný wrote:
> Hello Dejan,
> 
> nice to see you around,

Nice to see you too.

> On 06/08/19 10:37 +0200, Dejan Muhamedagic wrote:
> > Hawk runs in a docker container on one of the cluster nodes (the
> > nodes run Debian and apparently it's rather difficult to install
> > hawk on a non-SUSE distribution, hence docker). Now, how to
> > connect to the cluster? Hawk uses the pacemaker command line
> > tools such as cibadmin. I have a vague recollection that there is
> > a way to connect over tcp/ip, but, if that is so, I cannot find
> > any documentation about it.
> 
> I think that what you are after is one of:
> 
> 1. have docker runtime for the particular container have the abstract
>Unix sockets shared from the host (--network=host? don't remember
>exactly)
> 
>- apparently, this weak style of compartmentalization comes with
>  many drawbacks, so you may be facing hefty work of cutting any
>  other interferences stemming from pre-chrooting assumptions of
>  what is a singleton on the system, incl. sockets etc.
> 
> 2. use modern enough libqb (v1.0.2+) and use
> 
>  touch /etc/libqb/force-filesystem-sockets
> 
>on both host and within the container (assuming those two locations
>are fully disjoint, i.e., not an overlay-based reuse), you should
>then be able to share the respective reified sockets simply by
>sharing the pertaining directory (normally /var/run it seems)
> 
>- if indeed a directory as generic as /var/run is involved,
>  it may also lead to unexpected interferences, so the more
>  minimalistic the container is, the better I think
>  (or you can recompile libqb and play with path mapping
>  in container configuration to achieve smoother plug-in)
> 
> Then, pacemaker utilities would hopefully work across the container
> boundaries just as if they were fully native, hence hawk shall as
> well.
> 
> Let us know how far you'll get and where we can colletively join you
> in your attempts, I don't think we had such experience disseminated
> here.  I know for sure I haven't ever tried this in practice, some
> one else here could have.  Also, there may be a lot of fun with various
> Linux Security Modules like SELinux.

That system is out of sorts right now, but will give the
filesystem sockets a try.

Many thanks!

Cheers,

Dejan

> -- 
> Jan (Poki)



> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] how to connect to the cluster from a docker container

2019-08-06 Thread Dejan Muhamedagic
Hi,

Hawk runs in a docker container on one of the cluster nodes (the
nodes run Debian and apparently it's rather difficult to install
hawk on a non-SUSE distribution, hence docker). Now, how to
connect to the cluster? Hawk uses the pacemaker command line
tools such as cibadmin. I have a vague recollection that there is
a way to connect over tcp/ip, but, if that is so, I cannot find
any documentation about it.

Cheers,

Dejan
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] Booth fail-over conditions

2018-04-17 Thread Dejan Muhamedagic
Hi,

On Mon, Apr 16, 2018 at 01:22:08PM +0200, Kristoffer Grönlund wrote:
> Zach Anderson  writes:
> 
> >  Hey all,
> >
> > new user to pacemaker/booth and I'm fumbling my way through my first proof
> > of concept. I have a 2 site configuration setup with local pacemaker
> > clusters at each site (running rabbitmq) and a booth arbitrator. I've
> > successfully validated the base failover when the "granted" site has
> > failed. My question is if there are any other ways to configure failover,
> > i.e. using resource health checks or the like?

You can take a look at "before-acquire-handler" (quite a mouthful
there). The main motivation was to add an ability to verify that
some other conditions at _the site_ are good, perhaps using
environment sensors, say to measure temperature, or if the
aircondition works, or such.

Nothing stopping you from doing there a resource health check,
but it could probably be deemed as something on a rather
different "level".

> 
> Hi Zach,
> 
> Do you mean that a resource health check should trigger site failover?
> That's actually something I'm not sure comes built-in..

There's nothing really specific about a resource, because booth
knows nothing about resources. The tickets are the only way it
can describe the world ;-)

Cheers,

Dejan

> though making a
> resource agent which revokes a ticket on failure should be fairly
> straight-forward. You could then group your resource which the ticket
> resource to enable this functionality.
> 
> The logic in the ticket resource ought to be something like "if monitor
> fails and the current site is granted, then revoke the ticket, else do
> nothing". You would probably want to handle probe monitor invocations
> differently. There is a ocf_is_probe function provided to help with
> this.
> 
> Cheers,
> Kristoffer
> 
> > Thanks!
> > ___
> > Users mailing list: Users@clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> -- 
> // Kristoffer Grönlund
> // kgronl...@suse.com
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] pacemaker with sbd fails to start if node reboots too fast.

2017-12-05 Thread Dejan Muhamedagic
On Mon, Dec 04, 2017 at 09:55:46PM +0300, Andrei Borzenkov wrote:
> 04.12.2017 14:48, Gao,Yan пишет:
> > On 12/02/2017 07:19 PM, Andrei Borzenkov wrote:
> >> 30.11.2017 13:48, Gao,Yan пишет:
> >>> On 11/22/2017 08:01 PM, Andrei Borzenkov wrote:
>  SLES12 SP2 with pacemaker 1.1.15-21.1-e174ec8; two node cluster with
>  VM on VSphere using shared VMDK as SBD. During basic tests by killing
>  corosync and forcing STONITH pacemaker was not started after reboot.
>  In logs I see during boot
> 
>  Nov 22 16:04:56 sapprod01s crmd[3151]: crit: We were allegedly
>  just fenced by sapprod01p for sapprod01p
>  Nov 22 16:04:56 sapprod01s pacemakerd[3137]:  warning: The crmd
>  process (3151) can no longer be respawned,
>  Nov 22 16:04:56 sapprod01s pacemakerd[3137]:   notice: Shutting down
>  Pacemaker
> 
>  SBD timeouts are 60s for watchdog and 120s for msgwait. It seems that
>  stonith with SBD always takes msgwait (at least, visually host is not
>  declared as OFFLINE until 120s passed). But VM rebots lightning fast
>  and is up and running long before timeout expires.
> 
>  I think I have seen similar report already. Is it something that can
>  be fixed by SBD/pacemaker tuning?
> >>> SBD_DELAY_START=yes in /etc/sysconfig/sbd is the solution.
> >>>
> >>
> >> I tried it (on openSUSE Tumbleweed which is what I have at hand, it has
> >> SBD 1.3.0) and with SBD_DELAY_START=yes sbd does not appear to watch
> >> disk at all. 
> > It simply waits that long on startup before starting the rest of the
> > cluster stack to make sure the fencing that targeted it has returned. It
> > intentionally doesn't watch anything during this period of time.
> > 
> 
> Unfortunately it waits too long.
> 
> ha1:~ # systemctl status sbd.service
> ● sbd.service - Shared-storage based fencing daemon
>Loaded: loaded (/usr/lib/systemd/system/sbd.service; enabled; vendor
> preset: disabled)
>Active: failed (Result: timeout) since Mon 2017-12-04 21:47:03 MSK;
> 4min 16s ago
>   Process: 1861 ExecStop=/usr/bin/kill -TERM $MAINPID (code=exited,
> status=0/SUCCESS)
>   Process: 2058 ExecStart=/usr/sbin/sbd $SBD_OPTS -p /var/run/sbd.pid
> watch (code=killed, signa
>  Main PID: 1792 (code=exited, status=0/SUCCESS)
> 
> дек 04 21:45:32 ha1 systemd[1]: Starting Shared-storage based fencing
> daemon...
> дек 04 21:47:02 ha1 systemd[1]: sbd.service: Start operation timed out.
> Terminating.
> дек 04 21:47:03 ha1 systemd[1]: Failed to start Shared-storage based
> fencing daemon.
> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Unit entered failed state.
> дек 04 21:47:03 ha1 systemd[1]: sbd.service: Failed with result 'timeout'.
> 
> But the real problem is - in spite of SBD failed to start, the whole
> cluster stack continues to run; and because SBD blindly trusts in well
> behaving nodes, fencing appears to succeed after timeout ... without
> anyone taking any action on poison pill ...

That's something I always wondered about: if a node is capable of
reading a poison pill then it could before shutdown also write an
"I'm leaving" message into its slot. Wouldn't that make sbd more
reliable? Any reason not to implement that?

Thanks,

Dejan

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How much cluster-glue support is still needed in Pacemaker?

2017-11-17 Thread Dejan Muhamedagic
Hi,

On Fri, Nov 17, 2017 at 10:32:45AM +0100, Kristoffer Grönlund wrote:
> Ken Gaillot  writes:
> 
> > We're starting work on Pacemaker 2.0, which will remove support for the
> > heartbeat stack.
> >
> > cluster-glue was traditionally associated with heartbeat. Do current
> > distributions still ship it?
> >
> > Currently, Pacemaker uses cluster-glue's stonith/stonith.h to support
> > heartbeat-class stonith agents via the fence_legacy agent. If this is
> > still widely used, we can keep this support.
> >
> > Pacemaker also checks for heartbeat/glue_config.h and uses certain
> > configuration values there in favor of Pacemaker's own defaults (e.g.
> > the value of HA_COREDIR instead of /var/lib/pacemaker/cores). Does
> > anyone still use the cluster-glue configuration for such things? If
> > not, I'd prefer to drop this.
> 
> Hi Ken,
> 
> We're still shipping it, but mostly only for the legacy agents which we

I really wonder how did stonith agents earn the "legacy" part.
Though there's indeed a lot of duplication which is of course
unfortunate.

Cheers,

Dejan

> still use - although we aim to phase them out in favor of fence-agents.
> 
> I would say that if you can keep the fence_legacy agent intact, dropping
> the rest is OK.
> 
> Cheers,
> Kristoffer
> 
> > -- 
> > Ken Gaillot 
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> 
> -- 
> // Kristoffer Grönlund
> // kgronl...@suse.com
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] boothd-site/boothd-arbitrator: WARN: packet timestamp older than previous one

2017-11-11 Thread Dejan Muhamedagic
On Tue, Nov 07, 2017 at 04:26:40PM +0100, Nicolas Huillard wrote:
> Le mardi 07 novembre 2017 à 13:41 +0100, Dejan Muhamedagic a écrit :
> > Hi,
> > 
> > On Mon, Nov 06, 2017 at 10:52:12AM +0100, Nicolas Huillard wrote:
> > > Hello,
> > > 
> > > I have many of those above syslog messages from boothd (counting
> > > all servers, that's nearly 1 hundred per day).
> > > All sites are synchronized using NTP, but according to source
> > > (https://github.com/ClusterLabs/booth/blob/master/src/transport.c),
> > > that specific messages isn't event tied to maxtimeskew (which I
> > > forced to 120s, because I wondered it it wrongly defaulted to 0s).
> > > This message is output unless the new timestamp is larger (>) that
> > > the previous one from the same site).
> > 
> > Right.
> > 
> > > Adding debug on the arbitrator, it appears that every exchange
> > > between the servers is made of at least 2 messages in each
> > > direction. Could it be that two consecutive messages have the exact
> > > same timestamp, and thus trigger the warning?
> > 
> > The time resolution used should be sufficiently fine (i.e.
> > microseconds) so that the timestamps of two consecutive packets
> > are not the same. At least I'd expect that to be so. Now, what is
> > the actual time resolution on your platform? Maybe you should
> > check the clocksource? Still, if these are not some special kind
> > of computing platform, I suppose that there shouldn't be any
> > surprises.
> 
> My servers are regular Intel platforms ("Intel(R) Atom(TM) CPU C2750 @
> 2.40GHz" and "Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz"), using Debian
> stretch with kernel 4.9.51.
> 
> According to boot-time logs, TSC is selected as the clocksource:
> kernel: [0.00] clocksource: refined-jiffies: mask: 0x 
> max_cycles: 0x, max_idle_ns: 7645519600211568 ns
> kernel: [0.00] clocksource: hpet: mask: 0x max_cycles: 
> 0x, max_idle_ns: 133484882848 ns
> kernel: [1.512907] clocksource: jiffies: mask: 0x max_cycles: 
> 0x, max_idle_ns: 764504178510 ns
> kernel: [1.905780] clocksource: Switched to clocksource hpet
> kernel: [1.93] clocksource: acpi_pm: mask: 0xff max_cycles: 
> 0xff, max_idle_ns: 2085701024 ns
> kernel: [3.861842] tsc: Refined TSC clocksource calibration: 2099.998 MHz
> kernel: [3.861859] clocksource: tsc: mask: 0x max_cycles: 
> 0x1e452ea631d, max_idle_ns: 440795244572 ns
> kernel: [5.345808] clocksource: Switched to clocksource tsc
> 
> https://superuser.com/questions/393969/what-does-clocksource-tsc-unstab
> le-mean#393978
> "In short, on modern systems, the TSC sucks for measuring time
> accurately" and "There is no promise that the timestamp counters of
> multiple CPUs on a single motherboard will be synchronized"
> Oops: this means that the timestamps sent may not be increasing,
> depending on which core boothd runs...

Hmm, there is supposed to be a guarantee that the clock is
monotonic (we use CLOCK_MONOTONIC). The clock_getres(3) does say
though that it "is affected by the incremental adjustments
performed by adjtime(3) and NTP." But normally those
adjustments shouldn't exceed certain threshold per second.
Anyway, maybe you can try with hpet as a clock source.

> The CPUs are currently underutilised, which can lead to increased
> discrepancy between cores TSCs.
> I can either:
> * switch to another clocksource (I don't yet know how to do that)

You can do that with sysfs: in
/sys/devices/system/clocksource/clocksource0/ see
available_clocksource and current_clocksource. To modify it on
boot there are certainly some kernel parameters.

> * lock boothd on a specific core (I don't know if I can do that)
> * ignore these messages altogether (the next one re. "timestamp older
> than skew" will still happen)
> 
> > How often are these messages logged, compared to the expire (or
> > renewal_freq) time?
> 
> 3 nodes, 5 min expire, 2 tickets = 1728 messages per day, vs. 108
> messages on all host in the last 24h = 6% fail.
> Apparently, the Xeon D fail more often than the Atom C.
> I wonder why this problem is not more widely experienced (not that it's
> a big problem).

No idea. If you feel like experimenting, you could add a bit more
information into the debug messages (i.e. time difference) and
log all time diffs. Otherwise, though it doesn't look so to me,
maybe we do have some issue in the code, one never knows ;-)

Thanks,

Dejan

> > Thanks,
> > 
> > Dejan
> 
> Thanks for your work!
> 
> -- 
> 

Re: [ClusterLabs] boothd-site/boothd-arbitrator: WARN: packet timestamp older than previous one

2017-11-07 Thread Dejan Muhamedagic
Hi,

On Mon, Nov 06, 2017 at 10:52:12AM +0100, Nicolas Huillard wrote:
> Hello,
> 
> I have many of those above syslog messages from boothd (counting all
> servers, that's nearly 1 hundred per day).
> All sites are synchronized using NTP, but according to source (https://
> github.com/ClusterLabs/booth/blob/master/src/transport.c), that
> specific messages isn't event tied to maxtimeskew (which I forced to
> 120s, because I wondered it it wrongly defaulted to 0s). This message
> is output unless the new timestamp is larger (>) that the previous one
> from the same site).

Right.

> Adding debug on the arbitrator, it appears that every exchange between
> the servers is made of at least 2 messages in each direction. Could it
> be that two consecutive messages have the exact same timestamp, and
> thus trigger the warning?

The time resolution used should be sufficiently fine (i.e.
microseconds) so that the timestamps of two consecutive packets
are not the same. At least I'd expect that to be so. Now, what is
the actual time resolution on your platform? Maybe you should
check the clocksource? Still, if these are not some special kind
of computing platform, I suppose that there shouldn't be any
surprises.

How often are these messages logged, compared to the expire (or
renewal_freq) time?

Thanks,

Dejan

> Maybe replacing ">" with ">=" could "fix" this ? (or adding a "no more
> than N identical timestamps" counter on the receiving side, or adding a
> message serial number on the sending side if consecutive timestamps are
> identical, or sending with "max(previous_ts + 1, current_ts)" if there
> are useless trailing zeros in timestamps).
> My alternative would be to just ignore those syslog mesages...
> 
> Debug info is not really useful here, because date/hour is in the exact
> same second each time. Here are 3 samples though (expire is set to
> 300):
> 
> 1) syslog on siteA vs. boothd debug on arbitrator:
> Nov  6 09:50:53 siteA boothd-site: [5459]: WARN: [arbitratorIP]: packet 
> timestamp older than previous one
> 
> Nov 06 09:50:53 arbitrator boothd-arbitrator: [9237]: debug: raft_answer:935: 
> ticketA (Fllw/4/148993): got HrtB from [siteAIP]
> Nov 06 09:50:53 arbitrator boothd-arbitrator: [9237]: debug: 
> answer_HEARTBEAT:331: ticketA (Fllw/4/148993): heartbeat from leader: 
> [siteAIP], have [siteAIP]; term 4 vs 4
> Nov 06 09:50:53 arbitrator boothd-arbitrator: [9237]: debug: 
> become_follower:120: ticketA (Fllw/4/148999): state transition: Fllw -> Fllw
> Nov 06 09:50:53 arbitrator boothd-arbitrator: [9237]: debug: 
> answer_HEARTBEAT:355: ticketA (Fllw/4/148999): ticket leader set to [siteAIP]
> Nov 06 09:50:53 arbitrator boothd-arbitrator: [9237]: debug: raft_answer:935: 
> ticketA (Fllw/4/148987): got UpdE from [siteAIP]
> Nov 06 09:50:53 arbitrator boothd-arbitrator: [9237]: debug: 
> process_UPDATE:377: ticketA (Fllw/4/148986): leader [siteAIP] wants to update 
> our ticket
> Nov 06 09:50:53 arbitrator boothd-arbitrator: [9237]: debug: 
> become_follower:120: ticketA (Fllw/4/298999): state transition: Fllw -> Fllw
> Nov 06 09:50:53 arbitrator boothd-arbitrator: [9237]: debug: 
> process_UPDATE:380: ticketA (Fllw/4/298999): ticket leader set to [siteAIP]
> Nov 06 09:50:53 arbitrator boothd-arbitrator: [9237]: debug: 
> log_next_wakeup:1182: ticketA (Fllw/4/298999): set ticket wakeup in 298.999
> 
> 2) syslog on siteB vs. boothd debug on arbitrator:
> Nov  6 09:52:21 siteB boothd-site: [16767]: WARN: [siteAIP]: packet timestamp 
> older than previous one
> 
> Nov 06 09:52:21 arbitrator boothd-arbitrator: [9237]: debug: raft_answer:935: 
> ticket-web (Fllw/3/148996): got HrtB from [siteBIP]
> Nov 06 09:52:21 arbitrator boothd-arbitrator: [9237]: debug: 
> answer_HEARTBEAT:331: ticket-web (Fllw/3/148996): heartbeat from leader: 
> [siteBIP], have [siteBIP]; term 3 vs 3
> Nov 06 09:52:21 arbitrator boothd-arbitrator: [9237]: debug: 
> become_follower:120: ticket-web (Fllw/3/148999): state transition: Fllw -> 
> Fllw
> Nov 06 09:52:21 arbitrator boothd-arbitrator: [9237]: debug: 
> answer_HEARTBEAT:355: ticket-web (Fllw/3/148999): ticket leader set to 
> [siteBIP]
> Nov 06 09:52:21 arbitrator boothd-arbitrator: [9237]: debug: raft_answer:935: 
> ticket-web (Fllw/3/148987): got UpdE from [siteBIP]
> Nov 06 09:52:21 arbitrator boothd-arbitrator: [9237]: debug: 
> process_UPDATE:377: ticket-web (Fllw/3/148987): leader [siteBIP] wants to 
> update our ticket
> Nov 06 09:52:21 arbitrator boothd-arbitrator: [9237]: debug: 
> become_follower:120: ticket-web (Fllw/3/298999): state transition: Fllw -> 
> Fllw
> Nov 06 09:52:21 arbitrator boothd-arbitrator: [9237]: debug: 
> process_UPDATE:380: ticket-web (Fllw/3/298999): ticket leader set to [siteBIP]
> Nov 06 09:52:21 arbitrator boothd-arbitrator: [9237]: debug: 
> log_next_wakeup:1182: ticket-web (Fllw/3/298999): set ticket wakeup in 298.999
> 
> 3) syslog on siteB vs. boothd debug on arbitrator:
> Nov  6 10:32:21 siteB boothd-site: [16767]: WARN: 

Re: [ClusterLabs] Configuring booth for multi-site cluster

2017-11-01 Thread Dejan Muhamedagic
On Tue, Oct 31, 2017 at 09:31:56AM +0100, Nicolas Huillard wrote:
> Le mardi 31 octobre 2017 à 08:25 +0100, Dejan Muhamedagic a écrit :
> > * is it a good idea to route the booth plain UDP/9929 traffic via
> > > Internet ? (the firewalls are configured to accept only traffic
> > > from/to
> > > the known public addresses, and the booth shared secret
> > > authentication
> > > remains secret)
> > 
> > There's nothing particularly interesting in booth traffic.
> 
> Maybe an injection could be bad, but it's apparently taken care of with
> timestamps. I'll try do use this simplest setup without IPsec.

IIRC, there's a description of the authentication process.

> > > * is it possible to use some kind of special syntax in booth.conf
> > > to
> > > declare both the NATted local and the public addresses, say
> > > arbitrator="192.168.1.1@81.12.34.56"
> > 
> > That never occurred as a possible setup/requirement and I'm not
> > sure if it'd be necessary. Shouldn't it be possible that the
> > arbitrator's internal address is also translated into the public
> > one? Or does booth at the arbitrator complain about it?
> 
> Yes, it does (sorry I forgot that info):
> booth: [536]: ERROR: Cannot find myself in the configuration.
> ...when using the external IP (81.12.34.56)
> This internal NATted IP is know to the arbitrator, but not to the other
> sites, whereas the external IP is reachable from the other sites, but
> not the arbitrator itself.
> Thus the above pseudo-syntax, resembling a bit the ipsec.conf details
> in a NATted setup.

Ah, right. Too bad.

> > > * is IPsec mandatory, and if so, what is the best setup ? (both
> > > sites
> > > have a DMZ and a cluster private network, both use PPPoE to reach
> > > the
> > > internet; each Pacemaker manages a virtual IP in the DMZ and
> > > another in
> > > the internal network, and spawns the pppd daemon which acts as a
> > > gateway to the Internet; there is an existing IPsec tunnel between
> > > the
> > > 2 sites' internal networks)
> > 
> > No, IPsec is not mandatory.
> 
> Great... or so. I don't know any other way to make the
> internal/external IPs match.
> I just tried using DNS names (resolving into different IPs depending on
> location), to no avail:
> booth: [5364]: ERROR: Address string "address.at.arbitrator.net" is bad

Only numerical addresses were supported, but in the meantime
one can also use names.

> It just occurred to me that I can also try NOT to have the exact same
> booth.conf in all the instances...

Well, in this case that could hopefully help. Otherwise, could
you please open an issue at github, maybe there is an easy way to
fix that.

> > > * with IPsec, should the booth.conf site= and arbitrator= IPs be
> > > the
> > > internal virtual IPs, or DMZ IPs, or something else entirely ?
> > 
> > Well, however the sites address each other ;-)
> 
> Both sites can address each other in a symmetric way (I'll choose the
> exact fashion in time then), but the arbitrator is an outlier with it's
> NAT (that I can't change for various other reasons).
> I understand that my setup is not high-end, as I try to take advantage
> of an existing well-managed home server.

The arbitrator is needed only, well, to arbitrate and by
definition cannot be a SPOF. But it should function reliably when
needed; for instance, you better have a not too flaky provider.
booth is being tested also in (simulated) networks of all kinds,
but it won't be of much use if there's no connection at all.

Cheers,

Dejan


> -- 
> Nicolas Huillard
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Configuring booth for multi-site cluster

2017-10-31 Thread Dejan Muhamedagic
Hi,

On Mon, Oct 30, 2017 at 07:03:28PM +0100, Nicolas Huillard wrote:
> Hello all,
> 
> I have 2 sites, each with an independent configured cluster
> (corosync+pacemaker), and an arbitrator server, which is behind a NAT
> connection to the Internet.
> I see in the booth.conf templates that each site/arbitrator is only
> designated by a single IP address, not taking into account the
> potential NAT, ie. the arbitrator identifies itself using its internal
> address, but is reached from the outside using the public address of
> the NAT device.
> IPsec is mentionned in https://www.suse.com/documentation/sle-ha-geo-12
> /singlehtml/art-ha-geo-quick-start/art-ha-geo-quick-start.html without
> much details.
> I'm using booth 1.0 from Debian/strech.
> 
> Questions:
> * is it a good idea to route the booth plain UDP/9929 traffic via
> Internet ? (the firewalls are configured to accept only traffic from/to
> the known public addresses, and the booth shared secret authentication
> remains secret)

There's nothing particularly interesting in booth traffic.

> * is it possible to use some kind of special syntax in booth.conf to
> declare both the NATted local and the public addresses, say
> arbitrator="192.168.1.1@81.12.34.56"

That never occurred as a possible setup/requirement and I'm not
sure if it'd be necessary. Shouldn't it be possible that the
arbitrator's internal address is also translated into the public
one? Or does booth at the arbitrator complain about it?

> * is IPsec mandatory, and if so, what is the best setup ? (both sites
> have a DMZ and a cluster private network, both use PPPoE to reach the
> internet; each Pacemaker manages a virtual IP in the DMZ and another in
> the internal network, and spawns the pppd daemon which acts as a
> gateway to the Internet; there is an existing IPsec tunnel between the
> 2 sites' internal networks)

No, IPsec is not mandatory.

> * with IPsec, should the booth.conf site= and arbitrator= IPs be the
> internal virtual IPs, or DMZ IPs, or something else entirely ?

Well, however the sites address each other ;-)

Thanks,

Dejan

> TIA,
> 
> -- 
> Nicolas Huillard
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Regression in Filesystem RA

2017-10-17 Thread Dejan Muhamedagic
Hi Lars,

On Mon, Oct 16, 2017 at 08:52:04PM +0200, Lars Ellenberg wrote:
> On Mon, Oct 16, 2017 at 08:09:21PM +0200, Dejan Muhamedagic wrote:
> > Hi,
> > 
> > On Thu, Oct 12, 2017 at 03:30:30PM +0900, Christian Balzer wrote:
> > > 
> > > Hello,
> > > 
> > > 2nd post in 10 years, lets see if this one gets an answer unlike the first
> > > one...
> 
> Do you want to make me check for the old one? ;-)
> 
> > > One of the main use cases for pacemaker here are DRBD replicated
> > > active/active mailbox servers (dovecot/exim) on Debian machines. 
> > > We've been doing this for a loong time, as evidenced by the oldest pair
> > > still running Wheezy with heartbeat and pacemaker 1.1.7.
> > > 
> > > The majority of cluster pairs is on Jessie with corosync and backported
> > > pacemaker 1.1.16.
> > > 
> > > Yesterday we had a hiccup, resulting in half the machines loosing
> > > their upstream router for 50 seconds which in turn caused the pingd RA to
> > > trigger a fail-over of the DRBD RA and associated resource group
> > > (filesystem/IP) to the other node. 
> > > 
> > > The old cluster performed flawlessly, the newer clusters all wound up with
> > > DRBD and FS resource being BLOCKED as the processes holding open the
> > > filesystem didn't get killed fast enough.
> > > 
> > > Comparing the 2 RAs (no versioning T_T) reveals a large change in the
> > > "signal_processes" routine.
> > > 
> > > So with the old Filesystem RA using fuser we get something like this and
> > > thousands of processes killed per second:
> > > ---
> > > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: 
> > > (res_Filesystem_mb07:stop:stdout)   3478  3593   ...
> > > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: 
> > > (res_Filesystem_mb07:stop:stderr) 
> > > cmccmccmccmcmcmcmcmccmccmcmcmcmcmcmcmcmcmcmcmcmccmcm
> > > Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: 
> > > (res_Filesystem_mb07:stop:stdout)   4032  4058   ...
> > > ---
> > > 
> > > Whereas the new RA (newer isn't better) that goes around killing processes
> > > individually with beautiful logging was a total fail at about 4 processes
> > > per second killed...
> > > ---
> > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: 
> > > sending signal TERM to: mail42264909  0 09:43 ?S  
> > > 0:00 dovecot/imap 
> > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: 
> > > sending signal TERM to: mail42294909  0 09:43 ?S  
> > > 0:00 dovecot/imap [idling]
> > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: 
> > > sending signal TERM to: mail42384909  0 09:43 ?S  
> > > 0:00 dovecot/imap 
> > > Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: 
> > > sending signal TERM to: mail42394909  0 09:43 ?S  
> > > 0:00 dovecot/imap 
> > > ---
> > > 
> > > So my questions are:
> > > 
> > > 1. Am I the only one with more than a handful of processes per FS who
> > > can't afford to wait hours the new routine to finish?
> > 
> > The change was introduced about five years ago.
> 
> Also, usually there should be no process anymore,
> because whatever is using the Filesystem should have it's own RA,
> which should have appropriate constraints,
> which means that should have been called and "stop"ped first,
> before the Filesystem stop and umount, and only the "accidental,
> stray, abandoned, idle since three weeks, operator shell session,
> that happend to cd into that file system" is supposed to be around
> *unexpectedly* and in need of killing, and not "thousands of service
> processes, expectedly".

Indeed, but obviously one can never tell ;-)

> So arguably your setup is broken,

Or the other RA didn't/couldn't stop the resource ...

> relying on a fall-back workaround
> which used to "perform" better.
> 
> The bug is not that this fall-back workaround now
> has pretty printing and is much slower (and eventually times out),
> the bug is that you don't properly kill the service first.
> [and that you don't have fencing].

... and didn't exit with an appropriate exit code (i.e. fail).

> > > 2. Can we have the old FUSER (kill) mode back?
> > 
> > Yes. I'll make a pull request.
> 
> Still, that's a sane thing to d

Re: [ClusterLabs] Regression in Filesystem RA

2017-10-16 Thread Dejan Muhamedagic
Hi,

On Thu, Oct 12, 2017 at 03:30:30PM +0900, Christian Balzer wrote:
> 
> Hello,
> 
> 2nd post in 10 years, lets see if this one gets an answer unlike the first
> one...
> 
> One of the main use cases for pacemaker here are DRBD replicated
> active/active mailbox servers (dovecot/exim) on Debian machines. 
> We've been doing this for a loong time, as evidenced by the oldest pair
> still running Wheezy with heartbeat and pacemaker 1.1.7.
> 
> The majority of cluster pairs is on Jessie with corosync and backported
> pacemaker 1.1.16.
> 
> Yesterday we had a hiccup, resulting in half the machines loosing
> their upstream router for 50 seconds which in turn caused the pingd RA to
> trigger a fail-over of the DRBD RA and associated resource group
> (filesystem/IP) to the other node. 
> 
> The old cluster performed flawlessly, the newer clusters all wound up with
> DRBD and FS resource being BLOCKED as the processes holding open the
> filesystem didn't get killed fast enough.
> 
> Comparing the 2 RAs (no versioning T_T) reveals a large change in the
> "signal_processes" routine.
> 
> So with the old Filesystem RA using fuser we get something like this and
> thousands of processes killed per second:
> ---
> Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: 
> (res_Filesystem_mb07:stop:stdout)   3478  3593  3597  3618  3654  3705  3708  
> 3716  3736  3781  3792  3804  3963  3964  3972  3974  3978  3980  3981  3982  
> 3985  3987  3991  3996  4002  4008  4013  4030
> Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: 
> (res_Filesystem_mb07:stop:stderr) 
> cmccmccmccmcmcmcmcmccmccmcmcmcmcmcmcmcmcmcmcmcmccmcm
> Oct 11 15:06:35 mbx07 lrmd: [4731]: info: RA output: 
> (res_Filesystem_mb07:stop:stdout)   4032  4058  4086  4107  4199  4230  4320  
> 4336  4362  4420  4429  4432  4435  4450  4468  4470  4471  4498  4510  4519  
> 4584  4592  4604  4607  4632  4638  4640  4649  4676  4722  4765
> ---
> 
> Whereas the new RA (newer isn't better) that goes around killing processes
> individually with beautiful logging was a total fail at about 4 processes
> per second killed...
> ---
> Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending 
> signal TERM to: mail42264909  0 09:43 ?S  0:00 
> dovecot/imap 
> Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending 
> signal TERM to: mail42294909  0 09:43 ?S  0:00 
> dovecot/imap [idling]
> Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending 
> signal TERM to: mail42384909  0 09:43 ?S  0:00 
> dovecot/imap 
> Oct 11 15:06:46 mbx10 Filesystem(res_Filesystem_mb10)[288712]: INFO: sending 
> signal TERM to: mail42394909  0 09:43 ?S  0:00 
> dovecot/imap 
> ---
> 
> So my questions are:
> 
> 1. Am I the only one with more than a handful of processes per FS who
> can't afford to wait hours the new routine to finish?

The change was introduced about five years ago.

> 2. Can we have the old FUSER (kill) mode back?

Yes. I'll make a pull request.

Sorry for the trouble.

Thanks,

Dejan



> Regards,
> 
> Christian
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Rakuten Communications
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ocf_take_lock is NOT actually safe to use

2017-06-26 Thread Dejan Muhamedagic
Hi,

On Wed, Jun 21, 2017 at 04:40:47PM +0200, Lars Ellenberg wrote:
> 
> Repost to a wider audience, to raise awareness for this.
> ocf_take_lock may or may not be better than nothing.
> 
> It at least "annotates" that the auther would like to protect something
> that is considered a "critical region" of the resource agent.
> 
> At the same time, it does NOT deliver what the name seems to imply.
> 

Lars, many thanks for the analysis and bringing this up again.

I'm not going to take on the details below, just to say that
there's now a pull request for the issue:

https://github.com/ClusterLabs/resource-agents/pull/995

In short, it consists of reducing the race window size (by using
mkdir*), double test for stale locks, and improved random number
function. I ran numerous tests with and without stale locks and
it seems to hold quite well.

The comments there contain a detailed description of the
approach.

Please review and comment whoever finds time.

Cheers,

Dejan

*) Though the current implementation uses just a file and the
proposed one directories, the locks are short lived and there
shouldn't be problems on upgrades.

> I think I brought this up a few times over the years, but was not noisy
> enough about it, because it seemed not important enough: no-one was
> actually using this anyways.
> 
> But since new usage has been recently added with
> [ClusterLabs/resource-agents] targetcli lockfile (#917)
> here goes:
> 
> On Wed, Jun 07, 2017 at 02:49:41PM -0700, Dejan Muhamedagic wrote:
> > On Wed, Jun 07, 2017 at 05:52:33AM -0700, Lars Ellenberg wrote:
> > > Note: ocf_take_lock is NOT actually safe to use.
> > > 
> > > As implemented, it uses "echo $pid > lockfile" to create the lockfile,
> > > which means if several such "ocf_take_lock" happen at the same time,
> > > they all "succeed", only the last one will be the "visible" one to future 
> > > waiters.
> > 
> > Ugh.
> 
> Exactly.
> 
> Reproducer:
> #
> #!/bin/bash
> export OCF_ROOT=/usr/lib/ocf/ ;
> .  /usr/lib/ocf/lib/heartbeat/ocf-shellfuncs ;
> 
> x() (
>   ocf_take_lock dummy-lock ;
>   ocf_release_lock_on_exit dummy-lock  ;
>   set -C;
>   echo x > protected && sleep 0.15 && rm -f protected || touch BROKEN;
> );
> 
> mkdir -p /run/ocf_take_lock_demo
> cd /run/ocf_take_lock_demo
> rm -f BROKEN; i=0;
> time while ! test -e BROKEN; do
>   x &  x &
>   wait;
>   i=$(( i+1 ));
> done ;
> test -e BROKEN && echo "reproduced race in $i iterations"
> #
> 
> x() above takes, and, because of the () subshell and
> ocf_release_lock_on_exit, releases the "dummy-lock",
> and within the protected region of code,
> creates and removes a file "protected".
> 
> If ocf_take_lock was good, there could never be two instances
> inside the lock, so echo x > protected should never fail.
> 
> With the current implementation of ocf_take_lock,
> it takes "just a few" iterations here to reproduce the race.
> (usually within a minute).
> 
> The races I see in ocf_take_lock:
> "creation race":
>   test -e $lock
>   # someone else may create it here
>   echo $$ > $lock
>   # but we override it with ours anyways
> 
> "still empty race":
>   test -e $lock   # maybe it already exists (open O_CREAT|O_TRUNC)
>   # but does not yet contain target pid,
>   pid=`cat $lock` # this one is empty,
>   kill -0 $pid# and this one fails
>   and thus a "just being created" one is considered stale
> 
> There are other problems around "stale pid file detection",
> but let's not go into that minefield right now.
> 
> > > Maybe we should change it to 
> > > ```
> > > while ! ( set -C; echo $pid > lockfile ); do
> > > if test -e lockfile ; then
> > > : error handling for existing lockfile, stale lockfile detection
> > > else
> > > : error handling for not being able to create lockfile
> > > fi
> > > done
> > > : only reached if lockfile was successfully created
> > > ```
> > > 
> > > (or use flock or other tools designed for that purpose)
> > 
> > flock would probably be the easiest. mkdir would do too, but for
> > upgrade issues.
> 
> and, being part of util-linux, flock should be available "everywhere".
> 
> but becaus

Re: [ClusterLabs] Rename option group resource id with pcs

2017-04-11 Thread Dejan Muhamedagic
Hi,

On Tue, Apr 11, 2017 at 10:50:56AM +0200, Tomas Jelinek wrote:
> Dne 11.4.2017 v 08:53 SAYED, MAJID ALI SYED AMJAD ALI napsal(a):
> >Hello,
> >
> >Is there any option in pcs to rename group resource id?
> >
> 
> Hi,
> 
> No, there is not.
> 
> Pacemaker doesn't really cover the concept of renaming a resource.

Perhaps you can check how crmsh does resource rename. It's not
impossible, but can be rather involved if there are other objects
(e.g. constraints) referencing the resource. Also, crmsh will
refuse to rename the resource if it's running.

Thanks,

Dejan

> From
> pacemaker's point of view one resource gets removed and another one gets
> created.
> 
> This has been discussed recently:
> http://lists.clusterlabs.org/pipermail/users/2017-April/005387.html
> 
> Regards,
> Tomas
> 
> >
> >
> >
> >
> >
> >
> >*/MAJID SAYED/*
> >
> >/HPC System Administrator./
> >
> >/King Abdullah International Medical Research Centre/
> >
> >/Phone:+9661801(Ext:40631)/
> >
> >/Email:sayed...@ngha.med.sa/
> >
> >
> >
> >
> >
> >This Email and any files transmitted may contain confidential and/or
> >privileged information and is intended solely for the addressee(s)
> >named. If you have received this information in error, or are being
> >posted by accident, please notify the sender by return Email, do not
> >redistribute this email message, delete it immediately and keep no
> >copies of it. All opinions and/or views expressed in this email are
> >solely those of the author and do not necessarily represent those of
> >NGHA. Any purchase order, purchase advice or legal commitment is only
> >valid once backed by the signed hardcopy by the authorized person from NGHA.
> >
> >
> >___
> >Users mailing list: Users@clusterlabs.org
> >http://lists.clusterlabs.org/mailman/listinfo/users
> >
> >Project Home: http://www.clusterlabs.org
> >Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >Bugs: http://bugs.clusterlabs.org
> >
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] PSA Ubuntu 16.04 and OCSF2 corruption

2017-04-11 Thread Dejan Muhamedagic
Hi,

On Mon, Apr 10, 2017 at 04:33:23PM -0400, Kyle O'Donnell wrote:
> Hello,
> 
> Just opened what I think is a bug with the Ubuntu util-linux package:
> 
> https://bugs.launchpad.net/ubuntu/+source/util-linux/+bug/1681410

I can vaguely recall a cron script on SUSE running fstrim only on
certain filesystem types. There's probably something similar on
Ubuntu. BTW, isn't that only for SSD?

Thanks,

Dejan

> 
> TL;DR
> 
> The 'fstrim' command is run weekly on all filesystems.  If you're using ocfs2 
> and the same filesystem is mounted on multiple Ubuntu 16.04 servers, this 
> fstrim is run at the same time to the same device from all servers.  I'm 
> positive this is what's causing my filesystem corruption issues, which occurs 
> a minute or two after fstrim is scheduled to run.
> 
> -Kyle
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Adding a node to the cluster deployed with "Unicast configuration"

2017-04-11 Thread Dejan Muhamedagic
On Mon, Apr 10, 2017 at 03:24:12PM -0300, Alejandro Comisario wrote:
> Sorry, i lack of the last command, where i took the node out of the CIB.
> 
> discard my previous question, thanks

If you're using crmsh, there's the "node remove" command. It
should take care of all details, IIRC.

Thanks,

Dejan

> 
> On Mon, Apr 10, 2017 at 3:11 PM, Alejandro Comisario
>  wrote:
> > Hi, im reviving this thread because of the following.
> >
> > To add new nodes to the cluster, this procedure works like a charm.
> > But to take them out, i've assumed the follwing, whitch didnt worked.
> >
> > * stop corosync/pacemaker on the old node
> > * update all corosync.conf on the cluster and took the old node of the 
> > config
> > * corosync-cfgtool -R from one of the nodes
> >
> > But the old node never disappears as OFFLINE.
> > How can a i take out a node from the cluster and make is dissapear
> > from the nodes into it ?
> >
> > best.
> >
> > On Thu, Feb 23, 2017 at 1:23 PM, Alejandro Comisario
> >  wrote:
> >> Amazing, i will try to do that on ubuntu also and get back to you !
> >> thanks for the help !
> >>
> >> On Wed, Feb 22, 2017 at 5:08 PM, Ken Gaillot  wrote:
> >>> On 02/22/2017 09:55 AM, Jan Friesse wrote:
>  Alejandro Comisario napsal(a):
> > Hi everyone, i have a problem when scaling a corosync/pacemaker
> > cluster deployed using unicast.
> >
> > eg on corosync.conf.
> >
> > nodelist {
> > node {
> > ring0_addr: 10.10.0.10
> > nodeid: 1
> > }
> > node {
> > ring0_addr: 10.10.0.11
> > nodeid: 2
> > }
> > }
> >
> > Tried to add the new node with the new config (meaning, adding the new
> > node) nad leaving the other two with the same config and started
> > services on the third node, but doesnt seem to work until i update the
> > config on the server #1 and #2 and restart corosync/pacemaker which
> > does the obvious of bringing every resource down.
> >
> > There should be a way to "hot add" a new node to the cluster, but i
> > dont seem to find one.
> > what is the best way to add a node without bothering the rest ? or
> > better said, what is the right way to do it ?
> 
>  Adding node is possible with following process:
>  - Add node to config file on both existing nodes
>  - Exec corosync-cfgtool -R (it's enough to exec only on one of nodes)
>  - Make sure 3rd (new) node has same config as two existing nodes
>  - Start corosync/pcmk on new node
> 
>  This is (more or less) how pcs works. No stopping of corosync is needed.
> 
>  Regards,
>    Honza
> >>>
> >>> Ah, noted ... I need to update some upstream docs :-)
> >>>
> >
> > PS: i can't implement multicast on my network.
> >
> >>>
> >>> ___
> >>> Users mailing list: Users@clusterlabs.org
> >>> http://lists.clusterlabs.org/mailman/listinfo/users
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >>
> >>
> >>
> >> --
> >> Alejandrito
> >
> > Alejandrito
> 
> 
> 
> -- 
> Alejandro Comisario
> CTO | NUBELIU
> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
> _
> www.nubeliu.com
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] to change resource id - how?

2017-04-04 Thread Dejan Muhamedagic
On Mon, Apr 03, 2017 at 09:26:04AM -0500, Ken Gaillot wrote:
> On 04/03/2017 06:35 AM, lejeczek wrote:
> > hi
> > I'm sroogling and reading but cannot find any info - how to
> > (programmatically) change resources ids? In other words: how to rename
> > these entities?
> > many thanks
> > L
> 
> As far as I know, higher-level tools don't support this directly -- you

At least crmsh does. There's "crm cfg rename".

Thanks,

Dejan

> have to edit the XML. The basic process is:
> 
> 1. Save a copy of the live CIB to a file.
> 2. Edit that file, and do a search-and-replace on the desired name (so
> you change it in constraints, etc., as well as the resource definition).
> 3. Push the configuration section of that file to the live CIB.
> 
> The higher-level tools have commands to do that, but at the low level,
> it would be something like:
> 
> 1. cibadmin -Q --scope configuration > tmp.cib
> 2. vim tmp.cib
> 3. cibadmin -x tmp.cib --replace --scope configuration
> 
> The cluster will treat it as a completely new resource, so it will stop
> the old one and start the new one.
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: question about ocf metadata actions

2017-03-31 Thread Dejan Muhamedagic
On Fri, Mar 31, 2017 at 09:46:18AM +0200, Kristoffer Grönlund wrote:
> Ulrich Windl  writes:
> 
> > I thought the hierarchy is like this:
> > 1) default timeout
> > 2) RA's default timeout
> > 3) user-specified timeout
> >
> > So crm would go from 1) to 3) taking the last value it finds. Isn't it like
> > that?
> 
> No, step 2) is not taken by crm.

There's a command in crmsh to insert default timeouts in the
resource. There was some initiative to do that at the resource
creation time, i.e. that crmsh automatically propagates all
defaults from the RA metadata, but that would've made the
configuration very large.

> > I mean if there's no timeout in the resource cnfiguration, doesn't the RM 
> > use
> > the default timeout?
> 
> Yes, it then uses the timeout defined in op_defaults:
> 
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-operation-defaults

At any rate, the RA metadata is only for informational purpose.

Thanks,

Dejan
> 
> Cheers,
> Kristoffer
> 
> >
> > Regards,
> > Ulrich
> >
> >> 
> >> https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra
> >
> >> -dev-guide.asc#_metadata
> >> 
> >>> Every action should list its own timeout value. This is a hint to the
> >>> user what minimal timeout should be configured for the action. This is
> >>> meant to cater for the fact that some resources are quick to start and
> >>> stop (IP addresses or filesystems, for example), some may take several
> >>> minutes to do so (such as databases).
> >> 
> >>> In addition, recurring actions (such as monitor) should also specify a
> >>> recommended minimum interval, which is the time between two
> >>> consecutive invocations of the same action. Like timeout, this value
> >>> does not constitute a default— it is merely a hint for the user which
> >>> action interval to configure, at minimum.
> >> 
> >> Cheers,
> >> Kristoffer
> >> 
> >>>
> >>> Br,
> >>>
> >>> Allen
> >>> ___
> >>> Users mailing list: Users@clusterlabs.org 
> >>> http://lists.clusterlabs.org/mailman/listinfo/users 
> >>>
> >>> Project Home: http://www.clusterlabs.org 
> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> >>> Bugs: http://bugs.clusterlabs.org 
> >> 
> >> -- 
> >> // Kristoffer Grönlund
> >> // kgronl...@suse.com 
> >> 
> >> ___
> >> Users mailing list: Users@clusterlabs.org 
> >> http://lists.clusterlabs.org/mailman/listinfo/users 
> >> 
> >> Project Home: http://www.clusterlabs.org 
> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> >> Bugs: http://bugs.clusterlabs.org 
> >
> >
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> -- 
> // Kristoffer Grönlund
> // kgronl...@suse.com
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-31 Thread Dejan Muhamedagic
Hi,

On Fri, Mar 31, 2017 at 02:39:02AM -0400, Digimer wrote:
> On 31/03/17 02:32 AM, Jan Friesse wrote:
> >> The original message has the logs from nodes 1 and 3. Node 2, the one
> >> that
> >> got fenced in this test, doesn't really show much. Here are the logs from
> >> it:
> >>
> >> Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #5 enp6s0f0,
> >> 192.168.100.14#123, interface stats: received=0, sent=0, dropped=0,
> >> active_time=3253 secs
> >> Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #7 enp6s0f0,
> >> fe80::a236:9fff:fe8a:6500%6#123, interface stats: received=0, sent=0,
> >> dropped=0, active_time=3253 secs
> >> Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] A processor failed,
> >> forming new configuration.
> >> Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] A processor failed,
> >> forming
> >> new configuration.
> >> Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] The network
> >> interface
> >> is down.
> > 
> > This is problem. Corosync handles ifdown really badly. If this was not
> > intentional it may be caused by NetworkManager. Then please install
> > equivalent of NetworkManager-config-server package (it's actually one
> > file called 00-server.conf so you can extract it from, for example,
> > Fedora package
> > https://www.rpmfind.net/linux/RPM/fedora/devel/rawhide/x86_64/n/NetworkManager-config-server-1.8.0-0.1.fc27.noarch.html)
> 
> ifdown'ing corosync's interface happens a lot, intentionally or
> otherwise.

I'm not sure, but I think that it can happen only intentionally,
i.e. through a human intervention. If there's another problem
with the interface it doesn't disappear from the system.

Thanks,

Dejan

> I think it is reasonable to expect corosync to handle this
> properly. How hard would it be to make corosync resilient to this fault
> case?
> 
> -- 
> Digimer
> Papers and Projects: https://alteeve.com/w/
> "I am, somehow, less interested in the weight and convolutions of
> Einstein’s brain than in the near certainty that people of equal talent
> have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-29 Thread Dejan Muhamedagic
On Fri, Mar 24, 2017 at 01:44:44PM -0700, Seth Reid wrote:
> I have a three node Pacemaker/GFS2 cluster on Ubuntu 16.04. Its not in
> production yet because I'm having a problem during fencing. When I disable
> the network interface of any one machine, the disabled machines is properly
> fenced leaving me, briefly, with a two node cluster. A second node is then
> fenced off immediately, and the remaining node appears to try to fence
> itself off. This leave two nodes with corosync/pacemaker stopped, and the
> remaining machine still in the cluster but showing an offline node and an
> UNCLEAN node. What can be causing this behavior?

Man, you've got a very suicidal cluster. Is it depressed? Did you
try psychotherapy?

Otherwise, it looks like corosync crashed. Maybe look for core
dumps. Also, there should be another log in
/var/log/pacemaker.log (or similar) with lower severity messages.

Thanks,

Dejan

> Each machine has a dedicated network interface for the cluster, and there
> is a vlan on the switch devoted to just these interfaces.
> In the following, I disabled the interface on node id 2 (b014). Node 1
> (b013) is fenced as well. Node 2 (b015) is still up.
> 
> Logs from b013:
> Mar 24 16:35:01 b013 CRON[19133]: (root) CMD (command -v debian-sa1 >
> /dev/null && debian-sa1 1 1)
> Mar 24 16:35:13 b013 corosync[2134]: notice  [TOTEM ] A processor failed,
> forming new configuration.
> Mar 24 16:35:13 b013 corosync[2134]:  [TOTEM ] A processor failed, forming
> new configuration.
> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] A new membership (
> 192.168.100.13:576) was formed. Members left: 2
> Mar 24 16:35:17 b013 corosync[2134]: notice  [TOTEM ] Failed to receive the
> leave message. failed: 2
> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] A new membership (
> 192.168.100.13:576) was formed. Members left: 2
> Mar 24 16:35:17 b013 corosync[2134]:  [TOTEM ] Failed to receive the leave
> message. failed: 2
> Mar 24 16:35:17 b013 attrd[2223]:   notice: crm_update_peer_proc: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 cib[2220]:   notice: crm_update_peer_proc: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 cib[2220]:   notice: Removing b014-cl/2 from the
> membership list
> Mar 24 16:35:17 b013 cib[2220]:   notice: Purged 1 peers with id=2 and/or
> uname=b014-cl from the membership cache
> Mar 24 16:35:17 b013 pacemakerd[2187]:   notice: crm_reap_unseen_nodes:
> Node b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 attrd[2223]:   notice: Removing b014-cl/2 from the
> membership list
> Mar 24 16:35:17 b013 attrd[2223]:   notice: Purged 1 peers with id=2 and/or
> uname=b014-cl from the membership cache
> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: crm_update_peer_proc: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Removing b014-cl/2 from
> the membership list
> Mar 24 16:35:17 b013 stonith-ng[2221]:   notice: Purged 1 peers with id=2
> and/or uname=b014-cl from the membership cache
> Mar 24 16:35:17 b013 dlm_controld[2727]: 3091 fence request 2 pid 19223
> nodedown time 1490387717 fence_all dlm_stonith
> Mar 24 16:35:17 b013 kernel: [ 3091.800118] dlm: closing connection to node
> 2
> Mar 24 16:35:17 b013 crmd[2227]:   notice: crm_reap_unseen_nodes: Node
> b014-cl[2] - state is now lost (was member)
> Mar 24 16:35:17 b013 dlm_stonith: stonith_api_time: Found 0 entries for
> 2/(null): 0 in progress, 0 completed
> Mar 24 16:35:18 b013 stonith-ng[2221]:   notice: Operation reboot of
> b014-cl by b015-cl for stonith-api.19223@b013-cl.7aeb2ffb: OK
> Mar 24 16:35:18 b013 stonith-api[19223]: stonith_api_kick: Node 2/(null)
> kicked: reboot
> Mar 24 16:35:18 b013 kernel: [ 3092.421495] dlm: closing connection to node
> 3
> Mar 24 16:35:18 b013 kernel: [ 3092.422246] dlm: closing connection to node
> 1
> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace share_data
> Mar 24 16:35:18 b013 dlm_controld[2727]: 3092 abandoned lockspace clvmd
> Mar 24 16:35:18 b013 kernel: [ 3092.426545] dlm: dlm user daemon left 2
> lockspaces
> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Main process exited,
> code=exited, status=255/n/a
> Mar 24 16:35:18 b013 cib[2220]:error: Connection to the CPG API failed:
> Library error (2)
> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Unit entered failed
> state.
> Mar 24 16:35:18 b013 attrd[2223]:error: Connection to cib_rw failed
> Mar 24 16:35:18 b013 systemd[1]: corosync.service: Failed with result
> 'exit-code'.
> Mar 24 16:35:18 b013 attrd[2223]:error: Connection to
> cib_rw[0x560754147990] closed (I/O condition=17)
> Mar 24 16:35:18 b013 systemd[1]: pacemaker.service: Main process exited,
> code=exited, status=107/n/a
> Mar 24 16:35:18 b013 pacemakerd[2187]:error: Connection to the CPG API
> failed: Library error (2)
> Mar 24 16:35:18 b013 systemd[1]: pacemaker.service: Unit entered failed
> state.

Re: [ClusterLabs] stonith in dual HMC environment

2017-03-28 Thread Dejan Muhamedagic
On Tue, Mar 28, 2017 at 04:20:12PM +0300, Alexander Markov wrote:
> Hello, Dejan,
> 
> >Why? I don't have a test system right now, but for instance this
> >should work:
> >
> >$ stonith -t ibmhmc ipaddr=10.1.2.9 -lS
> >$ stonith -t ibmhmc ipaddr=10.1.2.9 -T reset {nodename}
> 
> Ah, I see. Everything (including stonith methods, fencing and failover)
> works just fine under normal circumstances. Sorry if I wasn't clear about
> that. The problem occurs only when I have one datacenter (i.e. one IBM
> machine and one HMC) lost due to power outage.
> 
> For example:
> test01:~ # stonith -t ibmhmc ipaddr=10.1.2.8 -lS | wc -l
> info: ibmhmc device OK.
> 39
> test01:~ # stonith -t ibmhmc ipaddr=10.1.2.9 -lS | wc -l
> info: ibmhmc device OK.
> 39
> 
> As I had said stonith device can see and manage all the cluster nodes.

That's great :)

> >If so, then your configuration does not appear to be correct. If
> >both are capable of managing all nodes then you should tell
> >pacemaker about it.
> 
> Thanks for the hint. But if stonith device return node list, isn't it
> obvious for cluster that it can manage those nodes?

Did you try that? Just drop the location constraints and see if
it works. The pacemaker should actually keep the list of resources
(stonith) capable of managing the node.

> Could you please be more
> precise about what you refer to? I currently changed configuration to two
> fencing levels (one per HMC) but still don't think I get an idea here.
> 
> >Survived node, running stonith resource for dead node tries to
> >contact ipmi device (which is also dead). How does cluster understand that
> >lost node is really dead and it's not just a network issue?
> >
> >It cannot.
> 
> How do people then actually solve the problem of two node metro cluster?

That depends, but if you have a communication channel for stonith
devices which is _independent_ of the cluster communication then
you should be OK. Of course, a fencing device which goes down
together with its node is of no use, but that doesn't seem to be
the case here.

> I mean, I know one option: stonith-enabled=false, but it doesn't seem right
> for me.

Certainly not.

Thanks,

Dejan

> 
> Thank you.
> 
> Regards,
> Alexander Markov
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonith in dual HMC environment

2017-03-28 Thread Dejan Muhamedagic
On Mon, Mar 27, 2017 at 01:17:31PM +0300, Alexander Markov wrote:
> Hello, Dejan,
> 
> 
> >The first thing I'd try is making sure you can fence each node from the
> >command line by manually running the fence agent. I'm not sure how to do
> >that for the "stonith:" type agents.
> >
> >There's a program stonith(8). It's easy to replicate the
> >configuration on the command line.
> 
> Unfortunately, it is not.

Why? I don't have a test system right now, but for instance this
should work:

$ stonith -t ibmhmc ipaddr=10.1.2.9 -lS
$ stonith -t ibmhmc ipaddr=10.1.2.9 -T reset {nodename}

Read the examples in the man page:

$ man stonith

Check also the documentation of your agent:

$ stonith -t ibmhmc -h
$ stonith -t ibmhmc -n

> The landscape I refer to is similar to VMWare. We use cluster for virtual
> machines (LPARs) and everything works OK but the real pain occurs when whole
> host system is down. Keeping in mind that it's actually used now in
> production, I just can't afford to turn it off for test reason.

Yes, I understand. However, I was just talking about how to use
the stonith agents and how to do the testing outside of
pacemaker.

> >Stonith agents are to be queried for the list of nodes they can
> >manage. It's part of the interface. Some agents can figure that
> >out by themself and some need a parameter defining the node list.
> 
> And this is just the place I'm stuck. I've got two stonith devices (ibmhmc)
> for redundancy. Both of them are capable to manage every node.

If so, then your configuration does not appear to be correct. If
both are capable of managing all nodes then you should tell
pacemaker about it. Digimer had a fairly extensive documentation
on how to configure complex fencing configurations. You can also
check with your vendor's documentation.

> The problem starts when
> 
> 1) one stonith device is completely lost and inaccessible (due to power
> outage in datacenter)
> 2) survived stonith device cannot access nor cluster node neither hosting
> system (in VMWare terms) for this cluster node, for both of them are also
> lost due to power outage.

Both lost? What remained? Why do you mention vmware? I thought
that your nodes are LPARs.

> What is the correct solution for this situation?
> 
> >Well, this used to be a standard way to configure one kind of
> >stonith resources, one common representative being ipmi, and
> >served exactly the purpose of restricting the stonith resource
> >from being enabled ("running") on a node which this resource
> >manages.
> 
> Unfortunately, there's no such thing as ipmi in IBM Power boxes.

I mentioned ipmi as an example, not that it has anything to do
with your setup.

> But it
> triggers interesting question for me: if both one node and its complementary
> ipmi device are lost (due to power outage) - what's happening with a
> cluster?

The cluster gets stuck trying to fence the node. Typically this
would render your cluster unusable. There are some IPMI devices
which have a battery to allow for some extra time to manage the
host.

> Survived node, running stonith resource for dead node tries to
> contact ipmi device (which is also dead). How does cluster understand that
> lost node is really dead and it's not just a network issue?

It cannot.

Thanks,

Dejan

> 
> Thank you.
> 
> -- 
> Regards,
> Alexander Markov
> +79104531955
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Insert delay between the statup of VirtualDomain

2017-03-03 Thread Dejan Muhamedagic
Hi,

On Wed, Mar 01, 2017 at 01:47:21PM +0100, Oscar Segarra wrote:
> Hi Dejan,
> 
> In my environment, is it possible to launch the check from the hypervisor.
> A simple telnet against an specific port may be enough tp check if service
> is ready.

telnet is not so practical for scripts, better use ssh or
the mysql client.

> In this simple scenario (and check) how can I instruct the second server to
> wait the mysql server is up?

That's what the ordering constraints in pacemaker are for. You
don't need to do anything special.

Thanks,

Dejan

> 
> Thanks a lot
> 
> El 1 mar. 2017 1:08 p. m., "Dejan Muhamedagic" <deja...@fastmail.fm>
> escribió:
> 
> > Hi,
> >
> > On Sat, Feb 25, 2017 at 09:58:01PM +0100, Oscar Segarra wrote:
> > > Hi,
> > >
> > > Yes,
> > >
> > > Database server can be considered started up when it accepts mysql client
> > > connections
> > > Applications server can be considered started as soon as the listening
> > port
> > > is up al accepting connections
> > >
> > > ¿Can you provide any example about how to achieve this?
> >
> > Is it possible to connect to the database from the supervisor?
> > Then something like this would do:
> >
> > mysql -h vm_ip_address ... < /dev/null
> >
> > If not, then if ssh works:
> >
> > echo mysql ... | ssh vm_ip_address
> >
> > I'm afraid I cannot help you more with mysql details and what to
> > put in '...' stead above, but it should do whatever is necessary
> > to test if the database reached the functional state. You can
> > find an example in ocf:heartbeat:mysql: just look for the
> > "test_table" parameter. Of course, you'll need to put that in a
> > script and test output and so on. I guess that there's enough
> > information in internet on how to do that.
> >
> > Good luck!
> >
> > Dejan
> >
> > > Thanks a lot.
> > >
> > >
> > > 2017-02-25 19:35 GMT+01:00 Dejan Muhamedagic <deja...@fastmail.fm>:
> > >
> > > > Hi,
> > > >
> > > > On Thu, Feb 23, 2017 at 08:51:20PM +0100, Oscar Segarra wrote:
> > > > > Hi,
> > > > >
> > > > > In my environment I have 5 guestes that have to be started up in a
> > > > > specified order starting for the MySQL database server.
> > > > >
> > > > > I have set the order constraints and VirtualDomains start in the
> > right
> > > > > order but, the problem I have, is that the second host starts up
> > faster
> > > > > than the database server and therefore applications running on the
> > second
> > > > > host raise errors due to database connectivity problems.
> > > > >
> > > > > I'd like to introduce a delay between the startup of the
> > VirtualDomain of
> > > > > the database server and the startup of the second guest.
> > > >
> > > > Do you have a way to check if this server is up? If so...
> > > > The start action of VirtualDomain won't exit until the monitor
> > > > action returns success. And there's a parameter called
> > > > monitor_scripts (see the meta-data). Note that these programs
> > > > (scripts) are run at the supervisor host and not in the guest.
> > > > It's all a bit involved, but should be doable.
> > > >
> > > > Thanks,
> > > >
> > > > Dejan
> > > >
> > > > > ¿Is it any way to get this?
> > > > >
> > > > > Thanks a lot.
> > > >
> > > > > ___
> > > > > Users mailing list: Users@clusterlabs.org
> > > > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > > >
> > > > > Project Home: http://www.clusterlabs.org
> > > > > Getting started: http://www.clusterlabs.org/
> > doc/Cluster_from_Scratch.pdf
> > > > > Bugs: http://bugs.clusterlabs.org
> > > >
> > > >
> > > > ___
> > > > Users mailing list: Users@clusterlabs.org
> > > > http://lists.clusterlabs.org/mailman/listinfo/users
> > > >
> > > > Project Home: http://www.clusterlabs.org
> > > > Getting started: http://www.clusterlabs.org/
> > doc/Cluster_from_Scratch.pdf
> > > > Bugs: http://bugs.clusterlabs.org
> > > >
> >
> > 

Re: [ClusterLabs] Oralsnr/Oracle resources agents

2017-03-01 Thread Dejan Muhamedagic
Hi,

On Sun, Feb 26, 2017 at 09:51:47AM +0300, Andrei Borzenkov wrote:
> 25.02.2017 23:18, Jihed M'selmi пишет:
> > [DM] I thought that oracle listener is not consuming that many resources.
> > At any rate, ocf:heartbeat:oralsnr doesn't support single listener for
> > multiple instances. Do you have an idea how to do that? How to deal with
> > the tnsping then? Maybe you're better off with the system start script in
> > this case.
> > 
> > [JM] According to the dba, it could lead some memory issue when the
> > listener serves many instances at the same time (in my experience, I have
> > never faced this issue).
> > 
> 
> What "it" means in the above sentence? "Running single listener for
> multiple instances" or "running each instance with own listener"?
> 
> How many instances are we talking about?
> 
> > Let's take a case when the listener is serving multiple instance, and one
> > of the instance fails => ocf:heartbeat:oracle will relocate it to another
> > node, the listener should follow (especially, when we use
> > collocation constraint between RA oracle and oralsnr) this will have a bad
> > impact on the rest of instances.
> > 
> > One of the option is to have two listeners (one per node) and configured
> > outside the cluster to host the all instance. But, I keep looking for a
> > better solution.
> > 
> > [DM] Hmm, what should then the RA do? Skip the instance and report it 
> > started?
> > I'm not sure I follow.
> > [JM] The DBA use a flag Y/N to tell if this instance should run or no. It
> > could be better, for RA to use this flag too: when it's Y start the
> > instance and when It's N, the RA should not start the instance and suitable
> > message in log will be usefull to describe the situation. Now, the
> > challenge is how to monitor this flag.
> > 
> 
> DBA still needs to remember to change this flag on each node in the
> cluster. In which case it can just as well remember to use different way
> to disable automatic startup.

Indeed. I wonder what is the difference between editing a file
and running say "crm rsc stop db".

At any rate, I doubt that there is a sane way for the RA to
handle such a case.

> > One of the issue that I faced when the DBA when to shutdown the listener
> > and the instance (for launch the cold backup) but, the RA keep pushing them
> > ON. -- Note the dba team usually don't have an access to pcs to disable the
> > resource during this type of operation.
> > 
> 
> There is nothing new. Once you put application under HA control, you in
> general cannot use native application tools to manage it. That is why
> SAP introduced "cluster glue" layer that intercepts native requests to
> start/stop application and forwards them to cluster for actual processing.
> 
> The solution here is really to make it possible to delegate control of
> individual resources to different users, so that DBA can
> start/stop/disable/unmanage individual resources (s)he owns.
> 
> If nothing else, it can be done as SUID scripts that implement this check.

Users other than root/hacluster can be given access to cluster
(essentially editing the CIB). Details escape me now, but there
is a concept of roles and then one can define rules on what roles
are allowed to do.

Thanks,

Dejan

> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Insert delay between the statup of VirtualDomain

2017-03-01 Thread Dejan Muhamedagic
Hi,

On Mon, Feb 27, 2017 at 12:38:07PM +0100, Ferenc Wágner wrote:
> Oscar Segarra  writes:
> 
> > In my environment I have 5 guestes that have to be started up in a
> > specified order starting for the MySQL database server.
> 
> We use a somewhat redesigned resource agent, which connects to the guest
> using a virtio channel and waits for a signal before exiting from the
> start operation.  The signal is sent by an approriately placed startup
> script from the guest.  This is fully independent from regular network
> traffic and does not need any channel configuration.

Cool. Maybe you'd like to share the code or, best, do a pull
request at github. This is certainly very useful.

Thanks,

Dejan

> -- 
> Feri
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Insert delay between the statup of VirtualDomain

2017-03-01 Thread Dejan Muhamedagic
Hi,

On Sat, Feb 25, 2017 at 09:58:01PM +0100, Oscar Segarra wrote:
> Hi,
> 
> Yes,
> 
> Database server can be considered started up when it accepts mysql client
> connections
> Applications server can be considered started as soon as the listening port
> is up al accepting connections
> 
> ¿Can you provide any example about how to achieve this?

Is it possible to connect to the database from the supervisor?
Then something like this would do:

mysql -h vm_ip_address ... < /dev/null

If not, then if ssh works:

echo mysql ... | ssh vm_ip_address

I'm afraid I cannot help you more with mysql details and what to
put in '...' stead above, but it should do whatever is necessary
to test if the database reached the functional state. You can
find an example in ocf:heartbeat:mysql: just look for the
"test_table" parameter. Of course, you'll need to put that in a
script and test output and so on. I guess that there's enough
information in internet on how to do that.

Good luck!

Dejan

> Thanks a lot.
> 
> 
> 2017-02-25 19:35 GMT+01:00 Dejan Muhamedagic <deja...@fastmail.fm>:
> 
> > Hi,
> >
> > On Thu, Feb 23, 2017 at 08:51:20PM +0100, Oscar Segarra wrote:
> > > Hi,
> > >
> > > In my environment I have 5 guestes that have to be started up in a
> > > specified order starting for the MySQL database server.
> > >
> > > I have set the order constraints and VirtualDomains start in the right
> > > order but, the problem I have, is that the second host starts up faster
> > > than the database server and therefore applications running on the second
> > > host raise errors due to database connectivity problems.
> > >
> > > I'd like to introduce a delay between the startup of the VirtualDomain of
> > > the database server and the startup of the second guest.
> >
> > Do you have a way to check if this server is up? If so...
> > The start action of VirtualDomain won't exit until the monitor
> > action returns success. And there's a parameter called
> > monitor_scripts (see the meta-data). Note that these programs
> > (scripts) are run at the supervisor host and not in the guest.
> > It's all a bit involved, but should be doable.
> >
> > Thanks,
> >
> > Dejan
> >
> > > ¿Is it any way to get this?
> > >
> > > Thanks a lot.
> >
> > > ___
> > > Users mailing list: Users@clusterlabs.org
> > > http://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >

> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Oralsnr/Oracle resources agents

2017-02-25 Thread Dejan Muhamedagic
Hi,

On Fri, Feb 24, 2017 at 10:09:28AM +, Jihed M'selmi wrote:
> Hi,
> 
> Using one instance per service leads to a memory issues espacially when we
> have many instance per node.

Complain to oracle? ;->

I thought that oracle listener is not consuming that many
resources. At any rate, ocf:heartbeat:oralsnr doesn't support
single listener for multiple instances. Do you have an idea how
to do that? How to deal with the tnsping then? Maybe you're
better off with the system start script in this case.

> Regarding the 2nd point, I think it's better to read the flag on Y/N in the
> oratab.
> I see some issue/missbehavior, when, some db has N in oratab (the instance
> shouldn't run), the cluster try to bring it UP. Any thoughts ?

Hmm, what should then the RA do? Skip the instance and report it
started? I'm not sure I follow.

Thanks,

Dejan

> Cheers,
> 
> On Thu, Feb 23, 2017, 5:01 PM emmanuel segura  wrote:
> 
> > I think no, in /usr/lib/ocf/resource.d/heartbeat/oralsnr
> >
> > start function, oralsnr_start = "output=`echo lsnrctl start $listener
> > | runasdba`"
> > stop function, oralsnr_stop = "output=`echo lsnrctl stop $listener |
> > runasdba`"
> >
> > Where listener variable is the resource agent parameter given by
> > pacemaker : #   OCF_RESKEY_listener (optional; defaults to LISTENER)
> >
> > Why don't use one listener per instance?
> >
> > 2017-02-23 16:37 GMT+01:00 Jihed M'selmi :
> > > I was reading the oralsnr script, I found that to stop a listener the
> > agent
> > > uses the lsnrctl to stop the instances.
> > >
> > > My questions,  how to configure this agent for an oracle listener
> > attached
> > > the multiple instance ?
> > >
> > > My 2nd quest, is it possible to enhance the ora-common.sh and
> > > resource.d/oracle to take in account the flag y/n in the oratab in order
> > to
> > > start the database or no ?
> > >
> > > Cheers,
> > >
> > > --
> > >
> > >
> > > Jihed MSELMI
> > > RHCE, RHCSA, VCP4
> > > 10 Villa Stendhal, 75020 Paris France
> > > Mobile: +33 (0) 753768653
> > >
> > > ___
> > > Users mailing list: Users@clusterlabs.org
> > > http://lists.clusterlabs.org/mailman/listinfo/users
> > >
> > > Project Home: http://www.clusterlabs.org
> > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > > Bugs: http://bugs.clusterlabs.org
> > >
> >
> >
> >
> > --
> >   .~.
> >   /V\
> >  //  \\
> > /(   )\
> > ^`~'^
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> -- 
> 
> 
> Jihed MSELMI
> RHCE, RHCSA, VCP4
> 10 Villa Stendhal, 75020 Paris France
> Mobile: +33 (0) 753768653

> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How to configure my ocf_heartbeat_iscsi resource(s) such that I have both paths to the LUN?

2017-02-25 Thread Dejan Muhamedagic
Hi,

On Thu, Feb 23, 2017 at 02:29:46PM -0500, Scott Greenlese wrote:
> 
> Refreshing this post in case anyone might have missed it...   still trying
> to figure out some way to have multiple iscsi paths managed
> by a single ocf_heartbeat_iscsi resource.  Any ideas?
> 
> Thanks!
> 
> Scott Greenlese ... KVM on System Z - Solutions Test, IBM Poughkeepsie,
> N.Y.
>   INTERNET:  swgre...@us.ibm.com
> 
> 
> 
> 
> From: Scott Greenlese/Poughkeepsie/IBM@IBMUS
> To:   users@clusterlabs.org
> Cc:   Si Bo Niu , Michael
> Tebolt/Poughkeepsie/IBM@IBMUS
> Date: 02/15/2017 12:18 PM
> Subject:  [ClusterLabs] How to configure my ocf_heartbeat_iscsi resource
> (s) such that I have both paths to the LUN?
> 
> 
> 
> Hi folks,
> 
> I'm running some test scenarios with an ocf_heartbeat_iscsi pacemaker
> resource,
> using the following XIV multipath'ed configuration:
> 
> I created a single XIV iscsi host definition containing all the pacemaker
> host (cluster node) 'Initiator's:
> 
> XIV 7812475>>host_list_ports host=pacemaker_iscsi
> Host Type Port Name
> pacemaker_iscsi iSCSI iqn.2005-03.org.open-iscsi:6539c3daf095
> pacemaker_iscsi iSCSI iqn.2005-03.org.open-iscsi:11d639c0976c
> pacemaker_iscsi iSCSI iqn.1994-05.com.redhat:74ca24d6476
> pacemaker_iscsi iSCSI iqn.1994-05.com.redhat:ea17bebd09a
> pacemaker_iscsi iSCSI iqn.1994-05.com.redhat:b852a67852c
> 
> Here is the XIV Target IP:port / IQN info:
> 
> 10.20.92.108:3260
> Target: iqn.2005-10.com.xivstorage:012475
> 
> 10.20.92.109:3260
> Target: iqn.2005-10.com.xivstorage:012475
> 
> I mapped a single 17Gb Lun to the XIV host: 20017380030BB110F
> 
> 
> From ocf_heartbeat_iscsi man page, it didn't seem obvious to me how I might
> specify multiple
> IP Portals (i.e. both the .108 and .109 IPs) to a single
> ocf_heartbeat_iscsi resource. So,
> I went ahead and configured my resource using just the .108 IP path, as
> follows:
> 
> Using the above XIV definitions, I created a single iscsi pacemaker
> resource using only one of the two IP
> paths to the XIV LUN:
> 
> [root@zs95kj VD]# pcs resource show iscsi_r4
> 
> Resource: iscsi_r4 (class=ocf provider=heartbeat type=iscsi)
> Attributes: portal=10.20.92.108:3260
> target=iqn.2005-10.com.xivstorage:012475
> Operations: start interval=0s timeout=120 (iscsi_r4-start-interval-0s)
> stop interval=0s timeout=120 (iscsi_r4-stop-interval-0s)
> monitor interval=120 timeout=30 (iscsi_r4-monitor-interval-120)
> 
> I'm looking for suggestions as to how I should configure my iscsi resource
> (s) such that I have both paths (.108 and .109)
> to the LUN available to my application? Do I need to create a second iscsi
> resource for the .109 path, and colocate
> them so that they move about the cluster together?

Yes, you may do that. Then you'll get two devices. For that you
should then use dm-multipath.

> As an aside, I have run into situations where the second IP (.109) comes
> online to where my iscsi_r4 resource is running (how or why,
> I'm not sure), which introduces problems because iscsi_r4 only manages
> the .108 connection.

What does "comes online to where..." exactly mean?

Thanks,

Dejan

> 
> Thanks in advance...
> 
> Scott Greenlese ... KVM on System Z - Solutions Test, IBM Poughkeepsie,
> N.Y.
> INTERNET: swgre...@us.ibm.com
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> 



> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Insert delay between the statup of VirtualDomain

2017-02-25 Thread Dejan Muhamedagic
Hi,

On Thu, Feb 23, 2017 at 08:51:20PM +0100, Oscar Segarra wrote:
> Hi,
> 
> In my environment I have 5 guestes that have to be started up in a
> specified order starting for the MySQL database server.
> 
> I have set the order constraints and VirtualDomains start in the right
> order but, the problem I have, is that the second host starts up faster
> than the database server and therefore applications running on the second
> host raise errors due to database connectivity problems.
> 
> I'd like to introduce a delay between the startup of the VirtualDomain of
> the database server and the startup of the second guest.

Do you have a way to check if this server is up? If so...
The start action of VirtualDomain won't exit until the monitor
action returns success. And there's a parameter called
monitor_scripts (see the meta-data). Note that these programs
(scripts) are run at the supervisor host and not in the guest.
It's all a bit involved, but should be doable.

Thanks,

Dejan

> ¿Is it any way to get this?
> 
> Thanks a lot.

> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] simple active/active router using pacemaker+corosync

2017-01-27 Thread Dejan Muhamedagic
Hi,

On Fri, Jan 27, 2017 at 10:52:25AM +0100, Arturo Borrero Gonzalez wrote:
> On 26 January 2017 at 22:31, Valentin Vidic  wrote:
> > On Thu, Jan 26, 2017 at 09:31:23PM +0100, Valentin Vidic wrote:
> >> Guess you could create a Dummy resource and make INIFINITY colloction
> >> constraints for the IPs so they follow Dummy as it moves between the
> >> nodes :)
> >
> > In fact using resource sets this becomes one rule:
> >
> >   colocation ip6-leader6 inf: ( ip6a ip6b ip6c ip6d ) leader6
> >
> > But I also had to set:
> >
> >   property node-action-limit=30
> >
> > or the number of running resource operations was limited to
> > 2 x number of CPUs.
> >
> 
> Great thanks, I couldn't find this in the docs. Any chance this get
> included somewhere?

You can find documentation for such attributes in the pengine's
meta-data or, sometimes, crmd's. For instance:

crm ra info pengine
crm ra info crmd

Thanks,

Dejan

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] no final newline from "crm configure show"

2016-11-23 Thread Dejan Muhamedagic
Hi,

On Wed, Nov 09, 2016 at 02:49:02PM +0100, Ulrich Windl wrote:
> Hi!
> 
> I just found a minor problem when redirecting the output of
> "crm configure show" to a file for archiving purposes: If you

Better use 'configure save' for archive/backup. 'configure show'
is really meant for the terminal.

> use diff to compare two files, diff complains about a missing
> newline (especially after having edited the other file with
> vim). For XML output the final newline is there. Is that an
> omission? Or is it related that my last entry is an acl_target?

I doubt that it depends on the last entry. But if 'save' doesn't
work, just open an issue at github.

Regards,

Dejan

> 
> Regards,
> Ulrich
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] How Pacemaker reacts to fast changes of the same parameter in configuration

2016-11-08 Thread Dejan Muhamedagic
On Tue, Nov 08, 2016 at 12:54:10PM +0100, Klaus Wenninger wrote:
> On 11/08/2016 11:40 AM, Kostiantyn Ponomarenko wrote:
> > Hi,
> >
> > I need a way to do a manual fail-back on demand.
> > To be clear, I don't want it to be ON/OFF; I want it to be more like
> > "one shot".
> > So far I found that the most reasonable way to do it - is to set
> > "resource stickiness" to a different value, and then set it back to
> > what it was. 
> > To do that I created a simple script with two lines:
> >
> > crm configure rsc_defaults resource-stickiness=50
> > crm configure rsc_defaults resource-stickiness=150
> >
> > There are no timeouts before setting the original value back.
> > If I call this script, I get what I want - Pacemaker moves resources
> > to their preferred locations, and "resource stickiness" is set back to
> > its original value. 
> >
> > Despite it works, I still have few concerns about this approach.
> > Will I get the same behavior under a big load with delays on systems
> > in cluster (which is truly possible and a normal case in my environment)?
> > How Pacemaker treats fast change of this parameter?
> > I am worried that if "resource stickiness" is set back to its original
> > value to fast, then no fail-back will happen. Is it possible, or I
> > shouldn't worry about it?
> 
> AFAIK pengine is interrupted when calculating a more complicated transition
> and if the situation has changed a transition that is just being executed
> is aborted if the input from pengine changed.
> So I would definitely worry!
> What you could do is to issue 'crm_simulate -Ls' in between and grep for
> an empty transition.
> There might be more elegant ways but that should be safe.

crmsh has an option (-w) to wait for the PE to settle after
committing configuration changes.

Thanks,

Dejan
> 
> > Thank you,
> > Kostia
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-20 Thread Dejan Muhamedagic
On Tue, Sep 20, 2016 at 01:13:23PM +, Auer, Jens wrote:
> Hi,
> 
> >> I've decided to create two answers for the two problems. The cluster
> >> still fails to relocate the resource after unloading the modules even
> >> with resource-agents 3.9.7
> > From the point of view of the resource agent,
> > you configured it to use a non-existing network.
> > Which it considers to be a configuration error,
> > which is treated by pacemaker as
> > "don't try to restart anywhere
> > but let someone else configure it properly, first".
> > Still, I have yet to see what scenario you are trying to test here.
> > To me, this still looks like "scenario evil admin".  If so, I'd not even
> > try, at least not on the pacemaker configuration level.
> It's not evil admin as this would not make sense. I am trying to find a way 
> to force a failover condition e.g. by simulating a network card defect or 
> network outage without running to the server room every time. 

Better use iptables. Bringing the interface down is not the same
as network card going bad.

Thanks,

Dejan

> > CONFIDENTIALITY NOTICE:
> > Oh please :-/
> > This is a public mailing list.
> Sorry, this is a standard disclaimer I usually remove. We are forced to add 
> this to e-mails, but I think this is fairly common for commercial companies.
> 
> >> Also the netmask and the ip address are wrong. I have configured the
> >> device to 192.168.120.10 with netmask 192.168.120.10. How does IpAddr2
> >> get the wrong configuration? I have no idea.
> >A netmask of "192.168.120.10" is nonsense.
> >That is the address, not a mask.
> Oops, my fault when writing the e-mail. Obviously this is the address. The 
> configured netmask for the device is 255.255.255.0, but after IPaddr2 brings 
> it up again it is 255.255.255.255 which is not what I configured in the 
> betwork configuration. 
> 
> > Also, according to some posts back,
> > you have configured it in pacemaker with
> > cidr_netmask=32, which is not particularly useful either.
> Thanks for pointing this out. I copied the parameters from the 
> manual/tutorial, but did not think about the values.
> 
> > Again: the IPaddr2 resource agent is supposed to control the assignment
> > of an IP address, hence the name.
> > It is not supposed to create or destroy network interfaces,
> > or configure bonding, or bridges, or anything like that.
> > In fact, it is not even supposed to bring up or down the interfaces,
> > even though for "convenience" it seems to do "ip link set up".
> This is what made me wonder in the beginning. When I bring down the device, 
> this leads to a failure of the resource agent which is exactly what I 
> expected. I did not expect it to bring the device up  again, and definitetly 
> not ignoring the default network configuration.
> 
> > Monitoring connectivity, or dealing with removed interface drivers,
> > or unplugged devices, or whatnot, has to be dealt with elsewhere.
> I am using a ping daemon for that. 
> 
> > What you did is: down the bond, remove all slave assignments, even
> > remove the driver, and expect the resource agent to "heal" things that
> > it does not know about. It can not.
> I am not expecting the RA to heal anything. How could it? And why would I 
> expect it? In fact I am expecting the opposite that is a consistent failure 
> when the device is down. This may be also wrong because you can assign ip 
> addresses to downed devices.
> 
> My initial expectation was that the resource cannot be started when the 
> device is down and then is relocated. I think this more or less the core 
> functionality of the cluster. I can see a reason why it does not switch to 
> another node when there is a configuration error in the cluster because it is 
> fair to assume that the configuration is identical (wrong) on all nodes. But 
> what happens if the network device is broken? The server would start, fail to 
> assign the ip address and then prevent the whole cluster from working? What 
> happens if the network card breaks while the cluster is running? 
> 
> Best wishes,
>   Jens
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] group resources without order behavior / monitor timeout smaller than interval?

2016-09-20 Thread Dejan Muhamedagic
Hi,

On Wed, Sep 14, 2016 at 02:41:10PM -0500, Ken Gaillot wrote:
> On 09/14/2016 03:01 AM, Stefan Bauer wrote:
> > Hi,
> > 
> > I'm trying to understand some cluster internals and would be happy to
> > get some best practice recommendations:
> > 
> > monitor interval and timeout: shouldn't timeout value always be smaller
> > than interval to avoid another check even though the first is not over yet?
> 
> The cluster handles it intelligently. If the previous monitor is still
> in progress when the interval expires, it won't run another one.

The lrmd which got replaced would schedule the next monitor
operation only once the current monitor operation finished, hence
the timeout value was essentialy irrelevant. Is that still the
case with the new lrmd?

> It certainly makes sense that the timeout will generally be smaller than
> the interval, but there may be cases where a monitor on rare occasions
> takes a long time, and the user wants the high timeout for those
> occasions, but a shorter interval that will be used most of the time.

Just to add that there's a tendency to make monitor intervals
quite short, often without taking a good look at the nature of
the resource.

Thanks,

Dejan

> > Additionally i would like to use the group function to put all my VMS
> > (ocf:heartbeat:VirtualDomain) in one group and colocate the group with
> > the VIP and my LVM-volume. Unfortunately group function starts the
> > resources in the listed order. So if i stop one VM, the following VMs
> > are also stopped.
> > 
> > Right now I'm having the following configuration and want to make it
> > less redundant:
> 
> You can use one ordering constraint and one colocation constraint, each
> with two resource sets, one containing the IP and volume with
> sequential=true, and the other containing the VMs with sequential=false.
> See:
> 
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html-single/Pacemaker_Explained/index.html#s-resource-sets
> 
> 
> > 
> > # never let the stonith_service run on the host to stonith
> > 
> > location l_st_srv20 st_ipmi_srv20 -inf: srv20
> > location l_st_srv21 st_ipmi_srv21 -inf: srv21
> > 
> > 
> > # do not run resources on quorum only node
> > location loc_r_lvm_vg-storage_quorum_only_node r_lvm_vg-storage -inf:
> > quorum_only_node
> > location loc_r_vm_ado01_quorum_only_node r_vm_ado01 -inf: quorum_only_node
> > location loc_r_vm_bar01_quorum_only_node r_vm_bar01 -inf: quorum_only_node
> > location loc_r_vm_cmt01_quorum_only_node r_vm_cmt01 -inf: quorum_only_node
> > location loc_r_vm_con01_quorum_only_node r_vm_con01 -inf: quorum_only_node
> > location loc_r_vm_con02_quorum_only_node r_vm_con02 -inf: quorum_only_node
> > location loc_r_vm_dsm01_quorum_only_node r_vm_dsm01 -inf: quorum_only_node
> > location loc_r_vm_jir01_quorum_only_node r_vm_jir01 -inf: quorum_only_node
> > location loc_r_vm_jir02_quorum_only_node r_vm_jir02 -inf: quorum_only_node
> > location loc_r_vm_prx02_quorum_only_node r_vm_prx02 -inf: quorum_only_node
> > location loc_r_vm_src01_quorum_only_node r_vm_src01 -inf: quorum_only_node
> > 
> > 
> > # colocate ip with lvm storage
> > colocation col_r_Failover_IP_r_lvm_vg-storage inf: r_Failover_IP
> > r_lvm_vg-storage
> > 
> > 
> > # colocate each VM with lvm storage
> > colocation col_r_vm_ado01_r_lvm_vg-storage inf: r_vm_ado01 r_lvm_vg-storage
> > colocation col_r_vm_bar01_r_lvm_vg-storage inf: r_vm_bar01 r_lvm_vg-storage
> > colocation col_r_vm_cmt01_r_lvm_vg-storage inf: r_vm_cmt01 r_lvm_vg-storage
> > colocation col_r_vm_con01_r_lvm_vg-storage inf: r_vm_jir01 r_lvm_vg-storage
> > colocation col_r_vm_con02_r_lvm_vg-storage inf: r_vm_con02 r_lvm_vg-storage
> > colocation col_r_vm_dsm01_r_lvm_vg-storage inf: r_vm_dsm01 r_lvm_vg-storage
> > colocation col_r_vm_jir01_r_lvm_vg-storage inf: r_vm_con01 r_lvm_vg-storage
> > colocation col_r_vm_jir02_r_lvm_vg-storage inf: r_vm_jir02 r_lvm_vg-storage
> > colocation col_r_vm_prx02_r_lvm_vg-storage inf: r_vm_prx02 r_lvm_vg-storage
> > colocation col_r_vm_src01_r_lvm_vg-storage inf: r_vm_src01 r_lvm_vg-storage
> > 
> > # start lvm storage before VIP
> > 
> > order ord_r_lvm_vg-storage_r_Failover_IP inf: r_lvm_vg-storage r_Failover_IP
> > 
> > 
> > # start lvm storage before each VM
> > order ord_r_lvm_vg-storage_r_vm_ado01 inf: r_lvm_vg-storage r_vm_ado01
> > order ord_r_lvm_vg-storage_r_vm_bar01 inf: r_lvm_vg-storage r_vm_bar01
> > order ord_r_lvm_vg-storage_r_vm_cmt01 inf: r_lvm_vg-storage r_vm_cmt01
> > order ord_r_lvm_vg-storage_r_vm_con01 inf: r_lvm_vg-storage r_vm_con01
> > order ord_r_lvm_vg-storage_r_vm_con02 inf: r_lvm_vg-storage r_vm_con02
> > order ord_r_lvm_vg-storage_r_vm_dsm01 inf: r_lvm_vg-storage r_vm_dsm01
> > order ord_r_lvm_vg-storage_r_vm_jir01 inf: r_lvm_vg-storage r_vm_jir01
> > order ord_r_lvm_vg-storage_r_vm_jir02 inf: r_lvm_vg-storage r_vm_jir02
> > order ord_r_lvm_vg-storage_r_vm_prx02 inf: r_lvm_vg-storage r_vm_prx02
> > order ord_r_lvm_vg-storage_r_vm_src01 inf: r_lvm_vg-storage r_vm_src01
> > 
> > 

Re: [ClusterLabs] ocf scripts shell and local variables

2016-09-01 Thread Dejan Muhamedagic
On Wed, Aug 31, 2016 at 10:39:22AM -0500, Dmitri Maziuk wrote:
> On 2016-08-31 03:59, Dejan Muhamedagic wrote:
> >On Tue, Aug 30, 2016 at 12:32:36PM -0500, Dimitri Maziuk wrote:
> 
> >>I expect you're being deliberately obtuse.
> >
> >Not sure why do you think that
> 
> Because the point I was trying to make was that having shebang line say
> #!/opt/swf/bin/bash
> does not guarantee the script will actually be interpreted by
> /opt/swf/bin/bash. For example
> 
> >When a file is sourced, the "#!" line has no special meaning
> >(apart from documenting purposes).
> 
> (sic) Or when
> 
> >I haven't read the code either, but it must be some of the
> >exec(2) system calls.
> 
> it's execl("/bin/sh", "/bin/sh", "/script/file") instead of execl(
> "script/file', ...) directly.
> 
> (As an aside, I suspect the feature where exec(2) will run the loader which
> will read the magic and load an appropriate binfmt* kernel module, may well
> also be portable between "most" systems, just like "local" is portable to
> "most" shell. I don't think posix specifies anything more than "executable
> image" and that on a strictly posix-compliant system execl( "/my/script.sh",
> ... ) will fail. I am so old that I have a vague recollection it *had to be*
> execl("/bin/sh", "/bin/sh", "/script/file") back when I learned it. But this
> going even further OT.)
> 
> My point, again, was that solutions involving shebang lines are great as
> long as you can guarantee those shebang lines are being used on all
> supported platforms at all times.

There is no other way to tell the UNIX/Linux system which
interpreter to use. Or am I missing something?

> Sourcing findif.sh from IPAddr2 is proof
> by counter-example that they aren't and you can't.

findif.sh or any other file which is to be sourced therefore
must be compatible with the greatest common denominator
(shellwise).

Thanks,

Dejan

> Dima
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-31 Thread Dejan Muhamedagic
On Tue, Aug 30, 2016 at 06:53:24PM +0200, Lars Ellenberg wrote:
> On Tue, Aug 30, 2016 at 06:15:49PM +0200, Dejan Muhamedagic wrote:
> > On Tue, Aug 30, 2016 at 10:08:00AM -0500, Dmitri Maziuk wrote:
> > > On 2016-08-30 03:44, Dejan Muhamedagic wrote:
> > > 
> > > >The kernel reads the shebang line and it is what defines the
> > > >interpreter which is to be invoked to run the script.
> > > 
> > > Yes, and does the kernel read when the script is source'd or executed via
> > > any of the mechanisms that have the executable specified in the call,
> > > explicitly or implicitly?
> > 
> > I suppose that it is explained in enough detail here:
> > 
> > https://en.wikipedia.org/wiki/Shebang_(Unix)
> > 
> > In particular:
> > 
> > https://en.wikipedia.org/wiki/Shebang_(Unix)#Magic_number
> > 
> > > >None of /bin/sh RA requires bash.
> > > 
> > > Yeah, only "local".
> > 
> > As already mentioned elsewhere in the thread, local is supported
> > in most shell implementations and without it we otherwise
> > wouldn't to be able to maintain software. Not sure where local
> > originates, but wouldn't bet that it's bash.
> 
> Let's just agree that as currently implemented,
> our collection of /bin/sh scripts won't run on ksh as shipped with
> solaris (while there likely are ksh derivatives in *BSD somewhere
> that would be mostly fine with them).

I can recall people running some of the BSD systems and they
didn't complain about resource-agents.

> And before this turns even more into a "yes, I'm that old, too" thread,

I guess that is in human nature.

> may I suggest to document that we expect a
> "dash compatible" /bin/sh, and that we expect scripts
> to have a bash shebang (or as appropriate) if they go beyond that.

That is how it has been in resource-agents: POSIX compatible with
the addition of "local". Anyway, every script must have a #! line
which states the right interpreter.

> Then check for incompatible shells in ocf-shellfuncs,
> and just exit early if we detect incompatibilities.
> 
> For a suggestion on checking for a proper "local" see below.
> (Add more checks later, if someone feels like it.)
> 
> Though, if someone rewrites not the current agents, but the "lib/ocf*"
> help stuff to be sourced by shell based agents in a way that would
> support RAs in all bash, dash, ash, ksh, whatever,
> and the result turns out not too much worse than what we have now,
> I'd have no problem with that...
> 
> Cheers,
> 
> Lars
> 
> 
> And for the "typeset" crowd,
> if you think s/local/typeset/ was all that was necessary
> to support function local variables in ksh, think again:
> 
> ksh -c '
>   function a {
>   echo "start of a: x=$x"
>   typeset x=a
>   echo "before b: x=$x"
>   b
>   echo "end of a: x=$x"
>   }
>   function b {
>   echo "start of b: x=$x ### HAHA guess this one was unexpected 
> to all but ksh users"
>   typeset x=b
>   echo "end of b: x=$x"
>   }
>   x=x
>   echo "before a: x=$x"
>   a
>   echo "after a: x=$x"
> '
> 
> Try the same with bash.

:)

> Also remember that sometimes we set a "local" variable in a function
> and expect it to be visible in nested functions, but also set a new
> value in a nested function and expect that value to be reflected
> in the outer scope (up to the last "local").

I hope that this wasn't (ab)used much, it doesn't sound like it
would be easy to follow.

> diff --git a/heartbeat/ocf-shellfuncs.in b/heartbeat/ocf-shellfuncs.in
> index 6d9669d..4151630 100644
> --- a/heartbeat/ocf-shellfuncs.in
> +++ b/heartbeat/ocf-shellfuncs.in
> @@ -920,3 +920,37 @@ ocf_is_true "$OCF_TRACE_RA" && ocf_start_trace
>  if ocf_is_true "$HA_use_logd"; then
>   : ${HA_LOGD:=yes}
>  fi
> +
> +# We use a lot of function local variables with the "local" keyword.
> +# Which works fine with dash and bash,
> +# but does not work with e.g. ksh.
> +# Fail cleanly with a sane error message,
> +# if the current shell does not feel compatible.
> +
> +__ocf_check_for_incompatible_shell_l2()
> +{
> + [ $__ocf_check_for_incompatible_shell_k = v1 ] || return 1
> + local __ocf_check_for_incompatible_shell_k=v2
> + [ $__ocf_check_for_incompatible_shell_k = v2 ] || return 1
> + return 0
> +}
> +
> +__ocf_check_fo

Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-31 Thread Dejan Muhamedagic
On Tue, Aug 30, 2016 at 12:32:36PM -0500, Dimitri Maziuk wrote:
> On 08/30/2016 11:15 AM, Dejan Muhamedagic wrote:
> 
> > I suppose that it is explained in enough detail here:
> > 
> > https://en.wikipedia.org/wiki/Shebang_(Unix)
> 
> I expect you're being deliberately obtuse.

Not sure why do you think that when I offer a perfectly good
document on how "#!" line is interpreted.

> It does not explain which program loader interprets line 1 of findif.sh:
> "#!/bin/sh" when it is invoked from line 69 of IPAddr2 RA:
> 
> . ${OCF_FUNCTIONS_DIR}/findif.sh

When a file is sourced, the "#!" line has no special meaning
(apart from documenting purposes).

> https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/IPaddr2
> https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/findif.sh
> 
> Similarly, I have not read the code so I don't know who invokes IPArrd2
> and how exactly they do it. If you tell me it makes the kernel look at
> the magic number and spawn whatever shell's specified there, I believe you.

I haven't read the code either, but it must be some of the
exec(2) system calls.

> > As already mentioned elsewhere in the thread, local is supported
> > in most shell implementations and without it we otherwise
> > wouldn't to be able to maintain software. Not sure where local
> > originates, but wouldn't bet that it's bash.
> 
> Well 2 out of 3 is "most", can't argue with that.

There are certainly more than 3.

Thanks,

Dejan

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-30 Thread Dejan Muhamedagic
On Tue, Aug 30, 2016 at 10:08:00AM -0500, Dmitri Maziuk wrote:
> On 2016-08-30 03:44, Dejan Muhamedagic wrote:
> 
> >The kernel reads the shebang line and it is what defines the
> >interpreter which is to be invoked to run the script.
> 
> Yes, and does the kernel read when the script is source'd or executed via
> any of the mechanisms that have the executable specified in the call,
> explicitly or implicitly?

I suppose that it is explained in enough detail here:

https://en.wikipedia.org/wiki/Shebang_(Unix)

In particular:

https://en.wikipedia.org/wiki/Shebang_(Unix)#Magic_number

> >None of /bin/sh RA requires bash.
> 
> Yeah, only "local".

As already mentioned elsewhere in the thread, local is supported
in most shell implementations and without it we otherwise
wouldn't to be able to maintain software. Not sure where local
originates, but wouldn't bet that it's bash.

Thanks,

Dejan

> Dima
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-30 Thread Dejan Muhamedagic
Hi,

On Mon, Aug 29, 2016 at 05:08:35PM +0200, Gabriele Bulfon wrote:
> Sure, infact I can change all shebang to point to /bin/bash and it's ok.
> The question is about current shebang /bin/sh which may go into trouble (as 
> if one would point to a generic python but uses many specific features of a 
> version of python).
> Also, the question is about bash being a good option for RAs, being much more 
> heavy.

I'd really suggest installing a smaller shell such as /bin/dash
and using that as /bin/sh. Isn't there a Bourne shell in Solaris?
If you modify the RAs it could be trouble on subsequent updates.

Thanks,

Dejan

> Gabriele
> 
> Sonicle S.r.l.
> :
> http://www.sonicle.com
> Music:
> http://www.gabrielebulfon.com
> Quantum Mechanics :
> http://www.cdbaby.com/cd/gabrielebulfon
> ------
> Da: Dejan Muhamedagic
> A: kgail...@redhat.com Cluster Labs - All topics related to open-source 
> clustering welcomed
> Data: 29 agosto 2016 16.43.52 CEST
> Oggetto: Re: [ClusterLabs] ocf scripts shell and local variables
> Hi,
> On Mon, Aug 29, 2016 at 08:47:43AM -0500, Ken Gaillot wrote:
> On 08/29/2016 04:17 AM, Gabriele Bulfon wrote:
> Hi Ken,
> I have been talking with the illumos guys about the shell problem.
> They all agreed that ksh (and specially the ksh93 used in illumos) is
> absolutely Bourne-compatible, and that the "local" variables used in the
> ocf shells is not a Bourne syntax, but probably a bash specific.
> This means that pointing the scripts to "#!/bin/sh" is portable as long
> as the scripts are really Bourne-shell only syntax, as any Unix variant
> may link whatever Bourne-shell they like.
> In this case, it should point to "#!/bin/bash" or whatever shell the
> script was written for.
> Also, in this case, the starting point is not the ocf-* script, but the
> original RA (IPaddr, but almost all of them).
> What about making the code base of RA and ocf-* portable?
> It may be just by changing them to point to bash, or with some kind of
> configure modifier to be able to specify the shell to use.
> Meanwhile, changing the scripts by hands into #!/bin/bash worked like a
> charm, and I will start patching.
> Gabriele
> Interesting, I thought local was posix, but it's not. It seems everyone
> but solaris implemented it:
> http://stackoverflow.com/questions/18597697/posix-compliant-way-to-scope-variables-to-a-function-in-a-shell-script
> Please open an issue at:
> https://github.com/ClusterLabs/resource-agents/issues
> The simplest solution would be to require #!/bin/bash for all RAs that
> use local,
> This issue was raised many times, but note that /bin/bash is a
> shell not famous for being lean: it's great for interactive use,
> but not so great if you need to run a number of scripts. The
> complexity in bash, which is superfluous for our use case,
> doesn't go well with the basic principles of HA clusters.
> but I'm not sure that's fair to the distros that support
> local in a non-bash default shell. Another possibility would be to
> modify all RAs to avoid local entirely, by using unique variable
> prefixes per function.
> I doubt that we could do a moderately complex shell scripts
> without capability of limiting the variables' scope and retaining
> sanity at the same time.
> Or, it may be possible to guard every instance of
> local with a check for ksh, which would use typeset instead. Raising the
> issue will allow some discussion of the possibilities.
> Just to mention that this is the first time someone reported
> running a shell which doesn't support local. Perhaps there's an
> option that they install a shell which does.
> Thanks,
> Dejan
> 
> *Sonicle S.r.l. *: http://www.sonicle.com
> *Music: *http://www.gabrielebulfon.com
> *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
> --
> Da: Ken Gaillot
> A: gbul...@sonicle.com Cluster Labs - All topics related to open-source
> clustering welcomed
> Data: 26 agosto 2016 15.56.02 CEST
> Oggetto: Re: ocf scripts shell and local variables
> On 08/26/2016 08:11 AM, Gabriele Bulfon wrote:
> I tried adding some debug in ocf-shellfuncs, showing env and ps
> -ef into
> the corosync.log
> I suspect it's always using ksh, because in the env output I
> produced I
> find this: KSH_VERSION=.sh.version
> This is normally not present in the environment, unless ksh is running
> the shell.
> The RAs typically start with #!/bin

Re: [ClusterLabs] systemd RA start/stop delays

2016-08-30 Thread Dejan Muhamedagic
Hi,

On Thu, Aug 18, 2016 at 09:00:24AM -0500, Ken Gaillot wrote:
> On 08/17/2016 08:17 PM, TEG AMJG wrote:
> > Hi
> > 
> > I am having a problem with a simple Active/Passive cluster which
> > consists in the next configuration
> > 
> > Cluster Name: kamcluster
> > Corosync Nodes:
> >  kam1vs3 kam2vs3
> > Pacemaker Nodes:
> >  kam1vs3 kam2vs3
> > 
> > Resources:
> >  Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
> >   Attributes: ip=10.0.1.206 cidr_netmask=32
> >   Operations: start interval=0s timeout=20s (ClusterIP-start-interval-0s)
> >   stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
> >   monitor interval=10s (ClusterIP-monitor-interval-10s)
> >  Resource: ClusterIP2 (class=ocf provider=heartbeat type=IPaddr2)
> >   Attributes: ip=10.0.1.207 cidr_netmask=32
> >   Operations: start interval=0s timeout=20s (ClusterIP2-start-interval-0s)
> >   stop interval=0s timeout=20s (ClusterIP2-stop-interval-0s)
> >   monitor interval=10s (ClusterIP2-monitor-interval-10s)
> >  Resource: rtpproxycluster (class=systemd type=rtpproxy)
> >   Operations: monitor interval=10s (rtpproxycluster-monitor-interval-10s)
> >   stop interval=0s on-fail=block
> > (rtpproxycluster-stop-interval-0s)
> >  Resource: kamailioetcfs (class=ocf provider=heartbeat type=Filesystem)
> >   Attributes: device=/dev/drbd1 directory=/etc/kamailio fstype=ext4
> >   Operations: start interval=0s timeout=60 (kamailioetcfs-start-interval-0s)
> >   monitor interval=10s on-fail=fence
> > (kamailioetcfs-monitor-interval-1  
> >0s)
> >   stop interval=0s on-fail=fence
> > (kamailioetcfs-stop-interval-0s)
> >  Clone: fence_kam2_xvm-clone
> >   Meta Attrs: interleave=true clone-max=2 clone-node-max=1
> >   Resource: fence_kam2_xvm (class=stonith type=fence_xvm)
> >Attributes: port=tegamjg_kam2 pcmk_host_list=kam2vs3
> >Operations: monitor interval=60s (fence_kam2_xvm-monitor-interval-60s)
> >  Master: kamailioetcclone
> >   Meta Attrs: master-max=1 master-node-max=1 clone-max=2
> > clone-node-max=1 notify=t  
> >rue on-fail=fence
> >   Resource: kamailioetc (class=ocf provider=linbit type=drbd)
> >Attributes: drbd_resource=kamailioetc
> >Operations: start interval=0s timeout=240 (kamailioetc-start-interval-0s)
> >promote interval=0s on-fail=fence
> > (kamailioetc-promote-interval-0s)
> >demote interval=0s on-fail=fence
> > (kamailioetc-demote-interval-0s)
> >stop interval=0s on-fail=fence (kamailioetc-stop-interval-0s)
> >monitor interval=10s (kamailioetc-monitor-interval-10s)
> >  Clone: fence_kam1_xvm-clone
> >   Meta Attrs: interleave=true clone-max=2 clone-node-max=1
> >   Resource: fence_kam1_xvm (class=stonith type=fence_xvm)
> >Attributes: port=tegamjg_kam1 pcmk_host_list=kam1vs3
> >Operations: monitor interval=60s (fence_kam1_xvm-monitor-interval-60s)
> >  Resource: kamailiocluster (class=ocf provider=heartbeat type=kamailio)
> >   Attributes: listen_address=10.0.1.206
> > conffile=/etc/kamailio/kamailio.cfg pidfil  
> >  
> >  e=/var/run/kamailio.pid monitoring_ip=10.0.1.206
> > monitoring_ip2=10.0.1.207 port=50  
> >60 proto=udp
> > kamctlrc=/etc/kamailio/kamctlrc shmem=128 pkg=8
> >   Meta Attrs: target-role=Stopped
> >   Operations: start interval=0s timeout=60
> > (kamailiocluster-start-interval-0s)
> >   stop interval=0s timeout=30 (kamailiocluster-stop-interval-0s)
> >   monitor interval=5s (kamailiocluster-monitor-interval-5s)
> > 
> > Stonith Devices:
> > Fencing Levels:
> > 
> > Location Constraints:
> > Ordering Constraints:
> >   start fence_kam1_xvm-clone then start fence_kam2_xvm-clone
> > (kind:Mandatory) (id:  
> >  
> >  order-fence_kam1_xvm-clone-fence_kam2_xvm-clone-mandatory)
> >   start fence_kam2_xvm-clone then promote kamailioetcclone
> > (kind:Mandatory) (id:or
> >
> >  der-fence_kam2_xvm-clone-kamailioetcclone-mandatory)
> >   promote kamailioetcclone then start kamailioetcfs (kind:Optional)
> > (id:order-kama  
> >ilioetcclone-kamailioetcfs-Optional)
> >   Resource Sets:
> > set kamailioetcfs sequential=true (id:pcs_rsc_set_kamailioetcfs) set
> > ClusterIP  
> > ClusterIP2 sequential=false
> > (id:pcs_rsc_set_ClusterIP_ClusterIP2) set 

Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-30 Thread Dejan Muhamedagic
Hi,

On Mon, Aug 29, 2016 at 10:13:18AM -0500, Dmitri Maziuk wrote:
> On 2016-08-29 04:06, Gabriele Bulfon wrote:
> >Thanks, though this does not work :)
> 
> Uhm... right. Too many languages, sorry: perl's system() will call the login
> shell, system system() uses /bin/sh, and exec()s will run whatever the
> programmer tells them to. The point is none of them cares what shell's in
> shebang line AFAIK.

The kernel reads the shebang line and it is what defines the
interpreter which is to be invoked to run the script.

> But anyway, you're correct; a lot of linux "shell" scripts are bash-only and
> pacemaker RAs are no exception.

None of /bin/sh RA requires bash.

Thanks,

Dejan

> 
> Dima
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: ocf scripts shell and local variables

2016-08-30 Thread Dejan Muhamedagic
On Tue, Aug 30, 2016 at 08:09:34AM +0200, Ulrich Windl wrote:
> >>> Dejan Muhamedagic <deja...@fastmail.fm> schrieb am 29.08.2016 um 16:37 in
> Nachricht <20160829143700.GA1538@tuttle.homenet>:
> > Hi,
> > 
> > On Mon, Aug 29, 2016 at 02:58:11PM +0200, Gabriele Bulfon wrote:
> >> I think the main issue is the usage of the "local" operator in ocf*
> >> I'm not an expert on this operator (never used!), don't know how hard it 
> >> is 
> > to replace it with a standard version.
> > 
> > Unfortunately, there's no command defined in POSIX which serves
> > the purpose of local, i.e. setting variables' scope. "local" is,
> 
> Isn't it "typeset"?

I don't think that /bin/dash supports typeset. Anyway, supporting
typeset, which covers much more than limiting the scope, would
also invite people to use it for that other stuff. If it's there,
someone's sure to use it ;-)

Thanks,

Dejan

> 
> > however, supported in almost all shells (including most versions
> > of ksh, but apparently not the one you run) and hence we
> > tolerated that in /bin/sh resource agents.
> > 
> > Thanks,
> > 
> > Dejan
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-30 Thread Dejan Muhamedagic
Hi,

On Tue, Aug 30, 2016 at 09:32:54AM +0200, Kristoffer Grönlund wrote:
> Jehan-Guillaume de Rorthais <j...@dalibo.com> writes:
> 
> > On Mon, 29 Aug 2016 10:02:28 -0500
> > Ken Gaillot <kgail...@redhat.com> wrote:
> >
> >> On 08/29/2016 09:43 AM, Dejan Muhamedagic wrote:
> > ...
> >>> I doubt that we could do a moderately complex shell scripts
> >>> without capability of limiting the variables' scope and retaining
> >>> sanity at the same time.
> >> 
> >> This prefixing approach would definitely be ugly, and it would violate
> >> best practices on shells that do support local, but it should be feasible.
> >> 
> >> I'd argue that anything moderately complex should be converted to python
> >> (maybe after we have OCF 2.0, and some good python bindings ...).
> >
> > For what it worth, I already raised this discussion some month ago as we 
> > wrote
> > some perl modules equivalent to ocf-shellfuncs, ocf-returncodes and
> > ocf-directories. See: 
> >
> >   Subject: [ClusterLabs Developers] Perl Modules for resource agents (was:
> > Resource Agent language discussion) 
> >   Date: Thu, 26 Nov 2015 01:13:36 +0100
> >
> > I don't want to start a flameware about languages here, this is not about 
> > that.
> > Maybe it would be a good time to include various libraries for different
> > languages in official source? At least for ocf-directories which is quite
> > simple, but often tied to the configure options in various distro. We had to
> > make a ugly wrapper around the ocf-directories librairie on build time to
> > produce our OCF_Directories.pm module on various distros.
> >
> 
> I don't know Perl so I can't be very helpful reviewing it, but please do
> submit your Perl OCF library to resource-agents if you have one! I don't
> think anyone is opposed to including a Perl (or Python) API on
> principle, it's just a matter of someone actually sitting down to do the
> work putting it together.

Sure, I doubt that anybody would oppose that. Looking at the
amount of the existing supporting code, it is not going to be a
small feat.

***

This is off-topic and I really do hope not to start a discussion
about it so I'll keep it short: I often heard shell programming
being bashed up to the point that it should be banned. While it
is true that there are numerous scripts executed in a way leaving
much to be desired, the opposite is certainly possible too. Shell
does fit well a certain type of task and resource agents
(essentially more robust init scripts) belong to that realm.

Thanks,

Dejan


> Cheers,
> Kristoffer
> 
> -- 
> // Kristoffer Grönlund
> // kgronl...@suse.com
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-29 Thread Dejan Muhamedagic
Hi,

On Mon, Aug 29, 2016 at 08:47:43AM -0500, Ken Gaillot wrote:
> On 08/29/2016 04:17 AM, Gabriele Bulfon wrote:
> > Hi Ken,
> > 
> > I have been talking with the illumos guys about the shell problem.
> > They all agreed that ksh (and specially the ksh93 used in illumos) is
> > absolutely Bourne-compatible, and that the "local" variables used in the
> > ocf shells is not a Bourne syntax, but probably a bash specific.
> > This means that pointing the scripts to "#!/bin/sh" is portable as long
> > as the scripts are really Bourne-shell only syntax, as any Unix variant
> > may link whatever Bourne-shell they like.
> > In this case, it should point to "#!/bin/bash" or whatever shell the
> > script was written for.
> > Also, in this case, the starting point is not the ocf-* script, but the
> > original RA (IPaddr, but almost all of them).
> > 
> > What about making the code base of RA and ocf-* portable?
> > It may be just by changing them to point to bash, or with some kind of
> > configure modifier to be able to specify the shell to use.
> > 
> > Meanwhile, changing the scripts by hands into #!/bin/bash worked like a
> > charm, and I will start patching.
> > 
> > Gabriele
> 
> Interesting, I thought local was posix, but it's not. It seems everyone
> but solaris implemented it:
> 
> http://stackoverflow.com/questions/18597697/posix-compliant-way-to-scope-variables-to-a-function-in-a-shell-script
> 
> Please open an issue at:
> 
> https://github.com/ClusterLabs/resource-agents/issues
> 
> The simplest solution would be to require #!/bin/bash for all RAs that
> use local,

This issue was raised many times, but note that /bin/bash is a
shell not famous for being lean: it's great for interactive use,
but not so great if you need to run a number of scripts. The
complexity in bash, which is superfluous for our use case,
doesn't go well with the basic principles of HA clusters.

> but I'm not sure that's fair to the distros that support
> local in a non-bash default shell. Another possibility would be to
> modify all RAs to avoid local entirely, by using unique variable
> prefixes per function.

I doubt that we could do a moderately complex shell scripts
without capability of limiting the variables' scope and retaining
sanity at the same time.

> Or, it may be possible to guard every instance of
> local with a check for ksh, which would use typeset instead. Raising the
> issue will allow some discussion of the possibilities.

Just to mention that this is the first time someone reported
running a shell which doesn't support local. Perhaps there's an
option that they install a shell which does.

Thanks,

Dejan

> > 
> > 
> > *Sonicle S.r.l. *: http://www.sonicle.com 
> > *Music: *http://www.gabrielebulfon.com 
> > *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
> > 
> > 
> > 
> > --
> > 
> > Da: Ken Gaillot 
> > A: gbul...@sonicle.com Cluster Labs - All topics related to open-source
> > clustering welcomed 
> > Data: 26 agosto 2016 15.56.02 CEST
> > Oggetto: Re: ocf scripts shell and local variables
> > 
> > On 08/26/2016 08:11 AM, Gabriele Bulfon wrote:
> > > I tried adding some debug in ocf-shellfuncs, showing env and ps
> > -ef into
> > > the corosync.log
> > > I suspect it's always using ksh, because in the env output I
> > produced I
> > > find this: KSH_VERSION=.sh.version
> > > This is normally not present in the environment, unless ksh is running
> > > the shell.
> > 
> > The RAs typically start with #!/bin/sh, so whatever that points to on
> > your system is what will be used.
> > 
> > > I also tried modifiying all ocf shells with "#!/usr/bin/bash" at the
> > > beginning, no way, same output.
> > 
> > You'd have to change the RA that includes them.
> > 
> > > Any idea how can I change the used shell to support "local" variables?
> > 
> > You can either edit the #!/bin/sh line at the top of each RA, or figure
> > out how to point /bin/sh to a Bourne-compatible shell. ksh isn't
> > Bourne-compatible, so I'd expect lots of #!/bin/sh scripts to fail with
> > it as the default shell.
> > 
> > > Gabriele
> > >
> > >
> > 
> > 
> > > *Sonicle S.r.l. *: http://www.sonicle.com 
> > > *Music: *http://www.gabrielebulfon.com
> > 
> > > *Quantum Mechanics : *http://www.cdbaby.com/cd/gabrielebulfon
> > >
> > >
> > 
> > >
> > >
> > > *Da:* Gabriele Bulfon 
> > > *A:* 

Re: [ClusterLabs] ocf scripts shell and local variables

2016-08-29 Thread Dejan Muhamedagic
Hi,

On Mon, Aug 29, 2016 at 02:58:11PM +0200, Gabriele Bulfon wrote:
> I think the main issue is the usage of the "local" operator in ocf*
> I'm not an expert on this operator (never used!), don't know how hard it is 
> to replace it with a standard version.

Unfortunately, there's no command defined in POSIX which serves
the purpose of local, i.e. setting variables' scope. "local" is,
however, supported in almost all shells (including most versions
of ksh, but apparently not the one you run) and hence we
tolerated that in /bin/sh resource agents.

Thanks,

Dejan

> Happy to contribute, it still the case
> Gabriele
> 
> Sonicle S.r.l.
> :
> http://www.sonicle.com
> Music:
> http://www.gabrielebulfon.com
> Quantum Mechanics :
> http://www.cdbaby.com/cd/gabrielebulfon
> --
> Da: Kristoffer Gr?nlund
> A: gbul...@sonicle.com Cluster Labs - All topics related to open-source 
> clustering welcomed
> kgail...@redhat.com Cluster Labs - All topics related to open-source 
> clustering welcomed
> Data: 29 agosto 2016 14.36.23 CEST
> Oggetto: Re: [ClusterLabs] ocf scripts shell and local variables
> Gabriele Bulfon
> writes:
> Hi Ken,
> I have been talking with the illumos guys about the shell problem.
> They all agreed that ksh (and specially the ksh93 used in illumos) is 
> absolutely Bourne-compatible, and that the "local" variables used in the ocf 
> shells is not a Bourne syntax, but probably a bash specific.
> This means that pointing the scripts to "#!/bin/sh" is portable as long as 
> the scripts are really Bourne-shell only syntax, as any Unix variant may link 
> whatever Bourne-shell they like.
> In this case, it should point to "#!/bin/bash" or whatever shell the script 
> was written for.
> Also, in this case, the starting point is not the ocf-* script, but the 
> original RA (IPaddr, but almost all of them).
> What about making the code base of RA and ocf-* portable?
> It may be just by changing them to point to bash, or with some kind of 
> configure modifier to be able to specify the shell to use.
> Meanwhile, changing the scripts by hands into #!/bin/bash worked like a 
> charm, and I will start patching.
> Gabriele
> Hi Gabriele,
> Yes, your observation is correct: The resource scripts are not fully
> POSIX compatible in this respect. We have been fixing these issues as
> they come up, but since we all use bash-like shells it has never become
> a pressing issue (IIRC Debian did have some of the same issues since
> /bin/sh there is dash, which is also not fully bash-compatible).
> It would be fantastic if you could file issues or submit patches at
> https://github.com/ClusterLabs/resource-agents for the resource agents
> where you still find these problems.
> Cheers,
> Kristoffer
> --
> // Kristoffer Gr?nlund
> // kgronl...@suse.com

> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [ClusterLab] : Corosync not initializing successfully

2016-05-03 Thread Dejan Muhamedagic
Hi,

On Mon, May 02, 2016 at 08:54:09AM +0200, Jan Friesse wrote:
> >As your hardware is probably capable of running ppcle and if you have an
> >environment
> >at hand without too much effort it might pay off to try that.
> >There are of course distributions out there support corosync on
> >big-endian architectures
> >but I don't know if there is an automatized regression for corosync on
> >big-endian that
> >would catch big-endian-issues right away with something as current as
> >your 2.3.5.
> 
> No we are not testing big-endian.
> 
> So totally agree with Klaus. Give a try to ppcle. Also make sure all
> nodes are little-endian. Corosync should work in mixed BE/LE
> environment but because it's not tested, it may not work (and it's a
> bug, so if ppcle works I will try to fix BE).

I tested a cluster consisting of big endian/little endian nodes
(s390 and x86-64), but that was a while ago. IIRC, all relevant
bugs in corosync got fixed at that time. Don't know what is the
situation with the latest version.

Thanks,

Dejan

> Regards,
>   Honza
> 
> >
> >Regards,
> >Klaus
> >
> >On 05/02/2016 06:44 AM, Nikhil Utane wrote:
> >>Re-sending as I don't see my post on the thread.
> >>
> >>On Sun, May 1, 2016 at 4:21 PM, Nikhil Utane
> >>> wrote:
> >>
> >> Hi,
> >>
> >> Looking for some guidance here as we are completely blocked
> >> otherwise :(.
> >>
> >> -Regards
> >> Nikhil
> >>
> >> On Fri, Apr 29, 2016 at 6:11 PM, Sriram  >> > wrote:
> >>
> >> Corrected the subject.
> >>
> >> We went ahead and captured corosync debug logs for our ppc board.
> >> After log analysis and comparison with the sucessful logs(
> >> from x86 machine) ,
> >> we didnt find *"[ MAIN  ] Completed service synchronization,
> >> ready to provide service.*" in ppc logs.
> >> So, looks like corosync is not in a position to accept
> >> connection from Pacemaker.
> >> Even I tried with the new corosync.conf with no success.
> >>
> >> Any hints on this issue would be really helpful.
> >>
> >> Attaching ppc_notworking.log, x86_working.log, corosync.conf.
> >>
> >> Regards,
> >> Sriram
> >>
> >>
> >>
> >> On Fri, Apr 29, 2016 at 2:44 PM, Sriram  >> > wrote:
> >>
> >> Hi,
> >>
> >> I went ahead and made some changes in file system(Like I
> >> brought in /etc/init.d/corosync and /etc/init.d/pacemaker,
> >> /etc/sysconfig ), After that I was able to run  "pcs
> >> cluster start".
> >> But it failed with the following error
> >>  # pcs cluster start
> >> Starting Cluster...
> >> Starting Pacemaker Cluster Manager[FAILED]
> >> Error: unable to start pacemaker
> >>
> >> And in the /var/log/pacemaker.log, I saw these errors
> >> pacemakerd: info: mcp_read_config:  cmap connection
> >> setup failed: CS_ERR_TRY_AGAIN.  Retrying in 4s
> >> Apr 29 08:53:47 [15863] node_cu pacemakerd: info:
> >> mcp_read_config:  cmap connection setup failed:
> >> CS_ERR_TRY_AGAIN.  Retrying in 5s
> >> Apr 29 08:53:52 [15863] node_cu pacemakerd:  warning:
> >> mcp_read_config:  Could not connect to Cluster
> >> Configuration Database API, error 6
> >> Apr 29 08:53:52 [15863] node_cu pacemakerd:   notice:
> >> main: Could not obtain corosync config data, exiting
> >> Apr 29 08:53:52 [15863] node_cu pacemakerd: info:
> >> crm_xml_cleanup:  Cleaning up memory from libxml2
> >>
> >>
> >> And in the /var/log/Debuglog, I saw these errors coming
> >> from corosync
> >> 20160429 085347.487050  airv_cu
> >> daemon.warn corosync[12857]:   [QB] Denied connection,
> >> is not ready (12857-15863-14)
> >> 20160429 085347.487067  airv_cu
> >> daemon.info  corosync[12857]:   [QB
> >> ] Denied connection, is not ready (12857-15863-14)
> >>
> >>
> >> I browsed the code of libqb to find that it is failing in
> >>
> >> 
> >> https://github.com/ClusterLabs/libqb/blob/master/lib/ipc_setup.c
> >>
> >> Line 600 :
> >> handle_new_connection function
> >>
> >> Line 637:
> >> if (auth_result == 0 &&
> >> c->service->serv_fns.connection_accept) {
> >> res = c->service->serv_fns.connection_accept(c,
> >>  c->euid, c->egid);
> >> }
> >> if (res != 0) {
> >> goto send_response;
> >> 

Re: [ClusterLabs] Monitoring action of Pacemaker resources fail because of high load on the nodes

2016-04-27 Thread Dejan Muhamedagic
Hi Dmitri,

On Tue, Apr 26, 2016 at 10:20:45AM -0500, Dmitri Maziuk wrote:
> On 2016-04-26 00:58, Klaus Wenninger wrote:
> 
> >But what you are attempting doesn't sound entirely proprietary.
> >So once you have something that looks like it might be useful
> >for others as well let the community participate and free yourself
> >from having to always take care of your private copy ;-)
> 
> Presumably you could try a pull request but last time I failed to
> convince Andrew that wget'ing http://localhost/server-status/ is a
> wrong thing to do in the first place (apache RA).

I'm not sure why would it be wrong, but neither can I vouch that
there's no better way to do a basic apache functionality test. At
any rate, the test URL can be defined using a parameter.

> So your pull
> request may never get merged.

True. That should really depend only on the quality of the
contribution. Sometimes other obstacles, which may not always
look justified from the outside, can get in the way. Though it
may not always look like that, maintainers are only humans too ;-)

Thanks,

Dejan

> Which I suppose is better than my mon scripts: those are
> private-copy-only with no place in the heartbeat packages to try and
> share them.
> 
> Dimitri
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] booth release v1.0

2016-03-20 Thread Dejan Muhamedagic
Hello everybody,

I'm happy to announce that the booth repository was yesterday
tagged as v1.0:

https://github.com/ClusterLabs/booth/releases/tag/v1.0

There were very few patches since the v1.0 rc1. The complete
list of changes is available in the ChangeLog:

https://github.com/ClusterLabs/booth/blob/v1.0/ChangeLog

The binaries are provided for some Linux distributions. Currently,
there are packages for CentOS7 and various openSUSE versions:

http://download.opensuse.org/repositories/network:/ha-clustering:/Stable/

If you don't know what booth is and what is it good for, please
check the README at the bottom of the git repository home page:

https://github.com/ClusterLabs/booth

Cheers,

Dejan

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Removing node from pacemaker.

2016-03-06 Thread Dejan Muhamedagic
Hi,

On Fri, Mar 04, 2016 at 01:07:34PM +0300, Andrei Maruha wrote:
> I have tried it on my cluster, "crm node delete" just removes node
> from the cib without updating of corosync.conf.

Ah, I didn't know that this is about udpu where nodes are listed
in corosync.conf.

> After restart of pacemaker service you will get something like this:
> Online: [ node1 ]
> OFFLINE: [ node2 ]
> 
> 
> BTW, you will get the same state after "pacemaker restart", if you
> remove a node from corosync.conf and do not call "crm corosync
> reload".

Right, obviously one needs to tell corosync that the
configuration file changed.

Thanks,

Dejan

> On 03/04/2016 12:07 PM, Dejan Muhamedagic wrote:
> >Hi,
> >
> >On Thu, Mar 03, 2016 at 03:20:56PM +0300, Andrei Maruha wrote:
> >>Hi,
> >>Usually I use the following steps to delete node from the cluster:
> >>1. #crm corosync del-node 
> >>2. #crm_node -R node --force
> >>3. #crm corosync reload
> >I'd expect all this to be wrapped in "crm node delete". Isn't
> >that the case?
> >
> >Also, is "corosync reload" really required after node removal?
> >
> >Thanks,
> >
> >Dejan
> >
> >>Instead of steps 1 and 2you can delete certain node from the
> >>corosync config manually and run:
> >>#corosync-cfgtool -R
> >>
> >>On 03/03/2016 02:44 PM, Somanath Jeeva wrote:
> >>>Hi,
> >>>
> >>>I am trying to remove a node from the pacemaker’/corosync cluster,
> >>>using the command “crm_node -R dl360x4061 –force”.
> >>>
> >>>Though this command removes the node from the cluster, it is
> >>>appearing as offline after pacemaker/corosync restart in the nodes
> >>>that are online.
> >>>
> >>>Is there any other command to completely delete the node from the
> >>>pacemaker/corosync cluster.
> >>>
> >>>Pacemaker and Corosync Versions.
> >>>
> >>>PACEMAKER=1.1.10
> >>>
> >>>COROSYNC=1.4.1
> >>>
> >>>Regards
> >>>
> >>>Somanath Thilak J
> >>>
> >>>
> >>>
> >>>___
> >>>Users mailing list: Users@clusterlabs.org
> >>>http://clusterlabs.org/mailman/listinfo/users
> >>>
> >>>Project Home: http://www.clusterlabs.org
> >>>Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>Bugs: http://bugs.clusterlabs.org
> >>___
> >>Users mailing list: Users@clusterlabs.org
> >>http://clusterlabs.org/mailman/listinfo/users
> >>
> >>Project Home: http://www.clusterlabs.org
> >>Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>Bugs: http://bugs.clusterlabs.org
> >
> >___
> >Users mailing list: Users@clusterlabs.org
> >http://clusterlabs.org/mailman/listinfo/users
> >
> >Project Home: http://www.clusterlabs.org
> >Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >Bugs: http://bugs.clusterlabs.org
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Removing node from pacemaker.

2016-03-04 Thread Dejan Muhamedagic
Hi,

On Thu, Mar 03, 2016 at 03:20:56PM +0300, Andrei Maruha wrote:
> Hi,
> Usually I use the following steps to delete node from the cluster:
> 1. #crm corosync del-node 
> 2. #crm_node -R node --force
> 3. #crm corosync reload

I'd expect all this to be wrapped in "crm node delete". Isn't
that the case?

Also, is "corosync reload" really required after node removal?

Thanks,

Dejan

> Instead of steps 1 and 2you can delete certain node from the
> corosync config manually and run:
> #corosync-cfgtool -R
> 
> On 03/03/2016 02:44 PM, Somanath Jeeva wrote:
> >
> >Hi,
> >
> >I am trying to remove a node from the pacemaker’/corosync cluster,
> >using the command “crm_node -R dl360x4061 –force”.
> >
> >Though this command removes the node from the cluster, it is
> >appearing as offline after pacemaker/corosync restart in the nodes
> >that are online.
> >
> >Is there any other command to completely delete the node from the
> >pacemaker/corosync cluster.
> >
> >Pacemaker and Corosync Versions.
> >
> >PACEMAKER=1.1.10
> >
> >COROSYNC=1.4.1
> >
> >Regards
> >
> >Somanath Thilak J
> >
> >
> >
> >___
> >Users mailing list: Users@clusterlabs.org
> >http://clusterlabs.org/mailman/listinfo/users
> >
> >Project Home: http://www.clusterlabs.org
> >Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >Bugs: http://bugs.clusterlabs.org
> 

> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] fence agent and using it with pacemaker

2016-02-11 Thread Dejan Muhamedagic
Hi,

On Thu, Feb 11, 2016 at 03:56:15PM +0100, Stanislav Kopp wrote:
> HI again,
> 
> I think I've sorted it out, correct me if I'm wrong.
> 
> - "fence-agents" are RHCS compatible fence agents (for using with
> RGmanager originally?) located in /usr/sbin/fence_*,

Yes.

> - "cluster-glue" provides agents compatible with pacemaker, they re
> located in "/usr/lib/stonith/plugins/"

Right. But today pacemaker supports both. I don't recall anymore
with which release support for fencing agents was introduced. The
support is compile-time and RHEL includes support for only
fencing agents, whereas most other distributions include support
for both.

> "fence_pve" [1] agent looks like RHCS agent, so it doesn't work
> out-of-box with pacemaker, but like Dejan hints, it's possible to use
> rhcs "adapter" to use RH fencing agent in pacemaker. Unfortunally I
> don't have
> 
> /usr/lib{,64}/stonith/plugins/rhcs directory in Debian (only rhcs.so
> lib), can't find it via "yum provides" in Centos 7 either.

You won't find it in Centos or RHEL.

> Is there some documentation how to use RHCS fence agent with pacemaker
> esp. with crmsh examples?

Well, no, but it should be easy. You just create that directory
and then create a link to your fencing agent:

# mkdir /usr/lib{,64}/stonith/plugins/rhcs
# cd /usr/lib{,64}/stonith/plugins/rhcs
# ln -s /usr/sbin/fence_pve

Thanks,

Dejan

> 
> Best,
> Stan
> 
> [1] 
> https://github.com/ClusterLabs/fence-agents/blob/master/fence/agents/pve/fence_pve.py
> 
> 2016-02-11 12:07 GMT+01:00 Stanislav Kopp <stask...@gmail.com>:
> > HI Dejan,
> >
> > 2016-02-10 20:28 GMT+01:00 Dejan Muhamedagic <deja...@fastmail.fm>:
> >> Hi,
> >>
> >> On Wed, Feb 10, 2016 at 03:20:49PM +0100, Stanislav Kopp wrote:
> >>> Hi all,
> >>>
> >>> I have general, clarification question about how fence agents work
> >>> with pacemaker (crmsh in particular). As far I understood STDIN
> >>
> >> Not sure what crmsh has to do with fencing agents. The
> >> interaction happens between other pacemaker components (normally
> >> stonithd) and the agent.
> >
> > I've mentioned crm, to make sure it's not some stupid syntax error by
> > me with stonith definition.
> >
> >>> arguments can be used within pacemaker resources and command line
> >>> arguments in terminal (for testing and scripting?).
> >>
> >> As for testing, crmsh has support for resource testing with
> >> 'configure rsctest', but apparently only Linux HA stonith agents
> >> (legacy*) are currently supported.
> >>
> >> Otherwise, there's stonith_admin(8) and stonith(8) for stonith
> >> agents. The former uses stonith-ng (stonithd) which makes it
> >> closer to the real cluster operation.
> >>
> >> There is also a (not so well known) stonith interface to fencing
> >> agents, which predates RH fencing agents support in pacemaker.
> >> One just needs to create a link in /usr/lib64/stonith/plugins/rhcs:
> >
> > That sounds interesting, because the agent (fence_pve) comes with
> > "fence-agents" package, and not with "cluster-glue" (in Debian), Do
> > you know, what is difference between agents in /usr/sbin/ (like
> > "/usr/sbin/fence_ilo") and in "/usr/lib/stonith/plugins/external/"?
> >
> > Thanks,
> > Stan
> >
> >> lrwxrwxrwx 1 root root 32 Nov 18 07:56 fence_cisco_ucs -> 
> >> ../../../../sbin/fence_cisco_ucs
> >>
> >> For instance, this agent is referenced as stonith:rhcs/cisco_ucs.
> >>
> >> Finally, it would be great to somehow put together the two
> >> fencing/stonith agents sets.
> >>
> >> Thanks,
> >>
> >> Dejan
> >>
> >> *) Why the Linux HA stonith agents got the attribute/name
> >> "legacy" in Pacemaker at some point is beyond me.
> >>
> >>> I have "fence_pve" [1] agent which works fine with command line
> >>> arguments, but not with pacemaker, it says some parameters like
> >>> "passwd" or "login" does not exist, although STDIN parameters are
> >>> supported [2]
> >>>
> >>> I'm using  stock 1.1.7-1 on Debian Wheezy
> >>>
> >>> Best,
> >>> Stan
> >>>
> >>>
> >>> [1] 
> >>> https://github.com/ClusterLabs/fence-agents/blob/master/fence/agents/pve/fence_pve.py
> >>> [2] https://www.mankier.com/8/f

Re: [ClusterLabs] crmsh configure delete for constraints

2016-02-10 Thread Dejan Muhamedagic
On Wed, Feb 10, 2016 at 07:39:27AM +0300, Vladislav Bogdanov wrote:
[...]
> >> Particularly, imho RAs should not run validate_all on stop
> >> action.
> >
> >I'd disagree here. If the environment is no good (bad
> >installation, missing configuration and similar), then the stop
> >operation probably won't do much good. Ultimately, it may depend
> >on how the resource is managed. In ocf-rarun, validate_all is
> >run, but then the operation is not carried out if the environment
> >is invalid. In particular, the resource is considered to be
> >stopped, and the stop operation exits with success. One of the
> >most common cases is when the software resides on shared
> >non-parallel storage.
> 
> Well, I'd reword. Generally, RA should not exit with error if validation 
> fails on stop.
> Is that better?

Much better! :) Not on probes either.

Cheers,

Dejan

> >
> >BTW, handling the stop and monitor/probe operations was the
> >primary motivation to develop ocf-rarun. It's often quite
> >difficult to get these things right.
> >
> >Cheers,
> >
> >Dejan
> >
> >
> >> Best,
> >> Vladislav
> >> 
> >> 
> >> ___
> >> Users mailing list: Users@clusterlabs.org
> >> http://clusterlabs.org/mailman/listinfo/users
> >> 
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> >http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >
> >___
> >Users mailing list: Users@clusterlabs.org
> >http://clusterlabs.org/mailman/listinfo/users
> >
> >Project Home: http://www.clusterlabs.org
> >Getting started:
> >http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >Bugs: http://bugs.clusterlabs.org
> 
> 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] crmsh configure delete for constraints

2016-02-10 Thread Dejan Muhamedagic
On Wed, Feb 10, 2016 at 12:06:34PM +0100, Ferenc Wágner wrote:
> Dejan Muhamedagic <deja...@fastmail.fm> writes:
> 
> > If the environment is no good (bad installation, missing configuration
> > and similar), then the stop operation probably won't do much good.
> 
> Agreed.  It may not even know how to probe it.
> 
> > In ocf-rarun, validate_all is run, but then the operation is not
> > carried out if the environment is invalid. In particular, the resource
> > is considered to be stopped, and the stop operation exits with
> > success.
> 
> This sounds dangerous.  What if the local configuration of a node gets
> damaged while a resource is running on it?

I understand your worry, but cannot imagine how that could
happen, unless in case of a more serious failure such as disk
crash, which, the failure, should really cause fencing at another
level.

The most common case, by far, is some mistake or omission during
cluster setup. Humans tend to make mistakes. As Vladislav wrote
elsewhere in this thread, this can cause a fencing loop, which is
no fun, in particular if pacemaker is set to start on boot. It
happened to me a few times and I guess I don't need to describe
the intensity of my feelings toward computers in general and the
cluster stack in particular (not to mention the RA author).

> Eventually the cluster may
> try to stop it, think that it succeeded and start the resource on
> another node.  Now you have two instances running.  Or is the resource
> probed on each node before the start?

No, I don't think so. The probes are run only on crmd start.

> Can a probe failure save your day
> here?  Or do you only mean resource parameters by "environment" (which
> should be identical on each host, so validation would fail everywhere)?

The validation typically checks the configuration and then
whether various files (programs) and directories exist, sometimes
if directories are writable. There could be more, but at least I
would prefer to stop here.

Anyway, we could introduce something like optional
emergency_stop() which would be invoked in ocf-rarun in case the
validation failed. And/or say a RUN_STOP_ANYWAY variable which
would allow stop to be run regardless. But note that it is
extremely difficult to prove or make sure that executing RA
_after_ the validate step failed is going to produce meaningful
results.  In addition, there could also be
FENCE_ON_INVALID_ENVIRONMENT (to be set by the user) for the very
paranoid ;-)

Cheers,

Dejan

> -- 
> Thanks,
> Feri.
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] nfsserver_monitor() doesn't detect nfsd process is lost.

2016-02-09 Thread Dejan Muhamedagic
Hi,

On Thu, Jan 28, 2016 at 04:42:55PM +0900, yuta takeshita wrote:
> Hi,
> Sorry for replying late.

No problem.

> 2016-01-15 21:19 GMT+09:00 Dejan Muhamedagic <deja...@fastmail.fm>:
> 
> > Hi,
> >
> > On Fri, Jan 15, 2016 at 04:54:37PM +0900, yuta takeshita wrote:
> > > Hi,
> > >
> > > Tanks for responding and making a patch.
> > >
> > > 2016-01-14 19:16 GMT+09:00 Dejan Muhamedagic <deja...@fastmail.fm>:
> > >
> > > > On Thu, Jan 14, 2016 at 11:04:09AM +0100, Dejan Muhamedagic wrote:
> > > > > Hi,
> > > > >
> > > > > On Thu, Jan 14, 2016 at 04:20:19PM +0900, yuta takeshita wrote:
> > > > > > Hello.
> > > > > >
> > > > > > I have been a problem with nfsserver RA on RHEL 7.1 and systemd.
> > > > > > When the nfsd process is lost with unexpectly failure,
> > > > nfsserver_monitor()
> > > > > > doesn't detect it and doesn't execute failover.
> > > > > >
> > > > > > I use the below RA.(but this problem may be caused with latest
> > > > nfsserver RA
> > > > > > as well)
> > > > > >
> > > >
> > https://github.com/ClusterLabs/resource-agents/blob/v3.9.6/heartbeat/nfsserver
> > > > > >
> > > > > > The cause is following.
> > > > > >
> > > > > > 1. After execute "pkill -9 nfsd", "systemctl status
> > nfs-server.service"
> > > > > > returns 0.
> > > > >
> > > > > I think that it should be systemctl is-active. Already had a
> > > > > problem with systemctl status, well, not being what one would
> > > > > assume status would be. Can you please test that and then open
> > > > > either a pull request or issue at
> > > > > https://github.com/ClusterLabs/resource-agents
> > > >
> > > > I already made a pull request:
> > > >
> > > > https://github.com/ClusterLabs/resource-agents/pull/741
> > > >
> > > > Please test if you find time.
> > > >
> > > I tested the code, but still problems remain.
> > > systemctl is-active retrun active and the return code is 0 as well as
> > > systemctl status.
> > > Perhaps it is inappropriate to use systemctl for monitoring the kernel
> > > process.
> >
> > OK. My patch was too naive and didn't take into account the
> > systemd/kernel intricacies.
> >
> > > Mr Kay Sievers who is a developer of systemd said that systemd doesn't
> > > monitor kernel process in the following.
> > > http://comments.gmane.org/gmane.comp.sysutils.systemd.devel/34367
> >
> > Thanks for the reference. One interesting thing could also be
> > reading /proc/fs/nfsd/threads instead of checking the process
> > existence. Furthermore, we could do some RPC based monitor, but
> > that would be, I guess, better suited for another monitor depth.
> >
> > OK. I survey and test the /proc/fs/nfsd/threads.
> It seems to work well on my cluster.
> I make a patch and a pull request.
> https://github.com/ClusterLabs/resource-agents/pull/746
> 
> Please check if you have time.

Some return codes of nfsserver_systemd_monitor() follow OCF and one
apparently LSB:

301 nfs_exec is-active
302 rc=$?
...
311 if [ $threads_num -gt 0 ]; then
312 return $OCF_SUCCESS
313 else
314 return 3
315 fi
316 else
317 return $OCF_ERR_GENERIC
...
321 return $rc

Given that nfs_exec() returns LSB codes, it should probably be
something like this:

311 if [ $threads_num -gt 0 ]; then
312 return 0
313 else
314 return 3
315 fi
316 else
317 return 1
...
321 return $rc

It won't make any actual difference, but the intent would be
cleaner (i.e. it's just by accident that the OCF codes are the
same in this case).

Cheers,

Dejan

> Regards,
> Yuta
> 
> > Cheers,
> >
> > Dejan
> >
> > > I reply to your pull request.
> > >
> > > Regards,
> > > Yuta Takeshita
> > >
> > > >
> > > > Thanks for reporting!
> > > >
> > > > Dejan
> > > >
> > > > > Thanks,
> > > > >
> > > > > Dejan
> > > > >
> > > > > > 2. nfsserver_monitor() judge with the return value of "systemctl
> > st

Re: [ClusterLabs] nfsserver_monitor() doesn't detect nfsd process is lost.

2016-01-14 Thread Dejan Muhamedagic
On Thu, Jan 14, 2016 at 11:04:09AM +0100, Dejan Muhamedagic wrote:
> Hi,
> 
> On Thu, Jan 14, 2016 at 04:20:19PM +0900, yuta takeshita wrote:
> > Hello.
> > 
> > I have been a problem with nfsserver RA on RHEL 7.1 and systemd.
> > When the nfsd process is lost with unexpectly failure, nfsserver_monitor()
> > doesn't detect it and doesn't execute failover.
> > 
> > I use the below RA.(but this problem may be caused with latest nfsserver RA
> > as well)
> > https://github.com/ClusterLabs/resource-agents/blob/v3.9.6/heartbeat/nfsserver
> > 
> > The cause is following.
> > 
> > 1. After execute "pkill -9 nfsd", "systemctl status nfs-server.service"
> > returns 0.
> 
> I think that it should be systemctl is-active. Already had a
> problem with systemctl status, well, not being what one would
> assume status would be. Can you please test that and then open
> either a pull request or issue at
> https://github.com/ClusterLabs/resource-agents

I already made a pull request:

https://github.com/ClusterLabs/resource-agents/pull/741

Please test if you find time.

Thanks for reporting!

Dejan

> Thanks,
> 
> Dejan
> 
> > 2. nfsserver_monitor() judge with the return value of "systemctl status
> > nfs-server.service".
> > 
> > --
> > # ps ax | grep nfsd
> > 25193 ?S< 0:00 [nfsd4]
> > 25194 ?S< 0:00 [nfsd4_callbacks]
> > 25197 ?S  0:00 [nfsd]
> > 25198 ?S  0:00 [nfsd]
> > 25199 ?S  0:00 [nfsd]
> > 25200 ?S  0:00 [nfsd]
> > 25201 ?S  0:00 [nfsd]
> > 25202 ?S  0:00 [nfsd]
> > 25203 ?S  0:00 [nfsd]
> > 25204 ?S  0:00 [nfsd]
> > 25238 pts/0S+ 0:00 grep --color=auto nfsd
> > #
> > # pkill -9 nfsd
> > #
> > # systemctl status nfs-server.service
> > ● nfs-server.service - NFS server and services
> >Loaded: loaded (/etc/systemd/system/nfs-server.service; disabled; vendor
> > preset: disabled)
> >Active: active (exited) since 木 2016-01-14 11:35:39 JST; 1min 3s ago
> >   Process: 25184 ExecStart=/usr/sbin/rpc.nfsd $RPCNFSDARGS (code=exited,
> > status=0/SUCCESS)
> >   Process: 25182 ExecStartPre=/usr/sbin/exportfs -r (code=exited,
> > status=0/SUCCESS)
> >  Main PID: 25184 (code=exited, status=0/SUCCESS)
> >CGroup: /system.slice/nfs-server.service
> > (snip)
> > #
> > # echo $?
> > 0
> > #
> > # ps ax | grep nfsd
> > 25256 pts/0S+ 0:00 grep --color=auto nfsd
> > --
> > 
> > It is because the nfsd process is kernel process, and systemd does not
> > monitor the state of the kernel process of running.
> > 
> > Is there something good way?
> > (When I use "pidof" instead of "systemctl status", the faileover is
> > successful.)
> > 
> > Regards,
> > Yuta Takeshita
> 
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] nfsserver_monitor() doesn't detect nfsd process is lost.

2016-01-14 Thread Dejan Muhamedagic
Hi,

On Thu, Jan 14, 2016 at 04:20:19PM +0900, yuta takeshita wrote:
> Hello.
> 
> I have been a problem with nfsserver RA on RHEL 7.1 and systemd.
> When the nfsd process is lost with unexpectly failure, nfsserver_monitor()
> doesn't detect it and doesn't execute failover.
> 
> I use the below RA.(but this problem may be caused with latest nfsserver RA
> as well)
> https://github.com/ClusterLabs/resource-agents/blob/v3.9.6/heartbeat/nfsserver
> 
> The cause is following.
> 
> 1. After execute "pkill -9 nfsd", "systemctl status nfs-server.service"
> returns 0.

I think that it should be systemctl is-active. Already had a
problem with systemctl status, well, not being what one would
assume status would be. Can you please test that and then open
either a pull request or issue at
https://github.com/ClusterLabs/resource-agents

Thanks,

Dejan

> 2. nfsserver_monitor() judge with the return value of "systemctl status
> nfs-server.service".
> 
> --
> # ps ax | grep nfsd
> 25193 ?S< 0:00 [nfsd4]
> 25194 ?S< 0:00 [nfsd4_callbacks]
> 25197 ?S  0:00 [nfsd]
> 25198 ?S  0:00 [nfsd]
> 25199 ?S  0:00 [nfsd]
> 25200 ?S  0:00 [nfsd]
> 25201 ?S  0:00 [nfsd]
> 25202 ?S  0:00 [nfsd]
> 25203 ?S  0:00 [nfsd]
> 25204 ?S  0:00 [nfsd]
> 25238 pts/0S+ 0:00 grep --color=auto nfsd
> #
> # pkill -9 nfsd
> #
> # systemctl status nfs-server.service
> ● nfs-server.service - NFS server and services
>Loaded: loaded (/etc/systemd/system/nfs-server.service; disabled; vendor
> preset: disabled)
>Active: active (exited) since 木 2016-01-14 11:35:39 JST; 1min 3s ago
>   Process: 25184 ExecStart=/usr/sbin/rpc.nfsd $RPCNFSDARGS (code=exited,
> status=0/SUCCESS)
>   Process: 25182 ExecStartPre=/usr/sbin/exportfs -r (code=exited,
> status=0/SUCCESS)
>  Main PID: 25184 (code=exited, status=0/SUCCESS)
>CGroup: /system.slice/nfs-server.service
> (snip)
> #
> # echo $?
> 0
> #
> # ps ax | grep nfsd
> 25256 pts/0S+ 0:00 grep --color=auto nfsd
> --
> 
> It is because the nfsd process is kernel process, and systemd does not
> monitor the state of the kernel process of running.
> 
> Is there something good way?
> (When I use "pidof" instead of "systemctl status", the faileover is
> successful.)
> 
> Regards,
> Yuta Takeshita

> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact

2016-01-04 Thread Dejan Muhamedagic
Hi,

On Mon, Jan 04, 2016 at 04:52:43PM +0100, Bogdan Dobrelya wrote:
> On 04.01.2016 16:36, Ken Gaillot wrote:
> > On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote:
> >> On 04.01.2016 15:50, Bogdan Dobrelya wrote:
[...]
> >> Also note, that lrmd spawns *many* monitors like:
> >> root  6495  0.0  0.0  70268  1456 ?Ss2015   4:56  \_
> >> /usr/lib/pacemaker/lrmd
> >> root 31815  0.0  0.0   4440   780 ?S15:08   0:00  |   \_
> >> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> >> root 31908  0.0  0.0   4440   388 ?S15:08   0:00  |
> >>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> >> root 31910  0.0  0.0   4440   384 ?S15:08   0:00  |
> >>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> >> root 31915  0.0  0.0   4440   392 ?S15:08   0:00  |
> >>   \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> >> ...
> > 
> > At first glance, that looks like your monitor action is calling itself
> > recursively, but I don't see how in your code.
> 
> Yes, it should be a bug in the ocf-shellfuncs's ocf_log().

If you're sure about that, please open an issue at
https://github.com/ClusterLabs/resource-agents/issues

Thanks,

Dejan

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Notice: SLES11SP4 broke exportfs!

2015-12-21 Thread Dejan Muhamedagic
Hi,

On Sat, Dec 12, 2015 at 10:06:57AM +0300, Andrei Borzenkov wrote:
> 11.12.2015 21:27, Ulrich Windl пишет:
> > Hi!
> > 
> > After updating from SLES11SP3 (june version) to SLES11SP4 (todays version) 
> > exportfs fails to get the export status. I have message like this in syslog:
> > 
> > Dec 11 19:22:09 h04 crmd[11128]:   notice: process_lrm_event: 
> > rksaph04-prm_nfs_c11_mnt_exp_monitor_0:93 [ 
> > /usr/lib/ocf/resource.d/heartbeat/exportfs: line 178: 4f838db1: value too 
> > great for base (error token is "4f838db1")\n ]
> > 
> > Why is such broken code released? Here's the diff:
> > 
> > --- /usr/lib/ocf/resource.d/heartbeat/exportfs  2015-03-11 
> > 07:00:04.0 +0100
> ...
> 
> > @@ -165,18 +171,48 @@
> > !
> >  }
> > 
> > +reset_fsid() {
> > +   CURRENT_FSID=$OCF_RESKEY_fsid
> > +}
> > +bump_fsid() {
> > +   let $((CURRENT_FSID++))
> > +}
> 
> Here is where error comes from.
> 
> > +get_fsid() {
> > +   echo $CURRENT_FSID
> > +}
> > +
> > +# run a function on all directories
> > +forall() {
> > +   local func=$1
> > +   shift 1
> > +   local fast_exit=""
> > +   local dir rc=0
> > +   if [ "$2" = fast_exit ]; then
> > +   fast_exit=1
> > +   shift 1
> > +   fi
> > +   reset_fsid
> > +   for dir in $OCF_RESKEY_directory; do
> > +   $func $dir "$@"
> > +   rc=$(($rc | $?))
> > +   bump_fsid
> 
> called here
> 
> > +   [ "$fast_exit" ] && continue
> > +   [ $rc -ne 0 ] && return $rc
> > +   done
> > +   return $rc
> > +}
> > +
> ...
> 
> >  exportfs_validate_all ()
> >  {
> > -   if [ ! -d $OCF_RESKEY_directory ]; then
> > -   ocf_log err "$OCF_RESKEY_directory does not exist or is not 
> > a directory"
> > +   if [ `echo "$OCF_RESKEY_directory" | wc -w` -gt 1 ] &&
> > +   ! ocf_is_decimal "$OCF_RESKEY_fsid"; then
> > +   ocf_log err "use integer fsid when exporting multiple 
> > directories"
> > +   return $OCF_ERR_CONFIGURED
> > +   fi
> > +   if ! forall testdir; then
> > return $OCF_ERR_INSTALLED
> > fi
> >  }
> 
> It is validated to be decimal, but only if more than one directory is
> present, while it is always being incremented, even if only single
> directory is defined.

Good catch!

Thanks,

Dejan

> Same code present upstream (178 line number is a bit off).
> 
> Workaround is to change FSID, but yes, it looks like upstream bug.
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Re: duplicate node

2015-12-21 Thread Dejan Muhamedagic
Hi,

On Fri, Dec 11, 2015 at 09:34:59PM +, gerry kernan wrote:
> Hi 
> Tried removing the node with uuid ae4d76e7-af64-4d93-acdd-4d7b5c274eff but I 
> get an error that the node is not present in the cib.
> 
> When I do a crm configure show the uuid is there
> node $id="3b5d1061-8f68-4ab3-b169-e0ebe890c446" gat-voip-01.gdft.org node 
> $id="ae4d76e7-af64-4d93-acdd-4d7b5c274eff" gat-voip-01.gdft.org \
>   attributes standby="on"
> 
> is there any files I can edit manually to remove this or should I do complete 
> erase of the config and start fresh.

It's sort of involved to edit the cluster configuration (CIB)
offline. All nodes need to be shutdown, then remove all cib.*
files on all nodes but one, and on that node you can then edit
the CIB file like this (with crm for instance):

# CIB_file=/var/lib/*/cib/cib.xml crm configure
...

Then you have to remove the corresponding signature file
(cib.xml.sig), because after edit it's not going to match
anymore. Perhaps there's a way to generate it by hand, but I
don't know how. At any rate, pacemaker loads cib.xml in case
the .sig file is not present.

Thanks,

Dejan

> 
> 
> Gerry 
> -Original Message-
> From: Dejan Muhamedagic [mailto:deja...@fastmail.fm] 
> Sent: Thursday 10 December 2015 16:37
> To: Cluster Labs - All topics related to open-source clustering welcomed 
> <users@clusterlabs.org>
> Subject:  Re: [ClusterLabs] duplicate node
> 
> Hi,
> 
> On Tue, Dec 08, 2015 at 09:17:27PM +, gerry kernan wrote:
> > Hi
> >  
> > How would I remove a duplicate node, I have a 2 node setup , but on node is 
> > showing twice .  crm show configure below, node gat-voip-01.gdft.org is 
> > listed twice.
> >  
> >  
> > node $id="0dc85a64-01ad-4fc5-81fd-698208a8322c" gat-voip-02\
> > attributes standby="on"
> > node $id="3b5d1061-8f68-4ab3-b169-e0ebe890c446" gat-voip-01 node 
> > $id="ae4d76e7-af64-4d93-acdd-4d7b5c274eff" gat-voip-01\
> > attributes standby="off"
> 
> First you need to figure out which one is the old uuid, then try:
> 
> # crm node delete 
> 
> This looks like heartbeat, there used to be a crm_uuid or something similar 
> to read the uuid. There's also a uuid file somewhere in /var/lib/heartbeat.
> 
> Thanks,
> 
> Dejan
> 
> > primitive res_Filesystem_rep ocf:heartbeat:Filesystem \
> > params device="/dev/drbd0" directory="/rep" fstype="ext3" \
> > operations $id="res_Filesystem_rep-operations" \
> > op start interval="0" timeout="60" \
> > op stop interval="0" timeout="60" \
> > op monitor interval="20" timeout="40" start-delay="0" \
> > op notify interval="0" timeout="60" \
> > meta target-role="started" is-managed="true"
> > primitive res_IPaddr2_northIP ocf:heartbeat:IPaddr2 \
> > params ip="10.75.29.10" cidr_netmask="26" \
> > operations $id="res_IPaddr2_northIP-operations" \
> > op start interval="0" timeout="20" \
> > op stop interval="0" timeout="20" \
> > op monitor interval="10" timeout="20" start-delay="0" \
> > meta target-role="started" is-managed="true"
> > primitive res_IPaddr2_sipIP ocf:heartbeat:IPaddr2 \
> > params ip="158.255.224.226" nic="bond2" \
> > operations $id="res_IPaddr2_sipIP-operations" \
> > op start interval="0" timeout="20" \
> > op stop interval="0" timeout="20" \
> > op monitor interval="10" timeout="20" start-delay="0" \
> > meta target-role="started" is-managed="true"
> > primitive res_asterisk_res_asterisk lsb:asterisk \
> > operations $id="res_asterisk_res_asterisk-operations" \
> > op start interval="0" timeout="15" \
> > op stop interval="0" timeout="15" \
> > op monitor interval="15" timeout="15" start-delay="15" \
> > meta target-role="started" is-managed="true"
> > primitive res_drbd_1 ocf:linbit:drbd \
> > params drbd_resource="r0" \
> > operations $id="res_drbd_1-operations" \
> > op start interval=&q

Re: [ClusterLabs] Notice: SLES11SP4 broke exportfs!

2015-12-21 Thread Dejan Muhamedagic
Hi,

On Fri, Dec 11, 2015 at 07:27:28PM +0100, Ulrich Windl wrote:
> Hi!
> 
> After updating from SLES11SP3 (june version) to SLES11SP4 (todays version) 
> exportfs fails to get the export status. I have message like this in syslog:
> 
> Dec 11 19:22:09 h04 crmd[11128]:   notice: process_lrm_event: 
> rksaph04-prm_nfs_c11_mnt_exp_monitor_0:93 [ 
> /usr/lib/ocf/resource.d/heartbeat/exportfs: line 178: 4f838db1: value too 
> great for base (error token is "4f838db1")\n ]

The value of the fsid is unexpected. The code (and I) assumed
that it would be decimal and that's mentioned in the fsid
meta-data description.

> Why is such broken code released? Here's the diff:

I suspect that every newly released code is broken in some way
for some deployments.

Thanks,

Dejan

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Notice: SLES11SP4 broke exportfs!

2015-12-21 Thread Dejan Muhamedagic
On Mon, Dec 21, 2015 at 12:54:49PM +0100, Ulrich Windl wrote:
> >>> Dejan Muhamedagic <deja...@fastmail.fm> schrieb am 21.12.2015 um 11:40 in
> Nachricht <20151221104011.GB9783@walrus.homenet>:
> > Hi,
> > 
> > On Fri, Dec 11, 2015 at 07:27:28PM +0100, Ulrich Windl wrote:
> >> Hi!
> >> 
> >> After updating from SLES11SP3 (june version) to SLES11SP4 (todays version) 
> > exportfs fails to get the export status. I have message like this in syslog:
> >> 
> >> Dec 11 19:22:09 h04 crmd[11128]:   notice: process_lrm_event: 
> > rksaph04-prm_nfs_c11_mnt_exp_monitor_0:93 [ 
> > /usr/lib/ocf/resource.d/heartbeat/exportfs: line 178: 4f838db1: value too 
> > great for base (error token is "4f838db1")\n ]
> > 
> > The value of the fsid is unexpected. The code (and I) assumed
> > that it would be decimal and that's mentioned in the fsid
> > meta-data description.
> 
> Hi!
> 
> Really? crm(live)# ra info exportfs:
> 
> [...]
> fsid* (string): Unique fsid within cluster or starting fsid for multiple 
> exports
> .
> The fsid option to pass to exportfs. This can be a unique positive
> integer, a UUID, or the special string "root" which is functionally
> identical to numeric fsid of 0.
> If multiple directories are being exported, then they are
> assigned ids sequentially starting with this fsid (fsid, fsid+1,
> fsid+2, ...). Obviously, in that case the fsid must be an
> integer.

  Here ^^^

> 0 (root) identifies the export as the root of an NFSv4
> pseudofilesystem -- avoid this setting unless you understand its
> special status.
> This value will override any fsid provided via the options parameter.
> [...]
> 
> Did you read "UUID" also? 
> 
> > 
> >> Why is such broken code released? Here's the diff:
> > 
> > I suspect that every newly released code is broken in some way
> > for some deployments.
> 
> The code clearly does not match the description, and it is broken.

The code _should_ match the description, but, as we all
concluded, there's a bug.

> I would also expect that "validate" would report values for fsid it cannot 
> handle.
> Furthermose I see no sense in trying to increment a fsid.
> 
> Maybe you can explain.

The RA tries to increase the fsid for a one-directory
configuration. Erroneously. It needs to be fixed _not_ to
manipulate the fsid for such configurations.

Thanks,

Dejan

> 
> Regards,
> Ulrich
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Q] Check on application layer (kamailio, openhab)

2015-12-21 Thread Dejan Muhamedagic
Hi,

On Mon, Dec 21, 2015 at 10:07:25AM -0600, Ken Gaillot wrote:
> On 12/19/2015 10:21 AM, Sebish wrote:
> > Dear all ha-list members,
> > 
> > I am trying to setup two availability checks on application layer using
> > heartbeat and pacemaker.
> > To be more concrete I need 1 resource agent (ra) for openHAB and 1 for
> > Kamailio SIP Proxy.
> > 
> > *My setup:
> > *
> > 
> >+ Debian 7.9 + Heartbeat + Pacemaker + more
> 
> This should work for your purposes, but FYI, corosync 2 is the preferred
> communications layer these days. Debian 7 provides corosync 1, which
> might be worth using here, to make an eventual switch to corosync 2 easier.
> 
> Also FYI, Pacemaker was dropped from Debian 8, but there is a group
> working on backporting the latest pacemaker/corosync/etc. to it.
> 
> >+ 2 Node Cluster with Hot-Standby Failover
> >+ Active Cluster with clusterip, ip-monitoring, working failover and
> >services
> >+ Copied kamailio ra into /usr/lib/ocf/resource.d/heartbeat, chmod
> >755 and 'crm ra list ocf heartbeat' finds it
> > 
> > *The plan:*
> > 
> > _openHAB_
> > 
> >My idea was to let heartbeat check for the availabilty of openHAB's
> >website (jettybased) or check if the the process is up and running.
> > 
> >I did not find a fitting resource agent. Is there a general ra in
> >which you would just have to insert the process name 'openhab'?
> > 
> > _Kamailio_
> > 
> >My idea was to let an ra send a SIP-request to kamailio and check,
> >if it gets an answer AND if it is the correct one.
> > 
> >It seems like the ra
> >   
> > https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/kamailio
> > 
> >does exactly what I want,
> >but I do not really understand it. Is it plug and play? Do I have to
> >change values inside the code like users, the complete meta-data or
> >else?
> > 
> >When I try to insert this agent (no changes) into pacemaker using
> >'crm configure primitive kamailio ocf:heartbeat:kamailio' it says:
> > 
> >lrmadmin[4629]: 2015/12/19_16:11:40 ERROR:
> >lrm_get_rsc_type_metadata(578): got a return code HA_FAIL from a
> >reply message of rmetadata with function get_ret_from_msg.
> >ERROR: ocf:heartbeat:kamailio: could not parse meta-data:
> >ERROR: ocf:heartbeat:kamailio: could not parse meta-data:
> >ERROR: ocf:heartbeat:kamailio: no such resource agent
> 
> lrmadmin is no longer used, and I'm not familiar with it, but first I'd
> check that the RA is executable. If it supports running directly from
> the command line, maybe make sure you can run it that way first.

I think that the RA is just not installed.

Thanks,

Dejan

> Most RAs support configuration options, which you can set in the cluster
> configuration (you don't have to edit the RA). Each RA specifies the
> options it accepts in the  section of its metadata.
> 
> > *The question:*_
> > 
> > _Maybe you could give me some hints on what to do next. Perhaps one of
> > you is even already using the kamailio ra successfully or checking a
> > non-apache website?
> > If I simply have to insert all my cluster data into the kamailio ra, it
> > should not throw this error, should it? Could have used a readme for
> > this ra though...
> > If you need any data, I will provide it asap!
> > 
> > *
> > **Thanks a lot to all who read this mail!*
> > 
> > Sebish
> > ha-newbie, but not noobie ;)
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] duplicate node

2015-12-10 Thread Dejan Muhamedagic
Hi,

On Tue, Dec 08, 2015 at 09:17:27PM +, gerry kernan wrote:
> Hi 
>  
> How would I remove a duplicate node, I have a 2 node setup , but on node is 
> showing twice .  crm show configure below, node gat-voip-01.gdft.org is 
> listed twice.
>  
>  
> node $id="0dc85a64-01ad-4fc5-81fd-698208a8322c" gat-voip-02\
> attributes standby="on"
> node $id="3b5d1061-8f68-4ab3-b169-e0ebe890c446" gat-voip-01
> node $id="ae4d76e7-af64-4d93-acdd-4d7b5c274eff" gat-voip-01\
> attributes standby="off"

First you need to figure out which one is the old uuid, then try:

# crm node delete 

This looks like heartbeat, there used to be a crm_uuid or
something similar to read the uuid. There's also a uuid file
somewhere in /var/lib/heartbeat.

Thanks,

Dejan

> primitive res_Filesystem_rep ocf:heartbeat:Filesystem \
> params device="/dev/drbd0" directory="/rep" fstype="ext3" \
> operations $id="res_Filesystem_rep-operations" \
> op start interval="0" timeout="60" \
> op stop interval="0" timeout="60" \
> op monitor interval="20" timeout="40" start-delay="0" \
> op notify interval="0" timeout="60" \
> meta target-role="started" is-managed="true"
> primitive res_IPaddr2_northIP ocf:heartbeat:IPaddr2 \
> params ip="10.75.29.10" cidr_netmask="26" \
> operations $id="res_IPaddr2_northIP-operations" \
> op start interval="0" timeout="20" \
> op stop interval="0" timeout="20" \
> op monitor interval="10" timeout="20" start-delay="0" \
> meta target-role="started" is-managed="true"
> primitive res_IPaddr2_sipIP ocf:heartbeat:IPaddr2 \
> params ip="158.255.224.226" nic="bond2" \
> operations $id="res_IPaddr2_sipIP-operations" \
> op start interval="0" timeout="20" \
> op stop interval="0" timeout="20" \
> op monitor interval="10" timeout="20" start-delay="0" \
> meta target-role="started" is-managed="true"
> primitive res_asterisk_res_asterisk lsb:asterisk \
> operations $id="res_asterisk_res_asterisk-operations" \
> op start interval="0" timeout="15" \
> op stop interval="0" timeout="15" \
> op monitor interval="15" timeout="15" start-delay="15" \
> meta target-role="started" is-managed="true"
> primitive res_drbd_1 ocf:linbit:drbd \
> params drbd_resource="r0" \
> operations $id="res_drbd_1-operations" \
> op start interval="0" timeout="240" \
> op promote interval="0" timeout="90" \
> op demote interval="0" timeout="90" \
> op stop interval="0" timeout="100" \
> op monitor interval="10" timeout="20" start-delay="0" \
> op notify interval="0" timeout="90"
> primitive res_httpd_res_httpd lsb:httpd \
> operations $id="res_httpd_res_httpd-operations" \
> op start interval="0" timeout="15" \
> op stop interval="0" timeout="15" \
> op monitor interval="15" timeout="15" start-delay="15" \
> meta target-role="started" is-managed="true"
> primitive res_mysqld_res_mysql lsb:mysqld \
> operations $id="res_mysqld_res_mysql-operations" \
> op start interval="0" timeout="15" \
> op stop interval="0" timeout="15" \
> op monitor interval="15" timeout="15" start-delay="15" \
> meta target-role="started"
> group asterisk res_Filesystem_rep res_IPaddr2_northIP res_IPaddr2_sipIP 
> res_mysqld_res_mysql res_httpd_res_httpd res_asterisk_res_asterisk
> ms ms_drbd_1 res_drbd_1 \
> meta clone-max="2" notify="true" interleave="true" 
> resource-stickiness="100"
> location loc_res_httpd_res_httpd_gat-voip-01.gdft.org asterisk inf: 
> gat-voip-01.gdft.org
> location loc_res_mysqld_res_mysql_gat-voip-01.gdft.org asterisk inf: 
> gat-voip-01.gdft.org
> colocation col_res_Filesystem_rep_ms_drbd_1 inf: asterisk ms_drbd_1:Master
> order ord_ms_drbd_1_res_Filesystem_rep inf: ms_drbd_1:promote asterisk:start
> property $id="cib-bootstrap-options" \
> stonith-enabled="false" \
> dc-version="1.0.12-unknown" \
> no-quorum-policy="ignore" \
> cluster-infrastructure="Heartbeat" \
> last-lrm-refresh="1345727614"
>  
>  
>  
> Gerry Kernan
>  
>  
> Infinity IT   |   17 The Mall   |   Beacon Court   |   Sandyford   |   Dublin 
> D18 E3C8   |   Ireland
> Tel:  +353 - (0)1 - 293 0090   |   E-Mail:  gerry.ker...@infinityit.ie
>  
> Managed IT Services   Infinity IT - www.infinityit.ie
> IP TelephonyAsterisk Consulting - 
> www.asteriskconsulting.com
> Contact CentreTotal Interact - www.totalinteract.com
>  



> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



Re: [ClusterLabs] how to configure fence in a two node cluster?who can help me?

2015-11-17 Thread Dejan Muhamedagic
Hi,

On Mon, Nov 16, 2015 at 10:42:50AM -0600, Ken Gaillot wrote:
> On 11/16/2015 06:08 AM, Shilu wrote:
> > The following configuration is mine.it work well when I didn't add the 
> > fence configureation.
> > But when I add the configureation primitive st-ipmilan1 stonith:ipmilan, it 
> > report the following error.
> > I want to know how to configure it and how to confirm that it will work 
> > well.
> > 
> > Failed actions:
> > st-ipmilan1_monitor_0 (node=ubuntu212, call=-1, rc=1, status=Timed Out, 
> > last
> > -rc-change=Mon Nov 16 06:47:19 2015
> > , queued=0ms, exec=0ms
> > ): unknown error
> > 
> 
> Check the logs for any fence_ipmilan messages around this time. You can
> also try to run the command manually (and maybe add "-v" for verbose
> messages):
> 
> fence_ipmilan -A md5 -a 192.168.33.127 -p 12345678 -l admin -o monitor
> -L admin

The configuration has the linux-ha stonith agent ipmilan, not
fence_ipmilan.

> I don't think there's a "hostname" option for fence_ipmilan. I think you
> mean "pcmk_host_list=ubuntu211", because without that, Pacemaker will
> try to use st-ipmilan1 when fencing *either* node. (Of course, once this
> is working, you'll want to set up a st-ipmilan2 similarly.)
> 
> I think "priv" should be "privlvl".

No, the parameters are fine.

> If these are on-board IPMI controllers, which share their power supply
> with the host computer, then fencing will fail if the power fails, and
> the cluster will refuse to run any services. If that's undesirable, you
> may want to consider a second fencing level using intelligent power
> switches or some other alternative fencing method.

Right.

Thanks,

Dejan

> > node $id="3232244179" ubuntu211
> > node $id="3232244180" ubuntu212
> > primitive VIP ocf:heartbeat:IPaddr \
> >  params ip="192.168.33.129" \
> >  op monitor timeout="10s" interval="1s"
> > primitive lunit ocf:heartbeat:iSCSILogicalUnit \
> >  params implementation="tgt" target_iqn="hoo" lun="1" 
> > path="rbd/hoo" tgt
> > _bstype="rbd" \
> >  op monitor timeout="10s" interval="1s"
> > primitive st-ipmilan1 stonith:ipmilan \
> >  params hostname="ubuntu211" ipaddr="192.168.33.127" port="623" 
> > auth="md
> > 5" priv="admin" login="admin" password="12345678"
> > primitive tgt ocf:heartbeat:iSCSITarget \
> >  params implementation="tgt" iqn="hoo" tid="5" \
> >  op monitor interval="1s" timeout="10s"
> > group cluster VIP tgt lunit
> > property $id="cib-bootstrap-options" \
> >  dc-version="1.1.10-42f2063" \
> >  cluster-infrastructure="corosync" \
> >  stonith-enabled="true" \
> >  no-quorum-policy="ignore"
> > -
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] how to configure fence in a two node cluster?who can help me?

2015-11-17 Thread Dejan Muhamedagic
Hi,

On Mon, Nov 16, 2015 at 12:08:53PM +, Shilu wrote:
> The following configuration is mine.it work well when I didn't add the fence 
> configureation.
> But when I add the configureation primitive st-ipmilan1 stonith:ipmilan, it 
> report the following error.
> I want to know how to configure it and how to confirm that it will work well.

Please use stonith:external/ipmi as stonith:ipmilan is
deprecated.

Otherwise, if you can use the ipmitool(8) with the same
parameters, then the stonith resource should work too.

Thanks,

Dejan

> Failed actions:
> st-ipmilan1_monitor_0 (node=ubuntu212, call=-1, rc=1, status=Timed Out, 
> last
> -rc-change=Mon Nov 16 06:47:19 2015
> , queued=0ms, exec=0ms
> ): unknown error
> 
> 
> node $id="3232244179" ubuntu211
> node $id="3232244180" ubuntu212
> primitive VIP ocf:heartbeat:IPaddr \
>  params ip="192.168.33.129" \
>  op monitor timeout="10s" interval="1s"
> primitive lunit ocf:heartbeat:iSCSILogicalUnit \
>  params implementation="tgt" target_iqn="hoo" lun="1" path="rbd/hoo" 
> tgt
> _bstype="rbd" \
>  op monitor timeout="10s" interval="1s"
> primitive st-ipmilan1 stonith:ipmilan \
>  params hostname="ubuntu211" ipaddr="192.168.33.127" port="623" 
> auth="md
> 5" priv="admin" login="admin" password="12345678"
> primitive tgt ocf:heartbeat:iSCSITarget \
>  params implementation="tgt" iqn="hoo" tid="5" \
>  op monitor interval="1s" timeout="10s"
> group cluster VIP tgt lunit
> property $id="cib-bootstrap-options" \
>  dc-version="1.1.10-42f2063" \
>  cluster-infrastructure="corosync" \
>  stonith-enabled="true" \
>  no-quorum-policy="ignore"
> -
> 
> 
> 
> ???
> This e-mail and its attachments contain confidential information from H3C, 
> which is
> intended only for the person or entity whose address is listed above. Any use 
> of the
> information contained herein in any way (including, but not limited to, total 
> or partial
> disclosure, reproduction, or dissemination) by persons other than the intended
> recipient(s) is prohibited. If you receive this e-mail in error, please 
> notify the sender
> by phone or email immediately and delete it!

> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: [Question] Question about mysql RA.

2015-11-16 Thread Dejan Muhamedagic
Hi Hideo-san,

On Thu, Nov 12, 2015 at 06:15:29PM +0900, renayama19661...@ybb.ne.jp wrote:
> Hi Ken,
> Hi Ulrich,
> 
> Hi All,
> 
> I sent a patch.
>  * https://github.com/ClusterLabs/resource-agents/pull/698

Your patch was merged. Many thanks.

Cheers,

Dejan

> 
> Please confirm it.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> - Original Message -
> > From: "renayama19661...@ybb.ne.jp" 
> > To: Cluster Labs - All topics related to open-source clustering welcomed 
> > 
> > Cc: 
> > Date: 2015/11/5, Thu 19:36
> > Subject: Re: [ClusterLabs] Antw: Re:  [Question] Question about mysql RA.
> > 
> > Hi Ken,
> > Hi Ulrich,
> > 
> > Thank you for comment
> > 
> > The RA of mysql seemed to have a problem somehow or other from the 
> > beginning as 
> > far as I heard the opinion of Ken and Ulrich.
> > 
> > I wait for the opinion of other people a little more, and I make a patch.
> > 
> > Best Regards,
> > Hideo Yamauchi.
> > 
> > 
> > - Original Message -
> >>  From: Ulrich Windl 
> >>  To: users@clusterlabs.org; kgail...@redhat.com
> >>  Cc: 
> >>  Date: 2015/11/5, Thu 16:11
> >>  Subject: [ClusterLabs] Antw: Re:  [Question] Question about mysql RA.
> >> 
> >   Ken Gaillot  schrieb am 04.11.2015 
> > um 
> >>  16:44 in Nachricht
> >>  <563a27c2.5090...@redhat.com>:
> >>>   On 11/04/2015 04:36 AM, renayama19661...@ybb.ne.jp wrote:
> >>  [...]
>        pid=`cat $OCF_RESKEY_pid 2> /dev/null `
>        /bin/kill $pid > /dev/null
> >>> 
> >>>   I think before this line, the RA should do a "kill -0" to 
> > check 
> >>  whether
> >>>   the PID is running, and return $OCF_SUCCESS if not. That way, we can
> >>>   still return an error if the real kill fails.
> >> 
> >>  And remove the stale PID file if there is no such pid. For very busy 
> > systems one 
> >>  could use ps for that PID to see whether the PID belongs to the expected 
> >>  process. There is a small chance that a PID exists, but does not belong 
> >> to 
> > the 
> >>  expected process...
> >> 
> >>> 
>        rc=$?
>        if [ $rc != 0 ]; then
>            ocf_exit_reason "MySQL couldn't be stopped"
>            return $OCF_ERR_GENERIC
>        fi
>    (snip)
>    -
>  
>    The mysql RA does such a code from old days.
>     * http://hg.linux-ha.org/agents/file/67234f982ab7/heartbeat/mysql 
> > 
>  
>    Does mysql RA know the reason becoming this made?
>    Possibly is it a factor to be conscious of mysql cluster?
>  
>    I think about a patch of this movement of mysql RA.
>    I want to know the detailed reason.
>  
>    Best Regards,
>    Hideo Yamauchi.
> >>> 
> >>> 
> >>>   ___
> >>>   Users mailing list: Users@clusterlabs.org 
> >>>   http://clusterlabs.org/mailman/listinfo/users 
> >>> 
> >>>   Project Home: http://www.clusterlabs.org 
> >>>   Getting started: 
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> >>>   Bugs: http://bugs.clusterlabs.org 
> >> 
> >> 
> >> 
> >> 
> >> 
> >>  ___
> >>  Users mailing list: Users@clusterlabs.org
> >>  http://clusterlabs.org/mailman/listinfo/users
> >> 
> >>  Project Home: http://www.clusterlabs.org
> >>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>  Bugs: http://bugs.clusterlabs.org
> >> 
> > 
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Patch][glue][external/libvirt] Conversion to a lower case of hostlist.

2015-10-30 Thread Dejan Muhamedagic
Hi Hideo-san,

On Fri, Oct 30, 2015 at 11:41:26AM +0900, renayama19661...@ybb.ne.jp wrote:
> Hi Dejan,
> Hi All,
> 
> How about the patch which I contributed by a former email?
> I would like an opinion.

It somehow slipped.

I suppose that you tested the patch well and nobody objected so
far, so lets apply it.

Many thanks! And sorry about the delay.

Cheers,

Dejan



> 
> Best Regards,
> Hideo Yamauchi.
> 
> - Original Message -
> > From: "renayama19661...@ybb.ne.jp" <renayama19661...@ybb.ne.jp>
> > To: Cluster Labs - All topics related to open-source clustering welcomed 
> > <users@clusterlabs.org>
> > Cc: 
> > Date: 2015/10/14, Wed 09:38
> > Subject: Re: [ClusterLabs] [Patch][glue][external/libvirt] Conversion to a 
> > lower case of hostlist.
> > 
> > Hi Dejan,
> > Hi All,
> > 
> > We reconsidered a patch.
> > 
> > 
> > 
> > In Pacemaker1.1, node names in STONITH are always small letters.
> > When a user uses a capital letter in host name, STONITH of libvirt fails.
> > 
> > This patch lets STONITH by libvirt succeed in the next setting.
> > 
> >  * host name(upper case) and hostlist(upper case) and domain_id on 
> > libvirt(uppper case)
> >  * host name(upper case) and hostlist(lower case) and domain_id on 
> > libvirt(lower 
> > case)
> >  * host name(lower case) and hostlist(upper case) and domain_id on 
> > libvirt(uppper case)
> >  * host name(lower case) and hostlist(lower case) and domain_id on 
> > libvirt(lower 
> > case)
> > 
> > 
> > However, in the case of the next setting, STONITH of libvirt causes an 
> > error.
> > In this case it is necessary for the user to make the size of the letter of 
> > the 
> > host name to manage of libvirt the same as host name to appoint in hostlist.
> > 
> >  * host name(upper case) and hostlist(lower case) and domain_id on 
> > libvirt(uppper case)
> >  * host name(upper case) and hostlist(uppper case) and domain_id on 
> > libvirt(lower case)
> >  * host name(lower case) and hostlist(lower case) and domain_id on 
> > libvirt(uppper case)
> >  * host name(lower case) and hostlist(uppper case) and domain_id on 
> > libvirt(lower case)
> > 
> > 
> > This patch is effective for letting STONITH by libvirt when host name was 
> > set 
> > for a capital letter succeed.
> > 
> > Best Regards,
> > Hideo Yamauchi.
> > 
> > 
> > 
> > 
> > - Original Message -
> >>  From: "renayama19661...@ybb.ne.jp" 
> > <renayama19661...@ybb.ne.jp>
> >>  To: Cluster Labs - All topics related to open-source clustering welcomed 
> > <users@clusterlabs.org>
> >>  Cc: 
> >>  Date: 2015/9/15, Tue 03:28
> >>  Subject: Re: [ClusterLabs] [Patch][glue][external/libvirt] Conversion to 
> >> a 
> > lower case of hostlist.
> >> 
> >>  Hi Dejan,
> >> 
> >>>   I suppose that you'll send another one? I can vaguelly recall
> >>>   a problem with non-lower case node names, but not the specifics.
> >>>   Is that supposed to be handled within a stonith agent?
> >> 
> >> 
> >>  Yes.
> >>  We make a different patch now.
> >>  With the patch, I solve a node name of the non-small letter in the range 
> >> of 
> > 
> >>  stonith agent.
> >>  # But the patch cannot cover all all patterns.
> >> 
> >>  Please wait a little longer.
> >>  I send a patch again.
> >>  For a new patch, please tell me your opinion.
> >> 
> >>  Best Regards,
> >>  Hideo Yamauchi.
> >> 
> >> 
> >> 
> >>  - Original Message -
> >>>   From: Dejan Muhamedagic <deja...@fastmail.fm>
> >>>   To: ClusterLabs-ML <users@clusterlabs.org>
> >>>   Cc: 
> >>>   Date: 2015/9/14, Mon 22:20
> >>>   Subject: Re: [ClusterLabs] [Patch][glue][external/libvirt] Conversion 
> > to a 
> >>  lower case of hostlist.
> >>> 
> >>>   Hi Hideo-san,
> >>> 
> >>>   On Tue, Sep 08, 2015 at 05:28:05PM +0900, renayama19661...@ybb.ne.jp 
> > wrote:
> >>>>    Hi All,
> >>>> 
> >>>>    We intend to change some patches.
> >>>>    We withdraw this patch.
> >>> 
> >>>   I suppose that you'll send another one? I can vaguelly recall
> >>>   a problem with non-lower case node names

Re: [ClusterLabs] ORACLE 12 and SLES HAE (Sles 11sp3)

2015-10-29 Thread Dejan Muhamedagic
Hi,

On Wed, Oct 28, 2015 at 02:45:55AM -0600, Cristiano Coltro wrote:
> Hi,
> most of the SLES 11 sp3 with HAE are migrating Oracle Db.
> The migration will be from Oracle 11 to Oracle 12
> 
> They have verified that the Oracles cluster resources actually supports  
> - Oracle 10.2 and 11.2 
> Command used: usando il comando "crm ra info ocf:heartbeat:SAPDatabase"
> So seems they are out of support.

It just means that the agent was tested with those. Oracle 12 was
probably not available at the time. At any rate, as others
already pointed out, the RA should support all databases which
are supported by SAPHostAgent.

Thanks,

Dejan

> So I would like to know which version of cluster/SO/Agent supports Oracle 12.
> AFAIK agents are tipically included in rpm.
> # rpm -qf /usr/lib/ocf/resource.d/heartbeat/SAPDatabase
> resource-agents-3.9.5-0.34.57
> and there are NOT updates about thatin the channel.
> 
> Any Idea on that?
> Thanks,
> Crisitiano
> 
> 
> 
> Cristiano Coltro
> Premium Support Engineer
>   
> mail: cristiano.col...@microfocus.com
> phone +39 02 36634936
> mobile +39 3351435589
> 
> 
>  
> 
> 
> 
> 
> 
> 
> 
> 


> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] VIP monitoring failing with Timed Out error

2015-10-29 Thread Dejan Muhamedagic
Hi,

On Thu, Oct 29, 2015 at 10:40:18AM +0530, Pritam Kharat wrote:
> Thank you very much Ken for reply. I will try your suggested steps.

If you cannot figure out from the logs why the stop operation
times out, you can also try to trace the resource agent:

# crm resource help trace
# crm resource trace vip stop

Then take a look at the trace or post it somewhere.

Thanks,

Dejan

> 
> On Wed, Oct 28, 2015 at 11:23 PM, Ken Gaillot  wrote:
> 
> > On 10/28/2015 03:51 AM, Pritam Kharat wrote:
> > > Hi All,
> > >
> > > I am facing one issue in my two node HA. When I stop pacemaker on ACTIVE
> > > node, it takes more time to stop and by this time VIP migration with
> > other
> > > resources migration fails to STANDBY node. (I have seen same issue in
> > > ACTIVE node reboot case also)
> >
> > I assume STANDBY in this case is just a description of the node's
> > purpose, and does not mean that you placed the node in pacemaker's
> > standby mode. If the node really is in standby mode, it can't run any
> > resources.
> >
> > > Last change: Wed Oct 28 02:52:57 2015 via cibadmin on node-1
> > > Stack: corosync
> > > Current DC: node-1 (1) - partition with quorum
> > > Version: 1.1.10-42f2063
> > > 2 Nodes configured
> > > 2 Resources configured
> > >
> > >
> > > Online: [ node-1 node-2 ]
> > >
> > > Full list of resources:
> > >
> > >  resource (upstart:resource): Stopped
> > >  vip (ocf::heartbeat:IPaddr2): Started node-2 (unmanaged) FAILED
> > >
> > > Migration summary:
> > > * Node node-1:
> > > * Node node-2:
> > >
> > > Failed actions:
> > > vip_stop_0 (node=node-2, call=-1, rc=1, status=Timed Out,
> > > last-rc-change=Wed Oct 28 03:05:24 2015
> > > , queued=0ms, exec=0ms
> > > ): unknown error
> > >
> > > VIP monitor is failing over here with error Timed Out. What is the
> > general
> > > reason for TimeOut. ? I have kept default-action-timeout=180secs which
> > > should be enough for monitoring
> >
> > 180s should be far more than enough, so something must be going wrong.
> > Notice that it is the stop operation on the active node that is failing.
> > Normally in such a case, pacemaker would fence that node to be sure that
> > it is safe to bring it up elsewhere, but you have disabled stonith.
> >
> > Fencing is important in failure recovery such as this, so it would be a
> > good idea to try to get it implemented.
> >
> > > I have added order property -> when vip is started then only start other
> > > resources.
> > > Any clue to solve this problem ? Most of the time this VIP monitoring is
> > > failing with Timed Out error.
> >
> > The "stop" in "vip_stop_0" means that the stop operation is what failed.
> > Have you seen timeouts on any other operations?
> >
> > Look through the logs around the time of the failure, and try to see if
> > there are any indications as to why the stop failed.
> >
> > If you can set aside some time for testing or have a test cluster that
> > exhibits the same issue, you can try unmanaging the resource in
> > pacemaker, then:
> >
> > 1. Try adding/removing the IP via normal system commands, and make sure
> > that works.
> >
> > 2. Try running the resource agent manually (with any verbose option) to
> > start/stop/monitor the IP to see if you can reproduce the problem and
> > get more messages.
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> 
> 
> 
> -- 
> Thanks and Regards,
> Pritam Kharat.

> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] restarting resources after configuration changes

2015-10-29 Thread Dejan Muhamedagic
Hi,

On Wed, Oct 28, 2015 at 10:21:26AM +, - - wrote:
> Hi,
> I am having problems restarting resources (e.g apache) after a
> configuration file change. I have tried 'pcs resource restart resourceid',
> which says 'resource successfully restarted', but the httpd process
> does not restart and hence my configuration changes in httpd.conf
> does not take effect.
> I am sure this scenario is quite common as administrators needs to
> update httpd.config files often - how is it done in an HA cluster?

Exactly as you tried. Something apparently went wrong, but hard
to say what.

Thanks,

Dejan

> I can send a HUP signal to the httpd process to achieve this, but I hope
> there will be a cluster (pcs/crm) method to do this.
> Many Thanks
> 
> krishan

> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] about the configuration for a iSCSITarget resource using the crm(8) shell.

2015-10-21 Thread Dejan Muhamedagic
Hi,

On Wed, Oct 21, 2015 at 12:17:15PM +, Shilu wrote:
> Hi,everyone!
> The following is an example configuration for a iSCSITarget resource using 
> the crm(8) shell:
> primitive tgt ocf:heartbeat:iSCSITarget \
>   params implementation="tgt" iqn="foo" tid="1" \
>   op monitor interval="1s"

This interval is very small. Are you sure that you want to
monitor it so often?

> now i want to use the param additional_parameters,who can tell me how to use 
> this param?
> The following is how I use it,but it is not correct.
> 
> primitive tgt ocf:heartbeat:iSCSITarget \
>   params implementation="tgt" iqn="foo" tid="1" additional_parameters="lun=1 
> bs-type=rbd backing-store=rbd/foo" \
>   op monitor interval="1s"

What exactly is not correct? Not an iscsi target expert here, but
syntactically it seems to be OK. You can also take look at the
XML definition of the resource:

$ crm configure show xml tgt

Thanks,

Dejan

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Antw: Monitoring Op for LVM - Excessive Logging

2015-10-12 Thread Dejan Muhamedagic
Hi,

On Mon, Oct 12, 2015 at 08:17:15AM +0200, Ulrich Windl wrote:
> >>> Jorge Fábregas  schrieb am 09.10.2015 um 18:00
> in
> Nachricht <5617e49c.6060...@gmail.com>:
> > On 10/09/2015 09:06 AM, Ulrich Windl wrote:
> >> Did you try daemon_options="-d0"? (in clvmd resource)
> > 
> > You've nailed it Ulrich!
> 
> I still think this should be the default value...

Of course, and it is. I can recall a bug report for this and it
was quite a while ago.

Thanks,

Dejan

> > 
> > Thanks!
> > 
> > -- 
> > Jorge
> > 
> > ___
> > Users mailing list: Users@clusterlabs.org 
> > http://clusterlabs.org/mailman/listinfo/users 
> > 
> > Project Home: http://www.clusterlabs.org 
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> > Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: group resources not grouped ?!?

2015-10-09 Thread Dejan Muhamedagic
Hi,

On Fri, Oct 09, 2015 at 01:56:34PM +0200, zulucloud wrote:
> 
> 
> On 10/08/2015 10:37 AM, Dejan Muhamedagic wrote:
> >On Wed, Oct 07, 2015 at 05:13:40PM +0200, zulucloud wrote:
> 
> >>
> >>Well, they're quite verbose and a little bit cryptic...;) I didn't
> >>find anything what could enlighten that for me...
> >
> >If you're using crmsh, you can at least let history filter out
> >the stuff you don't want to look at. There's an introduction on
> >the feature here:
> >
> >http://crmsh.github.io/history-guide/
> >
> >Thanks,
> >
> >Dejan
> >
> 
> Hi Dejan,
> 
> that looks very good, thank you. I need to get the source and
> compile... Do you know if there are any usage restrictions if the
> rest of the cluster stack software is quite old (pacemaker 1.0.9.1,
> corosync 1.2.1-4) ?

Hmm, v1.0.x. Where did you find such an old thing? :) Does it
come with crmsh (until v1.1.8, crmsh was part of pacemaker)? Even
so, I doubt that it has the history feature.  You could try to
build the v1.2.6 branch, but I'm not sure whether it's going to
work. I can also recall that colleagues of NTT Japan were
maintaining a version for some older pacemaker versions, but
don't know where do they keep the packages.

Thanks,

Dejan

> thx
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Monitoring Op for LVM - Excessive Logging

2015-10-09 Thread Dejan Muhamedagic
Hi,

On Fri, Oct 09, 2015 at 08:20:31AM -0400, Jorge Fábregas wrote:
> Hi,
> 
> Is there a way to stop the excessive logging produced by the LVM monitor
> operation?  I got it set at the default (30 seconds) here on SLES 11
> SP4.  However, everytime it runs the DC will write 174 lines on
> /var/log/messages (all coming from LVM).   I'm referring to the LVM
> primitive resource (the one that activates a VG).  I'm also using DLM/cLVM.

Can you post some samples? If you want this fixed, best is to
open a support call with SUSE.

Thanks,

Dejan

> I checked /etc/lvm/lvm.conf and the logging defaults are reasonable
> (verbose value set at 0 which is the lowest).
> 
> Thanks!
> 
> -- 
> Jorge
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Linux-HA] fence_ec2 agent

2015-09-24 Thread Dejan Muhamedagic
op monitor interval="3600s" timeout="60s" \
> > op stop interval="0s" timeout="60s"
> >
> >
> >The 1st instance have "Cluster1=node01" tag-key.
> >The 2nd instance have "Cluster1=node02" tag-key.
> >The 3rd instance have "Cluster1=node03" tag-key.
> >...
> >The prmStonith1-2 can fence node01 , node02 and node03.
> >
> >
> >If you like above, I will implement that.
> >
> >
> >Regards,
> >Kazuhiko Higashi
> >
> >
> >On 2015/03/19 1:03, Markus Guertler wrote:
> >>Hi Kazuhiko, Dejan,
> >>
> >>the new resource agent is very good. Since there were a couple of days 
> >>between my original question and the answer from
> >>Kazuhiko, I also have written a stonith agent proof of concept (attached to 
> >>this email) in order to continue in my
> >>project. However, I think that your fence_ec2 agent is better from a 
> >>development perspective and it doesn't make sense
> >>to have two different agents for the same use case.
> >>
> >>Nevertheless, I've implemented an idea, that is very useful in EC2 
> >>environments with clusters that have more than two
> >>nodes: All EC2 instances that belong to a cluster get a unique cluster name 
> >>via an EC2 instance tag. The agent uses this
> >>tag to determine all cluster nodes that belong to his own cluster
> >>
> >>--- SNIP ---
> >> gethosts)
> >> # List of hostnames of this cluster
> >> init_agent
> >> ec2-describe-instances --filter "tag-key=Clustername" --filter 
> >> "tag-value=$clustername" | grep "^TAG" |grep
> >>"Hostname" | awk '{ print $5 }' | sort -u
> >>--- SNIP ---
> >>
> >>The advantage of this method is, that you just need one configuration 
> >>snippet for all nodes. This allows to dynamically
> >>add or remove EC2 instances / cluster nodes to/from a cluster without 
> >>having to need to touch the cluster configuration.
> >>Dynamically adding or removing nodes (compute instances) is a very common 
> >>scenario in a cloud.
> >>
> >>Would it be possible, to implement this idea as an additional configuration 
> >>method to the fence_ec2 agent?
> >>
> >>Cheers,
> >>Markus
> >>
> >>>>>東一彦 <higashi.kazuh...@lab.ntt.co.jp> 3/12/2015 10:44 AM >>>
> >>Hi Dejan
> >>
> >>Thank you for add it and the fix some issues !
> >>
> >>
> >>  > I was not able to test it, hope it works :)
> >>I confirmed that it works fine in my AWS environment :)
> >>
> >>
> >>Regards,
> >>Kazuhiko Higashi
> >>
> >>On 2015/03/11 21:27, Dejan Muhamedagic wrote:
> >>>Hi Kazuhiko-san,
> >>>
> >>>On Wed, Mar 11, 2015 at 02:36:43PM +0900, 東一彦 wrote:
> >>>>Hi, Dejan
> >>>>
> >>>>Thank you for the comment.
> >>>>
> >>>>I'd like to contribute it as glue stonith agents.
> >>>>
> >>>>So, I rename it to just "ec2".
> >>>>
> >>>>Would you please add it to glue repository (http://hg.linux-ha.org/glue/) 
> >>>>?
> >>>
> >>>I just added your stonith agent. There were this change in the
> >>>initial changeset:
> >>>
> >>>- replaced '-' which is not allowed in identifiers with '_' in
> >>>function getinfo_xml().
> >>>
> >>>There were other smaller changes. You can find them in the
> >>>repository.
> >>>
> >>>I was not able to test it, hope it works :)
> >>>
> >>>Many thanks for the contribution.
> >>>
> >>>Cheers,
> >>>
> >>>Dejan
> >>>
> >>>>Regards,
> >>>>Kazuhiko Higashi
> >>>>
> >>>>On 2015/03/06 2:38, Dejan Muhamedagic wrote:
> >>>>>Hi,
> >>>>>
> >>>>>On Tue, Mar 03, 2015 at 05:13:49PM +0900, 東一彦 wrote:
> >>>>>>Dear Markus,
> >>>>>>
> >>>>>>I was also thinking the same thing.
> >>>>>>So, Already I've created a new one.
> >>>>>
> >>>>>Perhaps you'd like to then contribute it upstream? Either to
> >>>>>glue stonith agents

Re: [ClusterLabs] Small bug in RA heartbeat/syslog-ng

2015-09-21 Thread Dejan Muhamedagic
Hi,

On Mon, Sep 21, 2015 at 09:01:07AM +0200, Ulrich Windl wrote:
> Hi!
> 
> Just a small notice: While having a look at the syslog-ng RA, I found this 
> bug (in SLES11 SP3, resource-agents-3.9.5-0.37.38.19):
> SYSLOG_NG_EXE="${OCF_RESKEY_syslog_ng_binary-/sbin/syslog-ng}" ### line 237 
> of /usr/lib/ocf/resource.d/heartbeat/syslog-ng
> 
> I tried it in BASH, but if {OCF_RESKEY_syslog_ng_binary is unset, the default 
> won't be substituted. It's because the correct syntax is:
> SYSLOG_NG_EXE="${OCF_RESKEY_syslog_ng_binary:-/sbin/syslog-ng}"

Yes. Interestingly, there's some code to handle that case (but
commented out):

# why not default to /sbin/syslog-ng?
#if [[ -z "$SYSLOG_NG_EXE" ]]; then
#   ocf_log err "Undefined parameter:syslog_ng_binary"
#   exit $OCF_ERR_CONFIGURED
#fi

Thanks for the heads up!

Cheers,

Dejan

> Regards,
> Ulrich
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Patch][glue][external/libvirt] Conversion to a lower case of hostlist.

2015-09-14 Thread Dejan Muhamedagic
Hi Hideo-san,

On Tue, Sep 08, 2015 at 05:28:05PM +0900, renayama19661...@ybb.ne.jp wrote:
> Hi All,
> 
> We intend to change some patches.
> We withdraw this patch.

I suppose that you'll send another one? I can vaguelly recall
a problem with non-lower case node names, but not the specifics.
Is that supposed to be handled within a stonith agent?

Cheers,

Dejan

> Best Regards,
> Hideo Yamauchi.
> 
> 
> - Original Message -
> > From: "renayama19661...@ybb.ne.jp" 
> > To: ClusterLabs-ML 
> > Cc: 
> > Date: 2015/9/7, Mon 09:06
> > Subject: [ClusterLabs] [Patch][glue][external/libvirt] Conversion to a 
> > lower case of hostlist.
> > 
> > Hi All,
> > 
> > When a cluster carries out stonith, Pacemaker handles host name by a small 
> > letter.
> > When a user sets the host name of the OS and host name of hostlist of 
> > external/libvrit in capital letters and waits, stonith is not carried out.
> > 
> > The external/libvrit to convert host name of hostlist, and to compare can 
> > assist 
> > a setting error of the user.
> > 
> > Best Regards,
> > Hideo Yamauchi.
> > 
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] node dead timeout

2015-09-14 Thread Dejan Muhamedagic
Hi,

On Mon, Sep 14, 2015 at 09:15:59AM +0200, Nikola Ciprich wrote:
> Hello Andrew and all pacemaker users and developers,
> 
> I'd like to ask for advice - reading the docs, I'm still not sure - how
> can I set timeout telling when is node considered dead (and fenced)?
> 
> Is it dc-deadtime ?

No, it's an event delivered by the underlying messaging layer. I
suppose that you're using corosync, in which case see about token
timeout in corosync.conf(5).

Thanks,

Dejan

> thanks a lot in advance!
> 
> BR
> 
> nik
> 
> -- 
> -
> Ing. Nikola CIPRICH
> LinuxBox.cz, s.r.o.
> 28.rijna 168, 709 00 Ostrava
> 
> tel.:   +420 591 166 214
> fax:+420 596 621 273
> mobil:  +420 777 093 799
> www.linuxbox.cz
> 
> mobil servis: +420 737 238 656
> email servis: ser...@linuxbox.cz
> -



> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RES: Can't get a group of IP address up when moving to a new version of Pacemaker/Corosync

2015-09-04 Thread Dejan Muhamedagic
Hi,

On Wed, Sep 02, 2015 at 03:44:47PM -0300, Carlos Xavier wrote:
> Hi Kristoffer.
> 
> Tank you very much for fast reply.
> 
> I did a cleanup of the resource and took a look at the log and could see that 
> the issue has something to do with the RA IPaddr2 trying to set some iptables 
> rule, although we are not using any iptables rule set.
> 
> crm(live)resource# cleanup c-ip-httpd
> Cleaning up ip_ccardbusiness:0 on apolo
> Cleaning up ip_ccardbusiness:0 on diana
> Cleaning up ip_ccardgift:0 on apolo
> Cleaning up ip_ccardgift:0 on diana
> Cleaning up ip_intranet:0 on apolo
> Cleaning up ip_intranet:0 on diana
> Cleaning up ip_ccardbusiness:1 on apolo
> Cleaning up ip_ccardbusiness:1 on diana
> Cleaning up ip_ccardgift:1 on apolo
> Cleaning up ip_ccardgift:1 on diana
> Cleaning up ip_intranet:1 on apolo
> Cleaning up ip_intranet:1 on diana
> Waiting for 12 replies from the CRMd OK
> 
> And on the log we have
> 
[...]
> 2015-09-02T14:40:54.184995-03:00 apolo IPaddr2(ip_ccardbusiness)[8360]: 
> ERROR: iptables failed
> 2015-09-02T14:40:54.187651-03:00 apolo lrmd[5182]:   notice: 
> operation_finished: ip_ccardbusiness_start_0:8360:stderr [ iptables: No 
> chain/target/match by that name. ]

The target used is CLUSTERIP. Do you have the iptables extensions
installed (you should)? The package is called xtables-plugins. It
should contain a library with CLUSTERIP support.

[...]

> Is there a way to stop the IPaddr2 RA to try to manage the iptables rule?

No, there isn't. Not a specialist on this, but so far nobody
complained about iptables use. The underlying problem should be
solved.

Since you're running a SLE product, you can also ask for help
from the SUSE support.

Thanks,

Dejan

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [Linux-HA] file system resource becomes inaccesible when any of the node goes down

2015-07-06 Thread Dejan Muhamedagic
On Mon, Jul 06, 2015 at 03:14:34PM +0500, Muhammad Sharfuddin wrote:
 On 07/06/2015 02:50 PM, Dejan Muhamedagic wrote:
 Hi,
 
 On Sun, Jul 05, 2015 at 09:13:56PM +0500, Muhammad Sharfuddin wrote:
 SLES 11 SP 3 + online updates(pacemaker-1.1.11-0.8.11.70
 openais-1.1.4-5.22.1.7)
 
 Its a dual primary drbd cluster, which mounts a file system resource
 on both the cluster nodes simultaneously(file system type is ocfs2).
 
 Whenever any of the nodes goes down, the file system(/sharedata)
 become inaccessible for exact 35 seconds on the other
 (surviving/online) node, and then become available again on the
 online node.
 
 Please help me understand why the node which survives or remains
 online unable to access the file system resource(/sharedata) for 35
 seconds ? and how can I fix the cluster so that file system remains
 accessible on the surviving node without any interruption/delay(as
 in my case of about 35 seconds)
 
 By inaccessible, I meant to say that running ls -l /sharedata and
 df /sharedata does not return any output and does not return the
 prompt back on the online node for exact 35 seconds once the other
 node becomes offline.
 
 e.g node1 got offline somewhere around  01:37:15, and then
 /sharedata file system was inaccessible during 01:37:35 and 01:38:18
 on the online node i.e node2.
 Before the failing node gets fenced you won't be able to use the
 ocfs2 filesystem. In this case, the fencing operation takes 40
 seconds:
 so its expected.
 [...]
 Jul  5 01:37:35 node2 sbd: [6197]: info: Writing reset to node slot node1
 Jul  5 01:37:35 node2 sbd: [6197]: info: Messaging delay: 40
 Jul  5 01:38:15 node2 sbd: [6197]: info: reset successfully
 delivered to node1
 Jul  5 01:38:15 node2 sbd: [6196]: info: Message successfully delivered.
 [...]
 You may want to reduce that sbd timeout.
 Ok, so reducing the sbd timeout(or msgwait) would provide the
 uninterrupted access to the ocfs2 file system on the
 surviving/online node ?
 or would it just minimize the downtime ?

Only the latter. But note that it is important that once sbd
reports success, the target node is really down. sbd is
timeout-based, i.e. it doesn't test whether the node actually
left. Hence this timeout shouldn't be too short.

Thanks,

Dejan

 Thanks,
 
 Dejan
 ___
 Linux-HA mailing list is closing down.
 Please subscribe to users@clusterlabs.org instead.
 http://clusterlabs.org/mailman/listinfo/users
 ___
 linux...@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha
 
 --
 Regards,
 
 Muhammad Sharfuddin
 ___
 Linux-HA mailing list is closing down.
 Please subscribe to users@clusterlabs.org instead.
 http://clusterlabs.org/mailman/listinfo/users
 ___
 linux...@lists.linux-ha.org
 http://lists.linux-ha.org/mailman/listinfo/linux-ha

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org