Re: [ClusterLabs] nfsserver_monitor() doesn't detect nfsd process is lost.

2016-02-09 Thread Dejan Muhamedagic
Hi,

On Thu, Jan 28, 2016 at 04:42:55PM +0900, yuta takeshita wrote:
> Hi,
> Sorry for replying late.

No problem.

> 2016-01-15 21:19 GMT+09:00 Dejan Muhamedagic :
> 
> > Hi,
> >
> > On Fri, Jan 15, 2016 at 04:54:37PM +0900, yuta takeshita wrote:
> > > Hi,
> > >
> > > Tanks for responding and making a patch.
> > >
> > > 2016-01-14 19:16 GMT+09:00 Dejan Muhamedagic :
> > >
> > > > On Thu, Jan 14, 2016 at 11:04:09AM +0100, Dejan Muhamedagic wrote:
> > > > > Hi,
> > > > >
> > > > > On Thu, Jan 14, 2016 at 04:20:19PM +0900, yuta takeshita wrote:
> > > > > > Hello.
> > > > > >
> > > > > > I have been a problem with nfsserver RA on RHEL 7.1 and systemd.
> > > > > > When the nfsd process is lost with unexpectly failure,
> > > > nfsserver_monitor()
> > > > > > doesn't detect it and doesn't execute failover.
> > > > > >
> > > > > > I use the below RA.(but this problem may be caused with latest
> > > > nfsserver RA
> > > > > > as well)
> > > > > >
> > > >
> > https://github.com/ClusterLabs/resource-agents/blob/v3.9.6/heartbeat/nfsserver
> > > > > >
> > > > > > The cause is following.
> > > > > >
> > > > > > 1. After execute "pkill -9 nfsd", "systemctl status
> > nfs-server.service"
> > > > > > returns 0.
> > > > >
> > > > > I think that it should be systemctl is-active. Already had a
> > > > > problem with systemctl status, well, not being what one would
> > > > > assume status would be. Can you please test that and then open
> > > > > either a pull request or issue at
> > > > > https://github.com/ClusterLabs/resource-agents
> > > >
> > > > I already made a pull request:
> > > >
> > > > https://github.com/ClusterLabs/resource-agents/pull/741
> > > >
> > > > Please test if you find time.
> > > >
> > > I tested the code, but still problems remain.
> > > systemctl is-active retrun active and the return code is 0 as well as
> > > systemctl status.
> > > Perhaps it is inappropriate to use systemctl for monitoring the kernel
> > > process.
> >
> > OK. My patch was too naive and didn't take into account the
> > systemd/kernel intricacies.
> >
> > > Mr Kay Sievers who is a developer of systemd said that systemd doesn't
> > > monitor kernel process in the following.
> > > http://comments.gmane.org/gmane.comp.sysutils.systemd.devel/34367
> >
> > Thanks for the reference. One interesting thing could also be
> > reading /proc/fs/nfsd/threads instead of checking the process
> > existence. Furthermore, we could do some RPC based monitor, but
> > that would be, I guess, better suited for another monitor depth.
> >
> > OK. I survey and test the /proc/fs/nfsd/threads.
> It seems to work well on my cluster.
> I make a patch and a pull request.
> https://github.com/ClusterLabs/resource-agents/pull/746
> 
> Please check if you have time.

Some return codes of nfsserver_systemd_monitor() follow OCF and one
apparently LSB:

301 nfs_exec is-active
302 rc=$?
...
311 if [ $threads_num -gt 0 ]; then
312 return $OCF_SUCCESS
313 else
314 return 3
315 fi
316 else
317 return $OCF_ERR_GENERIC
...
321 return $rc

Given that nfs_exec() returns LSB codes, it should probably be
something like this:

311 if [ $threads_num -gt 0 ]; then
312 return 0
313 else
314 return 3
315 fi
316 else
317 return 1
...
321 return $rc

It won't make any actual difference, but the intent would be
cleaner (i.e. it's just by accident that the OCF codes are the
same in this case).

Cheers,

Dejan

> Regards,
> Yuta
> 
> > Cheers,
> >
> > Dejan
> >
> > > I reply to your pull request.
> > >
> > > Regards,
> > > Yuta Takeshita
> > >
> > > >
> > > > Thanks for reporting!
> > > >
> > > > Dejan
> > > >
> > > > > Thanks,
> > > > >
> > > > > Dejan
> > > > >
> > > > > > 2. nfsserver_monitor() judge with the return value of "systemctl
> > status
> > > > > > nfs-server.service".
> > > > > >
> > > > > >
> > --
> > > > > > # ps ax | grep nfsd
> > > > > > 25193 ?S< 0:00 [nfsd4]
> > > > > > 25194 ?S< 0:00 [nfsd4_callbacks]
> > > > > > 25197 ?S  0:00 [nfsd]
> > > > > > 25198 ?S  0:00 [nfsd]
> > > > > > 25199 ?S  0:00 [nfsd]
> > > > > > 25200 ?S  0:00 [nfsd]
> > > > > > 25201 ?S  0:00 [nfsd]
> > > > > > 25202 ?S  0:00 [nfsd]
> > > > > > 25203 ?S  0:00 [nfsd]
> > > > > > 25204 ?S  0:00 [nfsd]
> > > > > > 25238 pts/0S+ 0:00 grep --color=auto nfsd
> > > > > > #
> > > > > > # pkill -9 nfsd
> > > > > > #
> > > > > > # systemctl status nfs-server.service
> > > > > > ● nfs-server.service - NFS server and services
> > > > > >Loaded: loaded (/etc/systemd/system/nfs-server.service;
> > disabled;
> > > > vendor
> > > > > > preset: disabled)
> > > > > >Active: active 

Re: [ClusterLabs] Cluster resources migration from CMAN to Pacemaker

2016-02-09 Thread jaspal singla
Hi Jan/Digiman,

Thanks for your replies. Based on your inputs, I managed to configure these
values and results were fine but still have some doubts for which I would
seek your help. I also tried to dig some of issues on internet but seems
due to lack of cman -> pacemaker documentation, I couldn't find any.

I have configured 8 scripts under one resource as you recommended. But out
of which 2 scripts are not being executed by cluster by cluster itself.
When I tried to execute the same script manually, I am able to do it but
through pacemaker command I don't.

For example:

This is the output of crm_mon command:

###
Last updated: Mon Feb  8 17:30:57 2016  Last change: Mon Feb  8
17:03:29 2016 by hacluster via crmd on ha1-103.cisco.com
Stack: corosync
Current DC: ha1-103.cisco.com (version 1.1.13-10.el7-44eb2dd) - partition
with quorum
1 node and 10 resources configured

Online: [ ha1-103.cisco.com ]

 Resource Group: ctm_service
 FSCheck
 (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/FsCheckAgent.py):
 Started ha1-103.cisco.com
 NTW_IF
(lsb:../../..//cisco/PrimeOpticalServer/HA/bin/NtwIFAgent.py):  Started
ha1-103.cisco.com
 CTM_RSYNC
 (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/RsyncAgent.py):  Started
ha1-103.cisco.com
 REPL_IF
 (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/ODG_IFAgent.py): Started
ha1-103.cisco.com
 ORACLE_REPLICATOR
 (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/ODG_ReplicatorAgent.py):
Started ha1-103.cisco.com
 CTM_SID
 (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/OracleAgent.py): Started
ha1-103.cisco.com
 CTM_SRV
 (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/CtmAgent.py):Stopped
 CTM_APACHE
(lsb:../../..//cisco/PrimeOpticalServer/HA/bin/ApacheAgent.py): Stopped
 Resource Group: ctm_heartbeat
 CTM_HEARTBEAT
 (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/HeartBeat.py):   Started
ha1-103.cisco.com
 Resource Group: ctm_monitoring
 FLASHBACK
 (lsb:../../..//cisco/PrimeOpticalServer/HA/bin/FlashBackMonitor.py):
 Started ha1-103.cisco.com

Failed Actions:
* CTM_SRV_start_0 on ha1-103.cisco.com 'unknown error' (1): call=577,
status=complete, exitreason='none',
last-rc-change='Mon Feb  8 17:12:33 2016', queued=0ms, exec=74ms

#


CTM_SRV && CTM_APACHE are in stopped state. These services are not being
executed by cluster OR it is being failed somehow by cluster, not sure
why?  When I manually execute CTM_SRV script, the script gets executed
without issues.

-> For manually execution of this script I ran the below command:

# /cisco/PrimeOpticalServer/HA/bin/OracleAgent.py status

Output:

_
2016-02-08 17:48:41,888 INFO MainThread CtmAgent
=
Executing preliminary checks...
 Check Oracle and Listener availability
  => Oracle and listener are up.
 Migration check
  => Migration check completed successfully.
 Check the status of the DB archivelog
  => DB archivelog check completed successfully.
 Check of Oracle scheduler...
  => Check of Oracle scheduler completed successfully
 Initializing database tables
  => Database tables initialized successfully.
 Install in cache the store procedure
  => Installing store procedures completed successfully
 Gather the oracle system stats
  => Oracle stats completed successfully
Preliminary checks completed.
=
Starting base services...
Starting Zookeeper...
JMX enabled by default
Using config: /opt/CiscoTransportManagerServer/zookeeper/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
 Retrieving name service port...
 Starting name service...
Base services started.
=
Starting Prime Optical services...
Prime Optical services started.
=
Cisco Prime Optical Server Version: 10.5.0.0.214 / Oracle Embedded
-
  USER   PID  %CPU  %MEM START  TIME   PROCESS
-
  root 16282   0.0   0.0  17:48:11  0:00   CTM Server
  root 16308   0.0   0.1  17:48:16  0:00   CTM Server
  root 16172   0.1   0.1  17:48:10  0:00   NameService
  root 16701  24.8   7.5  17:48:27  0:27   TOMCAT
  root 16104   0.2   0.2  17:48:09  0:00   Zookeeper
-
For startup details, see:

Re: [ClusterLabs] [ha-devel] Hawk release 2.0

2016-02-09 Thread Zhu Lingshan

congratulations and thanks!

在 2016/2/8 23:30, Kristoffer Grönlund 写道:

Hello everyone!

It is my great pleasure to announce that Hawk 2.0.0 is released! Yes,
technically the previous actual release was 0.7.2, but for various
reasons I decided to bump the version number all the way up to 2.

One of the reasons for doing so is the huge amount of changes that
have gone into this version of Hawk. Not only does it look completely
different, on the backend of things, everything has changed as well.

First of all, Hawk now has a website! Visit http://hawk-ui.github.io/
and check out the new logo designed by Manuele Carlini.

I have also started working on a User Guide for Hawk, here:
http://hawk-guide.readthedocs.org/ . It's still early days for the
Guide and it needs more work to be truly useful, but already it has
one thing going for it: It's a cluster usage guide which doesn't
ignore fencing. I know some of you will like that, at least.

## New Features

* Redesigned Frontend

   The Hawk frontend has been modernised, and now uses
   Bootstrap 3. The layout and organization of the user
   interface has been rethought with usability in mind.

* Updated Backend

   Hawk 2 is based on Ruby on Rails 4.2 running on the Puma
   web server. By using Puma, we can make Hawk as
   unintrusive as possible on the cluster nodes without
   compromising performance. In fact, thanks to the use of
   asset precompilation Hawk 2 should perform better than
   the Hawk 1 interface despite the updated visual style.

* Wizards

   In Hawk 1, the wizards were implemented as a custom
   solution. For Hawk 2, the wizards have been moved into
   the crm shell, making them available from the command
   line as well. In addition to this move, the wizards have
   been greatly improved. They now feature optional steps
   and multi-step configuration (for example, in case
   resources in an earlier step need to be started before
   configuring the next set of resources). Wizards are also
   able to perform complex actions like installing and
   configuring necessary software packages.

* Integrated Dashboard

   The multi-cluster dashboard has been integrated into the
   main interface. Now you can monitor multiple clusters
   directly from the regular Hawk UI.

* New Pacemaker Features

   We support many of the features that have been added
   recently to Pacemaker:
   - Location constraints can apply to several resources
 at once.
   - Tags
   - Remote nodes are shown separate from regular nodes
 in the Dashboard

* Configuration view and Command Log

   To make the transition between command line usage and
   the web interface easier, we've added the ability to
   view the current cluster configuration in the command
   line format, complete with syntax highlighting. Also,
   the command log provides a list of recent commands
   executed on the cluster from the web interface. This can
   serve as a basic audit log, as well as helping new users
   learn the command line interface directly by performing
   operations on the cluster.

* History Explorer and Simulator

   Together with the general improvements to the interface,
   the History Explorer has been redesigned to be easier to
   use and more powerful. Now you can see more details for
   each transition, as well as easily navigate forward and
   backward in time through a report. The History Explorer
   now also shows a summary of important events directly
   when opening the report, to make it easy to find the
   relevant events in the log. The report generation,
   download and upload functions are all now accessible
   from a single location.

   Similarly, the Simulator has been updated to not only be
   prettier (if you ever used the old version, you'll know
   what I mean) but also easier to use.

## Downloading

Source downloads for the release are available here:

 https://github.com/ClusterLabs/hawk/releases/tag/hawk-2.0.0

openSUSE Tumbleweed has a version of Hawk 2 (package name hawk2) which
is very close to the actual release version, and the release will be
there soon.

Another way to try Hawk is to use the Vagrant configuration which
comes with the User Guide. It configures a 1 - 3 node cluster with
Hawk already installed and running.

Thank you!

Credit for the release goes to Tim Serong (for the original Hawk
version), Thomas Boerger, Manuele Carlini and Thomas Hutterer. And
also me.

Cheers,
Kristoffer




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Working with 2 VIPs

2016-02-09 Thread Ken Gaillot
On 02/08/2016 04:24 AM, Louis Chanouha wrote:
> Hello,
> I'm not sure if this mailign is the proper place to send ma request, please 
> tell 
> me where i should send it if not :)

This is the right place :)

> I have an use case that i can't run acutally with corosync + pacemaker.
> 
> I have two nodes, two VIP and two services (one dupplicated), in order to 
> provide an active/active service (2 physical sites).

By "2 physical sites", do you mean 2 physical machines on the same LAN,
or 2 geographically separate locations?

> On a normal situation, one VIP is associated to one node via a prefered 
> location, and the service is running one the two nodes (cloned).
> 
> On failing situation, i want that the working node takes the IP of the other 
> host without migrating the service (listening on 0.0.0.0), so when :
>   - the service is down - not working
>   - the node is down (network or OS layer) - working
> 
> I can't find the proper way to conceptualize this problem with 
> group/colocation/order notions of pacemaker. I would be happy in you give me 
> some thoughs on appropriate options.

I believe your current configuration already does that. What problems
are you seeing?

> 
> Thank you in advance for your help.
> Sorry for my non-native English.
> 
> Louis Chanouha
> 
> **
> 
> My current configuration is this one. I can't translate it in XML if you need 
> it.
> 
> /node Gollum//

This will likely cause a log warning that "Node names with capitals are
discouraged". It's one of those things that shouldn't matter, but better
safe than sorry ...

> //node edison//
> //primitive cups lsb:cups \//
> //op monitor interval="2s"//
> //primitive vip_edison ocf:heartbeat:IPaddr2 \//
> //params nic="eth0" ip="10.1.9.18" cidr_netmask="24" \//
> //op monitor interval="2s"//
> //primitive vip_gollum ocf:heartbeat:IPaddr2 \//
> //params nic="eth0" ip="10.1.9.23" cidr_netmask="24" \//
> //op monitor interval="2s"//
> //clone ha-cups cups//
> //location pref_edison vip_edison 50: edison//
> //location pref_gollum vip_gollum 50: Gollum//
> //property $id="cib-bootstrap-options" \//
> //dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \//
> //cluster-infrastructure="openais" \//
> //expected-quorum-votes="2" \//
> //stonith-enabled="false" \//

Without stonith, the cluster will be unable to recover from certain
types of failures (for example, network failures). If both nodes are up
but can't talk to each other ("split brain"), they will both bring up
both IP addresses.

> //no-quorum-policy="ignore"/

If you can use corosync 2, you can set "two_node: 1" in corosync.conf,
and then you wouldn't need this line.

> -- 
> 
> *Louis Chanouha | Ingénieur Système et Réseaux*
> Service Numérique de l'Université de Toulouse
> *Université Fédérale Toulouse Midi-Pyrénées*
> 15 rue des Lois - BP 61321 - 31013 Toulouse Cedex 6
> Tél. : +33(0)5 61 10 80 45  / poste int. : 18045
> 
> louis.chano...@univ-toulouse.fr
> Facebook 
>  | 
> Twitter  | 
> www.univ-toulouse.fr


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] crmsh configure delete for constraints

2016-02-09 Thread Vladislav Bogdanov
Dejan Muhamedagic  wrote:
>Hi,
>
>On Tue, Feb 09, 2016 at 05:15:15PM +0300, Vladislav Bogdanov wrote:
>> 09.02.2016 16:31, Kristoffer Grönlund wrote:
>> >Vladislav Bogdanov  writes:
>> >
>> >>Hi,
>> >>
>> >>when performing a delete operation, crmsh (2.2.0) having -F tries
>> >>to stop passed op arguments and then waits for DC to become idle.
>> >>
>> >
>> >Hi again,
>> >
>> >I have pushed a fix that only waits for DC if any resources were
>> >actually stopped:
>https://github.com/ClusterLabs/crmsh/commit/164aa48
>> 
>> Great!
>> 
>> >
>> >>
>> >>More, it may be worth checking stop-orphan-resources property and
>pass stop
>> >>work to pacemaker if it is set to true.
>> >
>> >I am a bit concerned that this might not be 100% reliable. I found
>an
>> >older discussion regarding this and the recommendation from David
>Vossel
>> >then was to always make sure resources were stopped before removing
>> >them, and not relying on stop-orphan-resources to clean things up
>> >correctly. His example of when this might not work well is when
>removing
>> >a group, as the group members might get stopped out-of-order.
>> 
>> OK, I agree. That was just an idea.
>> 
>> >
>> >At the same time, I have thought before that the current
>functionality
>> >is not great. Having to stop resources before removing them is if
>> >nothing else annoying! I have a tentative change proposal to this
>where
>> >crmsh would stop the resources even if --force is not set, and there
>> >would be a flag to pass to stop to get it to ignore whether
>resources
>> >are running, since that may be useful if the resource is
>misconfigured
>> >and the stop action doesn't work.
>> 
>> That should result in fencing, no? I think that is RA issue if that
>> happens.
>
>Right. Unfortunately, this case often gets too little attention;
>people typically test with good and working configurations only.
>The first time we hear about it is from some annoyed user who's
>node got fenced for no good reason. Even worse, with some bad
>configurations, it can happen that the nodes get fenced in a
>round-robin fashion, which certainly won't make your time very
>productive.
>
>> Particularly, imho RAs should not run validate_all on stop
>> action.
>
>I'd disagree here. If the environment is no good (bad
>installation, missing configuration and similar), then the stop
>operation probably won't do much good. Ultimately, it may depend
>on how the resource is managed. In ocf-rarun, validate_all is
>run, but then the operation is not carried out if the environment
>is invalid. In particular, the resource is considered to be
>stopped, and the stop operation exits with success. One of the
>most common cases is when the software resides on shared
>non-parallel storage.

Well, I'd reword. Generally, RA should not exit with error if validation fails 
on stop.
Is that better?

>
>BTW, handling the stop and monitor/probe operations was the
>primary motivation to develop ocf-rarun. It's often quite
>difficult to get these things right.
>
>Cheers,
>
>Dejan
>
>
>> Best,
>> Vladislav
>> 
>> 
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>___
>Users mailing list: Users@clusterlabs.org
>http://clusterlabs.org/mailman/listinfo/users
>
>Project Home: http://www.clusterlabs.org
>Getting started:
>http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>Bugs: http://bugs.clusterlabs.org



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org