Re: [ClusterLabs] multiple action= lines sent to STDIN of fencing agents - why?

2015-10-19 Thread Marek marx Grác
Hi,


On 15 October 2015 at 16:45:58, Jan Pokorný (jpoko...@redhat.com) wrote:
> On 15/10/15 12:25 +0100, Adam Spiers wrote:
> > I inserted some debugging into fencing.py and found that stonithd
> > sends stuff like this to STDIN of the fencing agents it forks:
> >
> > action=list
> > param1=value1
> > param2=value2
> > param3=value3
> > action=list
> >
> > where paramX and valueX come from the configuration of the primitive
> > for the fencing agent.
> >
> > As a corollary, if the primitive for the fencing agent has 'action'
> > defined as one of its parameters, this means that there will be three
> > 'action=' lines, and the middle one could have a different value to
> > the two sandwiching it.
> >
> > When I first saw this, I had an extended #wtf moment and thought it
> > was a bug. But on closer inspection, it seems very deliberate, e.g.
> >
> > https://github.com/ClusterLabs/pacemaker/commit/bfd620645f151b71fafafa279969e9d8bd0fd74f
> >   
> >
> > The "regardless of what the admin configured" comment suggests to me
> > that there is an underlying assumption that any fencing agent will
> > ensure that if the same parameter is duplicated on STDIN, the final
> > value will override any previous ones. And indeed fencing.py ensures
> > this, but presumably it is possible to write agents which don't use
> > fencing.py.
> >
> > Is my understanding correct? If so:
> >
> > 1) Is the first 'action=' line intended in order to set some kind of
> > default action, in the case that the admin didn't configure the
> > primitive with an 'action=' parameter *and* _action wasn't one of
> > list/status/monitor/metadata? In what circumstances would this
> > happen?

Fence agents do not use this mechanism for adding a values for attributes. So 
if multiple values of actions occur, it is done by user manually. By default, 
if action is not defined it is reboot (or off).

In case of pacemaker (pcmk_*_action), the action is added according to 
pcmk_*_action settings. 


> > 2) Is this assumption of the agents always being order-sensitive
> > (i.e. last value always wins) documented anywhere? The best
> > documentation on the API I could find was here:
> >
> > https://fedorahosted.org/cluster/wiki/FenceAgentAPI
> >
> > but it doesn't mention this.
>  
> Agreed that this implicit and undocumented arrangement smells fishy.

Info was added to the wiki page was added


> In fact I raised this point within RH team 3 month back but there
> wasn't any feedback at that time. Haven't filed a bug because it's
> unclear to me what's intentional and what's not between the two
> interacting sides. Back then, I was proposing that fencing.py should,
> at least, emit a message about subsequent parameter overrides.

Emit a message when such situation occurs is a bit tricky but doable. 
The problem is in fact that we use ‘default’ values for some attributes and 
later in program we can’t distinguish who entered the value. So in some cases, 
the warning will not be emit.

m,

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Fencing questions.

2015-10-19 Thread Arjun Pandey
Hi

I am running a 2 node cluster with this config on centos 6.5/6.6  where i
have a multi-state resource foo being run in master/slave mode and  a bunch
of floating IP addresses configured. Additionally i have a collocation
constraint
for the IP addr to be collocated with the master.

Please find the following files attached
cluster.conf
CIB

Issues that i have :-
1. Daemons required for fencing
Earlier we were invoking cman start quorum from pacemaker script which
ensured that fenced / gfs and other daemons are not started. This was ok
since fencing wasn't being handled earlier.

For fencing purpose do we only need the fenced to be started ?  We don't
have any gfs partitions that we want to monitor via pacemaker. My concern
here is that if i use the unmodified script then pacemaker start time
increases significantly. I see a difference of 60 sec from the earlier
startup before service pacemaker status shows up as started.

2. Fencing test cases.
 Based on the internet queries i could find , apart from plugging out the
dedicated cable. The only other case suggested is killing corosync process
on one of the nodes.
Are there any other basic cases that i should look at ?
What about bring up interface down manually ? I understand that this is an
unlikely scenario but i am just looking for more ways to test this out.

3. Testing whether fencing is working or not.
Previously i have been using fence_ilo4 from the shell to test whether the
command is working. I was assuming that similar invocation would be done by
stonith when actual fencing needs to be done.

However based on other threads i could find people also use fence_tool
 to try this out. According to me this tests out whether fencing
when invoked by fenced for a particular node succeeds or not. Is that valid
?

Since we are configuring fence_pcmk as the fence device the flow of things
is
fenced -> fence_pcmk -> stonith -> fence agent.

4. Fencing agent to be used (fence_ipmilan vs fence_ilo4)
Also for ILO fencing i see fence_ilo4 and fence_ipmilan both available. I
had been using fence_ilo4 till now.

I think this mail has multiple questions and i will probably send out
another mail for a few issues i see after fencing takes place.

Thanks in advance
Arjun


cib
Description: Binary data


cluster.conf
Description: Binary data
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Is STONITH resource location special?

2015-10-19 Thread Ferenc Wagner
Hi,

Pacemaker Explained discusses in 13.3 the special treatment of STONITH
resources.  Now, I configured 6 fence_ipmilan instances in a cluster
which runs on 4 nodes currently.  No STONITH resource started on the
node it can kill, even though I haven't configured location constraints
yet.  Is this clever placement accidental, or is it another aspect of
the special treatment of STONITH resources?  Should I configure the
usual location constraints, or are they not needed under 1.1.13?

Also, I haven't configured pcmk_host_check, so is should default to
dynamic-list, but the pcmk_host_list setting is still effective,
according to stonith_admin.  I don't mind this, but it seems to
contradict the documentation.  Who is right?
-- 
Thanks,
Feri.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Fencing questions.

2015-10-19 Thread Digimer
On 19/10/15 06:53 AM, Arjun Pandey wrote:
> Hi
> 
> I am running a 2 node cluster with this config on centos 6.5/6.6  where

It's important to keep both nodes on the same minor version,
particularly in this case. Please either upgrade centos 6.5 to 6.6 or
both to 6.7.

> i have a multi-state resource foo being run in master/slave mode and  a
> bunch of floating IP addresses configured. Additionally i have
> a collocation constraint for the IP addr to be collocated with the master.
> 
> Please find the following files attached 
> cluster.conf
> CIB

It's preferable on a mailing list to copy the text into the body of the
message. Easier to read.

> Issues that i have :-
> 1. Daemons required for fencing
> Earlier we were invoking cman start quorum from pacemaker script which
> ensured that fenced / gfs and other daemons are not started. This was ok
> since fencing wasn't being handled earlier.

The cman fencing is simply a pass-through to pacemaker. When pacemaker
tells cman that fencing succeeded, it inform DLM and begins cleanup.

> For fencing purpose do we only need the fenced to be started ?  We don't
> have any gfs partitions that we want to monitor via pacemaker. My
> concern here is that if i use the unmodified script then pacemaker start
> time increases significantly. I see a difference of 60 sec from the
> earlier startup before service pacemaker status shows up as started.

Don't start fenced manually, just start pacemaker and let it handle
everything. Ideally, use the pcs command (and pcsd daemon on the nodes)
to start/stop the cluster, but you'll need to upgrade to 6.7.

> 2. Fencing test cases.
>  Based on the internet queries i could find , apart from plugging out
> the dedicated cable. The only other case suggested is killing corosync
> process on one of the nodes.
> Are there any other basic cases that i should look at ?
> What about bring up interface down manually ? I understand that this is
> an unlikely scenario but i am just looking for more ways to test this out.

echo c > /proc/sysrq-trigger == kernel panic. It's my preferred test.
Also, killing the power to the node will cause IPMI to fail and will
test your backup fence method, if you have it, or ensure the cluster
locks up if you don't (better to hang than to risk corruption).

> 3. Testing whether fencing is working or not.
> Previously i have been using fence_ilo4 from the shell to test whether
> the command is working. I was assuming that similar invocation would be
> done by stonith when actual fencing needs to be done. 
> 
> However based on other threads i could find people also use fence_tool
>  to try this out. According to me this tests out whether
> fencing when invoked by fenced for a particular node succeeds or not. Is
> that valid ? 

Fence tool is just a command to control the cluster's fencing. The
fence_X agents do the actual work.

> Since we are configuring fence_pcmk as the fence device the flow of
> things is 
> fenced -> fence_pcmk -> stonith -> fence agent.

Basically correct.

> 4. Fencing agent to be used (fence_ipmilan vs fence_ilo4)
> Also for ILO fencing i see fence_ilo4 and fence_ipmilan both available.
> I had been using fence_ilo4 till now. 

Which ever works is fine. I believe a lot of the fence_X out-of-band
agents are actually just links to fence_ipmilan, but I might be wrong.

> I think this mail has multiple questions and i will probably send out
> another mail for a few issues i see after fencing takes place. 
> 
> Thanks in advance
> Arjun
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Fencing questions.

2015-10-19 Thread Arjun Pandey
Hi  Digimer

Please find  my response inilne.

On Mon, Oct 19, 2015 at 8:21 PM, Digimer  wrote:

> On 19/10/15 06:53 AM, Arjun Pandey wrote:
> > Hi
> >
> > I am running a 2 node cluster with this config on centos 6.5/6.6  where
>
> It's important to keep both nodes on the same minor version,
> particularly in this case. Please either upgrade centos 6.5 to 6.6 or
> both to 6.7.
>
[Arjun]

> My bad.  Both the nodes are on centos 6.6 now.  We used to support this on
> 6.5 earlier.
>


> > i have a multi-state resource foo being run in master/slave mode and  a
> > bunch of floating IP addresses configured. Additionally i have
> > a collocation constraint for the IP addr to be collocated with the
> master.
> >
> > Please find the following files attached
> > cluster.conf
> > CIB
>
> It's preferable on a mailing list to copy the text into the body of the
> message. Easier to read.
> [Arjun] Adding now
>

cluster.conf

  
  

  

  

  


  

  

  

  
  
  

  
  


  


CIB


  

  





  


  
  


  

  

  
  






  


  
  
  
  
  
  

  
  

  
  
  
  
  
  
  


  

  
  

  
  
  
  
  
  
  


  

  
  

  
  
  


  
  
  


  
  

  
  

  
  
  


  
  
  


  
  

  
  

  
  
  


  
  
  


  
  

  
  

  
  
  


  
  
  


  
  

  


  
  
  
  
  
  
  
  

  
  

  

  
  
  

  
  

  


  
  

  
  

  
  

  
  


  
  

  
  

  

  


  

  
  
  

  
  

  


  
  


  
  


  
  


  
  

  
  


  
  


  

  

  


> Issues that i have :-
> > 1. Daemons required for fencing
> > Earlier we were invoking cman start quorum from pacemaker script which
> > ensured that fenced / gfs and other daemons are not started. This was ok
> > since fencing wasn't being handled earlier.
>
> The cman fencing is simply a pass-through to pacemaker. When pacemaker
> tells cman that fencing succeeded, it inform DLM and begins cleanup.
>
> > For fencing purpose do we only need the fenced to be started ?  We don't
> > have any gfs partitions that we want to monitor via pacemaker. My
> > concern here is that if i use the unmodified script then pacemaker start
> > time increases significantly. I see a difference of 60 sec from the
> > earlier startup before service pacemaker status shows up as started.
>
> Don't start fenced manually, just start pacemaker and let it handle
> everything. Ideally, use the pcs command (and pcsd daemon on the nodes)
> to start/stop the cluster, but you'll need to upgrade to 6.7.

[Arjun]
I am not starting fencing manually,it gets started from pacemaker setup
itself. However if i look at cman init script start routine
there's a case where if one calls  "cman start quorum"  in which fenced/gfs
and  other daemons are not started.
Source code link
https://git.fedorahosted.org/cgit/cluster.git/tree/cman/init.d/cman.in?h=RHEL6#n795

This  is what  gets called from our pacemaker init script

I was wondering whether i should add a new case in this script because we
don't really use gfs/dlm. This is leading to substantial increase in
pacemaker startup time

> 2. Fencing test cases.
> >  Based on the internet queries i could find , apart from plugging out
> > the dedicated cable. The only other case suggested is killing corosync
> > process on one of the nodes.
> >

Re: [ClusterLabs] Corosync & pacemaker quit between boot and login

2015-10-19 Thread Matthias Ferdinand
On Fri, Oct 16, 2015 at 06:25:29PM +0200, users-requ...@clusterlabs.org wrote:
> From: Russell Martin 
> Subject: [ClusterLabs] Corosync & pacemaker quit between boot and login
> 
> ...
> Both corosync and pacemaker seem to start fine during boot (they both say 
> "[OK]")
> However, once logged in, neither daemon is running.

Hi,

since you are using a Desktop install, could it be that NetworkManager
interferes with your network configuration?

Otherwise you would have to look through /var/log/corosync/corosync.log
for a hint.

Regards
Matthias

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Coming in 1.1.14: remapping sequential reboots to all-off-then-all-on

2015-10-19 Thread Ken Gaillot
Pacemaker supports fencing "topologies", allowing multiple fencing
devices to be used (in conjunction or as fallbacks) when a node needs to
be fenced.

However, there is a catch when using something like redundant power
supplies. If you put two power switches in the same topology level, and
Pacemaker needs to reboot the node, it will reboot the first power
switch and then the second -- which has no effect since the supplies are
redundant.

Pacemaker's upstream master branch has new handling that will be part of
the eventual 1.1.14 release. In such a case, it will turn all the
devices off, then turn them all back on again.

With previous versions, there was a complicated configuration workaround
involving creating separate devices for the off and on actions. With the
new version, it happens automatically, and no special configuration is
needed.

Here's an example where node1 is the affected node, and apc1 and apc2
are the fence devices:

   pcs stonith level add 1 node1 apc1,apc2

Of course you can configure it using crm or XML as well.

The fencing operation will be treated as successful as long as the "off"
commands succeed, because then it is safe for the cluster to recover any
resources that were on the node. Timeouts and errors in the "on" phase
will be logged but ignored.

Any action-specific timeout for the remapped action will be used (for
example, pcmk_off_timeout will be used when executing the "off" command,
not pcmk_reboot_timeout).

The new code knows to skip the "on" step if the fence agent has
automatic unfencing (because it will happen when the node rejoins the
cluster). This allows fence_scsi to work with this feature.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Coming in 1.1.14: remapping sequential reboots to all-off-then-all-on

2015-10-19 Thread Digimer
On 19/10/15 12:34 PM, Ken Gaillot wrote:
> Pacemaker supports fencing "topologies", allowing multiple fencing
> devices to be used (in conjunction or as fallbacks) when a node needs to
> be fenced.
> 
> However, there is a catch when using something like redundant power
> supplies. If you put two power switches in the same topology level, and
> Pacemaker needs to reboot the node, it will reboot the first power
> switch and then the second -- which has no effect since the supplies are
> redundant.
> 
> Pacemaker's upstream master branch has new handling that will be part of
> the eventual 1.1.14 release. In such a case, it will turn all the
> devices off, then turn them all back on again.

How long will it leave stay in the 'off' state? Is it configurable? I
ask because if it's too short, some PSUs may not actually lose power.
One or two seconds should be way more than enough though.

> With previous versions, there was a complicated configuration workaround
> involving creating separate devices for the off and on actions. With the
> new version, it happens automatically, and no special configuration is
> needed.
> 
> Here's an example where node1 is the affected node, and apc1 and apc2
> are the fence devices:
> 
>pcs stonith level add 1 node1 apc1,apc2

Where would the outlet definition go? 'apc1:4,apc2:4'?

> Of course you can configure it using crm or XML as well.
> 
> The fencing operation will be treated as successful as long as the "off"
> commands succeed, because then it is safe for the cluster to recover any
> resources that were on the node. Timeouts and errors in the "on" phase
> will be logged but ignored.
> 
> Any action-specific timeout for the remapped action will be used (for
> example, pcmk_off_timeout will be used when executing the "off" command,
> not pcmk_reboot_timeout).

I think this answers my question about how long it stays off for. What
would be an example config to control the off time then?

> The new code knows to skip the "on" step if the fence agent has
> automatic unfencing (because it will happen when the node rejoins the
> cluster). This allows fence_scsi to work with this feature.

http://i.imgur.com/i7BzivK.png

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Automatic IPC buffer adjustment in Pacemaker 1.1.11?

2015-10-19 Thread Ferenc Wagner
Hi,

The http://www.ultrabug.fr/tuning-pacemaker-for-large-clusters/ blog
post states that "Pacemaker v1.1.11 should come with a feature which
will allow the IPC layer to adjust the PCMK_ipc_buffer automagically".
However, I failed to identify this in the ChangeLog.  Did it really
happen, can I drop my PCMK_ipc_buffer settings, and if yes, do I need to
switch this feature on somehow?
-- 
Thanks,
Feri.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Coming in 1.1.14: remapping sequential reboots to all-off-then-all-on

2015-10-19 Thread Ken Gaillot
On 10/19/2015 11:42 AM, Digimer wrote:
> On 19/10/15 12:34 PM, Ken Gaillot wrote:
>> Pacemaker supports fencing "topologies", allowing multiple fencing
>> devices to be used (in conjunction or as fallbacks) when a node needs to
>> be fenced.
>>
>> However, there is a catch when using something like redundant power
>> supplies. If you put two power switches in the same topology level, and
>> Pacemaker needs to reboot the node, it will reboot the first power
>> switch and then the second -- which has no effect since the supplies are
>> redundant.
>>
>> Pacemaker's upstream master branch has new handling that will be part of
>> the eventual 1.1.14 release. In such a case, it will turn all the
>> devices off, then turn them all back on again.
> 
> How long will it leave stay in the 'off' state? Is it configurable? I
> ask because if it's too short, some PSUs may not actually lose power.
> One or two seconds should be way more than enough though.

It simply waits for the fence agent to return success from the "off"
command before proceeding. I wouldn't assume any particular time between
that and initiating "on", and there's no way to set a delay there --
it's up to the agent to not return success until the action is actually
complete.

The standard says that agents should actually confirm that the device is
in the desired state after sending a command, so hopefully this is
already baked in.

>> With previous versions, there was a complicated configuration workaround
>> involving creating separate devices for the off and on actions. With the
>> new version, it happens automatically, and no special configuration is
>> needed.
>>
>> Here's an example where node1 is the affected node, and apc1 and apc2
>> are the fence devices:
>>
>>pcs stonith level add 1 node1 apc1,apc2
> 
> Where would the outlet definition go? 'apc1:4,apc2:4'?

"apc1" here is name of a Pacemaker fence resource. Hostname, port, etc.
would be configured in the definition of the "apc1" resource (which I
omitted above to focus on the topology config).

>> Of course you can configure it using crm or XML as well.
>>
>> The fencing operation will be treated as successful as long as the "off"
>> commands succeed, because then it is safe for the cluster to recover any
>> resources that were on the node. Timeouts and errors in the "on" phase
>> will be logged but ignored.
>>
>> Any action-specific timeout for the remapped action will be used (for
>> example, pcmk_off_timeout will be used when executing the "off" command,
>> not pcmk_reboot_timeout).
> 
> I think this answers my question about how long it stays off for. What
> would be an example config to control the off time then?

This isn't a delay, but a timeout before declaring the action failed. If
an "off" command does not return in this amount of time, the command
(and the entire topology level) will be considered failed, and the next
level will be tried.

The timeouts are configured in the fence resource definition. So
combining the above questions, apc1 might be defined like this:

   pcs stonith create apc1 fence_apc_snmp \
  ipaddr=apc1.example.com \
  login=user passwd='supersecret' \
  pcmk_off_timeout=30s \
  pcmk_host_map="node1.example.com:1,node2.example.com:2"

>> The new code knows to skip the "on" step if the fence agent has
>> automatic unfencing (because it will happen when the node rejoins the
>> cluster). This allows fence_scsi to work with this feature.
> 
> http://i.imgur.com/i7BzivK.png

:-D


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org