Re: [ClusterLabs] Three node cluster becomes completely fenced if one node leaves

2017-03-30 Thread Seth Reid
The original message has the logs from nodes 1 and 3. Node 2, the one that
got fenced in this test, doesn't really show much. Here are the logs from
it:

Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #5 enp6s0f0,
192.168.100.14#123, interface stats: received=0, sent=0, dropped=0,
active_time=3253 secs
Mar 24 16:35:10 b014 ntpd[2318]: Deleting interface #7 enp6s0f0,
fe80::a236:9fff:fe8a:6500%6#123, interface stats: received=0, sent=0,
dropped=0, active_time=3253 secs
Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] A processor failed,
forming new configuration.
Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] A processor failed, forming
new configuration.
Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] The network interface
is down.
Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] adding new UDPU
member {192.168.100.13}
Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] adding new UDPU
member {192.168.100.14}
Mar 24 16:35:13 b014 corosync[2166]: notice  [TOTEM ] adding new UDPU
member {192.168.100.15}
Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] The network interface is
down.
Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] adding new UDPU member
{192.168.100.13}
Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] adding new UDPU member
{192.168.100.14}
Mar 24 16:35:13 b014 corosync[2166]:  [TOTEM ] adding new UDPU member
{192.168.100.15}

---
Seth Reid



On Wed, Mar 29, 2017 at 7:17 AM, Bob Peterson  wrote:

> - Original Message -
> | I will try to install updated packages from ubuntu 16.10 or newer. It
> can't
> | get worse than not working.
> |
> | Can you think of any logs that might help? I've enabled debug on corosync
> | log, but it really doesn't show anything else other than corosync
> exiting.
> | Any diagnostic tools you can recommend?
> |
> | ---
> | Seth Reid
>
>
> Hi Seth,
>
> Can you post the pertinent messages from the consoles of all nodes in the
> cluster? Hopefully you were monitoring them.
>
> Regards,
>
> Bob Peterson
> Red Hat File Systems
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Running two independent clusters

2017-03-30 Thread Ken Gaillot
On 03/30/2017 01:17 AM, Nikhil Utane wrote:
> "/Coincidentally, I am about to announce enhanced container support in/
> /pacemaker. I should have a post with more details later today or
> tomorrow./"
> 
> Ken: Where you able to get to it?
> 
> -Thanks
> Nikhil

Not yet, we've been tweaking the syntax a bit, so I wanted to have
something more final first. But it's very close.

> 
> On Thu, Mar 23, 2017 at 7:35 PM, Ken Gaillot  > wrote:
> 
> On 03/22/2017 11:08 PM, Nikhil Utane wrote:
> > I simplified when I called it as a service. Essentially it is a complete
> > system.
> > It is an LTE eNB solution. It provides LTE service (service A) and now
> > we need to provide redundancy for another different but related service
> > (service B). The catch being, the LTE redundancy solution will be tied
> > to one operator whereas the other service can span across multiple
> > operators. Therefore ideally we want two completely independent clusters
> > since different set of nodes will form the two clusters.
> > Now what I am thinking is, to run additional instance of Pacemaker +
> > Corosync in a container which can then notify the service B on host
> > machine to start or stop it's service. That way my CIB file will be
> > independent and I can run corosync on different interfaces.
> >
> > Workable right?
> >
> > -Regards
> > Nikhil
> 
> It's not well-tested, but in theory it should work, as long as the
> container is privileged.
> 
> I still think virtualizing the services would be more resilient. It
> makes sense to have a single determination of quorum and fencing for the
> same real hosts. I'd think of it like a cloud provider -- the cloud
> instances are segregated by customer, but the underlying hosts are
> the same.
> 
> You could configure your cluster as asymmetric, and enable each VM only
> on the nodes it's allowed on, so you get the two separate "clusters"
> that way. You could set up the VMs as guest nodes if you want to monitor
> and manage multiple services within them. If your services require
> hardware access that's not easily passed to a VM, containerizing the
> services might be a better option.
> 
> > On Wed, Mar 22, 2017 at 8:06 PM, Ken Gaillot  
> > >> wrote:
> >
> > On 03/22/2017 05:23 AM, Nikhil Utane wrote:
> > > Hi Ulrich,
> > >
> > > It's not an option unfortunately.
> > > Our product runs on a specialized hardware and provides both the
> > > services (A & B) that I am referring to. Hence I cannot have 
> service A
> > > running on some nodes as cluster A and service B running on other 
> nodes
> > > as cluster B.
> > > The two services HAVE to run on same node. The catch being 
> service A and
> > > service B have to be independent of each other.
> > >
> > > Hence looking at Container option since we are using that for 
> some other
> > > product (but not for Pacemaker/Corosync).
> > >
> > > -Regards
> > > Nikhil
> >
> > Instead of containerizing pacemaker, why don't you containerize or
> > virtualize the services, and have pacemaker manage the 
> containers/VMs?
> >
> > Coincidentally, I am about to announce enhanced container support in
> > pacemaker. I should have a post with more details later today or
> > tomorrow.
> >
> > >
> > > On Wed, Mar 22, 2017 at 12:41 PM, Ulrich Windl
> > >  
> >  >
> > >  
> >   > >
> > > >>> Nikhil Utane  
>  >
> > >  
> >   schrieb am 22.03.2017 um 07:48 in
> > > Nachricht
> > >
> >  
>  
> > >
> > >     > >  

Re: [ClusterLabs] Stonith

2017-03-30 Thread Kristoffer Grönlund
Alexander Markov  writes:

> Hello, Kristoffer
>
>> Did you test failover through pacemaker itself?
>
> Yes, I did, no problems here.
>
>> However: Am I understanding it correctly that you have one node in each
>> data center, and a stonith device in each data center?
>
> Yes.
>
>> If the
>> data center is lost, the stonith device for the node in that data 
>> center
>> would also be lost and thus not able to fence.
>
> Exactly what happens!
>
>> In such a hardware configuration, only a poison pill solution like SBD
>> could work, I think.
>
> I've got no shared storage here. Every datacenter has its own storage 
> and they have replication on top (similar to drbd). I can organize a 
> cross-shared solution though if it help, but don't see how.

The only solution I know which allows for a configuration like this is
using separate clusters in each data center, and using booth for
transferring ticket ownership between them. Booth requires a data
center-level quorum (meaning at least 3 locations), though the third
location can be just a small daemon without an actual cluster, and can
run in a public cloud or similar for example.

Cheers,
Kristoffer

>
>> --
>> Regards,
>> Alexander
>
>

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] question about ocf metadata actions

2017-03-30 Thread Kristoffer Grönlund
he.hailo...@zte.com.cn writes:

> Hi,
>
>
> Does the timeout configured in the ocf metadata actually take effect?
>
>
>
>
> <actions>
>
> <action name="start" timeout="300s" />
>
> <action name="stop" timeout="200s" />
>
> <action name="status" timeout="20s" />
>
> <action name="monitor" depth="0" timeout="20s" interval="2s" />
>
> <action name="meta-data" timeout="120s" />
>
> <action name="validate-all"  timeout="20s" />
>
> </actions>
>
>
>
>
> what's the relationship with the ones configured using "crm configure 
> primitive" ?

Hi Allen,

The timeouts in the OCF metadata are merely documentation hints, and
ignored by Pacemaker unless configured appropriately in the CIB (which
is what crm configure primitive does). See the OCF documentation:

https://github.com/ClusterLabs/resource-agents/blob/master/doc/dev-guides/ra-dev-guide.asc#_metadata

> Every action should list its own timeout value. This is a hint to the
> user what minimal timeout should be configured for the action. This is
> meant to cater for the fact that some resources are quick to start and
> stop (IP addresses or filesystems, for example), some may take several
> minutes to do so (such as databases).

> In addition, recurring actions (such as monitor) should also specify a
> recommended minimum interval, which is the time between two
> consecutive invocations of the same action. Like timeout, this value
> does not constitute a default — it is merely a hint for the user which
> action interval to configure, at minimum.

Cheers,
Kristoffer

>
> Br,
>
> Allen
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stonith

2017-03-30 Thread Alexander Markov

Hello, Kristoffer


Did you test failover through pacemaker itself?


Yes, I did, no problems here.


However: Am I understanding it correctly that you have one node in each
data center, and a stonith device in each data center?


Yes.


If the
data center is lost, the stonith device for the node in that data 
center

would also be lost and thus not able to fence.


Exactly what happens!


In such a hardware configuration, only a poison pill solution like SBD
could work, I think.


I've got no shared storage here. Every datacenter has its own storage 
and they have replication on top (similar to drbd). I can organize a 
cross-shared solution though if it help, but don't see how.



--
Regards,
Alexander



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] question about ocf metadata actions

2017-03-30 Thread he.hailong5
Hi,


Does the timeout configured in the ocf metadata actually take effect?




<actions>

<action name="start" timeout="300s" />

<action name="stop" timeout="200s" />

<action name="status" timeout="20s" />

<action name="monitor" depth="0" timeout="20s" interval="2s" />

<action name="meta-data" timeout="120s" />

<action name="validate-all"  timeout="20s" />

</actions>




what's the relationship with the ones configured using "crm configure 
primitive" ?

Br,

Allen___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Stonith

2017-03-30 Thread Kristoffer Grönlund
Alexander Markov  writes:

> Hello guys,
>
> it looks like I miss something obvious, but I just don't get what has 
> happened.
>
> I've got a number of stonith-enabled clusters within my big POWER boxes. 
> My stonith devices are two HMC (hardware management consoles) - separate 
> servers from IBM that can reboot separate LPARs (logical partitions) 
> within POWER boxes - one per every datacenter.
>
> So my definition for stonith devices was pretty straightforward:
>
> primitive st_dc2_hmc stonith:ibmhmc \
> params ipaddr=10.1.2.9
> primitive st_dc1_hmc stonith:ibmhmc \
> params ipaddr=10.1.2.8
> clone cl_st_dc2_hmc st_dc2_hmc
> clone cl_st_dc1_hmc st_dc1_hmc
>
> Everything was ok when we tested failover. But today upon power outage 

Did you test failover through pacemaker itself?

Otherwise, the logs for the attempted stonith should reveal more about
how Pacemaker tried to call the stonith device, and what went wrong.

However: Am I understanding it correctly that you have one node in each
data center, and a stonith device in each data center? That doesn't
sound like a setup that can recover from data center failure: If the
data center is lost, the stonith device for the node in that data center
would also be lost and thus not able to fence.

In such a hardware configuration, only a poison pill solution like SBD
could work, I think.

Cheers,
Kristoffer

> we lost one DC completely. Shortly after that cluster just literally 
> hanged itself upong trying to reboot nonexistent node. No failover 
> occured. Nonexistent node was marked OFFLINE UNCLEAN and resources were 
> marked "Started UNCLEAN" on nonexistent node.
>
> UNCLEAN seems to flag a problems with stonith configuration. So my 
> question is: how to avoid such behaviour?
>
> Thank you!
>
> -- 
> Regards,
> Alexander
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
// Kristoffer Grönlund
// kgronl...@suse.com

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Running two independent clusters

2017-03-30 Thread Nikhil Utane
"*Coincidentally, I am about to announce enhanced container support in*
*pacemaker. I should have a post with more details later today or tomorrow.*
"

Ken: Where you able to get to it?

-Thanks
Nikhil

On Thu, Mar 23, 2017 at 7:35 PM, Ken Gaillot  wrote:

> On 03/22/2017 11:08 PM, Nikhil Utane wrote:
> > I simplified when I called it as a service. Essentially it is a complete
> > system.
> > It is an LTE eNB solution. It provides LTE service (service A) and now
> > we need to provide redundancy for another different but related service
> > (service B). The catch being, the LTE redundancy solution will be tied
> > to one operator whereas the other service can span across multiple
> > operators. Therefore ideally we want two completely independent clusters
> > since different set of nodes will form the two clusters.
> > Now what I am thinking is, to run additional instance of Pacemaker +
> > Corosync in a container which can then notify the service B on host
> > machine to start or stop it's service. That way my CIB file will be
> > independent and I can run corosync on different interfaces.
> >
> > Workable right?
> >
> > -Regards
> > Nikhil
>
> It's not well-tested, but in theory it should work, as long as the
> container is privileged.
>
> I still think virtualizing the services would be more resilient. It
> makes sense to have a single determination of quorum and fencing for the
> same real hosts. I'd think of it like a cloud provider -- the cloud
> instances are segregated by customer, but the underlying hosts are the
> same.
>
> You could configure your cluster as asymmetric, and enable each VM only
> on the nodes it's allowed on, so you get the two separate "clusters"
> that way. You could set up the VMs as guest nodes if you want to monitor
> and manage multiple services within them. If your services require
> hardware access that's not easily passed to a VM, containerizing the
> services might be a better option.
>
> > On Wed, Mar 22, 2017 at 8:06 PM, Ken Gaillot  > > wrote:
> >
> > On 03/22/2017 05:23 AM, Nikhil Utane wrote:
> > > Hi Ulrich,
> > >
> > > It's not an option unfortunately.
> > > Our product runs on a specialized hardware and provides both the
> > > services (A & B) that I am referring to. Hence I cannot have
> service A
> > > running on some nodes as cluster A and service B running on other
> nodes
> > > as cluster B.
> > > The two services HAVE to run on same node. The catch being service
> A and
> > > service B have to be independent of each other.
> > >
> > > Hence looking at Container option since we are using that for some
> other
> > > product (but not for Pacemaker/Corosync).
> > >
> > > -Regards
> > > Nikhil
> >
> > Instead of containerizing pacemaker, why don't you containerize or
> > virtualize the services, and have pacemaker manage the
> containers/VMs?
> >
> > Coincidentally, I am about to announce enhanced container support in
> > pacemaker. I should have a post with more details later today or
> > tomorrow.
> >
> > >
> > > On Wed, Mar 22, 2017 at 12:41 PM, Ulrich Windl
> > >  > 
> > >  > >> wrote:
> > >
> > > >>> Nikhil Utane 
> > >  > >> schrieb am 22.03.2017 um
> 07:48 in
> > > Nachricht
> > >
> >   > 
> > > >>:
> > > > Hi All,
> > > >
> > > > First of all, let me thank everyone here for providing
> > excellent support
> > > > from the time I started evaluating this tool about a year
> > ago. It has
> > > > helped me to make a timely and good quality release of our
> > Redundancy
> > > > solution using Pacemaker & Corosync. (Three cheers :))
> > > >
> > > > Now for our next release we have a slightly different ask.
> > > > We want to provide Redundancy to two different types of
> > services (we can
> > > > call them Service A and Service B) such that all cluster
> > communication for
> > > > Service A happens on one network/interface (say VLAN A) and
> > for service B
> > > > happens on a different network/interface (say VLAN B).
> > Moreover we do not
> > > > want the details of Service A (resource attributes etc) to
> > be seen by
> > > > Service B and vice-versa.
> > > >
> > > > So essentially we want to be able to run two independent
> >