Re: [ClusterLabs] mail server (postfix)

2016-06-06 Thread Dimitri Maziuk
On 06/06/2016 03:04 AM, Vladislav Bogdanov wrote:

...
> 5) promote service on nodeB - replace config and internally reload/restart
> 6) start VIP on nodeB
> 
> 1-2 and 5-6 pairs may need to be reversed if you bind service to a
> specific VIP (instead of listening on INADDR_ANY).

Yeah, that could work... but if my way works I won't have to write my
own RA -- or at least not for postfix. ;)

Thanks,
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-06 Thread Vladislav Bogdanov

07.06.2016 02:20, Ken Gaillot wrote:

On 06/06/2016 03:30 PM, Vladislav Bogdanov wrote:

06.06.2016 22:43, Ken Gaillot wrote:

On 06/06/2016 12:25 PM, Vladislav Bogdanov wrote:

06.06.2016 19:39, Ken Gaillot wrote:

On 06/05/2016 07:27 PM, Andrew Beekhof wrote:

On Sat, Jun 4, 2016 at 12:16 AM, Ken Gaillot 
wrote:

On 06/02/2016 08:01 PM, Andrew Beekhof wrote:

On Fri, May 20, 2016 at 1:53 AM, Ken Gaillot 
wrote:

A recent thread discussed a proposed new feature, a new environment
variable that would be passed to resource agents, indicating
whether a
stop action was part of a recovery.

Since that thread was long and covered a lot of topics, I'm
starting a
new one to focus on the core issue remaining:

The original idea was to pass the number of restarts remaining
before
the resource will no longer tried to be started on the same node.
This
involves calculating (fail-count - migration-threshold), and that
implies certain limitations: (1) it will only be set when the
cluster
checks migration-threshold; (2) it will only be set for the failed
resource itself, not for other resources that may be recovered
due to
dependencies on it.

Ulrich Windl proposed an alternative: setting a boolean value
instead. I
forgot to cc the list on my reply, so I'll summarize now: We would
set a
new variable like OCF_RESKEY_CRM_recovery=true


This concept worries me, especially when what we've implemented is
called OCF_RESKEY_CRM_restarting.


Agreed; I plan to rename it yet again, to
OCF_RESKEY_CRM_start_expected.


The name alone encourages people to "optimise" the agent to not
actually stop the service "because its just going to start again
shortly".  I know thats not what Adam would do, but not everyone
understands how clusters work.

There are any number of reasons why a cluster that intends to
restart
a service may not do so.  In such a scenario, a badly written agent
would cause the cluster to mistakenly believe that the service is
stopped - allowing it to start elsewhere.

Its true there are any number of ways to write bad agents, but I
would
argue that we shouldn't be nudging people in that direction :)


I do have mixed feelings about that. I think if we name it
start_expected, and document it carefully, we can avoid any casual
mistakes.

My main question is how useful would it actually be in the
proposed use
cases. Considering the possibility that the expected start might
never
happen (or fail), can an RA really do anything different if
start_expected=true?


I would have thought not.  Correctness should trump optimal.
But I'm prepared to be mistaken.


If the use case is there, I have no problem with
adding it, but I want to make sure it's worthwhile.


Anyone have comments on this?

A simple example: pacemaker calls an RA stop with start_expected=true,
then before the start happens, someone disables the resource, so the
start is never called. Or the node is fenced before the start happens,
etc.

Is there anything significant an RA can do differently based on
start_expected=true/false without causing problems if an expected start
never happens?


Yep.

It may request stop of other resources
* on that node by removing some node attributes which participate in
location constraints
* or cluster-wide by revoking/putting to standby cluster ticket other
resources depend on

Latter case is that's why I asked about the possibility of passing the
node name resource is intended to be started on instead of a boolean
value (in comments to PR #1026) - I would use it to request stop of
lustre MDTs and OSTs by revoking ticket they depend on if MGS (primary
lustre component which does all "request routing") fails to start
anywhere in cluster. That way, if RA does not receive any node name,


Why would ordering constraints be insufficient?


They are in place, but advisory ones to allow MGS fail/switch-over.


What happens if the MDTs/OSTs continue running because a start of MGS
was expected, but something prevents the start from actually happening?


Nothing critical, lustre clients won't be able to contact them without
MGS running and will hang.
But it is safer to shutdown them if it is known that MGS cannot be
started right now. Especially if geo-cluster failover is expected in
that case (as MGS can be local to a site, countrary to all other lustre
parts which need to be replicated). Actually that is the only part of a
puzzle remaining to "solve" that big project, and IMHO it is enough to
have a node name of a intended start or nothing in that attribute
(nothing means stop everything and initiate geo-failover if needed). If
f.e. fencing happens for a node intended to start resource, then stop
will be called again after the next start failure after failure-timeout
lapses. That would be much better than no information at all. Total stop
or geo-failover will happen just with some (configurable) delay instead
of rendering the whole filesystem to an unusable state requiring manual

Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-06 Thread Andrew Beekhof
On Tue, Jun 7, 2016 at 9:07 AM, Ken Gaillot  wrote:
> On 06/06/2016 05:45 PM, Adam Spiers wrote:
>> Adam Spiers  wrote:
>>> Andrew Beekhof  wrote:
 On Tue, Jun 7, 2016 at 8:29 AM, Adam Spiers  wrote:
> Ken Gaillot  wrote:
>> My main question is how useful would it actually be in the proposed use
>> cases. Considering the possibility that the expected start might never
>> happen (or fail), can an RA really do anything different if
>> start_expected=true?
>
> That's the wrong question :-)
>
>> If the use case is there, I have no problem with
>> adding it, but I want to make sure it's worthwhile.
>
> The use case which started this whole thread is for
> start_expected=false, not start_expected=true.

 Isn't this just two sides of the same coin?
 If you're not doing the same thing for both cases, then you're just
 reversing the order of the clauses.
>>>
>>> No, because the stated concern about unreliable expectations
>>> ("Considering the possibility that the expected start might never
>>> happen (or fail)") was regarding start_expected=true, and that's the
>>> side of the coin we don't care about, so it doesn't matter if it's
>>> unreliable.
>>
>> BTW, if the expected start happens but fails, then Pacemaker will just
>> keep repeating until migration-threshold is hit, at which point it
>> will call the RA 'stop' action finally with start_expected=false.
>> So that's of no concern.
>
> To clarify, that's configurable, via start-failure-is-fatal and on-fail
>
>> Maybe your point was that if the expected start never happens (so
>> never even gets a chance to fail), we still want to do a nova
>> service-disable?
>
> That is a good question, which might mean it should be done on every
> stop -- or could that cause problems (besides delays)?
>
> Another aspect of this is that the proposed feature could only look at a
> single transition. What if stop is called with start_expected=false, but
> then Pacemaker is able to start the service on the same node in the next
> transition immediately afterward? Would having called service-disable
> cause problems for that start?
>
>> Yes that would be nice, but this proposal was never intended to
>> address that.  I guess we'd need an entirely different mechanism in
>> Pacemaker for that.  But let's not allow perfection to become the
>> enemy of the good ;-)
>
> The ultimate concern is that this will encourage people to write RAs
> that leave services in a dangerous state after stop is called.
>
> I think with naming and documenting it properly, I'm fine to provide the
> option, but I'm on the fence. Beekhof needs a little more convincing :-)

I think the new name is a big step in the right direction

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-06 Thread Andrew Beekhof
On Tue, Jun 7, 2016 at 8:45 AM, Adam Spiers  wrote:
> Adam Spiers  wrote:
>> Andrew Beekhof  wrote:
>> > On Tue, Jun 7, 2016 at 8:29 AM, Adam Spiers  wrote:
>> > > Ken Gaillot  wrote:
>> > >> My main question is how useful would it actually be in the proposed use
>> > >> cases. Considering the possibility that the expected start might never
>> > >> happen (or fail), can an RA really do anything different if
>> > >> start_expected=true?
>> > >
>> > > That's the wrong question :-)
>> > >
>> > >> If the use case is there, I have no problem with
>> > >> adding it, but I want to make sure it's worthwhile.
>> > >
>> > > The use case which started this whole thread is for
>> > > start_expected=false, not start_expected=true.
>> >
>> > Isn't this just two sides of the same coin?
>> > If you're not doing the same thing for both cases, then you're just
>> > reversing the order of the clauses.
>>
>> No, because the stated concern about unreliable expectations
>> ("Considering the possibility that the expected start might never
>> happen (or fail)") was regarding start_expected=true, and that's the
>> side of the coin we don't care about, so it doesn't matter if it's
>> unreliable.
>
> BTW, if the expected start happens but fails, then Pacemaker will just
> keep repeating until migration-threshold is hit, at which point it
> will call the RA 'stop' action finally with start_expected=false.

Maybe. Maybe not. People cannot rely on this and I'd put money on them
trying :-)

> So that's of no concern.
>
> Maybe your point was that if the expected start never happens (so
> never even gets a chance to fail), we still want to do a nova
> service-disable?

Exactly :)

>
> Yes that would be nice, but this proposal was never intended to
> address that.  I guess we'd need an entirely different mechanism in
> Pacemaker for that.  But let's not allow perfection to become the
> enemy of the good ;-)
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-06 Thread Ken Gaillot
On 06/06/2016 03:30 PM, Vladislav Bogdanov wrote:
> 06.06.2016 22:43, Ken Gaillot wrote:
>> On 06/06/2016 12:25 PM, Vladislav Bogdanov wrote:
>>> 06.06.2016 19:39, Ken Gaillot wrote:
 On 06/05/2016 07:27 PM, Andrew Beekhof wrote:
> On Sat, Jun 4, 2016 at 12:16 AM, Ken Gaillot 
> wrote:
>> On 06/02/2016 08:01 PM, Andrew Beekhof wrote:
>>> On Fri, May 20, 2016 at 1:53 AM, Ken Gaillot 
>>> wrote:
 A recent thread discussed a proposed new feature, a new environment
 variable that would be passed to resource agents, indicating
 whether a
 stop action was part of a recovery.

 Since that thread was long and covered a lot of topics, I'm
 starting a
 new one to focus on the core issue remaining:

 The original idea was to pass the number of restarts remaining
 before
 the resource will no longer tried to be started on the same node.
 This
 involves calculating (fail-count - migration-threshold), and that
 implies certain limitations: (1) it will only be set when the
 cluster
 checks migration-threshold; (2) it will only be set for the failed
 resource itself, not for other resources that may be recovered
 due to
 dependencies on it.

 Ulrich Windl proposed an alternative: setting a boolean value
 instead. I
 forgot to cc the list on my reply, so I'll summarize now: We would
 set a
 new variable like OCF_RESKEY_CRM_recovery=true
>>>
>>> This concept worries me, especially when what we've implemented is
>>> called OCF_RESKEY_CRM_restarting.
>>
>> Agreed; I plan to rename it yet again, to
>> OCF_RESKEY_CRM_start_expected.
>>
>>> The name alone encourages people to "optimise" the agent to not
>>> actually stop the service "because its just going to start again
>>> shortly".  I know thats not what Adam would do, but not everyone
>>> understands how clusters work.
>>>
>>> There are any number of reasons why a cluster that intends to
>>> restart
>>> a service may not do so.  In such a scenario, a badly written agent
>>> would cause the cluster to mistakenly believe that the service is
>>> stopped - allowing it to start elsewhere.
>>>
>>> Its true there are any number of ways to write bad agents, but I
>>> would
>>> argue that we shouldn't be nudging people in that direction :)
>>
>> I do have mixed feelings about that. I think if we name it
>> start_expected, and document it carefully, we can avoid any casual
>> mistakes.
>>
>> My main question is how useful would it actually be in the
>> proposed use
>> cases. Considering the possibility that the expected start might
>> never
>> happen (or fail), can an RA really do anything different if
>> start_expected=true?
>
> I would have thought not.  Correctness should trump optimal.
> But I'm prepared to be mistaken.
>
>> If the use case is there, I have no problem with
>> adding it, but I want to make sure it's worthwhile.

 Anyone have comments on this?

 A simple example: pacemaker calls an RA stop with start_expected=true,
 then before the start happens, someone disables the resource, so the
 start is never called. Or the node is fenced before the start happens,
 etc.

 Is there anything significant an RA can do differently based on
 start_expected=true/false without causing problems if an expected start
 never happens?
>>>
>>> Yep.
>>>
>>> It may request stop of other resources
>>> * on that node by removing some node attributes which participate in
>>> location constraints
>>> * or cluster-wide by revoking/putting to standby cluster ticket other
>>> resources depend on
>>>
>>> Latter case is that's why I asked about the possibility of passing the
>>> node name resource is intended to be started on instead of a boolean
>>> value (in comments to PR #1026) - I would use it to request stop of
>>> lustre MDTs and OSTs by revoking ticket they depend on if MGS (primary
>>> lustre component which does all "request routing") fails to start
>>> anywhere in cluster. That way, if RA does not receive any node name,
>>
>> Why would ordering constraints be insufficient?
> 
> They are in place, but advisory ones to allow MGS fail/switch-over.
>>
>> What happens if the MDTs/OSTs continue running because a start of MGS
>> was expected, but something prevents the start from actually happening?
> 
> Nothing critical, lustre clients won't be able to contact them without
> MGS running and will hang.
> But it is safer to shutdown them if it is known that MGS cannot be
> started right now. Especially if geo-cluster failover is expected in
> that case (as MGS can be local to a site, countrary to all other lustre
> parts which need to 

Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-06 Thread Ken Gaillot
On 06/06/2016 05:45 PM, Adam Spiers wrote:
> Adam Spiers  wrote:
>> Andrew Beekhof  wrote:
>>> On Tue, Jun 7, 2016 at 8:29 AM, Adam Spiers  wrote:
 Ken Gaillot  wrote:
> My main question is how useful would it actually be in the proposed use
> cases. Considering the possibility that the expected start might never
> happen (or fail), can an RA really do anything different if
> start_expected=true?

 That's the wrong question :-)

> If the use case is there, I have no problem with
> adding it, but I want to make sure it's worthwhile.

 The use case which started this whole thread is for
 start_expected=false, not start_expected=true.
>>>
>>> Isn't this just two sides of the same coin?
>>> If you're not doing the same thing for both cases, then you're just
>>> reversing the order of the clauses.
>>
>> No, because the stated concern about unreliable expectations
>> ("Considering the possibility that the expected start might never
>> happen (or fail)") was regarding start_expected=true, and that's the
>> side of the coin we don't care about, so it doesn't matter if it's
>> unreliable.
> 
> BTW, if the expected start happens but fails, then Pacemaker will just
> keep repeating until migration-threshold is hit, at which point it
> will call the RA 'stop' action finally with start_expected=false.
> So that's of no concern.

To clarify, that's configurable, via start-failure-is-fatal and on-fail

> Maybe your point was that if the expected start never happens (so
> never even gets a chance to fail), we still want to do a nova
> service-disable?

That is a good question, which might mean it should be done on every
stop -- or could that cause problems (besides delays)?

Another aspect of this is that the proposed feature could only look at a
single transition. What if stop is called with start_expected=false, but
then Pacemaker is able to start the service on the same node in the next
transition immediately afterward? Would having called service-disable
cause problems for that start?

> Yes that would be nice, but this proposal was never intended to
> address that.  I guess we'd need an entirely different mechanism in
> Pacemaker for that.  But let's not allow perfection to become the
> enemy of the good ;-)

The ultimate concern is that this will encourage people to write RAs
that leave services in a dangerous state after stop is called.

I think with naming and documenting it properly, I'm fine to provide the
option, but I'm on the fence. Beekhof needs a little more convincing :-)

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-06 Thread Adam Spiers
Adam Spiers  wrote:
> Andrew Beekhof  wrote:
> > On Tue, Jun 7, 2016 at 8:29 AM, Adam Spiers  wrote:
> > > Ken Gaillot  wrote:
> > >> My main question is how useful would it actually be in the proposed use
> > >> cases. Considering the possibility that the expected start might never
> > >> happen (or fail), can an RA really do anything different if
> > >> start_expected=true?
> > >
> > > That's the wrong question :-)
> > >
> > >> If the use case is there, I have no problem with
> > >> adding it, but I want to make sure it's worthwhile.
> > >
> > > The use case which started this whole thread is for
> > > start_expected=false, not start_expected=true.
> > 
> > Isn't this just two sides of the same coin?
> > If you're not doing the same thing for both cases, then you're just
> > reversing the order of the clauses.
> 
> No, because the stated concern about unreliable expectations
> ("Considering the possibility that the expected start might never
> happen (or fail)") was regarding start_expected=true, and that's the
> side of the coin we don't care about, so it doesn't matter if it's
> unreliable.

BTW, if the expected start happens but fails, then Pacemaker will just
keep repeating until migration-threshold is hit, at which point it
will call the RA 'stop' action finally with start_expected=false.
So that's of no concern.

Maybe your point was that if the expected start never happens (so
never even gets a chance to fail), we still want to do a nova
service-disable?

Yes that would be nice, but this proposal was never intended to
address that.  I guess we'd need an entirely different mechanism in
Pacemaker for that.  But let's not allow perfection to become the
enemy of the good ;-)

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-06 Thread Adam Spiers
Andrew Beekhof  wrote:
> On Tue, Jun 7, 2016 at 8:29 AM, Adam Spiers  wrote:
> > Ken Gaillot  wrote:
> >> On 06/02/2016 08:01 PM, Andrew Beekhof wrote:
> >> > On Fri, May 20, 2016 at 1:53 AM, Ken Gaillot  wrote:
> >> >> A recent thread discussed a proposed new feature, a new environment
> >> >> variable that would be passed to resource agents, indicating whether a
> >> >> stop action was part of a recovery.
> >> >>
> >> >> Since that thread was long and covered a lot of topics, I'm starting a
> >> >> new one to focus on the core issue remaining:
> >> >>
> >> >> The original idea was to pass the number of restarts remaining before
> >> >> the resource will no longer tried to be started on the same node. This
> >> >> involves calculating (fail-count - migration-threshold), and that
> >> >> implies certain limitations: (1) it will only be set when the cluster
> >> >> checks migration-threshold; (2) it will only be set for the failed
> >> >> resource itself, not for other resources that may be recovered due to
> >> >> dependencies on it.
> >> >>
> >> >> Ulrich Windl proposed an alternative: setting a boolean value instead. I
> >> >> forgot to cc the list on my reply, so I'll summarize now: We would set a
> >> >> new variable like OCF_RESKEY_CRM_recovery=true
> >> >
> >> > This concept worries me, especially when what we've implemented is
> >> > called OCF_RESKEY_CRM_restarting.
> >>
> >> Agreed; I plan to rename it yet again, to OCF_RESKEY_CRM_start_expected.
> >
> > [snipped]
> >
> >> My main question is how useful would it actually be in the proposed use
> >> cases. Considering the possibility that the expected start might never
> >> happen (or fail), can an RA really do anything different if
> >> start_expected=true?
> >
> > That's the wrong question :-)
> >
> >> If the use case is there, I have no problem with
> >> adding it, but I want to make sure it's worthwhile.
> >
> > The use case which started this whole thread is for
> > start_expected=false, not start_expected=true.
> 
> Isn't this just two sides of the same coin?
> If you're not doing the same thing for both cases, then you're just
> reversing the order of the clauses.

No, because the stated concern about unreliable expectations
("Considering the possibility that the expected start might never
happen (or fail)") was regarding start_expected=true, and that's the
side of the coin we don't care about, so it doesn't matter if it's
unreliable.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-06 Thread Andrew Beekhof
On Tue, Jun 7, 2016 at 8:29 AM, Adam Spiers  wrote:
> Ken Gaillot  wrote:
>> On 06/02/2016 08:01 PM, Andrew Beekhof wrote:
>> > On Fri, May 20, 2016 at 1:53 AM, Ken Gaillot  wrote:
>> >> A recent thread discussed a proposed new feature, a new environment
>> >> variable that would be passed to resource agents, indicating whether a
>> >> stop action was part of a recovery.
>> >>
>> >> Since that thread was long and covered a lot of topics, I'm starting a
>> >> new one to focus on the core issue remaining:
>> >>
>> >> The original idea was to pass the number of restarts remaining before
>> >> the resource will no longer tried to be started on the same node. This
>> >> involves calculating (fail-count - migration-threshold), and that
>> >> implies certain limitations: (1) it will only be set when the cluster
>> >> checks migration-threshold; (2) it will only be set for the failed
>> >> resource itself, not for other resources that may be recovered due to
>> >> dependencies on it.
>> >>
>> >> Ulrich Windl proposed an alternative: setting a boolean value instead. I
>> >> forgot to cc the list on my reply, so I'll summarize now: We would set a
>> >> new variable like OCF_RESKEY_CRM_recovery=true
>> >
>> > This concept worries me, especially when what we've implemented is
>> > called OCF_RESKEY_CRM_restarting.
>>
>> Agreed; I plan to rename it yet again, to OCF_RESKEY_CRM_start_expected.
>
> [snipped]
>
>> My main question is how useful would it actually be in the proposed use
>> cases. Considering the possibility that the expected start might never
>> happen (or fail), can an RA really do anything different if
>> start_expected=true?
>
> That's the wrong question :-)
>
>> If the use case is there, I have no problem with
>> adding it, but I want to make sure it's worthwhile.
>
> The use case which started this whole thread is for
> start_expected=false, not start_expected=true.

Isn't this just two sides of the same coin?
If you're not doing the same thing for both cases, then you're just
reversing the order of the clauses.

"A isn't different from B, B is different from A!" :-)

> When it's false for
> NovaCompute, we call nova service-disable to ensure that nova doesn't
> attempt to schedule any more VMs on that host.
>
> If start_expected=true, we don't *want* to do anything different.  So
> it doesn't matter even if the expected start never happens.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-06 Thread Adam Spiers
Ken Gaillot  wrote:
> On 06/02/2016 08:01 PM, Andrew Beekhof wrote:
> > On Fri, May 20, 2016 at 1:53 AM, Ken Gaillot  wrote:
> >> A recent thread discussed a proposed new feature, a new environment
> >> variable that would be passed to resource agents, indicating whether a
> >> stop action was part of a recovery.
> >>
> >> Since that thread was long and covered a lot of topics, I'm starting a
> >> new one to focus on the core issue remaining:
> >>
> >> The original idea was to pass the number of restarts remaining before
> >> the resource will no longer tried to be started on the same node. This
> >> involves calculating (fail-count - migration-threshold), and that
> >> implies certain limitations: (1) it will only be set when the cluster
> >> checks migration-threshold; (2) it will only be set for the failed
> >> resource itself, not for other resources that may be recovered due to
> >> dependencies on it.
> >>
> >> Ulrich Windl proposed an alternative: setting a boolean value instead. I
> >> forgot to cc the list on my reply, so I'll summarize now: We would set a
> >> new variable like OCF_RESKEY_CRM_recovery=true
> > 
> > This concept worries me, especially when what we've implemented is
> > called OCF_RESKEY_CRM_restarting.
> 
> Agreed; I plan to rename it yet again, to OCF_RESKEY_CRM_start_expected.

[snipped]

> My main question is how useful would it actually be in the proposed use
> cases. Considering the possibility that the expected start might never
> happen (or fail), can an RA really do anything different if
> start_expected=true?

That's the wrong question :-)

> If the use case is there, I have no problem with
> adding it, but I want to make sure it's worthwhile.

The use case which started this whole thread is for
start_expected=false, not start_expected=true.  When it's false for
NovaCompute, we call nova service-disable to ensure that nova doesn't
attempt to schedule any more VMs on that host.

If start_expected=true, we don't *want* to do anything different.  So
it doesn't matter even if the expected start never happens.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker reload Master/Slave resource

2016-06-06 Thread Ken Gaillot
On 05/20/2016 06:20 AM, Felix Zachlod (Lists) wrote:
> version 1.1.13-10.el7_2.2-44eb2dd
> 
> Hello!
> 
> I am currently developing a master/slave resource agent. So far it is working 
> just fine, but this resource agent implements reload() and this does not work 
> as expected when running as Master:
> The reload action is invoked and it succeeds returning 0. The resource is 
> still Master and monitor will return $OCF_RUNNING_MASTER.
> 
> But Pacemaker considers the instance being slave afterwards. Actually only 
> reload is invoked, no monitor, no demote etc.
> 
> I first thought that reload should possibly return $OCF_RUNNING_MASTER too 
> but this leads to the resource failing on reload. It seems 0 is the only 
> valid return code.
> 
> I can recover the cluster state running resource $resourcename promote, which 
> will call
> 
> notify
> promote
> notify
> 
> Afterwards my resource is considered Master again. After  PEngine Recheck 
> Timer (I_PE_CALC) just popped (90ms), the cluster manager will promote 
> the resource itself.
> But this can lead to unexpected results, it could promote the resource on the 
> wrong node so that both sides are actually running master, the cluster will 
> not even notice it does not call monitor either.
> 
> Is this a bug?
> 
> regards, Felix

I think it depends on your point of view :)

Reload is implemented as an alternative to stop-then-start. For m/s
clones, start leaves the resource in slave state.

So on the one hand, it makes sense that Pacemaker would expect a m/s
reload to end up in slave state, regardless of the initial state, since
it should be equivalent to stop-then-start.

On the other hand, you could argue that a reload for a master should
logically be an alternative to demote-stop-start-promote.

On the third hand ;) you could argue that reload is ambiguous for master
resources and thus shouldn't be supported at all.

Feel free to open a feature request at http://bugs.clusterlabs.org/ to
say how you think it should work.

As an aside, I think the current implementation of reload in pacemaker
is unsatisfactory for two reasons:

* Using the "unique" attribute to determine whether a parameter is
reloadable was a bad idea. For example, the location of a daemon binary
is generally set to unique=0, which is sensible in that multiple RA
instances can use the same binary, but a reload could not handle that
change. It is not a problem only because no one ever changes that.

* There is a fundamental misunderstanding between pacemaker and most RA
developers as to what reload means. Pacemaker uses the reload action to
make parameter changes in the resource's *pacemaker* configuration take
effect, but RA developers tend to use it to reload the service's own
configuration files (a more natural interpretation, but completely
different from how pacemaker uses it).

> trace   May 20 12:58:31 cib_create_op(609):0: Sending call options: 0010, 
> 1048576
> trace   May 20 12:58:31 cib_native_perform_op_delegate(384):0: Sending 
> cib_modify message to CIB service (timeout=120s)
> trace   May 20 12:58:31 crm_ipc_send(1175):0: Sending from client: cib_shm 
> request id: 745 bytes: 1070 timeout:12 msg...
> trace   May 20 12:58:31 crm_ipc_send(1188):0: Message sent, not waiting for 
> reply to 745 from cib_shm to 1070 bytes...
> trace   May 20 12:58:31 cib_native_perform_op_delegate(395):0: Reply: No data 
> to dump as XML
> trace   May 20 12:58:31 cib_native_perform_op_delegate(398):0: Async call, 
> returning 268
> trace   May 20 12:58:31 do_update_resource(2274):0: Sent resource state 
> update message: 268 for reload=0 on scst_dg_ssd
> trace   May 20 12:58:31 cib_client_register_callback_full(606):0: Adding 
> callback cib_rsc_callback for call 268
> trace   May 20 12:58:31 process_lrm_event(2374):0: Op scst_dg_ssd_reload_0 
> (call=449, stop-id=scst_dg_ssd:449, remaining=3): Confirmed
> notice  May 20 12:58:31 process_lrm_event(2392):0: Operation 
> scst_dg_ssd_reload_0: ok (node=alpha, call=449, rc=0, cib-update=268, 
> confirmed=true)
> debug   May 20 12:58:31 update_history_cache(196):0: Updating history for 
> 'scst_dg_ssd' with reload op
> trace   May 20 12:58:31 crm_ipc_read(992):0: No message from lrmd received: 
> Resource temporarily unavailable
> trace   May 20 12:58:31 mainloop_gio_callback(654):0: Message acquisition 
> from lrmd[0x22b0ec0] failed: No message of desired type (-42)
> trace   May 20 12:58:31 crm_fsa_trigger(293):0: Invoked (queue len: 0)
> trace   May 20 12:58:31 s_crmd_fsa(159):0: FSA invoked with Cause: 
> C_FSA_INTERNAL   State: S_NOT_DC
> trace   May 20 12:58:31 s_crmd_fsa(246):0: Exiting the FSA
> trace   May 20 12:58:31 crm_fsa_trigger(295):0: Exited  (queue len: 0)
> trace   May 20 12:58:31 crm_ipc_read(989):0: Received cib_shm event 2108, 
> size=183, rc=183, text:  cib_callid="268" cib_clientid="60010689-7350-4916-a7bd-bd85ff
> trace   May 20 12:58:31 mainloop_gio_callback(659):0: New message from 
> cib_shm[0x23b7ab0] 

Re: [ClusterLabs] Different pacemaker versions split cluster

2016-06-06 Thread Vladislav Bogdanov

06.06.2016 23:28, Ken Gaillot wrote:

On 05/30/2016 01:14 PM, DacioMF wrote:

Hi,

I had 4 nodes with Ubuntu 14.04LTS in my cluster and all of then worked well. I 
need upgrade all my cluster nodes to Ubuntu 16.04LTS without stop my resources. 
Two nodes have been updated to 16.04 and the two others remains with 14.04. The 
problem is that my cluster was splited and the nodes with Ubuntu 14.04 only 
work with the other in the same version. The same is true for the nodes with 
Ubuntu 16.04. The feature set of pacemaker in Ubuntu 14.04 is v3.0.7 and in 
16.04 is v3.0.10.

The following commands shows what's happening:

root@xenserver50:/var/log/corosync# crm status
Last updated: Thu May 19 17:19:06 2016
Last change: Thu May 19 09:00:48 2016 via cibadmin on xenserver50
Stack: corosync
Current DC: xenserver51 (51) - partition with quorum
Version: 1.1.10-42f2063
4 Nodes configured
4 Resources configured

Online: [ xenserver50 xenserver51 ]
OFFLINE: [ xenserver52 xenserver54 ]

-

root@xenserver52:/var/log/corosync# crm status
Last updated: Thu May 19 17:20:04 2016Last change: Thu May 19 08:54:57 
2016 by hacluster via crmd on xenserver54
Stack: corosync
Current DC: xenserver52 (version 1.1.14-70404b0) - partition with quorum
4 nodes and 4 resources configured

Online: [ xenserver52 xenserver54 ]
OFFLINE: [ xenserver50 xenserver51 ]

xenserver52 and xenserver54 are Ubuntu 16.04 the others are Ubuntu 14.04.

Someone knows what's the problem?

Sorry by my poor english.

Best regards,
 DacioMF Analista de Redes e Infraestrutura


Hi,

We aim for backward compatibility, so this likely is a bug. Can you
attach the output of crm_report from around this time?

  crm_report --from "-M-D H:M:S" --to "-M-D H:M:S"

FYI, you cannot do a rolling upgrade from corosync 1 to corosync 2, but
I believe both 14.04 and 16.04 use corosync 2.


iirc there were incompatible wire changes, probably between 2.1 and 2.2 
(or 2.2 and 2.3) at least if crypto/secauth is enabled.




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Creating a rule based on whether a quorum exists

2016-06-06 Thread Ken Gaillot
On 05/30/2016 08:13 AM, Les Green wrote:
> Hi All,
> 
> I have a two-node cluster with no-quorum-policy=ignore and an external
> ping responder to try to determine if a node has its network down (it's
> the dead one), or if the other node is really dead..
> 
> The ping helps to determine who the master is.
> 
> I have realised in the situation where the ping responder goes down,
> both stop being the master.
> 
> Code can be seen here: https://github.com/greemo/vagrant-fabric
> 
> I currently have the following rule which prevents a node becoming a
> master unless it can access the ping resource. (I may add more ping
> resources later):
> 
> 
>rsc="g_mysql" with-rsc="ms_drbd_mysql" with-rsc-role="Master"/>
>   
>  id="l_drbd_master_on_ping-rule">
>id="l_drbd_master_on_ping-rule-expression"/>
>type="number" id="l_drbd_master_on_ping-rule-expression-0"/>
> 
>   
>first="ms_drbd_mysql" first-action="promote" then="g_mysql"
> then-action="start"/>
> 
> 
> 
> I want to create a rule that says "if I am not in a quorum AND I cannot
> access all the ping resources, do not become the master". I can sort out
> the ping part, but how can I determine within a Pacemaker rule if I am
> part of a quorum?
> 
> I have thought to set up a cron job using shell tools to query the CIB
> and populate an attribute, but surely there has to be an easier way...
> 
> Hopefully, Les

Not that I'm aware of. Some alternatives: set up the ping responder as a
quorum-only node instead; configure fencing and get rid of the ping
resource; list the cluster nodes in the ping resource's host_list and
change the rule to lte 1.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-06 Thread Vladislav Bogdanov

06.06.2016 22:43, Ken Gaillot wrote:

On 06/06/2016 12:25 PM, Vladislav Bogdanov wrote:

06.06.2016 19:39, Ken Gaillot wrote:

On 06/05/2016 07:27 PM, Andrew Beekhof wrote:

On Sat, Jun 4, 2016 at 12:16 AM, Ken Gaillot 
wrote:

On 06/02/2016 08:01 PM, Andrew Beekhof wrote:

On Fri, May 20, 2016 at 1:53 AM, Ken Gaillot 
wrote:

A recent thread discussed a proposed new feature, a new environment
variable that would be passed to resource agents, indicating
whether a
stop action was part of a recovery.

Since that thread was long and covered a lot of topics, I'm
starting a
new one to focus on the core issue remaining:

The original idea was to pass the number of restarts remaining before
the resource will no longer tried to be started on the same node.
This
involves calculating (fail-count - migration-threshold), and that
implies certain limitations: (1) it will only be set when the cluster
checks migration-threshold; (2) it will only be set for the failed
resource itself, not for other resources that may be recovered due to
dependencies on it.

Ulrich Windl proposed an alternative: setting a boolean value
instead. I
forgot to cc the list on my reply, so I'll summarize now: We would
set a
new variable like OCF_RESKEY_CRM_recovery=true


This concept worries me, especially when what we've implemented is
called OCF_RESKEY_CRM_restarting.


Agreed; I plan to rename it yet again, to
OCF_RESKEY_CRM_start_expected.


The name alone encourages people to "optimise" the agent to not
actually stop the service "because its just going to start again
shortly".  I know thats not what Adam would do, but not everyone
understands how clusters work.

There are any number of reasons why a cluster that intends to restart
a service may not do so.  In such a scenario, a badly written agent
would cause the cluster to mistakenly believe that the service is
stopped - allowing it to start elsewhere.

Its true there are any number of ways to write bad agents, but I would
argue that we shouldn't be nudging people in that direction :)


I do have mixed feelings about that. I think if we name it
start_expected, and document it carefully, we can avoid any casual
mistakes.

My main question is how useful would it actually be in the proposed use
cases. Considering the possibility that the expected start might never
happen (or fail), can an RA really do anything different if
start_expected=true?


I would have thought not.  Correctness should trump optimal.
But I'm prepared to be mistaken.


If the use case is there, I have no problem with
adding it, but I want to make sure it's worthwhile.


Anyone have comments on this?

A simple example: pacemaker calls an RA stop with start_expected=true,
then before the start happens, someone disables the resource, so the
start is never called. Or the node is fenced before the start happens,
etc.

Is there anything significant an RA can do differently based on
start_expected=true/false without causing problems if an expected start
never happens?


Yep.

It may request stop of other resources
* on that node by removing some node attributes which participate in
location constraints
* or cluster-wide by revoking/putting to standby cluster ticket other
resources depend on

Latter case is that's why I asked about the possibility of passing the
node name resource is intended to be started on instead of a boolean
value (in comments to PR #1026) - I would use it to request stop of
lustre MDTs and OSTs by revoking ticket they depend on if MGS (primary
lustre component which does all "request routing") fails to start
anywhere in cluster. That way, if RA does not receive any node name,


Why would ordering constraints be insufficient?


They are in place, but advisory ones to allow MGS fail/switch-over.


What happens if the MDTs/OSTs continue running because a start of MGS
was expected, but something prevents the start from actually happening?


Nothing critical, lustre clients won't be able to contact them without 
MGS running and will hang.
But it is safer to shutdown them if it is known that MGS cannot be 
started right now. Especially if geo-cluster failover is expected in 
that case (as MGS can be local to a site, countrary to all other lustre 
parts which need to be replicated). Actually that is the only part of a 
puzzle remaining to "solve" that big project, and IMHO it is enough to 
have a node name of a intended start or nothing in that attribute 
(nothing means stop everything and initiate geo-failover if needed). If 
f.e. fencing happens for a node intended to start resource, then stop 
will be called again after the next start failure after failure-timeout 
lapses. That would be much better than no information at all. Total stop 
or geo-failover will happen just with some (configurable) delay instead 
of rendering the whole filesystem to an unusable state requiring manual 
intervention.





then it can be "almost sure" pacemaker does not intend to restart

Re: [ClusterLabs] Different pacemaker versions split cluster

2016-06-06 Thread Ken Gaillot
On 05/30/2016 01:14 PM, DacioMF wrote:
> Hi,
> 
> I had 4 nodes with Ubuntu 14.04LTS in my cluster and all of then worked well. 
> I need upgrade all my cluster nodes to Ubuntu 16.04LTS without stop my 
> resources. Two nodes have been updated to 16.04 and the two others remains 
> with 14.04. The problem is that my cluster was splited and the nodes with 
> Ubuntu 14.04 only work with the other in the same version. The same is true 
> for the nodes with Ubuntu 16.04. The feature set of pacemaker in Ubuntu 14.04 
> is v3.0.7 and in 16.04 is v3.0.10.
> 
> The following commands shows what's happening:
> 
> root@xenserver50:/var/log/corosync# crm status
> Last updated: Thu May 19 17:19:06 2016
> Last change: Thu May 19 09:00:48 2016 via cibadmin on xenserver50
> Stack: corosync
> Current DC: xenserver51 (51) - partition with quorum
> Version: 1.1.10-42f2063
> 4 Nodes configured
> 4 Resources configured
> 
> Online: [ xenserver50 xenserver51 ]
> OFFLINE: [ xenserver52 xenserver54 ]
> 
> -
> 
> root@xenserver52:/var/log/corosync# crm status
> Last updated: Thu May 19 17:20:04 2016Last change: Thu May 19 
> 08:54:57 2016 by hacluster via crmd on xenserver54
> Stack: corosync
> Current DC: xenserver52 (version 1.1.14-70404b0) - partition with quorum
> 4 nodes and 4 resources configured
> 
> Online: [ xenserver52 xenserver54 ]
> OFFLINE: [ xenserver50 xenserver51 ]
> 
> xenserver52 and xenserver54 are Ubuntu 16.04 the others are Ubuntu 14.04.
> 
> Someone knows what's the problem?
> 
> Sorry by my poor english.
> 
> Best regards,
>  DacioMF Analista de Redes e Infraestrutura

Hi,

We aim for backward compatibility, so this likely is a bug. Can you
attach the output of crm_report from around this time?

  crm_report --from "-M-D H:M:S" --to "-M-D H:M:S"

FYI, you cannot do a rolling upgrade from corosync 1 to corosync 2, but
I believe both 14.04 and 16.04 use corosync 2.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-06 Thread Ken Gaillot
On 06/06/2016 12:25 PM, Vladislav Bogdanov wrote:
> 06.06.2016 19:39, Ken Gaillot wrote:
>> On 06/05/2016 07:27 PM, Andrew Beekhof wrote:
>>> On Sat, Jun 4, 2016 at 12:16 AM, Ken Gaillot 
>>> wrote:
 On 06/02/2016 08:01 PM, Andrew Beekhof wrote:
> On Fri, May 20, 2016 at 1:53 AM, Ken Gaillot 
> wrote:
>> A recent thread discussed a proposed new feature, a new environment
>> variable that would be passed to resource agents, indicating
>> whether a
>> stop action was part of a recovery.
>>
>> Since that thread was long and covered a lot of topics, I'm
>> starting a
>> new one to focus on the core issue remaining:
>>
>> The original idea was to pass the number of restarts remaining before
>> the resource will no longer tried to be started on the same node.
>> This
>> involves calculating (fail-count - migration-threshold), and that
>> implies certain limitations: (1) it will only be set when the cluster
>> checks migration-threshold; (2) it will only be set for the failed
>> resource itself, not for other resources that may be recovered due to
>> dependencies on it.
>>
>> Ulrich Windl proposed an alternative: setting a boolean value
>> instead. I
>> forgot to cc the list on my reply, so I'll summarize now: We would
>> set a
>> new variable like OCF_RESKEY_CRM_recovery=true
>
> This concept worries me, especially when what we've implemented is
> called OCF_RESKEY_CRM_restarting.

 Agreed; I plan to rename it yet again, to
 OCF_RESKEY_CRM_start_expected.

> The name alone encourages people to "optimise" the agent to not
> actually stop the service "because its just going to start again
> shortly".  I know thats not what Adam would do, but not everyone
> understands how clusters work.
>
> There are any number of reasons why a cluster that intends to restart
> a service may not do so.  In such a scenario, a badly written agent
> would cause the cluster to mistakenly believe that the service is
> stopped - allowing it to start elsewhere.
>
> Its true there are any number of ways to write bad agents, but I would
> argue that we shouldn't be nudging people in that direction :)

 I do have mixed feelings about that. I think if we name it
 start_expected, and document it carefully, we can avoid any casual
 mistakes.

 My main question is how useful would it actually be in the proposed use
 cases. Considering the possibility that the expected start might never
 happen (or fail), can an RA really do anything different if
 start_expected=true?
>>>
>>> I would have thought not.  Correctness should trump optimal.
>>> But I'm prepared to be mistaken.
>>>
 If the use case is there, I have no problem with
 adding it, but I want to make sure it's worthwhile.
>>
>> Anyone have comments on this?
>>
>> A simple example: pacemaker calls an RA stop with start_expected=true,
>> then before the start happens, someone disables the resource, so the
>> start is never called. Or the node is fenced before the start happens,
>> etc.
>>
>> Is there anything significant an RA can do differently based on
>> start_expected=true/false without causing problems if an expected start
>> never happens?
> 
> Yep.
> 
> It may request stop of other resources
> * on that node by removing some node attributes which participate in
> location constraints
> * or cluster-wide by revoking/putting to standby cluster ticket other
> resources depend on
> 
> Latter case is that's why I asked about the possibility of passing the
> node name resource is intended to be started on instead of a boolean
> value (in comments to PR #1026) - I would use it to request stop of
> lustre MDTs and OSTs by revoking ticket they depend on if MGS (primary
> lustre component which does all "request routing") fails to start
> anywhere in cluster. That way, if RA does not receive any node name,

Why would ordering constraints be insufficient?

What happens if the MDTs/OSTs continue running because a start of MGS
was expected, but something prevents the start from actually happening?

> then it can be "almost sure" pacemaker does not intend to restart
> resource (yet) and can request it to stop everything else (because
> filesystem is not usable anyways). Later, if another start attempt
> (caused by failure-timeout expiration) succeeds, RA may grant the ticket
> back, and all other resources start again.
> 
> Best,
> Vladislav

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] how to "switch on" cLVM ?

2016-06-06 Thread Digimer
On 06/06/16 01:13 PM, Lentes, Bernd wrote:
> Hi,
> 
> i'm currently establishing a two node cluster. I have a FC SAN and two hosts. 
> My services are integrated into virtual machines (kvm). The vm's should 
> reside on the SAN.
> The hosts are connected to the SAN via FC HBA. Inside the hosts i see already 
> the volume from the SAN. I'd like to store each vm in a dedicated logical 
> volume (with or without filesystem in the LV).
> Hosts are SLES 11 SP4. LVM2 and LVM2-clvm is installed.
> How do i have to "switch on" clvm ? Is locking_type=3 in /etc/lvm/lvm.conf 
> all which is necessary ?

That tells LVM to use cluster locking, you still need to actually
provide the cluster locking, called DLM. With the cluster formed (and
fencing working!), you should be able to start the clvmd daemon.

> In both nodes ?

Yes

> Restart of the init-scripts afterwards sufficient ?

No, DLM needs to be added to the cluster and be running.

> And how do i have to proceed afterwards ?
> My idea is:
> 1. Create a PV
> 2. Create a VG
> 3. Create several LV's

If the VG is created while dlm is running, it should automatically flag
the VG as clustered. If not, you will need to tell LVM that the VG is
clustered (-cy, iirc).

> And because of clvm i only have to do that on one host and the other host 
> sees all automatically ?

Once clvmd is running, any changes made (lvcreate, delete, resize, etc)
will immediately appear on the other nodes.

> Later on it's possible that some vm's run on host 1 and some on host 2. Does 
> clvm needs to be a ressource managed by the cluster manager ?

Yes, you can live-migrate as well. I do this all the time, except I use
DRBD instead of a SAN and RHEL instead of SUSE, but those are trivial
differences in this case.

> If i use a fs inside the lv, a "normal" fs like ext3 is sufficient, i think. 
> But it has to be a cluster ressource, right ?

You can format a clustered LV with a cluster unaware filesystem just
fine. However, the FS is not made magically cluster aware... If you
mount it on two nodes, you will almost certainly corrupt the FS quickly.
If you want to mount an LV on two+ nodes at once, you need a
cluster-aware file system, life GFS2.

> Thanks in advance.
> 
> 
> Bernd
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-06 Thread Vladislav Bogdanov

06.06.2016 19:39, Ken Gaillot wrote:

On 06/05/2016 07:27 PM, Andrew Beekhof wrote:

On Sat, Jun 4, 2016 at 12:16 AM, Ken Gaillot  wrote:

On 06/02/2016 08:01 PM, Andrew Beekhof wrote:

On Fri, May 20, 2016 at 1:53 AM, Ken Gaillot  wrote:

A recent thread discussed a proposed new feature, a new environment
variable that would be passed to resource agents, indicating whether a
stop action was part of a recovery.

Since that thread was long and covered a lot of topics, I'm starting a
new one to focus on the core issue remaining:

The original idea was to pass the number of restarts remaining before
the resource will no longer tried to be started on the same node. This
involves calculating (fail-count - migration-threshold), and that
implies certain limitations: (1) it will only be set when the cluster
checks migration-threshold; (2) it will only be set for the failed
resource itself, not for other resources that may be recovered due to
dependencies on it.

Ulrich Windl proposed an alternative: setting a boolean value instead. I
forgot to cc the list on my reply, so I'll summarize now: We would set a
new variable like OCF_RESKEY_CRM_recovery=true


This concept worries me, especially when what we've implemented is
called OCF_RESKEY_CRM_restarting.


Agreed; I plan to rename it yet again, to OCF_RESKEY_CRM_start_expected.


The name alone encourages people to "optimise" the agent to not
actually stop the service "because its just going to start again
shortly".  I know thats not what Adam would do, but not everyone
understands how clusters work.

There are any number of reasons why a cluster that intends to restart
a service may not do so.  In such a scenario, a badly written agent
would cause the cluster to mistakenly believe that the service is
stopped - allowing it to start elsewhere.

Its true there are any number of ways to write bad agents, but I would
argue that we shouldn't be nudging people in that direction :)


I do have mixed feelings about that. I think if we name it
start_expected, and document it carefully, we can avoid any casual mistakes.

My main question is how useful would it actually be in the proposed use
cases. Considering the possibility that the expected start might never
happen (or fail), can an RA really do anything different if
start_expected=true?


I would have thought not.  Correctness should trump optimal.
But I'm prepared to be mistaken.


If the use case is there, I have no problem with
adding it, but I want to make sure it's worthwhile.


Anyone have comments on this?

A simple example: pacemaker calls an RA stop with start_expected=true,
then before the start happens, someone disables the resource, so the
start is never called. Or the node is fenced before the start happens, etc.

Is there anything significant an RA can do differently based on
start_expected=true/false without causing problems if an expected start
never happens?


Yep.

It may request stop of other resources
* on that node by removing some node attributes which participate in 
location constraints
* or cluster-wide by revoking/putting to standby cluster ticket other 
resources depend on


Latter case is that's why I asked about the possibility of passing the 
node name resource is intended to be started on instead of a boolean 
value (in comments to PR #1026) - I would use it to request stop of 
lustre MDTs and OSTs by revoking ticket they depend on if MGS (primary 
lustre component which does all "request routing") fails to start 
anywhere in cluster. That way, if RA does not receive any node name, 
then it can be "almost sure" pacemaker does not intend to restart 
resource (yet) and can request it to stop everything else (because 
filesystem is not usable anyways). Later, if another start attempt 
(caused by failure-timeout expiration) succeeds, RA may grant the ticket 
back, and all other resources start again.


Best,
Vladislav



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] how to "switch on" cLVM ?

2016-06-06 Thread Lentes, Bernd
Hi,

i'm currently establishing a two node cluster. I have a FC SAN and two hosts. 
My services are integrated into virtual machines (kvm). The vm's should reside 
on the SAN.
The hosts are connected to the SAN via FC HBA. Inside the hosts i see already 
the volume from the SAN. I'd like to store each vm in a dedicated logical 
volume (with or without filesystem in the LV).
Hosts are SLES 11 SP4. LVM2 and LVM2-clvm is installed.
How do i have to "switch on" clvm ? Is locking_type=3 in /etc/lvm/lvm.conf all 
which is necessary ?
In both nodes ?
Restart of the init-scripts afterwards sufficient ?
And how do i have to proceed afterwards ?
My idea is:
1. Create a PV
2. Create a VG
3. Create several LV's
And because of clvm i only have to do that on one host and the other host sees 
all automatically ?
Later on it's possible that some vm's run on host 1 and some on host 2. Does 
clvm needs to be a ressource managed by the cluster manager ?
If i use a fs inside the lv, a "normal" fs like ext3 is sufficient, i think. 
But it has to be a cluster ressource, right ?

Thanks in advance.


Bernd

-- 
Bernd Lentes 

Systemadministration 
institute of developmental genetics 
Gebäude 35.34 - Raum 208 
HelmholtzZentrum München 
bernd.len...@helmholtz-muenchen.de 
phone: +49 (0)89 3187 1241 
fax: +49 (0)89 3187 2294 

Wer glaubt das Projektleiter Projekte leiten 
der glaubt auch das Zitronenfalter 
Zitronen falten
 

Helmholtz Zentrum Muenchen
Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH)
Ingolstaedter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir'in Baerbel Brumme-Bothe
Geschaeftsfuehrer: Prof. Dr. Guenther Wess, Dr. Alfons Enhsen, Renate Schlusen 
(komm.)
Registergericht: Amtsgericht Muenchen HRB 6466
USt-IdNr: DE 129521671


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Informing RAs about recovery: failed resource recovery, or any start-stop cycle?

2016-06-06 Thread Ken Gaillot
On 06/05/2016 07:27 PM, Andrew Beekhof wrote:
> On Sat, Jun 4, 2016 at 12:16 AM, Ken Gaillot  wrote:
>> On 06/02/2016 08:01 PM, Andrew Beekhof wrote:
>>> On Fri, May 20, 2016 at 1:53 AM, Ken Gaillot  wrote:
 A recent thread discussed a proposed new feature, a new environment
 variable that would be passed to resource agents, indicating whether a
 stop action was part of a recovery.

 Since that thread was long and covered a lot of topics, I'm starting a
 new one to focus on the core issue remaining:

 The original idea was to pass the number of restarts remaining before
 the resource will no longer tried to be started on the same node. This
 involves calculating (fail-count - migration-threshold), and that
 implies certain limitations: (1) it will only be set when the cluster
 checks migration-threshold; (2) it will only be set for the failed
 resource itself, not for other resources that may be recovered due to
 dependencies on it.

 Ulrich Windl proposed an alternative: setting a boolean value instead. I
 forgot to cc the list on my reply, so I'll summarize now: We would set a
 new variable like OCF_RESKEY_CRM_recovery=true
>>>
>>> This concept worries me, especially when what we've implemented is
>>> called OCF_RESKEY_CRM_restarting.
>>
>> Agreed; I plan to rename it yet again, to OCF_RESKEY_CRM_start_expected.
>>
>>> The name alone encourages people to "optimise" the agent to not
>>> actually stop the service "because its just going to start again
>>> shortly".  I know thats not what Adam would do, but not everyone
>>> understands how clusters work.
>>>
>>> There are any number of reasons why a cluster that intends to restart
>>> a service may not do so.  In such a scenario, a badly written agent
>>> would cause the cluster to mistakenly believe that the service is
>>> stopped - allowing it to start elsewhere.
>>>
>>> Its true there are any number of ways to write bad agents, but I would
>>> argue that we shouldn't be nudging people in that direction :)
>>
>> I do have mixed feelings about that. I think if we name it
>> start_expected, and document it carefully, we can avoid any casual mistakes.
>>
>> My main question is how useful would it actually be in the proposed use
>> cases. Considering the possibility that the expected start might never
>> happen (or fail), can an RA really do anything different if
>> start_expected=true?
> 
> I would have thought not.  Correctness should trump optimal.
> But I'm prepared to be mistaken.
> 
>> If the use case is there, I have no problem with
>> adding it, but I want to make sure it's worthwhile.

Anyone have comments on this?

A simple example: pacemaker calls an RA stop with start_expected=true,
then before the start happens, someone disables the resource, so the
start is never called. Or the node is fenced before the start happens, etc.

Is there anything significant an RA can do differently based on
start_expected=true/false without causing problems if an expected start
never happens?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [corosync] Virtual Synchrony Property guarantees in case of network partition

2016-06-06 Thread satish kumar
Thanks, really appreciate your help.

On Mon, Jun 6, 2016 at 9:17 PM, Jan Friesse  wrote:

> But C1 is *guaranteed *to deliver *before *m(k)? No case where C1 is
>>
>
> Yes
>
> delivered after m(k)?
>>
>
> Nope.
>
>
>
>>
>> Regards,
>> Satish
>>
>> On Mon, Jun 6, 2016 at 8:10 PM, Jan Friesse  wrote:
>>
>> satish kumar napsal(a):
>>>
>>> Hello honza, thanks for the response !
>>>

 With state sync, I simply mean that 'k-1' messages were delivered to N1,
 N2
 and N3 and they have applied these messages to change their program
 state.
 N1.state = apply(m(k-1);
 N2.state = apply(m(k-1);
 N3.state = apply(m(k-1);

 The document you shared cleared many doubts. However I still need one
 clarification.

 According to the document:
 "The configuration change messages warn the application that a
 membership
 change has occurred, so that the application program can take
 appropriate
 action based on the membership change. Extended virtual synchrony
 guarantees a consistent order of messages delivery across a partition,
 which is essential if the application program are to be able to
 reconcile
 their states following repair of a failed processor or reemerging of the
 partitioned network."

 I just want to know that this property is not something related to
 CPG_TYPE_SAFE, which is still not implemented.
 Please consider this scenario:
 0. N1, N2 and N3 has received the message m(k-1).
 1. N1 mcast(CPG_TYPE_AGREED) m(k) message.
 2. As it is not CPG_TYPE_SAFE, m(k) delievered to N1 but was not yet
 delivered to N2 and N3.
 3. Network partition separate N1 from N2 and N3. N2 and N3 can never see
 m(k).
 4. Configuration change message is now delivered to N1, N2 and N3.

 Here, N1 will change its state to N1.state = apply(m(k), thinking all in
 the current configuration has received the message.

 According to your reply it looks like N1 will not receive m(k). So this
 is
 what each node will see:
 N1 will see: m(k-1) -> C1 (config change)
 N2 will see: m(k-1) -> C1 (config change)
 N3 will see: m(k-1) -> C1 (config change)


>>> For N2 and N3, it's not same C1. So let's call it C2. Because C1 for N1
>>> is
>>> (N2 and N3 left) and C2 for N2 and N3 is (N1 left).
>>>
>>>
>>>
>>> Message m(k) will be discarded, and will not be delivered to N1 even if
 it
 was sent by N1 before the network partition.


>>> No. m(k) will be delivered to app running on N1. So N1 will see m(k-1),
>>> C1, m(k). So application exactly knows which node got message m(k).
>>>
>>> Regards,
>>>Honza
>>>
>>>
>>>
>>> This is the expected behavior with CPG_TYPE_AGREED?

 Regards,
 Satish


 On Mon, Jun 6, 2016 at 4:15 PM, Jan Friesse 
 wrote:

 Hi,

>
> Hello,
>
>
>> Virtual Synchrony Property - messages are delivered in agreed order
>> and
>> configuration changes are delivered in agreed order relative to
>> message.
>>
>> What happen to this property when network is partitioned the cluster
>> into
>> two. Consider following scenario (which I took from one of the
>> previous query by Andrei Elkin):
>>
>> * N1, N2 and N3 are in state sync with m(k-1) messages are delivered.
>>
>>
>> What exactly you mean by "state sync"?
>
> * N1 sends m(k) and just now network partition N1 node from N2 and N3.
>
>
>> Does CPG_TYPE_AGREED guarantee that virtual synchrony is held?
>>
>>
>> Yes it does (actually higher level of VS called EVS)
>
>
> When property is held, configuration change message C1 is guaranteed to
>
>> delivered before m(k) to N1.
>> N1 will see: m(k-1) C1 m(k)
>> N2 and N3 will see: m(k-1) C1
>>
>> But if this property is violated:
>> N1 will see: m(k-1) m(k) C1
>> N2 and N3 will see: m(k-1) C1
>>
>> Violation will screw any user application running on the cluster.
>>
>> Could someone please explain what is the behavior of Corosync in this
>> scenario with CPG_TYPE_AGREED ordering.
>>
>>
>> For description how exactly totem synchronization works take a look to
> http://corosync.github.com/corosync/doc/DAAgarwal.thesis.ps.gz
>
> After totem is synchronized, there is another level of synchronization
> of
> services (not described in above doc). All services synchronize in very
> similar way, so you can take a look to CPG as example. Basically only
> state
> held by CPG is connected clients. So every node sends it's connected
> clients list to every other node. If sync is aborted (change of
> membership), it's restarted. These sync messages has priority over user
> messages (actually it's not possible to send messages during 

Re: [ClusterLabs] [corosync] Virtual Synchrony Property guarantees in case of network partition

2016-06-06 Thread Jan Friesse

But C1 is *guaranteed *to deliver *before *m(k)? No case where C1 is


Yes


delivered after m(k)?


Nope.




Regards,
Satish

On Mon, Jun 6, 2016 at 8:10 PM, Jan Friesse  wrote:


satish kumar napsal(a):

Hello honza, thanks for the response !


With state sync, I simply mean that 'k-1' messages were delivered to N1,
N2
and N3 and they have applied these messages to change their program state.
N1.state = apply(m(k-1);
N2.state = apply(m(k-1);
N3.state = apply(m(k-1);

The document you shared cleared many doubts. However I still need one
clarification.

According to the document:
"The configuration change messages warn the application that a membership
change has occurred, so that the application program can take appropriate
action based on the membership change. Extended virtual synchrony
guarantees a consistent order of messages delivery across a partition,
which is essential if the application program are to be able to reconcile
their states following repair of a failed processor or reemerging of the
partitioned network."

I just want to know that this property is not something related to
CPG_TYPE_SAFE, which is still not implemented.
Please consider this scenario:
0. N1, N2 and N3 has received the message m(k-1).
1. N1 mcast(CPG_TYPE_AGREED) m(k) message.
2. As it is not CPG_TYPE_SAFE, m(k) delievered to N1 but was not yet
delivered to N2 and N3.
3. Network partition separate N1 from N2 and N3. N2 and N3 can never see
m(k).
4. Configuration change message is now delivered to N1, N2 and N3.

Here, N1 will change its state to N1.state = apply(m(k), thinking all in
the current configuration has received the message.

According to your reply it looks like N1 will not receive m(k). So this is
what each node will see:
N1 will see: m(k-1) -> C1 (config change)
N2 will see: m(k-1) -> C1 (config change)
N3 will see: m(k-1) -> C1 (config change)



For N2 and N3, it's not same C1. So let's call it C2. Because C1 for N1 is
(N2 and N3 left) and C2 for N2 and N3 is (N1 left).




Message m(k) will be discarded, and will not be delivered to N1 even if it
was sent by N1 before the network partition.



No. m(k) will be delivered to app running on N1. So N1 will see m(k-1),
C1, m(k). So application exactly knows which node got message m(k).

Regards,
   Honza




This is the expected behavior with CPG_TYPE_AGREED?

Regards,
Satish


On Mon, Jun 6, 2016 at 4:15 PM, Jan Friesse  wrote:

Hi,


Hello,



Virtual Synchrony Property - messages are delivered in agreed order and
configuration changes are delivered in agreed order relative to message.

What happen to this property when network is partitioned the cluster
into
two. Consider following scenario (which I took from one of the
previous query by Andrei Elkin):

* N1, N2 and N3 are in state sync with m(k-1) messages are delivered.



What exactly you mean by "state sync"?

* N1 sends m(k) and just now network partition N1 node from N2 and N3.



Does CPG_TYPE_AGREED guarantee that virtual synchrony is held?



Yes it does (actually higher level of VS called EVS)


When property is held, configuration change message C1 is guaranteed to

delivered before m(k) to N1.
N1 will see: m(k-1) C1 m(k)
N2 and N3 will see: m(k-1) C1

But if this property is violated:
N1 will see: m(k-1) m(k) C1
N2 and N3 will see: m(k-1) C1

Violation will screw any user application running on the cluster.

Could someone please explain what is the behavior of Corosync in this
scenario with CPG_TYPE_AGREED ordering.



For description how exactly totem synchronization works take a look to
http://corosync.github.com/corosync/doc/DAAgarwal.thesis.ps.gz

After totem is synchronized, there is another level of synchronization of
services (not described in above doc). All services synchronize in very
similar way, so you can take a look to CPG as example. Basically only
state
held by CPG is connected clients. So every node sends it's connected
clients list to every other node. If sync is aborted (change of
membership), it's restarted. These sync messages has priority over user
messages (actually it's not possible to send messages during sync). User
app can be sure that message was delivered only after it gets it's own
message. Also app gets configuration change message so it knows, who got
the message.

Regards,
Honza


Regards,

Satish



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org





___

Re: [ClusterLabs] [corosync] Virtual Synchrony Property guarantees in case of network partition

2016-06-06 Thread satish kumar
But C1 is *guaranteed *to deliver *before *m(k)? No case where C1 is
delivered after m(k)?


Regards,
Satish

On Mon, Jun 6, 2016 at 8:10 PM, Jan Friesse  wrote:

> satish kumar napsal(a):
>
> Hello honza, thanks for the response !
>>
>> With state sync, I simply mean that 'k-1' messages were delivered to N1,
>> N2
>> and N3 and they have applied these messages to change their program state.
>> N1.state = apply(m(k-1);
>> N2.state = apply(m(k-1);
>> N3.state = apply(m(k-1);
>>
>> The document you shared cleared many doubts. However I still need one
>> clarification.
>>
>> According to the document:
>> "The configuration change messages warn the application that a membership
>> change has occurred, so that the application program can take appropriate
>> action based on the membership change. Extended virtual synchrony
>> guarantees a consistent order of messages delivery across a partition,
>> which is essential if the application program are to be able to reconcile
>> their states following repair of a failed processor or reemerging of the
>> partitioned network."
>>
>> I just want to know that this property is not something related to
>> CPG_TYPE_SAFE, which is still not implemented.
>> Please consider this scenario:
>> 0. N1, N2 and N3 has received the message m(k-1).
>> 1. N1 mcast(CPG_TYPE_AGREED) m(k) message.
>> 2. As it is not CPG_TYPE_SAFE, m(k) delievered to N1 but was not yet
>> delivered to N2 and N3.
>> 3. Network partition separate N1 from N2 and N3. N2 and N3 can never see
>> m(k).
>> 4. Configuration change message is now delivered to N1, N2 and N3.
>>
>> Here, N1 will change its state to N1.state = apply(m(k), thinking all in
>> the current configuration has received the message.
>>
>> According to your reply it looks like N1 will not receive m(k). So this is
>> what each node will see:
>> N1 will see: m(k-1) -> C1 (config change)
>> N2 will see: m(k-1) -> C1 (config change)
>> N3 will see: m(k-1) -> C1 (config change)
>>
>
> For N2 and N3, it's not same C1. So let's call it C2. Because C1 for N1 is
> (N2 and N3 left) and C2 for N2 and N3 is (N1 left).
>
>
>
>> Message m(k) will be discarded, and will not be delivered to N1 even if it
>> was sent by N1 before the network partition.
>>
>
> No. m(k) will be delivered to app running on N1. So N1 will see m(k-1),
> C1, m(k). So application exactly knows which node got message m(k).
>
> Regards,
>   Honza
>
>
>
>> This is the expected behavior with CPG_TYPE_AGREED?
>>
>> Regards,
>> Satish
>>
>>
>> On Mon, Jun 6, 2016 at 4:15 PM, Jan Friesse  wrote:
>>
>> Hi,
>>>
>>> Hello,
>>>

 Virtual Synchrony Property - messages are delivered in agreed order and
 configuration changes are delivered in agreed order relative to message.

 What happen to this property when network is partitioned the cluster
 into
 two. Consider following scenario (which I took from one of the
 previous query by Andrei Elkin):

 * N1, N2 and N3 are in state sync with m(k-1) messages are delivered.


>>> What exactly you mean by "state sync"?
>>>
>>> * N1 sends m(k) and just now network partition N1 node from N2 and N3.
>>>

 Does CPG_TYPE_AGREED guarantee that virtual synchrony is held?


>>> Yes it does (actually higher level of VS called EVS)
>>>
>>>
>>> When property is held, configuration change message C1 is guaranteed to
 delivered before m(k) to N1.
 N1 will see: m(k-1) C1 m(k)
 N2 and N3 will see: m(k-1) C1

 But if this property is violated:
 N1 will see: m(k-1) m(k) C1
 N2 and N3 will see: m(k-1) C1

 Violation will screw any user application running on the cluster.

 Could someone please explain what is the behavior of Corosync in this
 scenario with CPG_TYPE_AGREED ordering.


>>> For description how exactly totem synchronization works take a look to
>>> http://corosync.github.com/corosync/doc/DAAgarwal.thesis.ps.gz
>>>
>>> After totem is synchronized, there is another level of synchronization of
>>> services (not described in above doc). All services synchronize in very
>>> similar way, so you can take a look to CPG as example. Basically only
>>> state
>>> held by CPG is connected clients. So every node sends it's connected
>>> clients list to every other node. If sync is aborted (change of
>>> membership), it's restarted. These sync messages has priority over user
>>> messages (actually it's not possible to send messages during sync). User
>>> app can be sure that message was delivered only after it gets it's own
>>> message. Also app gets configuration change message so it knows, who got
>>> the message.
>>>
>>> Regards,
>>>Honza
>>>
>>>
>>> Regards,
 Satish



 ___
 Users mailing list: Users@clusterlabs.org
 http://clusterlabs.org/mailman/listinfo/users

 Project Home: http://www.clusterlabs.org
 Getting started:

Re: [ClusterLabs] [Problem] Start is carried out twice.

2016-06-06 Thread renayama19661014
Hi All, 

When a node joined while start of the resource takes time, start of the
resource is carried out twice. 

Step 1) Put sleep in start of the Dummy 
resource.(/usr/lib/ocf/resource.d/heartbeat/Dummy)
 (snip)
dummy_start() {
 sleep 60
 dummy_monitor
 if [ $? =  $OCF_SUCCESS ]; then
(snip)

Step 2) Start one node and send crm file. 

### Cluster Option ###
property no-quorum-policy="ignore" \ 
stonith-enabled="false" \ crmd-transition-delay="2s" 

### Resource Defaults ###
rsc_defaults resource-stickiness="INFINITY" \ 
migration-threshold="1" 

### Group Configuration ###
group grpDummy \
 prmDummy1 \
 prmDummy2 

### Primitive Configuration ###
primitive prmDummy1 ocf:heartbeat:Dummy \
 op start interval="0s" timeout="120s" on-fail="restart" \
 op monitor interval="10s" timeout="60s" on-fail="restart" \
 op stop interval="0s" timeout="60s" on-fail="block" 
primitive prmDummy2 ocf:heartbeat:Dummy \
 op start interval="0s" timeout="120s" on-fail="restart"\
 op monitor interval="10s" timeout="60s" on-fail="restart" \
 op stop interval="0s" timeout="60s" on-fail="block" 

### Resource Location ###
location rsc_location-grpDummy-1 grpDummy \
 rule 200: #uname eq vm1 \
 rule 100: #uname eq vm2

 
Step 3) When start of prmDummy1 was carried out, Start the second node. 
Start of prmDummy1 is carried out twice.

 [root@vm1 ~]# grep Initiating /var/log/ha-log
Jun  6 23:55:15 rh72-01 crmd[2921]:  notice: Initiating start operation 
prmDummy1_start_0 locally on vm1
Jun  6 23:56:17 rh72-01 crmd[2921]:  notice: Initiating start operation 
prmDummy1_start_0 locally on vm1 

While completion of start is unknown, it is not preferable for start to be 
carried out twice. 
When a node joined, it seems to be caused by the fact that information of 
thepractice of start which is not completed is deleted.


I registered these contents with Bugzilla.
  * http://bugs.clusterlabs.org/show_bug.cgi?id=5286

I attach the file which I collected in crm_report to Bugzilla.


Best Regards,
Hideo Yamauchi.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] [Problem] Start is carried out twice.

2016-06-06 Thread renayama19661014
Hi All, When a node joined while start of the resource takes time, start of the
resource is carried out twice. Step 1) Put sleep in start of the Dummy
resource.(/usr/lib/ocf/resource.d/heartbeat/Dummy) (snip)
dummy_start() { sleep 60 dummy_monitor if [ $? =  $OCF_SUCCESS ]; then
(snip) Step 2) Start one node and send crm file. 
### Cluster Option ###
property no-quorum-policy="ignore" \ stonith-enabled="false" \ 
crmd-transition-delay="2s" ### Resource Defaults ###
rsc_defaults resource-stickiness="INFINITY" \ migration-threshold="1" ### Group 
Configuration ###
group grpDummy \ prmDummy1 \ prmDummy2 ### Primitive Configuration ###
primitive prmDummy1 ocf:heartbeat:Dummy \ op start interval="0s" timeout="120s" 
on-fail="restart" \ op monitor interval="10s" timeout="60s" on-fail="restart" \ 
op stop interval="0s" timeout="60s" on-fail="block" primitive prmDummy2 
ocf:heartbeat:Dummy \ op start interval="0s" timeout="120s" on-fail="restart" \ 
op monitor interval="10s" timeout="60s" on-fail="restart" \ op stop 
interval="0s" timeout="60s" on-fail="block" ### Resource Location ###
location rsc_location-grpDummy-1 grpDummy \ rule 200: #uname eq vm1 \
rule 100: #uname eq vm2
 Step 3) When start of prmDummy1 was carried out, Start the 
second node. Start of prmDummy1 is carried out twice. [root@vm1 ~]# grep 
Initiating /var/log/ha-log
Jun  6 23:55:15 rh72-01 crmd[2921]:  notice: Initiating start operation
prmDummy1_start_0 locally on vm1
Jun  6 23:56:17 rh72-01 crmd[2921]:  notice: Initiating start operation
prmDummy1_start_0 locally on vm1 While completion of start is unknown, it is 
not preferable for start to be
carried out twice. When a node joined, it seems to be caused by the fact that 
information of the
practice of start which is not completed is deleted.

I registered these contents with Bugzilla.
* http://bugs.clusterlabs.org/show_bug.cgi?id=5286
I attach the file which I collected in crm_report to Bugzilla.

Best Regards,

Hideo Yamauchi.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [corosync] Virtual Synchrony Property guarantees in case of network partition

2016-06-06 Thread satish kumar
Hello honza, thanks for the response !

With state sync, I simply mean that 'k-1' messages were delivered to N1, N2
and N3 and they have applied these messages to change their program state.
N1.state = apply(m(k-1);
N2.state = apply(m(k-1);
N3.state = apply(m(k-1);

The document you shared cleared many doubts. However I still need one
clarification.

According to the document:
"The configuration change messages warn the application that a membership
change has occurred, so that the application program can take appropriate
action based on the membership change. Extended virtual synchrony
guarantees a consistent order of messages delivery across a partition,
which is essential if the application program are to be able to reconcile
their states following repair of a failed processor or reemerging of the
partitioned network."

I just want to know that this property is not something related to
CPG_TYPE_SAFE, which is still not implemented.
Please consider this scenario:
0. N1, N2 and N3 has received the message m(k-1).
1. N1 mcast(CPG_TYPE_AGREED) m(k) message.
2. As it is not CPG_TYPE_SAFE, m(k) delievered to N1 but was not yet
delivered to N2 and N3.
3. Network partition separate N1 from N2 and N3. N2 and N3 can never see
m(k).
4. Configuration change message is now delivered to N1, N2 and N3.

Here, N1 will change its state to N1.state = apply(m(k), thinking all in
the current configuration has received the message.

According to your reply it looks like N1 will not receive m(k). So this is
what each node will see:
N1 will see: m(k-1) -> C1 (config change)
N2 will see: m(k-1) -> C1 (config change)
N3 will see: m(k-1) -> C1 (config change)

Message m(k) will be discarded, and will not be delivered to N1 even if it
was sent by N1 before the network partition.

This is the expected behavior with CPG_TYPE_AGREED?

Regards,
Satish


On Mon, Jun 6, 2016 at 4:15 PM, Jan Friesse  wrote:

> Hi,
>
> Hello,
>>
>> Virtual Synchrony Property - messages are delivered in agreed order and
>> configuration changes are delivered in agreed order relative to message.
>>
>> What happen to this property when network is partitioned the cluster into
>> two. Consider following scenario (which I took from one of the
>> previous query by Andrei Elkin):
>>
>> * N1, N2 and N3 are in state sync with m(k-1) messages are delivered.
>>
>
> What exactly you mean by "state sync"?
>
> * N1 sends m(k) and just now network partition N1 node from N2 and N3.
>>
>> Does CPG_TYPE_AGREED guarantee that virtual synchrony is held?
>>
>
> Yes it does (actually higher level of VS called EVS)
>
>
>> When property is held, configuration change message C1 is guaranteed to
>> delivered before m(k) to N1.
>> N1 will see: m(k-1) C1 m(k)
>> N2 and N3 will see: m(k-1) C1
>>
>> But if this property is violated:
>> N1 will see: m(k-1) m(k) C1
>> N2 and N3 will see: m(k-1) C1
>>
>> Violation will screw any user application running on the cluster.
>>
>> Could someone please explain what is the behavior of Corosync in this
>> scenario with CPG_TYPE_AGREED ordering.
>>
>
> For description how exactly totem synchronization works take a look to
> http://corosync.github.com/corosync/doc/DAAgarwal.thesis.ps.gz
>
> After totem is synchronized, there is another level of synchronization of
> services (not described in above doc). All services synchronize in very
> similar way, so you can take a look to CPG as example. Basically only state
> held by CPG is connected clients. So every node sends it's connected
> clients list to every other node. If sync is aborted (change of
> membership), it's restarted. These sync messages has priority over user
> messages (actually it's not possible to send messages during sync). User
> app can be sure that message was delivered only after it gets it's own
> message. Also app gets configuration change message so it knows, who got
> the message.
>
> Regards,
>   Honza
>
>
>> Regards,
>> Satish
>>
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] mail server (postfix)

2016-06-06 Thread Vladislav Bogdanov

05.06.2016 22:22, Dimitri Maziuk wrote:

On 06/04/2016 01:02 PM, Vladislav Bogdanov wrote:


I'd modify RA to support master/slave concept.


I'm assuming you use a shared mail store on your imapd cluster? I want


No, I use cyrus internal replication.


to host the storage on the same cluster with mail daemons. I want to a)
stop accepting mail, b) fail-over drbd maildirs, then c) restart postfix
in send-only "slave" configuration. On the other node I could simply
restart the "master" postfix after b), but on the node going passive the
b) has to be between a) and c).


Do you have reasons for b) to be strictly between a) and c) ?

I'd propose something as the following:
0a) service is a master on one node (nodeA) - it listens on socket and 
stores mail to DRBD-backd maildirs.

0b) service is a slave on second node (nodeB) - send-only config
1) stop VIP on nodeA
2) demote service on nodeA (replace config and restart/reload it 
internally in the RA) - that would combine your a) and c)

3) demote DRBD on nodeA (first part of your b) )
4) promote DRBD on nodeB (second part of b) )
5) promote service on nodeB - replace config and internally reload/restart
6) start VIP on nodeB

1-2 and 5-6 pairs may need to be reversed if you bind service to a 
specific VIP (instead of listening on INADDR_ANY).


For 2 and 5 to work correctly you need to colocate service in *master* 
role with DRBD in *master* role. That way "slave" service instance does 
not require DRBD at all.


Hope this helps,
Vladislav


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [corosync] Virtual Synchrony Property guarantees in case of network partition

2016-06-06 Thread Jan Friesse

Hi,


Hello,

Virtual Synchrony Property - messages are delivered in agreed order and
configuration changes are delivered in agreed order relative to message.

What happen to this property when network is partitioned the cluster into
two. Consider following scenario (which I took from one of the
previous query by Andrei Elkin):

* N1, N2 and N3 are in state sync with m(k-1) messages are delivered.


What exactly you mean by "state sync"?


* N1 sends m(k) and just now network partition N1 node from N2 and N3.

Does CPG_TYPE_AGREED guarantee that virtual synchrony is held?


Yes it does (actually higher level of VS called EVS)



When property is held, configuration change message C1 is guaranteed to
delivered before m(k) to N1.
N1 will see: m(k-1) C1 m(k)
N2 and N3 will see: m(k-1) C1

But if this property is violated:
N1 will see: m(k-1) m(k) C1
N2 and N3 will see: m(k-1) C1

Violation will screw any user application running on the cluster.

Could someone please explain what is the behavior of Corosync in this
scenario with CPG_TYPE_AGREED ordering.


For description how exactly totem synchronization works take a look to 
http://corosync.github.com/corosync/doc/DAAgarwal.thesis.ps.gz


After totem is synchronized, there is another level of synchronization 
of services (not described in above doc). All services synchronize in 
very similar way, so you can take a look to CPG as example. Basically 
only state held by CPG is connected clients. So every node sends it's 
connected clients list to every other node. If sync is aborted (change 
of membership), it's restarted. These sync messages has priority over 
user messages (actually it's not possible to send messages during sync). 
User app can be sure that message was delivered only after it gets it's 
own message. Also app gets configuration change message so it knows, who 
got the message.


Regards,
  Honza



Regards,
Satish



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Few questions regarding corosync authkey

2016-06-06 Thread Jan Friesse

Hi,

Would like to understand how secure is the corosync authkey.
As the authkey is a binary file, how is the private key saved inside the
authkey?


Corosync uses symmetric encryption, so there is no public certificate. 
authkey = private key



What safeguard mechanisms are in place if the private key is compromised?


No safeguard mechanisms. Compromised authkey = problem.


For e.g I don't think it uses any temporary session key which refreshes
periodically.


Exactly


Is it possible to dynamically update the key without causing any outage?


Nope

Regards,
  Honza



-Thanks
Nikhil



___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Few questions regarding corosync authkey

2016-06-06 Thread Nikhil Utane
Hi,

Would like to understand how secure is the corosync authkey.
As the authkey is a binary file, how is the private key saved inside the
authkey?
What safeguard mechanisms are in place if the private key is compromised?
For e.g I don't think it uses any temporary session key which refreshes
periodically.
Is it possible to dynamically update the key without causing any outage?

-Thanks
Nikhil
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org