Re: [ClusterLabs] Antw: Any CLVM/DLM users around?

2018-10-01 Thread Vladislav Bogdanov
On October 1, 2018 8:01:36 PM UTC, Patrick Whitney  
wrote:

[...]

>so we were lucky enough our test environment is a KVM/libvirt
>environment,
>so I used fence_virsh.  Again, I had the same problem... when the "bad"
>node was fenced, dlm_controld would issue (what appears to be) a
>fence_all,
>and I would receive messages that that the dlm clone was down on all
>members and would have a log message that the clvm lockspace was
>abandoned.

What is your dlm versuon btw?

[...]


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Any CLVM/DLM users around?

2018-10-01 Thread Patrick Whitney
Hi Ulrich,

When I first encountered this issue, I posted this:

https://lists.clusterlabs.org/pipermail/users/2018-September/015637.html

... I was using resource fencing in this example, but, as I've mentioned
before, the issue would come about, not when fencing occurred, but when the
fenced node was shutdown (we were using resource fencing).

During that discussion, yourself and others suggested that power fencing
was the only way DLM was going to cooperate and one suggestion of using
meatware was proposed.

Unfortunately, I found out later that meatware was no longer available (
https://lists.clusterlabs.org/pipermail/users/2018-September/015715.html),
so we were lucky enough our test environment is a KVM/libvirt environment,
so I used fence_virsh.  Again, I had the same problem... when the "bad"
node was fenced, dlm_controld would issue (what appears to be) a fence_all,
and I would receive messages that that the dlm clone was down on all
members and would have a log message that the clvm lockspace was
abandoned.

It was only when I disabled fencing for dlm (enable_fencing=0 in dlm.conf;
but kept fencing enabled in pcmk) did things begin to work as expected.

One suggestion earlier in this thread suggests trying the dlm configuration
of  disabling startup fencing (enable_startup_fencing=0), which sounds like
a plausible solution after looking over the logs, but I haven't tested
yet.

The conclusion I'm coming to is:
1. The reason DLM cannot handle resource fencing is because it keeps its
own "heartbeat/control" channel (for lack of a better term) via the
network, and pcmk cannot instruct DLM "Don't worry about that guy over
there" which means we must use power fencing, but;
2. DLM does not like to see one of its members disappear; when that does
happen, DLM does "something" which causes the lockspace to disappear...
unless you disable fencing for DLM.

I am now speculating that DLM restarts when the communications fail, and
the theory that disabling startup fencing for DLM
(enable_startup_fencing=0) may be the solution to my problem (reverting my
enable_fencing=0 DLM config).

Best,
-Pat

On Mon, Oct 1, 2018 at 3:38 PM Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> Hi!
>
> It would be much more helpful, if you could provide logs around the
> problem events. Personally I think you _must_ implement proper fencing. In
> addition, DLM seems to do its own fencing when there is a communication
> problem.
>
> Regards,
> Ulrich
>
>
> >>> Patrick Whitney  01.10.18 16.25 Uhr >>>
> Hi Everyone,
>
> I wanted to solicit input on my configuration.
>
> I have a two node (test) cluster running corosync/pacemaker with DLM and
> CLVM.
>
> I was running into an issue where when one node failed, the remaining node
> would appear to do the right thing, from the pcmk perspective, that is.
>  It would  create a new cluster (of one) and fence the other node, but
> then, rather surprisingly, DLM would see the other node offline, and it
> would go offline itself, abandoning the lockspace.
>
> I changed my DLM settings to "enable_fencing=0", disabling DLM fencing, and
> our tests are now working as expected.
>
> I'm a little concern I have masked an issue by doing this, as in all of the
> tutorials and docs I've read, there is no mention of having to configure
> DLM whatsoever.
>
> Is anyone else running a similar stack and can comment?
>
> Best,
> -Pat
> --
> Patrick Whitney
> DevOps Engineer -- Tools
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>


-- 
Patrick Whitney
DevOps Engineer -- Tools
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Any CLVM/DLM users around?

2018-10-01 Thread Ulrich Windl
Hi!

Maybe explain how your "node failed": Usually when I node has failed, fencing 
is just to make sure that the node is really dead, so in most cases it won't 
actually do a thing unless a false "node dead" had been detected. Fencing makes 
sure that the fenced node has absolutely no chance to access shared storage 
until it joined the cluster again (after reboot).

Regards,
Ulrich


>>> Patrick Whitney  01.10.18 19.44 Uhr >>>
We tested with both, and experienced the same behavior using both fencing
strategies:  an abandoned DLM lockspace.   More than once, within this
forum, I've heard that DLM only supports power fencing, but without
explanation.  Can you explain why DLM requires power fencing?

Best,
-Pat

On Mon, Oct 1, 2018 at 1:38 PM Vladislav Bogdanov 
wrote:

> On October 1, 2018 4:55:07 PM UTC, Patrick Whitney 
> wrote:
> >>
> >> Fencing in clustering is always required, but unlike pacemaker that
> >lets
> >> you turn it off and take your chances, DLM doesn't.
> >
> >
> >As a matter of fact, DLM has a setting "enable_fencing=0|1" for what
> >that's
> >worth.
> >
> >
> >> You must have
> >> working fencing for DLM (and anything using it) to function
> >correctly.
> >>
> >
> >We do have fencing enabled in the cluster; we've tested both node level
> >fencing and resource fencing; DLM behaved identically in both
> >scenarios,
> >until we set it to 'enable_fencing=0' in the dlm.conf file.
>
> Do you have power or fabric fencing? Dlm requires former.
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>


-- 
Patrick Whitney
DevOps Engineer -- Tools

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Any CLVM/DLM users around?

2018-10-01 Thread Ulrich Windl
Hi!

It would be much more helpful, if you could provide logs around the problem 
events. Personally I think you _must_ implement proper fencing. In addition, 
DLM seems to do its own fencing when there is a communication problem.

Regards,
Ulrich


>>> Patrick Whitney  01.10.18 16.25 Uhr >>>
Hi Everyone,

I wanted to solicit input on my configuration.

I have a two node (test) cluster running corosync/pacemaker with DLM and
CLVM.

I was running into an issue where when one node failed, the remaining node
would appear to do the right thing, from the pcmk perspective, that is.
 It would  create a new cluster (of one) and fence the other node, but
then, rather surprisingly, DLM would see the other node offline, and it
would go offline itself, abandoning the lockspace.

I changed my DLM settings to "enable_fencing=0", disabling DLM fencing, and
our tests are now working as expected.

I'm a little concern I have masked an issue by doing this, as in all of the
tutorials and docs I've read, there is no mention of having to configure
DLM whatsoever.

Is anyone else running a similar stack and can comment?

Best,
-Pat
-- 
Patrick Whitney
DevOps Engineer -- Tools

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Any CLVM/DLM users around?

2018-10-01 Thread Vladislav Bogdanov
On October 1, 2018 5:44:20 PM UTC, Patrick Whitney  
wrote:
>We tested with both, and experienced the same behavior using both
>fencing
>strategies:  an abandoned DLM lockspace.   More than once, within this
>forum, I've heard that DLM only supports power fencing, but without
>explanation.  Can you explain why DLM requires power fencing?

The main part of dlm runs inside the kernel, and it is very hard to impossible 
to return it into a vanilla state programmatically. Espesially if filesystems 
like gfs2 run on top. iiuc Sistina originally developed dlm mainly for their 
gfs1, and only then for lvm. Things changed, but original design remains I 
believe.

>
>Best,
>-Pat
>
>On Mon, Oct 1, 2018 at 1:38 PM Vladislav Bogdanov
>
>wrote:
>
>> On October 1, 2018 4:55:07 PM UTC, Patrick Whitney
>
>> wrote:
>> >>
>> >> Fencing in clustering is always required, but unlike pacemaker
>that
>> >lets
>> >> you turn it off and take your chances, DLM doesn't.
>> >
>> >
>> >As a matter of fact, DLM has a setting "enable_fencing=0|1" for what
>> >that's
>> >worth.
>> >
>> >
>> >> You must have
>> >> working fencing for DLM (and anything using it) to function
>> >correctly.
>> >>
>> >
>> >We do have fencing enabled in the cluster; we've tested both node
>level
>> >fencing and resource fencing; DLM behaved identically in both
>> >scenarios,
>> >until we set it to 'enable_fencing=0' in the dlm.conf file.
>>
>> Do you have power or fabric fencing? Dlm requires former.
>>
>>
>> ___
>> Users mailing list: Users@clusterlabs.org
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Any CLVM/DLM users around?

2018-10-01 Thread Patrick Whitney
We tested with both, and experienced the same behavior using both fencing
strategies:  an abandoned DLM lockspace.   More than once, within this
forum, I've heard that DLM only supports power fencing, but without
explanation.  Can you explain why DLM requires power fencing?

Best,
-Pat

On Mon, Oct 1, 2018 at 1:38 PM Vladislav Bogdanov 
wrote:

> On October 1, 2018 4:55:07 PM UTC, Patrick Whitney 
> wrote:
> >>
> >> Fencing in clustering is always required, but unlike pacemaker that
> >lets
> >> you turn it off and take your chances, DLM doesn't.
> >
> >
> >As a matter of fact, DLM has a setting "enable_fencing=0|1" for what
> >that's
> >worth.
> >
> >
> >> You must have
> >> working fencing for DLM (and anything using it) to function
> >correctly.
> >>
> >
> >We do have fencing enabled in the cluster; we've tested both node level
> >fencing and resource fencing; DLM behaved identically in both
> >scenarios,
> >until we set it to 'enable_fencing=0' in the dlm.conf file.
>
> Do you have power or fabric fencing? Dlm requires former.
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>


-- 
Patrick Whitney
DevOps Engineer -- Tools
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Any CLVM/DLM users around?

2018-10-01 Thread Vladislav Bogdanov
On October 1, 2018 4:55:07 PM UTC, Patrick Whitney  
wrote:
>>
>> Fencing in clustering is always required, but unlike pacemaker that
>lets
>> you turn it off and take your chances, DLM doesn't.
>
>
>As a matter of fact, DLM has a setting "enable_fencing=0|1" for what
>that's
>worth.
>
>
>> You must have
>> working fencing for DLM (and anything using it) to function
>correctly.
>>
>
>We do have fencing enabled in the cluster; we've tested both node level
>fencing and resource fencing; DLM behaved identically in both
>scenarios,
>until we set it to 'enable_fencing=0' in the dlm.conf file.

Do you have power or fabric fencing? Dlm requires former.


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: How is fencing and unfencing suppose to work?

2018-10-01 Thread Digimer
On 2018-10-01 03:06 AM, Ulrich Windl wrote:
 digimer  schrieb am 28.09.2018 um 19:11 in Nachricht
> <968d00cd-fad5-8f17-edfd-7787a9964...@alteeve.ca>:
>> On 2018-09-04 8:49 p.m., Ken Gaillot wrote:
>>> On Tue, 2018-08-21 at 10:23 -0500, Ryan Thomas wrote:
 I’m seeing unexpected behavior when using “unfencing” – I don’t think
 I’m understanding it correctly.  I configured a resource that
 “requires unfencing” and have a custom fencing agent which “provides
 unfencing”.   I perform a simple test where I setup the cluster and
 then run “pcs stonith fence node2”, and I see that node2 is
 successfully fenced by sending an “off” action to my fencing agent.
 But, immediately after this, I see an “on” action sent to my fencing
 agent.  My fence agent doesn’t implement the “reboot” action, so
 perhaps its trying to reboot by running an off action followed by a
 on action.  Prior to adding “provides unfencing” to the fencing
 agent, I didn’t see the on action. It seems unsafe to say “node2 you
 can’t run” and then immediately “ you can run”.
>>> I'm not as familiar with unfencing as I'd like, but I believe the basic
>>> idea is:
>>>
>>> - the fence agent's off action cuts the machine off from something
>>> essential needed to run resources (generally shared storage or network
>>> access)
>>>
>>> - the fencing works such that a fenced host is not able to request
>>> rejoining the cluster without manual intervention by a sysadmin
>>>
>>> - when the sysadmin allows the host back into the cluster, and it
>>> contacts the other nodes to rejoin, the cluster will call the fence
>>> agent's on action, which is expected to re-enable the host's access
>>>
>>> How that works in practice, I have only vague knowledge.
>>
>> This is correct. Consider fabric fencing where fiber channel ports are 
>> disconnected. Unfence restores the connection. Similar to a pure 'off' 
>> fence call to switched PDUs, as you mention above. Unfence powers the 
>> outlets back up.
> 
> I doubt whether successful fencing can be emulated by "pausing" I/O: when
> re-establishing the fabric, outstanding I/Os might be performed (which cannot
> happen after real fencing).
> 
> [...]
> 
> Regards,
> Ulrich

I have never been a fan of fabric fencing, and that is exactly one
reason why. Another being panic'ed admins, not understanding what's
happening, and turning ports back up. To me, power fencing is the only
sensible options.

That said, the question was about unfence, and that is what I was
addressing. :)


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Any CLVM/DLM users around?

2018-10-01 Thread Digimer
On 2018-10-01 12:55 PM, Patrick Whitney wrote:
> Fencing in clustering is always required, but unlike pacemaker that lets
> you turn it off and take your chances, DLM doesn't.
> 
> 
> As a matter of fact, DLM has a setting "enable_fencing=0|1" for what
> that's worth.   

I did not know that... Interesting. Dangerous, but interesting.

> You must have
> working fencing for DLM (and anything using it) to function correctly.
> 
> 
> We do have fencing enabled in the cluster; we've tested both node level
> fencing and resource fencing; DLM behaved identically in both scenarios,
> until we set it to 'enable_fencing=0' in the dlm.conf file. 
>  
> 
> Basically, cluster config changes (node declared lost), dlm informed and
> blocks, fence attempt begins and loops until it succeeds, on success,
> informs DLM, dlm reaps locks held by the lost node and normal operation
> continues.
> 
> This isn't quite what I was seeing in the logs.  The "failed" node would
> be fenced off, pacemaker appeared to be sane, reporting services running
> on the running nodes, but once the failed node was seen as missing by
> dlm (dlm_controld), dlm would request fencing, from what I can tell by
> the log entry.  Here is an example of the suspect log entry:
> Sep 26 09:41:35 pcmk-test-1 dlm_controld[837]: 38 fence request 2 pid
> 1446 startup time 1537969264 fence_all dlm_stonith
>  
> 
> This isn't a question of node count or other configuration concerns.
> It's simply that you must have proper fencing for DLM.
> 
> 
> Can you speak more to what "proper fencing" is for DLM? 
> 
> Best,
> -Pat
> 
>   
> 
> On Mon, Oct 1, 2018 at 12:30 PM Digimer  > wrote:
> 
> On 2018-10-01 12:04 PM, Ferenc Wágner wrote:
> > Patrick Whitney  > writes:
> >
> >> I have a two node (test) cluster running corosync/pacemaker with DLM
> >> and CLVM.
> >>
> >> I was running into an issue where when one node failed, the
> remaining node
> >> would appear to do the right thing, from the pcmk perspective,
> that is.
> >> It would  create a new cluster (of one) and fence the other node, but
> >> then, rather surprisingly, DLM would see the other node offline,
> and it
> >> would go offline itself, abandoning the lockspace.
> >>
> >> I changed my DLM settings to "enable_fencing=0", disabling DLM
> fencing, and
> >> our tests are now working as expected.
> >
> > I'm running a larger Pacemaker cluster with standalone DLM + cLVM
> (that
> > is, they are started by systemd, not by Pacemaker).  I've seen
> weird DLM
> > fencing behavior, but not what you describe above (though I ran with
> > more than two nodes from the very start).  Actually, I don't even
> > understand how it occured to you to disable DLM fencing to fix that...
> 
> Fencing in clustering is always required, but unlike pacemaker that lets
> you turn it off and take your chances, DLM doesn't. You must have
> working fencing for DLM (and anything using it) to function correctly.
> 
> Basically, cluster config changes (node declared lost), dlm informed and
> blocks, fence attempt begins and loops until it succeeds, on success,
> informs DLM, dlm reaps locks held by the lost node and normal operation
> continues.
> 
> This isn't a question of node count or other configuration concerns.
> It's simply that you must have proper fencing for DLM.
> 
> >> I'm a little concern I have masked an issue by doing this, as in all
> >> of the tutorials and docs I've read, there is no mention of having to
> >> configure DLM whatsoever.
> >
> > Unfortunately it's very hard to come by any reliable info about
> DLM.  I
> > had a couple of enlightening exchanges with David Teigland (its
> primary
> > author) on this list, he is very helpful indeed, but I'm still
> very far
> > from having a working understanding of it.
> >
> > But I've been running with --enable_fencing=0 for years without
> issues,
> > leaving all fencing to Pacemaker.  Note that manual cLVM
> operations are
> > the only users of DLM here, so delayed fencing does not cause any
> > problems, the cluster services do not depend on DLM being
> operational (I
> > mean it can stay frozen for several days -- as it happened in a couple
> > of pathological cases).  GFS2 would be a very different thing, I
> guess.
> >
> 
> 
> -- 
> Digimer
> Papers and Projects: https://alteeve.com/w/
> "I am, somehow, less interested in the weight and convolutions of
> Einstein’s brain than in the near certainty that people of equal talent
> have lived and died in cotton fields and sweatshops." - Stephen Jay
> Gould
> 
> 
> 
> -- 
> Patrick Whitney
> DevOps Engineer -- Tools


-- 
Digimer
Papers and Projects: 

Re: [ClusterLabs] Any CLVM/DLM users around?

2018-10-01 Thread FeldHost™ Admin
Probably you need to enable_startup_fencing = 0 instead of enable_fencing = 0.

S pozdravem Kristián Feldsam
Tel.: +420 773 303 353, +421 944 137 535
E-mail.: supp...@feldhost.cz

www.feldhost.cz - FeldHost™ – Hostingové služby přizpůsobíme Vám Máte 
specifické požadavky? Poradíme si s nimi.

FELDSAM s.r.o.
V Chotejně 765/15
Praha 10 – Hostivař, PSČ 102 00
IČ: 290 60 958, DIČ: CZ290 60 958
C 200350 vedená u Městského soudu v Praze

Banka: Fio banka a.s.
Číslo účtu: 2400330446/2010
BIC: FIOBCZPPXX
IBAN: CZ82 2010  0024 0033 0446

> On 1 Oct 2018, at 18:55, Patrick Whitney  wrote:
> 
> Fencing in clustering is always required, but unlike pacemaker that lets
> you turn it off and take your chances, DLM doesn't.
> 
> As a matter of fact, DLM has a setting "enable_fencing=0|1" for what that's 
> worth.   
>  
> You must have
> working fencing for DLM (and anything using it) to function correctly.
> 
> We do have fencing enabled in the cluster; we've tested both node level 
> fencing and resource fencing; DLM behaved identically in both scenarios, 
> until we set it to 'enable_fencing=0' in the dlm.conf file. 
>  
> Basically, cluster config changes (node declared lost), dlm informed and
> blocks, fence attempt begins and loops until it succeeds, on success,
> informs DLM, dlm reaps locks held by the lost node and normal operation
> continues.
> This isn't quite what I was seeing in the logs.  The "failed" node would be 
> fenced off, pacemaker appeared to be sane, reporting services running on the 
> running nodes, but once the failed node was seen as missing by dlm 
> (dlm_controld), dlm would request fencing, from what I can tell by the log 
> entry.  Here is an example of the suspect log entry:
> Sep 26 09:41:35 pcmk-test-1 dlm_controld[837]: 38 fence request 2 pid 1446 
> startup time 1537969264 fence_all dlm_stonith
>  
> This isn't a question of node count or other configuration concerns.
> It's simply that you must have proper fencing for DLM.
> 
> Can you speak more to what "proper fencing" is for DLM? 
> 
> Best,
> -Pat
> 
>   
> 
> On Mon, Oct 1, 2018 at 12:30 PM Digimer  > wrote:
> On 2018-10-01 12:04 PM, Ferenc Wágner wrote:
> > Patrick Whitney mailto:pwhit...@luminoso.com>> 
> > writes:
> > 
> >> I have a two node (test) cluster running corosync/pacemaker with DLM
> >> and CLVM.
> >>
> >> I was running into an issue where when one node failed, the remaining node
> >> would appear to do the right thing, from the pcmk perspective, that is.
> >> It would  create a new cluster (of one) and fence the other node, but
> >> then, rather surprisingly, DLM would see the other node offline, and it
> >> would go offline itself, abandoning the lockspace.
> >>
> >> I changed my DLM settings to "enable_fencing=0", disabling DLM fencing, and
> >> our tests are now working as expected.
> > 
> > I'm running a larger Pacemaker cluster with standalone DLM + cLVM (that
> > is, they are started by systemd, not by Pacemaker).  I've seen weird DLM
> > fencing behavior, but not what you describe above (though I ran with
> > more than two nodes from the very start).  Actually, I don't even
> > understand how it occured to you to disable DLM fencing to fix that...
> 
> Fencing in clustering is always required, but unlike pacemaker that lets
> you turn it off and take your chances, DLM doesn't. You must have
> working fencing for DLM (and anything using it) to function correctly.
> 
> Basically, cluster config changes (node declared lost), dlm informed and
> blocks, fence attempt begins and loops until it succeeds, on success,
> informs DLM, dlm reaps locks held by the lost node and normal operation
> continues.
> 
> This isn't a question of node count or other configuration concerns.
> It's simply that you must have proper fencing for DLM.
> 
> >> I'm a little concern I have masked an issue by doing this, as in all
> >> of the tutorials and docs I've read, there is no mention of having to
> >> configure DLM whatsoever.
> > 
> > Unfortunately it's very hard to come by any reliable info about DLM.  I
> > had a couple of enlightening exchanges with David Teigland (its primary
> > author) on this list, he is very helpful indeed, but I'm still very far
> > from having a working understanding of it.
> > 
> > But I've been running with --enable_fencing=0 for years without issues,
> > leaving all fencing to Pacemaker.  Note that manual cLVM operations are
> > the only users of DLM here, so delayed fencing does not cause any
> > problems, the cluster services do not depend on DLM being operational (I
> > mean it can stay frozen for several days -- as it happened in a couple
> > of pathological cases).  GFS2 would be a very different thing, I guess.
> > 
> 
> 
> -- 
> Digimer
> Papers and Projects: https://alteeve.com/w/ 
> "I am, somehow, less interested in the weight and convolutions of
> Einstein’s brain than in the near certainty that people of equal talent
> have lived and 

Re: [ClusterLabs] Any CLVM/DLM users around?

2018-10-01 Thread Patrick Whitney
>
> Fencing in clustering is always required, but unlike pacemaker that lets
> you turn it off and take your chances, DLM doesn't.


As a matter of fact, DLM has a setting "enable_fencing=0|1" for what that's
worth.


> You must have
> working fencing for DLM (and anything using it) to function correctly.
>

We do have fencing enabled in the cluster; we've tested both node level
fencing and resource fencing; DLM behaved identically in both scenarios,
until we set it to 'enable_fencing=0' in the dlm.conf file.


> Basically, cluster config changes (node declared lost), dlm informed and
> blocks, fence attempt begins and loops until it succeeds, on success,
> informs DLM, dlm reaps locks held by the lost node and normal operation
> continues.
>
This isn't quite what I was seeing in the logs.  The "failed" node would be
fenced off, pacemaker appeared to be sane, reporting services running on
the running nodes, but once the failed node was seen as missing by dlm
(dlm_controld), dlm would request fencing, from what I can tell by the log
entry.  Here is an example of the suspect log entry:
Sep 26 09:41:35 pcmk-test-1 dlm_controld[837]: 38 fence request 2 pid 1446
startup time 1537969264 fence_all dlm_stonith


> This isn't a question of node count or other configuration concerns.
> It's simply that you must have proper fencing for DLM.


Can you speak more to what "proper fencing" is for DLM?

Best,
-Pat



On Mon, Oct 1, 2018 at 12:30 PM Digimer  wrote:

> On 2018-10-01 12:04 PM, Ferenc Wágner wrote:
> > Patrick Whitney  writes:
> >
> >> I have a two node (test) cluster running corosync/pacemaker with DLM
> >> and CLVM.
> >>
> >> I was running into an issue where when one node failed, the remaining
> node
> >> would appear to do the right thing, from the pcmk perspective, that is.
> >> It would  create a new cluster (of one) and fence the other node, but
> >> then, rather surprisingly, DLM would see the other node offline, and it
> >> would go offline itself, abandoning the lockspace.
> >>
> >> I changed my DLM settings to "enable_fencing=0", disabling DLM fencing,
> and
> >> our tests are now working as expected.
> >
> > I'm running a larger Pacemaker cluster with standalone DLM + cLVM (that
> > is, they are started by systemd, not by Pacemaker).  I've seen weird DLM
> > fencing behavior, but not what you describe above (though I ran with
> > more than two nodes from the very start).  Actually, I don't even
> > understand how it occured to you to disable DLM fencing to fix that...
>
> Fencing in clustering is always required, but unlike pacemaker that lets
> you turn it off and take your chances, DLM doesn't. You must have
> working fencing for DLM (and anything using it) to function correctly.
>
> Basically, cluster config changes (node declared lost), dlm informed and
> blocks, fence attempt begins and loops until it succeeds, on success,
> informs DLM, dlm reaps locks held by the lost node and normal operation
> continues.
>
> This isn't a question of node count or other configuration concerns.
> It's simply that you must have proper fencing for DLM.
>
> >> I'm a little concern I have masked an issue by doing this, as in all
> >> of the tutorials and docs I've read, there is no mention of having to
> >> configure DLM whatsoever.
> >
> > Unfortunately it's very hard to come by any reliable info about DLM.  I
> > had a couple of enlightening exchanges with David Teigland (its primary
> > author) on this list, he is very helpful indeed, but I'm still very far
> > from having a working understanding of it.
> >
> > But I've been running with --enable_fencing=0 for years without issues,
> > leaving all fencing to Pacemaker.  Note that manual cLVM operations are
> > the only users of DLM here, so delayed fencing does not cause any
> > problems, the cluster services do not depend on DLM being operational (I
> > mean it can stay frozen for several days -- as it happened in a couple
> > of pathological cases).  GFS2 would be a very different thing, I guess.
> >
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.com/w/
> "I am, somehow, less interested in the weight and convolutions of
> Einstein’s brain than in the near certainty that people of equal talent
> have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
>


-- 
Patrick Whitney
DevOps Engineer -- Tools
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Any CLVM/DLM users around?

2018-10-01 Thread Digimer
On 2018-10-01 12:04 PM, Ferenc Wágner wrote:
> Patrick Whitney  writes:
> 
>> I have a two node (test) cluster running corosync/pacemaker with DLM
>> and CLVM.
>>
>> I was running into an issue where when one node failed, the remaining node
>> would appear to do the right thing, from the pcmk perspective, that is.
>> It would  create a new cluster (of one) and fence the other node, but
>> then, rather surprisingly, DLM would see the other node offline, and it
>> would go offline itself, abandoning the lockspace.
>>
>> I changed my DLM settings to "enable_fencing=0", disabling DLM fencing, and
>> our tests are now working as expected.
> 
> I'm running a larger Pacemaker cluster with standalone DLM + cLVM (that
> is, they are started by systemd, not by Pacemaker).  I've seen weird DLM
> fencing behavior, but not what you describe above (though I ran with
> more than two nodes from the very start).  Actually, I don't even
> understand how it occured to you to disable DLM fencing to fix that...

Fencing in clustering is always required, but unlike pacemaker that lets
you turn it off and take your chances, DLM doesn't. You must have
working fencing for DLM (and anything using it) to function correctly.

Basically, cluster config changes (node declared lost), dlm informed and
blocks, fence attempt begins and loops until it succeeds, on success,
informs DLM, dlm reaps locks held by the lost node and normal operation
continues.

This isn't a question of node count or other configuration concerns.
It's simply that you must have proper fencing for DLM.

>> I'm a little concern I have masked an issue by doing this, as in all
>> of the tutorials and docs I've read, there is no mention of having to
>> configure DLM whatsoever.
> 
> Unfortunately it's very hard to come by any reliable info about DLM.  I
> had a couple of enlightening exchanges with David Teigland (its primary
> author) on this list, he is very helpful indeed, but I'm still very far
> from having a working understanding of it.
> 
> But I've been running with --enable_fencing=0 for years without issues,
> leaving all fencing to Pacemaker.  Note that manual cLVM operations are
> the only users of DLM here, so delayed fencing does not cause any
> problems, the cluster services do not depend on DLM being operational (I
> mean it can stay frozen for several days -- as it happened in a couple
> of pathological cases).  GFS2 would be a very different thing, I guess.
> 


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Any CLVM/DLM users around?

2018-10-01 Thread Ferenc Wágner
Patrick Whitney  writes:

> I have a two node (test) cluster running corosync/pacemaker with DLM
> and CLVM.
>
> I was running into an issue where when one node failed, the remaining node
> would appear to do the right thing, from the pcmk perspective, that is.
> It would  create a new cluster (of one) and fence the other node, but
> then, rather surprisingly, DLM would see the other node offline, and it
> would go offline itself, abandoning the lockspace.
>
> I changed my DLM settings to "enable_fencing=0", disabling DLM fencing, and
> our tests are now working as expected.

I'm running a larger Pacemaker cluster with standalone DLM + cLVM (that
is, they are started by systemd, not by Pacemaker).  I've seen weird DLM
fencing behavior, but not what you describe above (though I ran with
more than two nodes from the very start).  Actually, I don't even
understand how it occured to you to disable DLM fencing to fix that...

> I'm a little concern I have masked an issue by doing this, as in all
> of the tutorials and docs I've read, there is no mention of having to
> configure DLM whatsoever.

Unfortunately it's very hard to come by any reliable info about DLM.  I
had a couple of enlightening exchanges with David Teigland (its primary
author) on this list, he is very helpful indeed, but I'm still very far
from having a working understanding of it.

But I've been running with --enable_fencing=0 for years without issues,
leaving all fencing to Pacemaker.  Note that manual cLVM operations are
the only users of DLM here, so delayed fencing does not cause any
problems, the cluster services do not depend on DLM being operational (I
mean it can stay frozen for several days -- as it happened in a couple
of pathological cases).  GFS2 would be a very different thing, I guess.
-- 
Regards,
Feri
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Colocation by Node

2018-10-01 Thread Ken Gaillot
On Mon, 2018-10-01 at 11:09 -0400, Marc Smith wrote:
> Hi,
> 
> I'm looking for the correct constraint setup to use for the following
> resource configuration:
> --snip--
> node 1: tgtnode2.parodyne.com
> node 2: tgtnode1.parodyne.com
> primitive p_iscsi_tgtnode1 iscsi \
>         params portal=172.16.0.12 target=tgtnode2_redirect udev=no
> try_recovery=true \
>         op start interval=0 timeout=120 \
>         op stop interval=0 timeout=120 \
>         op monitor interval=120 timeout=30
> primitive p_iscsi_tgtnode2 iscsi \
>         params portal=172.16.0.11 target=tgtnode1_redirect udev=no
> try_recovery=true \
>         op start interval=0 timeout=120 \
>         op stop interval=0 timeout=120 \
>         op monitor interval=120 timeout=30
> primitive p_scst ocf:esos:scst \
>         params alua=false \
>         op start interval=0 timeout=120 \
>         op stop interval=0 timeout=90 \
>         op monitor interval=30 timeout=60
> clone clone_scst p_scst \
>         meta interleave=true target-role=Started
> location l_iscsi_tgtnode1 p_iscsi_tgtnode1 role=Started -inf:
> tgtnode2.parodyne.com
> location l_iscsi_tgtnode2 p_iscsi_tgtnode2 role=Started -inf:
> tgtnode1.parodyne.com
> property cib-bootstrap-options: \
>         have-watchdog=false \
>         dc-version=1.1.16-94ff4df \
>         cluster-infrastructure=corosync \
>         stonith-enabled=false \
>         cluster-name=redirect_test \
>         last-lrm-refresh=1538405190
> --snip--
> 
> The gist of it is the 'ocf:esos:scst' provides an iSCSI target
> interface that each opposing node connects to (just a two node
> cluster) via 'ocf:heartbeat:iscsi' on each node. I have location
> constraints in the above configuration that force the
> "p_iscsi_tgtnode1" and "p_iscsi_tgtnode2" primitives to the correct
> node, but what I'm lacking is a colocation constraint that prevents
> "p_iscsi_tgtnode1" from starting unless "clone_scst" is running on
> the opposing node and vice versa.
> 
> Is a configuration like this possible? Without creating two
> primitives for 'ocf:esos:scst' and ditching the clone rule? Or is the

No, there's no way to constrain against a particular clone instance.
Using separate primitives would be your best bet.

> best route creating two separate primitives for 'ocf:esos:scst' and
> then adding more constraints to prevent these from running on the
> same node and forcing their node "ownership"?
> 
> Any help / guidance / advice would be greatly appreciated.
> 
> 
> Thanks,
> 
> Marc
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Colocation by Node

2018-10-01 Thread Marc Smith
Hi,

I'm looking for the correct constraint setup to use for the following
resource configuration:
--snip--
node 1: tgtnode2.parodyne.com
node 2: tgtnode1.parodyne.com
primitive p_iscsi_tgtnode1 iscsi \
params portal=172.16.0.12 target=tgtnode2_redirect udev=no
try_recovery=true \
op start interval=0 timeout=120 \
op stop interval=0 timeout=120 \
op monitor interval=120 timeout=30
primitive p_iscsi_tgtnode2 iscsi \
params portal=172.16.0.11 target=tgtnode1_redirect udev=no
try_recovery=true \
op start interval=0 timeout=120 \
op stop interval=0 timeout=120 \
op monitor interval=120 timeout=30
primitive p_scst ocf:esos:scst \
params alua=false \
op start interval=0 timeout=120 \
op stop interval=0 timeout=90 \
op monitor interval=30 timeout=60
clone clone_scst p_scst \
meta interleave=true target-role=Started
location l_iscsi_tgtnode1 p_iscsi_tgtnode1 role=Started -inf:
tgtnode2.parodyne.com
location l_iscsi_tgtnode2 p_iscsi_tgtnode2 role=Started -inf:
tgtnode1.parodyne.com
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.16-94ff4df \
cluster-infrastructure=corosync \
stonith-enabled=false \
cluster-name=redirect_test \
last-lrm-refresh=1538405190
--snip--

The gist of it is the 'ocf:esos:scst' provides an iSCSI target interface
that each opposing node connects to (just a two node cluster) via
'ocf:heartbeat:iscsi' on each node. I have location constraints in the
above configuration that force the "p_iscsi_tgtnode1" and
"p_iscsi_tgtnode2" primitives to the correct node, but what I'm lacking is
a colocation constraint that prevents "p_iscsi_tgtnode1" from starting
unless "clone_scst" is running on the opposing node and vice versa.

Is a configuration like this possible? Without creating two primitives for
'ocf:esos:scst' and ditching the clone rule? Or is the best route creating
two separate primitives for 'ocf:esos:scst' and then adding more
constraints to prevent these from running on the same node and forcing
their node "ownership"?

Any help / guidance / advice would be greatly appreciated.


Thanks,

Marc
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] The crmd process exited: Generic Pacemaker error (201)

2018-10-01 Thread Ken Gaillot
On Sat, 2018-09-29 at 22:42 +0800, lkxjtu wrote:
> 
> Version information
> [root@paas-controller-172-167-40-24:~]$ rpm -q corosync
> corosync-2.4.0-9.el7_4.2.x86_64
> [root@paas-controller-172-167-40-24:~]$ rpm -q pacemaker
> pacemaker-1.1.16-12.el7_4.2.x86_64
> 
> The crmd process exited with error code of 201. The pacemakerd
> process tried to fork 100 times, exceeding the threshold, and the
> crmd process exited forever. Here is the last attempt log of forking
> the crmd process.
> 
> I have two questions. The first one is why the crmd process exits?
> And the second question is whether I can set the threshold for retry
> times? Thank you very much!

The cause is unclear from these logs. You'll have to look at the first
sign in the logs before this of any warning or error condition.

The limit is hardcoded.

> 
> 
> 
> Sep 08 18:10:09 [28446] paas-controller-172-167-40-24 pacemakerd:   
> error: pcmk_child_exit:    The crmd process (83749) exited: Generic
> Pacemaker error (201)
> Sep 08 18:10:09 [28446] paas-controller-172-167-40-24 pacemakerd:  
> notice: pcmk_process_exit:  Respawning failed child process: crmd
> Sep 08 18:10:09 [28446] paas-controller-172-167-40-24 pacemakerd:
> info: start_child:    Using uid=189 and group=189 for process
> crmd
> Sep 08 18:10:09 [28446] paas-controller-172-167-40-24 pacemakerd:
> info: start_child:    Forked child 88033 for process crmd
> Sep 08 18:10:09 [28446] paas-controller-172-167-40-24 pacemakerd:
> info: mcp_cpg_deliver:    Ignoring process list sent by peer for
> local node
> Sep 08 18:10:09 [28446] paas-controller-172-167-40-24 pacemakerd:
> info: mcp_cpg_deliver:    Ignoring process list sent by peer for
> local node
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: crm_log_init:   Changed active directory to
> /var/lib/pacemaker/cores
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: main:   CRM Git Version: 1.1.16-12.el7_4.2 (94ff4df)
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: do_log: Input I_STARTUP received in state S_STARTING from
> crmd_init
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: get_cluster_type:   Verifying cluster type: 'corosync'
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: get_cluster_type:   Assuming an active 'corosync' cluster
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: do_cib_control: CIB connection established
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:  
> notice: crm_cluster_connect:    Connecting to cluster
> infrastructure: corosync
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: crm_get_peer:   Created entry ebd1fc7d-5c48-4c81-85ec-
> bad8a3f6fcb1/0x7fe04dec49a0 for node 172.167.40.24/167040024 (1
> total)
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: crm_get_peer:   Node 167040024 is now known as
> 172.167.40.24
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: peer_update_callback:   172.167.40.24 is now in unknown
> state
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: crm_get_peer:   Node 167040024 has uuid 167040024
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: crm_update_peer_proc:   cluster_connect_cpg: Node
> 172.167.40.24[167040024] - corosync-cpg is now online
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: peer_update_callback:   Client 172.167.40.24/peer now has
> status [online] (DC=, changed=400)
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: init_cs_connection_once:    Connection to 'corosync':
> established
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:  
> notice: cluster_connect_quorum: Quorum acquired
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: do_ha_control:  Connected to the cluster
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: lrmd_ipc_connect:   Connecting to lrmd
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: do_lrm_control: LRM connection established
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: do_started: Delaying start, no membership data
> (0010)
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: do_started: Delaying start, no membership data
> (0010)
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:
> info: pcmk_quorum_notification:   Quorum retained  membership=4
> members=1
> Sep 08 18:10:09 [88033] paas-controller-172-167-40-24   crmd:  
> notice: crm_update_peer_state_iter: Node 172.167.40.24 state is now
> member  nodeid=167040024 

Re: [ClusterLabs] Questions regarding crm_mon tool

2018-10-01 Thread Ken Gaillot
On Fri, 2018-09-28 at 19:41 +, Brian Vagnini wrote:
> Greetings,
> We are implementing an HA cluster solution and as a part of it, are
> using crm_mon. Part of my job is to document training materials for
> certain things. I am running into a problem in defining some of the
> information that gets outputted when running the crm_mon command.
>  
> One of my principle engineers responded with this comment.
>  
> crm_mon
>  online -> explain the different states online, offline, pending and
> standby (there might be more)
>  last updated/change (lookup what these mean but i think one of them
> shows when the cluster last had a config change or maybe changed
> states
> My dilemma is that I can’t seem to find that within the help system
> or anywhere else. I’m hoping that you can assist me with this.
>  
> Thanks in advance,
>  
> Brian K. Vagnini
> Training & Documentation Coordinator
> Tallahassee office
> 850-270-0387
> Slack (@bkvagnini)

Hi,

Those are good questions. We are planning some significant enhancements
to crm_mon over the next couple of releases to improve configurability
and general usefulness. (Already lined up for the next release is
display of failed and pending fence actions.)

We definitely could use better documentation of the output format,
unfortunately there are many higher priorities in the queue. It should
definitely include this information.

Possible node states:

online - node is a functioning member of both the cluster layer
(corosync) and its daemon peer group

pending - node is a cluster member, but not a peer group member (this
typically means it is in the process of starting or shutting down)

standby - node is online but in standby mode. This may optionally be
followed by (on-fail) meaning that it is in standby because a resource
with on-fail=standby resources failed, or (with active resources) which
is new and indicates that even though the node is in standby, it is
still in the process of moving resources off.

maintenance - node is online but in maintenance mode

UNCLEAN - node must be fenced. This will be followed by (online),
(pending) or (offline).

OFFLINE - node cleanly left the cluster. This may optionally be
followed by (standby) or (maintenance).


Last updated is the time the crm_mon display was last refreshed.

Last change is the time the CIB (cluster configuration and status) last
changed.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Any CLVM/DLM users around?

2018-10-01 Thread Patrick Whitney
Hi Everyone,

I wanted to solicit input on my configuration.

I have a two node (test) cluster running corosync/pacemaker with DLM and
CLVM.

I was running into an issue where when one node failed, the remaining node
would appear to do the right thing, from the pcmk perspective, that is.
 It would  create a new cluster (of one) and fence the other node, but
then, rather surprisingly, DLM would see the other node offline, and it
would go offline itself, abandoning the lockspace.

I changed my DLM settings to "enable_fencing=0", disabling DLM fencing, and
our tests are now working as expected.

I'm a little concern I have masked an issue by doing this, as in all of the
tutorials and docs I've read, there is no mention of having to configure
DLM whatsoever.

Is anyone else running a similar stack and can comment?

Best,
-Pat
-- 
Patrick Whitney
DevOps Engineer -- Tools
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: Corosync 3 release plans?

2018-10-01 Thread Christine Caulfield
On 01/10/18 07:45, Ulrich Windl wrote:
 Ferenc Wágner  schrieb am 27.09.2018 um 21:16
> in
> Nachricht <87zhw23g5p@lant.ki.iif.hu>:
>> Christine Caulfield  writes:
>>
>>> I'm also looking into high‑res timestamps for logfiles too.
>>
>> Wouldn't that be a useful option for the syslog output as well?  I'm
>> sometimes concerned by the batching effect added by the transport
>> between the application and the (local) log server (rsyslog or systemd).
>> Reliably merging messages from different channels can prove impossible
>> without internal timestamps (even considering a single machine only).
>>
>> Another interesting feature could be structured, direct journal output
>> (if you're looking for challenges).
> 
> Make it configurable please; most lines are long enough even without extra
> timestamps.
> 

Don't worry, I will :)

Chrissie

>> ‑‑ 
>> Regards,
>> Feri
>> ___
>> Users mailing list: Users@clusterlabs.org 
>> https://lists.clusterlabs.org/mailman/listinfo/users 
>>
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: How is fencing and unfencing suppose to work?

2018-10-01 Thread Ulrich Windl
>>> digimer  schrieb am 28.09.2018 um 19:11 in Nachricht
<968d00cd-fad5-8f17-edfd-7787a9964...@alteeve.ca>:
> On 2018-09-04 8:49 p.m., Ken Gaillot wrote:
>> On Tue, 2018-08-21 at 10:23 -0500, Ryan Thomas wrote:
>>> I’m seeing unexpected behavior when using “unfencing” – I don’t think
>>> I’m understanding it correctly.  I configured a resource that
>>> “requires unfencing” and have a custom fencing agent which “provides
>>> unfencing”.   I perform a simple test where I setup the cluster and
>>> then run “pcs stonith fence node2”, and I see that node2 is
>>> successfully fenced by sending an “off” action to my fencing agent.
>>> But, immediately after this, I see an “on” action sent to my fencing
>>> agent.  My fence agent doesn’t implement the “reboot” action, so
>>> perhaps its trying to reboot by running an off action followed by a
>>> on action.  Prior to adding “provides unfencing” to the fencing
>>> agent, I didn’t see the on action. It seems unsafe to say “node2 you
>>> can’t run” and then immediately “ you can run”.
>> I'm not as familiar with unfencing as I'd like, but I believe the basic
>> idea is:
>>
>> - the fence agent's off action cuts the machine off from something
>> essential needed to run resources (generally shared storage or network
>> access)
>>
>> - the fencing works such that a fenced host is not able to request
>> rejoining the cluster without manual intervention by a sysadmin
>>
>> - when the sysadmin allows the host back into the cluster, and it
>> contacts the other nodes to rejoin, the cluster will call the fence
>> agent's on action, which is expected to re-enable the host's access
>>
>> How that works in practice, I have only vague knowledge.
> 
> This is correct. Consider fabric fencing where fiber channel ports are 
> disconnected. Unfence restores the connection. Similar to a pure 'off' 
> fence call to switched PDUs, as you mention above. Unfence powers the 
> outlets back up.

I doubt whether successful fencing can be emulated by "pausing" I/O: when
re-establishing the fabric, outstanding I/Os might be performed (which cannot
happen after real fencing).

[...]

Regards,
Ulrich

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Re: Understanding the behavior of pacemaker crash

2018-10-01 Thread Ulrich Windl
>>> Ken Gaillot  schrieb am 28.09.2018 um 15:50 in 
>>> Nachricht
<1538142642.4679.1.ca...@redhat.com>:
> On Fri, 2018-09-28 at 15:26 +0530, Prasad Nagaraj wrote:
>> Hi Ken - Only if I turn off corosync on the node [ where I crashed
>> pacemaker] other nodes are able to detect and put the node as
>> OFFLINE.
>> Do you have any other guidance or insights into this ?
> 
> Yes, corosync is the cluster membership layer -- if corosync is
> successfully running, then the node is a member of the cluster.
> Pacemaker's crmd provides a higher level of membership; typically, with
> corosync but no crmd, the node shows up as "pending" in status. However
> I am not sure how it worked with the old corosync plugin.

Maybe crmd should "feed a watchdog with tranquilizers" (meaning if it stops to 
do so, the watchdog will become alive and reset the node). ;-)

Regards,
Ulrich

> 
>> 
>> Thanks
>> Prasad
>> 
>> On Thu, Sep 27, 2018 at 9:33 PM Prasad Nagaraj > l.com> wrote:
>> > Hi Ken - Thanks for the response. Pacemaker is still not running on
>> > that node. So I am still wondering what could be the issue ? Any
>> > other configurations or logs should I be sharing to understand this
>> > more ?
>> > 
>> > Thanks!
>> > 
>> > On Thu, Sep 27, 2018 at 8:08 PM Ken Gaillot 
>> > wrote:
>> > > On Thu, 2018-09-27 at 13:45 +0530, Prasad Nagaraj wrote:
>> > > > Hello - I was trying to understand the behavior or cluster when
>> > > > pacemaker crashes on one of the nodes. So I hard killed
>> > > pacemakerd
>> > > > and its related processes.
>> > > > 
>> > > > -
>> > > --
>> > > > -
>> > > > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
>> > > > root  74022  1  0 07:53 pts/000:00:00 pacemakerd
>> > > > 189   74028  74022  0 07:53 ?00:00:00
>> > > > /usr/libexec/pacemaker/cib
>> > > > root  74029  74022  0 07:53 ?00:00:00
>> > > > /usr/libexec/pacemaker/stonithd
>> > > > root  74030  74022  0 07:53 ?00:00:00
>> > > > /usr/libexec/pacemaker/lrmd
>> > > > 189   74031  74022  0 07:53 ?00:00:00
>> > > > /usr/libexec/pacemaker/attrd
>> > > > 189   74032  74022  0 07:53 ?00:00:00
>> > > > /usr/libexec/pacemaker/pengine
>> > > > 189   74033  74022  0 07:53 ?00:00:00
>> > > > /usr/libexec/pacemaker/crmd
>> > > > 
>> > > > root  75228  50092  0 07:54 pts/000:00:00 grep
>> > > pacemaker
>> > > > [root@SG-mysqlold-907 azureuser]# kill -9 74022
>> > > > 
>> > > > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
>> > > > root  74030  1  0 07:53 ?00:00:00
>> > > > /usr/libexec/pacemaker/lrmd
>> > > > 189   74032  1  0 07:53 ?00:00:00
>> > > > /usr/libexec/pacemaker/pengine
>> > > > 
>> > > > root  75303  50092  0 07:55 pts/000:00:00 grep
>> > > pacemaker
>> > > > [root@SG-mysqlold-907 azureuser]# kill -9 74030
>> > > > [root@SG-mysqlold-907 azureuser]# kill -9 74032
>> > > > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
>> > > > root  75332  50092  0 07:55 pts/000:00:00 grep
>> > > pacemaker
>> > > > 
>> > > > [root@SG-mysqlold-907 azureuser]# crm satus
>> > > > ERROR: status: crm_mon (rc=107): Connection to cluster failed:
>> > > > Transport endpoint is not connected
>> > > > -
>> > > --
>> > > > --
>> > > > 
>> > > > However, this does not seem to be having any effect on the
>> > > cluster
>> > > > status from other nodes
>> > > > -
>> > > --
>> > > > 
>> > > > 
>> > > > [root@SG-mysqlold-909 azureuser]# crm status
>> > > > Last updated: Thu Sep 27 07:56:17 2018  Last change:
>> > > Thu Sep
>> > > > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
>> > > > Stack: classic openais (with plugin)
>> > > > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0)
>> > > -
>> > > > partition with quorum
>> > > > 3 nodes and 3 resources configured, 3 expected votes
>> > > > 
>> > > > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
>> > > 
>> > > It most definitely would make the node offline, and if fencing
>> > > were
>> > > configured, the rest of the cluster would fence the node to make
>> > > sure
>> > > it's safely down.
>> > > 
>> > > I see you're using the old corosync 1 plugin. I suspect what
>> > > happened
>> > > in this case is that corosync noticed the plugin died and
>> > > restarted it
>> > > quickly enough that it had rejoined by the time you checked the
>> > > status
>> > > elsewhere.
>> > > 
>> > > > 
>> > > > Full list of resources:
>> > > > 
>> > > >  Master/Slave Set: ms_mysql [p_mysql]
>> > > >  Masters: [ SG-mysqlold-909 ]
>> > > >  Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
>> > 

[ClusterLabs] Antw: Re: Corosync 3 release plans?

2018-10-01 Thread Ulrich Windl
>>> Ferenc Wágner  schrieb am 27.09.2018 um 21:16
in
Nachricht <87zhw23g5p@lant.ki.iif.hu>:
> Christine Caulfield  writes:
> 
>> I'm also looking into high‑res timestamps for logfiles too.
> 
> Wouldn't that be a useful option for the syslog output as well?  I'm
> sometimes concerned by the batching effect added by the transport
> between the application and the (local) log server (rsyslog or systemd).
> Reliably merging messages from different channels can prove impossible
> without internal timestamps (even considering a single machine only).
> 
> Another interesting feature could be structured, direct journal output
> (if you're looking for challenges).

Make it configurable please; most lines are long enough even without extra
timestamps.

> ‑‑ 
> Regards,
> Feri
> ___
> Users mailing list: Users@clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Keep printing "Sent 0 CPG messages" in corosync.log

2018-10-01 Thread Jan Friesse

lkxjtu,




Corosync.log has kept printing the following logs for several days. What's 
wrong with the corosync cluster? Now the cpu load is not high.


Interesting messages from logs you've sent are:

Sep 30 01:23:28 [127667] paas-controller-172-21-0-2 corosync warning 
[MAIN  ] timer_function_scheduler_timeout Corosync main process was not 
scheduled for 10470.3652 ms (threshold is 2400. ms). Consider token 
timeout increase.


and

Sep 30 01:23:29 [127667] paas-controller-172-21-0-2 corosync notice 
[TOTEM ] pause_flush Process pause detected for 8760 ms, flushing 
membership messages.



This means that corosync was unable to get required time to run. This 
can happen because of:
- (Most often) cluster is running in highly overloaded VMs (quite often 
cloud environments)
- Corosync doesn't have a RT priority or there is another RT priority 
task using most of the time

- I/O problem
- Misbehaving watchdog device
- Bug in corosync

Honza



Cluster version information:
[root@paas-controller-172-167-40-24:~]$ rpm -q corosync
corosync-2.4.0-9.el7_4.2.x86_64
[root@paas-controller-172-167-40-24:~]$ rpm -q pacemaker
pacemaker-1.1.16-12.el7_4.2.x86_64



Sep 30 01:23:27 [128232] paas-controller-172-21-0-2cib: info: 
crm_cs_flush: Sent 0 CPG messages  (13 remaining, last=363): Try again (6)

...
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org