Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue

2019-02-11 Thread Fulong Wang
Klaus,

Thanks for the infor!
Did you mean i should compile sbd from github source to include the fixs you 
mentioned by myself?

The corosync, pacemaker and sbd version in my setup is as below:
corosync: 2.3.6-9.13.1
pacemaker: 1.1.16-6.5.1
sbd:   1.3.1+20180507



Regards
Fulong

From: Klaus Wenninger 
Sent: Monday, February 11, 2019 18:51
To: Cluster Labs - All topics related to open-source clustering welcomed; 
Fulong Wang; Gao,Yan
Subject: Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue

On 02/11/2019 09:49 AM, Fulong Wang wrote:
Thanks Yan,

You gave me more valuable hints on the SBD operation!
Now, i can see the verbose output after service restart.


>Be aware since pacemaker integration (-P) is enabled by default, which
>means despite the sbd failure, if the node itself was clean and
>"healthy" from pacemaker's point of view and if it's in the cluster
>partition with the quorum, it wouldn't self-fence -- meaning a node just
>being unable to fence doesn't necessarily need to be fenced.

>As described in sbd man page, "this allows sbd to survive temporary
>outages of the majority of devices. However, while the cluster is in
>such a degraded state, it can neither successfully fence nor be shutdown
>cleanly (as taking the cluster below the quorum threshold will
>immediately cause all remaining nodes to self-fence). In short, it will
>not tolerate any further faults.  Please repair the system before
>continuing."

Yes, I can see the "pacemaker integration" was enabled in my sbd config file by 
default.
So, you mean in some sbd failure cases, if the node was considered as "healthy" 
from pacemaker's poinit of view, it still wouldn't sel-fence.

Honestly speaking, i didn't get you at this point. I have 
"no-quorum-policy=ignore" setting in my setup and it's a two node cluster.
Can you show me a sample situation for this?

When using sbd with 2-node-clusters and pacemaker-integration you might
check 
https://github.com/ClusterLabs/sbd/commit/4bd0a66da3ac9c9afaeb8a2468cdd3ed51ad3377
to be included in your sbd-version.
This is relevant when 2-node is configured in corosync.

Regards,
Klaus


Many Thanks!!!




Reagards
Fulong




From: Gao,Yan 
Sent: Thursday, January 3, 2019 20:43
To: Fulong Wang; Cluster Labs - All topics related to open-source clustering 
welcomed
Subject: Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue

On 12/24/18 7:10 AM, Fulong Wang wrote:
> Yan, klaus and Everyone,
>
>
>   Merry Christmas!!!
>
>
>
> Many thanks for your advice!
> I added the "-v" param in "SBD_OPTS", but didn't see any apparent change
> in the system message log,  am i looking at a wrong place?
Did you restart all cluster services, for example by "crm cluster stop"
and then "crm cluster start"? Basically sbd.service needs to be
restarted. Be aware "systemctl restart pacemaker" only restarts pacemaker.

SBD daemons log into syslog. When a sbd watcher receives a "test"
command, there should be a syslog like this showing up:

"servant: Received command test from ..."

sbd won't actually do anything about a "test" command but logging a message.

If you are not running a late version of sbd (maintenance update) yet, a
single "-v" will make sbd too verbose already. But of course you could
use grep.

>
> By the way, we want to test when the disk access paths (multipath
> devices) lost, the sbd can fence the node automatically.
Be aware since pacemaker integration (-P) is enabled by default, which
means despite the sbd failure, if the node itself was clean and
"healthy" from pacemaker's point of view and if it's in the cluster
partition with the quorum, it wouldn't self-fence -- meaning a node just
being unable to fence doesn't necessarily need to be fenced.

As described in sbd man page, "this allows sbd to survive temporary
outages of the majority of devices. However, while the cluster is in
such a degraded state, it can neither successfully fence nor be shutdown
cleanly (as taking the cluster below the quorum threshold will
immediately cause all remaining nodes to self-fence). In short, it will
not tolerate any further faults.  Please repair the system before
continuing."

Regards,
   Yan


> what's your recommendation for this scenario?
>
>
>
>
>
>
>
> The "crm node fence"  did the work.
>
>
>
>
>
>
>
>
>
>
>
>
>
> Regards
> Fulong
>
> 
> *From:* Gao,Yan 
> *Sent:* Friday, December 21, 2018 20:43
> *To:* kwenn...@redhat.com; Cluster Labs - All 
> topics related to
> open-source clustering welcomed; Fulong Wang
> *Subject:* Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue
> First thanks for your reply, Klaus!
>
> On 2018/12/21 10:09, Klaus Wenninger wrote:
>> On 12/21/2018 08:15 AM, Fulong Wang wrote:
>>> Hello Experts,
>>>
>>> I'm New to this mail lists.
>>> Pls kindlyforgive m

Re: [ClusterLabs] [pacemaker] Discretion with glib v2.59.0+ recommended

2019-02-11 Thread Ken Gaillot
On Mon, 2019-02-11 at 22:48 +0100, Jan Pokorný wrote:
> On 20/01/19 12:44 +0100, Jan Pokorný wrote:
> > On 18/01/19 20:32 +0100, Jan Pokorný wrote:
> > > It was discovered that this release of glib project changed
> > > sligthly
> > > some parameters of how distribution of values within  hash tables
> > > structures work, undermining pacemaker's hard (alas unfeasible)
> > > attempt
> > > to turn this data type into fully predictable entity.
> > > 
> > > Current impact is unknown beside some internal regression test
> > > failing
> > > due to this, so that, e.g., in the environment variables passed
> > > in the
> > > notification messages, the order of the active nodes (being a
> > > space
> > > separarated list) may be appear shuffled in comparison with the
> > > long
> > > standing (and perhaps making a false impression of determinism)
> > > behaviour witnessed with older versions of glib in the game.
> > 
> > Our immediate response is to, at the very least, make the
> > cts-scheduler regression suite (the only localhost one that was
> > rendered broken with 52 tests out of 733 failed) skip those tests
> > where reliance on the exact order of hash-table-driven items was
> > sported, so it won't fail as a whole:
> > 
> > 
https://github.com/ClusterLabs/pacemaker/pull/1677/commits/15ace890ef0b987db035ee2d71994e37f7eaff96
> > [above edit: updated with the newer version of the patch]
> 
> Shout-out to Ken for fixing the immediate fallout (deterministic
> output breakages in some cts-scheduler tests, making the above
> change superfluous) for the upcoming 2.0.1 release!
> 
> > > Variations like these are expected, and you may take it as an
> > > opportunity to fix incorrect order-wise (like in the stated case)
> > > assumptions.
> > 
> > [intentionally CC'd developers@, should have done it since
> > beginning]
> > 
> > At this point, testing with glib v2.59.0+, preferably using 2.0.1-
> > rc3
> > due to the release cycle timing, is VERY DESIRED if you are
> > considering
> > providing some volunteer capacity to pacemaker project, especially
> > if
> > you have your own agents and scripts that rely on the exact (and
> > previously likely stable) order of "set data made linear, hence
> > artificially ordered", like with
> > OCF_RESKEY_CRM_meta_notify_active_uname
> > environment variable in clone notifications (as was already
> > suggested;
> > complete list is also unknown at this point, unfortunately, for a
> > lack
> > of systemic and precise data items tracking in general).
> 
> While some of these if not all are now ordered, I'd call using
> "stable ordered list" approach to these variable, as opposed to
> "plain unordered set" one, from within agents as continuously
> frowned-upon unless explicitly lifted.  For predictable
> backward/forward pacemaker+glib version compatibility if
> for no other reason.
> 
> Ken, do you agree?
> 
> (If so, we shall keep that in mind for future documentation tweaks
> [possibly including also OCF updates], so no false assumptions won't
> be cast for new agent implementations going forward.)

Correct, the lists given to resource agents via clone notifications
environment variables are not guaranteed to be in any particular
order. 

The documentation already does not claim any ordering, and in fact
gives an example where node names are not in alphabetic order, so I
think it's pretty obvious.

> 
> > > More serious troubles stemming from this expectation-reality
> > > mismatch
> > > regarding said data type cannot be denied at this point, subject
> > > of
> > > further investigation.  When in doubt, staying with glib up to
> > > and
> > > including v2.58.2 (said tests are passing with it, though any
> > > later
> > > v2.58.* may keep working "as always") is likely a good idea for
> > > the
> > > time being.
> 
> It think this still partially holds and only time-proven as fully
> settled?  I mean, for anything truly reproducible (as in
> crm_simulate),
> either pacemaker prior to 2.0.1 combined with glib pre- or equal-or-
> post-
> 2.59.0 need to be uniformly (reproducers need to follow the original)
> combined to get the same results, and with pacemaker 2.0.1+,
> identical
> results (but possibly differing against either of the former combos)
> will _likely_ be obtained regardless of particular run-time linked
> glib
> version, but strength of this "likely" will only be established with
> future experience, I suppose (but shall universally hold with the
> same
> glib class per stated division, so no change in this already positive
> regard).
> 
> Just scratched the surface, so gladly be corrected.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker log showing time mismatch after

2019-02-11 Thread Jan Pokorný
On 11/02/19 15:03 -0600, Ken Gaillot wrote:
> On Fri, 2019-02-01 at 08:10 +0100, Jan Pokorný wrote:
>> On 28/01/19 09:47 -0600, Ken Gaillot wrote:
>>> On Mon, 2019-01-28 at 18:04 +0530, Dileep V Nair wrote:
>>> Pacemaker can handle the clock jumping forward, but not backward.
>> 
>> I am rather surprised, are we not using monotonic time only, then?
>> If so, why?
> 
> The scheduler runs on a single node (the DC) but must take as input the
> resource history (including timestamps) on all nodes. We need wall
> clock time to compare against time-based rules.

Yep, was aware of the troubles with this.

> Also, if we get two resource history entries from a node, we don't
> know if it rebooted in between, so a monotonic timestamp alone
> wouldn't be sufficient.

Ah, that's along the lines of Ulrich's response, I see it now, thanks.

I am not sure if there could be a step around that using boot IDs when
provided by the platform (like the hashes in case of systemd, assuming
just an equality test is all that's needed, since both histories
cannot arrive at the same time and presumably the FIFO gets
preserved -- it shall under normal circumstances, incl. no time
travels and sane, non-byzantine process, which might even be partially
detected and acted upon [two differing histories without any
recollection the node was fenced by the cluster? fence it for sure
and good measure!]).

> However, it might be possible to store both time representations in the
> history (and possibly maintain some sort of cluster knowledge about
> monotonic clocks to compare them within and across nodes), and use one
> or the other depending on the context. I haven't tried to determine how
> feasible that would be, but it would be a major project.

Just thinking aloud, the DC node, once won the election, will cause
other nodes (incl. new-comers during it's ruling) to (re)set the
offsets DC's vs. their own internal wall-clock time clocks (for
time-based rules should they ever become DC on their own), all nodes
will (re)establish monotonic to wall-clock time conversion based on
these one-off inputs, and from that point on, they can operate fully
detached from the wall-clock time (so that changing the wall clock
will have no immediate effect on the cluster unless (re)harmonizing
desired via some configured event handler that could reschedule the
plan appropriately, or delay the sync for a more suitable moment).
This indeed counts on pacemaker used in a full-fledged cluster mode,
i.e., requiring quorum (not to speak about fencing).

As a corollary, with such a scheme, time-based rules would only be
allowed when quorum fully honoured (afterall, it makes more sense
to employ cron or timer systemd units otherwise, since they all
use a local time only, which is the right fit, then, as opposed to
a distributed system).

>> We shall not need any explicit time synchronization across the nodes
>> since we are already backed by extended virtual synchrony from
>> corosync, eventhough it could introduce strangenesses when
>> time-based rules kick in.
> 
> Pacemaker determines the state of a resource by replaying its resource
> history in the CIB. A history entry can be replaced only by a newer
> event. Thus if there's a start event in the history, and a stop result
> comes in, we have to know which one is newer to determine whether the
> resource is started or stopped.

But for that, boot ID might be a sufficient determinant, since
a result (keyed with boot ID ABC) for an action never triggered with
boot ID XYZ deemed "current" ATM means that something is off, and
fencing is perhaps the best choice.

> Something along those lines is likely the cause of:
> 
> https://bugs.clusterlabs.org/show_bug.cgi?id=5246

I believe the "detached" scheme sketched above could solve this.
Problem is, devil is in the details to it can be unpredictably hard
to get it right in the distributed environment.

-- 
Jan (Poki)


pgpzb2j8BIGQK.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] [pacemaker] Discretion with glib v2.59.0+ recommended

2019-02-11 Thread Jan Pokorný
On 20/01/19 12:44 +0100, Jan Pokorný wrote:
> On 18/01/19 20:32 +0100, Jan Pokorný wrote:
>> It was discovered that this release of glib project changed sligthly
>> some parameters of how distribution of values within  hash tables
>> structures work, undermining pacemaker's hard (alas unfeasible) attempt
>> to turn this data type into fully predictable entity.
>> 
>> Current impact is unknown beside some internal regression test failing
>> due to this, so that, e.g., in the environment variables passed in the
>> notification messages, the order of the active nodes (being a space
>> separarated list) may be appear shuffled in comparison with the long
>> standing (and perhaps making a false impression of determinism)
>> behaviour witnessed with older versions of glib in the game.
> 
> Our immediate response is to, at the very least, make the
> cts-scheduler regression suite (the only localhost one that was
> rendered broken with 52 tests out of 733 failed) skip those tests
> where reliance on the exact order of hash-table-driven items was
> sported, so it won't fail as a whole:
> 
> https://github.com/ClusterLabs/pacemaker/pull/1677/commits/15ace890ef0b987db035ee2d71994e37f7eaff96
> [above edit: updated with the newer version of the patch]

Shout-out to Ken for fixing the immediate fallout (deterministic
output breakages in some cts-scheduler tests, making the above
change superfluous) for the upcoming 2.0.1 release!

>> Variations like these are expected, and you may take it as an
>> opportunity to fix incorrect order-wise (like in the stated case)
>> assumptions.
> 
> [intentionally CC'd developers@, should have done it since beginning]
> 
> At this point, testing with glib v2.59.0+, preferably using 2.0.1-rc3
> due to the release cycle timing, is VERY DESIRED if you are considering
> providing some volunteer capacity to pacemaker project, especially if
> you have your own agents and scripts that rely on the exact (and
> previously likely stable) order of "set data made linear, hence
> artificially ordered", like with OCF_RESKEY_CRM_meta_notify_active_uname
> environment variable in clone notifications (as was already suggested;
> complete list is also unknown at this point, unfortunately, for a lack
> of systemic and precise data items tracking in general).

While some of these if not all are now ordered, I'd call using
"stable ordered list" approach to these variable, as opposed to
"plain unordered set" one, from within agents as continuously
frowned-upon unless explicitly lifted.  For predictable
backward/forward pacemaker+glib version compatibility if
for no other reason.

Ken, do you agree?

(If so, we shall keep that in mind for future documentation tweaks
[possibly including also OCF updates], so no false assumptions won't
be cast for new agent implementations going forward.)

>> More serious troubles stemming from this expectation-reality mismatch
>> regarding said data type cannot be denied at this point, subject of
>> further investigation.  When in doubt, staying with glib up to and
>> including v2.58.2 (said tests are passing with it, though any later
>> v2.58.* may keep working "as always") is likely a good idea for the
>> time being.

It think this still partially holds and only time-proven as fully
settled?  I mean, for anything truly reproducible (as in crm_simulate),
either pacemaker prior to 2.0.1 combined with glib pre- or equal-or-post-
2.59.0 need to be uniformly (reproducers need to follow the original)
combined to get the same results, and with pacemaker 2.0.1+, identical
results (but possibly differing against either of the former combos)
will _likely_ be obtained regardless of particular run-time linked glib
version, but strength of this "likely" will only be established with
future experience, I suppose (but shall universally hold with the same
glib class per stated division, so no change in this already positive
regard).

Just scratched the surface, so gladly be corrected.

-- 
Jan (Poki)


pgpkiGGrQQBN2.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker log showing time mismatch after

2019-02-11 Thread Ken Gaillot
On Fri, 2019-02-01 at 08:10 +0100, Jan Pokorný wrote:
> On 28/01/19 09:47 -0600, Ken Gaillot wrote:
> > On Mon, 2019-01-28 at 18:04 +0530, Dileep V Nair wrote:
> > Pacemaker can handle the clock jumping forward, but not backward.
> 
> I am rather surprised, are we not using monotonic time only, then?
> If so, why?

The scheduler runs on a single node (the DC) but must take as input the
resource history (including timestamps) on all nodes. We need wall
clock time to compare against time-based rules. Also, if we get two
resource history entries from a node, we don't know if it rebooted in
between, so a monotonic timestamp alone wouldn't be sufficient.

However, it might be possible to store both time representations in the
history (and possibly maintain some sort of cluster knowledge about
monotonic clocks to compare them within and across nodes), and use one
or the other depending on the context. I haven't tried to determine how
feasible that would be, but it would be a major project.

> We shall not need any explicit time synchronization across the nodes
> since we are already backed by extended virtual synchrony from
> corosync, eventhough it could introduce strangenesses when
> time-based rules kick in.

Pacemaker determines the state of a resource by replaying its resource
history in the CIB. A history entry can be replaced only by a newer
event. Thus if there's a start event in the history, and a stop result
comes in, we have to know which one is newer to determine whether the
resource is started or stopped.

Something along those lines is likely the cause of:

https://bugs.clusterlabs.org/show_bug.cgi?id=5246
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Is fencing really a must for Postgres failover?

2019-02-11 Thread Jehan-Guillaume de Rorthais
On Mon, 11 Feb 2019 13:14:53 -0500
Digimer  wrote:

> On 2019-02-11 1:04 p.m., Digimer wrote:
> > On 2019-02-11 12:34 p.m., Jehan-Guillaume de Rorthais wrote:  
> >> On Mon, 11 Feb 2019 11:54:03 -0500
> >> Digimer  wrote:
> >> [...]  
> >>> Also, with pacemaker v2, fencing (stonith) became mandatory at a
> >>> programmatic level.  
> >>
> >> ORLY? What did I missed?
> >>  
> > 
> > It was announced at the last HA summit by Andrew Beekhof. I'll try to
> > find it in the docs somewhere, but iirc, it was going to be something
> > like disabling stonith would effectively put the node into maintenance
> > mode.
> > 
> > Let me find sources though, my recollection has been faulty before. :)  
> 
> I spoke to Ken Gaillot and was told that, though announced, that didn't
> actually make it into 2.0.
> 
> Quote;
> 
> 13:04 < digimer> kgaillot: ping?
> 13:04 < digimer> stonith became required in v2, right?
> 13:05 < digimer> docs on that? I mentioned it in a ML thread but didn't
> cite my sources
> 13:06 <@kgaillot> digimer: no, that turned out to be more difficult in
> code than expected, and time ran short
> 

Oh, OK, thank you for this quick tech-preview Digimer :)
 
> Everything else about stonith being required stands though. :)

Of course :)
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Is fencing really a must for Postgres failover?

2019-02-11 Thread Digimer
On 2019-02-11 1:04 p.m., Digimer wrote:
> On 2019-02-11 12:34 p.m., Jehan-Guillaume de Rorthais wrote:
>> On Mon, 11 Feb 2019 11:54:03 -0500
>> Digimer  wrote:
>> [...]
>>> Also, with pacemaker v2, fencing (stonith) became mandatory at a
>>> programmatic level.
>>
>> ORLY? What did I missed?
>>
> 
> It was announced at the last HA summit by Andrew Beekhof. I'll try to
> find it in the docs somewhere, but iirc, it was going to be something
> like disabling stonith would effectively put the node into maintenance
> mode.
> 
> Let me find sources though, my recollection has been faulty before. :)

I spoke to Ken Gaillot and was told that, though announced, that didn't
actually make it into 2.0.

Quote;

13:04 < digimer> kgaillot: ping?
13:04 < digimer> stonith became required in v2, right?
13:05 < digimer> docs on that? I mentioned it in a ML thread but didn't
cite my sources
13:06 <@kgaillot> digimer: no, that turned out to be more difficult in
code than expected, and time ran short


Everything else about stonith being required stands though. :)

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Is fencing really a must for Postgres failover?

2019-02-11 Thread Digimer
On 2019-02-11 12:34 p.m., Jehan-Guillaume de Rorthais wrote:
> On Mon, 11 Feb 2019 11:54:03 -0500
> Digimer  wrote:
> [...]
>> Also, with pacemaker v2, fencing (stonith) became mandatory at a
>> programmatic level.
> 
> ORLY? What did I missed?
> 

It was announced at the last HA summit by Andrew Beekhof. I'll try to
find it in the docs somewhere, but iirc, it was going to be something
like disabling stonith would effectively put the node into maintenance
mode.

Let me find sources though, my recollection has been faulty before. :)

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Is fencing really a must for Postgres failover?

2019-02-11 Thread Jehan-Guillaume de Rorthais
On Mon, 11 Feb 2019 11:54:03 -0500
Digimer  wrote:
[...]
> Also, with pacemaker v2, fencing (stonith) became mandatory at a
> programmatic level.

ORLY? What did I missed?
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Is fencing really a must for Postgres failover?

2019-02-11 Thread Digimer
On 2019-02-11 6:34 a.m., Maciej S wrote:
> I was wondering if anyone can give a plain answer if fencing is really
> needed in case there are no shared resources being used (as far as I
> define shared resource). 
> 
> We want to use PAF or other Postgres (with replicated data files on the
> local drives) failover agent together with Corosync, Pacemaker and
> virtual IP resource and I am wondering if there is a need for fencing
> (which is very close bind to an infrastructure) if a Pacemaker is
> already controlling resources state. I know that in failover case there
> might be a need to add functionality to recover master that entered
> dirty shutdown state (eg. in case of power outage), but I can't see any
> case where fencing is really necessary. Am I wrong?
> 
> I was looking for a strict answer but I couldn't find one...
> 
> Regards,
> Maciej

Fencing is as required as a wearing a seat belt in a car. You can
physically make things work, but the first time you're "in an accident",
you're screwed.

Think of it this way;

If services can run in two or more places at the same time without
coordination, you don't need a cluster, just run things everywhere. If
you need coordination though, you need fencing.

The role of fencing is to force a node that has entered into an unknown
state and force it into a known state. In a system that requires
coordination, often times fencing is the only way to ensure sane operation.

Also, with pacemaker v2, fencing (stonith) became mandatory at a
programmatic level.

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] fence_azure_arm flooding syslog

2019-02-11 Thread Thomas Berreis
Hi Oyvind,

Debug mode doesn't help me. The error occurs after successfully getting
Bearer token. No other issues are shown:

DEBUG:requests_oauthlib.oauth2_session:Invoking 0 token response hooks.
2019-02-11 15:27:12,687 DEBUG: Invoking 0 token response hooks.
DEBUG:requests_oauthlib.oauth2_session:Obtained token {u'resource':
u'https://management.core.windows.net/', u'access_token': ***',
u'ext_expires_in': u'3599', u'expires_in': u'3599', u'expires_at':
1549902431.687557, u'token_type': u'Bearer', u'not_before': u'1549898532',
u'expires_on': u'1549902432'}.
2019-02-11 15:27:12,687 DEBUG: Obtained token {u'resource':
u'https://management.core.windows.net/', u'access_token': u'***',
u'ext_expires_in': u'3599', u'expires_in': u'3599', u'expires_at':
1549902431.687557, u'token_type': u'Bearer', u'not_before': u'1549898532',
u'expires_on': u'1549902432'}.
WARNING:msrestazure.azure_active_directory:Keyring cache token has failed:
No recommended backend was available. Install the keyrings.alt package if
you want to use the non-recommended backends. See README.rst for details.
2019-02-11 15:27:12,687 WARNING: Keyring cache token has failed: No
recommended backend was available. Install the keyrings.alt package if you
want to use the non-recommended backends. See README.rst for details.
DEBUG:requests_oauthlib.oauth2_session:Encoding `client_id` "***" with
`client_secret` as Basic auth credentials.
2019-02-11 15:27:12,727 DEBUG: Encoding `client_id` "***" with
`client_secret` as Basic auth credentials.
DEBUG:requests_oauthlib.oauth2_session:Requesting url
https://login.microsoftonline.com/***/oauth2/token using method POST.
2019-02-11 15:27:12,727 DEBUG: Requesting url
https://login.microsoftonline.com/***/oauth2/token using method POST.

2019-02-11 15:27:13,072 DEBUG: Obtained token {u'resource':
u'https://management.core.windows.net/', u'access_token': u'e***w',
u'ext_expires_in': u'3600', u'expires_in': u'3600', u'expires_at':
1549902433.072768, u'token_type': u'Bearer', u'not_before': u'1549898533',
u'expires_on': u'1549902433'}.
WARNING:msrestazure.azure_active_directory:Keyring cache token has failed:
No recommended backend was available. Install the keyrings.alt package if
you want to use the non-recommended backends. See README.rst for details.
2019-02-11 15:27:13,073 WARNING: Keyring cache token has failed: No
recommended backend was available. Install the keyrings.alt package if you
want to use the non-recommended backends. See README.rst for details.
INFO:root:getting power status for VM node01

Thomas

-Original Message-
From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Oyvind
Albrigtsen
Sent: Montag, 11. Februar 2019 14:58
To: Cluster Labs - All topics related to open-source clustering welcomed

Subject: Re: [ClusterLabs] fence_azure_arm flooding syslog

You can try adding verbose=1 and see if that gives you some more details on
where it's failing.

On 11/02/19 14:47 +0100, Thomas Berreis wrote:
>We need this feature because shutdown / reboot takes too much time ( > 
>5
>min) and network fencing fences the virtual machine much faster ( < 5 sec).
>We finished all the required steps and network fencing works as 
>expected but I'm still confused about these errors in the log and the 
>failure counts showed by pcs ...
>
>fence_azure_arm: Keyring cache token has failed: No recommended backend 
>was available. Install the keyrings.alt package if you want to use the 
>non-recommended backends. See README.rst for details.
>
>stonith-ng[7789]: warning: fence_azure_arm[7896] stderr: [ 2019-02-11
>13:41:19,178 WARNING: Keyring cache token has failed: No recommended 
>backend was available. Install the keyrings.alt package if you want to 
>use the non-recommended backends. See README.rst for details. ]
>
>stonith-net_monitor_6 on node02 'unknown error' (1): call=369, 
>status=Timed Out, exitreason='', last-rc-change='Mon Feb 11 10:38:02 
>2019', queued=0ms, exec=20007ms
>
>Thomas
>
>-Original Message-
>From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Oyvind 
>Albrigtsen
>Sent: Montag, 11. Februar 2019 13:58
>To: Cluster Labs - All topics related to open-source clustering 
>welcomed 
>Subject: Re: [ClusterLabs] fence_azure_arm flooding syslog
>
>Oh. Thanks for clarifying.
>
>Network-fencing requires extra setup as explained if you run "pcs 
>stonith resource describe fence_azure_arm".
>
>You probably dont need network-fencing, so you can just skip that part.
>
>On 11/02/19 13:50 +0100, Thomas Berreis wrote:
>>> It seems like the issue might be the space in "pcmk_host_list= node01".
>>Sorry, this typo was only in my mail because I anonymized in- and output.
>>
>>The issue only occurs when I add the parameter "network-fencing=on".
>>
>>WARNING:msrestazure.azure_active_directory:Keyring cache token has failed:
>>No recommended backend was available. Install the keyrings.alt package 
>>if you want to use the non-recommended backends. See README.rst for
>details.
>>201

Re: [ClusterLabs] Is fencing really a must for Postgres failover?

2019-02-11 Thread Jehan-Guillaume de Rorthais
On Mon, 11 Feb 2019 12:34:44 +0100
Maciej S  wrote:

> I was wondering if anyone can give a plain answer if fencing is really
> needed in case there are no shared resources being used (as far as I define
> shared resource).
> 
> We want to use PAF or other Postgres (with replicated data files on the
> local drives) 

I'm not sure to understand. Are you talking of PostgreSQL internal replication
or FS replication? What do you mean by "on local dirves"?

> failover agent together with Corosync, Pacemaker and virtual
> IP resource and I am wondering if there is a need for fencing (which is
> very close bind to an infrastructure) if a Pacemaker is already controlling
> resources state. I know that in failover case there might be a need to add
> functionality to recover master that entered dirty shutdown state (eg. in
> case of power outage), but I can't see any case where fencing is really
> necessary. Am I wrong?
> 
> I was looking for a strict answer but I couldn't find one...

You need fencing for various reasons:

* with default config, Pacemaker will refuse to promote a new primary if
  the state of current primary is unknown. Fencing is solving this
* a non-responsible primary doesn't imply the primary failed. Before
  failing over, the cluster must be able to define the state of the
  resource/node by itself: this is fencing.
* your service is not secured after a failover if you leave a rogue node in an
  inconsistant state alive. What if the node comes back to activity after
  some freezing time with all its leaving services? IP and PgSQL up? Maybe
  Pacemaker will detect it, but 1) it might takes some time (interval) 2)
  action taken might be the opposite of what you want and triggers another
  unavailability if service must move again
* split brain is usually much worst than a simple service outage.
* I would recommend NOT auto-failing back a node. This is another layer of high
  complexity with additional failures scenario. If a pgsql or node went wrong,
  a human needs to understand why before getting it back on feet.

Good luck.
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Is fencing really a must for Postgres failover?

2019-02-11 Thread Ulrich Windl
>>> Maciej S  schrieb am 11.02.2019 um 12:34 in Nachricht
:
> I was wondering if anyone can give a plain answer if fencing is really
> needed in case there are no shared resources being used (as far as I define
> shared resource).
> 
> We want to use PAF or other Postgres (with replicated data files on the
> local drives) failover agent together with Corosync, Pacemaker and virtual
> IP resource and I am wondering if there is a need for fencing (which is
> very close bind to an infrastructure) if a Pacemaker is already controlling
> resources state. I know that in failover case there might be a need to add
> functionality to recover master that entered dirty shutdown state (eg. in
> case of power outage), but I can't see any case where fencing is really
> necessary. Am I wrong?
> 
> I was looking for a strict answer but I couldn't find one...

I think you can try without.

(as it was on TV yesterday: In "Terminator 2" Arnie is driving motorcycle 
without a helmet... So if you feel you can do pacemaker without fencing, go 
ahead ;-)

> 
> Regards,
> Maciej




___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] fence_azure_arm flooding syslog

2019-02-11 Thread Oyvind Albrigtsen

You can try adding verbose=1 and see if that gives you some more
details on where it's failing.

On 11/02/19 14:47 +0100, Thomas Berreis wrote:

We need this feature because shutdown / reboot takes too much time ( > 5
min) and network fencing fences the virtual machine much faster ( < 5 sec).
We finished all the required steps and network fencing works as expected but
I'm still confused about these errors in the log and the failure counts
showed by pcs ...

fence_azure_arm: Keyring cache token has failed: No recommended backend was
available. Install the keyrings.alt package if you want to use the
non-recommended backends. See README.rst for details.

stonith-ng[7789]: warning: fence_azure_arm[7896] stderr: [ 2019-02-11
13:41:19,178 WARNING: Keyring cache token has failed: No recommended backend
was available. Install the keyrings.alt package if you want to use the
non-recommended backends. See README.rst for details. ]

stonith-net_monitor_6 on node02 'unknown error' (1): call=369,
status=Timed Out, exitreason='', last-rc-change='Mon Feb 11 10:38:02 2019',
queued=0ms, exec=20007ms

Thomas

-Original Message-
From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Oyvind
Albrigtsen
Sent: Montag, 11. Februar 2019 13:58
To: Cluster Labs - All topics related to open-source clustering welcomed

Subject: Re: [ClusterLabs] fence_azure_arm flooding syslog

Oh. Thanks for clarifying.

Network-fencing requires extra setup as explained if you run "pcs stonith
resource describe fence_azure_arm".

You probably dont need network-fencing, so you can just skip that part.

On 11/02/19 13:50 +0100, Thomas Berreis wrote:

It seems like the issue might be the space in "pcmk_host_list= node01".

Sorry, this typo was only in my mail because I anonymized in- and output.

The issue only occurs when I add the parameter "network-fencing=on".

WARNING:msrestazure.azure_active_directory:Keyring cache token has failed:
No recommended backend was available. Install the keyrings.alt package
if you want to use the non-recommended backends. See README.rst for

details.

2019-02-11 12:29:37,079 WARNING: Keyring cache token has failed: No
recommended backend was available. Install the keyrings.alt package if
you want to use the non-recommended backends. See README.rst for details.


Thomas

-Original Message-
From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Oyvind
Albrigtsen
Sent: Montag, 11. Februar 2019 10:38
To: Cluster Labs - All topics related to open-source clustering
welcomed 
Subject: Re: [ClusterLabs] fence_azure_arm flooding syslog

pcmk_host_list is used by Pacemaker to limit which hosts can be fenced
by the STONITH device, so you'll have to use the plug parameter for

testing.

Pacemaker uses it internally when it fences a host, so you wouldnt use
plug= as part of the pcs stonith create command.

It seems like the issue might be the space in "pcmk_host_list= node01".

Oyvind

On 10/02/19 14:43 +0100, Thomas Berreis wrote:

Hi Oyvind,

Seems, option "pcmk_host_list" is not supported by fence_azure_arm.
Instead fence_azure_arm requires "plug" and however this is not
supported

by pcs.

What's the correct syntax to add such a stonith device with

fence_azure_arm?


[root ~]# fence_azure_arm
login=***
network-fencing=on
passwd=***
pcmk_host_list= node01
resourceGroup=***
retry_on=0
subscriptionId=***
tenantId=***
WARNING:root:Parse error: Ignoring unknown option 'pcmk_host_list= node01'

ERROR:root:Failed: You have to enter plug number or machine
identification

2019-02-10 12:54:51,862 ERROR: Failed: You have to enter plug number
or machine identification

--

[root ~]# fence_azure_arm
login=***
network-fencing=on
passwd=***
pcmk_host_list= node01
resourceGroup=***
retry_on=0
subscriptionId=***
tenantId=***
plug=node01
WARNING:root:Parse error: Ignoring unknown option 'pcmk_host_list= node01'

WARNING:msrestazure.azure_active_directory:Keyring cache token has failed:
No recommended backend was available.
2019-02-10 12:56:30,377 WARNING: Keyring cache token has failed: No
recommended backend was available. Install th
WARNING:msrestazure.azure_active_directory:Keyring cache token has failed:
No recommended backend was available.
2019-02-10 12:56:30,729 WARNING: Keyring cache token has failed: No
recommended backend was available. Install th
Success: Powered OFF

--

[root ~]# fence_azure_arm
login=***
network-fencing=on
passwd=***
plug=node01
resourceGroup=***
retry_on=0
subscriptionId=***
tenantId=***

Success: Powered OFF

--

[root ~]# pcs stonith create stonith-net fence_azure_arm [...]
plug=node01 network-fencing=on
Error: invalid stonith option 'plug'

[root ~]# fence_azure_arm -h
Usage:
   fence_azure_arm [options]
Options:
[...]
  -n, --plug=[id]Physical plug number on device, UUID or
identification of machine


Thanks!

Thomas

-Original Message-
From: Users [mailto:users

Re: [ClusterLabs] fence_azure_arm flooding syslog

2019-02-11 Thread Thomas Berreis
We need this feature because shutdown / reboot takes too much time ( > 5
min) and network fencing fences the virtual machine much faster ( < 5 sec).
We finished all the required steps and network fencing works as expected but
I'm still confused about these errors in the log and the failure counts
showed by pcs ...

fence_azure_arm: Keyring cache token has failed: No recommended backend was
available. Install the keyrings.alt package if you want to use the
non-recommended backends. See README.rst for details.

stonith-ng[7789]: warning: fence_azure_arm[7896] stderr: [ 2019-02-11
13:41:19,178 WARNING: Keyring cache token has failed: No recommended backend
was available. Install the keyrings.alt package if you want to use the
non-recommended backends. See README.rst for details. ]

stonith-net_monitor_6 on node02 'unknown error' (1): call=369,
status=Timed Out, exitreason='', last-rc-change='Mon Feb 11 10:38:02 2019',
queued=0ms, exec=20007ms

Thomas

-Original Message-
From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Oyvind
Albrigtsen
Sent: Montag, 11. Februar 2019 13:58
To: Cluster Labs - All topics related to open-source clustering welcomed

Subject: Re: [ClusterLabs] fence_azure_arm flooding syslog

Oh. Thanks for clarifying.

Network-fencing requires extra setup as explained if you run "pcs stonith
resource describe fence_azure_arm".

You probably dont need network-fencing, so you can just skip that part.

On 11/02/19 13:50 +0100, Thomas Berreis wrote:
>> It seems like the issue might be the space in "pcmk_host_list= node01".
>Sorry, this typo was only in my mail because I anonymized in- and output.
>
>The issue only occurs when I add the parameter "network-fencing=on".
>
>WARNING:msrestazure.azure_active_directory:Keyring cache token has failed:
>No recommended backend was available. Install the keyrings.alt package 
>if you want to use the non-recommended backends. See README.rst for
details.
>2019-02-11 12:29:37,079 WARNING: Keyring cache token has failed: No 
>recommended backend was available. Install the keyrings.alt package if 
>you want to use the non-recommended backends. See README.rst for details.
>
>
>Thomas
>
>-Original Message-
>From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Oyvind 
>Albrigtsen
>Sent: Montag, 11. Februar 2019 10:38
>To: Cluster Labs - All topics related to open-source clustering 
>welcomed 
>Subject: Re: [ClusterLabs] fence_azure_arm flooding syslog
>
>pcmk_host_list is used by Pacemaker to limit which hosts can be fenced 
>by the STONITH device, so you'll have to use the plug parameter for
testing.
>Pacemaker uses it internally when it fences a host, so you wouldnt use 
>plug= as part of the pcs stonith create command.
>
>It seems like the issue might be the space in "pcmk_host_list= node01".
>
>Oyvind
>
>On 10/02/19 14:43 +0100, Thomas Berreis wrote:
>>Hi Oyvind,
>>
>>Seems, option "pcmk_host_list" is not supported by fence_azure_arm.
>>Instead fence_azure_arm requires "plug" and however this is not 
>>supported
>by pcs.
>>What's the correct syntax to add such a stonith device with
>fence_azure_arm?
>>
>>[root ~]# fence_azure_arm
>>login=***
>>network-fencing=on
>>passwd=***
>>pcmk_host_list= node01
>>resourceGroup=***
>>retry_on=0
>>subscriptionId=***
>>tenantId=***
>>WARNING:root:Parse error: Ignoring unknown option 'pcmk_host_list= node01'
>>
>>ERROR:root:Failed: You have to enter plug number or machine 
>>identification
>>
>>2019-02-10 12:54:51,862 ERROR: Failed: You have to enter plug number 
>>or machine identification
>>
>>--
>>
>>[root ~]# fence_azure_arm
>>login=***
>>network-fencing=on
>>passwd=***
>>pcmk_host_list= node01
>>resourceGroup=***
>>retry_on=0
>>subscriptionId=***
>>tenantId=***
>>plug=node01
>>WARNING:root:Parse error: Ignoring unknown option 'pcmk_host_list= node01'
>>
>>WARNING:msrestazure.azure_active_directory:Keyring cache token has failed:
>>No recommended backend was available.
>>2019-02-10 12:56:30,377 WARNING: Keyring cache token has failed: No 
>>recommended backend was available. Install th 
>>WARNING:msrestazure.azure_active_directory:Keyring cache token has failed:
>>No recommended backend was available.
>>2019-02-10 12:56:30,729 WARNING: Keyring cache token has failed: No 
>>recommended backend was available. Install th
>>Success: Powered OFF
>>
>>--
>>
>>[root ~]# fence_azure_arm
>>login=***
>>network-fencing=on
>>passwd=***
>>plug=node01
>>resourceGroup=***
>>retry_on=0
>>subscriptionId=***
>>tenantId=***
>>
>>Success: Powered OFF
>>
>>--
>>
>>[root ~]# pcs stonith create stonith-net fence_azure_arm [...]
>>plug=node01 network-fencing=on
>>Error: invalid stonith option 'plug'
>>
>>[root ~]# fence_azure_arm -h
>>Usage:
>>fence_azure_arm [options]
>>Options:
>>[...]
>>   -n, --plug=[id]Physical plug number on device, UUID or
>>identification of machine
>>
>>
>>Thanks!
>>
>>Thoma

Re: [ClusterLabs] fence_azure_arm flooding syslog

2019-02-11 Thread Oyvind Albrigtsen

Oh. Thanks for clarifying.

Network-fencing requires extra setup as explained if you run "pcs stonith
resource describe fence_azure_arm".

You probably dont need network-fencing, so you can just skip that
part.

On 11/02/19 13:50 +0100, Thomas Berreis wrote:

It seems like the issue might be the space in "pcmk_host_list= node01".

Sorry, this typo was only in my mail because I anonymized in- and output.

The issue only occurs when I add the parameter "network-fencing=on".

WARNING:msrestazure.azure_active_directory:Keyring cache token has failed:
No recommended backend was available. Install the keyrings.alt package if
you want to use the non-recommended backends. See README.rst for details.
2019-02-11 12:29:37,079 WARNING: Keyring cache token has failed: No
recommended backend was available. Install the keyrings.alt package if you
want to use the non-recommended backends. See README.rst for details.


Thomas

-Original Message-
From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Oyvind
Albrigtsen
Sent: Montag, 11. Februar 2019 10:38
To: Cluster Labs - All topics related to open-source clustering welcomed

Subject: Re: [ClusterLabs] fence_azure_arm flooding syslog

pcmk_host_list is used by Pacemaker to limit which hosts can be fenced by
the STONITH device, so you'll have to use the plug parameter for testing.
Pacemaker uses it internally when it fences a host, so you wouldnt use plug=
as part of the pcs stonith create command.

It seems like the issue might be the space in "pcmk_host_list= node01".

Oyvind

On 10/02/19 14:43 +0100, Thomas Berreis wrote:

Hi Oyvind,

Seems, option "pcmk_host_list" is not supported by fence_azure_arm.
Instead fence_azure_arm requires "plug" and however this is not supported

by pcs.

What's the correct syntax to add such a stonith device with

fence_azure_arm?


[root ~]# fence_azure_arm
login=***
network-fencing=on
passwd=***
pcmk_host_list= node01
resourceGroup=***
retry_on=0
subscriptionId=***
tenantId=***
WARNING:root:Parse error: Ignoring unknown option 'pcmk_host_list= node01'

ERROR:root:Failed: You have to enter plug number or machine
identification

2019-02-10 12:54:51,862 ERROR: Failed: You have to enter plug number or
machine identification

--

[root ~]# fence_azure_arm
login=***
network-fencing=on
passwd=***
pcmk_host_list= node01
resourceGroup=***
retry_on=0
subscriptionId=***
tenantId=***
plug=node01
WARNING:root:Parse error: Ignoring unknown option 'pcmk_host_list= node01'

WARNING:msrestazure.azure_active_directory:Keyring cache token has failed:
No recommended backend was available.
2019-02-10 12:56:30,377 WARNING: Keyring cache token has failed: No
recommended backend was available. Install th
WARNING:msrestazure.azure_active_directory:Keyring cache token has failed:
No recommended backend was available.
2019-02-10 12:56:30,729 WARNING: Keyring cache token has failed: No
recommended backend was available. Install th
Success: Powered OFF

--

[root ~]# fence_azure_arm
login=***
network-fencing=on
passwd=***
plug=node01
resourceGroup=***
retry_on=0
subscriptionId=***
tenantId=***

Success: Powered OFF

--

[root ~]# pcs stonith create stonith-net fence_azure_arm [...]
plug=node01 network-fencing=on
Error: invalid stonith option 'plug'

[root ~]# fence_azure_arm -h
Usage:
   fence_azure_arm [options]
Options:
[...]
  -n, --plug=[id]Physical plug number on device, UUID or
identification of machine


Thanks!

Thomas

-Original Message-
From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Oyvind
Albrigtsen
Sent: Freitag, 8. Februar 2019 14:16
To: Cluster Labs - All topics related to open-source clustering
welcomed 
Subject: Re: [ClusterLabs] fence_azure_arm flooding syslog

Hi Thomas,

It looks like it's trying to use credentials from keyring, which
generates these error messages.

Maybe there's a typo in one of the authentitcation parameters in your
config. Can you post the output of "pcs stonith show --full"?

You can try running fence_azure_arm without parameters on the command
line and copy the parameters (one per line) to it, and add e.g.
action=status and end with Ctrl+D to see .


Oyvind

On 08/02/19 14:02 +0100, Thomas Berreis wrote:

Hi,



I'm having a two-node cluster in Azure with fencing enabled via module
fence_azure_arm.

The cluster management is flooding my syslog on both nodes with these
messages:



** fence_azure_arm: Keyring cache token has failed: No recommended
backend was available. Install the keyrings.alt package if you want to
use the non-recommended backends. See README.rst for details.



** stonith-ng[5220]: warning: fence_azure_arm[14847] stderr: [
2019-02-08
12:52:02,703 WARNING: Keyring cache token has failed: No recommended
backend was available. Install the keyrings.alt package if you want to
use the non-recommended backends. See README.rst for details. ]





Can anyone give me a hint

Re: [ClusterLabs] fence_azure_arm flooding syslog

2019-02-11 Thread Thomas Berreis
> It seems like the issue might be the space in "pcmk_host_list= node01".
Sorry, this typo was only in my mail because I anonymized in- and output.

The issue only occurs when I add the parameter "network-fencing=on".

WARNING:msrestazure.azure_active_directory:Keyring cache token has failed:
No recommended backend was available. Install the keyrings.alt package if
you want to use the non-recommended backends. See README.rst for details.
2019-02-11 12:29:37,079 WARNING: Keyring cache token has failed: No
recommended backend was available. Install the keyrings.alt package if you
want to use the non-recommended backends. See README.rst for details.


Thomas

-Original Message-
From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Oyvind
Albrigtsen
Sent: Montag, 11. Februar 2019 10:38
To: Cluster Labs - All topics related to open-source clustering welcomed

Subject: Re: [ClusterLabs] fence_azure_arm flooding syslog

pcmk_host_list is used by Pacemaker to limit which hosts can be fenced by
the STONITH device, so you'll have to use the plug parameter for testing.
Pacemaker uses it internally when it fences a host, so you wouldnt use plug=
as part of the pcs stonith create command.

It seems like the issue might be the space in "pcmk_host_list= node01".

Oyvind

On 10/02/19 14:43 +0100, Thomas Berreis wrote:
>Hi Oyvind,
>
>Seems, option "pcmk_host_list" is not supported by fence_azure_arm. 
>Instead fence_azure_arm requires "plug" and however this is not supported
by pcs.
>What's the correct syntax to add such a stonith device with
fence_azure_arm?
>
>[root ~]# fence_azure_arm
>login=***
>network-fencing=on
>passwd=***
>pcmk_host_list= node01
>resourceGroup=***
>retry_on=0
>subscriptionId=***
>tenantId=***
>WARNING:root:Parse error: Ignoring unknown option 'pcmk_host_list= node01'
>
>ERROR:root:Failed: You have to enter plug number or machine 
>identification
>
>2019-02-10 12:54:51,862 ERROR: Failed: You have to enter plug number or 
>machine identification
>
>--
>
>[root ~]# fence_azure_arm
>login=***
>network-fencing=on
>passwd=***
>pcmk_host_list= node01
>resourceGroup=***
>retry_on=0
>subscriptionId=***
>tenantId=***
>plug=node01
>WARNING:root:Parse error: Ignoring unknown option 'pcmk_host_list= node01'
>
>WARNING:msrestazure.azure_active_directory:Keyring cache token has failed:
>No recommended backend was available.
>2019-02-10 12:56:30,377 WARNING: Keyring cache token has failed: No 
>recommended backend was available. Install th 
>WARNING:msrestazure.azure_active_directory:Keyring cache token has failed:
>No recommended backend was available.
>2019-02-10 12:56:30,729 WARNING: Keyring cache token has failed: No 
>recommended backend was available. Install th
>Success: Powered OFF
>
>--
>
>[root ~]# fence_azure_arm
>login=***
>network-fencing=on
>passwd=***
>plug=node01
>resourceGroup=***
>retry_on=0
>subscriptionId=***
>tenantId=***
>
>Success: Powered OFF
>
>--
>
>[root ~]# pcs stonith create stonith-net fence_azure_arm [...] 
>plug=node01 network-fencing=on
>Error: invalid stonith option 'plug'
>
>[root ~]# fence_azure_arm -h
>Usage:
>fence_azure_arm [options]
>Options:
>[...]
>   -n, --plug=[id]Physical plug number on device, UUID or
>identification of machine
>
>
>Thanks!
>
>Thomas
>
>-Original Message-
>From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Oyvind 
>Albrigtsen
>Sent: Freitag, 8. Februar 2019 14:16
>To: Cluster Labs - All topics related to open-source clustering 
>welcomed 
>Subject: Re: [ClusterLabs] fence_azure_arm flooding syslog
>
>Hi Thomas,
>
>It looks like it's trying to use credentials from keyring, which 
>generates these error messages.
>
>Maybe there's a typo in one of the authentitcation parameters in your 
>config. Can you post the output of "pcs stonith show --full"?
>
>You can try running fence_azure_arm without parameters on the command 
>line and copy the parameters (one per line) to it, and add e.g.
>action=status and end with Ctrl+D to see .
>
>
>Oyvind
>
>On 08/02/19 14:02 +0100, Thomas Berreis wrote:
>>Hi,
>>
>>
>>
>>I'm having a two-node cluster in Azure with fencing enabled via module 
>>fence_azure_arm.
>>
>>The cluster management is flooding my syslog on both nodes with these
>>messages:
>>
>>
>>
>>** fence_azure_arm: Keyring cache token has failed: No recommended 
>>backend was available. Install the keyrings.alt package if you want to 
>>use the non-recommended backends. See README.rst for details.
>>
>>
>>
>>** stonith-ng[5220]: warning: fence_azure_arm[14847] stderr: [
>>2019-02-08
>>12:52:02,703 WARNING: Keyring cache token has failed: No recommended 
>>backend was available. Install the keyrings.alt package if you want to 
>>use the non-recommended backends. See README.rst for details. ]
>>
>>
>>
>>
>>
>>Can anyone give me a hint how to fix these issue?
>>
>>
>>
>>pcs version: 0.9.165
>>
>>pacemaker version: 1.1.

[ClusterLabs] Is fencing really a must for Postgres failover?

2019-02-11 Thread Maciej S
I was wondering if anyone can give a plain answer if fencing is really
needed in case there are no shared resources being used (as far as I define
shared resource).

We want to use PAF or other Postgres (with replicated data files on the
local drives) failover agent together with Corosync, Pacemaker and virtual
IP resource and I am wondering if there is a need for fencing (which is
very close bind to an infrastructure) if a Pacemaker is already controlling
resources state. I know that in failover case there might be a need to add
functionality to recover master that entered dirty shutdown state (eg. in
case of power outage), but I can't see any case where fencing is really
necessary. Am I wrong?

I was looking for a strict answer but I couldn't find one...

Regards,
Maciej
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue

2019-02-11 Thread Klaus Wenninger
On 02/11/2019 09:49 AM, Fulong Wang wrote:
> Thanks Yan,
>
> You gave me more valuable hints on the SBD operation!
> Now, i can see the verbose output after service restart.
>
>
> >Be aware since pacemaker integration (-P) is enabled by default, which 
> >means despite the sbd failure, if the node itself was clean and 
> >"healthy" from pacemaker's point of view and if it's in the cluster 
> >partition with the quorum, it wouldn't self-fence -- meaning a node just 
> >being unable to fence doesn't necessarily need to be fenced.
>
> >As described in sbd man page, "this allows sbd to survive temporary 
> >outages of the majority of devices. However, while the cluster is in 
> >such a degraded state, it can neither successfully fence nor be shutdown 
> >cleanly (as taking the cluster below the quorum threshold will 
> >immediately cause all remaining nodes to self-fence). In short, it will 
> >not tolerate any further faults.  Please repair the system before 
> >continuing."
>
> Yes, I can see the "pacemaker integration" was enabled in my sbd
> config file by default.
> So, you mean in some sbd failure cases, if the node was considered as
> "healthy" from pacemaker's poinit of view, it still wouldn't sel-fence.   
>
> Honestly speaking, i didn't get you at this point. I have
> "no-quorum-policy=ignore" setting in my setup and it's a two node
> cluster. 
> Can you show me a sample situation for this?

When using sbd with 2-node-clusters and pacemaker-integration you might
check
https://github.com/ClusterLabs/sbd/commit/4bd0a66da3ac9c9afaeb8a2468cdd3ed51ad3377
to be included in your sbd-version.
This is relevant when 2-node is configured in corosync.

Regards,
Klaus

>
> Many Thanks!!!
>
>
>
>
> Reagards
> Fulong
>
>
>
> 
> *From:* Gao,Yan 
> *Sent:* Thursday, January 3, 2019 20:43
> *To:* Fulong Wang; Cluster Labs - All topics related to open-source
> clustering welcomed
> *Subject:* Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue
>  
> On 12/24/18 7:10 AM, Fulong Wang wrote:
> > Yan, klaus and Everyone,
> >
> >
> >   Merry Christmas!!!
> >
> >
> >
> > Many thanks for your advice!
> > I added the "-v" param in "SBD_OPTS", but didn't see any apparent
> change
> > in the system message log,  am i looking at a wrong place?
> Did you restart all cluster services, for example by "crm cluster stop"
> and then "crm cluster start"? Basically sbd.service needs to be
> restarted. Be aware "systemctl restart pacemaker" only restarts pacemaker.
>
> SBD daemons log into syslog. When a sbd watcher receives a "test"
> command, there should be a syslog like this showing up:
>
> "servant: Received command test from ..."
>
> sbd won't actually do anything about a "test" command but logging a
> message.
>
> If you are not running a late version of sbd (maintenance update) yet, a
> single "-v" will make sbd too verbose already. But of course you could
> use grep.
>
> >
> > By the way, we want to test when the disk access paths (multipath
> > devices) lost, the sbd can fence the node automatically.
> Be aware since pacemaker integration (-P) is enabled by default, which
> means despite the sbd failure, if the node itself was clean and
> "healthy" from pacemaker's point of view and if it's in the cluster
> partition with the quorum, it wouldn't self-fence -- meaning a node just
> being unable to fence doesn't necessarily need to be fenced.
>
> As described in sbd man page, "this allows sbd to survive temporary
> outages of the majority of devices. However, while the cluster is in
> such a degraded state, it can neither successfully fence nor be shutdown
> cleanly (as taking the cluster below the quorum threshold will
> immediately cause all remaining nodes to self-fence). In short, it will
> not tolerate any further faults.  Please repair the system before
> continuing."
>
> Regards,
>    Yan
>
>
> > what's your recommendation for this scenario?
> >
> >
> >
> >
> >
> >
> >
> > The "crm node fence"  did the work.
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Regards
> > Fulong
> >
> > 
> > *From:* Gao,Yan 
> > *Sent:* Friday, December 21, 2018 20:43
> > *To:* kwenn...@redhat.com; Cluster Labs - All topics related to
> > open-source clustering welcomed; Fulong Wang
> > *Subject:* Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue
> > First thanks for your reply, Klaus!
> >
> > On 2018/12/21 10:09, Klaus Wenninger wrote:
> >> On 12/21/2018 08:15 AM, Fulong Wang wrote:
> >>> Hello Experts,
> >>>
> >>> I'm New to this mail lists.
> >>> Pls kindlyforgive me if this mail has disturb you!
> >>>
> >>> Our Company recently is evaluating the usage of the SuSE HAE on x86
> >>> platform.
> >>> Wen simulating the storage disaster fail-over, i finally found that
> >>> the SBD communication functioned normal on SuSE11 SP4 but abnormal on
> >>> SuSE12 SP3.
> >>
> >> I have no experien

Re: [ClusterLabs] fence_azure_arm flooding syslog

2019-02-11 Thread Oyvind Albrigtsen

pcmk_host_list is used by Pacemaker to limit which hosts can be fenced
by the STONITH device, so you'll have to use the plug parameter for
testing. Pacemaker uses it internally when it fences a host, so you
wouldnt use plug= as part of the pcs stonith create command.

It seems like the issue might be the space in "pcmk_host_list= node01".

Oyvind

On 10/02/19 14:43 +0100, Thomas Berreis wrote:

Hi Oyvind,

Seems, option "pcmk_host_list" is not supported by fence_azure_arm. Instead
fence_azure_arm requires "plug" and however this is not supported by pcs.
What's the correct syntax to add such a stonith device with fence_azure_arm?

[root ~]# fence_azure_arm
login=***
network-fencing=on
passwd=***
pcmk_host_list= node01
resourceGroup=***
retry_on=0
subscriptionId=***
tenantId=***
WARNING:root:Parse error: Ignoring unknown option 'pcmk_host_list= node01'

ERROR:root:Failed: You have to enter plug number or machine identification

2019-02-10 12:54:51,862 ERROR: Failed: You have to enter plug number or
machine identification

--

[root ~]# fence_azure_arm
login=***
network-fencing=on
passwd=***
pcmk_host_list= node01
resourceGroup=***
retry_on=0
subscriptionId=***
tenantId=***
plug=node01
WARNING:root:Parse error: Ignoring unknown option 'pcmk_host_list= node01'

WARNING:msrestazure.azure_active_directory:Keyring cache token has failed:
No recommended backend was available.
2019-02-10 12:56:30,377 WARNING: Keyring cache token has failed: No
recommended backend was available. Install th
WARNING:msrestazure.azure_active_directory:Keyring cache token has failed:
No recommended backend was available.
2019-02-10 12:56:30,729 WARNING: Keyring cache token has failed: No
recommended backend was available. Install th
Success: Powered OFF

--

[root ~]# fence_azure_arm
login=***
network-fencing=on
passwd=***
plug=node01
resourceGroup=***
retry_on=0
subscriptionId=***
tenantId=***

Success: Powered OFF

--

[root ~]# pcs stonith create stonith-net fence_azure_arm [...] plug=node01
network-fencing=on
Error: invalid stonith option 'plug'

[root ~]# fence_azure_arm -h
Usage:
   fence_azure_arm [options]
Options:
[...]
  -n, --plug=[id]Physical plug number on device, UUID or
identification of machine


Thanks!

Thomas

-Original Message-
From: Users [mailto:users-boun...@clusterlabs.org] On Behalf Of Oyvind
Albrigtsen
Sent: Freitag, 8. Februar 2019 14:16
To: Cluster Labs - All topics related to open-source clustering welcomed

Subject: Re: [ClusterLabs] fence_azure_arm flooding syslog

Hi Thomas,

It looks like it's trying to use credentials from keyring, which generates
these error messages.

Maybe there's a typo in one of the authentitcation parameters in your
config. Can you post the output of "pcs stonith show --full"?

You can try running fence_azure_arm without parameters on the command line
and copy the parameters (one per line) to it, and add e.g.
action=status and end with Ctrl+D to see .


Oyvind

On 08/02/19 14:02 +0100, Thomas Berreis wrote:

Hi,



I'm having a two-node cluster in Azure with fencing enabled via module
fence_azure_arm.

The cluster management is flooding my syslog on both nodes with these
messages:



** fence_azure_arm: Keyring cache token has failed: No recommended
backend was available. Install the keyrings.alt package if you want to
use the non-recommended backends. See README.rst for details.



** stonith-ng[5220]: warning: fence_azure_arm[14847] stderr: [
2019-02-08
12:52:02,703 WARNING: Keyring cache token has failed: No recommended
backend was available. Install the keyrings.alt package if you want to
use the non-recommended backends. See README.rst for details. ]





Can anyone give me a hint how to fix these issue?



pcs version: 0.9.165

pacemaker version: 1.1.19-8.el7_6.1

fence_azure_arm version: 4.2.1

corosync 2.4.3



Thanks!



Thomas




___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org Getting started:
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.

Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue

2019-02-11 Thread Fulong Wang
Thanks Yan,

You gave me more valuable hints on the SBD operation!
Now, i can see the verbose output after service restart.


>Be aware since pacemaker integration (-P) is enabled by default, which
>means despite the sbd failure, if the node itself was clean and
>"healthy" from pacemaker's point of view and if it's in the cluster
>partition with the quorum, it wouldn't self-fence -- meaning a node just
>being unable to fence doesn't necessarily need to be fenced.

>As described in sbd man page, "this allows sbd to survive temporary
>outages of the majority of devices. However, while the cluster is in
>such a degraded state, it can neither successfully fence nor be shutdown
>cleanly (as taking the cluster below the quorum threshold will
>immediately cause all remaining nodes to self-fence). In short, it will
>not tolerate any further faults.  Please repair the system before
>continuing."

Yes, I can see the "pacemaker integration" was enabled in my sbd config file by 
default.
So, you mean in some sbd failure cases, if the node was considered as "healthy" 
from pacemaker's poinit of view, it still wouldn't sel-fence.

Honestly speaking, i didn't get you at this point. I have 
"no-quorum-policy=ignore" setting in my setup and it's a two node cluster.
Can you show me a sample situation for this?

Many Thanks!!!




Reagards
Fulong




From: Gao,Yan 
Sent: Thursday, January 3, 2019 20:43
To: Fulong Wang; Cluster Labs - All topics related to open-source clustering 
welcomed
Subject: Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue

On 12/24/18 7:10 AM, Fulong Wang wrote:
> Yan, klaus and Everyone,
>
>
>   Merry Christmas!!!
>
>
>
> Many thanks for your advice!
> I added the "-v" param in "SBD_OPTS", but didn't see any apparent change
> in the system message log,  am i looking at a wrong place?
Did you restart all cluster services, for example by "crm cluster stop"
and then "crm cluster start"? Basically sbd.service needs to be
restarted. Be aware "systemctl restart pacemaker" only restarts pacemaker.

SBD daemons log into syslog. When a sbd watcher receives a "test"
command, there should be a syslog like this showing up:

"servant: Received command test from ..."

sbd won't actually do anything about a "test" command but logging a message.

If you are not running a late version of sbd (maintenance update) yet, a
single "-v" will make sbd too verbose already. But of course you could
use grep.

>
> By the way, we want to test when the disk access paths (multipath
> devices) lost, the sbd can fence the node automatically.
Be aware since pacemaker integration (-P) is enabled by default, which
means despite the sbd failure, if the node itself was clean and
"healthy" from pacemaker's point of view and if it's in the cluster
partition with the quorum, it wouldn't self-fence -- meaning a node just
being unable to fence doesn't necessarily need to be fenced.

As described in sbd man page, "this allows sbd to survive temporary
outages of the majority of devices. However, while the cluster is in
such a degraded state, it can neither successfully fence nor be shutdown
cleanly (as taking the cluster below the quorum threshold will
immediately cause all remaining nodes to self-fence). In short, it will
not tolerate any further faults.  Please repair the system before
continuing."

Regards,
   Yan


> what's your recommendation for this scenario?
>
>
>
>
>
>
>
> The "crm node fence"  did the work.
>
>
>
>
>
>
>
>
>
>
>
>
>
> Regards
> Fulong
>
> 
> *From:* Gao,Yan 
> *Sent:* Friday, December 21, 2018 20:43
> *To:* kwenn...@redhat.com; Cluster Labs - All topics related to
> open-source clustering welcomed; Fulong Wang
> *Subject:* Re: [ClusterLabs] SuSE12SP3 HAE SBD Communication Issue
> First thanks for your reply, Klaus!
>
> On 2018/12/21 10:09, Klaus Wenninger wrote:
>> On 12/21/2018 08:15 AM, Fulong Wang wrote:
>>> Hello Experts,
>>>
>>> I'm New to this mail lists.
>>> Pls kindlyforgive me if this mail has disturb you!
>>>
>>> Our Company recently is evaluating the usage of the SuSE HAE on x86
>>> platform.
>>> Wen simulating the storage disaster fail-over, i finally found that
>>> the SBD communication functioned normal on SuSE11 SP4 but abnormal on
>>> SuSE12 SP3.
>>
>> I have no experience with SBD on SLES but I know that handling of the
>> logging verbosity-levels has changed recently in the upstream-repo.
>> Given that it was done by Yan Gao iirc I'd assume it went into SLES.
>> So changing the verbosity of the sbd-daemon might get you back
>> these logs.
> Yes, I think it's the issue. Could you please retrieve the latest
> maintenance update for SLE12SP3 and try? Otherwise of course you could
> temporarily enable verbose/debug logging by adding a couple of "-v" into
>"SBD_OPTS" in /etc/sysconfig/sbd.
>
> But frankly, it makes more sense to manually trigger fencing for example
> by "crm node fence" and se