Re: [ClusterLabs] Antw: Salvaging aborted resource migration

2018-09-27 Thread Ken Gaillot
On Thu, 2018-09-27 at 18:00 +0200, Ferenc Wágner wrote:
> Ken Gaillot  writes:
> 
> > On Thu, 2018-09-27 at 09:36 +0200, Ulrich Windl wrote:
> > 
> > > Obviously you violated the most important cluster rule that is
> > > "be
> > > patient".  Maybe the next important is "Don't change the
> > > configuration while the cluster is not in IDLE state" ;-)
> > 
> > Agreed -- although even idle, removing a ban can result in a
> > migration
> > back (if something like stickiness doesn't prevent it).
> 
> I've got no problem with that in general.  However, I can't gurantee
> that every configuration change happens in idle state, certain
> operations (mostly resource additions) are done by several
> administrators without synchronization, and of course asynchronous
> cluster events can also happen any time.  So I have to ask: what are
> the
> consequences of breaking this "impossible" rule?

It's not truly a rule, just a "better safe than sorry" approach. In
general the cluster is very forgiving about frequent config changes
from any node. The only non-obvious consequence is that if the cluster
is still making changes based on the previous config, then it will wait
for any action already in progress to complete, then abandon that and
recalculate based on the new config, which might reverse actions just
taken.

> > There's currently no way to tell pacemaker that an operation (i.e.
> > migrate_from) is a no-op and can be ignored. If a migration is only
> > partially completed, it has to be considered a failure and
> > reverted.
> 
> OK.  Are there other complex operations which can "partially
> complete"
> if a transition is aborted by some event?

I believe migration is the only one in that category. Perhaps a restart
could be considered similar, as it involves a separate stop and start,
but a completed stop doesn't have to be reversed in that case, so that
wouldn't cause any similar issues.

> Now let's suppose a pull migration scenario: migrate_to does nothing,
> but in this tiny window a configuration change aborts the transition.
> The resources would go through a full recovery (stop+start), right?

Yes

> Now let's suppose migrate_from gets scheduled and starts performing
> the
> migration.  Before it finishes, a configuration change aborts the
> transition.  The cluster waits for the outstanding operation to
> finish,
> doesn't it?  And if it finishes successfully, is the migration
> considered complete requiring no recovery?

Correct. If an agent has actually been executed, the cluster will wait
for that operation to complete or timeout before recalculating.

(As an aside, that can cause problems of a different sort: if an
operation in progress has a very long timeout and takes that whole
time, it can delay recovery of other resources that newly fail, even if
their recovery would not depend on the outcome of that operation.
That's a complicated problem to solve because that last clause is not
obvious to a computer program without simulating all possible results,
and even then, it can't be sure that the operation won't do something
like change a node attribute that might affect other resources.)

> > I'm not sure why the reload was scheduled; I suspect it's a bug due
> > to
> > a restart being needed but no parameters having changed. There
> > should
> > be special handling for a partial migration to make the stop
> > required.
> 
> Probably CLBZ#5309 again...  You debugged a pe-input file for me with
> a
> similar issue almost exactly a year ago (thread subject "Pacemaker
> resource parameter reload confusion").  Time to upgrade this cluster,
> I
> guess.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync 3 release plans?

2018-09-27 Thread Ferenc Wágner
Christine Caulfield  writes:

> I'm also looking into high-res timestamps for logfiles too.

Wouldn't that be a useful option for the syslog output as well?  I'm
sometimes concerned by the batching effect added by the transport
between the application and the (local) log server (rsyslog or systemd).
Reliably merging messages from different channels can prove impossible
without internal timestamps (even considering a single machine only).

Another interesting feature could be structured, direct journal output
(if you're looking for challenges).
-- 
Regards,
Feri
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync 3 release plans?

2018-09-27 Thread Ferenc Wágner
Ken Gaillot  writes:

> libqb would simply provide the API for reopening the log, and clients
> such as pacemaker would intercept the signal and call the API.

Just for posterity: you needn't restrict yourself to signals.  Logrotate
has nothing to do with signals.  Signals are a rather limited form of
IPC, which might be a good tool for some applications.  Both Pacemaker
and Corosync already employ much richer IPC mechanisms, which might be
more natural to extend for triggering log rotation than adding a new IPC
mechanism.  Logrotate optionally runs scripts before and after renaming
the log files; these can invoke kill, corosync-cmapctl, cibadmin and so
on all the same.  It's entirely your call.
-- 
Regards,
Feri
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: meatware stonith

2018-09-27 Thread Digimer
On 2018-09-27 03:13 AM, Kristoffer Grönlund wrote:
> On Thu, 2018-09-27 at 02:49 -0400, Digimer wrote:
>> On 2018-09-27 01:54 AM, Ulrich Windl wrote:
>> Digimer  schrieb am 26.09.2018 um 18:29 in
>> Nachricht
>>>
>>> <1c70b5e2-ea8e-8cbe-3d83-e207ca47b...@alteeve.ca>:
 On 2018-09-26 11:11 AM, Patrick Whitney wrote:
> Hey everyone,
>
> I'm doing some pacemaker/corosync/dlm/clvm testing.  I'm
> without a power
> fencing solution at the moment, so I wanted to utilize
> meatware, but it
> doesn't show when I list available stonith devices (pcs stonith
> list).
>
> I do seem to have it on the system, as cluster-glue is
> installed, and I
> see meatware.so and meatclient on the system, and I also see
> meatware
> listed when running the command 'stonith -L' 
>
> Can anyone guide me as to how to create a stonith meatware
> resource
> using pcs? 
>
> Best,
> -Pat

 The "fence_manual" agent was removed after EL5 days, a lng
 time ago,
 because it so often led to split-brains because of misuse. Manual
 fencing is NOT recommended.

 There are new options, like SBD (storage-based death) if you have
 a
 watchdog timer.
>>>
>>> And even if you do not ;-)
>>
>> I've not used SBD. How, without a watchdog timer, can you be sure the
>> target node is dead?
> 
> You can't. You can use the Linux softdog module though, but since it is
> a pure software solution it is limited and not ideal.

That was my understanding, thank you.

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Understanding the behavior of pacemaker crash

2018-09-27 Thread Prasad Nagaraj
Hi Ken - Thanks for the response. Pacemaker is still not running on that
node. So I am still wondering what could be the issue ? Any other
configurations or logs should I be sharing to understand this more ?

Thanks!

On Thu, Sep 27, 2018 at 8:08 PM Ken Gaillot  wrote:

> On Thu, 2018-09-27 at 13:45 +0530, Prasad Nagaraj wrote:
> > Hello - I was trying to understand the behavior or cluster when
> > pacemaker crashes on one of the nodes. So I hard killed pacemakerd
> > and its related processes.
> >
> > ---
> > -
> > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> > root  74022  1  0 07:53 pts/000:00:00 pacemakerd
> > 189   74028  74022  0 07:53 ?00:00:00
> > /usr/libexec/pacemaker/cib
> > root  74029  74022  0 07:53 ?00:00:00
> > /usr/libexec/pacemaker/stonithd
> > root  74030  74022  0 07:53 ?00:00:00
> > /usr/libexec/pacemaker/lrmd
> > 189   74031  74022  0 07:53 ?00:00:00
> > /usr/libexec/pacemaker/attrd
> > 189   74032  74022  0 07:53 ?00:00:00
> > /usr/libexec/pacemaker/pengine
> > 189   74033  74022  0 07:53 ?00:00:00
> > /usr/libexec/pacemaker/crmd
> >
> > root  75228  50092  0 07:54 pts/000:00:00 grep pacemaker
> > [root@SG-mysqlold-907 azureuser]# kill -9 74022
> >
> > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> > root  74030  1  0 07:53 ?00:00:00
> > /usr/libexec/pacemaker/lrmd
> > 189   74032  1  0 07:53 ?00:00:00
> > /usr/libexec/pacemaker/pengine
> >
> > root  75303  50092  0 07:55 pts/000:00:00 grep pacemaker
> > [root@SG-mysqlold-907 azureuser]# kill -9 74030
> > [root@SG-mysqlold-907 azureuser]# kill -9 74032
> > [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> > root  75332  50092  0 07:55 pts/000:00:00 grep pacemaker
> >
> > [root@SG-mysqlold-907 azureuser]# crm satus
> > ERROR: status: crm_mon (rc=107): Connection to cluster failed:
> > Transport endpoint is not connected
> > ---
> > --
> >
> > However, this does not seem to be having any effect on the cluster
> > status from other nodes
> > ---
> > 
> >
> > [root@SG-mysqlold-909 azureuser]# crm status
> > Last updated: Thu Sep 27 07:56:17 2018  Last change: Thu Sep
> > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
> > Stack: classic openais (with plugin)
> > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) -
> > partition with quorum
> > 3 nodes and 3 resources configured, 3 expected votes
> >
> > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
>
> It most definitely would make the node offline, and if fencing were
> configured, the rest of the cluster would fence the node to make sure
> it's safely down.
>
> I see you're using the old corosync 1 plugin. I suspect what happened
> in this case is that corosync noticed the plugin died and restarted it
> quickly enough that it had rejoined by the time you checked the status
> elsewhere.
>
> >
> > Full list of resources:
> >
> >  Master/Slave Set: ms_mysql [p_mysql]
> >  Masters: [ SG-mysqlold-909 ]
> >  Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
> >
> >
> > [root@SG-mysqlold-908 azureuser]# crm status
> > Last updated: Thu Sep 27 07:56:08 2018  Last change: Thu Sep
> > 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
> > Stack: classic openais (with plugin)
> > Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) -
> > partition with quorum
> > 3 nodes and 3 resources configured, 3 expected votes
> >
> > Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
> >
> > Full list of resources:
> >
> >  Master/Slave Set: ms_mysql [p_mysql]
> >  Masters: [ SG-mysqlold-909 ]
> >  Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
> >
> > ---
> > ---
> >
> > I am bit surprised that other nodes are not able to detect that
> > pacemaker is down on one of the nodes - SG-mysqlold-907
> >
> > Even if I kill pacemaker on the node which is a DC - I observe the
> > same behavior with rest of the nodes not detecting that DC is down.
> >
> > Could some one explain what is the expected behavior in these cases ?
> >
> > I am using corosync 1.4.7 and pacemaker 1.1.14
> >
> > Thanks in advance
> > Prasad
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > https://lists.clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> > pdf
> > 

Re: [ClusterLabs] Antw: Salvaging aborted resource migration

2018-09-27 Thread Ferenc Wágner
Ken Gaillot  writes:

> On Thu, 2018-09-27 at 09:36 +0200, Ulrich Windl wrote:
> 
>> Obviously you violated the most important cluster rule that is "be
>> patient".  Maybe the next important is "Don't change the
>> configuration while the cluster is not in IDLE state" ;-)
>
> Agreed -- although even idle, removing a ban can result in a migration
> back (if something like stickiness doesn't prevent it).

I've got no problem with that in general.  However, I can't gurantee
that every configuration change happens in idle state, certain
operations (mostly resource additions) are done by several
administrators without synchronization, and of course asynchronous
cluster events can also happen any time.  So I have to ask: what are the
consequences of breaking this "impossible" rule?

> There's currently no way to tell pacemaker that an operation (i.e.
> migrate_from) is a no-op and can be ignored. If a migration is only
> partially completed, it has to be considered a failure and reverted.

OK.  Are there other complex operations which can "partially complete"
if a transition is aborted by some event?

Now let's suppose a pull migration scenario: migrate_to does nothing,
but in this tiny window a configuration change aborts the transition.
The resources would go through a full recovery (stop+start), right?
Now let's suppose migrate_from gets scheduled and starts performing the
migration.  Before it finishes, a configuration change aborts the
transition.  The cluster waits for the outstanding operation to finish,
doesn't it?  And if it finishes successfully, is the migration
considered complete requiring no recovery?

> I'm not sure why the reload was scheduled; I suspect it's a bug due to
> a restart being needed but no parameters having changed. There should
> be special handling for a partial migration to make the stop required.

Probably CLBZ#5309 again...  You debugged a pe-input file for me with a
similar issue almost exactly a year ago (thread subject "Pacemaker
resource parameter reload confusion").  Time to upgrade this cluster, I
guess.
-- 
Thanks,
Feri
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: pcmk 1.1.17: Which effective user is calling OCF agents for querying meta-data?

2018-09-27 Thread Ken Gaillot
On Thu, 2018-09-27 at 14:57 +, cfpubl...@verimatrix.com wrote:
> > On Thu, 2018-09-27 at 10:23 +, cfpubl...@verimatrix.com wrote:
> > > > > With pacemaker 1.1.17, we observe the following messages
> > > > > during 
> > > > > startup of
> > > > > pacemaker:
> > > > > 2018-09-18T11:58:18.452951+03:00 p12-0001-bcsm03
> > > > > crmd[2871]:  warning:
> > > > > Cannot execute
> > > > > '/usr/lib/ocf/resource.d/verimatrix/anything4':
> > > > > Permission denied (13)
> > > > > 2018-09-18T11:58:18.453179+03:00 p12-0001-bcsm03
> > > > > crmd[2871]:error:
> > > > > Failed to retrieve meta-data for ocf:verimatrix:anything4
> > > > > 2018-09-18T11:58:18.453291+03:00 p12-0001-bcsm03
> > > > > crmd[2871]:error: No
> > > > > metadata for ocf::verimatrix:anything4
> > > > > 
> > > > 
> > > > Could it be as simple as
> > > > /usr/lib/ocf/resource.d/verimatrix/anything4 not having the
> > > > execute 
> > > > bit set (for the user)?
> > > 
> > > The OCF agent is not physically located there.
> > > /usr/lib/ocf/resource.d/verimatrix is a symbolic link that points
> > > to a 
> > > directory in our software distribution. That part is not
> > > reachable for 
> > > "other".
> > > 
> > > > > 
> > > > > It seems that on startup, crmd is querying the meta-data on
> > > > > the 
> > > > > OCF agents using a non-root user (hacluster?) while the
> > > > > regular 
> > > > > resource control activity seems to be done as root.
> > > > > The OCF resource in question intentionally resides in a
> > > > > directory 
> > > > > that is inaccessible to non-root users.
> > > > 
> > > > Why? You can selectively grat access (man setfacl)!
> > > 
> > > That's an option, thank you.
> > > 
> > > > > 
> > > > > Is this behavior of using different users intended? If yes,
> > > > > any 
> > > > > clue why was it working with pacemaker 1.1.7 under RHEL6?
> > > > 
> > > > Finally: Why are you asking thei list for help, when you
> > > > removed 
> > > > execute permission for your home-grown (as it seems) resource
> > > > agent?
> > > > What could WE do?
> > > 
> > > I would like to understand the concept behind it to determine the
> > > best 
> > > solution. If there was a way to configure or tell crmd to do the 
> > > meta-data query as root, I would prefer that. It's because even
> > > if the 
> > > actual access to the OCF agent scripts worked, we would hit the
> > > next 
> > > problems because some OCF scripts use the "ocf_is_root" function
> > > from 
> > > /usr/lib/ocf/lib/heartbeat/ocf-shellfuncs. These OCFs would fail 
> > > miserably.
> > > So before I revisit all our OCFs to check if the well-behave if
> > > called 
> > > as non-root, I wanted to check if there is another way.
> > > 
> > > Thanks,
> > >   Carsten
> > 
> > The crmd does indeed run meta-data queries as the hacluster user,
> > while the lrmd executes all other resource actions as root. It's a
> > longstanding goal to correct this by making crmd go through lrmd to
> > get meta-data, but unfortunately it's a big project and there are
> > many
> > more pressing issues to address. :-(
> > 
> > There's no workaround within pacemaker, but the setfacl approach
> > sounds useful.
> > 
> > As a best practice, an agent's meta-data action should not do
> > anything other than print meta-data. I.e. many agents have common
> > initialization that needs to be done before all actions, but this
> > should be avoided for meta-data. Even without this issue, regular
> > users and
> > higher-level tools should be able to obtain meta-data without
> > permission issues or side effects.
> >  --
> > Ken Gaillot 
> 
> Thank you, Ken, for the confirmation that there is no workaround
> within pacemaker. I will re-visit all our OCFs then and make sure
> they are world accessible and the meta-data action requires no
> special permissions.
> 
> I still wonder why it worked with pacemaker 1.1.7, though ;-)
> 
> Thanks,
>   Carsten

I'm going to guess you were using the corosync 1 plugin, and that it
ran everything as root. I never worked with that code, so it's just a
guess :)
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync 3 release plans?

2018-09-27 Thread Ken Gaillot
On Thu, 2018-09-27 at 16:09 +0100, Christine Caulfield wrote:
> On 27/09/18 16:01, Ken Gaillot wrote:
> > On Thu, 2018-09-27 at 09:58 -0500, Ken Gaillot wrote:
> > > On Thu, 2018-09-27 at 15:32 +0200, Ferenc Wágner wrote:
> > > > Christine Caulfield  writes:
> > > > 
> > > > > TBH I would be quite happy to leave this to logrotate but the
> > > > > message I
> > > > > was getting here is that we need additional help from libqb.
> > > > > I'm
> > > > > willing
> > > > > to go with a consensus on this though
> > > > 
> > > > Yes, to do a proper job logrotate has to have a way to get the
> > > > log
> > > > files
> > > > reopened.  And applications can't do that without support from
> > > > libqb,
> > > > if
> > > > I understood Honza right.
> > > 
> > > There are two related issues:
> > > 
> > > * Issue #142, about automatically rotating logs once they reach a
> > > certain size, can be done with logrotate already. If it's a one-
> > > time
> > > thing (e.g. running a test with trace), the admin can control
> > > rotation
> > > directly with logrotate --force /etc/logrotate.d/whatever.conf.
> > > If
> > > it's
> > > desired permanently, a maxsize or size line can be added to the
> > > logrotate config (which, now that I think about it, would be a
> > > good
> > > idea for the default pacemaker config).
> > > 
> > > * Issue #239, about the possibility of losing messages with
> > > copytruncate, has a widely used, easily implemented, and robust
> > > solution of using a signal to indicate reopening a log. logrotate
> > > is
> > > then configured to rotate by moving the log to a new name,
> > > sending
> > > the
> > > signal, then compressing the old log.
> > 
> > Regarding implementation in libqb's case, libqb would simply
> > provide
> > the API for reopening the log, and clients such as pacemaker would
> > intercept the signal and call the API.
> > 
> 
> That sounds pretty easy to achieve. I'm also looking into high-res
> timestamps for logfiles too.
> 
> > A minor complication is that pacemaker would have to supply
> > different
> > logrotate configs depending on the version of libqb available.
> > 
> 
> Can't you just intercept the signal anyway and not do anything if an
> old
> libqb is linked in?
> 
> Chrissie

Yes, but the logrotate config will need to keep the copytruncate option
or not depending on whether the new API is available. It's not too
difficult via the configure script, just reminding myself to do it :)
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: pcmk 1.1.17: Which effective user is calling OCF agents for querying meta-data?

2018-09-27 Thread cfpubl...@verimatrix.com
> On Thu, 2018-09-27 at 10:23 +, cfpubl...@verimatrix.com wrote:
> > > > With pacemaker 1.1.17, we observe the following messages during 
> > > > startup of
> > > > pacemaker:
> > > > 2018-09-18T11:58:18.452951+03:00 p12-0001-bcsm03
> > > > crmd[2871]:  warning:
> > > > Cannot execute '/usr/lib/ocf/resource.d/verimatrix/anything4':
> > > > Permission denied (13)
> > > > 2018-09-18T11:58:18.453179+03:00 p12-0001-bcsm03
> > > > crmd[2871]:error:
> > > > Failed to retrieve meta-data for ocf:verimatrix:anything4
> > > > 2018-09-18T11:58:18.453291+03:00 p12-0001-bcsm03
> > > > crmd[2871]:error: No
> > > > metadata for ocf::verimatrix:anything4
> > > > 
> > > 
> > > Could it be as simple as
> > > /usr/lib/ocf/resource.d/verimatrix/anything4 not having the execute 
> > > bit set (for the user)?
> > 
> > The OCF agent is not physically located there.
> > /usr/lib/ocf/resource.d/verimatrix is a symbolic link that points to a 
> > directory in our software distribution. That part is not reachable for 
> > "other".
> > 
> > > > 
> > > > It seems that on startup, crmd is querying the meta-data on the 
> > > > OCF agents using a non-root user (hacluster?) while the regular 
> > > > resource control activity seems to be done as root.
> > > > The OCF resource in question intentionally resides in a directory 
> > > > that is inaccessible to non-root users.
> > > 
> > > Why? You can selectively grat access (man setfacl)!
> > 
> > That's an option, thank you.
> > 
> > > > 
> > > > Is this behavior of using different users intended? If yes, any 
> > > > clue why was it working with pacemaker 1.1.7 under RHEL6?
> > > 
> > > Finally: Why are you asking thei list for help, when you removed 
> > > execute permission for your home-grown (as it seems) resource agent?
> > > What could WE do?
> > 
> > I would like to understand the concept behind it to determine the best 
> > solution. If there was a way to configure or tell crmd to do the 
> > meta-data query as root, I would prefer that. It's because even if the 
> > actual access to the OCF agent scripts worked, we would hit the next 
> > problems because some OCF scripts use the "ocf_is_root" function from 
> > /usr/lib/ocf/lib/heartbeat/ocf-shellfuncs. These OCFs would fail 
> > miserably.
> > So before I revisit all our OCFs to check if the well-behave if called 
> > as non-root, I wanted to check if there is another way.
> > 
> > Thanks,
> >   Carsten
>
> The crmd does indeed run meta-data queries as the hacluster user, while the 
> lrmd executes all other resource actions as root. It's a
> longstanding goal to correct this by making crmd go through lrmd to get 
> meta-data, but unfortunately it's a big project and there are many
> more pressing issues to address. :-(
>
> There's no workaround within pacemaker, but the setfacl approach sounds 
> useful.
>
> As a best practice, an agent's meta-data action should not do anything other 
> than print meta-data. I.e. many agents have common
> initialization that needs to be done before all actions, but this should be 
> avoided for meta-data. Even without this issue, regular users and
> higher-level tools should be able to obtain meta-data without permission 
> issues or side effects.
>  --
> Ken Gaillot 

Thank you, Ken, for the confirmation that there is no workaround within 
pacemaker. I will re-visit all our OCFs then and make sure they are world 
accessible and the meta-data action requires no special permissions.

I still wonder why it worked with pacemaker 1.1.7, though ;-)

Thanks,
  Carsten


___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync 3 release plans?

2018-09-27 Thread Christine Caulfield
On 27/09/18 16:01, Ken Gaillot wrote:
> On Thu, 2018-09-27 at 09:58 -0500, Ken Gaillot wrote:
>> On Thu, 2018-09-27 at 15:32 +0200, Ferenc Wágner wrote:
>>> Christine Caulfield  writes:
>>>
 TBH I would be quite happy to leave this to logrotate but the
 message I
 was getting here is that we need additional help from libqb. I'm
 willing
 to go with a consensus on this though
>>>
>>> Yes, to do a proper job logrotate has to have a way to get the log
>>> files
>>> reopened.  And applications can't do that without support from
>>> libqb,
>>> if
>>> I understood Honza right.
>>
>> There are two related issues:
>>
>> * Issue #142, about automatically rotating logs once they reach a
>> certain size, can be done with logrotate already. If it's a one-time
>> thing (e.g. running a test with trace), the admin can control
>> rotation
>> directly with logrotate --force /etc/logrotate.d/whatever.conf. If
>> it's
>> desired permanently, a maxsize or size line can be added to the
>> logrotate config (which, now that I think about it, would be a good
>> idea for the default pacemaker config).
>>
>> * Issue #239, about the possibility of losing messages with
>> copytruncate, has a widely used, easily implemented, and robust
>> solution of using a signal to indicate reopening a log. logrotate is
>> then configured to rotate by moving the log to a new name, sending
>> the
>> signal, then compressing the old log.
> 
> Regarding implementation in libqb's case, libqb would simply provide
> the API for reopening the log, and clients such as pacemaker would
> intercept the signal and call the API.
> 

That sounds pretty easy to achieve. I'm also looking into high-res
timestamps for logfiles too.

> A minor complication is that pacemaker would have to supply different
> logrotate configs depending on the version of libqb available.
> 

Can't you just intercept the signal anyway and not do anything if an old
libqb is linked in?

Chrissie

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync 3 release plans?

2018-09-27 Thread Ken Gaillot
On Thu, 2018-09-27 at 09:58 -0500, Ken Gaillot wrote:
> On Thu, 2018-09-27 at 15:32 +0200, Ferenc Wágner wrote:
> > Christine Caulfield  writes:
> > 
> > > TBH I would be quite happy to leave this to logrotate but the
> > > message I
> > > was getting here is that we need additional help from libqb. I'm
> > > willing
> > > to go with a consensus on this though
> > 
> > Yes, to do a proper job logrotate has to have a way to get the log
> > files
> > reopened.  And applications can't do that without support from
> > libqb,
> > if
> > I understood Honza right.
> 
> There are two related issues:
> 
> * Issue #142, about automatically rotating logs once they reach a
> certain size, can be done with logrotate already. If it's a one-time
> thing (e.g. running a test with trace), the admin can control
> rotation
> directly with logrotate --force /etc/logrotate.d/whatever.conf. If
> it's
> desired permanently, a maxsize or size line can be added to the
> logrotate config (which, now that I think about it, would be a good
> idea for the default pacemaker config).
> 
> * Issue #239, about the possibility of losing messages with
> copytruncate, has a widely used, easily implemented, and robust
> solution of using a signal to indicate reopening a log. logrotate is
> then configured to rotate by moving the log to a new name, sending
> the
> signal, then compressing the old log.

Regarding implementation in libqb's case, libqb would simply provide
the API for reopening the log, and clients such as pacemaker would
intercept the signal and call the API.

A minor complication is that pacemaker would have to supply different
logrotate configs depending on the version of libqb available.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync 3 release plans?

2018-09-27 Thread Ken Gaillot
On Thu, 2018-09-27 at 15:32 +0200, Ferenc Wágner wrote:
> Christine Caulfield  writes:
> 
> > TBH I would be quite happy to leave this to logrotate but the
> > message I
> > was getting here is that we need additional help from libqb. I'm
> > willing
> > to go with a consensus on this though
> 
> Yes, to do a proper job logrotate has to have a way to get the log
> files
> reopened.  And applications can't do that without support from libqb,
> if
> I understood Honza right.

There are two related issues:

* Issue #142, about automatically rotating logs once they reach a
certain size, can be done with logrotate already. If it's a one-time
thing (e.g. running a test with trace), the admin can control rotation
directly with logrotate --force /etc/logrotate.d/whatever.conf. If it's
desired permanently, a maxsize or size line can be added to the
logrotate config (which, now that I think about it, would be a good
idea for the default pacemaker config).

* Issue #239, about the possibility of losing messages with
copytruncate, has a widely used, easily implemented, and robust
solution of using a signal to indicate reopening a log. logrotate is
then configured to rotate by moving the log to a new name, sending the
signal, then compressing the old log.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: pcmk 1.1.17: Which effective user is calling OCF agents for querying meta-data?

2018-09-27 Thread Ken Gaillot
On Thu, 2018-09-27 at 10:23 +, cfpubl...@verimatrix.com wrote:
> > > With pacemaker 1.1.17, we observe the following messages during 
> > > startup of
> > > pacemaker:
> > > 2018-09-18T11:58:18.452951+03:00 p12-0001-bcsm03
> > > crmd[2871]:  warning: 
> > > Cannot execute '/usr/lib/ocf/resource.d/verimatrix/anything4': 
> > > Permission denied (13)
> > > 2018-09-18T11:58:18.453179+03:00 p12-0001-bcsm03
> > > crmd[2871]:error: 
> > > Failed to retrieve meta-data for ocf:verimatrix:anything4
> > > 2018-09-18T11:58:18.453291+03:00 p12-0001-bcsm03
> > > crmd[2871]:error: No 
> > > metadata for ocf::verimatrix:anything4
> > > 
> > 
> > Could it be as simple as
> > /usr/lib/ocf/resource.d/verimatrix/anything4 not having the execute
> > bit set (for the user)?
> 
> The OCF agent is not physically located there.
> /usr/lib/ocf/resource.d/verimatrix is a symbolic link that points to
> a directory in our software distribution. That part is not reachable
> for "other".
> 
> > > 
> > > It seems that on startup, crmd is querying the meta-data on the
> > > OCF 
> > > agents using a non-root user (hacluster?) while the regular
> > > resource 
> > > control activity seems to be done as root.
> > > The OCF resource in question intentionally resides in a directory
> > > that 
> > > is inaccessible to non-root users.
> > 
> > Why? You can selectively grat access (man setfacl)!
> 
> That's an option, thank you.
> 
> > > 
> > > Is this behavior of using different users intended? If yes, any
> > > clue 
> > > why was
> > > it working with pacemaker 1.1.7 under RHEL6?
> > 
> > Finally: Why are you asking thei list for help, when you removed
> > execute permission for your home-grown (as it seems) resource
> > agent?
> > What could WE do?
> 
> I would like to understand the concept behind it to determine the
> best solution. If there was a way to configure or tell crmd to do the
> meta-data query as root, I would prefer that. It's because even if
> the actual access to the OCF agent scripts worked, we would hit the
> next problems because some OCF scripts use the "ocf_is_root" function
> from /usr/lib/ocf/lib/heartbeat/ocf-shellfuncs. These OCFs would fail
> miserably.
> So before I revisit all our OCFs to check if the well-behave if
> called as non-root, I wanted to check if there is another way.
> 
> Thanks,
>   Carsten

The crmd does indeed run meta-data queries as the hacluster user, while
the lrmd executes all other resource actions as root. It's a
longstanding goal to correct this by making crmd go through lrmd to get
meta-data, but unfortunately it's a big project and there are many more
pressing issues to address. :-(

There's no workaround within pacemaker, but the setfacl approach sounds
useful.

As a best practice, an agent's meta-data action should not do anything
other than print meta-data. I.e. many agents have common initialization
that needs to be done before all actions, but this should be avoided
for meta-data. Even without this issue, regular users and higher-level
tools should be able to obtain meta-data without permission issues or
side effects.
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Understanding the behavior of pacemaker crash

2018-09-27 Thread Ken Gaillot
On Thu, 2018-09-27 at 13:45 +0530, Prasad Nagaraj wrote:
> Hello - I was trying to understand the behavior or cluster when
> pacemaker crashes on one of the nodes. So I hard killed pacemakerd
> and its related processes.
> 
> ---
> -
> [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> root      74022      1  0 07:53 pts/0    00:00:00 pacemakerd
> 189       74028  74022  0 07:53 ?        00:00:00
> /usr/libexec/pacemaker/cib
> root      74029  74022  0 07:53 ?        00:00:00
> /usr/libexec/pacemaker/stonithd
> root      74030  74022  0 07:53 ?        00:00:00
> /usr/libexec/pacemaker/lrmd
> 189       74031  74022  0 07:53 ?        00:00:00
> /usr/libexec/pacemaker/attrd
> 189       74032  74022  0 07:53 ?        00:00:00
> /usr/libexec/pacemaker/pengine
> 189       74033  74022  0 07:53 ?        00:00:00
> /usr/libexec/pacemaker/crmd
> 
> root      75228  50092  0 07:54 pts/0    00:00:00 grep pacemaker
> [root@SG-mysqlold-907 azureuser]# kill -9 74022
> 
> [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> root      74030      1  0 07:53 ?        00:00:00
> /usr/libexec/pacemaker/lrmd
> 189       74032      1  0 07:53 ?        00:00:00
> /usr/libexec/pacemaker/pengine
> 
> root      75303  50092  0 07:55 pts/0    00:00:00 grep pacemaker
> [root@SG-mysqlold-907 azureuser]# kill -9 74030
> [root@SG-mysqlold-907 azureuser]# kill -9 74032
> [root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
> root      75332  50092  0 07:55 pts/0    00:00:00 grep pacemaker
> 
> [root@SG-mysqlold-907 azureuser]# crm satus
> ERROR: status: crm_mon (rc=107): Connection to cluster failed:
> Transport endpoint is not connected
> ---
> --
> 
> However, this does not seem to be having any effect on the cluster
> status from other nodes
> ---
> 
> 
> [root@SG-mysqlold-909 azureuser]# crm status
> Last updated: Thu Sep 27 07:56:17 2018          Last change: Thu Sep
> 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
> Stack: classic openais (with plugin)
> Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) -
> partition with quorum
> 3 nodes and 3 resources configured, 3 expected votes
> 
> Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]

It most definitely would make the node offline, and if fencing were
configured, the rest of the cluster would fence the node to make sure
it's safely down.

I see you're using the old corosync 1 plugin. I suspect what happened
in this case is that corosync noticed the plugin died and restarted it
quickly enough that it had rejoined by the time you checked the status
elsewhere.

> 
> Full list of resources:
> 
>  Master/Slave Set: ms_mysql [p_mysql]
>      Masters: [ SG-mysqlold-909 ]
>      Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
> 
> 
> [root@SG-mysqlold-908 azureuser]# crm status
> Last updated: Thu Sep 27 07:56:08 2018          Last change: Thu Sep
> 27 07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
> Stack: classic openais (with plugin)
> Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) -
> partition with quorum
> 3 nodes and 3 resources configured, 3 expected votes
> 
> Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]
> 
> Full list of resources:
> 
>  Master/Slave Set: ms_mysql [p_mysql]
>      Masters: [ SG-mysqlold-909 ]
>      Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]
> 
> ---
> ---
> 
> I am bit surprised that other nodes are not able to detect that
> pacemaker is down on one of the nodes - SG-mysqlold-907 
> 
> Even if I kill pacemaker on the node which is a DC - I observe the
> same behavior with rest of the nodes not detecting that DC is down. 
> 
> Could some one explain what is the expected behavior in these cases ?
>  
> I am using corosync 1.4.7 and pacemaker 1.1.14
> 
> Thanks in advance
> Prasad
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Salvaging aborted resource migration

2018-09-27 Thread Ken Gaillot
On Thu, 2018-09-27 at 09:36 +0200, Ulrich Windl wrote:
> Hi!
> 
> Obviously you violated the most important cluster rule that is "be
> patient".
> Maybe the next important is "Don't change the configuration while the
> cluster
> is not in IDLE state" ;-)

Agreed -- although even idle, removing a ban can result in a migration
back (if something like stickiness doesn't prevent it).

There's currently no way to tell pacemaker that an operation (i.e.
migrate_from) is a no-op and can be ignored. If a migration is only
partially completed, it has to be considered a failure and reverted.

I'm not sure why the reload was scheduled; I suspect it's a bug due to
a restart being needed but no parameters having changed. There should
be special handling for a partial migration to make the stop required.

> I feel these are issues that should be fixed, but the above rules
> make your
> life easier while these issues still exist.
> 
> Regards,
> Ulrich
> 
> > > > Ferenc Wágner  schrieb am 27.09.2018
> > > > um 08:37
> 
> in
> Nachricht <87tvmb5ttw@lant.ki.iif.hu>:
> > Hi,
> > 
> > The current behavior of cancelled migration with Pacemaker 1.1.16
> > with a
> > resource implementing push migration:
> > 
> > # /usr/sbin/crm_resource ‑‑ban ‑r vm‑conv‑4
> > 
> > vhbl03 crmd[10017]:   notice: State transition S_IDLE ‑>
> > S_POLICY_ENGINE
> > vhbl03 pengine[10016]:   notice: Migrate vm‑conv‑4#011(Started
> > vhbl07 ‑>
> 
> vhbl04)
> > vhbl03 crmd[10017]:   notice: Initiating migrate_to operation 
> > vm‑conv‑4_migrate_to_0 on vhbl07
> > vhbl03 pengine[10016]:   notice: Calculated transition 4633, saving
> > inputs 
> > in /var/lib/pacemaker/pengine/pe‑input‑1069.bz2
> > [...]
> > 
> > At this point, with the migration still ongoing, I wanted to get
> > rid of
> > the constraint:
> > 
> > # /usr/sbin/crm_resource ‑‑clear ‑r vm‑conv‑4
> > 
> > vhbl03 crmd[10017]:   notice: Transition aborted by deletion of 
> > rsc_location[@id='cli‑ban‑vm‑conv‑4‑on‑vhbl07']: Configuration
> > change
> > vhbl07 crmd[10233]:   notice: Result of migrate_to operation for
> > vm‑conv‑4
> 
> on 
> > vhbl07: 0 (ok)
> > vhbl03 crmd[10017]:   notice: Transition 4633 (Complete=6,
> > Pending=0, 
> > Fired=0, Skipped=1, Incomplete=6, 
> > Source=/var/lib/pacemaker/pengine/pe‑input‑1069.bz2): Stopped
> > vhbl03 pengine[10016]:   notice: Resource vm‑conv‑4 can no longer
> > migrate to
> > vhbl04. Stopping on vhbl07 too
> > vhbl03 pengine[10016]:   notice: Reload  vm‑conv‑4#011(Started
> > vhbl07)
> > vhbl03 pengine[10016]:   notice: Calculated transition 4634, saving
> > inputs 
> > in /var/lib/pacemaker/pengine/pe‑input‑1070.bz2
> > vhbl03 crmd[10017]:   notice: Initiating stop operation
> > vm‑conv‑4_stop_0 on
> > vhbl07
> > vhbl03 crmd[10017]:   notice: Initiating stop operation
> > vm‑conv‑4_stop_0 on
> > vhbl04
> > vhbl03 crmd[10017]:   notice: Initiating reload operation
> > vm‑conv‑4_reload_0
> > on vhbl04
> > 
> > This recovery was entirely unnecessary, as the resource
> > successfully
> > migrated to vhbl04 (the migrate_from operation does
> > nothing).  Pacemaker
> > does not know this, but is there a way to educate it?  I think in
> > this
> > special case it is possible to redesign the agent making migrate_to
> > a
> > no‑op and doing everything in migrate_from, which would
> > significantly
> > reduce the window between the start points of the two "halfs", but
> > I'm
> > not sure that would help in the end: Pacemaker could still decide
> > to do
> > an unnecessary stop+start recovery.  Would it?  I failed to find
> > any
> > documentation on recovery from aborted migration transitions.  I
> > don't
> > expect on‑fail (for migrate_* ops, not me) to apply here, does it?
> > 
> > Side question: why initiate a reload in any case, like above?
> > 
> > Even more side question: could you please consider using space
> > instead
> > of TAB in syslog messages?  (Actually, I wouldn't mind getting rid
> > of
> > them altogether in any output.)
> > ‑‑ 
> > Thanks,
> > Feri
> > ___
> > Users mailing list: Users@clusterlabs.org 
> > https://lists.clusterlabs.org/mailman/listinfo/users 
> > 
> > Project Home: http://www.clusterlabs.org 
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratc
> > h.pdf 
> > Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot 
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync 3 release plans?

2018-09-27 Thread Ferenc Wágner
Christine Caulfield  writes:

> TBH I would be quite happy to leave this to logrotate but the message I
> was getting here is that we need additional help from libqb. I'm willing
> to go with a consensus on this though

Yes, to do a proper job logrotate has to have a way to get the log files
reopened.  And applications can't do that without support from libqb, if
I understood Honza right.
-- 
Regards,
Feri
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync 3 release plans?

2018-09-27 Thread Christine Caulfield
On 27/09/18 12:52, Ferenc Wágner wrote:
> Christine Caulfield  writes:
> 
>> I'm looking into new features for libqb and the option in
>> https://github.com/ClusterLabs/libqb/issues/142#issuecomment-76206425
>> looks like a good option to me.
> 
> It feels backwards to me: traditionally, increasing numbers signify
> older rotated logs, while this proposal does the opposite.  And what
> happens on application restart?  Do you overwrite from 0?  Do you ever
> jump back to 0?  It also leaves the problem of cleaning up old log files
> unsolved...

The idea I had was to look for logs with 'old' numbers at startup and
then start a new log with the next number, starting at 0 or 1. Good
point about the numbers going the other way with logrotate though, I
hadn't considered that

> 
>> Though adding an API call to re-open the log file could be done too -
>> I'm not averse to having both,
> 
> Not addig log rotation policy (and implementation!) to each application
> is a win in my opinion, and also unifies local administration.  Syslog
> is pretty good in this regard, my only gripe with it is that its time
> stamps can't be quite as precise as the ones from the (realtime)
> application (even nowadays, under systemd).  And that it can block the
> log stream... on the other hand, disk latencies can block log writes
> just as well.
> 

TBH I would be quite happy to leave this to logrotate but the message I
was getting here is that we need additional help from libqb. I'm willing
to go with a consensus on this though

Chrissie
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync 3 release plans?

2018-09-27 Thread Ferenc Wágner
Christine Caulfield  writes:

> I'm looking into new features for libqb and the option in
> https://github.com/ClusterLabs/libqb/issues/142#issuecomment-76206425
> looks like a good option to me.

It feels backwards to me: traditionally, increasing numbers signify
older rotated logs, while this proposal does the opposite.  And what
happens on application restart?  Do you overwrite from 0?  Do you ever
jump back to 0?  It also leaves the problem of cleaning up old log files
unsolved...

> Though adding an API call to re-open the log file could be done too -
> I'm not averse to having both,

Not addig log rotation policy (and implementation!) to each application
is a win in my opinion, and also unifies local administration.  Syslog
is pretty good in this regard, my only gripe with it is that its time
stamps can't be quite as precise as the ones from the (realtime)
application (even nowadays, under systemd).  And that it can block the
log stream... on the other hand, disk latencies can block log writes
just as well.
-- 
Regards,
Feri
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: pcmk 1.1.17: Which effective user is calling OCF agents for querying meta-data?

2018-09-27 Thread cfpubl...@verimatrix.com
>> With pacemaker 1.1.17, we observe the following messages during 
>> startup of
>> pacemaker:
>> 2018-09-18T11:58:18.452951+03:00 p12-0001-bcsm03 crmd[2871]:  warning: 
>> Cannot execute '/usr/lib/ocf/resource.d/verimatrix/anything4': 
>> Permission denied (13)
>> 2018-09-18T11:58:18.453179+03:00 p12-0001-bcsm03 crmd[2871]:error: 
>> Failed to retrieve meta-data for ocf:verimatrix:anything4
>> 2018-09-18T11:58:18.453291+03:00 p12-0001-bcsm03 crmd[2871]:error: No 
>> metadata for ocf::verimatrix:anything4
>> 
> 
> Could it be as simple as /usr/lib/ocf/resource.d/verimatrix/anything4 not 
> having the execute bit set (for the user)?

The OCF agent is not physically located there. 
/usr/lib/ocf/resource.d/verimatrix is a symbolic link that points to a 
directory in our software distribution. That part is not reachable for "other".

>> 
>> It seems that on startup, crmd is querying the meta-data on the OCF 
>> agents using a non-root user (hacluster?) while the regular resource 
>> control activity seems to be done as root.
>> The OCF resource in question intentionally resides in a directory that 
>> is inaccessible to non-root users.
> 
> Why? You can selectively grat access (man setfacl)!

That's an option, thank you.

>> 
>> Is this behavior of using different users intended? If yes, any clue 
>> why was
>> it working with pacemaker 1.1.7 under RHEL6?
> 
> Finally: Why are you asking thei list for help, when you removed execute 
> permission for your home-grown (as it seems) resource agent?
> What could WE do?

I would like to understand the concept behind it to determine the best 
solution. If there was a way to configure or tell crmd to do the meta-data 
query as root, I would prefer that. It's because even if the actual access to 
the OCF agent scripts worked, we would hit the next problems because some OCF 
scripts use the "ocf_is_root" function from 
/usr/lib/ocf/lib/heartbeat/ocf-shellfuncs. These OCFs would fail miserably.
So before I revisit all our OCFs to check if the well-behave if called as 
non-root, I wanted to check if there is another way.

Thanks,
  Carsten

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: pcmk 1.1.17: Which effective user is calling OCF agents for querying meta-data?

2018-09-27 Thread Ulrich Windl
>>> "cfpubl...@verimatrix.com"  schrieb am 27.09.2018
um
11:19 in Nachricht


> Hi all,
> 
> we have been using pacemaker 1.1.7 for many years on RedHat 6. Recently, we

> moved to RedHat 7.3 and pacemaker 1.1.17.
> Note that we build pacemaker from source RPMs and don’t use the packages 
> supplied by RedHat.
> 
> With pacemaker 1.1.17, we observe the following messages during startup of 
> pacemaker:
> 2018-09-18T11:58:18.452951+03:00 p12-0001-bcsm03 crmd[2871]:  warning: 
> Cannot execute '/usr/lib/ocf/resource.d/verimatrix/anything4': Permission 
> denied (13)
> 2018-09-18T11:58:18.453179+03:00 p12-0001-bcsm03 crmd[2871]:error: 
> Failed to retrieve meta-data for ocf:verimatrix:anything4
> 2018-09-18T11:58:18.453291+03:00 p12-0001-bcsm03 crmd[2871]:error: No 
> metadata for ocf::verimatrix:anything4
> 

Hi!

Could it be as simple as /usr/lib/ocf/resource.d/verimatrix/anything4 not
having the execute bit set (for the user)?

> However, apart from that, we can control the respective cluster resource 
> (start, stop, move, etc.) as expected.
> 
> crmd is running as user ‘hacluster’, both on the old pacemaker 1.1.7 
> deployment on RHEL6 and on the new pacemaker 1.1.17 deployment on RHEL7.
> 
> It seems that on startup, crmd is querying the meta-data on the OCF agents 
> using a non-root user (hacluster?) while the regular resource control 
> activity seems to be done as root.
> The OCF resource in question intentionally resides in a directory that is 
> inaccessible to non-root users.

Why? You can selectively grat access (man setfacl)!

> 
> Is this behavior of using different users intended? If yes, any clue why was

> it working with pacemaker 1.1.7 under RHEL6?

Finally: Why are you asking thei list for help, when you removed execute
permission for your home-grown (as it seems) resource agent? What could WE do?

Regards,
Ulrich

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] pcmk 1.1.17: Which effective user is calling OCF agents for querying meta-data?

2018-09-27 Thread cfpubl...@verimatrix.com
Hi all,

we have been using pacemaker 1.1.7 for many years on RedHat 6. Recently, we 
moved to RedHat 7.3 and pacemaker 1.1.17.
Note that we build pacemaker from source RPMs and don’t use the packages 
supplied by RedHat.

With pacemaker 1.1.17, we observe the following messages during startup of 
pacemaker:
2018-09-18T11:58:18.452951+03:00 p12-0001-bcsm03 crmd[2871]:  warning: Cannot 
execute '/usr/lib/ocf/resource.d/verimatrix/anything4': Permission denied (13)
2018-09-18T11:58:18.453179+03:00 p12-0001-bcsm03 crmd[2871]:error: Failed 
to retrieve meta-data for ocf:verimatrix:anything4
2018-09-18T11:58:18.453291+03:00 p12-0001-bcsm03 crmd[2871]:error: No 
metadata for ocf::verimatrix:anything4

However, apart from that, we can control the respective cluster resource 
(start, stop, move, etc.) as expected.

crmd is running as user ‘hacluster’, both on the old pacemaker 1.1.7 deployment 
on RHEL6 and on the new pacemaker 1.1.17 deployment on RHEL7.

It seems that on startup, crmd is querying the meta-data on the OCF agents 
using a non-root user (hacluster?) while the regular resource control activity 
seems to be done as root.
The OCF resource in question intentionally resides in a directory that is 
inaccessible to non-root users.

Is this behavior of using different users intended? If yes, any clue why was it 
working with pacemaker 1.1.7 under RHEL6?

Thanks,
  Carsten

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Understanding the behavior of pacemaker crash

2018-09-27 Thread Prasad Nagaraj
Hello - I was trying to understand the behavior or cluster when pacemaker
crashes on one of the nodes. So I hard killed pacemakerd and its related
processes.


[root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
root  74022  1  0 07:53 pts/000:00:00 pacemakerd
189   74028  74022  0 07:53 ?00:00:00 /usr/libexec/pacemaker/cib
root  74029  74022  0 07:53 ?00:00:00
/usr/libexec/pacemaker/stonithd
root  74030  74022  0 07:53 ?00:00:00
/usr/libexec/pacemaker/lrmd
189   74031  74022  0 07:53 ?00:00:00
/usr/libexec/pacemaker/attrd
189   74032  74022  0 07:53 ?00:00:00
/usr/libexec/pacemaker/pengine
189   74033  74022  0 07:53 ?00:00:00
/usr/libexec/pacemaker/crmd

root  75228  50092  0 07:54 pts/000:00:00 grep pacemaker
[root@SG-mysqlold-907 azureuser]# kill -9 74022

[root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
root  74030  1  0 07:53 ?00:00:00
/usr/libexec/pacemaker/lrmd
189   74032  1  0 07:53 ?00:00:00
/usr/libexec/pacemaker/pengine

root  75303  50092  0 07:55 pts/000:00:00 grep pacemaker
[root@SG-mysqlold-907 azureuser]# kill -9 74030
[root@SG-mysqlold-907 azureuser]# kill -9 74032
[root@SG-mysqlold-907 azureuser]# ps -ef | grep pacemaker
root  75332  50092  0 07:55 pts/000:00:00 grep pacemaker

[root@SG-mysqlold-907 azureuser]# crm satus
ERROR: status: crm_mon (rc=107): Connection to cluster failed: Transport
endpoint is not connected
-

However, this does not seem to be having any effect on the cluster status
from other nodes
---

[root@SG-mysqlold-909 azureuser]# crm status
Last updated: Thu Sep 27 07:56:17 2018  Last change: Thu Sep 27
07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
Stack: classic openais (with plugin)
Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) - partition
with quorum
3 nodes and 3 resources configured, 3 expected votes

Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]

Full list of resources:

 Master/Slave Set: ms_mysql [p_mysql]
 Masters: [ SG-mysqlold-909 ]
 Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]


[root@SG-mysqlold-908 azureuser]# crm status
Last updated: Thu Sep 27 07:56:08 2018  Last change: Thu Sep 27
07:53:43 2018 by root via crm_attribute on SG-mysqlold-909
Stack: classic openais (with plugin)
Current DC: SG-mysqlold-908 (version 1.1.14-8.el6_8.1-70404b0) - partition
with quorum
3 nodes and 3 resources configured, 3 expected votes

Online: [ SG-mysqlold-907 SG-mysqlold-908 SG-mysqlold-909 ]

Full list of resources:

 Master/Slave Set: ms_mysql [p_mysql]
 Masters: [ SG-mysqlold-909 ]
 Slaves: [ SG-mysqlold-907 SG-mysqlold-908 ]

--

I am bit surprised that other nodes are not able to detect that pacemaker
is down on one of the nodes - SG-mysqlold-907

Even if I kill pacemaker on the node which is a DC - I observe the same
behavior with rest of the nodes not detecting that DC is down.

Could some one explain what is the expected behavior in these cases ?

I am using corosync 1.4.7 and pacemaker 1.1.14

Thanks in advance
Prasad
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Salvaging aborted resource migration

2018-09-27 Thread Ulrich Windl
Hi!

Obviously you violated the most important cluster rule that is "be patient".
Maybe the next important is "Don't change the configuration while the cluster
is not in IDLE state" ;-)


I feel these are issues that should be fixed, but the above rules make your
life easier while these issues still exist.

Regards,
Ulrich

>>> Ferenc Wágner  schrieb am 27.09.2018 um 08:37
in
Nachricht <87tvmb5ttw@lant.ki.iif.hu>:
> Hi,
> 
> The current behavior of cancelled migration with Pacemaker 1.1.16 with a
> resource implementing push migration:
> 
> # /usr/sbin/crm_resource ‑‑ban ‑r vm‑conv‑4
> 
> vhbl03 crmd[10017]:   notice: State transition S_IDLE ‑> S_POLICY_ENGINE
> vhbl03 pengine[10016]:   notice: Migrate vm‑conv‑4#011(Started vhbl07 ‑>
vhbl04)
> vhbl03 crmd[10017]:   notice: Initiating migrate_to operation 
> vm‑conv‑4_migrate_to_0 on vhbl07
> vhbl03 pengine[10016]:   notice: Calculated transition 4633, saving inputs 
> in /var/lib/pacemaker/pengine/pe‑input‑1069.bz2
> [...]
> 
> At this point, with the migration still ongoing, I wanted to get rid of
> the constraint:
> 
> # /usr/sbin/crm_resource ‑‑clear ‑r vm‑conv‑4
> 
> vhbl03 crmd[10017]:   notice: Transition aborted by deletion of 
> rsc_location[@id='cli‑ban‑vm‑conv‑4‑on‑vhbl07']: Configuration change
> vhbl07 crmd[10233]:   notice: Result of migrate_to operation for vm‑conv‑4
on 
> vhbl07: 0 (ok)
> vhbl03 crmd[10017]:   notice: Transition 4633 (Complete=6, Pending=0, 
> Fired=0, Skipped=1, Incomplete=6, 
> Source=/var/lib/pacemaker/pengine/pe‑input‑1069.bz2): Stopped
> vhbl03 pengine[10016]:   notice: Resource vm‑conv‑4 can no longer migrate to

> vhbl04. Stopping on vhbl07 too
> vhbl03 pengine[10016]:   notice: Reload  vm‑conv‑4#011(Started vhbl07)
> vhbl03 pengine[10016]:   notice: Calculated transition 4634, saving inputs 
> in /var/lib/pacemaker/pengine/pe‑input‑1070.bz2
> vhbl03 crmd[10017]:   notice: Initiating stop operation vm‑conv‑4_stop_0 on

> vhbl07
> vhbl03 crmd[10017]:   notice: Initiating stop operation vm‑conv‑4_stop_0 on

> vhbl04
> vhbl03 crmd[10017]:   notice: Initiating reload operation vm‑conv‑4_reload_0

> on vhbl04
> 
> This recovery was entirely unnecessary, as the resource successfully
> migrated to vhbl04 (the migrate_from operation does nothing).  Pacemaker
> does not know this, but is there a way to educate it?  I think in this
> special case it is possible to redesign the agent making migrate_to a
> no‑op and doing everything in migrate_from, which would significantly
> reduce the window between the start points of the two "halfs", but I'm
> not sure that would help in the end: Pacemaker could still decide to do
> an unnecessary stop+start recovery.  Would it?  I failed to find any
> documentation on recovery from aborted migration transitions.  I don't
> expect on‑fail (for migrate_* ops, not me) to apply here, does it?
> 
> Side question: why initiate a reload in any case, like above?
> 
> Even more side question: could you please consider using space instead
> of TAB in syslog messages?  (Actually, I wouldn't mind getting rid of
> them altogether in any output.)
> ‑‑ 
> Thanks,
> Feri
> ___
> Users mailing list: Users@clusterlabs.org 
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 



___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Corosync 3 release plans?

2018-09-27 Thread Christine Caulfield
On 26/09/18 09:21, Ferenc Wágner wrote:
> Jan Friesse  writes:
> 
>> wagner.fer...@kifu.gov.hu writes:
>>
>>> triggered by your favourite IPC mechanism (SIGHUP and SIGUSRx are common
>>> choices, but logging.* cmap keys probably fit Corosync better).  That
>>> would enable proper log rotation.
>>
>> What is the reason that you find "copytruncate" as non-proper log
>> rotation? I know there is a risk to loose some lines, but it should be
>> pretty small.
> 
> Yes, there's a chance of losing some messages.  It may be acceptable in
> some cases, but it's never desirable.  The copy operation also wastes
> I/O bandwidth.  Reopening the log files on some external trigger is a
> better solution on all accounts and also an industry standard.
> 
>> Anyway, this again one of the feature where support from libqb would
>> be nice to have (there is actually issue opened
>> https://github.com/ClusterLabs/libqb/issues/239).
> 
> That's a convoluted one for a simple reopen!  But yes, if libqb does not
> expose such functionality, you can't do much about it.  I'll stay with
> syslog for now. :)  In cluster environments centralised log management is
> a must anyway, and that's annoying to achieve with direct file logs.
> 

I'm looking into new features for libqb and the option in
https://github.com/ClusterLabs/libqb/issues/142#issuecomment-76206425
looks like a good option to me. Though adding an API call to re-open the
log file could be done too - I'm not averse to having both,

Chrissie

>>> Jan Friesse  writes:
>>>
 No matter how much I still believe totemsrp as a library would be
 super nice to have - but current state is far away from what I would
 call library (= something small, without non-related things like
 transports/ip/..., testable (ideally with unit tests testing corner
 cases)) and making one fat binary looks like a better way.

 I'll made a patch and send PR (it should be easy).
>>>
>>> Sounds sensible.  Somebody can still split it out later if needed.
>>
>> Yep (and PR send + merged already :) )
> 
> Great!  Did you mean to keep the totem.h, totemip.h, totempg.h and
> totemstats.h header files installed nevertheless?  And totem_pg.pc could
> go as well, I guess.
> 

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: meatware stonith

2018-09-27 Thread Kristoffer Grönlund
On Thu, 2018-09-27 at 02:49 -0400, Digimer wrote:
> On 2018-09-27 01:54 AM, Ulrich Windl wrote:
> > > > > Digimer  schrieb am 26.09.2018 um 18:29 in
> > > > > Nachricht
> > 
> > <1c70b5e2-ea8e-8cbe-3d83-e207ca47b...@alteeve.ca>:
> > > On 2018-09-26 11:11 AM, Patrick Whitney wrote:
> > > > Hey everyone,
> > > > 
> > > > I'm doing some pacemaker/corosync/dlm/clvm testing.  I'm
> > > > without a power
> > > > fencing solution at the moment, so I wanted to utilize
> > > > meatware, but it
> > > > doesn't show when I list available stonith devices (pcs stonith
> > > > list).
> > > > 
> > > > I do seem to have it on the system, as cluster-glue is
> > > > installed, and I
> > > > see meatware.so and meatclient on the system, and I also see
> > > > meatware
> > > > listed when running the command 'stonith -L' 
> > > > 
> > > > Can anyone guide me as to how to create a stonith meatware
> > > > resource
> > > > using pcs? 
> > > > 
> > > > Best,
> > > > -Pat
> > > 
> > > The "fence_manual" agent was removed after EL5 days, a lng
> > > time ago,
> > > because it so often led to split-brains because of misuse. Manual
> > > fencing is NOT recommended.
> > > 
> > > There are new options, like SBD (storage-based death) if you have
> > > a
> > > watchdog timer.
> > 
> > And even if you do not ;-)
> 
> I've not used SBD. How, without a watchdog timer, can you be sure the
> target node is dead?

You can't. You can use the Linux softdog module though, but since it is
a pure software solution it is limited and not ideal.

> 
-- 

Cheers,
Kristoffer

___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Re: meatware stonith

2018-09-27 Thread Digimer
On 2018-09-27 01:54 AM, Ulrich Windl wrote:
 Digimer  schrieb am 26.09.2018 um 18:29 in Nachricht
> <1c70b5e2-ea8e-8cbe-3d83-e207ca47b...@alteeve.ca>:
>> On 2018-09-26 11:11 AM, Patrick Whitney wrote:
>>> Hey everyone,
>>>
>>> I'm doing some pacemaker/corosync/dlm/clvm testing.  I'm without a power
>>> fencing solution at the moment, so I wanted to utilize meatware, but it
>>> doesn't show when I list available stonith devices (pcs stonith list).
>>>
>>> I do seem to have it on the system, as cluster-glue is installed, and I
>>> see meatware.so and meatclient on the system, and I also see meatware
>>> listed when running the command 'stonith -L' 
>>>
>>> Can anyone guide me as to how to create a stonith meatware resource
>>> using pcs? 
>>>
>>> Best,
>>> -Pat
>>
>> The "fence_manual" agent was removed after EL5 days, a lng time ago,
>> because it so often led to split-brains because of misuse. Manual
>> fencing is NOT recommended.
>>
>> There are new options, like SBD (storage-based death) if you have a
>> watchdog timer.
> 
> And even if you do not ;-)

I've not used SBD. How, without a watchdog timer, can you be sure the
target node is dead?

-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Salvaging aborted resource migration

2018-09-27 Thread Ferenc Wágner
Hi,

The current behavior of cancelled migration with Pacemaker 1.1.16 with a
resource implementing push migration:

# /usr/sbin/crm_resource --ban -r vm-conv-4

vhbl03 crmd[10017]:   notice: State transition S_IDLE -> S_POLICY_ENGINE
vhbl03 pengine[10016]:   notice: Migrate vm-conv-4#011(Started vhbl07 -> vhbl04)
vhbl03 crmd[10017]:   notice: Initiating migrate_to operation 
vm-conv-4_migrate_to_0 on vhbl07
vhbl03 pengine[10016]:   notice: Calculated transition 4633, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-1069.bz2
[...]

At this point, with the migration still ongoing, I wanted to get rid of
the constraint:

# /usr/sbin/crm_resource --clear -r vm-conv-4

vhbl03 crmd[10017]:   notice: Transition aborted by deletion of 
rsc_location[@id='cli-ban-vm-conv-4-on-vhbl07']: Configuration change
vhbl07 crmd[10233]:   notice: Result of migrate_to operation for vm-conv-4 on 
vhbl07: 0 (ok)
vhbl03 crmd[10017]:   notice: Transition 4633 (Complete=6, Pending=0, Fired=0, 
Skipped=1, Incomplete=6, Source=/var/lib/pacemaker/pengine/pe-input-1069.bz2): 
Stopped
vhbl03 pengine[10016]:   notice: Resource vm-conv-4 can no longer migrate to 
vhbl04. Stopping on vhbl07 too
vhbl03 pengine[10016]:   notice: Reload  vm-conv-4#011(Started vhbl07)
vhbl03 pengine[10016]:   notice: Calculated transition 4634, saving inputs in 
/var/lib/pacemaker/pengine/pe-input-1070.bz2
vhbl03 crmd[10017]:   notice: Initiating stop operation vm-conv-4_stop_0 on 
vhbl07
vhbl03 crmd[10017]:   notice: Initiating stop operation vm-conv-4_stop_0 on 
vhbl04
vhbl03 crmd[10017]:   notice: Initiating reload operation vm-conv-4_reload_0 on 
vhbl04

This recovery was entirely unnecessary, as the resource successfully
migrated to vhbl04 (the migrate_from operation does nothing).  Pacemaker
does not know this, but is there a way to educate it?  I think in this
special case it is possible to redesign the agent making migrate_to a
no-op and doing everything in migrate_from, which would significantly
reduce the window between the start points of the two "halfs", but I'm
not sure that would help in the end: Pacemaker could still decide to do
an unnecessary stop+start recovery.  Would it?  I failed to find any
documentation on recovery from aborted migration transitions.  I don't
expect on-fail (for migrate_* ops, not me) to apply here, does it?

Side question: why initiate a reload in any case, like above?

Even more side question: could you please consider using space instead
of TAB in syslog messages?  (Actually, I wouldn't mind getting rid of
them altogether in any output.)
-- 
Thanks,
Feri
___
Users mailing list: Users@clusterlabs.org
https://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org