Re: [ClusterLabs] Pacemaker resource parameter reload confusion

2017-12-12 Thread Ken Gaillot
On Wed, 2017-11-01 at 10:04 +0100, Ferenc Wágner wrote:
> Ken Gaillot  writes:
> 
> > When an operation completes, a history entry () is
> > added to
> > the pe-input file. If the agent supports reload, the entry will
> > include
> > op-force-restart and op-restart-digest fields. Now I see those are
> > present in the vm-alder_last_0 entry, so agent support isn't the
> > issue.
> 
> Thanks for the explanation.
> 
> > However, the operation is recorded as a *failed* probe (i.e. the
> > resource was running where it wasn't expected). This gets recorded
> > as a
> > separate vm-alder_last_failure_0 entry, which does not get the
> > special
> > fields. It looks to me like this failure entry is forcing the
> > restart.
> > That would be a good idea if it's an actual failure; if we find a
> > resource unexpectedly running, we don't know how it was started, so
> > a
> > full restart makes sense. 
> > 
> > However, I'm guessing it may not have been a real error, but a
> > resource
> > cleanup. A cleanup clears the history so the resource is re-probed, 
> > and
> > I suspect that re-probe is what got recorded here as a failure.
> > Does
> > that match what actually happened?
> 
> Well, I can't really remember, it happened two months ago...  I'm
> pretty
> sure the resource wasn't running unexpectedly, I'd surely recall such
> a
> grave failure.  Interestingly, though, my shell history contains a
> cleanup operation shortly after the parameter change.  Also, if you
> look
> at the logs in my thread starting mail, you'll find
> 
> warning: Processing failed op monitor for vm-alder on vhbl05: not
> running (7)
> 
> which does not seem to match up with the failure in the lrm_rsc_op
> entry
> in pe-input.  It's sort of "normal" that such a resource disappears
> and
> gets restarted by the cluster.  If that report survived the
> unexpected
> restart, I might have wanted to routinely clean it up afterwards.
> 
> (I'm leaving for a short holiday now, expect longer delays.)

Looking at it again with crm_simulate with 1.1.18 + patches, it does
appear that the combination of a cleanup and a parameter change in the
same transition turned the reload into a restart.

The cleanup results in a failed probe being recorded, and that history
entry does not have the magic attributes indicating reloadability.

I suspect if you changed the parameter, waited for the reload to
happen, then did the cleanup, it would have been fine.

I'll have to investigate a fix.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker resource parameter reload confusion

2017-11-01 Thread Ferenc Wágner
Ken Gaillot  writes:

> When an operation completes, a history entry () is added to
> the pe-input file. If the agent supports reload, the entry will include
> op-force-restart and op-restart-digest fields. Now I see those are
> present in the vm-alder_last_0 entry, so agent support isn't the issue.

Thanks for the explanation.

> However, the operation is recorded as a *failed* probe (i.e. the
> resource was running where it wasn't expected). This gets recorded as a
> separate vm-alder_last_failure_0 entry, which does not get the special
> fields. It looks to me like this failure entry is forcing the restart.
> That would be a good idea if it's an actual failure; if we find a
> resource unexpectedly running, we don't know how it was started, so a
> full restart makes sense. 
>
> However, I'm guessing it may not have been a real error, but a resource
> cleanup. A cleanup clears the history so the resource is re-probed, and
> I suspect that re-probe is what got recorded here as a failure. Does
> that match what actually happened?

Well, I can't really remember, it happened two months ago...  I'm pretty
sure the resource wasn't running unexpectedly, I'd surely recall such a
grave failure.  Interestingly, though, my shell history contains a
cleanup operation shortly after the parameter change.  Also, if you look
at the logs in my thread starting mail, you'll find

warning: Processing failed op monitor for vm-alder on vhbl05: not running (7)

which does not seem to match up with the failure in the lrm_rsc_op entry
in pe-input.  It's sort of "normal" that such a resource disappears and
gets restarted by the cluster.  If that report survived the unexpected
restart, I might have wanted to routinely clean it up afterwards.

(I'm leaving for a short holiday now, expect longer delays.)
-- 
Regards,
Feri

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker resource parameter reload confusion

2017-10-31 Thread Ken Gaillot
On Tue, 2017-10-31 at 18:44 +0100, Ferenc Wágner wrote:
> Ken Gaillot  writes:
> 
> > The pe-input is indeed entirely sufficient.
> > 
> > I forgot to check why the reload was not possible in this case. It
> > turns out it is this:
> > 
> >    trace: check_action_definition:  Resource vm-alder doesn't
> > know
> > how to reload
> > 
> > Does the resource agent implement the "reload" action and advertise
> > it
> > in the  section of its metadata?
> 
> Absolutely, I use this operation routinely.
> 
> $ /usr/sbin/crm_resource --show-metadata=ocf:niif:TransientDomain
> [...]
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> And the implementation is just a no-op.
> 
> vm-alder is based on a template, just like all other VMs:
> 
>  type="TransientDomain">
>   
>  name="migr_timeout" value="120"/>
> [...]
>   
>   [...]
>   
>  name="migr_timeout" value="10"/>
> [...]
>  value="kissg wferi"/>
>   
>   
>  timeout="1500" record-pending="true"/>
>  record-pending="true"/>
>  name="migrate_from" timeout="20"/>
>  timeout="20"/>
>  timeout="120" record-pending="true"/>
>   
>   [...]
> 
> 
> I wonder why it wouldn't know how to reload.  How is that visible in
> the
> pe-input file?  I'd check the other resources...

When an operation completes, a history entry () is added to
the pe-input file. If the agent supports reload, the entry will include
op-force-restart and op-restart-digest fields. Now I see those are
present in the vm-alder_last_0 entry, so agent support isn't the issue.

However, the operation is recorded as a *failed* probe (i.e. the
resource was running where it wasn't expected). This gets recorded as a
separate vm-alder_last_failure_0 entry, which does not get the special
fields. It looks to me like this failure entry is forcing the restart.
That would be a good idea if it's an actual failure; if we find a
resource unexpectedly running, we don't know how it was started, so a
full restart makes sense. 

However, I'm guessing it may not have been a real error, but a resource
cleanup. A cleanup clears the history so the resource is re-probed, and
I suspect that re-probe is what got recorded here as a failure. Does
that match what actually happened?
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker resource parameter reload confusion

2017-10-31 Thread Ferenc Wágner
Ken Gaillot  writes:

> The pe-input is indeed entirely sufficient.
>
> I forgot to check why the reload was not possible in this case. It
> turns out it is this:
>
>    trace: check_action_definition:  Resource vm-alder doesn't know
> how to reload
>
> Does the resource agent implement the "reload" action and advertise it
> in the  section of its metadata?

Absolutely, I use this operation routinely.

$ /usr/sbin/crm_resource --show-metadata=ocf:niif:TransientDomain
[...]












And the implementation is just a no-op.

vm-alder is based on a template, just like all other VMs:


  

[...]
  
  [...]
  

[...]

  
  





  
  [...]


I wonder why it wouldn't know how to reload.  How is that visible in the
pe-input file?  I'd check the other resources...
-- 
Thanks,
Feri

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker resource parameter reload confusion

2017-10-31 Thread Ken Gaillot
On Tue, 2017-10-31 at 09:33 +0100, Ferenc Wágner wrote:
> Ken Gaillot  writes:
> 
> > On Fri, 2017-10-20 at 15:52 +0200, Ferenc Wágner wrote:
> > 
> > > Ken Gaillot  writes:
> > > 
> > > > On Fri, 2017-09-22 at 18:30 +0200, Ferenc Wágner wrote:
> > > > 
> > > > > Ken Gaillot  writes:
> > > > > 
> > > > > > Hmm, stop+reload is definitely a bug. Can you attach (or
> > > > > > email it
> > > > > > to me privately, or file a bz with it attached) the above
> > > > > > pe-input
> > > > > > file with any sensitive info removed?
> > > > > 
> > > > > I sent you the pe-input file privately.  It indeed shows the
> > > > > issue:
> > > > > 
> > > > > $ /usr/sbin/crm_simulate -x pe-input-1033.bz2 -RS
> > > > > [...]
> > > > > Executing cluster transition:
> > > > >  * Resource action: vm-alderstop on vhbl05
> > > > >  * Resource action: vm-alderreload on vhbl05
> > > > > [...]
> > > > > 
> > > > > Hope you can easily get to the bottom of this.
> > > > 
> > > > This turned out to have the same underlying cause as CLBZ#5309.
> > > > I
> > > > have a fix pending review, which I expect to make it into the
> > > > soon-to-be-released 1.1.18.
> > > 
> > > Great!
> > > 
> > > > It is a regression introduced in 1.1.15 by commit 2558d76f. The
> > > > logic for reloads was consolidated in one place, but that
> > > > happened
> > > > to be before restarts were scheduled, so it no longer had the
> > > > right
> > > > information about whether a restart was needed. Now, it sets an
> > > > ordering flag that is used later to cancel the reload if the
> > > > restart
> > > > becomes required. I've also added a regression test for it.
> > > 
> > > Restarts shouldn't even enter the picture here, so I don't get
> > > your
> > > explanation.  But I also don't know the code, so that doesn't
> > > mean a
> > > thing.  I'll test the next RC to be sure.
> > 
> > :-)
> > 
> > Reloads are done in place of restarts, when circumstances allow. So
> > reloads are always related to (potential) restarts.
> > 
> > The problem arose because not all of the relevant circumstances are
> > known at the time the reload action is created. We may figure out
> > later
> > that a resource the reloading resource depends on must be
> > restarted,
> > therefore the reloading resource must be fully restarted instead of
> > reloaded. E.g. a database resource might otherwise be able to
> > reload,
> > but not if the filesystem it's using is going away.
> > 
> > Previously in those cases, we would end up scheduling both the
> > reload
> > and the restart. Now, we schedule only the restart.
> 
> Hi Ken,
> 
> 1.1.18-rc3 indeed schedules a restart, not a reload, like 1.1.16 did.
> However, this wasn't my problem, I really expect a reload on the
> change
> of a non-unique parameter.  Them problem was that 1.1.16 also
> executed a
> stop action in parallel with the reload.
> 
> Maybe I test it wrong: I just copied the pe-input file to another
> system
> (which doesn't even know this resource agent) running 1.1.18-rc3 and
> gave it to crm_simulate.  Does the pe-input file contain all the
> information necessary to decide between restart and reload?  The
> op-force-restart attribute does not contain the name of the changed
> parameter, but I can't find any info on what changed at all.  Should
> I
> see a clean reload in this test setup at all?

The pe-input is indeed entirely sufficient.

I forgot to check why the reload was not possible in this case. It
turns out it is this:

   trace: check_action_definition:  Resource vm-alder doesn't know
how to reload

Does the resource agent implement the "reload" action and advertise it
in the  section of its metadata?
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker resource parameter reload confusion

2017-10-31 Thread Ferenc Wágner
Ken Gaillot  writes:

> On Fri, 2017-10-20 at 15:52 +0200, Ferenc Wágner wrote:
>
>> Ken Gaillot  writes:
>> 
>>> On Fri, 2017-09-22 at 18:30 +0200, Ferenc Wágner wrote:
>>>
 Ken Gaillot  writes:
 
> Hmm, stop+reload is definitely a bug. Can you attach (or email it
> to me privately, or file a bz with it attached) the above pe-input
> file with any sensitive info removed?
 
 I sent you the pe-input file privately.  It indeed shows the
 issue:
 
 $ /usr/sbin/crm_simulate -x pe-input-1033.bz2 -RS
 [...]
 Executing cluster transition:
  * Resource action: vm-alderstop on vhbl05
  * Resource action: vm-alderreload on vhbl05
 [...]
 
 Hope you can easily get to the bottom of this.
>>> 
>>> This turned out to have the same underlying cause as CLBZ#5309. I
>>> have a fix pending review, which I expect to make it into the
>>> soon-to-be-released 1.1.18.
>> 
>> Great!
>> 
>>> It is a regression introduced in 1.1.15 by commit 2558d76f. The
>>> logic for reloads was consolidated in one place, but that happened
>>> to be before restarts were scheduled, so it no longer had the right
>>> information about whether a restart was needed. Now, it sets an
>>> ordering flag that is used later to cancel the reload if the restart
>>> becomes required. I've also added a regression test for it.
>> 
>> Restarts shouldn't even enter the picture here, so I don't get your
>> explanation.  But I also don't know the code, so that doesn't mean a
>> thing.  I'll test the next RC to be sure.
>
> :-)
>
> Reloads are done in place of restarts, when circumstances allow. So
> reloads are always related to (potential) restarts.
>
> The problem arose because not all of the relevant circumstances are
> known at the time the reload action is created. We may figure out later
> that a resource the reloading resource depends on must be restarted,
> therefore the reloading resource must be fully restarted instead of
> reloaded. E.g. a database resource might otherwise be able to reload,
> but not if the filesystem it's using is going away.
>
> Previously in those cases, we would end up scheduling both the reload
> and the restart. Now, we schedule only the restart.

Hi Ken,

1.1.18-rc3 indeed schedules a restart, not a reload, like 1.1.16 did.
However, this wasn't my problem, I really expect a reload on the change
of a non-unique parameter.  Them problem was that 1.1.16 also executed a
stop action in parallel with the reload.

Maybe I test it wrong: I just copied the pe-input file to another system
(which doesn't even know this resource agent) running 1.1.18-rc3 and
gave it to crm_simulate.  Does the pe-input file contain all the
information necessary to decide between restart and reload?  The
op-force-restart attribute does not contain the name of the changed
parameter, but I can't find any info on what changed at all.  Should I
see a clean reload in this test setup at all?
-- 
Thanks,
Feri

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker resource parameter reload confusion

2017-10-20 Thread Ken Gaillot
On Fri, 2017-10-20 at 15:52 +0200, Ferenc Wágner wrote:
> Ken Gaillot  writes:
> 
> > On Fri, 2017-09-22 at 18:30 +0200, Ferenc Wágner wrote:
> > > Ken Gaillot  writes:
> > > 
> > > > Hmm, stop+reload is definitely a bug. Can you attach (or email
> > > > it to
> > > > me privately, or file a bz with it attached) the above pe-input 
> > > > file
> > > > with any sensitive info removed?
> > > 
> > > I sent you the pe-input file privately.  It indeed shows the
> > > issue:
> > > 
> > > $ /usr/sbin/crm_simulate -x pe-input-1033.bz2 -RS
> > > [...]
> > > Executing cluster transition:
> > >  * Resource action: vm-alderstop on vhbl05
> > >  * Resource action: vm-alderreload on vhbl05
> > > [...]
> > > 
> > > Hope you can easily get to the bottom of this.
> > 
> > This turned out to have the same underlying cause as CLBZ#5309. I
> > have
> > a fix pending review, which I expect to make it into the soon-to-
> > be-
> > released 1.1.18.
> 
> Great!
> 
> > It is a regression introduced in 1.1.15 by commit 2558d76f. The
> > logic
> > for reloads was consolidated in one place, but that happened to be
> > before restarts were scheduled, so it no longer had the right
> > information about whether a restart was needed. Now, it sets an
> > ordering flag that is used later to cancel the reload if the
> > restart
> > becomes required. I've also added a regression test for it.
> 
> Restarts shouldn't even enter the picture here, so I don't get your
> explanation.  But I also don't know the code, so that doesn't mean a
> thing.  I'll test the next RC to be sure.

:-)

Reloads are done in place of restarts, when circumstances allow. So
reloads are always related to (potential) restarts.

The problem arose because not all of the relevant circumstances are
known at the time the reload action is created. We may figure out later
that a resource the reloading resource depends on must be restarted,
therefore the reloading resource must be fully restarted instead of
reloaded. E.g. a database resource might otherwise be able to reload,
but not if the filesystem it's using is going away.

Previously in those cases, we would end up scheduling both the reload
and the restart. Now, we schedule only the restart.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker resource parameter reload confusion

2017-10-20 Thread Ferenc Wágner
Ken Gaillot  writes:

> On Fri, 2017-09-22 at 18:30 +0200, Ferenc Wágner wrote:
>> Ken Gaillot  writes:
>> 
>>> Hmm, stop+reload is definitely a bug. Can you attach (or email it to
>>> me privately, or file a bz with it attached) the above pe-input file
>>> with any sensitive info removed?
>> 
>> I sent you the pe-input file privately.  It indeed shows the issue:
>> 
>> $ /usr/sbin/crm_simulate -x pe-input-1033.bz2 -RS
>> [...]
>> Executing cluster transition:
>>  * Resource action: vm-alderstop on vhbl05
>>  * Resource action: vm-alderreload on vhbl05
>> [...]
>> 
>> Hope you can easily get to the bottom of this.
>
> This turned out to have the same underlying cause as CLBZ#5309. I have
> a fix pending review, which I expect to make it into the soon-to-be-
> released 1.1.18.

Great!

> It is a regression introduced in 1.1.15 by commit 2558d76f. The logic
> for reloads was consolidated in one place, but that happened to be
> before restarts were scheduled, so it no longer had the right
> information about whether a restart was needed. Now, it sets an
> ordering flag that is used later to cancel the reload if the restart
> becomes required. I've also added a regression test for it.

Restarts shouldn't even enter the picture here, so I don't get your
explanation.  But I also don't know the code, so that doesn't mean a
thing.  I'll test the next RC to be sure.
-- 
Thanks,
Feri

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker resource parameter reload confusion

2017-10-17 Thread Ken Gaillot
On Fri, 2017-09-22 at 18:30 +0200, Ferenc Wágner wrote:
> Ken Gaillot  writes:
> 
> > Hmm, stop+reload is definitely a bug. Can you attach (or email it
> > to me
> > privately, or file a bz with it attached) the above pe-input file
> > with
> > any sensitive info removed?
> 
> I sent you the pe-input file privately.  It indeed shows the issue:
> 
> $ /usr/sbin/crm_simulate -x pe-input-1033.bz2 -RS
> [...]
> Executing cluster transition:
>  * Resource action: vm-alderstop on vhbl05
>  * Resource action: vm-alderreload on vhbl05
> [...]
> 
> Hope you can easily get to the bottom of this.
> 
> > Nothing's been done about reload yet. It's waiting until we get
> > around
> > to an overhaul of the OCF resource agent standard, so we can define
> > the semantics more clearly. It will involve replacing "unique" with
> > separate meta-data for reloadability and GUI hinting, and possibly
> > changes to the reload operation. Of course we'll try to stay
> > backward-
> > compatible.
> 
> Thanks for the confirmation.

This turned out to have the same underlying cause as CLBZ#5309. I have
a fix pending review, which I expect to make it into the soon-to-be-
released 1.1.18.

It is a regression introduced in 1.1.15 by commit 2558d76f. The logic
for reloads was consolidated in one place, but that happened to be
before restarts were scheduled, so it no longer had the right
information about whether a restart was needed. Now, it sets an
ordering flag that is used later to cancel the reload if the restart
becomes required. I've also added a regression test for it.
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker resource parameter reload confusion

2017-09-22 Thread Ferenc Wágner
Ken Gaillot  writes:

> Hmm, stop+reload is definitely a bug. Can you attach (or email it to me
> privately, or file a bz with it attached) the above pe-input file with
> any sensitive info removed?

I sent you the pe-input file privately.  It indeed shows the issue:

$ /usr/sbin/crm_simulate -x pe-input-1033.bz2 -RS
[...]
Executing cluster transition:
 * Resource action: vm-alderstop on vhbl05
 * Resource action: vm-alderreload on vhbl05
[...]

Hope you can easily get to the bottom of this.

> Nothing's been done about reload yet. It's waiting until we get around
> to an overhaul of the OCF resource agent standard, so we can define
> the semantics more clearly. It will involve replacing "unique" with
> separate meta-data for reloadability and GUI hinting, and possibly
> changes to the reload operation. Of course we'll try to stay backward-
> compatible.

Thanks for the confirmation.
-- 
Feri

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker resource parameter reload confusion

2017-09-22 Thread Ken Gaillot
On Fri, 2017-09-22 at 16:23 +0200, Ferenc Wágner wrote:
> Hi,
> 
> I'm running a custom resourcre agent under Pacemaker 1.1.16, which
> has
> several reloadable parameters:
> 
> $ /usr/sbin/crm_resource --show-metadata=ocf:niif:TransientDomain |
> fgrep unique=
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> I used to routinely change the unique="0" parameters without having
> the
> corresponding resources restarted.  But now something like
> 
> $ sudo crm_resource -r vm-alder -p admins -v "kissg wferi"
> 
> restarts the resource in a somewhat strange way:
> 
> crmd[27037]:   notice: State transition S_IDLE -> S_POLICY_ENGINE
> pengine[27036]:   notice: Reload  vm-alder#011(Started vhbl05)
> pengine[27036]:   notice: Calculated transition 1309, saving inputs
> in /var/lib/pacemaker/pengine/pe-input-1033.bz2
> crmd[27037]:   notice: Initiating stop operation vm-alder_stop_0 on
> vhbl05
> crmd[27037]:   notice: Initiating reload operation vm-alder_reload_0
> on vhbl05
> crmd[27037]:   notice: Transition aborted by deletion of
> lrm_rsc_op[@id='vm-alder_last_failure_0']: Resource operation removal
> crmd[27037]:   notice: Transition 1309 (Complete=10, Pending=0,
> Fired=0, Skipped=1, Incomplete=2,
> Source=/var/lib/pacemaker/pengine/pe-input-1033.bz2): Stopped
> pengine[27036]:   notice: Calculated transition 1310, saving inputs
> in /var/lib/pacemaker/pengine/pe-input-1034.bz2

Hmm, stop+reload is definitely a bug. Can you attach (or email it to me
privately, or file a bz with it attached) the above pe-input file with
any sensitive info removed?

> crmd[27037]:   notice: Initiating monitor operation vm-
> alder_monitor_6 on vhbl05
> crmd[27037]:  warning: Action 228 (vm-alder_monitor_6) on vhbl05
> failed (target: 0 vs. rc: 7): Error
> crmd[27037]:   notice: Transition aborted by operation vm-
> alder_monitor_6 'create' on vhbl05: Event failed
> crmd[27037]:   notice: Transition 1310 (Complete=7, Pending=0,
> Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-1034.bz2): Complete
> pengine[27036]:  warning: Processing failed op monitor for vm-alder
> on vhbl05: not running (7)
> pengine[27036]:   notice: Recover vm-alder#011(Started vhbl05)
> pengine[27036]:   notice: Calculated transition 1311, saving inputs
> in /var/lib/pacemaker/pengine/pe-input-1035.bz2
> pengine[27036]:  warning: Processing failed op monitor for vm-alder
> on vhbl05: not running (7)
> pengine[27036]:   notice: Recover vm-alder#011(Started vhbl05)
> pengine[27036]:   notice: Calculated transition 1312, saving inputs
> in /var/lib/pacemaker/pengine/pe-input-1036.bz2
> crmd[27037]:   notice: Initiating stop operation vm-alder_stop_0 on
> vhbl05
> crmd[27037]:   notice: Initiating start operation vm-alder_start_0 on
> vhbl05
> crmd[27037]:   notice: Initiating monitor operation vm-
> alder_monitor_6 on vhbl05
> crmd[27037]:   notice: Transition 1312 (Complete=10, Pending=0,
> Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-1036.bz2): Complete
> crmd[27037]:   notice: State transition S_TRANSITION_ENGINE -> S_IDLE
> 
> I've got info level logs as well, but those are rather long and maybe
> someone can pinpoint my problem without going through those.  I
> remember
> past discussions about "doing reload right", but I'm not sure what
> was
> implemented in the end, and I can't find anything in the changelog
> either.  So, what do I miss here?  Parallel reload and stop looks
> rather
> suspicious, though...

Nothing's been done about reload yet. It's waiting until we get around
to an overhaul of the OCF resource agent standard, so we can define the
semantics more clearly. It will involve replacing "unique" with
separate meta-data for reloadability and GUI hinting, and possibly
changes to the reload operation. Of course we'll try to stay backward-
compatible.
-- 
Ken Gaillot 




___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org