subject:"Re\: \[ClusterLabs\] Pacemaker resource start delay when there are another resource is starting"

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

2017-11-09 Thread lkxjtu



>Also, I forgot about the undocumented/unsupported start-delay operation
>attribute, that you can put on the status operation to delay the first
>monitor. That may give you the behavior you want.
I have try to add "start-delay=60s" to monitor operation. The first monitor was 
really delayed as 60s. But in the 60s, it will block other resources too! The 
result is the same to sleeping in monitor.
So, I think the best method for me,  is to judge whether need to return success 
in monitor function by timestamp.
Thank you very much!








At 2017-11-06 21:53:53, "Ken Gaillot"  wrote:
>On Sat, 2017-11-04 at 22:46 +0800, lkxjtu wrote:
>> 
>> 
>> >Another possibility would be to have the start return immediately,
>> and
>> >make the monitor artificially return success for the first 10
>> minutes
>> >after starting. It's hacky, and it depends on your situation whether
>> >the behavior is acceptable.
>> I tried to put the sleep into the monitor function,( I add a “sleep
>> 60” at the monitor entry for debug),  the start function returns
>> immediately. I found an interesting thing that is, at the first time
>> of monitor after start, it will block other resource too, but from
>> the second time, it won't block other resources! Is this normal?
>
>Yes, the first result is for an unknown status, but after that, the
>cluster assumes the resource is OK unless/until the monitor says
>otherwise.
>
>However, I wasn't suggesting putting a sleep inside the monitor -- I
>was just thinking of having the monitor check the time, and if it's
>within 10 minutes of start, return success.
>
>> >My first thought on how to implement this
>> >would be to have the start action set a private node attribute
>> >(attrd_updater -p) with a timestamp. When the monitor runs, it could
>> do
>> >its usual check, and if it succeeds, remove that node attribute, but
>> if
>> >it fails, check the node attribute to see whether it's within the
>> >desired delay.
>> This means that if it is in the desired delay， monitor should return
>> success even if healthcheck failed？
>> I think this can solve my problem except "crm status" show
>
>Yes, that's what I had in mind. The status would show "running", which
>may or may not be what you want in this case.
>
>Also, I forgot about the undocumented/unsupported start-delay operation
>attribute, that you can put on the status operation to delay the first
>monitor. That may give you the behavior you want.
>
>> At 2017-11-01 21:20:50, "Ken Gaillot"  wrote:
>> >On Sat, 2017-10-28 at 01:11 +0800, lkxjtu wrote:
>> >> 
>> >> Thank you for your response! This means that there shoudn't be
>> long
>> >> "sleep" in ocf script.
>> >> If my service takes 10 minite from service starting to healthcheck
>> >> normally, then what shoud I do?
>> >
>> >That is a tough situation with no great answer.
>> >
>> >You can leave it as it is, and live with the delay. Note that it
>> only
>> >happens if a resource fails after the slow resource has already
>> begun
>> >starting ... if they fail at the same time (as with a node failure),
>> >the cluster will schedule recovery for both at the same time.
>> >
>> >Another possibility would be to have the start return immediately,
>> and
>> >make the monitor artificially return success for the first 10
>> minutes
>> >after starting. It's hacky, and it depends on your situation whether
>> >the behavior is acceptable. My first thought on how to implement
>> this
>> >would be to have the start action set a private node attribute
>> >(attrd_updater -p) with a timestamp. When the monitor runs, it could
>> do
>> >its usual check, and if it succeeds, remove that node attribute, but
>> if
>> >it fails, check the node attribute to see whether it's within the
>> >desired delay.
>> >
>> >> Thank you very much!
>> >>  
>> >> > Hi,
>> >> > If I remember correctly, any pending actions from a previous
>> >> transition
>> >> > must be completed before a new transition can be calculated.
>> >> Otherwise,
>> >> > there's the possibility that the pending action could change the
>> >> state
>> >> > in a way that makes the second transition's decisions harmful.
>> >> > Theoretically (and ideally), pacemaker could figure out whether
>> >> some of
>> >> > the actions in the second transition would be needed regardless
>> of
>> >> > whether the pending actions succeeded or failed, but in
>> practice,
>> >> that
>> >> > would be difficult to implement (and possibly take more time to
>> >> > calculate than is desirable in a recovery situation).
>> >>  
>> >> > On Fri, 2017-10-27 at 23:54 +0800, lkxjtu wrote:
>> >> 
>> >> > I have two clone resources in my corosync/pacemaker cluster.
>> They
>> >> are
>> >> > fm_mgt and logserver. Both of their RA is ocf. fm_mgt takes 1
>> >> minute
>> >> > to start the
>> >> > service(calling ocf start function for 1 minite). Configured as
>> >> > below：
>> >> > # crm configure show
>> >> > node 168002177: 192.168.2.177
>> >> > node

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

2017-11-06 Thread Ken Gaillot

On Sat, 2017-11-04 at 22:46 +0800, lkxjtu wrote:
> 
> 
> >Another possibility would be to have the start return immediately,
> and
> >make the monitor artificially return success for the first 10
> minutes
> >after starting. It's hacky, and it depends on your situation whether
> >the behavior is acceptable.
> I tried to put the sleep into the monitor function,( I add a “sleep
> 60” at the monitor entry for debug),  the start function returns
> immediately. I found an interesting thing that is, at the first time
> of monitor after start, it will block other resource too, but from
> the second time, it won't block other resources! Is this normal?

Yes, the first result is for an unknown status, but after that, the
cluster assumes the resource is OK unless/until the monitor says
otherwise.

However, I wasn't suggesting putting a sleep inside the monitor -- I
was just thinking of having the monitor check the time, and if it's
within 10 minutes of start, return success.

> >My first thought on how to implement this
> >would be to have the start action set a private node attribute
> >(attrd_updater -p) with a timestamp. When the monitor runs, it could
> do
> >its usual check, and if it succeeds, remove that node attribute, but
> if
> >it fails, check the node attribute to see whether it's within the
> >desired delay.
> This means that if it is in the desired delay， monitor should return
> success even if healthcheck failed？
> I think this can solve my problem except "crm status" show

Yes, that's what I had in mind. The status would show "running", which
may or may not be what you want in this case.

Also, I forgot about the undocumented/unsupported start-delay operation
attribute, that you can put on the status operation to delay the first
monitor. That may give you the behavior you want.

> At 2017-11-01 21:20:50, "Ken Gaillot"  wrote:
> >On Sat, 2017-10-28 at 01:11 +0800, lkxjtu wrote:
> >> 
> >> Thank you for your response! This means that there shoudn't be
> long
> >> "sleep" in ocf script.
> >> If my service takes 10 minite from service starting to healthcheck
> >> normally, then what shoud I do?
> >
> >That is a tough situation with no great answer.
> >
> >You can leave it as it is, and live with the delay. Note that it
> only
> >happens if a resource fails after the slow resource has already
> begun
> >starting ... if they fail at the same time (as with a node failure),
> >the cluster will schedule recovery for both at the same time.
> >
> >Another possibility would be to have the start return immediately,
> and
> >make the monitor artificially return success for the first 10
> minutes
> >after starting. It's hacky, and it depends on your situation whether
> >the behavior is acceptable. My first thought on how to implement
> this
> >would be to have the start action set a private node attribute
> >(attrd_updater -p) with a timestamp. When the monitor runs, it could
> do
> >its usual check, and if it succeeds, remove that node attribute, but
> if
> >it fails, check the node attribute to see whether it's within the
> >desired delay.
> >
> >> Thank you very much!
> >>  
> >> > Hi,
> >> > If I remember correctly, any pending actions from a previous
> >> transition
> >> > must be completed before a new transition can be calculated.
> >> Otherwise,
> >> > there's the possibility that the pending action could change the
> >> state
> >> > in a way that makes the second transition's decisions harmful.
> >> > Theoretically (and ideally), pacemaker could figure out whether
> >> some of
> >> > the actions in the second transition would be needed regardless
> of
> >> > whether the pending actions succeeded or failed, but in
> practice,
> >> that
> >> > would be difficult to implement (and possibly take more time to
> >> > calculate than is desirable in a recovery situation).
> >>  
> >> > On Fri, 2017-10-27 at 23:54 +0800, lkxjtu wrote:
> >> 
> >> > I have two clone resources in my corosync/pacemaker cluster.
> They
> >> are
> >> > fm_mgt and logserver. Both of their RA is ocf. fm_mgt takes 1
> >> minute
> >> > to start the
> >> > service(calling ocf start function for 1 minite). Configured as
> >> > below：
> >> > # crm configure show
> >> > node 168002177: 192.168.2.177
> >> > node 168002178: 192.168.2.178
> >> > node 168002179: 192.168.2.179
> >> > primitive fm_mgt fm_mgt \
> >> > op monitor interval=20s timeout=120s \
> >> > op stop interval=0 timeout=120s on-fail=restart \
> >> > op start interval=0 timeout=120s on-fail=restart \
> >> > meta target-role=Started
> >> > primitive logserver logserver \
> >> > op monitor interval=20s timeout=120s \
> >> > op stop interval=0 timeout=120s on-fail=restart \
> >> > op start interval=0 timeout=120s on-fail=restart \
> >> > meta target-role=Started
> >> > clone fm_mgt_replica fm_mgt
> >> > clone logserver_replica logserver
> >> > property cib-bootstrap-options: \
> >> >

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

2017-11-06 Thread Klaus Wenninger

Hi!

Not saying that the use of start-delay in the monitor-operations is
a good thing. It should in most cases be definitely better to delay
the return of start till a monitor would succeed. Have seen discussion
about deprecating start-delay - don't know the current state though.
But this case - if I got the use-case right - with a 10min delay might
be a  legitimate use of start-delay - if any does exists at all ;-)

Regards,
Klaus

On 11/04/2017 03:46 PM, lkxjtu wrote:
>
>
> >Another possibility would be to have the start return immediately, and
> >make the monitor artificially return success for the first 10 minutes
> >after starting. It's hacky, and it depends on your situation whether
> >the behavior is acceptable.
> 
> I tried to put the sleep into the monitor function,( I add a “sleep
> 60” at the monitor entry for debug),  the start function returns
> immediately.I found an interesting thing that is, at the first time of
> monitor after start, it will block other resource too, but from the
> second time, it won't block other resources! Is this normal?
>
> >My first thought on how to implement this
> >would be to have the start action set a private node attribute
> >(attrd_updater -p) with a timestamp. When the monitor runs, it could do
> >its usual check, and if it succeeds, remove that node attribute, but if
> >it fails, check the node attribute to see whether it's within the
> >desired delay.
> This means that if it is in the desired delay， monitor should return success 
> even if healthcheck failed？
> I think this can solve my problem except "crm status" show
>
>
> At 2017-11-01 21:20:50, "Ken Gaillot"  wrote:
> >On Sat, 2017-10-28 at 01:11 +0800, lkxjtu wrote:
> >> 
> >> Thank you for your response! This means that there shoudn't be long
> >> "sleep" in ocf script.
> >> If my service takes 10 minite from service starting to healthcheck
> >> normally, then what shoud I do?
> >
> >That is a tough situation with no great answer.
> >
> >You can leave it as it is, and live with the delay. Note that it only
> >happens if a resource fails after the slow resource has already begun
> >starting ... if they fail at the same time (as with a node failure),
> >the cluster will schedule recovery for both at the same time.
> >
> >Another possibility would be to have the start return immediately, and
> >make the monitor artificially return success for the first 10 minutes
> >after starting. It's hacky, and it depends on your situation whether
> >the behavior is acceptable. My first thought on how to implement this
> >would be to have the start action set a private node attribute
> >(attrd_updater -p) with a timestamp. When the monitor runs, it could do
> >its usual check, and if it succeeds, remove that node attribute, but if
> >it fails, check the node attribute to see whether it's within the
> >desired delay.
> >
> >> Thank you very much!
> >>  
> >> > Hi,
> >> > If I remember correctly, any pending actions from a previous
> >> transition
> >> > must be completed before a new transition can be calculated.
> >> Otherwise,
> >> > there's the possibility that the pending action could change the
> >> state
> >> > in a way that makes the second transition's decisions harmful.
> >> > Theoretically (and ideally), pacemaker could figure out whether
> >> some of
> >> > the actions in the second transition would be needed regardless of
> >> > whether the pending actions succeeded or failed, but in practice,
> >> that
> >> > would be difficult to implement (and possibly take more time to
> >> > calculate than is desirable in a recovery situation).
> >>  
> >> > On Fri, 2017-10-27 at 23:54 +0800, lkxjtu wrote:
> >> 
> >> > I have two clone resources in my corosync/pacemaker cluster. They
> >> are
> >> > fm_mgt and logserver. Both of their RA is ocf. fm_mgt takes 1
> >> minute
> >> > to start the
> >> > service(calling ocf start function for 1 minite). Configured as
> >> > below：
> >> > # crm configure show
> >> > node 168002177: 192.168.2.177
> >> > node 168002178: 192.168.2.178
> >> > node 168002179: 192.168.2.179
> >> > primitive fm_mgt fm_mgt \
> >> > op monitor interval=20s timeout=120s \
> >> > op stop interval=0 timeout=120s on-fail=restart \
> >> > op start interval=0 timeout=120s on-fail=restart \
> >> > meta target-role=Started
> >> > primitive logserver logserver \
> >> > op monitor interval=20s timeout=120s \
> >> > op stop interval=0 timeout=120s on-fail=restart \
> >> > op start interval=0 timeout=120s on-fail=restart \
> >> > meta target-role=Started
> >> > clone fm_mgt_replica fm_mgt
> >> > clone logserver_replica logserver
> >> > property cib-bootstrap-options: \
> >> > have-watchdog=false \
> >> > dc-version=1.1.13-10.el7-44eb2dd \
> >> > cluster-infrastructure=corosync \
> >> > stonith-enabled=false \
> >> > start-failure-is-fatal=false
> >> > When I

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

2017-11-04 Thread lkxjtu




>Another possibility would be to have the start return immediately, and
>make the monitor artificially return success for the first 10 minutes
>after starting. It's hacky, and it depends on your situation whether
>the behavior is acceptable.
I tried to put the sleep into the monitor function,( I add a “sleep 60” at the 
monitor entry for debug),  the start function returns immediately. I found an 
interesting thing that is, at the first time of monitor after start, it will 
block other resource too, but from the second time, it won't block other 
resources! Is this normal?



>My first thought on how to implement this
>would be to have the start action set a private node attribute
>(attrd_updater -p) with a timestamp. When the monitor runs, it could do
>its usual check, and if it succeeds, remove that node attribute, but if
>it fails, check the node attribute to see whether it's within the
>desired delay.
This means that if it is in the desired delay， monitor should return success 
even if healthcheck failed？
I think this can solve my problem except "crm status" show





At 2017-11-01 21:20:50, "Ken Gaillot"  wrote:
>On Sat, 2017-10-28 at 01:11 +0800, lkxjtu wrote:
>> 
>> Thank you for your response! This means that there shoudn't be long
>> "sleep" in ocf script.
>> If my service takes 10 minite from service starting to healthcheck
>> normally, then what shoud I do?
>
>That is a tough situation with no great answer.
>
>You can leave it as it is, and live with the delay. Note that it only
>happens if a resource fails after the slow resource has already begun
>starting ... if they fail at the same time (as with a node failure),
>the cluster will schedule recovery for both at the same time.
>
>Another possibility would be to have the start return immediately, and
>make the monitor artificially return success for the first 10 minutes
>after starting. It's hacky, and it depends on your situation whether
>the behavior is acceptable. My first thought on how to implement this
>would be to have the start action set a private node attribute
>(attrd_updater -p) with a timestamp. When the monitor runs, it could do
>its usual check, and if it succeeds, remove that node attribute, but if
>it fails, check the node attribute to see whether it's within the
>desired delay.
>
>> Thank you very much!
>>  
>> > Hi,
>> > If I remember correctly, any pending actions from a previous
>> transition
>> > must be completed before a new transition can be calculated.
>> Otherwise,
>> > there's the possibility that the pending action could change the
>> state
>> > in a way that makes the second transition's decisions harmful.
>> > Theoretically (and ideally), pacemaker could figure out whether
>> some of
>> > the actions in the second transition would be needed regardless of
>> > whether the pending actions succeeded or failed, but in practice,
>> that
>> > would be difficult to implement (and possibly take more time to
>> > calculate than is desirable in a recovery situation).
>>  
>> > On Fri, 2017-10-27 at 23:54 +0800, lkxjtu wrote:
>> 
>> > I have two clone resources in my corosync/pacemaker cluster. They
>> are
>> > fm_mgt and logserver. Both of their RA is ocf. fm_mgt takes 1
>> minute
>> > to start the
>> > service(calling ocf start function for 1 minite). Configured as
>> > below：
>> > # crm configure show
>> > node 168002177: 192.168.2.177
>> > node 168002178: 192.168.2.178
>> > node 168002179: 192.168.2.179
>> > primitive fm_mgt fm_mgt \
>> > op monitor interval=20s timeout=120s \
>> > op stop interval=0 timeout=120s on-fail=restart \
>> > op start interval=0 timeout=120s on-fail=restart \
>> > meta target-role=Started
>> > primitive logserver logserver \
>> > op monitor interval=20s timeout=120s \
>> > op stop interval=0 timeout=120s on-fail=restart \
>> > op start interval=0 timeout=120s on-fail=restart \
>> > meta target-role=Started
>> > clone fm_mgt_replica fm_mgt
>> > clone logserver_replica logserver
>> > property cib-bootstrap-options: \
>> > have-watchdog=false \
>> > dc-version=1.1.13-10.el7-44eb2dd \
>> > cluster-infrastructure=corosync \
>> > stonith-enabled=false \
>> > start-failure-is-fatal=false
>> > When I kill fm_mgt service on one node，pacemaker will immediately
>> > recover it after monitor failed. This looks perfectly normal. But
>> in
>> > this 1 minite
>> > of fm_mgt starting, if I kill logserver service on any node, the
>> > monitor will catch the fail normally too，but pacemaker will not
>> > restart it
>> > immediately but waiting for fm_mgt starting finished. After fm_mgt
>> > starting finished, pacemaker begin restarting logserver. It seems
>> > that there are
>> > some dependency between pacemaker resource.
>> > # crm status
>> > Last updated: Thu Oct 26 06:40:24 2017  Last change: Thu
>> Oct
>> > 26 06:36:33 2017 by root via crm_resource on 192.168.2.177
>>

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

2017-11-01 Thread Vladislav Bogdanov

01.11.2017 17:20, Ken Gaillot wrote:

On Sat, 2017-10-28 at 01:11 +0800, lkxjtu wrote:

Thank you for your response! This means that there shoudn't be long
"sleep" in ocf script.
If my service takes 10 minite from service starting to healthcheck
normally, then what shoud I do?

That is a tough situation with no great answer.

You can leave it as it is, and live with the delay. Note that it only
happens if a resource fails after the slow resource has already begun
starting ... if they fail at the same time (as with a node failure),
the cluster will schedule recovery for both at the same time.

Another possibility would be to have the start return immediately, and
make the monitor artificially return success for the first 10 minutes
after starting. It's hacky, and it depends on your situation whether
the behavior is acceptable. My first thought on how to implement this
would be to have the start action set a private node attribute
(attrd_updater -p) with a timestamp. When the monitor runs, it could do
its usual check, and if it succeeds, remove that node attribute, but if
it fails, check the node attribute to see whether it's within the
desired delay.

Or write a master-slave resource agent, like DRBD has.
It sets low master score on an outdated node after sync is started and
raises it after sync is finished, thus it is not promoted to master
until sync is complete.

If you "map" service states to a pacemaker states like

Starting - Slave
Started - Master

that would help.

Thank you very much!

Hi,
If I remember correctly, any pending actions from a previous

transition

must be completed before a new transition can be calculated.

Otherwise,

there's the possibility that the pending action could change the

state

in a way that makes the second transition's decisions harmful.
Theoretically (and ideally), pacemaker could figure out whether

some of

the actions in the second transition would be needed regardless of
whether the pending actions succeeded or failed, but in practice,

that

would be difficult to implement (and possibly take more time to
calculate than is desirable in a recovery situation).

On Fri, 2017-10-27 at 23:54 +0800, lkxjtu wrote:

I have two clone resources in my corosync/pacemaker cluster. They

are

fm_mgt and logserver. Both of their RA is ocf. fm_mgt takes 1

minute

to start the
service(calling ocf start function for 1 minite). Configured as
below：
# crm configure show
node 168002177: 192.168.2.177
node 168002178: 192.168.2.178
node 168002179: 192.168.2.179
primitive fm_mgt fm_mgt \
op monitor interval=20s timeout=120s \
op stop interval=0 timeout=120s on-fail=restart \
op start interval=0 timeout=120s on-fail=restart \
meta target-role=Started
primitive logserver logserver \
op monitor interval=20s timeout=120s \
op stop interval=0 timeout=120s on-fail=restart \
op start interval=0 timeout=120s on-fail=restart \
meta target-role=Started
clone fm_mgt_replica fm_mgt
clone logserver_replica logserver
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.13-10.el7-44eb2dd \
cluster-infrastructure=corosync \
stonith-enabled=false \
start-failure-is-fatal=false
When I kill fm_mgt service on one node，pacemaker will immediately
recover it after monitor failed. This looks perfectly normal. But

this 1 minite
of fm_mgt starting, if I kill logserver service on any node, the
monitor will catch the fail normally too，but pacemaker will not
restart it
immediately but waiting for fm_mgt starting finished. After fm_mgt
starting finished, pacemaker begin restarting logserver. It seems
that there are
some dependency between pacemaker resource.
# crm status
Last updated: Thu Oct 26 06:40:24 2017 Last change: Thu

Oct

26 06:36:33 2017 by root via crm_resource on 192.168.2.177
Stack: corosync
Current DC: 192.168.2.179 (version 1.1.13-10.el7-44eb2dd) -

partition

with quorum
3 nodes and 6 resources configured
Online: [ 192.168.2.177 192.168.2.178 192.168.2.179 ]
Full list of resources:
Clone Set: logserver_replica [logserver]
logserver (ocf::heartbeat:logserver): FAILED

192.168.2.177

Started: [ 192.168.2.178 192.168.2.179 ]
Clone Set: fm_mgt_replica [fm_mgt]
Started: [ 192.168.2.178 192.168.2.179 ]
Stopped: [ 192.168.2.177 ]
I am confusing very much. Is there something wrong configure?Thank
you very much!
James
best regards

【网易自营】好吃到爆！鲜香弹滑加热即食，经典13香/麻辣小龙虾仅75元3斤>>

【网易自营|30天无忧退货】仅售同款价1/4！MUJI制造商“2017秋冬舒适家居拖鞋系列”限时仅34.9元>>

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

2017-11-01 Thread Ken Gaillot

On Sat, 2017-10-28 at 01:11 +0800, lkxjtu wrote:
> 
> Thank you for your response! This means that there shoudn't be long
> "sleep" in ocf script.
> If my service takes 10 minite from service starting to healthcheck
> normally, then what shoud I do?

That is a tough situation with no great answer.

You can leave it as it is, and live with the delay. Note that it only
happens if a resource fails after the slow resource has already begun
starting ... if they fail at the same time (as with a node failure),
the cluster will schedule recovery for both at the same time.

Another possibility would be to have the start return immediately, and
make the monitor artificially return success for the first 10 minutes
after starting. It's hacky, and it depends on your situation whether
the behavior is acceptable. My first thought on how to implement this
would be to have the start action set a private node attribute
(attrd_updater -p) with a timestamp. When the monitor runs, it could do
its usual check, and if it succeeds, remove that node attribute, but if
it fails, check the node attribute to see whether it's within the
desired delay.

> Thank you very much!
>  
> > Hi,
> > If I remember correctly, any pending actions from a previous
> transition
> > must be completed before a new transition can be calculated.
> Otherwise,
> > there's the possibility that the pending action could change the
> state
> > in a way that makes the second transition's decisions harmful.
> > Theoretically (and ideally), pacemaker could figure out whether
> some of
> > the actions in the second transition would be needed regardless of
> > whether the pending actions succeeded or failed, but in practice,
> that
> > would be difficult to implement (and possibly take more time to
> > calculate than is desirable in a recovery situation).
>  
> > On Fri, 2017-10-27 at 23:54 +0800, lkxjtu wrote:
> 
> > I have two clone resources in my corosync/pacemaker cluster. They
> are
> > fm_mgt and logserver. Both of their RA is ocf. fm_mgt takes 1
> minute
> > to start the
> > service(calling ocf start function for 1 minite). Configured as
> > below：
> > # crm configure show
> > node 168002177: 192.168.2.177
> > node 168002178: 192.168.2.178
> > node 168002179: 192.168.2.179
> > primitive fm_mgt fm_mgt \
> > op monitor interval=20s timeout=120s \
> > op stop interval=0 timeout=120s on-fail=restart \
> > op start interval=0 timeout=120s on-fail=restart \
> > meta target-role=Started
> > primitive logserver logserver \
> > op monitor interval=20s timeout=120s \
> > op stop interval=0 timeout=120s on-fail=restart \
> > op start interval=0 timeout=120s on-fail=restart \
> > meta target-role=Started
> > clone fm_mgt_replica fm_mgt
> > clone logserver_replica logserver
> > property cib-bootstrap-options: \
> > have-watchdog=false \
> > dc-version=1.1.13-10.el7-44eb2dd \
> > cluster-infrastructure=corosync \
> > stonith-enabled=false \
> > start-failure-is-fatal=false
> > When I kill fm_mgt service on one node，pacemaker will immediately
> > recover it after monitor failed. This looks perfectly normal. But
> in
> > this 1 minite
> > of fm_mgt starting, if I kill logserver service on any node, the
> > monitor will catch the fail normally too，but pacemaker will not
> > restart it
> > immediately but waiting for fm_mgt starting finished. After fm_mgt
> > starting finished, pacemaker begin restarting logserver. It seems
> > that there are
> > some dependency between pacemaker resource.
> > # crm status
> > Last updated: Thu Oct 26 06:40:24 2017  Last change: Thu
> Oct
> > 26 06:36:33 2017 by root via crm_resource on 192.168.2.177
> > Stack: corosync
> > Current DC: 192.168.2.179 (version 1.1.13-10.el7-44eb2dd) -
> partition
> > with quorum
> > 3 nodes and 6 resources configured
> > Online: [ 192.168.2.177 192.168.2.178 192.168.2.179 ]
> > Full list of resources:
> >  Clone Set: logserver_replica [logserver]
> >  logserver  (ocf::heartbeat:logserver): FAILED
> 192.168.2.177
> >  Started: [ 192.168.2.178 192.168.2.179 ]
> >  Clone Set: fm_mgt_replica [fm_mgt]
> >  Started: [ 192.168.2.178 192.168.2.179 ]
> >  Stopped: [ 192.168.2.177 ]
> > I am confusing very much. Is there something wrong configure?Thank
> > you very much!
> > James
> > best regards
>  
> 
> 
> 【网易自营】好吃到爆！鲜香弹滑加热即食，经典13香/麻辣小龙虾仅75元3斤>>      
> 
> 
> 【网易自营】好吃到爆！鲜香弹滑加热即食，经典13香/麻辣小龙虾仅75元3斤>>      
> 
> 
> 【网易自营|30天无忧退货】仅售同款价1/4！MUJI制造商“2017秋冬舒适家居拖鞋系列”限时仅34.9元>>      
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

2017-10-27 Thread lkxjtu


Thank you for your response! This means that there shoudn't be long "sleep" in 
ocf script.
If my service takes 10 minite from service starting to healthcheck normally, 
then what shoud I do?
Thank you very much!
 
> Hi,
> If I remember correctly, any pending actions from a previous transition
> must be completed before a new transition can be calculated. Otherwise,
> there's the possibility that the pending action could change the state
> in a way that makes the second transition's decisions harmful.
> Theoretically (and ideally), pacemaker could figure out whether some of
> the actions in the second transition would be needed regardless of
> whether the pending actions succeeded or failed, but in practice, that
> would be difficult to implement (and possibly take more time to
> calculate than is desirable in a recovery situation).
 
> On Fri, 2017-10-27 at 23:54 +0800, lkxjtu wrote:

> I have two clone resources in my corosync/pacemaker cluster. They are
> fm_mgt and logserver. Both of their RA is ocf. fm_mgt takes 1 minute
> to start the
> service(calling ocf start function for 1 minite). Configured as
> below：
> # crm configure show
> node 168002177: 192.168.2.177
> node 168002178: 192.168.2.178
> node 168002179: 192.168.2.179
> primitive fm_mgt fm_mgt \
> op monitor interval=20s timeout=120s \
> op stop interval=0 timeout=120s on-fail=restart \
> op start interval=0 timeout=120s on-fail=restart \
> meta target-role=Started
> primitive logserver logserver \
> op monitor interval=20s timeout=120s \
> op stop interval=0 timeout=120s on-fail=restart \
> op start interval=0 timeout=120s on-fail=restart \
> meta target-role=Started
> clone fm_mgt_replica fm_mgt
> clone logserver_replica logserver
> property cib-bootstrap-options: \
> have-watchdog=false \
> dc-version=1.1.13-10.el7-44eb2dd \
> cluster-infrastructure=corosync \
> stonith-enabled=false \
> start-failure-is-fatal=false
> When I kill fm_mgt service on one node，pacemaker will immediately
> recover it after monitor failed. This looks perfectly normal. But in
> this 1 minite
> of fm_mgt starting, if I kill logserver service on any node, the
> monitor will catch the fail normally too，but pacemaker will not
> restart it
> immediately but waiting for fm_mgt starting finished. After fm_mgt
> starting finished, pacemaker begin restarting logserver. It seems
> that there are
> some dependency between pacemaker resource.
> # crm status
> Last updated: Thu Oct 26 06:40:24 2017  Last change: Thu Oct
> 26 06:36:33 2017 by root via crm_resource on 192.168.2.177
> Stack: corosync
> Current DC: 192.168.2.179 (version 1.1.13-10.el7-44eb2dd) - partition
> with quorum
> 3 nodes and 6 resources configured
> Online: [ 192.168.2.177 192.168.2.178 192.168.2.179 ]
> Full list of resources:
>  Clone Set: logserver_replica [logserver]
>  logserver  (ocf::heartbeat:logserver): FAILED 192.168.2.177
>  Started: [ 192.168.2.178 192.168.2.179 ]
>  Clone Set: fm_mgt_replica [fm_mgt]
>  Started: [ 192.168.2.178 192.168.2.179 ]
>  Stopped: [ 192.168.2.177 ]
> I am confusing very much. Is there something wrong configure?Thank
> you very much!
> James
> best regards

 




【网易自营】好吃到爆！鲜香弹滑加热即食，经典13香/麻辣小龙虾仅75元3斤>>  



【网易自营】好吃到爆！鲜香弹滑加热即食，经典13香/麻辣小龙虾仅75元3斤>>  ___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

2017-10-27 Thread Ken Gaillot

Hi,

If I remember correctly, any pending actions from a previous transition
must be completed before a new transition can be calculated. Otherwise,
there's the possibility that the pending action could change the state
in a way that makes the second transition's decisions harmful.

Theoretically (and ideally), pacemaker could figure out whether some of
the actions in the second transition would be needed regardless of
whether the pending actions succeeded or failed, but in practice, that
would be difficult to implement (and possibly take more time to
calculate than is desirable in a recovery situation).

On Fri, 2017-10-27 at 23:54 +0800, lkxjtu wrote:
> I have two clone resources in my corosync/pacemaker cluster. They are
> fm_mgt and logserver. Both of their RA is ocf. fm_mgt takes 1 minute
> to start the
> service(calling ocf start function for 1 minite). Configured as
> below：
> # crm configure show
> node 168002177: 192.168.2.177
> node 168002178: 192.168.2.178
> node 168002179: 192.168.2.179
> primitive fm_mgt fm_mgt \
>     op monitor interval=20s timeout=120s \
>     op stop interval=0 timeout=120s on-fail=restart \
>     op start interval=0 timeout=120s on-fail=restart \
>     meta target-role=Started
> primitive logserver logserver \
>     op monitor interval=20s timeout=120s \
>     op stop interval=0 timeout=120s on-fail=restart \
>     op start interval=0 timeout=120s on-fail=restart \
>     meta target-role=Started
> clone fm_mgt_replica fm_mgt
> clone logserver_replica logserver
> property cib-bootstrap-options: \
>     have-watchdog=false \
>     dc-version=1.1.13-10.el7-44eb2dd \
>     cluster-infrastructure=corosync \
>     stonith-enabled=false \
>     start-failure-is-fatal=false
> When I kill fm_mgt service on one node，pacemaker will immediately
> recover it after monitor failed. This looks perfectly normal. But in
> this 1 minite
> of fm_mgt starting, if I kill logserver service on any node, the
> monitor will catch the fail normally too，but pacemaker will not
> restart it
> immediately but waiting for fm_mgt starting finished. After fm_mgt
> starting finished, pacemaker begin restarting logserver. It seems
> that there are
> some dependency between pacemaker resource.
> # crm status
> Last updated: Thu Oct 26 06:40:24 2017  Last change: Thu Oct
> 26 06:36:33 2017 by root via crm_resource on 192.168.2.177
> Stack: corosync
> Current DC: 192.168.2.179 (version 1.1.13-10.el7-44eb2dd) - partition
> with quorum
> 3 nodes and 6 resources configured
> Online: [ 192.168.2.177 192.168.2.178 192.168.2.179 ]
> Full list of resources:
>  Clone Set: logserver_replica [logserver]
>  logserver  (ocf::heartbeat:logserver): FAILED 192.168.2.177
>  Started: [ 192.168.2.178 192.168.2.179 ]
>  Clone Set: fm_mgt_replica [fm_mgt]
>  Started: [ 192.168.2.178 192.168.2.179 ]
>  Stopped: [ 192.168.2.177 ]
> I am confusing very much. Is there something wrong configure?Thank
> you very much!
> James
> best regards
> 
> 
> 【网易自营】好吃到爆！鲜香弹滑加热即食，经典13香/麻辣小龙虾仅75元3斤>>      
> 
> 
> 【网易自营】好吃到爆！鲜香弹滑加热即食，经典13香/麻辣小龙虾仅75元3斤>>      
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.
> pdf
> Bugs: http://bugs.clusterlabs.org
-- 
Ken Gaillot 

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

Re: [ClusterLabs] Pacemaker resource start delay when there are another resource is starting

8 matches

Site Navigation

Mail list logo

Footer information