Re: [Linux-ha-dev] OCF RA for named
No interest? On Tue, Jul 12, 2011 at 3:50 PM, Serge Dubrouski serge...@gmail.com wrote: Hello - I've created an OCF RA for named (BIND) server. There is an existing one in redhat directory but I don't like how it does monitoring and I doubt that it can work with pacemaker. So please review the attached RA and see if it can be included into the project. -- Serge Dubrouski. -- Serge Dubrouski. ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
[Linux-HA] Antw: Re: [ha-wg-technical] The mess with OCF_CHECK_LEVEL (crm aborts during commit)
Dejan Muhamedagic de...@suse.de schrieb am 04.08.2011 um 18:32 in Nachricht 20110804163245.GA28585@rondo.homenet: Hi, On Thu, Aug 04, 2011 at 05:45:16PM +0200, Ulrich Windl wrote: Hi! Some RAs support OCF_CHECK_LEVEL (e.g. ocf:heartbeat:Raid1). However the OCF_CHECK_LEVEL is not advertised in the metadata. Also, OCF_CHECK_LEVEL is not a global parameter (wouldn't make much sense). So obviously using the crm_gui one can add OCF_CHECK_LEVEL for some resource, and that seems to work. So far, so good. Now I tried to add more resources without an OCF_CHECK_LEVEL using the crm command line. I added the new resources to a group that contained resources using OCF_CHECK_LEVEL. OCF_CHECK_LEVEL is to be defined on a per-monitor basis, like this: primitive ... op monitor OCF_CHECK_LEVEL=10 interval=... [...] So, is a configuration like the following incorrect? primitive prm_c11_as_1_raid1 ocf:heartbeat:Raid1 \ params raidconf=/etc/mdadm/mdadm.conf raiddev=/dev/md15 OCF_CHECK_LEVEL=1 \ operations $id=prm_c11_as_1_raid1-operations \ op start interval=0 timeout=20s \ op stop interval=0 timeout=20s \ op monitor interval=60 timeout=60s Ulrich P.S. Moving the issue to the linux-ha list as requested. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: location and orders : Question about a behavior ...
Maloja01 maloj...@arcor.de schrieb am 04.08.2011 um 18:49 in Nachricht 4e3acd86.1020...@arcor.de: Hi Ulrich, I did not folow the complete thread, just jumped in - sorry. Is the resource inside a resource group? In this case the stickiness is multiplied. And sofor the stickiness could be greater than the location role (score). Hi! Yes, a group with about 20 resources has a resource-stickiness=10 and a location loc_grp_cbw grp_cbw 50: node. As the group is somewhat indivisible, assigning varying stickinesses to individual resources just makes things unreadable and complicated. I feel that a group stickyness should override individual resource stickynesses, and not be used a a default stickyness for every resource in the group. Regards, Ulrich Regards Fabian On 08/04/2011 03:10 PM, Ulrich Windl wrote: Maloja01 maloj...@arcor.de schrieb am 04.08.2011 um 12:58 in Nachricht 4e3a7b5c.1030...@arcor.de: On 08/04/2011 08:28 AM, Ulrich Windl wrote: Hi! Isn't the stickyness effectively based on the failcount? We have one resource that has a location constraint for one node with a weight of 50 and a sticky ness of 10. The resource runs on a different node and shows no tendency of moving back (not even after restarts). No stickiness has nothing to do with the failcount. The policy engine could take both into account the stickiness (for RUNNING resources) and the failcount for (RUNNING or non-running ressources). If you ever had a on-start-failure of a resource on a node the failcount is set to infinity which means, the resource could not be started at this node. fabian, I know that, and the errors were removed by crm_resource -C. Still the resource is happy where it is, and doesn't want to move away. If the policy engine needs to evaluate where to run a resource it uses the location/antcolocation/cololaction constraints, failcounts, stickiness and maybe some other scores to evaluate WHERE to run a resource. So in my opinion the stiness does exactly what you are asking for. Unfortunately someone did a manual migrate yesterday, so I cannot show the scores that lead to the problem. Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: [ha-wg-technical] The mess with OCF_CHECK_LEVEL (crm aborts during commit)
Hi, On Fri, Aug 05, 2011 at 08:23:43AM +0200, Ulrich Windl wrote: Dejan Muhamedagic de...@suse.de schrieb am 04.08.2011 um 18:32 in Nachricht 20110804163245.GA28585@rondo.homenet: Hi, On Thu, Aug 04, 2011 at 05:45:16PM +0200, Ulrich Windl wrote: Hi! Some RAs support OCF_CHECK_LEVEL (e.g. ocf:heartbeat:Raid1). However the OCF_CHECK_LEVEL is not advertised in the metadata. Also, OCF_CHECK_LEVEL is not a global parameter (wouldn't make much sense). So obviously using the crm_gui one can add OCF_CHECK_LEVEL for some resource, and that seems to work. So far, so good. Now I tried to add more resources without an OCF_CHECK_LEVEL using the crm command line. I added the new resources to a group that contained resources using OCF_CHECK_LEVEL. OCF_CHECK_LEVEL is to be defined on a per-monitor basis, like this: primitive ... op monitor OCF_CHECK_LEVEL=10 interval=... [...] So, is a configuration like the following incorrect? primitive prm_c11_as_1_raid1 ocf:heartbeat:Raid1 \ params raidconf=/etc/mdadm/mdadm.conf raiddev=/dev/md15 OCF_CHECK_LEVEL=1 \ operations $id=prm_c11_as_1_raid1-operations \ op start interval=0 timeout=20s \ op stop interval=0 timeout=20s \ op monitor interval=60 timeout=60s Yes. See an example here: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-operation-monitor-multiple.html Though it's XML, you can see that OCF_CHECK_LEVEL is defined within a monitor operation. Ulrich P.S. Moving the issue to the linux-ha list as requested. Thanks, Dejan ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] About OCF RA exportfs
Hi, I checked all threads about HA NFS active/active, and I understood that the solution was to have a periodic backup of rmtab in a .rmtab locally in the shared FS, as it was effectively done in the OCF RA exportfs delivered in resource-agents-3.0.12-15 I just wonder if it is always the good solution for HA-NFS active/active and if there is somewhere a newer version of this RA OCF exportfs ? The thing I don't catch in this RA OCF exportfs script is that the monitoring function does a grep of OCF_RESKEY_directory in the rmtab which fails if it does not find it in the file, but it seems that unless at least on NFS client mounts the directory via NFS, there is no chance to have this directory in the rmtab file ... so, once the resource is started, first monitoring fails. Where am I wrong ? Thanks Alain Moullé ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: [ha-wg-technical] The mess with OCF_CHECK_LEVEL (crm aborts during commit)
Dejan Muhamedagic de...@suse.de schrieb am 05.08.2011 um 08:39 in Nachricht 20110805063900.GB31749@rondo.homenet: Hi, On Fri, Aug 05, 2011 at 08:23:43AM +0200, Ulrich Windl wrote: Dejan Muhamedagic de...@suse.de schrieb am 04.08.2011 um 18:32 in Nachricht 20110804163245.GA28585@rondo.homenet: Hi, On Thu, Aug 04, 2011 at 05:45:16PM +0200, Ulrich Windl wrote: Hi! Some RAs support OCF_CHECK_LEVEL (e.g. ocf:heartbeat:Raid1). However the OCF_CHECK_LEVEL is not advertised in the metadata. Also, OCF_CHECK_LEVEL is not a global parameter (wouldn't make much sense). So obviously using the crm_gui one can add OCF_CHECK_LEVEL for some resource, and that seems to work. So far, so good. Now I tried to add more resources without an OCF_CHECK_LEVEL using the crm command line. I added the new resources to a group that contained resources using OCF_CHECK_LEVEL. OCF_CHECK_LEVEL is to be defined on a per-monitor basis, like this: primitive ... op monitor OCF_CHECK_LEVEL=10 interval=... [...] So, is a configuration like the following incorrect? primitive prm_c11_as_1_raid1 ocf:heartbeat:Raid1 \ params raidconf=/etc/mdadm/mdadm.conf raiddev=/dev/md15 OCF_CHECK_LEVEL=1 \ operations $id=prm_c11_as_1_raid1-operations \ op start interval=0 timeout=20s \ op stop interval=0 timeout=20s \ op monitor interval=60 timeout=60s Yes. See an example here: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s- operation-monitor-multiple.html Though it's XML, you can see that OCF_CHECK_LEVEL is defined within a monitor operation. Amazingly crm_verify -LV does not report any problem however. Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: location and orders : Question about a behavior ...
On 08/05/2011 08:30 AM, Ulrich Windl wrote: Maloja01 maloj...@arcor.de schrieb am 04.08.2011 um 18:49 in Nachricht 4e3acd86.1020...@arcor.de: Hi Ulrich, I did not folow the complete thread, just jumped in - sorry. Is the resource inside a resource group? In this case the stickiness is multiplied. And sofor the stickiness could be greater than the location role (score). Hi! Yes, a group with about 20 resources has a resource-stickiness=10 In this case - if I remeber that correctly the scores for a RUNNING group is 20*10 - 200 50. Can you describe your problem, what are you missing? a) You want to have a RUNNING group NOT to do a fallback - Stickiness should do that here: 2M (active node) 500K (preferred node) [if activenode preferred node ;-)] b) You want to have a STOPPED group to be placed on a specific node (to have an ordered administartion at least at the start-point - location score should help here 500K (preferred node) 0 (not preferred node) I miss the point where you argumented, that stickiness is not implemented as you expected it would be implemented. Could you explain, whats missing or wrong? Maybe we can try it in a state description like status-before (f.e. group on node1), change in the cluster (either event or admin based) and status-after (here the current implemented one and the one that you expected how it should work). Kind regards Fabian and a location loc_grp_cbw grp_cbw 50: node. As the group is somewhat indivisible, assigning varying stickinesses to individual resources just makes things unreadable and complicated. I feel that a group stickyness should override individual resource stickynesses, and not be used a a default stickyness for every resource in the group. Regards, Ulrich Regards Fabian On 08/04/2011 03:10 PM, Ulrich Windl wrote: Maloja01 maloj...@arcor.de schrieb am 04.08.2011 um 12:58 in Nachricht 4e3a7b5c.1030...@arcor.de: On 08/04/2011 08:28 AM, Ulrich Windl wrote: Hi! Isn't the stickyness effectively based on the failcount? We have one resource that has a location constraint for one node with a weight of 50 and a sticky ness of 10. The resource runs on a different node and shows no tendency of moving back (not even after restarts). No stickiness has nothing to do with the failcount. The policy engine could take both into account the stickiness (for RUNNING resources) and the failcount for (RUNNING or non-running ressources). If you ever had a on-start-failure of a resource on a node the failcount is set to infinity which means, the resource could not be started at this node. fabian, I know that, and the errors were removed by crm_resource -C. Still the resource is happy where it is, and doesn't want to move away. If the policy engine needs to evaluate where to run a resource it uses the location/antcolocation/cololaction constraints, failcounts, stickiness and maybe some other scores to evaluate WHERE to run a resource. So in my opinion the stiness does exactly what you are asking for. Unfortunately someone did a manual migrate yesterday, so I cannot show the scores that lead to the problem. Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: location and orders : Question about a behavior ...
Hi, I was the guy who intiate this thread with a simple question, but this thread has been re-oriented with other similar questions ... so I don't know who is answering to anybody else ... please Fabian, if you can just reopen my first msg in this thread, it would be nice for me ... Thanks a lot anyway. Alain De :Maloja01 maloj...@arcor.de A : linux-ha@lists.linux-ha.org Date : 05/08/2011 11:02 Objet : Re: [Linux-HA] Antw: Re: location and orders : Question about a behavior ... Envoyé par :linux-ha-boun...@lists.linux-ha.org On 08/05/2011 08:30 AM, Ulrich Windl wrote: Maloja01 maloj...@arcor.de schrieb am 04.08.2011 um 18:49 in Nachricht 4e3acd86.1020...@arcor.de: Hi Ulrich, I did not folow the complete thread, just jumped in - sorry. Is the resource inside a resource group? In this case the stickiness is multiplied. And sofor the stickiness could be greater than the location role (score). Hi! Yes, a group with about 20 resources has a resource-stickiness=10 In this case - if I remeber that correctly the scores for a RUNNING group is 20*10 - 200 50. Can you describe your problem, what are you missing? a) You want to have a RUNNING group NOT to do a fallback - Stickiness should do that here: 2M (active node) 500K (preferred node) [if activenode preferred node ;-)] b) You want to have a STOPPED group to be placed on a specific node (to have an ordered administartion at least at the start-point - location score should help here 500K (preferred node) 0 (not preferred node) I miss the point where you argumented, that stickiness is not implemented as you expected it would be implemented. Could you explain, whats missing or wrong? Maybe we can try it in a state description like status-before (f.e. group on node1), change in the cluster (either event or admin based) and status-after (here the current implemented one and the one that you expected how it should work). Kind regards Fabian and a location loc_grp_cbw grp_cbw 50: node. As the group is somewhat indivisible, assigning varying stickinesses to individual resources just makes things unreadable and complicated. I feel that a group stickyness should override individual resource stickynesses, and not be used a a default stickyness for every resource in the group. Regards, Ulrich Regards Fabian On 08/04/2011 03:10 PM, Ulrich Windl wrote: Maloja01 maloj...@arcor.de schrieb am 04.08.2011 um 12:58 in Nachricht 4e3a7b5c.1030...@arcor.de: On 08/04/2011 08:28 AM, Ulrich Windl wrote: Hi! Isn't the stickyness effectively based on the failcount? We have one resource that has a location constraint for one node with a weight of 50 and a sticky ness of 10. The resource runs on a different node and shows no tendency of moving back (not even after restarts). No stickiness has nothing to do with the failcount. The policy engine could take both into account the stickiness (for RUNNING resources) and the failcount for (RUNNING or non-running ressources). If you ever had a on-start-failure of a resource on a node the failcount is set to infinity which means, the resource could not be started at this node. fabian, I know that, and the errors were removed by crm_resource -C. Still the resource is happy where it is, and doesn't want to move away. If the policy engine needs to evaluate where to run a resource it uses the location/antcolocation/cololaction constraints, failcounts, stickiness and maybe some other scores to evaluate WHERE to run a resource. So in my opinion the stiness does exactly what you are asking for. Unfortunately someone did a manual migrate yesterday, so I cannot show the scores that lead to the problem. Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: location and orders : Question about a behavior ...
On 08/05/2011 11:26 AM, alain.mou...@bull.net wrote: Hi, I was the guy who intiate this thread with a simple question, but this thread has been re-oriented with other similar questions ... so I don't know who is answering to anybody else ... please Fabian, if you can just reopen my first msg in this thread, it would be nice for me ... Yes you are right - so I will rewind the thread beginning from message 1 :) Thanks a lot anyway. Alain De :Maloja01 maloj...@arcor.de A : linux-ha@lists.linux-ha.org Date : 05/08/2011 11:02 Objet : Re: [Linux-HA] Antw: Re: location and orders : Question about a behavior ... Envoyé par :linux-ha-boun...@lists.linux-ha.org On 08/05/2011 08:30 AM, Ulrich Windl wrote: Maloja01 maloj...@arcor.de schrieb am 04.08.2011 um 18:49 in Nachricht 4e3acd86.1020...@arcor.de: Hi Ulrich, I did not folow the complete thread, just jumped in - sorry. Is the resource inside a resource group? In this case the stickiness is multiplied. And sofor the stickiness could be greater than the location role (score). Hi! Yes, a group with about 20 resources has a resource-stickiness=10 In this case - if I remeber that correctly the scores for a RUNNING group is 20*10 - 200 50. Can you describe your problem, what are you missing? a) You want to have a RUNNING group NOT to do a fallback - Stickiness should do that here: 2M (active node) 500K (preferred node) [if activenode preferred node ;-)] b) You want to have a STOPPED group to be placed on a specific node (to have an ordered administartion at least at the start-point - location score should help here 500K (preferred node) 0 (not preferred node) I miss the point where you argumented, that stickiness is not implemented as you expected it would be implemented. Could you explain, whats missing or wrong? Maybe we can try it in a state description like status-before (f.e. group on node1), change in the cluster (either event or admin based) and status-after (here the current implemented one and the one that you expected how it should work). Kind regards Fabian and a location loc_grp_cbw grp_cbw 50: node. As the group is somewhat indivisible, assigning varying stickinesses to individual resources just makes things unreadable and complicated. I feel that a group stickyness should override individual resource stickynesses, and not be used a a default stickyness for every resource in the group. Regards, Ulrich Regards Fabian On 08/04/2011 03:10 PM, Ulrich Windl wrote: Maloja01 maloj...@arcor.de schrieb am 04.08.2011 um 12:58 in Nachricht 4e3a7b5c.1030...@arcor.de: On 08/04/2011 08:28 AM, Ulrich Windl wrote: Hi! Isn't the stickyness effectively based on the failcount? We have one resource that has a location constraint for one node with a weight of 50 and a sticky ness of 10. The resource runs on a different node and shows no tendency of moving back (not even after restarts). No stickiness has nothing to do with the failcount. The policy engine could take both into account the stickiness (for RUNNING resources) and the failcount for (RUNNING or non-running ressources). If you ever had a on-start-failure of a resource on a node the failcount is set to infinity which means, the resource could not be started at this node. fabian, I know that, and the errors were removed by crm_resource -C. Still the resource is happy where it is, and doesn't want to move away. If the policy engine needs to evaluate where to run a resource it uses the location/antcolocation/cololaction constraints, failcounts, stickiness and maybe some other scores to evaluate WHERE to run a resource. So in my opinion the stiness does exactly what you are asking for. Unfortunately someone did a manual migrate yesterday, so I cannot show the scores that lead to the problem. Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] location and orders : Question about a behavior ...
On 08/02/2011 05:06 PM, alain.mou...@bull.net wrote: Hi I have this simple configuration of locations and orders between resources group-1 , group-2 and clone-1 (on a two nodes ha cluster with Pacemaker-1.1.2-7 /corosync-1.2.3-21) : location loc1-group-1 group-1 +100: node2 location loc1-group-2 group-2 +100: node3 order order-group-1 inf: group-1 clone-1 order order-group-2 inf: group-2 clone-1 property $id=cib-bootstrap-options \ dc-version=1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=true \ no-quorum-policy=ignore \ default-resource-stickiness=5000 \ (and no current cli- preferences) When I stop the node2, the group-1 is well migrated on node3 But when node2 is up again, and that I start Pacemaker again on node2, the group-1 automatically comes back on node2 , and I wonder why ? I have other similar configuration with same location constraints and same default-resource-stickiness value, but without order with a clone resource, and the group does not come back automatically. But I don't understand why this order constraint would change this behavior ... We should focus our thoughts on the fact, that when node2 comes back into the cluster the clone-1 gets a change, because it is started now also on node2 - am I right? I do not have a good explanatio at this point of time but this could be the point why the group-1 looses its stickiness, because its first stopped and than restarted (after the clone is completely up again). Can you check the following in your setup: Either set max_clone to 1 (just for a test of course) or doing an anti-location that clone-1 will not run on node2 (so after rejoining node2 clone-1 will not get a change in its setup). With your current config (without my changes): You should also check, if you see any stops on clone-instances when node2 is rejoining the cluster. That could be the case, if you have limitted the number of clones and have additional location constraints for the clone. Can you tell more about the clone and the group? Are there any possible side effects in the functionality of the resources? Kind regards Fabian Thanks for your help Alain Moullé ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] ocf::LVM monitor needs excessive time to complete
Hi, we run a cluster that has about 30 LVM VGs that are monitored every minute with a timeout interval of 90s. Surprisingly even if the system is in nominal state, the LVM monitor times out. I suspect this has to do with multiple LVM commands being run in parallel like this: # ps ax |grep vg 2014 pts/0D+ 0:00 vgs 2580 ?D 0:00 vgdisplay -v NFS_C11_IO 2638 ?D 0:00 vgck CBW_DB_BTD 2992 ?D 0:00 vgdisplay -v C11_DB_Exe 3002 ?D 0:00 vgdisplay -v C11_DB_15k 4564 pts/2S+ 0:00 grep vg # ps ax |grep vg 8095 ?D 0:00 vgck CBW_DB_Exe 8119 ?D 0:00 vgdisplay -v C11_DB_FATA 8194 ?D 0:00 vgdisplay -v NFS_SAP_Exe When I tried a vgs manually, it could not be suspended or killed, and it took more than 30 seconds to complete. Thus the LVM monitoring is quite useless as it is now (SLES 11 SP1 x86_64 on a machine with lots of disks, RAM and CPUs). As I had changed all the timeouts via crm configure edit, I suspect the LRM starts all these monitors at the same time, creating massive parallelism. Maybe a random star delay would be more useful than having the user specify a variable start delay for the monitor. Possibly those stuck monitor operations also affect monitors that would finish in time. Here's a part of the mess on one node: Aug 5 13:50:55 h03 lrmd: [14526]: WARN: operation monitor[360] on ocf::LVM::prm_cbw_ci_mnt_lvm for client 14529, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.5] CRM_meta_record_pending=[true] CRM_meta_timeout=[3] CRM_meta_interval=[1] volgrpname=[CBW_CI] : pid [29910] timed out Aug 5 13:50:55 h03 crmd: [14529]: ERROR: process_lrm_event: LRM operation prm_cbw_ci_mnt_lvm_monitor_1 (360) Timed Out (timeout=3ms) Aug 5 13:50:55 h03 lrmd: [14526]: WARN: perform_ra_op: the operation operation monitor[154] on ocf::IPaddr2::prm_a20_ip_1 for client 14529, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.5] CRM_meta_record_pending=[true] CRM_meta_timeout=[2] CRM_meta_interval=[1] iflabel=[a20] ip=[172.20.17.54] stayed in operation list for 24020 ms (longer than 1 ms) Aug 5 13:50:56 h03 lrmd: [14526]: WARN: perform_ra_op: the operation operation monitor[179] on ocf::Raid1::prm_nfs_cbw_trans_raid1 for client 14529, its parameters: CRM_meta_record_pending=[true] raidconf=[/etc/mdadm/mdadm.conf] crm_feature_set=[3.0.5] OCF_CHECK_LEVEL=[1] raiddev=[/dev/md8] CRM_meta_name=[monitor] CRM_meta_timeout=[6] CRM_meta_interval=[6] stayed in operation list for 24010 ms (longer than 1 ms) Aug 5 13:50:56 h03 attrd: [14527]: notice: attrd_ais_dispatch: Update relayed from h04 Aug 5 13:50:56 h03 attrd: [14527]: info: attrd_local_callback: Expanded fail-count-prm_cbw_ci_mnt_lvm=value++ to 9 Aug 5 13:50:56 h03 attrd: [14527]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-prm_cbw_ci_mnt_lvm (9) Aug 5 13:50:56 h03 attrd: [14527]: info: attrd_perform_update: Sent update 416: fail-count-prm_cbw_ci_mnt_lvm=9 Aug 5 13:50:56 h03 attrd: [14527]: notice: attrd_ais_dispatch: Update relayed from h04 Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] location and orders : Question about a behavior ...
Hi Fabian, Many thanks to have a look at my initial problem. I can't make it again today as I'm trying another configuration on both servers (HA NFS active/active I post another thread about it this morning) but I should be able to try again next week. But if I well understand your explanation: you suppose that clone-1 instance on node2 when it starts again after the reboot, it could disturb the clone-1 instance on node 3 by stop/restart it also on node3 ? I have not noticed via crm_mon any state change of the clone-1 instance on node3 when node2 is restarted neither any state change on the group-2 which remain started on node3 (if clone-1 has been stopped/restarted on node3 even quickly, I should have also seen group-2 stopped/restarted due to the order-group-2 constraint) Hope it helps to clarify ... Thanks again Alain De :Maloja01 maloj...@arcor.de A : linux-ha@lists.linux-ha.org Date : 05/08/2011 11:40 Objet : Re: [Linux-HA] location and orders : Question about a behavior ... Envoyé par :linux-ha-boun...@lists.linux-ha.org On 08/02/2011 05:06 PM, alain.mou...@bull.net wrote: Hi I have this simple configuration of locations and orders between resources group-1 , group-2 and clone-1 (on a two nodes ha cluster with Pacemaker-1.1.2-7 /corosync-1.2.3-21) : location loc1-group-1 group-1 +100: node2 location loc1-group-2 group-2 +100: node3 order order-group-1 inf: group-1 clone-1 order order-group-2 inf: group-2 clone-1 property $id=cib-bootstrap-options \ dc-version=1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=true \ no-quorum-policy=ignore \ default-resource-stickiness=5000 \ (and no current cli- preferences) When I stop the node2, the group-1 is well migrated on node3 But when node2 is up again, and that I start Pacemaker again on node2, the group-1 automatically comes back on node2 , and I wonder why ? I have other similar configuration with same location constraints and same default-resource-stickiness value, but without order with a clone resource, and the group does not come back automatically. But I don't understand why this order constraint would change this behavior ... We should focus our thoughts on the fact, that when node2 comes back into the cluster the clone-1 gets a change, because it is started now also on node2 - am I right? I do not have a good explanatio at this point of time but this could be the point why the group-1 looses its stickiness, because its first stopped and than restarted (after the clone is completely up again). Can you check the following in your setup: Either set max_clone to 1 (just for a test of course) or doing an anti-location that clone-1 will not run on node2 (so after rejoining node2 clone-1 will not get a change in its setup). With your current config (without my changes): You should also check, if you see any stops on clone-instances when node2 is rejoining the cluster. That could be the case, if you have limitted the number of clones and have additional location constraints for the clone. Can you tell more about the clone and the group? Are there any possible side effects in the functionality of the resources? Kind regards Fabian Thanks for your help Alain Moullé ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] Antw: Re: [ha-wg-technical] The mess with OCF_CHECK_LEVEL (crm aborts during commit)
On Fri, Aug 05, 2011 at 09:15:33AM +0200, Ulrich Windl wrote: Dejan Muhamedagic de...@suse.de schrieb am 05.08.2011 um 08:39 in Nachricht 20110805063900.GB31749@rondo.homenet: Hi, On Fri, Aug 05, 2011 at 08:23:43AM +0200, Ulrich Windl wrote: Dejan Muhamedagic de...@suse.de schrieb am 04.08.2011 um 18:32 in Nachricht 20110804163245.GA28585@rondo.homenet: Hi, On Thu, Aug 04, 2011 at 05:45:16PM +0200, Ulrich Windl wrote: Hi! Some RAs support OCF_CHECK_LEVEL (e.g. ocf:heartbeat:Raid1). However the OCF_CHECK_LEVEL is not advertised in the metadata. Also, OCF_CHECK_LEVEL is not a global parameter (wouldn't make much sense). So obviously using the crm_gui one can add OCF_CHECK_LEVEL for some resource, and that seems to work. So far, so good. Now I tried to add more resources without an OCF_CHECK_LEVEL using the crm command line. I added the new resources to a group that contained resources using OCF_CHECK_LEVEL. OCF_CHECK_LEVEL is to be defined on a per-monitor basis, like this: primitive ... op monitor OCF_CHECK_LEVEL=10 interval=... [...] So, is a configuration like the following incorrect? primitive prm_c11_as_1_raid1 ocf:heartbeat:Raid1 \ params raidconf=/etc/mdadm/mdadm.conf raiddev=/dev/md15 OCF_CHECK_LEVEL=1 \ operations $id=prm_c11_as_1_raid1-operations \ op start interval=0 timeout=20s \ op stop interval=0 timeout=20s \ op monitor interval=60 timeout=60s Yes. See an example here: http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s- operation-monitor-multiple.html Though it's XML, you can see that OCF_CHECK_LEVEL is defined within a monitor operation. Amazingly crm_verify -LV does not report any problem however. crm_verify doesn't know which parameters the RA supports. crm configure verify should complain, however, because it looks at the RA meta-data and does checks which are beyond crm_verify. Thanks, Dejan Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ocf::LVM monitor needs excessive time to complete
Hi, On Fri, Aug 05, 2011 at 01:55:25PM +0200, Ulrich Windl wrote: Hi, we run a cluster that has about 30 LVM VGs that are monitored every minute with a timeout interval of 90s. Surprisingly even if the system is in nominal state, the LVM monitor times out. I suspect this has to do with multiple LVM commands being run in parallel like this: # ps ax |grep vg 2014 pts/0D+ 0:00 vgs 2580 ?D 0:00 vgdisplay -v NFS_C11_IO 2638 ?D 0:00 vgck CBW_DB_BTD 2992 ?D 0:00 vgdisplay -v C11_DB_Exe 3002 ?D 0:00 vgdisplay -v C11_DB_15k 4564 pts/2S+ 0:00 grep vg # ps ax |grep vg 8095 ?D 0:00 vgck CBW_DB_Exe 8119 ?D 0:00 vgdisplay -v C11_DB_FATA 8194 ?D 0:00 vgdisplay -v NFS_SAP_Exe When I tried a vgs manually, it could not be suspended or killed, and it took more than 30 seconds to complete. Thus the LVM monitoring is quite useless as it is now (SLES 11 SP1 x86_64 on a machine with lots of disks, RAM and CPUs). I guess that this is somehow related to the storage. Best to report directly to SUSE. As I had changed all the timeouts via crm configure edit, I suspect the LRM starts all these monitors at the same time, creating massive parallelism. Maybe a random star delay would be more useful than having the user specify a variable start delay for the monitor. Possibly those stuck monitor operations also affect monitors that would finish in time. lrmd starts at most max-children operations in parallel. That's 4 by default. Thanks, Dejan Here's a part of the mess on one node: Aug 5 13:50:55 h03 lrmd: [14526]: WARN: operation monitor[360] on ocf::LVM::prm_cbw_ci_mnt_lvm for client 14529, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.5] CRM_meta_record_pending=[true] CRM_meta_timeout=[3] CRM_meta_interval=[1] volgrpname=[CBW_CI] : pid [29910] timed out Aug 5 13:50:55 h03 crmd: [14529]: ERROR: process_lrm_event: LRM operation prm_cbw_ci_mnt_lvm_monitor_1 (360) Timed Out (timeout=3ms) Aug 5 13:50:55 h03 lrmd: [14526]: WARN: perform_ra_op: the operation operation monitor[154] on ocf::IPaddr2::prm_a20_ip_1 for client 14529, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.5] CRM_meta_record_pending=[true] CRM_meta_timeout=[2] CRM_meta_interval=[1] iflabel=[a20] ip=[172.20.17.54] stayed in operation list for 24020 ms (longer than 1 ms) Aug 5 13:50:56 h03 lrmd: [14526]: WARN: perform_ra_op: the operation operation monitor[179] on ocf::Raid1::prm_nfs_cbw_trans_raid1 for client 14529, its parameters: CRM_meta_record_pending=[true] raidconf=[/etc/mdadm/mdadm.conf] crm_feature_set=[3.0.5] OCF_CHECK_LEVEL=[1] raiddev=[/dev/md8] CRM_meta_name=[monitor] CRM_meta_timeout=[6] CRM_meta_interval=[6] stayed in operation list for 24010 ms (longer than 1 ms) Aug 5 13:50:56 h03 attrd: [14527]: notice: attrd_ais_dispatch: Update relayed from h04 Aug 5 13:50:56 h03 attrd: [14527]: info: attrd_local_callback: Expanded fail-count-prm_cbw_ci_mnt_lvm=value++ to 9 Aug 5 13:50:56 h03 attrd: [14527]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-prm_cbw_ci_mnt_lvm (9) Aug 5 13:50:56 h03 attrd: [14527]: info: attrd_perform_update: Sent update 416: fail-count-prm_cbw_ci_mnt_lvm=9 Aug 5 13:50:56 h03 attrd: [14527]: notice: attrd_ais_dispatch: Update relayed from h04 Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ocf::LVM monitor needs excessive time to complete
On 8/5/2011 7:18 AM, Dejan Muhamedagic wrote: Hi, On Fri, Aug 05, 2011 at 01:55:25PM +0200, Ulrich Windl wrote: ... When I tried a vgs manually, it could not be suspended or killed, and it took more than 30 seconds to complete. Thus the LVM monitoring is quite useless as it is now (SLES 11 SP1 x86_64 on a machine with lots of disks, RAM and CPUs). I guess that this is somehow related to the storage. Best to report directly to SUSE. What sort of disks and how many? -- last time we ran out of room, I had to add a different-sized ide disk (smaller, because you couldn't buy a big one anymore) so I had to use lvm. I/O performance went down the drain right away. (That was centos5 a couple of years ago.) Dima (thank Cthulhu for sata and mdadm) ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] ocf:heartbeat:exportfs and crm configure verify
Hi! I think the RA for exportfs needs to be changed to allow a list of hosts (I had mentioned that before). Linux only allows eicher a hostname pattern, an IP mask, or a netgroup, but you cannot specify a thing like host[358] or host{3,5,8}. So as an ugly work-arounf one uses one resource per host. This works, but crm configure verify complains about it: WARNING: Resources prm_nfs_cbw_trans_exp_h02,prm_nfs_cbw_trans_exp_h03,prm_nfs_cbw_trans_exp_h04,prm_nfs_cbw_trans_exp_h06,prm_nfs_cbw_trans_exp_h07,prm_nfs_cbw_trans_exp_n01,prm_nfs_cbw_trans_exp_v01,prm_nfs_cbw_trans_exp_v03 violate uniqueness for parameter fsid: ba57bee9-5872-46f2-9a87-0d178851d795 So for one filesystem it seems to be required (by the RA only?) that only one exportfs resource exists. That's bad. Also the documentation for clientspec is not that precise (resource-agents-1.0.3-0.10.1): - clientspec* (string): Client ACL. The client specification allowing remote machines to mount the directory over NFS. - If I find time, I'll suggest a patch for the RA. Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Q: default vs. default (e.g. exportfs)
Hi! I frequently see problems I don't understand: When configuring an exportfs resource using crm shell without explicitly specifying operations or timeouts, I get warnings like these: WARNING: prm_nfs_v03: default timeout 20s for start is smaller than the advised 40 I wonder: If the default is 40s, and I specify none, why isn't that default used? Is it because CRM has ist own defaults? Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] OCF RA for named
No interest? On Tue, Jul 12, 2011 at 3:50 PM, Serge Dubrouski serge...@gmail.com wrote: Hello - I've created an OCF RA for named (BIND) server. There is an existing one in redhat directory but I don't like how it does monitoring and I doubt that it can work with pacemaker. So please review the attached RA and see if it can be included into the project. -- Serge Dubrouski. -- Serge Dubrouski. ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
[Linux-HA] Antw: Re: ocf::LVM monitor needs excessive time to complete
Dejan Muhamedagic deja...@fastmail.fm schrieb am 05.08.2011 um 14:18 in Nachricht 20110805121851.GB950@rondo.homenet: Hi, On Fri, Aug 05, 2011 at 01:55:25PM +0200, Ulrich Windl wrote: Hi, we run a cluster that has about 30 LVM VGs that are monitored every minute with a timeout interval of 90s. Surprisingly even if the system is in nominal state, the LVM monitor times out. I suspect this has to do with multiple LVM commands being run in parallel like this: # ps ax |grep vg 2014 pts/0D+ 0:00 vgs 2580 ?D 0:00 vgdisplay -v NFS_C11_IO 2638 ?D 0:00 vgck CBW_DB_BTD 2992 ?D 0:00 vgdisplay -v C11_DB_Exe 3002 ?D 0:00 vgdisplay -v C11_DB_15k 4564 pts/2S+ 0:00 grep vg # ps ax |grep vg 8095 ?D 0:00 vgck CBW_DB_Exe 8119 ?D 0:00 vgdisplay -v C11_DB_FATA 8194 ?D 0:00 vgdisplay -v NFS_SAP_Exe When I tried a vgs manually, it could not be suspended or killed, and it took more than 30 seconds to complete. Thus the LVM monitoring is quite useless as it is now (SLES 11 SP1 x86_64 on a machine with lots of disks, RAM and CPUs). I guess that this is somehow related to the storage. Best to report directly to SUSE. Hi! I suspect that LVM uses an exclusive lock while examining the state. Basically vgdisplay in Linux does a stupid thing: It always scanns all disks to find PVs. As compared to HP-UX LVM, there it only scans the disks if you explicitly request as vgscan. A simple vgdisplay will access kernel in-RAM structures, but you can only vgdisplay VGs that are active (otherwise the kernel doesn't know them). PVs for VGs are stored in a file there. I don't think the disk system is the problem; it's the LVM implementation. A very quick test series showed that vgdisplay for a named VG that exists takes 0.3 to 0.8 seconds, that's rather slow. And looking for a VG that does not exist takes 0.8 to 1.5 seconds. The system in question has 192 SCSI disks that are combined to 44 multipath disks. About the half of those are combined to RAID1s, a few of those RAIDs are partitioned. All RAIDs have a VG with at least one LV. This gives 72 device mapper devices. Now if lvm searches on all those devices, it can take a while to complete. While playing I made an interesting observation: If you use jsut vgdisplay to display all VGs, the command takes about 0.05s, but when you specify a name, it takes about 0.7s. Finally when using awk to locate the desired VG, the command isn't very much slower than without awk: # time (vgdisplay | awk '$1 == VG $2 == Name $3 == dd { print $3 }') real0m0.082s user0m0.020s sys 0m0.012s # time (vgdisplay | awk '$1 == VG $2 == Name $3 == sys { print $3 }') sys real0m0.098s user0m0.012s sys 0m0.020s # time vgdisplay sys [...] real0m0.063s user0m0.020s sys 0m0.004s # time vgdisplay sysX Volume group sysX not found real0m0.806s user0m0.012s sys 0m0.060s So the status as it's implemented now takes much longer to return stopped than it takes to return started. Maybe someone wants to have a look what terrible things happen when a non-existing VG is specified for vgdisplay Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] ocf::LVM monitor needs excessive time to complete
Hi, processes in state D looks like locked in a kernel call/device request. Do you have a problem with your storage? This is not cluster related . Kind regards Fabian On 08/05/2011 01:55 PM, Ulrich Windl wrote: Hi, we run a cluster that has about 30 LVM VGs that are monitored every minute with a timeout interval of 90s. Surprisingly even if the system is in nominal state, the LVM monitor times out. I suspect this has to do with multiple LVM commands being run in parallel like this: # ps ax |grep vg 2014 pts/0D+ 0:00 vgs 2580 ?D 0:00 vgdisplay -v NFS_C11_IO 2638 ?D 0:00 vgck CBW_DB_BTD 2992 ?D 0:00 vgdisplay -v C11_DB_Exe 3002 ?D 0:00 vgdisplay -v C11_DB_15k 4564 pts/2S+ 0:00 grep vg # ps ax |grep vg 8095 ?D 0:00 vgck CBW_DB_Exe 8119 ?D 0:00 vgdisplay -v C11_DB_FATA 8194 ?D 0:00 vgdisplay -v NFS_SAP_Exe When I tried a vgs manually, it could not be suspended or killed, and it took more than 30 seconds to complete. Thus the LVM monitoring is quite useless as it is now (SLES 11 SP1 x86_64 on a machine with lots of disks, RAM and CPUs). As I had changed all the timeouts via crm configure edit, I suspect the LRM starts all these monitors at the same time, creating massive parallelism. Maybe a random star delay would be more useful than having the user specify a variable start delay for the monitor. Possibly those stuck monitor operations also affect monitors that would finish in time. Here's a part of the mess on one node: Aug 5 13:50:55 h03 lrmd: [14526]: WARN: operation monitor[360] on ocf::LVM::prm_cbw_ci_mnt_lvm for client 14529, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.5] CRM_meta_record_pending=[true] CRM_meta_timeout=[3] CRM_meta_interval=[1] volgrpname=[CBW_CI] : pid [29910] timed out Aug 5 13:50:55 h03 crmd: [14529]: ERROR: process_lrm_event: LRM operation prm_cbw_ci_mnt_lvm_monitor_1 (360) Timed Out (timeout=3ms) Aug 5 13:50:55 h03 lrmd: [14526]: WARN: perform_ra_op: the operation operation monitor[154] on ocf::IPaddr2::prm_a20_ip_1 for client 14529, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.5] CRM_meta_record_pending=[true] CRM_meta_timeout=[2] CRM_meta_interval=[1] iflabel=[a20] ip=[172.20.17.54] stayed in operation list for 24020 ms (longer than 1 ms) Aug 5 13:50:56 h03 lrmd: [14526]: WARN: perform_ra_op: the operation operation monitor[179] on ocf::Raid1::prm_nfs_cbw_trans_raid1 for client 14529, its parameters: CRM_meta_record_pending=[true] raidconf=[/etc/mdadm/mdadm.conf] crm_feature_set=[3.0.5] OCF_CHECK_LEVEL=[1] raiddev=[/dev/md8] CRM_meta_name=[monitor] CRM_meta_timeout=[6] CRM_meta_interval=[6] stayed in operation list for 24010 ms (longer than 1 ms) Aug 5 13:50:56 h03 attrd: [14527]: notice: attrd_ais_dispatch: Update relayed from h04 Aug 5 13:50:56 h03 attrd: [14527]: info: attrd_local_callback: Expanded fail-count-prm_cbw_ci_mnt_lvm=value++ to 9 Aug 5 13:50:56 h03 attrd: [14527]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-prm_cbw_ci_mnt_lvm (9) Aug 5 13:50:56 h03 attrd: [14527]: info: attrd_perform_update: Sent update 416: fail-count-prm_cbw_ci_mnt_lvm=9 Aug 5 13:50:56 h03 attrd: [14527]: notice: attrd_ais_dispatch: Update relayed from h04 Regards, Ulrich ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
Re: [Linux-HA] location and orders : Question about a behavior ...
Hi Alain, yes your arguments about group-2 makes sense. To get an idea, if you are seeing a side effect about one resource desturbing the other OR if its a reproducable plan of the pengine, you should check, if this also happens, if you only set node2 in status standby and active again. If so you could create a shadow cib and than only change the node status in the shadow cib and starting the what-if-analyze. This gives us an idea, if the cluster does that relocation due to your configuration only inside the cluster, or if there also some extenal factors which only occurs on really running resources. Kind regards Fabian On 08/05/2011 02:17 PM, alain.mou...@bull.net wrote: Hi Fabian, Many thanks to have a look at my initial problem. I can't make it again today as I'm trying another configuration on both servers (HA NFS active/active I post another thread about it this morning) but I should be able to try again next week. But if I well understand your explanation: you suppose that clone-1 instance on node2 when it starts again after the reboot, it could disturb the clone-1 instance on node 3 by stop/restart it also on node3 ? I have not noticed via crm_mon any state change of the clone-1 instance on node3 when node2 is restarted neither any state change on the group-2 which remain started on node3 (if clone-1 has been stopped/restarted on node3 even quickly, I should have also seen group-2 stopped/restarted due to the order-group-2 constraint) Hope it helps to clarify ... Thanks again Alain De :Maloja01 maloj...@arcor.de A : linux-ha@lists.linux-ha.org Date : 05/08/2011 11:40 Objet : Re: [Linux-HA] location and orders : Question about a behavior ... Envoyé par :linux-ha-boun...@lists.linux-ha.org On 08/02/2011 05:06 PM, alain.mou...@bull.net wrote: Hi I have this simple configuration of locations and orders between resources group-1 , group-2 and clone-1 (on a two nodes ha cluster with Pacemaker-1.1.2-7 /corosync-1.2.3-21) : location loc1-group-1 group-1 +100: node2 location loc1-group-2 group-2 +100: node3 order order-group-1 inf: group-1 clone-1 order order-group-2 inf: group-2 clone-1 property $id=cib-bootstrap-options \ dc-version=1.1.2-f059ec7ced7a86f18e5490b67ebf4a0b963bccfe \ cluster-infrastructure=openais \ expected-quorum-votes=2 \ stonith-enabled=true \ no-quorum-policy=ignore \ default-resource-stickiness=5000 \ (and no current cli- preferences) When I stop the node2, the group-1 is well migrated on node3 But when node2 is up again, and that I start Pacemaker again on node2, the group-1 automatically comes back on node2 , and I wonder why ? I have other similar configuration with same location constraints and same default-resource-stickiness value, but without order with a clone resource, and the group does not come back automatically. But I don't understand why this order constraint would change this behavior ... We should focus our thoughts on the fact, that when node2 comes back into the cluster the clone-1 gets a change, because it is started now also on node2 - am I right? I do not have a good explanatio at this point of time but this could be the point why the group-1 looses its stickiness, because its first stopped and than restarted (after the clone is completely up again). Can you check the following in your setup: Either set max_clone to 1 (just for a test of course) or doing an anti-location that clone-1 will not run on node2 (so after rejoining node2 clone-1 will not get a change in its setup). With your current config (without my changes): You should also check, if you see any stops on clone-instances when node2 is rejoining the cluster. That could be the case, if you have limitted the number of clones and have additional location constraints for the clone. Can you tell more about the clone and the group? Are there any possible side effects in the functionality of the resources? Kind regards Fabian Thanks for your help Alain Moullé ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems ___ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems