Hi, On 9/27/19 6:12 PM, Lentes, Bernd wrote: > > ----- On Sep 26, 2019, at 5:19 PM, Yan Gao [email protected] wrote: > >> Hi, >> >> On 9/26/19 3:25 PM, Lentes, Bernd wrote: >>> HI, >>> >>> i had two errors with a GSF2 Partition several days ago: >>> gfs2_share_monitor_30000 on ha-idg-2 'unknown error' (1): call=103, >>> status=Timed >>> Out, exitreason='', >>> last-rc-change='Thu Sep 19 13:44:22 2019', queued=0ms, exec=0ms >>> >>> gfs2_share_monitor_30000 on ha-idg-1 'unknown error' (1): call=103, >>> status=Timed >>> Out, exitreason='', >>> last-rc-change='Thu Sep 19 13:44:12 2019', queued=0ms, exec=0ms >>> >>> Now i wanted to get rid of these messages and did a "resource cleanup". >>> I had to do this several times until both dissapeared. >>> >>> But then all VirtualDomain resources restarted. >>> >>> The config for the GSF2 is: >>> primitive gfs2_share Filesystem \ >>> params device="/dev/vg_san/lv_share" directory="/mnt/share" >>> fstype=gfs2 >>> options=acl \ >>> op monitor interval=30 timeout=20 \ >>> op start timeout=60 interval=0 \ >>> op stop timeout=60 interval=0 \ >>> meta is-managed=true >>> >>> /mnt/share keeps the config files for VirtualDomains. >>> >>> Here one VirtualDomain config (the others are the same): >>> primitive vm_crispor VirtualDomain \ >>> params config="/mnt/share/crispor.xml" \ >>> params hypervisor="qemu:///system" \ >>> params migration_transport=ssh \ >>> params migrate_options="--p2p --tunnelled" \ >>> op start interval=0 timeout=120 \ >>> op stop interval=0 timeout=180 \ >>> op monitor interval=30 timeout=25 \ >>> op migrate_from interval=0 timeout=300 \ >>> op migrate_to interval=0 timeout=300 \ >>> meta allow-migrate=true target-role=Started is-managed=true >>> maintenance=false \ >>> utilization cpu=2 hv_memory=8192 >>> >>> The GFS2 Share is a group and the group is cloned: >>> group gr_share dlm clvmd gfs2_share gfs2_snap fs_ocfs2 >>> clone cl_share gr_share \ >>> meta target-role=Started interleave=true >>> >>> And for each VirtualDomain i have an order: >>> order or_vm_crispor_after_gfs2 Mandatory: cl_share vm_crispor >>> symmetrical=true >>> >>> Why are the domains restarted ? I thought a cleanup would just delete the >>> error >>> message. >> It could be potentially fixed by this: >> https://github.com/ClusterLabs/pacemaker/pull/1765 >> >> Regards, >> Yan > > Hi Yan, > > thanks for that information. I saw that this patch is included in the SuSE > updates for Pacemaker for SLES 12 SP4. > I will install it on soon and let you know. > I had a look in the logs and what happened when i issued a "resource cleanup" > of the GFS2 resource is > that the cluster deleted an entry in the status section: > > Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_perform_op: Diff: > --- 2.9157.0 2 > Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_perform_op: Diff: > +++ 2.9157.1 (null) > Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_perform_op: -- > /cib/status/node_state[@id='1084777482']/lrm[@id='1084777482']/lrm_resources/lrm_resource[@id='dlm'] > Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_perform_op: + > /cib: @num_updates=1 > Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_process_request: > Completed cib_delete operation for section > //node_state[@uname='ha-idg-1']//lrm_resource[@id='dlm']: OK (rc=0, > origin=ha-idg-1/crmd/113, version=2.9157.0) > Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_perform_op: Diff: > --- 2.9157.0 2 > Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_perform_op: Diff: > +++ 2.9157.1 (null) > Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_perform_op: -- > /cib/status/node_state[@id='1084777482']/lrm[@id='1084777482']/lrm_resources/lrm_resource[@id='dlm'] > Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_perform_op: + > /cib: @num_updates=1 > Sep 26 14:52:52 [9317] ha-idg-2 cib: info: cib_process_request: > Completed cib_delete operation for section > //node_state[@uname='ha-idg-1']//lrm_resource[@id='dlm']: OK (rc=0, > origin=ha-idg-1/crmd/114, version=2.9157.1) > Sep 26 14:52:52 [9322] ha-idg-2 crmd: info: abort_transition_graph: > Transition 1028 aborted by deletion of lrm_resource[@id='dlm']: Resource > state removal | cib=2.9157.1 source=abort_unless_down:344 > path=/cib/status/node_stat > e[@id='1084777482']/lrm[@id='1084777482']/lrm_resources/lrm_resource[@id='dlm'] > complete=true > Sep 26 14:52:52 [9322] ha-idg-2 crmd: notice: do_state_transition: > State transition S_IDLE -> S_POLICY_ENGINE | input=I_PE_CALC > cause=C_FSA_INTERNAL origin=abort_transition_graph > > and soon later on it recognized dlm on ha-idg-1 as stopped (or stops it): > > Sep 26 14:52:54 [9321] ha-idg-2 pengine: warning: unpack_rsc_op_failure: > Processing failed monitor of gfs2_share:1 on ha-idg-2: unknown error | rc=1 > Sep 26 14:52:54 [9321] ha-idg-2 pengine: warning: unpack_rsc_op_failure: > Processing failed monitor of vm_severin on ha-idg-2: not running | rc=7 > Sep 26 14:52:54 [9321] ha-idg-2 pengine: warning: unpack_rsc_op_failure: > Processing failed monitor of vm_geneious on ha-idg-2: not running | rc=7 > Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: unpack_node_loop: Node > 1084777482 is already processed > Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: unpack_node_loop: Node > 1084777492 is already processed > Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: unpack_node_loop: Node > 1084777482 is already processed > Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: unpack_node_loop: Node > 1084777492 is already processed > Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: common_print: > fence_ilo_ha-idg-2 (stonith:fence_ilo2): Started ha-idg-1 > Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: common_print: > fence_ilo_ha-idg-1 (stonith:fence_ilo4): Started ha-idg-2 > Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: clone_print: > Clone Set: cl_share [gr_share] > Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: group_print: > Resource Group: gr_share:0 > Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: common_print: > dlm (ocf::pacemaker:controld): Stopped > <=============================== > Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: common_print: > clvmd (ocf::heartbeat:clvm): Started ha-idg-1 > Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: common_print: > gfs2_share (ocf::heartbeat:Filesystem): Started ha-idg-1 > Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: common_print: > gfs2_snap (ocf::heartbeat:Filesystem): Started ha-idg-1 > Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: common_print: > fs_ocfs2 (ocf::heartbeat:Filesystem): Started ha-idg-1 > Sep 26 14:52:54 [9321] ha-idg-2 pengine: info: short_print: > Started: [ ha-idg-2 ] > > Following the logs dlm is running before. Does the deletion of that entry > leads to the stop of the dlm resource ? > Is that expected behaviour ?First, unless "force" is specified, cleanup > issued for a child resource will do the work for the whole resource group.
Cleanup deletes resources' history which triggers (re-) probe of resources. But before probe of a resource has been finished, the resource will be shown as "Stopped" which doesn't necessarily mean it's actually "Stopped". A running resource will be detected to be "Started" with the probe. Restart of VM was because pengine/crmd thought the resources it depended on were really "Stopped" and wasn't patient enough to wait for probe of them to finish. That's what the pull request resolved. Regards, Yan > > I simulated the deletion of that entry with crm_simulate and it happened > again the same procedure. > > Bernd > > > Helmholtz Zentrum Muenchen > Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) > Ingolstaedter Landstr. 1 > 85764 Neuherberg > www.helmholtz-muenchen.de > Aufsichtsratsvorsitzende: MinDir'in Prof. Dr. Veronika von Messling > Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin Guenther > Registergericht: Amtsgericht Muenchen HRB 6466 > USt-IdNr: DE 129521671 > > _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
