>>> Gang He <[email protected]> schrieb am 22.01.2021 um 09:44 in Nachricht <[email protected]>:
> > On 2021/1/22 16:17, Ulrich Windl wrote: >>>>> Gang He <[email protected]> schrieb am 22.01.2021 um 09:13 in Nachricht >> <[email protected]>: >>> Hi Ulrich, >>> >>> I reviewed the crm configuration file, there are some comments as below, >>> 1) lvmlockd resource is used for shared VG, if you do not plan to add >>> any shared VG in your cluster, I suggest to drop this resource and clone. >>> 2) second, lvmlockd service depends on DLM service, it will create >>> "lvm_xxx" related lock spaces when any shared VG is created/activated. >>> but some other resource also depends on DLM to create lock spaces for >>> avoiding race condition, e.g. clustered MD, ocfs2, etc. Then, the file >>> system resource should start later than lvm2(lvmlockd) related resources. >>> That means this order should be wrong. >>> order ord_lockspace_fs__lvmlockd Mandatory: cln_lockspace_ocfs2 cln_lvmlock >> >> But cln_lockspace_ocfs2 provides the shared filesystem that lvmlockd uses. I >> thought for locking in a cluster it needs a cluster-wide filesystem. > > ocfs2 file system resource only depends on DLM resource if you use a > shared raw disk(e.g /dev/vdb3), e.g. > primitive dlm ocf:pacemaker:controld \ > op start interval=0 timeout=90 \ > op stop interval=0 timeout=100 \ > op monitor interval=20 timeout=600 > primitive ocfs2-2 Filesystem \ > params device="/dev/vdb3" directory="/mnt/shared" fstype=ocfs2 \ > op monitor interval=20 timeout=40 > group base-group dlm ocfs2-2 > clone base-clone base-group > > If you use ocfs2 file system on top of shared VG(e.g./dev/vg1/lv1), you > need to add lvmlock/LVM-activate resource before ocfs2 file system, e.g. > primitive dlm ocf:pacemaker:controld \ > op monitor interval=60 timeout=60 > primitive lvmlockd lvmlockd \ > op start timeout=90 interval=0 \ > op stop timeout=100 interval=0 \ > op monitor interval=30 timeout=90 > primitive ocfs2-2 Filesystem \ > params device="/dev/vg1/lv1" directory="/mnt/shared" fstype=ocfs2 \ > op monitor interval=20 timeout=40 > primitive vg1 LVM-activate \ > params vgname=vg1 vg_access_mode=lvmlockd activation_mode=shared \ > op start timeout=90s interval=0 \ > op stop timeout=90s interval=0 \ > op monitor interval=30s timeout=90s > group base-group dlm lvmlockd vg1 ocfs2-2 > clone base-clone base-group Hi! I don't see the problem: As said before OCFS2 used for lockspace does not use LVM itself, but it uses a clustered-MD (prm_lockspace_ocfs2 Filesystem, cln_lockspace_ocfs2). That is co-located with DLM and the RAID (cln_lockspace_raid_md10). (And also for cln_lvmlockd) Ordering is somewhat redundant as clustered RAID needs DLM, and OCFS needs DLM and the RAID. lvmlockd (prm_lvmlockd, cln_lvmlockd) is co-located with DLM (hmm...does that mean it used DLM and maybe does NOT need a shared filesystem?) and cln_lockspace_ocfs2. Accordingly ordering is that vlmlockd starts after DLM (cln_DLM) and after OCFS (cln_lockspace_ocfs2) To summarize the related resources: Node List: * Online: [ h16 h18 h19 ] Full List of Resources: * Clone Set: cln_DLM [prm_DLM]: * Started: [ h16 h18 h19 ] * Clone Set: cln_lvmlockd [prm_lvmlockd]: * Started: [ h16 h18 h19 ] * Clone Set: cln_lockspace_raid_md10 [prm_lockspace_raid_md10]: * Started: [ h16 h18 h19 ] * Clone Set: cln_lockspace_ocfs2 [prm_lockspace_ocfs2]: * Started: [ h16 h18 h19 ] Regards, Ulrich > > Thanks > Gang > > >> >>> >>> >>> Thanks >>> Gang >>> >>> On 2021/1/21 20:08, Ulrich Windl wrote: >>>>>>> Gang He <[email protected]> schrieb am 21.01.2021 um 11:30 in Nachricht >>>> <[email protected]>: >>>>> Hi Ulrich, >>>>> >>>>> The problem is reproduced stably? could you help to share your >>>>> pacemaker crm configure and OS/lvm2/resource‑agents related version >>>>> information? >>>> >>>> OK, the problem occurred on every node, so I guess it's reproducible. >>>> OS is SLES15 SP2 with all current updates (lvm2-2.03.05-8.18.1.x86_64, >>>> pacemaker-2.0.4+20200616.2deceaa3a-3.3.1.x86_64, >>>> resource-agents-4.4.0+git57.70549516-3.12.1.x86_64). >>>> >>>> The configuration (somewhat trimmed) is attached. >>>> >>>> The only VG the cluster node sees is: >>>> ph16:~ # vgs >>>> VG #PV #LV #SN Attr VSize VFree >>>> sys 1 3 0 wz--n- 222.50g 0 >>>> >>>> Regards, >>>> Ulrich >>>> >>>>> I feel the problem was probably caused by lvmlock resource agent script, >>>>> which did not handle this corner case correctly. >>>>> >>>>> Thanks >>>>> Gang >>>>> >>>>> >>>>> On 2021/1/21 17:53, Ulrich Windl wrote: >>>>>> Hi! >>>>>> >>>>>> I have a problem: For tests I had configured lvmlockd. Now that the >> tests >>>>> have ended, no LVM is used for cluster resources any more, but lvmlockd >> is >>>>> still configured. >>>>>> Unfortunately I ran into this problem: >>>>>> On OCFS2 mount was unmounted successfully, another holding the lockspace >>>> for >>>>> lvmlockd is still active. >>>>>> lvmlockd shuts down. At least it says so. >>>>>> >>>>>> Unfortunately that stop never succeeds (runs into a timeout). >>>>>> >>>>>> My suspect is something like this: >>>>>> Some non‑LVM lock exists for the now unmounted OCFS2 filesystem. >>>>>> lvmlockd want to access that filesystem for unknown reasons. >>>>>> >>>>>> I don't understand waht's going on. >>>>>> >>>>>> The events at nod shutdown were: >>>>>> Some Xen PVM was live‑migrated successfully to another node, but during >>>> that >>>>> there was a message like this: >>>>>> Jan 21 10:20:13 h19 virtlockd[41990]: libvirt version: 6.0.0 >>>>>> Jan 21 10:20:13 h19 virtlockd[41990]: hostname: h19 >>>>>> Jan 21 10:20:13 h19 virtlockd[41990]: resource busy: Lockspace resource >>>>> '4c6bebd1f4bc581255b422a65d317f31deef91f777e51ba0daf04419dda7ade5' is not >>>>> locked >>>>>> Jan 21 10:20:13 h19 libvirtd[41991]: libvirt version: 6.0.0 >>>>>> Jan 21 10:20:13 h19 libvirtd[41991]: hostname: h19 >>>>>> Jan 21 10:20:13 h19 libvirtd[41991]: resource busy: Lockspace resource >>>>> '4c6bebd1f4bc581255b422a65d317f31deef91f777e51ba0daf04419dda7ade5' is not >>>>> locked >>>>>> Jan 21 10:20:13 h19 libvirtd[41991]: Unable to release lease on >> test‑jeos4 >>>>>> Jan 21 10:20:13 h19 VirtualDomain(prm_xen_test‑jeos4)[32786]: INFO: >>>>> test‑jeos4: live migration to h18 succeeded. >>>>>> >>>>>> Unfortnuately the log message makes it practically impossible to guess >> what >>>> >>>>> the locked object actually is (indirect lock using SHA256 as hash it >>>> seems). >>>>>> >>>>>> Then the OCFS for the VM images unmounts successfully while the stop of >>>>> lvmlockd is still busy: >>>>>> Jan 21 10:20:16 h19 lvmlockd(prm_lvmlockd)[32945]: INFO: stop the >>>> lockspaces >>>>> of shared VG(s)... >>>>>> ... >>>>>> Jan 21 10:21:56 h19 pacemaker‑controld[42493]: error: Result of stop >>>>> operation for prm_lvmlockd on h19: Timed Out >>>>>> >>>>>> As said before: I don't have shared VGs any more. I don't understand. >>>>>> >>>>>> On a node without VMs running I see: >>>>>> h19:~ # lvmlockctl ‑d >>>>>> 1611221190 lvmlockd started >>>>>> 1611221190 No lockspaces found to adopt >>>>>> 1611222560 new cl 1 pi 2 fd 8 >>>>>> 1611222560 recv client[10817] cl 1 dump_info . "" mode iv flags 0 >>>>>> 1611222560 send client[10817] cl 1 dump result 0 dump_len 149 >>>>>> 1611222560 send_dump_buf delay 0 total 149 >>>>>> 1611222560 close client[10817] cl 1 fd 8 >>>>>> 1611222563 new cl 2 pi 2 fd 8 >>>>>> 1611222563 recv client[10818] cl 2 dump_log . "" mode iv flags 0 >>>>>> >>>>>> On a node with VMs running I see: >>>>>> h16:~ # lvmlockctl ‑d >>>>>> 1611216942 lvmlockd started >>>>>> 1611216942 No lockspaces found to adopt >>>>>> 1611221684 new cl 1 pi 2 fd 8 >>>>>> 1611221684 recv pvs[17159] cl 1 lock gl "" mode sh flags 0 >>>>>> 1611221684 lockspace "lvm_global" not found for dlm gl, adding... >>>>>> 1611221684 add_lockspace_thread dlm lvm_global version 0 >>>>>> 1611221684 S lvm_global lm_add_lockspace dlm wait 0 adopt 0 >>>>>> 1611221685 S lvm_global lm_add_lockspace done 0 >>>>>> 1611221685 S lvm_global R GLLK action lock sh >>>>>> 1611221685 S lvm_global R GLLK res_lock cl 1 mode sh >>>>>> 1611221685 S lvm_global R GLLK lock_dlm >>>>>> 1611221685 S lvm_global R GLLK res_lock rv 0 read vb 0 0 0 >>>>>> 1611221685 S lvm_global R GLLK res_lock all versions zero >>>>>> 1611221685 S lvm_global R GLLK res_lock invalidate global state >>>>>> 1611221685 send pvs[17159] cl 1 lock gl rv 0 >>>>>> 1611221685 recv pvs[17159] cl 1 lock vg "sys" mode sh flags 0 >>>>>> 1611221685 lockspace "lvm_sys" not found >>>>>> 1611221685 send pvs[17159] cl 1 lock vg rv ‑210 ENOLS >>>>>> 1611221685 close pvs[17159] cl 1 fd 8 >>>>>> 1611221685 S lvm_global R GLLK res_unlock cl 1 from close >>>>>> 1611221685 S lvm_global R GLLK unlock_dlm >>>>>> 1611221685 S lvm_global R GLLK res_unlock lm done >>>>>> 1611222582 new cl 2 pi 2 fd 8 >>>>>> 1611222582 recv client[19210] cl 2 dump_log . "" mode iv flags 0 >>>>>> >>>>>> Note: "lvm_sys" may refer to VG sys used for the hypervisor. >>>>>> >>>>>> Regards, >>>>>> Ulrich >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Manage your subscription: >>>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>>> >>>>>> ClusterLabs home: https://www.clusterlabs.org/ >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Manage your subscription: >>>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>>> >>>>> ClusterLabs home: https://www.clusterlabs.org/ >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Manage your subscription: >>>> https://lists.clusterlabs.org/mailman/listinfo/users >>>> >>>> ClusterLabs home: https://www.clusterlabs.org/ >>>> >>> >>> _______________________________________________ >>> Manage your subscription: >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> ClusterLabs home: https://www.clusterlabs.org/ >> >> >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
