Re: [lustre-discuss] problem with an MDT
yeah I don't know how successful lctl --device abort_recovery is going to be vs abort_recov on the device itself, I think probably by the time you get to aborting it via lctl it's already too late. But to confirm you're back online again? (Also, time to upgrade!) =) On Tue, Nov 10, 2020 at 7:57 AM wrote: > > Hi Colin, > > Thank you. That was the tip I needed! > > We are running IEEL so I did the following... > > * mount the mdt by hand with -o abort_recov > > mount -v -t lustre -o abort_recov /dev/mapper/mpatha > /mnt/lustre02-MDT > > * after it mounted up, umount it > * start the mdt via IEEL > * mount the file system on clients. > > I also tried to start the mdt with IEEL then use > > lctl --device 4 abort_recovery > > but that didn't work. > > Cheers, > > sb. Scott Blomquist > > > Colin Faber writes: > > > Scott, > > > > Have you tried aborting recovery on mount? > > > > On Mon, Nov 9, 2020 at 1:15 PM wrote: > > > >> > >> Hi All, > >> > >> After the recent power glitch last week one of our lustre file systems > >> failed to come up. > >> > >> We diagnosed the problem down to a file system error on the MDT. This > >> is an old IEEL systems running on Dell equipment. > >> > >> Here are the facts... > >> > >> * the raid 6 array running on an Dell MD32xx is ok. > >> > >> * when we bring up the MDT it goes read-only then the MDS host crashes > >> > >> * after this the MDT file system is dirty and we have to e2fsck it > >> > >> * I have tried multiple combinations of MDS up/down, OSS up/down with > >> nothing changing the results. > >> > >> * This seems to be lustre 2.7.15 > >> > >> I think this may be > >> > >> https://jira.whamcloud.com/browse/LU-7045 > >> > >> or something like that. > >> > >> Is there a way to LFSCK (or something) this error away? Or is this a > >> please update lustre error. > >> > >> Thanks for any help. > >> > >> I have attached the error below. > >> > >> Thanks for any insight, > >> > >> sb. Scott Blomquist > >> > >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [ cut here > >> ] > >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: WARNING: at > >> > /tmp/rpmbuild-lustre-jenkins-U6NXEPsD/BUILD/lustre-2.7.15.3/ldiskfs/ext4_jbd2.c:266 > >> __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs]() > >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: Modules linked in: > osp(OE) > >> mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) > >> ldiskfs(OE) lquota(OE) vfat fat usb_storage mpt3sas mptctl mptbase > dell_rbu > >> lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) > ptlrpc(OE) > >> obdclass(OE) lnet(OE) sha512_generic fuse crypto_null libcfs(OE) > >> rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) > >> ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_en(OE) vxlan > >> ip6_udp_tunnel udp_tunnel intel_powerclamp coretemp intel_rapl kvm_intel > >> kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul > glue_helper > >> ablk_helper iTCO_wdt dcdbas cryptd iTCO_vendor_support dm_round_robin > >> pcspkr sg ipmi_devintf sb_edac edac_core acpi_power_meter ntb ipmi_si wm > >> i shpchp acpi_pad ipmi_msghandler lpc_ich mei_me mei mfd_core > >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: knem(OE) nfsd > auth_rpcgss > >> nfs_acl lockd grace sunrpc dm_multipath ip_tables ext4 mbcache jbd2 > >> mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) sr_mod cdrom > >> sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect > >> sysimgblt i2c_algo_bit drm_kms_helper crct10dif_pclmul crct10dif_common > >> mpt2sas crc32c_intel ttm ahci raid_class drm libahci scsi_transport_sas > >> mlx4_core(OE) mlx_compat(OE) libata tg3 i2c_core ptp megaraid_sas > pps_core > >> dm_mirror dm_region_hash dm_log dm_mod > >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: CPU: 10 PID: 6577 Comm: > >> mdt01_003 Tainted: G OE > >> 3.10.0-327.el7_lustre.gd4cb884.x86_64 #1 > >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: Hardware name: Dell Inc. > >> PowerEdge R620/0PXXHP, BIOS 2.5.4 01/22/2016 > >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: > >> fff8486a 881f815234f0 81635429 > >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: 881f81523528 > >> 8107b200 880fc708f1a0 880fe535b060 > >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: 881fa4aff7c8 > >> a10b9a9c 0325 881f81523538 > >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: Call Trace: > >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [] > >> dump_stack+0x19/0x1b > >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [] > >> warn_slowpath_common+0x70/0xb0 > >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [] > >> warn_slowpath_null+0x1a/0x20 > >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [] > >> __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs] > >> Nov 4 09:29:09
Re: [lustre-discuss] problem with an MDT
Hi Colin, Thank you. That was the tip I needed! We are running IEEL so I did the following... * mount the mdt by hand with -o abort_recov mount -v -t lustre -o abort_recov /dev/mapper/mpatha /mnt/lustre02-MDT * after it mounted up, umount it * start the mdt via IEEL * mount the file system on clients. I also tried to start the mdt with IEEL then use lctl --device 4 abort_recovery but that didn't work. Cheers, sb. Scott Blomquist Colin Faber writes: > Scott, > > Have you tried aborting recovery on mount? > > On Mon, Nov 9, 2020 at 1:15 PM wrote: > >> >> Hi All, >> >> After the recent power glitch last week one of our lustre file systems >> failed to come up. >> >> We diagnosed the problem down to a file system error on the MDT. This >> is an old IEEL systems running on Dell equipment. >> >> Here are the facts... >> >> * the raid 6 array running on an Dell MD32xx is ok. >> >> * when we bring up the MDT it goes read-only then the MDS host crashes >> >> * after this the MDT file system is dirty and we have to e2fsck it >> >> * I have tried multiple combinations of MDS up/down, OSS up/down with >> nothing changing the results. >> >> * This seems to be lustre 2.7.15 >> >> I think this may be >> >> https://jira.whamcloud.com/browse/LU-7045 >> >> or something like that. >> >> Is there a way to LFSCK (or something) this error away? Or is this a >> please update lustre error. >> >> Thanks for any help. >> >> I have attached the error below. >> >> Thanks for any insight, >> >> sb. Scott Blomquist >> >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [ cut here >> ] >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: WARNING: at >> /tmp/rpmbuild-lustre-jenkins-U6NXEPsD/BUILD/lustre-2.7.15.3/ldiskfs/ext4_jbd2.c:266 >> __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs]() >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: Modules linked in: osp(OE) >> mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE) >> ldiskfs(OE) lquota(OE) vfat fat usb_storage mpt3sas mptctl mptbase dell_rbu >> lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE) >> obdclass(OE) lnet(OE) sha512_generic fuse crypto_null libcfs(OE) >> rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) >> ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_en(OE) vxlan >> ip6_udp_tunnel udp_tunnel intel_powerclamp coretemp intel_rapl kvm_intel >> kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper >> ablk_helper iTCO_wdt dcdbas cryptd iTCO_vendor_support dm_round_robin >> pcspkr sg ipmi_devintf sb_edac edac_core acpi_power_meter ntb ipmi_si wm >> i shpchp acpi_pad ipmi_msghandler lpc_ich mei_me mei mfd_core >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: knem(OE) nfsd auth_rpcgss >> nfs_acl lockd grace sunrpc dm_multipath ip_tables ext4 mbcache jbd2 >> mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) sr_mod cdrom >> sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect >> sysimgblt i2c_algo_bit drm_kms_helper crct10dif_pclmul crct10dif_common >> mpt2sas crc32c_intel ttm ahci raid_class drm libahci scsi_transport_sas >> mlx4_core(OE) mlx_compat(OE) libata tg3 i2c_core ptp megaraid_sas pps_core >> dm_mirror dm_region_hash dm_log dm_mod >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: CPU: 10 PID: 6577 Comm: >> mdt01_003 Tainted: G OE >> 3.10.0-327.el7_lustre.gd4cb884.x86_64 #1 >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: Hardware name: Dell Inc. >> PowerEdge R620/0PXXHP, BIOS 2.5.4 01/22/2016 >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: >> fff8486a 881f815234f0 81635429 >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: 881f81523528 >> 8107b200 880fc708f1a0 880fe535b060 >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: 881fa4aff7c8 >> a10b9a9c 0325 881f81523538 >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: Call Trace: >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [] >> dump_stack+0x19/0x1b >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [] >> warn_slowpath_common+0x70/0xb0 >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [] >> warn_slowpath_null+0x1a/0x20 >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [] >> __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs] >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [] >> ldiskfs_getblk+0x131/0x200 [ldiskfs] >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [] >> ldiskfs_bread+0x27/0xc0 [ldiskfs] >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [] >> osd_ldiskfs_write_record+0x169/0x360 [osd_ldiskfs] >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [] >> osd_write+0xf8/0x230 [osd_ldiskfs] >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [] >> dt_record_write+0x45/0x130 [obdclass] >> Nov 4 09:29:09 mds1-002.lustre.cluster kernel: [] >> tgt_last_rcvd_update+0x732/0xef0 [ptlrpc] >>
Re: [lustre-discuss] ZFS w/Lustre problem
On Mon, 9 Nov 2020, Hans Henrik Happe wrote: I sounds like this issue, but I'm not sure what your dnodesize is: https://github.com/openzfs/zfs/issues/8458 ZFS 0.8.1+ on the receiving side should fix it. Then again ZFS 0.8 is not supported in Lustre 2.12, so it's a bit hard to restore, without copying the underlying devices. Hans Henrik, Many thanks for your input. I had in fact known about the dnodesize issue, and tested a workaround. Unfortunately, it turned out not to be this. Instead, I have tested a patch to zfs_send.c, which does appear to have solved the issue. The zfs send/recv is still running, however; if it completes successfully, I will post again with details of the patch. Steve -- Steve Thompson E-mail: smt AT vgersoft DOT com Voyager Software LLC Web: http://www DOT vgersoft DOT com 3901 N Charles St VSW Support: support AT vgersoft DOT com Baltimore MD 21218 "186,282 miles per second: it's not just a good idea, it's the law" ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org