Re: [lustre-discuss] problem with an MDT

2020-11-10 Thread Colin Faber
yeah I don't know how successful lctl --device abort_recovery is going to
be vs abort_recov on the device itself, I think probably by the time you
get to aborting it via lctl it's already too late.

But to confirm you're back online again? (Also, time to upgrade!) =)

On Tue, Nov 10, 2020 at 7:57 AM  wrote:

>
> Hi Colin,
>
> Thank you.  That was the tip I needed!
>
> We are running IEEL so I did the following...
>
> * mount the mdt by hand with -o abort_recov
>
>   mount -v -t lustre -o abort_recov /dev/mapper/mpatha
> /mnt/lustre02-MDT
>
> * after it mounted up, umount it
> * start the mdt via IEEL
> * mount the file system on clients.
>
> I also tried to start the mdt with IEEL then use
>
> lctl --device 4 abort_recovery
>
> but that didn't work.
>
> Cheers,
>
> sb. Scott Blomquist
>
>
> Colin Faber  writes:
>
> > Scott,
> >
> > Have you tried aborting recovery on mount?
> >
> > On Mon, Nov 9, 2020 at 1:15 PM  wrote:
> >
> >>
> >> Hi All,
> >>
> >> After the recent power glitch last week one of our lustre file systems
> >> failed to come up.
> >>
> >> We diagnosed the problem down to a file system error on the MDT.  This
> >> is an old IEEL systems running on Dell equipment.
> >>
> >> Here are the facts...
> >>
> >>   * the raid 6 array running on an Dell MD32xx is ok.
> >>
> >>   * when we bring up the MDT it goes read-only then the MDS host crashes
> >>
> >>   * after this the MDT file system is dirty and we have to e2fsck it
> >>
> >>   * I have tried multiple combinations of MDS up/down, OSS up/down with
> >> nothing changing the results.
> >>
> >>   * This seems to be lustre 2.7.15
> >>
> >> I think this may be
> >>
> >> https://jira.whamcloud.com/browse/LU-7045
> >>
> >> or something like that.
> >>
> >> Is there a way to LFSCK (or something) this error away?  Or is this a
> >> please update lustre error.
> >>
> >> Thanks for any help.
> >>
> >> I have attached the error below.
> >>
> >> Thanks for any insight,
> >>
> >> sb. Scott Blomquist
> >>
> >> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: [ cut here
> >> ]
> >> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: WARNING: at
> >>
> /tmp/rpmbuild-lustre-jenkins-U6NXEPsD/BUILD/lustre-2.7.15.3/ldiskfs/ext4_jbd2.c:266
> >> __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs]()
> >> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: Modules linked in:
> osp(OE)
> >> mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE)
> >> ldiskfs(OE) lquota(OE) vfat fat usb_storage mpt3sas mptctl mptbase
> dell_rbu
> >> lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE)
> ptlrpc(OE)
> >> obdclass(OE) lnet(OE) sha512_generic fuse crypto_null libcfs(OE)
> >> rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE)
> >> ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_en(OE) vxlan
> >> ip6_udp_tunnel udp_tunnel intel_powerclamp coretemp intel_rapl kvm_intel
> >> kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul
> glue_helper
> >> ablk_helper iTCO_wdt dcdbas cryptd iTCO_vendor_support dm_round_robin
> >> pcspkr sg ipmi_devintf sb_edac edac_core acpi_power_meter ntb ipmi_si wm
> >> i shpchp acpi_pad ipmi_msghandler lpc_ich mei_me mei mfd_core
> >> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: knem(OE) nfsd
> auth_rpcgss
> >> nfs_acl lockd grace sunrpc dm_multipath ip_tables ext4 mbcache jbd2
> >> mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) sr_mod cdrom
> >> sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect
> >> sysimgblt i2c_algo_bit drm_kms_helper crct10dif_pclmul crct10dif_common
> >> mpt2sas crc32c_intel ttm ahci raid_class drm libahci scsi_transport_sas
> >> mlx4_core(OE) mlx_compat(OE) libata tg3 i2c_core ptp megaraid_sas
> pps_core
> >> dm_mirror dm_region_hash dm_log dm_mod
> >> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: CPU: 10 PID: 6577 Comm:
> >> mdt01_003 Tainted: G   OE  
> >>  3.10.0-327.el7_lustre.gd4cb884.x86_64 #1
> >> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: Hardware name: Dell Inc.
> >> PowerEdge R620/0PXXHP, BIOS 2.5.4 01/22/2016
> >> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: 
> >> fff8486a 881f815234f0 81635429
> >> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: 881f81523528
> >> 8107b200 880fc708f1a0 880fe535b060
> >> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: 881fa4aff7c8
> >> a10b9a9c 0325 881f81523538
> >> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: Call Trace:
> >> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: []
> >> dump_stack+0x19/0x1b
> >> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: []
> >> warn_slowpath_common+0x70/0xb0
> >> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: []
> >> warn_slowpath_null+0x1a/0x20
> >> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: []
> >> __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs]
> >> Nov  4 09:29:09 

Re: [lustre-discuss] problem with an MDT

2020-11-10 Thread s_b


Hi Colin,

Thank you.  That was the tip I needed!

We are running IEEL so I did the following...

* mount the mdt by hand with -o abort_recov

  mount -v -t lustre -o abort_recov /dev/mapper/mpatha /mnt/lustre02-MDT

* after it mounted up, umount it
* start the mdt via IEEL
* mount the file system on clients.

I also tried to start the mdt with IEEL then use

lctl --device 4 abort_recovery

but that didn't work.

Cheers,

sb. Scott Blomquist


Colin Faber  writes:

> Scott,
>
> Have you tried aborting recovery on mount?
>
> On Mon, Nov 9, 2020 at 1:15 PM  wrote:
>
>>
>> Hi All,
>>
>> After the recent power glitch last week one of our lustre file systems
>> failed to come up.
>>
>> We diagnosed the problem down to a file system error on the MDT.  This
>> is an old IEEL systems running on Dell equipment.
>>
>> Here are the facts...
>>
>>   * the raid 6 array running on an Dell MD32xx is ok.
>>
>>   * when we bring up the MDT it goes read-only then the MDS host crashes
>>
>>   * after this the MDT file system is dirty and we have to e2fsck it
>>
>>   * I have tried multiple combinations of MDS up/down, OSS up/down with
>> nothing changing the results.
>>
>>   * This seems to be lustre 2.7.15
>>
>> I think this may be
>>
>> https://jira.whamcloud.com/browse/LU-7045
>>
>> or something like that.
>>
>> Is there a way to LFSCK (or something) this error away?  Or is this a
>> please update lustre error.
>>
>> Thanks for any help.
>>
>> I have attached the error below.
>>
>> Thanks for any insight,
>>
>> sb. Scott Blomquist
>>
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: [ cut here
>> ]
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: WARNING: at
>> /tmp/rpmbuild-lustre-jenkins-U6NXEPsD/BUILD/lustre-2.7.15.3/ldiskfs/ext4_jbd2.c:266
>> __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs]()
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: Modules linked in: osp(OE)
>> mdd(OE) lod(OE) mdt(OE) lfsck(OE) mgs(OE) mgc(OE) osd_ldiskfs(OE)
>> ldiskfs(OE) lquota(OE) vfat fat usb_storage mpt3sas mptctl mptbase dell_rbu
>> lustre(OE) lmv(OE) mdc(OE) lov(OE) fid(OE) fld(OE) ko2iblnd(OE) ptlrpc(OE)
>> obdclass(OE) lnet(OE) sha512_generic fuse crypto_null libcfs(OE)
>> rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE)
>> ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_en(OE) vxlan
>> ip6_udp_tunnel udp_tunnel intel_powerclamp coretemp intel_rapl kvm_intel
>> kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper
>> ablk_helper iTCO_wdt dcdbas cryptd iTCO_vendor_support dm_round_robin
>> pcspkr sg ipmi_devintf sb_edac edac_core acpi_power_meter ntb ipmi_si wm
>> i shpchp acpi_pad ipmi_msghandler lpc_ich mei_me mei mfd_core
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: knem(OE) nfsd auth_rpcgss
>> nfs_acl lockd grace sunrpc dm_multipath ip_tables ext4 mbcache jbd2
>> mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) sr_mod cdrom
>> sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect
>> sysimgblt i2c_algo_bit drm_kms_helper crct10dif_pclmul crct10dif_common
>> mpt2sas crc32c_intel ttm ahci raid_class drm libahci scsi_transport_sas
>> mlx4_core(OE) mlx_compat(OE) libata tg3 i2c_core ptp megaraid_sas pps_core
>> dm_mirror dm_region_hash dm_log dm_mod
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: CPU: 10 PID: 6577 Comm:
>> mdt01_003 Tainted: G   OE  
>>  3.10.0-327.el7_lustre.gd4cb884.x86_64 #1
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: Hardware name: Dell Inc.
>> PowerEdge R620/0PXXHP, BIOS 2.5.4 01/22/2016
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: 
>> fff8486a 881f815234f0 81635429
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: 881f81523528
>> 8107b200 880fc708f1a0 880fe535b060
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: 881fa4aff7c8
>> a10b9a9c 0325 881f81523538
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: Call Trace:
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: []
>> dump_stack+0x19/0x1b
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: []
>> warn_slowpath_common+0x70/0xb0
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: []
>> warn_slowpath_null+0x1a/0x20
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: []
>> __ldiskfs_handle_dirty_metadata+0x1c2/0x220 [ldiskfs]
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: []
>> ldiskfs_getblk+0x131/0x200 [ldiskfs]
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: []
>> ldiskfs_bread+0x27/0xc0 [ldiskfs]
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: []
>> osd_ldiskfs_write_record+0x169/0x360 [osd_ldiskfs]
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: []
>> osd_write+0xf8/0x230 [osd_ldiskfs]
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: []
>> dt_record_write+0x45/0x130 [obdclass]
>> Nov  4 09:29:09 mds1-002.lustre.cluster kernel: []
>> tgt_last_rcvd_update+0x732/0xef0 [ptlrpc]
>> 

Re: [lustre-discuss] ZFS w/Lustre problem

2020-11-10 Thread Steve Thompson

On Mon, 9 Nov 2020, Hans Henrik Happe wrote:


I sounds like this issue, but I'm not sure what your dnodesize is:

https://github.com/openzfs/zfs/issues/8458

ZFS 0.8.1+ on the receiving side should fix it. Then again ZFS 0.8 is
not supported in Lustre 2.12, so it's a bit hard to restore, without
copying the underlying devices.


Hans Henrik,

Many thanks for your input. I had in fact known about the dnodesize issue, 
and tested a workaround. Unfortunately, it turned out not to be this. 
Instead, I have tested a patch to zfs_send.c, which does appear to have 
solved the issue. The zfs send/recv is still running, however; if it 
completes successfully, I will post again with details of the patch.


Steve
--

Steve Thompson E-mail:  smt AT vgersoft DOT com
Voyager Software LLC   Web: http://www DOT vgersoft DOT com
3901 N Charles St  VSW Support: support AT vgersoft DOT com
Baltimore MD 21218
  "186,282 miles per second: it's not just a good idea, it's the law"

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org