Re: [lustre-discuss] Corrupted? MDT not mounting

2022-05-10 Thread Andrew Elwell via lustre-discuss
On Wed, 11 May 2022 at 04:37, Laura Hild  wrote:
> The non-dummy SRP module is in the kmod-srp package, which isn't included in 
> the Lustre repository...

Thanks Laura,
Yeah, I realised that earlier in the week, and have rebuilt the srp
module from source via mlnxofedinstall, and sure enough installing
srp-4.9-OFED.4.9.4.1.6.1.kver.3.10.0_1160.49.1.el7_lustre.x86_64.x86_64.rpm
(gotta love those short names) gives me working srp again.

Hat tip to a DDN contact here (we owe him even more beers now) for
some extra tuning parameters:
options ib_srp cmd_sg_entries=255 indirect_sg_entries=2048
allow_ext_sg=1 ch_count=1 use_imm_data=0
but I'm pleased to say that it _seems_ to be working much better. I'd
done one half of the HA pairs earlier in the week, lfsck completed,
full robinhood scan done (dropped the DB and rescanned from fresh) and
I'm just bringing the other half of the pairs up to the same software
stack now.

Couple of pointers for anyone caught in the same boat that apparently
we did correctly:
* upgrade your f2fsprogs to the latest - if your fsck'ing disks make
sure you're not introducing more problems with a buggy old e2fsck
* tunefs.lustre --writeconf isn't too destructive (see the warnings,
you'll lose pool info but in our case that wasn't critical)
* monitoring is good but tbh the rate of change and that it happened
out of hours means we likely couldn't have intervened
* so quotas are better.

Thanks to those who replied on and off-list - I'm just grateful we
only had the pair of MDTs, not the 40 (!!!) that Origin's getting
(yeah, I was watching the LUG talk last night) - service isn't quite
back to users but we're getting there!

Andrew
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Corrupted? MDT not mounting

2022-05-10 Thread Laura Hild via lustre-discuss
Hi Andrew-

The non-dummy SRP module is in the kmod-srp package, which isn't included in 
the Lustre repository.  I'm less certain than I'd like to be, as ours is a DKMS 
setup rather than kmod, and the last time I had an SRP setup was a couple years 
ago, but I suspect you may have success if you fetch the full MLNX_OFED from

  
https://content.mellanox.com/ofed/MLNX_OFED-4.9-4.1.7.0/MLNX_OFED_LINUX-4.9-4.1.7.0-rhel7.9-x86_64.tgz

and rebuild it for the _lustre kernel (mlnxofedinstall --add-kernel-support 
--kmp).  When I do that, I get modules that load successfully into the kernel 
with the kmods from the Lustre repository.

-Laura

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Corrupted? MDT not mounting

2022-05-08 Thread Andrew Elwell via lustre-discuss
On Fri, 6 May 2022 at 20:04, Andreas Dilger  wrote:
> MOFED is usually preferred over in-kernel OFED, it is just tested and fixed a 
> lot more.

Fair enough, However is the 2.12.8-ib tree built with all the features?
specifically 
https://downloads.whamcloud.com/public/lustre/lustre-2.12.8-ib/MOFED-4.9-4.1.7.0/el7/server/

If I compare the ib_srp module from 2.12 in-kernel

[root@astrofs-oss3 ~]# find /lib/modules/`uname -r` -name ib_srp.ko.xz
/lib/modules/3.10.0-1160.49.1.el7_lustre.x86_64/kernel/drivers/infiniband/ulp/srp/ib_srp.ko.xz
[root@astrofs-oss3 ~]# rpm -qf
/lib/modules/3.10.0-1160.49.1.el7_lustre.x86_64/kernel/drivers/infiniband/ulp/srp/ib_srp.ko.xz
kernel-3.10.0-1160.49.1.el7_lustre.x86_64
[root@astrofs-oss3 ~]# modinfo ib_srp
filename:
/lib/modules/3.10.0-1160.49.1.el7_lustre.x86_64/kernel/drivers/infiniband/ulp/srp/ib_srp.ko.xz
license:Dual BSD/GPL
description:InfiniBand SCSI RDMA Protocol initiator
author: Roland Dreier
retpoline:  Y
rhelversion:7.9
srcversion: 1FB80E3A962EE7F39AD3959
depends:ib_core,scsi_transport_srp,ib_cm,rdma_cm
intree: Y
vermagic:   3.10.0-1160.49.1.el7_lustre.x86_64 SMP mod_unload modversions
signer: CentOS Linux kernel signing key
sig_key:FA:A3:27:4B:D9:17:36:F0:FD:43:6A:42:1B:6A:A4:FA:FE:D0:AC:FA
sig_hashalgo:   sha256
parm:   srp_sg_tablesize:Deprecated name for cmd_sg_entries (uint)
parm:   cmd_sg_entries:Default number of gather/scatter
entries in the SRP command (default is 12, max 255) (uint)
parm:   indirect_sg_entries:Default max number of
gather/scatter entries (default is 12, max is 2048) (uint)
parm:   allow_ext_sg:Default behavior when there are more than
cmd_sg_entries S/G entries after mapping; fails the request when false
(default false) (bool)
parm:   topspin_workarounds:Enable workarounds for
Topspin/Cisco SRP target bugs if != 0 (int)
parm:   prefer_fr:Whether to use fast registration if both FMR
and fast registration are supported (bool)
parm:   register_always:Use memory registration even for
contiguous memory regions (bool)
parm:   never_register:Never register memory (bool)
parm:   reconnect_delay:Time between successive reconnect attempts
parm:   fast_io_fail_tmo:Number of seconds between the
observation of a transport layer error and failing all I/O. "off"
means that this functionality is disabled.
parm:   dev_loss_tmo:Maximum number of seconds that the SRP
transport should insulate transport layer errors. After this time has
been exceeded the SCSI host is removed. Should be between 1 and
SCSI_DEVICE_BLOCK_MAX_TIMEOUT if fast_io_fail_tmo has not been set.
"off" means that this functionality is disabled.
parm:   ch_count:Number of RDMA channels to use for
communication with an SRP target. Using more than one channel improves
performance if the HCA supports multiple completion vectors. The
default value is the minimum of four times the number of online CPU
sockets and the number of completion vectors supported by the HCA.
(uint)
parm:   use_blk_mq:Use blk-mq for SRP (bool)
[root@astrofs-oss3 ~]#

.. it all looks normal and capable of mounting our exascaler luns

cf the one from 2.12.8-ib

=
 PackageArch
  Version
Repository   Size
=
Installing:
 kernel x86_64
  3.10.0-1160.49.1.el7_lustre
lustre-2.12-mofed50 M
 kmod-lustre-osd-ldiskfsx86_64
  2.12.8_6_g5457c37-1.el7
lustre-2.12-mofed   469 k
 lustre x86_64
  2.12.8_6_g5457c37-1.el7
lustre-2.12-mofed   805 k
Installing for dependencies:
 kmod-lustrex86_64
  2.12.8_6_g5457c37-1.el7
lustre-2.12-mofed   3.9 M
 kmod-mlnx-ofa_kernel   x86_64
  4.9-OFED.4.9.4.1.7.1
lustre-2.12-mofed   1.3 M
 lustre-osd-ldiskfs-mount   x86_64
  2.12.8_6_g5457c37-1.el7
lustre-2.12-mofed15 k
 mlnx-ofa_kernelx86_64
  4.9-OFED.4.9.4.1.7.1
lustre-2.12-mofed   108 k

[root@astrofs-oss1 ~]# find /lib/modules/`uname -r` -name ib_srp.ko.xz

Re: [lustre-discuss] Corrupted? MDT not mounting

2022-05-06 Thread Laura Hild via lustre-discuss
Absolutely try MOFED.  The problem you're describing is extremely similar to 
one we were dealing with in March after we patched to 2.12.8, right down to 
those call traces.  Went away when we switched.

-Laura

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Corrupted? MDT not mounting

2022-05-06 Thread Andreas Dilger via lustre-discuss
On May 5, 2022, at 07:16, Andrew Elwell via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:
I've got a case open with the vendor to see if there are any firmware
updates - but I'm not hopeful. These are 6 core single socket
broadwells. with 128G of RAM, Storage disks are mounted over SRP from
a DDN appliance. Would jumping to MOFED make a difference? Otherwise
I'm open to suggestions as it's getting very tiring wrangling servers
back to life

MOFED is usually preferred over in-kernel OFED, it is just tested and fixed a 
lot more.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Corrupted? MDT not mounting

2022-05-05 Thread Andrew Elwell via lustre-discuss
> It's looking more like something filled up our space - I'm just
> copying the files out as a backup (mounted as ldiskfs just now) -

Ahem. Inode quotas are a good idea. Turns out that a user creating
about 130 million directories rapidly is more than a small MDT volume
can take.

An update on recovery progress - Upgrading the MDS to 2.12 got us over
the issue in LU-12674 enough to recover, and I've migrated half (one
of the HA pairs) of the OSSs to RHEL 7.9 / Lustre 2.12.8 too

It needed a set of writeconf's doing before they'd mount, and e2fsck
has run over any suspect luns. The filesystem "works" in that under
light testing I can read/write OK, but as soon as it gets stressed,
OSSs are falling over

[ 1226.864430] BUG: unable to handle kernel NULL pointer dereference
at   (null)
[ 1226.872281] IP: [] __list_add+0x1b/0xc0
[ 1226.877699] PGD 1ffba0d067 PUD 1ffa48e067 PMD 0
[ 1226.882360] Oops:  [#1] SMP
[ 1226.885619] Modules linked in: osp(OE) ofd(OE) lfsck(OE) ost(OE)
mgc(OE) osd_ldiskfs(OE) ldiskfs(OE) lquota(OE) fid(OE) fld(OE)
ko2iblnd(OE) ptlrpc(OE) obdclass(OE) lnet(OE) libcfs(OE)
dm_round_robin ib_srp scsi_transport_srp scsi_tgt tcp_diag inet_diag
ib_isert iscsi_target_mod target_core_mod rpcrdma rdma_ucm ib_iser
ib_umad bonding rdma_cm ib_ipoib iw_cm libiscsi scsi_transport_iscsi
ib_cm mlx4_ib ib_uverbs ib_core sunrpc ext4 mbcache jbd2 sb_edac
intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel iTCO_wdt kvm
iTCO_vendor_support irqbypass crc32_pclmul ghash_clmulni_intel
aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr
i2c_i801 lpc_ich mei_me joydev mei sg ioatdma wmi ipmi_si ipmi_devintf
ipmi_msghandler dm_multipath acpi_pad acpi_power_meter dm_mod
ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mlx4_en
ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm
drm igb ahci libahci mpt2sas mlx4_core ptp crct10dif_pclmul
crct10dif_common libata crc32c_intel pps_core dca raid_class devlink
i2c_algo_bit drm_panel_orientation_quirks scsi_transport_sas nfit
libnvdimm [last unloaded: scsi_tgt]
[ 1226.987670] CPU: 6 PID: 366 Comm: kworker/u24:6 Kdump: loaded
Tainted: G   OE  
3.10.0-1160.49.1.el7_lustre.x86_64 #1
[ 1227.000168] Hardware name: SGI.COM CH-C1104-GP6/X10SRW-F, BIOS 3.1 06/06/2018
[ 1227.007310] Workqueue: rdma_cm cma_work_handler [rdma_cm]
[ 1227.012725] task: 934839f0b180 ti: 934836c2 task.ti:
934836c2
[ 1227.020195] RIP: 0010:[]  []
__list_add+0x1b/0xc0
[ 1227.028036] RSP: 0018:934836c23d68  EFLAGS: 00010246
[ 1227.09] RAX:  RBX: 934836c23d90 RCX: 
[ 1227.040463] RDX: 932fa518e680 RSI:  RDI: 934836c23d90
[ 1227.047587] RBP: 934836c23d80 R08:  R09: b2df8c1b3dcb3100
[ 1227.054712] R10: b2df8c1b3dcb3100 R11: 00ff R12: 932fa518e680
[ 1227.061835] R13:  R14:  R15: 932fa518e680
[ 1227.068958] FS:  () GS:93483f38()
knlGS:
[ 1227.077034] CS:  0010 DS:  ES:  CR0: 80050033
[ 1227.082772] CR2:  CR3: 001fe47a8000 CR4: 003607e0
[ 1227.089895] DR0:  DR1:  DR2: 
[ 1227.097020] DR3:  DR6: fffe0ff0 DR7: 0400
[ 1227.104142] Call Trace:
[ 1227.106593]  [] __mutex_lock_slowpath+0xa6/0x1d0
[ 1227.112770]  [] ? __switch_to+0xce/0x580
[ 1227.118255]  [] mutex_lock+0x1f/0x2f
[ 1227.123399]  [] cma_work_handler+0x25/0xa0 [rdma_cm]
[ 1227.129922]  [] process_one_work+0x17f/0x440
[ 1227.135752]  [] worker_thread+0x126/0x3c0
[ 1227.141324]  [] ? manage_workers.isra.26+0x2a0/0x2a0
[ 1227.147849]  [] kthread+0xd1/0xe0
[ 1227.152729]  [] ? insert_kthread_work+0x40/0x40
[ 1227.158822]  [] ret_from_fork_nospec_begin+0x7/0x21
[ 1227.165260]  [] ? insert_kthread_work+0x40/0x40
[ 1227.171348] Code: ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00
55 48 89 e5 41 55 49 89 f5 41 54 49 89 d4 53 4c 8b 42 08 48 89 fb 49
39 f0 75 2a <4d> 8b 45 00 4d 39 c4 75 68 4c 39 e3 74 3e 4c 39 eb 74 39
49 89
[ 1227.191295] RIP  [] __list_add+0x1b/0xc0
[ 1227.196798]  RSP 
[ 1227.200284] CR2: 


and I'm able to reproduce this on multiple servers :-/

I can see a few mentions (https://access.redhat.com/solutions/4969471
for example) that seem to hint it's low memory triggered, but they
also say it's fixed in the Red Hat 7.9 kernel (and we're running the
2.12.8 stock 3.10.0-1160.49.1.el7_lustre.x86_64)

I've got a case open with the vendor to see if there are any firmware
updates - but I'm not hopeful. These are 6 core single socket
broadwells. with 128G of RAM, Storage disks are mounted over SRP from
a DDN appliance. Would jumping to MOFED make a difference? Otherwise
I'm open to suggestions as it's getting very tiring wrangling servers
back to life

[root@astrofs-oss1 ~]# ls -l /var/crash/ | grep 2022
drwxr-xr-x 2 root root 44 

Re: [lustre-discuss] Corrupted? MDT not mounting

2022-04-20 Thread Andrew Elwell via lustre-discuss
Thanks Stéphane,

It's looking more like something filled up our space - I'm just
copying the files out as a backup (mounted as ldiskfs just now) -
we're running DNE (MDT and this one MDT0001) but I don't
understand why so much space is being taken up in REMOTE_PARENT_DIR -
we seem to have actual user data stashed in there


[root@astrofs-mds2 SSINS_uvfits]# pwd
/mnt/REMOTE_PARENT_DIR/0xa40002340:0x1:0x0/MWA/data/1061313128/SSINS_uvfits
[root@astrofs-mds2 SSINS_uvfits]# ls -l
total 0
-rw-rw-r--+ 1 redacted redacted 67153694400 Oct  9  2018
1061313128_noavg_noflag_00.uvfits
-rw-rw-r--+ 1 redacted redacted   0 Oct  9  2018
1061313128_noavg_noflag_01.uvfits
[root@astrofs-mds2 SSINS_uvfits]#

and although this one was noticeably large, it's not the only non-zero
sized file under REMOTE_PARENT_DIR:
[root@astrofs-mds2 1061314832]# ls -l | head
total 116
-rw-rw-r--+ 1 redacted redacted7338240 Nov 14  2017 1061314832_01.mwaf
-rw-rw-r--+ 1 redacted redacted7338240 Nov 14  2017 1061314832_02.mwaf
-rw-rw-r--+ 1 redacted redacted7404480 Nov 14  2017 1061314832_03.mwaf
-rw-rw-r--+ 1 redacted redacted7404480 Nov 14  2017 1061314832_04.mwaf
-rw-rw-r--+ 1 redacted redacted7338240 Nov 14  2017 1061314832_05.mwaf
-rw-rw-r--+ 1 redacted redacted7338240 Nov 14  2017 1061314832_06.mwaf
-rw-rw-r--+ 1 redacted redacted7404480 Nov 14  2017 1061314832_07.mwaf
-rw-rw-r--+ 1 redacted redacted7404480 Nov 14  2017 1061314832_08.mwaf
-rw-rw-r--+ 1 redacted redacted7404480 Nov 14  2017 1061314832_09.mwaf
[root@astrofs-mds2 1061314832]# pwd
/mnt/REMOTE_PARENT_DIR/0xa40002340:0x1:0x0/MWA/data/1061314832

Suggestions for how to clean up and recover anyone?

Andrew
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Corrupted? MDT not mounting

2022-04-19 Thread Stephane Thiell via lustre-discuss
Hi Andrew,

kernel: LustreError: 13921:0:(genops.c:478:class_register_device())
astrofs-OST-osc-MDT0001: already exists, won't add

is symptomatic of a llog index issue/mismatch on the MDT vs. MGT. I would check 
if the llog backup of MDT0001 (over ldiskfs in CONFIGS) matches the one on the 
MGT. The llog indexes should match. If your MDT ran out of space/inodes (!), 
perhaps the llog backup has failed somehow or got corrupted. There are multiple 
patches in 2.12+ that address various issues with config llog (for example, if 
you used llog_cancel). I don’t think lfsck can repair config llog issues.

At worse, you could try to do a writeconf on the whole filesystem.

Good luck,
Stéphane


> On Apr 19, 2022, at 2:40 AM, Andrew Elwell via lustre-discuss 
>  wrote:
> 
> Hi Folks,
> 
> One of our filesystems seemed to fail over the holiday weekend - we're
> running DNE and MDT0001 won't mount. At first it looked like we'd run
> out of space (rc = -28) but then we were seeing this
> 
> mount.lustre: mount /dev/mapper/MDT0001 at /lustre/astrofs-MDT0001
> failed: File exists retries left: 0
> mount.lustre: mount /dev/mapper/MDT0001 at /lustre/astrofs-MDT0001
> failed: File exists
> 
> possibly
> kernel: LustreError: 13921:0:(genops.c:478:class_register_device())
> astrofs-OST-osc-MDT0001: already exists, won't add
> 
> lustre_rmmod wouldn't remove everything cleanly (osc in use) and so
> after a reboot everything *seemed* to start OK
> 
> [root@astrofs-mds1 ~]# mount -t lustre
> /dev/mapper/MGS on /lustre/MGS type lustre (ro)
> /dev/mapper/MDT on /lustre/astrofs-MDT type lustre (ro)
> /dev/mapper/MDT0001 on /lustre/astrofs-MDT0001 type lustre (ro)
> 
> ... but not for long
> 
> kernel: LustreError: 12355:0:(osp_sync.c:343:osp_sync_declare_add())
> ASSERTION( ctxt ) failed:
> kernel: LustreError: 12355:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG
> 
> possibly corrupt llog?
> 
> I see LU-12674 which looks like our problem, but only backported to
> 2.12 branch (these servers are still 2.10.8)
> 
> Piecing together what *might* have happened is a user possibly ran out
> of inodes and then did a rm -r before the system stopped responding.
> 
> Mounting just now I'm getting:
> [ 1985.078422] LustreError: 10953:0:(llog.c:654:llog_process_thread())
> astrofs-OST0001-osc-MDT0001: Local llog found corrupted #0x7ede0:1:0
> plain index 35518 count 2
> [ 1985.095129] LustreError:
> 10959:0:(llog_osd.c:961:llog_osd_next_block()) astrofs-MDT0001-osd:
> invalid llog tail at log id [0x7ef40:0x1:0x0]:0 offset 577536 bytes
> 4096
> [ 1985.109892] LustreError:
> 10959:0:(osp_sync.c:1242:osp_sync_thread())
> astrofs-OST0004-osc-MDT0001: llog process with osp_sync_process_queues
> failed: -22
> [ 1985.126797] LustreError:
> 10973:0:(llog_cat.c:269:llog_cat_id2handle())
> astrofs-OST000b-osc-MDT0001: error opening log id [0x7ef76:0x1:0x0]:0:
> rc = -2
> [ 1985.140169] LustreError:
> 10973:0:(llog_cat.c:823:llog_cat_process_cb())
> astrofs-OST000b-osc-MDT0001: cannot find handle for llog
> [0x7ef76:0x1:0x0]: rc = -2
> [ 1985.155321] Lustre: astrofs-MDT0001: Imperative Recovery enabled,
> recovery window shrunk from 300-900 down to 150-900
> [ 1985.169404] Lustre: astrofs-MDT0001: in recovery but waiting for
> the first client to connect
> [ 1985.177869] Lustre: astrofs-MDT0001: Will be in recovery for at
> least 2:30, or until 1508 clients reconnect
> [ 1985.187612] Lustre: astrofs-MDT0001: Connection restored to
> a5e41149-73fc-b60a-30b1-da096a5c2527 (at 1170@gni1)
> [ 2017.251374] Lustre: astrofs-MDT0001: Connection restored to
> 7a388f58-bc16-6bd7-e0c8-4ffa7c0dd305 (at 400@gni1)
> [ 2017.261374] Lustre: Skipped 1275 previous similar messages
> [ 2081.458117] Lustre: astrofs-MDT0001: Connection restored to
> 10.10.36.143@o2ib4 (at 10.10.36.143@o2ib4)
> [ 2081.467419] Lustre: Skipped 277 previous similar messages
> [ 2082.324547] Lustre: astrofs-MDT0001: Recovery over after 1:37, of
> 1508 clients 1508 recovered and 0 were evicted.
> 
> Message from syslogd@astrofs-mds2 at Apr 19 17:32:49 ...
> kernel: LustreError: 11082:0:(osp_sync.c:343:osp_sync_declare_add())
> ASSERTION( ctxt ) failed:
> 
> Message from syslogd@astrofs-mds2 at Apr 19 17:32:49 ...
> kernel: LustreError: 11082:0:(osp_sync.c:343:osp_sync_declare_add()) LBUG
> [ 2082.392381] LustreError:
> 11082:0:(osp_sync.c:343:osp_sync_declare_add()) ASSERTION( ctxt )
> failed:
> [ 2082.401422] LustreError: 11082:0:(osp_sync.c:343:osp_sync_declare_add()) 
> LBUG
> [ 2082.408558] Pid: 11082, comm: orph_cleanup_as
> 3.10.0-957.1.3.el7_lustre.x86_64 #1 SMP Mon May 27 03:45:37 UTC 2019
> [ 2082.418891] Call Trace:
> [ 2082.421340]  [] libcfs_call_trace+0x8c/0xc0 [libcfs]
> [ 2082.427890]  [] lbug_with_loc+0x4c/0xa0 [libcfs]
> [ 2082.434077]  [] osp_sync_declare_add+0x3a9/0x3e0 [osp]
> [ 2082.440797]  [] osp_declare_destroy+0xc9/0x1c0 [osp]
> [ 2082.447338]  [] lod_sub_declare_destroy+0xce/0x2d0 [lod]
> [ 2082.454237]  [] lod_obj_stripe_destroy_cb+0x85/0x90 [lod]
> [