Re: [lustre-discuss] Disk failures triggered during OST creation and mounting on OSS Servers

2023-05-10 Thread Jeff Johnson
Jane,

You're having hardware errors, the codes in those mpt3sas errors define as
"PL_LOGINFO_SUB_CODE_OPEN_FAILURE_ORR_TIMEOUT", or in other words your SAS
HBA cannot open a command dialogue with your disk. I'd suspect backplane or
cabling issues as an internal disk failure will be reported by the target
disk with its own error code. In this case your HBA can't even talk to it
properly.

Is sdah the partner mpath device to sdef? Or is sdah a second failing disk
interface?

Looking at this, I don't think your hardware is deploy-ready.

--Jeff



On Wed, May 10, 2023 at 9:29 AM Jane Liu via lustre-discuss <
lustre-discuss@lists.lustre.org> wrote:

> Hi,
>
> We recently attempted to add several new OSS servers ( RHEL 8.7 and
> Lustre 2.15.2). While creating new OSTs, I noticed that mdstat reported
> some disk failures after the mkfs, even though the disks were functional
> before the mkfs command. Our hardware admins managed to resolve the
> mdstat issue and restore the disks to normal operation. However, when I
> ran the mount OST command (when network had a problem and mount command
> timed out), similar problems occurred, and several disks were kicked
> out. The relevant /var/log/messages are provided below.
>
> This problem was consistent across all our OSS servers. Any insights
> into the possible cause would be appreciated.
>
> Jane
>
> -
>
> May  9 13:33:15 sphnxoss47 kernel: LDISKFS-fs (md0): mounted filesystem
> with ordered data mode. Opts: errors=remount-ro
> May  9 13:33:15 sphnxoss47 systemd[1]: tmp-mntmirJ5z.mount: Succeeded.
> May  9 13:33:16 sphnxoss47 kernel: LNet: HW NUMA nodes: 2, HW CPU cores:
> 72, npartitions: 2
> May  9 13:33:16 sphnxoss47 kernel: alg: No test for adler32
> (adler32-zlib)
> May  9 13:33:16 sphnxoss47 kernel: Key type ._llcrypt registered
> May  9 13:33:16 sphnxoss47 kernel: Key type .llcrypt registered
> May  9 13:33:16 sphnxoss47 kernel: Lustre: Lustre: Build Version: 2.15.2
> May  9 13:33:16 sphnxoss47 kernel: LNet: Added LNI 169.254.1.2@tcp
> [8/256/0/180]
> May  9 13:33:16 sphnxoss47 kernel: LNet: Accept secure, port 988
> May  9 13:33:17 sphnxoss47 kernel: LDISKFS-fs (md0): mounted filesystem
> with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
> May  9 13:33:17 sphnxoss47 kernel: Lustre: sphnx01-OST0244-osd: enabled
> 'large_dir' feature on device /dev/md0
> May  9 13:33:25 sphnxoss47 systemd-logind[8609]: New session 7 of user
> root.
> May  9 13:33:25 sphnxoss47 systemd[1]: Started Session 7 of user root.
> May  9 13:34:36 sphnxoss47 kernel: LustreError: 15f-b: sphnx01-OST0244:
> cannot register this server with the MGS: rc = -110. Is the MGS running?
> May  9 13:34:36 sphnxoss47 kernel: LustreError:
> 45314:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start
> targets: -110
> May  9 13:34:36 sphnxoss47 kernel: LustreError:
> 45314:0:(obd_mount_server.c:1644:server_put_super()) no obd
> sphnx01-OST0244
> May  9 13:34:36 sphnxoss47 kernel: LustreError:
> 45314:0:(obd_mount_server.c:131:server_deregister_mount())
> sphnx01-OST0244 not registered
> May  9 13:34:39 sphnxoss47 kernel: Lustre: server umount sphnx01-OST0244
> complete
> May  9 13:34:39 sphnxoss47 kernel: LustreError:
> 45314:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount
> : rc = -110
> May  9 13:34:40 sphnxoss47 kernel: LDISKFS-fs (md1): mounted filesystem
> with ordered data mode. Opts: errors=remount-ro
> May  9 13:34:40 sphnxoss47 systemd[1]: tmp-mntXT85fz.mount: Succeeded.
> May  9 13:34:41 sphnxoss47 kernel: LDISKFS-fs (md1): mounted filesystem
> with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
> May  9 13:34:41 sphnxoss47 kernel: Lustre: sphnx01-OST0245-osd: enabled
> 'large_dir' feature on device /dev/md1
> May  9 13:36:00 sphnxoss47 kernel: LustreError: 15f-b: sphnx01-OST0245:
> cannot register this server with the MGS: rc = -110. Is the MGS running?
> May  9 13:36:00 sphnxoss47 kernel: LustreError:
> 46127:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start
> targets: -110
> May  9 13:36:00 sphnxoss47 kernel: LustreError:
> 46127:0:(obd_mount_server.c:1644:server_put_super()) no obd
> sphnx01-OST0245
> May  9 13:36:00 sphnxoss47 kernel: LustreError:
> 46127:0:(obd_mount_server.c:131:server_deregister_mount())
> sphnx01-OST0245 not registered
> May  9 13:36:08 sphnxoss47 kernel: Lustre: server umount sphnx01-OST0245
> complete
> May  9 13:36:08 sphnxoss47 kernel: LustreError:
> 46127:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount
> : rc = -110
> May  9 13:36:08 sphnxoss47 kernel: LDISKFS-fs (md2): mounted filesystem
> with ordered data mode. Opts: errors=remount-ro
> May  9 13:36:08 sphnxoss47 systemd[1]: tmp-mnt17IOaq.mount: Succeeded.
> May  9 13:36:09 sphnxoss47 kernel: LDISKFS-fs (md2): mounted filesystem
> with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
> Show less
> 11:03 AM
>
> -
>
> it just repeats for all of the md 

[lustre-discuss] Disk failures triggered during OST creation and mounting on OSS Servers

2023-05-10 Thread Jane Liu via lustre-discuss

Hi,

We recently attempted to add several new OSS servers ( RHEL 8.7 and 
Lustre 2.15.2). While creating new OSTs, I noticed that mdstat reported 
some disk failures after the mkfs, even though the disks were functional 
before the mkfs command. Our hardware admins managed to resolve the 
mdstat issue and restore the disks to normal operation. However, when I 
ran the mount OST command (when network had a problem and mount command 
timed out), similar problems occurred, and several disks were kicked 
out. The relevant /var/log/messages are provided below.


This problem was consistent across all our OSS servers. Any insights 
into the possible cause would be appreciated.


Jane

-

May  9 13:33:15 sphnxoss47 kernel: LDISKFS-fs (md0): mounted filesystem 
with ordered data mode. Opts: errors=remount-ro

May  9 13:33:15 sphnxoss47 systemd[1]: tmp-mntmirJ5z.mount: Succeeded.
May  9 13:33:16 sphnxoss47 kernel: LNet: HW NUMA nodes: 2, HW CPU cores: 
72, npartitions: 2
May  9 13:33:16 sphnxoss47 kernel: alg: No test for adler32 
(adler32-zlib)

May  9 13:33:16 sphnxoss47 kernel: Key type ._llcrypt registered
May  9 13:33:16 sphnxoss47 kernel: Key type .llcrypt registered
May  9 13:33:16 sphnxoss47 kernel: Lustre: Lustre: Build Version: 2.15.2
May  9 13:33:16 sphnxoss47 kernel: LNet: Added LNI 169.254.1.2@tcp 
[8/256/0/180]

May  9 13:33:16 sphnxoss47 kernel: LNet: Accept secure, port 988
May  9 13:33:17 sphnxoss47 kernel: LDISKFS-fs (md0): mounted filesystem 
with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
May  9 13:33:17 sphnxoss47 kernel: Lustre: sphnx01-OST0244-osd: enabled 
'large_dir' feature on device /dev/md0
May  9 13:33:25 sphnxoss47 systemd-logind[8609]: New session 7 of user 
root.

May  9 13:33:25 sphnxoss47 systemd[1]: Started Session 7 of user root.
May  9 13:34:36 sphnxoss47 kernel: LustreError: 15f-b: sphnx01-OST0244: 
cannot register this server with the MGS: rc = -110. Is the MGS running?
May  9 13:34:36 sphnxoss47 kernel: LustreError: 
45314:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start 
targets: -110
May  9 13:34:36 sphnxoss47 kernel: LustreError: 
45314:0:(obd_mount_server.c:1644:server_put_super()) no obd 
sphnx01-OST0244
May  9 13:34:36 sphnxoss47 kernel: LustreError: 
45314:0:(obd_mount_server.c:131:server_deregister_mount()) 
sphnx01-OST0244 not registered
May  9 13:34:39 sphnxoss47 kernel: Lustre: server umount sphnx01-OST0244 
complete
May  9 13:34:39 sphnxoss47 kernel: LustreError: 
45314:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount 
: rc = -110
May  9 13:34:40 sphnxoss47 kernel: LDISKFS-fs (md1): mounted filesystem 
with ordered data mode. Opts: errors=remount-ro

May  9 13:34:40 sphnxoss47 systemd[1]: tmp-mntXT85fz.mount: Succeeded.
May  9 13:34:41 sphnxoss47 kernel: LDISKFS-fs (md1): mounted filesystem 
with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
May  9 13:34:41 sphnxoss47 kernel: Lustre: sphnx01-OST0245-osd: enabled 
'large_dir' feature on device /dev/md1
May  9 13:36:00 sphnxoss47 kernel: LustreError: 15f-b: sphnx01-OST0245: 
cannot register this server with the MGS: rc = -110. Is the MGS running?
May  9 13:36:00 sphnxoss47 kernel: LustreError: 
46127:0:(obd_mount_server.c:2027:server_fill_super()) Unable to start 
targets: -110
May  9 13:36:00 sphnxoss47 kernel: LustreError: 
46127:0:(obd_mount_server.c:1644:server_put_super()) no obd 
sphnx01-OST0245
May  9 13:36:00 sphnxoss47 kernel: LustreError: 
46127:0:(obd_mount_server.c:131:server_deregister_mount()) 
sphnx01-OST0245 not registered
May  9 13:36:08 sphnxoss47 kernel: Lustre: server umount sphnx01-OST0245 
complete
May  9 13:36:08 sphnxoss47 kernel: LustreError: 
46127:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount 
: rc = -110
May  9 13:36:08 sphnxoss47 kernel: LDISKFS-fs (md2): mounted filesystem 
with ordered data mode. Opts: errors=remount-ro

May  9 13:36:08 sphnxoss47 systemd[1]: tmp-mnt17IOaq.mount: Succeeded.
May  9 13:36:09 sphnxoss47 kernel: LDISKFS-fs (md2): mounted filesystem 
with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc

Show less
11:03 AM

-

it just repeats for all of the md raids, then the errors start and the 
drive fails and is disabled:


May  9 13:44:31 sphnxoss47 kernel: LustreError: 
48069:0:(super25.c:176:lustre_fill_super()) llite: Unable to mount 
: rc = -110
May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: log_info(0x3112011a): 
originator(PL), code(0x12), sub_code(0x011a)
May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: log_info(0x3112011a): 
originator(PL), code(0x12), sub_code(0x011a)
May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: log_info(0x3112011a): 
originator(PL), code(0x12), sub_code(0x011a)
May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: log_info(0x3112011a): 
originator(PL), code(0x12), sub_code(0x011a)
May  9 13:44:33 sphnxoss47 kernel: mpt3sas_cm1: log_info(0x3112011a): 
originator(PL), code(0x12), 

[lustre-discuss] Build Lustre from source ldiskfs

2023-05-10 Thread Nick dan via lustre-discuss
Hi

Can I get a document to build lustre ldiskfs from source ?

Thanks and regards
Nick
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org