Re: [lustre-discuss] Unresponsiveness of OSS and Directory Listing Hang-up

2023-05-22 Thread Jane Liu via lustre-discuss

Hi,

We found out a solution to address the issues of directory listing 
hang-ups and unresponsive OSS. There seemed metadata inconsistency in 
the MDT. While the on-the-fly "lctl lfsck_start ..." would not work, we 
tried the offline ext4 level fix. We unmounted the MDT and ran the 
e2fsck command to fix metadata inconsistencies. The commands we executed 
are as follows:


e2fsck -fp /dev/mapper/mds01-mds01

After the MDT was mounted, everything just worked.

Jane


On 2023-05-18 16:44, Jane Liu via lustre-discuss wrote:

Hi,

We have recently upgraded our Lustre servers to run on RHEL 8.7, along
with Lustre 2.15.2. Despite running smoothly for several weeks, we
have encountered an issue that is same as the one reported on this
webpage:
https://urldefense.com/v3/__https://jira.whamcloud.com/browse/LU-10697__;!!P4SdNyxKAPE!GoFxQH3CIfXmrkYK7xAsYqTrz_kWEXPCBLXkAKX3COGlgMHcrp4dKzf9aNF-iw3kP7nlt_IOTuIbYSCy5GgrT5E3X2JEtzN54FNF$
. Although the Lustre version described there differs from ours, the
symptoms are identical.

“uname -a” on our system returns the following output:
4.18.0-425.3.1.el8_lustre.x86_64 #1 SMP Wed Jan 11 23:55:00 UTC 2023
x86_64 x86_64 x86_64 GNU/Linux. And the content of /etc/redhat-release
is: Red Hat Enterprise Linux release 8.7 (Ootpa)

Here are the details about the issue.

On 5/16, around 4:47 am, one OSS named oss29 began experiencing
problems. There was a rapid increase in the number of active requests,
from 1 to 123, occurring from roughly 4:47 am to 10:20 am. At around
5/16 10:20 am, I/O on oss29 stopped entirely, and the number of active
requests remained at 123. Concurrently, the system load experienced a
significant increase, shooting up from a very low number to a high
number as 400, again within the timeframe of 4:47 am to 10:20 am.
Interestingly, despite the extreme system load, the CPU usage remained
idle.

Furthermore, when executing the lfs df command on the MDS, no OSTs on
oss29 were visible.

We noticed a lot of error about “This server is not able to keep up
with request traffic (cpu-bound)” in syslog on oss29:

May 16 05:44:52 oss29 kernel: Lustre: ost_io: This server is not able
to keep up with request traffic (cpu-bound).
May 16 06:13:49 oss29 kernel: Lustre: ost_io: This server is not able
to keep up with request traffic (cpu-bound).
May 16 06:23:39 oss29 kernel: Lustre: ost_io: This server is not able
to keep up with request traffic (cpu-bound).
May 16 06:32:56 oss29 kernel: Lustre: ost_io: This server is not able
to keep up with request traffic (cpu-bound).
May 16 06:42:46 oss29 kernel: Lustre: ost_io: This server is not able
to keep up with request traffic (cpu-bound).
…

At the same time, users also reported an issue with a specific
directory, which became inaccessible. Running ls -l on this directory
resulted in a hang, while the ls command worked. Users found they
could read certain files within the directory, but not all of them.

In an attempt to fix the situation, I tried to unmount the OSTs on
oss29, but this was unsuccessful. We then made the decision to reboot
oss29 on 5/17 around 3:30 pm. However, upon the system's return, oss29
immediately reverted to its previous unresponsive state with high
load. Listing the directory still hung.

I did lfsck on MDT, but it just hung there.

Here are related MDS syslog during the perios:

May 16 05:09:13 sphnxmds01 kernel: Lustre:
sphnx01-OST0192-osc-MDT: Connection to sphnx01-OST0192 (at
10.42.73.42@tcp) was lost; in progress operations using this service
will wait for recovery to complete
May 16 05:09:13 sphnxmds01 kernel: LustreError:
384795:0:(osp_precreate.c:677:osp_precreate_send())
sphnx01-OST0192-osc-MDT: can't precreate: rc = -11
May 16 05:09:13 sphnxmds01 kernel: Lustre:
sphnx01-OST0192-osc-MDT: Connection restored to 10.42.73.42@tcp
(at 10.42.73.42@tcp)
May 16 05:09:13 sphnxmds01 kernel: LustreError:
384795:0:(osp_precreate.c:1340:osp_precreate_thread())
sphnx01-OST0192-osc-MDT: cannot precreate objects: rc = -11

…
May 16 05:22:17 sphnxmds01 kernel: Lustre:
sphnx01-OST0192-osc-MDT: Connection to sphnx01-OST0192 (at
10.42.73.42@tcp) was lost; in progress operations using this service
will wait for recovery to complete
May 16 05:22:17 sphnxmds01 kernel: LustreError:
384795:0:(osp_precreate.c:967:osp_precreate_cleanup_orphans())
sphnx01-OST0192-osc-MDT: cannot cleanup orphans: rc = -11
May 16 05:22:17 sphnxmds01 kernel: Lustre:
sphnx01-OST0192-osc-MDT: Connection restored to 10.42.73.42@tcp
(at 10.42.73.42@tcp)

And this is OSS syslog:

May 16 04:47:21 oss29 kernel: Lustre: sphnx01-OST0192: Export
9a00357a already connecting from 130.199.206.80@tcp
May 16 04:47:26 oss29 kernel: Lustre: sphnx01-OST0192: Export
9a00357a already connecting from 130.199.206.80@tcp
May 16 04:47:31 oss29 kernel: Lustre: sphnx01-OST0192: Export
12e8b1d0 already connecting from 130.199.48.37@tcp
May 16 04:47:36 oss29 kernel: Lustre: sphnx01-OST0192: Export
9a00357a already

Re: [lustre-discuss] Data stored in OST

2023-05-22 Thread Peter Grandi via lustre-discuss
>>> On Mon, 22 May 2023 13:08:19 +0530, Nick dan via lustre-discuss 
>>>  said:

> Hi I had one doubt. In lustre, data is divided into stripes
> and stored in multiple OSTs. So each OST will have some part
> of data. My question is if one OST fails, will there be data
> loss?

This is extensively discussed in the Lustre manual with
comprehensive illustrations:

  https://doc.lustre.org/lustre_manual.xhtml#understandinglustre.storageio
  https://doc.lustre.org/lustre_manual.xhtml#pfl
  https://doc.lustre.org/lustre_manual.xhtml#understandingfailover

The usual practice is to use RAUD10 for the MDT(s) on
"enterprise" high-endurance SSD, and RAID6 for the OST on
"professional" mixed-load SSDs or "small" (1-2TB at most)
"datacenter" HDDs, fronted by failover-servers.

I personally think that is is best to rely on Lustre striping
and the "new" PFL LFR layout (across two OST "pools"), and have
each OST on a single device, and and very few OSTs per OSS, when
Lustre is used as "scratch" area for an HPC cluster.

  https://doc.lustre.org/lustre_manual.xhtml#flr
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Recovering MDT failure

2023-05-22 Thread Peter Grandi via lustre-discuss
>>> On Thu, 27 Apr 2023 10:20:54 +0100, Peter Grandi
>>>  said:

>> - When I started this system I tried to backup MDT data
>> - without succes.

> [...] The ZFS version of MDT can do filesystem-level snapshots
> when mounted as 'zfs' instead of 'lustre'.

Just to be sure, even if this was already mentioned in a later
thread: even if the MDT is mounted as 'lustre' all 'zpool' and
'zfs' commands are still avaiable by using the zpool and
filesystem dataset names instead of the mountpoint, including
snapshotting (which should be read-only). 'zfs snapshot' and
'zfs send' do not require mounting the filesystem dataset, but
(always read-only) snapshots of the MDT mounted as 'lustre' can
be mounted as 'zfs' anyhow. My guess is that much more so for
'ldiskfs' the ZFS 'lustre' filesystem type is just a wrapper
around 'zfs'.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Data stored in OST

2023-05-22 Thread Nick dan via lustre-discuss
Hi

Thank you for your reply

Yes, the OSTs must provide internal redundancy - RAID-6 typically
Can RAID_6 be replaced with mirror/RAID0?

Which type of RAID is recommended for MDT and OST?

Also can you brief on how data will be read/written in Lustre with ZFS is
used as backend filesystem in Lustre FS?

Thanks and regards
Nick



On Mon, 22 May 2023 at 13:36, Andreas Dilger  wrote:

> Yes, the OSTs must provide internal redundancy - RAID-6 typically.
>
> There is File Level Redundancy (FLR = mirroring) possible in Lustre file
> layouts, but it is "unmanaged", so users or other system-level tools are
> required to resync FLR files if they are written after mirroring.
>
> Cheers, Andreas
>
> > On May 22, 2023, at 09:39, Nick dan via lustre-discuss <
> lustre-discuss@lists.lustre.org> wrote:
> >
> > 
> > Hi
> >
> > I had one doubt.
> > In lustre, data is divided into stripes and stored in multiple OSTs. So
> each OST will have some part of data.
> > My question is if one OST fails, will there be data loss?
> >
> > Please advise for the same.
> >
> > Thanks and regards
> > Nick
> > ___
> > lustre-discuss mailing list
> > lustre-discuss@lists.lustre.org
> > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Data stored in OST

2023-05-22 Thread Andreas Dilger via lustre-discuss
Yes, the OSTs must provide internal redundancy - RAID-6 typically. 

There is File Level Redundancy (FLR = mirroring) possible in Lustre file 
layouts, but it is "unmanaged", so users or other system-level tools are 
required to resync FLR files if they are written after mirroring.

Cheers, Andreas

> On May 22, 2023, at 09:39, Nick dan via lustre-discuss 
>  wrote:
> 
> 
> Hi
> 
> I had one doubt.
> In lustre, data is divided into stripes and stored in multiple OSTs. So each 
> OST will have some part of data. 
> My question is if one OST fails, will there be data loss?
> 
> Please advise for the same.
> 
> Thanks and regards
> Nick
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] Data stored in OST

2023-05-22 Thread Nick dan via lustre-discuss
Hi

I had one doubt.
In lustre, data is divided into stripes and stored in multiple OSTs. So
each OST will have some part of data.
My question is if one OST fails, will there be data loss?

Please advise for the same.

Thanks and regards
Nick
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org