Re: [lustre-discuss] Unresponsiveness of OSS and Directory Listing Hang-up
Hi, We found out a solution to address the issues of directory listing hang-ups and unresponsive OSS. There seemed metadata inconsistency in the MDT. While the on-the-fly "lctl lfsck_start ..." would not work, we tried the offline ext4 level fix. We unmounted the MDT and ran the e2fsck command to fix metadata inconsistencies. The commands we executed are as follows: e2fsck -fp /dev/mapper/mds01-mds01 After the MDT was mounted, everything just worked. Jane On 2023-05-18 16:44, Jane Liu via lustre-discuss wrote: Hi, We have recently upgraded our Lustre servers to run on RHEL 8.7, along with Lustre 2.15.2. Despite running smoothly for several weeks, we have encountered an issue that is same as the one reported on this webpage: https://urldefense.com/v3/__https://jira.whamcloud.com/browse/LU-10697__;!!P4SdNyxKAPE!GoFxQH3CIfXmrkYK7xAsYqTrz_kWEXPCBLXkAKX3COGlgMHcrp4dKzf9aNF-iw3kP7nlt_IOTuIbYSCy5GgrT5E3X2JEtzN54FNF$ . Although the Lustre version described there differs from ours, the symptoms are identical. “uname -a” on our system returns the following output: 4.18.0-425.3.1.el8_lustre.x86_64 #1 SMP Wed Jan 11 23:55:00 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux. And the content of /etc/redhat-release is: Red Hat Enterprise Linux release 8.7 (Ootpa) Here are the details about the issue. On 5/16, around 4:47 am, one OSS named oss29 began experiencing problems. There was a rapid increase in the number of active requests, from 1 to 123, occurring from roughly 4:47 am to 10:20 am. At around 5/16 10:20 am, I/O on oss29 stopped entirely, and the number of active requests remained at 123. Concurrently, the system load experienced a significant increase, shooting up from a very low number to a high number as 400, again within the timeframe of 4:47 am to 10:20 am. Interestingly, despite the extreme system load, the CPU usage remained idle. Furthermore, when executing the lfs df command on the MDS, no OSTs on oss29 were visible. We noticed a lot of error about “This server is not able to keep up with request traffic (cpu-bound)” in syslog on oss29: May 16 05:44:52 oss29 kernel: Lustre: ost_io: This server is not able to keep up with request traffic (cpu-bound). May 16 06:13:49 oss29 kernel: Lustre: ost_io: This server is not able to keep up with request traffic (cpu-bound). May 16 06:23:39 oss29 kernel: Lustre: ost_io: This server is not able to keep up with request traffic (cpu-bound). May 16 06:32:56 oss29 kernel: Lustre: ost_io: This server is not able to keep up with request traffic (cpu-bound). May 16 06:42:46 oss29 kernel: Lustre: ost_io: This server is not able to keep up with request traffic (cpu-bound). … At the same time, users also reported an issue with a specific directory, which became inaccessible. Running ls -l on this directory resulted in a hang, while the ls command worked. Users found they could read certain files within the directory, but not all of them. In an attempt to fix the situation, I tried to unmount the OSTs on oss29, but this was unsuccessful. We then made the decision to reboot oss29 on 5/17 around 3:30 pm. However, upon the system's return, oss29 immediately reverted to its previous unresponsive state with high load. Listing the directory still hung. I did lfsck on MDT, but it just hung there. Here are related MDS syslog during the perios: May 16 05:09:13 sphnxmds01 kernel: Lustre: sphnx01-OST0192-osc-MDT: Connection to sphnx01-OST0192 (at 10.42.73.42@tcp) was lost; in progress operations using this service will wait for recovery to complete May 16 05:09:13 sphnxmds01 kernel: LustreError: 384795:0:(osp_precreate.c:677:osp_precreate_send()) sphnx01-OST0192-osc-MDT: can't precreate: rc = -11 May 16 05:09:13 sphnxmds01 kernel: Lustre: sphnx01-OST0192-osc-MDT: Connection restored to 10.42.73.42@tcp (at 10.42.73.42@tcp) May 16 05:09:13 sphnxmds01 kernel: LustreError: 384795:0:(osp_precreate.c:1340:osp_precreate_thread()) sphnx01-OST0192-osc-MDT: cannot precreate objects: rc = -11 … May 16 05:22:17 sphnxmds01 kernel: Lustre: sphnx01-OST0192-osc-MDT: Connection to sphnx01-OST0192 (at 10.42.73.42@tcp) was lost; in progress operations using this service will wait for recovery to complete May 16 05:22:17 sphnxmds01 kernel: LustreError: 384795:0:(osp_precreate.c:967:osp_precreate_cleanup_orphans()) sphnx01-OST0192-osc-MDT: cannot cleanup orphans: rc = -11 May 16 05:22:17 sphnxmds01 kernel: Lustre: sphnx01-OST0192-osc-MDT: Connection restored to 10.42.73.42@tcp (at 10.42.73.42@tcp) And this is OSS syslog: May 16 04:47:21 oss29 kernel: Lustre: sphnx01-OST0192: Export 9a00357a already connecting from 130.199.206.80@tcp May 16 04:47:26 oss29 kernel: Lustre: sphnx01-OST0192: Export 9a00357a already connecting from 130.199.206.80@tcp May 16 04:47:31 oss29 kernel: Lustre: sphnx01-OST0192: Export 12e8b1d0 already connecting from 130.199.48.37@tcp May 16 04:47:36 oss29 kernel: Lustre: sphnx01-OST0192: Export 9a00357a already
Re: [lustre-discuss] Data stored in OST
>>> On Mon, 22 May 2023 13:08:19 +0530, Nick dan via lustre-discuss >>> said: > Hi I had one doubt. In lustre, data is divided into stripes > and stored in multiple OSTs. So each OST will have some part > of data. My question is if one OST fails, will there be data > loss? This is extensively discussed in the Lustre manual with comprehensive illustrations: https://doc.lustre.org/lustre_manual.xhtml#understandinglustre.storageio https://doc.lustre.org/lustre_manual.xhtml#pfl https://doc.lustre.org/lustre_manual.xhtml#understandingfailover The usual practice is to use RAUD10 for the MDT(s) on "enterprise" high-endurance SSD, and RAID6 for the OST on "professional" mixed-load SSDs or "small" (1-2TB at most) "datacenter" HDDs, fronted by failover-servers. I personally think that is is best to rely on Lustre striping and the "new" PFL LFR layout (across two OST "pools"), and have each OST on a single device, and and very few OSTs per OSS, when Lustre is used as "scratch" area for an HPC cluster. https://doc.lustre.org/lustre_manual.xhtml#flr ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Recovering MDT failure
>>> On Thu, 27 Apr 2023 10:20:54 +0100, Peter Grandi >>> said: >> - When I started this system I tried to backup MDT data >> - without succes. > [...] The ZFS version of MDT can do filesystem-level snapshots > when mounted as 'zfs' instead of 'lustre'. Just to be sure, even if this was already mentioned in a later thread: even if the MDT is mounted as 'lustre' all 'zpool' and 'zfs' commands are still avaiable by using the zpool and filesystem dataset names instead of the mountpoint, including snapshotting (which should be read-only). 'zfs snapshot' and 'zfs send' do not require mounting the filesystem dataset, but (always read-only) snapshots of the MDT mounted as 'lustre' can be mounted as 'zfs' anyhow. My guess is that much more so for 'ldiskfs' the ZFS 'lustre' filesystem type is just a wrapper around 'zfs'. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Data stored in OST
Hi Thank you for your reply Yes, the OSTs must provide internal redundancy - RAID-6 typically Can RAID_6 be replaced with mirror/RAID0? Which type of RAID is recommended for MDT and OST? Also can you brief on how data will be read/written in Lustre with ZFS is used as backend filesystem in Lustre FS? Thanks and regards Nick On Mon, 22 May 2023 at 13:36, Andreas Dilger wrote: > Yes, the OSTs must provide internal redundancy - RAID-6 typically. > > There is File Level Redundancy (FLR = mirroring) possible in Lustre file > layouts, but it is "unmanaged", so users or other system-level tools are > required to resync FLR files if they are written after mirroring. > > Cheers, Andreas > > > On May 22, 2023, at 09:39, Nick dan via lustre-discuss < > lustre-discuss@lists.lustre.org> wrote: > > > > > > Hi > > > > I had one doubt. > > In lustre, data is divided into stripes and stored in multiple OSTs. So > each OST will have some part of data. > > My question is if one OST fails, will there be data loss? > > > > Please advise for the same. > > > > Thanks and regards > > Nick > > ___ > > lustre-discuss mailing list > > lustre-discuss@lists.lustre.org > > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Data stored in OST
Yes, the OSTs must provide internal redundancy - RAID-6 typically. There is File Level Redundancy (FLR = mirroring) possible in Lustre file layouts, but it is "unmanaged", so users or other system-level tools are required to resync FLR files if they are written after mirroring. Cheers, Andreas > On May 22, 2023, at 09:39, Nick dan via lustre-discuss > wrote: > > > Hi > > I had one doubt. > In lustre, data is divided into stripes and stored in multiple OSTs. So each > OST will have some part of data. > My question is if one OST fails, will there be data loss? > > Please advise for the same. > > Thanks and regards > Nick > ___ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] Data stored in OST
Hi I had one doubt. In lustre, data is divided into stripes and stored in multiple OSTs. So each OST will have some part of data. My question is if one OST fails, will there be data loss? Please advise for the same. Thanks and regards Nick ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org