Re: [lustre-discuss] CentOS Stream 8/9 support?

2023-06-22 Thread Andreas Dilger via lustre-discuss
On Jun 22, 2023, at 06:58, Will Furnass via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Hi,

I imagine that many here might have seen RedHat's announcement
yesterday about ceasing to provide sources for EL8 and EL9 to those
who aren't paying customers (see [1] - CentOS 7 unaffected).  For many
HPC sites using or planning to adopt Alma/Rocky 8/9 this prompts a
change of tack:
- buy RHEL 8/9
- switch to CentOS 8/9 Stream for something EL-like
- switch to something else (SUSE or Ubuntu)

Those wanting to stick with EL-like will be interested in how well
Lustre works with Stream 8/9.  Seems it's not in the support matrix
[2].  Have others here used Lustre with Stream successfully?  If so,
anything folks would care to share about gotchas if encountered?  Did
you used patched or unpatched kernels?

For clients I don't think it will matter much, since users often have to build
their own client RPMs (possibly via DKMS), or they use weak updates to
avoid rebuilding the RPMs at all for client updates.  The Lustre client code
itself works with a wide range of kernel versions (3.10-6.0 currently), and
I suspect that relatively few production systems want to be on the bleeding
edge of Linux kernels either, so the lack of 6.1-6.3 kernel support is likely
not affecting anyone, and even then patches are already in flight for them.

Definitely servers will be more tricky, since the baseline will always be
moving, and more quickly than EL kernels.

[1] https://www.redhat.com/en/blog/furthering-evolution-centos-stream
[2] https://wiki.whamcloud.com/display/PUB/Lustre+Support+Matrix

Cheers,

Will

--
Dr Will Furnass | Research Platforms Engineer
IT Services | University of Sheffield


Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] No space left on device MDT DoM but not full nor run out of inodes

2023-06-22 Thread Andreas Dilger via lustre-discuss
There is a bug in the grant accounting that leaks under certain operations 
(maybe O_DIRECT?).  It is resolved by unmounting and remounting the clients, 
and/or upgrading.  There was a thread about it on lustre-discuss a couple of 
years ago.

Cheers, Andreas

On Jun 20, 2023, at 09:32, Jon Marshall via lustre-discuss 
mailto:lustre-discuss@lists.lustre.org>> wrote:

Sorry, typo in the version number - the version we are actually running is 
2.12.6

From: Jon Marshall
Sent: 20 June 2023 16:18
To: lustre-discuss@lists.lustre.org 
mailto:lustre-discuss@lists.lustre.org>>
Subject: No space left on device MDT DoM but not full nor run out of inodes

Hi,

We've been running lustre 2.15.1 in production for over a year and recently 
decided to enable PFL with DoM on our filesystem. Things have been fine up 
until last week, when users started reporting issues copying files, 
specifically "No space left on device". The MDT is running ldiskfs as the 
backend.

I've searched through the mailing list and found a couple of people reporting 
similar problems, which prompted me to check the inode allocation, which is 
currently:

UUID  Inodes   IUsed   IFree IUse% Mounted on
scratchc-MDT_UUID   62449254471144384   553348160  12% 
/mnt/scratchc[MDT:0]
scratchc-OST_UUID577125792448993433222645  43% 
/mnt/scratchc[OST:0]
scratchc-OST0001_UUID571140642450587632608188  43% 
/mnt/scratchc[OST:1]

filesystem_summary:1369752177114438465830833  52% /mnt/scratchc

So, nowhere near full - the disk usage is a little higher:

UUID   bytesUsed   Available Use% Mounted on
scratchc-MDT_UUID  882.1G  451.9G  355.8G  56% 
/mnt/scratchc[MDT:0]
scratchc-OST_UUID   53.6T   22.7T   31.0T  43% 
/mnt/scratchc[OST:0]
scratchc-OST0001_UUID   53.6T   23.0T   30.6T  43% 
/mnt/scratchc[OST:1]

filesystem_summary:   107.3T   45.7T   61.6T  43% /mnt/scratchc

But not full either! The errors are accompanied in the logs by:

LustreError: 15450:0:(tgt_grant.c:463:tgt_grant_space_left()) scratchc-MDT: 
cli ba0195c7-1ab4-4f7c-9e28-8689478f5c17/9e331e231c00 left 82586337280 < 
tot_grant 82586681321 unstable 0 pending 0 dirty 1044480
LustreError: 15450:0:(tgt_grant.c:463:tgt_grant_space_left()) Skipped 33050 
previous similar messages

For reference the DoM striping we're using is:

  lcm_layout_gen:0
  lcm_mirror_count:  1
  lcm_entry_count:   3
lcme_id: N/A
lcme_mirror_id:  N/A
lcme_flags:  0
lcme_extent.e_start: 0
lcme_extent.e_end:   1048576
  stripe_count:  0   stripe_size:   1048576   pattern:   mdt
   stripe_offset: -1

lcme_id: N/A
lcme_mirror_id:  N/A
lcme_flags:  0
lcme_extent.e_start: 1048576
lcme_extent.e_end:   1073741824
  stripe_count:  1   stripe_size:   1048576   pattern:   raid0  
 stripe_offset: -1

lcme_id: N/A
lcme_mirror_id:  N/A
lcme_flags:  0
lcme_extent.e_start: 1073741824
lcme_extent.e_end:   EOF
  stripe_count:  -1   stripe_size:   1048576   pattern:   raid0 
  stripe_offset: -1

So the first 1MB on the MDT.

My question is obviously what is causing these errors? I'm not massively 
familiar with Lustre internals, so any pointers on where to look would be 
greatly appreciated!

Cheers
Jon

Jon Marshall
High Performance Computing Specialist



IT and Scientific Computing Team



Cancer Research UK Cambridge Institute
Li Ka Shing Centre | Robinson Way | Cambridge | CB2 0RE
Web | 
Facebook | 
Twitter



[Description: CRI Logo]

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] CentOS Stream 8/9 support?

2023-06-22 Thread Jeff Johnson
This has the makings of being significant enough of an impact that I don't
think it is a done deal. I'm sure someone in DC is calling someone at IBM.
Even if USG does nothing, this is the kind of thing that EU regulators have
stomped on in the past.

I suspect this isn't open and shut...yet.

On Thu, Jun 22, 2023 at 11:35 AM Laura Hild via lustre-discuss <
lustre-discuss@lists.lustre.org> wrote:

> We have one, small Stream 8 cluster, which is currently running a Lustre
> client to which I cherry-picked a kernel compatibility patch.  I could
> imagine the effort being considerably more for the server component.  I
> also wonder, even if Whamcloud were to provide releases for Stream kernels,
> how many sites would be happy with Stream's five-year lifetimes.
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>


-- 
--
Jeff Johnson
Co-Founder
Aeon Computing

jeff.john...@aeoncomputing.com
www.aeoncomputing.com
t: 858-412-3810 x1001   f: 858-412-3845
m: 619-204-9061

4170 Morena Boulevard, Suite C - San Diego, CA 92117

High-Performance Computing / Lustre Filesystems / Scale-out Storage
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] CentOS Stream 8/9 support?

2023-06-22 Thread Laura Hild via lustre-discuss
We have one, small Stream 8 cluster, which is currently running a Lustre client 
to which I cherry-picked a kernel compatibility patch.  I could imagine the 
effort being considerably more for the server component.  I also wonder, even 
if Whamcloud were to provide releases for Stream kernels, how many sites would 
be happy with Stream's five-year lifetimes.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL] MDTs will only mount read only

2023-06-22 Thread Mike Mosley via lustre-discuss
Rick,

You were on the right track!

We were fortunate enough to get an expert from Cambridge Computing to take
a look at things and he managed to get us back into a normal state.

He remounted the MDTs with the *abort_recov* option and we were finally
able to get things going again.

Thanks to all who responded and special shout out to Brad at Cambridge
Computing for making time to help us get this fixed.

Mike




On Wed, Jun 21, 2023 at 4:32 PM Mohr, Rick  wrote:

> Mike,
>
> On the off chance that the recovery process is causing the issue, you
> could try mounting the mdt with the "abort_recov" option and see if the
> behavior changes.
>
> --Rick
>
>
>
> On 6/21/23, 2:33 PM, "lustre-discuss on behalf of Jeff Johnson" <
> lustre-discuss-boun...@lists.lustre.org  lustre-discuss-boun...@lists.lustre.org> on behalf of
> jeff.john...@aeoncomputing.com >
> wrote:
>
>
> Maybe someone else in the list can add clarity but I don't believe a
> recovery process on mount would keep the MDS read-only or trigger that
> trace. Something else may be going on.
>
>
> I would start from the ground up. Bring your servers up, unmounted. Ensure
> lnet is loaded and configured properly. Test lnet using ping or
> lnet_selftest from your MDS to all of your OSS nodes. Then mount your
> combined MGS/MDT volume on the MDS and see what happens.
>
>
>
>
> Is your MDS in a high-availability pair?
> What version of Lustre are you running?
>
>
>
>
> ...just a few things readers on the list might want to know.
>
>
>
>
> --Jeff
>
>
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 11:21 AM Mike Mosley    >> wrote:
>
>
> Jeff,
>
>
> At this point we have the OSS shutdown. We were coming back from. full
> outage and so we are trying to get the MDS up before starting to bring up
> the OSS.
>
>
>
>
> Mike
>
>
>
>
> On Wed, Jun 21, 2023 at 2:15 PM Jeff Johnson <
> jeff.john...@aeoncomputing.com 
> <_blank>> wrote:
>
>
> Mike,
>
>
> Have you made sure the the o2ib interface on all of your Lustre servers
> (MDS & OSS) are functioning properly? Are you able to `lctl ping
> x.x.x.x@o2ib` successfully between MDS and OSS nodes?
>
>
>
>
> --Jeff
>
>
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 10:08 AM Mike Mosley via lustre-discuss <
> lustre-discuss@lists.lustre.org 
> <_blank>> wrote:
>
>
> Rick,172.16.100.4 is the IB address of one of the OSS servers. I
> believe the mgt and mdt0 are the same target. My understanding is that we
> have a single instanceof the MGT which is on the first MDT server i.e. it
> was created via a comand similar to:
>
>
>
>
> # mkfs.lustre --fsname=scratch --index=0 --mdt --mgs --replace /dev/sdb
>
>
>
>
>
>
> Does that make sense.
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 12:55 PM Mohr, Rick  moh...@ornl.gov> <_blank>> wrote:
>
>
> Which host is 172.16.100.4? Also, are the mgt and mdt0 on the same target
> or are they two separate targets just on the same host?
>
>
> --Rick
>
>
>
>
> On 6/21/23, 12:52 PM, "Mike Mosley"  mike.mos...@charlotte.edu> <_blank>   <_blank>>> wrote:
>
>
>
>
> Hi Rick,
>
>
>
>
> The MGS/MDS are combined. The output I posted is from the primary.
>
>
>
>
>
>
>
>
> THanks,
>
>
>
>
>
>
>
>
> Mike
>
>
>
>
>
>
>
>
> On Wed, Jun 21, 2023 at 12:27 PM Mohr, Rick  moh...@ornl.gov> <_blank> 
> <_blank>>  <_blank>
>  <_blank wrote:
>
>
>
>
> Mike,
>
>
>
>
> It looks like the mds server is having a problem contacting the mgs
> server. I'm guessing the mgs is a separate host? I would start by looking
> for possible network problems that might explain the LNet timeouts. You can
> try using "lctl ping" to test the LNet connection between nodes, and you
> can also try regular "ping" between the IP addresses on the IB interfaces.
>
>
>
>
> --Rick
>
>
>
>
>
>
>
>
> On 6/21/23, 11:35 AM, "lustre-discuss on behalf of Mike Mosley via
> lustre-discuss"  lustre-discuss-boun...@lists.lustre.org> <_blank>  lustre-discuss-boun...@lists.lustre.org  lustre-discuss-boun...@lists.lustre.org> <_blank>> <_blank>  lustre-discuss-boun...@lists.lustre.org  lustre-discuss-boun...@lists.lustre.org> <_blank>  lustre-discuss-boun...@lists.lustre.org  lustre-discuss-boun...@lists.lustre.org> <_blank>> <_blank>> on behalf of
> lustre-discuss@lists.lustre.org 
> <_blank>  <_blank>> <_blank>  lustre-discuss@lists.lustre.org 
> <_blank>  <_blank>> <_blank>>> wrote:
>
>
>
>
>
>
>
>
> Greetings,
>
>
>
>
>
>
>

[lustre-discuss] CentOS Stream 8/9 support?

2023-06-22 Thread Will Furnass via lustre-discuss
Hi,

I imagine that many here might have seen RedHat's announcement
yesterday about ceasing to provide sources for EL8 and EL9 to those
who aren't paying customers (see [1] - CentOS 7 unaffected).  For many
HPC sites using or planning to adopt Alma/Rocky 8/9 this prompts a
change of tack:
 - buy RHEL 8/9
 - switch to CentOS 8/9 Stream for something EL-like
 - switch to something else (SUSE or Ubuntu)

Those wanting to stick with EL-like will be interested in how well
Lustre works with Stream 8/9.  Seems it's not in the support matrix
[2].  Have others here used Lustre with Stream successfully?  If so,
anything folks would care to share about gotchas if encountered?  Did
you used patched or unpatched kernels?

[1] https://www.redhat.com/en/blog/furthering-evolution-centos-stream
[2] https://wiki.whamcloud.com/display/PUB/Lustre+Support+Matrix

Cheers,

Will

-- 
Dr Will Furnass | Research Platforms Engineer
IT Services | University of Sheffield
+44 (0)114 22 29693
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL] I/O error on lctl ping although ibping successful

2023-06-22 Thread Youssef Eldakar via lustre-discuss
Quite strangely, I found 2 good hosts (successfully mount the file system),
where the TCP ping goes through on one, while it doe snot on the other
(though LNET ping is OK for both).

- Youssef

On Wed, Jun 21, 2023 at 6:08 PM Youssef Eldakar 
wrote:

> Thanks, Rick, for that suggestion. TCP ping between a problematic host and
> the MDS indeed does not go through.
>
> Not exactly sure what to investigate next, but that gives me somewhere to
> start...
>
> - Youssef
>
> On Tue, Jun 20, 2023 at 7:00 PM Mohr, Rick via lustre-discuss <
> lustre-discuss@lists.lustre.org> wrote:
>
>> Have you tried tcp pings on the IP addresses associated with the IB
>> interfaces?
>>
>> --Rick
>>
>>
>> On 6/20/23, 12:11 PM, "lustre-discuss on behalf of Youssef Eldakar via
>> lustre-discuss" > lustre-discuss-boun...@lists.lustre.org> on behalf of
>> lustre-discuss@lists.lustre.org >
>> wrote:
>>
>>
>> In a cluster having ~100 Lustre clients (compute nodes) connected
>> together with the MDS and OSS over Intel True Scale InfiniBand
>> (discontinued product), we started seeing certain nodes failing to mount
>> the Lustre file system and giving I/O error on LNET (lctl) ping even though
>> an ibping test to the MDS gives no errors. We tried rebooting the
>> problematic nodes and even fresh-installing the OS and Lustre client, which
>> did not help. However, rebooting the MDS seems to possibly momentarily help
>> after the MDS starts up again, but the same set of problematic nodes seem
>> to always eventually revert back to the state where they fail to ping the
>> MDS over LNET.
>>
>>
>> Thank you for any pointers we may pursue.
>>
>>
>>
>>
>> Youssef Eldakar
>> Bibliotheca Alexandrina
>> www.bibalex.org <
>> https://urldefense.us/v2/url?u=http-3A__www.bibalex.orgd=DwMFaQc=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYcr=SpEwA4Pnyq7nH7aMGq8KpAm=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDHs=5DLPIzJx0tgg1TgSZkvvNNVfDfgpo-Prv-BPOga0WMAe=>
>> <
>> https://urldefense.us/v2/url?u=http-3A__www.bibalex.orgamp;d=DwMFaQamp;c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYcamp;r=SpEwA4Pnyq7nH7aMGq8KpAamp;m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDHamp;s=5DLPIzJx0tgg1TgSZkvvNNVfDfgpo-Prv-BPOga0WMAamp;e=
>> ;>
>> hpc.bibalex.org <
>> https://urldefense.us/v2/url?u=http-3A__hpc.bibalex.orgd=DwMFaQc=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYcr=SpEwA4Pnyq7nH7aMGq8KpAm=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDHs=HMqKriFlJ2qwafMOSVJMqre9-wmJ--kaSS_rx4t7hQwe=>
>> <
>> https://urldefense.us/v2/url?u=http-3A__hpc.bibalex.orgamp;d=DwMFaQamp;c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYcamp;r=SpEwA4Pnyq7nH7aMGq8KpAamp;m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDHamp;s=HMqKriFlJ2qwafMOSVJMqre9-wmJ--kaSS_rx4t7hQwamp;e=
>> ;>
>>
>>
>>
>>
>>
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org