Re: [lustre-discuss] [EXTERNAL] I/O error on lctl ping although ibping successful

2023-06-22 Thread Youssef Eldakar via lustre-discuss
Quite strangely, I found 2 good hosts (successfully mount the file system),
where the TCP ping goes through on one, while it doe snot on the other
(though LNET ping is OK for both).

- Youssef

On Wed, Jun 21, 2023 at 6:08 PM Youssef Eldakar 
wrote:

> Thanks, Rick, for that suggestion. TCP ping between a problematic host and
> the MDS indeed does not go through.
>
> Not exactly sure what to investigate next, but that gives me somewhere to
> start...
>
> - Youssef
>
> On Tue, Jun 20, 2023 at 7:00 PM Mohr, Rick via lustre-discuss <
> lustre-discuss@lists.lustre.org> wrote:
>
>> Have you tried tcp pings on the IP addresses associated with the IB
>> interfaces?
>>
>> --Rick
>>
>>
>> On 6/20/23, 12:11 PM, "lustre-discuss on behalf of Youssef Eldakar via
>> lustre-discuss" > lustre-discuss-boun...@lists.lustre.org> on behalf of
>> lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.lustre.org>>
>> wrote:
>>
>>
>> In a cluster having ~100 Lustre clients (compute nodes) connected
>> together with the MDS and OSS over Intel True Scale InfiniBand
>> (discontinued product), we started seeing certain nodes failing to mount
>> the Lustre file system and giving I/O error on LNET (lctl) ping even though
>> an ibping test to the MDS gives no errors. We tried rebooting the
>> problematic nodes and even fresh-installing the OS and Lustre client, which
>> did not help. However, rebooting the MDS seems to possibly momentarily help
>> after the MDS starts up again, but the same set of problematic nodes seem
>> to always eventually revert back to the state where they fail to ping the
>> MDS over LNET.
>>
>>
>> Thank you for any pointers we may pursue.
>>
>>
>>
>>
>> Youssef Eldakar
>> Bibliotheca Alexandrina
>> www.bibalex.org <
>> https://urldefense.us/v2/url?u=http-3A__www.bibalex.orgd=DwMFaQc=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYcr=SpEwA4Pnyq7nH7aMGq8KpAm=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDHs=5DLPIzJx0tgg1TgSZkvvNNVfDfgpo-Prv-BPOga0WMAe=>
>> <
>> https://urldefense.us/v2/url?u=http-3A__www.bibalex.orgamp;d=DwMFaQamp;c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYcamp;r=SpEwA4Pnyq7nH7aMGq8KpAamp;m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDHamp;s=5DLPIzJx0tgg1TgSZkvvNNVfDfgpo-Prv-BPOga0WMAamp;e=
>> ;>
>> hpc.bibalex.org <
>> https://urldefense.us/v2/url?u=http-3A__hpc.bibalex.orgd=DwMFaQc=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYcr=SpEwA4Pnyq7nH7aMGq8KpAm=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDHs=HMqKriFlJ2qwafMOSVJMqre9-wmJ--kaSS_rx4t7hQwe=>
>> <
>> https://urldefense.us/v2/url?u=http-3A__hpc.bibalex.orgamp;d=DwMFaQamp;c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYcamp;r=SpEwA4Pnyq7nH7aMGq8KpAamp;m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDHamp;s=HMqKriFlJ2qwafMOSVJMqre9-wmJ--kaSS_rx4t7hQwamp;e=
>> ;>
>>
>>
>>
>>
>>
>> ___
>> lustre-discuss mailing list
>> lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] [EXTERNAL] I/O error on lctl ping although ibping successful

2023-06-21 Thread Youssef Eldakar via lustre-discuss
Thanks, Rick, for that suggestion. TCP ping between a problematic host and
the MDS indeed does not go through.

Not exactly sure what to investigate next, but that gives me somewhere to
start...

- Youssef

On Tue, Jun 20, 2023 at 7:00 PM Mohr, Rick via lustre-discuss <
lustre-discuss@lists.lustre.org> wrote:

> Have you tried tcp pings on the IP addresses associated with the IB
> interfaces?
>
> --Rick
>
>
> On 6/20/23, 12:11 PM, "lustre-discuss on behalf of Youssef Eldakar via
> lustre-discuss"  lustre-discuss-boun...@lists.lustre.org> on behalf of
> lustre-discuss@lists.lustre.org <mailto:lustre-discuss@lists.lustre.org>>
> wrote:
>
>
> In a cluster having ~100 Lustre clients (compute nodes) connected together
> with the MDS and OSS over Intel True Scale InfiniBand (discontinued
> product), we started seeing certain nodes failing to mount the Lustre file
> system and giving I/O error on LNET (lctl) ping even though an ibping test
> to the MDS gives no errors. We tried rebooting the problematic nodes and
> even fresh-installing the OS and Lustre client, which did not help.
> However, rebooting the MDS seems to possibly momentarily help after the MDS
> starts up again, but the same set of problematic nodes seem to always
> eventually revert back to the state where they fail to ping the MDS over
> LNET.
>
>
> Thank you for any pointers we may pursue.
>
>
>
>
> Youssef Eldakar
> Bibliotheca Alexandrina
> www.bibalex.org <
> https://urldefense.us/v2/url?u=http-3A__www.bibalex.orgd=DwMFaQc=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYcr=SpEwA4Pnyq7nH7aMGq8KpAm=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDHs=5DLPIzJx0tgg1TgSZkvvNNVfDfgpo-Prv-BPOga0WMAe=>
> <
> https://urldefense.us/v2/url?u=http-3A__www.bibalex.orgamp;d=DwMFaQamp;c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYcamp;r=SpEwA4Pnyq7nH7aMGq8KpAamp;m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDHamp;s=5DLPIzJx0tgg1TgSZkvvNNVfDfgpo-Prv-BPOga0WMAamp;e=
> ;>
> hpc.bibalex.org <
> https://urldefense.us/v2/url?u=http-3A__hpc.bibalex.orgd=DwMFaQc=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYcr=SpEwA4Pnyq7nH7aMGq8KpAm=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDHs=HMqKriFlJ2qwafMOSVJMqre9-wmJ--kaSS_rx4t7hQwe=>
> <
> https://urldefense.us/v2/url?u=http-3A__hpc.bibalex.orgamp;d=DwMFaQamp;c=v4IIwRuZAmwupIjowmMWUmLasxPEgYsgNI-O7C4ViYcamp;r=SpEwA4Pnyq7nH7aMGq8KpAamp;m=kwZRPirpHWOowgLmVOYe_KJ4ZigAHQk3DiF8-BwQ2qFikINn8C5-0SyyYEDelqDHamp;s=HMqKriFlJ2qwafMOSVJMqre9-wmJ--kaSS_rx4t7hQwamp;e=
> ;>
>
>
>
>
>
> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] I/O error on lctl ping although ibping successful

2023-06-20 Thread Youssef Eldakar via lustre-discuss
In a cluster having ~100 Lustre clients (compute nodes) connected together
with the MDS and OSS over Intel True Scale InfiniBand (discontinued
product), we started seeing certain nodes failing to mount the Lustre file
system and giving I/O error on LNET (lctl) ping even though an ibping test
to the MDS gives no errors. We tried rebooting the problematic nodes and
even fresh-installing the OS and Lustre client, which did not help.
However, rebooting the MDS seems to possibly momentarily help after the MDS
starts up again, but the same set of problematic nodes seem to always
eventually revert back to the state where they fail to ping the MDS over
LNET.

Thank you for any pointers we may pursue.

Youssef Eldakar
Bibliotheca Alexandrina
www.bibalex.org
hpc.bibalex.org
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org