Hey Olaf,

We'll investigate as suggested. I'm hopeful the journald logs would provide 
some additional insight.

As for OFED versions, we use the same Mellanox version across the cluster and 
haven't seen any issues with working nodes that mount the filesystem.

We also have a PMR open with IBM but we'll send a follow-up if we discover 
something more for group discussion.



Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu<http://www.ndsu.edu/>



[cid:image001.gif@01D57DE0.91C300C0]



________________________________
From: gpfsug-discuss-boun...@spectrumscale.org 
<gpfsug-discuss-boun...@spectrumscale.org> on behalf of 
gpfsug-discuss-requ...@spectrumscale.org 
<gpfsug-discuss-requ...@spectrumscale.org>
Sent: Tuesday, March 30, 2021 1:07 AM
To: gpfsug-discuss@spectrumscale.org <gpfsug-discuss@spectrumscale.org>
Subject: gpfsug-discuss Digest, Vol 110, Issue 34

Send gpfsug-discuss mailing list submissions to
        gpfsug-discuss@spectrumscale.org

To subscribe or unsubscribe via the World Wide Web, visit
        http://gpfsug.org/mailman/listinfo/gpfsug-discuss
or, via email, send a message with subject or body 'help' to
        gpfsug-discuss-requ...@spectrumscale.org

You can reach the person managing the list at
        gpfsug-discuss-ow...@spectrumscale.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of gpfsug-discuss digest..."


Today's Topics:

   1. Filesystem mount attempt hangs GPFS client node
      (Saula, Oluwasijibomi)
   2. Re: Filesystem mount attempt hangs GPFS client node (Olaf Weiser)


----------------------------------------------------------------------

Message: 1
Date: Mon, 29 Mar 2021 18:38:00 +0000
From: "Saula, Oluwasijibomi" <oluwasijibomi.sa...@ndsu.edu>
To: "gpfsug-discuss@spectrumscale.org"
        <gpfsug-discuss@spectrumscale.org>
Subject: [gpfsug-discuss] Filesystem mount attempt hangs GPFS client
        node
Message-ID:
        
<ph0pr08mb6598f1a7bc557225d417d8c998...@ph0pr08mb6598.namprd08.prod.outlook.com>

Content-Type: text/plain; charset="utf-8"

Hello Folks,

So we are experiencing a mind-boggling issue where just a couple of nodes in 
our cluster, at GPFS boot up, get hung so badly that the node must be power 
reset.

These AMD client nodes are diskless in nature and have at least 256G of memory. 
We have other AMD nodes that are working just fine in a separate GPFS cluster 
albeit on RHEL7.

Just before GPFS (or related processes) seize up the node, the following lines 
of /var/mmfs/gen/mmfslog are noted:


2021-03-29_12:47:37.343-0500: [N] mmfsd ready

2021-03-29_12:47:37.426-0500: mmcommon mmfsup invoked. Parameters: 10.12.50.47 
10.12.50.242 all

2021-03-29_12:47:37.587-0500: mounting /dev/mmfs1

2021-03-29_12:47:37.590-0500: [I] Command: mount mmfs1

2021-03-29_12:47:37.859-0500: [N] Connecting to 10.12.50.243 
tier1-sn-02.pixstor <c0n2>

2021-03-29_12:47:37.864-0500: [I] VERBS RDMA connecting to 10.12.50.242 
(tier1-sn-01.pixstor) on mlx5_0 port 1 fabnum 0 sl 0 index 0

2021-03-29_12:47:37.864-0500: [I] VERBS RDMA connecting to 10.12.50.242 
(tier1-sn-01) on mlx5_0 port 1 fabnum 0 sl 0 index 1

2021-03-29_12:47:37.866-0500: [I] VERBS RDMA connected to 10.12.50.242 
(tier1-sn-01) on mlx5_0 port 1 fabnum 0 sl 0 index 0

2021-03-29_12:47:37.867-0500: [I] VERBS RDMA connected to 10.12.50.242 
(tier1-sn-01) on mlx5_0 port 1 fabnum 0 sl 0 index 1

2021-03-29_12:47:37.868-0500: [I] Connected to 10.12.50.243 tier1-sn-02 <c0n2>

There have been hunches that this might be a network issue, however, other 
nodes connected to the IB network switch are mounting the filesystem without 
incident.

I'm inclined to believe there's a GPFS/OS-specific setting that might be 
causing these crashes especially when we note that disabling the automount on 
the client node doesn't result in the node hanging. However, once we issue 
mmmount, we see the node seize up shortly...

Please let me know if you have any thoughts on where to look for root-causes as 
I and a few fellows are stuck here ?



Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu<http://www.ndsu.edu/>



[cid:image001.gif@01D57DE0.91C300C0]


-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20210329/4ce36267/attachment-0001.html>

------------------------------

Message: 2
Date: Tue, 30 Mar 2021 06:06:54 +0000
From: "Olaf Weiser" <olaf.wei...@de.ibm.com>
To: gpfsug-discuss@spectrumscale.org
Cc: gpfsug-discuss@spectrumscale.org
Subject: Re: [gpfsug-discuss] Filesystem mount attempt hangs GPFS
        client node
Message-ID:
        
<of4ff5120b.5e2b3de7-on002586a8.0021023a-002586a8.00219...@notes.na.collabserv.com>

Content-Type: text/plain; charset="us-ascii"

An HTML attachment was scrubbed...
URL: 
<http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20210330/ae3c3cdd/attachment.html>

------------------------------

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


End of gpfsug-discuss Digest, Vol 110, Issue 34
***********************************************
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to