Well done - great job. 👏

Chris

Chris James
________________________________
From: Marcos Melo <marcos.mk.m...@gmail.com>
Sent: Thursday, May 22, 2025 8:04:01 PM
To: xcat-user@lists.sourceforge.net <xcat-user@lists.sourceforge.net>; Mark 
Frenette <mark2.frene...@gmail.com>
Subject: Re: [xcat-user] Node freezes after xCAT provisioning - maybe related 
to RAID1 partitioning?


Hi everyone,

I found the solution. The issue was the MTU on the switch interfaces — they 
weren’t configured properly, so SSH couldn’t display the output of the commands 
I was running. I really appreciate everyone’s effort. I wasn’t able to respond 
earlier because I didn’t get notified about the replies.

During testing, I lowered the MTU of the 10GbE interfaces, and the issue was 
resolved. When I checked the switch configuration, I noticed that the ports 
connected to the nodes did not have an MTU configured. I then set the MTU to 
9216 on those ports, and the problem was fully resolved.

Now, the nodes are using an MTU of 9000 on their 10GbE interfaces because the 
switch is properly handling it.

By the way, the switch I’m using for the 10GbE network is the Supermicro 
SSE-X3548S/SSE-X3548SR.

Thanks again to everyone who made an effort to assist, and a special thank you 
to @Mark Frenette<mailto:mark2.frene...@gmail.com> for pointing me in the right 
direction!

On Wed, May 7, 2025 at 10:37 AM Marcos Melo 
<marcos.mk.m...@gmail.com<mailto:marcos.mk.m...@gmail.com>> wrote:

Hey everyone,

I’m running into a strange problem after provisioning a node with xCAT, and I’m 
trying to figure out if it’s something related to how I set up RAID1.

Setup:

  *   Hardware: Supermicro server with AMD EPYC 7763 (64 cores)

  *   OS: Oracle Linux 8.8 (kernel 4.18.0-477.21.1.el8_8)

  *   Provisioning: xCAT

  *   Storage:

     *   2x 480GB SATA SSDs in RAID1 for system partitions

     *   1x 1.8TB NVMe drive for /scratch

  *   Filesystem: XFS

  *   Network: Infiniband (ConnectX-5, switch: Mellanox SB8700/SB8790)

What`s happening:

After provisioning, the node (node01) looks fine — it boots, mounts storage, 
RAID syncs, networking is working, etc.

But if I run simple commands like:

cat /proc/cpuinfo
cat /etc/fstab
cat /proc/mounts

vim /root/file_test

the SSH session freezes.

* (Other sessions are still fine, I can reconnect — it’s not a full system 
crash.)

Other commands like:

cat /proc/mdstat
xfs_info /dev/md2
dmesg | grep error
dd if=/dev/sda of=/dev/null

work normally without any issues.

Wha I already cheched:

  *   RAID1 is synced (/proc/mdstat shows [UU]).

  *   XFS filesystems mount cleanly (xfs_info looks good).

  *   No obvious errors in dmesg or journalctl.

  *   Disk performance (dd) looks normal.

  *   CPU microcode seems fine (0xa0011d5 for all cores).

  *   Unloading Infiniband drivers (mlx5_ib, mlx5_core) had no effect.

  *   strace shows the freeze while reading through /proc/cpuinfo.

Also important:

Other nodes (Dell servers with ConnectX-6) provisioned via the same xCAT 
environment do not have this problem.

Could this be the cause?

Could it be that something went wrong during the RAID1 creation with the 
partitionfile script during provisioning?

I created the RAID arrays (mdadm) during provisioning, plus a standalone 
/scratch partition on the NVMe.

Thanks a lot if you have any ideas.

I’m happy to share more info if needed — just trying to understand if I missed 
anything obvious.

_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user

Reply via email to