Hey everyone, I’m running into a strange problem after provisioning a node with xCAT, and I’m trying to figure out if it’s something related to how I set up RAID1.
Setup: - Hardware: Supermicro server with AMD EPYC 7763 (64 cores) - OS: Oracle Linux 8.8 (kernel 4.18.0-477.21.1.el8_8) - Provisioning: xCAT - Storage: - 2x 480GB SATA SSDs in RAID1 for system partitions - 1x 1.8TB NVMe drive for /scratch - Filesystem: XFS - Network: Infiniband (ConnectX-5, switch: Mellanox SB8700/SB8790) What`s happening: After provisioning, the node (node01) looks fine — it boots, mounts storage, RAID syncs, networking is working, etc. But if I run simple commands like: cat /proc/cpuinfo cat /etc/fstab cat /proc/mounts vim /root/file_test the *SSH session freezes*. * (Other sessions are still fine, I can reconnect — it’s not a full system crash.) Other commands like: cat /proc/mdstat xfs_info /dev/md2 dmesg | grep error dd if=/dev/sda of=/dev/null work normally without any issues. Wha I already cheched: - RAID1 is synced (/proc/mdstat shows [UU]). - XFS filesystems mount cleanly (xfs_info looks good). - No obvious errors in dmesg or journalctl. - Disk performance (dd) looks normal. - CPU microcode seems fine (0xa0011d5 for all cores). - Unloading Infiniband drivers (mlx5_ib, mlx5_core) had no effect. - strace shows the freeze while reading through /proc/cpuinfo. Also important: Other nodes (Dell servers with ConnectX-6) provisioned via the same xCAT environment do not have this problem. Could this be the cause? *Could it be that something went wrong during the RAID1 creation with the partitionfile script during provisioning?* I created the RAID arrays (mdadm) during provisioning, plus a standalone /scratch partition on the NVMe. Thanks a lot if you have any ideas. I’m happy to share more info if needed — just trying to understand if I missed anything obvious.
_______________________________________________ xCAT-user mailing list xCAT-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xcat-user