To whom it may concern, Hello. I am new in slurm. I am facing a problem of using slurm with Infiniband. When I ran the mpi jobs on a rebooted node, I would get fabric errors. For example, I tried a simple “hello world” via Intel mpi. I did like: $ salloc -N1 -n12 -w cn117 #cn117 is the node just rebooted salloc: Granted job allocation 1201 $ module list Currently Loaded Modulefiles: 1) modules 2) null 3) intelics/2013.1.039 $ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so $srun ./hello [3] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [4] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [5] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [6] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [7] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [8] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [10] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [11] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [0] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [9] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [1] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [2] MPI startup(): ofa fabric is not available and fallback fabric is not enabled srun: error: cn117: tasks 0-11: Exited with exit code 254 srun: Terminating job step 1201.0 ================================================================ However, as long as I manually restart the slurm on the cn117, the problem will be fixed. For example: $ ssh root@cn117<mailto:root@cn117> cn117# service slurm restart stopping slurmd: [ OK ] slurmd is stopped starting slurmd: [ OK ] # exit $ salloc -N1 -n12 -w cn117 salloc: Granted job allocation 1203 $ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so $ srun ./hello This is Process 9 out of 12 running on host cn117 This is Process 3 out of 12 running on host cn117 This is Process 2 out of 12 running on host cn117 This is Process 7 out of 12 running on host cn117 This is Process 6 out of 12 running on host cn117 This is Process 0 out of 12 running on host cn117 This is Process 5 out of 12 running on host cn117 This is Process 1 out of 12 running on host cn117 This is Process 4 out of 12 running on host cn117 This is Process 10 out of 12 running on host cn117 This is Process 8 out of 12 running on host cn117 This is Process 11 out of 12 running on host cn117 ============================================================= Although I can manully do it, I still hope the system can be more automatic. I tried to add “sleep 10s;/etc/init.d/slurm restart” in the end of the file, rc.local but the issue is still there. Can anyone help me about that?
Sincerely, Tingyang Xu HPC Administrator University of Connecticut PS: some information of the infiniband: $ slurmd -V slurm 14.03.0 cn117$ ofed_info|head -n1 MLNX_OFED_LINUX-2.2-1.0.1 (OFED-2.2-1.0.0): cn117$ ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.11.550 node_guid: sys_image_guid: ########## vendor_id: ########## vendor_part_id: ######## hw_ver: 0x0 board_id: ######## phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 131 port_lmc: 0x00 link_layer: InfiniBand port: 2 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: InfiniBand cn117$ cat /etc/redhat-release Red Hat Enterprise Linux Workstation release 6.5 (Santiago)
