To whom it may concern,
Hello. I am new in slurm. I am facing a problem of using slurm with Infiniband. 
When I ran the mpi jobs on a  rebooted node, I  would get fabric errors. For 
example, I tried a simple “hello world” via Intel mpi. I did like:
$ salloc -N1 -n12 -w cn117 #cn117 is the node just rebooted
salloc: Granted job allocation 1201
$ module list
Currently Loaded Modulefiles:
  1) modules                    2) null                       3) 
intelics/2013.1.039
$ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so
$srun ./hello
[3] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[4] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[5] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[6] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[7] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[8] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[10] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[11] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[0] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[9] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[1] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[2] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
srun: error: cn117: tasks 0-11: Exited with exit code 254
srun: Terminating job step 1201.0
================================================================
However, as long as I manually restart the slurm on the cn117, the problem will 
be fixed. For example:
$ ssh root@cn117<mailto:root@cn117>
cn117#  service slurm restart
stopping slurmd:                                           [  OK  ]
slurmd is stopped
starting slurmd:                                           [  OK  ]
# exit
$ salloc -N1 -n12 -w cn117
salloc: Granted job allocation 1203
$ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so
$ srun ./hello
This is Process  9 out of 12 running on host cn117
This is Process  3 out of 12 running on host cn117
This is Process  2 out of 12 running on host cn117
This is Process  7 out of 12 running on host cn117
This is Process  6 out of 12 running on host cn117
This is Process  0 out of 12 running on host cn117
This is Process  5 out of 12 running on host cn117
This is Process  1 out of 12 running on host cn117
This is Process  4 out of 12 running on host cn117
This is Process 10 out of 12 running on host cn117
This is Process  8 out of 12 running on host cn117
This is Process 11 out of 12 running on host cn117
=============================================================
Although I can manully do it, I still hope the system can be more automatic. I 
tried to add “sleep 10s;/etc/init.d/slurm restart” in the end of the file, 
rc.local but the issue is still there. Can anyone help me about that?

Sincerely,
Tingyang Xu
HPC Administrator
University of Connecticut


PS: some information of the infiniband:
$ slurmd -V
slurm 14.03.0

cn117$ ofed_info|head -n1
MLNX_OFED_LINUX-2.2-1.0.1 (OFED-2.2-1.0.0):

cn117$ ibv_devinfo
hca_id: mlx4_0
 transport:   InfiniBand (0)
 fw_ver:    2.11.550
 node_guid:
 sys_image_guid:   ##########
 vendor_id:   ##########
 vendor_part_id:   ########
 hw_ver:    0x0
 board_id:   ########
 phys_port_cnt:   2
  port: 1
   state:   PORT_ACTIVE (4)
   max_mtu:  4096 (5)
   active_mtu:  4096 (5)
   sm_lid:   1
   port_lid:  131
   port_lmc:  0x00
   link_layer:  InfiniBand

  port: 2
   state:   PORT_DOWN (1)
   max_mtu:  4096 (5)
   active_mtu:  4096 (5)
   sm_lid:   0
   port_lid:  0
   port_lmc:  0x00
   link_layer:  InfiniBand

cn117$ cat /etc/redhat-release
Red Hat Enterprise Linux Workstation release 6.5 (Santiago)

Reply via email to