Hi!

For sure this is not connected to Slurm, but it is a problem with your 
Infiband+IMPI configuration. You should go to other forums or mailing lists and 
ask for help ;)

At first, I would suggest you to configure correctly the dat.conf file. In my case it is 
"/etc/dat.conf". You have to comment out all the lines with the unsupported IB 
modes.

And then you should export some Intel MPI variables and set the correct 
environment.
Try to find the documentation about Intel MPI vars, like: I_MPI_DEVICE, 
I_MPI_FABRICS, I_MPI_FALLBACK, I_MPI_DAPL_PROVIDER_LIST and I_MPI_DEBUG. If you 
play enough I am sure you will get the desired result.

In our case we had set for example: "I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1" 
which solved similar problems if I remember correctly.

Best Regards,
Chrysovalantis Paschoulas


On 10/20/2014 06:46 PM, Tingyang Xu wrote:
To whom it may concern,
Hello. I am new in slurm. I am facing a problem of using slurm with Infiniband. 
When I ran the mpi jobs on a  rebooted node, I  would get fabric errors. For 
example, I tried a simple “hello world” via Intel mpi. I did like:
$ salloc -N1 -n12 -w cn117 #cn117 is the node just rebooted
salloc: Granted job allocation 1201
$ module list
Currently Loaded Modulefiles:
 1) modules                    2) null                       3) 
intelics/2013.1.039
$ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so
$srun ./hello
[3] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[4] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[5] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[6] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[7] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[8] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[10] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[11] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[0] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[9] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[1] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
[2] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
srun: error: cn117: tasks 0-11: Exited with exit code 254
srun: Terminating job step 1201.0
================================================================
However, as long as I manually restart the slurm on the cn117, the problem will 
be fixed. For example:
$ ssh root@cn117<mailto:root@cn117>
cn117#  service slurm restart
stopping slurmd:                                           [  OK  ]
slurmd is stopped
starting slurmd:                                           [  OK  ]
# exit
$ salloc -N1 -n12 -w cn117
salloc: Granted job allocation 1203
$ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so
$ srun ./hello
This is Process  9 out of 12 running on host cn117
This is Process  3 out of 12 running on host cn117
This is Process  2 out of 12 running on host cn117
This is Process  7 out of 12 running on host cn117
This is Process  6 out of 12 running on host cn117
This is Process  0 out of 12 running on host cn117
This is Process  5 out of 12 running on host cn117
This is Process  1 out of 12 running on host cn117
This is Process  4 out of 12 running on host cn117
This is Process 10 out of 12 running on host cn117
This is Process  8 out of 12 running on host cn117
This is Process 11 out of 12 running on host cn117
=============================================================
Although I can manully do it, I still hope the system can be more automatic. I 
tried to add “sleep 10s;/etc/init.d/slurm restart” in the end of the file, 
rc.local but the issue is still there. Can anyone help me about that?

Sincerely,
Tingyang Xu
HPC Administrator
University of Connecticut


PS: some information of the infiniband:
$ slurmd -V
slurm 14.03.0

cn117$ ofed_info|head -n1
MLNX_OFED_LINUX-2.2-1.0.1 (OFED-2.2-1.0.0):

cn117$ ibv_devinfo
hca_id: mlx4_0
transport:   InfiniBand (0)
fw_ver:    2.11.550
node_guid:
sys_image_guid:   ##########
vendor_id:   ##########
vendor_part_id:   ########
hw_ver:    0x0
board_id:   ########
phys_port_cnt:   2
 port: 1
  state:   PORT_ACTIVE (4)
  max_mtu:  4096 (5)
  active_mtu:  4096 (5)
  sm_lid:   1
  port_lid:  131
  port_lmc:  0x00
  link_layer:  InfiniBand

 port: 2
  state:   PORT_DOWN (1)
  max_mtu:  4096 (5)
  active_mtu:  4096 (5)
  sm_lid:   0
  port_lid:  0
  port_lmc:  0x00
  link_layer:  InfiniBand

cn117$ cat /etc/redhat-release
Red Hat Enterprise Linux Workstation release 6.5 (Santiago)





------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Reply via email to