Hi! For sure this is not connected to Slurm, but it is a problem with your Infiband+IMPI configuration. You should go to other forums or mailing lists and ask for help ;)
At first, I would suggest you to configure correctly the dat.conf file. In my case it is "/etc/dat.conf". You have to comment out all the lines with the unsupported IB modes. And then you should export some Intel MPI variables and set the correct environment. Try to find the documentation about Intel MPI vars, like: I_MPI_DEVICE, I_MPI_FABRICS, I_MPI_FALLBACK, I_MPI_DAPL_PROVIDER_LIST and I_MPI_DEBUG. If you play enough I am sure you will get the desired result. In our case we had set for example: "I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1" which solved similar problems if I remember correctly. Best Regards, Chrysovalantis Paschoulas On 10/20/2014 06:46 PM, Tingyang Xu wrote: To whom it may concern, Hello. I am new in slurm. I am facing a problem of using slurm with Infiniband. When I ran the mpi jobs on a rebooted node, I would get fabric errors. For example, I tried a simple “hello world” via Intel mpi. I did like: $ salloc -N1 -n12 -w cn117 #cn117 is the node just rebooted salloc: Granted job allocation 1201 $ module list Currently Loaded Modulefiles: 1) modules 2) null 3) intelics/2013.1.039 $ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so $srun ./hello [3] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [4] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [5] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [6] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [7] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [8] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [10] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [11] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [0] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [9] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [1] MPI startup(): ofa fabric is not available and fallback fabric is not enabled [2] MPI startup(): ofa fabric is not available and fallback fabric is not enabled srun: error: cn117: tasks 0-11: Exited with exit code 254 srun: Terminating job step 1201.0 ================================================================ However, as long as I manually restart the slurm on the cn117, the problem will be fixed. For example: $ ssh root@cn117<mailto:root@cn117> cn117# service slurm restart stopping slurmd: [ OK ] slurmd is stopped starting slurmd: [ OK ] # exit $ salloc -N1 -n12 -w cn117 salloc: Granted job allocation 1203 $ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so $ srun ./hello This is Process 9 out of 12 running on host cn117 This is Process 3 out of 12 running on host cn117 This is Process 2 out of 12 running on host cn117 This is Process 7 out of 12 running on host cn117 This is Process 6 out of 12 running on host cn117 This is Process 0 out of 12 running on host cn117 This is Process 5 out of 12 running on host cn117 This is Process 1 out of 12 running on host cn117 This is Process 4 out of 12 running on host cn117 This is Process 10 out of 12 running on host cn117 This is Process 8 out of 12 running on host cn117 This is Process 11 out of 12 running on host cn117 ============================================================= Although I can manully do it, I still hope the system can be more automatic. I tried to add “sleep 10s;/etc/init.d/slurm restart” in the end of the file, rc.local but the issue is still there. Can anyone help me about that? Sincerely, Tingyang Xu HPC Administrator University of Connecticut PS: some information of the infiniband: $ slurmd -V slurm 14.03.0 cn117$ ofed_info|head -n1 MLNX_OFED_LINUX-2.2-1.0.1 (OFED-2.2-1.0.0): cn117$ ibv_devinfo hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 2.11.550 node_guid: sys_image_guid: ########## vendor_id: ########## vendor_part_id: ######## hw_ver: 0x0 board_id: ######## phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 131 port_lmc: 0x00 link_layer: InfiniBand port: 2 state: PORT_DOWN (1) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: InfiniBand cn117$ cat /etc/redhat-release Red Hat Enterprise Linux Workstation release 6.5 (Santiago) ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------ Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------
