Thank you very much, Chrysovalantis. I just created a topic in Intel forum 
though your suggestion did not fix our issue. I will also update this topic if 
I get the solution in case other slurm users may have the similar issue again.

Thanks,
Tingyang Xu

From: Chrysovalantis Paschoulas 
Sent: Monday, October 27, 2014 10:45 AM
To: slurm-dev 
Subject: [slurm-dev] Re: slurm cannot work with Infiniband after rebooting

Hi!

For sure this is not connected to Slurm, but it is a problem with your 
Infiband+IMPI configuration. You should go to other forums or mailing lists and 
ask for help ;)

At first, I would suggest you to configure correctly the dat.conf file. In my 
case it is "/etc/dat.conf". You have to comment out all the lines with the 
unsupported IB modes.

And then you should export some Intel MPI variables and set the correct 
environment.
Try to find the documentation about Intel MPI vars, like: I_MPI_DEVICE, 
I_MPI_FABRICS, I_MPI_FALLBACK, I_MPI_DAPL_PROVIDER_LIST and I_MPI_DEBUG. If you 
play enough I am sure you will get the desired result. 

In our case we had set for example: "I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1" 
which solved similar problems if I remember correctly.

Best Regards,
Chrysovalantis Paschoulas


On 10/20/2014 06:46 PM, Tingyang Xu wrote:

  To whom it may concern,
  Hello. I am new in slurm. I am facing a problem of using slurm with 
Infiniband. When I ran the mpi jobs on a  rebooted node, I  would get fabric 
errors. For example, I tried a simple “hello world” via Intel mpi. I did like:
  $ salloc -N1 -n12 -w cn117 #cn117 is the node just rebooted
  salloc: Granted job allocation 1201
  $ module list
  Currently Loaded Modulefiles:
    1) modules                    2) null                       3) 
intelics/2013.1.039
  $ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so
  $srun ./hello
  [3] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [4] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [5] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [6] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [7] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [8] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [10] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [11] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [0] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [9] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [1] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  [2] MPI startup(): ofa fabric is not available and fallback fabric is not 
enabled
  srun: error: cn117: tasks 0-11: Exited with exit code 254
  srun: Terminating job step 1201.0
  ================================================================
  However, as long as I manually restart the slurm on the cn117, the problem 
will be fixed. For example:
  $ ssh root@cn117
  cn117#  service slurm restart
  stopping slurmd:                                           [  OK  ]
  slurmd is stopped
  starting slurmd:                                           [  OK  ]
  # exit
  $ salloc -N1 -n12 -w cn117
  salloc: Granted job allocation 1203
  $ export I_MPI_PMI_LIBRARY=/gpfs/slurm/lib/libpmi.so
  $ srun ./hello
  This is Process  9 out of 12 running on host cn117
  This is Process  3 out of 12 running on host cn117
  This is Process  2 out of 12 running on host cn117
  This is Process  7 out of 12 running on host cn117
  This is Process  6 out of 12 running on host cn117
  This is Process  0 out of 12 running on host cn117
  This is Process  5 out of 12 running on host cn117
  This is Process  1 out of 12 running on host cn117
  This is Process  4 out of 12 running on host cn117
  This is Process 10 out of 12 running on host cn117
  This is Process  8 out of 12 running on host cn117
  This is Process 11 out of 12 running on host cn117
  =============================================================
  Although I can manully do it, I still hope the system can be more automatic. 
I tried to add “sleep 10s;/etc/init.d/slurm restart” in the end of the file, 
rc.local but the issue is still there. Can anyone help me about that?

  Sincerely,
  Tingyang Xu
  HPC Administrator
  University of Connecticut


  PS: some information of the infiniband:
  $ slurmd -V
  slurm 14.03.0

  cn117$ ofed_info|head -n1
  MLNX_OFED_LINUX-2.2-1.0.1 (OFED-2.2-1.0.0):

  cn117$ ibv_devinfo
  hca_id: mlx4_0
  transport:   InfiniBand (0)
  fw_ver:    2.11.550
  node_guid:  
  sys_image_guid:   ##########
  vendor_id:   ##########
  vendor_part_id:   ########
  hw_ver:    0x0
  board_id:   ########
  phys_port_cnt:   2
    port: 1
     state:   PORT_ACTIVE (4)
     max_mtu:  4096 (5)
     active_mtu:  4096 (5)
     sm_lid:   1
     port_lid:  131
     port_lmc:  0x00
     link_layer:  InfiniBand

    port: 2
     state:   PORT_DOWN (1)
     max_mtu:  4096 (5)
     active_mtu:  4096 (5)
     sm_lid:   0
     port_lid:  0
     port_lmc:  0x00
     link_layer:  InfiniBand


  cn117$ cat /etc/redhat-release 
  Red Hat Enterprise Linux Workstation release 6.5 (Santiago)





------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

 

Reply via email to