I'm getting the following error with openmpi/3.1.4 and openmpi/3.1.6 compiled 
with intel/19.5 (openmpi/2 and openmpi/4 do not exhibit the problem).  When I 
run 'mpirun -display-devel-allocation hostname' over 2 nodes including node125 
of our cluster, I get an error stating there are not enough slots in the 
system.  You can see the full error at the end of the message.  If I run on 
just node125, no error occurs (probably because it is using only shared memory).

This error only occurs for node125.  All other nodes behave correctly.  Node125 
is cloned from an image.  I recently re-cloned it, so the OS is identical to 
the other nodes.  The motherboard was also recently replaced for other reasons, 
but the error persists.  We've run multiple stress tests without error.  I've 
tried different network cables without effect.  The system runs with CentOS 7, 
infiniband network, slurm 19.05.

At this point, I don't know what else to try.  All debug messages from openmpi 
that I've tried have gotten me nowhere.  Any help would be greatly appreciated.

Thanks,
Kris Garrett


$ mpirun --display-devel-allocation hostname
======================   ALLOCATED NODES   ======================
        node117: flags=0x11 slots=28 max_slots=0 slots_inuse=0 state=UP
        node125: flags=0x11 slots=28 max_slots=0 slots_inuse=0 state=UP
=================================================================
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 56
slots that were requested by the application:

  hostname

Either request fewer slots for your application, or make more slots
available for use.

A "slot" is the Open MPI term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which Open MPI processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, Open MPI defaults to the number of processor cores

In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.

Reply via email to