I'm getting the following error with openmpi/3.1.4 and openmpi/3.1.6 compiled with intel/19.5 (openmpi/2 and openmpi/4 do not exhibit the problem). When I run 'mpirun -display-devel-allocation hostname' over 2 nodes including node125 of our cluster, I get an error stating there are not enough slots in the system. You can see the full error at the end of the message. If I run on just node125, no error occurs (probably because it is using only shared memory).
This error only occurs for node125. All other nodes behave correctly. Node125 is cloned from an image. I recently re-cloned it, so the OS is identical to the other nodes. The motherboard was also recently replaced for other reasons, but the error persists. We've run multiple stress tests without error. I've tried different network cables without effect. The system runs with CentOS 7, infiniband network, slurm 19.05. At this point, I don't know what else to try. All debug messages from openmpi that I've tried have gotten me nowhere. Any help would be greatly appreciated. Thanks, Kris Garrett $ mpirun --display-devel-allocation hostname ====================== ALLOCATED NODES ====================== node117: flags=0x11 slots=28 max_slots=0 slots_inuse=0 state=UP node125: flags=0x11 slots=28 max_slots=0 slots_inuse=0 state=UP ================================================================= -------------------------------------------------------------------------- There are not enough slots available in the system to satisfy the 56 slots that were requested by the application: hostname Either request fewer slots for your application, or make more slots available for use. A "slot" is the Open MPI term for an allocatable unit where we can launch a process. The number of slots available are defined by the environment in which Open MPI processes are run: 1. Hostfile, via "slots=N" clauses (N defaults to number of processor cores if not provided) 2. The --host command line parameter, via a ":N" suffix on the hostname (N defaults to 1 if not provided) 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) 4. If none of a hostfile, the --host command line parameter, or an RM is present, Open MPI defaults to the number of processor cores In all the above cases, if you want Open MPI to default to the number of hardware threads instead of the number of processor cores, use the --use-hwthread-cpus option. Alternatively, you can use the --oversubscribe option to ignore the number of available slots when deciding the number of processes to launch.