Are you SURE node125 is identical to the others? systems can boot up and disable DIMMs for instance.
I would log on there and run free lscpu lspci dmidecode Take those outputs and run a diff against outputs from a known good node Also hwloc/lstopo might show some difference? On Thu, 2 Apr 2020 at 20:38, Garrett, Charles via users < users@lists.open-mpi.org> wrote: > I’m getting the following error with openmpi/3.1.4 and openmpi/3.1.6 > compiled with intel/19.5 (openmpi/2 and openmpi/4 do not exhibit the > problem). When I run ‘mpirun –display-devel-allocation hostname’ over 2 > nodes including node125 of our cluster, I get an error stating there are > not enough slots in the system. You can see the full error at the end of > the message. If I run on just node125, no error occurs (probably because > it is using only shared memory). > > > > This error only occurs for node125. All other nodes behave correctly. > Node125 is cloned from an image. I recently re-cloned it, so the OS is > identical to the other nodes. The motherboard was also recently replaced > for other reasons, but the error persists. We’ve run multiple stress tests > without error. I’ve tried different network cables without effect. The > system runs with CentOS 7, infiniband network, slurm 19.05. > > > > At this point, I don’t know what else to try. All debug messages from > openmpi that I’ve tried have gotten me nowhere. Any help would be greatly > appreciated. > > > > Thanks, > > Kris Garrett > > > > > > $ mpirun --display-devel-allocation hostname > > ====================== ALLOCATED NODES ====================== > > node117: flags=0x11 slots=28 max_slots=0 slots_inuse=0 state=UP > > node125: flags=0x11 slots=28 max_slots=0 slots_inuse=0 state=UP > > ================================================================= > > -------------------------------------------------------------------------- > > There are not enough slots available in the system to satisfy the 56 > > slots that were requested by the application: > > > > hostname > > > > Either request fewer slots for your application, or make more slots > > available for use. > > > > A "slot" is the Open MPI term for an allocatable unit where we can > > launch a process. The number of slots available are defined by the > > environment in which Open MPI processes are run: > > > > 1. Hostfile, via "slots=N" clauses (N defaults to number of > > processor cores if not provided) > > 2. The --host command line parameter, via a ":N" suffix on the > > hostname (N defaults to 1 if not provided) > > 3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.) > > 4. If none of a hostfile, the --host command line parameter, or an > > RM is present, Open MPI defaults to the number of processor cores > > > > In all the above cases, if you want Open MPI to default to the number > > of hardware threads instead of the number of processor cores, use the > > --use-hwthread-cpus option. > > > > Alternatively, you can use the --oversubscribe option to ignore the > > number of available slots when deciding the number of processes to > > launch. >