Are you SURE node125 is identical to the others?
systems can boot up and disable DIMMs for instance.

I would log on there and run    free    lscpu  lspci   dmidecode
Take those outputs and run a diff against outputs from a known good node

Also hwloc/lstopo might show some difference?

On Thu, 2 Apr 2020 at 20:38, Garrett, Charles via users <
users@lists.open-mpi.org> wrote:

> I’m getting the following error with openmpi/3.1.4 and openmpi/3.1.6
> compiled with intel/19.5 (openmpi/2 and openmpi/4 do not exhibit the
> problem).  When I run ‘mpirun –display-devel-allocation hostname’ over 2
> nodes including node125 of our cluster, I get an error stating there are
> not enough slots in the system.  You can see the full error at the end of
> the message.  If I run on just node125, no error occurs (probably because
> it is using only shared memory).
>
>
>
> This error only occurs for node125.  All other nodes behave correctly.
> Node125 is cloned from an image.  I recently re-cloned it, so the OS is
> identical to the other nodes.  The motherboard was also recently replaced
> for other reasons, but the error persists.  We’ve run multiple stress tests
> without error.  I’ve tried different network cables without effect.  The
> system runs with CentOS 7, infiniband network, slurm 19.05.
>
>
>
> At this point, I don’t know what else to try.  All debug messages from
> openmpi that I’ve tried have gotten me nowhere.  Any help would be greatly
> appreciated.
>
>
>
> Thanks,
>
> Kris Garrett
>
>
>
>
>
> $ mpirun --display-devel-allocation hostname
>
> ======================   ALLOCATED NODES   ======================
>
>         node117: flags=0x11 slots=28 max_slots=0 slots_inuse=0 state=UP
>
>         node125: flags=0x11 slots=28 max_slots=0 slots_inuse=0 state=UP
>
> =================================================================
>
> --------------------------------------------------------------------------
>
> There are not enough slots available in the system to satisfy the 56
>
> slots that were requested by the application:
>
>
>
>   hostname
>
>
>
> Either request fewer slots for your application, or make more slots
>
> available for use.
>
>
>
> A "slot" is the Open MPI term for an allocatable unit where we can
>
> launch a process.  The number of slots available are defined by the
>
> environment in which Open MPI processes are run:
>
>
>
>   1. Hostfile, via "slots=N" clauses (N defaults to number of
>
>      processor cores if not provided)
>
>   2. The --host command line parameter, via a ":N" suffix on the
>
>      hostname (N defaults to 1 if not provided)
>
>   3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
>
>   4. If none of a hostfile, the --host command line parameter, or an
>
>      RM is present, Open MPI defaults to the number of processor cores
>
>
>
> In all the above cases, if you want Open MPI to default to the number
>
> of hardware threads instead of the number of processor cores, use the
>
> --use-hwthread-cpus option.
>
>
>
> Alternatively, you can use the --oversubscribe option to ignore the
>
> number of available slots when deciding the number of processes to
>
> launch.
>

Reply via email to