Kurt,

Assuming you built Open MPI with tm support (default if tm is detected at
configure time, but you can configure --with-tm to have it abort if tm
support is not found), you should not need to use a hostfile.

As a workaround, I would suggest you try to
mpirun --map-by node -np 21 ...


Cheers,

Gilles

On Wed, Nov 3, 2021 at 6:06 PM Mccall, Kurt E. (MSFC-EV41) via users <
users@lists.open-mpi.org> wrote:

> I’m using OpenMPI 4.1.1 compiled with Nvidia’s nvc++ 20.9, and compiled
> with Torque support.
>
>
>
> I want to reserve multiple slots on each node, and then launch a single
> manager process on each node.   The remaining slots would be filled up as
> the manager spawns new processes with MPI_Comm_spawn on its local node.
>
>
>
> Here is the abbreviated mpiexec command, which I assume is the source of
> the problem described below (?).   The hostfile was created by Torque and
> it contains many repeated node names, one for each slot that it reserved.
>
>
>
> $ mpiexec --hostfile  MyHostFile  -np 21 -npernode 1  (etc.)
>
>
>
>
>
> When MPI_Comm_spawn is called, MPI is reporting that “All nodes which are
> allocated for this job are already filled."   They don’t appear to be
> filled as it also reports that only one slot is in use for each node:
>
>
>
> ======================   ALLOCATED NODES   ======================
>
>         n022: flags=0x11 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n021: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n020: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n018: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n017: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n016: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n015: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n014: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n013: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n012: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n011: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n010: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n009: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n008: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n007: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n006: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n005: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n004: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n003: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n002: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>         n001: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
>
>
>
> Do you have any idea what I am doing wrong?   My Torque qsub arguments are
> unchanged from when I successfully launched this kind of job structure
> under MPICH.   The relevant argument to qsub is the resource list, which is
> “-l  nodes=21:ppn=9”.
>
>
>

Reply via email to