Kurt, Assuming you built Open MPI with tm support (default if tm is detected at configure time, but you can configure --with-tm to have it abort if tm support is not found), you should not need to use a hostfile.
As a workaround, I would suggest you try to mpirun --map-by node -np 21 ... Cheers, Gilles On Wed, Nov 3, 2021 at 6:06 PM Mccall, Kurt E. (MSFC-EV41) via users < users@lists.open-mpi.org> wrote: > I’m using OpenMPI 4.1.1 compiled with Nvidia’s nvc++ 20.9, and compiled > with Torque support. > > > > I want to reserve multiple slots on each node, and then launch a single > manager process on each node. The remaining slots would be filled up as > the manager spawns new processes with MPI_Comm_spawn on its local node. > > > > Here is the abbreviated mpiexec command, which I assume is the source of > the problem described below (?). The hostfile was created by Torque and > it contains many repeated node names, one for each slot that it reserved. > > > > $ mpiexec --hostfile MyHostFile -np 21 -npernode 1 (etc.) > > > > > > When MPI_Comm_spawn is called, MPI is reporting that “All nodes which are > allocated for this job are already filled." They don’t appear to be > filled as it also reports that only one slot is in use for each node: > > > > ====================== ALLOCATED NODES ====================== > > n022: flags=0x11 slots=9 max_slots=0 slots_inuse=1 state=UP > > n021: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n020: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n018: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n017: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n016: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n015: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n014: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n013: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n012: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n011: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n010: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n009: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n008: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n007: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n006: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n005: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n004: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n003: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n002: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > n001: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP > > > > Do you have any idea what I am doing wrong? My Torque qsub arguments are > unchanged from when I successfully launched this kind of job structure > under MPICH. The relevant argument to qsub is the resource list, which is > “-l nodes=21:ppn=9”. > > >