Gilles and Ralph,

I did build with -with-tm.   I tried Gilles workaround but the failure still 
occurred.    What do I need to provide you so that you can investigate this 
possible bug?

Thanks,
Kurt

From: users <users-boun...@lists.open-mpi.org> On Behalf Of Ralph Castain via 
users
Sent: Wednesday, November 3, 2021 8:45 AM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: Ralph Castain <r...@open-mpi.org>
Subject: [EXTERNAL] Re: [OMPI users] Reserving slots and filling them after job 
launch with MPI_Comm_spawn

Sounds like a bug to me - regardless of configuration, if the hostfile contains 
an entry for each slot on a node, OMPI should have added those up.



On Nov 3, 2021, at 2:49 AM, Gilles Gouaillardet via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:

Kurt,

Assuming you built Open MPI with tm support (default if tm is detected at 
configure time, but you can configure --with-tm to have it abort if tm support 
is not found), you should not need to use a hostfile.

As a workaround, I would suggest you try to
mpirun --map-by node -np 21 ...


Cheers,

Gilles

On Wed, Nov 3, 2021 at 6:06 PM Mccall, Kurt E. (MSFC-EV41) via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:
I’m using OpenMPI 4.1.1 compiled with Nvidia’s nvc++ 20.9, and compiled with 
Torque support.

I want to reserve multiple slots on each node, and then launch a single manager 
process on each node.   The remaining slots would be filled up as the manager 
spawns new processes with MPI_Comm_spawn on its local node.

Here is the abbreviated mpiexec command, which I assume is the source of the 
problem described below (?).   The hostfile was created by Torque and it 
contains many repeated node names, one for each slot that it reserved.

$ mpiexec --hostfile  MyHostFile  -np 21 -npernode 1  (etc.)


When MPI_Comm_spawn is called, MPI is reporting that “All nodes which are 
allocated for this job are already filled."   They don’t appear to be filled as 
it also reports that only one slot is in use for each node:

======================   ALLOCATED NODES   ======================
        n022: flags=0x11 slots=9 max_slots=0 slots_inuse=1 state=UP
        n021: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n020: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n018: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n017: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n016: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n015: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n014: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n013: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n012: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n011: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n010: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n009: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n008: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n007: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n006: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n005: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n004: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n003: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n002: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP
        n001: flags=0x13 slots=9 max_slots=0 slots_inuse=1 state=UP

Do you have any idea what I am doing wrong?   My Torque qsub arguments are 
unchanged from when I successfully launched this kind of job structure under 
MPICH.   The relevant argument to qsub is the resource list, which is “-l  
nodes=21:ppn=9”.


Reply via email to