Darn, I was hoping the flags would give a clue to the malfunction, which I’ve 
been trying to solve for weeks.  MPI_Comm_spawn() correctly spawns a worker on 
the node the mpirun is executing on, but on other nodes it says the following:


****
There are no allocated resources for the application:
  /home/kmccall/mav/9.15_mpi/mav
that match the requested mapping:
  -host: n002.cluster.com:3

Verify that you have mapped the allocated resources properly for the
indicated specification.

[n002:08645] *** An error occurred in MPI_Comm_spawn
[n002:08645] *** reported by process [1225916417,4]
[n002:08645] *** on communicator MPI_COMM_SELF
[n002:08645] *** MPI_ERR_SPAWN: could not spawn processes
[n002:08645] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n002:08645] ***    and potentially your MPI job)

As you suggested several weeks ago, I added a process count to the host name 
(n001.cluster.com:3)   but it didn’t help.   Here is how I set up the “info” 
argument to MPI_Comm_spawn to spawn a single worker:

        char info_str[64], host_str[64];

        sprintf(info_str, "ppr:%d:node", 1);
        sprintf(host_str, "%s:%d", host_name_.c_str(), 3);    // added “:3” to 
host name

        MPI_Info_create(&info);
        MPI_Info_set(info, "host", host_str);
        MPI_Info_set(info, "map-by", info_str);
        MPI_Info_set(info, "ompi_non_mpi", "true");


From: users <users-boun...@lists.open-mpi.org> On Behalf Of Ralph Castain via 
users
Sent: Tuesday, April 14, 2020 8:13 AM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: Ralph Castain <r...@open-mpi.org>
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags

Then those flags are correct. I suspect mpirun is executing on n006, yes? The 
"location verified" just means that the daemon of rank N reported back from the 
node we expected it to be on - Slurm and Cray sometimes renumber the ranks. 
Torque doesn't and so you should never see a problem. Since mpirun isn't 
launched by itself, its node is never "verified", though I probably should 
alter that as it is obviously in the "right" place.

I don't know what you mean by your app isn't behaving correctly on the remote 
nodes - best guess is that perhaps some envar they need isn't being forwarded?



On Apr 14, 2020, at 2:04 AM, Mccall, Kurt E. (MSFC-EV41) 
<kurt.e.mcc...@nasa.gov<mailto:kurt.e.mcc...@nasa.gov>> wrote:

CentOS, Torque.



From: Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>>
Sent: Monday, April 13, 2020 5:44 PM
To: Mccall, Kurt E. (MSFC-EV41) 
<kurt.e.mcc...@nasa.gov<mailto:kurt.e.mcc...@nasa.gov>>
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags

What kind of system are you running on? Slurm? Cray? ...?




On Apr 13, 2020, at 3:11 PM, Mccall, Kurt E. (MSFC-EV41) 
<kurt.e.mcc...@nasa.gov<mailto:kurt.e.mcc...@nasa.gov>> wrote:

Thanks Ralph.   So the difference between the working node flag (0x11) and the 
non-working nodes’ flags (0x13) is the flagPRRTE_NODE_FLAG_LOC_VERIFIED.    
What does that imply?   The location of the daemon has NOT been verified?

Kurt

From: users 
<users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Ralph Castain via users
Sent: Monday, April 13, 2020 4:47 PM
To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>
Cc: Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>>
Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags

I updated the message to explain the flags (instead of a numerical value) for 
OMPI v5. In brief:

#define PRRTE_NODE_FLAG_DAEMON_LAUNCHED    0x01   // whether or not the daemon 
on this node has been launched
#define PRRTE_NODE_FLAG_LOC_VERIFIED               0x02   // whether or not the 
location has been verified - used for
                                                                                
                      // environments where the daemon's final destination is 
uncertain
#define PRRTE_NODE_FLAG_OVERSUBSCRIBED       0x04   // whether or not this node 
is oversubscribed
#define PRRTE_NODE_FLAG_MAPPED                         0x08   // whether we 
have been added to the current map
#define PRRTE_NODE_FLAG_SLOTS_GIVEN               0x10   // the number of slots 
was specified - used only in non-managed environments
#define PRRTE_NODE_NON_USABLE                           0x20   // the node is 
hosting a tool and is NOT to be used for jobs






On Apr 13, 2020, at 2:15 PM, Mccall, Kurt E. (MSFC-EV41) via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:

My application is behaving correctly on node n006, and incorrectly on the lower 
numbered nodes.   The flags in the error message below may give a clue as to 
why.   What is the meaning of the flag values 0x11 and 0x13?

======================   ALLOCATED NODES   ======================
        n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP
        n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
        n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
        n003: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
        n002: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP
        n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP

I’m using OpenMPI 4.0.3.

Thanks,
Kurt

Reply via email to