Darn, I was hoping the flags would give a clue to the malfunction, which I’ve been trying to solve for weeks. MPI_Comm_spawn() correctly spawns a worker on the node the mpirun is executing on, but on other nodes it says the following:
**** There are no allocated resources for the application: /home/kmccall/mav/9.15_mpi/mav that match the requested mapping: -host: n002.cluster.com:3 Verify that you have mapped the allocated resources properly for the indicated specification. [n002:08645] *** An error occurred in MPI_Comm_spawn [n002:08645] *** reported by process [1225916417,4] [n002:08645] *** on communicator MPI_COMM_SELF [n002:08645] *** MPI_ERR_SPAWN: could not spawn processes [n002:08645] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [n002:08645] *** and potentially your MPI job) As you suggested several weeks ago, I added a process count to the host name (n001.cluster.com:3) but it didn’t help. Here is how I set up the “info” argument to MPI_Comm_spawn to spawn a single worker: char info_str[64], host_str[64]; sprintf(info_str, "ppr:%d:node", 1); sprintf(host_str, "%s:%d", host_name_.c_str(), 3); // added “:3” to host name MPI_Info_create(&info); MPI_Info_set(info, "host", host_str); MPI_Info_set(info, "map-by", info_str); MPI_Info_set(info, "ompi_non_mpi", "true"); From: users <users-boun...@lists.open-mpi.org> On Behalf Of Ralph Castain via users Sent: Tuesday, April 14, 2020 8:13 AM To: Open MPI Users <users@lists.open-mpi.org> Cc: Ralph Castain <r...@open-mpi.org> Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags Then those flags are correct. I suspect mpirun is executing on n006, yes? The "location verified" just means that the daemon of rank N reported back from the node we expected it to be on - Slurm and Cray sometimes renumber the ranks. Torque doesn't and so you should never see a problem. Since mpirun isn't launched by itself, its node is never "verified", though I probably should alter that as it is obviously in the "right" place. I don't know what you mean by your app isn't behaving correctly on the remote nodes - best guess is that perhaps some envar they need isn't being forwarded? On Apr 14, 2020, at 2:04 AM, Mccall, Kurt E. (MSFC-EV41) <kurt.e.mcc...@nasa.gov<mailto:kurt.e.mcc...@nasa.gov>> wrote: CentOS, Torque. From: Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>> Sent: Monday, April 13, 2020 5:44 PM To: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mcc...@nasa.gov<mailto:kurt.e.mcc...@nasa.gov>> Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags What kind of system are you running on? Slurm? Cray? ...? On Apr 13, 2020, at 3:11 PM, Mccall, Kurt E. (MSFC-EV41) <kurt.e.mcc...@nasa.gov<mailto:kurt.e.mcc...@nasa.gov>> wrote: Thanks Ralph. So the difference between the working node flag (0x11) and the non-working nodes’ flags (0x13) is the flagPRRTE_NODE_FLAG_LOC_VERIFIED. What does that imply? The location of the daemon has NOT been verified? Kurt From: users <users-boun...@lists.open-mpi.org<mailto:users-boun...@lists.open-mpi.org>> On Behalf Of Ralph Castain via users Sent: Monday, April 13, 2020 4:47 PM To: Open MPI Users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> Cc: Ralph Castain <r...@open-mpi.org<mailto:r...@open-mpi.org>> Subject: [EXTERNAL] Re: [OMPI users] Meaning of mpiexec error flags I updated the message to explain the flags (instead of a numerical value) for OMPI v5. In brief: #define PRRTE_NODE_FLAG_DAEMON_LAUNCHED 0x01 // whether or not the daemon on this node has been launched #define PRRTE_NODE_FLAG_LOC_VERIFIED 0x02 // whether or not the location has been verified - used for // environments where the daemon's final destination is uncertain #define PRRTE_NODE_FLAG_OVERSUBSCRIBED 0x04 // whether or not this node is oversubscribed #define PRRTE_NODE_FLAG_MAPPED 0x08 // whether we have been added to the current map #define PRRTE_NODE_FLAG_SLOTS_GIVEN 0x10 // the number of slots was specified - used only in non-managed environments #define PRRTE_NODE_NON_USABLE 0x20 // the node is hosting a tool and is NOT to be used for jobs On Apr 13, 2020, at 2:15 PM, Mccall, Kurt E. (MSFC-EV41) via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote: My application is behaving correctly on node n006, and incorrectly on the lower numbered nodes. The flags in the error message below may give a clue as to why. What is the meaning of the flag values 0x11 and 0x13? ====================== ALLOCATED NODES ====================== n006: flags=0x11 slots=3 max_slots=0 slots_inuse=2 state=UP n005: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n004: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n003: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n002: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP n001: flags=0x13 slots=3 max_slots=0 slots_inuse=1 state=UP I’m using OpenMPI 4.0.3. Thanks, Kurt