Ralph,

I downloaded and compiled the August 7 tarball with Torque/Maui support and ran 
my test program – resulting in the same error when MPI_Comm_spawn was called.  
Can you suggest anything the might be wrong with my environment or our Torque 
configuration that might be causing this?

Thanks,
Kurt

From: Ralph Castain <r...@open-mpi.org>
Sent: Tuesday, August 6, 2019 1:53 PM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mcc...@nasa.gov>
Subject: [EXTERNAL] Re: [OMPI users] MPI_Comm_Spawn failure: All nodes already 
filled

I'm afraid I cannot replicate this problem on OMPI master, so it could be 
something different about OMPI 4.0.1 or your environment. Can you download and 
test one of the nightly tarballs from the "master" branch and see if it works 
for you?

https://www.open-mpi.org/nightly/master/<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.open-2Dmpi.org_nightly_master_&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=02dv9l909IBsmfMcILaJtSebmPpGpbb5CA4hukOPv4Y&s=KOjBOU3R8SYRlORpTU4f1S89BfzgobqHLEMS3VC_jq8&e=>

Ralph



On Aug 6, 2019, at 3:58 AM, Mccall, Kurt E. (MSFC-EV41) via users 
<users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>> wrote:

Hi,

MPI_Comm_spawn() is failing with the error message “All nodes which are 
allocated for this job are already filled”.   I compiled OpenMpi 4.0.1 with the 
Portland Group C++  compiler, v. 19.5.0, both with and without Torque/Maui 
support.   I thought that not using Torque/Maui support would give me finer 
control over where MPI_Comm_spawn() places the processes, but the failure 
message was the same in either case.  Perhaps Torque is interfering with 
process creation somehow?

For the pared-down test code, I am following the instructions here to make 
mpiexec create exactly one manager process on a remote node, and then forcing 
that manager to spawn one worker process on the same remote node:

https://stackoverflow.com/questions/47743425/controlling-node-mapping-of-mpi-comm-spawn<https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_questions_47743425_controlling-2Dnode-2Dmapping-2Dof-2Dmpi-2Dcomm-2Dspawn&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=02dv9l909IBsmfMcILaJtSebmPpGpbb5CA4hukOPv4Y&s=XtyQHmJFY97-e9umj4ROKIlvbglN7fZx-2FTdawoMaY&e=>




Here is the full error message.   Note the Max Slots: 0 message therein (?):

Data for JOB [39020,1] offset 0 Total slots allocated 22

========================   JOB MAP   ========================

Data for node: n001    Num slots: 2    Max slots: 2    Num procs: 1
        Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: N/A

=============================================================
Data for JOB [39020,1] offset 0 Total slots allocated 22

========================   JOB MAP   ========================

Data for node: n001    Num slots: 2    Max slots: 0    Num procs: 1
        Process OMPI jobid: [39020,1] App: 0 Process rank: 0 Bound: socket 
0[core 0[hwt 0]]:[B/././././././././.][./././././././././.]

=============================================================
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
[n001:08114] *** An error occurred in MPI_Comm_spawn
[n001:08114] *** reported by process [2557214721,0]
[n001:08114] *** on communicator MPI_COMM_SELF
[n001:08114] *** MPI_ERR_SPAWN: could not spawn processes
[n001:08114] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now 
abort,
[n001:08114] ***    and potentially your MPI job)




Here is my mpiexec command:

mpiexec --display-map --v --x DISPLAY -hostfile MyNodeFile --np 1 -map-by 
ppr:1:node SpawnTestManager




Here is my hostfile “MyNodeFile”:

n001.cluster.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__n001.cluster.com_&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=02dv9l909IBsmfMcILaJtSebmPpGpbb5CA4hukOPv4Y&s=U1Kh9c1PySsnzmR1cM9R_R2_5zlVBcJk7McLUEwOT8c&e=>
 slots=2 max_slots=2




Here is my SpawnTestManager code:

<code>
#include <iostream>
#include <string>
#include <cstdio>

#ifdef SUCCESS
#undef SUCCESS
#endif
#include "/opt/openmpi_pgc_tm/include/mpi.h"

using std::string;
using std::cout;
using std::endl;

int main(int argc, char *argv[])
{
    int rank, world_size;
    char *argv2[2];
    MPI_Comm mpi_comm;
    MPI_Info info;
    char host[MPI_MAX_PROCESSOR_NAME + 1];
    int host_name_len;

    string worker_cmd = "SpawnTestWorker";
    string host_name = 
"n001.cluster.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__n001.cluster.com_&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=02dv9l909IBsmfMcILaJtSebmPpGpbb5CA4hukOPv4Y&s=U1Kh9c1PySsnzmR1cM9R_R2_5zlVBcJk7McLUEwOT8c&e=>";

    argv2[0] = "dummy_arg";
    argv2[1] = NULL;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    MPI_Get_processor_name(host, &host_name_len);
    cout << "Host name from MPI_Get_processor_name is " << host << endl;

   char info_str[64];
    sprintf(info_str, "ppr:%d:node", 1);
    MPI_Info_create(&info);
    MPI_Info_set(info, "host", host_name.c_str());
    MPI_Info_set(info, "map-by", info_str);

    MPI_Comm_spawn(worker_cmd.c_str(), argv2, 1, info, rank, MPI_COMM_SELF,
        &mpi_comm, MPI_ERRCODES_IGNORE);
    MPI_Comm_set_errhandler(mpi_comm, MPI::ERRORS_THROW_EXCEPTIONS);

    std::cout << "Manager success!" << std::endl;

    MPI_Finalize();
    return 0;
}
</code>



Here is my SpawnTestWorker code:

<code>
#include "/opt/openmpi_pgc_tm/include/mpi.h"
#include <iostream>

int main(int argc, char *argv[])
{
    int world_size, rank;
    MPI_Comm manager_intercom;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    MPI_Comm_get_parent(&manager_intercom);
    MPI_Comm_set_errhandler(manager_intercom, MPI::ERRORS_THROW_EXCEPTIONS);

    std::cout << "Worker success!" << std::endl;

    MPI_Finalize();
    return 0;
}
</code>

My config.log can be found here:  
https://gist.github.com/kmccall882/e26bc2ea58c9328162e8959b614a6fce.js<https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_kmccall882_e26bc2ea58c9328162e8959b614a6fce.js&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=02dv9l909IBsmfMcILaJtSebmPpGpbb5CA4hukOPv4Y&s=pILf2wPNBqcI3D2k4q_kinqblS_QNNHhrkAzBBqyyCg&e=>

I’ve attached the other info requested at on the help page, except the output 
of "ompi_info -v ompi full --parsable".   My version of ompi_info doesn’t 
accept the “ompi full” arguments, and the “-all” arg doesn’t produce much 
output.

Thanks for your help,
Kurt









_______________________________________________
users mailing list
users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/users<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwMFaQ&c=ApwzowJNAKKw3xye91w7BE1XMRKi2LN9kiMk5Csz9Zk&r=6cP1IfXu3IZOHSDh_vBqciYiIh4uuVgs1MSi5K7l5fQ&m=02dv9l909IBsmfMcILaJtSebmPpGpbb5CA4hukOPv4Y&s=1BrZaUFV56c_RZw3QHgLkNMRjhjYYrzxaIFhIPa7DMk&e=>

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to