To rhc, Thanks for those suggestions. Here are the results: (1) Add "--oversubscribe" to mpirun cmd (I also added "--output-filename junk" -- see other output below). Terminal output had this fairly usual error message (shortened):
------------------------------------------------------- Child job 2 terminated normally, but 1 process returned a non-zero exit code.. Per user-direction, the job has been aborted. mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[37749,2],0] Exit code: 1 ------------------------------------------------------ And a file junk.2.000 (presumably stderr) was written--edited contents here (deleted duplicate output from multiple nodes): ------------------------------------------------------- [Node0.rockefeller.edu:20366] PSM EP connect error (Endpoint could not be reached): [Node0.rockefeller.edu:20366] Node0 [Node0.rockefeller.edu:20366] Node0 [Node0.rockefeller.edu:20366] Node0 ----A bunch of identical lines deleted---- [Node0.rockefeller.edu:20366] n0003 [Node0.rockefeller.edu:20366] n0003 [Node0.rockefeller.edu:20366] n0003 ----A bunch of identical lines deleted---- [Node0.rockefeller.edu:20366] n0004 [Node0.rockefeller.edu:20366] n0004 [Node0.rockefeller.edu:20366] n0004 ----A bunch of identical lines deleted---- [Node0.rockefeller.edu:20366] [Node0.rockefeller.edu:20366] [[37749,2],0] ORTE_ERROR_LOG: Error in file dpm_orte.c at line 523 *** An error occurred in MPI_Init *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [Node0.rockefeller.edu:20366] Local abort before MPI_INIT completed successfully; not able to aggregate error messages, and not able to guarantee that all other processes were killed! ---------------------------------------------------- I note that these errors apparently occurred in MPI_Init, before my attempt to spawn additional processes. (2) Modify your MPI_INFO to be “host”, “node0:22” so it thinks there are more slots available When I did this, since I actually try to spawn two processes, I put "Node0:22" for the first one and "Node0:23" for the second one. I get simply on the terminal output with no "junk" files: -------------------------------------------------------------------------- All nodes which are allocated for this job are already filled. -------------------------------------------------------------------------- This is the same whether I have "slots=22 max-slots=22" or "slots=21 max-slots=24" in the hostfile. (3) Using the MPI_INFO as in (2), I also tried adding "--bind-to core" to the mpirun line. This may be the most interesting output: -------------------------------------------------------------------------- WARNING: a request was made to bind a process. While the system supports binding the process itself, at least one node does NOT support binding memory to the process location. Node: Node0 This usually is due to not having the required NUMA support installed on the node. In some Linux distributions, the required support is contained in the libnumactl and libnumactl-devel packages. This is a warning only; your job will continue, though performance may be degraded. -------------------------------------------------------------------------- -------------------------------------------------------------------------- A request was made to bind to that would result in binding more processes than cpus on a resource: Bind to: CORE Node: Node0 #processes: 2 #cpus: 1 You can override this protection by adding the "overload-allowed" option to your binding directive. -------------------------------------------------------------------------- Indeed the packages mentioned are not installed. I found some discussion of this at https://github.com/open-mpi/ompi/issues/1087 which claims this message should really be about "hwloc" which is another thing I know nothing about. Does any of this help or suggest something else to try? Thanks, George Reeke On Fri, 2017-10-06 at 13:55 -0700, r...@open-mpi.org wrote: > Couple of things you can try: > > * add --oversubscribe to your mpirun cmd line so it doesn’t care how many > slots there are > > * modify your MPI_INFO to be “host”, “node0:22” so it thinks there are more > slots available > > It’s possible that the “host” info processing has a bug in it, but this will > tell us a little more and hopefully get your running. If you want to bind > your processes to cores, then add “--bind-to core” to the cmd line > > > > > On Oct 6, 2017, at 1:35 PM, George Reeke <re...@mail.rockefeller.edu> wrote: > > > > Dear colleagues, > > I need some help controlling where a process spawned with > > MPI_Comm_spawn goes. I am in openmpi-1.10 under Centos 6.7. > > My application is written in C and am running on a RedBarn > > system with a master node (hardware box) that connects to the > > outside world and two other nodes connected to it via ethernet and > > Infiniband. There are two executable files, one (I'll call it > > "Rank0Pgm") that expects to be rank 0 and does all the I/O and > > the other ("RanknPgm") that only communicates via MPI messages. > > There are two MPI_Comm_spawns that run just after MPI_Init and > > an initial broadcast that shares some setup info, like this: > > MPI_Comm_spawn("andmsg", argv, 1, MPI_INFO_NULL, > > hostid, commc, &commd, &sperr); > > where "andmsg" is a program that needs to communicate with the > > internet and with all the other processes via a new communicator > > that will be called commd (and another name for the other one). > > When I run this program with no hostfile and an mpirun line > > something like this on a node with 32 cores: > > /usr/lib64/openmpi-1.10/bin/mpirun -n 1 Rank0Pgm : -n 28 RanknPgm \ > > < InputFile > > everything works fine. I assume the spawns use 2 of the 3 available > > cores that I did not ask the program to use. > > > > Now I want to run on the full network, so I make a hostfile like this > > (call it "nodes120"): > > node0 slots=22 max-slots=22 > > n0003 slots=40 max-slots=40 > > n0004 slots=56 max-slots=56 > > where node0 has 24 cores and I am trying to leave room for my two > > spawned processes. The spawned processes have to be able to contact > > the internet, so I make an MPI_INFO with MPI_Info_create and > > MPI_Info_set(mpinfo, "host", "node0") > > and change the MPI_INFO_NULL in the spawn calls to point to this > > new MPI_Info. (If I leave the MPI_INFO_NULL I get a different > > error that is probably not of interest here.) > > > > Now I run the mpirun like above except now with > > "--hostfile nodes120" and "-n 116" after the colon. Now I get this > > error: > > > > "There are not enough slots available in the system to satisfy the 1 > > slots that were requested by the application: > > andmsg > > Either request fewer slots for your application, or make more slots > > available for use." > > > > I get the same error with "max-slots=24" on the first line of the > > hosts file. > > > > Sorry for the length of all that. Request for help: How do I set > > things up to run my rank 0 program and enough copies of RanknPgm to fill > > all but some number of cores on the master hardware node, and all the > > other rank n programs on the other hardware "nodes" (boxes of CPUs). > > [My application will do best with the default "by slot" scheduling.] > > > > Suggestions much appreciated. I am quite convinced my code is OK > > in that it runs OK as shown above on one hardware box. Also runs > > on my laptop with 4 cores and "-n 3 RanknPgm" so I guess I don't > > even really need to reserve cores for the two spawned processes. > > I thought of using old-fashioned 'fork' but I really want the > > extra communicators to keep asynchronous messages separated. > > The documentation says overloading is OK by default, so maybe > > something else is wrong here. > > > > George Reeke > > > > > > > > > > _______________________________________________ > > users mailing list > > users@lists.open-mpi.org > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIGaQ&c=JeTkUgVztGMmhKYjxsy2rfoWYibK1YmxXez1G3oNStg&r=-0HYJje2XxONzoGLV3ECU5R_Z00xayE_1fNBml0KNOw&m=zv8ir_0njk4_4Ebke1aTrY6O79nvjut_1oq0ATd0HA4&s=QKt8TgCrL7-PSfnJbaoWYyGoC2vKk5vhsz8hP-WkNTc&e= > > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.open-2Dmpi.org_mailman_listinfo_users&d=DwIGaQ&c=JeTkUgVztGMmhKYjxsy2rfoWYibK1YmxXez1G3oNStg&r=-0HYJje2XxONzoGLV3ECU5R_Z00xayE_1fNBml0KNOw&m=zv8ir_0njk4_4Ebke1aTrY6O79nvjut_1oq0ATd0HA4&s=QKt8TgCrL7-PSfnJbaoWYyGoC2vKk5vhsz8hP-WkNTc&e= > _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users