Hi Ralph, I've not yet determined whether this is actually a PMIx issue or the way the dpm stuff in OMPI is handling PMIx namespaces.
Howard Am Di., 11. Aug. 2020 um 19:34 Uhr schrieb Ralph Castain via users < users@lists.open-mpi.org>: > Howard - if there is a problem in PMIx that is causing this problem, then > we really could use a report on it ASAP as we are getting ready to release > v3.1.6 and I doubt we have addressed anything relevant to what is being > discussed here. > > > > On Aug 11, 2020, at 4:35 PM, Martín Morales via users < > users@lists.open-mpi.org> wrote: > > Hi Howard. > > Great!, that works for the crashing problem with OMPI 4.0.4. However It > stills hanging if I remove “master” (host which launches spawning > processes) from my hostfile. > I need spawn only in “worker”. Is there a way or workaround for doing this > without mpirun? > Thanks a lot for your assistance. > > Martín > > > > > *From: *Howard Pritchard <hpprit...@gmail.com> > *Sent: *lunes, 10 de agosto de 2020 19:13 > *To: *Martín Morales <martineduardomora...@hotmail.com> > *Cc: *Open MPI Users <users@lists.open-mpi.org> > *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with > dynamically processes allocation. OMPI 4.0.1 don't. > > Hi Martin, > > I was able to reproduce this with 4.0.x branch. I'll open an issue. > > If you really want to use 4.0.4, then what you'll need to do is build an > external PMIx 3.1.2 (the PMIx that was embedded in Open MPI 4.0.1), and > then build Open MPI using the --with-pmix=where your pmix is installed > You will also need to build both Open MPI and PMIx against the same > libevent. There's a configure option with both packages to use an > external libevent installation. > > Howard > > > Am Mo., 10. Aug. 2020 um 13:52 Uhr schrieb Martín Morales < > martineduardomora...@hotmail.com>: > > Hi Howard. Unfortunately the issue persists in OMPI 4.0.5rc1. Do I have > to post this on the bug section? Thanks and regards. > > > Martín > > > *From: *Howard Pritchard <hpprit...@gmail.com> > *Sent: *lunes, 10 de agosto de 2020 14:44 > *To: *Open MPI Users <users@lists.open-mpi.org> > *Cc: *Martín Morales <martineduardomora...@hotmail.com> > *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with > dynamically processes allocation. OMPI 4.0.1 don't. > > > Hello Martin, > > > Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx > version that introduced a problem with spawn for the 4.0.2-4.0.4 versions. > This is supposed to be fixed in the 4.0.5 release. Could you try the > 4.0.5rc1 tarball and see if that addresses the problem you're seeing? > > > https://www.open-mpi.org/software/ompi/v4.0/ > > > Howard > > > > > > > Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users < > users@lists.open-mpi.org>: > > > Hello people! > I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one > "master", one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded > OMPI just like this: > > > ./configure --prefix=/usr/local/openmpi-4.0.4/bin/ > > > My hostfile is this: > > > master slots=2 > worker slots=2 > > > I'm trying to dynamically allocate the processes with MPI_Comm_Spawn(). > If I launch the processes only on the "master" machine It's ok. But if I > use the hostfile crashes with this: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *--------------------------------------------------------------------------At > least one pair of MPI processes are unable to reach each other forMPI > communications. This means that no Open MPI device has indicatedthat it > can be used to communicate between these processes. This isan error; Open > MPI requires that all MPI processes be able to reacheach other. This error > can sometimes be the result of forgetting tospecify the "self" BTL. > Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M Process 2 > ([[35155,1],0]) is on host: unknown! BTLs attempted: tcp selfYour MPI job > is now going to abort; > sorry.--------------------------------------------------------------------------[nos-GF7050VT-M:22526] > [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line > 493--------------------------------------------------------------------------It > looks like MPI_INIT failed for some reason; your parallel process islikely > to abort. There are many reasons that a parallel process canfail during > MPI_INIT; some of which are due to configuration or environmentproblems. > This failure appears to be an internal failure; here's someadditional > information (which may only be relevant to an Open MPIdeveloper): > ompi_dpm_dyn_init() failed --> Returned "Unreachable" (-12) instead of > "Success" > (0)--------------------------------------------------------------------------[nos-GF7050VT-M:22526] > *** An error occurred in MPI_Init[nos-GF7050VT-M:22526] *** reported by > process [2303918082,1][nos-GF7050VT-M:22526] *** on a NULL > communicator[nos-GF7050VT-M:22526] *** Unknown error[nos-GF7050VT-M:22526] > *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now > abort,[nos-GF7050VT-M:22526] *** and potentially your MPI job)* > > > Note: host "nos-GF7050VT-M" is "worker" > > > But If I run without "master" in hostfile, the processes are launched but > It hangs: MPI_Init() doesn't returns. > I launched the script (pasted below) in this 2 ways with the same result: > > > $ ./simple_spawn 2 > $ mpirun -np 1 ./simple_spawn 2 > > > The "simple_spawn" script: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *#include "mpi.h"#include <stdio.h>#include <stdlib.h>int main(int argc, > char ** argv){ int processesToRun; MPI_Comm parentcomm, intercomm; > MPI_Info info; int rank, size, hostName_len; char hostName[200]; > MPI_Init( &argc, &argv ); MPI_Comm_get_parent( &parentcomm ); > MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, > &size); MPI_Get_processor_name(hostName, &hostName_len); if > (parentcomm == MPI_COMM_NULL) { if(argc < 2 ){ > printf("Processes number needed!"); return 0; } > processesToRun = atoi(argv[1]); MPI_Info_create( &info ); > MPI_Info_set( info, "hostfile", "./hostfile" ); MPI_Info_set( info, > "map_by", "node" ); MPI_Comm_spawn( argv[0], MPI_ARGV_NULL, > processesToRun, info, 0, MPI_COMM_WORLD, &intercomm, MPI_ERRCODES_IGNORE); > printf("I'm the parent.\n"); } else { printf("I'm the > spawned h: %s r/s: %i/%i.\n", hostName, rank, size ); } > fflush(stdout); MPI_Finalize(); return 0;}* > > > I came from OMPI 4.0.1. In this version It's working... with some > inconsistencies I'm afraid. That's why I decided to upgrade to OMPI 4.0.4. > I tried several versions with no luck. Is there maybe an intrinsic problem > with the OMPI dynamic allocation functionality? > Any help will be very appreciated. Best regards. > > > Martín > > >