Hi Howard, that’s right. This happens after some time I run ./simple_spawn <PROCESSES_NUMBER>
and the hostfile without “master” host (and just “worker” in it). Regards, Martín From: Howard Pritchard<mailto:hpprit...@gmail.com> Sent: sábado, 15 de agosto de 2020 15:09 To: Martín Morales<mailto:martineduardomora...@hotmail.com> Cc: Open MPI Users<mailto:users@lists.open-mpi.org> Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't. HI Martin, Thanks this is helpful. Are you getting this timeout when you're running the spawner process as a singleton? Howard Am Fr., 14. Aug. 2020 um 17:44 Uhr schrieb Martín Morales <martineduardomora...@hotmail.com<mailto:martineduardomora...@hotmail.com>>: Howard, I pasted below, the error message after a while of the hang I referred. Regards, Martín - A request has timed out and will therefore fail: Operation: LOOKUP: orted/pmix/pmix_server_pub.c:345 Your job may terminate as a result of this problem. You may want to adjust the MCA parameter pmix_server_max_wait and try again. If this occurred during a connect/accept operation, you can adjust that time using the pmix_base_exchange_timeout parameter. -------------------------------------------------------------------------- -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_dpm_dyn_init() failed --> Returned "Timeout" (-15) instead of "Success" (0) -------------------------------------------------------------------------- [nos-GF7050VT-M:03767] *** An error occurred in MPI_Init [nos-GF7050VT-M:03767] *** reported by process [2337734658,0] [nos-GF7050VT-M:03767] *** on a NULL communicator [nos-GF7050VT-M:03767] *** Unknown error [nos-GF7050VT-M:03767] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [nos-GF7050VT-M:03767] *** and potentially your MPI job) [osboxes:02457] *** An error occurred in MPI_Comm_spawn [osboxes:02457] *** reported by process [2337734657,0] [osboxes:02457] *** on communicator MPI_COMM_WORLD [osboxes:02457] *** MPI_ERR_UNKNOWN: unknown error [osboxes:02457] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [osboxes:02457] *** and potentially your MPI job) [osboxes:02458] 1 more process has sent help message help-orted.txt / timedout [osboxes:02458] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages From: Martín Morales via users<mailto:users@lists.open-mpi.org> Sent: viernes, 14 de agosto de 2020 19:40 To: Howard Pritchard<mailto:hpprit...@gmail.com> Cc: Martín Morales<mailto:martineduardomora...@hotmail.com>; Open MPI Users<mailto:users@lists.open-mpi.org> Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't. Hi Howard. Thanks for the track in Github. I have run with mpirun without “master” in the hostfile and runs ok. The hanging occurs when I run like a singleton (no mpirun) which is the way I need to run. If I make a top in both machines the processes are correctly mapped but hangued. Seems the MPI_Init() function doesn’t return. Thanks for your help. Best regards, Martín From: Howard Pritchard<mailto:hpprit...@gmail.com> Sent: viernes, 14 de agosto de 2020 15:18 To: Martín Morales<mailto:martineduardomora...@hotmail.com> Cc: Open MPI Users<mailto:users@lists.open-mpi.org> Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't. Hi Martin, I opened an issue on Open MPI's github to track this https://github.com/open-mpi/ompi/issues/8005 You may be seeing another problem if you removed master from the host file. Could you add the --debug-daemons option to the mpirun and post the output? Howard Am Di., 11. Aug. 2020 um 17:35 Uhr schrieb Martín Morales <martineduardomora...@hotmail.com<mailto:martineduardomora...@hotmail.com>>: Hi Howard. Great!, that works for the crashing problem with OMPI 4.0.4. However It stills hanging if I remove “master” (host which launches spawning processes) from my hostfile. I need spawn only in “worker”. Is there a way or workaround for doing this without mpirun? Thanks a lot for your assistance. Martín From: Howard Pritchard<mailto:hpprit...@gmail.com> Sent: lunes, 10 de agosto de 2020 19:13 To: Martín Morales<mailto:martineduardomora...@hotmail.com> Cc: Open MPI Users<mailto:users@lists.open-mpi.org> Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't. Hi Martin, I was able to reproduce this with 4.0.x branch. I'll open an issue. If you really want to use 4.0.4, then what you'll need to do is build an external PMIx 3.1.2 (the PMIx that was embedded in Open MPI 4.0.1), and then build Open MPI using the --with-pmix=where your pmix is installed You will also need to build both Open MPI and PMIx against the same libevent. There's a configure option with both packages to use an external libevent installation. Howard Am Mo., 10. Aug. 2020 um 13:52 Uhr schrieb Martín Morales <martineduardomora...@hotmail.com<mailto:martineduardomora...@hotmail.com>>: Hi Howard. Unfortunately the issue persists in OMPI 4.0.5rc1. Do I have to post this on the bug section? Thanks and regards. Martín From: Howard Pritchard<mailto:hpprit...@gmail.com> Sent: lunes, 10 de agosto de 2020 14:44 To: Open MPI Users<mailto:users@lists.open-mpi.org> Cc: Martín Morales<mailto:martineduardomora...@hotmail.com> Subject: Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't. Hello Martin, Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx version that introduced a problem with spawn for the 4.0.2-4.0.4 versions. This is supposed to be fixed in the 4.0.5 release. Could you try the 4.0.5rc1 tarball and see if that addresses the problem you're seeing? https://www.open-mpi.org/software/ompi/v4.0/ Howard Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users <users@lists.open-mpi.org<mailto:users@lists.open-mpi.org>>: Hello people! I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one "master", one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded OMPI just like this: ./configure --prefix=/usr/local/openmpi-4.0.4/bin/ My hostfile is this: master slots=2 worker slots=2 I'm trying to dynamically allocate the processes with MPI_Comm_Spawn(). If I launch the processes only on the "master" machine It's ok. But if I use the hostfile crashes with this: -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M Process 2 ([[35155,1],0]) is on host: unknown! BTLs attempted: tcp self Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- [nos-GF7050VT-M:22526] [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493 -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_dpm_dyn_init() failed --> Returned "Unreachable" (-12) instead of "Success" (0) -------------------------------------------------------------------------- [nos-GF7050VT-M:22526] *** An error occurred in MPI_Init [nos-GF7050VT-M:22526] *** reported by process [2303918082,1] [nos-GF7050VT-M:22526] *** on a NULL communicator [nos-GF7050VT-M:22526] *** Unknown error [nos-GF7050VT-M:22526] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [nos-GF7050VT-M:22526] *** and potentially your MPI job) Note: host "nos-GF7050VT-M" is "worker" But If I run without "master" in hostfile, the processes are launched but It hangs: MPI_Init() doesn't returns. I launched the script (pasted below) in this 2 ways with the same result: $ ./simple_spawn 2 $ mpirun -np 1 ./simple_spawn 2 The "simple_spawn" script: #include "mpi.h" #include <stdio.h> #include <stdlib.h> int main(int argc, char ** argv){ int processesToRun; MPI_Comm parentcomm, intercomm; MPI_Info info; int rank, size, hostName_len; char hostName[200]; MPI_Init( &argc, &argv ); MPI_Comm_get_parent( &parentcomm ); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_processor_name(hostName, &hostName_len); if (parentcomm == MPI_COMM_NULL) { if(argc < 2 ){ printf("Processes number needed!"); return 0; } processesToRun = atoi(argv[1]); MPI_Info_create( &info ); MPI_Info_set( info, "hostfile", "./hostfile" ); MPI_Info_set( info, "map_by", "node" ); MPI_Comm_spawn( argv[0], MPI_ARGV_NULL, processesToRun, info, 0, MPI_COMM_WORLD, &intercomm, MPI_ERRCODES_IGNORE); printf("I'm the parent.\n"); } else { printf("I'm the spawned h: %s r/s: %i/%i.\n", hostName, rank, size ); } fflush(stdout); MPI_Finalize(); return 0; } I came from OMPI 4.0.1. In this version It's working... with some inconsistencies I'm afraid. That's why I decided to upgrade to OMPI 4.0.4. I tried several versions with no luck. Is there maybe an intrinsic problem with the OMPI dynamic allocation functionality? Any help will be very appreciated. Best regards. Martín