Hello people! I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one "master", one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded OMPI just like this:
./configure --prefix=/usr/local/openmpi-4.0.4/bin/ My hostfile is this: master slots=2 worker slots=2 I'm trying to dynamically allocate the processes with MPI_Comm_Spawn(). If I launch the processes only on the "master" machine It's ok. But if I use the hostfile crashes with this: -------------------------------------------------------------------------- At least one pair of MPI processes are unable to reach each other for MPI communications. This means that no Open MPI device has indicated that it can be used to communicate between these processes. This is an error; Open MPI requires that all MPI processes be able to reach each other. This error can sometimes be the result of forgetting to specify the "self" BTL. Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M Process 2 ([[35155,1],0]) is on host: unknown! BTLs attempted: tcp self Your MPI job is now going to abort; sorry. -------------------------------------------------------------------------- [nos-GF7050VT-M:22526] [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 493 -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_dpm_dyn_init() failed --> Returned "Unreachable" (-12) instead of "Success" (0) -------------------------------------------------------------------------- [nos-GF7050VT-M:22526] *** An error occurred in MPI_Init [nos-GF7050VT-M:22526] *** reported by process [2303918082,1] [nos-GF7050VT-M:22526] *** on a NULL communicator [nos-GF7050VT-M:22526] *** Unknown error [nos-GF7050VT-M:22526] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [nos-GF7050VT-M:22526] *** and potentially your MPI job) Note: host "nos-GF7050VT-M" is "worker" But If I run without "master" in hostfile, the processes are launched but It hangs: MPI_Init() doesn't returns. I launched the script (pasted below) in this 2 ways with the same result: $ ./simple_spawn 2 $ mpirun -np 1 ./simple_spawn 2 The "simple_spawn" script: #include "mpi.h" #include <stdio.h> #include <stdlib.h> int main(int argc, char ** argv){ int processesToRun; MPI_Comm parentcomm, intercomm; MPI_Info info; int rank, size, hostName_len; char hostName[200]; MPI_Init( &argc, &argv ); MPI_Comm_get_parent( &parentcomm ); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_processor_name(hostName, &hostName_len); if (parentcomm == MPI_COMM_NULL) { if(argc < 2 ){ printf("Processes number needed!"); return 0; } processesToRun = atoi(argv[1]); MPI_Info_create( &info ); MPI_Info_set( info, "hostfile", "./hostfile" ); MPI_Info_set( info, "map_by", "node" ); MPI_Comm_spawn( argv[0], MPI_ARGV_NULL, processesToRun, info, 0, MPI_COMM_WORLD, &intercomm, MPI_ERRCODES_IGNORE); printf("I'm the parent.\n"); } else { printf("I'm the spawned h: %s r/s: %i/%i.\n", hostName, rank, size ); } fflush(stdout); MPI_Finalize(); return 0; } I came from OMPI 4.0.1. In this version It's working... with some inconsistencies I'm afraid. That's why I decided to upgrade to OMPI 4.0.4. I tried several versions with no luck. Is there maybe an intrinsic problem with the OMPI dynamic allocation functionality? Any help will be very appreciated. Best regards. Martín