Hi there,

I have an issue in OpenMPI 4.0.2 and 4.1.1 that MPI_COMM_SPAWN() cannot spawn across nodes, while I could successfully use this function in OpenMPI 2.1.1 I am testing on a cluster with CentOS 7.9, LSF Batch system, and GCC 6.3.0.

I used this code for testing (called it "spawn_example.c")

   |#include "mpi.h" #include <stdio.h> #include <stdlib.h> #define
   NUM_SPAWNS 3 int main( int argc, char *argv[] ) { int np =
   NUM_SPAWNS; int errcodes[NUM_SPAWNS]; MPI_Comm parentcomm,
   intercomm; MPI_Init( &argc, &argv ); MPI_Comm_get_parent(
   &parentcomm ); if (parentcomm == MPI_COMM_NULL) { /* Create 3 more
   processes - this example must be called spawn_example.exe for this
   to work. */ MPI_Comm_spawn( argv[0], MPI_ARGV_NULL, np,
   MPI_INFO_NULL, 0, MPI_COMM_WORLD, &intercomm, errcodes );
   printf("I'm the parent.\n"); } else { printf("I'm the spawned.\n");
   } fflush(stdout); MPI_Finalize(); return 0; } |

Running on one node, it looked fine:

   |$ bsub -n 6 -I "mpirun -n 1 spawn_example" MPI job. Job <195486300>
   is submitted to queue <normal.4h>. <<Waiting for dispatch ...>>
   <<Starting on eu-a2p-154>> I'm the spawned. I'm the spawned. I'm the
   spawned. I'm the parent.|

But on 2 nodes, an error occured:

   |$ bsub -n 6 -R "span[ptile=3]" -I "mpirun -n 1 spawn_example" MPI
   job. Job <195486678> is submitted to queue <normal.4h>. <<Waiting
   for dispatch ...>> <<Starting on eu-a2p-274>> [eu-a2p-217:30058]
   pml_ucx.c:175 Error: Failed to receive UCX worker address: Not found
   (-13) [eu-a2p-217:30058] [[18089,2],2] ORTE_ERROR_LOG: Error in file
   dpm/dpm.c at line 493
   --------------------------------------------------------------------------
   It looks like MPI_INIT failed for some reason; your parallel process
   is likely to abort. There are many reasons that a parallel process
   can fail during MPI_INIT; some of which are due to configuration or
   environment problems. This failure appears to be an internal
   failure; here's some additional information (which may only be
   relevant to an Open MPI developer): ompi_dpm_dyn_init() failed -->
   Returned "Error" (-1) instead of "Success" (0)
   --------------------------------------------------------------------------
   [eu-a2p-217:30058] *** An error occurred in MPI_Init
   [eu-a2p-217:30058] *** reported by process [1185480706,2]
   [eu-a2p-217:30058] *** on a NULL communicator [eu-a2p-217:30058] ***
   Unknown error [eu-a2p-217:30058] *** MPI_ERRORS_ARE_FATAL (processes
   in this communicator will now abort, [eu-a2p-217:30058] *** and
   potentially your MPI job) [eu-a2p-274:107025] PMIX ERROR:
   UNREACHABLE in file server/pmix_server.c at line 2147 |

   ||

I will greatly appreciate your advice. I have read threads with similar question but I did not find solutions there,

Best regards,
Jarunan

Reply via email to