Just an update:   eliminated the error below by telling MPI_Comm_spawn to 
create non-MPI processes, via the info key:

MPI_Info_set(info, "ompi_non_mpi", "true");

If you still want to pursue this matter, let me know.

Kurt

From: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mcc...@nasa.gov>
Sent: Thursday, March 17, 2022 5:58 PM
To: Open MPI Users <users@lists.open-mpi.org>
Cc: Mccall, Kurt E. (MSFC-EV41) <kurt.e.mcc...@nasa.gov>
Subject: OpenMpi crash in MPI_Comm_spawn / developer message

My job successfully spawned a large number of subprocesses via MPI_Comm_spawn, 
filling up the available cores.   When some of those subprocesses terminated, 
it attempted to spawn more.   It appears that the latter calls to 
MPI_Comm_spawn caused this error:

[n022.cluster.com:08996] [[56319,0],0] grpcomm:direct:send_relay proc 
[[56319,0],1] not running - cannot relay: NOT ALIVE

An internal error has occurred in ORTE:

[[56319,0],0] FORCE-TERMINATE AT Unreachable:-12 - error grpcomm_direct.c(601)

This is something that should be reported to the developers.

I would attach the output created by the mpiexec arguments “--mca 
ras_base_verbose 5 --display-devel-map --mca rmaps_base_verbose 5 “, but it is 
22 Mb.  Do you have a location where I can drop the file?

Thanks for any help.
Kurt

Reply via email to