My job successfully spawned a large number of subprocesses via MPI_Comm_spawn, filling up the available cores. When some of those subprocesses terminated, it attempted to spawn more. It appears that the latter calls to MPI_Comm_spawn caused this error:
[n022.cluster.com:08996] [[56319,0],0] grpcomm:direct:send_relay proc [[56319,0],1] not running - cannot relay: NOT ALIVE An internal error has occurred in ORTE: [[56319,0],0] FORCE-TERMINATE AT Unreachable:-12 - error grpcomm_direct.c(601) This is something that should be reported to the developers. I would attach the output created by the mpiexec arguments “--mca ras_base_verbose 5 --display-devel-map --mca rmaps_base_verbose 5 “, but it is 22 Mb. Do you have a location where I can drop the file? Thanks for any help. Kurt