My job successfully spawned a large number of subprocesses via MPI_Comm_spawn, 
filling up the available cores.   When some of those subprocesses terminated, 
it attempted to spawn more.   It appears that the latter calls to 
MPI_Comm_spawn caused this error:

[n022.cluster.com:08996] [[56319,0],0] grpcomm:direct:send_relay proc 
[[56319,0],1] not running - cannot relay: NOT ALIVE

An internal error has occurred in ORTE:

[[56319,0],0] FORCE-TERMINATE AT Unreachable:-12 - error grpcomm_direct.c(601)

This is something that should be reported to the developers.

I would attach the output created by the mpiexec arguments “--mca 
ras_base_verbose 5 --display-devel-map --mca rmaps_base_verbose 5 “, but it is 
22 Mb.  Do you have a location where I can drop the file?

Thanks for any help.
Kurt

Reply via email to