Ralph and all,
The following trivial test hangs
/* it hangs at least 99% of the time in my environment, 1% is a race
condition and the program behaves as expected */
mpirun -np 1 --mca btl self /bin/false
same behaviour happen with the following trivial but MPI program :
#include
int main (int argc, char *argv[]) {
MPI_Init(, );
MPI_Finalize();
return 1;
}
The attached patch fixes the hang (e.g. the program nicely abort with
the correct error message)
i did not commit it since i am not confident at all
could you please review it ?
Cheers
Gilles
Index: orte/mca/errmgr/default_hnp/errmgr_default_hnp.c
===
--- orte/mca/errmgr/default_hnp/errmgr_default_hnp.c(revision 32642)
+++ orte/mca/errmgr/default_hnp/errmgr_default_hnp.c(working copy)
@@ -10,6 +10,8 @@
* Copyright (c) 2011-2013 Los Alamos National Security, LLC.
* All rights reserved.
* Copyright (c) 2014 Intel, Inc. All rights reserved.
+ * Copyright (c) 2014 Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
* $COPYRIGHT$
*
* Additional copyrights may follow
@@ -382,6 +384,14 @@
jdata->num_terminated++;
}
+/* FIXME ???
+ * mark the proc as no more alive if needed
+ */
+if (ORTE_PROC_STATE_KILLED_BY_CMD == state) {
+if (ORTE_FLAG_TEST(pptr, ORTE_PROC_FLAG_WAITPID) &&
ORTE_FLAG_TEST(pptr, ORTE_PROC_FLAG_IOF_COMPLETE)) {
+ORTE_FLAG_UNSET(pptr, ORTE_PROC_FLAG_ALIVE);
+}
+}
/* if we were ordered to terminate, mark this proc as dead and see if
* any of our routes or local children remain alive - if not, then
* terminate ourselves. */