Re: [OMPI devel] mpirun hangs when a task exits with a non zero code

2014-08-29 Thread Ralph Castain
I dug into this a bit and think the patch wasn't quite complete, so I modified 
the approach to ensure this race condition gets resolved in every scenario. 
Hopefully, r32643 takes care of it for you.


On Aug 29, 2014, at 1:08 AM, Gilles Gouaillardet 
 wrote:

> Ralph and all,
> 
> The following trivial test hangs
> /* it hangs at least 99% of the time in my environment, 1% is a race
> condition and the program behaves as expected */
> 
> mpirun -np 1 --mca btl self /bin/false
> 
> same behaviour happen with the following trivial but MPI program :
> 
> #include 
> 
> int main (int argc, char *argv[]) {
>MPI_Init(, );
>MPI_Finalize();
>return 1;
> }
> 
> The attached patch fixes the hang (e.g. the program nicely abort with
> the correct error message)
> 
> i did not commit it since i am not confident at all
> 
> could you please review it ?
> 
> Cheers
> 
> Gilles
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15751.php



[OMPI devel] mpirun hangs when a task exits with a non zero code

2014-08-29 Thread Gilles Gouaillardet
Ralph and all,

The following trivial test hangs
/* it hangs at least 99% of the time in my environment, 1% is a race
condition and the program behaves as expected */

mpirun -np 1 --mca btl self /bin/false

same behaviour happen with the following trivial but MPI program :

#include 

int main (int argc, char *argv[]) {
MPI_Init(, );
MPI_Finalize();
return 1;
}

The attached patch fixes the hang (e.g. the program nicely abort with
the correct error message)

i did not commit it since i am not confident at all

could you please review it ?

Cheers

Gilles
Index: orte/mca/errmgr/default_hnp/errmgr_default_hnp.c
===
--- orte/mca/errmgr/default_hnp/errmgr_default_hnp.c(revision 32642)
+++ orte/mca/errmgr/default_hnp/errmgr_default_hnp.c(working copy)
@@ -10,6 +10,8 @@
  * Copyright (c) 2011-2013 Los Alamos National Security, LLC.
  * All rights reserved.
  * Copyright (c) 2014  Intel, Inc.  All rights reserved.
+ * Copyright (c) 2014  Research Organization for Information Science
+ * and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -382,6 +384,14 @@
 jdata->num_terminated++;
 }

+/* FIXME ???
+ * mark the proc as no more alive if needed
+ */
+if (ORTE_PROC_STATE_KILLED_BY_CMD == state) {
+if (ORTE_FLAG_TEST(pptr, ORTE_PROC_FLAG_WAITPID) && 
ORTE_FLAG_TEST(pptr, ORTE_PROC_FLAG_IOF_COMPLETE)) {
+ORTE_FLAG_UNSET(pptr, ORTE_PROC_FLAG_ALIVE);
+}
+}
 /* if we were ordered to terminate, mark this proc as dead and see if
  * any of our routes or local  children remain alive - if not, then
  * terminate ourselves. */