Thank you for your comments.

Our application relies upon "dum.sh" to clean up after the process exits, 
either if the process 
exits normally, or if the process exits abnormally because of MPI_ABORT.  If 
the process 
group is killed by MPI_ABORT, this clean up will not be performed.  If exec is 
used to launch 
the executable from dum.sh, then dum.sh is terminated by the exec, so dum.sh 
cannot 
perform any clean up.

I suppose that other user applications might work similarly, so it would be 
good to have an 
MCA parameter to control the behavior of MPI_ABORT.

We could rewrite our shell script that invokes mpirun, so that the cleanup that 
is now done by 
dum.sh is done by the invoking shell script after mpirun exits.  Perhaps this 
technique is the 
preferred way to clean up after mpirun is invoked.

By the way, I have also tested with Open MPI 1.10.7, and Open MPI 1.10.7 has 
different 
behavior than either Open MPI 1.4.3 or Open MPI 2.1.1.  In this explanation, it 
is important to 
know that the aborttest executable sleeps for 20 sec.

When running example 2:

1.4.3: process 1 immediately aborts
1.10.7: process 1 doesn't abort and never stops.
2.1.1 process 1 doesn't abort, but stops after it is finished sleeping 

Sincerely,

Ted Sussman

On 15 Jun 2017 at 9:18, r...@open-mpi.org wrote:

> Here is how the system is working:
> 
> Master: each process is put into its own process group upon launch. When we 
> issue a "kill", however, we only issue it to the individual process (instead 
> of the process group that is headed by that child process). This is probably 
> a bug as I don´t believe that is what we intended, but set that aside for now.
> 
> 2.x: each process is put into its own process group upon launch. When we 
> issue a "kill", we issue it to the process group. Thus, every child proc of 
> that child proc will receive it. IIRC, this was the intended behavior.
> 
> It is rather trivial to make the change (it only involves 3 lines of code), 
> but I´m not sure of what our intended behavior is supposed to be. Once we 
> clarify that, it is also trivial to add another MCA param (you can never have 
> too many!) to allow you to select the other behavior.
> 
> 
> > On Jun 15, 2017, at 5:23 AM, Ted Sussman <ted.suss...@adina.com> wrote:
> > 
> > Hello Gilles,
> > 
> > Thank you for your quick answer.  I confirm that if exec is used, both 
> > processes immediately 
> > abort.
> > 
> > Now suppose that the line
> > 
> > echo "After aborttest: OMPI_COMM_WORLD_RANK="$OMPI_COMM_WORLD_RANK
> > 
> > is added to the end of dum.sh.
> > 
> > If Example 2 is run with Open MPI 1.4.3, the output is
> > 
> > After aborttest: OMPI_COMM_WORLD_RANK=0
> > 
> > which shows that the shell script for the process with rank 0 continues 
> > after the abort,
> > but that the shell script for the process with rank 1 does not continue 
> > after the abort.
> > 
> > If Example 2 is run with Open MPI 2.1.1, with exec used to invoke 
> > aborttest02.exe, then 
> > there is no such output, which shows that both shell scripts do not 
> > continue after the abort.
> > 
> > I prefer the Open MPI 1.4.3 behavior because our original application 
> > depends upon the 
> > Open MPI 1.4.3 behavior.  (Our original application will also work if both 
> > executables are 
> > aborted, and if both shell scripts continue after the abort.)
> > 
> > It might be too much to expect, but is there a way to recover the Open MPI 
> > 1.4.3 behavior 
> > using Open MPI 2.1.1?  
> > 
> > Sincerely,
> > 
> > Ted Sussman
> > 
> > 
> > On 15 Jun 2017 at 9:50, Gilles Gouaillardet wrote:
> > 
> >> Ted,
> >> 
> >> 
> >> fwiw, the 'master' branch has the behavior you expect.
> >> 
> >> 
> >> meanwhile, you can simple edit your 'dum.sh' script and replace
> >> 
> >> /home/buildadina/src/aborttest02/aborttest02.exe
> >> 
> >> with
> >> 
> >> exec /home/buildadina/src/aborttest02/aborttest02.exe
> >> 
> >> 
> >> Cheers,
> >> 
> >> 
> >> Gilles
> >> 
> >> 
> >> On 6/15/2017 3:01 AM, Ted Sussman wrote:
> >>> Hello,
> >>> 
> >>> My question concerns MPI_ABORT, indirect execution of executables by 
> >>> mpirun and Open
> >>> MPI 2.1.1.  When mpirun runs executables directly, MPI_ABORT works as 
> >>> expected, but
> >>> when mpirun runs executables indirectly, MPI_ABORT does not work as 
> >>> expected.
> >>> 
> >>> If Open MPI 1.4.3 is used instead of Open MPI 2.1.1, MPI_ABORT works as 
> >>> expected in all
> >>> cases.
> >>> 
> >>> The examples given below have been simplified as far as possible to show 
> >>> the issues.
> >>> 
> >>> ---
> >>> 
> >>> Example 1
> >>> 
> >>> Consider an MPI job run in the following way:
> >>> 
> >>> mpirun ... -app addmpw1
> >>> 
> >>> where the appfile addmpw1 lists two executables:
> >>> 
> >>> -n 1 -host gulftown ... aborttest02.exe
> >>> -n 1 -host gulftown ... aborttest02.exe
> >>> 
> >>> The two executables are executed on the local node gulftown.  aborttest02 
> >>> calls MPI_ABORT
> >>> for rank 0, then sleeps.
> >>> 
> >>> The above MPI job runs as expected.  Both processes immediately abort 
> >>> when rank 0 calls
> >>> MPI_ABORT.
> >>> 
> >>> ---
> >>> 
> >>> Example 2
> >>> 
> >>> Now change the above example as follows:
> >>> 
> >>> mpirun ... -app addmpw2
> >>> 
> >>> where the appfile addmpw2 lists shell scripts:
> >>> 
> >>> -n 1 -host gulftown ... dum.sh
> >>> -n 1 -host gulftown ... dum.sh
> >>> 
> >>> dum.sh invokes aborttest02.exe.  So aborttest02.exe is executed 
> >>> indirectly by mpirun.
> >>> 
> >>> In this case, the MPI job only aborts process 0 when rank 0 calls 
> >>> MPI_ABORT.  Process 1
> >>> continues to run.  This behavior is unexpected.
> >>> 
> >>> ----
> >>> 
> >>> I have attached all files to this E-mail.  Since there are absolute 
> >>> pathnames in the files, to
> >>> reproduce my findings, you will need to update the pathnames in the 
> >>> appfiles and shell
> >>> scripts.  To run example 1,
> >>> 
> >>> sh run1.sh
> >>> 
> >>> and to run example 2,
> >>> 
> >>> sh run2.sh
> >>> 
> >>> ---
> >>> 
> >>> I have tested these examples with Open MPI 1.4.3 and 2.0.3.  In Open MPI 
> >>> 1.4.3, both
> >>> examples work as expected.  Open MPI 2.0.3 has the same behavior as Open 
> >>> MPI 2.1.1.
> >>> 
> >>> ---
> >>> 
> >>> I would prefer that Open MPI 2.1.1 aborts both processes, even when the 
> >>> executables are
> >>> invoked indirectly by mpirun.  If there is an MCA setting that is needed 
> >>> to make Open MPI
> >>> 2.1.1 abort both processes, please let me know.
> >>> 
> >>> 
> >>> Sincerely,
> >>> 
> >>> Theodore Sussman
> >>> 
> >>> 
> >>> The following section of this message contains a file attachment
> >>> prepared for transmission using the Internet MIME message format.
> >>> If you are using Pegasus Mail, or any other MIME-compliant system,
> >>> you should be able to save it or view it from within your mailer.
> >>> If you cannot, please ask your system administrator for assistance.
> >>> 
> >>>    ---- File information -----------
> >>>      File:  config.log.bz2
> >>>      Date:  14 Jun 2017, 13:35
> >>>      Size:  146548 bytes.
> >>>      Type:  Binary
> >>> 
> >>> 
> >>> The following section of this message contains a file attachment
> >>> prepared for transmission using the Internet MIME message format.
> >>> If you are using Pegasus Mail, or any other MIME-compliant system,
> >>> you should be able to save it or view it from within your mailer.
> >>> If you cannot, please ask your system administrator for assistance.
> >>> 
> >>>    ---- File information -----------
> >>>      File:  ompi_info.bz2
> >>>      Date:  14 Jun 2017, 13:35
> >>>      Size:  24088 bytes.
> >>>      Type:  Binary
> >>> 
> >>> 
> >>> The following section of this message contains a file attachment
> >>> prepared for transmission using the Internet MIME message format.
> >>> If you are using Pegasus Mail, or any other MIME-compliant system,
> >>> you should be able to save it or view it from within your mailer.
> >>> If you cannot, please ask your system administrator for assistance.
> >>> 
> >>>    ---- File information -----------
> >>>      File:  aborttest02.tgz
> >>>      Date:  14 Jun 2017, 13:52
> >>>      Size:  4285 bytes.
> >>>      Type:  Binary
> >>> 
> >>> 
> >>> _______________________________________________
> >>> users mailing list
> >>> users@lists.open-mpi.org
> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> >> 
> >> _______________________________________________
> >> users mailing list
> >> users@lists.open-mpi.org
> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> > 
> > 
> > 
> > _______________________________________________
> > users mailing list
> > users@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users



_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to