Yeah, things jittered a little there as we debated the “right” behavior. 
Generally, when we see that happening it means that a param is required, but 
somehow we never reached that point.

See if https://github.com/open-mpi/ompi/pull/3704 
<https://github.com/open-mpi/ompi/pull/3704> helps - if so, I can schedule it 
for the next 2.x release if the RMs agree to take it

Ralph

> On Jun 15, 2017, at 12:20 PM, Ted Sussman <ted.suss...@adina.com> wrote:
> 
> Thank you for your comments.
> 
> Our application relies upon "dum.sh" to clean up after the process exits, 
> either if the process 
> exits normally, or if the process exits abnormally because of MPI_ABORT.  If 
> the process 
> group is killed by MPI_ABORT, this clean up will not be performed.  If exec 
> is used to launch 
> the executable from dum.sh, then dum.sh is terminated by the exec, so dum.sh 
> cannot 
> perform any clean up.
> 
> I suppose that other user applications might work similarly, so it would be 
> good to have an 
> MCA parameter to control the behavior of MPI_ABORT.
> 
> We could rewrite our shell script that invokes mpirun, so that the cleanup 
> that is now done by 
> dum.sh is done by the invoking shell script after mpirun exits.  Perhaps this 
> technique is the 
> preferred way to clean up after mpirun is invoked.
> 
> By the way, I have also tested with Open MPI 1.10.7, and Open MPI 1.10.7 has 
> different 
> behavior than either Open MPI 1.4.3 or Open MPI 2.1.1.  In this explanation, 
> it is important to 
> know that the aborttest executable sleeps for 20 sec.
> 
> When running example 2:
> 
> 1.4.3: process 1 immediately aborts
> 1.10.7: process 1 doesn't abort and never stops.
> 2.1.1 process 1 doesn't abort, but stops after it is finished sleeping 
> 
> Sincerely,
> 
> Ted Sussman
> 
> On 15 Jun 2017 at 9:18, r...@open-mpi.org wrote:
> 
>> Here is how the system is working:
>> 
>> Master: each process is put into its own process group upon launch. When we 
>> issue a "kill", however, we only issue it to the individual process (instead 
>> of the process group that is headed by that child process). This is probably 
>> a bug as I don´t believe that is what we intended, but set that aside for 
>> now.
>> 
>> 2.x: each process is put into its own process group upon launch. When we 
>> issue a "kill", we issue it to the process group. Thus, every child proc of 
>> that child proc will receive it. IIRC, this was the intended behavior.
>> 
>> It is rather trivial to make the change (it only involves 3 lines of code), 
>> but I´m not sure of what our intended behavior is supposed to be. Once we 
>> clarify that, it is also trivial to add another MCA param (you can never 
>> have too many!) to allow you to select the other behavior.
>> 
>> 
>>> On Jun 15, 2017, at 5:23 AM, Ted Sussman <ted.suss...@adina.com> wrote:
>>> 
>>> Hello Gilles,
>>> 
>>> Thank you for your quick answer.  I confirm that if exec is used, both 
>>> processes immediately 
>>> abort.
>>> 
>>> Now suppose that the line
>>> 
>>> echo "After aborttest: OMPI_COMM_WORLD_RANK="$OMPI_COMM_WORLD_RANK
>>> 
>>> is added to the end of dum.sh.
>>> 
>>> If Example 2 is run with Open MPI 1.4.3, the output is
>>> 
>>> After aborttest: OMPI_COMM_WORLD_RANK=0
>>> 
>>> which shows that the shell script for the process with rank 0 continues 
>>> after the abort,
>>> but that the shell script for the process with rank 1 does not continue 
>>> after the abort.
>>> 
>>> If Example 2 is run with Open MPI 2.1.1, with exec used to invoke 
>>> aborttest02.exe, then 
>>> there is no such output, which shows that both shell scripts do not 
>>> continue after the abort.
>>> 
>>> I prefer the Open MPI 1.4.3 behavior because our original application 
>>> depends upon the 
>>> Open MPI 1.4.3 behavior.  (Our original application will also work if both 
>>> executables are 
>>> aborted, and if both shell scripts continue after the abort.)
>>> 
>>> It might be too much to expect, but is there a way to recover the Open MPI 
>>> 1.4.3 behavior 
>>> using Open MPI 2.1.1?  
>>> 
>>> Sincerely,
>>> 
>>> Ted Sussman
>>> 
>>> 
>>> On 15 Jun 2017 at 9:50, Gilles Gouaillardet wrote:
>>> 
>>>> Ted,
>>>> 
>>>> 
>>>> fwiw, the 'master' branch has the behavior you expect.
>>>> 
>>>> 
>>>> meanwhile, you can simple edit your 'dum.sh' script and replace
>>>> 
>>>> /home/buildadina/src/aborttest02/aborttest02.exe
>>>> 
>>>> with
>>>> 
>>>> exec /home/buildadina/src/aborttest02/aborttest02.exe
>>>> 
>>>> 
>>>> Cheers,
>>>> 
>>>> 
>>>> Gilles
>>>> 
>>>> 
>>>> On 6/15/2017 3:01 AM, Ted Sussman wrote:
>>>>> Hello,
>>>>> 
>>>>> My question concerns MPI_ABORT, indirect execution of executables by 
>>>>> mpirun and Open
>>>>> MPI 2.1.1.  When mpirun runs executables directly, MPI_ABORT works as 
>>>>> expected, but
>>>>> when mpirun runs executables indirectly, MPI_ABORT does not work as 
>>>>> expected.
>>>>> 
>>>>> If Open MPI 1.4.3 is used instead of Open MPI 2.1.1, MPI_ABORT works as 
>>>>> expected in all
>>>>> cases.
>>>>> 
>>>>> The examples given below have been simplified as far as possible to show 
>>>>> the issues.
>>>>> 
>>>>> ---
>>>>> 
>>>>> Example 1
>>>>> 
>>>>> Consider an MPI job run in the following way:
>>>>> 
>>>>> mpirun ... -app addmpw1
>>>>> 
>>>>> where the appfile addmpw1 lists two executables:
>>>>> 
>>>>> -n 1 -host gulftown ... aborttest02.exe
>>>>> -n 1 -host gulftown ... aborttest02.exe
>>>>> 
>>>>> The two executables are executed on the local node gulftown.  aborttest02 
>>>>> calls MPI_ABORT
>>>>> for rank 0, then sleeps.
>>>>> 
>>>>> The above MPI job runs as expected.  Both processes immediately abort 
>>>>> when rank 0 calls
>>>>> MPI_ABORT.
>>>>> 
>>>>> ---
>>>>> 
>>>>> Example 2
>>>>> 
>>>>> Now change the above example as follows:
>>>>> 
>>>>> mpirun ... -app addmpw2
>>>>> 
>>>>> where the appfile addmpw2 lists shell scripts:
>>>>> 
>>>>> -n 1 -host gulftown ... dum.sh
>>>>> -n 1 -host gulftown ... dum.sh
>>>>> 
>>>>> dum.sh invokes aborttest02.exe.  So aborttest02.exe is executed 
>>>>> indirectly by mpirun.
>>>>> 
>>>>> In this case, the MPI job only aborts process 0 when rank 0 calls 
>>>>> MPI_ABORT.  Process 1
>>>>> continues to run.  This behavior is unexpected.
>>>>> 
>>>>> ----
>>>>> 
>>>>> I have attached all files to this E-mail.  Since there are absolute 
>>>>> pathnames in the files, to
>>>>> reproduce my findings, you will need to update the pathnames in the 
>>>>> appfiles and shell
>>>>> scripts.  To run example 1,
>>>>> 
>>>>> sh run1.sh
>>>>> 
>>>>> and to run example 2,
>>>>> 
>>>>> sh run2.sh
>>>>> 
>>>>> ---
>>>>> 
>>>>> I have tested these examples with Open MPI 1.4.3 and 2.0.3.  In Open MPI 
>>>>> 1.4.3, both
>>>>> examples work as expected.  Open MPI 2.0.3 has the same behavior as Open 
>>>>> MPI 2.1.1.
>>>>> 
>>>>> ---
>>>>> 
>>>>> I would prefer that Open MPI 2.1.1 aborts both processes, even when the 
>>>>> executables are
>>>>> invoked indirectly by mpirun.  If there is an MCA setting that is needed 
>>>>> to make Open MPI
>>>>> 2.1.1 abort both processes, please let me know.
>>>>> 
>>>>> 
>>>>> Sincerely,
>>>>> 
>>>>> Theodore Sussman
>>>>> 
>>>>> 
>>>>> The following section of this message contains a file attachment
>>>>> prepared for transmission using the Internet MIME message format.
>>>>> If you are using Pegasus Mail, or any other MIME-compliant system,
>>>>> you should be able to save it or view it from within your mailer.
>>>>> If you cannot, please ask your system administrator for assistance.
>>>>> 
>>>>>   ---- File information -----------
>>>>>     File:  config.log.bz2
>>>>>     Date:  14 Jun 2017, 13:35
>>>>>     Size:  146548 bytes.
>>>>>     Type:  Binary
>>>>> 
>>>>> 
>>>>> The following section of this message contains a file attachment
>>>>> prepared for transmission using the Internet MIME message format.
>>>>> If you are using Pegasus Mail, or any other MIME-compliant system,
>>>>> you should be able to save it or view it from within your mailer.
>>>>> If you cannot, please ask your system administrator for assistance.
>>>>> 
>>>>>   ---- File information -----------
>>>>>     File:  ompi_info.bz2
>>>>>     Date:  14 Jun 2017, 13:35
>>>>>     Size:  24088 bytes.
>>>>>     Type:  Binary
>>>>> 
>>>>> 
>>>>> The following section of this message contains a file attachment
>>>>> prepared for transmission using the Internet MIME message format.
>>>>> If you are using Pegasus Mail, or any other MIME-compliant system,
>>>>> you should be able to save it or view it from within your mailer.
>>>>> If you cannot, please ask your system administrator for assistance.
>>>>> 
>>>>>   ---- File information -----------
>>>>>     File:  aborttest02.tgz
>>>>>     Date:  14 Jun 2017, 13:52
>>>>>     Size:  4285 bytes.
>>>>>     Type:  Binary
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> users@lists.open-mpi.org
>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> 
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> 
> 
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to