Hi Christof

Sorry if I missed this, but it sounds like you are saying that one of your 
procs abnormally terminates, and we are failing to kill the remaining job? Is 
that correct?

If so, I just did some work that might relate to that problem that is pending 
in PR #2528: https://github.com/open-mpi/ompi/pull/2528 
<https://github.com/open-mpi/ompi/pull/2528>

Would you be able to try that?

Ralph

> On Dec 7, 2016, at 9:37 AM, Christof Koehler 
> <christof.koeh...@bccms.uni-bremen.de> wrote:
> 
> Hello,
> 
> On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote:
>>> On Dec 7, 2016, at 10:07 AM, Christof Koehler 
>>> <christof.koeh...@bccms.uni-bremen.de> wrote:
>>>> 
>>> I really think the hang is a consequence of
>>> unclean termination (in the sense that the non-root ranks are not
>>> terminated) and probably not the cause, in my interpretation of what I
>>> see. Would you have any suggestion to catch signals sent between orterun
>>> (mpirun) and the child tasks ?
>> 
>> Do you know where in the code the termination call is?  Is it actually 
>> calling mpi_abort(), or just doing something ugly like calling fortran 
>> “stop”?  If the latter, would that explain a possible hang?
> Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The wannier90 
> input contains
> an error, a restart is requested and the wannier90.chk file the restart
> information is missing.
> "
> Exiting.......
> Error: restart requested but wannier90.chk file not found
> "
> So it must terminate.
> 
> The termination happens in the libwannier.a, source file io.F90:
> 
> write(stdout,*)  'Exiting.......'
> write(stdout, '(1x,a)') trim(error_msg)
> close(stdout)
> stop "wannier90 error: examine the output/error file for details"
> 
> So it calls stop  as you assumed.
> 
>> Presumably someone here can comment on what the standard says about the 
>> validity of terminating without mpi_abort.
> 
> Well, probably stop is not a good way to terminate then.
> 
> My main point was the change relative to 1.10 anyway :-) 
> 
> 
>> 
>> Actually, if you’re willing to share enough input files to reproduce, I 
>> could take a look.  I just recompiled our VASP with openmpi 2.0.1 to fix a 
>> crash that was apparently addressed by some change in the memory allocator 
>> in a recent version of openmpi.  Just e-mail me if that’s the case.
> 
> I think that is no longer necessary ? In principle it is no problem but
> it at the end of a (small) GW calculation, the Si tutorial example. 
> So the mail would be abit larger due to the WAVECAR.
> 
> 
>> 
>>                                                                      Noam
>> 
>> 
>> ____________
>> ||
>> |U.S. NAVAL|
>> |_RESEARCH_|
>> LABORATORY
>> Noam Bernstein, Ph.D.
>> Center for Materials Physics and Technology
>> U.S. Naval Research Laboratory
>> T +1 202 404 8628  F +1 202 404 7546
>> https://www.nrl.navy.mil <https://www.nrl.navy.mil/>
> 
> -- 
> Dr. rer. nat. Christof Köhler       email: c.koeh...@bccms.uni-bremen.de
> Universitaet Bremen/ BCCMS          phone:  +49-(0)421-218-62334
> Am Fallturm 1/ TAB/ Raum 3.12       fax: +49-(0)421-218-62770
> 28359 Bremen  
> 
> PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to