Re: [OMPI users] Abort/ Deadlock issue in allreduce (Gilles Gouaillardet)

2016-12-12 Thread Christof Koehler
Hello,

yes, I already tried the 2.0.x git branch with the original problem. It
now dies quite noisy

forrtl: error (78): process killed (SIGTERM)
Image  PCRoutineLine Source 
vasp-mpi-sca   040DD64D  Unknown   Unknown Unknown
...
...
...
mpirun has exited due to process rank 0 with PID 0 on
node node109 exiting improperly. There are three reasons this could
occur:
...
...

but apparently does not hang any more.

Thanks to everyone involved for fixing this !

Best Regards

Christof




On Mon, Dec 12, 2016 at 12:00:01PM -0700, users-requ...@lists.open-mpi.org 
wrote:
> Send users mailing list submissions to
>   users@lists.open-mpi.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>   https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> or, via email, send a message with subject or body 'help' to
>   users-requ...@lists.open-mpi.org
> 
> You can reach the person managing the list at
>   users-ow...@lists.open-mpi.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
> 
> 
> Today's Topics:
> 
>1. Re: Abort/ Deadlock issue in allreduce (Gilles Gouaillardet)
>2. Re: How to yield CPU more when not computing (was curious
>   behavior during wait for broadcast: 100% cpu) (Dave Love)
> 
> 
> --
> 
> Message: 1
> Date: Mon, 12 Dec 2016 09:32:25 +0900
> From: Gilles Gouaillardet <gil...@rist.or.jp>
> To: users@lists.open-mpi.org
> Subject: Re: [OMPI users] Abort/ Deadlock issue in allreduce
> Message-ID: <8316882f-01a6-8886-5308-bfed9af2a...@rist.or.jp>
> Content-Type: text/plain; charset="windows-1252"; Format="flowed"
> 
> Christof,
> 
> 
> Ralph fixed the issue,
> 
> meanwhile, the patch can be manually downloaded at 
> https://patch-diff.githubusercontent.com/raw/open-mpi/ompi/pull/2552.patch
> 
> 
> Cheers,
> 
> 
> Gilles
> 
> 
> 
> On 12/9/2016 5:39 PM, Christof Koehler wrote:
> > Hello,
> >
> > our case is. The libwannier.a is a "third party"
> > library which is built seperately and the just linked in. So the vasp
> > preprocessor never touches it. As far as I can see no preprocessing of
> > the f90 source is involved in the libwannier build process.
> >
> > I finally managed to set a breakpoint at the program exit of the root
> > rank:
> >
> > (gdb) bt
> > #0  0x2b7ccd2e4220 in _exit () from /lib64/libc.so.6
> > #1  0x2b7ccd25ee2b in __run_exit_handlers () from /lib64/libc.so.6
> > #2  0x2b7ccd25eeb5 in exit () from /lib64/libc.so.6
> > #3  0x0407298d in for_stop_core ()
> > #4  0x012fad41 in w90_io_mp_io_error_ ()
> > #5  0x01302147 in w90_parameters_mp_param_read_ ()
> > #6  0x012f49c6 in wannier_setup_ ()
> > #7  0x00e166a8 in mlwf_mp_mlwf_wannier90_ ()
> > #8  0x004319ff in vamp () at main.F:2640
> > #9  0x0040d21e in main ()
> > #10 0x2b7ccd247b15 in __libc_start_main () from /lib64/libc.so.6
> > #11 0x0040d129 in _start ()
> >
> > So for_stop_core is called apparently ? Of course it is below the main()
> > process of vasp, so additional things might happen which are not
> > visible. Is SIGCHILD (as observed when catching signals in mpirun) the
> > signal expectd after a for_stop_core ?
> >
> > Thank you very much for investigating this !
> >
> > Cheers
> >
> > Christof
> >
> > On Thu, Dec 08, 2016 at 03:15:47PM -0500, Noam Bernstein wrote:
> >>> On Dec 8, 2016, at 6:05 AM, Gilles Gouaillardet 
> >>> <gilles.gouaillar...@gmail.com> wrote:
> >>>
> >>> Christof,
> >>>
> >>>
> >>> There is something really odd with this stack trace.
> >>> count is zero, and some pointers do not point to valid addresses (!)
> >>>
> >>> in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests 
> >>> that
> >>> the stack has been corrupted inside MPI_Allreduce(), or that you are not 
> >>> using the library you think you use
> >>> pmap  will show you which lib is used
> >>>
> >>> btw, this was not started with
> >>> mpirun --mca coll ^tuned ...
> >>> right ?
> >>>
> >>> just to make it clear ...
> >>> a task from your program bluntly issues a fortran STOP, and this is kind 
> >>> of a feature.

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-11 Thread Gilles Gouaillardet

Christof,


Ralph fixed the issue,

meanwhile, the patch can be manually downloaded at 
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi/pull/2552.patch



Cheers,


Gilles



On 12/9/2016 5:39 PM, Christof Koehler wrote:

Hello,

our case is. The libwannier.a is a "third party"
library which is built seperately and the just linked in. So the vasp
preprocessor never touches it. As far as I can see no preprocessing of
the f90 source is involved in the libwannier build process.

I finally managed to set a breakpoint at the program exit of the root
rank:

(gdb) bt
#0  0x2b7ccd2e4220 in _exit () from /lib64/libc.so.6
#1  0x2b7ccd25ee2b in __run_exit_handlers () from /lib64/libc.so.6
#2  0x2b7ccd25eeb5 in exit () from /lib64/libc.so.6
#3  0x0407298d in for_stop_core ()
#4  0x012fad41 in w90_io_mp_io_error_ ()
#5  0x01302147 in w90_parameters_mp_param_read_ ()
#6  0x012f49c6 in wannier_setup_ ()
#7  0x00e166a8 in mlwf_mp_mlwf_wannier90_ ()
#8  0x004319ff in vamp () at main.F:2640
#9  0x0040d21e in main ()
#10 0x2b7ccd247b15 in __libc_start_main () from /lib64/libc.so.6
#11 0x0040d129 in _start ()

So for_stop_core is called apparently ? Of course it is below the main()
process of vasp, so additional things might happen which are not
visible. Is SIGCHILD (as observed when catching signals in mpirun) the
signal expectd after a for_stop_core ?

Thank you very much for investigating this !

Cheers

Christof

On Thu, Dec 08, 2016 at 03:15:47PM -0500, Noam Bernstein wrote:

On Dec 8, 2016, at 6:05 AM, Gilles Gouaillardet  
wrote:

Christof,


There is something really odd with this stack trace.
count is zero, and some pointers do not point to valid addresses (!)

in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
the stack has been corrupted inside MPI_Allreduce(), or that you are not using 
the library you think you use
pmap  will show you which lib is used

btw, this was not started with
mpirun --mca coll ^tuned ...
right ?

just to make it clear ...
a task from your program bluntly issues a fortran STOP, and this is kind of a 
feature.
the *only* issue is mpirun does not kill the other MPI tasks and mpirun never 
completes.
did i get it right ?

I just ran across very similar behavior in VASP (which we just switched over to 
openmpi 2.0.1), also in a allreduce + STOP combination (some nodes call one, 
others call the other), and I discovered several interesting things.

The most important is that when MPI is active, the preprocessor converts (via a 
#define in symbol.inc) fortran STOP into calls to m_exit() (defined in mpi.F), 
which is a wrapper around mpi_finalize.  So in my case some processes in the 
communicator call mpi_finalize, others call mpi_allreduce.  I’m not really 
surprised this hangs, because I think the correct thing to replace STOP with is 
mpi_abort, not mpi_finalize.  If you know where the STOP is called, you can 
check the preprocessed equivalent file (.f90 instead of .F), and see if it’s 
actually been replaced with a call to m_exit.  I’m planning to test whether 
replacing m_exit with m_stop in symbol.inc gives more sensible behavior, i.e. 
program termination when the original source file executes a STOP.

I’m assuming that a mix of mpi_allreduce and mpi_finalize is really expected to 
hang, but just in case that’s surprising, here are my stack traces:


hung in collective:

(gdb) where
#0  0x2b8d5a095ec6 in opal_progress () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.20
#1  0x2b8d59b3a36d in ompi_request_default_wait_all () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#2  0x2b8d59b8107c in ompi_coll_base_allreduce_intra_recursivedoubling () 
from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#3  0x2b8d59b495ac in PMPI_Allreduce () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#4  0x2b8d598e4027 in pmpi_allreduce__ () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
#5  0x00414077 in m_sum_i (comm=..., ivec=warning: Range for type 
(null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
..., n=2) at mpi.F:989
#6  0x00daac54 in full_kpoints::set_indpw_full (grid=..., wdes=..., 
kpoints_f=...) at mkpoints_full.F:1099
#7  0x01441654 in set_indpw_fock (t_info=..., p=warning: Range for type 
(null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) 

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-09 Thread Noam Bernstein
> On Dec 9, 2016, at 3:39 AM, Christof Koehler 
>  wrote:
> 
> Hello,
> 
> our case is. The libwannier.a is a "third party"
> library which is built seperately and the just linked in. So the vasp
> preprocessor never touches it. As far as I can see no preprocessing of
> the f90 source is involved in the libwannier build process. 
> 
> I finally managed to set a breakpoint at the program exit of the root
> rank:

Looks like my case really was just VASP’s fault, and really I’d call it a VASP 
bug (you shouldn't call mpi_finalize from a subset of the tasks).  Yours is 
similar, but not actually the same, since it’s actually trying to stop the 
task, and one would at least hope that OpenMPI could detect it and exit.


Noam



||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil 
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-09 Thread Christof Koehler
Hello,

our case is. The libwannier.a is a "third party"
library which is built seperately and the just linked in. So the vasp
preprocessor never touches it. As far as I can see no preprocessing of
the f90 source is involved in the libwannier build process. 

I finally managed to set a breakpoint at the program exit of the root
rank:

(gdb) bt
#0  0x2b7ccd2e4220 in _exit () from /lib64/libc.so.6
#1  0x2b7ccd25ee2b in __run_exit_handlers () from /lib64/libc.so.6
#2  0x2b7ccd25eeb5 in exit () from /lib64/libc.so.6
#3  0x0407298d in for_stop_core ()
#4  0x012fad41 in w90_io_mp_io_error_ ()
#5  0x01302147 in w90_parameters_mp_param_read_ ()
#6  0x012f49c6 in wannier_setup_ ()
#7  0x00e166a8 in mlwf_mp_mlwf_wannier90_ ()
#8  0x004319ff in vamp () at main.F:2640
#9  0x0040d21e in main ()
#10 0x2b7ccd247b15 in __libc_start_main () from /lib64/libc.so.6
#11 0x0040d129 in _start ()

So for_stop_core is called apparently ? Of course it is below the main()
process of vasp, so additional things might happen which are not
visible. Is SIGCHILD (as observed when catching signals in mpirun) the
signal expectd after a for_stop_core ?

Thank you very much for investigating this !

Cheers 

Christof

On Thu, Dec 08, 2016 at 03:15:47PM -0500, Noam Bernstein wrote:
> > On Dec 8, 2016, at 6:05 AM, Gilles Gouaillardet 
> >  wrote:
> > 
> > Christof,
> > 
> > 
> > There is something really odd with this stack trace.
> > count is zero, and some pointers do not point to valid addresses (!)
> > 
> > in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
> > the stack has been corrupted inside MPI_Allreduce(), or that you are not 
> > using the library you think you use
> > pmap  will show you which lib is used
> > 
> > btw, this was not started with
> > mpirun --mca coll ^tuned ...
> > right ?
> > 
> > just to make it clear ...
> > a task from your program bluntly issues a fortran STOP, and this is kind of 
> > a feature.
> > the *only* issue is mpirun does not kill the other MPI tasks and mpirun 
> > never completes.
> > did i get it right ?
> 
> I just ran across very similar behavior in VASP (which we just switched over 
> to openmpi 2.0.1), also in a allreduce + STOP combination (some nodes call 
> one, others call the other), and I discovered several interesting things.
> 
> The most important is that when MPI is active, the preprocessor converts (via 
> a #define in symbol.inc) fortran STOP into calls to m_exit() (defined in 
> mpi.F), which is a wrapper around mpi_finalize.  So in my case some processes 
> in the communicator call mpi_finalize, others call mpi_allreduce.  I’m not 
> really surprised this hangs, because I think the correct thing to replace 
> STOP with is mpi_abort, not mpi_finalize.  If you know where the STOP is 
> called, you can check the preprocessed equivalent file (.f90 instead of .F), 
> and see if it’s actually been replaced with a call to m_exit.  I’m planning 
> to test whether replacing m_exit with m_stop in symbol.inc gives more 
> sensible behavior, i.e. program termination when the original source file 
> executes a STOP.
> 
> I’m assuming that a mix of mpi_allreduce and mpi_finalize is really expected 
> to hang, but just in case that’s surprising, here are my stack traces:
> 
> 
> hung in collective:
> 
> (gdb) where
> #0  0x2b8d5a095ec6 in opal_progress () from 
> /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.20
> #1  0x2b8d59b3a36d in ompi_request_default_wait_all () from 
> /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
> #2  0x2b8d59b8107c in ompi_coll_base_allreduce_intra_recursivedoubling () 
> from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
> #3  0x2b8d59b495ac in PMPI_Allreduce () from 
> /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
> #4  0x2b8d598e4027 in pmpi_allreduce__ () from 
> /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
> #5  0x00414077 in m_sum_i (comm=..., ivec=warning: Range for type 
> (null) has invalid bounds 1..-12884901892
> warning: Range for type (null) has invalid bounds 1..-12884901892
> warning: Range for type (null) has invalid bounds 1..-12884901892
> warning: Range for type (null) has invalid bounds 1..-12884901892
> warning: Range for type (null) has invalid bounds 1..-12884901892
> warning: Range for type (null) has invalid bounds 1..-12884901892
> warning: Range for type (null) has invalid bounds 1..-12884901892
> ..., n=2) at mpi.F:989
> #6  0x00daac54 in full_kpoints::set_indpw_full (grid=..., wdes=..., 
> kpoints_f=...) at mkpoints_full.F:1099
> #7  0x01441654 in set_indpw_fock (t_info=..., p=warning: Range for 
> type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: 

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-09 Thread Gilles Gouaillardet
Folks,

the problem is indeed pretty trivial to reproduce

i opened https://github.com/open-mpi/ompi/issues/2550 (and included a
reproducer)


Cheers,

Gilles

On Fri, Dec 9, 2016 at 5:15 AM, Noam Bernstein
 wrote:
> On Dec 8, 2016, at 6:05 AM, Gilles Gouaillardet
>  wrote:
>
> Christof,
>
>
> There is something really odd with this stack trace.
> count is zero, and some pointers do not point to valid addresses (!)
>
> in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
> the stack has been corrupted inside MPI_Allreduce(), or that you are not
> using the library you think you use
> pmap  will show you which lib is used
>
> btw, this was not started with
> mpirun --mca coll ^tuned ...
> right ?
>
> just to make it clear ...
> a task from your program bluntly issues a fortran STOP, and this is kind of
> a feature.
> the *only* issue is mpirun does not kill the other MPI tasks and mpirun
> never completes.
> did i get it right ?
>
>
> I just ran across very similar behavior in VASP (which we just switched over
> to openmpi 2.0.1), also in a allreduce + STOP combination (some nodes call
> one, others call the other), and I discovered several interesting things.
>
> The most important is that when MPI is active, the preprocessor converts
> (via a #define in symbol.inc) fortran STOP into calls to m_exit() (defined
> in mpi.F), which is a wrapper around mpi_finalize.  So in my case some
> processes in the communicator call mpi_finalize, others call mpi_allreduce.
> I’m not really surprised this hangs, because I think the correct thing to
> replace STOP with is mpi_abort, not mpi_finalize.  If you know where the
> STOP is called, you can check the preprocessed equivalent file (.f90 instead
> of .F), and see if it’s actually been replaced with a call to m_exit.  I’m
> planning to test whether replacing m_exit with m_stop in symbol.inc gives
> more sensible behavior, i.e. program termination when the original source
> file executes a STOP.
>
> I’m assuming that a mix of mpi_allreduce and mpi_finalize is really expected
> to hang, but just in case that’s surprising, here are my stack traces:
>
>
> hung in collective:
>
> (gdb) where
>
> #0  0x2b8d5a095ec6 in opal_progress () from
> /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.20
> #1  0x2b8d59b3a36d in ompi_request_default_wait_all () from
> /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
> #2  0x2b8d59b8107c in ompi_coll_base_allreduce_intra_recursivedoubling
> () from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
> #3  0x2b8d59b495ac in PMPI_Allreduce () from
> /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
> #4  0x2b8d598e4027 in pmpi_allreduce__ () from
> /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
> #5  0x00414077 in m_sum_i (comm=..., ivec=warning: Range for type
> (null) has invalid bounds 1..-12884901892
> warning: Range for type (null) has invalid bounds 1..-12884901892
> warning: Range for type (null) has invalid bounds 1..-12884901892
> warning: Range for type (null) has invalid bounds 1..-12884901892
> warning: Range for type (null) has invalid bounds 1..-12884901892
> warning: Range for type (null) has invalid bounds 1..-12884901892
> warning: Range for type (null) has invalid bounds 1..-12884901892
> ..., n=2) at mpi.F:989
> #6  0x00daac54 in full_kpoints::set_indpw_full (grid=..., wdes=...,
> kpoints_f=...) at mkpoints_full.F:1099
> #7  0x01441654 in set_indpw_fock (t_info=..., p=warning: Range for
> type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> ..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address
> 0x1
> ) at fock.F:1669
> #8  fock::setup_fock (t_info=..., p=warning: Range for type (null) has
> invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> warning: Range for type (null) has invalid bounds 1..-1
> ..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address
> 0x1
> ) at fock.F:1413
> #9  0x02976478 in vamp () at main.F:2093
> #10 0x00412f9e in main ()
> #11 0x00383a41ed1d in __libc_start_main () from /lib64/libc.so.6
> #12 0x00412ea9 in _start ()
>
>
> hung in mpi_finalize:
>
> #0  0x00383a4acbdd in nanosleep () from /lib64/libc.so.6
> #1  

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Noam Bernstein
> On Dec 8, 2016, at 6:05 AM, Gilles Gouaillardet 
>  wrote:
> 
> Christof,
> 
> 
> There is something really odd with this stack trace.
> count is zero, and some pointers do not point to valid addresses (!)
> 
> in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
> the stack has been corrupted inside MPI_Allreduce(), or that you are not 
> using the library you think you use
> pmap  will show you which lib is used
> 
> btw, this was not started with
> mpirun --mca coll ^tuned ...
> right ?
> 
> just to make it clear ...
> a task from your program bluntly issues a fortran STOP, and this is kind of a 
> feature.
> the *only* issue is mpirun does not kill the other MPI tasks and mpirun never 
> completes.
> did i get it right ?

I just ran across very similar behavior in VASP (which we just switched over to 
openmpi 2.0.1), also in a allreduce + STOP combination (some nodes call one, 
others call the other), and I discovered several interesting things.

The most important is that when MPI is active, the preprocessor converts (via a 
#define in symbol.inc) fortran STOP into calls to m_exit() (defined in mpi.F), 
which is a wrapper around mpi_finalize.  So in my case some processes in the 
communicator call mpi_finalize, others call mpi_allreduce.  I’m not really 
surprised this hangs, because I think the correct thing to replace STOP with is 
mpi_abort, not mpi_finalize.  If you know where the STOP is called, you can 
check the preprocessed equivalent file (.f90 instead of .F), and see if it’s 
actually been replaced with a call to m_exit.  I’m planning to test whether 
replacing m_exit with m_stop in symbol.inc gives more sensible behavior, i.e. 
program termination when the original source file executes a STOP.

I’m assuming that a mix of mpi_allreduce and mpi_finalize is really expected to 
hang, but just in case that’s surprising, here are my stack traces:


hung in collective:

(gdb) where
#0  0x2b8d5a095ec6 in opal_progress () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.20
#1  0x2b8d59b3a36d in ompi_request_default_wait_all () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#2  0x2b8d59b8107c in ompi_coll_base_allreduce_intra_recursivedoubling () 
from /usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#3  0x2b8d59b495ac in PMPI_Allreduce () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#4  0x2b8d598e4027 in pmpi_allreduce__ () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
#5  0x00414077 in m_sum_i (comm=..., ivec=warning: Range for type 
(null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
warning: Range for type (null) has invalid bounds 1..-12884901892
..., n=2) at mpi.F:989
#6  0x00daac54 in full_kpoints::set_indpw_full (grid=..., wdes=..., 
kpoints_f=...) at mkpoints_full.F:1099
#7  0x01441654 in set_indpw_fock (t_info=..., p=warning: Range for type 
(null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address 0x1
) at fock.F:1669
#8  fock::setup_fock (t_info=..., p=warning: Range for type (null) has invalid 
bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
warning: Range for type (null) has invalid bounds 1..-1
..., wdes=..., grid=..., latt_cur=..., lmdim=Cannot access memory at address 0x1
) at fock.F:1413
#9  0x02976478 in vamp () at main.F:2093
#10 0x00412f9e in main ()
#11 0x00383a41ed1d in __libc_start_main () from /lib64/libc.so.6
#12 0x00412ea9 in _start ()

hung in mpi_finalize:

#0  0x00383a4acbdd in nanosleep () from /lib64/libc.so.6
#1  0x00383a4e1d94 in usleep () from /lib64/libc.so.6
#2  0x2b11db1e0ae7 in ompi_mpi_finalize () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi.so.20
#3  0x2b11daf8b399 in pmpi_finalize__ () from 
/usr/local/openmpi/2.0.1/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.20
#4  0x004199c5 in m_exit () at mpi.F:375
#5  0x00dab17f in 

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread r...@open-mpi.org
To the best I can determine, mpirun catches SIGTERM just fine and will hit the 
procs with SIGCONT, followed by SIGTERM and then SIGKILL. It will then wait to 
see the remote daemons complete after they hit their procs with the same 
sequence.


> On Dec 8, 2016, at 5:18 AM, Christof Koehler 
>  wrote:
> 
> Hello  again,
> 
> I am still not sure about breakpoints. But I did a "catch signal" in
> gdb, gdb's were attached to the two vasp processes and mpirun.
> 
> When the root rank exits I see in the gdb attaching to it
> [Thread 0x2b2787df8700 (LWP 2457) exited]
> [Thread 0x2b277f483180 (LWP 2455) exited]
> [Inferior 1 (process 2455) exited normally]
> 
> In the gdb attached to the mpirun
> Catchpoint 1 (signal SIGCHLD), 0x2b16560f769d in poll () from
> /lib64/libc.so.6
> 
> In the gdb attached to the second rank I see no output.
> 
> Issuing "continue" in the gdb session attached to mpi run does not lead
> to anything new as far as I can tell.
> 
> The stack trace of the mpirun after that (Ctrl-C'ed to stop it again) is
> #0  0x2b16560f769d in poll () from /lib64/libc.so.6
> #1  0x2b1654b3a496 in poll_dispatch () from
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
> #2  0x2b1654b32fa5 in opal_libevent2022_event_base_loop () from
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
> #3  0x00406311 in orterun (argc=7, argv=0x7ffdabfbebc8) at
> orterun.c:1071
> #4  0x004037e0 in main (argc=7, argv=0x7ffdabfbebc8) at
> main.c:13
> 
> So there is a signal and mpirun does nothing with it ?
> 
> Cheers
> 
> Christof
> 
> 
> On Thu, Dec 08, 2016 at 12:39:06PM +0100, Christof Koehler wrote:
>> Hello,
>> 
>> On Thu, Dec 08, 2016 at 08:05:44PM +0900, Gilles Gouaillardet wrote:
>>> Christof,
>>> 
>>> 
>>> There is something really odd with this stack trace.
>>> count is zero, and some pointers do not point to valid addresses (!)
>> Yes, I assumed it was interesting :-) Note that the program is compiled
>> with   -O2 -fp-model source, so optimization is on. I can try with -O0
>> or the gcc/gfortran ( will take a moment) to make sure it is not a
>> problem from that.
>> 
>>> 
>>> in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
>>> the stack has been corrupted inside MPI_Allreduce(), or that you are not
>>> using the library you think you use
>>> pmap  will show you which lib is used
>> The pmap of the survivor is at the very end of this mail.
>> 
>>> 
>>> btw, this was not started with
>>> mpirun --mca coll ^tuned ...
>>> right ?
>> This is correct, not started with "mpirun --mca coll ^tuned". Using it
>> does not change something.
>> 
>>> 
>>> just to make it clear ...
>>> a task from your program bluntly issues a fortran STOP, and this is kind of
>>> a feature.
>> Yes. The library where the stack occurs is/was written for serial use as
>> far as I can tell. As I mentioned, it is not our code but this one
>> http://www.wannier.org/ (Version 1.2) linked into https://www.vasp.at/ which 
>> should
>> be a working combination.
>> 
>>> the *only* issue is mpirun does not kill the other MPI tasks and mpirun
>>> never completes.
>>> did i get it right ?
>> Yes ! So it is not a really big problem IMO. Just a bit nasty if this
>> would happen with a job in the queueing system.
>> 
>> Best Regards
>> 
>> Christof
>> 
>> Note: git branch 2.0.2 of openmpi was configured and installed (make
>> install) with
>> ./configure CC=icc CXX=icpc FC=ifort F77=ifort FFLAGS="-O1 -fp-model
>> precise" CFLAGS="-O1 -fp-model precise" CXXFLAGS="-O1 -fp-model precise"
>> FCFLAGS="-O1 -fp-model precise" --with-psm2 --with-tm
>> --with-hwloc=internal --enable-static --enable-orterun-prefix-by-default
>> --prefix=/cluster/mpi/openmpi/2.0.2/intel2016
>> 
>> The OS is Centos 7, relatively current :-) with current Omni-Path driver
>> package from Intel (10.2).
>> 
>> vasp is linked againts Intel MKL Lapack/Blas, self compiled scalapack
>> (trunk 206) and FFTW 3.3.5. FFTW and scalapack statically linked. And of
>> course the libwannier.a version 1.2 statically linked.
>> 
>> pmap -p of the survivor
>> 
>> 32282:   /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
>> 0040  65200K r-x-- 
>> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
>> 045ab000100K r 
>> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
>> 045c4000   2244K rw--- 
>> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
>> 047f5000 100900K rw---   [ anon ]
>> 0bfaa000684K rw---   [ anon ]
>> 0c055000 20K rw---   [ anon ]
>> 0c05a000424K rw---   [ anon ]
>> 0c0c4000 68K rw---   [ anon ]
>> 0c0d5000  25384K rw---   [ anon ]
>> 2b17e34f6000132K r-x-- /usr/lib64/ld-2.17.so
>> 2b17e3517000  4K rw---   [ anon ]
>> 2b17e3518000 28K rw-s- /dev/infiniband/uverbs0
>> 2b17e3523000 88K rw---   [ anon ]
>> 

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Christof Koehler
Hello  again,

I am still not sure about breakpoints. But I did a "catch signal" in
gdb, gdb's were attached to the two vasp processes and mpirun.

When the root rank exits I see in the gdb attaching to it
[Thread 0x2b2787df8700 (LWP 2457) exited]
[Thread 0x2b277f483180 (LWP 2455) exited]
[Inferior 1 (process 2455) exited normally]

In the gdb attached to the mpirun
Catchpoint 1 (signal SIGCHLD), 0x2b16560f769d in poll () from
/lib64/libc.so.6

In the gdb attached to the second rank I see no output.

Issuing "continue" in the gdb session attached to mpi run does not lead
to anything new as far as I can tell.

The stack trace of the mpirun after that (Ctrl-C'ed to stop it again) is
#0  0x2b16560f769d in poll () from /lib64/libc.so.6
#1  0x2b1654b3a496 in poll_dispatch () from
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#2  0x2b1654b32fa5 in opal_libevent2022_event_base_loop () from
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#3  0x00406311 in orterun (argc=7, argv=0x7ffdabfbebc8) at
orterun.c:1071
#4  0x004037e0 in main (argc=7, argv=0x7ffdabfbebc8) at
main.c:13

So there is a signal and mpirun does nothing with it ?

Cheers

Christof


On Thu, Dec 08, 2016 at 12:39:06PM +0100, Christof Koehler wrote:
> Hello,
> 
> On Thu, Dec 08, 2016 at 08:05:44PM +0900, Gilles Gouaillardet wrote:
> > Christof,
> > 
> > 
> > There is something really odd with this stack trace.
> > count is zero, and some pointers do not point to valid addresses (!)
> Yes, I assumed it was interesting :-) Note that the program is compiled
> with   -O2 -fp-model source, so optimization is on. I can try with -O0
> or the gcc/gfortran ( will take a moment) to make sure it is not a
> problem from that.
> 
> > 
> > in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
> > the stack has been corrupted inside MPI_Allreduce(), or that you are not
> > using the library you think you use
> > pmap  will show you which lib is used
> The pmap of the survivor is at the very end of this mail.
> 
> > 
> > btw, this was not started with
> > mpirun --mca coll ^tuned ...
> > right ?
> This is correct, not started with "mpirun --mca coll ^tuned". Using it
> does not change something.
> 
> > 
> > just to make it clear ...
> > a task from your program bluntly issues a fortran STOP, and this is kind of
> > a feature.
> Yes. The library where the stack occurs is/was written for serial use as
> far as I can tell. As I mentioned, it is not our code but this one
> http://www.wannier.org/ (Version 1.2) linked into https://www.vasp.at/ which 
> should
> be a working combination.
> 
> > the *only* issue is mpirun does not kill the other MPI tasks and mpirun
> > never completes.
> > did i get it right ?
> Yes ! So it is not a really big problem IMO. Just a bit nasty if this
> would happen with a job in the queueing system.
> 
> Best Regards
> 
> Christof
> 
> Note: git branch 2.0.2 of openmpi was configured and installed (make
> install) with
> ./configure CC=icc CXX=icpc FC=ifort F77=ifort FFLAGS="-O1 -fp-model
> precise" CFLAGS="-O1 -fp-model precise" CXXFLAGS="-O1 -fp-model precise"
> FCFLAGS="-O1 -fp-model precise" --with-psm2 --with-tm
> --with-hwloc=internal --enable-static --enable-orterun-prefix-by-default
> --prefix=/cluster/mpi/openmpi/2.0.2/intel2016
> 
> The OS is Centos 7, relatively current :-) with current Omni-Path driver
> package from Intel (10.2).
> 
> vasp is linked againts Intel MKL Lapack/Blas, self compiled scalapack
> (trunk 206) and FFTW 3.3.5. FFTW and scalapack statically linked. And of
> course the libwannier.a version 1.2 statically linked.
> 
> pmap -p of the survivor
> 
> 32282:   /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
> 0040  65200K r-x-- 
> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
> 045ab000100K r 
> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
> 045c4000   2244K rw--- 
> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
> 047f5000 100900K rw---   [ anon ]
> 0bfaa000684K rw---   [ anon ]
> 0c055000 20K rw---   [ anon ]
> 0c05a000424K rw---   [ anon ]
> 0c0c4000 68K rw---   [ anon ]
> 0c0d5000  25384K rw---   [ anon ]
> 2b17e34f6000132K r-x-- /usr/lib64/ld-2.17.so
> 2b17e3517000  4K rw---   [ anon ]
> 2b17e3518000 28K rw-s- /dev/infiniband/uverbs0
> 2b17e3523000 88K rw---   [ anon ]
> 2b17e3539000772K rw-s- /dev/infiniband/uverbs0
> 2b17e35fa000772K rw-s- /dev/infiniband/uverbs0
> 2b17e36bb000196K rw-s- /dev/infiniband/uverbs0
> 2b17e36ec000 28K rw-s- /dev/infiniband/uverbs0
> 2b17e36f3000 20K rw-s- /dev/infiniband/uverbs0
> 2b17e3717000  4K r /usr/lib64/ld-2.17.so
> 2b17e3718000  4K rw--- /usr/lib64/ld-2.17.so
> 2b17e3719000  4K rw---   [ anon ]
> 2b17e371a000 88K r-x-- 

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Christof Koehler
Hello,

On Thu, Dec 08, 2016 at 08:05:44PM +0900, Gilles Gouaillardet wrote:
> Christof,
> 
> 
> There is something really odd with this stack trace.
> count is zero, and some pointers do not point to valid addresses (!)
Yes, I assumed it was interesting :-) Note that the program is compiled
with   -O2 -fp-model source, so optimization is on. I can try with -O0
or the gcc/gfortran ( will take a moment) to make sure it is not a
problem from that.

> 
> in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
> the stack has been corrupted inside MPI_Allreduce(), or that you are not
> using the library you think you use
> pmap  will show you which lib is used
The pmap of the survivor is at the very end of this mail.

> 
> btw, this was not started with
> mpirun --mca coll ^tuned ...
> right ?
This is correct, not started with "mpirun --mca coll ^tuned". Using it
does not change something.

> 
> just to make it clear ...
> a task from your program bluntly issues a fortran STOP, and this is kind of
> a feature.
Yes. The library where the stack occurs is/was written for serial use as
far as I can tell. As I mentioned, it is not our code but this one
http://www.wannier.org/ (Version 1.2) linked into https://www.vasp.at/ which 
should
be a working combination.

> the *only* issue is mpirun does not kill the other MPI tasks and mpirun
> never completes.
> did i get it right ?
Yes ! So it is not a really big problem IMO. Just a bit nasty if this
would happen with a job in the queueing system.

Best Regards

Christof

Note: git branch 2.0.2 of openmpi was configured and installed (make
install) with
./configure CC=icc CXX=icpc FC=ifort F77=ifort FFLAGS="-O1 -fp-model
precise" CFLAGS="-O1 -fp-model precise" CXXFLAGS="-O1 -fp-model precise"
FCFLAGS="-O1 -fp-model precise" --with-psm2 --with-tm
--with-hwloc=internal --enable-static --enable-orterun-prefix-by-default
--prefix=/cluster/mpi/openmpi/2.0.2/intel2016

The OS is Centos 7, relatively current :-) with current Omni-Path driver
package from Intel (10.2).

vasp is linked againts Intel MKL Lapack/Blas, self compiled scalapack
(trunk 206) and FFTW 3.3.5. FFTW and scalapack statically linked. And of
course the libwannier.a version 1.2 statically linked.

pmap -p of the survivor

32282:   /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
0040  65200K r-x-- 
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
045ab000100K r 
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
045c4000   2244K rw--- 
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi-sca
047f5000 100900K rw---   [ anon ]
0bfaa000684K rw---   [ anon ]
0c055000 20K rw---   [ anon ]
0c05a000424K rw---   [ anon ]
0c0c4000 68K rw---   [ anon ]
0c0d5000  25384K rw---   [ anon ]
2b17e34f6000132K r-x-- /usr/lib64/ld-2.17.so
2b17e3517000  4K rw---   [ anon ]
2b17e3518000 28K rw-s- /dev/infiniband/uverbs0
2b17e3523000 88K rw---   [ anon ]
2b17e3539000772K rw-s- /dev/infiniband/uverbs0
2b17e35fa000772K rw-s- /dev/infiniband/uverbs0
2b17e36bb000196K rw-s- /dev/infiniband/uverbs0
2b17e36ec000 28K rw-s- /dev/infiniband/uverbs0
2b17e36f3000 20K rw-s- /dev/infiniband/uverbs0
2b17e3717000  4K r /usr/lib64/ld-2.17.so
2b17e3718000  4K rw--- /usr/lib64/ld-2.17.so
2b17e3719000  4K rw---   [ anon ]
2b17e371a000 88K r-x-- /usr/lib64/libpthread-2.17.so
2b17e373   2048K - /usr/lib64/libpthread-2.17.so
2b17e393  4K r /usr/lib64/libpthread-2.17.so
2b17e3931000  4K rw--- /usr/lib64/libpthread-2.17.so
2b17e3932000 16K rw---   [ anon ]
2b17e3936000   1028K r-x-- /usr/lib64/libm-2.17.so
2b17e3a37000   2044K - /usr/lib64/libm-2.17.so
2b17e3c36000  4K r /usr/lib64/libm-2.17.so
2b17e3c37000  4K rw--- /usr/lib64/libm-2.17.so
2b17e3c38000 12K r-x-- /usr/lib64/libdl-2.17.so
2b17e3c3b000   2044K - /usr/lib64/libdl-2.17.so
2b17e3e3a000  4K r /usr/lib64/libdl-2.17.so
2b17e3e3b000  4K rw--- /usr/lib64/libdl-2.17.so
2b17e3e3c000184K r-x-- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
2b17e3e6a000   2044K - 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
2b17e4069000  4K r 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
2b17e406a000  4K rw--- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempif08.so.20.0.0
2b17e406b000 36K r-x-- 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
2b17e4074000   2044K - 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
2b17e4273000  4K r 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libmpi_usempi_ignore_tkr.so.20.0.0
2b17e4274000  4K rw--- 

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Gilles Gouaillardet
Christof,


There is something really odd with this stack trace.
count is zero, and some pointers do not point to valid addresses (!)

in OpenMPI, MPI_Allreduce(...,count=0,...) is a no-op, so that suggests that
the stack has been corrupted inside MPI_Allreduce(), or that you are not
using the library you think you use
pmap  will show you which lib is used

btw, this was not started with
mpirun --mca coll ^tuned ...
right ?

just to make it clear ...
a task from your program bluntly issues a fortran STOP, and this is kind of
a feature.
the *only* issue is mpirun does not kill the other MPI tasks and mpirun
never completes.
did i get it right ?

Cheers,

Gilles

On Thursday, December 8, 2016, Christof Koehler <
christof.koeh...@bccms.uni-bremen.de> wrote:

> Hello everybody,
>
> I tried it with the nightly and the direct 2.0.2 branch from git which
> according to the log should contain that patch
>
> commit d0b97d7a408b87425ca53523de369da405358ba2
> Merge: ac8c019 b9420bb
> Author: Jeff Squyres >
> Date:   Wed Dec 7 18:24:46 2016 -0500
> Merge pull request #2528 from rhc54/cmr20x/signals
>
> Unfortunately it changes nothing. The root rank stops and all other
> ranks (and mpirun) just stay, the remaining ranks at 100 % CPU waiting
> apparently in that allreduce. The stack trace looks a bit more
> interesting (git is always debug build ?), so I include it at the very
> bottom just in case.
>
> Off-list Gilles Gouaillardet suggested to set breakpoints at exit,
> __exit etc. to try to catch signals. Would that be useful ? I need a
> moment to figure out how to do this, but I can definitively try.
>
> Some remark: During "make install" from the git repo I see a
>
> WARNING!  Common symbols found:
>   mpi-f08-types.o: 0004 C ompi_f08_mpi_2complex
>   mpi-f08-types.o: 0004 C ompi_f08_mpi_2double_complex
>   mpi-f08-types.o: 0004 C
> ompi_f08_mpi_2double_precision
>   mpi-f08-types.o: 0004 C ompi_f08_mpi_2integer
>   mpi-f08-types.o: 0004 C ompi_f08_mpi_2real
>   mpi-f08-types.o: 0004 C ompi_f08_mpi_aint
>   mpi-f08-types.o: 0004 C ompi_f08_mpi_band
>   mpi-f08-types.o: 0004 C ompi_f08_mpi_bor
>   mpi-f08-types.o: 0004 C ompi_f08_mpi_bxor
>   mpi-f08-types.o: 0004 C ompi_f08_mpi_byte
>
> I have never noticed this before.
>
>
> Best Regards
>
> Christof
>
> Thread 1 (Thread 0x2af84cde4840 (LWP 11219)):
> #0  0x2af84e4c669d in poll () from /lib64/libc.so.6
> #1  0x2af850517496 in poll_dispatch () from /cluster/mpi/openmpi/2.0.2/
> intel2016/lib/libopen-pal.so.20
> #2  0x2af85050ffa5 in opal_libevent2022_event_base_loop () from
> /cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
> #3  0x2af85049fa1f in opal_progress () at runtime/opal_progress.c:207
> #4  0x2af84e02f7f7 in ompi_request_default_wait_all (count=233618144,
> requests=0x2, statuses=0x0) at ../opal/threads/wait_sync.h:80
> #5  0x2af84e0758a7 in ompi_coll_base_allreduce_intra_recursivedoubling
> (sbuf=0xdecbae0,
> rbuf=0x2, count=0, dtype=0x, op=0x0, comm=0x1,
> module=0xdee69e0) at base/coll_base_allreduce.c:225
> #6  0x2af84e07b747 in ompi_coll_tuned_allreduce_intra_dec_fixed
> (sbuf=0xdecbae0, rbuf=0x2, count=0, dtype=0x, op=0x0,
> comm=0x1, module=0x1) at coll_tuned_decision_fixed.c:66
> #7  0x2af84e03e832 in PMPI_Allreduce (sendbuf=0xdecbae0, recvbuf=0x2,
> count=0, datatype=0x, op=0x0, comm=0x1) at pallreduce.c:107
> #8  0x2af84ddaac90 in ompi_allreduce_f (sendbuf=0xdecbae0 "\005",
> recvbuf=0x2 , count=0x0,
> datatype=0x, op=0x0, comm=0x1, ierr=0x7ffdf3cffe9c) at
> pallreduce_f.c:87
> #9  0x0045ecc6 in m_sum_i_ ()
> #10 0x00e172c9 in mlwf_mp_mlwf_wannier90_ ()
> #11 0x004325ff in vamp () at main.F:2640
> #12 0x0040de1e in main ()
> #13 0x2af84e3fbb15 in __libc_start_main () from /lib64/libc.so.6
> #14 0x0040dd29 in _start ()
>
> On Wed, Dec 07, 2016 at 09:47:48AM -0800, r...@open-mpi.org 
> wrote:
> > Hi Christof
> >
> > Sorry if I missed this, but it sounds like you are saying that one of
> your procs abnormally terminates, and we are failing to kill the remaining
> job? Is that correct?
> >
> > If so, I just did some work that might relate to that problem that is
> pending in PR #2528: https://github.com/open-mpi/ompi/pull/2528 <
> https://github.com/open-mpi/ompi/pull/2528>
> >
> > Would you be able to try that?
> >
> > Ralph
> >
> > > On Dec 7, 2016, at 9:37 AM, Christof Koehler <
> christof.koeh...@bccms.uni-bremen.de > wrote:
> > >
> > > Hello,
> > >
> > > On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote:
> > >>> On Dec 7, 2016, at 10:07 AM, Christof Koehler <
> christof.koeh...@bccms.uni-bremen.de 

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-08 Thread Christof Koehler
Hello everybody,

I tried it with the nightly and the direct 2.0.2 branch from git which
according to the log should contain that patch

commit d0b97d7a408b87425ca53523de369da405358ba2
Merge: ac8c019 b9420bb
Author: Jeff Squyres 
Date:   Wed Dec 7 18:24:46 2016 -0500
Merge pull request #2528 from rhc54/cmr20x/signals

Unfortunately it changes nothing. The root rank stops and all other
ranks (and mpirun) just stay, the remaining ranks at 100 % CPU waiting
apparently in that allreduce. The stack trace looks a bit more
interesting (git is always debug build ?), so I include it at the very 
bottom just in case.

Off-list Gilles Gouaillardet suggested to set breakpoints at exit,
__exit etc. to try to catch signals. Would that be useful ? I need a
moment to figure out how to do this, but I can definitively try.

Some remark: During "make install" from the git repo I see a 

WARNING!  Common symbols found:
  mpi-f08-types.o: 0004 C ompi_f08_mpi_2complex
  mpi-f08-types.o: 0004 C ompi_f08_mpi_2double_complex
  mpi-f08-types.o: 0004 C ompi_f08_mpi_2double_precision
  mpi-f08-types.o: 0004 C ompi_f08_mpi_2integer
  mpi-f08-types.o: 0004 C ompi_f08_mpi_2real
  mpi-f08-types.o: 0004 C ompi_f08_mpi_aint
  mpi-f08-types.o: 0004 C ompi_f08_mpi_band
  mpi-f08-types.o: 0004 C ompi_f08_mpi_bor
  mpi-f08-types.o: 0004 C ompi_f08_mpi_bxor
  mpi-f08-types.o: 0004 C ompi_f08_mpi_byte

I have never noticed this before.


Best Regards

Christof

Thread 1 (Thread 0x2af84cde4840 (LWP 11219)):
#0  0x2af84e4c669d in poll () from /lib64/libc.so.6
#1  0x2af850517496 in poll_dispatch () from 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#2  0x2af85050ffa5 in opal_libevent2022_event_base_loop () from 
/cluster/mpi/openmpi/2.0.2/intel2016/lib/libopen-pal.so.20
#3  0x2af85049fa1f in opal_progress () at runtime/opal_progress.c:207
#4  0x2af84e02f7f7 in ompi_request_default_wait_all (count=233618144, 
requests=0x2, statuses=0x0) at ../opal/threads/wait_sync.h:80
#5  0x2af84e0758a7 in ompi_coll_base_allreduce_intra_recursivedoubling 
(sbuf=0xdecbae0,
rbuf=0x2, count=0, dtype=0x, op=0x0, comm=0x1, 
module=0xdee69e0) at base/coll_base_allreduce.c:225
#6  0x2af84e07b747 in ompi_coll_tuned_allreduce_intra_dec_fixed 
(sbuf=0xdecbae0, rbuf=0x2, count=0, dtype=0x, op=0x0, comm=0x1, 
module=0x1) at coll_tuned_decision_fixed.c:66
#7  0x2af84e03e832 in PMPI_Allreduce (sendbuf=0xdecbae0, recvbuf=0x2, 
count=0, datatype=0x, op=0x0, comm=0x1) at pallreduce.c:107
#8  0x2af84ddaac90 in ompi_allreduce_f (sendbuf=0xdecbae0 "\005", 
recvbuf=0x2 , count=0x0, 
datatype=0x, op=0x0, comm=0x1, ierr=0x7ffdf3cffe9c) at 
pallreduce_f.c:87
#9  0x0045ecc6 in m_sum_i_ ()
#10 0x00e172c9 in mlwf_mp_mlwf_wannier90_ ()
#11 0x004325ff in vamp () at main.F:2640
#12 0x0040de1e in main ()
#13 0x2af84e3fbb15 in __libc_start_main () from /lib64/libc.so.6
#14 0x0040dd29 in _start ()

On Wed, Dec 07, 2016 at 09:47:48AM -0800, r...@open-mpi.org wrote:
> Hi Christof
> 
> Sorry if I missed this, but it sounds like you are saying that one of your 
> procs abnormally terminates, and we are failing to kill the remaining job? Is 
> that correct?
> 
> If so, I just did some work that might relate to that problem that is pending 
> in PR #2528: https://github.com/open-mpi/ompi/pull/2528 
> 
> 
> Would you be able to try that?
> 
> Ralph
> 
> > On Dec 7, 2016, at 9:37 AM, Christof Koehler 
> >  wrote:
> > 
> > Hello,
> > 
> > On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote:
> >>> On Dec 7, 2016, at 10:07 AM, Christof Koehler 
> >>>  wrote:
>  
> >>> I really think the hang is a consequence of
> >>> unclean termination (in the sense that the non-root ranks are not
> >>> terminated) and probably not the cause, in my interpretation of what I
> >>> see. Would you have any suggestion to catch signals sent between orterun
> >>> (mpirun) and the child tasks ?
> >> 
> >> Do you know where in the code the termination call is?  Is it actually 
> >> calling mpi_abort(), or just doing something ugly like calling fortran 
> >> “stop”?  If the latter, would that explain a possible hang?
> > Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The wannier90 
> > input contains
> > an error, a restart is requested and the wannier90.chk file the restart
> > information is missing.
> > "
> > Exiting...
> > Error: restart requested but wannier90.chk file not found
> > "
> > So it must terminate.
> > 
> > The termination happens in the libwannier.a, source file io.F90:

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Noam Bernstein
> On Dec 7, 2016, at 12:37 PM, Christof Koehler 
>  wrote:
> 
> 
>> Presumably someone here can comment on what the standard says about the 
>> validity of terminating without mpi_abort.
> 
> Well, probably stop is not a good way to terminate then.
> 
> My main point was the change relative to 1.10 anyway :-) 

It’s definitely not the clean way to terminate, but I think everyone agrees 
that it shouldn’t hang if it can be avoided.

> 
> 
>> 
>> Actually, if you’re willing to share enough input files to reproduce, I 
>> could take a look.  I just recompiled our VASP with openmpi 2.0.1 to fix a 
>> crash that was apparently addressed by some change in the memory allocator 
>> in a recent version of openmpi.  Just e-mail me if that’s the case.
> 
> I think that is no longer necessary ? In principle it is no problem but
> it at the end of a (small) GW calculation, the Si tutorial example. 
> So the mail would be abit larger due to the WAVECAR.

I agree.  It sounds like it’s clearly a failure to exit from a collective 
communication when a process dies (from the point of view of mpi, since 
mpi_abort is not being called, it’s just a process dying).  Maybe the patch in 
Ralph’s e-mail fixes it.

Noam



||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil 
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread r...@open-mpi.org
Hi Christof

Sorry if I missed this, but it sounds like you are saying that one of your 
procs abnormally terminates, and we are failing to kill the remaining job? Is 
that correct?

If so, I just did some work that might relate to that problem that is pending 
in PR #2528: https://github.com/open-mpi/ompi/pull/2528 


Would you be able to try that?

Ralph

> On Dec 7, 2016, at 9:37 AM, Christof Koehler 
>  wrote:
> 
> Hello,
> 
> On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote:
>>> On Dec 7, 2016, at 10:07 AM, Christof Koehler 
>>>  wrote:
 
>>> I really think the hang is a consequence of
>>> unclean termination (in the sense that the non-root ranks are not
>>> terminated) and probably not the cause, in my interpretation of what I
>>> see. Would you have any suggestion to catch signals sent between orterun
>>> (mpirun) and the child tasks ?
>> 
>> Do you know where in the code the termination call is?  Is it actually 
>> calling mpi_abort(), or just doing something ugly like calling fortran 
>> “stop”?  If the latter, would that explain a possible hang?
> Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The wannier90 
> input contains
> an error, a restart is requested and the wannier90.chk file the restart
> information is missing.
> "
> Exiting...
> Error: restart requested but wannier90.chk file not found
> "
> So it must terminate.
> 
> The termination happens in the libwannier.a, source file io.F90:
> 
> write(stdout,*)  'Exiting...'
> write(stdout, '(1x,a)') trim(error_msg)
> close(stdout)
> stop "wannier90 error: examine the output/error file for details"
> 
> So it calls stop  as you assumed.
> 
>> Presumably someone here can comment on what the standard says about the 
>> validity of terminating without mpi_abort.
> 
> Well, probably stop is not a good way to terminate then.
> 
> My main point was the change relative to 1.10 anyway :-) 
> 
> 
>> 
>> Actually, if you’re willing to share enough input files to reproduce, I 
>> could take a look.  I just recompiled our VASP with openmpi 2.0.1 to fix a 
>> crash that was apparently addressed by some change in the memory allocator 
>> in a recent version of openmpi.  Just e-mail me if that’s the case.
> 
> I think that is no longer necessary ? In principle it is no problem but
> it at the end of a (small) GW calculation, the Si tutorial example. 
> So the mail would be abit larger due to the WAVECAR.
> 
> 
>> 
>>  Noam
>> 
>> 
>> 
>> ||
>> |U.S. NAVAL|
>> |_RESEARCH_|
>> LABORATORY
>> Noam Bernstein, Ph.D.
>> Center for Materials Physics and Technology
>> U.S. Naval Research Laboratory
>> T +1 202 404 8628  F +1 202 404 7546
>> https://www.nrl.navy.mil 
> 
> -- 
> Dr. rer. nat. Christof Köhler   email: c.koeh...@bccms.uni-bremen.de
> Universitaet Bremen/ BCCMS  phone:  +49-(0)421-218-62334
> Am Fallturm 1/ TAB/ Raum 3.12   fax: +49-(0)421-218-62770
> 28359 Bremen  
> 
> PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Christof Koehler
Hello,

On Wed, Dec 07, 2016 at 10:19:10AM -0500, Noam Bernstein wrote:
> > On Dec 7, 2016, at 10:07 AM, Christof Koehler 
> >  wrote:
> >> 
> > I really think the hang is a consequence of
> > unclean termination (in the sense that the non-root ranks are not
> > terminated) and probably not the cause, in my interpretation of what I
> > see. Would you have any suggestion to catch signals sent between orterun
> > (mpirun) and the child tasks ?
> 
> Do you know where in the code the termination call is?  Is it actually 
> calling mpi_abort(), or just doing something ugly like calling fortran 
> “stop”?  If the latter, would that explain a possible hang?
Well, basically it tries to use wannier90 (LWANNIER=.TRUE.). The wannier90 
input contains
an error, a restart is requested and the wannier90.chk file the restart
information is missing.
"
Exiting...
 Error: restart requested but wannier90.chk file not found
"
So it must terminate.

The termination happens in the libwannier.a, source file io.F90:

write(stdout,*)  'Exiting...'
write(stdout, '(1x,a)') trim(error_msg)
close(stdout)
stop "wannier90 error: examine the output/error file for details"

So it calls stop  as you assumed.

> Presumably someone here can comment on what the standard says about the 
> validity of terminating without mpi_abort.

Well, probably stop is not a good way to terminate then.

My main point was the change relative to 1.10 anyway :-) 


> 
> Actually, if you’re willing to share enough input files to reproduce, I could 
> take a look.  I just recompiled our VASP with openmpi 2.0.1 to fix a crash 
> that was apparently addressed by some change in the memory allocator in a 
> recent version of openmpi.  Just e-mail me if that’s the case.

I think that is no longer necessary ? In principle it is no problem but
it at the end of a (small) GW calculation, the Si tutorial example. 
So the mail would be abit larger due to the WAVECAR.


> 
>   Noam
> 
> 
> 
> ||
> |U.S. NAVAL|
> |_RESEARCH_|
> LABORATORY
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628  F +1 202 404 7546
> https://www.nrl.navy.mil 

-- 
Dr. rer. nat. Christof Köhler   email: c.koeh...@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS  phone:  +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12   fax: +49-(0)421-218-62770
28359 Bremen  

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/


signature.asc
Description: Digital signature
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Noam Bernstein
> On Dec 7, 2016, at 10:07 AM, Christof Koehler 
>  wrote:
>> 
> I really think the hang is a consequence of
> unclean termination (in the sense that the non-root ranks are not
> terminated) and probably not the cause, in my interpretation of what I
> see. Would you have any suggestion to catch signals sent between orterun
> (mpirun) and the child tasks ?

Do you know where in the code the termination call is?  Is it actually calling 
mpi_abort(), or just doing something ugly like calling fortran “stop”?  If the 
latter, would that explain a possible hang?

Presumably someone here can comment on what the standard says about the 
validity of terminating without mpi_abort.

Actually, if you’re willing to share enough input files to reproduce, I could 
take a look.  I just recompiled our VASP with openmpi 2.0.1 to fix a crash that 
was apparently addressed by some change in the memory allocator in a recent 
version of openmpi.  Just e-mail me if that’s the case.

Noam



||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil 
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Christof Koehler
Hello,

On Wed, Dec 07, 2016 at 11:07:49PM +0900, Gilles Gouaillardet wrote:
> Christof,
> 
> out of curiosity, can you run
> dmesg
> and see if you find some tasks killed by the oom-killer ?
Definitively not the oom-killer. It is a real tiny example. I checked
the machines logfile and dmesg.

> 
> the error message you see is a consequence of a task unexpectedly died.
> and there is no evidence the task crashed or was killed.
Yes, confusing isn't it ? 

> 
> when you observe a hang with two tasks, you can
> - retrieve the pids with ps
> - run 'pstack ' on both pids in order to collect the stacktrace.
When it hangs one is already gone ! The pstack traces I sent are from the
surviver(s). It is not terminating completely as it should do.

> 
> assuming they both hang in MPI_Allreduce(), the relevant part to us is
> - datatype (MPI_INT)
> - count (n)
> - communicator (COMM%MPI_COMM) (size, check this is the same communicator
> used by all tasks)
> - is all the buffer accessible (ivec(1:n))

As I said, the root rank terminates (according to gdb normally). The other
remains and hangs in allreduce. Possibly because its partner (the
root rank) is gone without saying goodbye properly.

This is not a real hang IMO, but a failure to terminate all ranks cleanly.

I really think the hang is a consequence of
unclean termination (in the sense that the non-root ranks are not
terminated) and probably not the cause, in my interpretation of what I
see. Would you have any suggestion to catch signals sent between orterun
(mpirun) and the child tasks ?

I will try to get the information you want, but I will have to figure
out how to do that first. 

Cheers

Christof

> 
> Cheers,
> 
> Gilles
> 
> On Wednesday, December 7, 2016, Christof Koehler <
> christof.koeh...@bccms.uni-bremen.de> wrote:
> 
> > Hello,
> >
> > thank you for the fast answer.
> >
> > On Wed, Dec 07, 2016 at 08:23:43PM +0900, Gilles Gouaillardet wrote:
> > > Christoph,
> > >
> > > can you please try again with
> > >
> > > mpirun --mca btl tcp,self --mca pml ob1 ...
> >
> > mpirun -n 20 --mca btl tcp,self --mca pml ob1
> > /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
> >
> > Deadlocks/ hangs, has no effect.
> >
> > > mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ...
> > mpirun -n 20 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned
> > /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
> >
> > Deadlocks/ hangs, has no effect. There is additional output.
> >
> > wannier90 error: examine the output/error file for details
> > [node109][[55572,1],16][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
> > (104)[node109][[55572,1],8][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> > [node109][[55572,1],4][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> > [node109][[55572,1],1][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> > [node109][[55572,1],2][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv]
> > mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> >
> > Please note: The "wannier90 error: examine the output/error file for
> > details" is expected, there is in fact an error in the input file. It
> > is supposed to terminate.
> >
> > However, with mvapich2 and openmpi 1.10.4 it terminates
> > completely, i.e. I get my shell prompt back. If a segfault is involved with
> > mvapich2 (as is apparently the case with openmpi 1.10.4 based in the
> > termination message) I do not know. I tried
> >
> > export MV2_DEBUG_SHOW_BACKTRACE=1
> > mpirun -n 20  /cluster/vasp/5.3.5/intel2016/mvapich2-2.2/bin/vasp-mpi
> >
> > but did not get any indication of a problem (segfault), the last lines
> > are
> >
> >  calculate QP shifts : iteration 1
> >  writing wavefunctions
> > wannier90 error: examine the output/error file for details
> > node109 14:00 /scratch/ckoe/gw %
> >
> > The last line is my shell prompt.
> >
> > >
> > > if everything fails, can you describe of MPI_Allreduce is invoked ?
> > > /* number of tasks, datatype, number of elements */
> > Difficult, this is not our code in the first place [1] and the problem
> > occurs when using an ("officially" supported) third party library [2].
> >
> > From the stack trace of the hanging process the vasp routine which calls
> > allreduce is "m_sum_i_". That is in the mpi.F source file. Allreduce is
> > called as
> >
> > CALL MPI_ALLREDUCE( MPI_IN_PLACE, ivec(1), n, MPI_INTEGER, &
> >  &MPI_SUM, COMM%MPI_COMM, ierror )
> >
> > n and ivec(1) are data type integer. It was originally with 20 ranks, I
> > tried 2 ranks now also and it hangs, too. With one (!) rank
> >
> > mpirun -n 1 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned
> > 

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Christof Koehler
Hello again,

attaching the gdb to mpirun the back trace when it hangs is
(gdb) bt
#0  0x2b039f74169d in poll () from /usr/lib64/libc.so.6
#1  0x2b039e1a9c42 in poll_dispatch () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2  0x2b039e1a2751 in opal_libevent2022_event_base_loop () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3  0x004056ef in orterun (argc=13, argv=0x7ffef20a79f8) at 
orterun.c:1057
#4  0x004035a0 in main (argc=13, argv=0x7ffef20a79f8) at main.c:13

Using pstack on mpirun I see several threads, below

Thread 5 (Thread 0x2b03a33b0700 (LWP 11691)):
#0  0x2b039f743413 in select () from /usr/lib64/libc.so.6
#1  0x2b039c599979 in listen_thread () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-rte.so.20
#2  0x2b039defedc5 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x2b039f74bced in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x2b03a3be9700 (LWP 11692)):
#0  0x2b039f74c2c3 in epoll_wait () from /usr/lib64/libc.so.6
#1  0x2b039e1a0f42 in epoll_dispatch () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2  0x2b039e1a2751 in opal_libevent2022_event_base_loop () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3  0x2b039e1fa996 in progress_engine () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4  0x2b039defedc5 in start_thread () from /usr/lib64/libpthread.so.0
#5  0x2b039f74bced in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x2b03a3dea700 (LWP 11693)):
#0  0x2b039f743413 in select () from /usr/lib64/libc.so.6
#1  0x2b039e1f3a5f in listen_thread () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2  0x2b039defedc5 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x2b039f74bced in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x2b03a3feb700 (LWP 11694)):
#0  0x2b039f743413 in select () from /usr/lib64/libc.so.6
#1  0x2b039c55616b in listen_thread_fn () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-rte.so.20
#2  0x2b039defedc5 in start_thread () from /usr/lib64/libpthread.so.0
#3  0x2b039f74bced in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x2b039c324100 (LWP 11690)):
#0  0x2b039f74169d in poll () from /usr/lib64/libc.so.6
#1  0x2b039e1a9c42 in poll_dispatch () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2  0x2b039e1a2751 in opal_libevent2022_event_base_loop () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3  0x004056ef in orterun (argc=13, argv=0x7ffef20a79f8) at 
orterun.c:1057
#4  0x004035a0 in main (argc=13, argv=0x7ffef20a79f8) at main.c:13

Best Regards

Christof



On Wed, Dec 07, 2016 at 02:07:27PM +0100, Christof Koehler wrote:
> Hello,
> 
> thank you for the fast answer.
> 
> On Wed, Dec 07, 2016 at 08:23:43PM +0900, Gilles Gouaillardet wrote:
> > Christoph,
> > 
> > can you please try again with
> > 
> > mpirun --mca btl tcp,self --mca pml ob1 ...
> 
> mpirun -n 20 --mca btl tcp,self --mca pml ob1 
> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
> 
> Deadlocks/ hangs, has no effect.
> 
> > mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ...
> mpirun -n 20 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned 
> /cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
> 
> Deadlocks/ hangs, has no effect. There is additional output.
> 
> wannier90 error: examine the output/error file for details
> [node109][[55572,1],16][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] 
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
> (104)[node109][[55572,1],8][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] 
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node109][[55572,1],4][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] 
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node109][[55572,1],1][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] 
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> [node109][[55572,1],2][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] 
> mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
> 
> Please note: The "wannier90 error: examine the output/error file for
> details" is expected, there is in fact an error in the input file. It
> is supposed to terminate.
> 
> However, with mvapich2 and openmpi 1.10.4 it terminates
> completely, i.e. I get my shell prompt back. If a segfault is involved with 
> mvapich2 (as is apparently the case with openmpi 1.10.4 based in the
> termination message) I do not know. I tried
> 
> export MV2_DEBUG_SHOW_BACKTRACE=1
> mpirun -n 20  /cluster/vasp/5.3.5/intel2016/mvapich2-2.2/bin/vasp-mpi
> 
> but did not get any indication of a problem (segfault), the last lines
> are
> 
>  calculate QP shifts : iteration 1
>  writing wavefunctions
> wannier90 error: examine the output/error file for 

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Christof Koehler
Hello,

thank you for the fast answer.

On Wed, Dec 07, 2016 at 08:23:43PM +0900, Gilles Gouaillardet wrote:
> Christoph,
> 
> can you please try again with
> 
> mpirun --mca btl tcp,self --mca pml ob1 ...

mpirun -n 20 --mca btl tcp,self --mca pml ob1 
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi

Deadlocks/ hangs, has no effect.

> mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ...
mpirun -n 20 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned 
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi

Deadlocks/ hangs, has no effect. There is additional output.

wannier90 error: examine the output/error file for details
[node109][[55572,1],16][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer
(104)[node109][[55572,1],8][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],4][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],1][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[node109][[55572,1],2][btl_tcp_frag.c:230:mca_btl_tcp_frag_recv] 
mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)

Please note: The "wannier90 error: examine the output/error file for
details" is expected, there is in fact an error in the input file. It
is supposed to terminate.

However, with mvapich2 and openmpi 1.10.4 it terminates
completely, i.e. I get my shell prompt back. If a segfault is involved with 
mvapich2 (as is apparently the case with openmpi 1.10.4 based in the
termination message) I do not know. I tried

export MV2_DEBUG_SHOW_BACKTRACE=1
mpirun -n 20  /cluster/vasp/5.3.5/intel2016/mvapich2-2.2/bin/vasp-mpi

but did not get any indication of a problem (segfault), the last lines
are

 calculate QP shifts : iteration 1
 writing wavefunctions
wannier90 error: examine the output/error file for details
node109 14:00 /scratch/ckoe/gw %

The last line is my shell prompt.

> 
> if everything fails, can you describe of MPI_Allreduce is invoked ?
> /* number of tasks, datatype, number of elements */
Difficult, this is not our code in the first place [1] and the problem
occurs when using an ("officially" supported) third party library [2].

From the stack trace of the hanging process the vasp routine which calls
allreduce is "m_sum_i_". That is in the mpi.F source file. Allreduce is
called as

CALL MPI_ALLREDUCE( MPI_IN_PLACE, ivec(1), n, MPI_INTEGER, &
 &MPI_SUM, COMM%MPI_COMM, ierror )

n and ivec(1) are data type integer. It was originally with 20 ranks, I
tried 2 ranks now also and it hangs, too. With one (!) rank

mpirun -n 1 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned 
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi

I of course get a shell prompt back. 

I then started in normally in the shell with 2 ranks 
mpirun -n 2 --mca btl tcp,self --mca pml ob1 --mca coll ^tuned 
/cluster/vasp/5.3.5/intel2016/openmpi-2.0/bin/vasp-mpi
and attached gdb to the rank with the lowest pid (3478). I do not get a prompt 
back (it hangs), the second rank 3479 is still at 100 % CPU and mpirun is still 
a process
I can see with "ps", but gdb says
(gdb) continue <- that is where I attached it !
Continuing.
[Thread 0x2b8366806700 (LWP 3480) exited]
[Thread 0x2b835da1c040 (LWP 3478) exited]
[Inferior 1 (process 3478) exited normally]
(gdb) bt
No stack.

So, as far as gdb is concerned the rank with the lowest pid (which is
gone while the other rank is still eating CPU time) terminated normally
? 

I hope this helps. I have only very basic experience with debuggers
(never needed them really) and even less with using them in parallel.
I can try to catch the contents of ivec, but I do not think that would
be helpful ? If you need them I can try of course, I have no idea hwo
large the vector is.


Best Regards

Christof

[1] https://www.vasp.at/
[2] http://www.wannier.org/, Old version 1.2
> 
> 
> 
> Cheers,
> 
> Gilles
> 
> On Wed, Dec 7, 2016 at 7:38 PM, Christof Koehler
>  wrote:
> > Hello everybody,
> >
> > I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single
> > node. A stack tracke (pstack) of one rank is below showing the program (vasp
> > 5.3.5) and the two psm2 progress threads. However:
> >
> > In fact, the vasp input is not ok and it should abort at the point where
> > it hangs. It does when using mvapich 2.2. With openmpi 2.0.1 it just
> > deadlocks in some allreduce operation. Originally it was started with 20
> > ranks, when it hangs there are only 19 left. From the PIDs I would
> > assume it is the master rank which is missing. So, this looks like a
> > failure to terminate.
> >
> > With 1.10 I get a clean
> > --
> > mpiexec 

Re: [OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Gilles Gouaillardet
Christoph,

can you please try again with

mpirun --mca btl tcp,self --mca pml ob1 ...

that will help figuring out whether pml/cm and/or mtl/psm2 is involved or not.


if that causes a crash, then can you please try

mpirun --mca btl tcp,self --mca pml ob1 --mca coll ^tuned ...

that will help figuring out whether coll/tuned is involved or not

coll/tuned is known not to correctly handle collectives with different
but matching signatures
(e.g. some tasks invoke the collective with one vector of N elements,
and some other invoke
the same collective with N elements)


if everything fails, can you describe of MPI_Allreduce is invoked ?
/* number of tasks, datatype, number of elements */



Cheers,

Gilles

On Wed, Dec 7, 2016 at 7:38 PM, Christof Koehler
 wrote:
> Hello everybody,
>
> I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single
> node. A stack tracke (pstack) of one rank is below showing the program (vasp
> 5.3.5) and the two psm2 progress threads. However:
>
> In fact, the vasp input is not ok and it should abort at the point where
> it hangs. It does when using mvapich 2.2. With openmpi 2.0.1 it just
> deadlocks in some allreduce operation. Originally it was started with 20
> ranks, when it hangs there are only 19 left. From the PIDs I would
> assume it is the master rank which is missing. So, this looks like a
> failure to terminate.
>
> With 1.10 I get a clean
> --
> mpiexec noticed that process rank 0 with PID 18789 on node node109
> exited on signal 11 (Segmentation fault).
> --
>
> Any ideas what to try ? Of course in this situation it may well be the
> program. Still, with the observed difference between 2.0.1 and 1.10 (and
> mvapich) this might be interesting to someone.
>
> Best Regards
>
> Christof
>
>
> Thread 3 (Thread 0x2ad362577700 (LWP 4629)):
> #0  0x2ad35b1562c3 in epoll_wait () from /lib64/libc.so.6
> #1  0x2ad35d114f42 in epoll_dispatch () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> #2  0x2ad35d116751 in opal_libevent2022_event_base_loop () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> #3  0x2ad35d16e996 in progress_engine () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> #4  0x2ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
> #5  0x2ad35b155ced in clone () from /lib64/libc.so.6
> Thread 2 (Thread 0x2ad362778700 (LWP 4640)):
> #0  0x2ad35b14b69d in poll () from /lib64/libc.so.6 #1  
> 0x2ad35d11dc42 in poll_dispatch () from
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> #2  0x2ad35d116751 in opal_libevent2022_event_base_loop () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> #3  0x2ad35d0c61d1 in progress_engine () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> #4  0x2ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
> #5  0x2ad35b155ced in clone () from /lib64/libc.so.6
> Thread 1 (Thread 0x2ad35978d040 (LWP 4609)):
> #0  0x2ad35b14b69d in poll () from /lib64/libc.so.6
> #1  0x2ad35d11dc42 in poll_dispatch () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> #2  0x2ad35d116751 in opal_libevent2022_event_base_loop () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> #3  0x2ad35d0c28cf in opal_progress () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
> #4  0x2ad35adce8d8 in ompi_request_wait_completion () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> #5  0x2ad35adce838 in mca_pml_cm_recv () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> #6  0x2ad35ad4da42 in ompi_coll_base_allreduce_intra_recursivedoubling () 
> from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> #7  0x2ad35ad52906 in ompi_coll_tuned_allreduce_intra_dec_fixed () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> #8  0x2ad35ad1f0f4 in PMPI_Allreduce () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
> #9  0x2ad35aa99c38 in pmpi_allreduce__ () from 
> /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi_mpifh.so.20
> #10 0x0045f8c6 in m_sum_i_ ()
> #11 0x00e1ce69 in mlwf_mp_mlwf_wannier90_ ()
> #12 0x004331ff in vamp () at main.F:2640
> #13 0x0040ea1e in main ()
> #14 0x2ad35b080b15 in __libc_start_main () from /lib64/libc.so.6
> #15 0x0040e929 in _start ()
>
>
> --
> Dr. rer. nat. Christof Köhler   email: c.koeh...@bccms.uni-bremen.de
> Universitaet Bremen/ BCCMS  phone:  +49-(0)421-218-62334
> Am Fallturm 1/ TAB/ Raum 3.12   fax: +49-(0)421-218-62770
> 28359 Bremen
>
> PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
>
> ___
> users mailing list
> 

[OMPI users] Abort/ Deadlock issue in allreduce

2016-12-07 Thread Christof Koehler
Hello everybody,

I am observing a deadlock in allreduce with openmpi 2.0.1 on a Single
node. A stack tracke (pstack) of one rank is below showing the program (vasp
5.3.5) and the two psm2 progress threads. However:

In fact, the vasp input is not ok and it should abort at the point where
it hangs. It does when using mvapich 2.2. With openmpi 2.0.1 it just
deadlocks in some allreduce operation. Originally it was started with 20
ranks, when it hangs there are only 19 left. From the PIDs I would
assume it is the master rank which is missing. So, this looks like a
failure to terminate.

With 1.10 I get a clean
--
mpiexec noticed that process rank 0 with PID 18789 on node node109
exited on signal 11 (Segmentation fault).
--

Any ideas what to try ? Of course in this situation it may well be the
program. Still, with the observed difference between 2.0.1 and 1.10 (and
mvapich) this might be interesting to someone.

Best Regards

Christof


Thread 3 (Thread 0x2ad362577700 (LWP 4629)):
#0  0x2ad35b1562c3 in epoll_wait () from /lib64/libc.so.6
#1  0x2ad35d114f42 in epoll_dispatch () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2  0x2ad35d116751 in opal_libevent2022_event_base_loop () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3  0x2ad35d16e996 in progress_engine () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4  0x2ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x2ad35b155ced in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x2ad362778700 (LWP 4640)):
#0  0x2ad35b14b69d in poll () from /lib64/libc.so.6 #1  0x2ad35d11dc42 
in poll_dispatch () from
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2  0x2ad35d116751 in opal_libevent2022_event_base_loop () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3  0x2ad35d0c61d1 in progress_engine () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4  0x2ad359efbdc5 in start_thread () from /lib64/libpthread.so.0
#5  0x2ad35b155ced in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x2ad35978d040 (LWP 4609)):
#0  0x2ad35b14b69d in poll () from /lib64/libc.so.6
#1  0x2ad35d11dc42 in poll_dispatch () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#2  0x2ad35d116751 in opal_libevent2022_event_base_loop () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#3  0x2ad35d0c28cf in opal_progress () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libopen-pal.so.20
#4  0x2ad35adce8d8 in ompi_request_wait_completion () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#5  0x2ad35adce838 in mca_pml_cm_recv () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#6  0x2ad35ad4da42 in ompi_coll_base_allreduce_intra_recursivedoubling () 
from /cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#7  0x2ad35ad52906 in ompi_coll_tuned_allreduce_intra_dec_fixed () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#8  0x2ad35ad1f0f4 in PMPI_Allreduce () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi.so.20
#9  0x2ad35aa99c38 in pmpi_allreduce__ () from 
/cluster/mpi/openmpi/2.0.1/intel2016/lib/libmpi_mpifh.so.20
#10 0x0045f8c6 in m_sum_i_ ()
#11 0x00e1ce69 in mlwf_mp_mlwf_wannier90_ ()
#12 0x004331ff in vamp () at main.F:2640
#13 0x0040ea1e in main ()
#14 0x2ad35b080b15 in __libc_start_main () from /lib64/libc.so.6
#15 0x0040e929 in _start ()


-- 
Dr. rer. nat. Christof Köhler   email: c.koeh...@bccms.uni-bremen.de
Universitaet Bremen/ BCCMS  phone:  +49-(0)421-218-62334
Am Fallturm 1/ TAB/ Raum 3.12   fax: +49-(0)421-218-62770
28359 Bremen  

PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/


signature.asc
Description: Digital signature
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Abort

2010-08-16 Thread David Ronis
Hi Jeff,

I've reproduced your test here, with the same results.  Moreover, if I
put the nodes with rank>0 into a blocking MPI call (MPI_Bcast or
MPI_Barrier) I still get the same behavior; namely, rank 0's calling
abort() generates a core file and leads to termination, which is the
behavior I want.  I'll look at my code a bit more, but the only
difference I see now is that in my code a floating point exception
triggers a signal-handler that calls abort().   I don't see why that
should be different from your test.

Thanks for your help.

David

On Mon, 2010-08-16 at 09:54 -0700, Jeff Squyres wrote:
> FWIW, I'm unable to replicate your behavior.  This is with Open MPI 1.4.2 on 
> RHEL5:
> 
> 
> [9:52] svbu-mpi:~/mpi % cat abort.c
> #include 
> #include 
> #include 
> 
> int main(int argc, char **argv)
> {
> int rank;
> 
> MPI_Init(, );
> MPI_Comm_rank(MPI_COMM_WORLD, );
> if (0 == rank) {
> abort();
> }
> printf("Rank %d sleeping...\n", rank);
> sleep(600);
> printf("Rank %d finalizing...\n", rank);
> MPI_Finalize();
> return 0;
> }
> [9:52] svbu-mpi:~/mpi % mpicc abort.c -o abort
> [9:52] svbu-mpi:~/mpi % ls -l core*
> ls: No match.
> [9:52] svbu-mpi:~/mpi % mpirun -np 4 --bynode --host svbu-mpi055,svbu-mpi056 
> ./abort
> Rank 1 sleeping...
> [svbu-mpi055:03991] *** Process received signal ***
> [svbu-mpi055:03991] Signal: Aborted (6)
> [svbu-mpi055:03991] Signal code:  (-6)
> [svbu-mpi055:03991] [ 0] /lib64/libpthread.so.0 [0x2b45caac87c0]
> [svbu-mpi055:03991] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x2b45cad05265]
> [svbu-mpi055:03991] [ 2] /lib64/libc.so.6(abort+0x110) [0x2b45cad06d10]
> [svbu-mpi055:03991] [ 3] ./abort(main+0x36) [0x4008ee]
> [svbu-mpi055:03991] [ 4] /lib64/libc.so.6(__libc_start_main+0xf4) 
> [0x2b45cacf2994]
> [svbu-mpi055:03991] [ 5] ./abort [0x400809]
> [svbu-mpi055:03991] *** End of error message ***
> Rank 3 sleeping...
> Rank 2 sleeping...
> --
> mpirun noticed that process rank 0 with PID 3991 on node svbu-mpi055 exited 
> on signal 6 (Aborted).
> --
> [9:52] svbu-mpi:~/mpi % ls -l core*
> -rw--- 1 jsquyres eng5 26009600 Aug 16 09:52 core.abort-1281977540-3991
> [9:52] svbu-mpi:~/mpi % file core.abort-1281977540-3991 
> core.abort-1281977540-3991: ELF 64-bit LSB core file AMD x86-64, version 1 
> (SYSV), SVR4-style, from 'abort'
> [9:52] svbu-mpi:~/mpi % 
> -
> 
> You can see that all processes die immediately, and I get a corefile from the 
> process that called abort().
> 
> 
> On Aug 16, 2010, at 9:25 AM, David Ronis wrote:
> 
> > I've tried both--as you said, MPI_Abort doesn't drop a core file, but
> > does kill off the entire MPI job.   abort() drops core when I'm running
> > on 1 processor, but not in a multiprocessor run.  In addition, a node
> > calling abort() doesn't lead to the entire run being killed off.
> > 
> > David
> > O
> > n Mon, 2010-08-16 at 08:51 -0700, Jeff Squyres wrote:
> >> On Aug 13, 2010, at 12:53 PM, David Ronis wrote:
> >> 
> >>> I'm using mpirun and the nodes are all on the same machin (a 8 cpu box
> >>> with an intel i7).  coresize is unlimited:
> >>> 
> >>> ulimit -a
> >>> core file size  (blocks, -c) unlimited
> >> 
> >> That looks good.
> >> 
> >> In reviewing the email thread, it's not entirely clear: are you calling 
> >> abort() or MPI_Abort()?  MPI_Abort() won't drop a core file.  abort() 
> >> should.
> >> 
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 



Re: [OMPI users] Abort

2010-08-16 Thread Jeff Squyres
FWIW, I'm unable to replicate your behavior.  This is with Open MPI 1.4.2 on 
RHEL5:


[9:52] svbu-mpi:~/mpi % cat abort.c
#include 
#include 
#include 

int main(int argc, char **argv)
{
int rank;

MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );
if (0 == rank) {
abort();
}
printf("Rank %d sleeping...\n", rank);
sleep(600);
printf("Rank %d finalizing...\n", rank);
MPI_Finalize();
return 0;
}
[9:52] svbu-mpi:~/mpi % mpicc abort.c -o abort
[9:52] svbu-mpi:~/mpi % ls -l core*
ls: No match.
[9:52] svbu-mpi:~/mpi % mpirun -np 4 --bynode --host svbu-mpi055,svbu-mpi056 
./abort
Rank 1 sleeping...
[svbu-mpi055:03991] *** Process received signal ***
[svbu-mpi055:03991] Signal: Aborted (6)
[svbu-mpi055:03991] Signal code:  (-6)
[svbu-mpi055:03991] [ 0] /lib64/libpthread.so.0 [0x2b45caac87c0]
[svbu-mpi055:03991] [ 1] /lib64/libc.so.6(gsignal+0x35) [0x2b45cad05265]
[svbu-mpi055:03991] [ 2] /lib64/libc.so.6(abort+0x110) [0x2b45cad06d10]
[svbu-mpi055:03991] [ 3] ./abort(main+0x36) [0x4008ee]
[svbu-mpi055:03991] [ 4] /lib64/libc.so.6(__libc_start_main+0xf4) 
[0x2b45cacf2994]
[svbu-mpi055:03991] [ 5] ./abort [0x400809]
[svbu-mpi055:03991] *** End of error message ***
Rank 3 sleeping...
Rank 2 sleeping...
--
mpirun noticed that process rank 0 with PID 3991 on node svbu-mpi055 exited on 
signal 6 (Aborted).
--
[9:52] svbu-mpi:~/mpi % ls -l core*
-rw--- 1 jsquyres eng5 26009600 Aug 16 09:52 core.abort-1281977540-3991
[9:52] svbu-mpi:~/mpi % file core.abort-1281977540-3991 
core.abort-1281977540-3991: ELF 64-bit LSB core file AMD x86-64, version 1 
(SYSV), SVR4-style, from 'abort'
[9:52] svbu-mpi:~/mpi % 
-

You can see that all processes die immediately, and I get a corefile from the 
process that called abort().


On Aug 16, 2010, at 9:25 AM, David Ronis wrote:

> I've tried both--as you said, MPI_Abort doesn't drop a core file, but
> does kill off the entire MPI job.   abort() drops core when I'm running
> on 1 processor, but not in a multiprocessor run.  In addition, a node
> calling abort() doesn't lead to the entire run being killed off.
> 
> David
> O
> n Mon, 2010-08-16 at 08:51 -0700, Jeff Squyres wrote:
>> On Aug 13, 2010, at 12:53 PM, David Ronis wrote:
>> 
>>> I'm using mpirun and the nodes are all on the same machin (a 8 cpu box
>>> with an intel i7).  coresize is unlimited:
>>> 
>>> ulimit -a
>>> core file size  (blocks, -c) unlimited
>> 
>> That looks good.
>> 
>> In reviewing the email thread, it's not entirely clear: are you calling 
>> abort() or MPI_Abort()?  MPI_Abort() won't drop a core file.  abort() should.
>> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Abort

2010-08-16 Thread David Ronis
I've tried both--as you said, MPI_Abort doesn't drop a core file, but
does kill off the entire MPI job.   abort() drops core when I'm running
on 1 processor, but not in a multiprocessor run.  In addition, a node
calling abort() doesn't lead to the entire run being killed off.

David
O
n Mon, 2010-08-16 at 08:51 -0700, Jeff Squyres wrote:
> On Aug 13, 2010, at 12:53 PM, David Ronis wrote:
> 
> > I'm using mpirun and the nodes are all on the same machin (a 8 cpu box
> > with an intel i7).  coresize is unlimited:
> > 
> > ulimit -a
> > core file size  (blocks, -c) unlimited
> 
> That looks good.
> 
> In reviewing the email thread, it's not entirely clear: are you calling 
> abort() or MPI_Abort()?  MPI_Abort() won't drop a core file.  abort() should.
> 



Re: [OMPI users] Abort

2010-08-16 Thread Jeff Squyres
On Aug 13, 2010, at 12:53 PM, David Ronis wrote:

> I'm using mpirun and the nodes are all on the same machin (a 8 cpu box
> with an intel i7).  coresize is unlimited:
> 
> ulimit -a
> core file size  (blocks, -c) unlimited

That looks good.

In reviewing the email thread, it's not entirely clear: are you calling abort() 
or MPI_Abort()?  MPI_Abort() won't drop a core file.  abort() should.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Abort

2010-08-13 Thread David Ronis
I'm using mpirun and the nodes are all on the same machin (a 8 cpu box
with an intel i7).  coresize is unlimited:


ulimit -a
core file size  (blocks, -c) unlimited

David


n Fri, 2010-08-13 at 13:47 -0400, Jeff Squyres wrote:
> On Aug 13, 2010, at 1:18 PM, David Ronis wrote:
> 
> > Second coredumpsize is unlimited, and indeed I DO get core dumps when
> > I'm running a single-processor version.  
> 
> What launcher are you using underneath Open MPI?
> 
> You might want to make sure that the underlying launcher actually sets the 
> coredumpsize to unlimited on each server where you're running.  E.g., if 
> you're using rsh/ssh, check that your shell startup files set coredumpsize to 
> unlimited for non-interactive logins.  Or, if you're using (for example) 
> Torque, check to ensure that jobs launched under Torque don't have their 
> coredumpsize automatically reset to 0, etc.
> 



Re: [OMPI users] Abort

2010-08-13 Thread Jeff Squyres
On Aug 13, 2010, at 1:18 PM, David Ronis wrote:

> Second coredumpsize is unlimited, and indeed I DO get core dumps when
> I'm running a single-processor version.  

What launcher are you using underneath Open MPI?

You might want to make sure that the underlying launcher actually sets the 
coredumpsize to unlimited on each server where you're running.  E.g., if you're 
using rsh/ssh, check that your shell startup files set coredumpsize to 
unlimited for non-interactive logins.  Or, if you're using (for example) 
Torque, check to ensure that jobs launched under Torque don't have their 
coredumpsize automatically reset to 0, etc.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Abort

2010-08-13 Thread David Ronis
Thanks to all who replied.  

First, I'm running openmpi 1.4.2.  

Second coredumpsize is unlimited, and indeed I DO get core dumps when
I'm running a single-processor version.  Third, the problem isn't
stopping the program, MPI_Abort does that just fine, rather it's getting
a cordump.  According to the man page, MPI_Abort sends a SIGTERM, not a
SIGABRT so perhaps that's what should happen.   

Finally, my guess as to what's happening if I use the libc abort is that
the other nodes get stuck in an MPI call (I do lots of MPI_Reduces or
MPI_Bcasts in this code), but this doesn't explain why the node calling
abort doesn't exit with a coredump.

David

On Thu, 2010-08-12 at 20:44 -0600, Ralph Castain wrote:
> Sounds very strange - what OMPI version, on what type of machine, and how was 
> it configured?
> 
> 
> On Aug 12, 2010, at 7:49 PM, David Ronis wrote:
> 
> > I've got a mpi program that is supposed to to generate a core file if
> > problems arise on any of the nodes.   I tried to do this by adding a
> > call to abort() to my exit routines but this doesn't work; I get no core
> > file, and worse, mpirun doesn't detect that one of my nodes has
> > aborted(?) and doesn't kill off the entire job, except in the trivial
> > case where the number of processors I'm running on is 1.   I've replaced
> > abort with MPI_Abort, which kills everything off, but leaves no core
> > file.  Any suggestions how I can get one and still have mpi exit?
> > 
> > Thanks in advance.
> > 
> > David
> > 
> > 
> > 
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 




Re: [OMPI users] Abort

2010-08-12 Thread Gus Correa

David Zhang wrote:
When my MPI code fails (seg fault), it usually cause the rest of the mpi 
process to abort as well.  Perhaps rather than calling abort(), perhaps 
you could do a divide-by-zero operation to halt the program?

David Zhang
University of California, San Diego

>
On Thu, Aug 12, 2010 at 6:49 PM, David Ronis > wrote:


I've got a mpi program that is supposed to to generate a core file if
problems arise on any of the nodes.   I tried to do this by adding a
call to abort() to my exit routines but this doesn't work; I get no core
file, and worse, mpirun doesn't detect that one of my nodes has
aborted(?) and doesn't kill off the entire job, except in the trivial
case where the number of processors I'm running on is 1.   I've replaced
abort with MPI_Abort, which kills everything off, but leaves no core
file.  Any suggestions how I can get one and still have mpi exit?

Thanks in advance.

David
 


Also, make sure your computers' coredumpsize / core file size
limit is not zero, which is sometimes the case.

Gus Correa


Re: [OMPI users] Abort

2010-08-12 Thread Ralph Castain
Sounds very strange - what OMPI version, on what type of machine, and how was 
it configured?


On Aug 12, 2010, at 7:49 PM, David Ronis wrote:

> I've got a mpi program that is supposed to to generate a core file if
> problems arise on any of the nodes.   I tried to do this by adding a
> call to abort() to my exit routines but this doesn't work; I get no core
> file, and worse, mpirun doesn't detect that one of my nodes has
> aborted(?) and doesn't kill off the entire job, except in the trivial
> case where the number of processors I'm running on is 1.   I've replaced
> abort with MPI_Abort, which kills everything off, but leaves no core
> file.  Any suggestions how I can get one and still have mpi exit?
> 
> Thanks in advance.
> 
> David
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Abort

2010-08-12 Thread David Zhang
When my MPI code fails (seg fault), it usually cause the rest of the mpi
process to abort as well.  Perhaps rather than calling abort(), perhaps you
could do a divide-by-zero operation to halt the program?

On Thu, Aug 12, 2010 at 6:49 PM, David Ronis  wrote:

> I've got a mpi program that is supposed to to generate a core file if
> problems arise on any of the nodes.   I tried to do this by adding a
> call to abort() to my exit routines but this doesn't work; I get no core
> file, and worse, mpirun doesn't detect that one of my nodes has
> aborted(?) and doesn't kill off the entire job, except in the trivial
> case where the number of processors I'm running on is 1.   I've replaced
> abort with MPI_Abort, which kills everything off, but leaves no core
> file.  Any suggestions how I can get one and still have mpi exit?
>
> Thanks in advance.
>
> David
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
David Zhang
University of California, San Diego


[OMPI users] Abort

2010-08-12 Thread David Ronis
I've got a mpi program that is supposed to to generate a core file if
problems arise on any of the nodes.   I tried to do this by adding a
call to abort() to my exit routines but this doesn't work; I get no core
file, and worse, mpirun doesn't detect that one of my nodes has
aborted(?) and doesn't kill off the entire job, except in the trivial
case where the number of processors I'm running on is 1.   I've replaced
abort with MPI_Abort, which kills everything off, but leaves no core
file.  Any suggestions how I can get one and still have mpi exit?

Thanks in advance.

David