Re: [OMPI users] Possible bug in MPI_Barrier() ?

2016-04-12 Thread Gilles Gouaillardet
This is quite unlikely, and fwiw, your test program works for me. i suggest you check your 3 TCP networks are usable, for example $ mpirun -np 2 -hostfile ~/hostfile -mca btl self,tcp -mca pml ob1 --mca btl_tcp_if_include xxx ./mpitest in which xxx is a [list of] interface name : eth0 eth1

Re: [OMPI users] Debugging help

2016-04-12 Thread Jeff Squyres (jsquyres)
On Apr 12, 2016, at 2:38 PM, dpchoudh . wrote: > > Hello all > > I am trying to set a breakpoint during the modex exchange process so I can > see the data being passed for different transport type. I assume that this is > being done in the context of orted since this is

[OMPI users] Debugging help

2016-04-12 Thread dpchoudh .
Hello all I am trying to set a breakpoint during the modex exchange process so I can see the data being passed for different transport type. I assume that this is being done in the context of orted since this is part of process launch. Here is what I did: (All of this pertains to the master

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Ralph Castain
My apologies for the tardy response - been stuck in meetings. I'm glad to hear that you are making progress tracking this down. FWIW: the error message you received indicates that the socket from that node unexpectedly reset during execution of the application. So it sounds like there is something

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Stefan Friedel
On Tue, Apr 12, 2016 at 01:30:37PM +0200, Stefan Friedel wrote: -thanks for you support!- nope, no core, just the "orte has lost"... Dear list - the problem is _not_ related to openmpi. I compiled mvapich2 and I get communication errors,too. Probably this is a hardware problem. Sorry for the

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Stefan Friedel
On Tue, Apr 12, 2016 at 07:51:48PM +0900, Gilles Gouaillardet wrote: what if you ulimit -c unlimited do orted generate some core dump ? Hi Gilles, -thanks for you support!- nope, no core, just the "orte has lost"... I now tested with a simple hello-world mpi program- printf("rank, processor")

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Gilles Gouaillardet
Stefan, what if you ulimit -c unlimited do orted generate some core dump ? Cheers Gilles On Tuesday, April 12, 2016, Stefan Friedel < stefan.frie...@iwr.uni-heidelberg.de> wrote: > On Tue, Apr 12, 2016 at 05:11:59PM +0900, Gilles Gouaillardet wrote: > Dear Gilles, > >> which version of

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Stefan Friedel
On Tue, Apr 12, 2016 at 05:11:59PM +0900, Gilles Gouaillardet wrote: Dear Gilles, which version of OpenMPI are you using ? as I wrote: openmpi-1.10.2, slurm-15.08.9; homes mounted via NFS/RDMA/ipoib, mpi when does the error occur ? is it before MPI_Init() completes ? is it in the middle

Re: [OMPI users] orte has lost communication

2016-04-12 Thread Gilles Gouaillardet
Stefan, which version of OpenMPI are you using ? when does the error occur ? is it before MPI_Init() completes ? is it in the middle of the job ? if yes, are you sure no task invoked MPI_Abort() ? also, you might want to check the system logs and make sure there was no OOM (Out Of Memory).

[OMPI users] orte has lost communication

2016-04-12 Thread Stefan Friedel
Good Morning List, we have a problem on our cluster with bigger jobs (~> 200 nodes) - almost every job ends with a message like: ### Starting at Mon Apr 11 15:54:06 CEST 2016 Running on hosts: stek[034-086,088-201,203-247,249-344,346-379,381-388] Running on 350 nodes. Current