Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-18 Thread Maxime Boissonneault
It's building... to be continued tomorrow morning. Le 2014-08-18 16:45, Rolf vandeVaart a écrit : Just to help reduce the scope of the problem, can you retest with a non CUDA-aware Open MPI 1.8.1? And if possible, use --enable-debug in the configure line to help with the stack trace?

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-18 Thread Maxime Boissonneault
Same thing : [mboisson@gpu-k20-07 simple_cuda_mpi]$ export MALLOC_CHECK_=1 [mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by ppr:1:node cudampi_simple malloc: using debugging hooks malloc: using debugging hooks [gpu-k20-07:47628] *** Process received signal *** [gpu-k20-07:47628] Si

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-18 Thread Alex A. Granovsky
Try the following: export MALLOC_CHECK_=1 and then run it again Kind regards, Alex Granovsky -Original Message- From: Maxime Boissonneault Sent: Tuesday, August 19, 2014 12:23 AM To: Open MPI Users Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes Hi, Since my previ

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-18 Thread Rolf vandeVaart
Just to help reduce the scope of the problem, can you retest with a non CUDA-aware Open MPI 1.8.1? And if possible, use --enable-debug in the configure line to help with the stack trace? >-Original Message- >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime >Boissonn

[OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-18 Thread Maxime Boissonneault
Hi, Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda derailed into two problems, one of which has been addressed, I figured I would start a new, more precise and simple one. I reduced the code to the minimal that would reproduce the bug. I have pasted it here : http://pa

Re: [OMPI users] No log_num_mtt in Ubuntu 14.04

2014-08-18 Thread Mike Dubman
most likely you installing old ofed which does not have this parameter: try: #modinfo mlx4_core and see if it is there. I would suggest install latest OFED or Mellanox OFED. On Mon, Aug 18, 2014 at 9:53 PM, Rio Yokota wrote: > I get "ofed_info: command not found". Note that I don't install t

Re: [OMPI users] No log_num_mtt in Ubuntu 14.04

2014-08-18 Thread Rio Yokota
I get "ofed_info: command not found". Note that I don't install the entire OFED, but do a component wise installation by doing "apt-get install infiniband-diags ibutils ibverbs-utils libmlx4-dev" for the drivers and utilities. > Hi, > what ofed version do you use? > (ofed_info -s) > > > On Su

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
Indeed odd - I'm afraid that this is just the kind of case that has been causing problems. I think I've figured out the problem, but have been buried with my "day job" for the last few weeks and unable to pursue it. On Aug 18, 2014, at 11:10 AM, Maxime Boissonneault wrote: > Ok, I confirm th

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Maxime Boissonneault
Ok, I confirm that with mpiexec -mca oob_tcp_if_include lo ring_c it works. It also works with mpiexec -mca oob_tcp_if_include ib0 ring_c We have 4 interfaces on this node. - lo, the local loop - ib0, infiniband - eth2, a management network - eth3, the public network It seems that mpiexec atte

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
Yeah, there are some issues with the internal connection logic that need to get fixed. We haven't had many cases where it's been an issue, but a couple like this have cropped up - enough that I need to set aside some time to fix it. My apologies for the problem. On Aug 18, 2014, at 10:31 AM, M

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Maxime Boissonneault
Indeed, that makes sense now. Why isn't OpenMPI attempting to connect with the local loop for same node ? This used to work with 1.6.5. Maxime Le 2014-08-18 13:11, Ralph Castain a écrit : Yep, that pinpointed the problem: [helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING [heli

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
Yep, that pinpointed the problem: [helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING [helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer [[63019,0],0] on socket 11 [helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect: connection failed: Co

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Maxime Boissonneault
Here it is. Maxime Le 2014-08-18 12:59, Ralph Castain a écrit : Ah...now that showed the problem. To pinpoint it better, please add -mca oob_base_verbose 10 and I think we'll have it On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault wrote: This is all one one node indeed. Attached is th

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
Ah...now that showed the problem. To pinpoint it better, please add -mca oob_base_verbose 10 and I think we'll have it On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault wrote: > This is all one one node indeed. > > Attached is the output of > mpirun -np 4 --mca plm_base_verbose 10 -mca odls_

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Maxime Boissonneault
This is all one one node indeed. Attached is the output of mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee output_ringc_verbose.txt Maxime Le 2014-08-18 12:48, Ralph Castain a écrit : This is all on one nod

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
This is all on one node, yes? Try adding the following: -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 Lot of garbage, but should tell us what is going on. On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault wrote: > Here it is > Le 2014-08-18 12:30, Joshua Ladd

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Maxime Boissonneault
Here it is Le 2014-08-18 12:30, Joshua Ladd a écrit : mpirun -np 4 --mca plm_base_verbose 10 [mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose 10 ring_c [helios-login1:27853] mca: base: components_register: registering plm components [helios-login1:27853] mca: base: compone

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Joshua Ladd
Maxime, Can you run with: mpirun -np 4 --mca plm_base_verbose 10 /path/to/examples//ring_c On Mon, Aug 18, 2014 at 12:21 PM, Maxime Boissonneault < maxime.boissonnea...@calculquebec.ca> wrote: > Hi, > I just did compile without Cuda, and the result is the same. No output, > exits with code

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Maxime Boissonneault
Hi, I just did compile without Cuda, and the result is the same. No output, exits with code 65. [mboisson@helios-login1 examples]$ ldd ring_c linux-vdso.so.1 => (0x7fff3ab31000) libmpi.so.1 => /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libmpi.so.1 (0x7fab9ec

Re: [OMPI users] No log_num_mtt in Ubuntu 14.04

2014-08-18 Thread Mike Dubman
Hi, what ofed version do you use? (ofed_info -s) On Sun, Aug 17, 2014 at 7:16 PM, Rio Yokota wrote: > I have recently upgraded from Ubuntu 12.04 to 14.04 and OpenMPI gives the > following warning upon execution, which did not appear before the upgrade. > > WARNING: It appears that your OpenFabr