It's building... to be continued tomorrow morning.
Le 2014-08-18 16:45, Rolf vandeVaart a écrit :
Just to help reduce the scope of the problem, can you retest with a non
CUDA-aware Open MPI 1.8.1? And if possible, use --enable-debug in the
configure line to help with the stack trace?
Same thing :
[mboisson@gpu-k20-07 simple_cuda_mpi]$ export MALLOC_CHECK_=1
[mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by ppr:1:node
cudampi_simple
malloc: using debugging hooks
malloc: using debugging hooks
[gpu-k20-07:47628] *** Process received signal ***
[gpu-k20-07:47628] Si
Try the following:
export MALLOC_CHECK_=1
and then run it again
Kind regards,
Alex Granovsky
-Original Message-
From: Maxime Boissonneault
Sent: Tuesday, August 19, 2014 12:23 AM
To: Open MPI Users
Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes
Hi,
Since my previ
Just to help reduce the scope of the problem, can you retest with a non
CUDA-aware Open MPI 1.8.1? And if possible, use --enable-debug in the
configure line to help with the stack trace?
>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>Boissonn
Hi,
Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
derailed into two problems, one of which has been addressed, I figured I
would start a new, more precise and simple one.
I reduced the code to the minimal that would reproduce the bug. I have
pasted it here :
http://pa
most likely you installing old ofed which does not have this parameter:
try:
#modinfo mlx4_core
and see if it is there.
I would suggest install latest OFED or Mellanox OFED.
On Mon, Aug 18, 2014 at 9:53 PM, Rio Yokota wrote:
> I get "ofed_info: command not found". Note that I don't install t
I get "ofed_info: command not found". Note that I don't install the entire
OFED, but do a component wise installation by doing "apt-get install
infiniband-diags ibutils ibverbs-utils libmlx4-dev" for the drivers and
utilities.
> Hi,
> what ofed version do you use?
> (ofed_info -s)
>
>
> On Su
Indeed odd - I'm afraid that this is just the kind of case that has been
causing problems. I think I've figured out the problem, but have been buried
with my "day job" for the last few weeks and unable to pursue it.
On Aug 18, 2014, at 11:10 AM, Maxime Boissonneault
wrote:
> Ok, I confirm th
Ok, I confirm that with
mpiexec -mca oob_tcp_if_include lo ring_c
it works.
It also works with
mpiexec -mca oob_tcp_if_include ib0 ring_c
We have 4 interfaces on this node.
- lo, the local loop
- ib0, infiniband
- eth2, a management network
- eth3, the public network
It seems that mpiexec atte
Yeah, there are some issues with the internal connection logic that need to get
fixed. We haven't had many cases where it's been an issue, but a couple like
this have cropped up - enough that I need to set aside some time to fix it.
My apologies for the problem.
On Aug 18, 2014, at 10:31 AM, M
Indeed, that makes sense now.
Why isn't OpenMPI attempting to connect with the local loop for same
node ? This used to work with 1.6.5.
Maxime
Le 2014-08-18 13:11, Ralph Castain a écrit :
Yep, that pinpointed the problem:
[helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
[heli
Yep, that pinpointed the problem:
[helios-login1:28558] [[63019,1],0] tcp:send_handler CONNECTING
[helios-login1:28558] [[63019,1],0]:tcp:complete_connect called for peer
[[63019,0],0] on socket 11
[helios-login1:28558] [[63019,1],0]-[[63019,0],0] tcp_peer_complete_connect:
connection failed: Co
Here it is.
Maxime
Le 2014-08-18 12:59, Ralph Castain a écrit :
Ah...now that showed the problem. To pinpoint it better, please add
-mca oob_base_verbose 10
and I think we'll have it
On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault
wrote:
This is all one one node indeed.
Attached is th
Ah...now that showed the problem. To pinpoint it better, please add
-mca oob_base_verbose 10
and I think we'll have it
On Aug 18, 2014, at 9:54 AM, Maxime Boissonneault
wrote:
> This is all one one node indeed.
>
> Attached is the output of
> mpirun -np 4 --mca plm_base_verbose 10 -mca odls_
This is all one one node indeed.
Attached is the output of
mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca
state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee
output_ringc_verbose.txt
Maxime
Le 2014-08-18 12:48, Ralph Castain a écrit :
This is all on one nod
This is all on one node, yes?
Try adding the following:
-mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5
Lot of garbage, but should tell us what is going on.
On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault
wrote:
> Here it is
> Le 2014-08-18 12:30, Joshua Ladd
Here it is
Le 2014-08-18 12:30, Joshua Ladd a écrit :
mpirun -np 4 --mca plm_base_verbose 10
[mboisson@helios-login1 examples]$ mpirun -np 4 --mca plm_base_verbose
10 ring_c
[helios-login1:27853] mca: base: components_register: registering plm
components
[helios-login1:27853] mca: base: compone
Maxime,
Can you run with:
mpirun -np 4 --mca plm_base_verbose 10 /path/to/examples//ring_c
On Mon, Aug 18, 2014 at 12:21 PM, Maxime Boissonneault <
maxime.boissonnea...@calculquebec.ca> wrote:
> Hi,
> I just did compile without Cuda, and the result is the same. No output,
> exits with code
Hi,
I just did compile without Cuda, and the result is the same. No output,
exits with code 65.
[mboisson@helios-login1 examples]$ ldd ring_c
linux-vdso.so.1 => (0x7fff3ab31000)
libmpi.so.1 =>
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libmpi.so.1
(0x7fab9ec
Hi,
what ofed version do you use?
(ofed_info -s)
On Sun, Aug 17, 2014 at 7:16 PM, Rio Yokota wrote:
> I have recently upgraded from Ubuntu 12.04 to 14.04 and OpenMPI gives the
> following warning upon execution, which did not appear before the upgrade.
>
> WARNING: It appears that your OpenFabr
20 matches
Mail list logo