from:"Rolf vandeVaart"

Re: [OMPI users] static OpenMPI with GNU

2015-11-13 Thread Rolf vandeVaart

A workaround is to add –disable-vt to your configure line if you do not care 
about having vampirtrace support.
Not a solution, but might help you make progress.

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ilias Miroslav
Sent: Friday, November 13, 2015 11:30 AM
To: us...@open-mpi.org
Subject: [OMPI users] static OpenMPI with GNU


Greeting,



I am trying to compile the static version of OpenMPI, with GNU.



The configuration command :


mil...@login.grid.umb.sk:~/bin/openmpi-1.10.1-gnu-static/openmpi-1.10.1/../configure
 --prefix=/home/milias/bin/openmpi-1.10.1-gnu-static  CXX=g++ CC=gcc 
F77=gfortran FC=gfortran  LDFLAGS="--static" LIBS="-ldl -lrt" --disable-shared 
--enable-static

But the compilation end with error below.

I though that the -lrt should fix it (/usr/lib64/librt.a), but no way. Any help 
please ?

Miro

make[10]: Entering directory 
`/home/milias/bin/openmpi-1.10.1-gnu-static/openmpi-1.10.1/ompi/contrib/vt/vt/extlib/otf/tools/otfmerge/mpi'
  CC   otfmerge_mpi-handler.o
  CC   otfmerge_mpi-otfmerge.o
  CCLD otfmerge-mpi
/home/milias/bin/openmpi-1.10.1-gnu-static/openmpi-1.10.1/opal/.libs/libopen-pal.a(memory_linux_munmap.o):
 In function `opal_memory_linux_free_ptmalloc2_munmap':
memory_linux_munmap.c:(.text+0x3d): undefined reference to `__munmap'
/home/milias/bin/openmpi-1.10.1-gnu-static/openmpi-1.10.1/opal/.libs/libopen-pal.a(memory_linux_munmap.o):
 In function `munmap':
memory_linux_munmap.c:(.text+0x87): undefined reference to `__munmap'
collect2: ld returned 1 exit status
make[10]: *** [otfmerge-mpi] Error 1







---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] How does MPI_Allreduce work?

2015-09-25 Thread Rolf vandeVaart

In the case of reductions, yes, we copy into host memory so we can do the 
reduction.  For other collectives or point to point communication, then GPU 
Direct RDMA will be used (for smaller messages).

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Yang Zhang
>Sent: Friday, September 25, 2015 11:37 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] How does MPI_Allreduce work?
>
>Hi Rolf,
>
>Thanks very much for the info! So with CUDA-aware build, OpenMPI still have
>to copy all the data first into host memory, and then do send/recv on the host
>memory? I thought OpenMPI would use GPUdirect and RDMA to send/recv
>GPU memory directly.
>
>I will try a debug build and see what does it say. Thanks!
>
>Best,
>Yang
>
>
>
>Sent by Apple Mail
>
>Yang ZHANG
>
>PhD candidate
>
>Networking and Wide-Area Systems Group
>Computer Science Department
>New York University
>
>715 Broadway Room 705
>New York, NY 10003
>
>> On Sep 25, 2015, at 11:07 AM, Rolf vandeVaart <rvandeva...@nvidia.com>
>wrote:
>>
>> Hello Yang:
>> It is not clear to me if you are asking about a CUDA-aware build of Open MPI
>where you do the MPI_Allreduce() or the GPU buffer or if you are handling
>staging the GPU into host memory and then calling the MPI_Allreduce().
>Either way, they are somewhat similar.  With CUDA-aware, the
>MPI_Allreduce() of GPU data simply first copies the data into a host buffer
>and then calls the underlying implementation.
>>
>> Depending on how you have configured your Open MPI, the underlying
>implementation may vary.  I would suggest you compile a debug version (--
>enable-debug) and then run some tests with --mca coll_base_verbose 100
>which will give you some insight into what is actually happening under the
>covers.
>>
>> Rolf
>>
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Yang
>>> Zhang
>>> Sent: Thursday, September 24, 2015 11:41 PM
>>> To: us...@open-mpi.org
>>> Subject: [OMPI users] How does MPI_Allreduce work?
>>>
>>> Hello OpenMPI users,
>>>
>>> Is there any document on MPI_Allreduce() implementation? I’m using it
>>> to do summation on GPU data. I wonder if OpenMPI will first do
>>> summation on processes in the same node, and then do summation on the
>>> intermediate results across nodes. This would be preferable since it
>>> reduces cross node communication and should be faster?
>>>
>>> I’m using OpenMPI 1.10.0 and CUDA 7.0. I need to sum 40 million float
>>> numbers on 6 nodes, each node running 4 processes. The nodes are
>>> connected via InfiniBand.
>>>
>>> Thanks very much!
>>>
>>> Best,
>>> Yang
>>>
>>> -
>>> ---
>>>
>>> Sent by Apple Mail
>>>
>>> Yang ZHANG
>>>
>>> PhD candidate
>>>
>>> Networking and Wide-Area Systems Group Computer Science Department
>>> New York University
>>>
>>> 715 Broadway Room 705
>>> New York, NY 10003
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: http://www.open-
>>> mpi.org/community/lists/users/2015/09/27675.php
>>
>> --
>> - This email message is for the sole use of the intended
>> recipient(s) and may contain confidential information.  Any
>> unauthorized review, use, disclosure or distribution is prohibited.
>> If you are not the intended recipient, please contact the sender by
>> reply email and destroy all copies of the original message.
>> --
>> - ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/09/27678.php
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/09/27679.php

Re: [OMPI users] How does MPI_Allreduce work?

2015-09-25 Thread Rolf vandeVaart

Hello Yang:
It is not clear to me if you are asking about a CUDA-aware build of Open MPI 
where you do the MPI_Allreduce() or the GPU buffer or if you are handling 
staging the GPU into host memory and then calling the MPI_Allreduce().  Either 
way, they are somewhat similar.  With CUDA-aware, the MPI_Allreduce() of GPU 
data simply first copies the data into a host buffer and then calls the 
underlying implementation.

Depending on how you have configured your Open MPI, the underlying 
implementation may vary.  I would suggest you compile a debug version 
(--enable-debug) and then run some tests with --mca coll_base_verbose 100 which 
will give you some insight into what is actually happening under the covers.

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Yang Zhang
>Sent: Thursday, September 24, 2015 11:41 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] How does MPI_Allreduce work?
>
>Hello OpenMPI users,
>
>Is there any document on MPI_Allreduce() implementation? I’m using it to do
>summation on GPU data. I wonder if OpenMPI will first do summation on
>processes in the same node, and then do summation on the intermediate
>results across nodes. This would be preferable since it reduces cross node
>communication and should be faster?
>
>I’m using OpenMPI 1.10.0 and CUDA 7.0. I need to sum 40 million float
>numbers on 6 nodes, each node running 4 processes. The nodes are
>connected via InfiniBand.
>
>Thanks very much!
>
>Best,
>Yang
>
>
>
>Sent by Apple Mail
>
>Yang ZHANG
>
>PhD candidate
>
>Networking and Wide-Area Systems Group
>Computer Science Department
>New York University
>
>715 Broadway Room 705
>New York, NY 10003
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/09/27675.php

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] tracking down what's causing a cuIpcOpenMemHandle error emitted by OpenMPI

2015-09-03 Thread Rolf vandeVaart

Lev:
Can you run with --mca mpi_common_cuda_verbose 100 --mca mpool_rgpusm_verbose 
100 and send me (rvandeva...@nvidia.com) the output of that.
Thanks,
Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Wednesday, September 02, 2015 7:15 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] tracking down what's causing a cuIpcOpenMemHandle
>error emitted by OpenMPI
>
>I recently noticed the following error when running a Python program I'm
>developing that repeatedly performs GPU-to-GPU data transfers via
>OpenMPI:
>
>The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol
>cannot be used.
>  cuIpcGetMemHandle return value:   1
>  address: 0x602e75000
>Check the cuda.h file for what the return value means. Perhaps a reboot of
>the node will clear the problem.
>
>The system is running Ubuntu 14.04.3 and contains several Tesla S2050 GPUs.
>I'm using the following software:
>
>- Linux kernel 3.19.0 (backported to Ubuntu 14.04.3 from 15.04)
>- CUDA 7.0 (installed via NVIDIA's deb packages)
>- NVIDIA kernel driver 346.82
>- OpenMPI 1.10.0 (manually compiled with CUDA support)
>- Python 2.7.10
>- pycuda 2015.1.3 (manually compiled against CUDA 7.0)
>- mpi4py (manually compiled git revision 1d8ab22)
>
>OpenMPI, Python, pycuda, and mpi4py are all locally installed in a conda
>environment.
>
>Judging from my program's logs, the error pops up during one of the
>program's first few iterations. The error isn't fatal, however - the program
>continues running to completion after the message appears.  Running
>mpiexec with --mca plm_base_verbose 10 doesn't seem to produce any
>additional debug info of use in tracking this down.  I did notice, though, that
>there are undeleted cuda.shm.* files in /run/shm after the error message
>appears and my program exits. Deleting the files does not prevent the error
>from recurring if I subsequently rerun the program.
>
>Oddly, the above problem doesn't crop up when I run the same code on an
>Ubuntu
>14.04.3 system with the exact same software containing 2 non-Tesla GPUs
>(specifically, a GTX 470 and 750). The error seems to have started occurring
>over the past two weeks, but none of the changes I made to my code over
>that time seem to be related to the problem (i.e., running an older revision
>resulted in the same errors). I also tried running my code using older releases
>of OpenMPI (e.g., 1.8.5) and mpi4py (e.g., from about 4 weeks ago), but the
>error message still occurs. Both Ubuntu systems are 64-bit and have been
>kept up to date with the latest package updates.
>
>Any thoughts as to what could be causing the problem?
>--
>Lev Givon
>Bionet Group | Neurokernel Project
>http://www.columbia.edu/~lev/
>http://lebedov.github.io/
>http://neurokernel.github.io/
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/09/27526.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Wrong distance calculations in multi-rail setup?

2015-08-28 Thread Rolf vandeVaart

Let me send you a patch off list that will print out some extra information to 
see if we can figure out where things are going wrong.
We basically depend on the information reported by hwloc so the patch will 
print out some extra information to see if we are getting good data from hwloc.

Thanks,
Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Marcin
>Krotkiewski
>Sent: Friday, August 28, 2015 12:13 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] Wrong distance calculations in multi-rail setup?
>
>
>Brilliant! Thank you, Rolf. This works: all ranks have reported using the
>expected port number, and performance is twice of what I was observing
>before :)
>
>I can certainly live with this workaround, but I will be happy to do some
>debugging to find the problem. If you tell me what is needed / where I can
>look, I could help to find the issue.
>
>Thanks a lot!
>
>Marcin
>
>
>On 08/28/2015 05:28 PM, Rolf vandeVaart wrote:
>> I am not sure why the distances are being computed as you are seeing. I do
>not have a dual rail card system to reproduce with. However, short term, I
>think you could get what you want by running like the following.  The first
>argument tells the selection logic to ignore locality, so both cards will be
>available to all ranks.  Then, using the application specific notation you can 
>pick
>the exact port for each rank.
>>
>> Something like:
>>   mpirun -gmca btl_openib_ignore_locality -np 1 --mca
>> btl_openib_if_include mlx4_0:1 a.out : -np 1 --mca
>> btl_openib_if_include mlx4_0:2 a.out : -np 1 --mca
>> btl_openib_if_include mlx4_1:1 a.out : --mca btl_openib_if_include
>> mlx4_1:2 a.out
>>
>> Kind of messy, but that is the general idea.
>>
>> Rolf
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of
>>> marcin.krotkiewski
>>> Sent: Friday, August 28, 2015 10:49 AM
>>> To: us...@open-mpi.org
>>> Subject: [OMPI users] Wrong distance calculations in multi-rail setup?
>>>
>>> I have a 4-socket machine with two dual-port Infiniband cards
>>> (devices
>>> mlx4_0 and mlx4_1). The cards are conneted to PCI slots of different
>>> CPUs (I hope..), both ports are active on both cards, everything
>>> connected to the same physical network.
>>>
>>> I use openmpi-1.10.0 and run the IBM-MPI1 benchmark with 4 MPI ranks
>>> bound to the 4 sockets, hoping to use both IB cards (and both ports):
>>>
>>>  mpirun --map-by socket --bind-to core -np 4 --mca btl
>>> openib,self --mca btl_openib_if_include mlx4_0,mlx4_1 ./IMB-MPI1
>>> SendRecv
>>>
>>> but OpenMPI refuses to use the mlx4_1 device
>>>
>>>  [node1.local:28265] [rank=0] openib: skipping device mlx4_1; it
>>> is too far away
>>>  [ the same for other ranks ]
>>>
>>> This is confusing, since I have read that OpenMPI automatically uses
>>> a closer HCA, so at least some (>=one) rank should choose mlx4_1. I
>>> use binding by socket, here is the reported map:
>>>
>>>  [node1.local:28263] MCW rank 2 bound to socket 2[core 24[hwt 0]]:
>>>
>[./././././././././././.][./././././././././././.][B/././././././././././.][./././././././././.
>>> /./.]
>>>  [node1.local:28263] MCW rank 3 bound to socket 3[core 36[hwt 0]]:
>>>
>[./././././././././././.][./././././././././././.][./././././././././././.][B/././././././././.
>>> /./.]
>>>  [node1.local:28263] MCW rank 0 bound to socket 0[core  0[hwt 0]]:
>>>
>[B/././././././././././.][./././././././././././.][./././././././././././.][./././././././././.
>>> /./.]
>>>  [node1.local:28263] MCW rank 1 bound to socket 1[core 12[hwt 0]]:
>>>
>[./././././././././././.][B/././././././././././.][./././././././././././.][./././././././././.
>>> /./.]
>>>
>>> To check what's going on I have modified btl_openib_component.c to
>>> print the computed distances.
>>>
>>>  opal_output_verbose(1,
>ompi_btl_base_framework.framework_output,
>>>  "[rank=%d] openib: device %d/%d distance
>>> %lf", ORTE_PROC_MY_NAME->vpid,
>>>  (int)i, (int)num_devs,
>>> (double)dev_sorted[i].distance);
>>>
>>> Here is what I get:
>>>
>>>  [node1.local:28265] [rank=0] openib: device 0/2 distance 0.00
>>>  [node1.local:28266] [rank=1] openib: device 0/2 distance 0.00
&g

Re: [OMPI users] Wrong distance calculations in multi-rail setup?

2015-08-28 Thread Rolf vandeVaart

I am not sure why the distances are being computed as you are seeing. I do not 
have a dual rail card system to reproduce with. However, short term, I think 
you could get what you want by running like the following.  The first argument 
tells the selection logic to ignore locality, so both cards will be available 
to all ranks.  Then, using the application specific notation you can pick the 
exact port for each rank.

Something like:
 mpirun -gmca btl_openib_ignore_locality -np 1 --mca btl_openib_if_include 
mlx4_0:1 a.out : -np 1 --mca btl_openib_if_include mlx4_0:2 a.out : -np 1 --mca 
btl_openib_if_include mlx4_1:1 a.out : --mca btl_openib_if_include mlx4_1:2 
a.out 

Kind of messy, but that is the general idea.

Rolf
>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of
>marcin.krotkiewski
>Sent: Friday, August 28, 2015 10:49 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] Wrong distance calculations in multi-rail setup?
>
>I have a 4-socket machine with two dual-port Infiniband cards (devices
>mlx4_0 and mlx4_1). The cards are conneted to PCI slots of different CPUs (I
>hope..), both ports are active on both cards, everything connected to the
>same physical network.
>
>I use openmpi-1.10.0 and run the IBM-MPI1 benchmark with 4 MPI ranks
>bound to the 4 sockets, hoping to use both IB cards (and both ports):
>
> mpirun --map-by socket --bind-to core -np 4 --mca btl openib,self --mca
>btl_openib_if_include mlx4_0,mlx4_1 ./IMB-MPI1 SendRecv
>
>but OpenMPI refuses to use the mlx4_1 device
>
> [node1.local:28265] [rank=0] openib: skipping device mlx4_1; it is too far
>away
> [ the same for other ranks ]
>
>This is confusing, since I have read that OpenMPI automatically uses a closer
>HCA, so at least some (>=one) rank should choose mlx4_1. I use binding by
>socket, here is the reported map:
>
> [node1.local:28263] MCW rank 2 bound to socket 2[core 24[hwt 0]]:
>[./././././././././././.][./././././././././././.][B/././././././././././.][./././././././././.
>/./.]
> [node1.local:28263] MCW rank 3 bound to socket 3[core 36[hwt 0]]:
>[./././././././././././.][./././././././././././.][./././././././././././.][B/././././././././.
>/./.]
> [node1.local:28263] MCW rank 0 bound to socket 0[core  0[hwt 0]]:
>[B/././././././././././.][./././././././././././.][./././././././././././.][./././././././././.
>/./.]
> [node1.local:28263] MCW rank 1 bound to socket 1[core 12[hwt 0]]:
>[./././././././././././.][B/././././././././././.][./././././././././././.][./././././././././.
>/./.]
>
>To check what's going on I have modified btl_openib_component.c to print
>the computed distances.
>
> opal_output_verbose(1, ompi_btl_base_framework.framework_output,
> "[rank=%d] openib: device %d/%d distance %lf",
>ORTE_PROC_MY_NAME->vpid,
> (int)i, (int)num_devs, 
> (double)dev_sorted[i].distance);
>
>Here is what I get:
>
> [node1.local:28265] [rank=0] openib: device 0/2 distance 0.00
> [node1.local:28266] [rank=1] openib: device 0/2 distance 0.00
> [node1.local:28267] [rank=2] openib: device 0/2 distance 0.00
> [node1.local:28268] [rank=3] openib: device 0/2 distance 0.00
> [node1.local:28265] [rank=0] openib: device 1/2 distance 2.10
> [node1.local:28266] [rank=1] openib: device 1/2 distance 1.00
> [node1.local:28267] [rank=2] openib: device 1/2 distance 2.10
> [node1.local:28268] [rank=3] openib: device 1/2 distance 2.10
>
>So the computed distance for mlx4_0 is 0 on all ranks. I believe this should 
>not
>be so. The distance should be smaller on 1 rank and larger for 3 others, as is
>the case for mlx4_1. Looks like a bug?
>
>Another question is, In my configuration two ranks will have a 'closer'
>IB card, but two others will not. Since the correct distance to both devices 
>will
>likely be equal, which device will they choose, if they do that automatically? 
>I'd
>rather they didn't both choose mlx4_0.. I guess it would be nice if I could by
>hand specify the device/port, which should be used by a given MPI rank. Is
>this (going to be) possible with OpenMPI?
>
>Thanks a lot,
>
>Marcin
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/08/27503.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] cuda aware mpi

2015-08-21 Thread Rolf vandeVaart

No, it is not.  You have to use pml ob1 which will pull in the smcuda and 
openib BTLs which have CUDA-aware built into them.
Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Subhra Mazumdar
Sent: Friday, August 21, 2015 12:18 AM
To: Open MPI Users
Subject: [OMPI users] cuda aware mpi

Hi,

Is cuda aware mpi supported with pml yalla?

Thanks,
Subhra

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's

2015-08-12 Thread Rolf vandeVaart

Hi Geoff:

Our original implementation used cuMemcpy for copying GPU memory into and out 
of host memory.  However, what we learned is that the cuMemcpy causes a 
synchronization for all work on the GPU.  This means that one could not overlap 
very well running a kernel and doing communication.  So, now we create an 
internal stream and then use that along with cuMemcpyAsync/cuStreamSynchronize 
for doing the copy.

In turns out in Jeremia’s case, he wanted to have a long running kernel and he 
wanted the MPI_Send/MPI_Recv to happen at the same time.  With the use of 
cuMemcpy, the MPI library was waiting for his kernel to complete before doing 
the cuMemcpy.

Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Geoffrey Paulsen
Sent: Wednesday, August 12, 2015 12:55 PM
To: us...@open-mpi.org
Cc: us...@open-mpi.org; Sameh S Sharkawi
Subject: Re: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's

I'm confused why this application needs an asynchronous cuMemcpyAsync()in a 
blocking MPI call.   Rolf could you please explain?

And how does is a call to cuMemcpyAsync() followed by a syncronization any 
different than a cuMemcpy() in this use case?

I would still expect that if the MPI_Send / Recv call issued the 
cuMemcpyAsync() that it would be MPI's responsibility to issue the 
synchronization call as well.



---
Geoffrey Paulsen
Software Engineer, IBM Platform MPI
IBM Platform-MPI
Phone: 720-349-2832
Email: gpaul...@us.ibm.com<mailto:gpaul...@us.ibm.com>
www.ibm.com<http://www.ibm.com>


- Original message -
From: Rolf vandeVaart <rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>>
Sent by: "users" <users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org>>
To: Open MPI Users <us...@open-mpi.org<mailto:us...@open-mpi.org>>
Cc:
Subject: Re: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's
Date: Tue, Aug 11, 2015 1:45 PM

I talked with Jeremia off list and we figured out what was going on.  There is 
the ability to use the cuMemcpyAsync/cuStreamSynchronize rather than the 
cuMemcpy but it was never made the default for Open MPI 1.8 series.  So, to get 
that behavior you need the following:

--mca mpi_common_cuda_cumemcpy_async 1

It is too late to change this in 1.8 but it will be made the default behavior 
in 1.10 and all future versions.  In addition, he is right about not being able 
to see these variables in the Open MPI 1.8 series.  This was a bug and it has 
been fixed in Open MPI v2.0.0.  Currently, there are no plans to bring that 
back into 1.10.

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeremia Bär
>Sent: Tuesday, August 11, 2015 9:17 AM
>To: us...@open-mpi.org<mailto:us...@open-mpi.org>
>Subject: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's
>
>Hi!
>
>In my current application, MPI_Send/MPI_Recv hangs when using buffers in
>GPU device memory of a Nvidia GPU. I realized this is due to the fact that
>OpenMPI uses the synchronous cuMempcy rather than the asynchornous
>cuMemcpyAsync (see stacktrace at the bottom). However, in my application,
>synchronous copies cannot be used.
>
>I scanned through the source and saw support for async memcpy's are
>available. It's controlled by 'mca_common_cuda_cumemcpy_async' in
>./ompi/mca/common/cuda/common_cuda.c
>However, I can't find a way to enable it. It's not exposed in 'ompi_info' (but
>registered?). How can I enforce the use of cuMemcpyAsync in OpenMPI?
>Version used is OpenMPI 1.8.5.
>
>Thank you,
>Jeremia
>
>(gdb) bt
>#0  0x2aaaba11 in clock_gettime ()
>#1  0x0039e5803e46 in clock_gettime () from /lib64/librt.so.1
>#2  0x2b58a7ae in ?? () from /usr/lib64/libcuda.so.1
>#3  0x2af41dfb in ?? () from /usr/lib64/libcuda.so.1
>#4  0x2af1f623 in ?? () from /usr/lib64/libcuda.so.1
>#5  0x2af17361 in ?? () from /usr/lib64/libcuda.so.1
>#6  0x2af180b6 in ?? () from /usr/lib64/libcuda.so.1
>#7  0x2ae860c2 in ?? () from /usr/lib64/libcuda.so.1
>#8  0x2ae8621a in ?? () from /usr/lib64/libcuda.so.1
>#9  0x2ae69d85 in cuMemcpy () from /usr/lib64/libcuda.so.1
>#10 0x2f0a7dea in mca_common_cuda_cu_memcpy () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libmca_common_c
>uda.so.1
>#11 0x2c992544 in opal_cuda_memcpy () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libopen-pal.so.6
>#12 0x2c98adf7 in opal_convertor_pack () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libopen-pal.so.6
>#13 0x2aaab167c611 in mca_pml_ob1_send_request_start_copy () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/openmpi/mca_pm
>l_ob1.so
>#14 0x2aaab167353f in mca_pml_ob1_send () from
>/home/jbaer/local_root/opt/openmpi_from_src_

Re: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's

2015-08-11 Thread Rolf vandeVaart

I talked with Jeremia off list and we figured out what was going on.  There is 
the ability to use the cuMemcpyAsync/cuStreamSynchronize rather than the 
cuMemcpy but it was never made the default for Open MPI 1.8 series.  So, to get 
that behavior you need the following:

--mca mpi_common_cuda_cumemcpy_async 1

It is too late to change this in 1.8 but it will be made the default behavior 
in 1.10 and all future versions.  In addition, he is right about not being able 
to see these variables in the Open MPI 1.8 series.  This was a bug and it has 
been fixed in Open MPI v2.0.0.  Currently, there are no plans to bring that 
back into 1.10.

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeremia Bär
>Sent: Tuesday, August 11, 2015 9:17 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] CUDA Buffers: Enforce asynchronous memcpy's
>
>Hi!
>
>In my current application, MPI_Send/MPI_Recv hangs when using buffers in
>GPU device memory of a Nvidia GPU. I realized this is due to the fact that
>OpenMPI uses the synchronous cuMempcy rather than the asynchornous
>cuMemcpyAsync (see stacktrace at the bottom). However, in my application,
>synchronous copies cannot be used.
>
>I scanned through the source and saw support for async memcpy's are
>available. It's controlled by 'mca_common_cuda_cumemcpy_async' in
>./ompi/mca/common/cuda/common_cuda.c
>However, I can't find a way to enable it. It's not exposed in 'ompi_info' (but
>registered?). How can I enforce the use of cuMemcpyAsync in OpenMPI?
>Version used is OpenMPI 1.8.5.
>
>Thank you,
>Jeremia
>
>(gdb) bt
>#0  0x2aaaba11 in clock_gettime ()
>#1  0x0039e5803e46 in clock_gettime () from /lib64/librt.so.1
>#2  0x2b58a7ae in ?? () from /usr/lib64/libcuda.so.1
>#3  0x2af41dfb in ?? () from /usr/lib64/libcuda.so.1
>#4  0x2af1f623 in ?? () from /usr/lib64/libcuda.so.1
>#5  0x2af17361 in ?? () from /usr/lib64/libcuda.so.1
>#6  0x2af180b6 in ?? () from /usr/lib64/libcuda.so.1
>#7  0x2ae860c2 in ?? () from /usr/lib64/libcuda.so.1
>#8  0x2ae8621a in ?? () from /usr/lib64/libcuda.so.1
>#9  0x2ae69d85 in cuMemcpy () from /usr/lib64/libcuda.so.1
>#10 0x2f0a7dea in mca_common_cuda_cu_memcpy () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libmca_common_c
>uda.so.1
>#11 0x2c992544 in opal_cuda_memcpy () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libopen-pal.so.6
>#12 0x2c98adf7 in opal_convertor_pack () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libopen-pal.so.6
>#13 0x2aaab167c611 in mca_pml_ob1_send_request_start_copy () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/openmpi/mca_pm
>l_ob1.so
>#14 0x2aaab167353f in mca_pml_ob1_send () from
>/home/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/openmpi/mca_pm
>l_ob1.so
>#15 0x2bf4f322 in PMPI_Send () from
>/users/jbaer/local_root/opt/openmpi_from_src_1.8.5/lib/libmpi.so.1
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/08/27424.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] openmpi 1.8.7 build error with cuda support using pgi compiler 15.4

2015-08-04 Thread Rolf vandeVaart

Hi Shahzeb:
I believe another colleague of mine may have helped you with this issue (I was 
not around last week).  However, to help me better understand the issue you are 
seeing, could you send me your config.log file  from when you did the 
configuration?  You can just send to rvandeva...@nvidia.com.
Thanks, Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Shahzeb
>Sent: Thursday, July 30, 2015 9:45 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] openmpi 1.8.7 build error with cuda support using pgi
>compiler 15.4
>
>Hello,
>
>I am getting error in my make and make installl  with building OpenMPI with
>CUDA support using PGI compiler. Please help me fix this problem.
>No clue why it is happening. We are using PGI 15.4
>
>  ./configure --prefix=/usr/global/openmpi/pgi/1.8.7 CC=pgcc CXX=pgCC
>FC=pgfortran --with-cuda=/usr/global/cuda/7.0/include/
>
>
> fi
>make[2]: Leaving directory
>`/gpfs/work/i/install/openmpi/openmpi-1.8.7/ompi/mca/common/sm'
>Making all in mca/common/verbs
>make[2]: Entering directory
>`/gpfs/work/i/install/openmpi/openmpi-1.8.7/ompi/mca/common/verbs'
>if test -z "libmca_common_verbs.la"; then \
>   rm -f "libmca_common_verbs.la"; \
>   ln -s "libmca_common_verbs_noinst.la" "libmca_common_verbs.la"; \
> fi
>make[2]: Leaving directory
>`/gpfs/work/i/install/openmpi/openmpi-1.8.7/ompi/mca/common/verbs'
>Making all in mca/common/cuda
>make[2]: Entering directory
>`/gpfs/work/i/install/openmpi/openmpi-1.8.7/ompi/mca/common/cuda'
>   CC   common_cuda.lo
>PGC-S-0039-Use of undeclared variable
>mca_common_cuda_cumemcpy_async
>(common_cuda.c: 320)
>PGC-S-0039-Use of undeclared variable libcuda_handle (common_cuda.c:
>396)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 396)
>PGC-S-0103-Illegal operand types for comparison operator (common_cuda.c:
>397)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 441)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 441)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 442)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 442)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 443)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 443)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 444)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 444)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 445)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 445)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 446)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 446)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 447)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 447)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 448)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 448)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 449)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 449)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 450)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 450)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 451)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 451)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 452)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 452)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 453)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 453)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 454)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 454)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 455)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 455)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 463)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 463)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 464)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 464)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 465)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 465)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 469)
>PGC-W-0155-Pointer value created from a nonlong integral type
>(common_cuda.c: 469)
>PGC-W-0095-Type cast required for this conversion (common_cuda.c: 470)
>PGC-W-0155-Pointer value created from a nonlong integral

Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

2015-07-06 Thread Rolf vandeVaart

Just an FYI that this issue has been found and fixed and will be available in 
the next release.
https://github.com/open-mpi/ompi-release/pull/357

Rolf

From: Rolf vandeVaart
Sent: Wednesday, July 01, 2015 4:47 PM
To: us...@open-mpi.org
Subject: RE: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

Hi Stefan (and Steven who reported this earlier with CUDA-aware program)

I have managed to observed the leak when running LAMMPS as well.  Note that 
this has nothing to do with CUDA-aware features.  I am going to move this 
discussion to the Open MPI developer’s list to dig deeper into this issue.  
Thanks for reporting.

Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Stefan Paquay
Sent: Wednesday, July 01, 2015 11:43 AM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

Hi all,
Hopefully this mail gets posted in the right thread...
I have noticed the (I guess same) leak using OpenMPI 1.8.6 with LAMMPS, a 
molecular dynamics program, without any use of CUDA. I am not that familiar 
with how the internal memory management of LAMMPS works, but it does not appear 
CUDA-related.
The symptoms are the same:
OpenMPI 1.8.5: everything is fine
OpenMPI 1.8.6: same setup, pretty large leak
Unfortunately, I have no idea how to isolate the bug, but to reproduce it:
1. clone LAMMPS (git clone 
git://git.lammps.org/lammps-ro.git<http://git.lammps.org/lammps-ro.git> lammps)
2. cd src/, compile with openMPI 1.8.6
3. run the example listed in lammps/examples/melt
I would like to help find this bug but I am not sure what would help. LAMMPS 
itself is pretty big so I can imagine you might not want to go through all of 
the code...

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

2015-07-01 Thread Rolf vandeVaart

Hi Stefan (and Steven who reported this earlier with CUDA-aware program)



I have managed to observed the leak when running LAMMPS as well.  Note that 
this has nothing to do with CUDA-aware features.  I am going to move this 
discussion to the Open MPI developer’s list to dig deeper into this issue.  
Thanks for reporting.



Rolf



From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Stefan Paquay
Sent: Wednesday, July 01, 2015 11:43 AM
To: us...@open-mpi.org
Subject: Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

Hi all,
Hopefully this mail gets posted in the right thread...
I have noticed the (I guess same) leak using OpenMPI 1.8.6 with LAMMPS, a 
molecular dynamics program, without any use of CUDA. I am not that familiar 
with how the internal memory management of LAMMPS works, but it does not appear 
CUDA-related.
The symptoms are the same:
OpenMPI 1.8.5: everything is fine
OpenMPI 1.8.6: same setup, pretty large leak
Unfortunately, I have no idea how to isolate the bug, but to reproduce it:
1. clone LAMMPS (git clone 
git://git.lammps.org/lammps-ro.git lammps)
2. cd src/, compile with openMPI 1.8.6
3. run the example listed in lammps/examples/melt
I would like to help find this bug but I am not sure what would help. LAMMPS 
itself is pretty big so I can imagine you might not want to go through all of 
the code...


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

2015-06-30 Thread Rolf vandeVaart

Hi Steven,
Thanks for the report.  Very little has changed between 1.8.5 and 1.8.6 within 
the CUDA-aware specific code so I am perplexed.  Also interesting that you do 
not see the issue with 1.8.5 and CUDA 7.0.
You mentioned that it is hard to share the code on this but maybe you could 
share how you observed the behavior.  Does the code need to run for a while to 
see this?
Any suggestions on how I could reproduce this?

Thanks,
Rolf


From: Steven Eliuk [mailto:s.el...@samsung.com]
Sent: Tuesday, June 30, 2015 6:05 PM
To: Rolf vandeVaart
Cc: Open MPI Users
Subject: 1.8.6 w/ CUDA 7.0 & GDR Huge Memory Leak

Hi All,

Looks like we have found a large memory leak,

Very difficult to share code on this but here are some details,

1.8.5 w/ Cuda 7.0 - no memory leak
1.8.5 w/ cuda 6.5 - no memory leak
1.8.6 w/ cuda 7.0 - large memory leak
1.8.5 w/ cuda 6.5 - no memory leak
mvapich2 2.1 GDR - no issue on either flavor of CUDA.

We have a relatively basic program that reproduces the error and have even 
narrowed it back to a single machine w/ multiple gpus and only two slaves. 
Looks like something in the IPC within a single node,

We don't have many free cycles at the moment but less us know if we can help w/ 
something basic,

Heres our config flag for 1.8.5,

./configure FC=gfortran --without-mx --with-openib=/usr 
--with-openib-libdir=/usr/lib64/ --enable-openib-rdmacm --without-psm 
--with-cuda=/cm/shared/apps/cuda70/toolkit/current 
--prefix=/cm/shared/OpenMPI_1_8_5_CUDA70

Kindest Regards,
-
Steven Eliuk, Ph.D. Comp Sci,
Project Lead,
Computing Science Innovation Center,
SRA - SV,
Samsung Electronics,
665 Clyde Avenue,
Mountain View, CA 94043,
Work: +1 650-623-2986,
Cell: +1 408-819-4407.


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] CUDA-aware MPI_Reduce problem in Openmpi 1.8.5

2015-06-17 Thread Rolf vandeVaart

There is no short-term plan but we are always looking at ways to improve things 
so this could be looked at some time in the future.

Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Fei Mao
Sent: Wednesday, June 17, 2015 1:48 PM
To: Open MPI Users
Subject: Re: [OMPI users] CUDA-aware MPI_Reduce problem in Openmpi 1.8.5

Hi Rolf,

Thank you very much for clarifying the problem. Is there any plan to support 
GPU RDMA for reduction in the future?

On Jun 17, 2015, at 1:38 PM, Rolf vandeVaart 
<rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>> wrote:

Hi Fei:

The reduction support for CUDA-aware in Open MPI is rather simple.  The GPU 
buffers are copied into temporary host buffers and then the reduction is done 
with the host buffers.  At the completion of the host reduction, the data is 
copied back into the GPU buffers.  So, there is no use of CUDA IPC or GPU 
Direct RDMA in the reduction.

Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Fei Mao
Sent: Wednesday, June 17, 2015 1:08 PM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: [OMPI users] CUDA-aware MPI_Reduce problem in Openmpi 1.8.5

Hi there,

I am doing benchmarks on a GPU cluster with two CPU sockets and 4 K80 GPUs each 
node. Two K80 are connected with CPU socket 0, another two with socket 1. An IB 
ConnectX-3 (FDR) is also under socket 1. We are using Linux's OFED, so I know 
there is no way to do GPU RDMA inter-node communication. I can do intra-node 
IPC for MPI_Send and MPI_Receive with two K80 (4 GPUs in total) which are 
connected under same socket (PCI-e switch). So I thought I could do intra-node 
MPI_Reduce with IPC support in openmpi 1.8.5.

The benchmark I was using is osu-micro-benchmarks-4.4.1, and I got the same 
results when I use two GPU under the same socket or different socket. The 
result was the same even I used two GPUs in different nodes.

Does MPI_Reduce use IPC for intra-node? Should I have to install Mellanox OFED 
stack to support GPU RDMA reduction on GPUs even they are under with the same 
PCI-e switch?

Thanks,

Fei Mao
High Performance Computing Technical Consultant
SHARCNET | http://www.sharcnet.ca<http://www.sharcnet.ca/>
Compute/Calcul Canada | 
http://www.computecanada.ca<http://www.computecanada.ca/>

This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.

___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/06/27147.php

Re: [OMPI users] Problems running linpack benchmark on old Sunfire opteron nodes

2015-05-26 Thread Rolf vandeVaart

I think we bumped up a default value in Open MPI 1.8.5.  To go back to the old 
64Mbyte value try running with:

--mca mpool_sm_min_size 67108864

Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Aurélien Bouteiller
Sent: Tuesday, May 26, 2015 10:10 AM
To: Open MPI Users
Subject: Re: [OMPI users] Problems running linpack benchmark on old Sunfire 
opteron nodes

* PGP Signed by an unknown key
You can also change the location of tmp files with the following mca option:
-mca orte_tmpdir_base /some/place

ompi_info --param all all -l 9 | grep tmp
MCA orte: parameter "orte_tmpdir_base" (current value: "", data 
source: default, level: 9 dev/all, type: string)
MCA orte: parameter "orte_local_tmpdir_base" (current value: 
"", data source: default, level: 9 dev/all, type: string)
MCA orte: parameter "orte_remote_tmpdir_base" (current value: 
"", data source: default, level: 9 dev/all, type: string)

--
Aurélien Bouteiller ~~ https://icl.cs.utk.edu/~bouteill/

Le 23 mai 2015 à 03:55, Gilles Gouaillardet 
> a écrit :

Bill,

the root cause is likely there is not enough free space in /tmp.

the simplest, but slowest, option is to run mpirun --mac btl tcp ...
if you cannot make enough space under /tmp (maybe you run diskless)
there are some options to create these kind of files under /dev/shm

Cheers,

Gilles


On Saturday, May 23, 2015, Lane, William 
> wrote:
I've compiled the linpack benchmark using openMPI 1.8.5 libraries
and include files on CentOS 6.4.

I've tested the binary on the one Intel node (some
sort of 4-core Xeon) and it runs, but when I try to run it on any of
the old Sunfire opteron compute nodes it appears to hang (although
top indicates CPU and memory usage) and eventually terminates
by itself. I'm also getting the following openMPI error messages/warnings:

mpirun -np 16 --report-bindings --hostfile hostfile --prefix 
/hpc/apps/mpi/openmpi/1.8.5-dev --mca btl_tcp_if_include eth0 xhpl

[cscld1-0-6:24370] create_and_attach: unable to create shared memory BTL 
coordinating structure :: size 134217728
[cscld1-0-3:24734] create_and_attach: unable to create shared memory BTL 
coordinating structure :: size 134217728
[cscld1-0-7:25152] create_and_attach: unable to create shared memory BTL 
coordinating structure :: size 134217728
[cscld1-0-4:18079] create_and_attach: unable to create shared memory BTL 
coordinating structure :: size 134217728
[cscld1-0-8:21443] create_and_attach: unable to create shared memory BTL 
coordinating structure :: size 134217728
[cscld1-0-2:19704] create_and_attach: unable to create shared memory BTL 
coordinating structure :: size 134217728
[cscld1-0-5:13481] create_and_attach: unable to create shared memory BTL 
coordinating structure :: size 134217728
[cscld1-0-0:21884] create_and_attach: unable to create shared memory BTL 
coordinating structure :: size 134217728
[cscld1:24240] 7 more processes have sent help message help-opal-shmem-mmap.txt 
/ target full

Note these errors also occur when I try to run the linpack benchmark on a single
node as well.

Does anyone know what's going on here? Google came up w/nothing and I have no
idea what a BTL coordinating structure is.

-Bill L.
IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
strictly prohibited. Thank you for your cooperation.
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/05/26907.php

* Unknown Key
* 0xBF250A1F

* PGP Unprotected
* text/plain body


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-21 Thread Rolf vandeVaart

Answers below...
>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Thursday, May 21, 2015 2:19 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] cuIpcOpenMemHandle failure when using
>OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service
>
>Received from Lev Givon on Thu, May 21, 2015 at 11:32:33AM EDT:
>> Received from Rolf vandeVaart on Wed, May 20, 2015 at 07:48:15AM EDT:
>>
>> (snip)
>>
>> > I see that you mentioned you are starting 4 MPS daemons.  Are you
>> > following the instructions here?
>> >
>> > http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-se
>> > rvice-mps.html
>>
>> Yes - also
>>
>https://docs.nvidia.com/deploy/pdf/CUDA_Multi_Process_Service_Overvie
>w
>> .pdf
>>
>> > This relies on setting CUDA_VISIBLE_DEVICES which can cause problems
>> > for CUDA IPC. Since you are using CUDA 7 there is no more need to
>> > start multiple daemons. You simply leave CUDA_VISIBLE_DEVICES
>> > untouched and start a single MPS control daemon which will handle all
>GPUs.  Can you try that?
>>
>> I assume that this means that only one CUDA_MPS_PIPE_DIRECTORY value
>> should be passed to all MPI processes.
There is no need to do anything with CUDA_MPS_PIPE_DIRECTORY with CUDA 7.  

>>
>> Several questions related to your comment above:
>>
>> - Should the MPI processes select and initialize the GPUs they respectively 
>> need
>>   to access as they normally would when MPS is not in use?
Yes.  

>> - Can CUDA_VISIBLE_DEVICES be used to control what GPUs are visible to MPS 
>> (and
>>   hence the client processes)? I ask because SLURM uses CUDA_VISIBLE_DEVICES 
>> to
>>   control GPU resource allocation, and I would like to run my program (and 
>> the
>>   MPS control daemon) on a cluster via SLURM.
Yes, I believe that is true.  

>> - Does the clash between setting CUDA_VISIBLE_DEVICES and CUDA IPC imply that
>>   MPS and CUDA IPC cannot reliably be used simultaneously in a multi-GPU 
>> setting
>>   with CUDA 6.5 even when one starts multiple MPS control daemons as  
>> described
>>   in the aforementioned blog post?
>
>Using a single control daemon with CUDA_VISIBLE_DEVICES unset appears to
>solve the problem when IPC is enabled.
>--
Glad to see this worked.  And you are correct that CUDA IPC will not work 
between devices if they are segregated by the use of CUDA_VISIBLE_DEVICES as we 
do with MPS in 6.5.

Rolf
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-20 Thread Rolf vandeVaart

-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Tuesday, May 19, 2015 10:25 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] cuIpcOpenMemHandle failure when using
>OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service
>
>Received from Rolf vandeVaart on Tue, May 19, 2015 at 08:28:46PM EDT:
>> >-Original Message-
>> >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev
>> >Givon
>> >Sent: Tuesday, May 19, 2015 6:30 PM
>> >To: us...@open-mpi.org
>> >Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI
>> >1.8.5 with CUDA 7.0 and Multi-Process Service
>> >
>> >I'm encountering intermittent errors while trying to use the
>> >Multi-Process Service with CUDA 7.0 for improving concurrent access
>> >to a Kepler K20Xm GPU by multiple MPI processes that perform
>> >GPU-to-GPU communication with each other (i.e., GPU pointers are
>passed to the MPI transmission primitives).
>> >I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI
>> >1.8.5, which is in turn built against CUDA 7.0. In my current
>> >configuration, I have 4 MPS server daemons running, each of which
>> >controls access to one of 4 GPUs; the MPI processes spawned by my
>> >program are partitioned into 4 groups (which might contain different
>> >numbers of processes) that each talk to a separate daemon. For
>> >certain transmission patterns between these processes, the program
>> >runs without any problems. For others (e.g., 16 processes partitioned into
>4 groups), however, it dies with the following error:
>> >
>> >[node05:20562] Failed to register remote memory, rc=-1
>> >-
>> >- The call to cuIpcOpenMemHandle failed. This is an unrecoverable
>> >error and will cause the program to abort.
>> >  cuIpcOpenMemHandle return value:   21199360
>> >  address: 0x1
>> >Check the cuda.h file for what the return value means. Perhaps a
>> >reboot of the node will clear the problem.
>
>(snip)
>
>> >After the above error occurs, I notice that /dev/shm/ is littered
>> >with
>> >cuda.shm.* files. I tried cleaning up /dev/shm before running my
>> >program, but that doesn't seem to have any effect upon the problem.
>> >Rebooting the machine also doesn't have any effect. I should also add
>> >that my program runs without any error if the groups of MPI processes
>> >talk directly to the GPUs instead of via MPS.
>> >
>> >Does anyone have any ideas as to what could be going on?
>>
>> I am not sure why you are seeing this.  One thing that is clear is
>> that you have found a bug in the error reporting.  The error message
>> is a little garbled and I see a bug in what we are reporting. I will fix 
>> that.
>>
>> If possible, could you try running with --mca btl_smcuda_use_cuda_ipc
>> 0.  My expectation is that you will not see any errors, but may lose
>> some performance.
>
>The error does indeed go away when IPC is disabled, although I do want to
>avoid degrading the performance of data transfers between GPU memory
>locations.
>
>> What does your hardware configuration look like?  Can you send me
>> output from "nvidia-smi topo -m"
>--

I see that you mentioned you are starting 4 MPS daemons.  Are you following the 
instructions here?

http://cudamusing.blogspot.de/2013/07/enabling-cuda-multi-process-service-mps.html
 

This relies on setting CUDA_VISIBLE_DEVICES which can cause problems for CUDA 
IPC. Since you are using CUDA 7 there is no more need to start multiple 
daemons. You simply leave CUDA_VISIBLE_DEVICES untouched and start a single MPS 
control daemon which will handle all GPUs.  Can you try that?  Because of this 
question, we realized we need to update our documentation as well.

Thanks,
Rolf


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI 1.8.5 with CUDA 7.0 and Multi-Process Service

2015-05-19 Thread Rolf vandeVaart

I am not sure why you are seeing this.  One thing that is clear is that you 
have found a bug in the error reporting.  The error message is a little garbled 
and I see a bug in what we are reporting. I will fix that.

If possible, could you try running with --mca btl_smcuda_use_cuda_ipc 0.  My 
expectation is that you will not see any errors, but may lose some performance.

What does your hardware configuration look like?  Can you send me output from 
"nvidia-smi topo -m"

Thanks,
Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Tuesday, May 19, 2015 6:30 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] cuIpcOpenMemHandle failure when using OpenMPI
>1.8.5 with CUDA 7.0 and Multi-Process Service
>
>I'm encountering intermittent errors while trying to use the Multi-Process
>Service with CUDA 7.0 for improving concurrent access to a Kepler K20Xm GPU
>by multiple MPI processes that perform GPU-to-GPU communication with
>each other (i.e., GPU pointers are passed to the MPI transmission primitives).
>I'm using GitHub revision 41676a1 of mpi4py built against OpenMPI 1.8.5,
>which is in turn built against CUDA 7.0. In my current configuration, I have 4
>MPS server daemons running, each of which controls access to one of 4 GPUs;
>the MPI processes spawned by my program are partitioned into 4 groups
>(which might contain different numbers of processes) that each talk to a
>separate daemon. For certain transmission patterns between these
>processes, the program runs without any problems. For others (e.g., 16
>processes partitioned into 4 groups), however, it dies with the following 
>error:
>
>[node05:20562] Failed to register remote memory, rc=-1
>--
>The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and
>will cause the program to abort.
>  cuIpcOpenMemHandle return value:   21199360
>  address: 0x1
>Check the cuda.h file for what the return value means. Perhaps a reboot of
>the node will clear the problem.
>--
>[node05:20562] [[58522,2],4] ORTE_ERROR_LOG: Error in file
>pml_ob1_recvreq.c at line 477
>---
>Child job 2 terminated normally, but 1 process returned a non-zero exit code..
>Per user-direction, the job has been aborted.
>---
>[node05][[58522,2],5][btl_tcp_frag.c:142:mca_btl_tcp_frag_send]
>mca_btl_tcp_frag_send: writev failed: Connection reset by peer (104)
>[node05:20564] Failed to register remote memory, rc=-1 [node05:20564]
>[[58522,2],6] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20566] Failed to register remote memory, rc=-1 [node05:20566]
>[[58522,2],8] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20567] Failed to register remote memory, rc=-1 [node05:20567]
>[[58522,2],9] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05][[58522,2],11][btl_tcp_frag.c:237:mca_btl_tcp_frag_recv]
>mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
>[node05:20569] Failed to register remote memory, rc=-1 [node05:20569]
>[[58522,2],11] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20571] Failed to register remote memory, rc=-1 [node05:20571]
>[[58522,2],13] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>[node05:20572] Failed to register remote memory, rc=-1 [node05:20572]
>[[58522,2],14] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 477
>
>After the above error occurs, I notice that /dev/shm/ is littered with
>cuda.shm.* files. I tried cleaning up /dev/shm before running my program,
>but that doesn't seem to have any effect upon the problem. Rebooting the
>machine also doesn't have any effect. I should also add that my program runs
>without any error if the groups of MPI processes talk directly to the GPUs
>instead of via MPS.
>
>Does anyone have any ideas as to what could be going on?
>--
>Lev Givon
>Bionet Group | Neurokernel Project
>http://www.columbia.edu/~lev/
>http://lebedov.github.io/
>http://neurokernel.github.io/
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/05/26881.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] getting OpenMPI 1.8.4 w/ CUDA to look for absolute path to libcuda.so.1

2015-04-29 Thread Rolf vandeVaart

Hi Lev:
Any chance you can try Open MPI 1.8.5rc3 and see if you see the same behavior?  
That code has changed a bit from the 1.8.4 series and I am curious if you will 
still see the same issue.  

http://www.open-mpi.org/software/ompi/v1.8/downloads/openmpi-1.8.5rc3.tar.gz

Thanks,
Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Wednesday, April 29, 2015 10:54 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] getting OpenMPI 1.8.4 w/ CUDA to look for absolute
>path to libcuda.so.1
>
>I'm trying to build/package OpenMPI 1.8.4 with CUDA support enabled on
>Linux
>x86_64 so that the compiled software can be downloaded/installed as one of
>the dependencies of a project I'm working on with no further user
>configuration.  I noticed that MPI programs built with the above will try to
>access
>/usr/lib/i386-linux-gnu/libcuda.so.1 (and obviously complain about it being the
>wrong ELF class) if /usr/lib/i386-linux-gnu precedes /usr/lib/x86_64-linux-gnu
>in one's ld.so cache. While one can get around this by modifying one's ld.so
>configuration (or tweaking LD_LIBRARY_PATH), is there some way to compile
>OpenMPI such that programs built with it (on x86_64) look for the full soname
>of
>libcuda.so.1 - i.e., /usr/lib/x86_64-linux-gnu/libcuda.so.1 - rather than fall 
>back
>on ld.so? I tried setting the rpath of MPI programs built with the above (by
>modifying the OpenMPI compiler wrappers to include -Wl,-rpath -
>Wl,/usr/lib/x86_64-linux-gnu), but that doesn't seem to help.
>--
>Lev Givon
>Bionet Group | Neurokernel Project
>http://www.columbia.edu/~lev/
>http://lebedov.github.io/
>http://neurokernel.github.io/
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/04/26809.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Different HCA from different OpenMP threads (same rank using MPI_THREAD_MULTIPLE)

2015-04-06 Thread Rolf vandeVaart

It is my belief that you cannot do this at least with the openib BTL.  The IB 
card to be used for communication is selected during the MPI _Init() phase 
based on where the CPU process is bound to.  You can see some of this selection 
by using the --mca btl_base_verbose 1 flag.  There is a bunch of output (which 
I have deleted), but you will see a few lines like this.

[ivy5] [rank=1] openib: using port mlx5_0:1
[ivy5] [rank=1] openib: using port mlx5_0:2
[ivy4] [rank=0] openib: using port mlx5_0:1
[ivy4] [rank=0] openib: using port mlx5_0:2

And if you have multiple NICs, you may also see some messages like this:
 "[rank=%d] openib: skipping device %s; it is too far away"
(This was lifted from the  code. I do not have a configuration right now where 
I can generate the second message.)

I cannot see how we can make this specific to a thread.  Maybe others have a 
different opinion.
Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo Spiga
>Sent: Monday, April 06, 2015 5:46 AM
>To: Open MPI Users
>Cc: Mohammed Sourouri
>Subject: [OMPI users] Different HCA from different OpenMP threads (same
>rank using MPI_THREAD_MULTIPLE)
>
>Dear Open MPI developers,
>
>I wonder if there is a way to address this particular scenario using MPI_T or
>other strategies in Open MPI. I saw a similar discussion few days ago, I assume
>the same challenges are applied in this case but I just want to check. Here is
>the scenario:
>
>We have a system composed by dual rail Mellanox IB, two distinct Connect-IB
>cards per node each one sitting on a different PCI-E lane out of two distinct
>sockets. We are seeking a way to control MPI traffic thought each one of
>them directly into the application. In specific we have a single MPI rank per
>node that goes multi-threading using OpenMP. MPI_THREAD_MULTIPLE is
>used, each OpenMP thread may initiate MPI communication. We would like to
>assign IB-0 to thread 0 and IB-1 to thread 1.
>
>Via mpirun or env variables we can control which IB interface to use by binding
>it to a specific MPI rank (or by apply a policy that relate IB to MPi ranks). 
>But if
>there is only one MPI rank active, how we can differentiate the traffic across
>multiple IB cards?
>
>Thanks in advance for any suggestion about this matter.
>
>Regards,
>Filippo
>
>--
>Mr. Filippo SPIGA, M.Sc.
>http://filippospiga.info ~ skype: filippo.spiga
>
>«Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
>
>*
>Disclaimer: "Please note this message and any attachments are
>CONFIDENTIAL and may be privileged or otherwise protected from disclosure.
>The contents are not to be disclosed to anyone other than the addressee.
>Unauthorized recipients are requested to preserve this confidentiality and to
>advise the sender immediately of any error in transmission."
>
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/04/26614.php

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] segfault during MPI_Isend when transmitting GPU arrays between multiple GPUs

2015-03-30 Thread Rolf vandeVaart

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Rolf
>vandeVaart
>Sent: Monday, March 30, 2015 9:37 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] segfault during MPI_Isend when transmitting GPU
>arrays between multiple GPUs
>
>>-Original Message-
>>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>>Sent: Sunday, March 29, 2015 10:11 PM
>>To: Open MPI Users
>>Subject: Re: [OMPI users] segfault during MPI_Isend when transmitting
>>GPU arrays between multiple GPUs
>>
>>Received from Rolf vandeVaart on Fri, Mar 27, 2015 at 04:09:58PM EDT:
>>> >-Original Message-
>>> >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev
>>> >Givon
>>> >Sent: Friday, March 27, 2015 3:47 PM
>>> >To: us...@open-mpi.org
>>> >Subject: [OMPI users] segfault during MPI_Isend when transmitting
>>> >GPU arrays between multiple GPUs
>>> >
>>> >I'm using PyCUDA 2014.1 and mpi4py (git commit 3746586, uploaded
>>> >today) built against OpenMPI 1.8.4 with CUDA support activated to
>>> >asynchronously send GPU arrays between multiple Tesla GPUs (Fermi
>>> >generation). Each MPI process is associated with a single GPU; the
>>> >process has a run loop that starts several Isends to transmit the
>>> >contents of GPU arrays to destination processes and several Irecvs
>>> >to receive data from source processes into GPU arrays on the process'
>>> >GPU. Some of the sends/recvs use one tag, while the remainder use a
>>> >second tag. A single Waitall invocation is used to wait for all of
>>> >these sends and receives to complete before the next iteration of
>>> >the loop
>>can commence. All GPU arrays are preallocated before the run loop starts.
>>> >While this pattern works most of the time, it sometimes fails with a
>>> >segfault that appears to occur during an Isend:
>>
>>(snip)
>>
>>> >Any ideas as to what could be causing this problem?
>>> >
>>> >I'm using CUDA 6.5-14 with NVIDIA drivers 340.29 on Ubuntu 14.04.
>>>
>>> Hi Lev:
>>>
>>> I am not sure what is happening here but there are a few things we
>>> can do to try and narrow things done.
>>> 1. If you run with --mca btl_smcuda_use_cuda_ipc 0 then I assume this
>error
>>>will go away?
>>
>>Yes - that appears to be the case.
>>
>>> 2. Do you know if when you see this error it happens on the first
>>> pass
>>through
>>>your communications?  That is, you mention how there are multiple
>>>iterations through the loop and I am wondering when you see failures if 
>>> it
>>>is the first pass through the loop.
>>
>>When the segfault occurs, it appears to always happen during the second
>>iteration of the loop, i.e., at least one slew of Isends (and
>>presumably Irecvs) is successfully performed.
>>
>>Some more details regarding the Isends: each process starts two Isends
>>for each destination process to which it transmits data. The Isends use
>>two different tags, respectively; one is passed None (by design), while
>>the other is passed the pointer to a GPU array with nonzero length. The
>>segfault appears to occur during the latter Isend.
>>--
>
>Lev, can you send me the test program off list.  I may try to create a C 
>version
>of the test and see if I can reproduce the problem.
>Not sure at this point what is happening.
>
>Thanks,
>Rolf
>
We figured out what was going on and I figured I would post here in case others 
see it.

After running for a while, some CUDA files related to CUDA IPC may get left in 
the /dev/shm directory.  These files can sometimes cause problems with later 
runs causing errors (or SEGVs) when calling some CUDA APIs.  The solution is 
clear out that directory periodically.

This issue is fixed in CUDA 7.0
Rolf
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] segfault during MPI_Isend when transmitting GPU arrays between multiple GPUs

2015-03-30 Thread Rolf vandeVaart

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Sunday, March 29, 2015 10:11 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] segfault during MPI_Isend when transmitting GPU
>arrays between multiple GPUs
>
>Received from Rolf vandeVaart on Fri, Mar 27, 2015 at 04:09:58PM EDT:
>> >-Original Message-
>> >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev
>> >Givon
>> >Sent: Friday, March 27, 2015 3:47 PM
>> >To: us...@open-mpi.org
>> >Subject: [OMPI users] segfault during MPI_Isend when transmitting GPU
>> >arrays between multiple GPUs
>> >
>> >I'm using PyCUDA 2014.1 and mpi4py (git commit 3746586, uploaded
>> >today) built against OpenMPI 1.8.4 with CUDA support activated to
>> >asynchronously send GPU arrays between multiple Tesla GPUs (Fermi
>> >generation). Each MPI process is associated with a single GPU; the
>> >process has a run loop that starts several Isends to transmit the
>> >contents of GPU arrays to destination processes and several Irecvs to
>> >receive data from source processes into GPU arrays on the process'
>> >GPU. Some of the sends/recvs use one tag, while the remainder use a
>> >second tag. A single Waitall invocation is used to wait for all of
>> >these sends and receives to complete before the next iteration of the loop
>can commence. All GPU arrays are preallocated before the run loop starts.
>> >While this pattern works most of the time, it sometimes fails with a
>> >segfault that appears to occur during an Isend:
>
>(snip)
>
>> >Any ideas as to what could be causing this problem?
>> >
>> >I'm using CUDA 6.5-14 with NVIDIA drivers 340.29 on Ubuntu 14.04.
>>
>> Hi Lev:
>>
>> I am not sure what is happening here but there are a few things we can
>> do to try and narrow things done.
>> 1. If you run with --mca btl_smcuda_use_cuda_ipc 0 then I assume this error
>>will go away?
>
>Yes - that appears to be the case.
>
>> 2. Do you know if when you see this error it happens on the first pass
>through
>>your communications?  That is, you mention how there are multiple
>>iterations through the loop and I am wondering when you see failures if it
>>is the first pass through the loop.
>
>When the segfault occurs, it appears to always happen during the second
>iteration of the loop, i.e., at least one slew of Isends (and presumably 
>Irecvs)
>is successfully performed.
>
>Some more details regarding the Isends: each process starts two Isends for
>each destination process to which it transmits data. The Isends use two
>different tags, respectively; one is passed None (by design), while the other 
>is
>passed the pointer to a GPU array with nonzero length. The segfault appears
>to occur during the latter Isend.
>--

Lev, can you send me the test program off list.  I may try to create a C 
version of the test and see if I can reproduce the problem.
Not sure at this point what is happening.

Thanks,
Rolf


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] segfault during MPI_Isend when transmitting GPU arrays between multiple GPUs

2015-03-27 Thread Rolf vandeVaart

Hi Lev:
I am not sure what is happening here but there are a few things we can do to 
try and narrow things done.
1. If you run with --mca btl_smcuda_use_cuda_ipc 0 then I assume this error 
will go away?
2. Do you know if when you see this error it happens on the first pass through 
your communications?  That is, you mention how there are multiple iterations 
through the loop and I am wondering when you see failures if it is the first 
pass through the loop.

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Lev Givon
>Sent: Friday, March 27, 2015 3:47 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] segfault during MPI_Isend when transmitting GPU
>arrays between multiple GPUs
>
>I'm using PyCUDA 2014.1 and mpi4py (git commit 3746586, uploaded today)
>built against OpenMPI 1.8.4 with CUDA support activated to asynchronously
>send GPU arrays between multiple Tesla GPUs (Fermi generation). Each MPI
>process is associated with a single GPU; the process has a run loop that starts
>several Isends to transmit the contents of GPU arrays to destination
>processes and several Irecvs to receive data from source processes into GPU
>arrays on the process' GPU. Some of the sends/recvs use one tag, while the
>remainder use a second tag. A single Waitall invocation is used to wait for 
>all of
>these sends and receives to complete before the next iteration of the loop
>can commence. All GPU arrays are preallocated before the run loop starts.
>While this pattern works most of the time, it sometimes fails with a segfault
>that appears to occur during an Isend:
>
>[myhost:05471] *** Process received signal *** [myhost:05471] Signal:
>Segmentation fault (11) [myhost:05471] Signal code:  (128) [myhost:05471]
>Failing at address: (nil) [myhost:05471] [ 0] /lib/x86_64-linux-
>gnu/libpthread.so.0(+0x10340)[0x2ac2bb176340]
>[myhost:05471] [ 1]
>/usr/lib/x86_64-linux-gnu/libcuda.so.1(+0x1f6b18)[0x2ac2c48bfb18]
>[myhost:05471] [ 2]
>/usr/lib/x86_64-linux-gnu/libcuda.so.1(+0x16dcc3)[0x2ac2c4836cc3]
>[myhost:05471] [ 3]
>/usr/lib/x86_64-linux-
>gnu/libcuda.so.1(cuIpcGetEventHandle+0x5d)[0x2ac2c480bccd]
>[myhost:05471] [ 4]
>/opt/openmpi-
>1.8.4/lib/libmpi.so.1(mca_common_cuda_construct_event_and_handle+0x27
>)[0x2ac2c27d3087]
>[myhost:05471] [ 5]
>/opt/openmpi-
>1.8.4/lib/libmpi.so.1(ompi_free_list_grow+0x199)[0x2ac2c277b8e9]
>[myhost:05471] [ 6]
>/opt/openmpi-
>1.8.4/lib/libmpi.so.1(mca_mpool_gpusm_register+0xf4)[0x2ac2c28c9fd4]
>[myhost:05471] [ 7]
>/opt/openmpi-
>1.8.4/lib/libmpi.so.1(mca_pml_ob1_rdma_cuda_btls+0xcd)[0x2ac2c28f8afd]
>[myhost:05471] [ 8]
>/opt/openmpi-
>1.8.4/lib/libmpi.so.1(mca_pml_ob1_send_request_start_cuda+0xbf)[0x2ac2c
>28f8d5f]
>[myhost:05471] [ 9]
>/opt/openmpi-
>1.8.4/lib/libmpi.so.1(mca_pml_ob1_isend+0x60e)[0x2ac2c28eb6fe]
>[myhost:05471] [10]
>/opt/openmpi-1.8.4/lib/libmpi.so.1(MPI_Isend+0x137)[0x2ac2c27b7cc7]
>[myhost:05471] [11]
>/home/lev/Work/miniconda/envs/MYENV/lib/python2.7/site-
>packages/mpi4py/MPI.so(+0xd3bb2)[0x2ac2c24b3bb2]
>(Python-related debug lines omitted.)
>
>Any ideas as to what could be causing this problem?
>
>I'm using CUDA 6.5-14 with NVIDIA drivers 340.29 on Ubuntu 14.04.
>--
>Lev Givon
>Bionet Group | Neurokernel Project
>http://www.columbia.edu/~lev/
>http://lebedov.github.io/
>http://neurokernel.github.io/
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/03/26553.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] issue with openmpi + CUDA

2015-03-26 Thread Rolf vandeVaart

Hi Jason:
The issue is that Open MPI is (presumably) a 64 bit application and it is 
trying to load up a 64-bit libcuda.so.1 but not finding one.  Making the link 
as you did will not fix the problem (as you saw).  In all my installations, I 
also have a 64-bit driver installed in /usr/lib64/libcuda.so.1 and everything 
works fine.

Let me investigate some more and get back to you.

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Zhisong Fu
>Sent: Wednesday, March 25, 2015 10:31 PM
>To: Open MPI Users
>Subject: [OMPI users] issue with openmpi + CUDA
>
>Hi,
>
>I just started to use openmpi and am trying to run a MPI/GPU code. My code
>compiles but when I run, I get this error:
>The library attempted to open the following supporting CUDA libraries, but
>each of them failed.  CUDA-aware support is disabled.
>/usr/lib/libcuda.so.1: wrong ELF class: ELFCLASS32
>/usr/lib/libcuda.so.1: wrong ELF class: ELFCLASS32 If you are not interested in
>CUDA-aware support, then run with --mca mpi_cuda_support 0 to suppress
>this message.  If you are interested in CUDA-aware support, then try setting
>LD_LIBRARY_PATH to the location of libcuda.so.1 to get passed this issue.
>
>I could not find a libcuda.so.1 in my system but I do find libcuda.so in
>/usr/local/cuda/lib64/stubs. Why is openmpi looking for libcuda.so.1 instead
>of libcuda.so?
>I created a symbolic link to libcuda.so, now I get CUDA error 35: CUDA driver
>version is insufficient for CUDA runtime version.
>I am not sure if this is related to libcuda.so or the driver since I could run 
>this
>code using mvapich.
>
>Any input on the issue is really appreciated.
>My openmpi version is 1.8.4, my cuda version is 6.5, driver version is 340.65.
>
>Thanks.
>Jason
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2015/03/26537.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] GPUDirect with OpenMPI

2015-03-03 Thread Rolf vandeVaart

Hi Rob:
Sorry for the slow reply but it took me a while to figure this out.  It turns 
out that this issue had to do with how some of the memory within the smcuda BTL 
was being registered with CUDA.  This was fixed a few weeks ago and will be 
available in the 1.8.5 release.  Perhaps you could retry with a pre-release 
version of Open MPI 1.8.5 that is available here and confirm it fixes your 
issue.  Any of the ones listed on that page should be fine.

http://www.open-mpi.org/nightly/v1.8/

Thanks,
Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Rolf vandeVaart
Sent: Wednesday, February 11, 2015 3:50 PM
To: Open MPI Users
Subject: Re: [OMPI users] GPUDirect with OpenMPI

Let me try to reproduce this.  This should not have anything to do with GPU 
Direct RDMA.  However, to eliminate it, you could run with:
--mca btl_openib_want_cuda_gdr 0.

Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Aulwes, Rob
Sent: Wednesday, February 11, 2015 2:17 PM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: [OMPI users] GPUDirect with OpenMPI

Hi,

I built OpenMPI 1.8.3 using PGI 14.7 and enabled CUDA support for CUDA 6.0.  I 
have a Fortran test code that tests GPUDirect and have included it here.  When 
I run it across 2 nodes using 4 MPI procs, sometimes it fails with incorrect 
results.  Specifically, sometimes rank 1 does not receive the correct value 
from one of the neighbors.

The code was compiled using PGI 14.7:
mpif90 -o direct.x -acc acc_direct.f90

and executed with:
mpirun -np 4 -npernode 2 -mca btl_openib_want_cudagdr 1 ./direct.x

Does anyone know if I'm missing something when using GPUDirect?

Thanks,Rob Aulwes


program acc_direct



 include 'mpif.h'





 integer :: ierr, rank, nranks

integer, dimension(:), allocatable :: i_ra



   call mpi_init(ierr)



   call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)

   rank = rank + 1

   write(*,*) 'hello from rank ',rank



   call MPI_COMM_SIZE(MPI_COMM_WORLD, nranks, ierr)



   allocate( i_ra(nranks) )



   call nb_exchange



   call mpi_finalize(ierr)





 contains



 subroutine nb_exchange



   integer :: i, j

   integer, dimension(nranks - 1) :: sendreq, recvreq

   logical :: done

   integer :: stat(MPI_STATUS_SIZE)



   i_ra = -1

   i_ra(rank) = rank



   !$acc data copy(i_ra(1:nranks))



   !$acc host_data use_device(i_ra)



   cnt = 0

   do i = 1,nranks

  if ( i .ne. rank ) then

 cnt = cnt + 1



 call MPI_ISEND(i_ra(rank), 1, MPI_INTEGER, i - 1, rank, &

MPI_COMM_WORLD, sendreq(cnt), ierr)

 if ( ierr .ne. MPI_SUCCESS ) write(*,*) 'isend call failed.'



 call MPI_IRECV(i_ra(i), 1, MPI_INTEGER, i - 1, i, &

MPI_COMM_WORLD, recvreq(cnt), ierr)

 if ( ierr .ne. MPI_SUCCESS ) write(*,*) 'irecv call failed.'



  endif



   enddo



   !$acc end host_data



   i = 0

   do while ( i .lt. 2*cnt )

 do j = 1, cnt

if ( recvreq(j) .ne. MPI_REQUEST_NULL ) then

call MPI_TEST(recvreq(j), done, stat, ierr)

if ( ierr .ne. MPI_SUCCESS ) &

   write(*,*) 'test for irecv call failed.'

if ( done ) then

   i = i + 1

endif

endif



if ( sendreq(j) .ne. MPI_REQUEST_NULL ) then

call MPI_TEST(sendreq(j), done, MPI_STATUS_IGNORE, ierr)

if ( ierr .ne. MPI_SUCCESS ) &

   write(*,*) 'test for irecv call failed.'

if ( done ) then

   i = i + 1

endif

endif

 enddo

   enddo



   write(*,*) rank,': nb_exchange: Updating host...'

   !$acc update host(i_ra(1:nranks))





   do j = 1, nranks

 if ( i_ra(j) .ne. j ) then

   write(*,*) 'isend/irecv failed.'

   write(*,*) 'rank', rank,': i_ra(',j,') = ',i_ra(j)

 endif

   enddo



   !$acc end data



 end subroutine





end program


This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.

Re: [OMPI users] GPUDirect with OpenMPI

2015-02-11 Thread Rolf vandeVaart

Let me try to reproduce this.  This should not have anything to do with GPU 
Direct RDMA.  However, to eliminate it, you could run with:
--mca btl_openib_want_cuda_gdr 0.

Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Aulwes, Rob
Sent: Wednesday, February 11, 2015 2:17 PM
To: us...@open-mpi.org
Subject: [OMPI users] GPUDirect with OpenMPI

Hi,

I built OpenMPI 1.8.3 using PGI 14.7 and enabled CUDA support for CUDA 6.0.  I 
have a Fortran test code that tests GPUDirect and have included it here.  When 
I run it across 2 nodes using 4 MPI procs, sometimes it fails with incorrect 
results.  Specifically, sometimes rank 1 does not receive the correct value 
from one of the neighbors.

The code was compiled using PGI 14.7:
mpif90 -o direct.x -acc acc_direct.f90

and executed with:
mpirun -np 4 -npernode 2 -mca btl_openib_want_cudagdr 1 ./direct.x

Does anyone know if I'm missing something when using GPUDirect?

Thanks,Rob Aulwes


program acc_direct



 include 'mpif.h'





 integer :: ierr, rank, nranks

integer, dimension(:), allocatable :: i_ra



   call mpi_init(ierr)



   call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierr)

   rank = rank + 1

   write(*,*) 'hello from rank ',rank



   call MPI_COMM_SIZE(MPI_COMM_WORLD, nranks, ierr)



   allocate( i_ra(nranks) )



   call nb_exchange



   call mpi_finalize(ierr)





 contains



 subroutine nb_exchange



   integer :: i, j

   integer, dimension(nranks - 1) :: sendreq, recvreq

   logical :: done

   integer :: stat(MPI_STATUS_SIZE)



   i_ra = -1

   i_ra(rank) = rank



   !$acc data copy(i_ra(1:nranks))



   !$acc host_data use_device(i_ra)



   cnt = 0

   do i = 1,nranks

  if ( i .ne. rank ) then

 cnt = cnt + 1



 call MPI_ISEND(i_ra(rank), 1, MPI_INTEGER, i - 1, rank, &

MPI_COMM_WORLD, sendreq(cnt), ierr)

 if ( ierr .ne. MPI_SUCCESS ) write(*,*) 'isend call failed.'



 call MPI_IRECV(i_ra(i), 1, MPI_INTEGER, i - 1, i, &

MPI_COMM_WORLD, recvreq(cnt), ierr)

 if ( ierr .ne. MPI_SUCCESS ) write(*,*) 'irecv call failed.'



  endif



   enddo



   !$acc end host_data



   i = 0

   do while ( i .lt. 2*cnt )

 do j = 1, cnt

if ( recvreq(j) .ne. MPI_REQUEST_NULL ) then

call MPI_TEST(recvreq(j), done, stat, ierr)

if ( ierr .ne. MPI_SUCCESS ) &

   write(*,*) 'test for irecv call failed.'

if ( done ) then

   i = i + 1

endif

endif



if ( sendreq(j) .ne. MPI_REQUEST_NULL ) then

call MPI_TEST(sendreq(j), done, MPI_STATUS_IGNORE, ierr)

if ( ierr .ne. MPI_SUCCESS ) &

   write(*,*) 'test for irecv call failed.'

if ( done ) then

   i = i + 1

endif

endif

 enddo

   enddo



   write(*,*) rank,': nb_exchange: Updating host...'

   !$acc update host(i_ra(1:nranks))





   do j = 1, nranks

 if ( i_ra(j) .ne. j ) then

   write(*,*) 'isend/irecv failed.'

   write(*,*) 'rank', rank,': i_ra(',j,') = ',i_ra(j)

 endif

   enddo



   !$acc end data



 end subroutine





end program


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Segmentation fault when using CUDA Aware feature

2015-01-12 Thread Rolf vandeVaart

I think I found a bug in your program with how you were allocating the GPU 
buffers.  I will send you a version offlist with the fix.
Also, there is no need to rerun with the flags I had mentioned below.
Rolf


From: Rolf vandeVaart
Sent: Monday, January 12, 2015 9:38 AM
To: us...@open-mpi.org
Subject: RE: [OMPI users] Segmentation fault when using CUDA Aware feature

That is strange, not sure why that is happening.  I will try to reproduce with 
your program on my system.  Also, perhaps you could rerun with –mca 
mpi_common_cuda_verbose 100 and send me that output.

Thanks

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Xun Gong
Sent: Sunday, January 11, 2015 11:41 PM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: [OMPI users] Segmentation fault when using CUDA Aware feature

Hi,

The OpenMpi I used is 1.8.4. I just tried to run a test program to see if the 
CUDA aware feature works. But I got the following errors.

ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ mpirun -np 2 s1
[ss-Inspiron-5439:32514] *** Process received signal ***
[ss-Inspiron-5439:32514] Signal: Segmentation fault (11)
[ss-Inspiron-5439:32514] Signal code: Address not mapped (1)
[ss-Inspiron-5439:32514] Failing at address: 0x3
[ss-Inspiron-5439:32514] [ 0] 
/lib/x86_64-linux-gnu/libc.so.6(+0x36c30)[0x7f74d7048c30]
[ss-Inspiron-5439:32514] [ 1] 
/lib/x86_64-linux-gnu/libc.so.6(+0x98a70)[0x7f74d70aaa70]
[ss-Inspiron-5439:32514] [ 2] 
/usr/local/openmpi-1.8.4/lib/libopen-pal.so.6(opal_convertor_pack+0x187)[0x7f74d673f097]
[ss-Inspiron-5439:32514] [ 3] 
/usr/local/openmpi-1.8.4/lib/openmpi/mca_btl_self.so(mca_btl_self_prepare_src+0xb8)[0x7f74ce196888]
[ss-Inspiron-5439:32514] [ 4] 
/usr/local/openmpi-1.8.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x4c)[0x7f74cd2c183c]
[ss-Inspiron-5439:32514] [ 5] 
/usr/local/openmpi-1.8.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5ba)[0x7f74cd2b78aa]
[ss-Inspiron-5439:32514] [ 6] 
/usr/local/openmpi-1.8.4/lib/libmpi.so.1(PMPI_Send+0xf2)[0x7f74d79602a2]
[ss-Inspiron-5439:32514] [ 7] s1[0x408b1e]
[ss-Inspiron-5439:32514] [ 8] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f74d7033ec5]
[ss-Inspiron-5439:32514] [ 9] s1[0x4088e9]
[ss-Inspiron-5439:32514] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 32514 on node ss-Inspiron-5439 
exited on signal 11 (Segmentation fault).

Looks like MPI_Send can not send CUDA buffer. But I already did  the command
  ./configure --with-cuda for OpenMPI.


The command I uesd is.

ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ nvcc -c k1.cu<http://k1.cu>
ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ mpic++ -c main.cc
ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ mpic++ -o s1 main.o k1.o 
-L/usr/local/cuda/lib64/ -lcudart
ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ mpirun -np 2 ./s1



The code I'm running is

main.cc file
#include
using namespace std;
#include
#include"k1.h"
#define vect_len 16
const int blocksize = 16;

int main(int argv, char *argc[])
{
  int numprocs, myid;
  MPI_Status status;
  const int vect_size = vect_len*sizeof(int);

  int *vect1 = new int[vect_size];
  int *vect2 = new int[vect_size];
  int *result = new int[vect_size];
  bool flag;

  int *ad;
  int *bd;

  MPI_Init(, );
  MPI_Comm_rank(MPI_COMM_WORLD, );
  MPI_Comm_size(MPI_COMM_WORLD, );
  if(myid == 0)
  {
  for(int i = 0; i < vect_len; i++)
  {
  vect1[i] = i;
  vect2[i] = 2 * i;
  }
  }
  else
  {
  for(int i = 0; i < vect_len; i++)
  {
  vect1[i] = 2 * i;
  vect2[i] = i;
  }
  }

  initializeGPU(vect1, vect2, ad, bd, vect_size);

  if(myid == 0)
  {
  for(int i = 0; i < numprocs; i++)
  {
  MPI_Send(ad,vect_len, MPI_INT, i, 99, 
MPI_COMM_WORLD );
  MPI_Send(bd,vect_len, MPI_INT, i, 99, 
MPI_COMM_WORLD );
  }
  }
  else
  {
  MPI_Recv(ad,vect_len, MPI_INT, 0, 99, MPI_COMM_WORLD, 
 );
  MPI_Recv(bd,vect_len, MPI_INT, 0, 99, MPI_COMM_WORLD, 
 );
  }



  computeGPU(blocksize, vect_len, ad, bd, result, vect_size);

  //Verify
  flag = true;

  for(int i = 0; i < vect_len; i++)
  {
  if (i < 8)
  vect1[i] += vect2[i];
  else
  vect1[i] -= vect2[

Re: [OMPI users] Segmentation fault when using CUDA Aware feature

2015-01-12 Thread Rolf vandeVaart

That is strange, not sure why that is happening.  I will try to reproduce with 
your program on my system.  Also, perhaps you could rerun with –mca 
mpi_common_cuda_verbose 100 and send me that output.

Thanks

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Xun Gong
Sent: Sunday, January 11, 2015 11:41 PM
To: us...@open-mpi.org
Subject: [OMPI users] Segmentation fault when using CUDA Aware feature

Hi,

The OpenMpi I used is 1.8.4. I just tried to run a test program to see if the 
CUDA aware feature works. But I got the following errors.

ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ mpirun -np 2 s1
[ss-Inspiron-5439:32514] *** Process received signal ***
[ss-Inspiron-5439:32514] Signal: Segmentation fault (11)
[ss-Inspiron-5439:32514] Signal code: Address not mapped (1)
[ss-Inspiron-5439:32514] Failing at address: 0x3
[ss-Inspiron-5439:32514] [ 0] 
/lib/x86_64-linux-gnu/libc.so.6(+0x36c30)[0x7f74d7048c30]
[ss-Inspiron-5439:32514] [ 1] 
/lib/x86_64-linux-gnu/libc.so.6(+0x98a70)[0x7f74d70aaa70]
[ss-Inspiron-5439:32514] [ 2] 
/usr/local/openmpi-1.8.4/lib/libopen-pal.so.6(opal_convertor_pack+0x187)[0x7f74d673f097]
[ss-Inspiron-5439:32514] [ 3] 
/usr/local/openmpi-1.8.4/lib/openmpi/mca_btl_self.so(mca_btl_self_prepare_src+0xb8)[0x7f74ce196888]
[ss-Inspiron-5439:32514] [ 4] 
/usr/local/openmpi-1.8.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x4c)[0x7f74cd2c183c]
[ss-Inspiron-5439:32514] [ 5] 
/usr/local/openmpi-1.8.4/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x5ba)[0x7f74cd2b78aa]
[ss-Inspiron-5439:32514] [ 6] 
/usr/local/openmpi-1.8.4/lib/libmpi.so.1(PMPI_Send+0xf2)[0x7f74d79602a2]
[ss-Inspiron-5439:32514] [ 7] s1[0x408b1e]
[ss-Inspiron-5439:32514] [ 8] 
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f74d7033ec5]
[ss-Inspiron-5439:32514] [ 9] s1[0x4088e9]
[ss-Inspiron-5439:32514] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 32514 on node ss-Inspiron-5439 
exited on signal 11 (Segmentation fault).

Looks like MPI_Send can not send CUDA buffer. But I already did  the command
  ./configure --with-cuda for OpenMPI.


The command I uesd is.

ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ nvcc -c k1.cu
ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ mpic++ -c main.cc
ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ mpic++ -o s1 main.o k1.o 
-L/usr/local/cuda/lib64/ -lcudart
ss@ss-Inspiron-5439:~/cuda-workspace/cuda_mpi_ex1$ mpirun -np 2 ./s1



The code I'm running is

main.cc file
#include
using namespace std;
#include
#include"k1.h"
#define vect_len 16
const int blocksize = 16;

int main(int argv, char *argc[])
{
  int numprocs, myid;
  MPI_Status status;
  const int vect_size = vect_len*sizeof(int);

  int *vect1 = new int[vect_size];
  int *vect2 = new int[vect_size];
  int *result = new int[vect_size];
  bool flag;

  int *ad;
  int *bd;

  MPI_Init(, );
  MPI_Comm_rank(MPI_COMM_WORLD, );
  MPI_Comm_size(MPI_COMM_WORLD, );
  if(myid == 0)
  {
  for(int i = 0; i < vect_len; i++)
  {
  vect1[i] = i;
  vect2[i] = 2 * i;
  }
  }
  else
  {
  for(int i = 0; i < vect_len; i++)
  {
  vect1[i] = 2 * i;
  vect2[i] = i;
  }
  }

  initializeGPU(vect1, vect2, ad, bd, vect_size);

  if(myid == 0)
  {
  for(int i = 0; i < numprocs; i++)
  {
  MPI_Send(ad,vect_len, MPI_INT, i, 99, 
MPI_COMM_WORLD );
  MPI_Send(bd,vect_len, MPI_INT, i, 99, 
MPI_COMM_WORLD );
  }
  }
  else
  {
  MPI_Recv(ad,vect_len, MPI_INT, 0, 99, MPI_COMM_WORLD, 
 );
  MPI_Recv(bd,vect_len, MPI_INT, 0, 99, MPI_COMM_WORLD, 
 );
  }



  computeGPU(blocksize, vect_len, ad, bd, result, vect_size);

  //Verify
  flag = true;

  for(int i = 0; i < vect_len; i++)
  {
  if (i < 8)
  vect1[i] += vect2[i];
  else
  vect1[i] -= vect2[i];

  }

  for(int i = 0; i < vect_len; i++)
  {
  if( result[i] != vect1[i] )
  {
  cout<<"the result ["<

Re: [OMPI users] Randomly long (100ms vs 7000+ms) fulfillment of MPI_Ibcast

2014-11-06 Thread Rolf vandeVaart

The CUDA person is now responding.  I will try and reproduce.  I looked through 
the zip file but did not see the mpirun command.   Can this be reproduced with 
-np 4 running across four nodes?
Also, in your original message you wrote "Likewise, it doesn't matter if I 
enable CUDA support or not. "  Can you provide more detail about what that 
means?
Thanks

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Thursday, November 06, 2014 1:05 PM
To: Open MPI Users
Subject: Re: [OMPI users] Randomly long (100ms vs 7000+ms) fulfillment of 
MPI_Ibcast

I was hoping our CUDA person would respond, but in the interim - I would 
suggest trying the nightly 1.8.4 tarball as we are getting ready to release it, 
and I know there were some CUDA-related patches since 1.8.1

http://www.open-mpi.org/nightly/v1.8/

On Nov 5, 2014, at 4:45 PM, Steven Eliuk 
> wrote:

OpenMPI: 1.8.1 with CUDA RDMA...

Thanks sir and sorry for the late response,

Kindest Regards,
-
Steven Eliuk, Ph.D. Comp Sci,
Advanced Software Platforms Lab,
SRA - SV,
Samsung Electronics,
1732 North First Street,
San Jose, CA 95112,
Work: +1 408-652-1976,
Work: +1 408-544-5781 Wednesdays,
Cell: +1 408-819-4407.

From: Ralph Castain >
Reply-To: Open MPI Users >
List-Post: users@lists.open-mpi.org
Date: Monday, November 3, 2014 at 10:02 AM
To: Open MPI Users >
Subject: Re: [OMPI users] Randomly long (100ms vs 7000+ms) fulfillment of 
MPI_Ibcast

Which version of OMPI were you testing?

On Nov 3, 2014, at 9:14 AM, Steven Eliuk 
> wrote:

Hello,

We were using OpenMPI for some testing, everything works fine but randomly, 
MPI_Ibcast()
takes long time to finish. We have a standalone program just to test it.  The 
following
is the profiling results of the simple test program on our cluster:

Ibcast 604 mb takes 103 ms
Ibcast 608 mb takes 106 ms
Ibcast 612 mb takes 105 ms
Ibcast 616 mb takes 105 ms
Ibcast 620 mb takes 107 ms
Ibcast 624 mb takes 107 ms
Ibcast 628 mb takes 108 ms
Ibcast 632 mb takes 110 ms
Ibcast 636 mb takes 110 ms
Ibcast 640 mb takes 7437 ms
Ibcast 644 mb takes 115 ms
Ibcast 648 mb takes 111 ms
Ibcast 652 mb takes 112 ms
Ibcast 656 mb takes 112 ms
Ibcast 660 mb takes 114 ms
Ibcast 664 mb takes 114 ms
Ibcast 668 mb takes 115 ms
Ibcast 672 mb takes 116 ms
Ibcast 676 mb takes 116 ms
Ibcast 680 mb takes 116 ms
Ibcast 684 mb takes 122 ms
Ibcast 688 mb takes 7385 ms
Ibcast 692 mb takes 8729 ms
Ibcast 696 mb takes 120 ms
Ibcast 700 mb takes 124 ms
Ibcast 704 mb takes 121 ms
Ibcast 708 mb takes 8240 ms
Ibcast 712 mb takes 122 ms
Ibcast 716 mb takes 123 ms
Ibcast 720 mb takes 123 ms
Ibcast 724 mb takes 124 ms
Ibcast 728 mb takes 125 ms
Ibcast 732 mb takes 125 ms
Ibcast 736 mb takes 126 ms

As you can see, Ibcast takes a long to finish and it's totally random.
The same program was compiled and tested with MVAPICH2-gdr but it went smoothly.
Both tests were running exclusively on our four nodes cluster without 
contention. Likewise, it doesn't matter
if I enable CUDA support or not.  The followings are the configuration of our 
server:

We have four nodes in this test, each with one K40 GPU and connected with 
mellanox IB.

Please find attached config details and some sample code...

Kindest Regards,
-
Steven Eliuk, Ph.D. Comp Sci,
Advanced Software Platforms Lab,
SRA - SV,
Samsung Electronics,
1732 North First Street,
San Jose, CA 95112,
Work: +1 408-652-1976,
Work: +1 408-544-5781 Wednesdays,
Cell: +1 408-819-4407.

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/11/25662.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/11/25695.php

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] CuEventCreate Failed...

2014-10-20 Thread Rolf vandeVaart

Hi:
I just tried running a program similar to yours with CUDA 6.5 and Open MPI and 
I could not reproduce.  Just to make sure I am doing things correctly, your 
example below is running with np=5 and on a single node? Which version of CUDA 
are you using?  Can you also send the output from nvidia-smi?  Also, based on 
the usage of -allow-run-as-root I assume you are running the program as root?


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Steven Eliuk
Sent: Monday, October 20, 2014 1:59 PM
To: Open MPI Users
Subject: Re: [OMPI users] CuEventCreate Failed...

Thanks for your quick response,

1)mpiexec --allow-run-as-root --mca btl_openib_want_cuda_gdr 1 --mca 
btl_openib_cuda_rdma_limit 6 --mca mpi_common_cuda_event_max 1000 -n 5 
test/RunTests
2)Yes, cuda aware support using Mellanox IB,
3)Yes, we have the ability to use several version of OpenMPI, Mvapich2, etc.

Also, our defaults for openmpi-mca-params.conf are:

mtl=^mxm

btl=^usnic,tcp

btl_openib_flags=1


service nv_peer_mem status

nv_peer_mem module is loaded.

Kindest Regards,
-
Steven Eliuk,


From: Rolf vandeVaart <rvandeva...@nvidia.com<mailto:rvandeva...@nvidia.com>>
Reply-To: Open MPI Users <us...@open-mpi.org<mailto:us...@open-mpi.org>>
List-Post: users@lists.open-mpi.org
Date: Sunday, October 19, 2014 at 7:33 PM
To: Open MPI Users <us...@open-mpi.org<mailto:us...@open-mpi.org>>
Subject: Re: [OMPI users] CuEventCreate Failed...

The error 304 corresponds to CUDA_ERRROR_OPERATNG_SYSTEM which means an OS call 
failed.  However, I am not sure how that relates to the call that is getting 
the error.
Also, the last error you report is from MVAPICH2-GDR, not from Open MPI.  I 
guess then I have a few questions.


1.  Can you supply your configure line for Open MPI?

2.  Are you making use of CUDA-aware support?

3.  Are you set up so that users can use both Open MPI and MVAPICH2?

Thanks,
Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Steven Eliuk
Sent: Friday, October 17, 2014 6:48 PM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: [OMPI users] CuEventCreate Failed...

Hi All,

We have run into issues, that don't really seem to materialize into incorrect 
results, nonetheless, we hope to figure out why we are getting them.

We have several environments with test from one machine, with say 1-16 
processes per node, to several machines with 1-16 processes. All systems are 
certified from Nvidia and use Nvidia Tesla k40 GPUs.

We notice frequent situations of the following,

--

The call to cuEventCreate failed. This is a unrecoverable error and will

cause the program to abort.

  Hostname: aHost

  cuEventCreate return value:   304

Check the cuda.h file for what the return value means.

--

--

The call to cuIpcGetEventHandle failed. This is a unrecoverable error and will

cause the program to abort.

  cuIpcGetEventHandle return value:   304

Check the cuda.h file for what the return value means.

--

--

The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol

cannot be used.

  cuIpcGetMemHandle return value:   304

  address: 0x700fd0400

Check the cuda.h file for what the return value means. Perhaps a reboot

of the node will clear the problem.

--

Now, our test suite still verifies results but this does cause the following 
when it happens,

The call to cuEventDestory failed. This is a unrecoverable error and will

cause the program to abort.

  cuEventDestory return value:   400

Check the cuda.h file for what the return value means.

--

---

Primary job  terminated normally, but 1 process returned

a non-zero exit code.. Per user-direction, the job has been aborted.

---

--

mpiexec detected that one or more processes exited with non-zero status, thus 
causing

the job to be terminated. The first process to do so was:



  Process name: [[37290,1],2]

  Exit code:1


We have traced the code back to the following files:
-ompi/mca/common/cuda/common_cuda.c :: 
mca_common_cuda_construct_event_and_handle()

We also know the the following:
-it happens on every machine on the very first entry to the function previously 
mentioned,
-does not happen if the buffer size is under 128 bytes... likely a different 
mech. Used for t

Re: [OMPI users] CuEventCreate Failed...

2014-10-19 Thread Rolf vandeVaart

The error 304 corresponds to CUDA_ERRROR_OPERATNG_SYSTEM which means an OS call 
failed.  However, I am not sure how that relates to the call that is getting 
the error.
Also, the last error you report is from MVAPICH2-GDR, not from Open MPI.  I 
guess then I have a few questions.


1.   Can you supply your configure line for Open MPI?

2.   Are you making use of CUDA-aware support?

3.   Are you set up so that users can use both Open MPI and MVAPICH2?

Thanks,
Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Steven Eliuk
Sent: Friday, October 17, 2014 6:48 PM
To: us...@open-mpi.org
Subject: [OMPI users] CuEventCreate Failed...

Hi All,

We have run into issues, that don't really seem to materialize into incorrect 
results, nonetheless, we hope to figure out why we are getting them.

We have several environments with test from one machine, with say 1-16 
processes per node, to several machines with 1-16 processes. All systems are 
certified from Nvidia and use Nvidia Tesla k40 GPUs.

We notice frequent situations of the following,

--

The call to cuEventCreate failed. This is a unrecoverable error and will

cause the program to abort.

  Hostname: aHost

  cuEventCreate return value:   304

Check the cuda.h file for what the return value means.

--

--

The call to cuIpcGetEventHandle failed. This is a unrecoverable error and will

cause the program to abort.

  cuIpcGetEventHandle return value:   304

Check the cuda.h file for what the return value means.

--

--

The call to cuIpcGetMemHandle failed. This means the GPU RDMA protocol

cannot be used.

  cuIpcGetMemHandle return value:   304

  address: 0x700fd0400

Check the cuda.h file for what the return value means. Perhaps a reboot

of the node will clear the problem.

--

Now, our test suite still verifies results but this does cause the following 
when it happens,

The call to cuEventDestory failed. This is a unrecoverable error and will

cause the program to abort.

  cuEventDestory return value:   400

Check the cuda.h file for what the return value means.

--

---

Primary job  terminated normally, but 1 process returned

a non-zero exit code.. Per user-direction, the job has been aborted.

---

--

mpiexec detected that one or more processes exited with non-zero status, thus 
causing

the job to be terminated. The first process to do so was:



  Process name: [[37290,1],2]

  Exit code:1


We have traced the code back to the following files:
-ompi/mca/common/cuda/common_cuda.c :: 
mca_common_cuda_construct_event_and_handle()

We also know the the following:
-it happens on every machine on the very first entry to the function previously 
mentioned,
-does not happen if the buffer size is under 128 bytes... likely a different 
mech. Used for the IPC,

Last, here is an intermittent one and it produces a lot failed tests in our 
suite... when in fact they are solid, besides this error. Cause notification, 
annoyances and it would be nice to clean it up.

mpi_rank_3][cudaipc_allocate_ipc_region] 
[src/mpid/ch3/channels/mrail/src/gen2/ibv_cuda_ipc.c:487] cuda failed with 
mapping of buffer object failed


We have not been able to duplicate these errors in other MPI libs,

Thank you for your time & looking forward to your response,


Kindest Regards,
-
Steven Eliuk, Ph.D. Comp Sci,
Advanced Software Platforms Lab,
SRA - SV,
Samsung Electronics,
1732 North First Street,
San Jose, CA 95112,
Work: +1 408-652-1976,
Work: +1 408-544-5781 Wednesdays,
Cell: +1 408-819-4407.


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

[OMPI users] CUDA-aware Users

2014-10-09 Thread Rolf vandeVaart

If you are utilizing the CUDA-aware support in Open MPI, can you send me an 
email with some information about the application and the cluster you are on.  
I will consolidate information.

Thanks,

Rolf (rvandeva...@nvidia.com)

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] OMPI CUDA IPC synchronisation/fail-silent problem

2014-08-26 Thread Rolf vandeVaart

Hi Christoph:
I will try and reproduce this issue and will let you know what I find.  There 
may be an issue with CUDA IPC support with certain traffic patterns.
Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Christoph Winter
Sent: Tuesday, August 26, 2014 2:46 AM
To: us...@open-mpi.org
Subject: [OMPI users] OMPI CUDA IPC synchronisation/fail-silent problem

Hey all,

to test the performance of my application I duplicated the call to the function 
that will issue the computation on two GPUs 5 times. During the 4th and 5th run 
of the algorithm, however, the algorithm yields different results (9 instead of 
20):

# datatype: double
# datapoints: 2
# max_iterations: 1000
# conv_iterations: 1000
# damping: 0.9
# communicator.size: 2
# time elapsed [s]; iterations executed; convergent since; clusters identified
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 820 9
121.* 1000 820 9

For communication I use Open MPI 1.8 and/or Open MPI 1.8.1, both compiled with 
cuda-awareness. The CUDA Toolkit version is 6.0.
Both GPUs are under the control of one single CPU, so that CUDA IPC can be used 
(because no QPI link has to be traversed).
Running the application with "mpirun -np 2 --mca btl_smcuda_cuda_ipc_verbose 
100", shows that IPC is used.

I tracked my problem down to an MPI_Allgather, which seems not to work since 
the first GPU  identifies 9 clusters, the second GPU identifies 11 clusters 
(makes 20 clusters total). Debugging the application shows, that all clusters 
are identified correctly, however, the exchange of the identified clusters 
seems not to work: Each MPI process stores its identified clusters in an 
buffer, that both processes exchange using MPI_Allgather:

value_type* d_dec = thrust::raw_pointer_cast([0]);
MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL,
d_dec, columns, MPI_DOUBLE, communicator);

I later discovered, that if I introduce a temporary host buffer, that will 
receive the results of both GPUs, all results are computed correctly:

value_type* d_dec = thrust::raw_pointer_cast([0]);
thrust::host_vector h_dec(dec.size());
MPI_Allgather( d_dec+columns*comm.rank(), columns, MPI_DOUBLE,
h_dec, columns, MPI_DOUBLE, communicator);
dec = h_dec; //copy results back from host to device

This lead me to the conclusion, that something with OMPIs CUDA IPC seems to 
cause the problems (synchronisation and/or fail-silent error) and indeed, 
disabling CUDA IPC :

mpirun --mca btl_smcuda_use_cuda_ipc 0 --mca btl_smcuda_use_cuda_ipc_same_gpu 0 
-np 2 ./double_test ../data/similarities2.double.-300 
ex.2.double.2.gpus 1000 1000 0.9

will calculate correct results:

# datatype: double
# datapoints: 2
# max_iterations: 1000
# conv_iterations: 1000
# damping: 0.9
# communicator.size: 2
# time elapsed [s]; iterations executed; convergent since; clusters identified
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20
121.* 1000 807 20

Surprisingly, the wrong results _always_ occur during the 4th and 5th run. Is 
there a way to force synchronisation (I tried MPI_Barrier() without success), 
has anybody discovered similar problems?

I posted some of the code to pastebin: http://pastebin.com/wCmc36k5

Thanks in advance,
Christoph

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-19 Thread Rolf vandeVaart

Glad it was solved.  I will submit a bug at NVIDIA as that does not seem like a 
very friendly way to handle that error.

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>Boissonneault
>Sent: Tuesday, August 19, 2014 10:39 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
>
>Hi,
>I believe I found what the problem was. My script set the
>CUDA_VISIBLE_DEVICES based on the content of $PBS_GPUFILE. Since the
>GPUs were listed twice in the $PBS_GPUFILE because of the two nodes, I had
>CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
>
>instead of
>CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
>
>Sorry for the false bug and thanks for directing me toward the solution.
>
>Maxime
>
>
>Le 2014-08-19 09:15, Rolf vandeVaart a écrit :
>> Hi:
>> This problem does not appear to have anything to do with MPI. We are
>getting a SEGV during the initial call into the CUDA driver.  Can you log on to
>gpu-k20-08, compile your simple program without MPI, and run it there?  Also,
>maybe run dmesg on gpu-k20-08 and see if there is anything in the log?
>>
>> Also, does your program run if you just run it on gpu-k20-07?
>>
>> Can you include the output from nvidia-smi on each node?
>>
>> Thanks,
>> Rolf
>>
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>>> Boissonneault
>>> Sent: Tuesday, August 19, 2014 8:55 AM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
>>>
>>> Hi,
>>> I recompiled OMPI 1.8.1 without Cuda and with debug, but it did not
>>> give me much more information.
>>> [mboisson@gpu-k20-07 simple_cuda_mpi]$ ompi_info | grep debug
>>>Prefix:
>>> /software-gpu/mpi/openmpi/1.8.1-debug_gcc4.8_nocuda
>>>Internal debug support: yes
>>> Memory debugging support: no
>>>
>>>
>>> Is there something I need to do at run time to get more information
>>> out of it ?
>>>
>>>
>>> [mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by
>>> ppr:1:node cudampi_simple [gpu-k20-08:46045] *** Process received
>>> signal *** [gpu-k20-08:46045] Signal: Segmentation fault (11)
>>> [gpu-k20-08:46045] Signal code: Address not mapped (1)
>>> [gpu-k20-08:46045] Failing at address: 0x8 [gpu-k20-08:46045] [ 0]
>>> /lib64/libpthread.so.0(+0xf710)[0x2b9ca76d9710]
>>> [gpu-k20-08:46045] [ 1]
>>> /usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2b9cade98acf]
>>> [gpu-k20-08:46045] [ 2]
>>> /usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2b9cade5ea83]
>>> [gpu-k20-08:46045] [ 3]
>>> /usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2b9cadd902da]
>>> [gpu-k20-08:46045] [ 4]
>>> /usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2b9cadd7c933]
>>> [gpu-k20-08:46045] [ 5]
>>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b9ca6df
>>> 4965]
>>> [gpu-k20-08:46045] [ 6]
>>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b9ca6df
>>> 4a0a]
>>> [gpu-k20-08:46045] [ 7]
>>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b9ca6df
>>> 4a3b]
>>> [gpu-k20-08:46045] [ 8]
>>> /software-
>>> gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2b9ca6e0
>>> f647] [gpu-k20-08:46045] [ 9] cudampi_simple[0x4007ae]
>>> [gpu-k20-08:46045] [10] [gpu-k20-07:61816] *** Process received
>>> signal *** [gpu-k20-07:61816] Signal: Segmentation fault (11)
>>> [gpu-k20-07:61816] Signal code: Address not mapped (1)
>>> [gpu-k20-07:61816] Failing at address: 0x8 [gpu-k20-07:61816] [ 0]
>>> /lib64/libpthread.so.0(+0xf710)[0x2aef36e9b710]
>>> [gpu-k20-07:61816] [ 1]
>>> /usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2aef3d65aacf]
>>> [gpu-k20-07:61816] [ 2]
>>> /usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2aef3d620a83]
>>> [gpu-k20-07:61816] [ 3]
>>> /usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2aef3d5522da]
>>> [gpu-k20-07:61816] [ 4]
>>> /usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2aef3d53e933]
>>> [gpu-k20-07:61816] [ 5]
>>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2aef365b
>>> 6965]
>>> [gpu-k20-07:61816] [ 6]
>>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2aef365b
>>> 6a0a]
>>> [gpu-k20-07:61816] [ 7]
>>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-19 Thread Rolf vandeVaart

Hi:
This problem does not appear to have anything to do with MPI. We are getting a 
SEGV during the initial call into the CUDA driver.  Can you log on to 
gpu-k20-08, compile your simple program without MPI, and run it there?  Also, 
maybe run dmesg on gpu-k20-08 and see if there is anything in the log?  

Also, does your program run if you just run it on gpu-k20-07?  

Can you include the output from nvidia-smi on each node?

Thanks,
Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>Boissonneault
>Sent: Tuesday, August 19, 2014 8:55 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
>
>Hi,
>I recompiled OMPI 1.8.1 without Cuda and with debug, but it did not give me
>much more information.
>[mboisson@gpu-k20-07 simple_cuda_mpi]$ ompi_info | grep debug
>   Prefix:
>/software-gpu/mpi/openmpi/1.8.1-debug_gcc4.8_nocuda
>   Internal debug support: yes
>Memory debugging support: no
>
>
>Is there something I need to do at run time to get more information out
>of it ?
>
>
>[mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by
>ppr:1:node
>cudampi_simple
>[gpu-k20-08:46045] *** Process received signal ***
>[gpu-k20-08:46045] Signal: Segmentation fault (11)
>[gpu-k20-08:46045] Signal code: Address not mapped (1)
>[gpu-k20-08:46045] Failing at address: 0x8
>[gpu-k20-08:46045] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b9ca76d9710]
>[gpu-k20-08:46045] [ 1]
>/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2b9cade98acf]
>[gpu-k20-08:46045] [ 2]
>/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2b9cade5ea83]
>[gpu-k20-08:46045] [ 3]
>/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2b9cadd902da]
>[gpu-k20-08:46045] [ 4]
>/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2b9cadd7c933]
>[gpu-k20-08:46045] [ 5]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b9ca6df4965]
>[gpu-k20-08:46045] [ 6]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b9ca6df4a0a]
>[gpu-k20-08:46045] [ 7]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b9ca6df4a3b]
>[gpu-k20-08:46045] [ 8]
>/software-
>gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2b9ca6e0f647]
>[gpu-k20-08:46045] [ 9] cudampi_simple[0x4007ae]
>[gpu-k20-08:46045] [10] [gpu-k20-07:61816] *** Process received signal ***
>[gpu-k20-07:61816] Signal: Segmentation fault (11)
>[gpu-k20-07:61816] Signal code: Address not mapped (1)
>[gpu-k20-07:61816] Failing at address: 0x8
>[gpu-k20-07:61816] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2aef36e9b710]
>[gpu-k20-07:61816] [ 1]
>/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2aef3d65aacf]
>[gpu-k20-07:61816] [ 2]
>/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2aef3d620a83]
>[gpu-k20-07:61816] [ 3]
>/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2aef3d5522da]
>[gpu-k20-07:61816] [ 4]
>/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2aef3d53e933]
>[gpu-k20-07:61816] [ 5]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2aef365b6965]
>[gpu-k20-07:61816] [ 6]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2aef365b6a0a]
>[gpu-k20-07:61816] [ 7]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2aef365b6a3b]
>[gpu-k20-07:61816] [ 8]
>/software-
>gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2aef365d1647
>]
>[gpu-k20-07:61816] [ 9] cudampi_simple[0x4007ae]
>[gpu-k20-07:61816] [10]
>/lib64/libc.so.6(__libc_start_main+0xfd)[0x2aef370c7d1d]
>[gpu-k20-07:61816] [11] cudampi_simple[0x400699]
>[gpu-k20-07:61816] *** End of error message ***
>/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b9ca7905d1d]
>[gpu-k20-08:46045] [11] cudampi_simple[0x400699]
>[gpu-k20-08:46045] *** End of error message ***
>--
>mpiexec noticed that process rank 1 with PID 46045 on node gpu-k20-08
>exited on signal 11 (Segmentation fault).
>--
>
>
>Thanks,
>
>Maxime
>
>
>Le 2014-08-18 16:45, Rolf vandeVaart a écrit :
>> Just to help reduce the scope of the problem, can you retest with a non
>CUDA-aware Open MPI 1.8.1?   And if possible, use --enable-debug in the
>configure line to help with the stack trace?
>>
>>
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>>> Boissonneault
>>> Sent: Monday, August 18, 2014 4:23 PM
>>> To: Open MPI Users
>>> Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes
>>>
>>> Hi,
>>> Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
>>> derailed into two problems, o

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-18 Thread Rolf vandeVaart

Just to help reduce the scope of the problem, can you retest with a non 
CUDA-aware Open MPI 1.8.1?   And if possible, use --enable-debug in the 
configure line to help with the stack trace?


>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>Boissonneault
>Sent: Monday, August 18, 2014 4:23 PM
>To: Open MPI Users
>Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes
>
>Hi,
>Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
>derailed into two problems, one of which has been addressed, I figured I
>would start a new, more precise and simple one.
>
>I reduced the code to the minimal that would reproduce the bug. I have
>pasted it here :
>http://pastebin.com/1uAK4Z8R
>Basically, it is a program that initializes MPI and cudaMalloc memory, and then
>free memory and finalize MPI. Nothing else.
>
>When I compile and run this on a single node, everything works fine.
>
>When I compile and run this on more than one node, I get the following stack
>trace :
>[gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041] Signal:
>Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not mapped
>(1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0]
>/lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
>[gpu-k20-07:40041] [ 1]
>/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf]
>[gpu-k20-07:40041] [ 2]
>/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83]
>[gpu-k20-07:40041] [ 3]
>/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da]
>[gpu-k20-07:40041] [ 4]
>/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933]
>[gpu-k20-07:40041] [ 5]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965]
>[gpu-k20-07:40041] [ 6]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a]
>[gpu-k20-07:40041] [ 7]
>/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b]
>[gpu-k20-07:40041] [ 8]
>/software-
>gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532]
>[gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:40041] [10]
>/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d]
>[gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041] ***
>End of error message ***
>
>
>The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware) or
>OpenMPI 1.8.1 (cuda aware).
>
>I know this is more than likely a problem with Cuda than with OpenMPI (since
>it does the same for two different versions), but I figured I would ask here if
>somebody has a clue of what might be going on. I have yet to be able to fill a
>bug report on NVidia's website for Cuda.
>
>
>Thanks,
>
>
>--
>-
>Maxime Boissonneault
>Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
>
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2014/08/25064.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Help with multirail configuration

2014-07-21 Thread Rolf vandeVaart

With Open MPI 1.8.1, the library will use the NIC that is "closest" to the CPU. 
There was a bug in earlier versions of Open MPI 1.8 so that did not happen.  
You can see this by running with some verbosity using the "btl_base_verbose" 
flag.  For example, this is what I observed on a two node cluster with two NICs 
on each node.

[rvandevaart@ivy0] $ mpirun --mca btl_base_verbose 1 -host ivy0,ivy1 -np 4 
--mca pml ob1 --mca btl_openib_warn_default_gid_prefix 0 MPI_Alltoall_c
[ivy0.nvidia.com:28896] [rank=0] openib: using device mlx5_0
[ivy0.nvidia.com:28896] [rank=0] openib: skipping device mlx5_1; it is too far 
away
[ivy0.nvidia.com:28897] [rank=1] openib: using device mlx5_1
[ivy0.nvidia.com:28897] [rank=1] openib: skipping device mlx5_0; it is too far 
away
[ivy1.nvidia.com:04652] [rank=2] openib: using device mlx5_0
[ivy1.nvidia.com:04652] [rank=2] openib: skipping device mlx5_1; it is too far 
away
[ivy1.nvidia.com:04653] [rank=3] openib: using device mlx5_1
[ivy1.nvidia.com:04653] [rank=3] openib: skipping device mlx5_0; it is too far 
away

So, maybe the right thing is happening by default?  Or you are looking for more 
fine-grained control?

Rolf

From: users [users-boun...@open-mpi.org] On Behalf Of Tobias Kloeffel 
[tobias.kloef...@fau.de]
Sent: Sunday, July 20, 2014 12:33 PM
To: Open MPI Users
Subject: Re: [OMPI users] Help with multirail configuration

I found no option in 1.6.5 and 1.8.1...


Am 7/20/2014 6:29 PM, schrieb Ralph Castain:
> What version of OMPI are you talking about?
>
> On Jul 20, 2014, at 9:11 AM, Tobias Kloeffel  wrote:
>
>> Hello everyone,
>>
>> I am trying to get the maximum performance out of my two node testing setup. 
>> Each node consists of 4 Sandy Bridge CPUs and each CPU has one directly 
>> attached Mellanox QDR card. Both nodes are connected via a 8-port Mellanox 
>> switch.
>> So far I found no option that allows binding mpi-ranks to a specific card, 
>> as it is available in MVAPICH2. Is there a way to change the round robin 
>> behavior of openMPI?
>> Maybe something like "btl_tcp_if_seq" that I have missed?
>>
>>
>> Kind regards,
>> Tobias
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/07/24822.php
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/07/24825.php

___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/07/24827.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] deprecated cuptiActivityEnqueueBuffer

2014-06-16 Thread Rolf vandeVaart

Do you need the vampire support in your build?  If not, you could add this to 
configure.
--disable-vt
  
>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of
>jcabe...@computacion.cs.cinvestav.mx
>Sent: Monday, June 16, 2014 1:40 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] deprecated cuptiActivityEnqueueBuffer
>
>Hi all:
>
>I'm having trouble to compile OMPI from trunk svn with the new 6.0 nvidia
>SDK because deprecated cuptiActivityEnqueueBuffer
>
>this is the problem:
>
>  CC   libvt_la-vt_cupti_activity.lo
>  CC   libvt_la-vt_iowrap_helper.lo
>  CC   libvt_la-vt_libwrap.lo
>  CC   libvt_la-vt_mallocwrap.lo
>vt_cupti_activity.c: In function 'vt_cuptiact_queueNewBuffer':
>vt_cupti_activity.c:203:3: error: implicit declaration of function
>'cuptiActivityEnqueueBuffer' [-Werror=implicit-function-declaration]
>   VT_CUPTI_CALL(cuptiActivityEnqueueBuffer(cuCtx, 0,
>ALIGN_BUFFER(buffer, 8),
>
>Does any body known any patch?
>___
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2014/06/24652.php
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Advices for parameter tuning for CUDA-aware MPI

2014-05-27 Thread Rolf vandeVaart

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>Boissonneault
>Sent: Tuesday, May 27, 2014 4:07 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] Advices for parameter tuning for CUDA-aware MPI
>
>Answers inline too.
>>> 2) Is the absence of btl_openib_have_driver_gdr an indicator of
>>> something missing ?
>> Yes, that means that somehow the GPU Direct RDMA is not installed
>correctly. All that check does is make sure that the file
>/sys/kernel/mm/memory_peers/nv_mem/version exists.  Does that exist?
>>
>It does not. There is no
>/sys/kernel/mm/memory_peers/
>
>>> 3) Are the default parameters, especially the rdma limits and such,
>>> optimal for our configuration ?
>> That is hard to say.  GPU Direct RDMA does not work well when the GPU
>and IB card are not "close" on the system. Can you run "nvidia-smi topo -m"
>on your system?
>nvidia-smi topo -m
>gives me the error
>[mboisson@login-gpu01 ~]$ nvidia-smi topo -m Invalid combination of input
>arguments. Please run 'nvidia-smi -h' for help.
Sorry, my mistake.  That may be a future feature.

>
>I could not find anything related to topology in the help. However, I can tell
>you the following which I believe to be true
>- GPU0 and GPU1 are on PCIe bus 0, socket 0
>- GPU2 and GPU3 are on PCIe bus 1, socket 0
>- GPU4 and GPU5 are on PCIe bus 2, socket 1
>- GPU6 and GPU7 are on PCIe bus 3, socket 1
>
>There is one IB card which I believe is on socket 0.
>
>
>I know that we do not have the Mellanox Ofed. We use the Linux RDMA from
>CentOS 6.5. However, should that completely disable GDR within a single
>node ? i.e. does GDR _have_ to go through IB ? I would assume that our lack
>of Mellanox OFED would result in no-GDR inter-node, but GDR intra-node.

Without Mellanox OFED, then GPU Direct RDMA is unavailable.  However, the term 
GPU Direct is a somewhat overloaded term and I think that is where I was 
getting confused.  GPU Direct (also known as CUDA IPC) will work between GPUs 
that do not cross a QPI connection.  That means that I believe GPU0,1,2,3 
should be able to use GPU Direct between them and GPU4,5,6,7 can also between 
them.   In this case, this means that GPU memory does not need to get staged 
through host memory for transferring between the GPUs.  With Open MPI, there is 
a mca parameter you can set that will allow you to see whether GPU Direct is 
being used between the GPUs.

--mca btl_smcuda_cuda_ipc_verbose 100

 Rolf

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Advices for parameter tuning for CUDA-aware MPI

2014-05-27 Thread Rolf vandeVaart

Answers inline...
>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>Boissonneault
>Sent: Friday, May 23, 2014 4:31 PM
>To: Open MPI Users
>Subject: [OMPI users] Advices for parameter tuning for CUDA-aware MPI
>
>Hi,
>I am currently configuring a GPU cluster. The cluster has 8 K20 GPUs per node
>on two sockets, 4 PCIe bus (2 K20 per bus, 4 K20 per socket), with a single QDR
>InfiniBand card on each node. We have the latest NVidia drivers and Cuda 6.0.
>
>I am wondering if someone could tell me if all the default MCA parameters are
>optimal for cuda. More precisely, I am interrested about GDR and IPC. It
>seems from the parameters (see below) that they are both available
>(although GDR is disabled by default). However, my notes from
>GTC14 mention the btl_openib_have_driver_gdr parameter, which I do not
>see at all.
>
>So, I guess, my questions :
>1) Why is GDR disabled by default when available ?
It was disabled by default because it did not always give optimum performance.  
That may change in the future but for now, as you mentioned, you have to turn 
on the feature explicitly.

>2) Is the absence of btl_openib_have_driver_gdr an indicator of something
>missing ?
Yes, that means that somehow the GPU Direct RDMA is not installed correctly. 
All that check does is make sure that the file 
/sys/kernel/mm/memory_peers/nv_mem/version exists.  Does that exist?

>3) Are the default parameters, especially the rdma limits and such, optimal for
>our configuration ?
That is hard to say.  GPU Direct RDMA does not work well when the GPU and IB 
card are not "close" on the system. Can you run "nvidia-smi topo -m" on your 
system? 

>4) Do I want to enable or disable IPC by default (my notes state that bandwith
>is much better with MPS than IPC).
Yes, you should leave IPC enabled by default.  That should give good 
performance.  They were some issues with earlier CUDA versions, but they were 
all fixed in CUDA 6.
>
>Thanks,
>
>Here is what I get from
>ompi_info --all | grep cuda
>
>[mboisson@login-gpu01 ~]$ ompi_info --all | grep cuda [login-
>gpu01.calculquebec.ca:11486] mca: base: components_register:
>registering filem components
>[login-gpu01.calculquebec.ca:11486] mca: base: components_register:
>found loaded component raw
>[login-gpu01.calculquebec.ca:11486] mca: base: components_register:
>component raw register function successful [login-
>gpu01.calculquebec.ca:11486] mca: base: components_register:
>registering snapc components
>   Prefix: /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37
>  Exec_prefix: /software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37
>   Bindir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/bin
>  Sbindir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/sbin
>   Libdir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib
>   Incdir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/include
>   Mandir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/share/man
>Pkglibdir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/openmpi
>   Libexecdir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/libexec
>  Datarootdir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/share
>  Datadir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/share
>   Sysconfdir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/etc
>   Sharedstatedir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/com
>Localstatedir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/var
>  Infodir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/share/info
>   Pkgdatadir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/share/openmpi
>Pkglibdir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/openmpi
>Pkgincludedir:
>/software-gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/include/openmpi
>  MCA mca: parameter "mca_param_files" (current value:
>"/home/mboisson/.openmpi/mca-params.conf:/software-
>gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/etc/openmpi-mca-params.conf",
>data source: default, level: 2 user/detail, type: string, deprecated, synonym
>of: mca_base_param_files)
>  MCA mca: parameter "mca_component_path" (current
>value:
>"/software-
>gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/lib/openmpi:/home/mboisson/.o
>penmpi/components",
>data source: default, level: 9 dev/all, type: string, deprecated, synonym of:
>mca_base_component_path)
>  MCA mca: parameter "mca_base_param_files" (current
>value:
>"/home/mboisson/.openmpi/mca-params.conf:/software-
>gpu/mpi/openmpi/1.8.1_gcc4.8_cuda6.0.37/etc/openmpi-mca-params.conf",
>data source: default, level: 2 user/detail, type: string, synonyms:
>mca_param_files)
>  MCA mca: informational

[MTT users] Username

2014-05-21 Thread Rolf vandeVaart

Can I get a username/password for submitting mtt results?

username=nvidia
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy exited with error."

2014-03-03 Thread Rolf vandeVaart

There is something going wrong with the ml collective component.  So, if you 
disable it, things work.
I just reconfigured without any CUDA-aware support, and I see the same failure 
so it has nothing to do with CUDA.

Looks like Jeff Squyres just made a bug for it.

https://svn.open-mpi.org/trac/ompi/ticket/4331



>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo Spiga
>Sent: Monday, March 03, 2014 7:32 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy
>exited with error."
>
>Dear Rolf,
>
>your suggestion works!
>
>$ mpirun -np 4 --map-by ppr:1:socket -bind-to core  --mca coll ^ml osu_alltoall
># OSU MPI All-to-All Personalized Exchange Latency Test v4.2
># Size   Avg Latency(us)
>1   8.02
>2   2.96
>4   2.91
>8   2.91
>16  2.96
>32  3.07
>64  3.25
>128 3.74
>256 3.85
>512 4.11
>10244.79
>20485.91
>4096   15.84
>8192   24.88
>16384  35.35
>32768  56.20
>65536  66.88
>131072114.89
>262144209.36
>524288396.12
>1048576   765.65
>
>
>Can you clarify exactly where the problem come from?
>
>Regards,
>Filippo
>
>
>On Mar 4, 2014, at 12:17 AM, Rolf vandeVaart <rvandeva...@nvidia.com>
>wrote:
>> Can you try running with --mca coll ^ml and see if things work?
>>
>> Rolf
>>
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo
>>> Spiga
>>> Sent: Monday, March 03, 2014 7:14 PM
>>> To: Open MPI Users
>>> Subject: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy
>>> exited with error."
>>>
>>> Dear Open MPI developers,
>>>
>>> I hit an expected error running OSU osu_alltoall benchmark using Open
>>> MPI 1.7.5rc1. Here the error:
>>>
>>> $ mpirun -np 4 --map-by ppr:1:socket -bind-to core osu_alltoall In
>>> bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory
>failed In
>>> bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory
>>> failed
>>> [tesla50][[6927,1],1][../../../../../ompi/mca/coll/ml/coll_ml_module.
>>> c:2996:mc a_coll_ml_comm_query] COLL-ML ml_discover_hierarchy exited
>>> with error.
>>>
>>> [tesla50:42200] In base_bcol_masesmuma_setup_library_buffers and
>>> mpool was not successfully setup!
>>> [tesla50][[6927,1],0][../../../../../ompi/mca/coll/ml/coll_ml_module.
>>> c:2996:mc a_coll_ml_comm_query] COLL-ML ml_discover_hierarchy exited
>>> with error.
>>>
>>> [tesla50:42201] In base_bcol_masesmuma_setup_library_buffers and
>>> mpool was not successfully setup!
>>> # OSU MPI All-to-All Personalized Exchange Latency Test v4.2
>>> # Size   Avg Latency(us)
>>> -
>>> - mpirun noticed that process rank 3 with PID 4508 on node
>>> tesla51 exited on signal 11 (Segmentation fault).
>>> -
>>> -
>>> 2 total processes killed (some possibly by mpirun during cleanup)
>>>
>>> Any idea where this come from?
>>>
>>> I compiled Open MPI using Intel 12.1, latest Mellanox stack and CUDA
>6.0RC.
>>> Attached outputs grabbed from configure, make and run. The configure
>>> was
>>>
>>> export MXM_DIR=/opt/mellanox/mxm
>>> export KNEM_DIR=$(find /opt -maxdepth 1 -type d -name "knem*"
>>> -print0) export FCA_DIR=/opt/mellanox/fca export
>>> HCOLL_DIR=/opt/mellanox/hcoll
>>>
>>> ../configure CC=icc CXX=icpc F77=ifort FC=ifort FFLAGS="-xSSE4.2
>>> -axAVX -ip -
>>> O3 -fno-fnalias" FCFLAGS="-xSSE4.2 -axAVX -ip -O3 -fno-fnalias"
>>> --prefix=<...> --enable-mpirun-prefix-by-default --with-fca=$FCA_DIR
>>> --with- mxm=$MXM_DIR --with-knem=$KNEM_DIR  --with-
>>> cuda=$CUDA_INSTALL_PATH --enable-mpi-thread-multiple --with-
>>> hwloc=internal --with-verbs 2>&1 | tee config.out
>>>
>>>
>>> Thanks in advance,
>>> Regards
>>>
>>> Filippo
>>>
>>> --
>>&

Re: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy exited with error."

2014-03-03 Thread Rolf vandeVaart

Can you try running with --mca coll ^ml and see if things work? 

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo Spiga
>Sent: Monday, March 03, 2014 7:14 PM
>To: Open MPI Users
>Subject: [OMPI users] 1.7.5rc1, error "COLL-ML ml_discover_hierarchy exited
>with error."
>
>Dear Open MPI developers,
>
>I hit an expected error running OSU osu_alltoall benchmark using Open MPI
>1.7.5rc1. Here the error:
>
>$ mpirun -np 4 --map-by ppr:1:socket -bind-to core osu_alltoall In
>bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory failed
>In bcol_comm_query hmca_bcol_basesmuma_allocate_sm_ctl_memory
>failed
>[tesla50][[6927,1],1][../../../../../ompi/mca/coll/ml/coll_ml_module.c:2996:mc
>a_coll_ml_comm_query] COLL-ML ml_discover_hierarchy exited with error.
>
>[tesla50:42200] In base_bcol_masesmuma_setup_library_buffers and mpool
>was not successfully setup!
>[tesla50][[6927,1],0][../../../../../ompi/mca/coll/ml/coll_ml_module.c:2996:mc
>a_coll_ml_comm_query] COLL-ML ml_discover_hierarchy exited with error.
>
>[tesla50:42201] In base_bcol_masesmuma_setup_library_buffers and mpool
>was not successfully setup!
># OSU MPI All-to-All Personalized Exchange Latency Test v4.2
># Size   Avg Latency(us)
>--
>mpirun noticed that process rank 3 with PID 4508 on node tesla51 exited on
>signal 11 (Segmentation fault).
>--
>2 total processes killed (some possibly by mpirun during cleanup)
>
>Any idea where this come from?
>
>I compiled Open MPI using Intel 12.1, latest Mellanox stack and CUDA 6.0RC.
>Attached outputs grabbed from configure, make and run. The configure was
>
>export MXM_DIR=/opt/mellanox/mxm
>export KNEM_DIR=$(find /opt -maxdepth 1 -type d -name "knem*" -print0)
>export FCA_DIR=/opt/mellanox/fca export HCOLL_DIR=/opt/mellanox/hcoll
>
>../configure CC=icc CXX=icpc F77=ifort FC=ifort FFLAGS="-xSSE4.2 -axAVX -ip -
>O3 -fno-fnalias" FCFLAGS="-xSSE4.2 -axAVX -ip -O3 -fno-fnalias" --prefix=<...>
>--enable-mpirun-prefix-by-default --with-fca=$FCA_DIR --with-
>mxm=$MXM_DIR --with-knem=$KNEM_DIR  --with-
>cuda=$CUDA_INSTALL_PATH --enable-mpi-thread-multiple --with-
>hwloc=internal --with-verbs 2>&1 | tee config.out
>
>
>Thanks in advance,
>Regards
>
>Filippo
>
>--
>Mr. Filippo SPIGA, M.Sc.
>http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga
>
> ~ David Hilbert
>
>*
>Disclaimer: "Please note this message and any attachments are
>CONFIDENTIAL and may be privileged or otherwise protected from disclosure.
>The contents are not to be disclosed to anyone other than the addressee.
>Unauthorized recipients are requested to preserve this confidentiality and to
>advise the sender immediately of any error in transmission."

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Configure issue with/without HWLOC when PGI used and CUDA support enabled

2014-02-14 Thread Rolf vandeVaart

I assume your first issue is happening because you configured hwloc with cuda 
support which creates a dependency on libcudart.so.  Not sure why that would 
mess up Open MPI.  Can you send me how you configured hwloc?

I am not sure I understand the second issue.  Open MPI puts everything in lib 
even though you may be building for 64 bits.  So all of these are fine.
  -I/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-14.1_cuda-6.0RC/lib 
 -Wl,-rpath 
-Wl,/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-14.1_cuda-6.0RC/lib 
 -L/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-14.1_cuda-6.0RC/lib

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Filippo Spiga
>Sent: Friday, February 14, 2014 9:44 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] Configure issue with/without HWLOC when PGI used
>and CUDA support enabled
>
>Dear Open MPI developers,
>
>I just want to point to a weird behavior of the configure procedure I
>discovered. I am compiling Open MPI 1.7.4 with CUDA support (CUDA 6.0 RC)
>and PGI 14.1
>
>If I explicitly compile against a self-compiled version of HWLOC (1.8.1) using
>this configure line ../configure CC=pgcc CXX=pgCC FC=pgf90 F90=pgf90 --
>prefix=/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-14.1_cuda-6.0RC  --
>enable-mpirun-prefix-by-default --with-fca=$FCA_DIR --with-
>mxm=$MXM_DIR --with-knem=$KNEM_DIR --with-hwloc=/usr/local/Cluster-
>Users/fs395/hwlock-1.8.1/gcc-4.4.7_cuda-6.0RC --with-
>slurm=/usr/local/Cluster-Apps/slurm  --with-cuda=/usr/local/Cluster-
>Users/fs395/cuda/6.0-RC
>
>make fails telling me that it cannot find "-lcudart".
>
>
>If I compile without HWLOC using this configure line:
>../configure CC=pgcc CXX=pgCC FC=pgf90 F90=pgf90 --
>prefix=/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-14.1_cuda-6.0RC  --
>enable-mpirun-prefix-by-default --with-fca=$FCA_DIR --with-
>mxm=$MXM_DIR --with-knem=$KNEM_DIR  --with-slurm=/usr/local/Cluster-
>Apps/slurm  --with-cuda=/usr/local/Cluster-Users/fs395/cuda/6.0-RC
>
>make succeeds and I have Open MPI compiled properly.
>
>$ mpif90 -show
>pgf90 -I/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-14.1_cuda-
>6.0RC/include -I/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-14.1_cuda-
>6.0RC/lib -Wl,-rpath -Wl,/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-
>14.1_cuda-6.0RC/lib -L/usr/local/Cluster-Users/fs395/openmpi-1.7.4/pgi-
>14.1_cuda-6.0RC/lib -lmpi_usempif08 -lmpi_usempi_ignore_tkr -lmpi_mpifh -
>lmpi $ ompi_info --all | grep btl_openib_have_cuda_gdr
> MCA btl: informational "btl_openib_have_cuda_gdr" (current 
> value:
>"true", data source: default, level: 5 tuner/detail, type: bool)
>
>I wonder why the configure picks up lib instead of lib64. I will test the build
>using real codes.
>
>Cheers,
>Filippo
>
>--
>Mr. Filippo SPIGA, M.Sc.
>http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga
>
> ~ David Hilbert
>
>*
>Disclaimer: "Please note this message and any attachments are
>CONFIDENTIAL and may be privileged or otherwise protected from disclosure.
>The contents are not to be disclosed to anyone other than the addressee.
>Unauthorized recipients are requested to preserve this confidentiality and to
>advise the sender immediately of any error in transmission."
>
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Cuda Aware MPI Problem

2013-12-13 Thread Rolf vandeVaart

Yes, this was a bug with Open MPI 1.7.3.  I could not reproduce it, but it was 
definitely an issue in certain configurations.
Here was the fix.   https://svn.open-mpi.org/trac/ompi/changeset/29762

We fixed it in Open MPI 1.7.4 and the trunk version, so as you have seen, they 
do not have the problem.

Rolf


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Özgür Pekçagliyan
Sent: Friday, December 13, 2013 8:03 AM
To: us...@open-mpi.org
Subject: Re: [OMPI users] Cuda Aware MPI Problem

Hello again,

I have compiled openmpi--1.9a1r29873 from nightly build trunk and so far 
everything looks alright. But I have not test the cuda support yet.

On Fri, Dec 13, 2013 at 2:38 PM, Özgür Pekçağlıyan 
> wrote:
Hello,

I am having difficulties with compiling openMPI with CUDA support. I have 
followed this (http://www.open-mpi.org/faq/?category=building#build-cuda) faq 
entry. As below;

$ cd openmpi-1.7.3/
$ ./configure --with-cuda=/urs/local/cuda-5.5
$ make all install

everything goes perfect during compilation. But when I try to execute simplest 
mpi hello world application I got following error;

$ mpicc hello.c -o hello
$ mpirun -np 2 hello

hello: symbol lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined 
symbol: progress_one_cuda_htod_event
hello: symbol lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined 
symbol: progress_one_cuda_htod_event
--
mpirun has exited due to process rank 0 with PID 30329 on
node cudalab1 exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

You can avoid this message by specifying -quiet on the mpirun command line.

--

$ mpirun -np 1 hello

hello: symbol lookup error: /usr/local/lib/openmpi/mca_pml_ob1.so: undefined 
symbol: progress_one_cuda_htod_event
--
mpirun has exited due to process rank 0 with PID 30327 on
node cudalab1 exiting improperly. There are three reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).

You can avoid this message by specifying -quiet on the mpirun command line.

--


Any suggestions?
I have two PCs with Intel I3 CPUs and Geforce GTX 480 GPUs.


And here is the hello.c file;
#include 
#include 


int main (int argc, char **argv)
{
  int rank, size;

  MPI_Init (, ); /* starts MPI */
  MPI_Comm_rank (MPI_COMM_WORLD, ); /* get current process id */
  MPI_Comm_size (MPI_COMM_WORLD, ); /* get number of processes */
  printf( "Hello world from process %d of %d\n", rank, size );
  MPI_Finalize();
  return 0;
}



--
Özgür Pekçağlıyan
B.Sc. in Computer Engineering
M.Sc. in Computer Engineering



--
Özgür Pekçağlıyan
B.Sc. in Computer Engineering
M.Sc. in Computer Engineering

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email

Re: [OMPI users] Bug in MPI_REDUCE in CUDA-aware MPI

2013-12-02 Thread Rolf vandeVaart

Hi Peter:
The reason behind not having the reduction support (I believe) was just the 
complexity of adding it to the code.  I will at least submit a ticket so we can 
look at it again.

Here is a link to FAQ which lists the APIs which are CUDA-aware.  
http://www.open-mpi.org/faq/?category=running#mpi-cuda-support

Regards,
Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Peter Zaspel
>Sent: Monday, December 02, 2013 8:29 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] Bug in MPI_REDUCE in CUDA-aware MPI
>
>* PGP Signed by an unknown key
>
>Hi Rolf,
>
>OK, I didn't know that. Sorry.
>
>Yes, it would be a pretty important feature in cases when you are doing
>reduction operations on many, many entries in parallel. Therefore, each
>reduction is not very complex or time-consuming but potentially hundreds of
>thousands reductions are done at the same time. This is definitely a point
>where a CUDA-aware implementation can give some performance
>improvements.
>
>I'm curious: Rather complex operations like allgatherv are CUDA-aware, but a
>reduction is not. Is there a reasoning for this? Is there some documentation,
>which MPI calls are CUDA-aware and which not?
>
>Best regards
>
>Peter
>
>
>
>On 12/02/2013 02:18 PM, Rolf vandeVaart wrote:
>> Thanks for the report.  CUDA-aware Open MPI does not currently support
>doing reduction operations on GPU memory.
>> Is this a feature you would be interested in?
>>
>> Rolf
>>
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Peter
>>> Zaspel
>>> Sent: Friday, November 29, 2013 11:24 AM
>>> To: us...@open-mpi.org
>>> Subject: [OMPI users] Bug in MPI_REDUCE in CUDA-aware MPI
>>>
>>> Hi users list,
>>>
>>> I would like to report a bug in the CUDA-aware OpenMPI 1.7.3
>>> implementation. I'm using CUDA 5.0 and Ubuntu 12.04.
>>>
>>> Attached, you will find an example code file, to reproduce the bug.
>>> The point is that MPI_Reduce with normal CPU memory fully works but
>>> the use of GPU memory leads to a segfault. (GPU memory is used when
>>> defining USE_GPU).
>>>
>>> The segfault looks like this:
>>>
>>> [peak64g-36:25527] *** Process received signal *** [peak64g-36:25527]
>>> Signal: Segmentation fault (11) [peak64g-36:25527] Signal code:
>>> Invalid permissions (2) [peak64g-36:25527] Failing at address:
>>> 0x600100200 [peak64g- 36:25527] [ 0]
>>> /lib/x86_64-linux-gnu/libc.so.6(+0x364a0)
>>> [0x7ff2abdb24a0]
>>> [peak64g-36:25527] [ 1]
>>> /data/zaspel/openmpi-1.7.3_build/lib/libmpi.so.1(+0x7d410)
>>> [0x7ff2ac4b9410] [peak64g-36:25527] [ 2]
>>> /data/zaspel/openmpi-
>>>
>1.7.3_build/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intr
>>> a_
>>> basic_linear+0x371)
>>> [0x7ff2a5987531]
>>> [peak64g-36:25527] [ 3]
>>> /data/zaspel/openmpi-1.7.3_build/lib/libmpi.so.1(MPI_Reduce+0x135)
>>> [0x7ff2ac499d55]
>>> [peak64g-36:25527] [ 4] /home/zaspel/testMPI/test_reduction()
>>> [0x400ca0] [peak64g-36:25527] [ 5]
>>> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed)
>>> [0x7ff2abd9d76d] [peak64g-36:25527] [ 6]
>>> /home/zaspel/testMPI/test_reduction() [0x400af9] [peak64g-36:25527]
>>> *** End of error message ***
>>> -
>>> - mpirun noticed that process rank 0 with PID 25527 on node
>>> peak64g-36 exited on signal 11 (Segmentation fault).
>>> -
>>> -
>>>
>>> Best regards,
>>>
>>> Peter
>> --
>> - This email message is for the sole use of the intended
>> recipient(s) and may contain confidential information.  Any
>> unauthorized review, use, disclosure or distribution is prohibited.
>> If you are not the intended recipient, please contact the sender by
>> reply email and destroy all copies of the original message.
>> --
>> - ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>--
>Dipl.-Inform. Peter Zaspel
>Institut fuer Numerische Simulation, Universitaet Bonn Wegelerstr.6, 53115
>Bonn, Germany
>tel: +49 228 73-2748   mailto:zas...@ins.uni-bonn.de
>fax: +49 228 73-7527   http://wissrech.ins.uni-bonn.de/people/zaspel.html
>
>* Unknown Key
>* 0x8611E59B(L)
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Bug in MPI_REDUCE in CUDA-aware MPI

2013-12-02 Thread Rolf vandeVaart

Thanks for the report.  CUDA-aware Open MPI does not currently support doing 
reduction operations on GPU memory.
Is this a feature you would be interested in?

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Peter Zaspel
>Sent: Friday, November 29, 2013 11:24 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] Bug in MPI_REDUCE in CUDA-aware MPI
>
>Hi users list,
>
>I would like to report a bug in the CUDA-aware OpenMPI 1.7.3
>implementation. I'm using CUDA 5.0 and Ubuntu 12.04.
>
>Attached, you will find an example code file, to reproduce the bug.
>The point is that MPI_Reduce with normal CPU memory fully works but the
>use of GPU memory leads to a segfault. (GPU memory is used when defining
>USE_GPU).
>
>The segfault looks like this:
>
>[peak64g-36:25527] *** Process received signal *** [peak64g-36:25527]
>Signal: Segmentation fault (11) [peak64g-36:25527] Signal code: Invalid
>permissions (2) [peak64g-36:25527] Failing at address: 0x600100200 [peak64g-
>36:25527] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x364a0)
>[0x7ff2abdb24a0]
>[peak64g-36:25527] [ 1]
>/data/zaspel/openmpi-1.7.3_build/lib/libmpi.so.1(+0x7d410)
>[0x7ff2ac4b9410] [peak64g-36:25527] [ 2]
>/data/zaspel/openmpi-
>1.7.3_build/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_reduce_intra_
>basic_linear+0x371)
>[0x7ff2a5987531]
>[peak64g-36:25527] [ 3]
>/data/zaspel/openmpi-1.7.3_build/lib/libmpi.so.1(MPI_Reduce+0x135)
>[0x7ff2ac499d55]
>[peak64g-36:25527] [ 4] /home/zaspel/testMPI/test_reduction() [0x400ca0]
>[peak64g-36:25527] [ 5]
>/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7ff2abd9d76d]
>[peak64g-36:25527] [ 6] /home/zaspel/testMPI/test_reduction() [0x400af9]
>[peak64g-36:25527] *** End of error message ***
>--
>mpirun noticed that process rank 0 with PID 25527 on node peak64g-36 exited
>on signal 11 (Segmentation fault).
>--
>
>Best regards,
>
>Peter
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] OpenMPI-1.7.3 - cuda support

2013-10-30 Thread Rolf vandeVaart

The CUDA-aware support is only available when running with the verbs interface 
to Infiniband.  It does not work with the PSM interface which is being used in 
your installation.
To verify this, you need to disable the usage of PSM.  This can be done in a 
variety of ways, but try running like this:

mpirun -mca pml ob1 .

This will force the use of the verbs support layer (openib) with the CUDA-aware 
support.


From: users [mailto:users-boun...@open-mpi.org] On Behalf Of KESTENER Pierre
Sent: Wednesday, October 30, 2013 12:07 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] OpenMPI-1.7.3 - cuda support

Dear Rolf,

thank for looking into this.
Here is the complete backtrace for execution using 2 GPUs on the same node:

(cuda-gdb) bt
#0  0x7711d885 in raise () from /lib64/libc.so.6
#1  0x7711f065 in abort () from /lib64/libc.so.6
#2  0x70387b8d in psmi_errhandler_psm (ep=,
err=PSM_INTERNAL_ERR, error_string=,
token=) at psm_error.c:76
#3  0x70387df1 in psmi_handle_error (ep=0xfffe,
error=PSM_INTERNAL_ERR, buf=) at psm_error.c:154
#4  0x70382f6a in psmi_am_mq_handler_rtsmatch (toki=0x7fffc6a0,
args=0x7fffed0461d0, narg=,
buf=, len=) at ptl.c:200
#5  0x7037a832 in process_packet (ptl=0x737818, pkt=0x7fffed0461c0,
isreq=) at am_reqrep_shmem.c:2164
#6  0x7037d90f in amsh_poll_internal_inner (ptl=0x737818, replyonly=0)
at am_reqrep_shmem.c:1756
#7  amsh_poll (ptl=0x737818, replyonly=0) at am_reqrep_shmem.c:1810
#8  0x703a0329 in __psmi_poll_internal (ep=0x737538,
poll_amsh=) at psm.c:465
#9  0x7039f0af in psmi_mq_wait_inner (ireq=0x7fffc848)
at psm_mq.c:299
#10 psmi_mq_wait_internal (ireq=0x7fffc848) at psm_mq.c:334
#11 0x7037db21 in amsh_mq_send_inner (ptl=0x737818,
mq=, epaddr=0x6eb418, flags=,
tag=844424930131968, ubuf=0x130835, len=32768)
---Type  to continue, or q  to quit---
at am_reqrep_shmem.c:2339
#12 amsh_mq_send (ptl=0x737818, mq=, epaddr=0x6eb418,
flags=, tag=844424930131968, ubuf=0x130835,
len=32768) at am_reqrep_shmem.c:2387
#13 0x7039ed71 in __psm_mq_send (mq=,
dest=, flags=,
stag=, buf=,
len=) at psm_mq.c:413
#14 0x705c4ea8 in ompi_mtl_psm_send ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/openmpi/mca_mtl_psm.so
#15 0x71eeddea in mca_pml_cm_send ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/openmpi/mca_pml_cm.so
#16 0x779253da in PMPI_Sendrecv ()
   from /gpfslocal/pub/openmpi/1.7.3/lib/libmpi.so.1
#17 0x004045ef in ExchangeHalos (cartComm=0x715460,
devSend=0x130835, hostSend=0x7b8710, hostRecv=0x7c0720,
devRecv=0x1308358000, neighbor=1, elemCount=4096) at CUDA_Aware_MPI.c:70
#18 0x004033d8 in TransferAllHalos (cartComm=0x715460,
domSize=0x7fffcd80, topIndex=0x7fffcd60, neighbors=0x7fffcd90,
copyStream=0xaa4450, devBlocks=0x7fffcd30,
devSideEdges=0x7fffcd20, devHaloLines=0x7fffcd10,
hostSendLines=0x7fffcd00, hostRecvLines=0x7fffccf0) at Host.c:400
#19 0x0040363c in RunJacobi (cartComm=0x715460, rank=0, size=2,
---Type  to continue, or q  to quit---
domSize=0x7fffcd80, topIndex=0x7fffcd60, neighbors=0x7fffcd90,
useFastSwap=0, devBlocks=0x7fffcd30, devSideEdges=0x7fffcd20,
devHaloLines=0x7fffcd10, hostSendLines=0x7fffcd00,
hostRecvLines=0x7fffccf0, devResidue=0x131048,
copyStream=0xaa4450, iterations=0x7fffcd44,
avgTransferTime=0x7fffcd48) at Host.c:466
#20 0x00401ccb in main (argc=4, argv=0x7fffcea8) at Jacobi.c:60
Pierre.




De : KESTENER Pierre
Date d'envoi : mercredi 30 octobre 2013 16:34
À : us...@open-mpi.org
Cc: KESTENER Pierre
Objet : OpenMPI-1.7.3 - cuda support
Hello,

I'm having problems running a simple cuda-aware mpi application; the one found 
at
https://github.com/parallel-forall/code-samples/tree/master/posts/cuda-aware-mpi-example

I have modified symbol ENV_LOCAL_RANK into OMPI_COMM_WORLD_LOCAL_RANK
My cluster has 2 K20m GPUs per node, with QLogic IB stack.

The normal CUDA/MPI application works fine;
 but the cuda-ware mpi app is crashing when using 2 MPI proc over the 2 GPUs of 
the same node:
the error message is:
Assertion failure at ptl.c:200: nbytes == msglen
I can send the complete backtrace from cuda-gdb if needed.

The same app when running on 2 GPUs on 2 different nodes give another error:
jacobi_cuda_aware_mpi:28280 terminated with signal 11 at PC=2aae9d7c9f78 
SP=7fffc06c21f8.  Backtrace:
/gpfslocal/pub/local/lib64/libinfinipath.so.4(+0x8f78)[0x2aae9d7c9f78]


Can someone give me hints where to look to track this problem ?
Thank you.

Pierre Kestener.



---
This email message is for the sole use of the intended recipient(s) and may 
contain

Re: [OMPI users] OpenMPI-1.7.3 - cuda support

2013-10-30 Thread Rolf vandeVaart

Let me try this out and see what happens for me.  But yes, please go ahead and 
send me the complete backtrace.
Rolf

From: users [mailto:users-boun...@open-mpi.org] On Behalf Of KESTENER Pierre
Sent: Wednesday, October 30, 2013 11:34 AM
To: us...@open-mpi.org
Cc: KESTENER Pierre
Subject: [OMPI users] OpenMPI-1.7.3 - cuda support

Hello,

I'm having problems running a simple cuda-aware mpi application; the one found 
at
https://github.com/parallel-forall/code-samples/tree/master/posts/cuda-aware-mpi-example

I have modified symbol ENV_LOCAL_RANK into OMPI_COMM_WORLD_LOCAL_RANK
My cluster has 2 K20m GPUs per node, with QLogic IB stack.

The normal CUDA/MPI application works fine;
 but the cuda-ware mpi app is crashing when using 2 MPI proc over the 2 GPUs of 
the same node:
the error message is:
Assertion failure at ptl.c:200: nbytes == msglen
I can send the complete backtrace from cuda-gdb if needed.

The same app when running on 2 GPUs on 2 different nodes give another error:
jacobi_cuda_aware_mpi:28280 terminated with signal 11 at PC=2aae9d7c9f78 
SP=7fffc06c21f8.  Backtrace:
/gpfslocal/pub/local/lib64/libinfinipath.so.4(+0x8f78)[0x2aae9d7c9f78]

Can someone give me hints where to look to track this problem ?
Thank you.

Pierre Kestener.

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] [EXTERNAL] Re: Build Failing for OpenMPI 1.7.2 and CUDA 5.5.11

2013-10-07 Thread Rolf vandeVaart

Good.  This is fixed in Open MPI 1.7.3 by the way.  I will add note to FAQ on 
building Open MPI 1.7.2.

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Hammond,
>Simon David (-EXP)
>Sent: Monday, October 07, 2013 4:17 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] [EXTERNAL] Re: Build Failing for OpenMPI 1.7.2 and
>CUDA 5.5.11
>
>Thanks Rolf, that seems to have made the code compile and make
>successfully.
>
>S.
>
>--
>Simon Hammond
>Scalable Computer Architectures (CSRI/146, 01422) Sandia National
>Laboratories, NM, USA
>
>
>
>
>
>
>On 10/7/13 1:47 PM, "Rolf vandeVaart" <rvandeva...@nvidia.com> wrote:
>
>>That might be a bug.  While I am checking, you could try configuring with
>>this additional flag:
>>
>>--enable-mca-no-build=pml-bfo
>>
>>Rolf
>>
>>>-Original Message-
>>>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of
>Hammond,
>>>Simon David (-EXP)
>>>Sent: Monday, October 07, 2013 3:30 PM
>>>To: us...@open-mpi.org
>>>Subject: [OMPI users] Build Failing for OpenMPI 1.7.2 and CUDA 5.5.11
>>>
>>>Hey everyone,
>>>
>>>I am trying to build OpenMPI 1.7.2 with CUDA enabled, OpenMPI will
>>>configure successfully but I am seeing a build error relating to the
>>>inclusion of
>>>the CUDA options (at least I think so). Do you guys know if this is a
>>>bug or
>>>whether something is wrong with how we are configuring OpenMPI for our
>>>cluster.
>>>
>>>Configure Line: ./configure
>>>--prefix=/home/projects/openmpi/1.7.2/gnu/4.7.2 --enable-shared --
>enable-
>>>static --disable-vt --with-cuda=/home/projects/cuda/5.5.11
>>>CC=`which gcc` CXX=`which g++` FC=`which gfortran`
>>>
>>>Running make V=1 gives:
>>>
>>>make[2]: Entering directory `/tmp/openmpi-1.7.2/ompi/tools/ompi_info'
>>>/bin/sh ../../../libtool  --tag=CC   --mode=link
>>>/home/projects/gcc/4.7.2/bin/gcc -std=gnu99 -
>>>DOPAL_CONFIGURE_USER="\"\"" -
>>>DOPAL_CONFIGURE_HOST="\"k20-0007\""
>>>-DOPAL_CONFIGURE_DATE="\"Mon Oct  7 13:16:12 MDT 2013\""
>>>-DOMPI_BUILD_USER="\"$USER\"" -
>DOMPI_BUILD_HOST="\"`hostname`\""
>>>-DOMPI_BUILD_DATE="\"`date`\"" -DOMPI_BUILD_CFLAGS="\"-O3 -
>>>DNDEBUG -finline-functions -fno-strict-aliasing -pthread\""
>>>-DOMPI_BUILD_CPPFLAGS="\"-I../../..
>>>-I/tmp/openmpi-1.7.2/opal/mca/hwloc/hwloc152/hwloc/include
>>>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent
>>>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent/include
>>>-I/usr/include/infiniband -I/usr/include/infiniband
>>>-I/usr/include/infiniband -
>>>I/usr/include/infiniband -I/usr/include/infiniband\"" -
>>>DOMPI_BUILD_CXXFLAGS="\"-O3 -DNDEBUG -finline-functions -
>pthread\"" -
>>>DOMPI_BUILD_CXXCPPFLAGS="\"-I../../..  \""
>>>-DOMPI_BUILD_FFLAGS="\"\"" -DOMPI_BUILD_FCFLAGS="\"\""
>>>-DOMPI_BUILD_LDFLAGS="\"-export-dynamic  \"" -
>DOMPI_BUILD_LIBS="\"-
>>>lrt -lnsl  -lutil -lm \"" -DOPAL_CC_ABSOLUTE="\"\""
>>>-DOMPI_CXX_ABSOLUTE="\"none\"" -O3 -DNDEBUG -finline-functions
>>>-fno-strict-aliasing -pthread  -export-dynamic   -o ompi_info ompi_info.o
>>>param.o components.o version.o ../../../ompi/libmpi.la -lrt -lnsl
>>>-lutil -lm
>>>libtool: link: /home/projects/gcc/4.7.2/bin/gcc -std=gnu99 -
>>>DOPAL_CONFIGURE_USER=\"\" -
>>>DOPAL_CONFIGURE_HOST=\"k20-0007\"
>>>"-DOPAL_CONFIGURE_DATE=\"Mon Oct  7 13:16:12 MDT 2013\""
>>>-DOMPI_BUILD_USER=\"\" -DOMPI_BUILD_HOST=\"k20-
>0007\"
>>>"-DOMPI_BUILD_DATE=\"Mon Oct  7 13:26:23 MDT 2013\""
>>>"-DOMPI_BUILD_CFLAGS=\"-O3 -DNDEBUG -finline-functions -fno-strict-
>>>aliasing -pthread\"" "-DOMPI_BUILD_CPPFLAGS=\"-I../../..
>>>-I/tmp/openmpi-1.7.2/opal/mca/hwloc/hwloc152/hwloc/include
>>>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent
>>>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent/include
>>>-I/usr/include/infiniband -I/usr/include/infiniband
>>>-I/usr/include/infiniband -
>>>I/usr/include/infiniband -I/us

Re: [OMPI users] Build Failing for OpenMPI 1.7.2 and CUDA 5.5.11

2013-10-07 Thread Rolf vandeVaart

That might be a bug.  While I am checking, you could try configuring with this 
additional flag:

--enable-mca-no-build=pml-bfo

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Hammond,
>Simon David (-EXP)
>Sent: Monday, October 07, 2013 3:30 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] Build Failing for OpenMPI 1.7.2 and CUDA 5.5.11
>
>Hey everyone,
>
>I am trying to build OpenMPI 1.7.2 with CUDA enabled, OpenMPI will
>configure successfully but I am seeing a build error relating to the inclusion 
>of
>the CUDA options (at least I think so). Do you guys know if this is a bug or
>whether something is wrong with how we are configuring OpenMPI for our
>cluster.
>
>Configure Line: ./configure
>--prefix=/home/projects/openmpi/1.7.2/gnu/4.7.2 --enable-shared --enable-
>static --disable-vt --with-cuda=/home/projects/cuda/5.5.11
>CC=`which gcc` CXX=`which g++` FC=`which gfortran`
>
>Running make V=1 gives:
>
>make[2]: Entering directory `/tmp/openmpi-1.7.2/ompi/tools/ompi_info'
>/bin/sh ../../../libtool  --tag=CC   --mode=link
>/home/projects/gcc/4.7.2/bin/gcc -std=gnu99 -
>DOPAL_CONFIGURE_USER="\"\"" -
>DOPAL_CONFIGURE_HOST="\"k20-0007\""
>-DOPAL_CONFIGURE_DATE="\"Mon Oct  7 13:16:12 MDT 2013\""
>-DOMPI_BUILD_USER="\"$USER\"" -DOMPI_BUILD_HOST="\"`hostname`\""
>-DOMPI_BUILD_DATE="\"`date`\"" -DOMPI_BUILD_CFLAGS="\"-O3 -
>DNDEBUG -finline-functions -fno-strict-aliasing -pthread\""
>-DOMPI_BUILD_CPPFLAGS="\"-I../../..
>-I/tmp/openmpi-1.7.2/opal/mca/hwloc/hwloc152/hwloc/include
>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent
>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent/include
>-I/usr/include/infiniband -I/usr/include/infiniband -I/usr/include/infiniband -
>I/usr/include/infiniband -I/usr/include/infiniband\"" -
>DOMPI_BUILD_CXXFLAGS="\"-O3 -DNDEBUG -finline-functions -pthread\"" -
>DOMPI_BUILD_CXXCPPFLAGS="\"-I../../..  \""
>-DOMPI_BUILD_FFLAGS="\"\"" -DOMPI_BUILD_FCFLAGS="\"\""
>-DOMPI_BUILD_LDFLAGS="\"-export-dynamic  \"" -DOMPI_BUILD_LIBS="\"-
>lrt -lnsl  -lutil -lm \"" -DOPAL_CC_ABSOLUTE="\"\""
>-DOMPI_CXX_ABSOLUTE="\"none\"" -O3 -DNDEBUG -finline-functions
>-fno-strict-aliasing -pthread  -export-dynamic   -o ompi_info ompi_info.o
>param.o components.o version.o ../../../ompi/libmpi.la -lrt -lnsl  -lutil -lm
>libtool: link: /home/projects/gcc/4.7.2/bin/gcc -std=gnu99 -
>DOPAL_CONFIGURE_USER=\"\" -
>DOPAL_CONFIGURE_HOST=\"k20-0007\"
>"-DOPAL_CONFIGURE_DATE=\"Mon Oct  7 13:16:12 MDT 2013\""
>-DOMPI_BUILD_USER=\"\" -DOMPI_BUILD_HOST=\"k20-0007\"
>"-DOMPI_BUILD_DATE=\"Mon Oct  7 13:26:23 MDT 2013\""
>"-DOMPI_BUILD_CFLAGS=\"-O3 -DNDEBUG -finline-functions -fno-strict-
>aliasing -pthread\"" "-DOMPI_BUILD_CPPFLAGS=\"-I../../..
>-I/tmp/openmpi-1.7.2/opal/mca/hwloc/hwloc152/hwloc/include
>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent
>-I/tmp/openmpi-1.7.2/opal/mca/event/libevent2019/libevent/include
>-I/usr/include/infiniband -I/usr/include/infiniband -I/usr/include/infiniband -
>I/usr/include/infiniband -I/usr/include/infiniband\"" "-
>DOMPI_BUILD_CXXFLAGS=\"-O3 -DNDEBUG -finline-functions -pthread\"" "-
>DOMPI_BUILD_CXXCPPFLAGS=\"-I../../..  \""
>-DOMPI_BUILD_FFLAGS=\"\" -DOMPI_BUILD_FCFLAGS=\"\"
>"-DOMPI_BUILD_LDFLAGS=\"-export-dynamic  \"" "-DOMPI_BUILD_LIBS=\"-
>lrt -lnsl  -lutil -lm \"" -DOPAL_CC_ABSOLUTE=\"\" -
>DOMPI_CXX_ABSOLUTE=\"none\"
>-O3 -DNDEBUG -finline-functions -fno-strict-aliasing -pthread -o
>.libs/ompi_info ompi_info.o param.o components.o version.o -Wl,--export-
>dynamic  ../../../ompi/.libs/libmpi.so -L/usr/lib64 -lrdmacm -losmcomp -
>libverbs /tmp/openmpi-1.7.2/orte/.libs/libopen-rte.so
>/tmp/openmpi-1.7.2/opal/.libs/libopen-pal.so -lcuda -lnuma -ldl -lrt -lnsl 
>-lutil -
>lm -pthread -Wl,-rpath -Wl,/home/projects/openmpi/1.7.2/gnu/4.7.2/lib
>../../../ompi/.libs/libmpi.so: undefined reference to
>`mca_pml_bfo_send_request_start_cuda'
>../../../ompi/.libs/libmpi.so: undefined reference to
>`mca_pml_bfo_cuda_need_buffers'
>collect2: error: ld returned 1 exit status
>
>
>
>Thanks.
>
>S.
>
>--
>Simon Hammond
>Scalable Computer Architectures (CSRI/146, 01422) Sandia National
>Laboratories, NM, USA
>
>
>
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

[OMPI users] CUDA-aware usage

2013-10-01 Thread Rolf vandeVaart

We have done some work over the last year or two to add some CUDA-aware support 
into the Open MPI library.  Details on building and using the feature are here.

http://www.open-mpi.org/faq/?category=building#build-cuda
http://www.open-mpi.org/faq/?category=running#mpi-cuda-support

I am looking for any feedback on this feature from anyone who has taken 
advantage of it.  You can send just send the response to me if you want and I 
will compile the feedback.

Rolf

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Trouble configuring 1.7.2 for Cuda 5.0.35

2013-08-14 Thread Rolf vandeVaart

Check to see if you have libcuda.so in /usr/lib64.  If so, then this should 
work:

--with-cuda=/opt/nvidia/cudatoolkit/5.0.35 

The configure will find the libcuda.so in /usr/lib64.

>-Original Message-
>From: Ray Sheppard [mailto:rshep...@iu.edu]
>Sent: Wednesday, August 14, 2013 2:59 PM
>To: Open MPI Users
>Cc: Rolf vandeVaart
>Subject: Re: [OMPI users] Trouble configuring 1.7.2 for Cuda 5.0.35
>
>Thank you for the quick reply Rolf,
>   I personally don't know the Cuda libraries. I was hoping there had been a
>name change.  I am on a Cray XT-7.
>Here is my configure command:
>
>./configure CC=gcc FC=gfortran CFLAGS="-O2" F77=gfortran FCFLAGS="-O2"
>--enable-static --disable-shared  --disable-vt --with-threads=posix --with-gnu-
>ld --with-alps --with-cuda=/opt/nvidia/cudatoolkit/5.0.35
>--with-cuda-libdir=/opt/nvidia/cudatoolkit/5.0.35/lib64
>--prefix=/N/soft/cle4/openmpi/gnu/1.7.2/cuda
>
>Ray
>
>On 8/14/2013 2:50 PM, Rolf vandeVaart wrote:
>> It is looking for the libcuda.so file, not the libcudart.so file.   So, 
>> maybe --
>with-libdir=/usr/lib64
>> You need to be on a machine with the CUDA driver installed.  What was your
>configure command?
>>
>> http://www.open-mpi.org/faq/?category=building#build-cuda
>>
>> Rolf
>>
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ray
>>> Sheppard
>>> Sent: Wednesday, August 14, 2013 2:49 PM
>>> To: Open MPI Users
>>> Subject: [OMPI users] Trouble configuring 1.7.2 for Cuda 5.0.35
>>>
>>> Hello,
>>>When I try to run my configure script, it dies with the following.
>>> Below it are the actual libraries in the directory. Could the
>>> solution be as simple as adding "rt" somewhere in the configure script?
>Thanks.
>>>   Ray
>>>
>>> checking if --with-cuda-libdir is set... not found
>>> configure: WARNING: Expected file
>>> /opt/nvidia/cudatoolkit/5.0.35/lib64/libcuda.* not found
>>> configure: error: Cannot continue
>>> rsheppar@login1:/N/dc/projects/ray/br2/openmpi-1.7.2> ls -l
>>> /opt/nvidia/cudatoolkit/5.0.35/lib64/
>>> total 356284
>>> lrwxrwxrwx 1 root root16 Mar 18 14:35 libcublas.so ->
>>> libcublas.so.5.0
>>> lrwxrwxrwx 1 root root19 Mar 18 14:35 libcublas.so.5.0 ->
>>> libcublas.so.5.0.35
>>> -rwxr-xr-x 1 root root  58852880 Sep 26  2012 libcublas.so.5.0.35
>>> -rw-r--r-- 1 root root  21255400 Sep 26  2012 libcublas_device.a
>>> -rw-r--r-- 1 root root456070 Sep 26  2012 libcudadevrt.a
>>> lrwxrwxrwx 1 root root16 Mar 18 14:35 libcudart.so ->
>>> libcudart.so.5.0
>>> lrwxrwxrwx 1 root root19 Mar 18 14:35 libcudart.so.5.0 ->
>>> libcudart.so.5.0.35
>>> -rwxr-xr-x 1 root root375752 Sep 26  2012 libcudart.so.5.0.35
>>> lrwxrwxrwx 1 root root15 Mar 18 14:35 libcufft.so -> libcufft.so.5.0
>>> lrwxrwxrwx 1 root root18 Mar 18 14:35 libcufft.so.5.0 ->
>>> libcufft.so.5.0.35
>>> -rwxr-xr-x 1 root root  30787712 Sep 26  2012 libcufft.so.5.0.35
>>> lrwxrwxrwx 1 root root17 Mar 18 14:35 libcuinj64.so ->
>>> libcuinj64.so.5.0
>>> lrwxrwxrwx 1 root root20 Mar 18 14:35 libcuinj64.so.5.0 ->
>>> libcuinj64.so.5.0.35
>>> -rwxr-xr-x 1 root root   1306496 Sep 26  2012 libcuinj64.so.5.0.35
>>> lrwxrwxrwx 1 root root16 Mar 18 14:35 libcurand.so ->
>>> libcurand.so.5.0
>>> lrwxrwxrwx 1 root root19 Mar 18 14:35 libcurand.so.5.0 ->
>>> libcurand.so.5.0.35
>>> -rwxr-xr-x 1 root root  25281224 Sep 26  2012 libcurand.so.5.0.35
>>> lrwxrwxrwx 1 root root18 Mar 18 14:35 libcusparse.so ->
>>> libcusparse.so.5.0
>>> lrwxrwxrwx 1 root root21 Mar 18 14:35 libcusparse.so.5.0 ->
>>> libcusparse.so.5.0.35
>>> -rwxr-xr-x 1 root root 132455240 Sep 26  2012 libcusparse.so.5.0.35
>>> lrwxrwxrwx 1 root root13 Mar 18 14:35 libnpp.so -> libnpp.so.5.0
>>> lrwxrwxrwx 1 root root16 Mar 18 14:35 libnpp.so.5.0 ->
>>> libnpp.so.5.0.35
>>> -rwxr-xr-x 1 root root  93602912 Sep 26  2012 libnpp.so.5.0.35
>>> lrwxrwxrwx 1 root root20 Mar 18 14:35 libnvToolsExt.so ->
>>> libnvToolsExt.so.5.0
>>> lrwxrwxrwx 1 root root23 Mar 18 14:35 libnvToolsExt.so.5.0 ->
>>> libnvToolsExt.so.5.0.35
>>> -rwxr-xr-x 1 root root 31280 Sep 26  2012 libnvToolsExt.so

Re: [OMPI users] Trouble configuring 1.7.2 for Cuda 5.0.35

2013-08-14 Thread Rolf vandeVaart

It is looking for the libcuda.so file, not the libcudart.so file.   So, maybe 
--with-libdir=/usr/lib64
You need to be on a machine with the CUDA driver installed.  What was your 
configure command?

http://www.open-mpi.org/faq/?category=building#build-cuda

Rolf

>-Original Message-
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ray
>Sheppard
>Sent: Wednesday, August 14, 2013 2:49 PM
>To: Open MPI Users
>Subject: [OMPI users] Trouble configuring 1.7.2 for Cuda 5.0.35
>
>Hello,
>   When I try to run my configure script, it dies with the following.
>Below it are the actual libraries in the directory. Could the solution be as
>simple as adding "rt" somewhere in the configure script?  Thanks.
>  Ray
>
>checking if --with-cuda-libdir is set... not found
>configure: WARNING: Expected file
>/opt/nvidia/cudatoolkit/5.0.35/lib64/libcuda.* not found
>configure: error: Cannot continue
>rsheppar@login1:/N/dc/projects/ray/br2/openmpi-1.7.2> ls -l
>/opt/nvidia/cudatoolkit/5.0.35/lib64/
>total 356284
>lrwxrwxrwx 1 root root16 Mar 18 14:35 libcublas.so ->
>libcublas.so.5.0
>lrwxrwxrwx 1 root root19 Mar 18 14:35 libcublas.so.5.0 ->
>libcublas.so.5.0.35
>-rwxr-xr-x 1 root root  58852880 Sep 26  2012 libcublas.so.5.0.35
>-rw-r--r-- 1 root root  21255400 Sep 26  2012 libcublas_device.a
>-rw-r--r-- 1 root root456070 Sep 26  2012 libcudadevrt.a
>lrwxrwxrwx 1 root root16 Mar 18 14:35 libcudart.so ->
>libcudart.so.5.0
>lrwxrwxrwx 1 root root19 Mar 18 14:35 libcudart.so.5.0 ->
>libcudart.so.5.0.35
>-rwxr-xr-x 1 root root375752 Sep 26  2012 libcudart.so.5.0.35
>lrwxrwxrwx 1 root root15 Mar 18 14:35 libcufft.so -> libcufft.so.5.0
>lrwxrwxrwx 1 root root18 Mar 18 14:35 libcufft.so.5.0 ->
>libcufft.so.5.0.35
>-rwxr-xr-x 1 root root  30787712 Sep 26  2012 libcufft.so.5.0.35
>lrwxrwxrwx 1 root root17 Mar 18 14:35 libcuinj64.so ->
>libcuinj64.so.5.0
>lrwxrwxrwx 1 root root20 Mar 18 14:35 libcuinj64.so.5.0 ->
>libcuinj64.so.5.0.35
>-rwxr-xr-x 1 root root   1306496 Sep 26  2012 libcuinj64.so.5.0.35
>lrwxrwxrwx 1 root root16 Mar 18 14:35 libcurand.so ->
>libcurand.so.5.0
>lrwxrwxrwx 1 root root19 Mar 18 14:35 libcurand.so.5.0 ->
>libcurand.so.5.0.35
>-rwxr-xr-x 1 root root  25281224 Sep 26  2012 libcurand.so.5.0.35
>lrwxrwxrwx 1 root root18 Mar 18 14:35 libcusparse.so ->
>libcusparse.so.5.0
>lrwxrwxrwx 1 root root21 Mar 18 14:35 libcusparse.so.5.0 ->
>libcusparse.so.5.0.35
>-rwxr-xr-x 1 root root 132455240 Sep 26  2012 libcusparse.so.5.0.35
>lrwxrwxrwx 1 root root13 Mar 18 14:35 libnpp.so -> libnpp.so.5.0
>lrwxrwxrwx 1 root root16 Mar 18 14:35 libnpp.so.5.0 ->
>libnpp.so.5.0.35
>-rwxr-xr-x 1 root root  93602912 Sep 26  2012 libnpp.so.5.0.35
>lrwxrwxrwx 1 root root20 Mar 18 14:35 libnvToolsExt.so ->
>libnvToolsExt.so.5.0
>lrwxrwxrwx 1 root root23 Mar 18 14:35 libnvToolsExt.so.5.0 ->
>libnvToolsExt.so.5.0.35
>-rwxr-xr-x 1 root root 31280 Sep 26  2012 libnvToolsExt.so.5.0.35
>
>
>
>--
>  Respectfully,
>Ray Sheppard
>rshep...@iu.edu
>http://pti.iu.edu/sciapt
>317-274-0016
>
>Principal Analyst
>Senior Technical Lead
>Scientific Applications and Performance Tuning
>Research Technologies
>University Information Technological Services
>IUPUI campus
>Indiana University
>
>My "pithy" saying:  Science is the art of translating the world
>into language. Unfortunately, that language is mathematics.
>Bumper sticker wisdom: Make it idiot-proof and they will make a
>better idiot.
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Support for CUDA and GPU-direct with OpenMPI 1.6.5 an 1.7.2

2013-07-08 Thread Rolf vandeVaart

With respect to the CUDA-aware support, Ralph is correct.  The ability to send 
and receive GPU buffers is in the Open MPI 1.7 series.  And incremental 
improvements will be added to the Open MPI 1.7 series.  CUDA 5.0 is supported.



From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Saturday, July 06, 2013 5:14 PM
To: Open MPI Users
Subject: Re: [OMPI users] Support for CUDA and GPU-direct with OpenMPI 1.6.5 an 
1.7.2

There was discussion of this on a prior email thread on the OMPI devel mailing 
list:

http://www.open-mpi.org/community/lists/devel/2013/05/12354.php


On Jul 6, 2013, at 2:01 PM, Michael Thomadakis 
> wrote:


thanks,
Do you guys have any plan to support Intel Phi in the future? That is, running 
MPI code on the Phi cards or across the multicore and Phi, as Intel MPI does?
thanks...
Michael

On Sat, Jul 6, 2013 at 2:36 PM, Ralph Castain 
> wrote:
Rolf will have to answer the question on level of support. The CUDA code is not 
in the 1.6 series as it was developed after that series went "stable". It is in 
the 1.7 series, although the level of support will likely be incrementally 
increasing as that "feature" series continues to evolve.


On Jul 6, 2013, at 12:06 PM, Michael Thomadakis 
> wrote:

> Hello OpenMPI,
>
> I am wondering what level of support is there for CUDA and GPUdirect on 
> OpenMPI 1.6.5 and 1.7.2.
>
> I saw the ./configure --with-cuda=CUDA_DIR option in the FAQ. However, it 
> seems that with configure v1.6.5 it was ignored.
>
> Can you identify GPU memory and send messages from it directly without 
> copying to host memory first?
>
>
> Or in general, what level of CUDA support is there on 1.6.5 and 1.7.2 ? Do 
> you support SDK 5.0 and above?
>
> Cheers ...
> Michael
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Application hangs on mpi_waitall

2013-06-27 Thread Rolf vandeVaart

Ed, how large are the messages that you are sending and receiving?
Rolf

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ed Blosch
Sent: Thursday, June 27, 2013 9:01 AM
To: us...@open-mpi.org
Subject: Re: [OMPI users] Application hangs on mpi_waitall

It ran a bit longer but still deadlocked.  All matching sends are posted 
1:1with posted recvs so it is a delivery issue of some kind.  I'm running a 
debug compiled version tonight to see what that might turn up.  I may try to 
rewrite with blocking sends and see if that works.  I can also try adding a 
barrier (irecvs, barrier, isends, waitall) to make sure sends are not buffering 
waiting for recvs to be posted.


Sent via the Samsung Galaxy S™ III, an AT 4G LTE smartphone



 Original message 
From: George Bosilca >
Date:
To: Open MPI Users >
Subject: Re: [OMPI users] Application hangs on mpi_waitall


Ed,

Im not sure but there might be a case where the BTL is getting overwhelmed by 
the nob-blocking operations while trying to setup the connection. There is a 
simple test for this. Add an MPI_Alltoall with a reasonable size (100k) before 
you start posting the non-blocking receives, and let's see if this solves your 
issue.

  George.


On Jun 26, 2013, at 04:02 , eblo...@1scom.net wrote:

> An update: I recoded the mpi_waitall as a loop over the requests with
> mpi_test and a 30 second timeout.  The timeout happens unpredictably,
> sometimes after 10 minutes of run time, other times after 15 minutes, for
> the exact same case.
>
> After 30 seconds, I print out the status of all outstanding receive
> requests.  The message tags that are outstanding have definitely been
> sent, so I am wondering why they are not getting received?
>
> As I said before, everybody posts non-blocking standard receives, then
> non-blocking standard sends, then calls mpi_waitall. Each process is
> typically waiting on 200 to 300 requests. Is deadlock possible via this
> implementation approach under some kind of unusual conditions?
>
> Thanks again,
>
> Ed
>
>> I'm running OpenMPI 1.6.4 and seeing a problem where mpi_waitall never
>> returns.  The case runs fine with MVAPICH.  The logic associated with the
>> communications has been extensively debugged in the past; we don't think
>> it has errors.   Each process posts non-blocking receives, non-blocking
>> sends, and then does waitall on all the outstanding requests.
>>
>> The work is broken down into 960 chunks. If I run with 960 processes (60
>> nodes of 16 cores each), things seem to work.  If I use 160 processes
>> (each process handling 6 chunks of work), then each process is handling 6
>> times as much communication, and that is the case that hangs with OpenMPI
>> 1.6.4; again, seems to work with MVAPICH.  Is there an obvious place to
>> start, diagnostically?  We're using the openib btl.
>>
>> Thanks,
>>
>> Ed
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Compiling openmpi 1.6.4 without CUDA

2013-05-20 Thread Rolf vandeVaart

I can speak to part of your issue.  There are no CUDA-aware features in the 1.6 
series of Open MPI.  Therefore, the various configure flags you tried would not 
affect Open MPI itself.  Those configure flags are relevant with the 1.7 series 
and later, but as the FAQ says, the CUDA-aware feature is only included when 
explicitly requested.

The issue is with the CUDA support that is being configured into the 
Vampirtrace support.  If you do not need the Vampirtrace support, then just 
configure with the -disable-vt as you discovered.

I am not sure what configure flags to give to VampirTrace to tell it to not 
build in CUDA support.

Rolf

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of dani
Sent: Monday, May 20, 2013 11:05 AM
To: us...@open-mpi.org
Subject: [OMPI users] Compiling openmpi 1.6.4 without CUDA

Hi List,

I've encountered an issue today - building an openmpi 1.6.4 from source rpm, on 
a machine which has cuda-5 (latest) installed, resulted in openmpi always using 
the cuda headers and libs.
I should mention that I have added the cuda libs dir to ldconfig, and the bin 
dir to the path (nvcc is in path).
When building openmpi 1.6.4 (rpmbuild --rebuild openmpi.src.rpm) the package is 
automatically build with cuda.
I have tried to define --without-cuda , --disable-cuda, --disable-cudawrapers 
but the rpm is always built with cuda, and fails to install as the required 
libs are not in rpmdb.
If installing with --disable-vt, cuda is not looked for or installed.
So i guess my question is two-fold:
1. Is this by design? from the FAQ 
(http://www.open-mpi.org/faq/?category=building#build-cuda) I was sure cuda is 
not built by default.
2. Is there a way to keep vampirtrace without cuda?

The reason I don't want cuda in mpi is due to the target cluster 
characteristics: Except for 1 node, it will have no gpus, so I saw no reason to 
deploy cuda to it. unfortunately, I had to use the single node with cuda as the 
compilation node, as it was the only node with complete development packages.

I can always mv the cuda dirs during build phase, but I'm wandering if this is 
how openmpi build is supposed to behave.

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] status of cuda across multiple IO hubs?

2013-03-11 Thread Rolf vandeVaart

Yes, unfortunately, that issue is still unfixed.  I just created the ticket and 
included a possible workaround.

https://svn.open-mpi.org/trac/ompi/ticket/3531

>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>On Behalf Of Russell Power
>Sent: Monday, March 11, 2013 11:28 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] status of cuda across multiple IO hubs?
>
>I'm running into issues when trying to use GPUs in a multiprocessor system
>when using the latest release candidate (1.7rc8).
>Specifically, it looks like the OpenMPI code is still assuming that all GPUs 
>are
>on the same IOH, as in this message from a few months
>ago:
>
>http://www.open-mpi.org/community/lists/users/2012/07/19879.php
>
>I couldn't determine what happened to the ticket mentioned in that thread.
>
>For the moment, I'm just constraining myself to using the GPUs attached to
>one processor, but obviously that's less then ideal :).
>
>Curiously, the eager send path doesn't seem to have the same issue - if I
>adjust btl_smcuda_eager_limit up, sends work up to that threshold.
>Unfortunately, if I increase it beyond 10 megabytes I start seeing bus errors.
>
>I can manually breakup my own sends to be below the eager limit, but that
>seems non-optimal.
>
>Any other recommendations?
>
>Thanks,
>
>R
>
>The testing code and output is pasted below.
>
>---
>
>#include 
>#include 
>#include 
>
>
>#  define CUDA_SAFE_CALL( call) {\
>cudaError err = call;\
>if( cudaSuccess != err) {\
>fprintf(stderr, "Cuda error in file '%s' in line %i : %s.\n",\
>__FILE__, __LINE__, cudaGetErrorString( err) );  \
>exit(EXIT_FAILURE);  \
>} }
>
>int recv(int src) {
>  cudaSetDevice(0);
>  for (int bSize = 1; bSize < 100e6; bSize *= 2) {
>fprintf(stderr, "Recv: %d\n", bSize);
>void* buffer;
>CUDA_SAFE_CALL(cudaMalloc(, bSize));
>auto world = MPI::COMM_WORLD;
>world.Recv(buffer, bSize, MPI::BYTE, src, 0);
>CUDA_SAFE_CALL(cudaFree(buffer))
>  }
>}
>
>int send(int dst) {
>  cudaSetDevice(2);
>  for (int bSize = 1; bSize < 100e6; bSize *= 2) {
>fprintf(stderr, "Send: %d\n", bSize);
>void* buffer;
>CUDA_SAFE_CALL(cudaMalloc(, bSize));
>auto world = MPI::COMM_WORLD;
>world.Send(buffer, bSize, MPI::BYTE, dst, 0);
>CUDA_SAFE_CALL(cudaFree(buffer))
>  }
>}
>
>void checkPeerAccess() {
>  fprintf(stderr, "Access capability: gpu -> gpu\n");
>  for (int a = 0; a < 3; ++a) {
>for (int b = a; b < 3; ++b) {
>  if (a == b) { continue; }
>  int res;
>  cudaDeviceCanAccessPeer(, a, b);
>  fprintf(stderr, "%d <-> %d: %d\n", a, b, res);
>}
>  }
>}
>
>int main() {
>  MPI::Init_thread(MPI::THREAD_MULTIPLE);
>  if (MPI::COMM_WORLD.Get_rank() == 0) {
>checkPeerAccess();
>recv(1);
>  } else {
>send(0);
>  }
>  MPI::Finalize();
>}
>
>output from running:
>mpirun -mca btl_smcuda_eager_limit 64 -n 2 ./a.out Access capability: gpu ->
>gpu
>0 <-> 1: 1
>0 <-> 2: 0
>1 <-> 2: 0
>Send: 1
>Recv: 1
>Send: 2
>Recv: 2
>Send: 4
>Recv: 4
>Send: 8
>Recv: 8
>Send: 16
>Recv: 16
>--
>The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and
>will cause the program to abort.
>  cuIpcOpenMemHandle return value:   217
>  address: 0x230020
>Check the cuda.h file for what the return value means. Perhaps a reboot of
>the node will clear the problem.
>--
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

[hwloc-users] Single hwloc.h header files that work on linux and windows

2013-01-03 Thread Rolf vandeVaart


I have an application is supposed to work with both windows and linux.  To that 
end, I downloaded hwloc, configured and then included the hwloc header files in 
my application.  I dynamically load the libhwloc.so library and map the 
functions I need.  If libhwloc.so is not there, then I can still run but give a 
warning.  However, I have run into a problem.  hwloc.h includes a whole bunch 
of other headers, one of which is config.h.  And config.h is specific to how 
the library was configured.  Therefore, when I attempt to compile my 
application on windows, I get an error about missing pthread.h file.  This is 
probably one of many differences.

Is there a special hwloc.h and supporting headers that is system independent so 
I can include them and build on both windows and linux?  Or do I need to have 
two different sets of header files, one for linux and one for windows?   
Perhaps I just need a config.h for windows and one for linux and select them at 
build time.

Rolf

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] mpi_leave_pinned is dangerous

2012-11-08 Thread Rolf vandeVaart

Not sure.  I will look into this.   And thank you for the feedback Jens!
Rolf

>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>On Behalf Of Jeff Squyres
>Sent: Thursday, November 08, 2012 8:49 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] mpi_leave_pinned is dangerous
>
>On Nov 7, 2012, at 7:21 PM, Jens Glaser wrote:
>
>> With the help of MVAPICH2 developer S. Potluri the problem was isolated
>and fixed.
>
>Sorry about not replying; we're all (literally) very swamped trying to prepare
>for the Supercomputing trade show/conference next week.  I know I'm
>wy behind on OMPI user mails; sorry folks.  :-(
>
>> It was, as expected, due to the library not intercepting the
>> cudaHostAlloc() and cudaFreeHost() calls to register pinned memory, as
>would be required for the registration cache to work.
>
>Rolf/NVIDIA -- what's the chance of getting that to be intercepted properly?
>Do you guys have good hooks for this?  (HINT HINT :-) )
>
>--
>Jeff Squyres
>jsquy...@cisco.com
>For corporate legal information go to:
>http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] ompi-clean on single executable

2012-10-24 Thread Rolf vandeVaart

And just to give a little context, ompi-clean was created initially to "clean" 
up a node, not for cleaning up a specific job.  It was for the case where MPI 
jobs would leave some files behind or leave some processes running.  (I do not 
believe this happens much at all anymore.)  But, as was said, no reason it 
could not be modified.

>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>On Behalf Of Jeff Squyres
>Sent: Wednesday, October 24, 2012 12:56 PM
>To: Open MPI Users
>Subject: Re: [OMPI users] ompi-clean on single executable
>
>...but patches would be greatly appreciated.  :-)
>
>On Oct 24, 2012, at 12:24 PM, Ralph Castain wrote:
>
>> All things are possible, including what you describe. Not sure when we
>would get to it, though.
>>
>>
>> On Oct 24, 2012, at 4:01 AM, Nicolas Deladerriere
> wrote:
>>
>>> Reuti,
>>>
>>> The problem I am facing is a small small part of our production
>>> system, and I cannot modify our mpirun submission system. This is why
>>> i am looking at solution using only ompi-clean of mpirun command
>>> specification.
>>>
>>> Thanks,
>>> Nicolas
>>>
>>> 2012/10/24, Reuti :
 Am 24.10.2012 um 11:33 schrieb Nicolas Deladerriere:

> Reuti,
>
> Thanks for your comments,
>
> In our case, we are currently running different mpirun commands on
> clusters sharing the same frontend. Basically we use a wrapper to
> run the mpirun command and to run an ompi-clean command to clean
>up
> the mpi job if required.
> Using ompi-clean like this just kills all other mpi jobs running on
> same frontend. I cannot use queuing system

 Why? Using it on a single machine was only one possible setup. Its
 purpose is to distribute jobs to slave hosts. If you have already
 one frontend as login-machine it fits perfect: the qmaster (in case
 of SGE) can run there and the execd on the nodes.

 -- Reuti


> as you have suggested this
> is why I was wondering a option or other solution associated to
> ompi-clean command to avoid this general mpi jobs cleaning.
>
> Cheers
> Nicolas
>
> 2012/10/24, Reuti :
>> Hi,
>>
>> Am 24.10.2012 um 09:36 schrieb Nicolas Deladerriere:
>>
>>> I am having issue running ompi-clean which clean up (this is
>>> normal) session associated to a user which means it kills all
>>> running jobs assoicated to this session (this is also normal).
>>> But I would like to be able to clean up session associated to a
>>> job (a not user).
>>>
>>> Here is my point:
>>>
>>> I am running two executable :
>>>
>>> % mpirun -np 2 myexec1
>>> --> run with PID 2399 ...
>>> % mpirun -np 2 myexec2
>>> --> run with PID 2402 ...
>>>
>>> When I run orte-clean I got this result :
>>> % orte-clean -v
>>> orte-clean: cleaning session dir tree
>>> openmpi-sessions-ndelader@myhost_0
>>> orte-clean: killing any lingering procs
>>> orte-clean: found potential rogue orterun process
>>> (pid=2399,user=ndelader), sending SIGKILL...
>>> orte-clean: found potential rogue orterun process
>>> (pid=2402,user=ndelader), sending SIGKILL...
>>>
>>> Which means that both jobs have been killed :-( Basically I would
>>> like to perform orte-clean using executable name or PID or
>>> whatever that identify which job I want to stop an clean. It
>>> seems I would need to create an openmpi session per job. Does it
>make sense ?
>>> And
>>> I would like to be able to do something like following command
>>> and get following result :
>>>
>>> % orte-clean -v myexec1
>>> orte-clean: cleaning session dir tree
>>> openmpi-sessions-ndelader@myhost_0
>>> orte-clean: killing any lingering procs
>>> orte-clean: found potential rogue orterun process
>>> (pid=2399,user=ndelader), sending SIGKILL...
>>>
>>>
>>> Does it make sense ? Is there a way to perform this kind of
>>> selection in cleaning process ?
>>
>> How many jobs are you starting on how many nodes at one time? This
>> requirement could be a point to start to use a queuing system,
>> where can remove job individually and also serialize your
>> workflow. In fact: we use GridEngine also local on workstations
>> for this purpose.
>>
>> -- Reuti
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


 ___
 users mailing list
 us...@open-mpi.org

Re: [OMPI users] RDMA GPUDirect CUDA...

2012-08-14 Thread Rolf vandeVaart

To answer the original questions, Open MPI will look at taking advantage of the 
RDMA CUDA when it is available.  Obviously, work needs to be done to figure out 
the best way to integrate into the library.  Much like there are a variety of 
protocols under the hood to support host transfer of data via IB, we will have 
to see what works  best for transferring GPU buffers.

It is unclear how this will affect the send/receive latency.

Lastly, the support will be for Kepler -class Quadro and Tesla devices.

Rolf


From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Durga Choudhury
Sent: Tuesday, August 14, 2012 4:46 PM
To: Open MPI Users
Subject: Re: [OMPI users] RDMA GPUDirect CUDA...

Dear OpenMPI developers

I'd like to add my 2 cents that this would be a very desirable feature 
enhancement for me as well (and perhaps others).

Best regards
Durga

On Tue, Aug 14, 2012 at 4:29 PM, Zbigniew Koza 
> wrote:
Hi,

I've just found this information on  nVidia's plans regarding enhanced support 
for MPI in their CUDA toolkit:
http://developer.nvidia.com/cuda/nvidia-gpudirect

The idea that two GPUs can talk to each other via network cards without CPU as 
a middleman looks very promising.
This technology is supposed to be revealed and released in September.

My questions:

1. Will OpenMPI include   RDMA support in its CUDA interface?
2. Any idea how much can this technology reduce the CUDA Send/Recv latency?
3. Any idea whether this technology will be available for Fermi-class Tesla 
devices or only for Keplers?

Regards,

Z Koza



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] CUDA in v1.7? (was: Compilation of OpenMPI 1.5.4 & 1.6.X fail for PGI compiler...)

2012-08-09 Thread Rolf vandeVaart

>-Original Message-
>From: Jeff Squyres [mailto:jsquy...@cisco.com]
>Sent: Thursday, August 09, 2012 9:45 AM
>To: Open MPI Users
>Cc: Rolf vandeVaart
>Subject: CUDA in v1.7? (was: Compilation of OpenMPI 1.5.4 & 1.6.X fail for PGI
>compiler...)
>
>On Aug 9, 2012, at 9:37 AM, ESCOBAR Juan wrote:
>
>> ... but as I'am also interested in testing the Open-MPI/CUDA feature (
>> with potentially pgi-acc or open-acc directive ) I've 'googled' and finish 
>> in the
>the Open-MPI  'trunck' .
>>
>> => this Open-MPI/CUDA feature will be only in the 1.9 serie or also on
>1.7/1.8 ?
>
>Good question.
>
>Rolf -- do you plan to bring cuda stuff to v1.7.x?
>
>--

Yes, there currently is support in Open MPI 1.7 for CUDA.  What is missing is 
some improvements for internode transfers over IB which I still plan to check 
into the trunk and then move over to 1.7.  
Hopefully within the next month or so.

Rolf
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] bug in CUDA support for dual-processor systems?

2012-07-31 Thread Rolf vandeVaart

The current implementation does assume that the GPUs are on the same IOH and 
therefore can use the IPC features of the CUDA library for communication.
One of the initial motivations for this was that to be able to detect whether 
GPUs can talk to one another, the CUDA library has to be initialized and the 
GPUs have to be selected by each rank.  It is at that point that we can 
determine whether the IPC will work between the GPUs.However, this means 
that the GPUs need to be selected by each rank prior to the call to MPI_Init as 
that is where we determine whether IPC is possible, and we were trying to avoid 
that requirement.

I will submit a ticket against this and see if we can improve this.

Rolf

>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>On Behalf Of Zbigniew Koza
>Sent: Tuesday, July 31, 2012 12:38 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] bug in CUDA support for dual-processor systems?
>
>Hi,
>
>I wrote a simple program to see if OpenMPI can really handle cuda pointers as
>promised in the FAQ and how efficiently.
>The program (see below) breaks if MPI communication is to be performed
>between two devices that are on the same node but under different IOHs in a
>dual-processor Intel machine.
>Note that  cudaMemCpy works for such devices, although not as efficiently as
>for the devices on the same IOH and GPUDirect enabled.
>
>Here's the output from my program:
>
>===
>
> >  mpirun -n 6 ./a.out
>Init
>Init
>Init
>Init
>Init
>Init
>rank: 1, size: 6
>rank: 2, size: 6
>rank: 3, size: 6
>rank: 4, size: 6
>rank: 5, size: 6
>rank: 0, size: 6
>device 3 is set
>Process 3 is on typhoon1
>Using regular memory
>device 0 is set
>Process 0 is on typhoon1
>Using regular memory
>device 4 is set
>Process 4 is on typhoon1
>Using regular memory
>device 1 is set
>Process 1 is on typhoon1
>Using regular memory
>device 5 is set
>Process 5 is on typhoon1
>Using regular memory
>device 2 is set
>Process 2 is on typhoon1
>Using regular memory
>^C^[[A^C
>zkoza@typhoon1:~/multigpu$
>zkoza@typhoon1:~/multigpu$ vim cudamussings.c
>zkoza@typhoon1:~/multigpu$ mpicc cudamussings.c -lcuda -lcudart
>-L/usr/local/cuda/lib64 -I/usr/local/cuda/include
>zkoza@typhoon1:~/multigpu$ vim cudamussings.c
>zkoza@typhoon1:~/multigpu$ mpicc cudamussings.c -lcuda -lcudart
>-L/usr/local/cuda/lib64 -I/usr/local/cuda/include
>zkoza@typhoon1:~/multigpu$ mpirun -n 6 ./a.out Process 1 of 6 is on
>typhoon1 Process 2 of 6 is on typhoon1 Process 0 of 6 is on typhoon1 Process
>4 of 6 is on typhoon1 Process 5 of 6 is on typhoon1 Process 3 of 6 is on
>typhoon1 device 2 is set device 1 is set device 0 is set Using regular memory
>device 5 is set device 3 is set device 4 is set
>Host->device bandwidth for processor 1: 1587.993499 MB/sec device
>Host->bandwidth for processor 2: 1570.275316 MB/sec device bandwidth for
>Host->processor 3: 1569.890751 MB/sec device bandwidth for processor 5:
>Host->1483.637702 MB/sec device bandwidth for processor 0: 1480.888029
>Host->MB/sec device bandwidth for processor 4: 1476.241371 MB/sec
>MPI_Send/MPI_Receive,  Host  [0] -> Host  [1] bandwidth: 3338.57 MB/sec
>MPI_Send/MPI_Receive,  Device[0] -> Host  [1] bandwidth: 420.85 MB/sec
>MPI_Send/MPI_Receive,  Host  [0] -> Device[1] bandwidth: 362.13 MB/sec
>MPI_Send/MPI_Receive,  Device[0] -> Device[1] bandwidth: 6552.35 MB/sec
>MPI_Send/MPI_Receive,  Host  [0] -> Host  [2] bandwidth: 3238.88 MB/sec
>MPI_Send/MPI_Receive,  Device[0] -> Host  [2] bandwidth: 418.18 MB/sec
>MPI_Send/MPI_Receive,  Host  [0] -> Device[2] bandwidth: 362.06 MB/sec
>MPI_Send/MPI_Receive,  Device[0] -> Device[2] bandwidth: 5022.82 MB/sec
>MPI_Send/MPI_Receive,  Host  [0] -> Host  [3] bandwidth: 3295.32 MB/sec
>MPI_Send/MPI_Receive,  Device[0] -> Host  [3] bandwidth: 418.90 MB/sec
>MPI_Send/MPI_Receive,  Host  [0] -> Device[3] bandwidth: 359.16 MB/sec
>MPI_Send/MPI_Receive,  Device[0] -> Device[3] bandwidth: 5019.89 MB/sec
>MPI_Send/MPI_Receive,  Host  [0] -> Host  [4] bandwidth: 4619.55 MB/sec
>MPI_Send/MPI_Receive,  Device[0] -> Host  [4] bandwidth: 419.24 MB/sec
>MPI_Send/MPI_Receive,  Host  [0] -> Device[4] bandwidth: 364.52 MB/sec
>--
>The call to cuIpcOpenMemHandle failed. This is an unrecoverable error and
>will cause the program to abort.
>   cuIpcOpenMemHandle return value:   205
>   address: 0x20020
>Check the cuda.h file for what the return value means. Perhaps a reboot of
>the node will clear the problem.
>--
>[typhoon1:06098] Failed to register remote memory, rc=-1 [typhoon1:06098]
>[[33788,1],4] ORTE_ERROR_LOG: Error in file pml_ob1_recvreq.c at line 465
>
>
>
>
>
>Comment:
>In my machine there are 2 six-core intel processors with HT on, yielding
>24 virtual processors, and  6 Tesla C2070s.
>The

Re: [OMPI users] gpudirect p2p (again)?

2012-07-09 Thread Rolf vandeVaart

Yes, this feature is in Open MPI 1.7.  It is implemented in the "smcuda" btl.  
If you configure as outlined in the FAQ, then things should just work.  The 
smcuda btl will be selected and P2P will be used between GPUs on the same node. 
 This is only utilized on transfers of buffers that are larger than 4K in size.

Rolf

>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>On Behalf Of Crni Gorac
>Sent: Monday, July 09, 2012 1:25 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] gpudirect p2p (again)?
>
>Trying to examine CUDA support in OpenMPI, using OpenMPI current feature
>series (v1.7).  There was a question on this mailing list back in October 2011
>(http://www.open-mpi.org/community/lists/users/2011/10/17539.php),
>about OpenMPI being able to use P2P transfers in case when two MPI
>processed involved in the transfer operation happens to execute on the same
>machine, and the answer was that this feature is being implemented.  So my
>question is - what is the current status here, is this feature supported now?
>
>Thanks.
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] NVCC mpi.h: error: attribute "deprecated" does not take arguments

2012-06-18 Thread Rolf vandeVaart

Dmitry:

It turns out that by default in Open MPI 1.7, configure enables warnings for 
deprecated MPI functionality.  In Open MPI 1.6, these warnings were disabled by 
default.
That explains why you would not see this issue in the earlier versions of Open 
MPI.

I assume that gcc must have added support for __attribute__((__deprecated__)) 
and then later on __attribute__((__deprecated__(msg))) and your version of gcc 
supports both of these.  (My version of gcc, 4.5.1 does not support the msg in 
the attribute)

The version of nvcc you have does not support the "msg" argument so everything 
blows up.

I suggest you configure with -disable-mpi-interface-warning which will prevent 
any of the deprecated attributes from being used and then things should work 
fine.

Let me know if this fixes your problem.

Rolf

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Rolf vandeVaart
Sent: Monday, June 18, 2012 11:00 AM
To: Open MPI Users
Cc: Олег Рябков
Subject: Re: [OMPI users] NVCC mpi.h: error: attribute "__deprecated__" does 
not take arguments

Hi Dmitry:
Let me look into this.

Rolf

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Dmitry N. Mikushin
Sent: Monday, June 18, 2012 10:56 AM
To: Open MPI Users
Cc: Олег Рябков
Subject: Re: [OMPI users] NVCC mpi.h: error: attribute "__deprecated__" does 
not take arguments

Yeah, definitely. Thank you, Jeff.

- D.
2012/6/18 Jeff Squyres <jsquy...@cisco.com<mailto:jsquy...@cisco.com>>
On Jun 18, 2012, at 10:41 AM, Dmitry N. Mikushin wrote:

> No, I'm configuring with gcc, and for openmpi-1.6 it works with nvcc without 
> a problem.
Then I think Rolf (from Nvidia) should figure this out; I don't have access to 
nvcc.  :-)

> Actually, nvcc always meant to be more or less compatible with gcc, as far as 
> I know. I'm guessing in case of trunk nvcc is the source of the issue.
>
> And with ./configure CC=nvcc etc. it won't build:
> /home/dmikushin/forge/openmpi-trunk/opal/mca/event/libevent2019/libevent/include/event2/util.h:126:2:
>  error: #error "No way to define ev_uint64_t"
You should complain to Nvidia about that.

--
Jeff Squyres
jsquy...@cisco.com<mailto:jsquy...@cisco.com>
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users

This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.

Re: [OMPI users] NVCC mpi.h: error: attribute "deprecated" does not take arguments

2012-06-18 Thread Rolf vandeVaart

Hi Dmitry:
Let me look into this.

Rolf

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Dmitry N. Mikushin
Sent: Monday, June 18, 2012 10:56 AM
To: Open MPI Users
Cc: Олег Рябков
Subject: Re: [OMPI users] NVCC mpi.h: error: attribute "__deprecated__" does 
not take arguments

Yeah, definitely. Thank you, Jeff.

- D.
2012/6/18 Jeff Squyres >
On Jun 18, 2012, at 10:41 AM, Dmitry N. Mikushin wrote:

> No, I'm configuring with gcc, and for openmpi-1.6 it works with nvcc without 
> a problem.
Then I think Rolf (from Nvidia) should figure this out; I don't have access to 
nvcc.  :-)

> Actually, nvcc always meant to be more or less compatible with gcc, as far as 
> I know. I'm guessing in case of trunk nvcc is the source of the issue.
>
> And with ./configure CC=nvcc etc. it won't build:
> /home/dmikushin/forge/openmpi-trunk/opal/mca/event/libevent2019/libevent/include/event2/util.h:126:2:
>  error: #error "No way to define ev_uint64_t"
You should complain to Nvidia about that.

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] GPU and CPU timing - OpenMPI and Thrust

2012-05-08 Thread Rolf vandeVaart

You should be running with one GPU per MPI process.  If I understand correctly, 
you have a 3 node cluster and each node has a GPU so you should run with np=3.
Maybe you can try that and see if your numbers come out better.


From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Rohan Deshpande
Sent: Monday, May 07, 2012 9:38 PM
To: Open MPI Users
Subject: [OMPI users] GPU and CPU timing - OpenMPI and Thrust

 I am running MPI and Thrust code on a cluster and measuring time for 
calculations.

My MPI code -

#include "mpi.h"
#include 
#include 
#include 
#include 
#include 
#include 

#define  MASTER 0
#define ARRAYSIZE 2000

int 
*masterarray,*onearray,*twoarray,*threearray,*fourarray,*fivearray,*sixarray,*sevenarray,*eightarray,*ninearray;
   int main(int argc, char* argv[])
{
  int   numtasks, taskid,chunksize, namelen;
  int mysum,one,two,three,four,five,six,seven,eight,nine;

  char myname[MPI_MAX_PROCESSOR_NAME];
  MPI_Status status;
  int a,b,c,d,e,f,g,h,i,j;

/* Initializations */
MPI_Init(, );
MPI_Comm_size(MPI_COMM_WORLD, );
MPI_Comm_rank(MPI_COMM_WORLD,);
MPI_Get_processor_name(myname, );
printf ("MPI task %d has started on host %s...\n", taskid, myname);

masterarray= malloc(ARRAYSIZE * sizeof(int));
onearray= malloc(ARRAYSIZE * sizeof(int));
twoarray= malloc(ARRAYSIZE * sizeof(int));
threearray= malloc(ARRAYSIZE * sizeof(int));
fourarray= malloc(ARRAYSIZE * sizeof(int));
fivearray= malloc(ARRAYSIZE * sizeof(int));
sixarray= malloc(ARRAYSIZE * sizeof(int));
sevenarray= malloc(ARRAYSIZE * sizeof(int));
eightarray= malloc(ARRAYSIZE * sizeof(int));
ninearray= malloc(ARRAYSIZE * sizeof(int));

/* Master task only **/
if (taskid == MASTER){
   for(a=0; a < ARRAYSIZE; a++){
 masterarray[a] = 1;

}
   mysum = run_kernel0(masterarray,ARRAYSIZE,taskid, myname);

 }  /* end of master section */

  if (taskid > MASTER) {

 if(taskid == 1){
for(b=0;b

Re: [OMPI users] MPI over tcp

2012-05-04 Thread Rolf vandeVaart

>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>On Behalf Of Don Armstrong
>Sent: Thursday, May 03, 2012 5:43 PM
>To: us...@open-mpi.org
>Subject: Re: [OMPI users] MPI over tcp
>
>On Thu, 03 May 2012, Rolf vandeVaart wrote:
>> I tried your program on a single node and it worked fine.
>
>It works fine on a single node, but deadlocks when it communicates in
>between nodes. Single node communication doesn't use tcp by default.
>
>> Yes, TCP message passing in Open MPI has been working well for some
>> time.
>
>Ok. Which version(s) of openmpi are you using successfully? [I'm assuming
>that this is in an environment which doesn't use IB.]

I was using a trunk version from a month or so ago.  However, TCP has not 
changed too much over the years, so I would expect all versions to work just 
fine.

>
>> 1. Can you run something like hostname successfully (mpirun -np 10
>> -hostfile yourhostfile hostname)
>
>Yes, but this only shows that processes start and output is returned, which
>doesn't utilize the in-band message passing at all.

Yes, I agree. But it at least shows that TCP connections can work between the 
machines.  We typically first make sure that something like hostname works.
Then we try something like the connectivity_c.c program in the examples 
directory to test out MPI communication.

>
>> 2. If that works, then you can also run with a debug switch to see
>> what connections are being made by MPI.
>
>You can see the connections being made in the attached log:
>
>[archimedes:29820] btl: tcp: attempting to connect() to [[60576,1],2] address
>138.23.141.162 on port 2001

Yes, I missed that.  So, can we simplify the problem.  Can you run with np=2 
and one process on each node?
Also, maybe you can send the ifconfig output from each node.  We sometimes see 
this type of hanging when
a node has two different interfaces on the same subnet.  

Assuming there are multiple interfaces, can you experiment with the runtime 
flags outlined here?
http://www.open-mpi.org/faq/?category=tcp#tcp-selection

Maybe by restricting to specific interfaces you can figure out which network is 
the problem.

>
>> I would suggest reading through here for some ideas and for the debug
>> switch.
>
>Thanks. I checked the FAQ, and didn't see anything that shed any light,
>unfortunately.
>
>
>Don Armstrong
>
>--
>Fate and Temperament are two words for one and the same concept.
> -- Novalis [Hermann Hesse _Demian_]
>
>http://www.donarmstrong.com  http://rzlab.ucr.edu
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] MPI over tcp

2012-05-03 Thread Rolf vandeVaart

I tried your program on a single node and it worked fine.  Yes, TCP message 
passing in Open MPI has been working well for some time.
I have a few suggestions.
1. Can you run something like hostname successfully (mpirun -np 10 -hostfile 
yourhostfile hostname)
2. If that works, then you can also run with a debug switch to see what 
connections are being made by MPI.

I would suggest reading through here for some ideas and for the debug switch.

http://www.open-mpi.org/faq/?category=tcp

Rolf

>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>On Behalf Of Don Armstrong
>Sent: Thursday, May 03, 2012 2:51 PM
>To: us...@open-mpi.org
>Subject: [OMPI users] MPI over tcp
>
>I'm attempting to use MPI over tcp; the attached (rather trivial) code gets
>stuck in MPI_Send. Looking at TCP dumps indicates that the TCP connection is
>made successfully to the right port, but the actual data doesn't appear to be
>sent.
>
>I'm beginning to suspect that there's some basic problem with my
>configuration, or an underlying bug in TCP message passing in MPI. Any
>suggestions to try (or a response indicating that MPI over TCP actually works,
>and that it's some problem with my setup) appreciated.
>
>The relevant portion of the hostfile looks like this:
>
>archimedes.int.donarmstrong.com slots=2
>krel.int.donarmstrong.com slots=8
>
>and the output of the run and tcpdump is attached.
>
>Thanks in advance.
>
>
>Don Armstrong
>
>--
>[T]he question of whether Machines Can Think, [...] is about as relevant as the
>question of whether Submarines Can Swim.
> -- Edsger W. Dijkstra "The threats to computing science"
>
>http://www.donarmstrong.com  http://rzlab.ucr.edu

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] MPI and CUDA

2012-04-24 Thread Rolf vandeVaart

I am not sure about everything that is going wrong, but there are at least two 
issues I found.
First, you are skipping the first line that you read from integers.txt.  Maybe 
something like this instead.

  while(fgets(line, sizeof line, fp)!= NULL){
sscanf(line,"%d",[k]);
sum = sum + data[k]; // calculate sum to verify later on
k++;
}

Secondly, your function run_kernel is returning a pointer to an integer, but 
you are treating it as an integer.
A quick hack fix is:

mysumptr = run_kernel(...)
mysum = *mysumptr;

I would suggest adding lots of printfs or walking through a debugger to find 
out other places there might be problems.

Rolf

From: users-boun...@open-mpi.org [users-boun...@open-mpi.org] On Behalf Of 
Rohan Deshpande [rohan...@gmail.com]
Sent: Tuesday, April 24, 2012 3:35 AM
To: Open MPI Users
Subject: [OMPI users] MPI and CUDA

I am combining mpi and cuda. Trying to find out sum of array elements using 
cuda and using mpi to distribute the array.

my cuda code

#include 

__global__ void add(int *devarray, int *devsum)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
*devsum = *devsum + devarray[index];
}

extern "C"

int * run_kernel(int array[],int nelements)
{
int  *devarray, *sum, *devsum;
sum =(int *) malloc(1 * sizeof(int));

printf("\nrun_kernel called..");


cudaMalloc((void**) , sizeof(int)*nelements);
cudaMalloc((void**) , sizeof(int));
cudaMemcpy(devarray, array, sizeof(int)*nelements, 
cudaMemcpyHostToDevice);

//cudaMemcpy(devsum, sum, sizeof(int), cudaMemcpyHostToDevice);
add<<<2, 3>>>(devarray, devsum);
  //  printf("\ndevsum is %d", devsum);

cudaMemcpy(sum, devsum, sizeof(int), cudaMemcpyDeviceToHost);

printf(" \nthe sum is %d\n", *sum);
cudaFree(devarray);

cudaFree(devsum);
return sum;

}



#include "mpi.h"

#include 
#include 

#include 

#define  ARRAYSIZE  2000

#define  MASTER 0
int  data[ARRAYSIZE];


int main(int argc, char* argv[])
{


int   numtasks, taskid, rc, dest, offset, i, j, tag1, tag2, source, chunksize, 
namelen;

int mysum;
long sum;
int update(int myoffset, int chunk, int myid);

char myname[MPI_MAX_PROCESSOR_NAME];
MPI_Status status;
double start = 0.0, stop = 0.0, time = 0.0;

double totaltime;
FILE *fp;
char line[128];

char element;
int n;
int k=0;


/* Initializations */

MPI_Init(, );

MPI_Comm_size(MPI_COMM_WORLD, );
MPI_Comm_rank(MPI_COMM_WORLD,);

MPI_Get_processor_name(myname, );
printf ("MPI task %d has started on host %s...\n", taskid, myname);

chunksize = (ARRAYSIZE / numtasks);
tag2 = 1;

tag1 = 2;

/* Master task only **/


if (taskid == MASTER){

  fp=fopen("integers.txt", "r");

  if(fp != NULL){
   sum = 0;

   while(fgets(line, sizeof line, fp)!= NULL){

fscanf(fp,"%d",[k]);
sum = sum + data[k]; // calculate sum to verify later on

k++;
   }
  }


printf("Initialized array sum %d\n", sum);


  /* Send each task its portion of the array - master keeps 1st part */

  offset = chunksize;
  for (dest=1; dest MASTER) {


  /* Receive my portion of array from the master task */

  start= MPI_Wtime();
  source = MASTER;

  MPI_Recv(, 1, MPI_INT, source, tag1, MPI_COMM_WORLD, );

  MPI_Recv([offset], chunksize, MPI_INT, source, tag2,MPI_COMM_WORLD, 
);

  mysum = run_kernel([offset], chunksize);
  printf("\nKernel returns sum %d ", mysum);


// mysum = update(offset, chunksize, taskid);

  stop = MPI_Wtime();
  time = stop -start;

  printf("time taken by process %d to recieve elements and caluclate own sum is 
= %lf seconds \n", taskid, time);

 // totaltime = totaltime + time;




  /* Send my results back to the master task */

  dest = MASTER;
  MPI_Send(, 1, MPI_INT, dest, tag1, MPI_COMM_WORLD);

  MPI_Send([offset], chunksize, MPI_INT, MASTER, tag2, MPI_COMM_WORLD);

  MPI_Reduce(, , 1, MPI_INT, MPI_SUM, MASTER, MPI_COMM_WORLD);


  } /* end of non-master */

Re: [OMPI users] Open MPI 1.4.5 and CUDA support

2012-04-17 Thread Rolf vandeVaart

Yes, they are supported in the sense that they can work together.  However, if 
you want to have the ability to send/receive GPU buffers directly via MPI 
calls, then I recommend you get CUDA 4.1 and use the Open MPI trunk.

http://www.open-mpi.org/faq/?category=building#build-cuda

Rolf

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Rohan Deshpande
Sent: Tuesday, April 17, 2012 2:13 AM
To: Open MPI Users
Subject: [OMPI users] Open MPI 1.4.5 and CUDA support

Hi,

I am using Open MPI 1.4.5 and I have CUDA 3.2 installed.

Anyone knows whether CUDA 3.2 is supported by OpenMPI?

Thanks




---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Problem running an mpi application on nodes with more than one interface

2012-02-17 Thread Rolf vandeVaart

Open MPI cannot handle having two interfaces on a node on the same subnet.  I 
believe it has to do with our matching code when we try to match up a 
connection.
The result is a hang as you observe.  I also believe it is not good practice to 
have two interfaces on the same subnet.
If you put them on different subnets, things will work fine and communication 
will stripe over the two of them.

Rolf


From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Richard Bardwell
Sent: Friday, February 17, 2012 5:37 AM
To: Open MPI Users
Subject: Re: [OMPI users] Problem running an mpi application on nodes with 
more than one interface

I had exactly the same problem.
Trying to run mpi between 2 separate machines, with each machine having
2 ethernet ports, causes really weird behaviour on the most basic code.
I had to disable one of the ethernet ports on each of the machines
and it worked just fine after that. No idea why though !

- Original Message -
From: Jingcha Joba
To: us...@open-mpi.org
Sent: Thursday, February 16, 2012 8:43 PM
Subject: [OMPI users] Problem running an mpi application on nodes with more 
than one interface

Hello Everyone,
This is my 1st post in open-mpi forum.
I am trying to run a simple program which does Sendrecv between two nodes 
having 2 interface cards on each of two nodes.
Both the nodes are running RHEL6, with open-mpi 1.4.4 on a 8 core Xeon 
processor.
What I noticed was that when using two or more interface on both the nodes, the 
mpi "hangs" attempting to connect.
These details might help,
Node 1 - Denver has a single port "A" card (eth21 - 25.192.xx.xx - which I use 
to ssh to that machine), and a double port "B" card (eth23 - 10.3.1.1 & eth24 - 
10.3.1.2).
Node 2 - Chicago also the same single port A card (eth19 - 25.192.xx.xx - again 
uses for ssh) and a double port B card ( eth29 - 10.3.1.3 & eth30 - 10.3.1.4).
My /etc/host looks like
25.192.xx.xx denver.xxx.com denver
10.3.1.1 denver.xxx.com denver
10.3.1.2 denver.xxx.com denver
25.192.xx.xx chicago.xxx.com chicago
10.3.1.3 chicago.xxx.com chicago
10.3.1.4 chicago.xxx.com chicago
...
...
...
This is how I run,
mpirun --hostfile host1 --mca btl tcp,sm,self --mca btl_tcp_if_exclude 
eth21,eth19,lo,virbr0 --mca btl_base_verbose 30 -np 4 ./Sendrecv
I get bunch of things from both chicago and denver, which says its has found 
components like tcp, sm, self and stuffs, and then hangs at
[denver.xxx.com:21682] btl: tcp: attempting to 
connect() to address 10.3.1.3 on port 4
[denver.xxx.com:21682] btl: tcp: attempting to 
connect() to address 10.3.1.4 on port 4
However, if I run the same program by excluding eth29 or eth30, then it works 
fine. Something like this:
mpirun --hostfile host1 --mca btl tcp,sm,self --mca btl_tcp_if_exclude 
eth21,eth19,eth29,lo,virbr0 --mca btl_base_verbose 30 -np 4 ./Sendrecv
My hostfile looks like this
[sshuser@denver Sendrecv]$ cat host1
denver slots=2
chicago slots=2
I am not sure if I have to provide somethbing else. Please if I have to, please 
feel to ask me..
thanks,
--
Joba

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] How "CUDA Init prior to MPI_Init" co-exists with unique GPU for each MPI process?

2011-12-14 Thread Rolf vandeVaart

To add to this, yes, we recommend that the CUDA context exists prior to a call 
to MPI_Init.  That is because a CUDA context needs to exist prior to MPI_Init 
as the library attempts to register some internal buffers with the CUDA library 
that require a CUDA context exists already.  Note that this is only relevant if 
you plan to send and receive CUDA device memory directly from MPI calls.   
There is a little more about this in the FAQ here.

http://www.open-mpi.org/faq/?category=running#mpi-cuda-support


Rolf

From: Matthieu Brucher [mailto:matthieu.bruc...@gmail.com]
Sent: Wednesday, December 14, 2011 10:47 AM
To: Open MPI Users
Cc: Rolf vandeVaart
Subject: Re: [OMPI users] How "CUDA Init prior to MPI_Init" co-exists with 
unique GPU for each MPI process?

Hi,

Processes are not spawned by MPI_Init. They are spawned before by some 
applications between your mpirun call and when your program starts. When it 
does, you already have all MPI processes (you can check by adding a sleep or 
something like that), but they are not synchronized and do not know each other. 
This is what MPI_Init is used for.

Matthieu Brucher
2011/12/14 Dmitry N. Mikushin <maemar...@gmail.com<mailto:maemar...@gmail.com>>
Dear colleagues,

For GPU Winter School powered by Moscow State University cluster
"Lomonosov", the OpenMPI 1.7 was built to test and popularize CUDA
capabilities of MPI. There is one strange warning I cannot understand:
OpenMPI runtime suggests to initialize CUDA prior to MPI_Init. Sorry,
but how could it be? I thought processes are spawned during MPI_Init,
and such context will be created on the very first root process. Why
do we need existing CUDA context before MPI_Init? I think there was no
such error in previous versions.

Thanks,
- D.
___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] configure with cuda

2011-10-27 Thread Rolf vandeVaart

Actually, that is not quite right.  From the FAQ:

"This feature currently only exists in the trunk version of the Open MPI 
library."

You need to download and use the trunk version for this to work.

http://www.open-mpi.org/nightly/trunk/

Rolf

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Thursday, October 27, 2011 11:43 AM
To: Open MPI Users
Subject: Re: [OMPI users] configure with cuda


I'm pretty sure cuda support was never moved to the 1.4 series. You will, 
however, find it in the 1.5 series. I suggest you get the latest tarball from 
there.


On Oct 27, 2011, at 12:38 PM, Peter Wells wrote:



I am attempting to configure OpenMPI 1.4.3 with cuda support on a Redhat 5 box. 
When I try to run configure with the following command:

 ./configure --prefix=/opt/crc/sandbox/pwells2/openmpi/1.4.3/intel-12.0-cuda/ 
FC=ifort F77=ifort CXX=icpc CC=icc --with-sge --disable-dlopen --enable-static 
--enable-shared --disable-openib-connectx-xrc --disable-openib-rdmacm 
--without-openib --with-cuda=/opt/crc/cuda/4.0/cuda 
--with-cuda-libdir=/opt/crc/cuda/4.0/cuda/lib64

I receive the warning that '--with-cuda' and '--with-cuda-libdir' are 
unrecognized options. According to the FAQ these options are supported in this 
version of OpenMPI. I attempted the same thing with v.1.4.4 downloaded directly 
from open-mpi.org with similar results. Attached are the 
results of configure and make on v.1.4.3. Any help would be greatly appreciated.

Peter Wells
HPC Intern
Center for Research Computing
University of Notre Dame
pwel...@nd.edu
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] gpudirect p2p?

2011-10-14 Thread Rolf vandeVaart

>-Original Message-
>From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
>On Behalf Of Chris Cooper
>Sent: Friday, October 14, 2011 1:28 AM
>To: us...@open-mpi.org
>Subject: [OMPI users] gpudirect p2p?
>
>Hi,
>
>Are the recent peer to peer capabilities of cuda leveraged by Open MPI when
>eg you're running a rank per gpu on the one workstation?

Currently, no.  I am actively working on adding that capability. 

>
>It seems in my testing that I only get in the order of about 1GB/s as per
>http://www.open-mpi.org/community/lists/users/2011/03/15823.php,
>whereas nvidia's simpleP2P test indicates ~6 GB/s.
>
>Also, I ran into a problem just trying to test.  It seems you have to do
>cudaSetDevice/cuCtxCreate with the appropriate gpu id which I was wanting
>to derive from the rank.  You don't however know the rank until after
>MPI_Init() and you need to initialize cuda before.  Not sure if there's a
>standard way to do it?  I have a workaround atm.
>

The recommended way is to put the GPU in exclusive mode first.

#nvidia-smi -c 1

Then, have this kind of snippet at the beginning of the program. (this is driver
API, probably should use runtime API)

res = cuInit(0);
if (CUDA_SUCCESS != res) {
exit(1);
} 

if(CUDA_SUCCESS != cuDeviceGetCount()) {
exit(2);
}
for (device = 0; device < cuDevCount; device++) {
if (CUDA_SUCCESS != (res = cuDeviceGet(, device))) {
exit(3);
}
if (CUDA_SUCCESS != cuCtxCreate(, 0, cuDev)) {
 /* Another process must have grabbed it.  Go to the next one. */
} else {
break;
}
i++;
}



>Thanks,
>Chris
>___
>users mailing list
>us...@open-mpi.org
>http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] MPI hangs on multiple nodes

2011-09-20 Thread Rolf vandeVaart


>> 1: After a reboot of two nodes I ran again, and the inter-node freeze didn't
>happen until the third iteration. I take that to mean that the basic
>communication works, but that something is saturating. Is there some notion
>of buffer size somewhere in the MPI system that could explain this?
>
>Hmm.  This is not a good sign; it somewhat indicates a problem with your OS.
>Based on this email and your prior emails, I'm guessing you're using TCP for
>communication, and that the problem is based on inter-node communication
>(e.g., the problem would occur even if you only run 1 process per machine,
>but does not occur if you run all N processes on a single machine, per your #4,
>below).
>

I agree with Jeff here.  Open MPI uses lazy connections to establish 
connections and round robins through the interfaces.
So, the first few communications could work as they are using interfaces that 
could communicate between the nodes, but the third iteration uses an interface 
that for some reason cannot establish the connection.

One flag you can use that may help is --mca btl_base_verbose 20, like this;

mpirun --mca btl_base_verbose 20 connectivity_c

It will dump out a bunch of stuff, but there will be a few lines that look like 
this:

[...snip...]
[dt:09880] btl: tcp: attempting to connect() to [[58627,1],1] address 
10.20.14.101 on port 1025
[...snip...]

Rolf


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] Program hangs when using OpenMPI and CUDA

2011-06-06 Thread Rolf vandeVaart

Hi Fengguang:

That is odd that you see the problem even when running with the openib flags 
set as Brice indicated.  Just to be extra sure there are no typo errors in your 
flag settings, maybe you can verify with the ompi_info command like this?

ompi_info -mca btl_openib_flags 304 -param btl openib | grep btl_openib_flags

When running with the 304 setting, then all communications travel through a 
regular send/receive protocol on IB.  The message is broken up into a 12K 
fragment, followed by however many 64K fragments it takes to move the message.

I will try and find to time to reproduce the other 1 Mbyte issue that Brice 
reported.

Rolf

PS: Not sure if you are interested, but in the trunk, you can configure in 
support so that you can send and receive GPU buffers directly.  There are still 
many performance issues to be worked out, but just thought I would mention it.

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Fengguang Song
Sent: Sunday, June 05, 2011 9:54 AM
To: Open MPI Users
Subject: Re: [OMPI users] Program hangs when using OpenMPI and CUDA

Hi Brice,

Thank you! I saw your previous discussion and actually have tried "--mca 
btl_openib_flags 304".
It didn't solve the problem unfortunately. In our case, the MPI buffer is 
different from the cudaMemcpy buffer and we do manually copy between them. I'm 
still trying to figure out how to configure OpenMPI's mca parameters to solve 
the problem...

Thanks,
Fengguang

On Jun 5, 2011, at 2:20 AM, Brice Goglin wrote:

> Le 05/06/2011 00:15, Fengguang Song a écrit :
>> Hi,
>> 
>> I'm confronting a problem when using OpenMPI 1.5.1 on a GPU cluster. 
>> My program uses MPI to exchange data between nodes, and uses cudaMemcpyAsync 
>> to exchange data between Host and GPU devices within a node.
>> When the MPI message size is less than 1MB, everything works fine. 
>> However, when the message size is > 1MB, the program hangs (i.e., an MPI 
>> send never reaches its destination based on my trace).
>> 
>> The issue may be related to locked-memory contention between OpenMPI and 
>> CUDA.
>> Does anyone have the experience to solve the problem? Which MCA 
>> parameters should I tune to increase the message size to be > 1MB (to avoid 
>> the program hang)? Any help would be appreciated.
>> 
>> Thanks,
>> Fengguang
> 
> Hello,
> 
> I may have seen the same problem when testing GPU direct. Do you use 
> the same host buffer for copying from/to GPU and for sending/receiving 
> on the network ? If so, you need a GPUDirect enabled kernel and 
> mellanox drivers, but it only helps before 1MB.
> 
> You can work around the problem with one of the following solution:
> * add --mca btl_openib_flags 304 to force OMPI to always send/recv 
> through an intermediate (internal buffer), but it'll decrease 
> performance before 1MB too
> * use different host buffers for the GPU and the network and manually 
> copy between them
> 
> I never got any reply from NVIDIA/Mellanox/here when I reported this 
> problem with GPUDirect and messages larger than 1MB.
> http://www.open-mpi.org/community/lists/users/2011/03/15823.php
> 
> Brice
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] anybody tried OMPI with gpudirect?

2011-02-28 Thread Rolf vandeVaart

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Brice Goglin
Sent: Monday, February 28, 2011 2:14 PM
To: Open MPI Users
Subject: Re: [OMPI users] anybody tried OMPI with gpudirect? 

Le 28/02/2011 19:49, Rolf vandeVaart a écrit :
> For the GPU Direct to work with Infiniband, you need to get some updated OFED 
> bits from your Infiniband vendor. 
>
> In terms of checking the driver updates, you can do a grep on the string 
> get_driver_pages in the file/proc/kallsyms.  If it is there, then the Linux 
> kernel is updated correctly.
>   

The kernel looks ok then. But I couldn't find any kernel modules (tried 
nvidia.ko and all ib modules) which references this symbol. So I guess my OFED 
kernel modules aren't ok. I'll check on Mellanox website (we have some very 
recent Mellanox ConnectX QDR boards).

thanks
Brice

--
I have since learned that you can check /sys/module/ib_core/parameters/*  which 
will list a couple of GPU direct files if the driver is installed correctly and 
loaded.

Rolf

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] anybody tried OMPI with gpudirect?

2011-02-28 Thread Rolf vandeVaart


For the GPU Direct to work with Infiniband, you need to get some updated OFED 
bits from your Infiniband vendor. 

In terms of checking the driver updates, you can do a grep on the string 
get_driver_pages in the file/proc/kallsyms.  If it is there, then the Linux 
kernel is updated correctly.

The GPU Direct functioning should be independent of the MPI you are using.

Rolf  


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Brice Goglin
Sent: Monday, February 28, 2011 11:42 AM
To: us...@open-mpi.org
Subject: Re: [OMPI users] anybody tried OMPI with gpudirect?

Le 28/02/2011 17:30, Rolf vandeVaart a écrit :
> Hi Brice:
> Yes, I have tired OMPI 1.5 with gpudirect and it worked for me.  You 
> definitely need the patch or you will see the behavior just as you described, 
> a hang. One thing you could try is disabling the large message RDMA in OMPI 
> and see if that works.  That can be done by adjusting the openib BTL flags.
>
> -- mca btl_openib_flags 304
>
> Rolf 
>   

Thanks Rolf. Adding this mca parameter worked-around the hang indeed.
The kernel is supposed to be properly patched for gpudirect. Are you
aware of anything else we might need to make this work? Do we need to
rebuild some OFED kernel modules for instance?

Also, is there any reliable/easy way to check if gpudirect works in our
kernel ? (we had to manually fix the gpudirect patch for SLES11).

Brice

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] anybody tried OMPI with gpudirect?

2011-02-28 Thread Rolf vandeVaart

Hi Brice:
Yes, I have tired OMPI 1.5 with gpudirect and it worked for me.  You definitely 
need the patch or you will see the behavior just as you described, a hang. One 
thing you could try is disabling the large message RDMA in OMPI and see if that 
works.  That can be done by adjusting the openib BTL flags.

-- mca btl_openib_flags 304

Rolf 

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Brice Goglin
Sent: Monday, February 28, 2011 11:16 AM
To: us...@open-mpi.org
Subject: [OMPI users] anybody tried OMPI with gpudirect?

Hello,

I am trying to play with nvidia's gpudirect. The test program given with the 
gpudirect tarball just does a basic MPI ping-pong between two process that 
allocated their buffers with cudaHostMalloc instead of malloc. It seems to work 
with Intel MPI but Open MPI 1.5 hangs in the first MPI_Send. Replacing the cuda 
buffer with a normally-malloc'ed buffer makes the program work again. I assume 
that something goes wrong when OMPI tries to register/pin the cuda buffer in 
the IB stack (that's what gpudirect seems to be about), but I don't see why 
Intel MPI would succeed there.

Has anybody ever looked at this?

FWIW, we're using OMPI 1.5, OFED 1.5.2, Intel MPI 4.0.0.28 and SLES11 w/ and 
w/o the gpudirect patch.

Thanks
Brice Goglin

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] One-sided datatype errors

2010-12-14 Thread Rolf vandeVaart


Hi James:
I can reproduce the problem on a single node with Open MPI 1.5 and the 
trunk.  I have submitted a ticket with

the information.

https://svn.open-mpi.org/trac/ompi/ticket/2656

Rolf

On 12/13/10 18:44, James Dinan wrote:

Hi,

I'm getting strange behavior using datatypes in a one-sided 
MPI_Accumulate operation.


The attached example performs an accumulate into a patch of a shared 
2d matrix.  It uses indexed datatypes and can be built with 
displacement or absolute indices (hindexed) - both cases fail.  I'm 
seeing data validation errors, hanging, or other erroneous behavior 
under OpenMPI 1.5 on Infiniband.  The example works correctly under 
the current release of MVAPICH on IB and under MPICH on shared memory.


Any help would be greatly appreciated.

Best,
 ~Jim.


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] [Rocks-Discuss] compiling Openmpi on solaris studio express

2010-11-29 Thread Rolf vandeVaart

No, I do not believe so.  First, I assume you are trying to build either 
1.4 or 1.5, not the trunk.
Secondly, I assume you are building from a tarfile that you have 
downloaded.  Assuming these
two things are true, then (as stated in the bug report), prior to 
running configure, you want to
make the following edits to config/libtool.m4 in all the places you see 
it. ( I think just one place)


FROM:

   *Sun\ F*)
 # Sun Fortran 8.3 passes all unrecognized flags to the linker
 _LT_TAGVAR(lt_prog_compiler_pic, $1)='-KPIC'
 _LT_TAGVAR(lt_prog_compiler_static, $1)='-Bstatic'
 _LT_TAGVAR(lt_prog_compiler_wl, $1)=''
 ;;

TO:

   *Sun\ F*)
 # Sun Fortran 8.3 passes all unrecognized flags to the linker
 _LT_TAGVAR(lt_prog_compiler_pic, $1)='-KPIC'
 _LT_TAGVAR(lt_prog_compiler_static, $1)='-Bstatic'
 _LT_TAGVAR(lt_prog_compiler_wl, $1)='-Qoption ld '
 ;;



Note the difference in the lt_prog_compiler_wl line. 

Then, you need to run ./autogen.sh.  Then, redo your configure but you 
do not need to do anything
with LDFLAGS.  Just use your original flags.  I think this should work, 
but I am only reading

what is in the ticket.

Rolf


On 11/29/10 16:26, Nehemiah Dacres wrote:

that looks about right. So the suggestion:

./configure LDFLAGS="-notpath ... ... ..."

-notpath should be replaced by whatever the proper flag should be, in my case -L ? 

  

On Mon, Nov 29, 2010 at 3:16 PM, Rolf vandeVaart 
<rolf.vandeva...@oracle.com <mailto:rolf.vandeva...@oracle.com>> wrote:


This problem looks a lot like a thread from earlier today.  Can
you look at this
ticket and see if it helps?  It has a workaround documented in it.

https://svn.open-mpi.org/trac/ompi/ticket/2632

Rolf


On 11/29/10 16:13, Prentice Bisbal wrote:

No, it looks like ld is being called with the option -path, and your
linker doesn't use that switch. Grep you Makefile(s) for the string
"-path". It's probably in a statement defining LDFLAGS somewhere.

When you find it, replace it with the equivalent switch for your
compiler. You may be able to override it's value on the configure
command-line, which is usually easiest/best:

./configure LDFLAGS="-notpath ... ... ..."

--
Prentice


Nehemiah Dacres wrote:
  

it may have been that  I didn't set ld_library_path

On Mon, Nov 29, 2010 at 2:36 PM, Nehemiah Dacres <dacre...@slu.edu 
<mailto:dacre...@slu.edu>
<mailto:dacre...@slu.edu>> wrote:

thank you, you have been doubly helpful, but I am having linking
errors and I do not know what the solaris studio compiler's
preferred linker is. The

the configure statement was

./configure --prefix=/state/partition1/apps/sunmpi/
--enable-mpi-threads --with-sge --enable-static
--enable-sparse-groups CC=/opt/oracle/solstudio12.2/bin/suncc
CXX=/opt/oracle/solstudio12.2/bin/sunCC
F77=/opt/oracle/solstudio12.2/bin/sunf77
FC=/opt/oracle/solstudio12.2/bin/sunf90

   compile statement was

make all install 2>errors


error below is

f90: Warning: Option -path passed to ld, if ld is invoked, ignored
otherwise
f90: Warning: Option -path passed to ld, if ld is invoked, ignored
otherwise
f90: Warning: Option -path passed to ld, if ld is invoked, ignored
otherwise
f90: Warning: Option -path passed to ld, if ld is invoked, ignored
otherwise
f90: Warning: Option -soname passed to ld, if ld is invoked, ignored
otherwise
/usr/bin/ld: unrecognized option '-path'
/usr/bin/ld: use the --help option for usage information
make[4]: *** [libmpi_f90.la <http://libmpi_f90.la> 
<http://libmpi_f90.la>] Error 2
make[3]: *** [all-recursive] Error 1
make[2]: *** [all] Error 2
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

am I doing this wrong? are any of those configure flags unnecessary
or inappropriate



On Mon, Nov 29, 2010 at 2:06 PM, Gus Correa <g...@ldeo.columbia.edu 
<mailto:g...@ldeo.columbia.edu>
<mailto:g...@ldeo.columbia.edu>> wrote:

Nehemiah Dacres wrote:

I want to compile openmpi to work with the solaris studio
express  or
solaris studio. This is a different version than is installed on
rockscluster 5.2  and would like to know if there any
gotchas or configure
flags I should use to get it working or portable to nodes on
the cluster.
Software-wise,  it is a fairly homogeneous environment with
only slight
variations on the hardware side which could be isolated

Re: [OMPI users] [Rocks-Discuss] compiling Openmpi on solaris studio express

2010-11-29 Thread Rolf vandeVaart

This problem looks a lot like a thread from earlier today.  Can you look 
at this

ticket and see if it helps?  It has a workaround documented in it.

https://svn.open-mpi.org/trac/ompi/ticket/2632

Rolf

On 11/29/10 16:13, Prentice Bisbal wrote:

No, it looks like ld is being called with the option -path, and your
linker doesn't use that switch. Grep you Makefile(s) for the string
"-path". It's probably in a statement defining LDFLAGS somewhere.

When you find it, replace it with the equivalent switch for your
compiler. You may be able to override it's value on the configure
command-line, which is usually easiest/best:

./configure LDFLAGS="-notpath ... ... ..."

--
Prentice


Nehemiah Dacres wrote:
  

it may have been that  I didn't set ld_library_path

On Mon, Nov 29, 2010 at 2:36 PM, Nehemiah Dacres > wrote:

thank you, you have been doubly helpful, but I am having linking
errors and I do not know what the solaris studio compiler's
preferred linker is. The

the configure statement was

./configure --prefix=/state/partition1/apps/sunmpi/
--enable-mpi-threads --with-sge --enable-static
--enable-sparse-groups CC=/opt/oracle/solstudio12.2/bin/suncc
CXX=/opt/oracle/solstudio12.2/bin/sunCC
F77=/opt/oracle/solstudio12.2/bin/sunf77
FC=/opt/oracle/solstudio12.2/bin/sunf90

   compile statement was

make all install 2>errors


error below is

f90: Warning: Option -path passed to ld, if ld is invoked, ignored
otherwise
f90: Warning: Option -path passed to ld, if ld is invoked, ignored
otherwise
f90: Warning: Option -path passed to ld, if ld is invoked, ignored
otherwise
f90: Warning: Option -path passed to ld, if ld is invoked, ignored
otherwise
f90: Warning: Option -soname passed to ld, if ld is invoked, ignored
otherwise
/usr/bin/ld: unrecognized option '-path'
/usr/bin/ld: use the --help option for usage information
make[4]: *** [libmpi_f90.la ] Error 2
make[3]: *** [all-recursive] Error 1
make[2]: *** [all] Error 2
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

am I doing this wrong? are any of those configure flags unnecessary
or inappropriate



On Mon, Nov 29, 2010 at 2:06 PM, Gus Correa > wrote:

Nehemiah Dacres wrote:

I want to compile openmpi to work with the solaris studio
express  or
solaris studio. This is a different version than is installed on
rockscluster 5.2  and would like to know if there any
gotchas or configure
flags I should use to get it working or portable to nodes on
the cluster.
Software-wise,  it is a fairly homogeneous environment with
only slight
variations on the hardware side which could be isolated
(machinefile flag
and what-not)
Please advise


Hi Nehemiah
I just answered your email to the OpenMPI list.
I want to add that if you build OpenMPI with Torque support,
the machine file for each is not needed, it is provided by Torque.
I believe the same is true for SGE (but I don't use SGE).
Gus Correa




-- 
Nehemiah I. Dacres
System Administrator 
Advanced Technology Group Saint Louis University





--
Nehemiah I. Dacres
System Administrator 
Advanced Technology Group Saint Louis University





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Test Program works on 1, 2 or 3 nodes. Hangs on 4 or more nodes.

2010-09-21 Thread Rolf vandeVaart


Ethan:

Can you run just "hostname" successfully?  In other words, a non-MPI 
program.
If that does not work, then we know the problem is in the runtime.  If  
it does works, then
there is something with the way the MPI library is setting up its 
connections.


Is there more than one interface on the nodes?

Rolf

On 09/21/10 14:41, Ethan Deneault wrote:

Prentice Bisbal wrote:



I'm assuming you already tested ssh connectivity and verified everything
is working as it should. (You did test all that, right?)


Yes. I am able to log in remotely to all nodes from the master, and to 
each node from each node without a password. Each node mounts the same 
/home directory from the master, so they have the same copy of all the 
ssh and rsh keys.



This sounds like configuration problem on one of the nodes, or a problem
with ssh. I suspect it's not a problem with the number of processes, but
  whichever node is the 4th in your machinefile has a connectivity or
configuration issue:

I would try the following:

1. reorder the list of hosts in your machine file.

> 3. Change your machinefile to include 4 completely different hosts.

This does not seem to have any beneficial effect.

The test program run from the master (pleiades) with any combination 
of 3 other nodes hangs during communication. This includes not using 
--machinefile and using -host; i.e.


$ mpirun -host merope,electra,atlas -np 4 ./test.out (hangs)
$ mpirun -host merope,electra,atlas -np 3 ./test.out (hangs)
$ mpirun -host merope,electra -np 3 ./test.out
 node   1 : Hello world
 node   0 : Hello world
 node   2 : Hello world


2. Run the mpirun command from a different host. I'd try running it from
several different hosts.


The mpirun command does not seem to work when launched from one of the 
nodes. As an example:


Running on node asterope:

asterope$ mpirun -debug-daemons -host atlas,electra -np 4 ./test.out

Daemon was launched on atlas - beginning to initialize
Daemon was launched on electra - beginning to initialize
Daemon [[54956,0],1] checking in as pid 2716 on host atlas
Daemon [[54956,0],1] not using static ports
Daemon [[54956,0],2] checking in as pid 2741 on host electra
Daemon [[54956,0],2] not using static ports

(hangs)


I think someone else recommended that you should be specifying the
number of process with -np. I second that.

If the above fails, you might want to post your machine file your using.


The machine file is a simple list of hostnames, as an example:

m43
taygeta
asterope



Cheers,
Ethan

Re: [OMPI users] [openib] segfault when using openib btl

2010-07-13 Thread Rolf vandeVaart


Hi Eloi:
To select the different bcast algorithms, you need to add an extra mca 
parameter that tells the library to use dynamic selection.

--mca coll_tuned_use_dynamic_rules 1

One way to make sure you are typing this in correctly is to use it with 
ompi_info.  Do the following:

ompi_info -mca coll_tuned_use_dynamic_rules 1 --param coll

You should see lots of output with all the different algorithms that can 
be selected for the various collectives.

Therefore, you need this:

--mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_bcast_algorithm 1

Rolf

On 07/13/10 11:28, Eloi Gaudry wrote:

Hi,

I've found that "--mca coll_tuned_bcast_algorithm 1" allowed to switch to the 
basic linear algorithm.
Anyway whatever the algorithm used, the segmentation fault remains.

Does anyone could give some advice on ways to diagnose the issue I'm facing ?

Regards,
Eloi


On Monday 12 July 2010 10:53:58 Eloi Gaudry wrote:
  

Hi,

I'm focusing on the MPI_Bcast routine that seems to randomly segfault when
using the openib btl. I'd like to know if there is any way to make OpenMPI
switch to a different algorithm than the default one being selected for
MPI_Bcast.

Thanks for your help,
Eloi

On Friday 02 July 2010 11:06:52 Eloi Gaudry wrote:


Hi,

I'm observing a random segmentation fault during an internode parallel
computation involving the openib btl and OpenMPI-1.4.2 (the same issue
can be observed with OpenMPI-1.3.3).

   mpirun (Open MPI) 1.4.2
   Report bugs to http://www.open-mpi.org/community/help/
   [pbn08:02624] *** Process received signal ***
   [pbn08:02624] Signal: Segmentation fault (11)
   [pbn08:02624] Signal code: Address not mapped (1)
   [pbn08:02624] Failing at address: (nil)
   [pbn08:02624] [ 0] /lib64/libpthread.so.0 [0x349540e4c0]
   [pbn08:02624] *** End of error message ***
   sh: line 1:  2624 Segmentation fault

\/share\/hpc3\/actran_suite\/Actran_11\.0\.rc2\.41872\/RedHatEL\-5\/x86_6
4\ /bin\/actranpy_mp
'--apl=/share/hpc3/actran_suite/Actran_11.0.rc2.41872/RedHatEL-5/x86_64/A
c tran_11.0.rc2.41872'
'--inputfile=/work/st25652/LSF_130073_0_47696_0/Case1_3Dreal_m4_n2.dat'
'--scratch=/scratch/st25652/LSF_130073_0_47696_0/scratch' '--mem=3200'
'--threads=1' '--errorlevel=FATAL' '--t_max=0.1' '--parallel=domain'

If I choose not to use the openib btl (by using --mca btl self,sm,tcp on
the command line, for instance), I don't encounter any problem and the
parallel computation runs flawlessly.

I would like to get some help to be able:
- to diagnose the issue I'm facing with the openib btl
- understand why this issue is observed only when using the openib btl
and not when using self,sm,tcp

Any help would be very much appreciated.

The outputs of ompi_info and the configure scripts of OpenMPI are
enclosed to this email, and some information on the infiniband drivers
as well.

Here is the command line used when launching a parallel computation

using infiniband:
   path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list --mca

btl openib,sm,self,tcp  --display-map --verbose --version --mca
mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]

and the command line used if not using infiniband:
   path_to_openmpi/bin/mpirun -np $NPROCESS --hostfile host.list --mca

btl self,sm,tcp  --display-map --verbose --version --mca
mpi_warn_on_fork 0 --mca btl_openib_want_fork_support 0 [...]

Thanks,
Eloi
  


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] Leftover session directories [was sm btl choices]

2010-03-01 Thread Rolf Vandevaart


On 03/01/10 11:51, Ralph Castain wrote:

On Mar 1, 2010, at 8:41 AM, David Turner wrote:


On 3/1/10 1:51 AM, Ralph Castain wrote:

Which version of OMPI are you using? We know that the 1.2 series was unreliable 
about removing the session directories, but 1.3 and above appear to be quite 
good about it. If you are having problems with the 1.3 or 1.4 series, I would 
definitely like to know about it.

Oops; sorry!  OMPI 1.4.1, compiled with PGI 10.0 compilers,
running on Scientific Linux 5.4, ofed 1.4.2.

The session directories are *frequently* left behind.  I have
not really tried to characterize under what circumstances they
are removed. But please confirm:  they *should* be removed by
OMPI.


Most definitely - they should always be removed by OMPI. This is the first 
report we have had of them -not- being removed in the 1.4 series, so it is 
disturbing.

What environment are you running under? Does this happen under normal 
termination, or under abnormal failures (the more you can tell us, the better)?




Hi Ralph:

It turns out that I am seeing session directories left behind as well 
with v1.4 (r22713)  I have not tested any other versions.  I believe 
there are two elements that make this reproducible.

1. Run across 2 or more nodes.
2. CTRL-C out of the MPI job.

Then take a look at the remote nodes and you may see a leftover session 
directory.  The mpirun node seems to be clean.


Here is an example using two nodes.  I also added some sleeps to the 
ring_c program to slow things down so I could hit CTRL-C.


First, tmp directories are empty:
[rolfv@burl-ct-x2200-6 ~/examples]$ ls -lt /tmp/openmpi-sessions-rolfv*
ls: No match.
[rolfv@burl-ct-x2200-7 ~]$ ls -lt /tmp/openmpi-sessions-rolfv*
ls: No match.

Now run test:
[rolfv@burl-ct-x2200-6 ~/examples]$ mpirun -np 4 -host 
burl-ct-x2200-6,burl-ct-x2200-6,burl-ct-x2200-7,burl-ct-x2200-7 ring_slow_c

Process 0 sending 10 to 1, tag 201 (4 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
mpirun: killing job...

--
mpirun noticed that process rank 0 with PID 3002 on node burl-ct-x2200-6 
exited on signal 0 (Unknown signal 0).

--
4 total processes killed (some possibly by mpirun during cleanup)
mpirun: clean termination accomplished

[burl-ct-x2200-6:02990] 2 more processes have sent help message 
help-mpi-btl-openib.txt / default subnet prefix


Now check tmp directories:
[rolfv@burl-ct-x2200-6 ~/examples]$ ls -lt /tmp/openmpi-sessions-rolfv* 
ls: No match.

[rolfv@burl-ct-x2200-7 ~]$ ls -lt /tmp/openmpi-sessions-rolfv*
total 8
drwx-- 3 rolfv hpcgroup 4096 Mar  1 17:27 20007/

Rolf

--

=
rolf.vandeva...@sun.com
781-442-3043
=

Re: [OMPI users] Bad MPI_Bcast behaviour when running over openib

2009-09-11 Thread Rolf Vandevaart

Hi, how exactly do you run this to get this error?  I tried and it 
worked for me.


burl-ct-x2200-16 50 =>mpirun -mca btl_openib_warn_default_gid_prefix 0 
-mca btl self,sm,openib -np 2 -host burl-ct-x2200-16,burl-ct-x2200-17 
-mca btl_openib_ib_timeout 16 a.out

I am 0 at 1252670691
I am 1 at 1252670559
I am 0 at 1252670692
I am 1 at 1252670559
 burl-ct-x2200-16 51 =>

Rolf

On 09/11/09 07:18, Ake Sandgren wrote:

Hi!

The following code shows a bad behaviour when running over openib.

Openmpi: 1.3.3
With openib it dies with "error polling HP CQ with status WORK REQUEST
FLUSHED ERROR status number 5 ", with tcp or shmem it works as expected.


#include 
#include 
#include 
#include "mpi.h"

int main(int argc, char *argv[])
{
int  rank;
int  n;

MPI_Init( ,  );

MPI_Comm_rank( MPI_COMM_WORLD,  );

fprintf(stderr, "I am %d at %d\n", rank, time(NULL));
fflush(stderr);

n = 4;
MPI_Bcast(, 1, MPI_INTEGER, 0, MPI_COMM_WORLD);
fprintf(stderr, "I am %d at %d\n", rank, time(NULL));
fflush(stderr);
if (rank == 0) {
sleep(60);
}
MPI_Barrier(MPI_COMM_WORLD);

MPI_Finalize( );
exit(0);
}

I know about the internal openmpi reason for it do behave as it does.
But i think that it should be allowed to behave as it does.

This example is a bit engineered but there are codes where a similar
situation can occur, i.e. the Bcast sender doing lots of other work
after the Bcast before the next MPI call. VASP is a candidate for this.




--

=
rolf.vandeva...@sun.com
781-442-3043
=

Re: [OMPI users] an MPI process using about 12 file descriptors per neighbour processes - isn't it a bit too much?

2009-08-14 Thread Rolf Vandevaart


Hi Paul:
I tried the running the same way as you did and I saw the same thing.  I 
was using ClusterTools 8.2 (Open MPI 1.3.3r21324) and running on 
Solaris.  I looked at the mpirun process and it was definitely consuming 
approximately 12 file descriptors per a.out process.


 burl-ct-v440-0 59 =>limit descriptors
descriptors 1024
 burl-ct-v440-0 60 =>mpirun -np 84 a.out
Connectivity test on 84 processes PASSED.
 burl-ct-v440-0 61 =>mpirun -np 85 a.out
[burl-ct-v440-0:27083] [[38835,0],0] ORTE_ERROR_LOG: The system limit on 
number of network connections a process can open was reached in file 
oob_tcp.c at line 446

--
Error: system limit exceeded on number of network connections that can 
be open


This can be resolved by setting the mca parameter 
opal_set_max_sys_limits to 1,

increasing your limit descriptor setting (using limit or ulimit commands),
or asking the system administrator to increase the system limit.
--
 burl-ct-v440-0 62 =>

This should not be happening.  I will try and look to see what is going 
on.  The process that is complaining is the mpirun process which in this 
scenario forks/execs all the a.outs.


Rolf

On 08/14/09 08:52, Paul Kapinos wrote:

Hi OpenMPI folks,

We use Sun MPI (Cluster Tools 8.2) and also native OpenMPI 1.3.3 and we 
wonder us about the way OpenMPI devours file descriptors: on our 
computers, ulimit -n is currently set to 1024, and we found out that we 
may run maximally 84 MPI processes per box, and if we try to run 85 (or 
above) processes, we got such error message:


--
Error: system limit exceeded on number of network connections that can 
be open

.
--

Simple computing tells us, 1024/85 is about 12. This lets us believe 
that there is an single OpenMPI process, which needs 12 file descriptor 
per other MPI process.


By now, we have only one box with more than 100 CPUs on which it may be 
meaningfull to run more than 85 processes. But in the quite near future, 
many-core boxes are arising (we also ordered 128-way nehalems), so it 
may be disadvantageous to consume a lot of file descriptors per MPI 
process.



We see a possibility to awod this problem by setting the ulimit for file 
descriptor to a higher value.  This is not easy unter linux: you need 
either to recompile the kernel (which is not a choise for us), or to set 
a root process somewhere which will set the ulimit to a higher value 
(which is a security risk and not easy to implement).


We also tryed to set the opal_set_max_sys_limits to 1, as the help says 
(by adding  "-mca opal_set_max_sys_limits 1" to the command line), but 
we does not see any change of behaviour).


What is your meaning?

Best regards,
Paul Kapinos
RZ RWTH Aachen



#
 /opt/SUNWhpc/HPC8.2/intel/bin/mpiexec -mca opal_set_max_sys_limits 1 
-np 86   a.out


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

=
rolf.vandeva...@sun.com
781-442-3043
=

Re: [OMPI users] pipes system limit

2009-08-07 Thread Rolf Vandevaart

This message is telling you that you have run out of file descriptors. 
I am surprised that the -mca parameter setting did not fix the problem.
Can you run limit or ulimit on your shell and send the information?  I 
typically set my limit to 65536 assuming the system allows it.


burl-16 58 =>limit descriptors
descriptors 65536
burl-16 59 =>

bash-3.00$ ulimit -n
65536
bash-3.00$


Rolf

On 08/07/09 11:21, Yann JOBIC wrote:

Hello all,

I'm using hpc8.2 :
Lidia-jobic% ompi_info
Displaying Open MPI information for 32-bit ...
Package: ClusterTools 8.2
   Open MPI: 1.3.3r21324-ct8.2-b09j-r40
[...]

And i've got a X4600 machine (8*4 cores).

When i'm trying to run a 32 processor jobs, i've got :

Lidia-jobic% mpiexec --mca opal_set_max_sys_limits 1 -n 32 ./exe
[Lidia:29384] [[61597,0],0] ORTE_ERROR_LOG: The system limit on number 
of pipes a process can open was reached in file base/iof_base_setup.c at 
line 112
[Lidia:29384] [[61597,0],0] ORTE_ERROR_LOG: The system limit on number 
of pipes a process can open was reached in file odls_default_module.c at 
line 203
[Lidia:29384] [[61597,0],0] ORTE_ERROR_LOG: The system limit on number 
of network connections a process can open was reached in file oob_tcp.c 
at line 446

--
Error: system limit exceeded on number of network connections that can 
be open


This can be resolved by setting the mca parameter 
opal_set_max_sys_limits to 1,

increasing your limit descriptor setting (using limit or ulimit commands),
or asking the system administrator to increase the system limit.
--

I tried the ulimit, the mca parameter, i've got no idea of where to look 
at.

I've got the same computer under linux, and it's working fine...

Have you seen it ?
Do you know a way to bypass it ?

Many thanks,

Yann





--

=
rolf.vandeva...@sun.com
781-442-3043
=

Re: [OMPI users] problem w sge 6.2 & openmpi

2009-08-05 Thread Rolf Vandevaart

I assume it is working with np=8 because the 8 processes are getting 
launched on the same node as mpirun and therefore there is no call to 
qrsh to start up any remote processes.  When you go beyond 8, mpirun 
calls qrsh to start up processes on some of the remote nodes.


I would suggest first that you replace your MPI program with just 
hostname to simplify debug.  Then maybe you can forward along your qsub 
script as well as what your PE environment looks like (qconf -sp PE_NAME 
--- where PE_NAME is the name of your parallel environemnt).


Rolf

Eli Morris wrote:

Hi guys,

I'm trying to run an example program, mpi-ring, on a rocks cluster. 
When launched via sge with 8 processors (we have 8 procs per node), 
the program works fine, but with any more processors and the program 
fails.
I'm using open-mpi 1.3.2, included below, at end of post, is output of 
ompi_info -all


Any help with this vexing problem is appreciated.

thanks,

Eli

[emorris@nimbus ~/test]$ echo $LD_LIBRARY_PATH
/opt/openmpi/lib:/lib:/usr/lib:/share/apps/sunstudio/rtlibs
[emorris@nimbus ~/test]$ echo $PATH
/opt/openmpi/bin:/share/apps/sunstudio/bin:/opt/ncl/bin:/home/tobrien/scripts:/usr/java/latest/bin:/opt/local/grads/bin:/share/apps/openmpilib/bin:/opt/local/ncl/ncl/bin:/opt/gridengine/bin/lx26-amd64:/usr/java/latest/bin:/opt/gridengine/bin/lx26-amd64:/usr/kerberos/bin:/opt/gridengine/bin/lx26-amd64:/usr/java/latest/bin:/usr/local/bin:/bin:/usr/bin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/opt/maven/bin:/opt/openmpi/bin/:/opt/rocks/bin:/opt/rocks/sbin:/home/emorris/.sage/bin:/opt/eclipse:/opt/ganglia/bin:/opt/ganglia/sbin:/opt/maven/bin:/opt/openmpi/bin/:/opt/rocks/bin:/opt/rocks/sbin:/home/emorris/.sage/bin 


[emorris@nimbus ~/test]$

Here is the mpirun command from the script:

/opt/openmpi/bin/mpirun --debug-daemons --mca plm_base_verbose 40 -mca 
plm_rsh_agent ssh -np $NSLOTS $HOME/test/mpi-ring


Here is the verbose output of a successful program start and failure:



Success:

[root@nimbus test]# more mpi-ring.qsub.o246
[compute-0-11.local:32126] mca: base: components_open: Looking for plm 
components
[compute-0-11.local:32126] mca: base: components_open: opening plm 
components
[compute-0-11.local:32126] mca: base: components_open: found loaded 
component rsh
[compute-0-11.local:32126] mca: base: components_open: component rsh 
has no register function
[compute-0-11.local:32126] mca: base: components_open: component rsh 
open function successful
[compute-0-11.local:32126] mca: base: components_open: found loaded 
component slurm
[compute-0-11.local:32126] mca: base: components_open: component slurm 
has no register function
[compute-0-11.local:32126] mca: base: components_open: component slurm 
open function successful

[compute-0-11.local:32126] mca:base:select: Auto-selecting plm components
[compute-0-11.local:32126] mca:base:select:(  plm) Querying component 
[rsh]
[compute-0-11.local:32126] [[INVALID],INVALID] plm:rsh: using 
/opt/gridengine/bin/lx26-amd64/qrsh for launching
[compute-0-11.local:32126] mca:base:select:(  plm) Query of component 
[rsh] set priority to 10
[compute-0-11.local:32126] mca:base:select:(  plm) Querying component 
[slurm]
[compute-0-11.local:32126] mca:base:select:(  plm) Skipping component 
[slurm]. Query failed to return a module
[compute-0-11.local:32126] mca:base:select:(  plm) Selected component 
[rsh]

[compute-0-11.local:32126] mca: base: close: component slurm closed
[compute-0-11.local:32126] mca: base: close: unloading component slurm
[compute-0-11.local:32126] [[22715,0],0] node[0].name compute-0-11 
daemon 0 arch ffc91200
[compute-0-11.local:32126] [[22715,0],0] orted_cmd: received 
add_local_procs
[compute-0-11.local:32126] [[22715,0],0] orted_recv: received 
sync+nidmap from local proc [[22715,1],1]
[compute-0-11.local:32126] [[22715,0],0] orted_recv: received 
sync+nidmap from local proc [[22715,1],0]
[compute-0-11.local:32126] [[22715,0],0] orted_cmd: received 
collective data cmd
[compute-0-11.local:32126] [[22715,0],0] orted_cmd: received 
collective data cmd

.
.
.

failure:

[root@nimbus test]# more mpi-ring.qsub.o244
[compute-0-14.local:31175] mca:base:select:(  plm) Querying component 
[rsh]
[compute-0-14.local:31175] [[INVALID],INVALID] plm:rsh: using 
/opt/gridengine/bin/lx26-amd64/qrsh for launc

hing
[compute-0-14.local:31175] mca:base:select:(  plm) Query of component 
[rsh] set priority to 10
[compute-0-14.local:31175] mca:base:select:(  plm) Querying component 
[slurm]
[compute-0-14.local:31175] mca:base:select:(  plm) Skipping component 
[slurm]. Query failed to return a mod

ule
[compute-0-14.local:31175] mca:base:select:(  plm) Selected component 
[rsh]

Starting server daemon at host "compute-0-6.local"
Server daemon successfully started with task id "1.compute-0-6"
error: error: ending connection before all data received
error:
error reading job context from "qlogin_starter"
--

Re: [OMPI users] Problem launching jobs in SGE (with loose integration), OpenMPI 1.3.3

2009-07-23 Thread Rolf Vandevaart


I think what you are looking for is this:

--mca plm_rsh_disable_qrsh 1

This means we will disable the use of qrsh and use rsh or ssh instead.

The --mca pls ^sge does not work anymore for two reasons.  First, the 
"pls" framework was renamed "plm".  Secondly, the gridgengine plm was 
folded into the rsh/ssh one.


A few more details at
http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge

Rolf

On 07/23/09 10:34, Craig Tierney wrote:

I have built OpenMPI 1.3.3 without support for SGE.
I just want to launch jobs with loose integration right
now.

Here is how I configured it:

./configure CC=pgcc CXX=pgCC F77=pgf90 F90=pgf90 FC=pgf90 
--prefix=/opt/openmpi/1.3.3-pgi --without-sge
 --enable-io-romio --with-openib=/opt/hjet/ofed/1.4.1 
--with-io-romio-flags=--with-file-system=lustre 
--enable-orterun-prefix-by-default


I can start jobs from the commandline just fine.  When
I try to do the same thing inside an SGE job, I get
errors like the following:


error: executing task of job 5041155 failed:
--
A daemon (pid 13324) died unexpectedly with status 1 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
mpirun: clean termination accomplished


I am starting mpirun with the following options:

$OMPI/bin/mpirun -mca btl openib,sm,self --mca pls ^sge \
-machinefile $MACHINE_FILE -x LD_LIBRARY_PATH -np 16 ./xhpl

The options are to ensure I am using IB, that SGE is not used, and that
the LD_LIBRARY_PATH is sent along to ensure dynamic linking is done 
correctly.


This worked with 1.2.7 (except setting the pls option as gridengine 
instead of sge), but I can't get it to work with 1.3.3.


Am I missing something obvious for getting jobs with loose integration
started?

Thanks,
Craig

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

=
rolf.vandeva...@sun.com
781-442-3043
=

Re: [OMPI users] selectively bind MPI to one HCA out of available ones

2009-07-15 Thread Rolf Vandevaart

As Lenny said, you should use the if_include parameter.  Specifically, 
it would look like this depending on which ones you want to select.


-mca btl_openib_if_include mtcha0

or

-mca btl_openib_if_include mtcha1

Rolf

On 07/15/09 09:33, nee...@crlindia.com wrote:


Thanks Ralph,

i found the mca parameter. It is btl_openib_max_btls which 
controls the available HCAs.


Thanks for helping.

Regards

Neeraj Chourasia (MTS)
Computational Research Laboratories Ltd.
(A wholly Owned Subsidiary of TATA SONS Ltd)
B-101, ICC Trade Towers, Senapati Bapat Road
Pune 411016 (Mah) INDIA
(O) +91-20-6620 9863  (Fax) +91-20-6620 9862
M: +91.9225520634



*Ralph Castain *
Sent by: users-boun...@open-mpi.org

07/15/2009 06:54 PM
Please respond to
Open MPI Users 



To
Open MPI Users 
cc

Subject
	Re: [OMPI users] selectively bind MPI to one HCA out of available   
 ones









Take a look at the output from "ompi_info --params btl openib" and you 
will see the available MCA params to direct the openib subsystem. I 
believe you will find that you can indeed specify the interface.



On Wed, Jul 15, 2009 at 7:15 AM, <_neeraj@crlindia.com_ 
> wrote:


Hi all,

I have a cluster where both HCA's of blade are active, but 
connected to different subnet.
Is there an option in MPI to select one HCA out of available 
one's? I know it can be done by making changes in openmpi code, but i 
need clean interface like option during mpi launch time to select mthca0 
or mthca1?


Any help is appreciated. Btw i just checked Mvapich and feature 
is there inside.


Regards

Neeraj Chourasia (MTS)
Computational Research Laboratories Ltd.
(A wholly Owned Subsidiary of TATA SONS Ltd)
B-101, ICC Trade Towers, Senapati Bapat Road
Pune 411016 (Mah) INDIA
(O) +91-20-6620 9863  (Fax) +91-20-6620 9862
M: +91.9225520634

=-=-= Notice: The information contained in this 
e-mail message and/or attachments to it may contain confidential or 
privileged information. If you are not the intended recipient, any 
dissemination, use, review, distribution, printing or copying of the 
information contained in this e-mail message and/or attachments to it 
are strictly prohibited. If you have received this communication in 
error, please notify us by reply e-mail or telephone and immediately and 
permanently delete the message and any attachments. Internet 
communications cannot be guaranteed to be timely, secure, error or 
virus-free. The sender does not accept liability for any errors or 
omissions.Thank you =-=-=



___
users mailing list_
__users@open-mpi.org_ _
__http://www.open-mpi.org/mailman/listinfo.cgi/users_
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

=-=-= Notice: The information contained in this 
e-mail message and/or attachments to it may contain confidential or 
privileged information. If you are not the intended recipient, any 
dissemination, use, review, distribution, printing or copying of the 
information contained in this e-mail message and/or attachments to it 
are strictly prohibited. If you have received this communication in 
error, please notify us by reply e-mail or telephone and immediately and 
permanently delete the message and any attachments. Internet 
communications cannot be guaranteed to be timely, secure, error or 
virus-free. The sender does not accept liability for any errors or 
omissions.Thank you =-=-=





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

=
rolf.vandeva...@sun.com
781-442-3043
=

Re: [OMPI users] OpenMPI and SGE

2009-06-23 Thread Rolf Vandevaart


Ray Muno wrote:

Rolf Vandevaart wrote:
  

Ray Muno wrote:


Ray Muno wrote:
  
  

We are running a cluster using Rocks 5.0 and OpenMPI 1.2 (primarily).
Scheduling is done through SGE.  MPI communication is over InfiniBand.




We also have OpenMPI 1.3 installed and receive similar errors.-

  
  

This does sound like a problem with SGE.  By default, we use qrsh to
start the jobs on all the remote nodes.  I believe that is the command
that is failing.  There are two things you can try to get more info
depending on the version of Open MPI.   With version 1.2, you can try
this to get more information.

|--mca pls_gridengine_verbose 1|



This did not look like it gave me any more info.

  

With Open MPI 1.3.2 and later the verbose flag will not help.  But
instead, you can disable the use of qrsh and instead use rsh/ssh to
start the remote jobs.

--mca plm_rsh_disable_qrsh 1




Tha give me

PMGR_COLLECTIVE ERROR: unitialized MPI task: Missing required
environment variable: MPIRUN_RANK
PMGR_COLLECTIVE ERROR: PMGR_COLLECTIVE ERROR: unitialized MPI task:
Missing required environment variable: MPIRUN_RANK
  
I do not recognize these errors as part of Open MPI.  A google search 
showed they might be coming from MVAPICH.  Is there a chance we are 
using Open MPI to launch the jobs (via Open MPI mpirun) but we are 
actually launching an application that is linked to MVAPICH?


--

=
rolf.vandeva...@sun.com
781-442-3043
=

Re: [OMPI users] OpenMPI and SGE

2009-06-23 Thread Rolf Vandevaart


Ray Muno wrote:

Ray Muno wrote:
  

We are running a cluster using Rocks 5.0 and OpenMPI 1.2 (primarily).
Scheduling is done through SGE.  MPI communication is over InfiniBand.




We also have OpenMPI 1.3 installed and receive similar errors.-

  
This does sound like a problem with SGE.  By default, we use qrsh to 
start the jobs on all the remote nodes.  I believe that is the command 
that is failing.  There are two things you can try to get more info 
depending on the version of Open MPI.   With version 1.2, you can try 
this to get more information.


|--mca pls_gridengine_verbose 1|

With Open MPI 1.3.2 and later the verbose flag will not help.  But 
instead, you can disable the use of qrsh and instead use rsh/ssh to 
start the remote jobs.


--mca plm_rsh_disable_qrsh 1

Maybe trying one or both of these might provide some extra clues.

Rolf




--

=
rolf.vandeva...@sun.com
781-442-3043
=

Re: [OMPI users] Problem getting OpenMPI to run

2009-06-01 Thread Rolf Vandevaart


On 06/01/09 14:58, Jeff Layton wrote:

Jeff Squyres wrote:

On Jun 1, 2009, at 2:04 PM, Jeff Layton wrote:


error: executing task of job 3084 failed: execution daemon on host
"compute-2-2.local" didn't accept task



This looks like an error message from the resource manager/scheduler 
-- not from OMPI (i.e., OMPI tried to launch a process on a node and 
the launch failed because something rejected it).


Which one are you using?


SGE

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Take a look at the following link for some info on SGE.

http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge

I do not know exactly what your error message is telling us, but I would 
first double check to see that you have your parallel environment set up 
similarly to what is shown in the FAQ.


Rolf



--

=
rolf.vandeva...@sun.com
781-442-3043
=

Re: [OMPI users] scaling problem with openmpi

2009-05-20 Thread Rolf Vandevaart


The correct MCA parameters are the following:
-mca coll_tuned_use_dynamic_rules 1
-mca coll_tuned_dynamic_rules_filename ./dyn_rules

You can also run the following command:
ompi_info -mca coll_tuned_use_dynamic_rules 1 -param coll tuned
This will give some insight into all the various algorithms that make up 
the tuned collectives.


If I am understanding what is happening, it looks like the original 
MPI_Alltoall made use of three algorithms.  (You can look in 
coll_tuned_decision_fixed.c)


If message size < 200 or communicator size > 12
  bruck
else if message size < 3000
  basic linear
else
  pairwise
end

With the file Pavel has provided things have changed to the following. 
(maybe someone can confirm)


If message size < 8192
  bruck
else
  pairwise
end

Rolf


On 05/20/09 07:48, Roman Martonak wrote:

Many thanks for the highly helpful analysis. Indeed, what Peter says
seems to be precisely the case here. I tried to run the 32 waters test
on 48 cores now, with the original cutoff of 100 Ry, and with slightly
increased one of 110 Ry. Normally with larger cutoff it should
obviously take more time for one step. Increasing cutoff however also
increases the size of the data buffer and it appears just to cross the
packet size threshold for different behaviour (test was ran with
openmpi-1.3.2).


cutoff 100Ry

time per 1 step is 2.869 s

 = ALL TO ALL COMM   151583. BYTES   2211.  =
= ALL TO ALL COMM16.741  MB/S  20.020 SEC  =


cutoff 110 Ry

time per 1 step is 1.879 s

 = ALL TO ALL COMM   167057. BYTES   2211.  =
 = ALL TO ALL COMM43.920  MB/S   8.410 SEC  =


so it actually runs much faster and  ALL TO ALL COMM is 2.6 times
faster. In my case the threshold seems to be somewhere between
167057/48 = 3 480 and 151583/48 = 3 157 bytes.

I saved the text that Pavel suggested

1 # num of collectives
3 # ID = 3 Alltoall collective (ID in coll_tuned.h)
1 # number of com sizes
64 # comm size 8
2 # number of msg sizes
0 3 0 0 # for message size 0, bruck 1, topo 0, 0 segmentation
8192 2 0 0 # 8k+, pairwise 2, no topo or segmentation
# end of first collective

to the file dyn_rules and tried to run appending the options
"--mca use_dynamic_rules 1 --mca dynamic_rules_filename ./dyn_rules" to mpirun
but it does not make any change. Is this the correct syntax to enable
the rules ?
And will the above sample file shift the threshold to lower values (to
what value) ?

Best regards

Roman

On Wed, May 20, 2009 at 10:39 AM, Peter Kjellstrom  wrote:

On Tuesday 19 May 2009, Peter Kjellstrom wrote:

On Tuesday 19 May 2009, Roman Martonak wrote:

On Tue, May 19, 2009 at 3:29 PM, Peter Kjellstrom  wrote:

On Tuesday 19 May 2009, Roman Martonak wrote:
...


openmpi-1.3.2   time per one MD step is 3.66 s
   ELAPSED TIME :0 HOURS  1 MINUTES 25.90 SECONDS
 = ALL TO ALL COMM   102033. BYTES   4221.  =
 = ALL TO ALL COMM 7.802  MB/S  55.200 SEC  =

...


With TASKGROUP=2 the summary looks as follows

...


 = ALL TO ALL COMM   231821. BYTES   4221.  =
 = ALL TO ALL COMM82.716  MB/S  11.830 SEC  =

Wow, according to this it takes 1/5th the time to do the same number (4221)
of alltoalls if the size is (roughly) doubled... (ten times better
performance with the larger transfer size)

Something is not quite right, could you possibly try to run just the
alltoalls like I suggested in my previous e-mail?

I was curious so I ran som tests. First it seems that the size reported by
CPMD is the total size of the data buffer not the message size. Running
alltoalls with 231821/64 and 102033/64 gives this (on a similar setup):

bw for   4221x 1595 B :  36.5 Mbytes/s   time was:  23.3 s
bw for   4221x 3623 B : 125.4 Mbytes/s   time was:  15.4 s
bw for   4221x 1595 B :  36.4 Mbytes/s   time was:  23.3 s
bw for   4221x 3623 B : 125.6 Mbytes/s   time was:  15.3 s

So it does seem that OpenMPI has some problems with small alltoalls. It is
obviously broken when you can get things across faster by sending more...

As a reference I ran with a commercial MPI using the same program and node-set
(I did not have MVAPICH nor IntelMPI on this system):

bw for   4221x 1595 B :  71.4 Mbytes/s   time was:  11.9 s
bw for   4221x 3623 B : 125.8 Mbytes/s   time was:  15.3 s
bw for   4221x 1595 B :  71.1 Mbytes/s   time was:  11.9 s
bw for

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

2009-04-01 Thread Rolf Vandevaart

It turns out that the use of --host and --hostfile act as a filter of 
which nodes to run on when you are running under SGE.  So, listing them 
several times does not affect where the processes land.  However, this 
still does not explain why you are seeing what you are seeing.  One 
thing you can try is to add this to the mpirun command.


 -mca ras_gridengine_verbose 100

This will provide some additional information as to what Open MPI is 
seeing as nodes and slots from SGE.  (Is there any chance that node0002 
actually has 8 slots?)


I just retried on my cluster of 2 CPU sparc solaris nodes.  When I run 
with np=2, the two MPI processes will all land on a single node, because 
that node has two slots.  When I go up to np=4, then they move on to the 
other node.  The --host acts as a filter to where they should run.


In terms of the using "IB bonding", I do not know what that means 
exactly.  Open MPI does stripe over multiple IB interfaces, so I think 
the answer is yes.


Rolf

PS:  Here is what my np=4 job script looked like.  (I just changed np=2 
for the other run)


 burl-ct-280r-0 148 =>more run.sh
#! /bin/bash
#$ -S /bin/bash
#$ -V
#$ -cwd
#$ -N Job1
#$ -pe orte 200
#$ -j y
#$ -l h_rt=00:20:00  # Run time (hh:mm:ss) - 10 min

echo $NSLOTS
/opt/SUNWhpc/HPC8.2/sun/bin/mpirun -mca ras_gridengine_verbose 100 -v 
-np 4 -host burl-ct-280r-1,burl-ct-280r-0 -mca btl self,sm,tcp hostname


Here is the output (somewhat truncated)
 burl-ct-280r-0 150 =>more Job1.o199
200
[burl-ct-280r-2:22132] ras:gridengine: JOB_ID: 199
[burl-ct-280r-2:22132] ras:gridengine: PE_HOSTFILE: 
/ws/ompi-tools/orte/sge/sge6_2u1/default/spool/burl-ct-280r-2/active_jobs/199.1/pe_hostfile

[..snip..]
[burl-ct-280r-2:22132] ras:gridengine: burl-ct-280r-0: PE_HOSTFILE shows 
slots=2
[burl-ct-280r-2:22132] ras:gridengine: burl-ct-280r-1: PE_HOSTFILE shows 
slots=2

[..snip..]
burl-ct-280r-1
burl-ct-280r-1
burl-ct-280r-0
burl-ct-280r-0
 burl-ct-280r-0 151 =>


On 03/31/09 22:39, PN wrote:

Dear Rolf,

Thanks for your reply.
I've created another PE and changed the submission script, explicitly 
specify the hostname with "--host".

However the result is the same.

# qconf -sp orte
pe_nameorte
slots  8
user_lists NONE
xuser_listsNONE
start_proc_args/bin/true
stop_proc_args /bin/true
allocation_rule$fill_up
control_slaves TRUE
job_is_first_task  FALSE
urgency_slots  min
accounting_summary TRUE

$ cat hpl-8cpu-test.sge
#!/bin/bash
#
#$ -N HPL_8cpu_GB
#$ -pe orte 8
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
#
cd /home/admin/hpl-2.0
/opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS --host 
node0001,node0001,node0001,node0001,node0002,node0002,node0002,node0002 
./bin/goto-openmpi-gcc/xhpl



# pdsh -a ps ax --width=200|grep hpl
node0002: 18901 ?S  0:00 /opt/openmpi-gcc/bin/mpirun -v -np 
8 --host 
node0001,node0001,node0001,node0001,node0002,node0002,node0002,node0002 
./bin/goto-openmpi-gcc/xhpl

node0002: 18902 ?RLl0:29 ./bin/goto-openmpi-gcc/xhpl
node0002: 18903 ?RLl0:29 ./bin/goto-openmpi-gcc/xhpl
node0002: 18904 ?RLl0:28 ./bin/goto-openmpi-gcc/xhpl
node0002: 18905 ?RLl0:28 ./bin/goto-openmpi-gcc/xhpl
node0002: 18906 ?RLl0:29 ./bin/goto-openmpi-gcc/xhpl
node0002: 18907 ?RLl0:28 ./bin/goto-openmpi-gcc/xhpl
node0002: 18908 ?RLl0:28 ./bin/goto-openmpi-gcc/xhpl
node0002: 18909 ?RLl0:28 ./bin/goto-openmpi-gcc/xhpl

Any hint to debug this situation?

Also, if I have 2 IB ports in each node, which IB bonding was done, will 
Open MPI automatically benefit from the double bandwidth?


Thanks a lot.

Best Regards,
PN

2009/4/1 Rolf Vandevaart <rolf.vandeva...@sun.com 
<mailto:rolf.vandeva...@sun.com>>


On 03/31/09 11:43, PN wrote:

Dear all,

I'm using Open MPI 1.3.1 and SGE 6.2u2 on CentOS 5.2
I have 2 compute nodes for testing, each node has a single quad
core CPU.

Here is my submission script and PE config:
$ cat hpl-8cpu.sge
#!/bin/bash
#
#$ -N HPL_8cpu_IB
#$ -pe mpi-fu 8
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
#
cd /home/admin/hpl-2.0
# For IB
/opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS -machinefile
$TMPDIR/machines ./bin/goto-openmpi-gcc/xhpl

I've tested the mpirun command can be run correctly in command line.

$ qconf -sp mpi-fu
pe_namempi-fu
slots  8
user_lists NONE
xuser_listsNONE
start_proc_args/opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args /opt/sge/mpi/stopmpi.sh
allocation_rule$fill_up
control_slaves TRUE
job_is_first_task  FALSE
urgency_slots  min
accounting_summary TRUE


I've checked the $TMP

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

2009-03-31 Thread Rolf Vandevaart


On 03/31/09 14:50, Dave Love wrote:

Rolf Vandevaart <rolf.vandeva...@sun.com> writes:


However, I found that if I explicitly specify the "-machinefile
$TMPDIR/machines", all 8 mpi processes were spawned within a single
node, i.e. node0002.


I had that sort of behaviour recently when the tight integration was
broken on the installation we'd been given, and it took me a long time
to spot.  [Is the orte_leave_session_attached fix relevant here?]
No, orte_leave_session_attached is needed to avoid the errno=2 errors 
from the sm btl. (It is fixed in 1.3.2 and trunk)



And for what it is worth, as you have seen,
you do not need to specify a machines file.  Open MPI will use the
ones that were allocated by SGE.  


Yes, but there's a problem with the recommended (as far as I remember)
setup, with one slot per node to ensure a single job per node.  In that
case, you have no control over allocation -- -bynode and -byslot are
equivalent, which apparently can badly affect some codes.  We're
currently using a starter to generate a hosts file for that reason
(complicated by having dual- and quad-core nodes) and would welcome a
better idea.

I am not sure what you are asking here.  Are you trying to get a single 
MPI process per node?  You could use -npernode 1.  Sorry for my confusion.


Rolf

--

=
rolf.vandeva...@sun.com
781-442-3043
=

Re: [OMPI users] Strange behaviour of SGE+OpenMPI

2009-03-31 Thread Rolf Vandevaart


On 03/31/09 11:43, PN wrote:

Dear all,

I'm using Open MPI 1.3.1 and SGE 6.2u2 on CentOS 5.2
I have 2 compute nodes for testing, each node has a single quad core CPU.

Here is my submission script and PE config:
$ cat hpl-8cpu.sge
#!/bin/bash
#
#$ -N HPL_8cpu_IB
#$ -pe mpi-fu 8
#$ -cwd
#$ -j y
#$ -S /bin/bash
#$ -V
#
cd /home/admin/hpl-2.0
# For IB
/opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS -machinefile $TMPDIR/machines 
./bin/goto-openmpi-gcc/xhpl


I've tested the mpirun command can be run correctly in command line.

$ qconf -sp mpi-fu
pe_namempi-fu
slots  8
user_lists NONE
xuser_listsNONE
start_proc_args/opt/sge/mpi/startmpi.sh -catch_rsh $pe_hostfile
stop_proc_args /opt/sge/mpi/stopmpi.sh
allocation_rule$fill_up
control_slaves TRUE
job_is_first_task  FALSE
urgency_slots  min
accounting_summary TRUE


I've checked the $TMPDIR/machines after submit, it was correct.
node0002
node0002
node0002
node0002
node0001
node0001
node0001
node0001

However, I found that if I explicitly specify the "-machinefile 
$TMPDIR/machines", all 8 mpi processes were spawned within a single 
node, i.e. node0002.


However, if I omit "-machinefile $TMPDIR/machines" in the line mpirun, i.e.
/opt/openmpi-gcc/bin/mpirun -v -np $NSLOTS ./bin/goto-openmpi-gcc/xhpl

The mpi processes can start correctly, 4 processes in node0001 and 4 
processes in node0002.


Is this normal behaviour of Open MPI?


I just tried it both ways and I got the same result both times.  The 
processes are split between the nodes.  Perhaps to be extra sure, you 
can just run hostname?  And for what it is worth, as you have seen, you 
do not need to specify a machines file.  Open MPI will use the ones that 
were allocated by SGE.  You can also change your parallel queue to not 
run any scripts.  Like this:


start_proc_args/bin/true
stop_proc_args /bin/true



Also, I wondered if I have IB interface, for example, the hostname of IB 
become node0001-clust and node0002-clust, will Open MPI automatically 
use the IB interface?

Yes, it should use the IB interface.


How about if I have 2 IB ports in each node, which IB bonding was done, 
will Open MPI automatically benefit from the double bandwidth?


Thanks a lot.

Best Regards,
PN




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--

=
rolf.vandeva...@sun.com
781-442-3043
=

1 2 >

1 - 100 of 129 matches

Mail list logo