from:"Justin Luitjens"

Re: [OMPI users] CUDA mpi question

2019-11-28 Thread Justin Luitjens via users

That is not guaranteed to work.  There is no streaming concept in the MPI 
standard.  The fundamental issue here is MPI is only asynchronous on the 
completion and not the initiation of the send/recv.

It would be nice if the next version of mpi would look to add something like a 
triggered send or receive that only initiates when it receives a signal saying 
the memory is ready.  This would be vender neutral and enable things like 
streaming.

For example at the end of a kernel which creates data the gpu could poke a 
memory location to signal the send is ready.  Then the IB device could initiate.

Sent from my iPhone

On Nov 28, 2019, at 8:21 AM, George Bosilca via users 
 wrote:


Wonderful maybe but extremely unportable. Thanks but no thanks!

  George.

On Wed, Nov 27, 2019 at 11:07 PM Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:
Interesting idea. But doing MPI_THREAD_MULTIPLE has other side-effects. If MPI 
nonblocking calls could take an extra stream argument and work like a kernel 
launch, it would be wonderful.
--Junchao Zhang


On Wed, Nov 27, 2019 at 6:12 PM Joshua Ladd 
mailto:josh...@mellanox.com>> wrote:
Why not spawn num_threads, where num_threads is the number of Kernels to launch 
, and compile with the “--default-stream per-thread” option?

Then you could use MPI in thread multiple mode to achieve your objective.

Something like:



void *launch_kernel(void *dummy)
{
float *data;
cudaMalloc(, N * sizeof(float));

kernel<<>>(data, N);

cudaStreamSynchronize(0);

MPI_Isend(data,..);
return NULL;
}

int main()
{
MPI_init_thread(,,MPI_THREAD_MULTIPLE,);
const int num_threads = 8;

pthread_t threads[num_threads];

for (int i = 0; i < num_threads; i++) {
if (pthread_create([i], NULL, launch_kernel, 0)) {
fprintf(stderr, "Error creating threadn");
return 1;
}
}

for (int i = 0; i < num_threads; i++) {
if(pthread_join(threads[i], NULL)) {
fprintf(stderr, "Error joining threadn");
return 2;
}
}
cudaDeviceReset();

MPI_Finalize();
}




From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Zhang, Junchao via users
Sent: Wednesday, November 27, 2019 5:43 PM
To: George Bosilca mailto:bosi...@icl.utk.edu>>
Cc: Zhang, Junchao mailto:jczh...@mcs.anl.gov>>; Open MPI 
Users mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] CUDA mpi question

I was pointed to "2.7. Synchronization and Memory Ordering" of  
https://docs.nvidia.com/pdf/GPUDirect_RDMA.pdf.
 It is on topic. But unfortunately it is too short and I could not understand 
it.
I also checked cudaStreamAddCallback/cudaLaunchHostFunc, which say the host 
function "must not make any CUDA API calls". I am not sure if MPI_Isend 
qualifies as such functions.
--Junchao Zhang


On Wed, Nov 27, 2019 at 4:18 PM George Bosilca 
mailto:bosi...@icl.utk.edu>> wrote:
On Wed, Nov 27, 2019 at 5:02 PM Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:
On Wed, Nov 27, 2019 at 3:16 PM George Bosilca 
mailto:bosi...@icl.utk.edu>> wrote:
Short and portable answer: you need to sync before the Isend or you will send 
garbage data.
Ideally, I want to formulate my code into a series of asynchronous "kernel 
launch, kernel launch, ..." without synchronization, so that I can hide kernel 
launch overhead. It now seems I have to sync before MPI calls (even nonblocking 
calls)

Then you need a means to ensure sequential execution, and this is what the 
streams provide. Unfortunately, I looked into the code and I'm afraid there is 
currently no realistic way to do what you need. My previous comment was based 
on an older code, that seems to be 1) unmaintained currently, and 2) only 
applicable to the OB1 PML + OpenIB BTL combo. As recent versions of OMPI have 
moved away from the OpenIB BTL, relying more heavily on UCX for Infiniband 
support, the old code is now deprecated. Sorry for giving you hope on this.

Maybe you can delegate the MPI call into a CUDA event callback ?

  George.




Assuming you are willing to go for a less portable solution you can get the 
OMPI streams and add your kernels inside, so that the sequential order will 
guarantee correctness of your isend. We have 2 hidden CUDA streams in OMPI, one 
for device-to-host and one for host-to-device, that can be queried with the 
non-MPI standard compliant functions (mca_common_cuda_get_dtoh_stream and 
mca_common_cuda_get_htod_stream).

Which streams (dtoh or htod) should I use to insert kernels producing send data 
and kernels using received data? I imagine MPI uses GPUDirect RDMA to move data 
directly from GPU to NIC. Why do we need to bother dtoh or

[OMPI users] OpenMPI 3.0.0 Failing To Compile

2018-02-28 Thread Justin Luitjens


I'm trying to build OpenMPI on Ubuntu 16.04.3 and I'm getting an error.


Here is how I configure and build:
./configure --with-cuda=$CUDA_HOME --prefix=$MPI_HOME && make clean &&  make -j 
&& make install


Here is the error I see:

make[2]: Entering directory 
'/tmpnfs/jluitjens/libs/src/openmpi-3.0.0/opal/mca/crs'
  CC   base/crs_base_open.lo
  GENERATE opal_crs.7
  CC   base/crs_base_select.lo
  CC   base/crs_base_close.lo
  CC   base/crs_base_fns.lo
Option package-version requires an argument
Usage: ../../../ompi/mpi/man/make_manpage.pl --package-name= 
--package-version= --ompi-date= --opal-date= --orte-date= --input= --output= 
[--nocxx] [ --nofortran] [--nof08]
Makefile:2199: recipe for target 'opal_crs.7' failed
make[2]: *** [opal_crs.7] Error 1
make[2]: *** Waiting for unfinished jobs
make[2]: Leaving directory 
'/tmpnfs/jluitjens/libs/src/openmpi-3.0.0/opal/mca/crs'
Makefile:2364: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/tmpnfs/jluitjens/libs/src/openmpi-3.0.0/opal'
Makefile:1885: recipe for target 'all-recursive' failed
make: *** [all-recursive] Error 1


Any suggestions on what might be going on?

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Crash in libopen-pal.so

2017-06-19 Thread Justin Luitjens

I have an application that works on other systems but on the current system I'm 
running I'm seeing the following crash:

[dt04:22457] *** Process received signal ***
[dt04:22457] Signal: Segmentation fault (11)
[dt04:22457] Signal code: Address not mapped (1)
[dt04:22457] Failing at address: 0x6a1da250
[dt04:22457] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2b353370]
[dt04:22457] [ 1] 
/home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_free+0x50)[0x2cbcf810]
[dt04:22457] [ 2] 
/home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_memory_ptmalloc2_free+0x9b)[0x2cbcff3b]
[dt04:22457] [ 3] ./hacc_tpm[0x42f068]
[dt04:22457] [ 4] ./hacc_tpm[0x42f231]
[dt04:22457] [ 5] ./hacc_tpm[0x40f64d]
[dt04:22457] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2c30db35]
[dt04:22457] [ 7] ./hacc_tpm[0x4115cf]
[dt04:22457] *** End of error message ***


This app is a CUDA app but doesn't use GPU direct so that should be irrelevant.

I'm building with ggc/5.3.0  cuda/8.0.44  openmpi/1.10.7

I'm using this on centos 7 and am using a vanilla MPI configure line:  
./configure --prefix=/home/jluitjens/libs/openmpi/

Currently I'm trying to do this with just a single MPI process but multiple MPI 
processes fail in the same way:

mpirun  --oversubscribe -np 1 ./command

What is odd is the crash occurs around the same spot in the code but not 
consistently at the same spot.  The spot in the code where the single thread is 
at the time of the crash is nowhere near MPI code.  The code where it is 
crashing is just using malloc to allocate some memory. This makes me think the 
crash is due to a thread outside of the application I'm working on (perhaps in 
OpenMPI itself) or perhaps due to openmpi hijacking malloc/free.

Does anyone have any ideas of what I could try to work around this issue?

Thanks,
Justin












---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Problem building OpenMPI with CUDA 8.0

2016-10-18 Thread Justin Luitjens

After looking into this a bit more it appears that the issue is I am building 
on a head node which does not have the driver installed.  Building on back node 
resolves this issue.  In CUDA 8.0 the NVML stubs can be found in the toolkit at 
the following path:  ${CUDA_HOME}/lib64/stubs

For 8.0 I'd suggest updating the configure/make scripts to look for nvml there 
and link in the stubs.  This way the build is not dependent on the driver being 
installed and only the toolkit.

Thanks,
Justin

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Justin 
Luitjens
Sent: Tuesday, October 18, 2016 9:53 AM
To: users@lists.open-mpi.org
Subject: [OMPI users] Problem building OpenMPI with CUDA 8.0

I have the release version of CUDA 8.0 installed and am trying to build OpenMPI.

Here is my configure and build line:

./configure --prefix=$PREFIXPATH --with-cuda=$CUDA_HOME --with-tm= 
--with-openib= && make && sudo make install

Where CUDA_HOME points to the cuda install path.

When I run the above command it builds for quite a while but eventually errors 
out wit this:

make[2]: Entering directory 
`/home/jluitjens/Perforce/jluitjens_dtlogin_p4sw/sw/devrel/DevtechCompute/Internal/Tools/dtlogin/scripts/mpi/openmpi-1.10.1-gcc5.0_2014_11-cuda8.0/opal/tools/wrappers'
  CCLD opal_wrapper
../../../opal/.libs/libopen-pal.so: undefined reference to `nvmlInit_v2'
../../../opal/.libs/libopen-pal.so: undefined reference to 
`nvmlDeviceGetHandleByIndex_v2'
../../../opal/.libs/libopen-pal.so: undefined reference to 
`nvmlDeviceGetCount_v2'


Any idea what I might need to change to get around this error?

Thanks,
Justin

This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Problem building OpenMPI with CUDA 8.0

2016-10-18 Thread Justin Luitjens

I have the release version of CUDA 8.0 installed and am trying to build OpenMPI.

Here is my configure and build line:

./configure --prefix=$PREFIXPATH --with-cuda=$CUDA_HOME --with-tm= 
--with-openib= && make && sudo make install

Where CUDA_HOME points to the cuda install path.

When I run the above command it builds for quite a while but eventually errors 
out wit this:

make[2]: Entering directory 
`/home/jluitjens/Perforce/jluitjens_dtlogin_p4sw/sw/devrel/DevtechCompute/Internal/Tools/dtlogin/scripts/mpi/openmpi-1.10.1-gcc5.0_2014_11-cuda8.0/opal/tools/wrappers'
  CCLD opal_wrapper
../../../opal/.libs/libopen-pal.so: undefined reference to `nvmlInit_v2'
../../../opal/.libs/libopen-pal.so: undefined reference to 
`nvmlDeviceGetHandleByIndex_v2'
../../../opal/.libs/libopen-pal.so: undefined reference to 
`nvmlDeviceGetCount_v2'


Any idea what I might need to change to get around this error?

Thanks,
Justin

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] CUDA IPC/RDMA Not Working

2016-03-30 Thread Justin Luitjens

We have figured this out.  It turns out that the first call to each 
MPI_Isend/Irecv is staged through the host but subsequent calls are not.

Thanks,
Justin

From: Justin Luitjens
Sent: Wednesday, March 30, 2016 9:37 AM
To: us...@open-mpi.org
Subject: CUDA IPC/RDMA Not Working

Hello,

I have installed OpenMPI 1.10.2 with cuda support:

[jluitjens@dt03 repro]$ ompi_info --parsable --all | grep 
mpi_built_with_cuda_support:value
mca:mpi:base:param:mpi_built_with_cuda_support:value:true


I'm trying to verify that GPU direct is working and that messages aren't 
traversing through the host.  On a K80 GPU I'm starting 2 MPI processes where 
each takes one of the GPUs of the K80.  They then do a send receive of a 
certain size.

In addition,  I'm recording a timeline with nvprof to visualize what is 
happening.  What I'm excepting to happens is there will be one MemCpy D2D on 
each device corresponding to the send and the recive.  However,  What I'm 
seeing is each device  D2H followed by a H2D copy suggesting the data is 
staging through the host.

Here is how I'm currently running the application:

mpirun --mca btl_smcuda_cuda_ipc_verbose 100 --mca btl_smcuda_use_cuda_ipc 1 
--mca btl smcuda,self --mca btl_openib_want_cuda_gdr 1 -np 2 nvprof -o 
profile.%p ./a.out



I'm getting the following diagnostic output:

[dt03:21732] Sending CUDA IPC REQ (try=1): myrank=1, mydev=1, peerrank=0
[dt03:21731] Sending CUDA IPC REQ (try=1): myrank=0, mydev=0, peerrank=1
[dt03:21731] Not sending CUDA IPC ACK because request already initiated
[dt03:21732] Analyzed CUDA IPC request: myrank=1, mydev=1, peerrank=0, 
peerdev=0 --> ACCESS=1
[dt03:21732] BTL smcuda: rank=1 enabling CUDA IPC to rank=0 on node=dt03
[dt03:21732] Sending CUDA IPC ACK:  myrank=1, mydev=1, peerrank=0, peerdev=0
[dt03:21731] Received CUDA IPC ACK, notifying PML: myrank=0, peerrank=1
[dt03:21731] BTL smcuda: rank=0 enabling CUDA IPC to rank=1 on node=dt03

Here it seems like IPC is correctly being enabled between ranks 0 and 1.

I have tried both very large and very small messages and they all seem to stage 
through the host.

What am I doing wrong?

For reference here is my ompi_info output:

[jluitjens@dt03 repro]$ ompi_info
 Package: Open MPI jluitjens@dt04 Distribution
Open MPI: 1.10.2
  Open MPI repo revision: v1.10.1-145-g799148f
   Open MPI release date: Jan 21, 2016
Open RTE: 1.10.2
  Open RTE repo revision: v1.10.1-145-g799148f
   Open RTE release date: Jan 21, 2016
OPAL: 1.10.2
  OPAL repo revision: v1.10.1-145-g799148f
   OPAL release date: Jan 21, 2016
 MPI API: 3.0.0
Ident string: 1.10.2
  Prefix: 
/shared/devtechapps/mpi/gnu-4.7.3/openmpi-1.10.2/cuda-7.5
Configured architecture: x86_64-pc-linux-gnu
  Configure host: dt04
   Configured by: jluitjens
   Configured on: Tue Feb  9 10:56:22 PST 2016
  Configure host: dt04
Built by: jluitjens
Built on: Tue Feb  9 11:21:51 PST 2016
  Built host: dt04
  C bindings: yes
C++ bindings: yes
 Fort mpif.h: yes (all)
Fort use mpi: yes (limited: overloading)
   Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: no
Fort mpi_f08 compliance: The mpi_f08 module was not built
  Fort mpi_f08 subarrays: no
   Java bindings: no
  Wrapper compiler rpath: runpath
  C compiler: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/gcc
 C compiler absolute:
  C compiler family name: GNU
  C compiler version: 4.7.3
C++ compiler: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/g++
  C++ compiler absolute: none
   Fort compiler: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/gfortran
   Fort compiler abs:
 Fort ignore TKR: no
   Fort 08 assumed shape: no
  Fort optional args: no
  Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
   Fort STORAGE_SIZE: no
  Fort BIND(C) (all): no
  Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): no
   Fort TYPE,BIND(C): no
Fort T,BIND(C,name="a"): no
Fort PRIVATE: no
  Fort PROTECTED: no
   Fort ABSTRACT: no
   Fort ASYNCHRONOUS: no
  Fort PROCEDURE: no
 Fort USE...ONLY: no
   Fort C_FUNLOC: no
Fort f08 using wrappers: no
 Fort MPI_SIZEOF: no
 C profiling: yes
   C++ profiling: yes
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: no
  C++ exceptions: no
  Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes,
  OMPI progress: no, ORTE progress: yes, Event lib:
  yes)
   Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
 MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no

[OMPI users] CUDA IPC/RDMA Not Working

2016-03-30 Thread Justin Luitjens

Hello,

I have installed OpenMPI 1.10.2 with cuda support:

[jluitjens@dt03 repro]$ ompi_info --parsable --all | grep 
mpi_built_with_cuda_support:value
mca:mpi:base:param:mpi_built_with_cuda_support:value:true


I'm trying to verify that GPU direct is working and that messages aren't 
traversing through the host.  On a K80 GPU I'm starting 2 MPI processes where 
each takes one of the GPUs of the K80.  They then do a send receive of a 
certain size.

In addition,  I'm recording a timeline with nvprof to visualize what is 
happening.  What I'm excepting to happens is there will be one MemCpy D2D on 
each device corresponding to the send and the recive.  However,  What I'm 
seeing is each device  D2H followed by a H2D copy suggesting the data is 
staging through the host.

Here is how I'm currently running the application:

mpirun --mca btl_smcuda_cuda_ipc_verbose 100 --mca btl_smcuda_use_cuda_ipc 1 
--mca btl smcuda,self --mca btl_openib_want_cuda_gdr 1 -np 2 nvprof -o 
profile.%p ./a.out



I'm getting the following diagnostic output:

[dt03:21732] Sending CUDA IPC REQ (try=1): myrank=1, mydev=1, peerrank=0
[dt03:21731] Sending CUDA IPC REQ (try=1): myrank=0, mydev=0, peerrank=1
[dt03:21731] Not sending CUDA IPC ACK because request already initiated
[dt03:21732] Analyzed CUDA IPC request: myrank=1, mydev=1, peerrank=0, 
peerdev=0 --> ACCESS=1
[dt03:21732] BTL smcuda: rank=1 enabling CUDA IPC to rank=0 on node=dt03
[dt03:21732] Sending CUDA IPC ACK:  myrank=1, mydev=1, peerrank=0, peerdev=0
[dt03:21731] Received CUDA IPC ACK, notifying PML: myrank=0, peerrank=1
[dt03:21731] BTL smcuda: rank=0 enabling CUDA IPC to rank=1 on node=dt03

Here it seems like IPC is correctly being enabled between ranks 0 and 1.

I have tried both very large and very small messages and they all seem to stage 
through the host.

What am I doing wrong?

For reference here is my ompi_info output:

[jluitjens@dt03 repro]$ ompi_info
 Package: Open MPI jluitjens@dt04 Distribution
Open MPI: 1.10.2
  Open MPI repo revision: v1.10.1-145-g799148f
   Open MPI release date: Jan 21, 2016
Open RTE: 1.10.2
  Open RTE repo revision: v1.10.1-145-g799148f
   Open RTE release date: Jan 21, 2016
OPAL: 1.10.2
  OPAL repo revision: v1.10.1-145-g799148f
   OPAL release date: Jan 21, 2016
 MPI API: 3.0.0
Ident string: 1.10.2
  Prefix: 
/shared/devtechapps/mpi/gnu-4.7.3/openmpi-1.10.2/cuda-7.5
Configured architecture: x86_64-pc-linux-gnu
  Configure host: dt04
   Configured by: jluitjens
   Configured on: Tue Feb  9 10:56:22 PST 2016
  Configure host: dt04
Built by: jluitjens
Built on: Tue Feb  9 11:21:51 PST 2016
  Built host: dt04
  C bindings: yes
C++ bindings: yes
 Fort mpif.h: yes (all)
Fort use mpi: yes (limited: overloading)
   Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: no
Fort mpi_f08 compliance: The mpi_f08 module was not built
  Fort mpi_f08 subarrays: no
   Java bindings: no
  Wrapper compiler rpath: runpath
  C compiler: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/gcc
 C compiler absolute:
  C compiler family name: GNU
  C compiler version: 4.7.3
C++ compiler: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/g++
  C++ compiler absolute: none
   Fort compiler: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/gfortran
   Fort compiler abs:
 Fort ignore TKR: no
   Fort 08 assumed shape: no
  Fort optional args: no
  Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
   Fort STORAGE_SIZE: no
  Fort BIND(C) (all): no
  Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): no
   Fort TYPE,BIND(C): no
Fort T,BIND(C,name="a"): no
Fort PRIVATE: no
  Fort PROTECTED: no
   Fort ABSTRACT: no
   Fort ASYNCHRONOUS: no
  Fort PROCEDURE: no
 Fort USE...ONLY: no
   Fort C_FUNLOC: no
Fort f08 using wrappers: no
 Fort MPI_SIZEOF: no
 C profiling: yes
   C++ profiling: yes
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: no
  C++ exceptions: no
  Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes,
  OMPI progress: no, ORTE progress: yes, Event lib:
  yes)
   Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
 MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
  dl support: yes
   Heterogeneous support: no
mpirun default --prefix: no
 MPI I/O support: yes
   MPI_WTIME support: gettimeofday
 Symbol vis. support: yes
   Host topology support: yes
  MPI extensions:
   FT Checkpoint support: no (checkpoint thread: no)
   C/R

Re: [OMPI users] MPI-Send for entire entire matrix when allocating memory dynamically

2009-10-31 Thread Justin Luitjens

Here is how you can do this without having to redescribe the data type all
the time.  This will also keep your data layout together and improve cache
coherency.


#include 
#include 
#include 
using namespace std;
int main()
{
  int N=2, M=3;
  //Allocate the matrix
  double **A=(double**)malloc(sizeof(double*)*N);
  double *A_data=(double*)malloc(sizeof(double)*N*M);

  //assign some values to the matrix
  for(int i=0;i<N;i++)
A[i]=_data[i*M];

  int j=0;
  for(int n=0;n<N;n++)
for(int m=0;m<M;m++)
  A[n][m]=j++;

  //print the matrix
  cout << "Matrix:\n";
  for(int n=0;n<N;n++)
  {
for(int m=0;m<M;m++)
{
  cout << A[n][m] << " ";
}
cout << endl;
  }

  //to send over mpi
  //MPI_Send(A_data,M*N,MPI_DOUBLE,dest,tag,MPI_COMM_WORLD);

  //delete the matrix
  free(A);
  free(A_data);

  return 0;
}


On Sat, Oct 31, 2009 at 11:32 AM, George Bosilca <bosi...@eecs.utk.edu>wrote:

> Eugene is right, every time you create a new matrix you will have to
> describe it with a new datatype (even when using MPI_BOTTOM).
>
> george.
>
>
> On Oct 30, 2009, at 18:11 , Natarajan CS wrote:
>
>  Thanks for the replies guys! Definitely two suggestions worth trying.
>> Definitely didn't consider a derived datatype. I wasn't really sure that the
>> MPI_Send call overhead was significant enough that increasing the buffer
>> size and decreasing the number of calls would cause any speed up. Will
>> change the code over the weekend and see what happens! Also, maybe if one
>> passes the absolute address maybe there is no need for creating multiple
>> definitions of the datatype? Haven't gone through the man pages yet, so
>> apologies for ignorance!
>>
>> On Fri, Oct 30, 2009 at 2:44 PM, Eugene Loh <eugene@sun.com> wrote:
>> Wouldn't you need to create a different datatype for each matrix instance?
>>  E.g., let's say you create twelve 5x5 matrices.  Wouldn't you need twelve
>> different derived datatypes?  I would think so because each time you create
>> a matrix, the footprint of that matrix in memory will depend on the whims of
>> malloc().
>>
>> George Bosilca wrote:
>>
>> Even with the original way to create the matrices, one can use
>>  MPI_Create_type_struct to create an MPI datatype (
>> http://web.mit.edu/course/13/13.715/OldFiles/build/mpich2-1.0.6p1/www/www3/MPI_Type_create_struct.html
>>  )
>> using MPI_BOTTOM as the original displacement.
>>
>> On Oct 29, 2009, at 15:31 , Justin Luitjens wrote:
>>
>> Why not do something like this:
>>
>> double **A=new double*[N];
>> double *A_data new double [N*N];
>>
>> for(int i=0;i<N;i++)
>> A[i]=_data[i*N];
>>
>> This way you have contiguous data (in A_data) but can access it as a  2D
>> array using A[i][j].
>>
>> (I haven't compiled this but I know we represent our matrices this  way).
>>
>> On Thu, Oct 29, 2009 at 12:30 PM, Natarajan CS <csnata...@gmail.com>
>>  wrote:
>> Hi
>> thanks for the quick response. Yes, that is what I meant. I  thought there
>> was no other way around what I am doing but It is  always good to ask a
>> expert rather than assume!
>>
>> Cheers,
>>
>> C.S.N
>>
>>
>> On Thu, Oct 29, 2009 at 11:25 AM, Eugene Loh <eugene@sun.com>  wrote:
>> Natarajan CS wrote:
>>
>> Hello all,
>>Firstly, My apologies for a duplicate post in LAM/MPI list I  have the
>> following simple MPI code. I was wondering if there was a  workaround for
>> sending a dynamically allocated 2-D matrix? Currently  I can send the matrix
>> row-by-row, however, since rows are not  contiguous I cannot send the entire
>> matrix at once. I realize one  option is to change the malloc to act as one
>> contiguous block but  can I keep the matrix definition as below and still
>> send the entire  matrix in one go?
>>
>> You mean with one standard MPI call?  I don't think so.
>>
>> In MPI, there is a notion of derived datatypes, but I'm not  convinced
>> this is what you want.  A derived datatype is basically a  static template
>> of data and holes in memory.  E.g., 3 bytes, then  skip 7 bytes, then
>> another 2 bytes, then skip 500 bytes, then 1 last  byte.  Something like
>> that.  Your 2d matrices differ in two  respects.  One is that the pattern in
>> memory is different for each  matrix you allocate.  The other is that your
>> matrix definition  includes pointer information that won't be the same in
>> every  process's address space.  I guess you could overcome the first
>>  problem by changing alloc_matrix(

Re: [OMPI users] MPI-Send for entire entire matrix when allocating memory dynamically

2009-10-29 Thread Justin Luitjens

Why not do something like this:

double **A=new double*[N];
double *A_data new double [N*N];

for(int i=0;i wrote:

> Hi
>thanks for the quick response. Yes, that is what I meant. I thought
> there was no other way around what I am doing but It is always good to ask a
> expert rather than assume!
>
> Cheers,
>
> C.S.N
>
>
> On Thu, Oct 29, 2009 at 11:25 AM, Eugene Loh  wrote:
>
>> Natarajan CS wrote:
>>
>>  Hello all,
>>>Firstly, My apologies for a duplicate post in LAM/MPI list I have
>>> the following simple MPI code. I was wondering if there was a workaround for
>>> sending a dynamically allocated 2-D matrix? Currently I can send the matrix
>>> row-by-row, however, since rows are not contiguous I cannot send the entire
>>> matrix at once. I realize one option is to change the malloc to act as one
>>> contiguous block but can I keep the matrix definition as below and still
>>> send the entire matrix in one go?
>>>
>>
>> You mean with one standard MPI call?  I don't think so.
>>
>> In MPI, there is a notion of derived datatypes, but I'm not convinced this
>> is what you want.  A derived datatype is basically a static template of data
>> and holes in memory.  E.g., 3 bytes, then skip 7 bytes, then another 2
>> bytes, then skip 500 bytes, then 1 last byte.  Something like that.  Your 2d
>> matrices differ in two respects.  One is that the pattern in memory is
>> different for each matrix you allocate.  The other is that your matrix
>> definition includes pointer information that won't be the same in every
>> process's address space.  I guess you could overcome the first problem by
>> changing alloc_matrix() to some fixed pattern in memory for some r and c,
>> but you'd still have pointer information in there that you couldn't blindly
>> copy from one process address space to another.
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] Segfault when using valgrind

2009-07-06 Thread Justin Luitjens

Hi,  I am attempting to debug a memory corruption in an mpi program using
valgrind.  Howver, when I run with valgrind I get semi-random segfaults and
valgrind messages with the openmpi library.  Here is an example of such a
seg fault:

==6153==
==6153== Invalid read of size 8
==6153==at 0x19102EA0: (within
/usr/lib/openmpi/lib/openmpi/mca_btl_sm.so)
==6153==by 0x182ABACB: (within
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
==6153==by 0x182A3040: (within
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
==6153==by 0xB425DD3: PMPI_Isend (in
/usr/lib/openmpi/lib/libmpi.so.0.0.0)
==6153==by 0x7B83DA8: int Uintah::SFC::MergeExchange(int, std::vector >&,
std::vector >&,
std::vector >&) (SFC.h:2989)
==6153==by 0x7B84A8F: void Uintah::SFC::Batchers(std::vector >&,
std::vector >&,
std::vector >&) (SFC.h:3730)
==6153==by 0x7B8857B: void Uintah::SFC::Cleanup(std::vector >&,
std::vector >&,
std::vector >&) (SFC.h:3695)
==6153==by 0x7B88CC6: void Uintah::SFC::Parallel0<3, unsigned
char>() (SFC.h:2928)
==6153==by 0x7C00AAB: void Uintah::SFC::Parallel<3, unsigned
char>() (SFC.h:1108)
==6153==by 0x7C0EF39: void Uintah::SFC::GenerateDim<3>(int)
(SFC.h:694)
==6153==by 0x7C0F0F2: Uintah::SFC::GenerateCurve(int)
(SFC.h:670)
==6153==by 0x7B30CAC:
Uintah::DynamicLoadBalancer::useSFC(Uintah::Handle const&,
int*) (DynamicLoadBalancer.cc:429)
==6153==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
^G^G^GThread "main"(pid 6153) caught signal SIGSEGV at address (nil)
(segmentation violation)

Looking at the code for our isend at SFC.h:298 does not seem to have any
errors:

=
  MergeInfo myinfo,theirinfo;

  MPI_Request srequest, rrequest;
  MPI_Status status;

  myinfo.n=n;
  if(n!=0)
  {
myinfo.min=sendbuf[0].bits;
myinfo.max=sendbuf[n-1].bits;
  }
  //cout << rank << " n:" << n << " min:" << (int)myinfo.min << "max:" <<
(int)myinfo.max << endl;

  MPI_Isend(,sizeof(MergeInfo),MPI_BYTE,to,0,Comm,);
==

myinfo is a struct located on the stack, to is the rank of the processor
that the message is being sent to, and srequest is also on the stack.  When
I don't run with valgrind my program runs past this point just fine.

I am currently using openmpi 1.3 from the debian unstable branch.  I also
see the same type of segfault in a different portion of the code involving
an MPI_Allgather which can be seen below:

==
==22736== Use of uninitialised value of size 8
==22736==at 0x19104775: mca_btl_sm_component_progress (opal_list.h:322)
==22736==by 0x1382CE09: opal_progress (opal_progress.c:207)
==22736==by 0xB404264: ompi_request_default_wait_all (condition.h:99)
==22736==by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual
(coll_tuned_util.c:55)
==22736==by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck
(coll_tuned_util.h:60)
==22736==by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
==22736==by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
==22736==by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
==22736==by 0x6465457:
Uintah::Grid::problemSetup(Uintah::Handle const&,
Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
==22736==by 0x8345759: Uintah::SimulationController::gridSetup()
(SimulationController.cc:243)
==22736==by 0x834F418: Uintah::AMRSimulationController::run()
(AMRSimulationController.cc:117)
==22736==by 0x4089AE: main (sus.cc:629)
==22736==
==22736== Invalid read of size 8
==22736==at 0x19104775: mca_btl_sm_component_progress (opal_list.h:322)
==22736==by 0x1382CE09: opal_progress (opal_progress.c:207)
==22736==by 0xB404264: ompi_request_default_wait_all (condition.h:99)
==22736==by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual
(coll_tuned_util.c:55)
==22736==by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck
(coll_tuned_util.h:60)
==22736==by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
==22736==by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
==22736==by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
==22736==by 0x6465457:
Uintah::Grid::problemSetup(Uintah::Handle const&,
Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
==22736==by 0x8345759: Uintah::SimulationController::gridSetup()
(SimulationController.cc:243)
==22736==by 0x834F418: Uintah::AMRSimulationController::run()
(AMRSimulationController.cc:117)
==22736==by 0x4089AE: main (sus.cc:629)

Re: [OMPI users] CUDA mpi question

[OMPI users] OpenMPI 3.0.0 Failing To Compile

[OMPI users] Crash in libopen-pal.so

Re: [OMPI users] Problem building OpenMPI with CUDA 8.0

[OMPI users] Problem building OpenMPI with CUDA 8.0

Re: [OMPI users] CUDA IPC/RDMA Not Working

[OMPI users] CUDA IPC/RDMA Not Working

Re: [OMPI users] MPI-Send for entire entire matrix when allocating memory dynamically

Re: [OMPI users] MPI-Send for entire entire matrix when allocating memory dynamically

[OMPI users] Segfault when using valgrind

10 matches

Site Navigation

Mail list logo

Footer information