Hey guys,

I have installed OpeMPI v.3.1.4 with UCX, gdrcopy and CUDA from source.
When I load the module, here is the list of dependencies:
[image: image.png]

In order to test the sanity of this new build, I use the CUDA-aware OpenMPI
<https://github.com/NVIDIA-developer-blog/code-samples/tree/master/posts/cuda-aware-mpi-example>
example on github. I opened this original ticket #21
<https://github.com/NVIDIA-developer-blog/code-samples/issues/21>, but then
I was advised to contact the OpenMPI user list.

When I try the jacobi_cuda_normal_mpi with 4 processes, I get an output,
and a lot of (similar) warning messages at the bottom. The warnings look
like this (full stdout/stderr attached below):


[1557488926.059031] [r24g38:186803:0]  memtype_cache.c:137  UCX  WARN
destroying inuse address:0x2b11a8000000


Shall I be concerned about UCX? Is there a way I would inspect this
further, and fix the issue, so the the stderr stays clean?

Kind regards,
Ehsan Moravveji
Topology size: 2 x 2
Local domain size (current node): 4096 x 4096
Global domain size (all nodes): 8192 x 8192
Starting Jacobi run with 4 processes using "Tesla P100-SXM2-16GB" GPUs (ECC 
enabled: 4 / 4):
Iteration: 0 - Residue: 0.250000
Iteration: 100 - Residue: 0.002397
Iteration: 200 - Residue: 0.001204
Iteration: 300 - Residue: 0.000804
Iteration: 400 - Residue: 0.000603
Iteration: 500 - Residue: 0.000483
Iteration: 600 - Residue: 0.000403
Iteration: 700 - Residue: 0.000345
Iteration: 800 - Residue: 0.000302
Iteration: 900 - Residue: 0.000269
Stopped after 1000 iterations with residue 0.000242
Total Jacobi run time: 1.2674 sec.
Average per-process communication time: 0.1054 sec.
Measured lattice updates: 52.92 GLU/s (total), 13.23 GLU/s (per process)
Measured FLOPS: 264.62 GFLOPS (total), 66.15 GFLOPS (per process)
Measured device bandwidth: 3.39 TB/s (total), 846.78 GB/s (per process)
[1557488926.059031] [r24g38:186803:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b11a8000000 
[1557488926.059057] [r24g38:186803:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b11b0200000 
[1557488926.059062] [r24g38:186803:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b11b0208000 
[1557488926.059030] [r24g38:186801:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2af544000000 
[1557488926.059045] [r24g38:186801:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2af54c200000 
[1557488926.059049] [r24g38:186801:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2af54c208000 
[1557488926.059054] [r24g38:186801:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2af54c210000 
[1557488926.059058] [r24g38:186801:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2af54c218000 
[1557488926.059062] [r24g38:186801:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2af54c220000 
[1557488926.059066] [r24g38:186801:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2af54e000000 
[1557488926.059066] [r24g38:186803:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b11b0210000 
[1557488926.059071] [r24g38:186803:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b11b0218000 
[1557488926.059075] [r24g38:186803:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b11b0220000 
[1557488926.059079] [r24g38:186803:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b11b2000000 
[1557488926.059150] [r24g38:186802:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b99a4000000 
[1557488926.059177] [r24g38:186802:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b99ac200000 
[1557488926.059183] [r24g38:186802:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b99ac208000 
[1557488926.059188] [r24g38:186802:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b99ac210000 
[1557488926.059226] [r24g38:186802:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b99ac218000 
[1557488926.059231] [r24g38:186802:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b99ac220000 
[1557488926.059236] [r24g38:186802:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b99ae000000 
[1557488926.059150] [r24g38:186804:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b4bf8000000 
[1557488926.059177] [r24g38:186804:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b4c00200000 
[1557488926.059182] [r24g38:186804:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b4c00208000 
[1557488926.059187] [r24g38:186804:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b4c00210000 
[1557488926.059192] [r24g38:186804:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b4c00218000 
[1557488926.059226] [r24g38:186804:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b4c00220000 
[1557488926.059231] [r24g38:186804:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b4c02000000 
[1557488926.914026] [r24g38:186802:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b99a4000000 
[1557488926.914056] [r24g38:186802:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b99ac200000 
[1557488926.914068] [r24g38:186802:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b99ac208000 
[1557488926.914079] [r24g38:186802:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b99ac210000 
[1557488926.914089] [r24g38:186802:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b99ac218000 
[1557488926.914100] [r24g38:186802:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b99ac220000 
[1557488926.914110] [r24g38:186802:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b99ae000000 
[1557488926.914149] [r24g38:186804:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b4bf8000000 
[1557488926.914177] [r24g38:186804:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b4c00200000 
[1557488926.914227] [r24g38:186804:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b4c00208000 
[1557488926.914240] [r24g38:186804:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b4c00210000 
[1557488926.914251] [r24g38:186804:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b4c00218000 
[1557488926.914261] [r24g38:186804:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b4c00220000 
[1557488926.914271] [r24g38:186804:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b4c02000000 
[1557488926.996036] [r24g38:186803:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b11a8000000 
[1557488926.996063] [r24g38:186803:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b11b0200000 
[1557488926.996097] [r24g38:186803:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b11b0208000 
[1557488926.996108] [r24g38:186803:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b11b0210000 
[1557488926.996118] [r24g38:186803:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b11b0218000 
[1557488926.996127] [r24g38:186803:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b11b0220000 
[1557488926.996137] [r24g38:186803:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2b11b2000000 
[1557488927.077068] [r24g38:186801:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2af544000000 
[1557488927.077095] [r24g38:186801:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2af54c200000 
[1557488927.077107] [r24g38:186801:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2af54c208000 
[1557488927.077117] [r24g38:186801:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2af54c210000 
[1557488927.077154] [r24g38:186801:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2af54c218000 
[1557488927.077165] [r24g38:186801:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2af54c220000 
[1557488927.077175] [r24g38:186801:0]  memtype_cache.c:137  UCX  WARN  
destroying inuse address:0x2af54e000000 
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to