Dear Open MPI developers and users, if I'm not totally wrong then I found a bug in the Open MPI ptmalloc2 memory module in combination with recent GCC code optimizations.
Affected Open MPI releases: ========================== All (non-debug) releases using the opal/mca/memory/linux/memory_linux_ptmalloc2.c (2010-05-11) implementation and probably all preceding implementations using GNU C Memory Allocation Hooks on systems with HAVE_POSIX_MEMALIGN *and* >=GCC-4.9 with optimization (-O3) turned on. Severity: ======== Critical for all affected Open MPI releases mentioned before. Problem description: =================== The critical code in question is in opal/mca/memory/linux/memory_linux_ptmalloc2.c: ##### 92 #if HAVE_POSIX_MEMALIGN 93 /* Double check for posix_memalign, too */ 94 if (mca_memory_linux_component.memalign_invoked) { 95 mca_memory_linux_component.memalign_invoked = false; 96 if (0 != posix_memalign(&p, sizeof(void*), 1024 * 1024)) { 97 return OPAL_ERR_IN_ERRNO; 98 } 99 free(p); 100 } 101 #endif 102 103 if (mca_memory_linux_component.malloc_invoked && 104 mca_memory_linux_component.realloc_invoked && 105 mca_memory_linux_component.memalign_invoked && 106 mca_memory_linux_component.free_invoked) { 107 /* Happiness; our functions were invoked */ 108 val |= OPAL_MEMORY_FREE_SUPPORT | OPAL_MEMORY_CHUNK_SUPPORT; 109 } [...] 121 /* All done */ 122 if (val > 0) { 123 opal_mem_hooks_set_support(val); 124 return OPAL_SUCCESS; 125 } ##### The code at lines 103-109 is legally optimized away with >=GCC-4.9 and optimizations turned on, because line 105 could never become true with the local knowledge of the compiler/optimizer. If mca_memory_linux_component.memalign_invoked == true at line 92, it would be set to false at line 95. If mca_memory_linux_component.memalign_invoked == false at line 92, it would be false at line 103, too. In both cases, the if at line 103-106 could never be evaluated to true and opal_mem_hooks_set_support is never called with OPAL_MEMORY_FREE_SUPPORT | OPAL_MEMORY_CHUNK_SUPPORT resulting in (silently) disabled mpi_leaved_pinned. In the global view this local assumption is not true, because posix_memalign() in line 96 will call the hook public_mEMALIGn() in opal/mca/memory/linux/malloc.c which in turn sets mca_memory_linux_component.memalign_invoked = true. As a result, the OPAL_MEMORY_*_SUPPORT would get configured correctly in line 123 and so the opal_mem_hooks_support_level() used by ompi/mca/btl/openib/btl_openib_component.c and indirectly by the ompi/mca/mpool/grdma/mpool_grdma* module enables the usage of pinned memory. The optimization could be disabled by adding -fno-tree-partial-pre to the CFLAGS in opal/mca/memory/linux/Makefile, but this is only a temporary workaround. Patch: ===== The proposed patch is as follows, which alters the point-of-view of the compiler's optimizer on the *_invoked variables used by different code paths (memory_linux_ptmalloc2.c vs. malloc.c): ##### diff -rupN openmpi-1.8.5.org/opal/mca/memory/linux/memory_linux.h openmpi-1.8.5/opal/mca/memory/linux/memory_linux.h --- openmpi-1.8.5.org/opal/mca/memory/linux/memory_linux.h 2014-10-03 22:32:23.000000000 +0200 +++ openmpi-1.8.5/opal/mca/memory/linux/memory_linux.h 2015-06-04 10:01:44.941544282 +0200 @@ -33,11 +33,11 @@ typedef struct opal_memory_linux_compone #if MEMORY_LINUX_PTMALLOC2 /* Ptmalloc2-specific data */ - bool free_invoked; - bool malloc_invoked; - bool realloc_invoked; - bool memalign_invoked; - bool munmap_invoked; + volatile bool free_invoked; + volatile bool malloc_invoked; + volatile bool realloc_invoked; + volatile bool memalign_invoked; + volatile bool munmap_invoked; #endif } opal_memory_linux_component_t; ##### Additionally, a further patch should be applied generating a warning in the GPUDirect module if leave_pinned is not available for some reason. In this case, GPUDirect support should be disabled, because it runs faster without (see Case 2 below). Symptoms: ======== Very high latency with GPUDirect and fluctuating bandwidth with InfiniBand transfers caused by run-time disabled mpi_leave_pinned. We are using OSU Micro Benchmarks 4.4.1 to show these GPUDirect latency and multi-rail bandwidth peformance problems. System specification: 2 nodes with 2x Intel E5-2670 processors, Mellanox Connect-IB MCB194A-FCAT HCA (dual-port FDR, PCIe 3.0 x16) and NVIDIA Tesla K40c GPU connected to different PCIe root complexes/CPUs. Software: CentOS 6.6, Mellanox OFED 2.4, CUDA 7.0, GCC 4.9.2 (local build), Open MPI 1.8.5 (local build) Without applied patch: ##### # Case 1: mpirun -report-bindings -display-map -map-by node -np 2 -mca btl_openib_want_cuda_gdr 1 -x CUDA_VISIBLE_DEVICES=0 /exports/bin/osu-micro-benchmarks-4.4.1/openmpi/1.8.5/gcc/4.9.2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d cuda D D Data for JOB [12959,1] offset 0 ======================== JOB MAP ======================== Data for node: e5-2670-1 Num slots: 16 Max slots: 0 Num procs: 1 Process OMPI jobid: [12959,1] App: 0 Process rank: 0 Data for node: e5-2670-2 Num slots: 16 Max slots: 0 Num procs: 1 Process OMPI jobid: [12959,1] App: 0 Process rank: 1 ============================================================= [e5-2670-1:09670] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.] [e5-2670-2:06302] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.] # OSU MPI-CUDA Latency Test v4.4.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Latency (us) 0 1.26 1 1442.93 2 1440.30 4 1440.63 8 1443.72 16 1444.98 32 1442.04 64 1441.51 128 1442.62 256 1443.31 512 1443.67 1024 1446.23 2048 1449.38 4096 1458.05 8192 1476.22 16384 1515.97 32768 36.86 65536 45.16 131072 60.57 262144 94.38 524288 130.83 1048576 199.23 2097152 328.85 4194304 603.71 ## # Case 2: mpirun -report-bindings -display-map -map-by node -np 2 -mca btl_openib_want_cuda_gdr 0 -x CUDA_VISIBLE_DEVICES=0 /exports/bin/osu-micro-benchmarks-4.4.1/openmpi/1.8.5/gcc/4.9.2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d cuda D D Data for JOB [19644,1] offset 0 ======================== JOB MAP ======================== Data for node: e5-2670-1 Num slots: 16 Max slots: 0 Num procs: 1 Process OMPI jobid: [19644,1] App: 0 Process rank: 0 Data for node: e5-2670-2 Num slots: 16 Max slots: 0 Num procs: 1 Process OMPI jobid: [19644,1] App: 0 Process rank: 1 ============================================================= [e5-2670-1:23525] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.] [e5-2670-2:08479] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.] # OSU MPI-CUDA Latency Test v4.4.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Latency (us) 0 1.27 1 14.83 2 15.11 4 14.82 8 14.85 16 15.14 32 14.92 64 15.00 128 15.40 256 15.52 512 15.53 1024 15.68 2048 16.39 4096 18.92 8192 21.69 16384 32.64 32768 36.92 65536 44.26 131072 60.99 262144 94.18 524288 130.59 1048576 199.84 2097152 328.17 4194304 575.35 ## # Case 3: mpirun -report-bindings -display-map -map-by node -np 2 -mca btl_openib_want_cuda_gdr 1 -x CUDA_VISIBLE_DEVICES=0 /exports/bin/osu-micro-benchmarks-4.4.1/openmpi/1.8.5/gcc/4.9.2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw Data for JOB [12768,1] offset 0 ======================== JOB MAP ======================== Data for node: e5-2670-1 Num slots: 16 Max slots: 0 Num procs: 1 Process OMPI jobid: [12768,1] App: 0 Process rank: 0 Data for node: e5-2670-2 Num slots: 16 Max slots: 0 Num procs: 1 Process OMPI jobid: [12768,1] App: 0 Process rank: 1 ============================================================= [e5-2670-1:09913] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.] [e5-2670-2:06639] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.] # OSU MPI-CUDA Bandwidth Test v4.4.1 # Send Buffer on HOST (H) and Receive Buffer on HOST (H) # Size Bandwidth (MB/s) 1 1.09 2 2.17 4 4.31 8 8.74 16 16.67 32 32.77 64 65.24 128 134.89 256 268.24 512 760.80 1024 1436.22 2048 2401.94 4096 4501.21 8192 5777.17 16384 5736.33 32768 6952.33 65536 10443.88 131072 11450.45 262144 11332.89 524288 8804.98 1048576 8820.94 2097152 11294.32 4194304 10869.27 ##### Expected behavior With applied patch: ##### # Case 4: mpirun -report-bindings -display-map -map-by node -np 2 -mca btl_openib_want_cuda_gdr 1 -x CUDA_VISIBLE_DEVICES=0 /exports/bin/osu-micro-benchmarks-4.4.1/openmpi/1.8.5/gcc/4.9.2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency -d cuda D D Data for JOB [17394,1] offset 0 ======================== JOB MAP ======================== Data for node: e5-2670-1 Num slots: 16 Max slots: 0 Num procs: 1 Process OMPI jobid: [17394,1] App: 0 Process rank: 0 Data for node: e5-2670-2 Num slots: 16 Max slots: 0 Num procs: 1 Process OMPI jobid: [17394,1] App: 0 Process rank: 1 ============================================================= [e5-2670-1:21675] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.] [e5-2670-2:06719] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.] # OSU MPI-CUDA Latency Test v4.4.1 # Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D) # Size Latency (us) 0 1.27 1 6.52 2 6.50 4 6.50 8 6.74 16 6.51 32 6.54 64 6.52 128 6.75 256 7.18 512 7.82 1024 10.01 2048 14.12 4096 22.31 8192 33.27 16384 55.25 32768 37.42 65536 44.22 131072 60.00 262144 94.27 524288 130.41 1048576 198.48 2097152 328.50 4194304 601.53 ## # Case 5: mpirun -report-bindings -display-map -map-by node -np 2 -mca btl_openib_want_cuda_gdr 1 -x CUDA_VISIBLE_DEVICES=0 /exports/bin/osu-micro-benchmarks-4.4.1/openmpi/1.8.5/gcc/4.9.2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw Data for JOB [17296,1] offset 0 ======================== JOB MAP ======================== Data for node: e5-2670-1 Num slots: 16 Max slots: 0 Num procs: 1 Process OMPI jobid: [17296,1] App: 0 Process rank: 0 Data for node: e5-2670-2 Num slots: 16 Max slots: 0 Num procs: 1 Process OMPI jobid: [17296,1] App: 0 Process rank: 1 ============================================================= [e5-2670-1:21705] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.] [e5-2670-2:06754] MCW rank 1 bound to socket 0[core 0[hwt 0]]: [B/././././././.][./././././././.] # OSU MPI-CUDA Bandwidth Test v4.4.1 # Send Buffer on HOST (H) and Receive Buffer on HOST (H) # Size Bandwidth (MB/s) 1 1.28 2 2.56 4 5.14 8 10.26 16 20.27 32 40.31 64 80.85 128 161.58 256 320.43 512 880.34 1024 1598.03 2048 2819.98 4096 4431.01 8192 5809.84 16384 9668.16 32768 10930.90 65536 11789.82 131072 12245.28 262144 12494.67 524288 12615.41 1048576 12679.62 2097152 12689.27 4194304 12725.77 ##### Best regards, René "oere" Oertel Computer Architecture Group Faculty of Computer Science Technische Universität Chemnitz Straße der Nationen 62 | R. 014A 09111 Chemnitz Germany Tel: +49 371 531-37652 Fax: +49 371 531-837652 rene.oer...@informatik.tu-chemnitz.de http://www.tu-chemnitz.de/informatik/RA