Dear Open MPI developers and users,

if I'm not totally wrong then I found a bug in the Open MPI ptmalloc2
memory module in combination with recent GCC code optimizations.

Affected Open MPI releases:
==========================

All (non-debug) releases using the
opal/mca/memory/linux/memory_linux_ptmalloc2.c (2010-05-11)
implementation and probably all preceding implementations using GNU C
Memory Allocation Hooks on systems with HAVE_POSIX_MEMALIGN *and*
>=GCC-4.9 with optimization (-O3) turned on.

Severity:
========

Critical for all affected Open MPI releases mentioned before.

Problem description:
===================

The critical code in question is in
opal/mca/memory/linux/memory_linux_ptmalloc2.c:
#####
92 #if HAVE_POSIX_MEMALIGN
93     /* Double check for posix_memalign, too */
94     if (mca_memory_linux_component.memalign_invoked) {
95         mca_memory_linux_component.memalign_invoked = false;
96         if (0 != posix_memalign(&p, sizeof(void*), 1024 * 1024)) {
97             return OPAL_ERR_IN_ERRNO;
98         }
99         free(p);
100     }
101 #endif
102
103     if (mca_memory_linux_component.malloc_invoked &&
104         mca_memory_linux_component.realloc_invoked &&
105         mca_memory_linux_component.memalign_invoked &&
106         mca_memory_linux_component.free_invoked) {
107         /* Happiness; our functions were invoked */
108         val |= OPAL_MEMORY_FREE_SUPPORT | OPAL_MEMORY_CHUNK_SUPPORT;
109     }
[...]
121     /* All done */
122     if (val > 0) {
123         opal_mem_hooks_set_support(val);
124         return OPAL_SUCCESS;
125     }
#####

The code at lines 103-109 is legally optimized away with >=GCC-4.9 and
optimizations turned on, because line 105 could never become true with
the local knowledge of the compiler/optimizer.
If mca_memory_linux_component.memalign_invoked == true at line 92, it
would be set to false at line 95.
If mca_memory_linux_component.memalign_invoked == false at line 92, it
would be false at line 103, too.
In both cases, the if at line 103-106 could never be evaluated to true
and opal_mem_hooks_set_support is never called with
OPAL_MEMORY_FREE_SUPPORT | OPAL_MEMORY_CHUNK_SUPPORT resulting in
(silently) disabled mpi_leaved_pinned.

In the global view this local assumption is not true, because
posix_memalign() in line 96 will call the hook public_mEMALIGn() in
opal/mca/memory/linux/malloc.c which in turn sets
mca_memory_linux_component.memalign_invoked = true.
As a result, the OPAL_MEMORY_*_SUPPORT would get configured correctly in
line 123 and so the opal_mem_hooks_support_level() used by
ompi/mca/btl/openib/btl_openib_component.c and indirectly by the
ompi/mca/mpool/grdma/mpool_grdma* module enables the usage of pinned memory.

The optimization could be disabled by adding -fno-tree-partial-pre to
the CFLAGS in opal/mca/memory/linux/Makefile, but this is only a
temporary workaround.

Patch:
=====

The proposed patch is as follows, which alters the point-of-view of the
compiler's optimizer on the *_invoked variables used by different code
paths (memory_linux_ptmalloc2.c vs. malloc.c):

#####
diff -rupN openmpi-1.8.5.org/opal/mca/memory/linux/memory_linux.h
openmpi-1.8.5/opal/mca/memory/linux/memory_linux.h
--- openmpi-1.8.5.org/opal/mca/memory/linux/memory_linux.h
2014-10-03 22:32:23.000000000 +0200
+++ openmpi-1.8.5/opal/mca/memory/linux/memory_linux.h  2015-06-04
10:01:44.941544282 +0200
@@ -33,11 +33,11 @@ typedef struct opal_memory_linux_compone

 #if MEMORY_LINUX_PTMALLOC2
     /* Ptmalloc2-specific data */
-    bool free_invoked;
-    bool malloc_invoked;
-    bool realloc_invoked;
-    bool memalign_invoked;
-    bool munmap_invoked;
+    volatile bool free_invoked;
+    volatile bool malloc_invoked;
+    volatile bool realloc_invoked;
+    volatile bool memalign_invoked;
+    volatile bool munmap_invoked;
 #endif
 } opal_memory_linux_component_t;

#####

Additionally, a further patch should be applied generating a warning in
the GPUDirect module if leave_pinned is not available for some reason.
In this case, GPUDirect support should be disabled, because it runs
faster without (see Case 2 below).

Symptoms:
========

Very high latency with GPUDirect and fluctuating bandwidth with
InfiniBand transfers caused by run-time disabled mpi_leave_pinned.

We are using OSU Micro Benchmarks 4.4.1 to show these GPUDirect latency
and multi-rail bandwidth peformance problems.
System specification: 2 nodes with 2x Intel E5-2670 processors, Mellanox
Connect-IB MCB194A-FCAT HCA (dual-port FDR, PCIe 3.0 x16) and NVIDIA
Tesla K40c GPU connected to different PCIe root complexes/CPUs.
Software: CentOS 6.6, Mellanox OFED 2.4, CUDA 7.0, GCC 4.9.2 (local
build), Open MPI 1.8.5 (local build)

Without applied patch:
#####
# Case 1:
mpirun -report-bindings -display-map -map-by node -np 2 -mca
btl_openib_want_cuda_gdr 1 -x CUDA_VISIBLE_DEVICES=0
/exports/bin/osu-micro-benchmarks-4.4.1/openmpi/1.8.5/gcc/4.9.2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency
-d cuda D D
 Data for JOB [12959,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: e5-2670-1       Num slots: 16   Max slots: 0    Num procs: 1
        Process OMPI jobid: [12959,1] App: 0 Process rank: 0

 Data for node: e5-2670-2       Num slots: 16   Max slots: 0    Num procs: 1
        Process OMPI jobid: [12959,1] App: 0 Process rank: 1

 =============================================================
[e5-2670-1:09670] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
[B/././././././.][./././././././.]
[e5-2670-2:06302] MCW rank 1 bound to socket 0[core 0[hwt 0]]:
[B/././././././.][./././././././.]
# OSU MPI-CUDA Latency Test v4.4.1
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       1.26
1                    1442.93
2                    1440.30
4                    1440.63
8                    1443.72
16                   1444.98
32                   1442.04
64                   1441.51
128                  1442.62
256                  1443.31
512                  1443.67
1024                 1446.23
2048                 1449.38
4096                 1458.05
8192                 1476.22
16384                1515.97
32768                  36.86
65536                  45.16
131072                 60.57
262144                 94.38
524288                130.83
1048576               199.23
2097152               328.85
4194304               603.71
##
# Case 2:
mpirun -report-bindings -display-map -map-by node -np 2 -mca
btl_openib_want_cuda_gdr 0 -x CUDA_VISIBLE_DEVICES=0
/exports/bin/osu-micro-benchmarks-4.4.1/openmpi/1.8.5/gcc/4.9.2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency
-d cuda D D
 Data for JOB [19644,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: e5-2670-1       Num slots: 16   Max slots: 0    Num procs: 1
        Process OMPI jobid: [19644,1] App: 0 Process rank: 0

 Data for node: e5-2670-2       Num slots: 16   Max slots: 0    Num procs: 1
        Process OMPI jobid: [19644,1] App: 0 Process rank: 1

 =============================================================
[e5-2670-1:23525] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
[B/././././././.][./././././././.]
[e5-2670-2:08479] MCW rank 1 bound to socket 0[core 0[hwt 0]]:
[B/././././././.][./././././././.]
# OSU MPI-CUDA Latency Test v4.4.1
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       1.27
1                      14.83
2                      15.11
4                      14.82
8                      14.85
16                     15.14
32                     14.92
64                     15.00
128                    15.40
256                    15.52
512                    15.53
1024                   15.68
2048                   16.39
4096                   18.92
8192                   21.69
16384                  32.64
32768                  36.92
65536                  44.26
131072                 60.99
262144                 94.18
524288                130.59
1048576               199.84
2097152               328.17
4194304               575.35
##
# Case 3:
mpirun -report-bindings -display-map -map-by node -np 2 -mca
btl_openib_want_cuda_gdr 1 -x CUDA_VISIBLE_DEVICES=0
/exports/bin/osu-micro-benchmarks-4.4.1/openmpi/1.8.5/gcc/4.9.2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw

 Data for JOB [12768,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: e5-2670-1       Num slots: 16   Max slots: 0    Num procs: 1
        Process OMPI jobid: [12768,1] App: 0 Process rank: 0

 Data for node: e5-2670-2       Num slots: 16   Max slots: 0    Num procs: 1
        Process OMPI jobid: [12768,1] App: 0 Process rank: 1

 =============================================================
[e5-2670-1:09913] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
[B/././././././.][./././././././.]
[e5-2670-2:06639] MCW rank 1 bound to socket 0[core 0[hwt 0]]:
[B/././././././.][./././././././.]
# OSU MPI-CUDA Bandwidth Test v4.4.1
# Send Buffer on HOST (H) and Receive Buffer on HOST (H)
# Size      Bandwidth (MB/s)
1                       1.09
2                       2.17
4                       4.31
8                       8.74
16                     16.67
32                     32.77
64                     65.24
128                   134.89
256                   268.24
512                   760.80
1024                 1436.22
2048                 2401.94
4096                 4501.21
8192                 5777.17
16384                5736.33
32768                6952.33
65536               10443.88
131072              11450.45
262144              11332.89
524288               8804.98
1048576              8820.94
2097152             11294.32
4194304             10869.27
#####

Expected behavior With applied patch:
#####
# Case 4:
mpirun -report-bindings -display-map -map-by node -np 2 -mca
btl_openib_want_cuda_gdr 1 -x CUDA_VISIBLE_DEVICES=0
/exports/bin/osu-micro-benchmarks-4.4.1/openmpi/1.8.5/gcc/4.9.2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_latency
-d cuda D D
 Data for JOB [17394,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: e5-2670-1       Num slots: 16   Max slots: 0    Num procs: 1
        Process OMPI jobid: [17394,1] App: 0 Process rank: 0

 Data for node: e5-2670-2       Num slots: 16   Max slots: 0    Num procs: 1
        Process OMPI jobid: [17394,1] App: 0 Process rank: 1

 =============================================================
[e5-2670-1:21675] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
[B/././././././.][./././././././.]
[e5-2670-2:06719] MCW rank 1 bound to socket 0[core 0[hwt 0]]:
[B/././././././.][./././././././.]
# OSU MPI-CUDA Latency Test v4.4.1
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size          Latency (us)
0                       1.27
1                       6.52
2                       6.50
4                       6.50
8                       6.74
16                      6.51
32                      6.54
64                      6.52
128                     6.75
256                     7.18
512                     7.82
1024                   10.01
2048                   14.12
4096                   22.31
8192                   33.27
16384                  55.25
32768                  37.42
65536                  44.22
131072                 60.00
262144                 94.27
524288                130.41
1048576               198.48
2097152               328.50
4194304               601.53
##
# Case 5:
mpirun -report-bindings -display-map -map-by node -np 2 -mca
btl_openib_want_cuda_gdr 1 -x CUDA_VISIBLE_DEVICES=0
/exports/bin/osu-micro-benchmarks-4.4.1/openmpi/1.8.5/gcc/4.9.2/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw

 Data for JOB [17296,1] offset 0

 ========================   JOB MAP   ========================

 Data for node: e5-2670-1       Num slots: 16   Max slots: 0    Num procs: 1
        Process OMPI jobid: [17296,1] App: 0 Process rank: 0

 Data for node: e5-2670-2       Num slots: 16   Max slots: 0    Num procs: 1
        Process OMPI jobid: [17296,1] App: 0 Process rank: 1

 =============================================================
[e5-2670-1:21705] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
[B/././././././.][./././././././.]
[e5-2670-2:06754] MCW rank 1 bound to socket 0[core 0[hwt 0]]:
[B/././././././.][./././././././.]
# OSU MPI-CUDA Bandwidth Test v4.4.1
# Send Buffer on HOST (H) and Receive Buffer on HOST (H)
# Size      Bandwidth (MB/s)
1                       1.28
2                       2.56
4                       5.14
8                      10.26
16                     20.27
32                     40.31
64                     80.85
128                   161.58
256                   320.43
512                   880.34
1024                 1598.03
2048                 2819.98
4096                 4431.01
8192                 5809.84
16384                9668.16
32768               10930.90
65536               11789.82
131072              12245.28
262144              12494.67
524288              12615.41
1048576             12679.62
2097152             12689.27
4194304             12725.77
#####

Best regards,

René "oere" Oertel

Computer Architecture Group
Faculty of Computer Science

Technische Universität Chemnitz
Straße der Nationen 62 | R. 014A
09111 Chemnitz
Germany

Tel:    +49 371 531-37652
Fax:    +49 371 531-837652

rene.oer...@informatik.tu-chemnitz.de
http://www.tu-chemnitz.de/informatik/RA

Reply via email to