[kudu-CR] Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
Todd Lipcon has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/9108 ) Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation .. Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation This bumps to a new gperftools/tcmalloc version which ought to be a bit faster. It also enables sized-deallocation support on compilers that support it. The upgrade is relatively straigthforward but for a few items: - for whatever reason, we developed a conflict between our own base::SpinLock from gutil and tcmalloc's. This conflict has always existed, but apparently tcmalloc's implementation changed sizes slightly, meaning that now the linker complains about the duplicate symbol definition. I created a simple find-replace patch to change tcmalloc to use a 'tcmalloc::' namespace instead of 'base::'. - our previous patches are all now included in tcmalloc upstream and no longer necessary. - for unknown reasons, gperftools now fails to build with TSAN instrumentation: it shows duplicate symbols between the tcmalloc 'operator new' and clang_rt's 'operator new'. We never actually needed to use gperftools in TSAN builds anyway, so this just removes the instrumented build and properly ifdefs out a few of the places where we'd included gperftools headers unconditionally. Sized deallocation is a C++14 feature (which can be selectively enabled on C++11) which adds a new 'operator delete(void*, size_t)'. In the common case where we delete a Foo*, the compiler already knows sizeof(Foo) and thus can pass this second argument to 'operator delete'. This goes through an optimized code path inside tcmalloc -- namely, it can skip the expensive lookup which otherwise would be required to map the pointer back to its size class. Unfortuantely, we can't always rely on the system allocator to provide sized deallocation, and thus we can't use this in the exported variants of our code (i.e. the client library). So, this patch only enables it when building the Kudu binaries and not the client library. I ran a quick performance check using full_stack-insert-scan-test twice each before and after: Before: Performance counter stats for 'build/latest/bin/full_stack-insert-scan-test --gtest_filter=*MRSOnlyStressTest* --inserts_per_client=20 --concurrent_inserts=10 --rows_per_batch=1 --skip_scans': 322623.135418 task-clock (msec) #4.964 CPUs utilized 11,751,640 context-switches #0.036 M/sec 3,619,010 cpu-migrations#0.011 M/sec 117,706 page-faults #0.365 K/sec 1,057,656,296,542 cycles#3.278 GHz 695,668,728,662 instructions #0.66 insn per cycle 127,565,245,327 branches # 395.400 M/sec 1,480,987,998 branch-misses #1.16% of all branches 64.996506196 seconds time elapsed 330206.351604 task-clock (msec) #5.330 CPUs utilized 12,518,370 context-switches #0.038 M/sec 3,695,790 cpu-migrations#0.011 M/sec 113,580 page-faults #0.344 K/sec 1,059,061,436,907 cycles#3.207 GHz 697,782,051,928 instructions #0.66 insn per cycle 127,949,774,596 branches # 387.484 M/sec 1,371,967,917 branch-misses #1.07% of all branches 61.952217532 seconds time elapsed After: 310788.334202 task-clock (msec) #5.315 CPUs utilized 12,486,276 context-switches #0.040 M/sec 3,690,660 cpu-migrations#0.012 M/sec 113,988 page-faults #0.367 K/sec 1,027,033,932,375 cycles#3.305 GHz 663,508,168,985 instructions #0.65 insn per cycle 125,864,966,052 branches # 404.986 M/sec 1,442,683,495 branch-misses #1.15% of all branches 58.479033228 seconds time elapsed Performance counter stats for 'build/latest/bin/full_stack-insert-scan-test.263 --gtest_filter=*MRSOnlyStressTest* --inserts_per_client=20 --concurrent_inserts=10 --rows_per_batch=1 --skip_scans': 317505.548908 task-clock (msec) #5.323 CPUs utilized 12,461,003 context-switches #0.039 M/sec 3,688,491 cpu-migrations#0.012 M/sec 113,952 page-faults #0.359 K/sec 1,029,595,155,391 cycles#3.243 GHz 663,514,269,656 instructions #0.64 insn per cycle 125,865,138,975 branches # 396.419 M/sec 1,419,717,877 branch-misses
[kudu-CR] Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
Dan Burkert has posted comments on this change. ( http://gerrit.cloudera.org:8080/9108 ) Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation .. Patch Set 2: Code-Review+2 LGTM. Tests pass on macos 10.12. -- To view, visit http://gerrit.cloudera.org:8080/9108 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b Gerrit-Change-Number: 9108 Gerrit-PatchSet: 2 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 24 Jan 2018 18:30:31 + Gerrit-HasComments: No
[kudu-CR] Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
Hello Dan Burkert, Kudu Jenkins, Adar Dembo, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9108 to look at the new patch set (#2). Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation .. Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation This bumps to a new gperftools/tcmalloc version which ought to be a bit faster. It also enables sized-deallocation support on compilers that support it. The upgrade is relatively straigthforward but for a few items: - for whatever reason, we developed a conflict between our own base::SpinLock from gutil and tcmalloc's. This conflict has always existed, but apparently tcmalloc's implementation changed sizes slightly, meaning that now the linker complains about the duplicate symbol definition. I created a simple find-replace patch to change tcmalloc to use a 'tcmalloc::' namespace instead of 'base::'. - our previous patches are all now included in tcmalloc upstream and no longer necessary. - for unknown reasons, gperftools now fails to build with TSAN instrumentation: it shows duplicate symbols between the tcmalloc 'operator new' and clang_rt's 'operator new'. We never actually needed to use gperftools in TSAN builds anyway, so this just removes the instrumented build and properly ifdefs out a few of the places where we'd included gperftools headers unconditionally. Sized deallocation is a C++14 feature (which can be selectively enabled on C++11) which adds a new 'operator delete(void*, size_t)'. In the common case where we delete a Foo*, the compiler already knows sizeof(Foo) and thus can pass this second argument to 'operator delete'. This goes through an optimized code path inside tcmalloc -- namely, it can skip the expensive lookup which otherwise would be required to map the pointer back to its size class. Unfortuantely, we can't always rely on the system allocator to provide sized deallocation, and thus we can't use this in the exported variants of our code (i.e. the client library). So, this patch only enables it when building the Kudu binaries and not the client library. I ran a quick performance check using full_stack-insert-scan-test twice each before and after: Before: Performance counter stats for 'build/latest/bin/full_stack-insert-scan-test --gtest_filter=*MRSOnlyStressTest* --inserts_per_client=20 --concurrent_inserts=10 --rows_per_batch=1 --skip_scans': 322623.135418 task-clock (msec) #4.964 CPUs utilized 11,751,640 context-switches #0.036 M/sec 3,619,010 cpu-migrations#0.011 M/sec 117,706 page-faults #0.365 K/sec 1,057,656,296,542 cycles#3.278 GHz 695,668,728,662 instructions #0.66 insn per cycle 127,565,245,327 branches # 395.400 M/sec 1,480,987,998 branch-misses #1.16% of all branches 64.996506196 seconds time elapsed 330206.351604 task-clock (msec) #5.330 CPUs utilized 12,518,370 context-switches #0.038 M/sec 3,695,790 cpu-migrations#0.011 M/sec 113,580 page-faults #0.344 K/sec 1,059,061,436,907 cycles#3.207 GHz 697,782,051,928 instructions #0.66 insn per cycle 127,949,774,596 branches # 387.484 M/sec 1,371,967,917 branch-misses #1.07% of all branches 61.952217532 seconds time elapsed After: 310788.334202 task-clock (msec) #5.315 CPUs utilized 12,486,276 context-switches #0.040 M/sec 3,690,660 cpu-migrations#0.012 M/sec 113,988 page-faults #0.367 K/sec 1,027,033,932,375 cycles#3.305 GHz 663,508,168,985 instructions #0.65 insn per cycle 125,864,966,052 branches # 404.986 M/sec 1,442,683,495 branch-misses #1.15% of all branches 58.479033228 seconds time elapsed Performance counter stats for 'build/latest/bin/full_stack-insert-scan-test.263 --gtest_filter=*MRSOnlyStressTest* --inserts_per_client=20 --concurrent_inserts=10 --rows_per_batch=1 --skip_scans': 317505.548908 task-clock (msec) #5.323 CPUs utilized 12,461,003 context-switches #0.039 M/sec 3,688,491 cpu-migrations#0.012 M/sec 113,952 page-faults #0.359 K/sec 1,029,595,155,391 cycles#3.243 GHz 663,514,269,656 instructions #0.64 insn per cycle 125,865,138,975 branches
[kudu-CR] Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/9108 ) Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/9108/1/CMakeLists.txt File CMakeLists.txt: http://gerrit.cloudera.org:8080/#/c/9108/1/CMakeLists.txt@610 PS1, Line 610: Exported variants should conform to the C++11 ABI > Technically they conform to the pre-C++11 ABI, no? Well, at least all of th good point. I thought you were on vacation, though :P -- To view, visit http://gerrit.cloudera.org:8080/9108 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b Gerrit-Change-Number: 9108 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon Gerrit-Comment-Date: Wed, 24 Jan 2018 00:17:44 + Gerrit-HasComments: Yes
[kudu-CR] Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/9108 ) Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/9108/1/CMakeLists.txt File CMakeLists.txt: http://gerrit.cloudera.org:8080/#/c/9108/1/CMakeLists.txt@610 PS1, Line 610: Exported variants should conform to the C++11 ABI Technically they conform to the pre-C++11 ABI, no? Well, at least all of the exported ABI symbols must adhere to the C++03 symbols; not sure about "hidden" symbols. -- To view, visit http://gerrit.cloudera.org:8080/9108 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b Gerrit-Change-Number: 9108 Gerrit-PatchSet: 1 Gerrit-Owner: Todd Lipcon Gerrit-Reviewer: Adar Dembo Gerrit-Reviewer: Dan Burkert Gerrit-Reviewer: Kudu Jenkins Gerrit-Comment-Date: Tue, 23 Jan 2018 23:16:26 + Gerrit-HasComments: Yes
[kudu-CR] Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
Hello Dan Burkert, I'd like you to do a code review. Please visit http://gerrit.cloudera.org:8080/9108 to review the following change. Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation .. Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation This bumps to a new gperftools/tcmalloc version which ought to be a bit faster. It also enables sized-deallocation support on compilers that support it. The upgrade is relatively straigthforward but for a few items: - for whatever reason, we developed a conflict between our own base::SpinLock from gutil and tcmalloc's. This conflict has always existed, but apparently tcmalloc's implementation changed sizes slightly, meaning that now the linker complains about the duplicate symbol definition. I created a simple find-replace patch to change tcmalloc to use a 'tcmalloc::' namespace instead of 'base::'. - our previous patches are all now included in tcmalloc upstream and no longer necessary. Sized deallocation is a C++14 feature (which can be selectively enabled on C++11) which adds a new 'operator delete(void*, size_t)'. In the common case where we delete a Foo*, the compiler already knows sizeof(Foo) and thus can pass this second argument to 'operator delete'. This goes through an optimized code path inside tcmalloc -- namely, it can skip the expensive lookup which otherwise would be required to map the pointer back to its size class. Unfortuantely, we can't always rely on the system allocator to provide sized deallocation, and thus we can't use this in the exported variants of our code (i.e. the client library). So, this patch only enables it when building the Kudu binaries and not the client library. I ran a quick performance check using full_stack-insert-scan-test twice each before and after: Before: Performance counter stats for 'build/latest/bin/full_stack-insert-scan-test --gtest_filter=*MRSOnlyStressTest* --inserts_per_client=20 --concurrent_inserts=10 --rows_per_batch=1 --skip_scans': 322623.135418 task-clock (msec) #4.964 CPUs utilized 11,751,640 context-switches #0.036 M/sec 3,619,010 cpu-migrations#0.011 M/sec 117,706 page-faults #0.365 K/sec 1,057,656,296,542 cycles#3.278 GHz 695,668,728,662 instructions #0.66 insn per cycle 127,565,245,327 branches # 395.400 M/sec 1,480,987,998 branch-misses #1.16% of all branches 64.996506196 seconds time elapsed 330206.351604 task-clock (msec) #5.330 CPUs utilized 12,518,370 context-switches #0.038 M/sec 3,695,790 cpu-migrations#0.011 M/sec 113,580 page-faults #0.344 K/sec 1,059,061,436,907 cycles#3.207 GHz 697,782,051,928 instructions #0.66 insn per cycle 127,949,774,596 branches # 387.484 M/sec 1,371,967,917 branch-misses #1.07% of all branches 61.952217532 seconds time elapsed After: 310788.334202 task-clock (msec) #5.315 CPUs utilized 12,486,276 context-switches #0.040 M/sec 3,690,660 cpu-migrations#0.012 M/sec 113,988 page-faults #0.367 K/sec 1,027,033,932,375 cycles#3.305 GHz 663,508,168,985 instructions #0.65 insn per cycle 125,864,966,052 branches # 404.986 M/sec 1,442,683,495 branch-misses #1.15% of all branches 58.479033228 seconds time elapsed Performance counter stats for 'build/latest/bin/full_stack-insert-scan-test.263 --gtest_filter=*MRSOnlyStressTest* --inserts_per_client=20 --concurrent_inserts=10 --rows_per_batch=1 --skip_scans': 317505.548908 task-clock (msec) #5.323 CPUs utilized 12,461,003 context-switches #0.039 M/sec 3,688,491 cpu-migrations#0.012 M/sec 113,952 page-faults #0.359 K/sec 1,029,595,155,391 cycles#3.243 GHz 663,514,269,656 instructions #0.64 insn per cycle 125,865,138,975 branches # 396.419 M/sec 1,419,717,877 branch-misses #1.13% of all branches 59.642651558 seconds time elapsed Seems like a small improvement in cycle count, wall clock, etc. Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b --- M CMakeLists.txt M thirdparty/download-thirdparty.sh D thirdparty/patches/gperftools-Change-default-TCMALLOC_TRANSFER_NUM_OBJ-to-40.patch A thirdparty/patche