[kudu-CR] Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

2018-01-24 Thread Todd Lipcon (Code Review)
Todd Lipcon has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/9108 )

Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
..

Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

This bumps to a new gperftools/tcmalloc version which ought to be a bit
faster. It also enables sized-deallocation support on compilers that
support it.

The upgrade is relatively straigthforward but for a few items:
- for whatever reason, we developed a conflict between our own
  base::SpinLock from gutil and tcmalloc's. This conflict has always
  existed, but apparently tcmalloc's implementation changed sizes
  slightly, meaning that now the linker complains about the duplicate
  symbol definition. I created a simple find-replace patch to change
  tcmalloc to use a 'tcmalloc::' namespace instead of 'base::'.
- our previous patches are all now included in tcmalloc upstream and no
  longer necessary.
- for unknown reasons, gperftools now fails to build with TSAN
  instrumentation: it shows duplicate symbols between the tcmalloc
  'operator new' and clang_rt's 'operator new'. We never actually needed
  to use gperftools in TSAN builds anyway, so this just removes the
  instrumented build and properly ifdefs out a few of the places where
  we'd included gperftools headers unconditionally.

Sized deallocation is a C++14 feature (which can be selectively enabled
on C++11) which adds a new 'operator delete(void*, size_t)'. In the
common case where we delete a Foo*, the compiler already knows
sizeof(Foo) and thus can pass this second argument to 'operator delete'.
This goes through an optimized code path inside tcmalloc -- namely, it
can skip the expensive lookup which otherwise would be required to map
the pointer back to its size class.

Unfortuantely, we can't always rely on the system allocator to provide
sized deallocation, and thus we can't use this in the exported variants
of our code (i.e. the client library). So, this patch only enables it
when building the Kudu binaries and not the client library.

I ran a quick performance check using full_stack-insert-scan-test twice
each before and after:

Before:
 Performance counter stats for 'build/latest/bin/full_stack-insert-scan-test 
--gtest_filter=*MRSOnlyStressTest* --inserts_per_client=20 
--concurrent_inserts=10 --rows_per_batch=1 --skip_scans':

 322623.135418  task-clock (msec) #4.964 CPUs utilized
11,751,640  context-switches  #0.036 M/sec
 3,619,010  cpu-migrations#0.011 M/sec
   117,706  page-faults   #0.365 K/sec
 1,057,656,296,542  cycles#3.278 GHz
   695,668,728,662  instructions  #0.66  insn per cycle
   127,565,245,327  branches  #  395.400 M/sec
 1,480,987,998  branch-misses #1.16% of all branches

  64.996506196 seconds time elapsed

 330206.351604  task-clock (msec) #5.330 CPUs utilized
12,518,370  context-switches  #0.038 M/sec
 3,695,790  cpu-migrations#0.011 M/sec
   113,580  page-faults   #0.344 K/sec
 1,059,061,436,907  cycles#3.207 GHz
   697,782,051,928  instructions  #0.66  insn per cycle
   127,949,774,596  branches  #  387.484 M/sec
 1,371,967,917  branch-misses #1.07% of all branches

  61.952217532 seconds time elapsed

After:

 310788.334202  task-clock (msec) #5.315 CPUs utilized
12,486,276  context-switches  #0.040 M/sec
 3,690,660  cpu-migrations#0.012 M/sec
   113,988  page-faults   #0.367 K/sec
 1,027,033,932,375  cycles#3.305 GHz
   663,508,168,985  instructions  #0.65  insn per cycle
   125,864,966,052  branches  #  404.986 M/sec
 1,442,683,495  branch-misses #1.15% of all branches

  58.479033228 seconds time elapsed

 Performance counter stats for 
'build/latest/bin/full_stack-insert-scan-test.263 
--gtest_filter=*MRSOnlyStressTest* --inserts_per_client=20 
--concurrent_inserts=10 --rows_per_batch=1 --skip_scans':

 317505.548908  task-clock (msec) #5.323 CPUs utilized
12,461,003  context-switches  #0.039 M/sec
 3,688,491  cpu-migrations#0.012 M/sec
   113,952  page-faults   #0.359 K/sec
 1,029,595,155,391  cycles#3.243 GHz
   663,514,269,656  instructions  #0.64  insn per cycle
   125,865,138,975  branches  #  396.419 M/sec
 1,419,717,877  branch-misses 

[kudu-CR] Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

2018-01-24 Thread Dan Burkert (Code Review)
Dan Burkert has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9108 )

Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
..


Patch Set 2: Code-Review+2

LGTM.  Tests pass on macos 10.12.


--
To view, visit http://gerrit.cloudera.org:8080/9108
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b
Gerrit-Change-Number: 9108
Gerrit-PatchSet: 2
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 24 Jan 2018 18:30:31 +
Gerrit-HasComments: No


[kudu-CR] Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

2018-01-23 Thread Todd Lipcon (Code Review)
Hello Dan Burkert, Kudu Jenkins, Adar Dembo,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9108

to look at the new patch set (#2).

Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
..

Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

This bumps to a new gperftools/tcmalloc version which ought to be a bit
faster. It also enables sized-deallocation support on compilers that
support it.

The upgrade is relatively straigthforward but for a few items:
- for whatever reason, we developed a conflict between our own
  base::SpinLock from gutil and tcmalloc's. This conflict has always
  existed, but apparently tcmalloc's implementation changed sizes
  slightly, meaning that now the linker complains about the duplicate
  symbol definition. I created a simple find-replace patch to change
  tcmalloc to use a 'tcmalloc::' namespace instead of 'base::'.
- our previous patches are all now included in tcmalloc upstream and no
  longer necessary.
- for unknown reasons, gperftools now fails to build with TSAN
  instrumentation: it shows duplicate symbols between the tcmalloc
  'operator new' and clang_rt's 'operator new'. We never actually needed
  to use gperftools in TSAN builds anyway, so this just removes the
  instrumented build and properly ifdefs out a few of the places where
  we'd included gperftools headers unconditionally.

Sized deallocation is a C++14 feature (which can be selectively enabled
on C++11) which adds a new 'operator delete(void*, size_t)'. In the
common case where we delete a Foo*, the compiler already knows
sizeof(Foo) and thus can pass this second argument to 'operator delete'.
This goes through an optimized code path inside tcmalloc -- namely, it
can skip the expensive lookup which otherwise would be required to map
the pointer back to its size class.

Unfortuantely, we can't always rely on the system allocator to provide
sized deallocation, and thus we can't use this in the exported variants
of our code (i.e. the client library). So, this patch only enables it
when building the Kudu binaries and not the client library.

I ran a quick performance check using full_stack-insert-scan-test twice
each before and after:

Before:
 Performance counter stats for 'build/latest/bin/full_stack-insert-scan-test 
--gtest_filter=*MRSOnlyStressTest* --inserts_per_client=20 
--concurrent_inserts=10 --rows_per_batch=1 --skip_scans':

 322623.135418  task-clock (msec) #4.964 CPUs utilized
11,751,640  context-switches  #0.036 M/sec
 3,619,010  cpu-migrations#0.011 M/sec
   117,706  page-faults   #0.365 K/sec
 1,057,656,296,542  cycles#3.278 GHz
   695,668,728,662  instructions  #0.66  insn per cycle
   127,565,245,327  branches  #  395.400 M/sec
 1,480,987,998  branch-misses #1.16% of all branches

  64.996506196 seconds time elapsed

 330206.351604  task-clock (msec) #5.330 CPUs utilized
12,518,370  context-switches  #0.038 M/sec
 3,695,790  cpu-migrations#0.011 M/sec
   113,580  page-faults   #0.344 K/sec
 1,059,061,436,907  cycles#3.207 GHz
   697,782,051,928  instructions  #0.66  insn per cycle
   127,949,774,596  branches  #  387.484 M/sec
 1,371,967,917  branch-misses #1.07% of all branches

  61.952217532 seconds time elapsed

After:

 310788.334202  task-clock (msec) #5.315 CPUs utilized
12,486,276  context-switches  #0.040 M/sec
 3,690,660  cpu-migrations#0.012 M/sec
   113,988  page-faults   #0.367 K/sec
 1,027,033,932,375  cycles#3.305 GHz
   663,508,168,985  instructions  #0.65  insn per cycle
   125,864,966,052  branches  #  404.986 M/sec
 1,442,683,495  branch-misses #1.15% of all branches

  58.479033228 seconds time elapsed

 Performance counter stats for 
'build/latest/bin/full_stack-insert-scan-test.263 
--gtest_filter=*MRSOnlyStressTest* --inserts_per_client=20 
--concurrent_inserts=10 --rows_per_batch=1 --skip_scans':

 317505.548908  task-clock (msec) #5.323 CPUs utilized
12,461,003  context-switches  #0.039 M/sec
 3,688,491  cpu-migrations#0.012 M/sec
   113,952  page-faults   #0.359 K/sec
 1,029,595,155,391  cycles#3.243 GHz
   663,514,269,656  instructions  #0.64  insn per cycle
   125,865,138,975  branches   

[kudu-CR] Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

2018-01-23 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9108 )

Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9108/1/CMakeLists.txt
File CMakeLists.txt:

http://gerrit.cloudera.org:8080/#/c/9108/1/CMakeLists.txt@610
PS1, Line 610: Exported variants should conform to the C++11 ABI
> Technically they conform to the pre-C++11 ABI, no? Well, at least all of th
good point. I thought you were on vacation, though :P



--
To view, visit http://gerrit.cloudera.org:8080/9108
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b
Gerrit-Change-Number: 9108
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 24 Jan 2018 00:17:44 +
Gerrit-HasComments: Yes


[kudu-CR] Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

2018-01-23 Thread Adar Dembo (Code Review)
Adar Dembo has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9108 )

Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9108/1/CMakeLists.txt
File CMakeLists.txt:

http://gerrit.cloudera.org:8080/#/c/9108/1/CMakeLists.txt@610
PS1, Line 610: Exported variants should conform to the C++11 ABI
Technically they conform to the pre-C++11 ABI, no? Well, at least all of the 
exported ABI symbols must adhere to the C++03 symbols; not sure about "hidden" 
symbols.



--
To view, visit http://gerrit.cloudera.org:8080/9108
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b
Gerrit-Change-Number: 9108
Gerrit-PatchSet: 1
Gerrit-Owner: Todd Lipcon 
Gerrit-Reviewer: Adar Dembo 
Gerrit-Reviewer: Dan Burkert 
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Tue, 23 Jan 2018 23:16:26 +
Gerrit-HasComments: Yes


[kudu-CR] Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

2018-01-23 Thread Todd Lipcon (Code Review)
Hello Dan Burkert,

I'd like you to do a code review. Please visit

http://gerrit.cloudera.org:8080/9108

to review the following change.


Change subject: Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation
..

Upgrade to tcmalloc 2.6.3 and enabled sized-deallocation

This bumps to a new gperftools/tcmalloc version which ought to be a bit
faster. It also enables sized-deallocation support on compilers that
support it.

The upgrade is relatively straigthforward but for a few items:
- for whatever reason, we developed a conflict between our own
  base::SpinLock from gutil and tcmalloc's. This conflict has always
  existed, but apparently tcmalloc's implementation changed sizes
  slightly, meaning that now the linker complains about the duplicate
  symbol definition. I created a simple find-replace patch to change
  tcmalloc to use a 'tcmalloc::' namespace instead of 'base::'.
- our previous patches are all now included in tcmalloc upstream and no
  longer necessary.

Sized deallocation is a C++14 feature (which can be selectively enabled
on C++11) which adds a new 'operator delete(void*, size_t)'. In the
common case where we delete a Foo*, the compiler already knows
sizeof(Foo) and thus can pass this second argument to 'operator delete'.
This goes through an optimized code path inside tcmalloc -- namely, it
can skip the expensive lookup which otherwise would be required to map
the pointer back to its size class.

Unfortuantely, we can't always rely on the system allocator to provide
sized deallocation, and thus we can't use this in the exported variants
of our code (i.e. the client library). So, this patch only enables it
when building the Kudu binaries and not the client library.

I ran a quick performance check using full_stack-insert-scan-test twice
each before and after:

Before:
 Performance counter stats for 'build/latest/bin/full_stack-insert-scan-test 
--gtest_filter=*MRSOnlyStressTest* --inserts_per_client=20 
--concurrent_inserts=10 --rows_per_batch=1 --skip_scans':

 322623.135418  task-clock (msec) #4.964 CPUs utilized
11,751,640  context-switches  #0.036 M/sec
 3,619,010  cpu-migrations#0.011 M/sec
   117,706  page-faults   #0.365 K/sec
 1,057,656,296,542  cycles#3.278 GHz
   695,668,728,662  instructions  #0.66  insn per cycle
   127,565,245,327  branches  #  395.400 M/sec
 1,480,987,998  branch-misses #1.16% of all branches

  64.996506196 seconds time elapsed

 330206.351604  task-clock (msec) #5.330 CPUs utilized
12,518,370  context-switches  #0.038 M/sec
 3,695,790  cpu-migrations#0.011 M/sec
   113,580  page-faults   #0.344 K/sec
 1,059,061,436,907  cycles#3.207 GHz
   697,782,051,928  instructions  #0.66  insn per cycle
   127,949,774,596  branches  #  387.484 M/sec
 1,371,967,917  branch-misses #1.07% of all branches

  61.952217532 seconds time elapsed

After:

 310788.334202  task-clock (msec) #5.315 CPUs utilized
12,486,276  context-switches  #0.040 M/sec
 3,690,660  cpu-migrations#0.012 M/sec
   113,988  page-faults   #0.367 K/sec
 1,027,033,932,375  cycles#3.305 GHz
   663,508,168,985  instructions  #0.65  insn per cycle
   125,864,966,052  branches  #  404.986 M/sec
 1,442,683,495  branch-misses #1.15% of all branches

  58.479033228 seconds time elapsed

 Performance counter stats for 
'build/latest/bin/full_stack-insert-scan-test.263 
--gtest_filter=*MRSOnlyStressTest* --inserts_per_client=20 
--concurrent_inserts=10 --rows_per_batch=1 --skip_scans':

 317505.548908  task-clock (msec) #5.323 CPUs utilized
12,461,003  context-switches  #0.039 M/sec
 3,688,491  cpu-migrations#0.012 M/sec
   113,952  page-faults   #0.359 K/sec
 1,029,595,155,391  cycles#3.243 GHz
   663,514,269,656  instructions  #0.64  insn per cycle
   125,865,138,975  branches  #  396.419 M/sec
 1,419,717,877  branch-misses #1.13% of all branches

  59.642651558 seconds time elapsed

Seems like a small improvement in cycle count, wall clock, etc.

Change-Id: I3b76c9d71b29d58f0b6dade3fc33b32d0c3de23b
---
M CMakeLists.txt
M thirdparty/download-thirdparty.sh
D 
thirdparty/patches/gperftools-Change-default-TCMALLOC_TRANSFER_NUM_OBJ-to-40.patch
A 
thirdparty/patche