[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Internal Jenkins has submitted this change and it was merged. Change subject: IMPALA-4123: Fast bit unpacking .. IMPALA-4123: Fast bit unpacking Adds utility functions for fast unpacking of batches of bit-packed values. These support reading batches of any number of values provided that the start of the batch is aligned to a byte boundary. Callers that want to read smaller batches that don't align to byte boundaries will need to implement their own buffering. The unpacking code uses only portable C++ and no SIMD intrinsics, but is fairly efficient because unpacking a full batch of 32 values compiles down to 32-bit loads, shifts by constants, masks by constants, bitwise ors when a value straddles 32-bit words and stores. Further speedups should be possible using SIMD intrinsics. Testing: Added unit tests for unpacking, exhaustively covering different bitwidths with additional test dimensions (memory alignment, various input sizes, etc). Tested under ASAN to ensure the bit unpacking doesn't read past the end of buffers. Perf: Added microbenchmark that shows on average an 8-9x speedup over the existing BitReader code. Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Reviewed-on: http://gerrit.cloudera.org:8080/4494 Reviewed-by: Tim Armstrong Tested-by: Internal Jenkins --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/bit-packing-benchmark.cc M be/src/benchmarks/bswap-benchmark.cc M be/src/exprs/expr-test.cc A be/src/testutil/mem-util.h M be/src/util/CMakeLists.txt A be/src/util/bit-packing-test.cc A be/src/util/bit-packing.h A be/src/util/bit-packing.inline.h M be/src/util/bit-stream-utils.h M be/src/util/bit-stream-utils.inline.h M be/src/util/bit-util.h M be/src/util/openssl-util-test.cc 13 files changed, 929 insertions(+), 84 deletions(-) Approvals: Internal Jenkins: Verified Tim Armstrong: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 16 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Internal Jenkins has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 15: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 15 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 15: (1 comment) http://gerrit.cloudera.org:8080/#/c/4494/11/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: Line 132: > No seed is the same as a single default seed for this type. Oh ok, this should be ok too then. -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 15 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 15: Code-Review+2 Carry +2 -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 15 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Hello Jim Apple, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4494 to look at the new patch set (#14). Change subject: IMPALA-4123: Fast bit unpacking .. IMPALA-4123: Fast bit unpacking Adds utility functions for fast unpacking of batches of bit-packed values. These support reading batches of any number of values provided that the start of the batch is aligned to a byte boundary. Callers that want to read smaller batches that don't align to byte boundaries will need to implement their own buffering. The unpacking code uses only portable C++ and no SIMD intrinsics, but is fairly efficient because unpacking a full batch of 32 values compiles down to 32-bit loads, shifts by constants, masks by constants, bitwise ors when a value straddles 32-bit words and stores. Further speedups should be possible using SIMD intrinsics. Testing: Added unit tests for unpacking, exhaustively covering different bitwidths with additional test dimensions (memory alignment, various input sizes, etc). Tested under ASAN to ensure the bit unpacking doesn't read past the end of buffers. Perf: Added microbenchmark that shows on average an 8-9x speedup over the existing BitReader code. Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/bit-packing-benchmark.cc M be/src/benchmarks/bswap-benchmark.cc M be/src/exprs/expr-test.cc A be/src/testutil/mem-util.h M be/src/util/CMakeLists.txt A be/src/util/bit-packing-test.cc A be/src/util/bit-packing.h A be/src/util/bit-packing.inline.h M be/src/util/bit-stream-utils.h M be/src/util/bit-stream-utils.inline.h M be/src/util/bit-util.h M be/src/util/openssl-util-test.cc 13 files changed, 929 insertions(+), 84 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/4494/14 -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 14 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Jim Apple has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 13: Code-Review+2 (2 comments) http://gerrit.cloudera.org:8080/#/c/4494/11/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: Line 132: > Done but didn't seed it since flakiness would only happen if we have a bad No seed is the same as a single default seed for this type. http://gerrit.cloudera.org:8080/#/c/4494/13/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: PS13, Line 137: 0, numeric_limits::max() these are the default arguments to the constructor. -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 13 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has uploaded a new patch set (#13). Change subject: IMPALA-4123: Fast bit unpacking .. IMPALA-4123: Fast bit unpacking Adds utility functions for fast unpacking of batches of bit-packed values. These support reading batches of any number of values provided that the start of the batch is aligned to a byte boundary. Callers that want to read smaller batches that don't align to byte boundaries will need to implement their own buffering. The unpacking code uses only portable C++ and no SIMD intrinsics, but is fairly efficient because unpacking a full batch of 32 values compiles down to 32-bit loads, shifts by constants, masks by constants, bitwise ors when a value straddles 32-bit words and stores. Further speedups should be possible using SIMD intrinsics. Testing: Added unit tests for unpacking, exhaustively covering different bitwidths with additional test dimensions (memory alignment, various input sizes, etc). Tested under ASAN to ensure the bit unpacking doesn't read past the end of buffers. Perf: Added microbenchmark that shows on average an 8-9x speedup over the existing BitReader code. Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/bit-packing-benchmark.cc M be/src/benchmarks/bswap-benchmark.cc M be/src/exprs/expr-test.cc A be/src/testutil/mem-util.h M be/src/util/CMakeLists.txt A be/src/util/bit-packing-test.cc A be/src/util/bit-packing.h A be/src/util/bit-packing.inline.h M be/src/util/bit-stream-utils.h M be/src/util/bit-stream-utils.inline.h M be/src/util/bit-util.h M be/src/util/openssl-util-test.cc 13 files changed, 929 insertions(+), 84 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/4494/13 -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 13 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 12: (8 comments) http://gerrit.cloudera.org:8080/#/c/4494/11/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: Line 19: #include > cstd... Done Line 132: > rand() is only guaranteed to have 15 non-zero bits - the high 17 bits may a Done but didn't seed it since flakiness would only happen if we have a bad product bug with incorrect results. http://gerrit.cloudera.org:8080/#/c/4494/11/be/src/util/bit-packing.h File be/src/util/bit-packing.h: PS11, Line 74: /// Implementation of Unpack32Values() that uses 32-bit integer loads to : /// unpack values with the > Did you check to see that this benchmarks to be actually faster than if BIT Looks like it might be slightly faster (although within the margin of error) to do the switch on bit_width within Unpack32Values() and UnpackUpTo32Values(). So I'll go with that. http://gerrit.cloudera.org:8080/#/c/4494/11/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: Line 57: return std::make_pair(in_pos, values_to_read); > After some reading, I think you can and should throw a static on these cons For classes it makes sense - if you have a non-static constexpr in a class definition then the compiler almost certainly has to allocate storage in the memory layout of the object (since you could pass the object into third-party code, which could take the address of the member). I don't think we want to do that for local variables. I don't see any benefit - the compiler shouldn't have any problem fully propagating the constants, and there's no reason it would allocate storage locally unless we take the address of the variable. Line 140: case i: return Unpack32Values(in, in_bytes, out); > constexpr BYTES_TO_READ, if you mark BitUtil::RoundUpNumBytes constexpr, or Done. Marked a few other functions in BitUtil constexpr for consistency. http://gerrit.cloudera.org:8080/#/c/4494/11/be/src/util/bit-stream-utils.h File be/src/util/bit-stream-utils.h: Line 100: /// 'buffer' is the buffer to read from. The buffer's length is 'buffer_len'. > While you're here, can you add "Does not take ownership of the pointer"? Done Line 142: /// Maximum byte length of a vlq encoded int > While you're here, can you DISALLOW_COPY_AND_ASSIGN? I believe we copy them in a couple of places. It's actually harmless and maybe useful since you can fork the state of the reader. I added a comment to document that it was deliberately left defined. http://gerrit.cloudera.org:8080/#/c/4494/12/be/src/util/openssl-util-test.cc File be/src/util/openssl-util-test.cc: Line 27: using std::uniform_int_distribution; Switch from boost for consistency -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 12 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has uploaded a new patch set (#12). Change subject: IMPALA-4123: Fast bit unpacking .. IMPALA-4123: Fast bit unpacking Adds utility functions for fast unpacking of batches of bit-packed values. These support reading batches of any number of values provided that the start of the batch is aligned to a byte boundary. Callers that want to read smaller batches that don't align to byte boundaries will need to implement their own buffering. The unpacking code uses only portable C++ and no SIMD intrinsics, but is fairly efficient because unpacking a full batch of 32 values compiles down to 32-bit loads, shifts by constants, masks by constants, bitwise ors when a value straddles 32-bit words and stores. Further speedups should be possible using SIMD intrinsics. Testing: Added unit tests for unpacking, exhaustively covering different bitwidths with additional test dimensions (memory alignment, various input sizes, etc). Tested under ASAN to ensure the bit unpacking doesn't read past the end of buffers. Perf: Added microbenchmark that shows on average an 8-9x speedup over the existing BitReader code. Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/bit-packing-benchmark.cc M be/src/benchmarks/bswap-benchmark.cc M be/src/exprs/expr-test.cc A be/src/testutil/mem-util.h M be/src/util/CMakeLists.txt A be/src/util/bit-packing-test.cc A be/src/util/bit-packing.h A be/src/util/bit-packing.inline.h M be/src/util/bit-stream-utils.h M be/src/util/bit-stream-utils.inline.h M be/src/util/bit-util.h M be/src/util/openssl-util-test.cc 13 files changed, 928 insertions(+), 84 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/4494/12 -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 12 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Jim Apple has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 11: (1 comment) http://gerrit.cloudera.org:8080/#/c/4494/11/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: Line 57: constexpr int BATCH_SIZE = 32; After some reading, I think you can and should throw a static on these constexprs. For instance, http://stackoverflow.com/questions/26152096/when-and-why-would-you-use-static-with-constexpr -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Jim Apple has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 11: (1 comment) http://gerrit.cloudera.org:8080/#/c/4494/11/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: Line 140: const int bytes_to_read = BitUtil::RoundUpNumBytes(32 * BIT_WIDTH); constexpr BYTES_TO_READ, if you mark BitUtil::RoundUpNumBytes constexpr, or if you write BIT_WIDTH*32/CHAR_BIT instead -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Jim Apple has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 11: (5 comments) http://gerrit.cloudera.org:8080/#/c/4494/11/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: Line 19: #include cstd... Line 132: std::generate(std::begin(in), std::end(in), rand); rand() is only guaranteed to have 15 non-zero bits - the high 17 bits may all be 0 each time. More modern rand, and also seeded to prevent flaky tests: std::mt19937 pseudorandom(0); std::uniform_int_distribution uniform; std::generate(std::begin(in), std::end(in), [&pseudorandom, &uniform]() { return uniform(pseudorandom); }); http://gerrit.cloudera.org:8080/#/c/4494/11/be/src/util/bit-packing.h File be/src/util/bit-packing.h: PS11, Line 74: /// Templated versions of UnpackValues() that can be used if BIT_WIDTH is a : /// compile-time constant. Did you check to see that this benchmarks to be actually faster than if BIT_WIDTH is not a template parameter? http://gerrit.cloudera.org:8080/#/c/4494/11/be/src/util/bit-stream-utils.h File be/src/util/bit-stream-utils.h: Line 100: /// 'buffer' is the buffer to read from. The buffer's length is 'buffer_len'. While you're here, can you add "Does not take ownership of the pointer"? Line 142: const uint8_t* buffer_; While you're here, can you DISALLOW_COPY_AND_ASSIGN? -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 10: (4 comments) http://gerrit.cloudera.org:8080/#/c/4494/10/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: PS10, Line 39: it > them Done PS10, Line 39: Both > Both buffers Done Line 60: uint8_t* packed = storage.data() + aligned; > If aligned is true, then packed is not aligned Done http://gerrit.cloudera.org:8080/#/c/4494/10/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: Line 41: #define UNPACK_VALUES_CASE(ignore1, i, ignore2) \ > Here is a way to do this without any macros, but it's a bit heavyweight fro IMO the macro is easier to understand because the macro bears some relation to the code being stamped out. Also you have a better chance of debugging it by looking at the preprocessed output. I think the macro makes sense in this case since we're really just trying to stamp out repetitive code instead of build an abstraction. -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 10 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has uploaded a new patch set (#11). Change subject: IMPALA-4123: Fast bit unpacking .. IMPALA-4123: Fast bit unpacking Adds utility functions for fast unpacking of batches of bit-packed values. These support reading batches of any number of values provided that the start of the batch is aligned to a byte boundary. Callers that want to read smaller batches that don't align to byte boundaries will need to implement their own buffering. The unpacking code uses only portable C++ and no SIMD intrinsics, but is fairly efficient because unpacking a full batch of 32 values compiles down to 32-bit loads, shifts by constants, masks by constants, bitwise ors when a value straddles 32-bit words and stores. Further speedups should be possible using SIMD intrinsics. Testing: Added unit tests for unpacking, exhaustively covering different bitwidths with additional test dimensions (memory alignment, various input sizes, etc). Tested under ASAN to ensure the bit unpacking doesn't read past the end of buffers. Perf: Added microbenchmark that shows on average an 8-9x speedup over the existing BitReader code. Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/bit-packing-benchmark.cc M be/src/benchmarks/bswap-benchmark.cc A be/src/testutil/mem-util.h M be/src/util/CMakeLists.txt A be/src/util/bit-packing-test.cc A be/src/util/bit-packing.h A be/src/util/bit-packing.inline.h M be/src/util/bit-stream-utils.h M be/src/util/bit-stream-utils.inline.h 10 files changed, 886 insertions(+), 39 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/4494/11 -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 11 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Jim Apple has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 10: (4 comments) http://gerrit.cloudera.org:8080/#/c/4494/10/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: PS10, Line 39: it them PS10, Line 39: Both Both buffers Line 60: uint8_t* packed = storage.data() + aligned; If aligned is true, then packed is not aligned http://gerrit.cloudera.org:8080/#/c/4494/10/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: Line 41: #define UNPACK_VALUES_CASE(ignore1, i, ignore2) \ Here is a way to do this without any macros, but it's a bit heavyweight from a syntax POV. I found it was roughly the same speed as the macro approach: --- a/be/src/util/bit-packing.inline.h +++ b/be/src/util/bit-packing.inline.h @@ -31,10 +31,33 @@ namespace impala { +template +constexpr std::array), sizeof...(I)> MakeArray( +std::integer_sequence) { + return {&T::template f...}; +} + +template +struct BitPackStruct { + template + static auto f(const uint8_t* __restrict__ in, int64_t in_bytes, int64_t num_values, + OutType* __restrict__ out) { +return BitPacking::UnpackValues(in, in_bytes, num_values, out); + } +}; + template std::pair BitPacking::UnpackValues(int bit_width, const uint8_t* __restrict__ in, int64_t in_bytes, int64_t num_values, OutType* __restrict__ out) { + static constexpr auto UNPACKERS = + MakeArray>(std::make_integer_sequence()); + if (bit_width < 0 || bit_width > 32) { +DCHECK(false); +return {nullptr, 0}; + } + return UNPACKERS[bit_width](in, in_bytes, num_values, out); Similar things can be done below. Line 146 can be done without std::make_integer_sequence. -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 10 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has uploaded a new patch set (#10). Change subject: IMPALA-4123: Fast bit unpacking .. IMPALA-4123: Fast bit unpacking Adds utility functions for fast unpacking of batches of bit-packed values. These support reading batches of any number of values provided that the start of the batch is aligned to a byte boundary. Callers that want to read smaller batches that don't align to byte boundaries will need to implement their own buffering. The unpacking code uses only portable C++ and no SIMD intrinsics, but is fairly efficient because unpacking a full batch of 32 values compiles down to 32-bit loads, shifts by constants, masks by constants, bitwise ors when a value straddles 32-bit words and stores. Further speedups should be possible using SIMD intrinsics. Testing: Added unit tests for unpacking, exhaustively covering different bitwidths with additional test dimensions (memory alignment, various input sizes, etc). Tested under ASAN to ensure the bit unpacking doesn't read past the end of buffers. Perf: Added microbenchmark that shows on average an 8-9x speedup over the existing BitReader code. Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/bit-packing-benchmark.cc M be/src/benchmarks/bswap-benchmark.cc A be/src/testutil/mem-util.h M be/src/util/CMakeLists.txt A be/src/util/bit-packing-test.cc A be/src/util/bit-packing.h A be/src/util/bit-packing.inline.h M be/src/util/bit-stream-utils.h M be/src/util/bit-stream-utils.inline.h 10 files changed, 884 insertions(+), 39 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/4494/10 -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 10 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 7: (3 comments) Sorry about that - I wish there was a good way to see whether I've responded to all of the comments on different patchset verisons. http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: PS7, Line 42: misaligned > I think it would help the reader here to spell out that misaligned is with Done PS7, Line 132: min(length, 1) > Still, if we're only allowing those two you could switch it to a bool 'misa Made the change, then switched to 'aligned' to avoid having to think about a double negative. http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS7, Line 69: in_bytes -= batch_size * (BIT_WIDTH / CHAR_BIT); > Should there be an additional test that would demonstrate that bug? Running the test under ASAN caught it once I fixed the bug in the test code. The bug was exactly the kind of bug I was trying to catch with the test so we do have the coverage. It seems weird (although not entirely without merit) to add a regression test for a test bug. -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Jim Apple has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 9: > (11 comments) It looks like you might have missed my comments on Base and PS7 in this group of comments. -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 8: (11 comments) http://gerrit.cloudera.org:8080/#/c/4494/8//COMMIT_MSG Commit Message: PS8, Line 17: 64 > 32, now? Done PS8, Line 18: ors at 64-bit boundaries > What does it mean to have an or at a k-bit boundary? Reworded to be hopefully clearer. http://gerrit.cloudera.org:8080/#/c/4494/8/be/src/benchmarks/bit-packing-benchmark.cc File be/src/benchmarks/bit-packing-benchmark.cc: Line 273: #include "common/names.h" > This isn't needed if using namespace std;, yes? I think I prefer keeping this and removing the "using namespace" declarations below. http://gerrit.cloudera.org:8080/#/c/4494/8/be/src/benchmarks/bswap-benchmark.cc File be/src/benchmarks/bswap-benchmark.cc: Line 123 > Thanks for fixing this. When this is committed, can you file a bug against Sure, will do. http://gerrit.cloudera.org:8080/#/c/4494/8/be/src/testutil/mem-util.h File be/src/testutil/mem-util.h: PS8, Line 31: posix_memalign > #include stdlib.h. I'd say cstdlib, but posix_memalign doesn't seem to be p Added cstdlib. I think in practice it's reasonable to assume that cstdlib includes the same things as stdlib.h in some form. The glibc one from the toolchain literally has #include in it. Line 45: private: > DISALLOW_COPY_AND_ASSIGN Done http://gerrit.cloudera.org:8080/#/c/4494/8/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: Line 39: void UnpackSubset(const uint32_t* in, const uint8_t* packed, int num_in_values, > I think it would aid clarity to add a short description of what the meaning Added descriptions and elaborated a little on what the function is doing. PS8, Line 45: uint32_t > const uint32_t * in Done PS8, Line 72: INFO > Was this just for debugging, or did you mean to leave it in? Didn't mean to leave it in - removed. PS8, Line 84: 8 > Why 8 and not 4? Used CHAR_BIT and added an intermediate variable to make it a bit clearer. http://gerrit.cloudera.org:8080/#/c/4494/8/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS8, Line 64: //LOG(INFO) << "bit_width " << BIT_WIDTH << "\n" : // << "in_bytes " << in_bytes << " num_values " << num_values << " max_input_values " << max_input_values << "\n" :// << "values_to_read " << values_to_read << " batches_to_read " << batches_to_read << "\n" :// << "remainder_values " << remainder_values; > remove Oops, sorry missed this. -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has uploaded a new patch set (#9). Change subject: IMPALA-4123: Fast bit unpacking .. IMPALA-4123: Fast bit unpacking Adds utility functions for fast unpacking of batches of bit-packed values. These support reading batches of any number of values provided that the start of the batch is aligned to a byte boundary. Callers that want to read smaller batches that don't align to byte boundaries will need to implement their own buffering. The unpacking code uses only portable C++ and no SIMD intrinsics, but is fairly efficient because unpacking a full batch of 32 values compiles down to 32-bit loads, shifts by constants, masks by constants, bitwise ors when a value straddles 32-bit words and stores. Further speedups should be possible using SIMD intrinsics. Testing: Added unit tests for unpacking, exhaustively covering different bitwidths with additional test dimensions (memory alignment, various input sizes, etc). Tested under ASAN to ensure the bit unpacking doesn't read past the end of buffers. Perf: Added microbenchmark that shows on average an 8-9x speedup over the existing BitReader code. Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/bit-packing-benchmark.cc M be/src/benchmarks/bswap-benchmark.cc A be/src/testutil/mem-util.h M be/src/util/CMakeLists.txt A be/src/util/bit-packing-test.cc A be/src/util/bit-packing.h A be/src/util/bit-packing.inline.h M be/src/util/bit-stream-utils.h M be/src/util/bit-stream-utils.inline.h 10 files changed, 883 insertions(+), 39 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/4494/9 -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Jim Apple has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 8: (14 comments) http://gerrit.cloudera.org:8080/#/c/4494/8//COMMIT_MSG Commit Message: PS8, Line 17: 64 32, now? PS8, Line 18: ors at 64-bit boundaries What does it mean to have an or at a k-bit boundary? http://gerrit.cloudera.org:8080/#/c/4494/8/be/src/benchmarks/bit-packing-benchmark.cc File be/src/benchmarks/bit-packing-benchmark.cc: Line 273: #include "common/names.h" This isn't needed if using namespace std;, yes? http://gerrit.cloudera.org:8080/#/c/4494/8/be/src/benchmarks/bswap-benchmark.cc File be/src/benchmarks/bswap-benchmark.cc: Line 123 Thanks for fixing this. When this is committed, can you file a bug against me to fix the other posix_memalign locations? http://gerrit.cloudera.org:8080/#/c/4494/8/be/src/testutil/mem-util.h File be/src/testutil/mem-util.h: PS8, Line 31: posix_memalign #include stdlib.h. I'd say cstdlib, but posix_memalign doesn't seem to be part of the C standard, so I don't know if it is technically in that header. Line 45: private: DISALLOW_COPY_AND_ASSIGN http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: PS7, Line 42: values' va > In practice it should be 64-bit aligned with TCMalloc beyond a certain allo I think it would help the reader here to spell out that misaligned is with respect to 64. PS7, Line 132: > Other ones should work but I don't think improve coverage - can't see a sce Still, if we're only allowing those two you could switch it to a bool 'misaligned' http://gerrit.cloudera.org:8080/#/c/4494/8/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: Line 39: void UnpackSubset(const uint32_t* in, const uint8_t* packed, int num_in_values, I think it would aid clarity to add a short description of what the meanings are of 'in' and 'packed' PS8, Line 45: uint32_t const uint32_t * in PS8, Line 72: INFO Was this just for debugging, or did you mean to leave it in? PS8, Line 84: 8 Why 8 and not 4? Can you add a more exact ASSERT here about num_to_unpack and it's relationship with writer.bytes_written()? http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS7, Line 69: // First unpack as many full batches as possible. > Yep, that was a bug. Should there be an additional test that would demonstrate that bug? http://gerrit.cloudera.org:8080/#/c/4494/8/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS8, Line 64: //LOG(INFO) << "bit_width " << BIT_WIDTH << "\n" : // << "in_bytes " << in_bytes << " num_values " << num_values << " max_input_values " << max_input_values << "\n" :// << "values_to_read " << values_to_read << " batches_to_read " << batches_to_read << "\n" :// << "remainder_values " << remainder_values; remove -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 7: (28 comments) http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/benchmarks/bit-packing-benchmark.cc File be/src/benchmarks/bit-packing-benchmark.cc: PS7, Line 330: int argc, char **argv > unused It seems weird to me to omit the arguments to main() PS7, Line 336: int64_t > const, or just make this the param to the data constructor and use data.siz Done PS7, Line 343: suite.AddBenchmark(Substitute("UnpackScalar", bit_width), UnpackBenchmark, ¶ms); > This line at bit_width 32 is not in the output you display in the comment a Oops, fat finger error http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: PS7, Line 35: a subset of : /// 'num_in_values' > I still find this wording confusing - Which is the to buffer, and which the I reworded to make it clearer that it's a subarray. PS7, Line 42: misaligned > When misalignment is 0, what alignment is the memory guaranteed to have? 0 In practice it should be 64-bit aligned with TCMalloc beyond a certain allocation size. I switched to using posix_memalign so that it's explicitly guaranteed. PS7, Line 49: uint32_t > const Done Line 93: uint32_t* in, uint8_t* packed, int num_in_values, int bit_width, int misalignment) { > const T* Done PS7, Line 94: int > const Done PS7, Line 99: vector > std::aligned_storage? Ended up adding a helper class. PS7, Line 102: sizeof(uint32_t) > Why is this needed? Fixed. It really should have been uint8_t. It turned out this was masking a bug where we overran the buffer with 64-bit loads for odd bitwidths. I ended up just switching to doing 32-bit loads always. There wasn't a significant performance hit (if anything it looked a bit more consistent). PS7, Line 132: min(length, 1) > so the only misalignments allowed are 0 and 1? Why so restrictive? Other ones should work but I don't think improve coverage - can't see a scenario where it would work for 1 byte-aligned and 8 byte-aligned but not 2-byte or 4-byte aligned. http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing.h File be/src/util/bit-packing.h: Line 22: #ifndef IMPALA_UTIL_BIT_PACKING_H > These usually go above the #includes Done PS7, Line 30: values > 'value' Done PS7, Line 48: BitPacking > Why not put this into use in this patch? In other words, why only the micro The BitReader need some rework to be able to make use of the batched interface - I thought it would be better to break it up into smaller patches. It also makes it easier to compare again the older implementation before reworking that. PS7, Line 67: bits of data > You can elide this phrase Done http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS4, Line 125: const uint32_t trailing_bits = in_words[IN_WORD_IDX + 1] % > I'm not so sure. Let's consider reading 10 bit ints using 32-bit loads. To That doesn't seem right - the nonoverlapping aligned loads will always minimise the number of loads required. If you have a series of overlapping unaligned loads if you shift each load along by enough bytes that it doesn't overlap with the previous you get a series of nonoverlapping aligned loads that read all the bits you need. For batches of 32, the input is always a multiple of 32 bits, so the aligned loads always exactly cover the input bytes. So you can't shift one of the loads without creating a gap in the input array that you need to add another load to read. The example has a bug somewhere - those loads only get you bits 0 to 112, which only contains the first 11 values. Reading 11 10-bit values would require 4 aligned loads, so that might be a case where you could slightly improve things by using unaligned loads. It does depend on the # of input values though so it's unclear how you'd exploit the # of input values without adding runtime overhead. Anyway, from what I've seen it would be more impactful to put later optimisation effort into adding AVX2, which could be 4x faster. http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS7, Line 45: make_ > make_ can be omitted. Done PS7, Line 57: const > constexpr BATCH_SIZE Done PS7, Line 69: in_bytes -= batch_size * (BIT_WIDTH / CHAR_BIT); > so in_bytes does not move if BIT_WIDTH is 7? Yep, that was a bug. PS7, Line 90: onstants > constants Done PS7, Line 91: should > Should this be "must", since VALUE_IDX is now a template param? Not that t Yes, that's true - updated. PS7, Line 115: else > if branch returns, so you can drop the "else" here. I found it clearer to have the if/else if/else just to make it obvious that the three branches are mutually exclusive. Normally I'd use the early retur
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has uploaded a new patch set (#8). Change subject: IMPALA-4123: Fast bit unpacking .. IMPALA-4123: Fast bit unpacking Adds utility functions for fast unpacking of batches of bit-packed values. These support reading batches of any number of values provided that the start of the batch is aligned to a byte boundary. Callers that want to read smaller batches that don't align to byte boundaries will need to implement their own buffering. The unpacking code uses only portable C++ and no SIMD intrinsics, but is fairly efficient because unpacking a full batch of 32 values compiles down to 64-bit loads, shifts by constants, masks by constants, bitwise ors at 64-bit boundaries, and stores. Further speedups should be possible using SIMD intrinsics. Testing: Added unit tests for unpacking, exhaustively covering different bitwidths with additional test dimensions (memory alignment, various input sizes, etc). Tested under ASAN to ensure the bit unpacking doesn't read past the end of buffers. Perf: Added microbenchmark that shows on average an 8-9x speedup over the existing BitReader code. Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/bit-packing-benchmark.cc M be/src/benchmarks/bswap-benchmark.cc A be/src/testutil/mem-util.h M be/src/util/CMakeLists.txt A be/src/util/bit-packing-test.cc A be/src/util/bit-packing.h A be/src/util/bit-packing.inline.h M be/src/util/bit-stream-utils.h M be/src/util/bit-stream-utils.inline.h 10 files changed, 879 insertions(+), 39 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/4494/8 -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Jim Apple has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 7: (10 comments) found with help of clang-tidy http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/benchmarks/bit-packing-benchmark.cc File be/src/benchmarks/bit-packing-benchmark.cc: PS7, Line 330: int argc, char **argv unused http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing.h File be/src/util/bit-packing.h: Line 22: #ifndef IMPALA_UTIL_BIT_PACKING_H These usually go above the #includes PS7, Line 48: BitPacking Why not put this into use in this patch? In other words, why only the microbenchmark? http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS7, Line 45: make_ make_ can be omitted. PS7, Line 90: onstants constants PS7, Line 115: else if branch returns, so you can drop the "else" here. PS7, Line 118: else else if branch returns, so you can drop the else here, if you want to. I'm ambivalent. Line 120: DCHECK_LT(VALUE_IDX, 31) << "Trailing bits after last word."; Can you explain more why VALUE_IDX < 31? Can you add a static_assert at the top with the maximum value it could possibly be? PS7, Line 126: warning What template values cause that warning to trigger? PS7, Line 184: BIT_WIDTH if BIT_WIDTH is 0, I don't think this is technically allowed. -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Jim Apple has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 7: (20 comments) Thanks for your patience http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/benchmarks/bit-packing-benchmark.cc File be/src/benchmarks/bit-packing-benchmark.cc: PS7, Line 336: int64_t const, or just make this the param to the data constructor and use data.size() as the last argument to the params initializer list. PS7, Line 343: suite.AddBenchmark(Substitute("UnpackScalar", bit_width), UnpackBenchmark, ¶ms); This line at bit_width 32 is not in the output you display in the comment at the top. http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: PS7, Line 35: a subset of : /// 'num_in_values' I still find this wording confusing - Which is the to buffer, and which the from? What does it mean to have a subset of an int? PS7, Line 42: misaligned When misalignment is 0, what alignment is the memory guaranteed to have? 0 mod 64? PS7, Line 49: uint32_t const Line 93: uint32_t* in, uint8_t* packed, int num_in_values, int bit_width, int misalignment) { const T* PS7, Line 94: int const PS7, Line 99: vector std::aligned_storage? PS7, Line 102: sizeof(uint32_t) Why is this needed? PS7, Line 132: min(length, 1) so the only misalignments allowed are 0 and 1? Why so restrictive? http://gerrit.cloudera.org:8080/#/c/4494/2/be/src/util/bit-packing.h File be/src/util/bit-packing.h: PS2, Line 57: ast one return > 8 is the smallest number such that (3 * 8) % 8 == 0. I.e. where the last bi OK, but you said "you have to choose" 8. A batch size of 32 would still work, right? http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing.h File be/src/util/bit-packing.h: PS7, Line 30: values 'value' PS7, Line 67: bits of data You can elide this phrase http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS4, Line 125: const uint32_t trailing_bits = in_words[IN_WORD_IDX + 1] % > That's true, but there's a downside - if the loads are aligned relative to I'm not so sure. Let's consider reading 10 bit ints using 32-bit loads. To unpack 16 of them with your method, you need to read 160 bits, which means 5 (aligned) loads. 12 of the unpacked values can be read without using this third (more expensive) branch. In this branch, there are two more ALU ops, so that's 4*2 = 8 extra ALU ops. OTOH, using un-aligned loads, you could do only 4 loads and (at bits 0, 24, 48, and 80) and no more than two ALU ops for each extraction. http://gerrit.cloudera.org:8080/#/c/4494/7/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS7, Line 57: const constexpr BATCH_SIZE PS7, Line 69: in_bytes -= batch_size * (BIT_WIDTH / CHAR_BIT); so in_bytes does not move if BIT_WIDTH is 7? PS7, Line 91: should Should this be "must", since VALUE_IDX is now a template param? Not that the explanation is wrong, but now you have forced the caller into efficient behavior with your choice. PS7, Line 178: const constexpr MAX_BATCH_SIZE PS7, Line 179: int const PS7, Line 183: // Copy into padded temporary buffer to avoid reading past the end of 'in'. Can this be avoided sometimes? For instance, if BIT_WIDTH is 6 and num_values is 16, can we just use 'in'? -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has uploaded a new patch set (#7). Change subject: IMPALA-4123: Fast bit unpacking .. IMPALA-4123: Fast bit unpacking Adds utility functions for fast unpacking of batches of bit-packed values. These support reading batches of any number of values provided that the start of the batch is aligned to a byte boundary. Callers that want to read smaller batches that don't align to byte boundaries will need to implement their own buffering. The unpacking code uses only portable C++ and no SIMD intrinsics, but is fairly efficient because unpacking a full batch of 32 values compiles down to 64-bit loads, shifts by constants, masks by constants, bitwise ors at 64-bit boundaries, and stores. Further speedups should be possible using SIMD intrinsics. Testing: Added unit tests for unpacking, exhaustively covering different bitwidths with additional test dimensions (memory alignment, various input sizes, etc). Tested under ASAN to ensure the bit unpacking doesn't read past the end of buffers. Perf: Added microbenchmark that shows on average an 8-9x speedup over the existing BitReader code. Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/bit-packing-benchmark.cc M be/src/util/CMakeLists.txt A be/src/util/bit-packing-test.cc A be/src/util/bit-packing.h A be/src/util/bit-packing.inline.h M be/src/util/bit-stream-utils.h M be/src/util/bit-stream-utils.inline.h 8 files changed, 807 insertions(+), 21 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/4494/7 -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 5: (15 comments) http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/benchmarks/bit-packing-benchmark.cc File be/src/benchmarks/bit-packing-benchmark.cc: PS4, Line 296: bool > const Done PS4, Line 303: UnpackValues > Unpack32Values Done PS4, Line 312: + offset > This could go off the end of out_buffer if NUM_OUT_VALUES is not a multiple Yeah, that's an assumption to simplify the benchmark code. Added a static_assert up the top. PS4, Line 322: int64_t > const Done http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: PS4, Line 30: compute_mask > CamelCaseName Done PS4, Line 35: smaller > smaller than what? clarified. Line 116: const int num_in_values = 64 * 1024; > constexpr, and all caps name Done Line 119: in[i] = rand(); > std::generate Done http://gerrit.cloudera.org:8080/#/c/4494/2/be/src/util/bit-packing.h File be/src/util/bit-packing.h: PS2, Line 57: ast one return > I don't understand yet why you have to choose a batch size of 8 for a bit w 8 is the smallest number such that (3 * 8) % 8 == 0. I.e. where the last bit of the batch aligns to a byte boundary. http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/util/bit-packing.h File be/src/util/bit-packing.h: PS4, Line 66: bytes > bits? Done PS4, Line 66: values > values of type OutType Done PS4, Line 70: in_bytes > Can you add to the comment a description of how this parameter is used? Ended up rewording the comment a bit to fit this in. http://gerrit.cloudera.org:8080/#/c/4494/3/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS3, Line 143: undef > https://gcc.gnu.org/onlinedocs/gcc/Push_002fPop-Macro-Pragmas.html#Push_002 got you - done. http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS4, Line 125: return lower_bits | (trailing_bits << TRAILING_BITS_SHIFT); > This comes out to two shifts, a bitwise or, and a bitwise and. If you did u That's true, but there's a downside - if the loads are aligned relative to the start of the buffer, the optimiser can reuse the results of the loads. Doing the unaligned load adds an extra load, which is probably comparable in cost to the two ALU ops that were saved. Line 142: BOOST_PP_REPEAT_FROM_TO(0, 33, UNPACK_VALUE_CALL, ignore); > So, even after turning this into a templated function, a loop with static b Even with #pragma unroll and ALWAYS_INLINE. Inlining happens but the generated code is bad. I think either the unrolling or inlining happens too late in GCC's optimisation pipeline for the rest of the optimisation to be fully effective. I think part of it is that we have to demote the template argument VALUE_IDX to an argument, so the template doesn't fully specialise the code when it's instantiated. Here's an example of the generated assembly. Note that there are a lot of jmps and jes in the loop version https://gist.github.com/timarmstrong/4f68a74054cfd0df91a89a824ed04961 I added a note in the UnpackValue() comment to explain how it should be stamped out. -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has uploaded a new patch set (#6). Change subject: IMPALA-4123: Fast bit unpacking .. IMPALA-4123: Fast bit unpacking Adds utility functions for fast unpacking of batches of bit-packed values. These support reading batches of any number of values provided that the start of the batch is aligned to a byte boundary. Callers that want to read smaller batches that don't align to byte boundaries will need to implement their own buffering. The unpacking code uses only portable C++ and no SIMD intrinsics, but is fairly efficient because unpacking a full batch of 32 values compiles down to 64-bit loads, shifts by constants, masks by constants, bitwise ors at 64-bit boundaries, and stores. Further speedups should be possible using SIMD intrinsics. Testing: Added unit tests for unpacking, exhaustively covering different bitwidths with additional test dimensions (memory alignment, various input sizes, etc). Tested under ASAN to ensure the bit unpacking doesn't read past the end of buffers. Perf: Added microbenchmark that shows on average an 8-9x speedup over the existing BitReader code. Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/bit-packing-benchmark.cc M be/src/util/CMakeLists.txt A be/src/util/bit-packing-test.cc A be/src/util/bit-packing.h A be/src/util/bit-packing.inline.h M be/src/util/bit-stream-utils.h M be/src/util/bit-stream-utils.inline.h 8 files changed, 807 insertions(+), 21 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/4494/6 -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Jim Apple has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 4: (15 comments) http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/benchmarks/bit-packing-benchmark.cc File be/src/benchmarks/bit-packing-benchmark.cc: PS4, Line 296: bool const PS4, Line 303: UnpackValues Unpack32Values PS4, Line 312: + offset This could go off the end of out_buffer if NUM_OUT_VALUES is not a multiple of 32, right? PS4, Line 322: int64_t const http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: PS4, Line 30: compute_mask CamelCaseName PS4, Line 35: smaller smaller than what? Line 116: const int num_in_values = 64 * 1024; constexpr, and all caps name Line 119: in[i] = rand(); std::generate http://gerrit.cloudera.org:8080/#/c/4494/2/be/src/util/bit-packing.h File be/src/util/bit-packing.h: PS2, Line 57: ast one return > A batch size of 32 is the smallest that works for all supported bit widths. I don't understand yet why you have to choose a batch size of 8 for a bit width of 3. http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/util/bit-packing.h File be/src/util/bit-packing.h: PS4, Line 66: bytes bits? PS4, Line 66: values values of type OutType PS4, Line 70: in_bytes Can you add to the comment a description of how this parameter is used? http://gerrit.cloudera.org:8080/#/c/4494/3/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS3, Line 143: undef > I'm not quite sure that I understand the intent or how I'd do that, aside f https://gcc.gnu.org/onlinedocs/gcc/Push_002fPop-Macro-Pragmas.html#Push_002fPop-Macro-Pragmas http://gerrit.cloudera.org:8080/#/c/4494/4/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS4, Line 125: return lower_bits | (trailing_bits << TRAILING_BITS_SHIFT); This comes out to two shifts, a bitwise or, and a bitwise and. If you did unaligned 64-bit reads to fit the full value inside of a single word, it could be done in one shift and one bitwise and, yes? Line 142: BOOST_PP_REPEAT_FROM_TO(0, 33, UNPACK_VALUE_CALL, ignore); So, even after turning this into a templated function, a loop with static bounds doesn't cause inlining? Maybe leave a note on why you do it this way, in that case. -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has uploaded a new patch set (#5). Change subject: IMPALA-4123: Fast bit unpacking .. IMPALA-4123: Fast bit unpacking Adds utility functions for fast unpacking of batches of bit-packed values. These support reading batches of any number of values provided that the start of the batch is aligned to a byte boundary. Callers that want to read smaller batches that don't align to byte boundaries will need to implement their own buffering. The unpacking code uses only portable C++ and no SIMD intrinsics, but is fairly efficient because unpacking a full batch of 32 values compiles down to 64-bit loads, shifts by constants, masks by constants, bitwise ors at 64-bit boundaries, and stores. Further speedups should be possible using SIMD intrinsics. Testing: Added unit tests for unpacking, exhaustively covering different bitwidths with additional test dimensions (memory alignment, various input sizes, etc). Tested under ASAN to ensure the bit unpacking doesn't read past the end of buffers. Perf: Added microbenchmark that shows on average an 8-9x speedup over the existing BitReader code. Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/bit-packing-benchmark.cc M be/src/util/CMakeLists.txt A be/src/util/bit-packing-test.cc A be/src/util/bit-packing.h A be/src/util/bit-packing.inline.h M be/src/util/bit-stream-utils.h M be/src/util/bit-stream-utils.inline.h 8 files changed, 799 insertions(+), 21 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/4494/5 -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 2: (33 comments) http://gerrit.cloudera.org:8080/#/c/4494/2/be/src/benchmarks/bit-packing-benchmark.cc File be/src/benchmarks/bit-packing-benchmark.cc: Line 296: reader.Reset(p->data, p->data_len); > Should offset advance in this loop, then? The code was a bit obfuscated. The loop only really retries once, so there's no need to advance offset. Changed it to be more obvious. Offset is advanced once per value read by virtue of the offset calculation above. http://gerrit.cloudera.org:8080/#/c/4494/3/be/src/benchmarks/bit-packing-benchmark.cc File be/src/benchmarks/bit-packing-benchmark.cc: Line 270: #include "common/names.h" > Why the blank line here? We separate out common/names.h like this most places because it needs to come after all the other includes. PS3, Line 277: > Maybe NUM_OUT_VALUES Done PS3, Line 305: * p > const uint8_t * const data_end ... Done Line 334: for (int i = 0; i < data_len; ++i) { > std::iota Done PS3, Line 336: > I think you can even elide the parens Done http://gerrit.cloudera.org:8080/#/c/4494/3/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: PS3, Line 29: compute_mask > ComputeMask, and in an anonymous namespace Done PS3, Line 33: UnpackSubset > With a comment here, if I'm going to use it in PackUnpack. Done PS3, Line 37: with memory that is > this phrase can be omitted Done PS3, Line 39: int > unsigned? DCHECK <= something? I don't think there's a particular limit on the number of values we could use here. PS3, Line 39: num_values > num_in_values? Done PS3, Line 39: int > also unsigned? I think int should be ok - I normally prefer using signed types by default even for variables that logically should always be positive because it's easier to reason about the number going negative compared with underflowing (also easier to catch the out-of-range value with dchecks). PS3, Line 45: int > const, here and elsewhere Done PS3, Line 51: & compute_mask(bit_width) > AKA mask; already hoisted Done PS3, Line 58: becase > because Done PS3, Line 58: the input buffer size > "the input buffer size (num_values)" Done PS3, Line 60: // Size buffer exactly so that ASAN can detect buffer overruns. > I think manual canaray checks might be called for here We rely on ASAN to detect read overruns anyway (which I was more concerned about since they're tricky to avoid in the code), so it doesn't seem worth it to me on balance to add the extra test code. PS3, Line 65: ASSERT_EQ > Can you add more " << error message " to your ASSERTs? Done http://gerrit.cloudera.org:8080/#/c/4494/2/be/src/util/bit-packing.h File be/src/util/bit-packing.h: Line 36: static const std::pair UnpackValues(int bit_width, > Yeah, the verbosity is frustrating. However, cstdint doesn't necessarily pt It seems like most implementations of cstdint put the typedefs in the global namespace anyway, so I think we're ok for now. If that turns out to be a problem I think we should create a wrapper header with the using declarations. PS2, Line 57: Unpack32Values > So, this boundary condition happens for any multiple of 8, and you chose 32 A batch size of 32 is the smallest that works for all supported bit widths. Otherwise you have to choose different batch sizes for different bit widths (e.g 32 for bit width 31, 8 for bit width 3, etc), which gets more complicated. Lemire's bitpacking libraries use 32 everywhere (https://github.com/lemire/) so it seemed like a reasonable choice - he obviously put some effort into validating the performance. I could benchmark different numbers but I think at this point I'd be optimising for the microbenchmark rather than a real workload. http://gerrit.cloudera.org:8080/#/c/4494/3/be/src/util/bit-packing.h File be/src/util/bit-packing.h: PS3, Line 55: IT_WIDTH from 'in' to 'out'. > But the last byte of 'in' that was read might only have been partially used That's correct. I deliberately made the decision that these functions shouldn't deal with resuming reads in the middle of bytes, because it makes things way more complicated (and probably slower). I expect that callers will either read in batches or do their own buffering if they need to read a value at a time. Expanded on the comment to make this explicit and explain what the caller can do. http://gerrit.cloudera.org:8080/#/c/4494/3/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS3, Line 56: >(in, > can be omitted if the branches are swapped Done Line 56: case 21: return UnpackValues(in, in_bytes, num_values, out); > long line Done PS3, Line 90: : if (r > I think you accidentally a word I ended up retrying the template function thing and it works ok this time around. I think something else was confounding it before
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has uploaded a new patch set (#4). Change subject: IMPALA-4123: Fast bit unpacking .. IMPALA-4123: Fast bit unpacking Adds utility functions for fast unpacking of batches of bit-packed values. These support reading batches of any number of values provided that the start of the batch is aligned to a byte boundary. Callers that want to read smaller batches that don't align to byte boundaries will need to implement their own buffering. The unpacking code uses only portable C++ and no SIMD intrinsics, but is fairly efficient because unpacking a full batch of 32 values compiles down to 64-bit loads, shifts by constants, masks by constants, bitwise ors at 64-bit boundaries, and stores. Further speedups should be possible using SIMD intrinsics. Testing: Added unit tests for unpacking, exhaustively covering different bitwidths with additional test dimensions (memory alignment, various input sizes, etc). Tested under ASAN to ensure the bit unpacking doesn't read past the end of buffers. Perf: Added microbenchmark that shows on average an 8-9x speedup over the existing BitReader code. Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/bit-packing-benchmark.cc M be/src/util/CMakeLists.txt A be/src/util/bit-packing-test.cc A be/src/util/bit-packing.h A be/src/util/bit-packing.inline.h M be/src/util/bit-stream-utils.h M be/src/util/bit-stream-utils.inline.h 8 files changed, 799 insertions(+), 21 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/4494/4 -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Jim Apple has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 3: (34 comments) http://gerrit.cloudera.org:8080/#/c/4494/2/be/src/benchmarks/bit-packing-benchmark.cc File be/src/benchmarks/bit-packing-benchmark.cc: Line 296: } > We need to do this to start the reader reading from the start of the buffer Should offset advance in this loop, then? http://gerrit.cloudera.org:8080/#/c/4494/3/be/src/benchmarks/bit-packing-benchmark.cc File be/src/benchmarks/bit-packing-benchmark.cc: Line 270: Why the blank line here? PS3, Line 277: NUM_VALUES Maybe NUM_OUT_VALUES PS3, Line 305: * d const uint8_t * const data_end ... Line 334: data[i] = static_cast(i); std::iota PS3, Line 336: ( I think you can even elide the parens http://gerrit.cloudera.org:8080/#/c/4494/3/be/src/util/bit-packing-test.cc File be/src/util/bit-packing-test.cc: PS3, Line 29: compute_mask ComputeMask, and in an anonymous namespace PS3, Line 33: UnpackSubset With a comment here, if I'm going to use it in PackUnpack. PS3, Line 37: with memory that is this phrase can be omitted PS3, Line 39: int also unsigned? PS3, Line 39: int unsigned? DCHECK <= something? PS3, Line 39: num_values num_in_values? PS3, Line 45: int const, here and elsewhere PS3, Line 51: & compute_mask(bit_width) AKA mask; already hoisted PS3, Line 58: becase because PS3, Line 58: the input buffer size "the input buffer size (num_values)" PS3, Line 60: // Size buffer exactly so that ASAN can detect buffer overruns. I think manual canaray checks might be called for here PS3, Line 65: ASSERT_EQ Can you add more " << error message " to your ASSERTs? http://gerrit.cloudera.org:8080/#/c/4494/2/be/src/util/bit-packing.h File be/src/util/bit-packing.h: Line 36: /// little-endian order). E.g. the values 1, 2, 3, 4 packed with bit width 4 results > We don't do this anywhere else, seems unnecessarily verbose. Yeah, the verbosity is frustrating. However, cstdint doesn't necessarily pt these in the global namespace, and stdint.h is deprecated. What color do you think we should paint this bikeshed? Maybe a using declaration at the top? PS2, Line 57: Type> > Added some explanation to the class comment for the magic number. It works So, this boundary condition happens for any multiple of 8, and you chose 32 to amortize the loop some more? Did you do any testing of 16 and 64 to see how they compare? http://gerrit.cloudera.org:8080/#/c/4494/3/be/src/util/bit-packing.h File be/src/util/bit-packing.h: PS3, Line 55: after the last byte of 'in' that was read But the last byte of 'in' that was read might only have been partially used, right? http://gerrit.cloudera.org:8080/#/c/4494/2/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS2, Line 108: \ : if (end_bit_offset < load_bit_width) { > Reworded this. I did some experiments early on looking at the assembly and Did you try always_inline and template parameters? If yes, what else did you try? http://gerrit.cloudera.org:8080/#/c/4494/3/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: PS3, Line 56: == 0 can be omitted if the branches are swapped Line 56: const int64_t max_input_values = BIT_WIDTH == 0 ? num_values : (in_bytes * CHAR_BIT) / BIT_WIDTH; long line PS3, Line 90: early : // with I think you accidentally a word PS3, Line 93: LOAD_TYPE I think "LOAD_TYPE" could be "LoadType" to more clearly match the lexical standard for other type names PS3, Line 93: i If 'i' is a compile-time constant, maybe call it "I" or "NTH"? PS3, Line 96: load_bit_width I think constexprs should get the static const treatment of all caps PS3, Line 106: shifted Can you add a comment describing what this is for? PS3, Line 108: 32 How is this 32 derived? PS3, Line 109: end_bit_offset so, this case al the bits we need are in one word? PS3, Line 119: Make non-negative to avoid spurious compiler warning Maybe give it an unsigned type? PS3, Line 124: Special-case impossible BIT_WIDTH 32 to avoid spurious compiler warning this could use a bit more explanation PS3, Line 143: define Check not already defined? -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 2: (34 comments) http://gerrit.cloudera.org:8080/#/c/4494/2//COMMIT_MSG Commit Message: PS2, Line 12: was > "want" Done PS2, Line 18: possible > long line Done http://gerrit.cloudera.org:8080/#/c/4494/2/be/src/benchmarks/bit-packing-benchmark.cc File be/src/benchmarks/bit-packing-benchmark.cc: PS2, Line 259: include > I think cmath, cstdlib, cstdio are the non-deprecated versions for C++ Done Line 260: #include > Can you arrange these in three blocks: Done Line 276: #define NUM_VALUES 1024 * 1024 > constexpr int Done PS2, Line 281: BenchmarkParams > You can leave the constructor out and use aggregate initialization: Done. I didn't realise you could use this with new Line 290: BenchmarkParams* p = reinterpret_cast(data); > maybe const auto p = reinterpret_cast I added the const. It doesn't seem verbose enough to require auto. PS2, Line 294: int64_t > const Done PS2, Line 294: i * 32 % NUM_VALUES + j; > Shouldn't this while thing be modulo NUM_VALUES, not just the non-+j part? Done Line 296: reader.Reset(p->data, p->data_len); > Why Reset? We need to do this to start the reader reading from the start of the buffer again. PS2, Line 333: uint8_t* data = new uint8_t[data_len]; > unique_ptr to clean up after Changed to a vector PS2, Line 334: int > int64_t Done PS2, Line 337: BenchmarkParams > This can be a local var, right? Done http://gerrit.cloudera.org:8080/#/c/4494/2/be/src/util/bit-packing.h File be/src/util/bit-packing.h: Line 18: #include > cstdint Done PS2, Line 29: Unpack bit-packed values > What does this phrase mean? I added an explanation to the class comment, probably needed to make the rest of the code understandable. PS2, Line 30: outputted > unpacked Done PS2, Line 30: is > are Done PS2, Line 30: 'out' must have : /// enough space for 'num_values'. > And in must point to at least in_bytes of addressable memory? Done PS2, Line 31: 0 > What does a zero bit_width mean in this context? Added an explanation to the class comment. It means every value is zero. PS2, Line 33: plus > "And also" Done PS2, Line 32: utType. : /// Retu > extra line break here, please Done PS2, Line 33: the byte > a pointer to the byte Done PS2, Line 33: the number of : /// values that were read. > I can infer that by just subtracting, right? I think the pair might be over It's the number of logical values that were read. The calculation is a function of the input arguments: min(num_values, in_bytes / bit_width) but it seems like the kind of thing that is easy to get wrong - that was the reason for making in_bytes a parameter rather than letting the caller do the calculation. Also division by bit_width is slower here than inside the function where we're dispatching to bit_width-specialized code anyway. PS2, Line 36: const > This const has no effect, I believe. Done Line 36: static const std::pair UnpackValues(int bit_width, > std::uint8_t, etc. We don't do this anywhere else, seems unnecessarily verbose. PS2, Line 57: Unpack32Values > Why not up to 64? Added some explanation to the class comment for the magic number. It works out because for 32-bit packed values, the end of the packed batch always falls on a byte boundary, so there's no need to deal with partial bytes between batches. If we went to larger batches I think the additional benefit from amortising loop overheads is fairly minimal and the code will get a lot bigger. http://gerrit.cloudera.org:8080/#/c/4494/2/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: Line 35: case 0: return UnpackValues(in, in_bytes, num_values, out); > BOOST_PP_REPEAT_FROM_TO would be worth it here, I think Done here and elsewhere. I'm a little concerned that it makes it hard to debug any compile errors, but hopefully the code shouldn't change much. PS2, Line 70: > can be omitted In this case it can't be inferred from the arguments. PS2, Line 70: NULL > nullptr Done. Have we actually standardised on this? PS2, Line 78: DCHECK_GE > static_assert Done here and elsewhere. This seems to have made the debug build a lot faster, which is great. PS2, Line 79: 8 > CHAR_BIT, here and below Done PS2, Line 80: in_bytes * 8 / BIT_WIDTH > I think this line would be clearer with more parens Done PS2, Line 99: unsigned integer type > static_check with http://en.cppreference.com/w/cpp/types/is_unsigned? Done. PS2, Line 108: inlined as : // soon as possible and subject to all optimizations > Can you elaborate on this? Are there benchmarks on that? Reworded this. I did some experiments early on looking at the assembly and wasn't able to convince gcc to generate the optimal code without resorting to the macro approach. -- To view, visit http:
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has uploaded a new patch set (#3). Change subject: IMPALA-4123: Fast bit unpacking .. IMPALA-4123: Fast bit unpacking Adds utility functions for fast unpacking of batches of bit-packed values. These support reading batches of any number of values provided that the start of the batch is aligned to a byte boundary. Callers that want to read smaller batches that don't align to byte boundaries will need to implement their own buffering. The unpacking code uses only portable C++ and no SIMD intrinsics, but is fairly efficient because unpacking a full batch of 32 values compiles down to 64-bit loads, shifts by constants, masks by constants, bitwise ors at 64-bit boundaries, and stores. Further speedups should be possible using SIMD intrinsics. Testing: Added unit tests for unpacking, exhaustively covering different bitwidths with additional test dimensions (memory alignment, various input sizes, etc). Tested under ASAN to ensure the bit unpacking doesn't read past the end of buffers. Perf: Added microbenchmark that shows on average an 8-9x speedup over the existing BitReader code. Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/bit-packing-benchmark.cc M be/src/util/CMakeLists.txt A be/src/util/bit-packing-test.cc A be/src/util/bit-packing.h A be/src/util/bit-packing.inline.h M be/src/util/bit-stream-utils.h M be/src/util/bit-stream-utils.inline.h 8 files changed, 788 insertions(+), 21 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/4494/3 -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Jim Apple has posted comments on this change. Change subject: IMPALA-4123: Fast bit unpacking .. Patch Set 2: (34 comments) First pass, didn't get to all of it yet but I thought you might want to see what was done http://gerrit.cloudera.org:8080/#/c/4494/2//COMMIT_MSG Commit Message: PS2, Line 12: was "want" PS2, Line 18: possible long line http://gerrit.cloudera.org:8080/#/c/4494/2/be/src/benchmarks/bit-packing-benchmark.cc File be/src/benchmarks/bit-packing-benchmark.cc: PS2, Line 259: include I think cmath, cstdlib, cstdio are the non-deprecated versions for C++ Line 260: #include Can you arrange these in three blocks: 1. C standard headers 2. C++ standard headers 3. Other alphabetized in each block Line 276: #define NUM_VALUES 1024 * 1024 constexpr int PS2, Line 281: BenchmarkParams You can leave the constructor out and use aggregate initialization: BenchmarkParams bp {p,q,r}; Line 290: BenchmarkParams* p = reinterpret_cast(data); maybe const auto p = reinterpret_cast PS2, Line 294: int64_t const PS2, Line 294: i * 32 % NUM_VALUES + j; Shouldn't this while thing be modulo NUM_VALUES, not just the non-+j part? Line 296: reader.Reset(p->data, p->data_len); Why Reset? PS2, Line 333: uint8_t* data = new uint8_t[data_len]; unique_ptr to clean up after PS2, Line 334: int int64_t PS2, Line 337: BenchmarkParams This can be a local var, right? http://gerrit.cloudera.org:8080/#/c/4494/2/be/src/util/bit-packing.h File be/src/util/bit-packing.h: Line 18: #include cstdint PS2, Line 29: Unpack bit-packed values What does this phrase mean? PS2, Line 30: outputted unpacked PS2, Line 30: is are PS2, Line 31: 0 What does a zero bit_width mean in this context? PS2, Line 30: 'out' must have : /// enough space for 'num_values'. And in must point to at least in_bytes of addressable memory? PS2, Line 33: the byte a pointer to the byte PS2, Line 33: plus "And also" PS2, Line 32: utType. : /// Retu extra line break here, please PS2, Line 33: the number of : /// values that were read. I can infer that by just subtracting, right? I think the pair might be overkill PS2, Line 36: const This const has no effect, I believe. Line 36: static const std::pair UnpackValues(int bit_width, std::uint8_t, etc. PS2, Line 57: Unpack32Values Why not up to 64? http://gerrit.cloudera.org:8080/#/c/4494/2/be/src/util/bit-packing.inline.h File be/src/util/bit-packing.inline.h: Line 35: case 0: return UnpackValues(in, in_bytes, num_values, out); BOOST_PP_REPEAT_FROM_TO would be worth it here, I think PS2, Line 70: NULL nullptr PS2, Line 70: can be omitted PS2, Line 78: DCHECK_GE static_assert PS2, Line 79: 8 CHAR_BIT, here and below PS2, Line 80: in_bytes * 8 / BIT_WIDTH I think this line would be clearer with more parens PS2, Line 99: unsigned integer type static_check with http://en.cppreference.com/w/cpp/types/is_unsigned? PS2, Line 108: inlined as : // soon as possible and subject to all optimizations Can you elaborate on this? Are there benchmarks on that? -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Jim Apple Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123: Fast bit unpacking
Tim Armstrong has uploaded a new patch set (#2). Change subject: IMPALA-4123: Fast bit unpacking .. IMPALA-4123: Fast bit unpacking Adds utility functions for fast unpacking of batches of bit-packed values. These support reading batches of any number of values provided that the start of the batch is aligned to a byte boundary. Callers that was to read smaller batches that don't align to byte boundaries will need to implement their own buffering. The unpacking code uses only portable C++ and no SIMD intrinsics, but is fairly efficient because unpacking a full batch of 32 values compiles down to 64-bit loads, shifts by constants, masks by constants, bitwise ors at 64-bit boundaries, and stores. Further speedups should be possible using SIMD intrinsics. Testing: Added unit tests for unpacking, exhaustively covering different bitwidths with additional test dimensions (memory alignment, various input sizes, etc). Tested under ASAN to ensure the bit unpacking doesn't read past the end of buffers. Perf: Added microbenchmark that shows on average an 8-9x speedup over the existing BitReader code. Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 --- M be/src/benchmarks/CMakeLists.txt A be/src/benchmarks/bit-packing-benchmark.cc M be/src/util/CMakeLists.txt A be/src/util/bit-packing-test.cc A be/src/util/bit-packing.h A be/src/util/bit-packing.inline.h M be/src/util/bit-stream-utils.h 7 files changed, 862 insertions(+), 18 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/94/4494/2 -- To view, visit http://gerrit.cloudera.org:8080/4494 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I12db69409483d208cd4c0f41c27a78aeb6cd3622 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Tim Armstrong