[Impala-ASF-CR] IMPALA-4300: Speed up BloomFilter::Or with SIMD
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4300: Speed up BloomFilter::Or with SIMD .. Patch Set 5: (1 comment) http://gerrit.cloudera.org:8080/#/c/4813/5/be/src/util/bloom-filter.cc File be/src/util/bloom-filter.cc: PS5, Line 166: i < simd_size; See IMPALA-4381 - the index computations here seem wrong now - it's not clear whether i/simd_size are denominated in sizeof(double) or sizeof(__m256d). I have a suspicion the bug is here. We should probably use the same convention as below where i/simd_size are denominated in sizeof(__m256d). -- To view, visit http://gerrit.cloudera.org:8080/4813 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Jim AppleGerrit-Reviewer: Internal Jenkins Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4300: Speed up BloomFilter::Or with SIMD
Hello Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/4813 to look at the new patch set (#4). Change subject: IMPALA-4300: Speed up BloomFilter::Or with SIMD .. IMPALA-4300: Speed up BloomFilter::Or with SIMD The previous code was not written in a way that GCC could auto-vectorize it. Manually vectorizing speeds up BloomFilter::Or by up to 184x. Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67 --- M be/src/benchmarks/bloom-filter-benchmark.cc M be/src/testutil/mem-util.h M be/src/util/bloom-filter.cc M be/src/util/cpu-info.cc M be/src/util/cpu-info.h 5 files changed, 178 insertions(+), 75 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/13/4813/4 -- To view, visit http://gerrit.cloudera.org:8080/4813 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Jim AppleGerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4300: Speed up BloomFilter::Or with SIMD
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4300: Speed up BloomFilter::Or with SIMD .. Patch Set 3: Code-Review+2 (1 comment) http://gerrit.cloudera.org:8080/#/c/4813/3//COMMIT_MSG Commit Message: Line 11: The existing loop should auto-vectorize with our GCC setting of -O3, Can you fix this too? Maybe just say "The previous code was not written in a way that GCC could auto-vectorize it". Missed it first time over. -- To view, visit http://gerrit.cloudera.org:8080/4813 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Jim AppleGerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4300: Speed up BloomFilter::Or with SIMD
Jim Apple has posted comments on this change. Change subject: IMPALA-4300: Speed up BloomFilter::Or with SIMD .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/4813/2/be/src/util/bloom-filter.cc File be/src/util/bloom-filter.cc: Line 181: // The trivial loop out[i] |= in[i] should auto-vectorize with gcc at -O3, but it is not > Let's fix this comment so it's clear that the issue is that the code is not Done -- To view, visit http://gerrit.cloudera.org:8080/4813 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Jim AppleGerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4300: Speed up BloomFilter::Or with SIMD
Tim Armstrong has posted comments on this change. Change subject: IMPALA-4300: Speed up BloomFilter::Or with SIMD .. Patch Set 1: (4 comments) LGTM, just want to make sure the comment is a little clearer. http://gerrit.cloudera.org:8080/#/c/4813/1/be/src/util/bloom-filter.cc File be/src/util/bloom-filter.cc: Line 167 > What's stringe is that without the pragma and without __restrict__, Matt Go I experimented with some variants: https://godbolt.org/g/rFww4g .BloomFilterOrPointersUnrolledIvDep() and BloomFilterOrIvDepInt() emit exactly the code you want. There are a few things that make a difference: * Avoiding the indirection via the in/out pointers helps a lot. If we extract vector.data()/vector.size() from the loop body we can get the same effect. * Rather subtly, using a signed/unsigned loop counter makes a difference (I think with the signed type the compiler is allow to assume no overflow and this somehow helps). * The unrolling and #pragma ivdep aren't always necessary to emit the code, but it's a lot cleaner with them. I think gcc's vectoriser is smart enough to insert a check for whether the arrays overlap and two versions of the code for the overlapping/non-overlapping cases. The explicit SIMD version has the virtue of not being sensitive to all these things. PS1, Line 168: _mm256_loadu_pd > BloomFilters do, but TBloomFilters do not, and from using gdb I see that th Oh, duh. PS1, Line 199: _mm_storeu_si128 > Let's continue this conversation above. Yeah, I doubt there's much difference between the aligned/unaligned versions, just didn't want to use the unaligned ones if there was already an invariant that the memory was aligned. http://gerrit.cloudera.org:8080/#/c/4813/2/be/src/util/bloom-filter.cc File be/src/util/bloom-filter.cc: Line 181: // The trivial loop out[i] |= in[i] should auto-vectorize with gcc at -O3, but it is not Let's fix this comment so it's clear that the issue is that the code is not written in a way that is friendly to auto-vectorization. I've seen other cryptic comments about these kind of things ("compiler is confused by x") and they're hard to know what to do with (if not outright misleading). -- To view, visit http://gerrit.cloudera.org:8080/4813 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Jim AppleGerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4300: Speed up BloomFilter::Or with SIMD
Jim Apple has posted comments on this change. Change subject: IMPALA-4300: Speed up BloomFilter::Or with SIMD .. Patch Set 1: (13 comments) http://gerrit.cloudera.org:8080/#/c/4813/1/be/src/util/bloom-filter.cc File be/src/util/bloom-filter.cc: Line 167 > One reason auto-vectorisation must fail here is that the two directories ma What's stringe is that without the pragma and without __restrict__, Matt Godbolt's compiler explorer still shows the loop being autovectorized. I tried loop unrolling, but it still didn't vectorize. PS1, Line 162: DCHECK > DCHECK_EQ Done Line 163: auto simd_in = reinterpret_cast(in); > I think double*/const double* is more readable than auto here - I'm used to Done PS1, Line 166: int > Use size_t for consistency? I had to think about whether the implicit casts Done PS1, Line 168: _mm256_loadu_pd > We should be able to use the aligned versions here, right, since we allocat BloomFilters do, but TBloomFilters do not, and from using gdb I see that they are not 32-byte aligned. PS1, Line 181: should > It shouldn't actually with the code in it's current form since it has to as Let's continue conversation on overlapping at the other comment. Line 184: // TODO: Tune gcc flags to auto-vectorize. This might not be possible. > See my other comment - should be possible. Let's continue conversation on that comment. Line 186: // TODO: IMPALA-4312: Evaluate clang for builds other than ASAN. > Just remove this TODO? I don't think it's informative for people trying to Done Line 190: auto simd_in = reinterpret_cast([0]); > This code is simple enough that I'm fine with moving forward with it even i I will leave as-is. PS1, Line 191: auto > I think just repeating the types const__m128i* or __m128i* is more readable Done PS1, Line 192: auto > I'd prefer the int type was explicit here. Done PS1, Line 196: int > size_t for consistency? Done PS1, Line 199: _mm_storeu_si128 > We should be able to use the aligned versions here too. Let's continue this conversation above. FWIW, i also tried some very complex methods of using non-simd ops to move both pointers forward until at least one was aligned. The performance was sometimes worse, so I put that aside for another day. -- To view, visit http://gerrit.cloudera.org:8080/4813 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Jim AppleGerrit-Reviewer: Jim Apple Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4300: Speed up BloomFilter::Or with SIMD
Jim Apple has uploaded a new patch set (#2). Change subject: IMPALA-4300: Speed up BloomFilter::Or with SIMD .. IMPALA-4300: Speed up BloomFilter::Or with SIMD Manually vectorizing speeds up BloomFilter::Or by up to 184x. The existing loop should auto-vectorize with our GCC setting of -O3, but it does not for reasons that remain mysterious. Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67 --- M be/src/benchmarks/bloom-filter-benchmark.cc M be/src/testutil/mem-util.h M be/src/util/bloom-filter.cc M be/src/util/cpu-info.cc M be/src/util/cpu-info.h 5 files changed, 176 insertions(+), 75 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/13/4813/2 -- To view, visit http://gerrit.cloudera.org:8080/4813 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Jim AppleGerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4300: Speed up BloomFilter::Or with SIMD
Jim Apple has uploaded a new change for review. http://gerrit.cloudera.org:8080/4813 Change subject: IMPALA-4300: Speed up BloomFilter::Or with SIMD .. IMPALA-4300: Speed up BloomFilter::Or with SIMD Manually vectorizing speeds up BloomFilter::Or by up to 184x. The existing loop should auto-vectorize with our GCC setting of -O3, but it does not for reasons that remain mysterious. Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67 --- M be/src/benchmarks/bloom-filter-benchmark.cc M be/src/testutil/mem-util.h M be/src/util/bloom-filter.cc M be/src/util/cpu-info.cc M be/src/util/cpu-info.h 5 files changed, 177 insertions(+), 75 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/13/4813/1 -- To view, visit http://gerrit.cloudera.org:8080/4813 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67 Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Jim Apple