[Impala-ASF-CR] IMPALA-4300: Speed up BloomFilter::Or with SIMD

2016-10-26 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4300: Speed up BloomFilter::Or with SIMD
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4813/5/be/src/util/bloom-filter.cc
File be/src/util/bloom-filter.cc:

PS5, Line 166:  i < simd_size;
See IMPALA-4381 - the index computations here seem wrong now - it's not clear 
whether i/simd_size are denominated in sizeof(double) or sizeof(__m256d). I 
have a suspicion the bug is here. We should probably use the same convention as 
below where i/simd_size are denominated in sizeof(__m256d).


-- 
To view, visit http://gerrit.cloudera.org:8080/4813
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Jim Apple 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4300: Speed up BloomFilter::Or with SIMD

2016-10-24 Thread Jim Apple (Code Review)
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4813

to look at the new patch set (#4).

Change subject: IMPALA-4300: Speed up BloomFilter::Or with SIMD
..

IMPALA-4300: Speed up BloomFilter::Or with SIMD

The previous code was not written in a way that GCC could
auto-vectorize it. Manually vectorizing speeds up BloomFilter::Or by
up to 184x.

Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67
---
M be/src/benchmarks/bloom-filter-benchmark.cc
M be/src/testutil/mem-util.h
M be/src/util/bloom-filter.cc
M be/src/util/cpu-info.cc
M be/src/util/cpu-info.h
5 files changed, 178 insertions(+), 75 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/13/4813/4
-- 
To view, visit http://gerrit.cloudera.org:8080/4813
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Jim Apple 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4300: Speed up BloomFilter::Or with SIMD

2016-10-23 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4300: Speed up BloomFilter::Or with SIMD
..


Patch Set 3: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4813/3//COMMIT_MSG
Commit Message:

Line 11: The existing loop should auto-vectorize with our GCC setting of -O3,
Can you fix this too? Maybe just say "The previous code was not written in a 
way that GCC could auto-vectorize it". Missed it first time over.


-- 
To view, visit http://gerrit.cloudera.org:8080/4813
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Jim Apple 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4300: Speed up BloomFilter::Or with SIMD

2016-10-23 Thread Jim Apple (Code Review)
Jim Apple has posted comments on this change.

Change subject: IMPALA-4300: Speed up BloomFilter::Or with SIMD
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4813/2/be/src/util/bloom-filter.cc
File be/src/util/bloom-filter.cc:

Line 181:   // The trivial loop out[i] |= in[i] should auto-vectorize with gcc 
at -O3, but it is not
> Let's fix this comment so it's clear that the issue is that the code is not
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4813
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Jim Apple 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4300: Speed up BloomFilter::Or with SIMD

2016-10-23 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4300: Speed up BloomFilter::Or with SIMD
..


Patch Set 1:

(4 comments)

LGTM, just want to make sure the comment is a little clearer.

http://gerrit.cloudera.org:8080/#/c/4813/1/be/src/util/bloom-filter.cc
File be/src/util/bloom-filter.cc:

Line 167
> What's stringe is that without the pragma and without __restrict__, Matt Go
I experimented with some variants: https://godbolt.org/g/rFww4g 
.BloomFilterOrPointersUnrolledIvDep() and BloomFilterOrIvDepInt() emit exactly 
the code you want.
There are a few things that make a difference:

* Avoiding the indirection via the in/out pointers helps a lot. If we extract 
vector.data()/vector.size() from the loop body we can get the same effect.
* Rather subtly, using a signed/unsigned loop counter makes a difference (I 
think with the signed type the compiler is allow to assume no overflow and this 
somehow helps).
* The unrolling and #pragma ivdep aren't always necessary to emit the code, but 
it's a lot cleaner with them.  I think gcc's vectoriser is smart enough to 
insert a check for whether the arrays overlap and two versions of the code for 
the overlapping/non-overlapping cases.

The explicit SIMD version has the virtue of not being sensitive to all these 
things.


PS1, Line 168: _mm256_loadu_pd
> BloomFilters do, but TBloomFilters do not, and from using gdb I see that th
Oh, duh.


PS1, Line 199: _mm_storeu_si128
> Let's continue this conversation above.
Yeah, I doubt there's much difference between the aligned/unaligned versions, 
just didn't want to use the unaligned ones if there was already an invariant 
that the memory was aligned.


http://gerrit.cloudera.org:8080/#/c/4813/2/be/src/util/bloom-filter.cc
File be/src/util/bloom-filter.cc:

Line 181:   // The trivial loop out[i] |= in[i] should auto-vectorize with gcc 
at -O3, but it is not
Let's fix this comment so it's clear that the issue is that the code is not 
written in a way that is friendly to auto-vectorization.

I've seen other cryptic comments about these kind of things ("compiler is 
confused by x")  and they're hard to know what to do with (if not outright 
misleading).


-- 
To view, visit http://gerrit.cloudera.org:8080/4813
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Jim Apple 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4300: Speed up BloomFilter::Or with SIMD

2016-10-23 Thread Jim Apple (Code Review)
Jim Apple has posted comments on this change.

Change subject: IMPALA-4300: Speed up BloomFilter::Or with SIMD
..


Patch Set 1:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/4813/1/be/src/util/bloom-filter.cc
File be/src/util/bloom-filter.cc:

Line 167
> One reason auto-vectorisation must fail here is that the two directories ma
What's stringe is that without the pragma and without __restrict__, Matt 
Godbolt's compiler explorer still shows the loop being autovectorized.

I tried loop unrolling, but it still didn't vectorize.


PS1, Line 162: DCHECK
> DCHECK_EQ
Done


Line 163:   auto simd_in = reinterpret_cast(in);
> I think double*/const double* is more readable than auto here - I'm used to
Done


PS1, Line 166: int
> Use size_t for consistency? I had to think about whether the implicit casts
Done


PS1, Line 168: _mm256_loadu_pd
> We should be able to use the aligned versions here, right, since we allocat
BloomFilters do, but TBloomFilters do not, and from using gdb I see that they 
are not 32-byte aligned.


PS1, Line 181: should
> It shouldn't actually with the code in it's current form since it has to as
Let's continue conversation on overlapping at the other comment.


Line 184:   // TODO: Tune gcc flags to auto-vectorize. This might not be 
possible.
> See my other comment - should be possible.
Let's continue conversation on that comment.


Line 186:   // TODO: IMPALA-4312: Evaluate clang for builds other than ASAN.
> Just remove this TODO? I don't think it's informative for people trying to 
Done


Line 190: auto simd_in = reinterpret_cast([0]);
> This code is simple enough that I'm fine with moving forward with it even i
I will leave as-is.


PS1, Line 191: auto
> I think just repeating the types const__m128i* or __m128i* is more readable
Done


PS1, Line 192: auto
> I'd prefer the int type was explicit here.
Done


PS1, Line 196: int
> size_t for consistency?
Done


PS1, Line 199: _mm_storeu_si128
> We should be able to use the aligned versions here too.
Let's continue this conversation above.

FWIW, i also tried some very complex methods of using non-simd ops to move both 
pointers forward until at least one was aligned. The performance was sometimes 
worse, so I put that aside for another day.


-- 
To view, visit http://gerrit.cloudera.org:8080/4813
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Jim Apple 
Gerrit-Reviewer: Jim Apple 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4300: Speed up BloomFilter::Or with SIMD

2016-10-23 Thread Jim Apple (Code Review)
Jim Apple has uploaded a new patch set (#2).

Change subject: IMPALA-4300: Speed up BloomFilter::Or with SIMD
..

IMPALA-4300: Speed up BloomFilter::Or with SIMD

Manually vectorizing speeds up BloomFilter::Or by up to 184x.

The existing loop should auto-vectorize with our GCC setting of -O3,
but it does not for reasons that remain mysterious.

Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67
---
M be/src/benchmarks/bloom-filter-benchmark.cc
M be/src/testutil/mem-util.h
M be/src/util/bloom-filter.cc
M be/src/util/cpu-info.cc
M be/src/util/cpu-info.h
5 files changed, 176 insertions(+), 75 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/13/4813/2
-- 
To view, visit http://gerrit.cloudera.org:8080/4813
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Jim Apple 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4300: Speed up BloomFilter::Or with SIMD

2016-10-23 Thread Jim Apple (Code Review)
Jim Apple has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/4813

Change subject: IMPALA-4300: Speed up BloomFilter::Or with SIMD
..

IMPALA-4300: Speed up BloomFilter::Or with SIMD

Manually vectorizing speeds up BloomFilter::Or by up to 184x.

The existing loop should auto-vectorize with our GCC setting of -O3,
but it does not for reasons that remain mysterious.

Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67
---
M be/src/benchmarks/bloom-filter-benchmark.cc
M be/src/testutil/mem-util.h
M be/src/util/bloom-filter.cc
M be/src/util/cpu-info.cc
M be/src/util/cpu-info.h
5 files changed, 177 insertions(+), 75 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/13/4813/1
-- 
To view, visit http://gerrit.cloudera.org:8080/4813
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I840799d9cfb81285c796e2abfe2029bb869b0f67
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Jim Apple