[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-04-06 Thread Tamas Mate (Code Review)
Tamas Mate has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 23:

(7 comments)

Hi Daniel, looks good, commented some nits. I will go through the change one 
more time tomorrow.

http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/exec/parquet/parquet-bloom-filter-util.h
File be/src/exec/parquet/parquet-bloom-filter-util.h:

http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/exec/parquet/parquet-bloom-filter-util.h@17
PS23, Line 17:
nit: missing include guard


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/exec/parquet/parquet-bloom-filter-util.cc
File be/src/exec/parquet/parquet-bloom-filter-util.cc:

http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/exec/parquet/parquet-bloom-filter-util.cc@18
PS23, Line 18: #include "exec/parquet/parquet-bloom-filter-util.h"
 :
 : #include 
 :
 : #include "exprs/scalar-expr-evaluator.h"
 : #include "util/parquet-bloom-filter.h"
nit: include order, usually it is standard include directories first then user 
defined files


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h
File be/src/thirdparty/xxhash/xxhash.h:

http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@1
PS23, Line 1: /*
Maybe we could create a Jira to disable clang on third party libraries. It 
looks like we would need a new directory specific clang-format file, I only did 
a quick research on it, so it could be more complex.


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/util/impala-bloom-filter-buffer-allocator.cc
File be/src/util/impala-bloom-filter-buffer-allocator.cc:

http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/util/impala-bloom-filter-buffer-allocator.cc@19
PS23, Line 19:
nit: empty line


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/util/parquet-bloom-filter-test.cc
File be/src/util/parquet-bloom-filter-test.cc:

http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/util/parquet-bloom-filter-test.cc@18
PS23, Line 18: #include "util/parquet-bloom-filter.h"
 : #include "testutil/gtest-util.h"
 :
 : #include 
 : #include 
 : #include 
 :
 : #include "common/names.h"
nit: include order, usually it is standard include directories first then user 
defined files


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/util/parquet-bloom-filter.cc
File be/src/util/parquet-bloom-filter.cc:

http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/util/parquet-bloom-filter.cc@18
PS23, Line 18: #include "parquet-bloom-filter.h"
 :
 : #include 
 : #include 
 :
 : #include "kudu/util/slice.h"
 : #include "kudu/util/status.h"
 : #include "util/kudu-status-util.h"
 :
 : #include "thirdparty/xxhash/xxhash.h"
nit: include order, usually it is standard include directories first then user 
defined files


http://gerrit.cloudera.org:8080/#/c/17026/23/tests/query_test/test_parquet_bloom_filter.py
File tests/query_test/test_parquet_bloom_filter.py:

http://gerrit.cloudera.org:8080/#/c/17026/23/tests/query_test/test_parquet_bloom_filter.py@20
PS23, Line 20:
nit: empty line



--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 23
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 06 Apr 2021 14:44:50 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-04-06 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 23:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8505/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 23
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 06 Apr 2021 10:26:01 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-04-06 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 23: Code-Review+1

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17026/23//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17026/23//COMMIT_MSG@7
PS23, Line 7:  Part 1
Please mention in the title that this is about read support.


http://gerrit.cloudera.org:8080/#/c/17026/23//COMMIT_MSG@10
PS23, Line 10: CHAR(N)
Can you add more info about the unsupported type? E.g.
Impala type | Problem
VARCHAR(N)  | truncation can change hash
CHAR(N) | padding / truncation can change hash
DECIMAL | multiple encodings supported
TIMESTAMP:| multiple encodings supported, timezone conversion
DATE  | ?

A follow up Jira could be also created and mentioned here that would add 
support for more problematic types if possible (e.g DATE, DECIMAL, maybe 
TIMESTAMP)



--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 23
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Tue, 06 Apr 2021 10:22:36 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-04-06 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 23:

(182 comments)

http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h
File be/src/thirdparty/xxhash/xxhash.h:

http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@70
PS23, Line 70: 
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html?showComment=1552696407071#c3490092340461170735
line too long (112 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@92
PS23, Line 92:  *  
https://fastcompression.blogspot.com/2018/03/xxhash-for-small-keys-impressive-power.html
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@113
PS23, Line 113: #  elif defined (__cplusplus) || (defined (__STDC_VERSION__) && 
(__STDC_VERSION__ >= 199901L) /* C99 */)
line too long (104 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@243
PS23, Line 243: #  define XXH3_64bits_reset_withSecret XXH_NAME2(XXH_NAMESPACE, 
XXH3_64bits_reset_withSecret)
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@253
PS23, Line 253: #  define XXH3_128bits_reset_withSeed XXH_NAME2(XXH_NAMESPACE, 
XXH3_128bits_reset_withSeed)
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@254
PS23, Line 254: #  define XXH3_128bits_reset_withSecret 
XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_reset_withSecret)
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@270
PS23, Line 270: #define XXH_VERSION_NUMBER  (XXH_VERSION_MAJOR *100*100 + 
XXH_VERSION_MINOR *100 + XXH_VERSION_RELEASE)
line too long (103 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@429
PS23, Line 429:  * @param statePtr A pointer to an @ref XXH32_state_t allocated 
with @ref XXH32_createState().
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@441
PS23, Line 441: XXH_PUBLIC_API void XXH32_copyState(XXH32_state_t* dst_state, 
const XXH32_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@476
PS23, Line 476: XXH_PUBLIC_API XXH_errorcode XXH32_update (XXH32_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@628
PS23, Line 628: XXH_PUBLIC_API void XXH64_copyState(XXH64_state_t* dst_state, 
const XXH64_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@631
PS23, Line 631: XXH_PUBLIC_API XXH_errorcode XXH64_update (XXH64_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@700
PS23, Line 700: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (98 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@724
PS23, Line 724: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSecret(const void* 
data, size_t len, const void* secret, size_t secretSize);
line too long (120 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@743
PS23, Line 743: XXH_PUBLIC_API void XXH3_copyState(XXH3_state_t* dst_state, 
const XXH3_state_t* src_state);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@756
PS23, Line 756: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSeed(XXH3_state_t* statePtr, XXH64_hash_t seed);
line too long (99 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@766
PS23, Line 766: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSecret(XXH3_state_t* statePtr, const void* secret, size_t 
secretSize);
line too long (121 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@768
PS23, Line 768: XXH_PUBLIC_API XXH_errorcode XXH3_64bits_update (XXH3_state_t* 
statePtr, const void* input, size_t length);
line too long (107 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@791
PS23, Line 791: XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (100 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/23/be/src/thirdparty/xxhash/xxhash.h@792
PS23, Line 792: XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_withSecret(const 
void* data, size_t len, const void* secret, size_t secretSize);
line too long (122 > 90)



[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-04-06 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#23). ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..

IMPALA-9470: Use Parquet Bloom filters - Part 1

This change adds read support for Parquet Bloom filters for types that
can reasonably be supported in Impala. Other types, such as CHAR(N),
would be very difficult to support because the length may be different
in Parquet and in Impala which results in truncation or padding, and
that changes the hash which makes using the Bloom filter impossible.
Write support will be added in a later change.
The supported Parquet type - Impala type pairs are the following:

 ---
|Parquet type |  Impala type|
|---|
|INT32|  TINYINT, SMALLINT, INT |
|INT64|  BIGINT |
|FLOAT|  FLOAT  |
|DOUBLE   |  DOUBLE |
|BYTE_ARRAY   |  STRING |
 ---

If a Bloom filter is available for a column that is fully dictionary
encoded, the Bloom filter is not used as the dictionary can give exact
results in filtering.

Testing:
  - Added tests/query_test/test_parquet_bloom_filter.py that tests
whether Parquet Bloom filtering works for the supported types and
that we do not incorrectly discard row groups for the unsupported
type VARCHAR. The Parquet file used in the test was generated with
an external tool.
  - Added unit tests for ParquetBloomFilter in file
be/src/util/parquet-bloom-filter-test.cc
  - A minor, unrelated change was done in
be/src/util/bloom-filter-test.cc: the MakeRandom() function had
return type uint64_t, the documentation claimed it returned a 64 bit
random number, but the actual number of random bits is 32, which is
what is intended in the tests. The return type and documentation
have been corrected to use 32 bits.

Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
---
M LICENSE.txt
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
A be/src/exec/parquet/parquet-bloom-filter-util.cc
A be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/exprs/expr-value.h
M be/src/exprs/literal.cc
M be/src/exprs/literal.h
M be/src/kudu/util/block_bloom_filter.cc
M be/src/kudu/util/block_bloom_filter.h
M be/src/kudu/util/block_bloom_filter_avx2.cc
M be/src/runtime/bufferpool/buffer-pool-internal.h
M be/src/runtime/bufferpool/buffer-pool.cc
M be/src/runtime/bufferpool/buffer-pool.h
A be/src/thirdparty/xxhash/README.md
A be/src/thirdparty/xxhash/xxhash.h
M be/src/util/CMakeLists.txt
M be/src/util/bloom-filter-test.cc
M be/src/util/bloom-filter.cc
M be/src/util/bloom-filter.h
A be/src/util/impala-bloom-filter-buffer-allocator.cc
A be/src/util/impala-bloom-filter-buffer-allocator.h
A be/src/util/parquet-bloom-filter-test.cc
A be/src/util/parquet-bloom-filter.cc
A be/src/util/parquet-bloom-filter.h
M bin/rat_exclude_files.txt
M bin/run_clang_tidy.sh
M common/thrift/parquet.thrift
M testdata/data/README
A testdata/data/parquet-bloom-filtering.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-bloom-filter.test
A tests/query_test/test_parquet_bloom_filter.py
33 files changed, 7,133 insertions(+), 139 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/17026/23
--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 23
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tamas Mate 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-04-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 22:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8495/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 22
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 
Gerrit-Comment-Date: Thu, 01 Apr 2021 23:01:05 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-04-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 22:

(182 comments)

http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h
File be/src/thirdparty/xxhash/xxhash.h:

http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@70
PS22, Line 70: 
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html?showComment=1552696407071#c3490092340461170735
line too long (112 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@92
PS22, Line 92:  *  
https://fastcompression.blogspot.com/2018/03/xxhash-for-small-keys-impressive-power.html
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@113
PS22, Line 113: #  elif defined (__cplusplus) || (defined (__STDC_VERSION__) && 
(__STDC_VERSION__ >= 199901L) /* C99 */)
line too long (104 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@243
PS22, Line 243: #  define XXH3_64bits_reset_withSecret XXH_NAME2(XXH_NAMESPACE, 
XXH3_64bits_reset_withSecret)
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@253
PS22, Line 253: #  define XXH3_128bits_reset_withSeed XXH_NAME2(XXH_NAMESPACE, 
XXH3_128bits_reset_withSeed)
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@254
PS22, Line 254: #  define XXH3_128bits_reset_withSecret 
XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_reset_withSecret)
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@270
PS22, Line 270: #define XXH_VERSION_NUMBER  (XXH_VERSION_MAJOR *100*100 + 
XXH_VERSION_MINOR *100 + XXH_VERSION_RELEASE)
line too long (103 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@429
PS22, Line 429:  * @param statePtr A pointer to an @ref XXH32_state_t allocated 
with @ref XXH32_createState().
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@441
PS22, Line 441: XXH_PUBLIC_API void XXH32_copyState(XXH32_state_t* dst_state, 
const XXH32_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@476
PS22, Line 476: XXH_PUBLIC_API XXH_errorcode XXH32_update (XXH32_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@628
PS22, Line 628: XXH_PUBLIC_API void XXH64_copyState(XXH64_state_t* dst_state, 
const XXH64_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@631
PS22, Line 631: XXH_PUBLIC_API XXH_errorcode XXH64_update (XXH64_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@700
PS22, Line 700: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (98 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@724
PS22, Line 724: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSecret(const void* 
data, size_t len, const void* secret, size_t secretSize);
line too long (120 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@743
PS22, Line 743: XXH_PUBLIC_API void XXH3_copyState(XXH3_state_t* dst_state, 
const XXH3_state_t* src_state);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@756
PS22, Line 756: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSeed(XXH3_state_t* statePtr, XXH64_hash_t seed);
line too long (99 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@766
PS22, Line 766: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSecret(XXH3_state_t* statePtr, const void* secret, size_t 
secretSize);
line too long (121 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@768
PS22, Line 768: XXH_PUBLIC_API XXH_errorcode XXH3_64bits_update (XXH3_state_t* 
statePtr, const void* input, size_t length);
line too long (107 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@791
PS22, Line 791: XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (100 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/22/be/src/thirdparty/xxhash/xxhash.h@792
PS22, Line 792: XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_withSecret(const 
void* data, size_t len, const void* secret, size_t secretSize);
line too long (122 > 90)



[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-04-01 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#22). ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..

IMPALA-9470: Use Parquet Bloom filters - Part 1

This change adds read support for Parquet Bloom filters for types that
can reasonably be supported in Impala. Other types, such as CHAR(N),
would be very difficult to support because the length may be different
in Parquet and in Impala which results in truncation or padding, and
that changes the hash which makes using the Bloom filter impossible.
Write support will be added in a later change.
The supported Parquet type - Impala type pairs are the following:

 ---
|Parquet type |  Impala type|
|---|
|INT32|  TINYINT, SMALLINT, INT |
|INT64|  BIGINT |
|FLOAT|  FLOAT  |
|DOUBLE   |  DOUBLE |
|BYTE_ARRAY   |  STRING |
 ---

If a Bloom filter is available for a column that is fully dictionary
encoded, the Bloom filter is not used as the dictionary can give exact
results in filtering.

Testing:
  - Added tests/query_test/test_parquet_bloom_filter.py that tests
whether Parquet Bloom filtering works for the supported types and
that we do not incorrectly discard row groups for the unsupported
type VARCHAR. The Parquet file used in the test was generated with
an external tool.
  - Added unit tests for ParquetBloomFilter in file
be/src/util/parquet-bloom-filter-test.cc
  - A minor, unrelated change was done in
be/src/util/bloom-filter-test.cc: the MakeRandom() function had
return type uint64_t, the documentation claimed it returned a 64 bit
random number, but the actual number of random bits is 32, which is
what is intended in the tests. The return type and documentation
have been corrected to use 32 bits.

Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
---
M LICENSE.txt
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
A be/src/exec/parquet/parquet-bloom-filter-util.cc
A be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/exprs/expr-value.h
M be/src/exprs/literal.cc
M be/src/exprs/literal.h
M be/src/kudu/util/block_bloom_filter.cc
M be/src/kudu/util/block_bloom_filter.h
M be/src/kudu/util/block_bloom_filter_avx2.cc
M be/src/runtime/bufferpool/buffer-pool-internal.h
M be/src/runtime/bufferpool/buffer-pool.cc
M be/src/runtime/bufferpool/buffer-pool.h
A be/src/thirdparty/xxhash/README.md
A be/src/thirdparty/xxhash/xxhash.h
M be/src/util/CMakeLists.txt
M be/src/util/bloom-filter-test.cc
M be/src/util/bloom-filter.cc
M be/src/util/bloom-filter.h
A be/src/util/impala-bloom-filter-buffer-allocator.cc
A be/src/util/impala-bloom-filter-buffer-allocator.h
A be/src/util/parquet-bloom-filter-test.cc
A be/src/util/parquet-bloom-filter.cc
A be/src/util/parquet-bloom-filter.h
M bin/rat_exclude_files.txt
M bin/run_clang_tidy.sh
M common/thrift/parquet.thrift
M testdata/data/README
A testdata/data/parquet-bloom-filtering.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-bloom-filter.test
A tests/query_test/test_parquet_bloom_filter.py
33 files changed, 7,133 insertions(+), 139 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/17026/22
--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 22
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Zoltan Borok-Nagy 


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-30 Thread Daniel Becker (Code Review)
Daniel Becker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 20:

(4 comments)

Thanks for the reviews. There are a few TODOs about which I'm not quite sure, 
could you please help me with them?

http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/exec/parquet/hdfs-parquet-scanner.cc@1482
PS20, Line 1482:   DCHECK(false); // TODO: Is it possible?
Do you think this is a possibility?


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/exec/parquet/hdfs-parquet-scanner.cc@1533
PS20, Line 1533: // TODO: That's what we do for stats filtering. Is it 
the correct way to handle
What do you think of this?


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/exec/parquet/parquet-bloom-filter-util.cc
File be/src/exec/parquet/parquet-bloom-filter-util.cc:

http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/exec/parquet/parquet-bloom-filter-util.cc@170
PS18, Line 170:   *ptr = storage->data();
  :   *len = storage->size();
> Resizing could be moved inside ParquetBloomFilterEncodeValue to make the ca
Done


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/exprs/expr-value.h
File be/src/exprs/expr-value.h:

http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/exprs/expr-value.h@236
PS20, Line 236: // TODO: Is it OK for all these string types?
What do you think of this, especially for TYPE_FIXED_UDA_INTERMEDIATE?



--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 20
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Tue, 30 Mar 2021 13:34:19 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-29 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 20:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8459/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 20
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 29 Mar 2021 16:38:55 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-29 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 20:

(182 comments)

http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h
File be/src/thirdparty/xxhash/xxhash.h:

http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@70
PS20, Line 70: 
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html?showComment=1552696407071#c3490092340461170735
line too long (112 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@92
PS20, Line 92:  *  
https://fastcompression.blogspot.com/2018/03/xxhash-for-small-keys-impressive-power.html
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@113
PS20, Line 113: #  elif defined (__cplusplus) || (defined (__STDC_VERSION__) && 
(__STDC_VERSION__ >= 199901L) /* C99 */)
line too long (104 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@243
PS20, Line 243: #  define XXH3_64bits_reset_withSecret XXH_NAME2(XXH_NAMESPACE, 
XXH3_64bits_reset_withSecret)
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@253
PS20, Line 253: #  define XXH3_128bits_reset_withSeed XXH_NAME2(XXH_NAMESPACE, 
XXH3_128bits_reset_withSeed)
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@254
PS20, Line 254: #  define XXH3_128bits_reset_withSecret 
XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_reset_withSecret)
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@270
PS20, Line 270: #define XXH_VERSION_NUMBER  (XXH_VERSION_MAJOR *100*100 + 
XXH_VERSION_MINOR *100 + XXH_VERSION_RELEASE)
line too long (103 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@429
PS20, Line 429:  * @param statePtr A pointer to an @ref XXH32_state_t allocated 
with @ref XXH32_createState().
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@441
PS20, Line 441: XXH_PUBLIC_API void XXH32_copyState(XXH32_state_t* dst_state, 
const XXH32_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@476
PS20, Line 476: XXH_PUBLIC_API XXH_errorcode XXH32_update (XXH32_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@628
PS20, Line 628: XXH_PUBLIC_API void XXH64_copyState(XXH64_state_t* dst_state, 
const XXH64_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@631
PS20, Line 631: XXH_PUBLIC_API XXH_errorcode XXH64_update (XXH64_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@700
PS20, Line 700: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (98 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@724
PS20, Line 724: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSecret(const void* 
data, size_t len, const void* secret, size_t secretSize);
line too long (120 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@743
PS20, Line 743: XXH_PUBLIC_API void XXH3_copyState(XXH3_state_t* dst_state, 
const XXH3_state_t* src_state);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@756
PS20, Line 756: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSeed(XXH3_state_t* statePtr, XXH64_hash_t seed);
line too long (99 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@766
PS20, Line 766: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSecret(XXH3_state_t* statePtr, const void* secret, size_t 
secretSize);
line too long (121 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@768
PS20, Line 768: XXH_PUBLIC_API XXH_errorcode XXH3_64bits_update (XXH3_state_t* 
statePtr, const void* input, size_t length);
line too long (107 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@791
PS20, Line 791: XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (100 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/20/be/src/thirdparty/xxhash/xxhash.h@792
PS20, Line 792: XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_withSecret(const 
void* data, size_t len, const void* secret, size_t secretSize);
line too long (122 > 90)



[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-29 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#20). ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..

IMPALA-9470: Use Parquet Bloom filters - Part 1

This change adds read support for Parquet Bloom filters for types that
can reasonably be supported in Impala. Other types, such as CHAR(N),
would be very difficult to support because the length may be different
in Parquet and in Impala which results in truncation or padding, and
that changes the hash which makes using the Bloom filter impossible.
Write support will be added in a later change.
The supported Parquet type - Impala type pairs are the following:

 ---
|Parquet type |  Impala type|
|---|
|INT32|  TINYINT, SMALLINT, INT |
|INT64|  BIGINT |
|FLOAT|  FLOAT  |
|DOUBLE   |  DOUBLE |
|BYTE_ARRAY   |  STRING |
 ---

If a Bloom filter is available for a column that is fully dictionary
encoded, the Bloom filter is not used as the dictionary can give exact
results in filtering.

Testing:
  - Added tests/query_test/test_parquet_bloom_filter.py that tests
whether Parquet Bloom filtering works for the supported types and
that we do not incorrectly discard row groups for the unsupported
type VARCHAR. The Parquet file used in the test was generated with
an external tool.
  - Added unit tests for ParquetBloomFilter in file
be/src/util/parquet-bloom-filter-test.cc
  - A minor, unrelated change was done in
be/src/util/bloom-filter-test.cc: the MakeRandom() function had
return type uint64_t, the documentation claimed it returned a 64 bit
random number, but the actual number of random bits is 32, which is
what is intended in the tests. The return type and documentation
have been corrected to use 32 bits.

Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
---
M LICENSE.txt
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
A be/src/exec/parquet/parquet-bloom-filter-util.cc
A be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/exprs/expr-value.h
M be/src/exprs/literal.cc
M be/src/exprs/literal.h
M be/src/kudu/util/block_bloom_filter.cc
M be/src/kudu/util/block_bloom_filter.h
M be/src/kudu/util/block_bloom_filter_avx2.cc
M be/src/runtime/bufferpool/buffer-pool-internal.h
M be/src/runtime/bufferpool/buffer-pool.cc
M be/src/runtime/bufferpool/buffer-pool.h
A be/src/thirdparty/xxhash/README.md
A be/src/thirdparty/xxhash/xxhash.h
M be/src/util/CMakeLists.txt
M be/src/util/bloom-filter-test.cc
M be/src/util/bloom-filter.cc
M be/src/util/bloom-filter.h
A be/src/util/impala-bloom-filter-buffer-allocator.cc
A be/src/util/impala-bloom-filter-buffer-allocator.h
A be/src/util/parquet-bloom-filter-test.cc
A be/src/util/parquet-bloom-filter.cc
A be/src/util/parquet-bloom-filter.h
M bin/rat_exclude_files.txt
M bin/run_clang_tidy.sh
M common/thrift/parquet.thrift
M testdata/data/README
A testdata/data/parquet-bloom-filtering.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-bloom-filter.test
A tests/query_test/test_parquet_bloom_filter.py
33 files changed, 7,140 insertions(+), 139 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/17026/20
--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 20
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-29 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 19:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8458/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 19
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 29 Mar 2021 15:30:10 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-29 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 19:

(182 comments)

http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h
File be/src/thirdparty/xxhash/xxhash.h:

http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@70
PS19, Line 70: 
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html?showComment=1552696407071#c3490092340461170735
line too long (112 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@92
PS19, Line 92:  *  
https://fastcompression.blogspot.com/2018/03/xxhash-for-small-keys-impressive-power.html
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@113
PS19, Line 113: #  elif defined (__cplusplus) || (defined (__STDC_VERSION__) && 
(__STDC_VERSION__ >= 199901L) /* C99 */)
line too long (104 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@243
PS19, Line 243: #  define XXH3_64bits_reset_withSecret XXH_NAME2(XXH_NAMESPACE, 
XXH3_64bits_reset_withSecret)
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@253
PS19, Line 253: #  define XXH3_128bits_reset_withSeed XXH_NAME2(XXH_NAMESPACE, 
XXH3_128bits_reset_withSeed)
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@254
PS19, Line 254: #  define XXH3_128bits_reset_withSecret 
XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_reset_withSecret)
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@270
PS19, Line 270: #define XXH_VERSION_NUMBER  (XXH_VERSION_MAJOR *100*100 + 
XXH_VERSION_MINOR *100 + XXH_VERSION_RELEASE)
line too long (103 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@429
PS19, Line 429:  * @param statePtr A pointer to an @ref XXH32_state_t allocated 
with @ref XXH32_createState().
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@441
PS19, Line 441: XXH_PUBLIC_API void XXH32_copyState(XXH32_state_t* dst_state, 
const XXH32_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@476
PS19, Line 476: XXH_PUBLIC_API XXH_errorcode XXH32_update (XXH32_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@628
PS19, Line 628: XXH_PUBLIC_API void XXH64_copyState(XXH64_state_t* dst_state, 
const XXH64_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@631
PS19, Line 631: XXH_PUBLIC_API XXH_errorcode XXH64_update (XXH64_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@700
PS19, Line 700: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (98 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@724
PS19, Line 724: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSecret(const void* 
data, size_t len, const void* secret, size_t secretSize);
line too long (120 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@743
PS19, Line 743: XXH_PUBLIC_API void XXH3_copyState(XXH3_state_t* dst_state, 
const XXH3_state_t* src_state);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@756
PS19, Line 756: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSeed(XXH3_state_t* statePtr, XXH64_hash_t seed);
line too long (99 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@766
PS19, Line 766: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSecret(XXH3_state_t* statePtr, const void* secret, size_t 
secretSize);
line too long (121 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@768
PS19, Line 768: XXH_PUBLIC_API XXH_errorcode XXH3_64bits_update (XXH3_state_t* 
statePtr, const void* input, size_t length);
line too long (107 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@791
PS19, Line 791: XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (100 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/19/be/src/thirdparty/xxhash/xxhash.h@792
PS19, Line 792: XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_withSecret(const 
void* data, size_t len, const void* secret, size_t secretSize);
line too long (122 > 90)



[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-29 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#19). ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..

IMPALA-9470: Use Parquet Bloom filters - Part 1

This change adds read support for Parquet Bloom filters for types that
can reasonably be supported in Impala. Other types, such as CHAR(N),
would be very difficult to support because the length may be different
in Parquet and in Impala which results in truncation or padding, and
that changes the hash which makes using the Bloom filter impossible.
Write support will be added in a later change.
The supported Parquet type - Impala type pairs are the following:

 ---
|Parquet type |  Impala type|
|---|
|INT32|  TINYINT, SMALLINT, INT |
|INT64|  BIGINT |
|FLOAT|  FLOAT  |
|DOUBLE   |  DOUBLE |
|BYTE_ARRAY   |  STRING |
 ---

If a Bloom filter is available for a column that is fully dictionary
encoded, the Bloom filter is not used as the dictionary can give exact
results in filtering.

Testing:
  - Added tests/query_test/test_parquet_bloom_filter.py that tests
whether Parquet Bloom filtering works for the supported types and
that we do not incorrectly discard row groups for the unsupported
type VARCHAR. The Parquet file used in the test was generated with
an external tool.
  - Added unit tests for ParquetBloomFilter in file
be/src/util/parquet-bloom-filter-test.cc

Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
---
M LICENSE.txt
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
A be/src/exec/parquet/parquet-bloom-filter-util.cc
A be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/exprs/expr-value.h
M be/src/exprs/literal.cc
M be/src/exprs/literal.h
M be/src/kudu/util/block_bloom_filter.cc
M be/src/kudu/util/block_bloom_filter.h
M be/src/kudu/util/block_bloom_filter_avx2.cc
M be/src/runtime/bufferpool/buffer-pool-internal.h
M be/src/runtime/bufferpool/buffer-pool.cc
M be/src/runtime/bufferpool/buffer-pool.h
A be/src/thirdparty/xxhash/README.md
A be/src/thirdparty/xxhash/xxhash.h
M be/src/util/CMakeLists.txt
M be/src/util/bloom-filter.cc
M be/src/util/bloom-filter.h
A be/src/util/impala-bloom-filter-buffer-allocator.cc
A be/src/util/impala-bloom-filter-buffer-allocator.h
A be/src/util/parquet-bloom-filter-test.cc
A be/src/util/parquet-bloom-filter.cc
A be/src/util/parquet-bloom-filter.h
M bin/rat_exclude_files.txt
M bin/run_clang_tidy.sh
M common/thrift/parquet.thrift
M testdata/data/README
A testdata/data/parquet-bloom-filtering.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-bloom-filter.test
A tests/query_test/test_parquet_bloom_filter.py
32 files changed, 7,137 insertions(+), 136 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/17026/19
--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 19
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-29 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 18:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/exec/parquet/parquet-bloom-filter-util.cc
File be/src/exec/parquet/parquet-bloom-filter-util.cc:

http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/exec/parquet/parquet-bloom-filter-util.cc@170
PS18, Line 170: const int exp_size = sizeof(double);
  : storage->resize(exp_size);
Resizing could be moved inside ParquetBloomFilterEncodeValue to make the call 
sites more compact. So this could look like 
ParquetBloomFilterEncodeValue(value, storage, sizeof(double));



-- 
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 18
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 29 Mar 2021 08:23:54 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-23 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 18:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8424/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 18
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Tue, 23 Mar 2021 16:56:04 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-23 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 18:

(184 comments)

http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/exec/parquet/parquet-bloom-filter-util.cc
File be/src/exec/parquet/parquet-bloom-filter-util.cc:

http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/exec/parquet/parquet-bloom-filter-util.cc@218
PS18, Line 218: ScalarExprEvaluator* eval, const parquet::Type::type& 
parquet_type, vector* storage,
line too long (97 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h
File be/src/thirdparty/xxhash/xxhash.h:

http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@70
PS18, Line 70: 
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html?showComment=1552696407071#c3490092340461170735
line too long (112 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@92
PS18, Line 92:  *  
https://fastcompression.blogspot.com/2018/03/xxhash-for-small-keys-impressive-power.html
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@113
PS18, Line 113: #  elif defined (__cplusplus) || (defined (__STDC_VERSION__) && 
(__STDC_VERSION__ >= 199901L) /* C99 */)
line too long (104 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@243
PS18, Line 243: #  define XXH3_64bits_reset_withSecret XXH_NAME2(XXH_NAMESPACE, 
XXH3_64bits_reset_withSecret)
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@253
PS18, Line 253: #  define XXH3_128bits_reset_withSeed XXH_NAME2(XXH_NAMESPACE, 
XXH3_128bits_reset_withSeed)
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@254
PS18, Line 254: #  define XXH3_128bits_reset_withSecret 
XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_reset_withSecret)
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@270
PS18, Line 270: #define XXH_VERSION_NUMBER  (XXH_VERSION_MAJOR *100*100 + 
XXH_VERSION_MINOR *100 + XXH_VERSION_RELEASE)
line too long (103 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@429
PS18, Line 429:  * @param statePtr A pointer to an @ref XXH32_state_t allocated 
with @ref XXH32_createState().
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@441
PS18, Line 441: XXH_PUBLIC_API void XXH32_copyState(XXH32_state_t* dst_state, 
const XXH32_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@476
PS18, Line 476: XXH_PUBLIC_API XXH_errorcode XXH32_update (XXH32_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@628
PS18, Line 628: XXH_PUBLIC_API void XXH64_copyState(XXH64_state_t* dst_state, 
const XXH64_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@631
PS18, Line 631: XXH_PUBLIC_API XXH_errorcode XXH64_update (XXH64_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@700
PS18, Line 700: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (98 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@724
PS18, Line 724: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSecret(const void* 
data, size_t len, const void* secret, size_t secretSize);
line too long (120 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@743
PS18, Line 743: XXH_PUBLIC_API void XXH3_copyState(XXH3_state_t* dst_state, 
const XXH3_state_t* src_state);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@756
PS18, Line 756: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSeed(XXH3_state_t* statePtr, XXH64_hash_t seed);
line too long (99 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@766
PS18, Line 766: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSecret(XXH3_state_t* statePtr, const void* secret, size_t 
secretSize);
line too long (121 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@768
PS18, Line 768: XXH_PUBLIC_API XXH_errorcode XXH3_64bits_update (XXH3_state_t* 
statePtr, const void* input, size_t length);
line too long (107 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/18/be/src/thirdparty/xxhash/xxhash.h@791
PS18, Line 791: XXH_PUBLIC_API 

[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-23 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#18). ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..

IMPALA-9470: Use Parquet Bloom filters - Part 1

This change adds read support for Parquet Bloom filters for types that
can reasonably be supported in Impala. Other types, such as CHAR(N),
would be very difficult to support because the length may be different
in Parquet and in Impala which results in truncation or padding, and
that changes the hash which makes using the Bloom filter impossible.
Write support will be added in a later change.
The supported Parquet type - Impala type pairs are the following:

 ---
|Parquet type |  Impala type|
|---|
|INT32|  TINYINT, SMALLINT, INT |
|INT64|  BIGINT |
|FLOAT|  FLOAT  |
|DOUBLE   |  DOUBLE |
|BYTE_ARRAY   |  STRING |
 ---

If a Bloom filter is available for a column that is fully dictionary
encoded, the Bloom filter is not used as the dictionary can give exact
results in filtering.

Testing:
  - Added tests/query_test/test_parquet_bloom_filter.py that tests
whether Parquet Bloom filtering works for the supported types and
that we do not incorrectly discard row groups for the unsupported
type VARCHAR. The Parquet file used in the test was generated with
an external tool.
  - TODO: Add unit tests for ParquetBloomFilter.

Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
---
M LICENSE.txt
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
A be/src/exec/parquet/parquet-bloom-filter-util.cc
A be/src/exec/parquet/parquet-bloom-filter-util.h
M be/src/exprs/expr-value.h
M be/src/exprs/literal.cc
M be/src/exprs/literal.h
M be/src/kudu/util/block_bloom_filter.cc
M be/src/kudu/util/block_bloom_filter.h
M be/src/kudu/util/block_bloom_filter_avx2.cc
M be/src/runtime/bufferpool/buffer-pool-internal.h
M be/src/runtime/bufferpool/buffer-pool.cc
M be/src/runtime/bufferpool/buffer-pool.h
A be/src/thirdparty/xxhash/README.md
A be/src/thirdparty/xxhash/xxhash.h
M be/src/util/CMakeLists.txt
M be/src/util/bloom-filter.cc
M be/src/util/bloom-filter.h
A be/src/util/impala-bloom-filter-buffer-allocator.cc
A be/src/util/impala-bloom-filter-buffer-allocator.h
A be/src/util/parquet-bloom-filter.cc
A be/src/util/parquet-bloom-filter.h
M bin/rat_exclude_files.txt
M bin/run_clang_tidy.sh
M common/thrift/parquet.thrift
M testdata/data/README
A testdata/data/parquet-bloom-filtering.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-bloom-filter.test
A tests/query_test/test_parquet_bloom_filter.py
31 files changed, 6,943 insertions(+), 136 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/17026/18
--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 18
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-18 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 17:

(25 comments)

http://gerrit.cloudera.org:8080/#/c/17026/17//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17026/17//COMMIT_MSG@7
PS17, Line 7: Part 1
Can you add some info about what is expected in later parts?


http://gerrit.cloudera.org:8080/#/c/17026/17//COMMIT_MSG@26
PS17, Line 26: Testing:
It would be great to add some unit tests for ParquetBloomFilter, especially if 
there are paths that are not used in the EE test.


http://gerrit.cloudera.org:8080/#/c/17026/17//COMMIT_MSG@28
PS17, Line 28: Parquet Bloom filtering works for the supported types and 
that we do
Please mention that we use a Parquet file generated by some other tool. This 
info should be also added to 
https://github.com/apache/impala/blob/aeeff53e884a67ee7f5980654a1d394c6e3e34ac/testdata/data/README


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.h@529
PS17, Line 529: buffer_pool_client_
BufferPool::ClientHandle should be only used from a single thread:
https://github.com/apache/impala/blob/master/be/src/runtime/bufferpool/buffer-pool.h#L332

I think that can cause problems if there are multiple scanners for a single 
scan node.


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.h@801
PS17, Line 801:   static bool IsParquetBloomFilterSupported(parquet::Type::type 
parquet_type,
There may be better places for these functions, e.g. ParquetMetadataUtils, 
ParquetCommon, or probably a separate file for Parquet bloom filter related 
stuff.


http://gerrit.cloudera.org:8080/#/c/17026/3/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17026/3/be/src/exec/parquet/hdfs-parquet-scanner.h@704
PS3, Line 704:   ///
It could be noted that this is read from metadata_range_.


http://gerrit.cloudera.org:8080/#/c/17026/3/be/src/exec/parquet/hdfs-parquet-scanner.h@718
PS3, Line 718:   /// Decides how to divide stream_->reservation() between the 
columns. May increase
consistency: EvalDictionaryFilters uses skip_row_group for the same purpose.


http://gerrit.cloudera.org:8080/#/c/17026/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17026/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@1108
PS3, Line 1108: nst string& fn_
A few DCHECKs would be nice, e.g. to ensure that metadata_range_ is filled.


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@843
PS17, Line 843:   continue;
This means that we will skip the row group without raising an y counters. I 
think that we should process the row group if there are issues with the 
bloomfilter.


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1470
PS17, Line 1470: FindChildSlotRef
Is this needed for any supported types? e.g. char(N) in the example shouldn't 
be supported.


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1684
PS17, Line 1684:   const int8_t* const cast_value = reinterpret_cast(value);
   :   const int byte_len = 
ParquetPlainEncoder::Encode(*cast_value,
   :   -1 /* fixed_len_size */, storage->data());
   :   DCHECK_EQ(byte_len, output_len);
Create a template function for this code?


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1729
PS17, Line 1729: const int exp_size = 
ParquetPlainEncoder::ByteSize(*cast_value);
   : storage->resize(exp_size);
if this was moved out, then the same template function could be used as the one 
mentioned at line 1684


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1825
PS17, Line 1825: __isset.meta_data
We shouldn't need to check this here.


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1834
PS17, Line 1834: _size, _filter_header));
nit: too much indentation


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1842
PS17, Line 1842:   return Status(Substitute("Could not allocate buffer 
of $0 bytes for Parquet "
nit: too much indentation


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/exec/parquet/hdfs-parquet-scanner.cc@1857
PS17, Line 1857: data_buffer.buffer() + data_already_read, 

[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 17:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8349/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 17
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Fri, 12 Mar 2021 11:21:43 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 17:

(182 comments)

http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h
File be/src/thirdparty/xxhash/xxhash.h:

http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@70
PS17, Line 70: 
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html?showComment=1552696407071#c3490092340461170735
line too long (112 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@92
PS17, Line 92:  *  
https://fastcompression.blogspot.com/2018/03/xxhash-for-small-keys-impressive-power.html
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@113
PS17, Line 113: #  elif defined (__cplusplus) || (defined (__STDC_VERSION__) && 
(__STDC_VERSION__ >= 199901L) /* C99 */)
line too long (104 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@243
PS17, Line 243: #  define XXH3_64bits_reset_withSecret XXH_NAME2(XXH_NAMESPACE, 
XXH3_64bits_reset_withSecret)
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@253
PS17, Line 253: #  define XXH3_128bits_reset_withSeed XXH_NAME2(XXH_NAMESPACE, 
XXH3_128bits_reset_withSeed)
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@254
PS17, Line 254: #  define XXH3_128bits_reset_withSecret 
XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_reset_withSecret)
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@270
PS17, Line 270: #define XXH_VERSION_NUMBER  (XXH_VERSION_MAJOR *100*100 + 
XXH_VERSION_MINOR *100 + XXH_VERSION_RELEASE)
line too long (103 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@429
PS17, Line 429:  * @param statePtr A pointer to an @ref XXH32_state_t allocated 
with @ref XXH32_createState().
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@441
PS17, Line 441: XXH_PUBLIC_API void XXH32_copyState(XXH32_state_t* dst_state, 
const XXH32_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@476
PS17, Line 476: XXH_PUBLIC_API XXH_errorcode XXH32_update (XXH32_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@628
PS17, Line 628: XXH_PUBLIC_API void XXH64_copyState(XXH64_state_t* dst_state, 
const XXH64_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@631
PS17, Line 631: XXH_PUBLIC_API XXH_errorcode XXH64_update (XXH64_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@700
PS17, Line 700: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (98 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@724
PS17, Line 724: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSecret(const void* 
data, size_t len, const void* secret, size_t secretSize);
line too long (120 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@743
PS17, Line 743: XXH_PUBLIC_API void XXH3_copyState(XXH3_state_t* dst_state, 
const XXH3_state_t* src_state);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@756
PS17, Line 756: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSeed(XXH3_state_t* statePtr, XXH64_hash_t seed);
line too long (99 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@766
PS17, Line 766: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSecret(XXH3_state_t* statePtr, const void* secret, size_t 
secretSize);
line too long (121 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@768
PS17, Line 768: XXH_PUBLIC_API XXH_errorcode XXH3_64bits_update (XXH3_state_t* 
statePtr, const void* input, size_t length);
line too long (107 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@791
PS17, Line 791: XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (100 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/17/be/src/thirdparty/xxhash/xxhash.h@792
PS17, Line 792: XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_withSecret(const 
void* data, size_t len, const void* secret, size_t secretSize);
line too long (122 > 90)



[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-12 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#17). ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..

IMPALA-9470: Use Parquet Bloom filters - Part 1

This change adds read support for Parquet Bloom filters for some types.
The supported Parquet type - Impala type pairs are the following:

 ---
|Parquet type |  Impala type|
|---|
|INT32|  TINYINT, SMALLINT, INT |
|INT64|  BIGINT |
|FLOAT|  FLOAT  |
|DOUBLE   |  DOUBLE |
|BYTE_ARRAY   |  STRING |
 ---

If a Bloom filter is available for a column that is fully dictionary
encoded, the Bloom filter is not used as the dictionary can give exact
results in filtering.

Testing:
  - Added tests/query_test/test_parquet_bloom_filter.py that tests that
Parquet Bloom filtering works for the supported types and that we do
not incorrectly discard row groups for the unsupported type VARCHAR.

Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
---
M LICENSE.txt
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exprs/expr-value.h
M be/src/exprs/literal.cc
M be/src/exprs/literal.h
M be/src/kudu/util/block_bloom_filter.cc
M be/src/kudu/util/block_bloom_filter.h
M be/src/runtime/bufferpool/buffer-pool-internal.h
M be/src/runtime/bufferpool/buffer-pool.cc
M be/src/runtime/bufferpool/buffer-pool.h
A be/src/thirdparty/xxhash/README.md
A be/src/thirdparty/xxhash/xxhash.h
M be/src/util/CMakeLists.txt
M be/src/util/bloom-filter.cc
M be/src/util/bloom-filter.h
A be/src/util/impala-bloom-filter-buffer-allocator.cc
A be/src/util/impala-bloom-filter-buffer-allocator.h
A be/src/util/parquet-bloom-filter.cc
A be/src/util/parquet-bloom-filter.h
M bin/rat_exclude_files.txt
M bin/run_clang_tidy.sh
M common/thrift/parquet.thrift
A testdata/data/parquet-bloom-filtering.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-bloom-filter.test
A tests/query_test/test_parquet_bloom_filter.py
27 files changed, 6,910 insertions(+), 132 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/17026/17
--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 17
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-11 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 16:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8342/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 16
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Thu, 11 Mar 2021 13:44:21 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-11 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 16:

(182 comments)

http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h
File be/src/thirdparty/xxhash/xxhash.h:

http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@70
PS16, Line 70: 
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html?showComment=1552696407071#c3490092340461170735
line too long (112 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@92
PS16, Line 92:  *  
https://fastcompression.blogspot.com/2018/03/xxhash-for-small-keys-impressive-power.html
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@113
PS16, Line 113: #  elif defined (__cplusplus) || (defined (__STDC_VERSION__) && 
(__STDC_VERSION__ >= 199901L) /* C99 */)
line too long (104 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@243
PS16, Line 243: #  define XXH3_64bits_reset_withSecret XXH_NAME2(XXH_NAMESPACE, 
XXH3_64bits_reset_withSecret)
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@253
PS16, Line 253: #  define XXH3_128bits_reset_withSeed XXH_NAME2(XXH_NAMESPACE, 
XXH3_128bits_reset_withSeed)
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@254
PS16, Line 254: #  define XXH3_128bits_reset_withSecret 
XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_reset_withSecret)
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@270
PS16, Line 270: #define XXH_VERSION_NUMBER  (XXH_VERSION_MAJOR *100*100 + 
XXH_VERSION_MINOR *100 + XXH_VERSION_RELEASE)
line too long (103 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@429
PS16, Line 429:  * @param statePtr A pointer to an @ref XXH32_state_t allocated 
with @ref XXH32_createState().
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@441
PS16, Line 441: XXH_PUBLIC_API void XXH32_copyState(XXH32_state_t* dst_state, 
const XXH32_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@476
PS16, Line 476: XXH_PUBLIC_API XXH_errorcode XXH32_update (XXH32_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@628
PS16, Line 628: XXH_PUBLIC_API void XXH64_copyState(XXH64_state_t* dst_state, 
const XXH64_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@631
PS16, Line 631: XXH_PUBLIC_API XXH_errorcode XXH64_update (XXH64_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@700
PS16, Line 700: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (98 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@724
PS16, Line 724: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSecret(const void* 
data, size_t len, const void* secret, size_t secretSize);
line too long (120 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@743
PS16, Line 743: XXH_PUBLIC_API void XXH3_copyState(XXH3_state_t* dst_state, 
const XXH3_state_t* src_state);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@756
PS16, Line 756: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSeed(XXH3_state_t* statePtr, XXH64_hash_t seed);
line too long (99 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@766
PS16, Line 766: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSecret(XXH3_state_t* statePtr, const void* secret, size_t 
secretSize);
line too long (121 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@768
PS16, Line 768: XXH_PUBLIC_API XXH_errorcode XXH3_64bits_update (XXH3_state_t* 
statePtr, const void* input, size_t length);
line too long (107 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@791
PS16, Line 791: XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (100 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/16/be/src/thirdparty/xxhash/xxhash.h@792
PS16, Line 792: XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_withSecret(const 
void* data, size_t len, const void* secret, size_t secretSize);
line too long (122 > 90)



[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-11 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#16). ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..

IMPALA-9470: Use Parquet Bloom filters - Part 1

This change adds read support for Parquet Bloom filters for some types.
The supported Parquet type - Impala type pairs are the following:

 ---
|Parquet type |  Impala type|
|---|
|INT32|  TINYINT, SMALLINT, INT |
|INT64|  BIGINT |
|FLOAT|  FLOAT  |
|DOUBLE   |  DOUBLE |
|BYTE_ARRAY   |  STRING |
 ---

If a Bloom filter is available for a column that is fully dictionary
encoded, the Bloom filter is not used as the dictionary can give exact
results in filtering.

Testing:
  - Added tests/query_test/test_parquet_bloom_filter.py that tests that
Parquet Bloom filtering works for the supported types and that we do
not incorrectly discard row groups for the unsupported type VARCHAR.

Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
---
M LICENSE.txt
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exprs/expr-value.h
M be/src/exprs/literal.cc
M be/src/exprs/literal.h
M be/src/kudu/util/block_bloom_filter.cc
M be/src/kudu/util/block_bloom_filter.h
M be/src/runtime/bufferpool/buffer-pool-internal.h
M be/src/runtime/bufferpool/buffer-pool.cc
M be/src/runtime/bufferpool/buffer-pool.h
A be/src/thirdparty/xxhash/README.md
A be/src/thirdparty/xxhash/xxhash.h
M be/src/util/CMakeLists.txt
M be/src/util/bloom-filter.cc
M be/src/util/bloom-filter.h
A be/src/util/impala-bloom-filter-buffer-allocator.cc
A be/src/util/impala-bloom-filter-buffer-allocator.h
A be/src/util/parquet-bloom-filter.cc
A be/src/util/parquet-bloom-filter.h
M bin/rat_exclude_files.txt
M bin/run_clang_tidy.sh
M common/thrift/parquet.thrift
A testdata/data/parquet-bloom-filtering.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-bloom-filter.test
A tests/query_test/test_parquet_bloom_filter.py
27 files changed, 6,843 insertions(+), 123 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/17026/16
--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 16
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 15:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8328/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 15
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Wed, 10 Mar 2021 11:52:22 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 15:

(182 comments)

http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h
File be/src/thirdparty/xxhash/xxhash.h:

http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@70
PS15, Line 70: 
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html?showComment=1552696407071#c3490092340461170735
line too long (112 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@92
PS15, Line 92:  *  
https://fastcompression.blogspot.com/2018/03/xxhash-for-small-keys-impressive-power.html
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@113
PS15, Line 113: #  elif defined (__cplusplus) || (defined (__STDC_VERSION__) && 
(__STDC_VERSION__ >= 199901L) /* C99 */)
line too long (104 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@243
PS15, Line 243: #  define XXH3_64bits_reset_withSecret XXH_NAME2(XXH_NAMESPACE, 
XXH3_64bits_reset_withSecret)
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@253
PS15, Line 253: #  define XXH3_128bits_reset_withSeed XXH_NAME2(XXH_NAMESPACE, 
XXH3_128bits_reset_withSeed)
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@254
PS15, Line 254: #  define XXH3_128bits_reset_withSecret 
XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_reset_withSecret)
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@270
PS15, Line 270: #define XXH_VERSION_NUMBER  (XXH_VERSION_MAJOR *100*100 + 
XXH_VERSION_MINOR *100 + XXH_VERSION_RELEASE)
line too long (103 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@429
PS15, Line 429:  * @param statePtr A pointer to an @ref XXH32_state_t allocated 
with @ref XXH32_createState().
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@441
PS15, Line 441: XXH_PUBLIC_API void XXH32_copyState(XXH32_state_t* dst_state, 
const XXH32_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@476
PS15, Line 476: XXH_PUBLIC_API XXH_errorcode XXH32_update (XXH32_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@628
PS15, Line 628: XXH_PUBLIC_API void XXH64_copyState(XXH64_state_t* dst_state, 
const XXH64_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@631
PS15, Line 631: XXH_PUBLIC_API XXH_errorcode XXH64_update (XXH64_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@700
PS15, Line 700: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (98 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@724
PS15, Line 724: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSecret(const void* 
data, size_t len, const void* secret, size_t secretSize);
line too long (120 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@743
PS15, Line 743: XXH_PUBLIC_API void XXH3_copyState(XXH3_state_t* dst_state, 
const XXH3_state_t* src_state);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@756
PS15, Line 756: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSeed(XXH3_state_t* statePtr, XXH64_hash_t seed);
line too long (99 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@766
PS15, Line 766: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSecret(XXH3_state_t* statePtr, const void* secret, size_t 
secretSize);
line too long (121 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@768
PS15, Line 768: XXH_PUBLIC_API XXH_errorcode XXH3_64bits_update (XXH3_state_t* 
statePtr, const void* input, size_t length);
line too long (107 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@791
PS15, Line 791: XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (100 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/15/be/src/thirdparty/xxhash/xxhash.h@792
PS15, Line 792: XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_withSecret(const 
void* data, size_t len, const void* secret, size_t secretSize);
line too long (122 > 90)



[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-10 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#15). ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..

IMPALA-9470: Use Parquet Bloom filters - Part 1

This change adds read support for Parquet Bloom filters for some types.
The supported Parquet type - Impala type pairs are the following:

 ---
|Parquet type |  Impala type|
|---|
|INT32|  TINYINT, SMALLINT, INT |
|INT64|  BIGINT |
|FLOAT|  FLOAT  |
|DOUBLE   |  DOUBLE |
|BYTE_ARRAY   |  STRING |
 ---

If a Bloom filter is available for a column that is fully dictionary
encoded, the Bloom filter is not used as the dictionary can give exact
results in filtering.

Testing:
  - Added tests/query_test/test_parquet_bloom_filter.py that tests that
Parquet Bloom filtering works for the supported types and that we do
not incorrectly discard row groups for the unsupported type VARCHAR.

Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
---
M LICENSE.txt
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exprs/expr-value.h
M be/src/exprs/literal.cc
M be/src/exprs/literal.h
M be/src/kudu/util/block_bloom_filter.cc
M be/src/kudu/util/block_bloom_filter.h
M be/src/runtime/bufferpool/buffer-pool-internal.h
M be/src/runtime/bufferpool/buffer-pool.cc
M be/src/runtime/bufferpool/buffer-pool.h
A be/src/thirdparty/xxhash/README.md
A be/src/thirdparty/xxhash/xxhash.h
M be/src/util/CMakeLists.txt
M be/src/util/bloom-filter.cc
M be/src/util/bloom-filter.h
A be/src/util/impala-bloom-filter-buffer-allocator.cc
A be/src/util/impala-bloom-filter-buffer-allocator.h
A be/src/util/parquet-bloom-filter.cc
A be/src/util/parquet-bloom-filter.h
M bin/rat_exclude_files.txt
M bin/run_clang_tidy.sh
M common/thrift/parquet.thrift
A testdata/data/parquet-bloom-filtering.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-bloom-filter.test
A tests/query_test/test_parquet_bloom_filter.py
27 files changed, 6,848 insertions(+), 123 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/17026/15
--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 15
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-09 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 14:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8324/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 14
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Tue, 09 Mar 2021 22:39:38 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-09 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 14:

(182 comments)

http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h
File be/src/thirdparty/xxhash/xxhash.h:

http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@70
PS14, Line 70: 
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html?showComment=1552696407071#c3490092340461170735
line too long (112 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@92
PS14, Line 92:  *  
https://fastcompression.blogspot.com/2018/03/xxhash-for-small-keys-impressive-power.html
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@113
PS14, Line 113: #  elif defined (__cplusplus) || (defined (__STDC_VERSION__) && 
(__STDC_VERSION__ >= 199901L) /* C99 */)
line too long (104 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@243
PS14, Line 243: #  define XXH3_64bits_reset_withSecret XXH_NAME2(XXH_NAMESPACE, 
XXH3_64bits_reset_withSecret)
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@253
PS14, Line 253: #  define XXH3_128bits_reset_withSeed XXH_NAME2(XXH_NAMESPACE, 
XXH3_128bits_reset_withSeed)
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@254
PS14, Line 254: #  define XXH3_128bits_reset_withSecret 
XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_reset_withSecret)
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@270
PS14, Line 270: #define XXH_VERSION_NUMBER  (XXH_VERSION_MAJOR *100*100 + 
XXH_VERSION_MINOR *100 + XXH_VERSION_RELEASE)
line too long (103 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@429
PS14, Line 429:  * @param statePtr A pointer to an @ref XXH32_state_t allocated 
with @ref XXH32_createState().
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@441
PS14, Line 441: XXH_PUBLIC_API void XXH32_copyState(XXH32_state_t* dst_state, 
const XXH32_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@476
PS14, Line 476: XXH_PUBLIC_API XXH_errorcode XXH32_update (XXH32_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@628
PS14, Line 628: XXH_PUBLIC_API void XXH64_copyState(XXH64_state_t* dst_state, 
const XXH64_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@631
PS14, Line 631: XXH_PUBLIC_API XXH_errorcode XXH64_update (XXH64_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@700
PS14, Line 700: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (98 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@724
PS14, Line 724: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSecret(const void* 
data, size_t len, const void* secret, size_t secretSize);
line too long (120 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@743
PS14, Line 743: XXH_PUBLIC_API void XXH3_copyState(XXH3_state_t* dst_state, 
const XXH3_state_t* src_state);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@756
PS14, Line 756: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSeed(XXH3_state_t* statePtr, XXH64_hash_t seed);
line too long (99 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@766
PS14, Line 766: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSecret(XXH3_state_t* statePtr, const void* secret, size_t 
secretSize);
line too long (121 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@768
PS14, Line 768: XXH_PUBLIC_API XXH_errorcode XXH3_64bits_update (XXH3_state_t* 
statePtr, const void* input, size_t length);
line too long (107 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@791
PS14, Line 791: XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (100 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/14/be/src/thirdparty/xxhash/xxhash.h@792
PS14, Line 792: XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_withSecret(const 
void* data, size_t len, const void* secret, size_t secretSize);
line too long (122 > 90)



[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-09 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#14). ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..

IMPALA-9470: Use Parquet Bloom filters - Part 1

This change adds read support for Parquet Bloom filters for some types.
The supported Parquet type - Impala type pairs are the following:

 ---
|Parquet type |  Impala type|
|---|
|INT32|  TINYINT, SMALLINT, INT |
|INT64|  BIGINT |
|FLOAT|  FLOAT  |
|DOUBLE   |  DOUBLE |
|BYTE_ARRAY   |  STRING |
 ---

If a Bloom filter is available for a column that is fully dictionary
encoded, the Bloom filter is not used as the dictionary can give exact
results in filtering.

Testing:
  - Added tests/query_test/test_parquet_bloom_filter.py that tests that
Parquet Bloom filtering works for the supported types and that we do
not incorrectly discard row groups for the unsupported type VARCHAR.

Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
---
M LICENSE.txt
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exprs/expr-value.h
M be/src/exprs/literal.cc
M be/src/exprs/literal.h
M be/src/kudu/util/block_bloom_filter.cc
M be/src/kudu/util/block_bloom_filter.h
M be/src/runtime/bufferpool/buffer-pool-internal.h
M be/src/runtime/bufferpool/buffer-pool.cc
M be/src/runtime/bufferpool/buffer-pool.h
A be/src/thirdparty/xxhash/README.md
A be/src/thirdparty/xxhash/xxhash.h
M be/src/util/CMakeLists.txt
M be/src/util/bloom-filter.cc
M be/src/util/bloom-filter.h
A be/src/util/impala-bloom-filter-buffer-allocator.cc
A be/src/util/impala-bloom-filter-buffer-allocator.h
A be/src/util/parquet-bloom-filter.cc
A be/src/util/parquet-bloom-filter.h
M bin/rat_exclude_files.txt
M bin/run_clang_tidy.sh
M common/thrift/parquet.thrift
A testdata/data/parquet-bloom-filtering.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-bloom-filter.test
A tests/query_test/test_parquet_bloom_filter.py
27 files changed, 6,825 insertions(+), 123 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/17026/14
--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 14
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-09 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 13:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/8321/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 13
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Tue, 09 Mar 2021 19:08:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-09 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 13:

(183 comments)

http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/exec/parquet/hdfs-parquet-scanner.cc@1819
PS13, Line 1819:   const int bytes = std::max(1, 
bloom_filter_header.numBytes); // TODO: Trying to silence a clang warning.
line too long (110 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h
File be/src/thirdparty/xxhash/xxhash.h:

http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@70
PS13, Line 70: 
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html?showComment=1552696407071#c3490092340461170735
line too long (112 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@92
PS13, Line 92:  *  
https://fastcompression.blogspot.com/2018/03/xxhash-for-small-keys-impressive-power.html
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@113
PS13, Line 113: #  elif defined (__cplusplus) || (defined (__STDC_VERSION__) && 
(__STDC_VERSION__ >= 199901L) /* C99 */)
line too long (104 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@243
PS13, Line 243: #  define XXH3_64bits_reset_withSecret XXH_NAME2(XXH_NAMESPACE, 
XXH3_64bits_reset_withSecret)
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@253
PS13, Line 253: #  define XXH3_128bits_reset_withSeed XXH_NAME2(XXH_NAMESPACE, 
XXH3_128bits_reset_withSeed)
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@254
PS13, Line 254: #  define XXH3_128bits_reset_withSecret 
XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_reset_withSecret)
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@270
PS13, Line 270: #define XXH_VERSION_NUMBER  (XXH_VERSION_MAJOR *100*100 + 
XXH_VERSION_MINOR *100 + XXH_VERSION_RELEASE)
line too long (103 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@429
PS13, Line 429:  * @param statePtr A pointer to an @ref XXH32_state_t allocated 
with @ref XXH32_createState().
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@441
PS13, Line 441: XXH_PUBLIC_API void XXH32_copyState(XXH32_state_t* dst_state, 
const XXH32_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@476
PS13, Line 476: XXH_PUBLIC_API XXH_errorcode XXH32_update (XXH32_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@628
PS13, Line 628: XXH_PUBLIC_API void XXH64_copyState(XXH64_state_t* dst_state, 
const XXH64_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@631
PS13, Line 631: XXH_PUBLIC_API XXH_errorcode XXH64_update (XXH64_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@700
PS13, Line 700: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (98 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@724
PS13, Line 724: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSecret(const void* 
data, size_t len, const void* secret, size_t secretSize);
line too long (120 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@743
PS13, Line 743: XXH_PUBLIC_API void XXH3_copyState(XXH3_state_t* dst_state, 
const XXH3_state_t* src_state);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@756
PS13, Line 756: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSeed(XXH3_state_t* statePtr, XXH64_hash_t seed);
line too long (99 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@766
PS13, Line 766: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSecret(XXH3_state_t* statePtr, const void* secret, size_t 
secretSize);
line too long (121 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@768
PS13, Line 768: XXH_PUBLIC_API XXH_errorcode XXH3_64bits_update (XXH3_state_t* 
statePtr, const void* input, size_t length);
line too long (107 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/13/be/src/thirdparty/xxhash/xxhash.h@791
PS13, Line 791: 

[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-09 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#13). ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..

IMPALA-9470: Use Parquet Bloom filters - Part 1

This change adds read support for Parquet Bloom filters for some types.
The supported Parquet type - Impala type pairs are the following:

 ---
|Parquet type |  Impala type|
|---|
|INT32|  TINYINT, SMALLINT, INT |
|INT64|  BIGINT |
|FLOAT|  FLOAT  |
|DOUBLE   |  DOUBLE |
|BYTE_ARRAY   |  STRING |
 ---

If a Bloom filter is available for a column that is fully dictionary
encoded, the Bloom filter is not used as the dictionary can give exact
results in filtering.

Testing:
  - Added tests/query_test/test_parquet_bloom_filter.py that tests that
Parquet Bloom filtering works for the supported types and that we do
not incorrectly discard row groups for the unsupported type VARCHAR.

Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
---
M LICENSE.txt
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exprs/expr-value.h
M be/src/exprs/literal.cc
M be/src/exprs/literal.h
M be/src/kudu/util/block_bloom_filter.cc
M be/src/kudu/util/block_bloom_filter.h
M be/src/runtime/bufferpool/buffer-pool-internal.h
M be/src/runtime/bufferpool/buffer-pool.cc
M be/src/runtime/bufferpool/buffer-pool.h
A be/src/thirdparty/xxhash/README.md
A be/src/thirdparty/xxhash/xxhash.h
M be/src/util/CMakeLists.txt
M be/src/util/bloom-filter.cc
M be/src/util/bloom-filter.h
A be/src/util/impala-bloom-filter-buffer-allocator.cc
A be/src/util/impala-bloom-filter-buffer-allocator.h
A be/src/util/parquet-bloom-filter.cc
A be/src/util/parquet-bloom-filter.h
M bin/rat_exclude_files.txt
M bin/run_clang_tidy.sh
M common/thrift/parquet.thrift
A testdata/data/parquet-bloom-filtering.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-bloom-filter.test
A tests/query_test/test_parquet_bloom_filter.py
27 files changed, 6,823 insertions(+), 123 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/17026/13
--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 13
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 12:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/8308/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 12
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 08 Mar 2021 16:58:17 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 12:

(182 comments)

http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h
File be/src/thirdparty/xxhash/xxhash.h:

http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@70
PS12, Line 70: 
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html?showComment=1552696407071#c3490092340461170735
line too long (112 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@92
PS12, Line 92:  *  
https://fastcompression.blogspot.com/2018/03/xxhash-for-small-keys-impressive-power.html
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@113
PS12, Line 113: #  elif defined (__cplusplus) || (defined (__STDC_VERSION__) && 
(__STDC_VERSION__ >= 199901L) /* C99 */)
line too long (104 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@243
PS12, Line 243: #  define XXH3_64bits_reset_withSecret XXH_NAME2(XXH_NAMESPACE, 
XXH3_64bits_reset_withSecret)
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@253
PS12, Line 253: #  define XXH3_128bits_reset_withSeed XXH_NAME2(XXH_NAMESPACE, 
XXH3_128bits_reset_withSeed)
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@254
PS12, Line 254: #  define XXH3_128bits_reset_withSecret 
XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_reset_withSecret)
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@270
PS12, Line 270: #define XXH_VERSION_NUMBER  (XXH_VERSION_MAJOR *100*100 + 
XXH_VERSION_MINOR *100 + XXH_VERSION_RELEASE)
line too long (103 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@429
PS12, Line 429:  * @param statePtr A pointer to an @ref XXH32_state_t allocated 
with @ref XXH32_createState().
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@441
PS12, Line 441: XXH_PUBLIC_API void XXH32_copyState(XXH32_state_t* dst_state, 
const XXH32_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@476
PS12, Line 476: XXH_PUBLIC_API XXH_errorcode XXH32_update (XXH32_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@628
PS12, Line 628: XXH_PUBLIC_API void XXH64_copyState(XXH64_state_t* dst_state, 
const XXH64_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@631
PS12, Line 631: XXH_PUBLIC_API XXH_errorcode XXH64_update (XXH64_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@700
PS12, Line 700: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (98 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@724
PS12, Line 724: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSecret(const void* 
data, size_t len, const void* secret, size_t secretSize);
line too long (120 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@743
PS12, Line 743: XXH_PUBLIC_API void XXH3_copyState(XXH3_state_t* dst_state, 
const XXH3_state_t* src_state);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@756
PS12, Line 756: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSeed(XXH3_state_t* statePtr, XXH64_hash_t seed);
line too long (99 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@766
PS12, Line 766: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSecret(XXH3_state_t* statePtr, const void* secret, size_t 
secretSize);
line too long (121 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@768
PS12, Line 768: XXH_PUBLIC_API XXH_errorcode XXH3_64bits_update (XXH3_state_t* 
statePtr, const void* input, size_t length);
line too long (107 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@791
PS12, Line 791: XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (100 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/12/be/src/thirdparty/xxhash/xxhash.h@792
PS12, Line 792: XXH_PUBLIC_API XXH128_hash_t XXH3_128bits_withSecret(const 
void* data, size_t len, const void* secret, size_t secretSize);
line too long (122 > 90)



[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-08 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded a new patch set (#12). ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..

IMPALA-9470: Use Parquet Bloom filters - Part 1

This change adds read support for Parquet Bloom filters for some types.
The supported Parquet type - Impala type pairs are the following:

 ---
|Parquet type |  Impala type|
|---|
|INT32|  TINYINT, SMALLINT, INT |
|INT64|  BIGINT |
|FLOAT|  FLOAT  |
|DOUBLE   |  DOUBLE |
|BYTE_ARRAY   |  STRING |
 ---

If a Bloom filter is available for a column that is fully dictionary
encoded, the Bloom filter is not used as the dictionary can give exact
results in filtering.

Testing:
  - Added tests/query_test/test_parquet_bloom_filter.py that tests that
Parquet Bloom filtering works for the supported types and that we do
not incorrectly discard row groups for the unsupported type VARCHAR.

Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
---
M LICENSE.txt
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exprs/expr-value.h
M be/src/exprs/literal.cc
M be/src/exprs/literal.h
M be/src/kudu/util/block_bloom_filter.cc
M be/src/kudu/util/block_bloom_filter.h
M be/src/runtime/bufferpool/buffer-pool-internal.h
M be/src/runtime/bufferpool/buffer-pool.cc
M be/src/runtime/bufferpool/buffer-pool.h
A be/src/thirdparty/xxhash/README.md
A be/src/thirdparty/xxhash/xxhash.h
M be/src/util/CMakeLists.txt
M be/src/util/bloom-filter.cc
M be/src/util/bloom-filter.h
A be/src/util/impala-bloom-filter-buffer-allocator.cc
A be/src/util/impala-bloom-filter-buffer-allocator.h
A be/src/util/parquet-bloom-filter.cc
A be/src/util/parquet-bloom-filter.h
M bin/rat_exclude_files.txt
M common/thrift/parquet.thrift
A testdata/data/parquet-bloom-filtering.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-bloom-filter.test
A tests/query_test/test_parquet_bloom_filter.py
26 files changed, 6,820 insertions(+), 122 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/17026/12
--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 12
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 11:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/8307/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 11
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Mon, 08 Mar 2021 15:44:54 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17026 )

Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..


Patch Set 11:

(185 comments)

http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/exec/parquet/hdfs-parquet-scanner.cc@1576
PS11, Line 1576: status = DeserializeThriftMsg(buffer->buffer(), 
header_size, true, bloom_filter_header);
line too long (92 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/exec/parquet/hdfs-parquet-scanner.cc@1819
PS11, Line 1819:   "Bloom filter data for file '$1'.", 
bloom_filter_header.numBytes, filename()));
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h
File be/src/thirdparty/xxhash/xxhash.h:

http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h@70
PS11, Line 70: 
https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html?showComment=1552696407071#c3490092340461170735
line too long (112 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h@92
PS11, Line 92:  *  
https://fastcompression.blogspot.com/2018/03/xxhash-for-small-keys-impressive-power.html
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h@113
PS11, Line 113: #  elif defined (__cplusplus) || (defined (__STDC_VERSION__) && 
(__STDC_VERSION__ >= 199901L) /* C99 */)
line too long (104 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h@243
PS11, Line 243: #  define XXH3_64bits_reset_withSecret XXH_NAME2(XXH_NAMESPACE, 
XXH3_64bits_reset_withSecret)
line too long (93 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h@253
PS11, Line 253: #  define XXH3_128bits_reset_withSeed XXH_NAME2(XXH_NAMESPACE, 
XXH3_128bits_reset_withSeed)
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h@254
PS11, Line 254: #  define XXH3_128bits_reset_withSecret 
XXH_NAME2(XXH_NAMESPACE, XXH3_128bits_reset_withSecret)
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h@270
PS11, Line 270: #define XXH_VERSION_NUMBER  (XXH_VERSION_MAJOR *100*100 + 
XXH_VERSION_MINOR *100 + XXH_VERSION_RELEASE)
line too long (103 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h@429
PS11, Line 429:  * @param statePtr A pointer to an @ref XXH32_state_t allocated 
with @ref XXH32_createState().
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h@441
PS11, Line 441: XXH_PUBLIC_API void XXH32_copyState(XXH32_state_t* dst_state, 
const XXH32_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h@476
PS11, Line 476: XXH_PUBLIC_API XXH_errorcode XXH32_update (XXH32_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h@628
PS11, Line 628: XXH_PUBLIC_API void XXH64_copyState(XXH64_state_t* dst_state, 
const XXH64_state_t* src_state);
line too long (94 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h@631
PS11, Line 631: XXH_PUBLIC_API XXH_errorcode XXH64_update (XXH64_state_t* 
statePtr, const void* input, size_t length);
line too long (102 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h@700
PS11, Line 700: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSeed(const void* 
data, size_t len, XXH64_hash_t seed);
line too long (98 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h@724
PS11, Line 724: XXH_PUBLIC_API XXH64_hash_t XXH3_64bits_withSecret(const void* 
data, size_t len, const void* secret, size_t secretSize);
line too long (120 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h@743
PS11, Line 743: XXH_PUBLIC_API void XXH3_copyState(XXH3_state_t* dst_state, 
const XXH3_state_t* src_state);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h@756
PS11, Line 756: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSeed(XXH3_state_t* statePtr, XXH64_hash_t seed);
line too long (99 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h@766
PS11, Line 766: XXH_PUBLIC_API XXH_errorcode 
XXH3_64bits_reset_withSecret(XXH3_state_t* statePtr, const void* secret, size_t 
secretSize);
line too long (121 > 90)


http://gerrit.cloudera.org:8080/#/c/17026/11/be/src/thirdparty/xxhash/xxhash.h@768
PS11, Line 768: XXH_PUBLIC_API XXH_errorcode 

[Impala-ASF-CR] IMPALA-9470: Use Parquet Bloom filters - Part 1

2021-03-08 Thread Daniel Becker (Code Review)
Daniel Becker has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/17026


Change subject: IMPALA-9470: Use Parquet Bloom filters - Part 1
..

IMPALA-9470: Use Parquet Bloom filters - Part 1

This change adds read support for Parquet Bloom filters for some types.
The supported Parquet type - Impala type pairs are the following:

 ---
|Parquet type |  Impala type|
|---|
|INT32|  TINYINT, SMALLINT, INT |
|INT64|  BIGINT |
|FLOAT|  FLOAT  |
|DOUBLE   |  DOUBLE |
|BYTE_ARRAY   |  STRING |
 ---

If a Bloom filter is available for a column that is fully dictionary
encoded, the Bloom filter is not used as the dictionary can give exact
results in filtering.

Testing:
  - Added tests/query_test/test_parquet_bloom_filter.py that tests that
Parquet Bloom filtering works for the supported types and that we do
not incorrectly discard row groups for the unsupported type VARCHAR.

Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
---
M LICENSE.txt
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exprs/expr-value.h
M be/src/exprs/literal.cc
M be/src/exprs/literal.h
M be/src/kudu/util/block_bloom_filter.cc
M be/src/kudu/util/block_bloom_filter.h
M be/src/runtime/bufferpool/buffer-pool-internal.h
M be/src/runtime/bufferpool/buffer-pool.cc
M be/src/runtime/bufferpool/buffer-pool.h
A be/src/thirdparty/xxhash/README.md
A be/src/thirdparty/xxhash/xxhash.h
M be/src/util/CMakeLists.txt
M be/src/util/bloom-filter.cc
M be/src/util/bloom-filter.h
A be/src/util/impala-bloom-filter-buffer-allocator.cc
A be/src/util/impala-bloom-filter-buffer-allocator.h
A be/src/util/parquet-bloom-filter.cc
A be/src/util/parquet-bloom-filter.h
M common/thrift/parquet.thrift
A testdata/data/parquet-bloom-filtering.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-bloom-filter.test
A tests/query_test/test_parquet_bloom_filter.py
25 files changed, 6,815 insertions(+), 122 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/17026/11
--
To view, visit http://gerrit.cloudera.org:8080/17026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I7119c7161fa3658e561fc1265430cb90079d8287
Gerrit-Change-Number: 17026
Gerrit-PatchSet: 11
Gerrit-Owner: Daniel Becker 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Daniel Becker