[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-23 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 13:

I've gotten into a similar situation before and git commit --amend 
--reset-author has gotten me out of it. Anyway, not a big deal, I just don't 
want to get credit (or blame!) for your work :)


--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 23 Mar 2018 15:10:37 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-23 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 13:

I think that it is related to an incident during my commit:
- by mistake, I committed first with "commit --amend", which modified the last 
commit in my repo (which was yours)
- I searched for a solution to redo this change without loosing my 
modification, and did the one I found (I don't remember the exact commands, I 
can look it up if necessary)
- everything looked ok to me, my changes were in separate commit, so I pushed 
the change to gerrit

My guess is that gerrit was tricked somehow by this maneuver, and the Author 
field was inherited from the last commit.


--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 23 Mar 2018 12:17:52 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-22 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 13:

I just noticed that I'm listed as the author on this patch - weird - wonder 
what happened...


--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 23 Mar 2018 00:09:53 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 12: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 22 Mar 2018 02:47:32 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..

IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

Impala already supported RLE encoding for levels and dictionary pages, so
the only task was to integrate it into BoolColumnReader.

A new benchmark, rle-benchmark.cc is added to test the speed of RLE
decoding for different bit widths and run lengths.

There might be a small performance impact on PLAIN encoded booleans,
because of the additional branch when the cache of BoolColumnReader is
filled. As the cache size is 128, I considered this to be outside the
"hot loop".

Testing:

As Impala cannot write RLE encoded bool columns at the moment, parquet-mr
was used to create a test file, testdata/data/rle_encoded_bool.parquet

tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table
that uses this file, and tries to query from it.

Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Reviewed-on: http://gerrit.cloudera.org:8080/9403
Reviewed-by: Tim Armstrong 
Tested-by: Impala Public Jenkins
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/rle-benchmark.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
M testdata/data/README
A testdata/data/rle_encoded_bool.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test
M tests/query_test/test_scanners.py
10 files changed, 331 insertions(+), 53 deletions(-)

Approvals:
  Tim Armstrong: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 13
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2150/


--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 21 Mar 2018 23:01:28 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-21 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 12: Code-Review+2

There was a minor clang-tidy issue. Just fixed it myself.


--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 12
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 21 Mar 2018 23:01:15 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-21 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new patch set (#11) to the change originally 
created by Csaba Ringhofer. ( http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..

IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

Impala already supported RLE encoding for levels and dictionary pages, so
the only task was to integrate it into BoolColumnReader.

A new benchmark, rle-benchmark.cc is added to test the speed of RLE
decoding for different bit widths and run lengths.

There might be a small performance impact on PLAIN encoded booleans,
because of the additional branch when the cache of BoolColumnReader is
filled. As the cache size is 128, I considered this to be outside the
"hot loop".

Testing:

As Impala cannot write RLE encoded bool columns at the moment, parquet-mr
was used to create a test file, testdata/data/rle_encoded_bool.parquet

tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table
that uses this file, and tries to query from it.

Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/rle-benchmark.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
M testdata/data/README
A testdata/data/rle_encoded_bool.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test
M tests/query_test/test_scanners.py
10 files changed, 331 insertions(+), 53 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/9403/11
--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 11
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 10: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/2143/


--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 10
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 21 Mar 2018 20:37:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-21 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 10: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 10
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 21 Mar 2018 16:31:28 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 10:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2143/


--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 10
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 21 Mar 2018 16:31:35 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-21 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 9:

Patch set 9 is a rebase + conflict resolution.


--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 21 Mar 2018 12:17:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-21 Thread Csaba Ringhofer (Code Review)
Hello Lars Volker, Gabor Kaszab, Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9403

to look at the new patch set (#9).

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..

IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

Impala already supported RLE encoding for levels and dictionary pages, so
the only task was to integrate it into BoolColumnReader.

A new benchmark, rle-benchmark.cc is added to test the speed of RLE
decoding for different bit widths and run lengths.

There might be a small performance impact on PLAIN encoded booleans,
because of the additional branch when the cache of BoolColumnReader is
filled. As the cache size is 128, I considered this to be outside the
"hot loop".

Testing:

As Impala cannot write RLE encoded bool columns at the moment, parquet-mr
was used to create a test file, testdata/data/rle_encoded_bool.parquet

tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table
that uses this file, and tries to query from it.

Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/rle-benchmark.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
M testdata/data/README
A testdata/data/rle_encoded_bool.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test
M tests/query_test/test_scanners.py
10 files changed, 331 insertions(+), 53 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/9403/9
--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 9
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-20 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 8:

I can merge once the merge conflict is fixed.


--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 20 Mar 2018 22:30:55 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-20 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 8: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 20 Mar 2018 22:00:53 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-20 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/exec/parquet-column-readers.cc@190
PS7, Line 190:   DCHECK_EQ(encoding_, parquet::Encoding::RLE);
> I think it's better to omit the "not for 2.x" so it doesn't confuse people
Thanks for the answer! As the commit message does not contain "not for 2.x", no 
modification is needed related to this.


http://gerrit.cloudera.org:8080/#/c/9403/2/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/9403/2/tests/query_test/test_scanners.py@649
PS2, Line 649: "testdata/data/", file_name)
> For me a refactor you mentioned makes sense. Do you know the proper way of
Sorry for the late answer. I would prefer to do that refactor in a separate 
patch, so I have created Jira for it: 
https://issues.apache.org/jira/browse/IMPALA-6709



--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 20 Mar 2018 14:50:54 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-19 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/exec/parquet-column-readers.cc@190
PS7, Line 190:   DCHECK_EQ(encoding_, parquet::Encoding::RLE);
> Sure, I can do the conflict resolution. Does this mean that I should add "C
I think it's better to omit the "not for 2.x" so it doesn't confuse people 
looking at the commit message in the future. Once the change is merged you can 
do the backport and if the cherry-pick job breaks in the meantime it'll 
continue once the backport gets merged.



--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 19 Mar 2018 17:08:04 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-19 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/exec/parquet-column-readers.cc@190
PS7, Line 190:   DCHECK_EQ(encoding_, parquet::Encoding::RLE);
> We'll need to resolve a conflict here on the 2.x branch since that also sup
Sure, I can do the conflict resolution. Does this mean that I should add 
"Cherry-picks: not for 2.x." to the commit message, and do the cherry picking 
locally?


http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/util/rle-encoding.h@687
PS7, Line 687:   DCHECK(false);
> Can't this DCHECK be triggered by invalid data? GetLiteralValues() could fa
I have removed the DCHECK, and created a Jira (IMPALA-6700) to improve error 
handling, as currently it is a bit hard to understand it in my opinion.



--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 19 Mar 2018 16:54:43 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-19 Thread Csaba Ringhofer (Code Review)
Hello Lars Volker, Gabor Kaszab, Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9403

to look at the new patch set (#8).

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..

IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

Impala already supported RLE encoding for levels and dictionary pages, so
the only task was to integrate it into BoolColumnReader.

A new benchmark, rle-benchmark.cc is added to test the speed of RLE
decoding for different bit widths and run lengths.

There might be a small performance impact on PLAIN encoded booleans,
because of the additional branch when the cache of BoolColumnReader is
filled. As the cache size is 128, I considered this to be outside the
"hot loop".

Testing:

As Impala cannot write RLE encoded bool columns at the moment, parquet-mr
was used to create a test file, testdata/data/rle_encoded_bool.parquet

tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table
that uses this file, and tries to query from it.

Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/rle-benchmark.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
M testdata/data/README
A testdata/data/rle_encoded_bool.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test
M tests/query_test/test_scanners.py
10 files changed, 331 insertions(+), 53 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/9403/8
--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 8
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-14 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 7:

(2 comments)

I think this is good to go once we remove that DCHECK.

http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/exec/parquet-column-readers.cc@190
PS7, Line 190:   DCHECK_EQ(encoding_, parquet::Encoding::RLE);
We'll need to resolve a conflict here on the 2.x branch since that also 
supports BIT_PACKED levels. Once we've merged this to master, can you 
cherry-pick it to the 2.x branch and resolve the conflict?


http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/9403/7/be/src/util/rle-encoding.h@687
PS7, Line 687:   DCHECK(false);
Can't this DCHECK be triggered by invalid data? GetLiteralValues() could fail 
if the input was truncated.



--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 14 Mar 2018 21:08:13 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-14 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 6:

"I'm curious what was wrong with the decoding - I didn't see an obvious bugfix 
between PS3 and this patchset."

It was fixed by skipping the first 4 bytes, which contained the length of the 
data, and caused some extra values to be inserted. see 
https://gerrit.cloudera.org/#/c/9403/3..4/be/src/exec/parquet-column-readers.cc

As the inserted values were 0s, and the values shifted out at the end of were 
also 0s, the count of 0s and 1s remained correct, so the original tests didn't 
catch this issue.


--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 14 Mar 2018 15:18:10 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-14 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 6:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/9403/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9403/6//COMMIT_MSG@17
PS6, Line 17: filled. As the cache size is 128, I considered this to be outside 
the
> I agree that this overhead seems minimal - I'm not concerned about a predic
Done


http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc
File be/src/benchmarks/rle-benchmark.cc:

http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc@39
PS4, Line 39: //   for loop / run length: 100.4  
0.4  0.4 0.487X 0.487X 0.486X
: // memset / run length: 10  0.396  
0.4  0.4 0.482X 0.487X 0.486X
> I agree that it would make sense to pick the denser encoding in cases where
I have created a Jira for the size issue: 
https://issues.apache.org/jira/browse/IMPALA-6658


http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc@60
PS4, Line 60: std::
> nit: we generally prefer "using" declarations in .cc files instead of using
Done


http://gerrit.cloudera.org:8080/#/c/9403/6/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/9403/6/be/src/util/rle-encoding.h@611
PS6, Line 611: literal_count_ = (indicator_value >> 1) * 8;
> I remember you pointed out on another patchset that, according to the gramm
I am unsure whether it is valid or not. There are tests related to 0 counters, 
but I think that these test only the case when the data starts with 0, which 
cause the RleBatchDecoder (or its original copy) to return early and consume 0 
elements, which is treated as error in some places. Meanwhile if the 0 comes 
later, Impala may ignore it silently.

the tests were added here:
https://github.com/apache/impala/commit/025fd3bd7f3b7f4bc3dcb6b21e3598124b19b577

I would prefer if RleBatchDecoder would be stricter, and treat 0 counters 
always as error, but this change could lead to errors in tables that can be 
read by Impala at the moment.



--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 14 Mar 2018 15:10:32 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-14 Thread Csaba Ringhofer (Code Review)
Hello Lars Volker, Gabor Kaszab, Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9403

to look at the new patch set (#7).

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..

IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

Impala already supported RLE encoding for levels and dictionary pages, so
the only task was to integrate it into BoolColumnReader.

A new benchmark, rle-benchmark.cc is added to test the speed of RLE
decoding for different bit widths and run lengths.

There might be a small performance impact on PLAIN encoded booleans,
because of the additional branch when the cache of BoolColumnReader is
filled. As the cache size is 128, I considered this to be outside the
"hot loop".

Testing:

As Impala cannot write RLE encoded bool columns at the moment, parquet-mr
was used to create a test file, testdata/data/rle_encoded_bool.parquet

tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table
that uses this file, and tries to query from it.

Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/rle-benchmark.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
M testdata/data/README
A testdata/data/rle_encoded_bool.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test
M tests/query_test/test_scanners.py
10 files changed, 332 insertions(+), 53 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/9403/7
--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 7
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-13 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 6:

(4 comments)

Only very minor comments. I'm curious what was wrong with the decoding - I 
didn't see an obvious bugfix between PS3 and this patchset.

http://gerrit.cloudera.org:8080/#/c/9403/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9403/6//COMMIT_MSG@17
PS6, Line 17: filled. As the cache size is 128, I considered this to be outside 
the
I agree that this overhead seems minimal - I'm not concerned about a 
predictable branch per 128 values.


http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc
File be/src/benchmarks/rle-benchmark.cc:

http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc@39
PS4, Line 39: //   for loop / run length: 100.4  
0.4  0.4 0.487X 0.487X 0.486X
: // memset / run length: 10  0.396  
0.4  0.4 0.482X 0.487X 0.486X
> Some thoughts about this performance degradation compared to the run_length
I agree that it would make sense to pick the denser encoding in cases where the 
literal run is actually smaller. Please feel free to file a JIRA and fix it.

I'm not sure that the performance difference will hold up permanently, since we 
can optimise handling of repeated runs better in the caller and avoid unpacking 
the N values.


http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc@60
PS4, Line 60: std::
nit: we generally prefer "using" declarations in .cc files instead of using the 
namespace everywhere.


http://gerrit.cloudera.org:8080/#/c/9403/6/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/9403/6/be/src/util/rle-encoding.h@611
PS6, Line 611: literal_count_ = (indicator_value >> 1) * 8;
I remember you pointed out on another patchset that, according to the grammar, 
this could technically be 0 and still be a valid literal run. Should we leave a 
comment explaining that while we're here?

https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3



--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 13 Mar 2018 18:55:41 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-13 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc
File be/src/benchmarks/rle-benchmark.cc:

http://gerrit.cloudera.org:8080/#/c/9403/4/be/src/benchmarks/rle-benchmark.cc@39
PS4, Line 39: //   for loop / run length: 100.4  
0.4  0.4 0.487X 0.487X 0.486X
: // memset / run length: 10  0.396  
0.4  0.4 0.482X 0.487X 0.486X
Some thoughts about this performance degradation compared to the run_length=1 
case:

If bit_width=1, then 8 values encoded as a repeated run can use more space than 
if they were encoded as a literal run. RleEncoder currently always uses 
repeated runs if it finds 8 repeated values - it may be good to change this to 
a higher number (16?) if bit_width=1.

The performance seems to be similar if bit_width>1, so the space inefficiency 
above is probably not the real cause. I suspect that the problem with short 
repeated runs is that BatchedBitReader is optimized for 32*N batches, and 
shorter literal runs are buffered by RleBatchDecoder. It would be possible to 
avoid buffering in case of 8*N batches too, which could improve the performance 
of shorter runs.



--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 4
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 13 Mar 2018 15:14:53 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-13 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 6:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/exec/parquet-column-readers.cc@673
PS3, Line 673: switch (page_encoding_) {
> nit: space before (
Done


http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/exec/parquet-column-readers.cc@1304
PS3, Line 1304: }
> Convert to using Substitute() while you're here - we generally prefer using
Done


http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h@138
PS3, Line 138:   /// Returns the number of consumed values or 0 if an error 
occurred.
> Can you add some tests to be/src/util/rle-test.cc for the new interface? Th
Done


http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h@674
PS3, Line 674:for (int i = 0; i < num_repeats_to_set; ++i) {
> Yes, please do add the benchmark. The old benchmark was comparing two old i
Done


http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/data/README@160
PS3, Line 160: Parquet v1 file with RLE encoded boolean column "b" and int 
column "i".
> How was it created? Parquet-mr?
Yes, it was created with parquet-mr (parquet-cli's convert-csv command). I had 
to modify parquet-mr to force RLE encoding for booleans, because by default it 
will use bit packed for Parquet v1. Parquet v2 uses RLE by default, but Impala 
does not read v2 data pages at the moment.


http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test:

http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test@4
PS3, Line 4: select count(*) from rle_encoded_bool where b;
> Thanks for the tip, I have created such a table, and something is actually
The new test will catch if rle encoded booleans are returned in the wrong order.


http://gerrit.cloudera.org:8080/#/c/9403/3/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/9403/3/tests/query_test/test_scanners.py@656
PS3, Line 656: self.client.execute(("""CREATE TABLE {0}.rle_encoded_bool (b 
boolean, i int)
> Nit: trailing """ can go on same line.
Done



--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 13 Mar 2018 13:59:39 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-13 Thread Csaba Ringhofer (Code Review)
Hello Lars Volker, Gabor Kaszab, Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9403

to look at the new patch set (#6).

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..

IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

Impala already supported RLE encoding for levels and dictionary pages, so
the only task was to integrate it into BoolColumnReader.

A new benchmark, rle-benchmark.cc is added to test the speed of RLE
decoding for different bit widths and run lengths.

There might be a small performance impact on PLAIN encoded booleans,
because of the additional branch when the cache of BoolColumnReader is
filled. As the cache size is 128, I considered this to be outside the
"hot loop", but some more performance measurement may be needed to
validate this.

Testing:

As Impala cannot write RLE encoded bool columns at the moment, parquet-mr
was used to create a test file, testdata/data/rle_encoded_bool.parquet

tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table
that uses this file, and tries to query from it.

Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/rle-benchmark.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
M testdata/data/README
A testdata/data/rle_encoded_bool.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test
M tests/query_test/test_scanners.py
10 files changed, 328 insertions(+), 53 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/9403/6
--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-13 Thread Csaba Ringhofer (Code Review)
Hello Lars Volker, Gabor Kaszab, Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9403

to look at the new patch set (#4).

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..

IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

Impala already supported RLE encoding for levels and dictionary pages, so
the only task was to integrate it into BoolColumnReader.

There might be a small performance impact on PLAIN encoded booleans,
because of the additional branch when the cache of BoolColumnReader is
filled. As the cache size is 128, I considered this to be outside the
"hot loop", but some performance measurement may be needed to validate
this.

Testing:

As Impala cannot write RLE encoded bool columns at the moment, parquet-mr
was used to create a test file, testdata/data/rle_encoded_bool.parquet

tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table
that uses this file, and tries to query from it.

Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/rle-benchmark.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
M testdata/data/README
A testdata/data/rle_encoded_bool.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test
M tests/query_test/test_scanners.py
10 files changed, 306 insertions(+), 66 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/9403/4
--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 4
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-09 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h@674
PS3, Line 674:memset(values + num_consumed, repeated_value, 
num_repeats_to_set);
> I have created a benchmark and it turned out that memset and  the for loop
Yes, please do add the benchmark. The old benchmark was comparing two old 
implementations that were no longer relevant.



--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 09 Mar 2018 17:34:38 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-09 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 3:

(2 comments)

There is something wrong with the decoding, so I have no patch ready for 
review, but I have a question related to this commit in the comment for 
rle-encoding.h.

http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h@674
PS3, Line 674:memset(values + num_consumed, repeated_value, 
num_repeats_to_set);
> We could also be lazy and not implement it for sizeof(T) > 1 until we actua
I have created a benchmark and it turned out that memset and  the for loop are 
actually different, memset seems to be a few percent faster for short runs 
(<=32), and the loop is a few percent faster for long runs (>=128). These 
results are surprising to me, I have expected the opposite.

Should I add the benchmark (rle-benchmark.cc) to this commit? I saw that you 
have removed a file with similar name in https://gerrit.cloudera.org/#/c/8267/


http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test:

http://gerrit.cloudera.org:8080/#/c/9403/3/testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test@4
PS3, Line 4: select count(*) from rle_encoded_bool where b;
> This wouldn't detect if the boolean values were returned in the wrong order
Thanks for the tip, I have created such a table, and something is actually 
wrong with the decoded values. I am still investigating the issue.



--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 09 Mar 2018 17:25:24 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-06 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/9403/3/be/src/util/rle-encoding.h@674
PS3, Line 674:memset(values + num_consumed, repeated_value, 
num_repeats_to_set);
> It's not valid to use memset() here in general since sizeof(T) may be > 1.
We could also be lazy and not implement it for sizeof(T) > 1 until we actually 
need it, but I think we should just fix it properly while we're here.



--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 06 Mar 2018 23:21:31 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-06 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 3: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9403/2/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/9403/2/tests/query_test/test_scanners.py@649
PS2, Line 649: "testdata/data/", file_name)
> I have copy-pasted the template of this functions from test_def_levels, whi
For me a refactor you mentioned makes sense. Do you know the proper way of 
doing this? Open a new Jira, or proactively include this to a random patch that 
touches this area (like this patch)?



--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 06 Mar 2018 16:15:27 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-05 Thread Csaba Ringhofer (Code Review)
Hello Lars Volker, Gabor Kaszab, Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9403

to look at the new patch set (#3).

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..

IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

Impala already supported RLE encoding for levels and dictionary pages, so
the only task was to integrate it into BoolColumnReader.

There might be a small performance impact on PLAIN encoded booleans,
because of the additional branch when the cache of BoolColumnReader is
filled. As the cache size is 128, I considered this to be outside the
"hot loop", but some performance measurement may be needed to validate
this.

Testing:

As Impala cannot write RLE encoded bool columns at the moment, parquet-mr
was used to create a test file, testdata/data/rle_encoded_bool.parquet

tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table
that uses this file, and tries to query from it.

Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
---
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/rle-encoding.h
M testdata/data/README
A testdata/data/rle_encoded_bool.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test
M tests/query_test/test_scanners.py
7 files changed, 102 insertions(+), 43 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/9403/3
--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-03-05 Thread Gabor Kaszab (Code Review)
Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9403 )

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..


Patch Set 2:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/9403/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9403/2//COMMIT_MSG@24
PS2, Line 24: query into it
nit: query from it?


http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc@185
PS2, Line 185:   return num_cached_levels > 0;
instead?
return num_cached_levels != NULL

num_cached_levels > 0 doesn't indicate for me that num_cached_levels is a 
pointer.


http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc@665
PS2, Line 665: switch(page_encoding_) {
Would it have any impact if you did a reset on both bool_values_ and 
rle_decoder_ here unconditionally?
If this has to stay, could you enhance the function's comment to reflect?

As I see you branch on page_encoding_ anyway where you use them.


http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/exec/parquet-column-readers.cc@714
PS2, Line 714:   num_unpacked_values_ = page_encoding_ == 
parquet::Encoding::PLAIN
nit: I find a simple if more readable than the ternary operator. Might be just 
my preference, though.


http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/util/rle-encoding.h
File be/src/util/rle-encoding.h:

http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/util/rle-encoding.h@139
PS2, Line 139:   int32_t GetValues(int32_t num_values_to_consume, T* values);
As I see the return value here serves 2 different purposes:
 1) As a general return value to indicate success or failure
 2) An out parameter to get the num_values_to_consume.
I find this a bit confusing. Can't you have a return value for the purpose of 
1) and an out param for 2) ?
What do you think?


http://gerrit.cloudera.org:8080/#/c/9403/2/be/src/util/rle-encoding.h@666
PS2, Line 666:   int32_t num_unpacked_values = 0;
The names of num_unpacked_values and num_values_to_consume are not in line in 
my opinion. Either the first one should be num_consumed_values or the second 
one num_values_to_unpack.


http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README@152
PS2, Line 152: rle_encoded_bool.parquet:
According to parquet-rle-encoded-bool.test there are 140 True values in column 
b. Could you please mention in the comment this? (and the total row count would 
also seem reasonable for future use).


http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README@154
PS2, Line 154: as I found no other way to force RLE
 : encoding of booleans in Parquet v1
nit: This part of the sentence doesn't provide extra information for the reader 
in my opinion.


http://gerrit.cloudera.org:8080/#/c/9403/2/testdata/data/README@155
PS2, Line 155: It became the default encoding in Parquet v2, but
 : Impala is not able to read Parquet v2 files yet.
nit: I'm not sure this information is required here either. (what if someone 
reads this after Parquet v2 reading is implemented in Impala?)


http://gerrit.cloudera.org:8080/#/c/9403/2/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/9403/2/tests/query_test/test_scanners.py@649
PS2, Line 649: "refresh {0}.rle_encoded_bool".format(unique_database));
For my understanding: Why is this refresh needed here? I checked other tests 
that include .parquet files (e.g. test_zero_rows) but those doesn't do this 
refresh step.



--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 05 Mar 2018 16:04:02 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-02-22 Thread Csaba Ringhofer (Code Review)
Hello Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9403

to look at the new patch set (#2).

Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..

IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

Impala already supported RLE encoding for levels and dictionary pages, so
the only task was to integrate it into BoolColumnReader.

There might be a small performance impact on PLAIN encoded booleans,
because of the additional branch when the cache of BoolColumnReader is
filled. As the cache size is 128, I considered this to be outside the
"hot loop", but some performance measurement may be needed to validate
this.

Testing:

As Impala cannot write RLE encoded bool columns at the moment, parquet-mr
was used to create a test file, testdata/data/rle_encoded_bool.parquet

tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table
that uses this file, and tries to query into it.

Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
---
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/rle-encoding.h
M testdata/data/README
A testdata/data/rle_encoded_bool.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test
M tests/query_test/test_scanners.py
7 files changed, 101 insertions(+), 43 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/9403/2
--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

2018-02-22 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/9403


Change subject: IMPALA-6324: Support reading RLE-encoded boolean values in 
Parquet scanner
..

IMPALA-6324: Support reading RLE-encoded boolean values in Parquet scanner

Impala already supported RLE encoding for levels and dictionary pages, so
the only task was to integrate it into BoolColumnReader.

There might be a small performance impact on PLAIN encoded booleans,
because of the addittional branch when the cache of BoolColumnReader is
filled. As the cache size is 128, I considered this to be outside the
"hot loop", but some performance measurement may be needed to validate
this.

Testing:

As Impala cannot write RLE encoded bool columns at the moment, parquet-mr
was used to create a test file, testdata/data/rle_encoded_bool.parquet

tests/query_test/test_scanners.py#test_rle_encoded_bools creates a table
and that uses this file, and tries to query into it.

Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
---
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/rle-encoding.h
M testdata/data/README
A testdata/data/rle_encoded_bool.parquet
A 
testdata/workloads/functional-query/queries/QueryTest/parquet-rle-encoded-bool.test
M tests/query_test/test_scanners.py
7 files changed, 101 insertions(+), 43 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/03/9403/1
--
To view, visit http://gerrit.cloudera.org:8080/9403
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I4644bf8cf5d2b7238b05076407fbf78ab5d2c14f
Gerrit-Change-Number: 9403
Gerrit-PatchSet: 1
Gerrit-Owner: Csaba Ringhofer 
Gerrit-Reviewer: Tim Armstrong