[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-05-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 7: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 7
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 03 May 2018 19:18:08 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-05-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..

IMPALA-4123 (prep): Parquet column reader cleanup

Some miscellaneous cleanup to make it easier to understand and
make future changes to the Parquet scanner.

A lot of the refactoring is about more cleanly separating functions
so that they have clearer purpose, e.g.:
* Functions that strictly do decoding, i.e. materialize values, convert
  and validate them. These are changed to operate on single values, not tuples.
* Functions that are used for the non-batched decoding path (i.e. driven
  by CollectionColumnReader or BoolColumnReader).
* Functions that dispatch to a templated implementation based on one or
  more runtime values.

Other misc changes:
* Move large functions out of class bodies.
* Use parquet::Encoding instead of bool to indicate encoding.
* Add some additional DCHECKs.

Testing:
* Ran exhaustive tests
* Ran fuzz test in a loop

Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Reviewed-on: http://gerrit.cloudera.org:8080/9799
Reviewed-by: Tim Armstrong 
Tested-by: Impala Public Jenkins 
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/runtime/tuple.h
M be/src/util/bit-stream-utils.h
M be/src/util/rle-encoding.h
6 files changed, 482 insertions(+), 398 deletions(-)

Approvals:
  Tim Armstrong: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 8
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-05-03 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 7: Code-Review+2

carry


--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 7
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 03 May 2018 17:50:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-05-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 7:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2403/


--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 7
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 03 May 2018 15:22:48 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-05-02 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 6: Code-Review+1

I'm confident that this is ready to be submitted. Dan, do you want to have 
another look?


--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 6
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 03 May 2018 01:03:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-05-01 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 6: Code-Review+1

carry


--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 6
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 01 May 2018 19:59:17 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-05-01 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 5:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@270
PS5, Line 270:
> Can you add a comment to describe the difference between the two?
Done


http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@413
PS5, Line 413: , dict_decoder_ is uninitialized
> I think this comment could be more clear, e.g. ", we don't need to decode a
Done


http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@456
PS5, Line 456: template
> nit: space?
Done


http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@463
PS5, Line 463: if (IN_COLLECTION)
> nit: else?
Done


http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@517
PS5, Line 517: template
> nit: space missing?
Did a search-and-replace for "template<" in the whole file.



--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 5
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 01 May 2018 19:54:19 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-05-01 Thread Tim Armstrong (Code Review)
Hello Lars Volker, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9799

to look at the new patch set (#6).

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..

IMPALA-4123 (prep): Parquet column reader cleanup

Some miscellaneous cleanup to make it easier to understand and
make future changes to the Parquet scanner.

A lot of the refactoring is about more cleanly separating functions
so that they have clearer purpose, e.g.:
* Functions that strictly do decoding, i.e. materialize values, convert
  and validate them. These are changed to operate on single values, not tuples.
* Functions that are used for the non-batched decoding path (i.e. driven
  by CollectionColumnReader or BoolColumnReader).
* Functions that dispatch to a templated implementation based on one or
  more runtime values.

Other misc changes:
* Move large functions out of class bodies.
* Use parquet::Encoding instead of bool to indicate encoding.
* Add some additional DCHECKs.

Testing:
* Ran exhaustive tests
* Ran fuzz test in a loop

Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/runtime/tuple.h
M be/src/util/bit-stream-utils.h
M be/src/util/rle-encoding.h
6 files changed, 482 insertions(+), 398 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/9799/6
--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 6
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-05-01 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 5:

(5 comments)

Had only minor comments.

http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@270
PS5, Line 270:
Can you add a comment to describe the difference between the two?


http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@413
PS5, Line 413: , dict_decoder_ is uninitialized
I think this comment could be more clear, e.g. ", we don't need to decode any 
values and dict_decoder_ does not need to be uninitialized."


http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@456
PS5, Line 456: template
nit: space?


http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@463
PS5, Line 463: if (IN_COLLECTION)
nit: else?


http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@517
PS5, Line 517: template
nit: space missing?



--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 5
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 01 May 2018 18:12:24 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-04-30 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 5: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 5
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 30 Apr 2018 23:06:50 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-04-04 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 4: Code-Review+1

Carry +1


--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 04 Apr 2018 22:56:42 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-03-29 Thread Tim Armstrong (Code Review)
Hello Lars Volker, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9799

to look at the new patch set (#4).

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..

IMPALA-4123 (prep): Parquet column reader cleanup

Some miscellaneous cleanup to make it easier to understand and
make future changes to the Parquet scanner.

A lot of the refactoring is about more cleanly separating functions
so that they have clearer purpose, e.g.:
* Functions that strictly do decoding, i.e. materialize values, convert
  and validate them. These are changed to operate on single values, not tuples.
* Functions that are used for the non-batched decoding path (i.e. driven
  by CollectionColumnReader or BoolColumnReader).
* Functions that dispatch to a templated implementation based on one or
  more runtime values.

Other misc changes:
* Move large functions out of class bodies.
* Use parquet::Encoding instead of bool to indicate encoding.
* Add some additional DCHECKs.

Testing:
* Ran exhaustive tests
* Ran fuzz test in a loop

Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/runtime/tuple.h
M be/src/util/bit-stream-utils.h
M be/src/util/rle-encoding.h
6 files changed, 465 insertions(+), 387 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/9799/4
--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 4
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-03-28 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9799/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9799/2/be/src/exec/parquet-column-readers.cc@299
PS2, Line 299:   bool ReadValueBatch(MemPool* RESTRICT pool, int max_values, 
int tuple_size,
> I also found it quite hard to navigate in this file. I think that the real
Yeah I agree that would make sense (and probably is the way it should have been 
organised initially)



--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 28 Mar 2018 20:31:20 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-03-28 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@597
PS1, Line 597: template 
> Ahh, I missed this. Unfortunately it looks like there's some reason that is
Sorry, I was wrong here, and I don't know of any good alternative. If ENCODING 
was used as an argument's type, then using overloading instead of function 
template would do the job.


http://gerrit.cloudera.org:8080/#/c/9799/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9799/2/be/src/exec/parquet-column-readers.cc@299
PS2, Line 299:   bool ReadValueBatch(MemPool* RESTRICT pool, int max_values, 
int tuple_size,
> I realised I should probably move some of these large functions out-of-line
I also found it quite hard to navigate in this file. I think that the real 
solution would be to split it into more files, mainly by moving scalar column 
stuff to its own h/cc, and possible do the same with CollectionColumnReader. 
This would make cherry picking harder, but moving function bodies  within the 
file can lead to similar issues.



--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 28 Mar 2018 14:30:40 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-03-27 Thread Tim Armstrong (Code Review)
Hello Lars Volker, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9799

to look at the new patch set (#3).

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..

IMPALA-4123 (prep): Parquet column reader cleanup

Some miscellaneous cleanup to make it easier to understand and
make future changes to the Parquet scanner.

A lot of the refactoring is about more cleanly separating functions
so that they have clearer purpose, e.g.:
* Functions that strictly do decoding, i.e. materialize values, convert
  and validate them. These are changed to operate on single values, not tuples.
* Functions that are used for the non-batched decoding path (i.e. driven
  by CollectionColumnReader or BoolColumnReader).
* Functions that dispatch to a templated implementation based on one or
  more runtime values.

Other misc changes:
* Move large functions out of class bodies.
* Use parquet::Encoding instead of bool to indicate encoding.
* Add some additional DCHECKs.

Testing:
* Ran exhaustive tests
* Ran fuzz test in a loop

Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/runtime/tuple.h
M be/src/util/bit-stream-utils.h
M be/src/util/rle-encoding.h
6 files changed, 449 insertions(+), 367 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/9799/3
--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 3
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-03-27 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9799/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9799/2/be/src/exec/parquet-column-readers.cc@299
PS2, Line 299:   bool ReadValueBatch(MemPool* RESTRICT pool, int max_values, 
int tuple_size,
I realised I should probably move some of these large functions out-of-line 
too. I don't know if it's just me, but I get confused about which 
implementation of ReadValueBatch() I'm looking at since the class body is large 
enough that I don't have a reference point when looking at each function.



--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 27 Mar 2018 23:38:47 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-03-27 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9799/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9799/2/be/src/exec/parquet-column-readers.cc@758
PS2, Line 758:   inline bool ReadSlot(bool* slot, MemPool* pool)  {
I should probably rename this DecodeValue() to be consistent and remove the 
unnecessary 'pool' argument.



--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 27 Mar 2018 18:18:07 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-03-27 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 2: Code-Review+1

carry +1


--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 27 Mar 2018 18:13:00 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-03-26 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 2: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 27 Mar 2018 04:41:31 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-03-26 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 2:

Carry +1. Running tests just to get clang-tidy coverage, etc.


--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 27 Mar 2018 00:56:49 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-03-26 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 2:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2181/


--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 27 Mar 2018 00:56:28 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-03-26 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 1:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@518
PS1, Line 518:   if (def_level < 
def_level_of_immediate_repeated_ancestor()) {
 : // A containing repeated field is empty or NULL. Skip 
the value but
 : // move to the next repetition level if necessary.
 : if (pos_slot_desc_ != nullptr) 
rep_levels_.CacheSkipLevels(1);
 : continue;
 :   }
 :   if (pos_slot_desc_ != nullptr) {
 : ReadPositionBatched(rep_levels_.CacheGetNext(),
 : 
static_cast(tuple->GetSlot(pos_slot_desc_->tuple_offset(;
 :   }
> I would move this whole block to ReadPositionBatched, and rename it to some
I think the current structure is the lesser of evils for a couple of reads:
* It is tied into the control flow of the loop, since we use "continue" to jump 
to the top of the loop based on the def level. I don't see a clean way to 
factor out the def_level < .. branch.
* I want to use ReadPositionBatched() as a helper function in a follow-on patch.


http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@531
PS1, Line 531: def_level >= max_def_level()
> This was not changed, but I am curious: can def_level be greater than max_d
It shouldn't be possible in a valid file. It would break our decoding in a lot 
of places, since we infer the maximum from the schema, then use that to 
determine the bit width when decoding levels.

This predates me working on the code, but I suspect the looser check was used 
to avoid crashing without paying any extra runtime overhead to validate each 
value.


http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@586
PS1, Line 586: if (!continue_execution) return false;
> Replacing this with !parent_->parse_status_.ok() and removing continue_exec
Done


http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@597
PS1, Line 597: template 
> This could be split to two specialized template functions.
Ahh, I missed this. Unfortunately it looks like there's some reason that isn't 
not possible to specialize a member function template when the class template 
is unspecialised - I'm getting this from clang:

  template 
  template <>
  bool ScalarColumnReader::DecodeValue(

 error| cannot specialize (with 'template<>') a member of an unspecialized 
template


If there's an alternative you know of, I'm open to it.


http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@600
PS1, Line 600: MemPool* RESTRICT pool
> "pool" could be removed, as it is not used in the function (ReadSlot() used
Oh good point. I removed the pool argument.

I'll hold off on making the other change. I can see the argument for it, but 
one advantage of the current code is that it's more obvious in 
HdfsParquetScanner::AssembleRows() that ReadValueBatch() allocates memory from 
aux_mem_pool, so the memory lifetime is a bit clearer.


http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@1320
PS1, Line 1320: return 
ReadSlot(static_cast(tuple->GetSlot(tuple_offset_)), pool);
> Tuple has a GetCollectionSlot() functions that does the cast. Some new type
Ah I forgot about those functions, good catch. Added some new functions. I used 
GetBigIntVal() since I think the name should reflect the logical type of the 
slot.



--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 27 Mar 2018 00:55:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-03-26 Thread Tim Armstrong (Code Review)
Hello Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9799

to look at the new patch set (#2).

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..

IMPALA-4123 (prep): Parquet column reader cleanup

Some miscellaneous cleanup to make it easier to understand and
make future changes to the Parquet scanner.

A lot of the refactoring is about more cleanly separating functions
so that they have clearer purpose, e.g.:
* Functions that strictly do decoding, i.e. materialize values, convert
  and validate them. These are changed to operate on single values, not tuples.
* Functions that are used for the non-batched decoding path (i.e. driven
  by CollectionColumnReader or BoolColumnReader).
* Functions that dispatch to a templated implementation based on one or
  more runtime values.

Other misc changes:
* Move large functions out of class bodies.
* Use parquet::Encoding instead of bool to indicate encoding.
* Add some additional DCHECKs.

Testing:
* Ran exhaustive tests
* Ran fuzz test in a loop

Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/runtime/tuple.h
M be/src/util/bit-stream-utils.h
M be/src/util/rle-encoding.h
6 files changed, 228 insertions(+), 168 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/9799/2
--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 2
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-03-26 Thread Csaba Ringhofer (Code Review)
Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9799 )

Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..


Patch Set 1: Code-Review+1

(6 comments)

My suggestions are mainly from the "why not refactor a bit, if we are already 
touching this area" type, so feel free to skip them.

http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@518
PS1, Line 518:   if (def_level < 
def_level_of_immediate_repeated_ancestor()) {
 : // A containing repeated field is empty or NULL. Skip 
the value but
 : // move to the next repetition level if necessary.
 : if (pos_slot_desc_ != nullptr) 
rep_levels_.CacheSkipLevels(1);
 : continue;
 :   }
 :   if (pos_slot_desc_ != nullptr) {
 : ReadPositionBatched(rep_levels_.CacheGetNext(),
 : 
static_cast(tuple->GetSlot(pos_slot_desc_->tuple_offset(;
 :   }
I would move this whole block to ReadPositionBatched, and rename it to 
something like HandleNextRepLevel(Tuple*, def_level) to make 
MaterializeValueBatch()'s non collection logic easier read.

Line 509 could be moved to, as rep_levels_.CacheHasNext() should be true at the 
start of every iteration of the loop.


http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@531
PS1, Line 531: def_level >= max_def_level()
This was not changed, but I am curious: can def_level be greater than 
max_def_level?


http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@586
PS1, Line 586: if (!continue_execution) return false;
Replacing this with !parent_->parse_status_.ok() and removing 
continue_execution would be a bit shorter.

Marking this branch as UNLIKELY could probably make the other branch a bit 
faster.


http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@597
PS1, Line 597: template 
This could be split to two specialized template functions.


http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@600
PS1, Line 600: MemPool* RESTRICT pool
"pool" could be removed, as it is not used in the function (ReadSlot() used it 
only as an argument to ConvertSlot()). 

It could be also turned into a member (it is always the scanners 
scratch_batch_->aux_mem_pool) instead of passing it as an argument.


http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@1320
PS1, Line 1320: return 
ReadSlot(static_cast(tuple->GetSlot(tuple_offset_)), pool);
Tuple has a GetCollectionSlot() functions that does the cast. Some new typed 
function like GetInt64Slot() could be added too replace the other GetSlot() 
casts.



--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Comment-Date: Mon, 26 Mar 2018 19:12:22 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup

2018-03-26 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/9799


Change subject: IMPALA-4123 (prep): Parquet column reader cleanup
..

IMPALA-4123 (prep): Parquet column reader cleanup

Some miscellaneous cleanup to make it easier to understand and
make future changes to the Parquet scanner.

A lot of the refactoring is about more cleanly separating functions
so that they have clearer purpose, e.g.:
* Functions that strictly do decoding, i.e. materialize values, convert
  and validate them. These are changed to operate on single values, not tuples.
* Functions that are used for the non-batched decoding path (i.e. driven
  by CollectionColumnReader or BoolColumnReader).
* Functions that dispatch to a templated implementation based on one or
  more runtime values.

Other misc changes:
* Move large functions out of class bodies.
* Use parquet::Encoding instead of bool to indicate encoding.
* Add some additional DCHECKs.

Testing:
* Ran exhaustive tests
* Ran fuzz test in a loop

Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/bit-stream-utils.h
M be/src/util/rle-encoding.h
5 files changed, 226 insertions(+), 166 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/9799/1
--
To view, visit http://gerrit.cloudera.org:8080/9799
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1
Gerrit-Change-Number: 9799
Gerrit-PatchSet: 1
Gerrit-Owner: Tim Armstrong