[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 7: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 7 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 03 May 2018 19:18:08 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. IMPALA-4123 (prep): Parquet column reader cleanup Some miscellaneous cleanup to make it easier to understand and make future changes to the Parquet scanner. A lot of the refactoring is about more cleanly separating functions so that they have clearer purpose, e.g.: * Functions that strictly do decoding, i.e. materialize values, convert and validate them. These are changed to operate on single values, not tuples. * Functions that are used for the non-batched decoding path (i.e. driven by CollectionColumnReader or BoolColumnReader). * Functions that dispatch to a templated implementation based on one or more runtime values. Other misc changes: * Move large functions out of class bodies. * Use parquet::Encoding instead of bool to indicate encoding. * Add some additional DCHECKs. Testing: * Ran exhaustive tests * Ran fuzz test in a loop Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Reviewed-on: http://gerrit.cloudera.org:8080/9799 Reviewed-by: Tim ArmstrongTested-by: Impala Public Jenkins --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/tuple.h M be/src/util/bit-stream-utils.h M be/src/util/rle-encoding.h 6 files changed, 482 insertions(+), 398 deletions(-) Approvals: Tim Armstrong: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 8 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 7: Code-Review+2 carry -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 7 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 03 May 2018 17:50:47 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 7: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2403/ -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 7 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 03 May 2018 15:22:48 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 6: Code-Review+1 I'm confident that this is ready to be submitted. Dan, do you want to have another look? -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 6 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 03 May 2018 01:03:06 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 6: Code-Review+1 carry -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 6 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 01 May 2018 19:59:17 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 5: (5 comments) http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@270 PS5, Line 270: > Can you add a comment to describe the difference between the two? Done http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@413 PS5, Line 413: , dict_decoder_ is uninitialized > I think this comment could be more clear, e.g. ", we don't need to decode a Done http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@456 PS5, Line 456: template > nit: space? Done http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@463 PS5, Line 463: if (IN_COLLECTION) > nit: else? Done http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@517 PS5, Line 517: template > nit: space missing? Did a search-and-replace for "template<" in the whole file. -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 5 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 01 May 2018 19:54:19 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Hello Lars Volker, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9799 to look at the new patch set (#6). Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. IMPALA-4123 (prep): Parquet column reader cleanup Some miscellaneous cleanup to make it easier to understand and make future changes to the Parquet scanner. A lot of the refactoring is about more cleanly separating functions so that they have clearer purpose, e.g.: * Functions that strictly do decoding, i.e. materialize values, convert and validate them. These are changed to operate on single values, not tuples. * Functions that are used for the non-batched decoding path (i.e. driven by CollectionColumnReader or BoolColumnReader). * Functions that dispatch to a templated implementation based on one or more runtime values. Other misc changes: * Move large functions out of class bodies. * Use parquet::Encoding instead of bool to indicate encoding. * Add some additional DCHECKs. Testing: * Ran exhaustive tests * Ran fuzz test in a loop Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/tuple.h M be/src/util/bit-stream-utils.h M be/src/util/rle-encoding.h 6 files changed, 482 insertions(+), 398 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/9799/6 -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 6 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 5: (5 comments) Had only minor comments. http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@270 PS5, Line 270: Can you add a comment to describe the difference between the two? http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@413 PS5, Line 413: , dict_decoder_ is uninitialized I think this comment could be more clear, e.g. ", we don't need to decode any values and dict_decoder_ does not need to be uninitialized." http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@456 PS5, Line 456: template nit: space? http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@463 PS5, Line 463: if (IN_COLLECTION) nit: else? http://gerrit.cloudera.org:8080/#/c/9799/5/be/src/exec/parquet-column-readers.cc@517 PS5, Line 517: template nit: space missing? -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 5 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 01 May 2018 18:12:24 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 5: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 5 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 30 Apr 2018 23:06:50 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 4: Code-Review+1 Carry +1 -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 4 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 04 Apr 2018 22:56:42 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Hello Lars Volker, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9799 to look at the new patch set (#4). Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. IMPALA-4123 (prep): Parquet column reader cleanup Some miscellaneous cleanup to make it easier to understand and make future changes to the Parquet scanner. A lot of the refactoring is about more cleanly separating functions so that they have clearer purpose, e.g.: * Functions that strictly do decoding, i.e. materialize values, convert and validate them. These are changed to operate on single values, not tuples. * Functions that are used for the non-batched decoding path (i.e. driven by CollectionColumnReader or BoolColumnReader). * Functions that dispatch to a templated implementation based on one or more runtime values. Other misc changes: * Move large functions out of class bodies. * Use parquet::Encoding instead of bool to indicate encoding. * Add some additional DCHECKs. Testing: * Ran exhaustive tests * Ran fuzz test in a loop Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/tuple.h M be/src/util/bit-stream-utils.h M be/src/util/rle-encoding.h 6 files changed, 465 insertions(+), 387 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/9799/4 -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 4 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/9799/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9799/2/be/src/exec/parquet-column-readers.cc@299 PS2, Line 299: bool ReadValueBatch(MemPool* RESTRICT pool, int max_values, int tuple_size, > I also found it quite hard to navigate in this file. I think that the real Yeah I agree that would make sense (and probably is the way it should have been organised initially) -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 2 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 28 Mar 2018 20:31:20 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 1: (2 comments) http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@597 PS1, Line 597: template > Ahh, I missed this. Unfortunately it looks like there's some reason that is Sorry, I was wrong here, and I don't know of any good alternative. If ENCODING was used as an argument's type, then using overloading instead of function template would do the job. http://gerrit.cloudera.org:8080/#/c/9799/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9799/2/be/src/exec/parquet-column-readers.cc@299 PS2, Line 299: bool ReadValueBatch(MemPool* RESTRICT pool, int max_values, int tuple_size, > I realised I should probably move some of these large functions out-of-line I also found it quite hard to navigate in this file. I think that the real solution would be to split it into more files, mainly by moving scalar column stuff to its own h/cc, and possible do the same with CollectionColumnReader. This would make cherry picking harder, but moving function bodies within the file can lead to similar issues. -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 1 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 28 Mar 2018 14:30:40 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Hello Lars Volker, Csaba Ringhofer, Impala Public Jenkins, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9799 to look at the new patch set (#3). Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. IMPALA-4123 (prep): Parquet column reader cleanup Some miscellaneous cleanup to make it easier to understand and make future changes to the Parquet scanner. A lot of the refactoring is about more cleanly separating functions so that they have clearer purpose, e.g.: * Functions that strictly do decoding, i.e. materialize values, convert and validate them. These are changed to operate on single values, not tuples. * Functions that are used for the non-batched decoding path (i.e. driven by CollectionColumnReader or BoolColumnReader). * Functions that dispatch to a templated implementation based on one or more runtime values. Other misc changes: * Move large functions out of class bodies. * Use parquet::Encoding instead of bool to indicate encoding. * Add some additional DCHECKs. Testing: * Ran exhaustive tests * Ran fuzz test in a loop Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/tuple.h M be/src/util/bit-stream-utils.h M be/src/util/rle-encoding.h 6 files changed, 449 insertions(+), 367 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/9799/3 -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 3 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/9799/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9799/2/be/src/exec/parquet-column-readers.cc@299 PS2, Line 299: bool ReadValueBatch(MemPool* RESTRICT pool, int max_values, int tuple_size, I realised I should probably move some of these large functions out-of-line too. I don't know if it's just me, but I get confused about which implementation of ReadValueBatch() I'm looking at since the class body is large enough that I don't have a reference point when looking at each function. -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 2 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 27 Mar 2018 23:38:47 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/9799/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9799/2/be/src/exec/parquet-column-readers.cc@758 PS2, Line 758: inline bool ReadSlot(bool* slot, MemPool* pool) { I should probably rename this DecodeValue() to be consistent and remove the unnecessary 'pool' argument. -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 2 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 27 Mar 2018 18:18:07 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 2: Code-Review+1 carry +1 -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 2 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 27 Mar 2018 18:13:00 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 2: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 2 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 27 Mar 2018 04:41:31 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 2: Carry +1. Running tests just to get clang-tidy coverage, etc. -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 2 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 27 Mar 2018 00:56:49 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 2: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2181/ -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 2 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 27 Mar 2018 00:56:28 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 1: (6 comments) http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@518 PS1, Line 518: if (def_level < def_level_of_immediate_repeated_ancestor()) { : // A containing repeated field is empty or NULL. Skip the value but : // move to the next repetition level if necessary. : if (pos_slot_desc_ != nullptr) rep_levels_.CacheSkipLevels(1); : continue; : } : if (pos_slot_desc_ != nullptr) { : ReadPositionBatched(rep_levels_.CacheGetNext(), : static_cast(tuple->GetSlot(pos_slot_desc_->tuple_offset(; : } > I would move this whole block to ReadPositionBatched, and rename it to some I think the current structure is the lesser of evils for a couple of reads: * It is tied into the control flow of the loop, since we use "continue" to jump to the top of the loop based on the def level. I don't see a clean way to factor out the def_level < .. branch. * I want to use ReadPositionBatched() as a helper function in a follow-on patch. http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@531 PS1, Line 531: def_level >= max_def_level() > This was not changed, but I am curious: can def_level be greater than max_d It shouldn't be possible in a valid file. It would break our decoding in a lot of places, since we infer the maximum from the schema, then use that to determine the bit width when decoding levels. This predates me working on the code, but I suspect the looser check was used to avoid crashing without paying any extra runtime overhead to validate each value. http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@586 PS1, Line 586: if (!continue_execution) return false; > Replacing this with !parent_->parse_status_.ok() and removing continue_exec Done http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@597 PS1, Line 597: template > This could be split to two specialized template functions. Ahh, I missed this. Unfortunately it looks like there's some reason that isn't not possible to specialize a member function template when the class template is unspecialised - I'm getting this from clang: template template <> bool ScalarColumnReader ::DecodeValue( error| cannot specialize (with 'template<>') a member of an unspecialized template If there's an alternative you know of, I'm open to it. http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@600 PS1, Line 600: MemPool* RESTRICT pool > "pool" could be removed, as it is not used in the function (ReadSlot() used Oh good point. I removed the pool argument. I'll hold off on making the other change. I can see the argument for it, but one advantage of the current code is that it's more obvious in HdfsParquetScanner::AssembleRows() that ReadValueBatch() allocates memory from aux_mem_pool, so the memory lifetime is a bit clearer. http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@1320 PS1, Line 1320: return ReadSlot(static_cast (tuple->GetSlot(tuple_offset_)), pool); > Tuple has a GetCollectionSlot() functions that does the cast. Some new type Ah I forgot about those functions, good catch. Added some new functions. I used GetBigIntVal() since I think the name should reflect the logical type of the slot. -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 27 Mar 2018 00:55:42 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Hello Csaba Ringhofer, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/9799 to look at the new patch set (#2). Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. IMPALA-4123 (prep): Parquet column reader cleanup Some miscellaneous cleanup to make it easier to understand and make future changes to the Parquet scanner. A lot of the refactoring is about more cleanly separating functions so that they have clearer purpose, e.g.: * Functions that strictly do decoding, i.e. materialize values, convert and validate them. These are changed to operate on single values, not tuples. * Functions that are used for the non-batched decoding path (i.e. driven by CollectionColumnReader or BoolColumnReader). * Functions that dispatch to a templated implementation based on one or more runtime values. Other misc changes: * Move large functions out of class bodies. * Use parquet::Encoding instead of bool to indicate encoding. * Add some additional DCHECKs. Testing: * Ran exhaustive tests * Ran fuzz test in a loop Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/tuple.h M be/src/util/bit-stream-utils.h M be/src/util/rle-encoding.h 6 files changed, 228 insertions(+), 168 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/9799/2 -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 2 Gerrit-Owner: Tim ArmstrongGerrit-Reviewer: Csaba Ringhofer
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/9799 ) Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. Patch Set 1: Code-Review+1 (6 comments) My suggestions are mainly from the "why not refactor a bit, if we are already touching this area" type, so feel free to skip them. http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@518 PS1, Line 518: if (def_level < def_level_of_immediate_repeated_ancestor()) { : // A containing repeated field is empty or NULL. Skip the value but : // move to the next repetition level if necessary. : if (pos_slot_desc_ != nullptr) rep_levels_.CacheSkipLevels(1); : continue; : } : if (pos_slot_desc_ != nullptr) { : ReadPositionBatched(rep_levels_.CacheGetNext(), : static_cast(tuple->GetSlot(pos_slot_desc_->tuple_offset(; : } I would move this whole block to ReadPositionBatched, and rename it to something like HandleNextRepLevel(Tuple*, def_level) to make MaterializeValueBatch()'s non collection logic easier read. Line 509 could be moved to, as rep_levels_.CacheHasNext() should be true at the start of every iteration of the loop. http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@531 PS1, Line 531: def_level >= max_def_level() This was not changed, but I am curious: can def_level be greater than max_def_level? http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@586 PS1, Line 586: if (!continue_execution) return false; Replacing this with !parent_->parse_status_.ok() and removing continue_execution would be a bit shorter. Marking this branch as UNLIKELY could probably make the other branch a bit faster. http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@597 PS1, Line 597: template This could be split to two specialized template functions. http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@600 PS1, Line 600: MemPool* RESTRICT pool "pool" could be removed, as it is not used in the function (ReadSlot() used it only as an argument to ConvertSlot()). It could be also turned into a member (it is always the scanners scratch_batch_->aux_mem_pool) instead of passing it as an argument. http://gerrit.cloudera.org:8080/#/c/9799/1/be/src/exec/parquet-column-readers.cc@1320 PS1, Line 1320: return ReadSlot(static_cast (tuple->GetSlot(tuple_offset_)), pool); Tuple has a GetCollectionSlot() functions that does the cast. Some new typed function like GetInt64Slot() could be added too replace the other GetSlot() casts. -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong Gerrit-Reviewer: Csaba Ringhofer Gerrit-Comment-Date: Mon, 26 Mar 2018 19:12:22 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-4123 (prep): Parquet column reader cleanup
Tim Armstrong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/9799 Change subject: IMPALA-4123 (prep): Parquet column reader cleanup .. IMPALA-4123 (prep): Parquet column reader cleanup Some miscellaneous cleanup to make it easier to understand and make future changes to the Parquet scanner. A lot of the refactoring is about more cleanly separating functions so that they have clearer purpose, e.g.: * Functions that strictly do decoding, i.e. materialize values, convert and validate them. These are changed to operate on single values, not tuples. * Functions that are used for the non-batched decoding path (i.e. driven by CollectionColumnReader or BoolColumnReader). * Functions that dispatch to a templated implementation based on one or more runtime values. Other misc changes: * Move large functions out of class bodies. * Use parquet::Encoding instead of bool to indicate encoding. * Add some additional DCHECKs. Testing: * Ran exhaustive tests * Ran fuzz test in a loop Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/util/bit-stream-utils.h M be/src/util/rle-encoding.h 5 files changed, 226 insertions(+), 166 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/99/9799/1 -- To view, visit http://gerrit.cloudera.org:8080/9799 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Ibc00352df3a0b2d605f872ae7e43db2dc90faab1 Gerrit-Change-Number: 9799 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong