[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8147 ) Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. Patch Set 8: Thanks for the contribution! -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 8 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 10 Oct 2017 22:22:55 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8147 ) Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. Patch Set 7: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 7 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Tue, 10 Oct 2017 01:30:32 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8147 ) Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. IMPALA-5448: fix invalid number of splits reported in Parquet scan node Parquet splits with multi columns are marked as completed by using HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types as column codec types. Thus the number of parquet splits are the real count multiplies number of materialized columns. Furthermore, according to the Parquet definition, it allows mixed compression codecs on different columns. This's handled in this patch as well. A parquet file using gzip and snappy compression codec will be reported as: FileFormats: PARQUET/(GZIP,SNAPPY):1 This patch introduces a compression types set for the above cases. Testing: Add end-to-end tests handling parquet files with all columns compressed in snappy, and handling parquet files with multi compression codec. Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Reviewed-on: http://gerrit.cloudera.org:8080/8147 Reviewed-by: Tim ArmstrongTested-by: Impala Public Jenkins --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h A testdata/multi_compression_parquet_data/README A testdata/multi_compression_parquet_data/tinytable_0_gzip_snappy.parq A testdata/multi_compression_parquet_data/tinytable_1_snappy_gzip.parq A testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test M tests/query_test/test_scanners.py 7 files changed, 132 insertions(+), 13 deletions(-) Approvals: Tim Armstrong: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 8 Gerrit-Owner: Quanlong Huang Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8147 ) Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. Patch Set 7: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 7 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 09 Oct 2017 21:26:34 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8147 ) Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. Patch Set 7: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1318/ -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 7 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 09 Oct 2017 21:26:32 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Hello Tim Armstrong, Mostafa Mokhtar, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8147 to look at the new patch set (#5). Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. IMPALA-5448: fix invalid number of splits reported in Parquet scan node Parquet splits with multi columns are marked as completed by using HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types as column codec types. Thus the number of parquet splits are the real count multiplies number of materialized columns. Furthermore, according to the Parquet definition, it allows mixed compression codecs on different columns. This's handled in this patch as well. A parquet file using gzip and snappy compression codec will be reported as: FileFormats: PARQUET/(GZIP,SNAPPY):1 This patch introduces a compression types set for the above cases. Testing: Add end-to-end tests handling parquet files with all columns compressed in snappy, and handling parquet files with multi compression codec. Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h A testdata/multi_compression_parquet_data/README A testdata/multi_compression_parquet_data/tinytable_0_gzip_snappy.parq A testdata/multi_compression_parquet_data/tinytable_1_snappy_gzip.parq A testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test M tests/query_test/test_scanners.py 7 files changed, 132 insertions(+), 13 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8147/5 -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 5 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8147 ) Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. Patch Set 4: I think the end-to-end test is a good idea, -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 4 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 05 Oct 2017 17:24:49 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8147 ) Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. Patch Set 4: Yeah, the profile is: File Formats: PARQUET/GZIP:1 PARQUET/SNAPPY:1 Should I add an end-to-end test for this? My test processes are: hive> use functional_parquet; hive> create table alltypes_mix like alltypes; hive> set parquet.compression=snappy; hive> insert into table alltypes_mix partition (year=2017,month=1) select id,bool_col,tinyint_col,smallint_col,int_col,bigint_col,float_col,double_col,date_string_col,string_col,timestamp_col from alltypes where year=2010 and month=1; hive> set parquet.compression=gzip; hive> insert into table alltypes_mix partition (year=2017,month=2) select id,bool_col,tinyint_col,smallint_col,int_col,bigint_col,float_col,double_col,date_string_col,string_col,timestamp_col from alltypes where year=2010 and month=1; Then in impala-shell: [localhost:21000] > invalidate metadata functional_parquet.alltypes_mix; [localhost:21000] > select * from functional_parquet.alltypes_mix; [localhost:21000] > profile; -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 4 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 05 Oct 2017 03:22:36 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Mostafa Mokhtar has posted comments on this change. ( http://gerrit.cloudera.org:8080/8147 ) Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. Patch Set 4: Did you the test the case if one partition is snappy and another gzip both scanned by the same HDFS SCAN Node? Can you please paste what the output in the query profile looks like? -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 4 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 05 Oct 2017 02:51:04 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8147 ) Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. Patch Set 4: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 4 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 05 Oct 2017 00:15:51 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8147 ) Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/8147/3/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: http://gerrit.cloudera.org:8080/#/c/8147/3/be/src/exec/hdfs-scan-node-base.h@537 PS3, Line 537: int Size() { return BitUtil::Popcount(bit_map_); } > One nit: can you add a blank link after this method. Sorry, I know you mean a blank line now... Done -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 4 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 04 Oct 2017 23:58:38 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Hello Tim Armstrong, Mostafa Mokhtar, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8147 to look at the new patch set (#4). Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. IMPALA-5448: fix invalid number of splits reported in Parquet scan node Parquet splits with multi columns are marked as completed by using HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types as column codec types. Thus the number of parquet splits are the real count multiplies number of materialized columns. Furthermore, according to the Parquet definition, it allows mixed compression codecs on different columns. This's handled in this patch as well. A parquet file using gzip and snappy compression codec will be reported as: FileFormats: PARQUET/(GZIP,SNAPPY):1 This patch introduces a compression types set for the above cases. Testing: Add end-to-end tests handling parquet files with all columns compressed in snappy, and handling parquet files with multi compression codec. Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h A testdata/multi_compression_parquet_data/README A testdata/multi_compression_parquet_data/tinytable_0_gzip_snappy.parq A testdata/multi_compression_parquet_data/tinytable_1_snappy_gzip.parq A testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test M tests/query_test/test_scanners.py 7 files changed, 112 insertions(+), 13 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8147/4 -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 4 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8147 ) Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. Patch Set 3: (2 comments) http://gerrit.cloudera.org:8080/#/c/8147/3/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: http://gerrit.cloudera.org:8080/#/c/8147/3/be/src/exec/hdfs-scan-node-base.h@537 PS3, Line 537: int Size() { return BitUtil::Popcount(bit_map_); } > One nit: can you add a blank link after this method. Sorry, what is a blank link? Should I write the function in 3 lines? http://gerrit.cloudera.org:8080/#/c/8147/3/testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test File testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test: http://gerrit.cloudera.org:8080/#/c/8147/3/testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test@10 PS3, Line 10: select * from functional_parquet.multi_compression Forgot to change the db name! Will fix it -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 3 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 04 Oct 2017 23:49:26 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8147 ) Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. Patch Set 3: Code-Review+1 Looks good. I'll ask Mostafa to confirm that the new output looks good to him but then I think we should be able to go ahead and merge. -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 3 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 04 Oct 2017 23:18:28 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8147 ) Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. Patch Set 3: (6 comments) http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h@557 PS2, Line 557: > Should call BitUtil::Popcount(), which will use hardware acceleration if ap Done http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h@579 PS2, Line 579: > We put an underscore at the end of private members, i.e. 'bit_map_' Oops, forgot it! Done http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h@582 PS2, Line 582: > Not your change, but it should mention the second entry in the tuple - whet Done. Change to: /// Mapping of file formats to the number of splits of that type. The key is a tuple /// containing: /// * file type /// * whether the split was skipped /// * compression types set http://gerrit.cloudera.org:8080/#/c/8147/2/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/8147/2/testdata/datasets/functional/functional_schema_template.sql@1581 PS2, Line 1581: -- IMPALA-1658: Timestamps written by Hive are local-to-UTC adjusted. > We moved to loading "special" files as part of the tests rather than part o Done http://gerrit.cloudera.org:8080/#/c/8147/2/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/8147/2/tests/query_test/test_scanners.py@82 PS2, Line 82: # Test all the scanners with a simple limit clause. The limit clause triggers > This only applies to parquet so should go in TestParquet below (TestScanner Done http://gerrit.cloudera.org:8080/#/c/8147/2/tests/query_test/test_scanners.py@337 PS2, Line 337: " stored as parquet" % unique_database) > This is an example of the alternative way of loading data files as part of Great! Done -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 3 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Wed, 04 Oct 2017 12:13:40 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Hello Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8147 to look at the new patch set (#3). Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. IMPALA-5448: fix invalid number of splits reported in Parquet scan node Parquet splits with multi columns are marked as completed by using HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types as column codec types. Thus the number of parquet splits are the real count multiplies number of materialized columns. Furthermore, according to the Parquet definition, it allows mixed compression codecs on different columns. This's handled in this patch as well. A parquet file using gzip and snappy compression codec will be reported as: FileFormats: PARQUET/(GZIP,SNAPPY):1 This patch introduces a compression types set for the above cases. Testing: Add end-to-end tests handling parquet files with all columns compressed in snappy, and handling parquet files with multi compression codec. Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h A testdata/multi_compression_parquet_data/README A testdata/multi_compression_parquet_data/tinytable_0_gzip_snappy.parq A testdata/multi_compression_parquet_data/tinytable_1_snappy_gzip.parq A testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test M tests/query_test/test_scanners.py 7 files changed, 111 insertions(+), 13 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8147/3 -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 3 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8147 ) Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. Patch Set 2: (6 comments) http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h@557 PS2, Line 557: __builtin_popcount Should call BitUtil::Popcount(), which will use hardware acceleration if appropriate. http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h@579 PS2, Line 579: bit_map We put an underscore at the end of private members, i.e. 'bit_map_' http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h@582 PS2, Line 582: /// Mapping of file formats (file type, compression types set) to the number of Not your change, but it should mention the second entry in the tuple - whether the split was skipped. http://gerrit.cloudera.org:8080/#/c/8147/2/testdata/datasets/functional/functional_schema_template.sql File testdata/datasets/functional/functional_schema_template.sql: http://gerrit.cloudera.org:8080/#/c/8147/2/testdata/datasets/functional/functional_schema_template.sql@1581 PS2, Line 1581: -- IMPALA-5448: parquet files with multiple compression types We moved to loading "special" files as part of the tests rather than part of the data loading in a lot of cases. I think that is better practically because if you change this template then everyone has to reload data. I commented on an instance of the alternative approach that we should switch to. http://gerrit.cloudera.org:8080/#/c/8147/2/tests/query_test/test_scanners.py File tests/query_test/test_scanners.py: http://gerrit.cloudera.org:8080/#/c/8147/2/tests/query_test/test_scanners.py@82 PS2, Line 82: def test_hdfs_parquet_scan_node_profile(self, vector): This only applies to parquet so should go in TestParquet below (TestScannersAllTableFormats runs the test for all table formats). http://gerrit.cloudera.org:8080/#/c/8147/2/tests/query_test/test_scanners.py@337 PS2, Line 337: def test_corrupt_rle_counts(self, vector, unique_database): This is an example of the alternative way of loading data files as part of the test. -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 2 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Mon, 02 Oct 2017 21:47:57 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8147 ) Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. Patch Set 2: Hi Tim Armstrong and Mostafa Mokhtar, I've uploaded a new patch after review. Could you reexamine it? https://gerrit.cloudera.org/#/c/8147/ Thanks -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 2 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Sat, 30 Sep 2017 03:13:43 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Hello Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8147 to look at the new patch set (#2). Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. IMPALA-5448: fix invalid number of splits reported in Parquet scan node Parquet splits with multi columns are marked as completed by using HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types as column codec types. Thus the number of parquet splits are the real count multiplies number of materialized columns. Furthermore, according to the Parquet definition, it allows mixed compression codecs on different columns. This's handled in this patch as well. A parquet file using gzip and snappy compression codec will be reported as: FileFormats: PARQUET/(GZIP,SNAPPY):1 This patch introduces a compression types set for the above cases. Testing: Add end-to-end tests handling parquet files with all columns compressed in snappy, and handling parquet files with multi compression codec. Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/multi_compression_parquet_data/README A testdata/multi_compression_parquet_data/tinytable_0_gzip_snappy.parq A testdata/multi_compression_parquet_data/tinytable_1_snappy_gzip.parq A testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test M tests/query_test/test_scanners.py 9 files changed, 111 insertions(+), 13 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8147/2 -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 2 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8147 ) Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. Patch Set 1: (10 comments) http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@497 PS1, Line 497: /// Mapping of file formats (file type, compression types set) to the number of > Can you move the comment below the class definition and above the map? Done http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@499 PS1, Line 499: struct HdfsCompressionTypesSet { > Can you make this a class and make the member variables private? I don't th Done http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@500 PS1, Line 500: uint32_t bit_map; > Can you add an assertion to the constructor to make sure that bit_map is la Done http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@501 PS1, Line 501: THdfsCompression::type last_type; > Is last_type needed? I think we can remove it. Done http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@504 PS1, Line 504: hasType > We capitalise the first letter in C++ method names, i.e. HasType(). The goo Done http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@506 PS1, Line 506: } > Please add blank lines between the method definitions. Done http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@507 PS1, Line 507: addType > AddType Done http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@512 PS1, Line 512: bool operator< (const HdfsCompressionTypesSet& o) const { > Can you comment that this is needed so it can be part of the std::map key. Done http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.cc@897 PS1, Line 897: for (auto i = compressions_map.begin(); i != compressions_map.end(); ++i) { > I think this would be more readable with a ranged for loop. E.g. Cool! Done http://gerrit.cloudera.org:8080/#/c/8147/1/testdata/multi_compression_parquet_data/README File testdata/multi_compression_parquet_data/README: http://gerrit.cloudera.org:8080/#/c/8147/1/testdata/multi_compression_parquet_data/README@5 PS1, Line 5: These files have two string columns 'a' and 'b'. Each columns using different compression types. > Cool! Done -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 1 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Fri, 29 Sep 2017 08:26:36 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8147 ) Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. Patch Set 1: (10 comments) The change makes sense to me. Comments are mainly about style. http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@497 PS1, Line 497: /// Mapping of file formats (file type, compression types set) to the number of Can you move the comment below the class definition and above the map? http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@499 PS1, Line 499: struct HdfsCompressionTypesSet { Can you make this a class and make the member variables private? I don't think there's a reason we need to expose them. http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@500 PS1, Line 500: uint32_t bit_map; Can you add an assertion to the constructor to make sure that bit_map is large enough to hold all compression types? Something like: DCHECK_GE(sizeof(bit_map) * CHAR_BIT, _THdfsCompression_VALUES_TO_NAMES.size()) http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@501 PS1, Line 501: THdfsCompression::type last_type; Is last_type needed? I think we can remove it. http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@504 PS1, Line 504: hasType We capitalise the first letter in C++ method names, i.e. HasType(). The google C++ guide is a good reference: https://google.github.io/styleguide/cppguide.html#Function_Names http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@506 PS1, Line 506: } Please add blank lines between the method definitions. http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@507 PS1, Line 507: addType AddType http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@512 PS1, Line 512: bool operator< (const HdfsCompressionTypesSet& o) const { Can you comment that this is needed so it can be part of the std::map key. http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.cc@897 PS1, Line 897: for (auto i = compressions_map.begin(); i != compressions_map.end(); ++i) { I think this would be more readable with a ranged for loop. E.g. for (auto& elem : _THdfsCompression_VALUES_TO_NAMES) http://gerrit.cloudera.org:8080/#/c/8147/1/testdata/multi_compression_parquet_data/README File testdata/multi_compression_parquet_data/README: http://gerrit.cloudera.org:8080/#/c/8147/1/testdata/multi_compression_parquet_data/README@5 PS1, Line 5: These files have two string columns 'a' and 'b'. Each columns using different compression types. Cool! -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 1 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 28 Sep 2017 15:17:55 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/8147 ) Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. Patch Set 1: Passed Jenkins pre-review-test: https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/342/console -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 1 Gerrit-Owner: Quanlong HuangGerrit-Reviewer: Quanlong Huang Gerrit-Reviewer: Tim Armstrong Gerrit-Comment-Date: Thu, 28 Sep 2017 06:20:12 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node
Quanlong Huang has uploaded this change for review. ( http://gerrit.cloudera.org:8080/8147 Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet scan node .. IMPALA-5448: fix invalid number of splits reported in Parquet scan node Parquet splits with multi columns are marked as completed by using HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types as column codec types. Thus the number of parquet splits are the real count multiplies number of materialized columns. Furthermore, according to the Parquet definition, it allows mixed compression codecs on different columns. This’s handled in this patch as well. A parquet file using gzip and snappy compression codec will be reported as: FileFormats: PARQUET/(GZIP,SNAPPY):1 This patch introduces a compression types set for the above cases. Testing: Add end-to-end tests handling parquet files with all columns compressed in snappy, and handling parquet files with multi compression codec. Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 --- M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M testdata/datasets/functional/functional_schema_template.sql M testdata/datasets/functional/schema_constraints.csv A testdata/multi_compression_parquet_data/README A testdata/multi_compression_parquet_data/tinytable_0_gzip_snappy.parq A testdata/multi_compression_parquet_data/tinytable_1_snappy_gzip.parq A testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test M tests/query_test/test_scanners.py 9 files changed, 85 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8147/1 -- To view, visit http://gerrit.cloudera.org:8080/8147 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1 Gerrit-Change-Number: 8147 Gerrit-PatchSet: 1 Gerrit-Owner: Quanlong Huang