[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-05 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Mostafa Mokhtar,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8147

to look at the new patch set (#5).

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..

IMPALA-5448: fix invalid number of splits reported in Parquet scan node

Parquet splits with multi columns are marked as completed by using
HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types
as column codec types. Thus the number of parquet splits are the real count
multiplies number of materialized columns.

Furthermore, according to the Parquet definition, it allows mixed compression
codecs on different columns. This's handled in this patch as well. A parquet 
file
using gzip and snappy compression codec will be reported as:
FileFormats: PARQUET/(GZIP,SNAPPY):1

This patch introduces a compression types set for the above cases.

Testing:
Add end-to-end tests handling parquet files with all columns compressed in
snappy, and handling parquet files with multi compression codec.

Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
A testdata/multi_compression_parquet_data/README
A testdata/multi_compression_parquet_data/tinytable_0_gzip_snappy.parq
A testdata/multi_compression_parquet_data/tinytable_1_snappy_gzip.parq
A 
testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test
M tests/query_test/test_scanners.py
7 files changed, 132 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8147/5
--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-04 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 4:

Yeah, the profile is:
File Formats: PARQUET/GZIP:1 PARQUET/SNAPPY:1

Should I add an end-to-end test for this? My test processes are:
hive> use functional_parquet;
hive> create table alltypes_mix like alltypes;
hive> set parquet.compression=snappy;
hive> insert into table alltypes_mix partition (year=2017,month=1) select 
id,bool_col,tinyint_col,smallint_col,int_col,bigint_col,float_col,double_col,date_string_col,string_col,timestamp_col
 from alltypes where year=2010 and month=1;

hive> set parquet.compression=gzip;
hive> insert into table alltypes_mix partition (year=2017,month=2) select 
id,bool_col,tinyint_col,smallint_col,int_col,bigint_col,float_col,double_col,date_string_col,string_col,timestamp_col
 from alltypes where year=2010 and month=1;

Then in impala-shell:
[localhost:21000] > invalidate metadata functional_parquet.alltypes_mix;
[localhost:21000] > select * from functional_parquet.alltypes_mix;
[localhost:21000] > profile;


--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 05 Oct 2017 03:22:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-04 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8147/3/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

http://gerrit.cloudera.org:8080/#/c/8147/3/be/src/exec/hdfs-scan-node-base.h@537
PS3, Line 537: int Size() { return BitUtil::Popcount(bit_map_); }
> One nit: can you add a blank link after this method.
Sorry, I know you mean a blank line now... Done



--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 04 Oct 2017 23:58:38 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-04 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Mostafa Mokhtar,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8147

to look at the new patch set (#4).

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..

IMPALA-5448: fix invalid number of splits reported in Parquet scan node

Parquet splits with multi columns are marked as completed by using
HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types
as column codec types. Thus the number of parquet splits are the real count
multiplies number of materialized columns.

Furthermore, according to the Parquet definition, it allows mixed compression
codecs on different columns. This's handled in this patch as well. A parquet 
file
using gzip and snappy compression codec will be reported as:
FileFormats: PARQUET/(GZIP,SNAPPY):1

This patch introduces a compression types set for the above cases.

Testing:
Add end-to-end tests handling parquet files with all columns compressed in
snappy, and handling parquet files with multi compression codec.

Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
A testdata/multi_compression_parquet_data/README
A testdata/multi_compression_parquet_data/tinytable_0_gzip_snappy.parq
A testdata/multi_compression_parquet_data/tinytable_1_snappy_gzip.parq
A 
testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test
M tests/query_test/test_scanners.py
7 files changed, 112 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8147/4
--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-04 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8147/3/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

http://gerrit.cloudera.org:8080/#/c/8147/3/be/src/exec/hdfs-scan-node-base.h@537
PS3, Line 537: int Size() { return BitUtil::Popcount(bit_map_); }
> One nit: can you add a blank link after this method.
Sorry, what is a blank link? Should I write the function in 3 lines?


http://gerrit.cloudera.org:8080/#/c/8147/3/testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test
File 
testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test:

http://gerrit.cloudera.org:8080/#/c/8147/3/testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test@10
PS3, Line 10: select * from functional_parquet.multi_compression
Forgot to change the db name! Will fix it



--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 04 Oct 2017 23:49:26 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-04 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h@557
PS2, Line 557:
> Should call BitUtil::Popcount(), which will use hardware acceleration if ap
Done


http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h@579
PS2, Line 579:
> We put an underscore at the end of private members, i.e. 'bit_map_'
Oops, forgot it! Done


http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h@582
PS2, Line 582:
> Not your change, but it should mention the second entry in the tuple - whet
Done. Change to:
  /// Mapping of file formats to the number of splits of that type. The key is 
a tuple
  /// containing:
  /// * file type
  /// * whether the split was skipped
  /// * compression types set


http://gerrit.cloudera.org:8080/#/c/8147/2/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/8147/2/testdata/datasets/functional/functional_schema_template.sql@1581
PS2, Line 1581: -- IMPALA-1658: Timestamps written by Hive are local-to-UTC 
adjusted.
> We moved to loading "special" files as part of the tests rather than part o
Done


http://gerrit.cloudera.org:8080/#/c/8147/2/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/8147/2/tests/query_test/test_scanners.py@82
PS2, Line 82: # Test all the scanners with a simple limit clause. The limit 
clause triggers
> This only applies to parquet so should go in TestParquet below (TestScanner
Done


http://gerrit.cloudera.org:8080/#/c/8147/2/tests/query_test/test_scanners.py@337
PS2, Line 337: " stored as parquet" % unique_database)
> This is an example of the alternative way of loading data files as part of
Great! Done



--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 04 Oct 2017 12:13:40 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-04 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8147

to look at the new patch set (#3).

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..

IMPALA-5448: fix invalid number of splits reported in Parquet scan node

Parquet splits with multi columns are marked as completed by using
HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types
as column codec types. Thus the number of parquet splits are the real count
multiplies number of materialized columns.

Furthermore, according to the Parquet definition, it allows mixed compression
codecs on different columns. This's handled in this patch as well. A parquet 
file
using gzip and snappy compression codec will be reported as:
FileFormats: PARQUET/(GZIP,SNAPPY):1

This patch introduces a compression types set for the above cases.

Testing:
Add end-to-end tests handling parquet files with all columns compressed in
snappy, and handling parquet files with multi compression codec.

Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
A testdata/multi_compression_parquet_data/README
A testdata/multi_compression_parquet_data/tinytable_0_gzip_snappy.parq
A testdata/multi_compression_parquet_data/tinytable_1_snappy_gzip.parq
A 
testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test
M tests/query_test/test_scanners.py
7 files changed, 111 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8147/3
--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-09-29 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 2:

Hi Tim Armstrong and Mostafa Mokhtar,

I've uploaded a new patch after review. Could you reexamine it?
 https://gerrit.cloudera.org/#/c/8147/

Thanks


--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 30 Sep 2017 03:13:43 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-09-29 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8147

to look at the new patch set (#2).

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..

IMPALA-5448: fix invalid number of splits reported in Parquet scan node

Parquet splits with multi columns are marked as completed by using
HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types
as column codec types. Thus the number of parquet splits are the real count
multiplies number of materialized columns.

Furthermore, according to the Parquet definition, it allows mixed compression
codecs on different columns. This's handled in this patch as well. A parquet 
file
using gzip and snappy compression codec will be reported as:
FileFormats: PARQUET/(GZIP,SNAPPY):1

This patch introduces a compression types set for the above cases.

Testing:
Add end-to-end tests handling parquet files with all columns compressed in
snappy, and handling parquet files with multi compression codec.

Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/multi_compression_parquet_data/README
A testdata/multi_compression_parquet_data/tinytable_0_gzip_snappy.parq
A testdata/multi_compression_parquet_data/tinytable_1_snappy_gzip.parq
A 
testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test
M tests/query_test/test_scanners.py
9 files changed, 111 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8147/2
--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-09-29 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 1:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@497
PS1, Line 497:   /// Mapping of file formats (file type, compression types set) 
to the number of
> Can you move the comment below the class definition and above the map?
Done


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@499
PS1, Line 499:   struct HdfsCompressionTypesSet {
> Can you make this a class and make the member variables private? I don't th
Done


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@500
PS1, Line 500: uint32_t bit_map;
> Can you add an assertion to the constructor to make sure that bit_map is la
Done


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@501
PS1, Line 501: THdfsCompression::type last_type;
> Is last_type needed? I think we can remove it.
Done


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@504
PS1, Line 504: hasType
> We capitalise the first letter in C++ method names, i.e. HasType(). The goo
Done


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@506
PS1, Line 506: }
> Please add blank lines between the method definitions.
Done


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@507
PS1, Line 507: addType
> AddType
Done


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@512
PS1, Line 512: bool operator< (const HdfsCompressionTypesSet& o) const {
> Can you comment that this is needed so it can be part of the std::map key.
Done


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.cc@897
PS1, Line 897:   for (auto i = compressions_map.begin(); i != 
compressions_map.end(); ++i) {
> I think this would be more readable with a ranged for loop. E.g.
Cool! Done


http://gerrit.cloudera.org:8080/#/c/8147/1/testdata/multi_compression_parquet_data/README
File testdata/multi_compression_parquet_data/README:

http://gerrit.cloudera.org:8080/#/c/8147/1/testdata/multi_compression_parquet_data/README@5
PS1, Line 5: These files have two string columns 'a' and 'b'. Each columns 
using different compression types.
> Cool!
Done



--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 29 Sep 2017 08:26:36 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-09-28 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 1:

Passed Jenkins pre-review-test: 
https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/342/console


--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 28 Sep 2017 06:20:12 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-09-26 Thread Quanlong Huang (Code Review)
Quanlong Huang has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/8147


Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..

IMPALA-5448: fix invalid number of splits reported in Parquet scan node

Parquet splits with multi columns are marked as completed by using
HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types
as column codec types. Thus the number of parquet splits are the real count
multiplies number of materialized columns.

Furthermore, according to the Parquet definition, it allows mixed compression
codecs on different columns. This’s handled in this patch as well. A parquet 
file
using gzip and snappy compression codec will be reported as:
FileFormats: PARQUET/(GZIP,SNAPPY):1

This patch introduces a compression types set for the above cases.

Testing:
Add end-to-end tests handling parquet files with all columns compressed in
snappy, and handling parquet files with multi compression codec.

Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/multi_compression_parquet_data/README
A testdata/multi_compression_parquet_data/tinytable_0_gzip_snappy.parq
A testdata/multi_compression_parquet_data/tinytable_1_snappy_gzip.parq
A 
testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test
M tests/query_test/test_scanners.py
9 files changed, 85 insertions(+), 8 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8147/1
--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang