[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 8:

Thanks for the contribution!


--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 10 Oct 2017 22:22:55 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-09 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 7: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 10 Oct 2017 01:30:32 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-09 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..

IMPALA-5448: fix invalid number of splits reported in Parquet scan node

Parquet splits with multi columns are marked as completed by using
HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types
as column codec types. Thus the number of parquet splits are the real count
multiplies number of materialized columns.

Furthermore, according to the Parquet definition, it allows mixed compression
codecs on different columns. This's handled in this patch as well. A parquet 
file
using gzip and snappy compression codec will be reported as:
FileFormats: PARQUET/(GZIP,SNAPPY):1

This patch introduces a compression types set for the above cases.

Testing:
Add end-to-end tests handling parquet files with all columns compressed in
snappy, and handling parquet files with multi compression codec.

Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Reviewed-on: http://gerrit.cloudera.org:8080/8147
Reviewed-by: Tim Armstrong 
Tested-by: Impala Public Jenkins
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
A testdata/multi_compression_parquet_data/README
A testdata/multi_compression_parquet_data/tinytable_0_gzip_snappy.parq
A testdata/multi_compression_parquet_data/tinytable_1_snappy_gzip.parq
A 
testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test
M tests/query_test/test_scanners.py
7 files changed, 132 insertions(+), 13 deletions(-)

Approvals:
  Tim Armstrong: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-09 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 7: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 09 Oct 2017 21:26:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-09 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 7:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1318/


--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 09 Oct 2017 21:26:32 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-05 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Mostafa Mokhtar,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8147

to look at the new patch set (#5).

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..

IMPALA-5448: fix invalid number of splits reported in Parquet scan node

Parquet splits with multi columns are marked as completed by using
HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types
as column codec types. Thus the number of parquet splits are the real count
multiplies number of materialized columns.

Furthermore, according to the Parquet definition, it allows mixed compression
codecs on different columns. This's handled in this patch as well. A parquet 
file
using gzip and snappy compression codec will be reported as:
FileFormats: PARQUET/(GZIP,SNAPPY):1

This patch introduces a compression types set for the above cases.

Testing:
Add end-to-end tests handling parquet files with all columns compressed in
snappy, and handling parquet files with multi compression codec.

Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
A testdata/multi_compression_parquet_data/README
A testdata/multi_compression_parquet_data/tinytable_0_gzip_snappy.parq
A testdata/multi_compression_parquet_data/tinytable_1_snappy_gzip.parq
A 
testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test
M tests/query_test/test_scanners.py
7 files changed, 132 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8147/5
--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-05 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 4:

I think the end-to-end test is a good idea,


--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 05 Oct 2017 17:24:49 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-04 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 4:

Yeah, the profile is:
File Formats: PARQUET/GZIP:1 PARQUET/SNAPPY:1

Should I add an end-to-end test for this? My test processes are:
hive> use functional_parquet;
hive> create table alltypes_mix like alltypes;
hive> set parquet.compression=snappy;
hive> insert into table alltypes_mix partition (year=2017,month=1) select 
id,bool_col,tinyint_col,smallint_col,int_col,bigint_col,float_col,double_col,date_string_col,string_col,timestamp_col
 from alltypes where year=2010 and month=1;

hive> set parquet.compression=gzip;
hive> insert into table alltypes_mix partition (year=2017,month=2) select 
id,bool_col,tinyint_col,smallint_col,int_col,bigint_col,float_col,double_col,date_string_col,string_col,timestamp_col
 from alltypes where year=2010 and month=1;

Then in impala-shell:
[localhost:21000] > invalidate metadata functional_parquet.alltypes_mix;
[localhost:21000] > select * from functional_parquet.alltypes_mix;
[localhost:21000] > profile;


--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 05 Oct 2017 03:22:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-04 Thread Mostafa Mokhtar (Code Review)
Mostafa Mokhtar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 4:

Did you the test the case if one partition is snappy and another gzip both 
scanned by the same HDFS SCAN Node?
Can you please paste what the output in the query profile looks like?


--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 05 Oct 2017 02:51:04 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-04 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 4: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 05 Oct 2017 00:15:51 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-04 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8147/3/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

http://gerrit.cloudera.org:8080/#/c/8147/3/be/src/exec/hdfs-scan-node-base.h@537
PS3, Line 537: int Size() { return BitUtil::Popcount(bit_map_); }
> One nit: can you add a blank link after this method.
Sorry, I know you mean a blank line now... Done



--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 04 Oct 2017 23:58:38 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-04 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Mostafa Mokhtar,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8147

to look at the new patch set (#4).

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..

IMPALA-5448: fix invalid number of splits reported in Parquet scan node

Parquet splits with multi columns are marked as completed by using
HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types
as column codec types. Thus the number of parquet splits are the real count
multiplies number of materialized columns.

Furthermore, according to the Parquet definition, it allows mixed compression
codecs on different columns. This's handled in this patch as well. A parquet 
file
using gzip and snappy compression codec will be reported as:
FileFormats: PARQUET/(GZIP,SNAPPY):1

This patch introduces a compression types set for the above cases.

Testing:
Add end-to-end tests handling parquet files with all columns compressed in
snappy, and handling parquet files with multi compression codec.

Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
A testdata/multi_compression_parquet_data/README
A testdata/multi_compression_parquet_data/tinytable_0_gzip_snappy.parq
A testdata/multi_compression_parquet_data/tinytable_1_snappy_gzip.parq
A 
testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test
M tests/query_test/test_scanners.py
7 files changed, 112 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8147/4
--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-04 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8147/3/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

http://gerrit.cloudera.org:8080/#/c/8147/3/be/src/exec/hdfs-scan-node-base.h@537
PS3, Line 537: int Size() { return BitUtil::Popcount(bit_map_); }
> One nit: can you add a blank link after this method.
Sorry, what is a blank link? Should I write the function in 3 lines?


http://gerrit.cloudera.org:8080/#/c/8147/3/testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test
File 
testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test:

http://gerrit.cloudera.org:8080/#/c/8147/3/testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test@10
PS3, Line 10: select * from functional_parquet.multi_compression
Forgot to change the db name! Will fix it



--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 04 Oct 2017 23:49:26 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-04 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 3: Code-Review+1

Looks good. I'll ask Mostafa to confirm that the new output looks good to him 
but then I think we should be able to go ahead and merge.


--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 04 Oct 2017 23:18:28 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-04 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h@557
PS2, Line 557:
> Should call BitUtil::Popcount(), which will use hardware acceleration if ap
Done


http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h@579
PS2, Line 579:
> We put an underscore at the end of private members, i.e. 'bit_map_'
Oops, forgot it! Done


http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h@582
PS2, Line 582:
> Not your change, but it should mention the second entry in the tuple - whet
Done. Change to:
  /// Mapping of file formats to the number of splits of that type. The key is 
a tuple
  /// containing:
  /// * file type
  /// * whether the split was skipped
  /// * compression types set


http://gerrit.cloudera.org:8080/#/c/8147/2/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/8147/2/testdata/datasets/functional/functional_schema_template.sql@1581
PS2, Line 1581: -- IMPALA-1658: Timestamps written by Hive are local-to-UTC 
adjusted.
> We moved to loading "special" files as part of the tests rather than part o
Done


http://gerrit.cloudera.org:8080/#/c/8147/2/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/8147/2/tests/query_test/test_scanners.py@82
PS2, Line 82: # Test all the scanners with a simple limit clause. The limit 
clause triggers
> This only applies to parquet so should go in TestParquet below (TestScanner
Done


http://gerrit.cloudera.org:8080/#/c/8147/2/tests/query_test/test_scanners.py@337
PS2, Line 337: " stored as parquet" % unique_database)
> This is an example of the alternative way of loading data files as part of
Great! Done



--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 04 Oct 2017 12:13:40 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-04 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8147

to look at the new patch set (#3).

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..

IMPALA-5448: fix invalid number of splits reported in Parquet scan node

Parquet splits with multi columns are marked as completed by using
HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types
as column codec types. Thus the number of parquet splits are the real count
multiplies number of materialized columns.

Furthermore, according to the Parquet definition, it allows mixed compression
codecs on different columns. This's handled in this patch as well. A parquet 
file
using gzip and snappy compression codec will be reported as:
FileFormats: PARQUET/(GZIP,SNAPPY):1

This patch introduces a compression types set for the above cases.

Testing:
Add end-to-end tests handling parquet files with all columns compressed in
snappy, and handling parquet files with multi compression codec.

Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
A testdata/multi_compression_parquet_data/README
A testdata/multi_compression_parquet_data/tinytable_0_gzip_snappy.parq
A testdata/multi_compression_parquet_data/tinytable_1_snappy_gzip.parq
A 
testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test
M tests/query_test/test_scanners.py
7 files changed, 111 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8147/3
--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-10-02 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h@557
PS2, Line 557: __builtin_popcount
Should call BitUtil::Popcount(), which will use hardware acceleration if 
appropriate.


http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h@579
PS2, Line 579: bit_map
We put an underscore at the end of private members, i.e. 'bit_map_'


http://gerrit.cloudera.org:8080/#/c/8147/2/be/src/exec/hdfs-scan-node-base.h@582
PS2, Line 582:   /// Mapping of file formats (file type, compression types set) 
to the number of
Not your change, but it should mention the second entry in the tuple - whether 
the split was skipped.


http://gerrit.cloudera.org:8080/#/c/8147/2/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/8147/2/testdata/datasets/functional/functional_schema_template.sql@1581
PS2, Line 1581: -- IMPALA-5448: parquet files with multiple compression types
We moved to loading "special" files as part of the tests rather than part of 
the data loading in a lot of cases. I think that is better practically because 
if you change this template then everyone has to reload data.

I commented on an instance of the alternative approach that we should switch to.


http://gerrit.cloudera.org:8080/#/c/8147/2/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

http://gerrit.cloudera.org:8080/#/c/8147/2/tests/query_test/test_scanners.py@82
PS2, Line 82:   def test_hdfs_parquet_scan_node_profile(self, vector):
This only applies to parquet so should go in TestParquet below 
(TestScannersAllTableFormats runs the test for all table formats).


http://gerrit.cloudera.org:8080/#/c/8147/2/tests/query_test/test_scanners.py@337
PS2, Line 337:   def test_corrupt_rle_counts(self, vector, unique_database):
This is an example of the alternative way of loading data files as part of the 
test.



--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 02 Oct 2017 21:47:57 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-09-29 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 2:

Hi Tim Armstrong and Mostafa Mokhtar,

I've uploaded a new patch after review. Could you reexamine it?
 https://gerrit.cloudera.org/#/c/8147/

Thanks


--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 30 Sep 2017 03:13:43 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-09-29 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/8147

to look at the new patch set (#2).

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..

IMPALA-5448: fix invalid number of splits reported in Parquet scan node

Parquet splits with multi columns are marked as completed by using
HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types
as column codec types. Thus the number of parquet splits are the real count
multiplies number of materialized columns.

Furthermore, according to the Parquet definition, it allows mixed compression
codecs on different columns. This's handled in this patch as well. A parquet 
file
using gzip and snappy compression codec will be reported as:
FileFormats: PARQUET/(GZIP,SNAPPY):1

This patch introduces a compression types set for the above cases.

Testing:
Add end-to-end tests handling parquet files with all columns compressed in
snappy, and handling parquet files with multi compression codec.

Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/multi_compression_parquet_data/README
A testdata/multi_compression_parquet_data/tinytable_0_gzip_snappy.parq
A testdata/multi_compression_parquet_data/tinytable_1_snappy_gzip.parq
A 
testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test
M tests/query_test/test_scanners.py
9 files changed, 111 insertions(+), 13 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8147/2
--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-09-29 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 1:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@497
PS1, Line 497:   /// Mapping of file formats (file type, compression types set) 
to the number of
> Can you move the comment below the class definition and above the map?
Done


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@499
PS1, Line 499:   struct HdfsCompressionTypesSet {
> Can you make this a class and make the member variables private? I don't th
Done


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@500
PS1, Line 500: uint32_t bit_map;
> Can you add an assertion to the constructor to make sure that bit_map is la
Done


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@501
PS1, Line 501: THdfsCompression::type last_type;
> Is last_type needed? I think we can remove it.
Done


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@504
PS1, Line 504: hasType
> We capitalise the first letter in C++ method names, i.e. HasType(). The goo
Done


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@506
PS1, Line 506: }
> Please add blank lines between the method definitions.
Done


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@507
PS1, Line 507: addType
> AddType
Done


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@512
PS1, Line 512: bool operator< (const HdfsCompressionTypesSet& o) const {
> Can you comment that this is needed so it can be part of the std::map key.
Done


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.cc@897
PS1, Line 897:   for (auto i = compressions_map.begin(); i != 
compressions_map.end(); ++i) {
> I think this would be more readable with a ranged for loop. E.g.
Cool! Done


http://gerrit.cloudera.org:8080/#/c/8147/1/testdata/multi_compression_parquet_data/README
File testdata/multi_compression_parquet_data/README:

http://gerrit.cloudera.org:8080/#/c/8147/1/testdata/multi_compression_parquet_data/README@5
PS1, Line 5: These files have two string columns 'a' and 'b'. Each columns 
using different compression types.
> Cool!
Done



--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 29 Sep 2017 08:26:36 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-09-28 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 1:

(10 comments)

The change makes sense to me. Comments are mainly about style.

http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@497
PS1, Line 497:   /// Mapping of file formats (file type, compression types set) 
to the number of
Can you move the comment below the class definition and above the map?


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@499
PS1, Line 499:   struct HdfsCompressionTypesSet {
Can you make this a class and make the member variables private? I don't think 
there's a reason we need to expose them.


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@500
PS1, Line 500: uint32_t bit_map;
Can you add an assertion to the constructor to make sure that bit_map is large 
enough to hold all compression types?

Something like:

  DCHECK_GE(sizeof(bit_map) * CHAR_BIT, 
_THdfsCompression_VALUES_TO_NAMES.size())


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@501
PS1, Line 501: THdfsCompression::type last_type;
Is last_type needed? I think we can remove it.


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@504
PS1, Line 504: hasType
We capitalise the first letter in C++ method names, i.e. HasType(). The google 
C++ guide is a good reference: 
https://google.github.io/styleguide/cppguide.html#Function_Names


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@506
PS1, Line 506: }
Please add blank lines between the method definitions.


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@507
PS1, Line 507: addType
AddType


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.h@512
PS1, Line 512: bool operator< (const HdfsCompressionTypesSet& o) const {
Can you comment that this is needed so it can be part of the std::map key.


http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/8147/1/be/src/exec/hdfs-scan-node-base.cc@897
PS1, Line 897:   for (auto i = compressions_map.begin(); i != 
compressions_map.end(); ++i) {
I think this would be more readable with a ranged for loop. E.g.

for (auto& elem : _THdfsCompression_VALUES_TO_NAMES)


http://gerrit.cloudera.org:8080/#/c/8147/1/testdata/multi_compression_parquet_data/README
File testdata/multi_compression_parquet_data/README:

http://gerrit.cloudera.org:8080/#/c/8147/1/testdata/multi_compression_parquet_data/README@5
PS1, Line 5: These files have two string columns 'a' and 'b'. Each columns 
using different compression types.
Cool!



--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 28 Sep 2017 15:17:55 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-09-28 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8147 )

Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..


Patch Set 1:

Passed Jenkins pre-review-test: 
https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/342/console


--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 28 Sep 2017 06:20:12 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5448: fix invalid number of splits reported in Parquet scan node

2017-09-26 Thread Quanlong Huang (Code Review)
Quanlong Huang has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/8147


Change subject: IMPALA-5448: fix invalid number of splits reported in Parquet 
scan node
..

IMPALA-5448: fix invalid number of splits reported in Parquet scan node

Parquet splits with multi columns are marked as completed by using
HdfsScanNodeBase::RangeComplete(). It duplicately counts the file types
as column codec types. Thus the number of parquet splits are the real count
multiplies number of materialized columns.

Furthermore, according to the Parquet definition, it allows mixed compression
codecs on different columns. This’s handled in this patch as well. A parquet 
file
using gzip and snappy compression codec will be reported as:
FileFormats: PARQUET/(GZIP,SNAPPY):1

This patch introduces a compression types set for the above cases.

Testing:
Add end-to-end tests handling parquet files with all columns compressed in
snappy, and handling parquet files with multi compression codec.

Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/multi_compression_parquet_data/README
A testdata/multi_compression_parquet_data/tinytable_0_gzip_snappy.parq
A testdata/multi_compression_parquet_data/tinytable_1_snappy_gzip.parq
A 
testdata/workloads/functional-query/queries/QueryTest/hdfs_parquet_scan_node_profile.test
M tests/query_test/test_scanners.py
9 files changed, 85 insertions(+), 8 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/47/8147/1
--
To view, visit http://gerrit.cloudera.org:8080/8147
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Iaacc2d775032f5707061e704f12e0a63cde695d1
Gerrit-Change-Number: 8147
Gerrit-PatchSet: 1
Gerrit-Owner: Quanlong Huang