[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-11 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 21:

Awesome work, Quanlong! Thanks for your contribution.


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 21
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 12 Apr 2018 00:05:42 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-11 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 21:

Thank you, Tim, Dan, and Joe! I do appreciate your review. Hopes we can work 
together to move forward the orc-support feature!


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 21
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 11 Apr 2018 22:06:05 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 20: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 20
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 11 Apr 2018 05:13:00 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..

IMPALA-5717: Support for reading ORC data files

This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.

A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.

Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
   is not robust for corrupt files (ORC-315).

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Reviewed-on: http://gerrit.cloudera.org:8080/9134
Reviewed-by: Quanlong Huang 
Reviewed-by: Tim Armstrong 
Tested-by: Impala Public Jenkins 
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/CMakeLists.txt
A be/src/exec/hdfs-orc-scanner.cc
A be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-mt.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/util/backend-gflag-util.cc
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindOrc.cmake
M common/thrift/BackendGflags.thrift
M common/thrift/CatalogObjects.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/jflex/sql-scanner.flex
M testdata/LineItemMultiBlock/README.dox
A testdata/LineItemMultiBlock/lineitem_orc_multiblock_one_stripe.orc
A testdata/LineItemMultiBlock/lineitem_sixblocks.orc
A testdata/LineItemMultiBlock/lineitem_threeblocks.orc
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/run-hive-server.sh
M testdata/cluster/node_templates/common/etc/hadoop/conf/hdfs-site.xml.tmpl
A testdata/data/chars-formats.orc
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_dimensions.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/impala_test_suite.py
M tests/common/test_dimensions.py
M tests/common/test_vector.py
M tests/comparison/cli_options.py
M tests/query_test/test_chars.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
62 files changed, 1,745 insertions(+), 307 deletions(-)

Approvals:
  Quanlong Huang: Looks good to me, but someone else must approve
  Tim Armstrong: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 21
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: 

[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 20: Code-Review+2

You'll still be the author of the patch.


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 20
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 11 Apr 2018 04:46:05 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-10 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 20: Code-Review+1

LGTM. Thank you Tim! The errors now are more accurate.

One more question: will the author of this patch change to you since you made 
the last commit? Should I add a final commit?


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 20
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 11 Apr 2018 02:17:43 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 20:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2278/


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 20
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 11 Apr 2018 01:24:02 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 18:

I made a couple of changes to the error handling to make sure that we returned 
Status objects that were more consistent with what would be returned from other 
scanners. I won't +2 yet until you've had a look to make sure that it looks ok 
to you too.


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 18
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 11 Apr 2018 01:23:48 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has uploaded a new patch set (#19) to the change originally 
created by Quanlong Huang. ( http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..

IMPALA-5717: Support for reading ORC data files

This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.

A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.

Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
   is not robust for corrupt files (ORC-315).

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/CMakeLists.txt
A be/src/exec/hdfs-orc-scanner.cc
A be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-mt.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/util/backend-gflag-util.cc
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindOrc.cmake
M common/thrift/BackendGflags.thrift
M common/thrift/CatalogObjects.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/jflex/sql-scanner.flex
M testdata/LineItemMultiBlock/README.dox
A testdata/LineItemMultiBlock/lineitem_orc_multiblock_one_stripe.orc
A testdata/LineItemMultiBlock/lineitem_sixblocks.orc
A testdata/LineItemMultiBlock/lineitem_threeblocks.orc
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/run-hive-server.sh
M testdata/cluster/node_templates/common/etc/hadoop/conf/hdfs-site.xml.tmpl
A testdata/data/chars-formats.orc
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_dimensions.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/impala_test_suite.py
M tests/common/test_dimensions.py
M tests/common/test_vector.py
M tests/comparison/cli_options.py
M tests/query_test/test_chars.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
62 files changed, 1,745 insertions(+), 307 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/9134/19
--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 19
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-10 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 18:

Hi all, this patch is ready to be reviewed.


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 18
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 10 Apr 2018 08:56:42 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-09 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 18:

The bug I mentioned before is due to stale data read from DiskIoMgr. It seems 
that DiskIoMgr still keeps the file handler of the removed ORC file in HDFS. I 
can only set expected_local to false in AllocateScanRange to work around this.

I filed a JIRA (IMPALA-6830) about this. Will be very appreciated if anyone can 
help to look into it. It can be reproduced in parquet-scanner.


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 18
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 10 Apr 2018 02:17:35 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-09 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 17:

Sorry that I find a bug in this patch. Please wait until I fix it. Repro:

 hive> create table tmp_orc (a int, b string, c int) stored as orc;
 hive> insert overwrite table tmp_orc select 1, "abc", 2;
 impala> invalidate metadata tmp_orc;
 impala> select * from tmp_orc;
 hive> insert overwrite table tmp_orc select 3, "ccc", 3;
 impala> refresh tmp_orc;
 impala> select * from tmp_orc;

The orc lib reports that ORC postscript length is invalid in the final query.


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 17
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 09 Apr 2018 13:44:08 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-09 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 17:

Refactor exceptions to fix clang-tidy warnings


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 17
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 09 Apr 2018 13:15:58 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-09 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Joe McDonnell, Dan Hecht, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9134

to look at the new patch set (#17).

Change subject: IMPALA-5717: Support for reading ORC data files
..

IMPALA-5717: Support for reading ORC data files

This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.

A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.

Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
   is not robust for corrupt files (ORC-315).

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/CMakeLists.txt
A be/src/exec/hdfs-orc-scanner.cc
A be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-mt.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/util/backend-gflag-util.cc
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindOrc.cmake
M common/thrift/BackendGflags.thrift
M common/thrift/CatalogObjects.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/jflex/sql-scanner.flex
M testdata/LineItemMultiBlock/README.dox
A testdata/LineItemMultiBlock/lineitem_orc_multiblock_one_stripe.orc
A testdata/LineItemMultiBlock/lineitem_sixblocks.orc
A testdata/LineItemMultiBlock/lineitem_threeblocks.orc
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/run-hive-server.sh
M testdata/cluster/node_templates/common/etc/hadoop/conf/hdfs-site.xml.tmpl
A testdata/data/chars-formats.orc
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_dimensions.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/impala_test_suite.py
M tests/common/test_dimensions.py
M tests/common/test_vector.py
M tests/comparison/cli_options.py
M tests/query_test/test_chars.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
62 files changed, 1,746 insertions(+), 307 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/9134/17
--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 17
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-09 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 16: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/2253/


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 16
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 09 Apr 2018 06:09:33 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 16:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2253/


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 16
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 09 Apr 2018 02:07:12 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-08 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 16: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 16
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 09 Apr 2018 02:06:33 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-05 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 14:

Fixed some compiler warnings. I can build with -asan. Running tests locally too.


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 14
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 06 Apr 2018 01:52:55 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-05 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Joe McDonnell, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9134

to look at the new patch set (#14).

Change subject: IMPALA-5717: Support for reading ORC data files
..

IMPALA-5717: Support for reading ORC data files

This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.

A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.

Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
   is not robust for corrupt files (ORC-315).

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/CMakeLists.txt
A be/src/exec/hdfs-orc-scanner.cc
A be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-mt.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/util/backend-gflag-util.cc
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindOrc.cmake
M common/thrift/BackendGflags.thrift
M common/thrift/CatalogObjects.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/jflex/sql-scanner.flex
M testdata/LineItemMultiBlock/README.dox
A testdata/LineItemMultiBlock/lineitem_orc_multiblock_one_stripe.orc
A testdata/LineItemMultiBlock/lineitem_sixblocks.orc
A testdata/LineItemMultiBlock/lineitem_threeblocks.orc
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/run-hive-server.sh
M testdata/cluster/node_templates/common/etc/hadoop/conf/hdfs-site.xml.tmpl
A testdata/data/chars-formats.orc
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_dimensions.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/impala_test_suite.py
M tests/common/test_dimensions.py
M tests/common/test_vector.py
M tests/comparison/cli_options.py
M tests/query_test/test_chars.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
62 files changed, 1,753 insertions(+), 307 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/9134/14
--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 14
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-05 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 13:

Actually I think the ASAN build is just slow - not sure if the LLVM 5 upgrade 
is related.


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 06 Apr 2018 00:20:13 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-05 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 13: Code-Review+2

I'm still in the process of running tests. I'm a little concerned that the 
AddressSanitizer test is hung, but I may not have time to investigate before 
the weekend (I'm taking Friday off).

AddressSanitizer is easy enough to run locally - you just pass the -asan option 
to buildall.sh.

S3 is tricky to set up - you need a data snapshot and an S3 bucket. I'm not 
sure that I'm actually going to be able to run that precommit because it's not 
trivial to get it set up correctly with new data.


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 06 Apr 2018 00:16:29 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-05 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 13:

(4 comments)

Sure, Tim. Thank you for doing local tests for me! Just refactor the patch.

BTW, how do you test it in S3 and AddressSanitizer? Is there a doc for this?

http://gerrit.cloudera.org:8080/#/c/9134/12/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/12/be/src/exec/hdfs-orc-scanner.cc@24
PS12, Line 24: #include "runtime/runtime-filter.inline.h"
> Some includes not needed:
Done


http://gerrit.cloudera.org:8080/#/c/9134/12/be/src/exec/hdfs-orc-scanner.cc@350
PS12, Line 350: default: break;
> Should include filename() in the message so users can trace back the error.
Done


http://gerrit.cloudera.org:8080/#/c/9134/12/be/src/exec/hdfs-orc-scanner.cc@355
PS12, Line 355: }
> Should include filename() in the message so users can trace back the error.
Done


http://gerrit.cloudera.org:8080/#/c/9134/12/be/src/exec/hdfs-orc-scanner.cc@606
PS12, Line 606: return Co
> I think this should be VLOG_ROW because it logs many times per file.
Done



--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 05 Apr 2018 09:31:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-05 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Joe McDonnell, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9134

to look at the new patch set (#13).

Change subject: IMPALA-5717: Support for reading ORC data files
..

IMPALA-5717: Support for reading ORC data files

This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.

A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.

Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
   is not robust for corrupt files (ORC-315).

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/CMakeLists.txt
A be/src/exec/hdfs-orc-scanner.cc
A be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-mt.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/util/backend-gflag-util.cc
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindOrc.cmake
M common/thrift/BackendGflags.thrift
M common/thrift/CatalogObjects.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/jflex/sql-scanner.flex
M testdata/LineItemMultiBlock/README.dox
A testdata/LineItemMultiBlock/lineitem_orc_multiblock_one_stripe.orc
A testdata/LineItemMultiBlock/lineitem_sixblocks.orc
A testdata/LineItemMultiBlock/lineitem_threeblocks.orc
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/run-hive-server.sh
M testdata/cluster/node_templates/common/etc/hadoop/conf/hdfs-site.xml.tmpl
A testdata/data/chars-formats.orc
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_dimensions.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/impala_test_suite.py
M tests/common/test_dimensions.py
M tests/common/test_vector.py
M tests/comparison/cli_options.py
M tests/query_test/test_chars.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
62 files changed, 1,753 insertions(+), 307 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/9134/13
--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 13
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-05 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 12:

(4 comments)

Had some minor comments, but I'm ready to +2 this otherwise. I want to do some 
additional testing myself to make sure that it passes some common build types - 
S3, AddressSanitizer, etc, but I can do that in parallel with the last 
revisions.

http://gerrit.cloudera.org:8080/#/c/9134/12/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/12/be/src/exec/hdfs-orc-scanner.cc@24
PS12, Line 24: #include "runtime/collection-value-builder.h"
Some includes not needed:
collection-value-builder.h
scoped-buffer.h
bitmap.h
dict-encoding.h


http://gerrit.cloudera.org:8080/#/c/9134/12/be/src/exec/hdfs-orc-scanner.cc@350
PS12, Line 350:   return Status(Substitute("ORC column $0 can't be 
truncated to table column $1",
Should include filename() in the message so users can trace back the error.


http://gerrit.cloudera.org:8080/#/c/9134/12/be/src/exec/hdfs-orc-scanner.cc@355
PS12, Line 355:   return Status(Substitute("Type mismatch: table column $0 is 
map to ORC column $1",
Should include filename() in the message so users can trace back the error.


http://gerrit.cloudera.org:8080/#/c/9134/12/be/src/exec/hdfs-orc-scanner.cc@606
PS12, Line 606: VLOG_FILE
I think this should be VLOG_ROW because it logs many times per file.



--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 12
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 05 Apr 2018 06:38:23 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-04 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 11:

The reason that test_codegen.py fails is that I haven't run compute stats on 
table functional.alltypes. It's ok actually.

I encountered some env errors when rebase onto the Hadoop3 patch, so tests are 
not run thoroughly in patch set 9 and 10. The patch set 11 has passed the 
pre-review-test: 
https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/1812/console
Please have a look when you have time.


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 11
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 04 Apr 2018 12:44:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-03 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 11:

Just find that test_codegen.py is broken. Will fix it tomorrow.


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 11
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 03 Apr 2018 15:08:23 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-04-03 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Joe McDonnell, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9134

to look at the new patch set (#11).

Change subject: IMPALA-5717: Support for reading ORC data files
..

IMPALA-5717: Support for reading ORC data files

This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.

A startup option --enable_orc_scanner is added for this feature. It's
set to true by default. Setting it to false will fail queries on ORC
tables.

Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
   is not robust for corrupt files (ORC-315).

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/CMakeLists.txt
A be/src/exec/hdfs-orc-scanner.cc
A be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-mt.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/util/backend-gflag-util.cc
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindOrc.cmake
M common/thrift/BackendGflags.thrift
M common/thrift/CatalogObjects.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/jflex/sql-scanner.flex
M testdata/LineItemMultiBlock/README.dox
A testdata/LineItemMultiBlock/lineitem_orc_multiblock_one_stripe.orc
A testdata/LineItemMultiBlock/lineitem_sixblocks.orc
A testdata/LineItemMultiBlock/lineitem_threeblocks.orc
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/run-hive-server.sh
M testdata/cluster/node_templates/common/etc/hadoop/conf/hdfs-site.xml.tmpl
A testdata/data/chars-formats.orc
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_dimensions.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/impala_test_suite.py
M tests/common/test_dimensions.py
M tests/common/test_vector.py
M tests/comparison/cli_options.py
M tests/query_test/test_chars.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
62 files changed, 1,755 insertions(+), 307 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/9134/11
--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 11
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-30 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Joe McDonnell, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9134

to look at the new patch set (#9).

Change subject: IMPALA-5717: Support for reading ORC data files
..

IMPALA-5717: Support for reading ORC data files

This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3. This
feature is enable by default. A startup option --disable_orc is added
to disable it.

Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
   is not robust for corrupt files (ORC-315).

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/CMakeLists.txt
A be/src/exec/hdfs-orc-scanner.cc
A be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-parquet-scanner-ir.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-mt.cc
M be/src/exec/hdfs-scanner-ir.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/util/backend-gflag-util.cc
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindOrc.cmake
M common/thrift/BackendGflags.thrift
M common/thrift/CatalogObjects.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/BackendConfig.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/jflex/sql-scanner.flex
M testdata/LineItemMultiBlock/README.dox
A testdata/LineItemMultiBlock/lineitem_orc_multiblock_one_stripe.orc
A testdata/LineItemMultiBlock/lineitem_sixblocks.orc
A testdata/LineItemMultiBlock/lineitem_threeblocks.orc
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/run-hive-server.sh
M testdata/cluster/node_templates/common/etc/hadoop/conf/hdfs-site.xml.tmpl
A testdata/data/chars-formats.orc
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_dimensions.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/impala_test_suite.py
M tests/common/test_dimensions.py
M tests/common/test_vector.py
M tests/comparison/cli_options.py
M tests/query_test/test_chars.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
M tests/query_test/test_tpch_queries.py
62 files changed, 1,755 insertions(+), 307 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/9134/9
--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 9
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-28 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 8:

(4 comments)

I think we're close to the point where we'd be best off merging this and 
getting the code into the hands of developers and the community so we can play 
around and see how it works in practice.

As you mentioned, you've run this in production so I'm sure a lot of issues 
were already flushed out.

http://gerrit.cloudera.org:8080/#/c/9134/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9134/5//COMMIT_MSG@19
PS5, Line 19:  - Most of the end-to-end tests can run on ORC format.
> Sorry that I should have mentioned this in the commit message. I've enabled
Haha ok, we can't do too much about that :). The ORC project should thank you 
for the free QA. Some of the existing file formats (Seq, RC) also have bugs 
like this with the fuzz test. I was hoping to avoid that from the start here, 
but we can't do too much about that.

Can you create a follow-on JIRA to enable the fuzz test and link the ORC bugs 
you found?

I think we should also add an --enable-orc-scanner feature flag (it can be 
on-by-default) that fails the query if it tries to scan ORC files, just so 
admins have a safety valve to disable it until we've flushed out all the issues.


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@72
PS5, Line 72:
> We need this as the above reason.
Ah makes sense :(. Naming member functions the same thing as a standard library 
function is pretty annoying!


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@259
PS5, Line 259:   VLOG_QUERY << "Unknown compression kind of 
orc::CompressionKind: " << kind;
> The ORC library can read ORC files using ZSTD compression. Here we just mis
I had a look and I don't think anything should be broken if you just add it to 
the end of the enum. Enabling ZSTD support to Parquet, other file formats, etc, 
requires making changes elsewhere, so ZSTD wouldn't get accidentally enable by 
adding it to the enum.

There's a similar situation with LZO where it is only supported for text/lzo, 
not Parquet.


http://gerrit.cloudera.org:8080/#/c/9134/7/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/7/be/src/exec/hdfs-orc-scanner.cc@77
PS7, Line 77: MemLimitExceeded
Maybe ThrowMemLimitExceeded()?

Also, there's an extra space here.



--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 28 Mar 2018 22:02:29 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-28 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 5:

(2 comments)

Answered your questions (sorry it took so long) and working through the whole 
change.

http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@567
PS5, Line 567: // TODO: combine this with the Parquet implementation
> Not quite familiar with the LLVM Codegen codes. I've simply tried extractin
We need to make sure that any functions that need to be used by codegen are in 
an -ir.cc file and compiled into the correct places. Easiest way to see is 
where a current file is included:

$ git grep hdfs-parquet-scanner-ir.cc
be/src/codegen/impala-ir.cc:#include "exec/hdfs-parquet-scanner-ir.cc"
be/src/exec/CMakeLists.txt:  hdfs-parquet-scanner-ir.cc

If we needed to move this to hdfs-scanner, we could add it to 
hdfs-scanner-ir.cc, which is already cross-compiled to LLVM IR.

Then functions that are loaded by codegen need to have the symbol name added to 
be/src/codegen/gen_ir_descriptions.py so that it can be found. Easiest way to 
figure out the symbol name is dump symbols in a binary and find the appropraite 
mangled name, e.g.:

  $ nm -g be/build/latest/service/impalad  | grep EvalRuntimeFilter
  01d47982 T 
_ZN6impala18HdfsParquetScanner17EvalRuntimeFilterEiPNS_8TupleRowE
  01d2f9de T 
_ZN6impala18HdfsParquetScanner18EvalRuntimeFiltersEPNS_8TupleRowE


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@590
PS5, Line 590: ColumnVectorBatch
> To change this, we have to change the type of scratch_batch_ to unique_ptr<
Can't the caller do the cast though? My thought was that would make the 
function more self-documenting.



--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 28 Mar 2018 21:14:34 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-25 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 8:

Just rebase the patch


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 26 Mar 2018 04:55:16 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-25 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Joe McDonnell, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9134

to look at the new patch set (#8).

Change subject: IMPALA-5717: Support for reading ORC data files
..

IMPALA-5717: Support for reading ORC data files

This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.

Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
   is not robust for corrupt files (ORC-315).

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
A be/src/exec/hdfs-orc-scanner.cc
A be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-mt.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindOrc.cmake
M common/thrift/CatalogObjects.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/jflex/sql-scanner.flex
M testdata/LineItemMultiBlock/README.dox
A testdata/LineItemMultiBlock/lineitem_orc_multiblock_one_stripe.orc
A testdata/LineItemMultiBlock/lineitem_sixblocks.orc
A testdata/LineItemMultiBlock/lineitem_threeblocks.orc
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/run-hive-server.sh
M testdata/cluster/node_templates/common/etc/hadoop/conf/hdfs-site.xml.tmpl
A testdata/data/chars-formats.orc
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_dimensions.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/impala_test_suite.py
M tests/common/test_dimensions.py
M tests/common/test_vector.py
M tests/comparison/cli_options.py
M tests/query_test/test_chars.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_tpch_queries.py
55 files changed, 1,637 insertions(+), 162 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/9134/8
--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 8
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-22 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Joe McDonnell, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9134

to look at the new patch set (#7).

Change subject: IMPALA-5717: Support for reading ORC data files
..

IMPALA-5717: Support for reading ORC data files

This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.

Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
   is not robust for corrupt files (ORC-315).

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
A be/src/exec/hdfs-orc-scanner.cc
A be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-mt.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindOrc.cmake
M common/thrift/CatalogObjects.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/jflex/sql-scanner.flex
M testdata/LineItemMultiBlock/README.dox
A testdata/LineItemMultiBlock/lineitem_orc_multiblock_one_stripe.orc
A testdata/LineItemMultiBlock/lineitem_sixblocks.orc
A testdata/LineItemMultiBlock/lineitem_threeblocks.orc
M testdata/bin/create-load-data.sh
M testdata/bin/generate-schema-statements.py
M testdata/bin/run-hive-server.sh
M testdata/cluster/node_templates/common/etc/hadoop/conf/hdfs-site.xml.tmpl
A testdata/data/chars-formats.orc
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_dimensions.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/impala_test_suite.py
M tests/common/test_dimensions.py
M tests/common/test_vector.py
M tests/comparison/cli_options.py
M tests/query_test/test_chars.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_tpch_queries.py
55 files changed, 1,637 insertions(+), 162 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/9134/7
--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 7
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-22 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 6:

(4 comments)

Sorry that some tests were failed. I've fixed them in the next patch.

http://gerrit.cloudera.org:8080/#/c/9134/6/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/6/be/src/exec/hdfs-orc-scanner.cc@339
PS6, Line 339:   if (type.type != TYPE_DECIMAL) break;
We should check scale as well.


http://gerrit.cloudera.org:8080/#/c/9134/6/be/src/exec/hdfs-orc-scanner.cc@347
PS6, Line 347:   } else {
if orc_type.getPrecision <= 9, it's valid to have type.GetByteSize() == 4


http://gerrit.cloudera.org:8080/#/c/9134/6/be/src/exec/hdfs-orc-scanner.cc@767
PS6, Line 767:   DCHECK_NE(slot_desc->type().GetByteSize(), 4);
ORC decimal(1,1) is stored in int64 batch as well. We shouldn't assume this and 
should add a case for slot_desc->type().GetByteSize() == 4.


http://gerrit.cloudera.org:8080/#/c/9134/6/testdata/bin/load_nested.py
File testdata/bin/load_nested.py:

http://gerrit.cloudera.org:8080/#/c/9134/6/testdata/bin/load_nested.py@297
PS6, Line 297: def load_orc():
 :   cluster.hdfs.ensure_home_dir()
 :   db = "nested_orc"
 :   tbl = "dummy"
> After the refactor of the nested types test, I think the changes in this fi
Yes! I'll remove these.



--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 22 Mar 2018 15:25:39 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-21 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 6:

(1 comment)

Thank you for refactoring the nested types test

http://gerrit.cloudera.org:8080/#/c/9134/6/testdata/bin/load_nested.py
File testdata/bin/load_nested.py:

http://gerrit.cloudera.org:8080/#/c/9134/6/testdata/bin/load_nested.py@297
PS6, Line 297: def load_orc():
 :   cluster.hdfs.ensure_home_dir()
 :   db = "nested_orc"
 :   tbl = "dummy"
After the refactor of the nested types test, I think the changes in this file 
can be removed.



--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 21 Mar 2018 17:03:57 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-21 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Joe McDonnell, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9134

to look at the new patch set (#6).

Change subject: IMPALA-5717: Support for reading ORC data files
..

IMPALA-5717: Support for reading ORC data files

This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.

Currently, we only support reading primitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Haven't enabled test_scanner_fuzz for ORC yet, since the ORC library
   is not robust for corrupt files (ORC-315).

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
A be/src/exec/hdfs-orc-scanner.cc
A be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-mt.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindOrc.cmake
M common/thrift/CatalogObjects.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/jflex/sql-scanner.flex
M testdata/LineItemMultiBlock/README.dox
A testdata/LineItemMultiBlock/lineitem_orc_multiblock_one_stripe.orc
A testdata/LineItemMultiBlock/lineitem_sixblocks.orc
A testdata/LineItemMultiBlock/lineitem_threeblocks.orc
M testdata/bin/generate-schema-statements.py
M testdata/bin/load_nested.py
M testdata/bin/run-hive-server.sh
M testdata/cluster/node_templates/common/etc/hadoop/conf/hdfs-site.xml.tmpl
A testdata/data/chars-formats.orc
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_dimensions.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/impala_test_suite.py
M tests/common/test_dimensions.py
M tests/common/test_vector.py
M tests/comparison/cli_options.py
M tests/query_test/test_chars.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
M tests/query_test/test_tpch_queries.py
55 files changed, 1,668 insertions(+), 164 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/9134/6
--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 6
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-21 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 5:

(4 comments)

I've refactored the previous patch. Main changes include
* shrink the size of ORC test files
* refactor nested types test
* refactor types check and add tests for type conversions
* refactor exceptions
* add support for Char(N), Varchar(N) and add tests
* correct codec for orc tests (orc/none -> orc/def/block)

http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@567
PS5, Line 567: // TODO: combine this with the Parquet implementation
> Can we do this as part of this patch? I'd prefer to avoid adding duplicate
Not quite familiar with the LLVM Codegen codes. I've simply tried extracting 
these two functions and parquet scanner's into hdfs-scanner. However, the 
impalad cannot launch after that... The failure is

 I0321 06:25:05.327670  1320 status.cc:125] Failed to find function
 _ZN6impala18HdfsParquetScanner17EvalRuntimeFilterEiPNS_8TupleRowE
@  0x16b63bd  impala::Status::Status()
@  0x1bd20b1  impala::LlvmCodeGen::InitializeLlvm()
@  0x19c46e6  ImpaladMain()
@  0x1658c10  main
@ 0x7fa14a578f45  __libc_start_main
@  0x1658a81  (unknown)
 F0321 06:25:05.327700  1320 llvm-codegen.cc:448] Check failed: 
execution_engine_ == nullptr Must Close() before destruction

Looks like Codegen functions are not initialized correctly. Maybe codegen 
logics like HdfsParquetScanner::CodegenEvalRuntimeFilters should be extracted 
too. Is it trivial to fix this? If we extract the EvalRuntimeFilter codes into 
hdfs-scanner, does that mean we can support RuntimeFilter in all file formats? 
If so we could create a JIRA for this complicate refactoring.


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-scanner.h@326
PS5, Line 326:   /// Size of the file footer.  This is a guess.  If this value 
is too little, we will
> Mention ORC and Parquet? E.g. "Size of the file footer for ORC and Parquet"
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/testdata/LineItemMultiBlock/lineitem_orc_multiblock_one_stripe.orc
File testdata/LineItemMultiBlock/lineitem_orc_multiblock_one_stripe.orc:

PS5:
> These files are pretty big - they'll increase the size of the Impala repo q
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/testdata/bin/load_nested.py
File testdata/bin/load_nested.py:

http://gerrit.cloudera.org:8080/#/c/9134/5/testdata/bin/load_nested.py@297
PS5, Line 297: def load_orc():
> That's really helpful! Thanks, Joe!
Done



--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 21 Mar 2018 14:12:54 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-20 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 5:

Thanks for addressing all those comments! I'll take a look once you upload the 
next patchset.


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 20 Mar 2018 21:58:59 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-20 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 5:

(36 comments)

Updates: I'm currently working on refactoring the patch. Things that haven't 
finished:
* add test coverage for types mismatch
* refactor nested types test
* shrink the size of orc test files
* refactor RuntimeEval codes

Hopefully, tomorrow I may be able to upload a new patch.

http://gerrit.cloudera.org:8080/#/c/9134/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9134/5//COMMIT_MSG@22
PS5, Line 22:  - Have passed all the tests.
> We're also missing test coverage to CHAR and VARCHAR. That looks like it wa
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.h@91
PS5, Line 91:   virtual Status Open(ScannerContext *context) WARN_UNUSED_RESULT;
> Can we add override to the functions where it fits?
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.h@141
PS5, Line 141: map
> Could this be unordered_map? We generally prefer that that unless there's a
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.h@177
PS5, Line 177: inline
> My intuition is that many of these "inline" specifiers will not be useful,
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@517
PS3, Line 517:
> Thanks for filing it. I did spent a little time reading the ORC code and it
Yeah, I just heard from Alibaba that they implemented an async InputStream to 
achieve this in their own branch. It's a pity that the ORC community is with 
slow responses. So they haven't planned to contribute back yet.
I think it's not hard to implement such an InputStream. I can have a try after 
this patch.


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@59
PS5, Line 59: Clear
> Let's call this FreeAll() to match the equivalent method on MemPool. MemPoo
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@62
PS5, Line 62:  std::
> std:: prefix not needed for free().
This class has a free function and a malloc function too. To avoid name 
conflict we need the "std::" prefix.


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@71
PS5, Line 71:   mem_tracker_->Consume(size);
> We should do a TryConsume() here so that we don't go over the memory limit.
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@72
PS5, Line 72: (std::m
> std:: prefix not needed for malloc
We need this as the above reason.


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@77
PS5, Line 77: throw std::runtime_error("memory limit exceeded");
> We could do the same status-wrapping thing here.
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@90
PS5, Line 90:   if (ImpaladMetrics::MEM_POOL_TOTAL_BYTES != nullptr) {
> Null check shouldn't be needed, right? We don't have it above.
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@182
PS5, Line 182:
> Remove extra blank line. There is a bit too much vertical whitespace (i.e.
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@208
PS5, Line 208:  &&
 :   reader_->getCompression() != 
orc::CompressionKind::CompressionKind_NONE
> CompressionKind_NONE is already handled in TranslateCompressionKind, right?
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@229
PS5, Line 229: Encounter
> Encountered
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@230
PS5, Line 230: Encounter
> Encountered.
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@233
PS5, Line 233:
> We can remove most of the blank lines in this function.
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@234
PS5, Line 234:   if (reader_->getNumberOfRows() == 0) {
> Nit: can fit if() and return on one line.
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@240
PS5, Line 240: return Status(Substitute("Invalid file: $0. No stripes in 
this file but numberOfRows"
> "Invalid ORC file"
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@250
PS5, Line 250: case orc::CompressionKind::CompressionKind_NONE:return 
THdfsCompression::NONE;
> Nit: missing space
Done


http://gerrit.cloudera.org:8080/#/c/9134/5/be/src/exec/hdfs-orc-scanner.cc@259

[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-15 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9134/5/testdata/bin/load_nested.py
File testdata/bin/load_nested.py:

http://gerrit.cloudera.org:8080/#/c/9134/5/testdata/bin/load_nested.py@297
PS5, Line 297: def load_orc():
> We have several tables in the functional-query dataset that use complex typ
That's really helpful! Thanks, Joe!



--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 16 Mar 2018 01:41:34 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-15 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9134/5/testdata/bin/load_nested.py
File testdata/bin/load_nested.py:

http://gerrit.cloudera.org:8080/#/c/9134/5/testdata/bin/load_nested.py@297
PS5, Line 297: def load_orc():
> I don't think this is the best way to load the data - I think we should add
We have several tables in the functional-query dataset that use complex types. 
I think it would make sense to extend those to do ORC.

For example, complextypes_fileformat is used for positive/negative testing of 
nested types for each file format. complextypes_multifileformat does something 
similar at a partition level. I think it would make sense for both of these to 
incorporate ORC. (See 
testdata/dataset/functional/functional_schema_template.sql)

complextypes_fileformat is used in our frontend planner tests that exercise the 
planner without the rest of execution. See 
fe/src/test/java/org/apache/impala/planner/PlannerTest.java. The specific test 
that uses complextypes_fileformat is testComplexTypesFileFormat(), which 
references 
testdata/workloads/functional-planner/queries/PlannerTest/complex-types-file-formats.test.
 This test is doing a similar check to what you are doing in 
nested-types-orc-basic.test, so I think it would make sense to incorporate that 
test into testComplexTypesFileFormat().



--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 15 Mar 2018 17:03:48 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-15 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 5:

(1 comment)

Thanks for your detailed comments, Tim! Wrapping the Status in the exception is 
quite a good idea! I'm working on bug fixes of the ORC library these days, so 
please expect my slow updates.
I'll refactor codes and add support & tests for VARCHAR and CHAR.

http://gerrit.cloudera.org:8080/#/c/9134/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9134/5//COMMIT_MSG@19
PS5, Line 19:  - Most of the end-to-end tests can run on ORC format.
> Can you enable orc in test_scanners_fuzz.py and run that in a loop for a wh
Sorry that I should have mentioned this in the commit message. I've enabled the 
test_scanner_fuzz.py in my local branch and found many bugs in the ORC lib when 
reading corrupt files, i.e. ORC-311, ORC-312, ORC-313, ORC-314, ORC-317, 
ORC-319. Some of my PRs are merged and some are under review. The stacktrace 
you pasted is due to ORC-313.

The ORC reader in version 1.4.3-release is not robust enough for 
test_scanner_fuzz.py, so I finally disable this test. I think it's ok for now 
since the random corrupt test is too strict. We can enable it when ORC release 
a new version.

In Hulu, we've deployed Impala-ORC (impala-2.5 on ORC-1.2.3) in production for 
more than half a year. It's luckily that we haven't encountered corrupt files 
that crash Impala. So I think we can compromise on this until a new ORC version 
is released.



--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 15 Mar 2018 12:20:55 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-14 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 5:

(39 comments)

Finally made time to do another detailed pass. Looks like we're getting closer.

What I'm aiming for is good test coverage, clean code, and stability. I think 
this will initially be an experimental feature, so it's ok if we're missing 
some features or performance.

http://gerrit.cloudera.org:8080/#/c/9134/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9134/5//COMMIT_MSG@15
PS5, Line 15: premitive
primitive.


http://gerrit.cloudera.org:8080/#/c/9134/5//COMMIT_MSG@19
PS5, Line 19:  - Most of the end-to-end tests can run on ORC format.
Can you enable orc in test_scanners_fuzz.py and run that in a loop for a while? 
That can help flush out bugs in the scanner. I think this is the required diff:

tarmstrong@tarmstrong-box:~/Impala/incubator-impala$ git diff
diff --git a/tests/query_test/test_scanners_fuzz.py 
b/tests/query_test/test_scanners_fuzz.py
index c336a17..4b97a5f 100644
--- a/tests/query_test/test_scanners_fuzz.py
+++ b/tests/query_test/test_scanners_fuzz.py
@@ -62,7 +62,7 @@ class TestScannersFuzzing(ImpalaTestSuite):
   'mem_limit' : cls.MEM_LIMITS}))
 # TODO: enable for more table formats once they consistently pass the fuzz 
test.
 cls.ImpalaTestMatrix.add_constraint(lambda v:
-v.get_value('table_format').file_format in ('avro', 'parquet') or
+v.get_value('table_format').file_format in ('avro', 'parquet', 'orc') 
or
 (v.get_value('table_format').file_format == 'text' and
   v.get_value('table_format').compression_codec in ('none', 'lzo')))

@@ -194,11 +194,12 @@ class TestScannersFuzzing(ImpalaTestSuite):
   continue
 msg = "Should not throw error when abort_on_error=0: '{0}'".format(e)
 LOG.error(msg)
-# Parquet and compressed text can fail the query for some parse errors.
+# Parquet, ORC and compressed text can fail the query for some parse 
errors.
 # E.g. corrupt Parquet footer (IMPALA-3773) or a corrupt LZO index file
 # (IMPALA-4013).
 table_format = vector.get_value('table_format')
 if table_format.file_format != 'parquet' \
+and table_format.file_format != 'orc' \
 and not (table_format.file_format == 'text' and
 table_format.compression_codec != 'none'):
   raise


I ran it for a few iterations and it found at least one crash:

(gdb) bt
#0  0x7f035c3bc428 in __GI_raise (sig=sig@entry=6) at 
../sysdeps/unix/sysv/linux/raise.c:54
#1  0x7f035c3be02a in __GI_abort () at abort.c:89
#2  0x7f035f300069 in ?? () from 
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#3  0x7f035f4b3997 in ?? () from 
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#4  0x7f035f4b40b7 in ?? () from 
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#5  0x7f035f304922 in ?? () from 
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#6  0x7f035f3094a9 in JVM_handle_linux_signal ()
   from /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#7  0x7f035f2fd1a8 in ?? () from 
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#8  
#9  0x7f035f2fa936 in ?? () from 
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#10 0x7f035ef10fe1 in ?? () from 
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#11 0x7f035f4b29b2 in ?? () from 
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#12 0x7f035f4b34eb in ?? () from 
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#13 0x7f035f30954f in JVM_handle_linux_signal ()
   from /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#14 0x7f035f2fd1a8 in ?? () from 
/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/amd64/server/libjvm.so
#15 
#16 0x03d155e1 in orc::convertType(orc::proto::Type const&, 
orc::proto::Footer const&) ()
#17 0x03d15928 in orc::convertType(orc::proto::Type const&, 
orc::proto::Footer const&) ()
#18 0x03cffbb1 in 
orc::ReaderImpl::ReaderImpl(std::shared_ptr, 
orc::ReaderOptions const&, unsigned long, unsigned long) ()
#19 0x03d026dc in orc::createReader(std::unique_ptr, orc::ReaderOptions const&) ()
#20 0x01ceb163 in impala::HdfsOrcScanner::ProcessFileTail 
(this=0xa9722c0)
at be/src/exec/hdfs-orc-scanner.cc:227
#21 0x01cea885 in impala::HdfsOrcScanner::Open (this=0xa9722c0, 
context=0x7f02df7bf200)
at be/src/exec/hdfs-orc-scanner.cc:167
#22 0x01c2e919 in impala::HdfsScanNodeBase::CreateAndOpenScanner 
(this=0xc239500, partition=0xb8ae380,
context=0x7f02df7bf200, scanner=0x7f02df7bf3d0) at 

[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 5:

(3 comments)

Thanks for the new patchset. I need to do a deep dive into the new changes but 
will respond to your comments first.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@517
PS3, Line 517:
> Though ORC-262 has no progress, I think we can still prefech data and let t
Thanks for filing it. I did spent a little time reading the ORC code and it 
does seem like we could achieve this with some modifications to the ORC library 
- they have two layers of InputStream abstraction, the top-level which does the 
decompression and a lower level that does I/O.)


http://gerrit.cloudera.org:8080/#/c/9134/5/testdata/bin/run-hive-server.sh
File testdata/bin/run-hive-server.sh:

http://gerrit.cloudera.org:8080/#/c/9134/5/testdata/bin/run-hive-server.sh@75
PS5, Line 75: tabls
typo: tables


http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/workloads/functional-query/functional-query_exhaustive.csv
File testdata/workloads/functional-query/functional-query_exhaustive.csv:

http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/workloads/functional-query/functional-query_exhaustive.csv@25
PS3, Line 25: file_format: orc, dataset: functional, compression_codec: none, 
compression_type: none
> Yeah, the default ORC codec is zlib (deflate in Impala).
We should the name of compression_codec to be deflate for our ORC tables so 
that it's accurate (we did the wrong thing with Parquet here).



--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 13 Mar 2018 00:28:49 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-12 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong, Dan Hecht,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9134

to look at the new patch set (#5).

Change subject: IMPALA-5717: Support for reading ORC data files
..

IMPALA-5717: Support for reading ORC data files

This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.

Currently, we only support reading premitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Have passed all the tests.

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
A be/src/exec/hdfs-orc-scanner.cc
A be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-mt.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindOrc.cmake
M common/thrift/CatalogObjects.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/jflex/sql-scanner.flex
M testdata/LineItemMultiBlock/README.dox
A testdata/LineItemMultiBlock/lineitem_orc_multiblock_one_stripe.orc
A testdata/LineItemMultiBlock/lineitem_sixblocks.orc
A testdata/LineItemMultiBlock/lineitem_threeblocks.orc
M testdata/bin/generate-schema-statements.py
M testdata/bin/load_nested.py
M testdata/bin/run-hive-server.sh
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/QueryTest/nested-types-orc-basic.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_dimensions.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/comparison/cli_options.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_nested_types.py
M tests/query_test/test_scanners.py
M tests/query_test/test_tpch_queries.py
49 files changed, 1,425 insertions(+), 156 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/9134/5
--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 5
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-12 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9134/4/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/9134/4/be/src/exec/hdfs-scanner.h@379
PS4, Line 379: virtual
> why make this virtual? This subroutine interface is pretty ill-defined, so
Oh, I forgot this! At first, I found that the signature of 
HdfsParquetScanner::CommitRows differ from this (just has reverse argument 
list). I think it may be a mistake so I was going to fix that. But finally 
abandon it to keep this patch smaller.
I'll remove the HdfsOrcScanner::CommitRows as well :)



--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 12 Mar 2018 22:38:56 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-12 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9134/4/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/9134/4/be/src/exec/hdfs-scanner.h@379
PS4, Line 379: virtual
why make this virtual? This subroutine interface is pretty ill-defined, so I 
think we should avoid making it overridable.

I don't think it's needed to be virtual anyway, since only the subclasses call 
it, so you could always just not use it from the orc scanner. But, it looks 
like the orc scanner version is the same implementation anyway...



--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 12 Mar 2018 19:58:41 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-10 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for reading ORC data files
..


Patch Set 4:

pre-review test for this patch: 
https://jenkins.impala.io/job/pre-review-test/150/console


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 10 Mar 2018 10:50:51 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for reading ORC data files

2018-03-09 Thread Quanlong Huang (Code Review)
Hello Tim Armstrong,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9134

to look at the new patch set (#4).

Change subject: IMPALA-5717: Support for reading ORC data files
..

IMPALA-5717: Support for reading ORC data files

This patch integrates the orc library into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch. The ORC version we used is release-1.4.3.

Currently, we only support reading premitive types. Writing into ORC
table has not been supported neither.

Tests
 - Most of the end-to-end tests can run on ORC format.
 - Add tpcds, tpch tests for ORC.
 - Add some ORC specific tests.
 - Have passed all the tests.

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
---
M CMakeLists.txt
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
A be/src/exec/hdfs-orc-scanner.cc
A be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-mt.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M bin/bootstrap_toolchain.py
M bin/impala-config.sh
A cmake_modules/FindOrc.cmake
M common/thrift/CatalogObjects.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
M fe/src/main/java/org/apache/impala/service/Frontend.java
M fe/src/main/jflex/sql-scanner.flex
M testdata/LineItemMultiBlock/README.dox
A testdata/LineItemMultiBlock/lineitem_orc_multiblock_one_stripe.orc
A testdata/LineItemMultiBlock/lineitem_sixblocks.orc
A testdata/LineItemMultiBlock/lineitem_threeblocks.orc
M testdata/bin/generate-schema-statements.py
M testdata/bin/load_nested.py
M testdata/bin/run-hive-server.sh
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
A 
testdata/workloads/functional-query/queries/QueryTest/nested-types-orc-basic.test
M testdata/workloads/tpcds/tpcds_core.csv
M testdata/workloads/tpcds/tpcds_dimensions.csv
M testdata/workloads/tpcds/tpcds_exhaustive.csv
M testdata/workloads/tpcds/tpcds_pairwise.csv
M testdata/workloads/tpch/tpch_core.csv
M testdata/workloads/tpch/tpch_dimensions.csv
M testdata/workloads/tpch/tpch_exhaustive.csv
M testdata/workloads/tpch/tpch_pairwise.csv
M tests/common/test_dimensions.py
M tests/comparison/cli_options.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_nested_types.py
M tests/query_test/test_scanners.py
M tests/query_test/test_tpch_queries.py
49 files changed, 1,447 insertions(+), 157 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/9134/4
--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong