[Impala-ASF-CR] IMPALA-5717: Support for ORC data files

2018-03-09 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for ORC data files
..


Patch Set 3:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@130
PS3, Line 130: ScanRange* HdfsOrcScanner::FindFooterSplit(HdfsFileDesc* file) {
> We could move this to HdfsScanner and share the code - the logic is generic
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@142
PS3, Line 142:   mem_tracker_.reset(new MemTracker(-1, "OrcReader", 
mem_tracker));
> I think we should track it against the HdfsScanNode's MemTracker instead of
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@151
PS3, Line 151:   mem_tracker_->CloseAndUnregisterFromParent();
> Just add the comment since I made it crashed when use Close at first...
Removed these since we use HdfsScanNode's MemTracker


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@162
PS3, Line 162:   if (ImpaladMetrics::MEM_POOL_TOTAL_BYTES != nullptr) {
> So the non-NULL checks in mem-pool.cc are redundant too? I learn this from
To be consistent with logics in MemPool, let's update the metric.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@333
PS3, Line 333: if (col_type.type == TYPE_ARRAY) {
> Complex types will be skipped here. Their column ids won't be set into the
Have added tests for this. Replaced these with DCHECK.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@484
PS3, Line 484:   if (start_with_first_stripe && misaligned_stripe_skipped) {
> I think we had a specific Parquet test for this code path, so we're probabl
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@498
PS3, Line 498: // TODO ValidateColumnOffsets
> Is this still needed?
Removed this. The orc library will handle this so we just need to catch 
exceptions from the orc lib.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@500
PS3, Line 500: google::uint64 stripe_offset = stripe.offset();
> Why not uint64_t? Are they not the same underlying type?
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@513
PS3, Line 513: // TODO: check if this stripe can be skipped by stats.
> File a follow-on JIRA for this enhancement?
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@517
PS3, Line 517: unique_ptr input_stream(new 
ScanRangeInputStream(this));
> I haven't fully thought through this idea but I wonder if we should have an
Though ORC-262 has no progress, I think we can still prefech data and let the 
ORC lib reading from an in-memory InputStream. Created a JIRA for this: 
IMPALA-6636.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@677
PS3, Line 677:   vector decompressed_footer_buffer;
> You're right! Will review these logics deeper.
The logics of parsing the file tail are now replaced by using the orc lib.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@751
PS3, Line 751:   // TODO
> We should do this in the initial patch if it's possible to crash Impala wit
Handled by the ORC lib


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@820
PS3, Line 820:   while (row_id < capacity && ScratchBatchNotEmpty()) {
> We should file a follow-on JIRA to codegen the runtime filter + conjunct ev
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@859
PS3, Line 859: bool HdfsOrcScanner::EvalRuntimeFilters(TupleRow* row) {
> Can you add a TODO to combine this with the Parquet implementation?
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@881
PS3, Line 881: inline void HdfsOrcScanner::ReadRow(const orc::ColumnVectorBatch 
, int row_idx,
> We don't need to do this in this patch, but it would probably be faster to
Add a TODO for this.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@884
PS3, Line 884: dynamic_cast
> Sure! Excited to know that you start to test the scanner's performance!
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@975
PS3, Line 975:   // TODO warn if slot_desc->type().GetByteSize() != 16
> You could DCHECK in that case, the only valid values are 4, 8 and 16
Add DCHECKs in default


http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/workloads/functional-query/functional-query_exhaustive.csv
File testdata/workloads/functional-query/functional-query_exhaustive.csv:

http://gerrit.cloudera.org:8080/#/c/9134/3/testdata/workloads/functional-query/functional-query_exhaustive.csv@25
PS3, Line 25: file_format: orc, dataset: 

[Impala-ASF-CR] IMPALA-5717: Support for ORC data files

2018-02-09 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for ORC data files
..


Patch Set 3:

Thank you! I'll test your native-toolchain patch as well.

Welcome the email discussion!


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 10 Feb 2018 01:00:08 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for ORC data files

2018-02-09 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for ORC data files
..


Patch Set 3:

I've been meaning to get back to you. I tried adding it to native-toolchain and 
got it working: https://gerrit.cloudera.org/#/c/9274/

It required a bit of hacking of their build scripts but it seems to work.

I wanted to have a discussion on the mailing list about some of the big-picture 
aspects to make sure that we have a consensus about the design. I'm going to 
put together an email summarising the points for discussion now.


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 10 Feb 2018 00:49:51 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for ORC data files

2018-02-09 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for ORC data files
..


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/9134/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9134/3//COMMIT_MSG@15
PS3, Line 15: Instead of linking the orc-reader as a third party library, it's
> Which version or commit hash of ORC did you import?
The version is 1.2.3. It's a little old since we started the Impala-ORC project 
one year ago. I can update it to the latest version if you think it's essential.

As for using it as a third-party library, I also concern about the memory 
paradigm gap between the ORC lib and Impala. On the other hand, I can try to 
reduce the size of this patch, e.g. remove useless codes like ColumnPrinter, 
reuse as more Impala codes as possible e.g. RLE decoders, int128. My goal can 
be making this reader no more complex than parquet column readers.

If finally you still decide to add it to the native-toolchain project, could 
you give me some docs of how to submit code review for that project? Is it just 
the same as Impala?


http://gerrit.cloudera.org:8080/#/c/9134/3//COMMIT_MSG@25
PS3, Line 25: tests.
> We should also add ORC for TPC-H and TPC-DS so that we have some larger dat
Sure, actually we have tested them but I forget to add this in the patch.



--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 10 Feb 2018 00:34:49 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5717: Support for ORC data files

2018-02-02 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for ORC data files
..


Patch Set 3:

(24 comments)

Thanks, Tim! Your comments are really useful!
If we finally decide to move the ORC library into native-tool-chain project, is 
there any document about how to contribute to this? I think I may need the ORC 
library merged first than I can use it like other tools.

There're still comments I haven't deal with. Forgive me to reply them later.

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.h@28
PS3, Line 28: class CollectionValueBuilder;
> Not needed?
Yeah, just added it when I try to support complex types. Will remove it.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.h@259
PS3, Line 259: ProcessFileTail
> It might be helpful to define what the "footer", "file tail" and "postscrip
I was confused as well at first :)
They're concepts in ORC. Here is their definitions: 
https://orc.apache.org/docs/file-tail.html
You can also find them in be/src/orc/orc_proto.proto


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@1
PS3, Line 1: // Copyright 2012 Cloudera Inc.
> Don't need cloudera copyrights!
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@24
PS3, Line 24: #include "common/object-pool.h"
> Many of these headers look unused.
Yes, will remove them


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@53
PS3, Line 53: using boost::algorithm::split;
> Some of these boost "using" declarations don't seem to be needed.
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@59
PS3, Line 59: DEFINE_double(orc_min_filter_reject_ratio, 0.1, "(Advanced) If 
the percentage of "
> I don't know why we made this flag option parquet-specific. Having an optio
agree with you. There're many logics in the parquet scanner that can share with 
the ORC scanner. Not only this var, but also functions like IssueInitialRanges 
and FindFooterSplit.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@73
PS3, Line 73:   for (int i = 0; i < files.size(); ++i) {
> Can we convert this to a range for? We generally prefer that in new code.
Done


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@151
PS3, Line 151:   mem_tracker_->CloseAndUnregisterFromParent();
> We only want to use CloseAndUnregisterFromParent() for the query-level MemT
Just add the comment since I made it crashed when use Close at first...
I can remove the comment if you're all clear of it :)


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@157
PS3, Line 157: std::
> Don't need std:: prefix. We generally prefer avoiding it when it isn't need
We need this prefix because this class has a free function (see below) as well.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@162
PS3, Line 162:   if (ImpaladMetrics::MEM_POOL_TOTAL_BYTES != nullptr) {
> You should be able to assume it's non-NULL. A lot of older code checks if m
So the non-NULL checks in mem-pool.cc are redundant too? I learn this from the 
impala::MemPool implememtation.

I found this metric useful when I ran test_failpoints.py individually. It won't 
come back to zero in 2 minutes. So I found bug mentioned in IMPALA-6423


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@196
PS3, Line 196: void HdfsOrcScanner::ScanRangeInputStream::read(void* buf, 
uint64_t length,
> It's unfortunate that the ORC code was designed to issue only synchronous r
yes, quite a pity. So I hope we can include the ORC codes and modify the logics 
of getting input.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@223
PS3, Line 223: memset(buf, 0, length);
> Is the memset needed? If so, should document why in a comment.
Just let the orc-reader to throw an exception for parse error if it read this 
later. Not needed actually since we throw the following exception immediately.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@224
PS3, Line 224: throw std::runtime_error("Cannot read from file: " + 
status.GetDetail());
> The fact we need to use exceptions is unfortunate. I need to think a bit mo
I don't like exceptions as well. One solution is to insert the stream context 
into the orc-reader, and check if cancel in the loops inside it.

This need to modify codes in the ORC-reader so I haven't started yet. If you 
guys decide to include the ORC codes, I can implement this.



[Impala-ASF-CR] IMPALA-5717: Support for ORC data files

2018-02-01 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for ORC data files
..


Patch Set 3:

(44 comments)

This is really impressive - I was able to build the patch, load TPC-H ORC data 
and run a bunch of queries.

I think there are still meta-questions about the approach of importing the ORC 
library that I want input from other community members on, but I spent some 
time understanding your code and commenting on specific things.

http://gerrit.cloudera.org:8080/#/c/9134/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9134/3//COMMIT_MSG@15
PS3, Line 15: Instead of linking the orc-reader as a third party library, it's
Which version or commit hash of ORC did you import?

I think we need to think about this carefully. It might be best to use it as a 
third-party library at least to start off with so that we can pick up any 
improvements made to the Apache ORC implementation by upgrading our library 
version. Otherwise it's a lot more code for the Impala project to maintain.

We can also contribute improvements like predicate pushdown support to the ORC 
library for now.

At some point, if we have multiple people committed to maintaining it and we 
want to make larger changes to the ORC code, we could revisit the decision.


http://gerrit.cloudera.org:8080/#/c/9134/3//COMMIT_MSG@25
PS3, Line 25: tests.
We should also add ORC for TPC-H and TPC-DS so that we have some larger data 
sets to test on.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner-test.cc
File be/src/exec/hdfs-orc-scanner-test.cc:

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner-test.cc@32
PS3, Line 32: uint8_t empty_data_one_col_orc[] = {
Can you comment how this data was generated, so that it could be reproduced if 
needed?


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner-test.cc@292
PS3, Line 292:   // TODO Codec::CreateDecompressor cannot create a LZO 
decompressor, how to support LZO?
We weren't able to include Lzo integration in Apache Impala since the original 
implementation was GPL-licensed. There's a Cloudera-developed plugin with a 
different license to read LZO text files.

It looks like ORC actually has an Apache-licensed LZO implementation so that 
might be an alternative.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.h
File be/src/exec/hdfs-orc-scanner.h:

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.h@28
PS3, Line 28: class CollectionValueBuilder;
Not needed?


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.h@259
PS3, Line 259: ProcessFileTail
It might be helpful to define what the "footer", "file tail" and "postscript" 
are, since the names are similar.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@1
PS3, Line 1: // Copyright 2012 Cloudera Inc.
Don't need cloudera copyrights!


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@24
PS3, Line 24: #include "common/object-pool.h"
Many of these headers look unused.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@53
PS3, Line 53: using boost::algorithm::split;
Some of these boost "using" declarations don't seem to be needed.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@59
PS3, Line 59: DEFINE_double(orc_min_filter_reject_ratio, 0.1, "(Advanced) If 
the percentage of "
I don't know why we made this flag option parquet-specific. Having an option 
per file format will get out of control.

Can we just define a generic flag like min_filter_reject_ratio, e.g. in 
global-flags.cc and use it in ORC? We could keep the old parquet flag around 
for compatibility.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@73
PS3, Line 73:   for (int i = 0; i < files.size(); ++i) {
Can we convert this to a range for? We generally prefer that in new code.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@130
PS3, Line 130: ScanRange* HdfsOrcScanner::FindFooterSplit(HdfsFileDesc* file) {
We could move this to HdfsScanner and share the code - the logic is generic.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@142
PS3, Line 142:   mem_tracker_.reset(new MemTracker(-1, "OrcReader", 
mem_tracker));
I think we should track it against the HdfsScanNode's MemTracker instead of 
creating a MemTracker per scanner thread.


http://gerrit.cloudera.org:8080/#/c/9134/3/be/src/exec/hdfs-orc-scanner.cc@151
PS3, Line 151:   mem_tracker_->CloseAndUnregisterFromParent();
We only want to use CloseAndUnregisterFromParent() for the query-level 
MemTracker for now (see the comment). I refactored some of that 

[Impala-ASF-CR] IMPALA-5717: Support for ORC data files

2018-02-01 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for ORC data files
..


Patch Set 3:

I'm still trying to grok the patch. I have a couple of higher-level asks:

* In the planner, we assume in a few places that PARQUET is the only columnar 
file format. E.g. the code below. We should identify the places where "== 
PARQUET" really means "isColumnar()" and update those accordingly so that ORC 
is also counted.

if (table.getMajorityFormat() == HdfsFileFormat.PARQUET) {
  // For the purpose of this estimation, the number of per-host scan ranges 
for
  // Parquet files are equal to the number of columns read from the file. 
I.e.
  // excluding partition columns and columns that are populated from file 
metadata.

* You should add ORC to test_scanners_fuzz.py and run it in a loop for a while. 
That often flushes out bugs in handling invalid data.

  while impala-py.test tests/query_test/test_scanners_fuzz.py -k parquet; do 
echo yes ; done


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 01 Feb 2018 23:50:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for ORC data files

2018-01-28 Thread Quanlong Huang (Code Review)
Quanlong Huang has uploaded a new patch set (#3). ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for ORC data files
..

IMPALA-5717: Support for ORC data files

This patch integrates the orc-reader into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch.

Instead of linking the orc-reader as a third party library, it's
integrated in the code level, leaving chances for further optimization,
e.g. Predicate Pushdown, Code Generation. Currently, we haven’t changed
any codes of the orc-reader. They're in folder be/src/orc.

Currently, we only support reading premitive types. Writing into ORC
table has not been supported neither.

Tests
Most of the end-to-end tests can run on ORC format. Have passed all the
tests.

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
A be/src/exec/hdfs-orc-scanner-test.cc
A be/src/exec/hdfs-orc-scanner.cc
A be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-mt.cc
A be/src/orc/Adaptor.hh
A be/src/orc/Adaptor.hh.in
A be/src/orc/ByteRLE.cc
A be/src/orc/ByteRLE.hh
A be/src/orc/C09Adapter.cc
A be/src/orc/CMakeLists.txt
A be/src/orc/ColumnPrinter.cc
A be/src/orc/ColumnPrinter.hh
A be/src/orc/ColumnReader.cc
A be/src/orc/ColumnReader.hh
A be/src/orc/Compression.cc
A be/src/orc/Compression.hh
A be/src/orc/Exceptions.cc
A be/src/orc/Exceptions.hh
A be/src/orc/Int128.cc
A be/src/orc/Int128.hh
A be/src/orc/LzoDecompressor.cc
A be/src/orc/LzoDecompressor.hh
A be/src/orc/MemoryPool.cc
A be/src/orc/MemoryPool.hh
A be/src/orc/OrcFile.cc
A be/src/orc/OrcFile.hh
A be/src/orc/RLE.cc
A be/src/orc/RLE.hh
A be/src/orc/RLEv1.cc
A be/src/orc/RLEv1.hh
A be/src/orc/RLEv2.cc
A be/src/orc/RLEv2.hh
A be/src/orc/Reader.cc
A be/src/orc/Reader.hh
A be/src/orc/Timezone.cc
A be/src/orc/Timezone.hh
A be/src/orc/Type.hh
A be/src/orc/TypeImpl.cc
A be/src/orc/TypeImpl.hh
A be/src/orc/Vector.cc
A be/src/orc/Vector.hh
A be/src/orc/orc-config.hh
A be/src/orc/orc-config.hh.in
A be/src/orc/orc_proto.proto
A be/src/orc/wrap/coded-stream-wrapper.h
A be/src/orc/wrap/gmock.h
A be/src/orc/wrap/gtest-wrapper.h
A be/src/orc/wrap/orc-proto-wrapper.cc
A be/src/orc/wrap/orc-proto-wrapper.hh
A be/src/orc/wrap/snappy-wrapper.h
A be/src/orc/wrap/zero-copy-stream-wrapper.h
M common/thrift/CatalogObjects.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
M fe/src/main/jflex/sql-scanner.flex
M testdata/bin/generate-schema-statements.py
M testdata/bin/run-hive-server.sh
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M tests/common/test_dimensions.py
M tests/comparison/cli_options.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
70 files changed, 15,389 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/9134/3
--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 3
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 


[Impala-ASF-CR] IMPALA-5717: Support for ORC data files

2018-01-26 Thread Quanlong Huang (Code Review)
Quanlong Huang has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9134 )

Change subject: IMPALA-5717: Support for ORC data files
..


Patch Set 2:

Here is a document about this patch: 
https://docs.google.com/document/d/1Lg-MmZIis-ZbmMf6cD8YJq4x2tM0UXYPyzf0AYqe6Gc


--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang 
Gerrit-Comment-Date: Fri, 26 Jan 2018 11:17:20 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5717: Support for ORC data files

2018-01-26 Thread Quanlong Huang (Code Review)
Quanlong Huang has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/9134


Change subject: IMPALA-5717: Support for ORC data files
..

IMPALA-5717: Support for ORC data files

This patch integrates the orc-reader into Impala and implements
HdfsOrcScanner as a middle layer between them. The HdfsOrcScanner
supplies input needed from the orc-reader, tracks memory consumption of
the reader and transfers the reader's output (orc::ColumnVectorBatch)
into impala::RowBatch.

Instead of linking the orc-reader as a third party library, it's
integrated in the code level, leaving chances for further optimization,
e.g. Predicate Pushdown, Code Generation. Currently, we haven’t changed
any codes of the orc-reader. They're in folder be/src/exec/orc.

Currently, we only support reading premitive types. Writing into ORC
table has not been supported neither.

Tests
Most of the end-to-end tests can run on ORC format. Have passed all the
tests.

Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
---
M be/CMakeLists.txt
M be/src/exec/CMakeLists.txt
A be/src/exec/hdfs-orc-scanner-test.cc
A be/src/exec/hdfs-orc-scanner.cc
A be/src/exec/hdfs-orc-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-mt.cc
A be/src/orc/Adaptor.hh
A be/src/orc/Adaptor.hh.in
A be/src/orc/ByteRLE.cc
A be/src/orc/ByteRLE.hh
A be/src/orc/C09Adapter.cc
A be/src/orc/CMakeLists.txt
A be/src/orc/ColumnPrinter.cc
A be/src/orc/ColumnPrinter.hh
A be/src/orc/ColumnReader.cc
A be/src/orc/ColumnReader.hh
A be/src/orc/Compression.cc
A be/src/orc/Compression.hh
A be/src/orc/Exceptions.cc
A be/src/orc/Exceptions.hh
A be/src/orc/Int128.cc
A be/src/orc/Int128.hh
A be/src/orc/LzoDecompressor.cc
A be/src/orc/LzoDecompressor.hh
A be/src/orc/MemoryPool.cc
A be/src/orc/MemoryPool.hh
A be/src/orc/OrcFile.cc
A be/src/orc/OrcFile.hh
A be/src/orc/RLE.cc
A be/src/orc/RLE.hh
A be/src/orc/RLEv1.cc
A be/src/orc/RLEv1.hh
A be/src/orc/RLEv2.cc
A be/src/orc/RLEv2.hh
A be/src/orc/Reader.cc
A be/src/orc/Reader.hh
A be/src/orc/Timezone.cc
A be/src/orc/Timezone.hh
A be/src/orc/Type.hh
A be/src/orc/TypeImpl.cc
A be/src/orc/TypeImpl.hh
A be/src/orc/Vector.cc
A be/src/orc/Vector.hh
A be/src/orc/orc-config.hh
A be/src/orc/orc-config.hh.in
A be/src/orc/orc_proto.proto
A be/src/orc/wrap/coded-stream-wrapper.h
A be/src/orc/wrap/gmock.h
A be/src/orc/wrap/gtest-wrapper.h
A be/src/orc/wrap/orc-proto-wrapper.cc
A be/src/orc/wrap/orc-proto-wrapper.hh
A be/src/orc/wrap/snappy-wrapper.h
A be/src/orc/wrap/zero-copy-stream-wrapper.h
M common/thrift/CatalogObjects.thrift
M fe/src/main/cup/sql-parser.cup
M fe/src/main/java/org/apache/impala/catalog/HdfsFileFormat.java
M fe/src/main/java/org/apache/impala/catalog/HdfsStorageDescriptor.java
M fe/src/main/jflex/sql-scanner.flex
M testdata/bin/generate-schema-statements.py
M testdata/bin/run-hive-server.sh
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/functional-query_core.csv
M testdata/workloads/functional-query/functional-query_dimensions.csv
M testdata/workloads/functional-query/functional-query_exhaustive.csv
M testdata/workloads/functional-query/functional-query_pairwise.csv
M tests/common/test_dimensions.py
M tests/comparison/cli_options.py
M tests/query_test/test_decimal_queries.py
M tests/query_test/test_scanners.py
70 files changed, 15,389 insertions(+), 8 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/9134/2
--
To view, visit http://gerrit.cloudera.org:8080/9134
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ia7b6ae4ce3b9ee8125b21993702faa87537790a4
Gerrit-Change-Number: 9134
Gerrit-PatchSet: 2
Gerrit-Owner: Quanlong Huang 
Gerrit-Reviewer: Quanlong Huang