[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-21 Thread Internal Jenkins (Code Review)
Internal Jenkins has submitted this change and it was merged.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- clustered insert tests have been added to select from alltypestiny to
  create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024). This is limited to uncompressed parquet
  files, to maintain a reasonable runtime. On my machine execution of
  test.insert took 1778 seconds, compared to 1002 seconds with the just
  default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.
- It adds tests to the Analyzer test suite.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Reviewed-on: http://gerrit.cloudera.org:8080/4863
Reviewed-by: Lars Volker 
Reviewed-by: Tim Armstrong 
Tested-by: Internal Jenkins
---
M be/src/exec/hbase-table-sink.cc
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hbase-table-writer.h
M be/src/exec/hdfs-avro-table-writer.cc
M be/src/exec/hdfs-avro-table-writer.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M be/src/exec/hdfs-table-writer.h
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/hdfs-text-table-writer.h
M bin/impala-config.sh
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
26 files changed, 457 insertions(+), 161 deletions(-)

Approvals:
  Lars Volker: Looks good to me, but someone else must approve
  Internal Jenkins: Verified
  Tim Armstrong: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 20
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-21 Thread Internal Jenkins (Code Review)
Internal Jenkins has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 19: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 19
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-21 Thread Internal Jenkins (Code Review)
Internal Jenkins has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 19:

Build failed: 
http://sandbox.jenkins.cloudera.com/job/impala-external-gerrit-verify-merge-ASF/518/

-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 19
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-21 Thread Internal Jenkins (Code Review)
Internal Jenkins has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 19:

Build failed: 
http://sandbox.jenkins.cloudera.com/job/impala-external-gerrit-verify-merge-ASF/515/

-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 19
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-21 Thread Internal Jenkins (Code Review)
Internal Jenkins has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 19:

Build failed: 
http://sandbox.jenkins.cloudera.com/job/impala-external-gerrit-verify-merge-ASF/510/

-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 19
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-21 Thread Internal Jenkins (Code Review)
Internal Jenkins has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 19: Verified-1

Build failed: 
http://sandbox.jenkins.cloudera.com/job/impala-external-gerrit-verify-merge-ASF/509/

-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 19
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-21 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 19: Code-Review+2

Carry the +2

-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 19
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-21 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 19: Code-Review+1

Replaced NULL with nullptr as discussed in previous comments. Carry +1. Now a 
committer required to carry the +2.

-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 19
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-21 Thread Lars Volker (Code Review)
Hello Marcel Kornacker, Alex Behm, Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4863

to look at the new patch set (#19).

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..

IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- clustered insert tests have been added to select from alltypestiny to
  create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024). This is limited to uncompressed parquet
  files, to maintain a reasonable runtime. On my machine execution of
  test.insert took 1778 seconds, compared to 1002 seconds with the just
  default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.
- It adds tests to the Analyzer test suite.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
---
M be/src/exec/hbase-table-sink.cc
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hbase-table-writer.h
M be/src/exec/hdfs-avro-table-writer.cc
M be/src/exec/hdfs-avro-table-writer.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M be/src/exec/hdfs-table-writer.h
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/hdfs-text-table-writer.h
M bin/impala-config.sh
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
26 files changed, 457 insertions(+), 161 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/19
-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 19
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-21 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 18: Code-Review+1

Rebased and ran the relevant tests locally, carry +1.

-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 18
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-21 Thread Lars Volker (Code Review)
Hello Marcel Kornacker, Alex Behm, Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4863

to look at the new patch set (#18).

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..

IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- clustered insert tests have been added to select from alltypestiny to
  create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024). This is limited to uncompressed parquet
  files, to maintain a reasonable runtime. On my machine execution of
  test.insert took 1778 seconds, compared to 1002 seconds with the just
  default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.
- It adds tests to the Analyzer test suite.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
---
M be/src/exec/hbase-table-sink.cc
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hbase-table-writer.h
M be/src/exec/hdfs-avro-table-writer.cc
M be/src/exec/hdfs-avro-table-writer.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M be/src/exec/hdfs-table-writer.h
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/hdfs-text-table-writer.h
M bin/impala-config.sh
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
26 files changed, 428 insertions(+), 138 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/18
-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 18
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-21 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 16:

(3 comments)

Thanks for the review. I addressed the comments in PS17 and will rebase the 
change next.

http://gerrit.cloudera.org:8080/#/c/4863/15/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 299:   // then the batch contains rows with the same key and can be 
written as a whole.
> then all rows in the batch have the same key
Done


http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 214:   /// Maps all rows in 'batch' to partitions and appends them to 
their temporary Hdfs
> good point, better revert then.
Done


http://gerrit.cloudera.org:8080/#/c/4863/16/be/src/exec/hdfs-table-writer.h
File be/src/exec/hdfs-table-writer.h:

Line 70:   /// rows in the batch are appended.
> i find this confusing, because it comments on the intention of the caller (
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 16
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-21 Thread Lars Volker (Code Review)
Hello Marcel Kornacker, Alex Behm, Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4863

to look at the new patch set (#17).

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..

IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- clustered insert tests have been added to select from alltypestiny to
  create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024). This is limited to uncompressed parquet
  files, to maintain a reasonable runtime. On my machine execution of
  test.insert took 1778 seconds, compared to 1002 seconds with the just
  default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.
- It adds tests to the Analyzer test suite.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
---
M be/src/exec/hbase-table-sink.cc
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hbase-table-writer.h
M be/src/exec/hdfs-avro-table-writer.cc
M be/src/exec/hdfs-avro-table-writer.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M be/src/exec/hdfs-table-writer.h
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/hdfs-text-table-writer.h
M bin/impala-config.sh
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
26 files changed, 428 insertions(+), 138 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/17
-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 17
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-18 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 15:

(5 comments)

Thanks for the comments, please see PS16.

http://gerrit.cloudera.org:8080/#/c/4863/15/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

PS15, Line 329: key
> Could do std:move() to avoid copy of string
Done


http://gerrit.cloudera.org:8080/#/c/4863/15/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

PS15, Line 168: form
> from
Done


Line 214:   /// Appends all rows in batch to the temporary Hdfs files their 
respective partitions.
> You're missing a verb in here I think
Rephrased.


Line 269:   PartitionPair* current_clustered_partition_;
> "Only set if 'input_is_clustered_' is true"
Done


Line 272:   /// batches.
> "Only set if 'input_is_clustered_' is true"
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 15
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-18 Thread Lars Volker (Code Review)
Hello Alex Behm, Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4863

to look at the new patch set (#16).

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..

IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- clustered insert tests have been added to select from alltypestiny to
  create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024). This is limited to uncompressed parquet
  files, to maintain a reasonable runtime. On my machine execution of
  test.insert took 1778 seconds, compared to 1002 seconds with the just
  default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
---
M be/src/exec/hbase-table-sink.cc
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hbase-table-writer.h
M be/src/exec/hdfs-avro-table-writer.cc
M be/src/exec/hdfs-avro-table-writer.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M be/src/exec/hdfs-table-writer.h
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/hdfs-text-table-writer.h
M bin/impala-config.sh
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
26 files changed, 429 insertions(+), 138 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/16
-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 16
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-17 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 15:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/4863/15/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

PS15, Line 329: key
Could do std:move() to avoid copy of string


http://gerrit.cloudera.org:8080/#/c/4863/15/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

PS15, Line 168: form
from


Line 214:   /// Appends all rows in batch to the temporary Hdfs files their 
respective partitions.
You're missing a verb in here I think


Line 269:   PartitionPair* current_clustered_partition_;
"Only set if 'input_is_clustered_' is true"


Line 272:   /// batches.
"Only set if 'input_is_clustered_' is true"


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 15
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-17 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 273: RETURN_IF_ERROR(output_partition->writer->AppendRowBatch(
> yes.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 13
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-17 Thread Lars Volker (Code Review)
Hello Alex Behm, Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4863

to look at the new patch set (#15).

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..

IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- clustered insert tests have been added to select from alltypestiny to
  create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024). This is limited to uncompressed parquet
  files, to maintain a reasonable runtime. On my machine execution of
  test.insert took 1778 seconds, compared to 1002 seconds with the just
  default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
---
M be/src/exec/hbase-table-sink.cc
M be/src/exec/hbase-table-writer.cc
M be/src/exec/hbase-table-writer.h
M be/src/exec/hdfs-avro-table-writer.cc
M be/src/exec/hdfs-avro-table-writer.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/hdfs-sequence-table-writer.cc
M be/src/exec/hdfs-sequence-table-writer.h
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M be/src/exec/hdfs-table-writer.h
M be/src/exec/hdfs-text-table-writer.cc
M be/src/exec/hdfs-text-table-writer.h
M bin/impala-config.sh
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
26 files changed, 424 insertions(+), 138 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/15
-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 15
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-17 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 273: RETURN_IF_ERROR(output_partition->writer->AppendRowBatch(
> This is part of the HdfsTableWriter virtual interface. Should I change it h
yes.


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 13
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-17 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 13:

(10 comments)

Thanks for the review. Please see PS14.

http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 66: tsink.table_sink.hdfs_table_sink.skip_header_line_count :
> if you break this up, then
This has been formatted by clang-format since I touched these lines. While I 
agree that the formatting could be nicer, I'd like to keep the automatic format 
here for simplicity.


Line 212: GetHashTblKey(NULL, dynamic_partition_key_value_ctxs, );
> we're standardizing on nullptr now
I changed all occurrences in lines I touched. Once the change has a +2 I will 
rebase the change and then replace all other occurrences in the file to keep it 
consistent (and keep the risk for merge conflicts low until then).


Line 266:   // The rows of this batch may span across multiple files. We 
repeatedly pass the row
> remove "across"
Done


Line 273: RETURN_IF_ERROR(output_partition->writer->AppendRowBatch(
> rename this function, it's not appending the entire row batch (AppendRows?)
This is part of the HdfsTableWriter virtual interface. Should I change it here 
and in all table writer classes?


Line 283: Status HdfsTableSink::WriteClusteredRowBatch(RuntimeState* state, 
RowBatch* batch) {
> to speed this up, you should:
Done. AppendRowBatch() already appends the whole batch if the row index vector 
is empty, so there was no change necessary.


http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 170:   const HdfsPartitionDescriptor& partition_descriptor, const 
TupleRow* current_row,
> mention that current_row provides the partition key values.
Done


Line 181:   /// GetHashTblKey(). Maps to an OutputPartition, which are owned by 
the object pool and
> "pool,"
Done


Line 189:   void GetHashTblKey(const TupleRow* current_row, const 
std::vector& ctxs,
> current_row -> row (it doesn't matter to this function that the row is 'cur
Done


Line 213:   /// Appends all rows in batch to the temporary Hdfs files 
corresponding to partitions.
> what "corresponding to partitions" refers to is unclear.
Rephrased it.


Line 214:   /// The input must be ordered by the partition key expressions.
> the rows are expected to be ordered...?
Done. Though I think "must be" makes it more clear that it's a  prerequisite to 
be guaranteed by the caller.


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 13
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-17 Thread Lars Volker (Code Review)
Hello Alex Behm, Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4863

to look at the new patch set (#14).

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..

IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- clustered insert tests have been added to select from alltypestiny to
  create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024). This is limited to uncompressed parquet
  files, to maintain a reasonable runtime. On my machine execution of
  test.insert took 1778 seconds, compared to 1002 seconds with the just
  default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M bin/impala-config.sh
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
14 files changed, 391 insertions(+), 98 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/14
-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 14
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-14 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 13:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 66: tsink.table_sink.hdfs_table_sink.skip_header_line_count :
if you break this up, then
...\n
? ...\n
: 0

is better


Line 212: GetHashTblKey(NULL, dynamic_partition_key_value_ctxs, );
we're standardizing on nullptr now


Line 266:   // The rows of this batch may span across multiple files. We 
repeatedly pass the row
remove "across"


Line 273: RETURN_IF_ERROR(output_partition->writer->AppendRowBatch(
rename this function, it's not appending the entire row batch (AppendRows?)


Line 283: Status HdfsTableSink::WriteClusteredRowBatch(RuntimeState* state, 
RowBatch* batch) {
to speed this up, you should:
- keep a current_clustered_partition_
- also current_clustered_partition_key_
- check the partition key of the last row in batch whether it matches 
current_partition_key_
- if so, skip the for loop and write the entire batch out

also: change WriteRowsToPartition to take the index vector separately; pass a 
nullptr in the case where the entire batch gets written to the same partition

apply that same optimization to writer::AppendRowBatch (or however you rename 
that). no need to spend time checking some vector in that case.


http://gerrit.cloudera.org:8080/#/c/4863/13/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 170:   const HdfsPartitionDescriptor& partition_descriptor, const 
TupleRow* current_row,
mention that current_row provides the partition key values.


Line 181:   /// GetHashTblKey(). Maps to an OutputPartition, which are owned by 
the object pool and
"pool,"


Line 189:   void GetHashTblKey(const TupleRow* current_row, const 
std::vector& ctxs,
current_row -> row (it doesn't matter to this function that the row is 
'current')

same for InitOutputPartition


Line 213:   /// Appends all rows in batch to the temporary Hdfs files 
corresponding to partitions.
what "corresponding to partitions" refers to is unclear.


Line 214:   /// The input must be ordered by the partition key expressions.
the rows are expected to be ordered...?


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 13
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-14 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 12:

Rebased and added tests in AnalyzeStmtsTest as suggested by Alex in 
https://gerrit.cloudera.org/#/c/5051/1/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java

-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zoltan Ivanfi 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-09 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 12: Code-Review+1

Rebased, carry +1 from Alex.

-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-09 Thread Lars Volker (Code Review)
Hello Alex Behm, Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4863

to look at the new patch set (#12).

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..

IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- clustered insert tests have been added to select from alltypestiny to
  create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024). This is limited to uncompressed parquet
  files, to maintain a reasonable runtime. On my machine execution of
  test.insert took 1778 seconds, compared to 1002 seconds with the just
  default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M bin/impala-config.sh
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
13 files changed, 341 insertions(+), 95 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/12
-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-08 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 11:

Passed private S3 run with only query_tests enabled here.

Relevant lines for S3:
02:22:40.508 [gw2] PASSED 
query_test/test_insert_behaviour.py::TestInsertBehaviour::test_clustered_partition_single_file
 
02:25:23.115 [gw2] PASSED 
query_test/test_insert_behaviour.py::TestInsertBehaviour::test_clustered_partition_multiple_files

-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-07 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 11: Code-Review+1

Assuming the S3 run passes.

-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-07 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 9:

(3 comments)

Thanks for the reviews. I'm running another private job to test the change on 
S3 and will update the Jira once it's done.

http://gerrit.cloudera.org:8080/#/c/4863/9/tests/query_test/test_insert_behaviour.py
File tests/query_test/test_insert_behaviour.py:

Line 488: import os
> why?
Sry, leftover from trying to get the tests to work.


Line 491: table_location = DEFAULT_FS + "/" + table_path
> I'm confused. Doesn't get_fs_path() do the right thing? Why do we have to a
It does. I couldn't figure out why FILESYSTEM_PREFIX works the way it does on 
the Jenkins host I ran this on, but not on my local dev machine. Sailesh helped 
me, now I'm convinced it should work. I'm running another private build and 
will update this change again once I have results.


Line 550:   table, table_location)
> indent 4
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-07 Thread Lars Volker (Code Review)
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4863

to look at the new patch set (#11).

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..

IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- clustered insert tests have been added to select from alltypestiny to
  create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024). This is limited to uncompressed parquet
  files, to maintain a reasonable runtime. On my machine execution of
  test.insert took 1778 seconds, compared to 1002 seconds with the just
  default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M bin/impala-config.sh
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/util/filesystem_utils.py
14 files changed, 344 insertions(+), 99 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/11
-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-07 Thread Lars Volker (Code Review)
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4863

to look at the new patch set (#10).

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..

IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- clustered insert tests have been added to select from alltypestiny to
  create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024). This is limited to uncompressed parquet
  files, to maintain a reasonable runtime. On my machine execution of
  test.insert took 1778 seconds, compared to 1002 seconds with the just
  default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M bin/impala-config.sh
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/util/filesystem_utils.py
14 files changed, 344 insertions(+), 99 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/10
-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-07 Thread Alex Behm (Code Review)
Alex Behm has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 9:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/4863/9/tests/query_test/test_insert_behaviour.py
File tests/query_test/test_insert_behaviour.py:

Line 488: import os
why?


Line 491: table_location = DEFAULT_FS + "/" + table_path
I'm confused. Doesn't get_fs_path() do the right thing? Why do we have to add 
this DEFAULT_FS prefix?


Line 550:   table, table_location)
indent 4


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-06 Thread Lars Volker (Code Review)
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4863

to look at the new patch set (#9).

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..

IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- clustered insert tests have been added to select from alltypestiny to
  create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024). This is limited to uncompressed parquet
  files, to maintain a reasonable runtime. On my machine execution of
  test.insert took 1778 seconds, compared to 1002 seconds with the just
  default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M bin/impala-config.sh
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
M tests/util/filesystem_utils.py
14 files changed, 345 insertions(+), 99 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/9
-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-04 Thread Marcel Kornacker (Code Review)
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 8:

i'll pick this up once it has an overall +1

-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Marcel Kornacker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-04 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 8: Code-Review+1

(3 comments)

+1 on the backend part.

http://gerrit.cloudera.org:8080/#/c/4863/6/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 530:   DCHECK(current_row != NULL || key == ROOT_PARTITION_KEY);
> My assumption was that InitOutputPartition wouldn't be called if key==ROOT_
Makes sense, thanks.


http://gerrit.cloudera.org:8080/#/c/4863/6/testdata/workloads/functional-query/queries/QueryTest/insert.test
File testdata/workloads/functional-query/queries/QueryTest/insert.test:

Line 912: partition (year, month) /*+ clustered,noshuffle */
> Done. I added a test and a DCHECK in HdfsTableSink::WriteClusteredRowBatch(
That's a good point, I misunderstood that aspect of the feature - I was 
thinking of the "SORT BY" clause that isn't implemented yet.


http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert_behaviour.py
File tests/query_test/test_insert_behaviour.py:

Line 554:l_returnflag
> What would be good values here? Should we leave it and see if things break 
I agree with Alex. I think we mainly want to check that we're not losing test 
coverage. So I'd be ok with something really loose like >= 3 and <= 30, since 
that would only fail if something drastically changed.


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-04 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 8:

(14 comments)

Thanks for the reviews, please see PS8. I will update again once the private 
builds for S3 and local FS have finished.

http://gerrit.cloudera.org:8080/#/c/4863/6/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 530:   DCHECK(current_row != NULL || key == ROOT_PARTITION_KEY);
> Doesn't the above check need to be the same as here?
My assumption was that InitOutputPartition wouldn't be called if 
key==ROOT_PARTITION_KEY, since that partition would exist already. However that 
seems to be wrong. The key isn't passed into InitOutputPartition so we cannot 
repeat the check there.


http://gerrit.cloudera.org:8080/#/c/4863/7/common/thrift/DataSinks.thrift
File common/thrift/DataSinks.thrift:

Line 70:   // partition keys, meaning partitions can be opened, written, and 
closed one by one.
> I think we just use the double // here (and above)
Done. The one above was me some time ago, too :)


Line 71:   4: required bool input_is_clustered
> remove trailing semicolon
Done, and elsewhere in the file.


http://gerrit.cloudera.org:8080/#/c/4863/7/fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
File fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java:

Line 51:   protected final boolean inputIsClustered_;
> inputIsClustered_
Done


http://gerrit.cloudera.org:8080/#/c/4863/7/fe/src/main/java/org/apache/impala/planner/TableSink.java
File fe/src/main/java/org/apache/impala/planner/TableSink.java:

Line 85:   boolean overwrite, boolean ignoreDuplicates, boolean 
inputIsClustered) {
> use Java style: inputIsClustered
Done


http://gerrit.cloudera.org:8080/#/c/4863/6/testdata/workloads/functional-query/queries/QueryTest/insert.test
File testdata/workloads/functional-query/queries/QueryTest/insert.test:

Line 912: partition (year, month) /*+ clustered,noshuffle */
> I missed it earlier that we don't have tests for unpartitioned inserts with
Done. I added a test and a DCHECK in HdfsTableSink::WriteClusteredRowBatch() to 
make sure we're performing a partitioned insert.

However I'm not sure now whether we actually should allow specifying /*+ 
clustered */ for non-partitioned tables at all, since it won't have any impact 
on the plan and will just be silently ignored.


http://gerrit.cloudera.org:8080/#/c/4863/7/testdata/workloads/functional-query/queries/QueryTest/insert.test
File testdata/workloads/functional-query/queries/QueryTest/insert.test:

Line 863: insert into table alltypesinsert
> do we have coverage of the clustered with and without shuffling?
So far we only had coverage in the FE. Added a test. Should we run all tests 
with shuffle and noshuffle? I assume shuffle makes more sense since we mostly 
will want to use clustered with large inserts. I added it to the other tests.


http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert.py
File tests/query_test/test_insert.py:

Line 112:   def test_insert_test(self, vector):
> ?
This makes it possible to test only "insert.test" by specifying "-k 
test_insert_test" to impala-py.test. A lot of our BE tests have this issue of 
not being able to just select the main test by -k. Should I revert this, or add 
a comment? It's explained in the commit message.


http://gerrit.cloudera.org:8080/#/c/4863/7/tests/query_test/test_insert_behaviour.py
File tests/query_test/test_insert_behaviour.py:

Line 481: table_path = get_fs_path(
> don't we need to add the filesystem prefix for S3 and local FS?
I wasn't aware that the S3 and local FS cases were handled by the Hdfs code. I 
added get_fs_path() calls and started private runs for both S3 and local FS and 
will update this once they're done.


Line 489: 
> add shuffle hint to make sure we shuffle
Done


Line 507: # This test takes about 30 seconds and we are unlikely to break 
it, so only run it in
> give reason. takes too long?
Added a comment.


Line 509: if self.exploration_strategy() != 'exhaustive':
> same comment about the fs prefix
See comment above.


Line 534: 
> does this test make any assumptions about whether the shuffling behavior? b
No, it doesn't make any assumptions. I added the hint.


Line 554:l_returnflag
> it feels like this could become flaky relatively easily (changing compressi
What would be good values here? Should we leave it and see if things break and 
then gradually adapt?


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 

[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-04 Thread Lars Volker (Code Review)
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4863

to look at the new patch set (#8).

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..

IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- clustered insert tests have been added to select from alltypestiny to
  create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024). This is limited to uncompressed parquet
  files, to maintain a reasonable runtime. On my machine execution of
  test.insert took 1778 seconds, compared to 1002 seconds with the just
  default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
12 files changed, 318 insertions(+), 95 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/8
-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-03 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 6:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/4863/6/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 530:   DCHECK(current_row != NULL || key == ROOT_PARTITION_KEY);
Doesn't the above check need to be the same as here?


http://gerrit.cloudera.org:8080/#/c/4863/6/testdata/workloads/functional-query/queries/QueryTest/insert.test
File testdata/workloads/functional-query/queries/QueryTest/insert.test:

Line 912: partition (year, month) /*+ clustered */
I missed it earlier that we don't have tests for unpartitioned inserts with the 
clustered hint - we should make sure that it goes down the right code path (I 
think it does).


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-03 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 7:

(1 comment)

Thanks Tim, for the review. I pushed PS7, removing a DCHECK I had added - 
though I'm not quite sure why it gets hit.

Marcel, can you have a look at PS7 for a +2?

http://gerrit.cloudera.org:8080/#/c/4863/6/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 395:   // Build the unique name for this partition from the partition 
keys, e.g. "j=1/f=foo/"
The insert tests are hitting this DCHECK. Without it they pass, though it is 
not entirely clear to me why, yet.


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-03 Thread Lars Volker (Code Review)
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/4863

to look at the new patch set (#7).

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..

IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- clustered insert tests have been added to select from alltypestiny to
  create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024). This is limited to uncompressed parquet
  files, to maintain a reasonable runtime. On my machine execution of
  test.insert took 1778 seconds, compared to 1002 seconds with the just
  default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
12 files changed, 282 insertions(+), 89 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/7
-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-03 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 6:

LGTM but I think it's a big enough change that we should have another pair of 
eyes on it

-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-03 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 6: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-03 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 5:

(4 comments)

Thanks for the review, please see PS6.

http://gerrit.cloudera.org:8080/#/c/4863/5/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 277: if (new_file) {
> Can avoid a level of nesting with if (!new_file) break;
Done


Line 294: current_row_ = batch->GetRow(i);
> Oh man, I didn't notice this 'current_row_' thing earlier. It looks like fo
Done


Line 302: RETURN_IF_ERROR(WriteRowsToPartition(state, batch, 
current_partition_pair));
> I think we should also remove the entry in partition_keys_to_output_partiti
Done


http://gerrit.cloudera.org:8080/#/c/4863/4/tests/query_test/test_insert_behaviour.py
File tests/query_test/test_insert_behaviour.py:

Line 476: 
> Maybe a little much given core already takes too long - I think it's useful
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-11-03 Thread Lars Volker (Code Review)
Lars Volker has uploaded a new patch set (#6).

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..

IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- clustered insert tests have been added to select from alltypestiny to
  create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024). This is limited to uncompressed parquet
  files, to maintain a reasonable runtime. On my machine execution of
  test.insert took 1778 seconds, compared to 1002 seconds with the just
  default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
12 files changed, 283 insertions(+), 89 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/6
-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-10-31 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 4:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/4863/5/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 277:   RETURN_IF_ERROR(FinalizePartitionFile(state, output_partition));
Can avoid a level of nesting with if (!new_file) break;


Line 294: GetHashTblKey(dynamic_partition_key_expr_ctxs_, );
Oh man, I didn't notice this 'current_row_' thing earlier. It looks like for 
some reason the original author of the code didn't want to pass an extra 
argument into functions so instead did it implicitly via 'current_row_'

Can you refactor it so that it's passed as an actual argument?


Line 302: RETURN_IF_ERROR(WriteRowsOfPartition(state, batch, 
current_partition_pair));
I think we should also remove the entry in partition_keys_to_output_partitions_ 
here so that we use O(1) memory.


http://gerrit.cloudera.org:8080/#/c/4863/4/tests/query_test/test_insert_behaviour.py
File tests/query_test/test_insert_behaviour.py:

Line 476: 
> Done. It takes 30 seconds on my machine. Too much for core?
Maybe a little much given core already takes too long - I think it's useful 
coverage but we're also probably pretty unlikely to break it. Seems fine for 
exhaustive.


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-10-29 Thread Lars Volker (Code Review)
Lars Volker has uploaded a new patch set (#5).

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..

IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- clustered insert tests have been added to select from alltypestiny to
  create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024). This is limited to uncompressed parquet
  files, to maintain a reasonable runtime. On my machine execution of
  test.insert took 1778 seconds, compared to 1002 seconds with the just
  default row batch size.
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
12 files changed, 253 insertions(+), 59 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/5
-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-10-29 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 4:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/4863/4/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 268:   // Pass the row batch to the writer. If new_file is returned true 
then the current
> I find this comment a little confusing. Maybe explain more the "why", i.e. 
Done. I had moved this from elsewhere, but tried to make it more clear.


Line 272:   do {
> I feel like this might be clearer if you write it as while(true) and then b
Done


Line 288:   // TODO: Should we adapt hdfs writer to accept start,num pair 
instead of a vector?
> My guess is that it would only help a little bit. It's nice to use the same
Ok


Line 296: PartitionPair* next_partition_pair = NULL;
> It would be more efficient to keep track of the previous key and comparing 
Done


PS4, Line 301: Flush
> I think it might be clearer to just say "Write rows" instead of "flush". Fl
Done


Line 301: // Flush previous partition
> I think the comments in here are a little too low level. Here maybe a singl
Done


Line 310: // Collect row index
> This comment isn't too helpful.
Done


PS4, Line 313: Flush
> "Write rows to" is probably clearer than "flush".
Done


PS4, Line 607: WriteRowsOfPartition
> WriteRowsToPartition?
Done


http://gerrit.cloudera.org:8080/#/c/4863/4/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

PS4, Line 183: rows
> Maybe reword to be clearer, took me a while to figure out:
Done


PS4, Line 209: partitions
> partition's (missing apostrophe).
Done


Line 209:   /// Writes all rows of a partition to the partitions writer and 
clears the row index
> This could be clearer that it's only writing the rows with indices in 'part
Done


PS4, Line 215:  is expected to
> This could be more declarative - i.e. "must" instead of "is expected to"
Done


http://gerrit.cloudera.org:8080/#/c/4863/4/tests/query_test/test_insert.py
File tests/query_test/test_insert.py:

PS4, Line 71: text
> parquet, right?
Done


http://gerrit.cloudera.org:8080/#/c/4863/4/tests/query_test/test_insert_behaviour.py
File tests/query_test/test_insert_behaviour.py:

Line 476: 
> Hmm, I think we could also do with a test that writes some large partitions
Done. It takes 30 seconds on my machine. Too much for core?


Line 477: 
> Extra blank line
Done


Line 501:   assert len(files) == 1
> Comment why we only expect one file - it might not be that obvious to peopl
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-10-28 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 4:

(17 comments)

Thanks for adding the tests, this is looking pretty good - I think it could be 
refined and polished a bit more but no major concerns here.

http://gerrit.cloudera.org:8080/#/c/4863/4/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 268:   // Pass the row batch to the writer. If new_file is returned true 
then the current
I find this comment a little confusing. Maybe explain more the "why", i.e. that 
the rows may be written to multiple files.


Line 272:   do {
I feel like this might be clearer if you write it as while(true) and then break 
out if 'new_file' is false.


Line 288:   // TODO: Should we adapt hdfs writer to accept start,num pair 
instead of a vector?
My guess is that it would only help a little bit. It's nice to use the same 
code path for this and dynamic partitioning at least for now.


Line 296: PartitionPair* next_partition_pair = NULL;
It would be more efficient to keep track of the previous key and comparing 
directly instead of looking up the partition hash table for every row. It's a 
little more complicated but I think would help perf.


PS4, Line 301: Flush
I think it might be clearer to just say "Write rows" instead of "flush". Flush 
makes me think it's writing to the filesystem, but it's really just pushing the 
rows into the HdfsTableWriter, which does its own buffering.


Line 301: // Flush previous partition
I think the comments in here are a little too low level. Here maybe a single 
comment along the lines of "Done with previous partition - write rows and 
close." might be more helpful.


Line 310: // Collect row index
This comment isn't too helpful.


PS4, Line 313: Flush
"Write rows to" is probably clearer than "flush".


PS4, Line 607: WriteRowsOfPartition
WriteRowsToPartition?


http://gerrit.cloudera.org:8080/#/c/4863/4/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

PS4, Line 183: rows
Maybe reword to be clearer, took me a while to figure out:
"a vector of indices of the rows in the current batch to insert into the 
partition"


Line 209:   /// Writes all rows of a partition to the partitions writer and 
clears the row index
This could be clearer that it's only writing the rows with indices in 
'partition_pair', it kind-of sounds like it's writing all the rows in 'batch' 
into the partition.


PS4, Line 209: partitions
partition's (missing apostrophe).


PS4, Line 215:  is expected to
This could be more declarative - i.e. "must" instead of "is expected to"


http://gerrit.cloudera.org:8080/#/c/4863/4/tests/query_test/test_insert.py
File tests/query_test/test_insert.py:

PS4, Line 71: text
parquet, right?


http://gerrit.cloudera.org:8080/#/c/4863/4/tests/query_test/test_insert_behaviour.py
File tests/query_test/test_insert_behaviour.py:

Line 476: 
Hmm, I think we could also do with a test that writes some large partitions 
split across files. E.g. do a CTAS of lineitem partitioned by some low-NDV 
column. We could also maybe adjust PARQUET_FILE_SIZE to force it to split it 
into more files.

Depending on how long it takes, might make sense to only run it in exhaustive.


Line 477: 
Extra blank line


Line 501:   assert len(files) == 1
Comment why we only expect one file - it might not be that obvious to people 
looking at this code later.


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-10-27 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..


Patch Set 4:

(2 comments)

Thanks Tim for having a look. Please see PS4.

http://gerrit.cloudera.org:8080/#/c/4863/1//COMMIT_MSG
Commit Message:

PS1, Line 7: A-
> - instead of _
Done


Line 12: 
> RE: testing - off the top of my head, I think we need:
Thanks for the feedback. I added tests addressing points 1 to 3 and an 
explanation in the commit message. I also added different row batch sizes to 
the exhaustive run of insert.test.


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-10-27 Thread Lars Volker (Code Review)
Lars Volker has uploaded a new patch set (#4).

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..

IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- clustered insert tests have been added to select from alltypestiny to
  create inserts with 1 and 2 rows per partition respectively.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024). This is limited to uncompressed parquet
  files, to maintain a reasonable runtime. On my machine it took 1778
  seconds, compared to 1002 seconds with the just default row batch
  size.
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
12 files changed, 192 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/4
-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-10-27 Thread Lars Volker (Code Review)
Lars Volker has uploaded a new patch set (#2).

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..

IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024).
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
12 files changed, 163 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/2
-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-2523: Make HdfsTableSink aware of clustered input

2016-10-27 Thread Lars Volker (Code Review)
Lars Volker has uploaded a new patch set (#3).

Change subject: IMPALA-2523: Make HdfsTableSink aware of clustered input
..

IMPALA-2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

This change also adds/modifies tests in several ways:

- clustered insert tests switch from selecting all rows from
  alltypessmall to alltypes. Together with varying settings for
  batch_size, this results in a larger number of row batches being
  written.
- clustered insert tests select from alltypes instead of
  functional.alltypes to make sure we also select from various input
  formats.
- exhaustive insert tests now use different values for batch_size: 1,
  16, 0 (meaning default, 1024).
- There is additional testing in test_insert_behaviour.py to make sure
  that insertion over several row batches only creates one file per
  partition.
- It renames the test_insert method to make it unique in the file and
  allow for effective filtering with -k.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
M testdata/workloads/functional-query/queries/QueryTest/insert.test
M tests/query_test/test_insert.py
M tests/query_test/test_insert_behaviour.py
12 files changed, 163 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/3
-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA 2523: Make HdfsTableSink aware of clustered input

2016-10-27 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA_2523: Make HdfsTableSink aware of clustered input
..


Patch Set 1:

(2 comments)

I didn't do a full pass but had thoughts on testing.

http://gerrit.cloudera.org:8080/#/c/4863/1//COMMIT_MSG
Commit Message:

PS1, Line 7: A_
- instead of _


Line 12: 
RE: testing - off the top of my head, I think we need:

* Very large inserts with partitions spanning many row batches
* Inserts with a small number of rows per partition (e.g. 1, 2, 3).
* Tests that check that the expected # of files are created

It would also be good to add batch_size as a test dimension (if it isn't 
already for the insert tests) and test with some different batch sizes to hit 
more edge cases, e.g. 1, 16, default (1024).


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA 2523: Make HdfsTableSink aware of clustered input

2016-10-26 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change.

Change subject: IMPALA_2523: Make HdfsTableSink aware of clustered input
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4863/1/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 594: WriteClusteredRowBatch(state, batch);
this case is exercised by the tests for IMPALA-2521, however we don't have 
tests (yet) that make sure we actually ever have one partition output file 
open. Any ideas how I could test this easily with our current infrastructure?


-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Lars Volker 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA 2523: Make HdfsTableSink aware of clustered input

2016-10-26 Thread Lars Volker (Code Review)
Lars Volker has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/4863

Change subject: IMPALA_2523: Make HdfsTableSink aware of clustered input
..

IMPALA_2523: Make HdfsTableSink aware of clustered input

IMPALA-2521 introduced clustering for insert statements. This change
makes the HdfsTableSink aware of clustered inputs, so that partitions
are opened, written, and closed one by one.

Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.cc
M common/thrift/DataSinks.thrift
M fe/src/main/java/org/apache/impala/analysis/DeleteStmt.java
M fe/src/main/java/org/apache/impala/analysis/InsertStmt.java
M fe/src/main/java/org/apache/impala/analysis/UpdateStmt.java
M fe/src/main/java/org/apache/impala/planner/HdfsTableSink.java
M fe/src/main/java/org/apache/impala/planner/TableSink.java
9 files changed, 105 insertions(+), 48 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/63/4863/1
-- 
To view, visit http://gerrit.cloudera.org:8080/4863
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ibeda0bdabbfe44c8ac95bf7c982a75649e1b82d0
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker