[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs

2017-07-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs
..


IMPALA-5498: Support for partial sorts in Kudu INSERTs

Impala currently supports total sorts (the entire set of data
is sorted) and top-n sorts (only the highest/lowest n elements
are sorted). This patch adds the ability to do partial sorts,
where the data is divided up into some number of subsets, each
of which is sorted individually.

It accomplishes this by adding a new exec node, PartialSortNode.
When PartialSortNode::GetNext() is called, it retrieves input
up to the query memory limit, uses the existing Sorter class to sort
it, and outputs it. This is faster than a total sort with SortNode
as it avoids the need to spill if the input is larger than the
memory limit.

Future work will look into setting a more restrictive memory limit
on the PartialSortNode. (IMPALA-5669)

In the planner, the SortNode plan node is used, with an enum value
indicating if it is a total or partial sort.

This also adds a new counter 'RunSize' to the runtime profile which
tracks the min, max, and avg size of the generated runs, in tuples.

As a first use case, partial sort is used where a total sort was
used previously for inserts/upserts into Kudu tables only. Future
work can extend this to other table sinks. (IMPALA-5649)

Testing:
- E2E test with a large INSERT into a Kudu table with a mem limit.
  Checks that no spills occurred.
- Updated planner tests.
- Existing E2E tests and stress test verify correctness of INSERT.
- Perf tests on the 10 node cluster: inserting tpch_100.lineitem
  into a Kudu table with mem_limit=3gb:
  Previously: 5 runs are spilled, sort took 7m33s
  Now: no spills, sort takes 6m19s, for ~18% speedup

Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Reviewed-on: http://gerrit.cloudera.org:8080/7267
Reviewed-by: Thomas Tauber-Marshall 
Tested-by: Impala Public Jenkins
---
M be/src/exec/CMakeLists.txt
M be/src/exec/exec-node.cc
A be/src/exec/partial-sort-node.cc
A be/src/exec/partial-sort-node.h
M be/src/exec/sort-node.h
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M be/src/util/runtime-profile-counters.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test
16 files changed, 459 insertions(+), 70 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Thomas Tauber-Marshall: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs

2017-07-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs
..


Patch Set 8: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs

2017-07-21 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs
..


Patch Set 8:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/908/

-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs

2017-07-21 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs
..


Patch Set 8: Code-Review+2

Rebased

-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs

2017-07-20 Thread Matthew Jacobs (Code Review)
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs
..


Patch Set 7: Code-Review+2

You probably need to rebase

-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs

2017-07-17 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs
..


Patch Set 6:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/7267/6//COMMIT_MSG
Commit Message:

Line 7: IMPALA-5498: Support for partial sorts
> lets specify in Kudu for now
Done


PS6, Line 28: A
> nit: This also adds...
Done


Line 32: used previously for inserts into Kudu.
> "for inserts into Kudu"->"for DML statements into Kudu tables only. Future 
This only applies to inserts/upserts. We still haven't done the 
partitioning/sorting for update/delete.


http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/exec/partial-sort-node.h
File be/src/exec/partial-sort-node.h:

PS6, Line 62: derived based
> awkward/confusing
Done


PS6, Line 62: sort_exec_exprs_
> update
Done


http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

PS6, Line 1383: RunSize
> can you indicate the units, e.g. #rows -> NumRowsPerRun?
Done


PS6, Line 1387:   if (no_spill_) {
  : // 'min_buffers_required' is 1 since the sorter will only 
have 1 run at a time, so
  : // there won't be any merges.
  : min_buffers_required = 1;
  :   } else {
  : min_buffers_required = MIN_BUFFERS_PER_MERGE;
  :   }
> nit: ternary operator
Done


http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

PS6, Line 45: Specifying 'no_spill' as true is
> For this use case, 'no_spill' should be set to true so the sorter can reduc
Done


PS6, Line 103: no_spill
> A positive argument would be marginally easier to follow. E.g. 'enable_spil
Of course, I called it no_spill to mirror the function names. Do you think we 
should also change AddBatch/AddBatchNoSpill -> AddBatchWithSpill/AddBatch?


PS6, Line 103: bool no_spill = false
> comment on this even though it's mentioned above
Done


Line 117:   /// up, it is sorted and a new unsorted run is created.
> comment that this may not be called if no_spill is true
Done


PS6, Line 127: InputDone
> comment that this may not be called if no_spill is true
So this does have a DCHECK in it, but only if there's more than one run (which 
isn't possible unless AddBatch() was called), so its fine to call with either 
value of enable_spilling.

The other cases would be the various private *Merge* functions. I didn't add 
DCHECKs/documentation to those cause it seems cluttered and the DCHECK here in 
InputDone() ensures that none of them are called, but I can add it in if you 
want.


http://gerrit.cloudera.org:8080/#/c/7267/6/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

PS6, Line 56: TODO: run experiments to
:   // determine the value for this, consider making it 
configurable, enforce it in the BE.
> JIRA?
Done


PS6, Line 124: {
 : return type_ != TSortType.PARTIAL;
 :   }
> nit: oneline?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs

2017-07-17 Thread Thomas Tauber-Marshall (Code Review)
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7267

to look at the new patch set (#7).

Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs
..

IMPALA-5498: Support for partial sorts in Kudu INSERTs

Impala currently supports total sorts (the entire set of data
is sorted) and top-n sorts (only the highest/lowest n elements
are sorted). This patch adds the ability to do partial sorts,
where the data is divided up into some number of subsets, each
of which is sorted individually.

It accomplishes this by adding a new exec node, PartialSortNode.
When PartialSortNode::GetNext() is called, it retrieves input
up to the query memory limit, uses the existing Sorter class to sort
it, and outputs it. This is faster than a total sort with SortNode
as it avoids the need to spill if the input is larger than the
memory limit.

Future work will look into setting a more restrictive memory limit
on the PartialSortNode. (IMPALA-5669)

In the planner, the SortNode plan node is used, with an enum value
indicating if it is a total or partial sort.

This also adds a new counter 'RunSize' to the runtime profile which
tracks the min, max, and avg size of the generated runs, in tuples.

As a first use case, partial sort is used where a total sort was
used previously for inserts/upserts into Kudu tables only. Future
work can extend this to other table sinks. (IMPALA-5649)

Testing:
- E2E test with a large INSERT into a Kudu table with a mem limit.
  Checks that no spills occurred.
- Updated planner tests.
- Existing E2E tests and stress test verify correctness of INSERT.
- Perf tests on the 10 node cluster: inserting tpch_100.lineitem
  into a Kudu table with mem_limit=3gb:
  Previously: 5 runs are spilled, sort took 7m33s
  Now: no spills, sort takes 6m19s, for ~18% speedup

Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
---
M be/src/exec/CMakeLists.txt
M be/src/exec/exec-node.cc
A be/src/exec/partial-sort-node.cc
A be/src/exec/partial-sort-node.h
M be/src/exec/sort-node.h
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M be/src/util/runtime-profile-counters.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test
16 files changed, 459 insertions(+), 70 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/7267/7
-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

2017-07-14 Thread Matthew Jacobs (Code Review)
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
..


Patch Set 6:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/7267/6//COMMIT_MSG
Commit Message:

Line 7: IMPALA-5498: Support for partial sorts
lets specify in Kudu for now


PS6, Line 28: A
nit: This also adds...


Line 32: used previously for inserts into Kudu.
"for inserts into Kudu"->"for DML statements into Kudu tables only. Future work 
can extend this to other table sinks IMPALA-"


Line 42:   Now: no spills, sort takes 6m19s, for ~18% speedup
comparison vs no partition/sort at all?


http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/exec/partial-sort-node.h
File be/src/exec/partial-sort-node.h:

PS6, Line 62: sort_exec_exprs_
update


PS6, Line 62: derived based
awkward/confusing


http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

PS6, Line 1383: RunSize
can you indicate the units, e.g. #rows -> NumRowsPerRun?


PS6, Line 1387:   if (no_spill_) {
  : // 'min_buffers_required' is 1 since the sorter will only 
have 1 run at a time, so
  : // there won't be any merges.
  : min_buffers_required = 1;
  :   } else {
  : min_buffers_required = MIN_BUFFERS_PER_MERGE;
  :   }
nit: ternary operator


http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

PS6, Line 45: Specifying 'no_spill' as true is
For this use case, 'no_spill' should be set to true so the sorter can reduce...


PS6, Line 103: bool no_spill = false
comment on this even though it's mentioned above


Line 117:   /// up, it is sorted and a new unsorted run is created.
comment that this may not be called if no_spill is true


PS6, Line 127: InputDone
comment that this may not be called if no_spill is true
(are there any other cases I missed?)


http://gerrit.cloudera.org:8080/#/c/7267/6/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

PS6, Line 56: TODO: run experiments to
:   // determine the value for this, consider making it 
configurable, enforce it in the BE.
JIRA?


PS6, Line 124: {
 : return type_ != TSortType.PARTIAL;
 :   }
nit: oneline?


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

2017-07-14 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
..


Patch Set 6: Code-Review+1

(1 comment)

LGTM aside from the variable name

http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

PS6, Line 103: no_spill
A positive argument would be marginally easier to follow. E.g. 'enable_spilling'


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

2017-07-14 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
..


Patch Set 6:

(1 comment)

> Can you extend the Sort metrics in the query profile to include
 > Avg, Min and Max run size?
 > This will be needed to reflect on the quality of the sorted data?

Done.

http://gerrit.cloudera.org:8080/#/c/7267/4/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

PS4, Line 99: to the merger and retrieve r
> I don't feel strongly - the change you made looks good to me. Maybe 'is_par
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

2017-07-14 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has uploaded a new patch set (#6).

Change subject: IMPALA-5498: Support for partial sorts
..

IMPALA-5498: Support for partial sorts

Impala currently supports total sorts (the entire set of data
is sorted) and top-n sorts (only the highest/lowest n elements
are sorted). This patch adds the ability to do partial sorts,
where the data is divided up into some number of subsets, each
of which is sorted individually.

It accomplishes this by adding a new exec node, PartialSortNode.
When PartialSortNode::GetNext() is called, it retrieves input
up to its memory limit, uses the existing Sorter class to sort
it, and outputs it. This is faster than a total sort with SortNode
as it avoids the need to spill if the input is larger than the
memory limit.

Future work will look into setting a more restrictive memory limit
on the PartialSortNode.

In the planner, the SortNode plan node is used, with an enum value
indicating if it is a total or partial sort.

Adds a new counter 'RunSize' to the runtime profile which tracks
the min, max, and avg size of the generated runs, in tuples.

As a first use case, partial sort is used where a total sort was
used previously for inserts into Kudu.

Testing:
- E2E test with a large INSERT into a Kudu table with a mem limit.
  Checks that no spills occurred.
- Updated planner tests.
- Existing E2E tests and stress test verify correctness of INSERT.
- Perf tests on the 10 node cluster: inserting tpch_100.lineitem
  into a Kudu table with mem_limit=3gb:
  Previously: 5 runs are spilled, sort took 7m33s
  Now: no spills, sort takes 6m19s, for ~18% speedup

Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
---
M be/src/exec/CMakeLists.txt
M be/src/exec/exec-node.cc
A be/src/exec/partial-sort-node.cc
A be/src/exec/partial-sort-node.h
M be/src/exec/sort-node.h
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M be/src/util/runtime-profile-counters.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test
16 files changed, 460 insertions(+), 67 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/7267/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

2017-07-14 Thread Mostafa Mokhtar (Code Review)
Mostafa Mokhtar has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
..


Patch Set 4:

Can you extend the Sort metrics in the query profile to include Avg, Min and 
Max run size?
This will be needed to reflect on the quality of the sorted data?

-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

2017-07-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7267/4/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

PS4, Line 99: int64_t min_buffers_required
> If you prefer, I think its also reasonable to pass a 'is_partial" flag here
I don't feel strongly - the change you made looks good to me. Maybe 
'is_partial' would be slightly better so that it's clearer what the expected 
usage pattern is, but I really don't feel strongly at all.


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

2017-07-11 Thread Matthew Jacobs (Code Review)
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7267/4/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

Line 527:   if (!insertStmt.hasNoClusteredHint() && 
!ctx_.isSingleNodeExec()) {
> We felt it was better to just do Kudu for now to keep this patch simpler an
Also it's not just a matter of enabling it here. Some work is required in the 
BE to ensure coordination between the sorter (doing a partial sort) and the 
writing of files to make sure that a given file isn't written out of order, 
i.e. that new files get written at the partial sort boundaries.


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

2017-07-11 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
..


Patch Set 5:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/7267/4/be/src/exec/partial-sort-node.cc
File be/src/exec/partial-sort-node.cc:

Line 111: num_rows_returned_ += row_batch->num_rows();
> Do we need to update num_rows_returned_ here?
Done


Line 148: }
> We can't have a partial sort in a subplan, right? It's probably best to jus
Done


PS4, Line 149: 
> Let's use nullptr consistently in the new code.
Done


http://gerrit.cloudera.org:8080/#/c/7267/4/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

PS4, Line 99: ples in place.
> I feel like this is breaking some of the encapsulation of the sort implemen
If you prefer, I think its also reasonable to pass a 'is_partial" flag here.

Both solutions are kind of weird, but with the flag we could DCHECK in various 
places to ensure that callers are correctly following the intention here of not 
mixing calls to AddBatch() and AddBatchNoSpill() (though I think it would work 
as expected if they did).


http://gerrit.cloudera.org:8080/#/c/7267/4/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

Line 527:   if (!insertStmt.hasNoClusteredHint() && 
!ctx_.isSingleNodeExec()) {
> Why is this limited to Kudu?
We felt it was better to just do Kudu for now to keep this patch simpler and 
allow us to exercise it with the stress test before using it more broadly.

I filed: IMPALA-5649 to track applying this to HDFS tables.


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

2017-07-11 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has uploaded a new patch set (#5).

Change subject: IMPALA-5498: Support for partial sorts
..

IMPALA-5498: Support for partial sorts

Impala currently supports total sorts (the entire set of data
is sorted) and top-n sorts (only the highest/lowest n elements
are sorted). This patch adds the ability to do partial sorts,
where the data is divided up into some number of subsets, each
of which is sorted individually.

It accomplishes this by adding a new exec node, PartialSortNode.
When PartialSortNode::GetNext() is called, it retrieves input
up to its memory limit, uses the existing Sorter class to sort
it, and outputs it. This is faster than a total sort with SortNode
as it avoids the need to spill if the input is larger than the
memory limit.

Future work will look into setting a more restrictive memory limit
on the PartialSortNode.

In the planner, the SortNode plan node is used, with an enum value
indicating if it is a total or partial sort.

As a first use case, partial sort is used where a total sort was
used previously for inserts into Kudu.

Testing:
- E2E test with a large INSERT into a Kudu table with a mem limit.
  Checks that no spills occurred.
- Updated planner tests.
- Existing E2E tests and stress test verify correctness of INSERT.
- Perf tests on the 10 node cluster: inserting tpch_100.lineitem
  into a Kudu table with mem_limit=3gb:
  Previously: 5 runs are spilled, sort took 7m33s
  Now: no spills, sort takes 6m19s, for ~18% speedup

Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
---
M be/src/exec/CMakeLists.txt
M be/src/exec/exec-node.cc
A be/src/exec/partial-sort-node.cc
A be/src/exec/partial-sort-node.h
M be/src/exec/sort-node.cc
M be/src/exec/sort-node.h
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test
16 files changed, 470 insertions(+), 76 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/7267/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

2017-07-10 Thread Mostafa Mokhtar (Code Review)
Mostafa Mokhtar has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7267/4/fe/src/main/java/org/apache/impala/planner/Planner.java
File fe/src/main/java/org/apache/impala/planner/Planner.java:

Line 527:   if (!insertStmt.hasNoClusteredHint() && 
!ctx_.isSingleNodeExec()) {
Why is this limited to Kudu?

Inserting into Partitioned HDFS tables is a very common use case.


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Mostafa Mokhtar 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

2017-07-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
..


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7267/4/be/src/exec/partial-sort-node.cc
File be/src/exec/partial-sort-node.cc:

Line 111: *eos = true;
Do we need to update num_rows_returned_ here?


Line 148: 
We can't have a partial sort in a subplan, right? It's probably best to just 
put a DCHECK(false) here since there's no way to test the code.


PS4, Line 149: :Res
Let's use nullptr consistently in the new code.


http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

Line 96:   Sorter(const TupleRowComparator& compare_less_than,
> This will be included in a follow up patch.
WFM. If you want to wait for https://gerrit.cloudera.org/#/c/5801/ to land 
there will be a way to directly set a maximum on bytes of buffers in the 
planner, which would accomplish the same thing.


http://gerrit.cloudera.org:8080/#/c/7267/4/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

PS4, Line 99: RuntimeProfile* profile, Run
I feel like this is breaking some of the encapsulation of the sort 
implementation. This detail leaks out into the planner anyway so I guess it's 
not really encapsulated.

How about we just document more clearly what value is expected. The class 
comment talks about the number of blocks per run, so it could be extended to 
talk about the minimum requirement for spilling and non-spilling sorts.


http://gerrit.cloudera.org:8080/#/c/7267/3/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

PS3, Line 349:  highest/lowest
> My problem with that wording is that it could be interpreted as "take the f
WFM


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

2017-07-10 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has uploaded a new patch set (#4).

Change subject: IMPALA-5498: Support for partial sorts
..

IMPALA-5498: Support for partial sorts

Impala currently supports total sorts (the entire set of data
is sorted) and top-n sorts (only the highest/lowest n elements
are sorted). This patch adds the ability to do partial sorts,
where the data is divided up into some number of subsets, each
of which is sorted individually.

It accomplishes this by adding a new exec node, PartialSortNode.
When PartialSortNode::GetNext() is called, it retrieves input
up to its memory limit, uses the existing Sorter class to sort
it, and outputs it. This is faster than a total sort with SortNode
as it avoids the need to spill if the input is larger than the
memory limit.

Future work will look into setting a more restrictive memory limit
on the PartialSortNode.

In the planner, the SortNode plan node is used, with an enum value
indicating if it is a total or partial sort.

As a first use case, partial sort is used where a total sort was
used previously for inserts into Kudu.

Testing:
- E2E test with a large INSERT into a Kudu table with a mem limit.
  Checks that no spills occurred.
- Updated planner tests.
- Existing E2E tests and stress test verify correctness of INSERT.
- Perf tests on the 10 node cluster: inserting tpch_100.lineitem
  into a Kudu table with mem_limit=3gb:
  Previously: 5 runs are spilled, sort took 7m33s
  Now: no spills, sort takes 6m19s, for ~18% speedup

Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
---
M be/src/exec/CMakeLists.txt
M be/src/exec/exec-node.cc
A be/src/exec/partial-sort-node.cc
A be/src/exec/partial-sort-node.h
M be/src/exec/sort-node.cc
M be/src/exec/sort-node.h
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test
16 files changed, 458 insertions(+), 72 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/7267/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

2017-07-10 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
..


Patch Set 4:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/exec/partial-sort-node.cc
File be/src/exec/partial-sort-node.cc:

Line 32: sorter_(NULL),
> This doesn't need to handle offset?
Done


Line 127: 
> Since this only requires a partial sort, you can exit early when the cumula
As noted elsewhere, removing limit-related code.


Line 140: *eos = input_eos_;
> I guess it is not clear to me why allowing a limit without an offset is use
Right, so 'offset' never really makes sense with partial-sort-node, since as 
you say there's not a total ordering and you're not guaranteed that one 
instance of partial-sort-node with an offset can pick up where another instance 
left off with a limit.

A limit is potentially useful just for cutting off the number of rows returned 
in a single partial-sort-node, but with the initial use case for this there's 
no way to end up with a partial-sort-node with a limit, so in the interest of 
keeping things simple and not having untested code paths, I think I'll just 
remove the limit code.


http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/exec/partial-sort-node.h
File be/src/exec/partial-sort-node.h:

Line 58:   virtual Status QueryMaintenance(RuntimeState* state);
> Just for general niceness, can you also update be/src/exec/sort-node.h's pr
Done


http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

Line 1384:   expr_mem_pool, _tuple_expr_evals_));
> Need to update this calculation for partial sorts.
Done


Line 1434:   if (sorted_runs_.size() == 1) {
> Unnecessary - this is unconditionally dereferenced
Done


http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

Line 96:   Sorter(const TupleRowComparator& compare_less_than,
> Wasn't the plan to apply the partition sort memory limit to the sorter itse
This will be included in a follow up patch.


http://gerrit.cloudera.org:8080/#/c/7267/3/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

PS3, Line 349: e {
> Should this be "first" since the order is specified elsewhere.
My problem with that wording is that it could be interpreted as "take the first 
N elements of input and then sort them"

Maybe: "Return the first N sorted elements"?


Line 360: struct TSortNode {
> Please ignore my earlier comments about offset handling (although a comment
Added to partial-sort-node.h


http://gerrit.cloudera.org:8080/#/c/7267/3/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

Line 58:   private final long PARTIAL_SORT_MEM_LIMIT = 128 * 1024 * 1024;
> It might be clearer to specify this in bytes, then just make sure that its 
Done


Line 286:   // The memory limit cannot be less than the size of the 
required blocks.
> This looks good to me.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

2017-07-06 Thread Zach Amsden (Code Review)
Zach Amsden has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
..


Patch Set 3:

(6 comments)

Looks promising!

http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/exec/partial-sort-node.cc
File be/src/exec/partial-sort-node.cc:

Line 32: sorter_(NULL),
This doesn't need to handle offset?


Line 127: input_batch_index_ += num_processed;
Since this only requires a partial sort, you can exit early when the cumulative 
number of rows processed exceeds the limit.  This might end up sorting 
significantly fewer rows (at the cost of less well sorted data when a limit is 
present).


Line 140:   if (ReachedLimit()) {
I guess it is not clear to me why allowing a limit without an offset is useful. 
 If there is a limit, the total ordering of rows is undefined and there is no 
way to continue fetching them without also allowing an offset.  Maybe document 
this limitation?


http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/exec/partial-sort-node.h
File be/src/exec/partial-sort-node.h:

Line 58: 
Just for general niceness, can you also update be/src/exec/sort-node.h's 
private SortInput method to have WARN_UNUSED_RESULT?


http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

Line 1434:   DCHECK(unsorted_run_ != NULL);
Unnecessary - this is unconditionally dereferenced


http://gerrit.cloudera.org:8080/#/c/7267/3/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 360:   // Not used with TSortType::PARTIAL.
Please ignore my earlier comments about offset handling (although a comment 
that offsets are not supported in exec/partion-sode-node.cc is still welcome.


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Zach Amsden 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

2017-07-06 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
..


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7267/1/be/src/exec/partial-sort-node.cc
File be/src/exec/partial-sort-node.cc:

Line 104:   sorter_->Reset();
> To be clear - what you're saying is that Open() will want to allocate memor
I wouldn't expect incorrect results, but it could result in queries failing 
with OOM when they shouldn't.


http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

Line 1384:   int min_buffers_required = MIN_BUFFERS_PER_MERGE;
Need to update this calculation for partial sorts.


http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

Line 96:   Sorter(const TupleRowComparator& compare_less_than,
Wasn't the plan to apply the partition sort memory limit to the sorter itself?


http://gerrit.cloudera.org:8080/#/c/7267/3/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

PS3, Line 349:  highest/lowest
Should this be "first" since the order is specified elsewhere.

Maybe:
"Return the first N elements in sorted order"?


http://gerrit.cloudera.org:8080/#/c/7267/3/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

Line 58:   private final long PARTIAL_SORT_MEM_LIMIT = 
BackendConfig.INSTANCE.getReadSize() * 100;
It might be clearer to specify this in bytes, then just make sure that its >= 2 
* read_size.

Although we really don't expect people to change read_size.


Line 286:   resourceProfile_ = new ResourceProfile(
This looks good to me.


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

2017-07-06 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-5498: Support for partial sorts
..


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/7267/2/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

Line 348: 
> these could use comments, especially since "partial sort" often means what 
Done


http://gerrit.cloudera.org:8080/#/c/7267/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java:

PS2, Line 296:   boolean useTopN = stmt.hasLimit() && !disableTopN;
> long line
Done


http://gerrit.cloudera.org:8080/#/c/7267/2/fe/src/main/java/org/apache/impala/planner/SortNode.java
File fe/src/main/java/org/apache/impala/planner/SortNode.java:

PS2, Line 93: 
> Maybe make this one a static factory for consistency. I like them- very rea
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-5498: Support for partial sorts

2017-07-06 Thread Thomas Tauber-Marshall (Code Review)
Thomas Tauber-Marshall has uploaded a new patch set (#3).

Change subject: IMPALA-5498: Support for partial sorts
..

IMPALA-5498: Support for partial sorts

Impala currently supports total sorts (the entire set of data
is sorted) and top-n sorts (only the highest/lowest n elements
are sorted). This patch adds the ability to do partial sorts,
where the data is divided up into some number of subsets, each
of which is sorted individually.

It accomplishes this by adding a new exec node, PartialSortNode.
When PartialSortNode::GetNext() is called, it retrieves input
up to its memory limit, uses the existing Sorter class to sort
it, and outputs it. This is faster than a total sort with SortNode
as it avoids the need to spill if the input is larger than the
memory limit.

Future work will look into setting a more restrictive memory limit
on the PartialSortNode.

In the planner, the SortNode plan node is used, with an enum value
indicating if it is a total or partial sort.

As a first use case, partial sort is used where a total sort was
used previously for inserts into Kudu.

Testing:
- E2E test with a large INSERT into a Kudu table with a mem limit.
  Checks that no spills occurred.
- Existing E2E tests and stress test verify correctness of INSERT.
- Perf tests on the 10 node cluster: inserting tpch_100.lineitem
  into a Kudu table with mem_limit=3gb:
  Previously: 5 runs are spilled, sort took 7m33s
  Now: no spills, sort takes 6m19s, for ~18% speedup

Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
---
M be/src/exec/CMakeLists.txt
M be/src/exec/exec-node.cc
A be/src/exec/partial-sort-node.cc
A be/src/exec/partial-sort-node.h
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M common/thrift/PlanNodes.thrift
M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java
M fe/src/main/java/org/apache/impala/planner/Planner.java
M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java
M fe/src/main/java/org/apache/impala/planner/SortNode.java
M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test
M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test
M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test
14 files changed, 431 insertions(+), 50 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/7267/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7267
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Thomas Tauber-Marshall 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Matthew Jacobs 
Gerrit-Reviewer: Thomas Tauber-Marshall 
Gerrit-Reviewer: Tim Armstrong