[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs
Impala Public Jenkins has submitted this change and it was merged. Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs .. IMPALA-5498: Support for partial sorts in Kudu INSERTs Impala currently supports total sorts (the entire set of data is sorted) and top-n sorts (only the highest/lowest n elements are sorted). This patch adds the ability to do partial sorts, where the data is divided up into some number of subsets, each of which is sorted individually. It accomplishes this by adding a new exec node, PartialSortNode. When PartialSortNode::GetNext() is called, it retrieves input up to the query memory limit, uses the existing Sorter class to sort it, and outputs it. This is faster than a total sort with SortNode as it avoids the need to spill if the input is larger than the memory limit. Future work will look into setting a more restrictive memory limit on the PartialSortNode. (IMPALA-5669) In the planner, the SortNode plan node is used, with an enum value indicating if it is a total or partial sort. This also adds a new counter 'RunSize' to the runtime profile which tracks the min, max, and avg size of the generated runs, in tuples. As a first use case, partial sort is used where a total sort was used previously for inserts/upserts into Kudu tables only. Future work can extend this to other table sinks. (IMPALA-5649) Testing: - E2E test with a large INSERT into a Kudu table with a mem limit. Checks that no spills occurred. - Updated planner tests. - Existing E2E tests and stress test verify correctness of INSERT. - Perf tests on the 10 node cluster: inserting tpch_100.lineitem into a Kudu table with mem_limit=3gb: Previously: 5 runs are spilled, sort took 7m33s Now: no spills, sort takes 6m19s, for ~18% speedup Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Reviewed-on: http://gerrit.cloudera.org:8080/7267 Reviewed-by: Thomas Tauber-MarshallTested-by: Impala Public Jenkins --- M be/src/exec/CMakeLists.txt M be/src/exec/exec-node.cc A be/src/exec/partial-sort-node.cc A be/src/exec/partial-sort-node.h M be/src/exec/sort-node.h M be/src/runtime/sorter.cc M be/src/runtime/sorter.h M be/src/util/runtime-profile-counters.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java M fe/src/main/java/org/apache/impala/planner/Planner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test 16 files changed, 459 insertions(+), 70 deletions(-) Approvals: Impala Public Jenkins: Verified Thomas Tauber-Marshall: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-Marshall Gerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs
Impala Public Jenkins has posted comments on this change. Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs .. Patch Set 8: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs
Impala Public Jenkins has posted comments on this change. Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs .. Patch Set 8: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/908/ -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs
Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs .. Patch Set 8: Code-Review+2 Rebased -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs
Matthew Jacobs has posted comments on this change. Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs .. Patch Set 7: Code-Review+2 You probably need to rebase -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs
Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs .. Patch Set 6: (14 comments) http://gerrit.cloudera.org:8080/#/c/7267/6//COMMIT_MSG Commit Message: Line 7: IMPALA-5498: Support for partial sorts > lets specify in Kudu for now Done PS6, Line 28: A > nit: This also adds... Done Line 32: used previously for inserts into Kudu. > "for inserts into Kudu"->"for DML statements into Kudu tables only. Future This only applies to inserts/upserts. We still haven't done the partitioning/sorting for update/delete. http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/exec/partial-sort-node.h File be/src/exec/partial-sort-node.h: PS6, Line 62: derived based > awkward/confusing Done PS6, Line 62: sort_exec_exprs_ > update Done http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/runtime/sorter.cc File be/src/runtime/sorter.cc: PS6, Line 1383: RunSize > can you indicate the units, e.g. #rows -> NumRowsPerRun? Done PS6, Line 1387: if (no_spill_) { : // 'min_buffers_required' is 1 since the sorter will only have 1 run at a time, so : // there won't be any merges. : min_buffers_required = 1; : } else { : min_buffers_required = MIN_BUFFERS_PER_MERGE; : } > nit: ternary operator Done http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/runtime/sorter.h File be/src/runtime/sorter.h: PS6, Line 45: Specifying 'no_spill' as true is > For this use case, 'no_spill' should be set to true so the sorter can reduc Done PS6, Line 103: no_spill > A positive argument would be marginally easier to follow. E.g. 'enable_spil Of course, I called it no_spill to mirror the function names. Do you think we should also change AddBatch/AddBatchNoSpill -> AddBatchWithSpill/AddBatch? PS6, Line 103: bool no_spill = false > comment on this even though it's mentioned above Done Line 117: /// up, it is sorted and a new unsorted run is created. > comment that this may not be called if no_spill is true Done PS6, Line 127: InputDone > comment that this may not be called if no_spill is true So this does have a DCHECK in it, but only if there's more than one run (which isn't possible unless AddBatch() was called), so its fine to call with either value of enable_spilling. The other cases would be the various private *Merge* functions. I didn't add DCHECKs/documentation to those cause it seems cluttered and the DCHECK here in InputDone() ensures that none of them are called, but I can add it in if you want. http://gerrit.cloudera.org:8080/#/c/7267/6/fe/src/main/java/org/apache/impala/planner/SortNode.java File fe/src/main/java/org/apache/impala/planner/SortNode.java: PS6, Line 56: TODO: run experiments to : // determine the value for this, consider making it configurable, enforce it in the BE. > JIRA? Done PS6, Line 124: { : return type_ != TSortType.PARTIAL; : } > nit: oneline? Done -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts in Kudu INSERTs
Hello Tim Armstrong, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/7267 to look at the new patch set (#7). Change subject: IMPALA-5498: Support for partial sorts in Kudu INSERTs .. IMPALA-5498: Support for partial sorts in Kudu INSERTs Impala currently supports total sorts (the entire set of data is sorted) and top-n sorts (only the highest/lowest n elements are sorted). This patch adds the ability to do partial sorts, where the data is divided up into some number of subsets, each of which is sorted individually. It accomplishes this by adding a new exec node, PartialSortNode. When PartialSortNode::GetNext() is called, it retrieves input up to the query memory limit, uses the existing Sorter class to sort it, and outputs it. This is faster than a total sort with SortNode as it avoids the need to spill if the input is larger than the memory limit. Future work will look into setting a more restrictive memory limit on the PartialSortNode. (IMPALA-5669) In the planner, the SortNode plan node is used, with an enum value indicating if it is a total or partial sort. This also adds a new counter 'RunSize' to the runtime profile which tracks the min, max, and avg size of the generated runs, in tuples. As a first use case, partial sort is used where a total sort was used previously for inserts/upserts into Kudu tables only. Future work can extend this to other table sinks. (IMPALA-5649) Testing: - E2E test with a large INSERT into a Kudu table with a mem limit. Checks that no spills occurred. - Updated planner tests. - Existing E2E tests and stress test verify correctness of INSERT. - Perf tests on the 10 node cluster: inserting tpch_100.lineitem into a Kudu table with mem_limit=3gb: Previously: 5 runs are spilled, sort took 7m33s Now: no spills, sort takes 6m19s, for ~18% speedup Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 --- M be/src/exec/CMakeLists.txt M be/src/exec/exec-node.cc A be/src/exec/partial-sort-node.cc A be/src/exec/partial-sort-node.h M be/src/exec/sort-node.h M be/src/runtime/sorter.cc M be/src/runtime/sorter.h M be/src/util/runtime-profile-counters.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java M fe/src/main/java/org/apache/impala/planner/Planner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test 16 files changed, 459 insertions(+), 70 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/7267/7 -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts
Matthew Jacobs has posted comments on this change. Change subject: IMPALA-5498: Support for partial sorts .. Patch Set 6: (14 comments) http://gerrit.cloudera.org:8080/#/c/7267/6//COMMIT_MSG Commit Message: Line 7: IMPALA-5498: Support for partial sorts lets specify in Kudu for now PS6, Line 28: A nit: This also adds... Line 32: used previously for inserts into Kudu. "for inserts into Kudu"->"for DML statements into Kudu tables only. Future work can extend this to other table sinks IMPALA-" Line 42: Now: no spills, sort takes 6m19s, for ~18% speedup comparison vs no partition/sort at all? http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/exec/partial-sort-node.h File be/src/exec/partial-sort-node.h: PS6, Line 62: sort_exec_exprs_ update PS6, Line 62: derived based awkward/confusing http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/runtime/sorter.cc File be/src/runtime/sorter.cc: PS6, Line 1383: RunSize can you indicate the units, e.g. #rows -> NumRowsPerRun? PS6, Line 1387: if (no_spill_) { : // 'min_buffers_required' is 1 since the sorter will only have 1 run at a time, so : // there won't be any merges. : min_buffers_required = 1; : } else { : min_buffers_required = MIN_BUFFERS_PER_MERGE; : } nit: ternary operator http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/runtime/sorter.h File be/src/runtime/sorter.h: PS6, Line 45: Specifying 'no_spill' as true is For this use case, 'no_spill' should be set to true so the sorter can reduce... PS6, Line 103: bool no_spill = false comment on this even though it's mentioned above Line 117: /// up, it is sorted and a new unsorted run is created. comment that this may not be called if no_spill is true PS6, Line 127: InputDone comment that this may not be called if no_spill is true (are there any other cases I missed?) http://gerrit.cloudera.org:8080/#/c/7267/6/fe/src/main/java/org/apache/impala/planner/SortNode.java File fe/src/main/java/org/apache/impala/planner/SortNode.java: PS6, Line 56: TODO: run experiments to : // determine the value for this, consider making it configurable, enforce it in the BE. JIRA? PS6, Line 124: { : return type_ != TSortType.PARTIAL; : } nit: oneline? -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts
Tim Armstrong has posted comments on this change. Change subject: IMPALA-5498: Support for partial sorts .. Patch Set 6: Code-Review+1 (1 comment) LGTM aside from the variable name http://gerrit.cloudera.org:8080/#/c/7267/6/be/src/runtime/sorter.h File be/src/runtime/sorter.h: PS6, Line 103: no_spill A positive argument would be marginally easier to follow. E.g. 'enable_spilling' -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts
Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-5498: Support for partial sorts .. Patch Set 6: (1 comment) > Can you extend the Sort metrics in the query profile to include > Avg, Min and Max run size? > This will be needed to reflect on the quality of the sorted data? Done. http://gerrit.cloudera.org:8080/#/c/7267/4/be/src/runtime/sorter.h File be/src/runtime/sorter.h: PS4, Line 99: to the merger and retrieve r > I don't feel strongly - the change you made looks good to me. Maybe 'is_par Done -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts
Thomas Tauber-Marshall has uploaded a new patch set (#6). Change subject: IMPALA-5498: Support for partial sorts .. IMPALA-5498: Support for partial sorts Impala currently supports total sorts (the entire set of data is sorted) and top-n sorts (only the highest/lowest n elements are sorted). This patch adds the ability to do partial sorts, where the data is divided up into some number of subsets, each of which is sorted individually. It accomplishes this by adding a new exec node, PartialSortNode. When PartialSortNode::GetNext() is called, it retrieves input up to its memory limit, uses the existing Sorter class to sort it, and outputs it. This is faster than a total sort with SortNode as it avoids the need to spill if the input is larger than the memory limit. Future work will look into setting a more restrictive memory limit on the PartialSortNode. In the planner, the SortNode plan node is used, with an enum value indicating if it is a total or partial sort. Adds a new counter 'RunSize' to the runtime profile which tracks the min, max, and avg size of the generated runs, in tuples. As a first use case, partial sort is used where a total sort was used previously for inserts into Kudu. Testing: - E2E test with a large INSERT into a Kudu table with a mem limit. Checks that no spills occurred. - Updated planner tests. - Existing E2E tests and stress test verify correctness of INSERT. - Perf tests on the 10 node cluster: inserting tpch_100.lineitem into a Kudu table with mem_limit=3gb: Previously: 5 runs are spilled, sort took 7m33s Now: no spills, sort takes 6m19s, for ~18% speedup Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 --- M be/src/exec/CMakeLists.txt M be/src/exec/exec-node.cc A be/src/exec/partial-sort-node.cc A be/src/exec/partial-sort-node.h M be/src/exec/sort-node.h M be/src/runtime/sorter.cc M be/src/runtime/sorter.h M be/src/util/runtime-profile-counters.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java M fe/src/main/java/org/apache/impala/planner/Planner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test 16 files changed, 460 insertions(+), 67 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/7267/6 -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 6 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts
Mostafa Mokhtar has posted comments on this change. Change subject: IMPALA-5498: Support for partial sorts .. Patch Set 4: Can you extend the Sort metrics in the query profile to include Avg, Min and Max run size? This will be needed to reflect on the quality of the sorted data? -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts
Tim Armstrong has posted comments on this change. Change subject: IMPALA-5498: Support for partial sorts .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/7267/4/be/src/runtime/sorter.h File be/src/runtime/sorter.h: PS4, Line 99: int64_t min_buffers_required > If you prefer, I think its also reasonable to pass a 'is_partial" flag here I don't feel strongly - the change you made looks good to me. Maybe 'is_partial' would be slightly better so that it's clearer what the expected usage pattern is, but I really don't feel strongly at all. -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts
Matthew Jacobs has posted comments on this change. Change subject: IMPALA-5498: Support for partial sorts .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/7267/4/fe/src/main/java/org/apache/impala/planner/Planner.java File fe/src/main/java/org/apache/impala/planner/Planner.java: Line 527: if (!insertStmt.hasNoClusteredHint() && !ctx_.isSingleNodeExec()) { > We felt it was better to just do Kudu for now to keep this patch simpler an Also it's not just a matter of enabling it here. Some work is required in the BE to ensure coordination between the sorter (doing a partial sort) and the writing of files to make sure that a given file isn't written out of order, i.e. that new files get written at the partial sort boundaries. -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts
Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-5498: Support for partial sorts .. Patch Set 5: (5 comments) http://gerrit.cloudera.org:8080/#/c/7267/4/be/src/exec/partial-sort-node.cc File be/src/exec/partial-sort-node.cc: Line 111: num_rows_returned_ += row_batch->num_rows(); > Do we need to update num_rows_returned_ here? Done Line 148: } > We can't have a partial sort in a subplan, right? It's probably best to jus Done PS4, Line 149: > Let's use nullptr consistently in the new code. Done http://gerrit.cloudera.org:8080/#/c/7267/4/be/src/runtime/sorter.h File be/src/runtime/sorter.h: PS4, Line 99: ples in place. > I feel like this is breaking some of the encapsulation of the sort implemen If you prefer, I think its also reasonable to pass a 'is_partial" flag here. Both solutions are kind of weird, but with the flag we could DCHECK in various places to ensure that callers are correctly following the intention here of not mixing calls to AddBatch() and AddBatchNoSpill() (though I think it would work as expected if they did). http://gerrit.cloudera.org:8080/#/c/7267/4/fe/src/main/java/org/apache/impala/planner/Planner.java File fe/src/main/java/org/apache/impala/planner/Planner.java: Line 527: if (!insertStmt.hasNoClusteredHint() && !ctx_.isSingleNodeExec()) { > Why is this limited to Kudu? We felt it was better to just do Kudu for now to keep this patch simpler and allow us to exercise it with the stress test before using it more broadly. I filed: IMPALA-5649 to track applying this to HDFS tables. -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts
Thomas Tauber-Marshall has uploaded a new patch set (#5). Change subject: IMPALA-5498: Support for partial sorts .. IMPALA-5498: Support for partial sorts Impala currently supports total sorts (the entire set of data is sorted) and top-n sorts (only the highest/lowest n elements are sorted). This patch adds the ability to do partial sorts, where the data is divided up into some number of subsets, each of which is sorted individually. It accomplishes this by adding a new exec node, PartialSortNode. When PartialSortNode::GetNext() is called, it retrieves input up to its memory limit, uses the existing Sorter class to sort it, and outputs it. This is faster than a total sort with SortNode as it avoids the need to spill if the input is larger than the memory limit. Future work will look into setting a more restrictive memory limit on the PartialSortNode. In the planner, the SortNode plan node is used, with an enum value indicating if it is a total or partial sort. As a first use case, partial sort is used where a total sort was used previously for inserts into Kudu. Testing: - E2E test with a large INSERT into a Kudu table with a mem limit. Checks that no spills occurred. - Updated planner tests. - Existing E2E tests and stress test verify correctness of INSERT. - Perf tests on the 10 node cluster: inserting tpch_100.lineitem into a Kudu table with mem_limit=3gb: Previously: 5 runs are spilled, sort took 7m33s Now: no spills, sort takes 6m19s, for ~18% speedup Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 --- M be/src/exec/CMakeLists.txt M be/src/exec/exec-node.cc A be/src/exec/partial-sort-node.cc A be/src/exec/partial-sort-node.h M be/src/exec/sort-node.cc M be/src/exec/sort-node.h M be/src/runtime/sorter.cc M be/src/runtime/sorter.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java M fe/src/main/java/org/apache/impala/planner/Planner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test 16 files changed, 470 insertions(+), 76 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/7267/5 -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 5 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts
Mostafa Mokhtar has posted comments on this change. Change subject: IMPALA-5498: Support for partial sorts .. Patch Set 4: (1 comment) http://gerrit.cloudera.org:8080/#/c/7267/4/fe/src/main/java/org/apache/impala/planner/Planner.java File fe/src/main/java/org/apache/impala/planner/Planner.java: Line 527: if (!insertStmt.hasNoClusteredHint() && !ctx_.isSingleNodeExec()) { Why is this limited to Kudu? Inserting into Partitioned HDFS tables is a very common use case. -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Mostafa Mokhtar Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts
Tim Armstrong has posted comments on this change. Change subject: IMPALA-5498: Support for partial sorts .. Patch Set 3: (6 comments) http://gerrit.cloudera.org:8080/#/c/7267/4/be/src/exec/partial-sort-node.cc File be/src/exec/partial-sort-node.cc: Line 111: *eos = true; Do we need to update num_rows_returned_ here? Line 148: We can't have a partial sort in a subplan, right? It's probably best to just put a DCHECK(false) here since there's no way to test the code. PS4, Line 149: :Res Let's use nullptr consistently in the new code. http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/runtime/sorter.h File be/src/runtime/sorter.h: Line 96: Sorter(const TupleRowComparator& compare_less_than, > This will be included in a follow up patch. WFM. If you want to wait for https://gerrit.cloudera.org/#/c/5801/ to land there will be a way to directly set a maximum on bytes of buffers in the planner, which would accomplish the same thing. http://gerrit.cloudera.org:8080/#/c/7267/4/be/src/runtime/sorter.h File be/src/runtime/sorter.h: PS4, Line 99: RuntimeProfile* profile, Run I feel like this is breaking some of the encapsulation of the sort implementation. This detail leaks out into the planner anyway so I guess it's not really encapsulated. How about we just document more clearly what value is expected. The class comment talks about the number of blocks per run, so it could be extended to talk about the minimum requirement for spilling and non-spilling sorts. http://gerrit.cloudera.org:8080/#/c/7267/3/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: PS3, Line 349: highest/lowest > My problem with that wording is that it could be interpreted as "take the f WFM -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts
Thomas Tauber-Marshall has uploaded a new patch set (#4). Change subject: IMPALA-5498: Support for partial sorts .. IMPALA-5498: Support for partial sorts Impala currently supports total sorts (the entire set of data is sorted) and top-n sorts (only the highest/lowest n elements are sorted). This patch adds the ability to do partial sorts, where the data is divided up into some number of subsets, each of which is sorted individually. It accomplishes this by adding a new exec node, PartialSortNode. When PartialSortNode::GetNext() is called, it retrieves input up to its memory limit, uses the existing Sorter class to sort it, and outputs it. This is faster than a total sort with SortNode as it avoids the need to spill if the input is larger than the memory limit. Future work will look into setting a more restrictive memory limit on the PartialSortNode. In the planner, the SortNode plan node is used, with an enum value indicating if it is a total or partial sort. As a first use case, partial sort is used where a total sort was used previously for inserts into Kudu. Testing: - E2E test with a large INSERT into a Kudu table with a mem limit. Checks that no spills occurred. - Updated planner tests. - Existing E2E tests and stress test verify correctness of INSERT. - Perf tests on the 10 node cluster: inserting tpch_100.lineitem into a Kudu table with mem_limit=3gb: Previously: 5 runs are spilled, sort took 7m33s Now: no spills, sort takes 6m19s, for ~18% speedup Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 --- M be/src/exec/CMakeLists.txt M be/src/exec/exec-node.cc A be/src/exec/partial-sort-node.cc A be/src/exec/partial-sort-node.h M be/src/exec/sort-node.cc M be/src/exec/sort-node.h M be/src/runtime/sorter.cc M be/src/runtime/sorter.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java M fe/src/main/java/org/apache/impala/planner/Planner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test 16 files changed, 458 insertions(+), 72 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/7267/4 -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts
Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-5498: Support for partial sorts .. Patch Set 4: (11 comments) http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/exec/partial-sort-node.cc File be/src/exec/partial-sort-node.cc: Line 32: sorter_(NULL), > This doesn't need to handle offset? Done Line 127: > Since this only requires a partial sort, you can exit early when the cumula As noted elsewhere, removing limit-related code. Line 140: *eos = input_eos_; > I guess it is not clear to me why allowing a limit without an offset is use Right, so 'offset' never really makes sense with partial-sort-node, since as you say there's not a total ordering and you're not guaranteed that one instance of partial-sort-node with an offset can pick up where another instance left off with a limit. A limit is potentially useful just for cutting off the number of rows returned in a single partial-sort-node, but with the initial use case for this there's no way to end up with a partial-sort-node with a limit, so in the interest of keeping things simple and not having untested code paths, I think I'll just remove the limit code. http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/exec/partial-sort-node.h File be/src/exec/partial-sort-node.h: Line 58: virtual Status QueryMaintenance(RuntimeState* state); > Just for general niceness, can you also update be/src/exec/sort-node.h's pr Done http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/runtime/sorter.cc File be/src/runtime/sorter.cc: Line 1384: expr_mem_pool, _tuple_expr_evals_)); > Need to update this calculation for partial sorts. Done Line 1434: if (sorted_runs_.size() == 1) { > Unnecessary - this is unconditionally dereferenced Done http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/runtime/sorter.h File be/src/runtime/sorter.h: Line 96: Sorter(const TupleRowComparator& compare_less_than, > Wasn't the plan to apply the partition sort memory limit to the sorter itse This will be included in a follow up patch. http://gerrit.cloudera.org:8080/#/c/7267/3/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: PS3, Line 349: e { > Should this be "first" since the order is specified elsewhere. My problem with that wording is that it could be interpreted as "take the first N elements of input and then sort them" Maybe: "Return the first N sorted elements"? Line 360: struct TSortNode { > Please ignore my earlier comments about offset handling (although a comment Added to partial-sort-node.h http://gerrit.cloudera.org:8080/#/c/7267/3/fe/src/main/java/org/apache/impala/planner/SortNode.java File fe/src/main/java/org/apache/impala/planner/SortNode.java: Line 58: private final long PARTIAL_SORT_MEM_LIMIT = 128 * 1024 * 1024; > It might be clearer to specify this in bytes, then just make sure that its Done Line 286: // The memory limit cannot be less than the size of the required blocks. > This looks good to me. Done -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 4 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts
Zach Amsden has posted comments on this change. Change subject: IMPALA-5498: Support for partial sorts .. Patch Set 3: (6 comments) Looks promising! http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/exec/partial-sort-node.cc File be/src/exec/partial-sort-node.cc: Line 32: sorter_(NULL), This doesn't need to handle offset? Line 127: input_batch_index_ += num_processed; Since this only requires a partial sort, you can exit early when the cumulative number of rows processed exceeds the limit. This might end up sorting significantly fewer rows (at the cost of less well sorted data when a limit is present). Line 140: if (ReachedLimit()) { I guess it is not clear to me why allowing a limit without an offset is useful. If there is a limit, the total ordering of rows is undefined and there is no way to continue fetching them without also allowing an offset. Maybe document this limitation? http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/exec/partial-sort-node.h File be/src/exec/partial-sort-node.h: Line 58: Just for general niceness, can you also update be/src/exec/sort-node.h's private SortInput method to have WARN_UNUSED_RESULT? http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/runtime/sorter.cc File be/src/runtime/sorter.cc: Line 1434: DCHECK(unsorted_run_ != NULL); Unnecessary - this is unconditionally dereferenced http://gerrit.cloudera.org:8080/#/c/7267/3/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: Line 360: // Not used with TSortType::PARTIAL. Please ignore my earlier comments about offset handling (although a comment that offsets are not supported in exec/partion-sode-node.cc is still welcome. -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-Reviewer: Zach Amsden Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts
Tim Armstrong has posted comments on this change. Change subject: IMPALA-5498: Support for partial sorts .. Patch Set 3: (6 comments) http://gerrit.cloudera.org:8080/#/c/7267/1/be/src/exec/partial-sort-node.cc File be/src/exec/partial-sort-node.cc: Line 104: sorter_->Reset(); > To be clear - what you're saying is that Open() will want to allocate memor I wouldn't expect incorrect results, but it could result in queries failing with OOM when they shouldn't. http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/runtime/sorter.cc File be/src/runtime/sorter.cc: Line 1384: int min_buffers_required = MIN_BUFFERS_PER_MERGE; Need to update this calculation for partial sorts. http://gerrit.cloudera.org:8080/#/c/7267/3/be/src/runtime/sorter.h File be/src/runtime/sorter.h: Line 96: Sorter(const TupleRowComparator& compare_less_than, Wasn't the plan to apply the partition sort memory limit to the sorter itself? http://gerrit.cloudera.org:8080/#/c/7267/3/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: PS3, Line 349: highest/lowest Should this be "first" since the order is specified elsewhere. Maybe: "Return the first N elements in sorted order"? http://gerrit.cloudera.org:8080/#/c/7267/3/fe/src/main/java/org/apache/impala/planner/SortNode.java File fe/src/main/java/org/apache/impala/planner/SortNode.java: Line 58: private final long PARTIAL_SORT_MEM_LIMIT = BackendConfig.INSTANCE.getReadSize() * 100; It might be clearer to specify this in bytes, then just make sure that its >= 2 * read_size. Although we really don't expect people to change read_size. Line 286: resourceProfile_ = new ResourceProfile( This looks good to me. -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts
Thomas Tauber-Marshall has posted comments on this change. Change subject: IMPALA-5498: Support for partial sorts .. Patch Set 3: (3 comments) http://gerrit.cloudera.org:8080/#/c/7267/2/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: Line 348: > these could use comments, especially since "partial sort" often means what Done http://gerrit.cloudera.org:8080/#/c/7267/2/fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java File fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java: PS2, Line 296: boolean useTopN = stmt.hasLimit() && !disableTopN; > long line Done http://gerrit.cloudera.org:8080/#/c/7267/2/fe/src/main/java/org/apache/impala/planner/SortNode.java File fe/src/main/java/org/apache/impala/planner/SortNode.java: PS2, Line 93: > Maybe make this one a static factory for consistency. I like them- very rea Done -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-5498: Support for partial sorts
Thomas Tauber-Marshall has uploaded a new patch set (#3). Change subject: IMPALA-5498: Support for partial sorts .. IMPALA-5498: Support for partial sorts Impala currently supports total sorts (the entire set of data is sorted) and top-n sorts (only the highest/lowest n elements are sorted). This patch adds the ability to do partial sorts, where the data is divided up into some number of subsets, each of which is sorted individually. It accomplishes this by adding a new exec node, PartialSortNode. When PartialSortNode::GetNext() is called, it retrieves input up to its memory limit, uses the existing Sorter class to sort it, and outputs it. This is faster than a total sort with SortNode as it avoids the need to spill if the input is larger than the memory limit. Future work will look into setting a more restrictive memory limit on the PartialSortNode. In the planner, the SortNode plan node is used, with an enum value indicating if it is a total or partial sort. As a first use case, partial sort is used where a total sort was used previously for inserts into Kudu. Testing: - E2E test with a large INSERT into a Kudu table with a mem limit. Checks that no spills occurred. - Existing E2E tests and stress test verify correctness of INSERT. - Perf tests on the 10 node cluster: inserting tpch_100.lineitem into a Kudu table with mem_limit=3gb: Previously: 5 runs are spilled, sort took 7m33s Now: no spills, sort takes 6m19s, for ~18% speedup Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 --- M be/src/exec/CMakeLists.txt M be/src/exec/exec-node.cc A be/src/exec/partial-sort-node.cc A be/src/exec/partial-sort-node.h M be/src/runtime/sorter.cc M be/src/runtime/sorter.h M common/thrift/PlanNodes.thrift M fe/src/main/java/org/apache/impala/planner/AnalyticPlanner.java M fe/src/main/java/org/apache/impala/planner/Planner.java M fe/src/main/java/org/apache/impala/planner/SingleNodePlanner.java M fe/src/main/java/org/apache/impala/planner/SortNode.java M testdata/workloads/functional-planner/queries/PlannerTest/kudu-upsert.test M testdata/workloads/functional-planner/queries/PlannerTest/kudu.test M testdata/workloads/functional-query/queries/QueryTest/kudu_insert.test 14 files changed, 431 insertions(+), 50 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/67/7267/3 -- To view, visit http://gerrit.cloudera.org:8080/7267 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38 Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Thomas Tauber-MarshallGerrit-Reviewer: Dan Hecht Gerrit-Reviewer: Matthew Jacobs Gerrit-Reviewer: Thomas Tauber-Marshall Gerrit-Reviewer: Tim Armstrong