subject:"\[Impala\-ASF\-CR\] IMPALA\-5706\: Spilling sort optimisations"

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-06-11 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..

IMPALA-5706: Spilling sort optimisations

This patch covers multiple changes with the purpose of optimizing
spilling sort mechanism:
  - Remove the hard-coded maximum limit of buffers that can be used
for merging the sorted runs. Instead this number is calculated
based on the available memory through buffer pool.
  - The already sorted runs are distributed more optimally between
the last intermediate merge and the final merge to avoid that a
heavy intermediate merge is followed by a light final merge.
  - Right before starting the merging phase Sorter tries to allocate
additional memory through the buffer pool.
  - An output run is not allocated anymore for the final merge.

Note, double-buffering the runs during a merge was also planned with
this patch. However, performance testing showed that except some
exotic queries with unreasonably small amount of buffer pool memory
available double-buffering doesn't add to the overall performance.
It's basically because the half of the available buffers have to be
sacrificed to do double-buffering and as a result the merge tree can
get deeper. In addition the amount of I/O wait time is not reaching
the level where double-buffering could countervail the reduced number
of runs during a particular merge.

Performance measurements were made during manual testing to verify
that this is in fact an optimization:
  - In case doing a sort on top of a join when working with a
restricted amount of memory then the Sort node successfully
allocates additional memory right before the merging phase. This
is feasible because once Join finishes sending new input data and
calls InputDone() then it releases memory that can be picked up
by the Sorter. This results in shallower merging trees (more runs
grabbed for a merge).
  - On a multi-node cluster I verified that in cases when at least one
merging step is done then this change reduces the execution time
for sorts.
  - The more merging steps are done the bigger the performance gain is
compared to the baseline.

Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Reviewed-on: http://gerrit.cloudera.org:8080/9943
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 
---
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M tests/query_test/test_sort.py
3 files changed, 149 insertions(+), 120 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 19
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-06-11 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 18: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 18
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 12 Jun 2018 01:15:43 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-06-11 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 18:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun-tarmstrong/39/


--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 18
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 11 Jun 2018 21:49:37 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-06-11 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 17:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun-tarmstrong/39/


--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 17
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 11 Jun 2018 21:49:32 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-06-11 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 18: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 18
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 11 Jun 2018 21:49:36 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-06-11 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 17:

Let's get this one in.


--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 17
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 11 Jun 2018 21:49:23 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-30 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 17: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 17
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 30 May 2018 16:26:34 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-30 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 17:

Sorry for the delay here, got busy with some unplanned things.


--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 17
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 30 May 2018 16:26:44 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-29 Thread Gabor Kaszab (Code Review)

Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 17: Verified+1

I've spent some time (way more than wanted to:) ) to fix 
test_multiple_mem_limit, however no matter what I tried the assert on the 
number of merges kept failing in gerrit-verify-dryrun meanwhile it succeeded 
both locally and in any other jenkins build I tried.
I decided to remove the assert on # of merges from that test as I haven't found 
a way to make it succeed in the verify job. Now it's all green:
https://jenkins.impala.io/job/gerrit-verify-dryrun/2560/


--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 17
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 29 May 2018 07:17:51 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-28 Thread Gabor Kaszab (Code Review)

Hello Tim Armstrong, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9943

to look at the new patch set (#17).

Change subject: IMPALA-5706: Spilling sort optimisations
..

IMPALA-5706: Spilling sort optimisations

This patch covers multiple changes with the purpose of optimizing
spilling sort mechanism:
  - Remove the hard-coded maximum limit of buffers that can be used
for merging the sorted runs. Instead this number is calculated
based on the available memory through buffer pool.
  - The already sorted runs are distributed more optimally between
the last intermediate merge and the final merge to avoid that a
heavy intermediate merge is followed by a light final merge.
  - Right before starting the merging phase Sorter tries to allocate
additional memory through the buffer pool.
  - An output run is not allocated anymore for the final merge.

Note, double-buffering the runs during a merge was also planned with
this patch. However, performance testing showed that except some
exotic queries with unreasonably small amount of buffer pool memory
available double-buffering doesn't add to the overall performance.
It's basically because the half of the available buffers have to be
sacrificed to do double-buffering and as a result the merge tree can
get deeper. In addition the amount of I/O wait time is not reaching
the level where double-buffering could countervail the reduced number
of runs during a particular merge.

Performance measurements were made during manual testing to verify
that this is in fact an optimization:
  - In case doing a sort on top of a join when working with a
restricted amount of memory then the Sort node successfully
allocates additional memory right before the merging phase. This
is feasible because once Join finishes sending new input data and
calls InputDone() then it releases memory that can be picked up
by the Sorter. This results in shallower merging trees (more runs
grabbed for a merge).
  - On a multi-node cluster I verified that in cases when at least one
merging step is done then this change reduces the execution time
for sorts.
  - The more merging steps are done the bigger the performance gain is
compared to the baseline.

Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
---
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M tests/query_test/test_sort.py
3 files changed, 149 insertions(+), 120 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/9943/17
--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 17
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-28 Thread Gabor Kaszab (Code Review)

Hello Tim Armstrong, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9943

to look at the new patch set (#16).

Change subject: IMPALA-5706: Spilling sort optimisations
..

IMPALA-5706: Spilling sort optimisations

This patch covers multiple changes with the purpose of optimizing
spilling sort mechanism:
  - Remove the hard-coded maximum limit of buffers that can be used
for merging the sorted runs. Instead this number is calculated
based on the available memory through buffer pool.
  - The already sorted runs are distributed more optimally between
the last intermediate merge and the final merge to avoid that a
heavy intermediate merge is followed by a light final merge.
  - Right before starting the merging phase Sorter tries to allocate
additional memory through the buffer pool.
  - An output run is not allocated anymore for the final merge.

Note, double-buffering the runs during a merge was also planned with
this patch. However, performance testing showed that except some
exotic queries with unreasonably small amount of buffer pool memory
available double-buffering doesn't add to the overall performance.
It's basically because the half of the available buffers have to be
sacrificed to do double-buffering and as a result the merge tree can
get deeper. In addition the amount of I/O wait time is not reaching
the level where double-buffering could countervail the reduced number
of runs during a particular merge.

Performance measurements were made during manual testing to verify
that this is in fact an optimization:
  - In case doing a sort on top of a join when working with a
restricted amount of memory then the Sort node successfully
allocates additional memory right before the merging phase. This
is feasible because once Join finishes sending new input data and
calls InputDone() then it releases memory that can be picked up
by the Sorter. This results in shallower merging trees (more runs
grabbed for a merge).
  - On a multi-node cluster I verified that in cases when at least one
merging step is done then this change reduces the execution time
for sorts.
  - The more merging steps are done the bigger the performance gain is
compared to the baseline.

Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
---
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
A 
testdata/workloads/functional-query/queries/QueryTest/partitions-with-same-location.test
M tests/query_test/test_sort.py
4 files changed, 149 insertions(+), 120 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/9943/16
--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 16
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-25 Thread Gabor Kaszab (Code Review)

Hello Tim Armstrong, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9943

to look at the new patch set (#15).

Change subject: IMPALA-5706: Spilling sort optimisations
..

IMPALA-5706: Spilling sort optimisations

This patch covers multiple changes with the purpose of optimizing
spilling sort mechanism:
  - Remove the hard-coded maximum limit of buffers that can be used
for merging the sorted runs. Instead this number is calculated
based on the available memory through buffer pool.
  - The already sorted runs are distributed more optimally between
the last intermediate merge and the final merge to avoid that a
heavy intermediate merge is followed by a light final merge.
  - Right before starting the merging phase Sorter tries to allocate
additional memory through the buffer pool.
  - An output run is not allocated anymore for the final merge.

Note, double-buffering the runs during a merge was also planned with
this patch. However, performance testing showed that except some
exotic queries with unreasonably small amount of buffer pool memory
available double-buffering doesn't add to the overall performance.
It's basically because the half of the available buffers have to be
sacrificed to do double-buffering and as a result the merge tree can
get deeper. In addition the amount of I/O wait time is not reaching
the level where double-buffering could countervail the reduced number
of runs during a particular merge.

Performance measurements were made during manual testing to verify
that this is in fact an optimization:
  - In case doing a sort on top of a join when working with a
restricted amount of memory then the Sort node successfully
allocates additional memory right before the merging phase. This
is feasible because once Join finishes sending new input data and
calls InputDone() then it releases memory that can be picked up
by the Sorter. This results in shallower merging trees (more runs
grabbed for a merge).
  - On a multi-node cluster I verified that in cases when at least one
merging step is done then this change reduces the execution time
for sorts.
  - The more merging steps are done the bigger the performance gain is
compared to the baseline.

Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
---
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M tests/query_test/test_sort.py
3 files changed, 153 insertions(+), 120 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/9943/15
--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 15
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-22 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 14: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/2522/


--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 14
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 22 May 2018 15:45:37 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-22 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 14:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2522/


--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 14
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 22 May 2018 12:21:48 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-21 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 14: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 14
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 21 May 2018 22:51:37 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-21 Thread Gabor Kaszab (Code Review)

Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 14:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9943/8/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

http://gerrit.cloudera.org:8080/#/c/9943/8/be/src/runtime/sorter.cc@1676
PS8, Line 1676: num_required_buffers -= 
sorted_runs_[j]->HasVarLenPages() ? 2 : 1;
> Thanks for the heads-up! I count for 2 buffers for an intermediate result t
There is an issue with allocation 2 buffers unconditionally for the result run: 
In case there are no var len pages for any of the runs, then the # of available 
buffers will be 3 (as a result of ComputeMinReservation()) but if out of those 
two are taken by the merge result then won't be enough buffers to merge at 
least two runs.
So I have to take it into account that check in Run::Init() when I calculate 
the number of buffers for the merge result (if (has_var_len_slots_)).



--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 14
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 21 May 2018 16:56:37 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-21 Thread Gabor Kaszab (Code Review)

Hello Tim Armstrong, Csaba Ringhofer, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9943

to look at the new patch set (#14).

Change subject: IMPALA-5706: Spilling sort optimisations
..

IMPALA-5706: Spilling sort optimisations

This patch covers multiple changes with the purpose of optimizing
spilling sort mechanism:
  - Remove the hard-coded maximum limit of buffers that can be used
for merging the sorted runs. Instead this number is calculated
based on the available memory through buffer pool.
  - The already sorted runs are distributed more optimally between
the last intermediate merge and the final merge to avoid that a
heavy intermediate merge is followed by a light final merge.
  - Right before starting the merging phase Sorter tries to allocate
additional memory through the buffer pool.
  - An output run is not allocated anymore for the final merge.

Note, double-buffering the runs during a merge was also planned with
this patch. However, performance testing showed that except some
exotic queries with unreasonably small amount of buffer pool memory
available double-buffering doesn't add to the overall performance.
It's basically because the half of the available buffers have to be
sacrificed to do double-buffering and as a result the merge tree can
get deeper. In addition the amount of I/O wait time is not reaching
the level where double-buffering could countervail the reduced number
of runs during a particular merge.

Performance measurements were made during manual testing to verify
that this is in fact an optimization:
  - In case doing a sort on top of a join when working with a
restricted amount of memory then the Sort node successfully
allocates additional memory right before the merging phase. This
is feasible because once Join finishes sending new input data and
calls InputDone() then it releases memory that can be picked up
by the Sorter. This results in shallower merging trees (more runs
grabbed for a merge).
  - On a multi-node cluster I verified that in cases when at least one
merging step is done then this change reduces the execution time
for sorts.
  - The more merging steps are done the bigger the performance gain is
compared to the baseline.

Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
---
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M tests/query_test/test_sort.py
3 files changed, 152 insertions(+), 116 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/9943/14
--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 14
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-19 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 13: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/2512/


--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 13
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 19 May 2018 10:18:46 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-19 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 13:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/2512/


--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 13
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 19 May 2018 07:18:38 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-18 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 13: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 13
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 18 May 2018 20:48:57 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-18 Thread Gabor Kaszab (Code Review)

Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9943/12/tests/query_test/test_sort.py
File tests/query_test/test_sort.py:

http://gerrit.cloudera.org:8080/#/c/9943/12/tests/query_test/test_sort.py@135
PS12, Line 135: selected as the pivot (the old method), the sorter tends to 
get stuck selecting the
> I gave this a second look and apparently I mis-measured something the last
As discussed offline, I couldn't reduce the runtime of this test. As agreed, 
I'm dropping it as the majority of the spilling sorts have already been covered 
by the existing tests.



--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 13
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 18 May 2018 10:20:26 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-18 Thread Gabor Kaszab (Code Review)

Hello Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9943

to look at the new patch set (#13).

Change subject: IMPALA-5706: Spilling sort optimisations
..

IMPALA-5706: Spilling sort optimisations

This patch covers multiple changes with the purpose of optimizing
spilling sort mechanism:
  - Remove the hard-coded maximum limit of buffers that can be used
for merging the sorted runs. Instead this number is calculated
based on the available memory through buffer pool.
  - The already sorted runs are distributed more optimally between
the last intermediate merge and the final merge to avoid that a
heavy intermediate merge is followed by a light final merge.
  - Right before starting the merging phase Sorter tries to allocate
additional memory through the buffer pool.
  - An output run is not allocated anymore for the final merge.

Note, double-buffering the runs during a merge was also planned with
this patch. However, performance testing showed that except some
exotic queries with unreasonably small amount of buffer pool memory
available double-buffering doesn't add to the overall performance.
It's basically because the half of the available buffers have to be
sacrificed to do double-buffering and as a result the merge tree can
get deeper. In addition the amount of I/O wait time is not reaching
the level where double-buffering could countervail the reduced number
of runs during a particular merge.

Performance measurements were made during manual testing to verify
that this is in fact an optimization:
  - In case doing a sort on top of a join when working with a
restricted amount of memory then the Sort node successfully
allocates additional memory right before the merging phase. This
is feasible because once Join finishes sending new input data and
calls InputDone() then it releases memory that can be picked up
by the Sorter. This results in shallower merging trees (more runs
grabbed for a merge).
  - On a multi-node cluster I verified that in cases when at least one
merging step is done then this change reduces the execution time
for sorts.
  - The more merging steps are done the bigger the performance gain is
compared to the baseline.

Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
---
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M tests/query_test/test_sort.py
3 files changed, 149 insertions(+), 114 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/9943/13
--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 13
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-17 Thread Gabor Kaszab (Code Review)

Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 12:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/9943/8/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

http://gerrit.cloudera.org:8080/#/c/9943/8/be/src/runtime/sorter.cc@1676
PS8, Line 1676: --num_runs_in_one_merge;
> It's possible to have var-len slots but no var-len pages if all strings are
Thanks for the heads-up! I count for 2 buffers for an intermediate result then.


http://gerrit.cloudera.org:8080/#/c/9943/12/tests/query_test/test_sort.py
File tests/query_test/test_sort.py:

http://gerrit.cloudera.org:8080/#/c/9943/12/tests/query_test/test_sort.py@135
PS12, Line 135: doesn't consume all the memory from the second Sort. This 
query takes approx. 25s to
I gave this a second look and apparently I mis-measured something the last 
time. This in fact runs for at leas 25s according to my measurements. I spent a 
few hours to reduce this but according to my observations in order to see this 
behavior manifest in the # of merges we need this scale of data and runtime.
I'm open for suggestions about the next step here. I thought this shouldn't run 
in core tests for sure, so I restricted it for the exhaustive runs. However, 
that would run this test in a number of permutations of test configs and 
multiplying that with this 25s per run I find it too much.
Currently I'd prefer just to drop this test.

What do you think?



--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 12
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 17 May 2018 15:21:11 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-17 Thread Gabor Kaszab (Code Review)

Hello Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9943

to look at the new patch set (#12).

Change subject: IMPALA-5706: Spilling sort optimisations
..

IMPALA-5706: Spilling sort optimisations

This patch covers multiple changes with the purpose of optimizing
spilling sort mechanism:
  - Remove the hard-coded maximum limit of buffers that can be used
for merging the sorted runs. Instead this number is calculated
based on the available memory through buffer pool.
  - The already sorted runs are distributed more optimally between
the last intermediate merge and the final merge to avoid that a
heavy intermediate merge is followed by a light final merge.
  - Right before starting the merging phase Sorter tries to allocate
additional memory through the buffer pool.
  - An output run is not allocated anymore for the final merge.

Note, double-buffering the runs during a merge was also planned with
this patch. However, performance testing showed that except some
exotic queries with unreasonably small amount of buffer pool memory
available double-buffering doesn't add to the overall performance.
It's basically because the half of the available buffers have to be
sacrificed to do double-buffering and as a result the merge tree can
get deeper. In addition the amount of I/O wait time is not reaching
the level where double-buffering could countervail the reduced number
of runs during a particular merge.

Performance measurements were made during manual testing to verify
that this is in fact an optimization:
  - In case doing a sort on top of a join when working with a
restricted amount of memory then the Sort node successfully
allocates additional memory right before the merging phase. This
is feasible because once Join finishes sending new input data and
calls InputDone() then it releases memory that can be picked up
by the Sorter. This results in shallower merging trees (more runs
grabbed for a merge).
  - On a multi-node cluster I verified that in cases when at least one
merging step is done then this change reduces the execution time
for sorts.
  - The more merging steps are done the bigger the performance gain is
compared to the baseline.

Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
---
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M tests/query_test/test_sort.py
3 files changed, 188 insertions(+), 114 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/9943/12
--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 12
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-17 Thread Gabor Kaszab (Code Review)

Hello Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9943

to look at the new patch set (#11).

Change subject: IMPALA-5706: Spilling sort optimisations
..

IMPALA-5706: Spilling sort optimisations

This patch covers multiple changes with the purpose of optimizing
spilling sort mechanism:
  - Remove the hard-coded maximum limit of buffers that can be used
for merging the sorted runs. Instead this number is calculated
based on the available memory through buffer pool.
  - The already sorted runs are distributed more optimally between
the last intermediate merge and the final merge to avoid that a
heavy intermediate merge is followed by a light final merge.
  - Right before starting the merging phase Sorter tries to allocate
additional memory through the buffer pool.
  - An output run is not allocated anymore for the final merge.

Note, double-buffering the runs during a merge was also planned with
this patch. However, performance testing showed that except some
exotic queries with unreasonably small amount of buffer pool memory
available double-buffering doesn't add to the overall performance.
It's basically because the half of the available buffers have to be
sacrificed to do double-buffering and as a result the merge tree can
get deeper. In addition the amount of I/O wait time is not reaching
the level where double-buffering could countervail the reduced number
of runs during a particular merge.

Performance measurements were made during manual testing to verify
that this is in fact an optimization:
  - In case doing a sort on top of a join when working with a
restricted amount of memory then the Sort node successfully
allocates additional memory right before the merging phase. This
is feasible because once Join finishes sending new input data and
calls InputDone() then it releases memory that can be picked up
by the Sorter. This results in shallower merging trees (more runs
grabbed for a merge).
  - On a multi-node cluster I verified that in cases when at least one
merging step is done then this change reduces the execution time
for sorts.
  - The more merging steps are done the bigger the performance gain is
compared to the baseline.

Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
---
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M tests/query_test/test_sort.py
3 files changed, 188 insertions(+), 114 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/9943/11
--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 11
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-17 Thread Gabor Kaszab (Code Review)

Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 10:

FYI, ran an exhaustive suite on this change and some test_insert_overwrite 
tests failed. I guess because https://issues.apache.org/jira/browse/IMPALA-7023.
Other than that, all is green.


--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 10
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 17 May 2018 10:42:48 +
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-16 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9943/8/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

http://gerrit.cloudera.org:8080/#/c/9943/8/be/src/runtime/sorter.cc@1676
PS8, Line 1676:   num_required_buffers += var_len_pages_found ? 2 : 1;
> As far as I see in Run::Init() it already knows if the run has var_len_slot
It's possible to have var-len slots but no var-len pages if all strings are 
NULL or empty. I've run into bugs with that edge case a few times before.



--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 8
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 16 May 2018 15:00:26 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-16 Thread Gabor Kaszab (Code Review)

Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 10:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/9943/8/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

http://gerrit.cloudera.org:8080/#/c/9943/8/be/src/runtime/sorter.h@193
PS8, Line 193:   /// function calculates the maximum number of runs that can be 
taken care of during the
> this sentence seems garbled. "can be taken care of"?
Done


http://gerrit.cloudera.org:8080/#/c/9943/8/be/src/runtime/sorter.h@199
PS8, Line 199:   /// round of merging. Returns at most MaxRunsInNextMerge(), so 
the Sorter will have
> Can you mention the invariant that we have enough reservation for this numb
Done


http://gerrit.cloudera.org:8080/#/c/9943/8/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

http://gerrit.cloudera.org:8080/#/c/9943/8/be/src/runtime/sorter.cc@1676
PS8, Line 1676:   num_required_buffers += var_len_pages_found ? 2 : 1;
> I think we need two buffers regardless if there are var-len slots, since Ru
As far as I see in Run::Init() it already knows if the run has var_len_slots 
(from TupleDescriptor given as constructor param) and add a page for that only 
in case it does.


http://gerrit.cloudera.org:8080/#/c/9943/8/be/src/runtime/sorter.cc@1679
PS8, Line 1679:  i -
> nit: i - 1
Done


http://gerrit.cloudera.org:8080/#/c/9943/8/tests/query_test/test_sort.py
File tests/query_test/test_sort.py:

http://gerrit.cloudera.org:8080/#/c/9943/8/tests/query_test/test_sort.py@104
PS8, Line 104: orders o2 on (o1.o_orderkey = o2.o_orderkey) order by 
o1.o_orderdate limit 10"""
> Should we also set num_nodes=1 here? Otherwise it might not be robust when
Good point. Done


http://gerrit.cloudera.org:8080/#/c/9943/10/tests/query_test/test_sort.py
File tests/query_test/test_sort.py:

http://gerrit.cloudera.org:8080/#/c/9943/10/tests/query_test/test_sort.py@142
PS10, Line 142: exec_option['buffer_pool_limit'] = "22m"
Had to change this as the minimum mem requirement increased for this query. I 
guess that is because I pulled in another change with the rebase that affected 
this. As a result, the total merges done by the first sort decreased.



--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 10
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 16 May 2018 13:29:18 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-16 Thread Gabor Kaszab (Code Review)

Hello Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9943

to look at the new patch set (#10).

Change subject: IMPALA-5706: Spilling sort optimisations
..

IMPALA-5706: Spilling sort optimisations

This patch covers multiple changes with the purpose of optimizing
spilling sort mechanism:
  - Remove the hard-coded maximum limit of buffers that can be used
for merging the sorted runs. Instead this number is calculated
based on the available memory through buffer pool.
  - The already sorted runs are distributed more optimally between
the last intermediate merge and the final merge to avoid that a
heavy intermediate merge is followed by a light final merge.
  - Right before starting the merging phase Sorter tries to allocate
additional memory through the buffer pool.
  - An output run is not allocated anymore for the final merge.

Note, double-buffering the runs during a merge was also planned with
this patch. However, performance testing showed that except some
exotic queries with unreasonably small amount of buffer pool memory
available double-buffering doesn't add to the overall performance.
It's basically because the half of the available buffers have to be
sacrificed to do double-buffering and as a result the merge tree can
get deeper. In addition the amount of I/O wait time is not reaching
the level where double-buffering could countervail the reduced number
of runs during a particular merge.

Performance measurements were made during manual testing to verify
that this is in fact an optimization:
  - In case doing a sort on top of a join when working with a
restricted amount of memory then the Sort node successfully
allocates additional memory right before the merging phase. This
is feasible because once Join finishes sending new input data and
calls InputDone() then it releases memory that can be picked up
by the Sorter. This results in shallower merging trees (more runs
grabbed for a merge).
  - On a multi-node cluster I verified that in cases when at least one
merging step is done then this change reduces the execution time
for sorts.
  - The more merging steps are done the bigger the performance gain is
compared to the baseline.

Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
---
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M tests/query_test/test_sort.py
3 files changed, 191 insertions(+), 114 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/9943/10
--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 10
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-15 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 8:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/9943/8/be/src/runtime/sorter.h
File be/src/runtime/sorter.h:

http://gerrit.cloudera.org:8080/#/c/9943/8/be/src/runtime/sorter.h@193
PS8, Line 193:   /// function calculates the maximum number of runs that can be 
taken care during the
this sentence seems garbled. "can be taken care of"?


http://gerrit.cloudera.org:8080/#/c/9943/8/be/src/runtime/sorter.h@199
PS8, Line 199:   /// round of merging.
Can you mention the invariant that we have enough reservation for this number 
of runs. E.g. "Returns at most MaxRunsInNextMerge(), so the sorter will have 
enough reservation to merge this number of runs".


http://gerrit.cloudera.org:8080/#/c/9943/8/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

http://gerrit.cloudera.org:8080/#/c/9943/8/be/src/runtime/sorter.cc@1676
PS8, Line 1676:   num_required_buffers += var_len_pages_found ? 2 : 1;
I think we need two buffers regardless if there are var-len slots, since 
Run::Init() always allocates a var-len page.


http://gerrit.cloudera.org:8080/#/c/9943/8/be/src/runtime/sorter.cc@1679
PS8, Line 1679:  i-1
nit: i - 1


http://gerrit.cloudera.org:8080/#/c/9943/8/tests/query_test/test_sort.py
File tests/query_test/test_sort.py:

http://gerrit.cloudera.org:8080/#/c/9943/8/tests/query_test/test_sort.py@104
PS8, Line 104: assert "TotalMergesPerformed: 1" in 
query_result.runtime_profile
Should we also set num_nodes=1 here? Otherwise it might not be robust when 
running on different clusters.



--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 8
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 15 May 2018 23:30:40 +
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-15 Thread Gabor Kaszab (Code Review)

Hello Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9943

to look at the new patch set (#9).

Change subject: IMPALA-5706: Spilling sort optimisations
..

IMPALA-5706: Spilling sort optimisations

This patch covers multiple changes with the purpose of optimizing
spilling sort mechanism:
  - Remove the hard-coded maximum limit of buffers that can be used
for merging the sorted runs. Instead this number is calculated
based on the available memory through buffer pool.
  - The already sorted runs are distributed more optimally between
the last intermediate merge and the final merge to avoid that a
heavy intermediate merge is followed by a light final merge.
  - Right before starting the merging phase Sorter tries to allocate
additional memory through the buffer pool.
  - An output run is not allocated anymore for the final merge.

Note, double-buffering the runs during a merge was also planned with
this patch. However, performance testing showed that except some
exotic queries with unreasonably small amount of buffer pool memory
available double-buffering doesn't add to the overall performance.
It's basically because the half of the available buffers have to be
sacrificed to do double-buffering and as a result the merge tree can
get deeper. In addition the amount of I/O wait time is not reaching
the level where double-buffering could countervail the reduced number
of runs during a particular merge.

Performance measurements were made during manual testing to verify
that this is in fact an optimization:
  - In case doing a sort on top of a join when working with a
restricted amount of memory then the Sort node successfully
allocates additional memory right before the merging phase. This
is feasible because once Join finishes sending new input data and
calls InputDone() then it releases memory that can be picked up
by the Sorter. This results in shallower merging trees (more runs
grabbed for a merge).
  - On a multi-node cluster I verified that in cases when at least one
merging step is done then this change reduces the execution time
for sorts.
  - The more merging steps are done the bigger the performance gain is
compared to the baseline.

Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
---
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M tests/query_test/test_sort.py
3 files changed, 189 insertions(+), 114 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/9943/9
--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-15 Thread Gabor Kaszab (Code Review)

Gabor Kaszab has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/9943 )

Change subject: IMPALA-5706: Spilling sort optimisations
..


Patch Set 8:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc
File be/src/runtime/sorter.cc:

http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@a1052
PS7, Line 1052:
> Can we preserve this DCHECK but make it about the next page being pinned, o
I think you meant the page under page_index + 2 as the one on page_index + 1 
should be always pinned due to double buffering. (And I DCHECK on this 
condition already.)

I'll add the DCHECK you mention to "if(is_pinned_)", however, it has a bit 
strange aftertaste :)

Update: I'll revert this as double-buffering is removed from the scope of this 
patch.


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@880
PS7, Line 880:
> Comment needs updating.
Done


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@881
PS7, Line 881:   // Pins the first var len page if there is any.
> Might be cleaner to rewrite these as loops, e.g.
It's rather min(2, fixed_len_pages_.size()) but get the point :) Was wondering 
whether to have nested if's or a loop. Went for the first one but this in fact 
seems cleaner.


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@1049
PS7, Line 1049: st nit: idx/index inconsistency. I'm fine with either but the inconsistency in
Sure, will follow how it was before.

Update: not valid since double-buffering is no longer included.


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@1052
PS7, Line 1052: }
> This DCHECK doesn't seem useful - if we messed up the vector addressing ari
Thx, done.


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@1551
PS7, Line 1551: RETURN_IF_ERROR(AddBatchNoSpill(batch, cur_batch_index, 
_processed));
> Not a big deal but we could reduce the vertical whitespace in this function
I personally prefer vertical segmentation of the code but as I see in Impala 
the short functions are usually dense, so I'm fine removing the empty lines.


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@1682
PS7, Line 1682: if (num_required_buffers <= num_available_buffers) 
break;
> Maybe use range-based for loop? I find it more readable personally.
Done


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@1706
PS7, Line 1706:
> Couldn't we compute this more accurately by doing a pass over sorted_runs_?
With dropping double-buffering this logic changes a bit. We always have 
fixed_len_pages so 1 buffer per run is needed for that, and in case that run 
has var_len_pages then it needs an additional buffer.


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@1721
PS7, Line 1721:   }
> I think my above comment about accurately computing the requirement for the
I added a more sophisticated calculation for getting the max runs in the next 
merge. It feels for me that I over-engineered it a little bit, but now it takes 
into account whether each run has var len pages or not.


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@1727
PS7, Line 1727:
> I had a bit of trouble reasoning through whether this calculation was corre
Nice catch! This in fact was incorrect.


http://gerrit.cloudera.org:8080/#/c/9943/7/be/src/runtime/sorter.cc@1730
PS7, Line 1730:   // its runs.
> If the goal is to minimize the number of "extra merges" per row, then it is
Interesting question. The current solution was meant to balance between 
optimising the distribution of runs between merges and not over-engineer the 
code.
Your solution also seems good, though. I'm wondering if finishing the 
intermediate merges quicker but then putting more stress on GetNext() where the 
final merge is done is actually better or not than the current solution of 
distributing the runs evenly. It's true that the amount of merges done on one 
particular row is lower with yours.
Let me give it a try, thanks for the suggestion.


http://gerrit.cloudera.org:8080/#/c/9943/7/testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test:

PS7:
> IMPALA-4835 probably caused a bunch of conflicts in the planner tests. Hope
I'll revert these changes as double-buffering is removed from this patch set.


http://gerrit.cloudera.org:8080/#/c/9943/7/tests/query_test/test_sort.py
File tests/query_test/test_sort.py:

http://gerrit.cloudera.org:8080/#/c/9943/7/tests/query_test/test_sort.py@45
PS7, Line 45:   def test_multiple_mem_limits(self, vector):
> Are these tests now tuned for a 3-node minicluster? Maybe we should skip va
Good point I haven't thought about that.

One note is that according to the comments, the purpose of this test is to 
trigger sorts with 0, 1

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

2018-05-15 Thread Gabor Kaszab (Code Review)

Hello Tim Armstrong, Csaba Ringhofer,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/9943

to look at the new patch set (#8).

Change subject: IMPALA-5706: Spilling sort optimisations
..

IMPALA-5706: Spilling sort optimisations

This patch covers multiple changes with the purpose of optimizing
spilling sort mechanism:
  - Remove the hard-coded maximum limit of buffers that can be used
for merging the sorted runs. Instead this number is calculated
based on the available memory through buffer pool.
  - The already sorted runs are distributed more optimally between
the last intermediate merge and the final merge to avoid that a
heavy intermediate merge is followed by a light final merge.
  - Right before starting the merging phase Sorter tries to allocate
additional memory through the buffer pool.
  - An output run is not allocated anymore for the final merge.

Note, double-buffering the runs during a merge was also planned with
this patch. However, performance testing showed that except some
exotic queries with unreasonably small amount of buffer pool memory
available double-buffering doesn't add to the overall performance.
It's basically because the half of the available buffers have to be
sacrificed to do double-buffering and as a result the merge tree can
get deeper. In addition the amount of I/O wait time is not reaching
the level where double-buffering could countervail the reduced number
of runs during a particular merge.

Performance measurements were made during manual testing to verify
that this is in fact an optimization:
  - In case doing a sort on top of a join when working with a
restricted amount of memory then the Sort node successfully
allocates additional memory right before the merging phase. This
is feasible because once Join finishes sending new input data and
calls InputDone() then it releases memory that can be picked up
by the Sorter. This results in shallower merging trees (more runs
grabbed for a merge).
  - On a multi-node cluster I verified that in cases when at least one
merging step is done then this change reduces the execution time
for sorts.
  - The more merging steps are done the bigger the performance gain is
compared to the baseline.

Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
---
M be/src/runtime/sorter.cc
M be/src/runtime/sorter.h
M tests/query_test/test_sort.py
3 files changed, 190 insertions(+), 115 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/43/9943/8
--
To view, visit http://gerrit.cloudera.org:8080/9943
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I74857c1694802e81f1cfc765d2b4e8bc644387f9
Gerrit-Change-Number: 9943
Gerrit-PatchSet: 8
Gerrit-Owner: Gabor Kaszab 
Gerrit-Reviewer: Csaba Ringhofer 
Gerrit-Reviewer: Gabor Kaszab 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

[Impala-ASF-CR] IMPALA-5706: Spilling sort optimisations

33 matches

Site Navigation

Mail list logo

Footer information