[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-28 Thread Internal Jenkins (Code Review)
Internal Jenkins has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 22: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 22
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-28 Thread Internal Jenkins (Code Review)
Internal Jenkins has submitted this change and it was merged.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

The main outcome of this patch is to split out PhjBuilder from
PartitionedHashJoinNode, which manages the build partitions for the
join and implements the DataSink interface.

A lot of this work is fairly mechanical refactoring: dividing the state
between the two classes and duplicating it where appropriate. One major
change is that probe partitions need to be separated from build
partitions.

This required some significant algorithmic changes to memory management
for probe partition buffers: memory management for probe partitions was
entangled with the build partitions: small buffers were allocated for
the probe partitions even for non-spilling joins and the join could
spill additional partitions during probing if the probe partitions needed
to switch from small to I/O buffers. The changes made were:
- Probe partitions are only initialized after the build is partitioned, and
  only for spilled build partitions.
- Probe partitions never use small buffers: once the initial write
  buffer is allocated, appending to the probe partition never fails.
- All probe partition allocation is done after partitioning the build
  and before processing the probe input during the same phase as hash
  table building. (Aside from NAAJ partitions which are allocated
  upfront).

The probe partition changes necessitated a change in
BufferedTupleStream: allocation of write blocks is now explicit via the
PrepareForWrite() API.

Testing:
Ran exhaustive build and local stress test.

Memory Usage:
Ran stress test binary search locally for TPC-DS SF-1 and TPC-H SF-20.
No regressions on TPC-DS. TPC-H either stayed the same or improved in
min memory requirement without spilling, but the min memory requirement
with spilling regressed in some cases. I investigated each of the
significant regressions on TPC-H and determined that they were all due
to exec nodes racing for spillable or non-spillable memory. None of them
were cases where exec nodes got their minimum reservation and failed to
execute the spilling algorithm correctly.

Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Reviewed-on: http://gerrit.cloudera.org:8080/3873
Reviewed-by: Tim Armstrong 
Reviewed-by: Dan Hecht 
Tested-by: Internal Jenkins
---
M .clang-format
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/CMakeLists.txt
M be/src/exec/analytic-eval-node.cc
M be/src/exec/blocking-join-node.cc
M be/src/exec/blocking-join-node.h
M be/src/exec/data-sink.cc
M be/src/exec/hash-join-node.cc
M be/src/exec/nested-loop-join-builder.cc
M be/src/exec/nested-loop-join-builder.h
M be/src/exec/nested-loop-join-node.cc
M be/src/exec/partitioned-aggregation-node.cc
A be/src/exec/partitioned-hash-join-builder-ir.cc
A be/src/exec/partitioned-hash-join-builder.cc
A be/src/exec/partitioned-hash-join-builder.h
M be/src/exec/partitioned-hash-join-node-ir.cc
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M be/src/exec/partitioned-hash-join-node.inline.h
M be/src/runtime/buffered-block-mgr.cc
M be/src/runtime/buffered-tuple-stream-test.cc
M be/src/runtime/buffered-tuple-stream.cc
M be/src/runtime/buffered-tuple-stream.h
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M tests/stress/concurrent_select.py
27 files changed, 2,264 insertions(+), 1,490 deletions(-)

Approvals:
  Internal Jenkins: Verified
  Dan Hecht: Looks good to me, approved
  Tim Armstrong: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 23
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-28 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 22: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 22
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-28 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 22: Code-Review+2

Rebase, carry +2

-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 22
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-28 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 20:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3873/20/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

Line 606: }
> Done
Oops, added wrong DCHECK. Fixed now


-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 20
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-28 Thread Tim Armstrong (Code Review)
Hello Michael Ho, Alex Behm, Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3873

to look at the new patch set (#22).

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..

IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

The main outcome of this patch is to split out PhjBuilder from
PartitionedHashJoinNode, which manages the build partitions for the
join and implements the DataSink interface.

A lot of this work is fairly mechanical refactoring: dividing the state
between the two classes and duplicating it where appropriate. One major
change is that probe partitions need to be separated from build
partitions.

This required some significant algorithmic changes to memory management
for probe partition buffers: memory management for probe partitions was
entangled with the build partitions: small buffers were allocated for
the probe partitions even for non-spilling joins and the join could
spill additional partitions during probing if the probe partitions needed
to switch from small to I/O buffers. The changes made were:
- Probe partitions are only initialized after the build is partitioned, and
  only for spilled build partitions.
- Probe partitions never use small buffers: once the initial write
  buffer is allocated, appending to the probe partition never fails.
- All probe partition allocation is done after partitioning the build
  and before processing the probe input during the same phase as hash
  table building. (Aside from NAAJ partitions which are allocated
  upfront).

The probe partition changes necessitated a change in
BufferedTupleStream: allocation of write blocks is now explicit via the
PrepareForWrite() API.

Testing:
Ran exhaustive build and local stress test.

Memory Usage:
Ran stress test binary search locally for TPC-DS SF-1 and TPC-H SF-20.
No regressions on TPC-DS. TPC-H either stayed the same or improved in
min memory requirement without spilling, but the min memory requirement
with spilling regressed in some cases. I investigated each of the
significant regressions on TPC-H and determined that they were all due
to exec nodes racing for spillable or non-spillable memory. None of them
were cases where exec nodes got their minimum reservation and failed to
execute the spilling algorithm correctly.

Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
---
M .clang-format
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/CMakeLists.txt
M be/src/exec/analytic-eval-node.cc
M be/src/exec/blocking-join-node.cc
M be/src/exec/blocking-join-node.h
M be/src/exec/data-sink.cc
M be/src/exec/hash-join-node.cc
M be/src/exec/nested-loop-join-builder.cc
M be/src/exec/nested-loop-join-builder.h
M be/src/exec/nested-loop-join-node.cc
M be/src/exec/partitioned-aggregation-node.cc
A be/src/exec/partitioned-hash-join-builder-ir.cc
A be/src/exec/partitioned-hash-join-builder.cc
A be/src/exec/partitioned-hash-join-builder.h
M be/src/exec/partitioned-hash-join-node-ir.cc
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M be/src/exec/partitioned-hash-join-node.inline.h
M be/src/runtime/buffered-block-mgr.cc
M be/src/runtime/buffered-tuple-stream-test.cc
M be/src/runtime/buffered-tuple-stream.cc
M be/src/runtime/buffered-tuple-stream.h
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M tests/stress/concurrent_select.py
27 files changed, 2,264 insertions(+), 1,490 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/73/3873/22
-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 22
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-28 Thread Tim Armstrong (Code Review)
Hello Michael Ho, Alex Behm, Dan Hecht,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3873

to look at the new patch set (#21).

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..

IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

The main outcome of this patch is to split out PhjBuilder from
PartitionedHashJoinNode, which manages the build partitions for the
join and implements the DataSink interface.

A lot of this work is fairly mechanical refactoring: dividing the state
between the two classes and duplicating it where appropriate. One major
change is that probe partitions need to be separated from build
partitions.

This required some significant algorithmic changes to memory management
for probe partition buffers: memory management for probe partitions was
entangled with the build partitions: small buffers were allocated for
the probe partitions even for non-spilling joins and the join could
spill additional partitions during probing if the probe partitions needed
to switch from small to I/O buffers. The changes made were:
- Probe partitions are only initialized after the build is partitioned, and
  only for spilled build partitions.
- Probe partitions never use small buffers: once the initial write
  buffer is allocated, appending to the probe partition never fails.
- All probe partition allocation is done after partitioning the build
  and before processing the probe input during the same phase as hash
  table building. (Aside from NAAJ partitions which are allocated
  upfront).

The probe partition changes necessitated a change in
BufferedTupleStream: allocation of write blocks is now explicit via the
PrepareForWrite() API.

Testing:
Ran exhaustive build and local stress test.

Memory Usage:
Ran stress test binary search locally for TPC-DS SF-1 and TPC-H SF-20.
No regressions on TPC-DS. TPC-H either stayed the same or improved in
min memory requirement without spilling, but the min memory requirement
with spilling regressed in some cases. I investigated each of the
significant regressions on TPC-H and determined that they were all due
to exec nodes racing for spillable or non-spillable memory. None of them
were cases where exec nodes got their minimum reservation and failed to
execute the spilling algorithm correctly.

Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
---
M .clang-format
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/CMakeLists.txt
M be/src/exec/analytic-eval-node.cc
M be/src/exec/blocking-join-node.cc
M be/src/exec/blocking-join-node.h
M be/src/exec/data-sink.cc
M be/src/exec/hash-join-node.cc
M be/src/exec/nested-loop-join-builder.cc
M be/src/exec/nested-loop-join-builder.h
M be/src/exec/nested-loop-join-node.cc
M be/src/exec/partitioned-aggregation-node.cc
A be/src/exec/partitioned-hash-join-builder-ir.cc
A be/src/exec/partitioned-hash-join-builder.cc
A be/src/exec/partitioned-hash-join-builder.h
M be/src/exec/partitioned-hash-join-node-ir.cc
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M be/src/exec/partitioned-hash-join-node.inline.h
M be/src/runtime/buffered-block-mgr.cc
M be/src/runtime/buffered-tuple-stream-test.cc
M be/src/runtime/buffered-tuple-stream.cc
M be/src/runtime/buffered-tuple-stream.h
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M tests/stress/concurrent_select.py
27 files changed, 2,264 insertions(+), 1,490 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/73/3873/21
-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 21
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-28 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 20:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3873/20/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

Line 606: }
> DCHECK(build_->null_aware_partition() == NULL);
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 20
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-28 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 20: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3873/20/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

Line 606: }
DCHECK(build_->null_aware_partition() == NULL);


-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 20
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-26 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 19:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/blocking-join-node.cc
File be/src/exec/blocking-join-node.cc:

PS14, Line 297: SCOPED_TIMER(build_timer_);
> It's not clear to me that the original timers were well thought out. It did
I am mostly concerned about preserving "BuildPartitionTime" and "BuildTime" 
which exist in the profile now. The former seems to measure the total time to 
partition all input rows into their partitions while the latter measures the 
total time to insert the non-spilled partitions into their corresponding hash 
tables.


http://gerrit.cloudera.org:8080/#/c/3873/19/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

PS19, Line 377: is
if


http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

PS14, Line 87: 
> They are redundant currently, but the idea is that PhjBuilder and PhjNode w
OK. To be able to probe the same hash table in parallel, it's necessary to have 
the thread-private build_expr_ctx_.


http://gerrit.cloudera.org:8080/#/c/3873/19/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

Line 77:   builder_.reset(
It helps to add a comment like:

TODO: For now, the build_side_expr_ may be duplicated for now but the builder_ 
need to have the build_side_expr_ once the builder is separated out into a 
different fragment.


PS19, Line 479: if (!output_build_partitions_.empty()) {
  :   DCHECK(NeedToProcessUnmatchedBuildRows());
  : 
  :   // Flush the remaining unmatched build rows of any 
partitions we are done
  :   // processing before moving onto the next partition.
  :   OutputUnmatchedBuild(out_batch);
  :   if (!output_build_partitions_.empty()) break;
It seems unfortunate that we cannot remove this given the change at line 588. I 
suppose this is needed if multiple output batches are needed full to hold all 
the unmatched build rows.

The duplicated code block at line 487 almost looks like we should use a goto 
statement.

Do you think there is a good way to refactor the code a bit ?


Line 554: if (state_ == PARTITIONING_PROBE) {
DCHECK(input_partition_ == NULL);


PS19, Line 599: if (builder_->null_aware_partition() == NULL) {
  :   DCHECK(null_aware_probe_partition_ == NULL);
Not your change but I find this duplicated check with line 501 a bit confusing 
here. Essentially, this is a slightly different check as it covers cases for 
all joins while the one at line 501 seems to cover the case in which the null 
aware partition has been processed completely if I understand it correctly.

I wonder if you can move this check into the if statement at line 599 and make 
it if (build_->null_aware_partition() != NULL) continue;

And move *eos = true; to the common path shared by all join types.


PS19, Line 724: false 
I suppose it's good to avoid small buffers on all probe partitions for 
consistency but does this increase the memory usage of NAAJ ?


Line 740:   if (probe_rows != NULL) {
Is there an error path in which probe_rows == NULL ? Looks like line 725 
implies probe_rows != NULL.


PS19, Line 751: false
Same question as above.


PS19, Line 968: new
nit: newly created


http://gerrit.cloudera.org:8080/#/c/3873/19/be/src/util/runtime-profile.h
File be/src/util/runtime-profile.h:

PS19, Line 129: AddChildFirst
Feel free to ignore but PrependChild() seems more appropriate.


-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 19
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-21 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 19: Code-Review+1

carry +1

-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 19
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-19 Thread Tim Armstrong (Code Review)
Hello Alex Behm,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/3873

to look at the new patch set (#18).

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..

IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

The main outcome of this patch is to split out PhjBuilder from
PartitionedHashJoinNode, which manages the build partitions for the
join and implements the DataSink interface.

A lot of this work is fairly mechanical refactoring: dividing the state
between the two classes and duplicating it where appropriate. One major
change is that probe partitions need to be separated from build
partitions.

This required some significant algorithmic changes to memory management
for probe partition buffers: memory management for probe partitions was
entangled with the build partitions: small buffers were allocated for
the probe partitions even for non-spilling joins and the join could
spill additional partitions during probing if the probe partitions needed
to switch from small to I/O buffers. The changes made were:
- Probe partitions are only initialized after the build is partitioned, and
  only for spilled build partitions.
- Probe partitions never use small buffers: once the initial write
  buffer is allocated, appending to the probe partition never fails.
- All probe partition allocation is done after partitioning the build
  and before processing the probe input during the same phase as hash
  table building. (Aside from NAAJ partitions which are allocated
  upfront).

The probe partition changes necessitated a change in
BufferedTupleStream: allocation of write blocks is now explicit via the
PrepareForWrite() API.

Testing:
Ran exhaustive build and local stress test.

Memory Usage:
Ran stress test binary search locally for TPC-DS SF-1 and TPC-H SF-20.
No regressions on TPC-DS. TPC-H either stayed the same or improved in
min memory requirement without spilling, but the min memory requirement
with spilling regressed in some cases. I investigated each of the
significant regressions on TPC-H and determined that they were all due
to exec nodes racing for spillable or non-spillable memory. None of them
were cases where exec nodes got their minimum reservation and failed to
execute the spilling algorithm correctly.

Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
---
M .clang-format
M be/src/codegen/gen_ir_descriptions.py
M be/src/codegen/impala-ir.cc
M be/src/exec/CMakeLists.txt
M be/src/exec/analytic-eval-node.cc
M be/src/exec/blocking-join-node.cc
M be/src/exec/blocking-join-node.h
M be/src/exec/data-sink.cc
M be/src/exec/hash-join-node.cc
M be/src/exec/nested-loop-join-builder.cc
M be/src/exec/nested-loop-join-builder.h
M be/src/exec/nested-loop-join-node.cc
M be/src/exec/partitioned-aggregation-node.cc
A be/src/exec/partitioned-hash-join-builder-ir.cc
A be/src/exec/partitioned-hash-join-builder.cc
A be/src/exec/partitioned-hash-join-builder.h
M be/src/exec/partitioned-hash-join-node-ir.cc
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M be/src/exec/partitioned-hash-join-node.inline.h
M be/src/runtime/buffered-block-mgr.cc
M be/src/runtime/buffered-tuple-stream-test.cc
M be/src/runtime/buffered-tuple-stream.cc
M be/src/runtime/buffered-tuple-stream.h
M be/src/util/runtime-profile.cc
M be/src/util/runtime-profile.h
M tests/stress/concurrent_select.py
27 files changed, 2,242 insertions(+), 1,491 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/73/3873/18
-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 18
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-15 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 17:

(10 comments)

These are the comments from re-reviewing the addressed comments.  Will finish 
off the rest of builder.h and then the .cc files next.

http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/partitioned-hash-join-builder.h
File be/src/exec/partitioned-hash-join-builder.h:

PS14, Line 90: 
> It passes from the class to the caller. I guess I don't find this ambiguous
I guess one term used elsewhere is "release", right? (e.g. C++ unique_ptr).  I 
think it's better than "acquire" given the direction of resource transfer, so 
that's one option.  The reworded comment helps too.


PS14, Line 95:  streams are empty but prepared for
 :   /// writing with a write buffer allocated.
 :   std::vector 
AcquireProbeStreams();
> The hash join node uses this to determine whether it's already cleaned up t
Okay for now as an intermediate step, but let's definitely clean it up as work 
continues.


PS14, Line 195:  is destroyed i
> Overall these bool arguments are really confusing, and the correctness of t
Yeah, agree this bool was very confusing. Will take a look at the new stuff.


PS14, Line 234: 
> Yes, since we share the reservation and hand off the probe buffers.
Thanks, the new comment clarifies.  From reading it I couldn't tell if there 
was a "* 2" missing because of the comment.


http://gerrit.cloudera.org:8080/#/c/3873/15/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

Line 372:   { RETURN_IF_ERROR(build_partition->BuildHashTable()); }
extraneous braces


Line 395: }
same


http://gerrit.cloudera.org:8080/#/c/3873/15/be/src/exec/partitioned-hash-join-node.h
File be/src/exec/partitioned-hash-join-node.h:

PS15, Line 81: build
probe


Line 103: ///
Thanks, I think this will make the code much more approachable.


PS15, Line 405:  
nit: extra space


http://gerrit.cloudera.org:8080/#/c/3873/15/be/src/util/runtime-profile.cc
File be/src/util/runtime-profile.cc:

Line 391:   }
alternatively: initialize insert_pos to begin and then DCHECK_NE(insert_pos, 
children_.begin());  Okay to ignore.


-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 17
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-15 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 14:

> (2 comments)
 > 
 > Before I responded to individual comments I wanted to point out
 > that the coupling is now one way - the builder never refers to
 > PhjNode aside from metadata provided via the constructor (you can
 > verify there are no references to partitioned-hash-join-node.h or
 > PartitionedHashJoin node in the builder code).

Yup, that's good.  I thought there was a back reference on my first pass 
through, but I was thinking of the reference from partition back to the 
builder/sink.  Since it is the sink, it seems a little weird (once the builder 
is used by the join node in the other fragment), but okay for this first step.  
Splitting out the sink and builder is something we should revisit later.

 > 
 > At this point it should only require minor changes to this hash
 > join code to create the builder in a separate fragment *before* the
 > join node, provided you have a 1:1 mapping from builder to node.
 > 1:n broadcast joins with no spilling would require some more
 > changes, but is a lot closer.
 > 

Yup, agree.

 > I chose this stopping point because there was a clear physical
 > separation between the classes, even if the spilling algorithm has
 > some logical coupling and assumes a 1:1 mapping from node to
 > builder.
 > 
 > Reducing the logical coupling would require significant changes to
 > the spilling algorithm that is driven from PartitionedHashJoin::GetNext(),
 > which I think should be deferred until there is some consensus
 > about what spilling for 1:n broadcast joins with multithreading
 > might look like.

I'm okay with this point more or less as an incremental first step, as well. I 
just wanted to understand better where we are going longer term and what the 
immediate requirements are.

As mentioned, looking at whether the sink should be just a sink, and then the 
"builder" is something that's driven by both the sink and also the join node.  
But I'm fine with waiting on that since this is a good step in teasing things 
apart.

-- 
To view, visit http://gerrit.cloudera.org:8080/3873
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I1e02ea9c7a7d1a0f373b11aa06c3237e1c7bd4cb
Gerrit-PatchSet: 14
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tim Armstrong 
Gerrit-Reviewer: Alex Behm 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Michael Ho
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-15 Thread Dan Hecht (Code Review)
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 14:

(19 comments)

Flushing out some more comments, and will move onto the later patch set.

One thing that's not clear to me yet is what the right separation between node 
and sink is -- they are currently very tightly coupled which means they don't 
really benefit from being separated into separate classes.  Is the goal of this 
change solely to be able to expose the DataSink interface, or is it to provide 
some separation as well? Do we see them becoming less coupled over time?

http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/partitioned-hash-join-builder.h
File be/src/exec/partitioned-hash-join-builder.h:

PS14, Line 45: build row partitions
is this the same as the "hash partitions" referred to in the next paragraph. 
let's be consistent in terminology.


Line 86: 
is it expected that the renaming public functions would only be used by PHJN?  
If so, would be good to state that.


PS14, Line 90: Acquire ownership
Doesn't the ownership pass from this class to the exec node?  In which case 
this method wouldn't acquire ownership.  So let's come up with better 
terminology and naming.


PS14, Line 95: Called after probing of the partitions
 :   /// is done. The partitions are not closed or destroyed, since 
they may be spilled or
 :   /// may contain unmatched build rows for certain join modes 
(e.g. right outer join).
The fact that we need some much explanation kinda makes this seem like it might 
not be the right abstraction.  Could we instead just clear the vector when 
creating the next set of partitions (it doesn't look like we delete partitions 
until either Close() or Reset() anyway), so why do we need to proactively 
clearly the hash_partitions_ vector before starting he next iteration?


Line 106:   }
nit: missing blank line


PS14, Line 109: When this call returns
Is this stuff by this method or the caller?  i.e. if this method closes 
input_partition, then say that.  If it doesn't then either say "after" or don't 
say it at all (why is it relevant to this method documentation?).


PS14, Line 112: so that the probe phase has enough buffers preallocated
  :   /// execute successfully.
slightly garbled. seems like a word is missing or something.


PS14, Line 117: /// Iterates over all the partitions in 'hash_partitions_' and 
returns the largest
  :   /// number of build rows.
rather than say how and referring to private member, and leaving it to the 
reader to assume that "largest number of build rows" is talking about a 
per-partition thing, let's say something like:

Returns the maximum number of build rows of a hash partition 

(or whatever terminology you choose in the class comment).


Line 123:   bool HashTableStoresNulls() const;
why is this here rather than the join node? especially given that the join node 
owns the HashTableCtx.  on a related note, I'm not really sure how the 
HashTable context is going to work out once this is driven like a real sync.  
Will we still have a parent_ link to the exec node living in a different 
fragment?


Line 128:   inline const std::vector& is_not_distinct_from() const {
and then should this be here? is it to be closer to being able to break the 
parent_ link at some point and make the builder able to build the hash table 
without any info from parent_?


PS14, Line 182: partition side of this partition
not sure what this is saying. maybe just "for this build partition"?


Line 188: /// false.
maybe clarify that an error is not returned in that case.  or clarify what 
cases are errors. etc.


PS14, Line 195: unpin_all_build
why "build"?  shouldn't it be unpin_all_blocks?  or even better would be to 
name it according to what the algorithm cares about. which I think is whether 
or not we might have more build rows to append, right?


PS14, Line 234: build or probe
does this care about the probe side partitioning?


Line 237:   /// 'null_aware_probe_partition_' and 'null_probe_rows_'.
Also explain what this is computing, rather than just what the computation is.


PS14, Line 244: all hash partitions for partitioning level
this is kinda misleading. maybe qualify with ... for the current input, or the 
"current set of hash partitions for partition level...". i.e. this isn't all 
the hash partitions that are created for a particular level.


PS14, Line 252: in
into


PS14, Line 256: in memory
what does this mean? is it trying to say it fits in the current (pinned) block?


http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/partitioned-hash-join-node.h
File be/src/exec/partitioned-hash-join-node.h:

PS14, Line 410:  we then iterate over all the probe rows
> Clarified that it's the partition's. Most of the text was just copied verba
Thanks. From the perspective of the reviewer, the comment is 

[Impala-ASF-CR] IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder

2016-09-14 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3567 Part 2, IMPALA-3899: factor out PHJ builder
..


Patch Set 14:

(53 comments)

http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/analytic-eval-node.cc
File be/src/exec/analytic-eval-node.cc:

Line 207: state, Substitute(PREPARE_FAILED_ERROR_MSG, "write"));
> old code and agg uses state_->block_mgr()->MemLimitTooLowError().  Any reas
Historical reasons it looks like:  the PrepareForRead() version below diverged 
from the PAGG/PHJ versions.

Switched to the block_mgr version for consistency.


http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/blocking-join-node.cc
File be/src/exec/blocking-join-node.cc:

PS14, Line 297: SCOPED_TIMER(build_timer_);
> We used to have two different timers in PHJ for measuring the time to hash 
It's not clear to me that the original timers were well thought out. It didn't 
look like there was a timer for how long it took to build the hash table - that 
was lumped into 'build_timer_'.

I added two new timers with clearer meanings that track the total hash table 
build timer and time spent partitioning rows. Also moved the remaining 
build-related timers into the Builder.

Now 'built_timer_' is the initial partitioning and hash table build, and 
'repartition_timer_' is the time spent repartitioning'.


http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

Line 179: largest_fraction = max(largest_fraction,
> DCHECK_GT(num_build_rows, 0);
I don't think that's valid, it's casting to double to avoid crashing on the 
divide-by-zero. I changed it to avoid the calculation if num_build_rows = 0 to 
avoid that problem.


PS14, Line 318: a hash table
> hash tables
Done. Both alternatives seem grammatically correct to me though.


Line 341:   // building hash tables to avoid building hash tables for 
partitions we'll spill anyway.
> That's an interesting comment. If I understand it correctly, it means that 
Done with some slight tweaks.

This is a lot simpler than the previous behaviour, where spilling can happen 
even during probing.

I didn't measure how often, but it's not at all improbable, since the probe 
buffers are 8MB each and we could allocate up to 16 of them.


Line 401:   RETURN_IF_ERROR(SpillPartition());
> May help to document that failure to find any partition to spill (e.g. all 
I added to the comment up the top of the loop to make the termination more 
explicit.


PS14, Line 529: PhjBuilder::HashTableStoresNulls()
> This seems to belong to PartitionedHashJoinNode conceptually.
I feel like it makes sense here though since the builder owns the hash tables.

Plumbing-wise it's also easier since PhjBuilder needs this info and starts 
executing before PartitionedHashJoinNode.


Line 646:   do {
> Please see comments in BlockingJoinNode,  it would be great to retain timer
Done


PS14, Line 782: process_build_batch_fn_level0
> 'process_build_batch_fn_level0_'
I just copied this verbatim - it seems to be referring to the local variable 
though.


http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/partitioned-hash-join-builder.h
File be/src/exec/partitioned-hash-join-builder.h:

Line 425: 
> nit: unnecessary blank line ?
Done


http://gerrit.cloudera.org:8080/#/c/3873/14/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

PS14, Line 151: 
  : 
  : 
> This may be important information to retain.
Not sure why I removed it. May have had a good reason but I can't recall it.


PS14, Line 87: Expr::CreateExprTree(pool_, eq_join_conjunct.right
> Does this overlap with CreateExprTree() in PhjBuilder::Init() ? Same questi
They are redundant currently, but the idea is that PhjBuilder and PhjNode will 
exist more independently in different plan fragments, so I wanted to 
encapsulate the state where possible.

The expr contexts need to be independent to eventually support many probe sides 
: 1 build side for broadcast joins.

I think documenting this explicitly may be confusing because it is sort-of 
explaining the current state of things relative to the previous way of doing 
things. Whereas I think with the separate build side, the default assumption is 
that builder data structures are private and not shared with the exec node.

It may make sense to share Expr trees globally between fragment instances as 
part of the multithreading changes, but I don't think it's worth trying to 
share them here until we have the right infrastructure.


Line 213:   for (unique_ptr& partition : spilled_partitions_)
> missing { }
Ah, missed that clang-format wrapped it.


PS14, Line 494: to output
> outputting
Done


PS14, Line 585: hash_partitions_
> 'hash_partitions_'
Done


Line 594:   continue;
> Is there a reason why we cannot call OutputUnmatchedBuild() directly at thi
I