[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-04-30 Thread Riza Suminto (Code Review)
Hello David Rorke, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15511

to look at the new patch set (#13).

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..

WIP IMPALA-9434: Implement Robin Hood Hash Table.

Robin hood hashing reduces the variances of probe lengths by
continually rebalancing elements. This patch is the first step towards
full robin hood hash table implementation by doing bucket rebalancing
after every insert.

If a hash table is configured as a robin hood hash table, the new
element insertion will be buffered in a temporary bucket. This
temporary bucket will then matched against existing bucket elements,
swapped with a "rich" bucket, and continue doing so until it swap with
an empty bucket. The PSL (probe sequence length) invariant is
maintained in robin hood hash table. This allow us to add
short-circuit in the Probe function to immediately returns when it
finds out that the PSL of currently visited bucket is smaller/richer
than the key that is being looked up, indicating that the key does not
exist in the table. Instead of continue probing until next empty
bucket is found, Probe can immediately the return the index of
recently visited richer bucket to caller along with the not found
flag. In turn, the caller use this returned index to specify in which
index the new element should be inserted to.

Testing:
- Add backend test for Robin Hood hash table in hash-table-test.cc
- Pass core tests

Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/hash-table-benchmark.cc
M be/src/exec/grouping-aggregator-ir.cc
M be/src/exec/grouping-aggregator.cc
M be/src/exec/grouping-aggregator.h
M be/src/exec/hash-table-test.cc
M be/src/exec/hash-table.cc
M be/src/exec/hash-table.h
M be/src/exec/hash-table.inline.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exprs/scalar-expr.h
12 files changed, 812 insertions(+), 86 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/13
--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 13
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-04-20 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 12:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5832/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 12
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 20 Apr 2020 19:13:36 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-04-20 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 12:

New change in Patch Set 12:
- Instead of measuring fill ratio as percentage, count the number of filled 
bucket instead (counter FilledBuckets).
- Add InsertTravel counter to measure how much travel involved in insertion 
path. Consequently, number of travel in lookup-only can be computed as (Travel 
- InsertTravel).
- Try to speedup bucket rebalancing process by displacing buckets altogether 
instead of swapping one-by-one.


--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 12
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 20 Apr 2020 18:36:28 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-04-20 Thread Riza Suminto (Code Review)
Hello David Rorke, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15511

to look at the new patch set (#12).

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..

WIP IMPALA-9434: Implement Robin Hood Hash Table.

Robin hood hashing reduces the variances of probe lengths by
continually rebalancing elements. This patch is the first step towards
full robin hood hash table implementation by doing bucket rebalancing
after every insert.

If a hash table is configured as a robin hood hash table, the new
element insertion will be buffered in a temporary bucket. This
temporary bucket will then matched against existing bucket elements,
swapped with a "rich" bucket, and continue doing so until it swap with
an empty bucket. The PSL (probe sequence length) invariant is
maintained in robin hood hash table. This allow us to add
short-circuit in the Probe function to immediately returns when it
finds out that the PSL of currently visited bucket is smaller/richer
than the key that is being looked up, indicating that the key does not
exist in the table. Instead of continue probing until next empty
bucket is found, Probe can immediately the return the index of
recently visited richer bucket to caller along with the not found
flag. In turn, the caller use this returned index to specify in which
index the new element should be inserted to.

Testing:
- Add backend test for Robin Hood hash table in hash-table-test.cc
- Pass core tests

Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/hash-table-benchmark.cc
M be/src/exec/grouping-aggregator-ir.cc
M be/src/exec/grouping-aggregator.cc
M be/src/exec/grouping-aggregator.h
M be/src/exec/hash-table-test.cc
M be/src/exec/hash-table.cc
M be/src/exec/hash-table.h
M be/src/exec/hash-table.inline.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exprs/scalar-expr.h
12 files changed, 832 insertions(+), 92 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/12
--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 12
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-04-06 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 11:

Patch Set 11 add improvement for insertion into Robin-Hood hash table.
If an insertion target a bucket that is an empty, do insertion in-place, 
bypassing temporary bucket and swap routine.
This reduce unnecessary number of bucket swaps.

Based on Patch Set 11, I have run initial performance measurement against 
TPC-DS with scale 60 GB over 3 nodes.
The baseline is TPC-DS under quadratic probing hash table.
Among 91 queries, 34 queries perform better in Robin-Hood than Quadratic, 57 
others show degradation.

The breakdown are the following:
- 1 query improve by 18.69%
- 3 queries improve between 11.29% to 14.21%
- 8 queries improve between 5.90% to 9.93%
- 22 queries improve between 0.02% to 4.14%
- 42 queries degrade 0% to -4.86%
- 8 queries degrade between -5.51% to -8.50%
- 6 queries degrade between -10.49% to -14.89%
- 1 queries degrade between -23.50%

The improvement and degradations mostly still within tens and hundreds of 
miliseconds, so I suspect larger test scale will be more representative.
Meanwhile, I will study the query profile from my runs and look if we can find 
some explanation about this result.


--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 11
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 06 Apr 2020 17:31:09 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-04-06 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 11:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5708/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 11
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 06 Apr 2020 17:25:53 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-04-06 Thread Riza Suminto (Code Review)
Hello David Rorke, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15511

to look at the new patch set (#11).

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..

WIP IMPALA-9434: Implement Robin Hood Hash Table.

Robin hood hashing reduces the variances of probe lengths by
continually rebalancing elements. This patch is the first step towards
full robin hood hash table implementation by doing bucket rebalancing
after every insert.

If a hash table is configured as a robin hood hash table, the new
element insertion will be buffered in a temporary bucket. This
temporary bucket will then matched against existing bucket elements,
swapped with a "rich" bucket, and continue doing so until it swap with
an empty bucket. The PSL (probe sequence length) invariant is
maintained in robin hood hash table. This allow us to add
short-circuit in the Probe function to immediately returns when it
finds out that the PSL of currently visited bucket is smaller/richer
than the key that is being looked up, indicating that the key does not
exist in the table. Instead of continue probing until next empty
bucket is found, Probe can immediately the return the index of
recently visited richer bucket to caller along with the not found
flag. In turn, the caller use this returned index to specify in which
index the new element should be inserted to.

Testing:
- Add backend test for Robin Hood hash table in hash-table-test.cc
- Pass core tests

Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/hash-table-benchmark.cc
M be/src/exec/grouping-aggregator.cc
M be/src/exec/hash-table-test.cc
M be/src/exec/hash-table.cc
M be/src/exec/hash-table.h
M be/src/exec/hash-table.inline.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exprs/scalar-expr.h
10 files changed, 793 insertions(+), 84 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/11
--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 11
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-04-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 10:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5691/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 10
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 02 Apr 2020 15:27:55 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-04-02 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 10:

(5 comments)

Patch Set 10 adds several things:
- Fix bug where we forgot to swap distance (PSL) variable as we swap bucket.
- Create a separate ProbeRobinHood function for robin-hood algorithm.
- Add some new profile counters to measure the cost of rebalancing.

Below, I also add some of my own comment indicating several of DCHECK hacks 
that I'm not fully understand, but works. Patch set 10 pass core tests with 
this hacks, but we should definitely revisit them later to fully verify their 
correctness.

http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h
File be/src/exec/hash-table.inline.h:

http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h@92
PS9, Line 92: (LIKELY(step < nu
> In case of FindBuildRowBucket, it is quite tricky, because the caller of Fi
I went ahead by creating separate ProbeRobinHood function to solve this.


http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h@132
PS9, Line 132:  if we got
> You are correct, thanks for pointing this bug! Will fix this in my next sub
Done


http://gerrit.cloudera.org:8080/#/c/15511/10/be/src/exec/partitioned-hash-join-builder.cc
File be/src/exec/partitioned-hash-join-builder.cc:

http://gerrit.cloudera.org:8080/#/c/15511/10/be/src/exec/partitioned-hash-join-builder.cc@1316
PS10, Line 1316:   DCHECK_EQ(replaced_constants.robin_hood, 0);
I'm not sure if the replacement count here should be 0. But looking at priors 
DCHECKs, it is probably right.


http://gerrit.cloudera.org:8080/#/c/15511/10/be/src/exec/partitioned-hash-join-builder.cc@1393
PS10, Line 1393:   DCHECK_REPLACE_COUNT(replaced, 2);
In Patch Set 10, I create a second Probe function special for Robin-Hood 
algorithm. This additions seems to add a second call-site to function "Equals". 
Changing this DCHECK and other related checks to 2 pass me from errors.


http://gerrit.cloudera.org:8080/#/c/15511/10/be/src/exec/partitioned-hash-join-node.cc
File be/src/exec/partitioned-hash-join-node.cc:

http://gerrit.cloudera.org:8080/#/c/15511/10/be/src/exec/partitioned-hash-join-node.cc@1543
PS10, Line 1543:   DCHECK(replaced == 1 || replaced == 2 || replaced == 3 || 
replaced == 4) << replaced;
Similarly, since I'm adding a new Probe function, I think I should change this 
DCHECK, but I'm not sure what is the right check.



--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 10
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 02 Apr 2020 15:17:18 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-04-02 Thread Riza Suminto (Code Review)
Hello David Rorke, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15511

to look at the new patch set (#10).

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..

WIP IMPALA-9434: Implement Robin Hood Hash Table.

Robin hood hashing reduces the variances of probe lengths by
continually rebalancing elements. This patch is the first step towards
full robin hood hash table implementation by doing bucket rebalancing
after every insert.

If a hash table is configured as a robin hood hash table, the new
element insertion will be buffered in a temporary bucket. This
temporary bucket will then matched against existing bucket elements,
swapped with a "rich" bucket, and continue doing so until it swap with
an empty bucket. The PSL (probe sequence length) invariant is
maintained in robin hood hash table. This allow us to add
short-circuit in the Probe function to immediately returns when it
finds out that the PSL of currently visited bucket is smaller/richer
than the key that is being looked up, indicating that the key does not
exist in the table. Instead of continue probing until next empty
bucket is found, Probe can immediately the return the index of
recently visited richer bucket to caller along with the not found
flag. In turn, the caller use this returned index to specify in which
index the new element should be inserted to.

Testing:
- Add backend test for Robin Hood hash table in hash-table-test.cc
- Pass core tests

Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/hash-table-benchmark.cc
M be/src/exec/grouping-aggregator.cc
M be/src/exec/hash-table-test.cc
M be/src/exec/hash-table.cc
M be/src/exec/hash-table.h
M be/src/exec/hash-table.inline.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exprs/scalar-expr.h
10 files changed, 777 insertions(+), 84 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/10
--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 10
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-04-01 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h
File be/src/exec/hash-table.inline.h:

http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h@92
PS9, Line 92: return bucket_idx
> Agree that it becomes too much overloaded just to fit in robin-hood case. M
In case of FindBuildRowBucket, it is quite tricky, because the caller of 
FindBuildRowBucket is the one who decide what to do next given the returned 
value "found". If found equals true, the caller might continue to call GetTuple 
or UpdateTuple. If found equals false, the caller might call SetTuple next, 
which essentially an insert operation.

In linear/quadratic mode, the iterator should point to an empty bucket before 
SetTuple is called. But in robin-hood, I hack it such that it point to the 
non-empty bucket that should be evicted if SetTuple is called. I can see that 
this is where the contract of return value of Probe becomes inconsistent, 
depending of what is the probe algorithm being used.

Maybe the cleanest way is to create separate Probe function, say 
ProbeRobinHood, and use it when robin-hood algorithm is used. The downside is 
we will have a little bit code duplication between the two probe functions.



--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 9
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 02 Apr 2020 04:13:49 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-04-01 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 9:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h
File be/src/exec/hash-table.inline.h:

http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h@92
PS9, Line 92: return bucket_idx
> I understand this is the short-cut case for insert, and returning the index
Agree that it becomes too much overloaded just to fit in robin-hood case. Maybe 
it is better to return the swap target bucket as an additional return parameter 
next to parameter found?


http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h@132
PS9, Line 132: my_distance
> Should my_distance be set to the BucketDistance of the new bucket in the ca
You are correct, thanks for pointing this bug! Will fix this in my next 
submission.



--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 9
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 02 Apr 2020 02:23:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-04-01 Thread David Rorke (Code Review)
David Rorke has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 9:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h
File be/src/exec/hash-table.inline.h:

http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h@92
PS9, Line 92: return bucket_idx
I understand this is the short-cut case for insert, and returning the index of 
a non-empty bucket might work here when called during insert, but it may not 
follow the contract described for the Probe() return value (see the comment in 
hash-table.h).  Calls to Probe during lookup won't necessarily check the value 
of "found" and might not expect a return that's not one of (a) a match, (b) an 
empty bucket, or (c) BUCKET_NOT FOUND.  For example not clear whether call from 
FindBuildRowBucket would expect a non-empty bucket as an indication of probe 
failure.

I'm not sure what the best solution is given the need to return a swap target 
bucket somehow in the insert case, but it seems like we should do something to 
make the contract for the return value clearer and less overloaded.


http://gerrit.cloudera.org:8080/#/c/15511/9/be/src/exec/hash-table.inline.h@132
PS9, Line 132: my_distance
Should my_distance be set to the BucketDistance of the new bucket in the case 
where we swap?



--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 9
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 02 Apr 2020 01:41:53 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-31 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 9:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5665/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 9
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 31 Mar 2020 18:45:38 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-31 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 9:

(2 comments)

Hi David, thank you for your feedback.

Below is summary of Patch Set 6 to 9.

Patch Set 6:
- Create FLAGS_enable_robin_hood to enable Robin Hood hash table.
- Create backend tests for Robin Hood hash table.

Patch Set 7:
- Fix clang-tidy warnings about tmp_bucket_ struct initialization.

Patch Set 8:
- Add MaxTravel profile counter

Patch Set 9:
- Add improvement to update probe stats just once prior to Probe return.

http://gerrit.cloudera.org:8080/#/c/15511/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15511/7//COMMIT_MSG@8
PS7, Line 8:
> I suggest we add an additional profile counter to capture the max travel di
Done


http://gerrit.cloudera.org:8080/#/c/15511/8/be/src/exec/hash-table.inline.h
File be/src/exec/hash-table.inline.h:

http://gerrit.cloudera.org:8080/#/c/15511/8/be/src/exec/hash-table.inline.h@97
PS8, Line 97: }
> Minor optimization suggestion - you should be able to avoid recalculating t
Done



--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 9
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 31 Mar 2020 18:20:55 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-31 Thread Riza Suminto (Code Review)
Hello David Rorke, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15511

to look at the new patch set (#9).

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..

WIP IMPALA-9434: Implement Robin Hood Hash Table.

Robin hood hashing reduces the variances of probe lengths by
continually rebalancing elements. This patch is the first step towards
full robin hood hash table implementation by doing bucket rebalancing
after every insert.

If a hash table is configured as a robin hood hash table, the new
element insertion will be buffered in a temporary bucket. This
temporary bucket will then matched against existing bucket elements,
swapped with a "rich" bucket, and continue doing so until it swap with
an empty bucket. The PSL (probe sequence length) invariant is
maintained in robin hood hash table. This allow us to add
short-circuit in the Probe function to immediately returns when it
finds out that the PSL of currently visited bucket is smaller/richer
than the key that is being looked up, indicating that the key does not
exist in the table. Instead of continue probing until next empty
bucket is found, Probe can immediately the return the index of
recently visited richer bucket to caller along with the not found
flag. In turn, the caller use this returned index to specify in which
index the new element should be inserted to.

Testing:
- Add backend test for Robin Hood hash table in hash-table-test.cc
- Pass core tests

Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/hash-table-benchmark.cc
M be/src/exec/grouping-aggregator.cc
M be/src/exec/hash-table-test.cc
M be/src/exec/hash-table.cc
M be/src/exec/hash-table.h
M be/src/exec/hash-table.inline.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exprs/scalar-expr.h
10 files changed, 691 insertions(+), 82 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/9
--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 9
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-31 Thread David Rorke (Code Review)
David Rorke has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15511/8/be/src/exec/hash-table.inline.h
File be/src/exec/hash-table.inline.h:

http://gerrit.cloudera.org:8080/#/c/15511/8/be/src/exec/hash-table.inline.h@97
PS8, Line 97: ht_ctx->max_travel_length_ = std::max(step, 
ht_ctx->max_travel_length_);
Minor optimization suggestion - you should be able to avoid recalculating the 
max on every iteration of the loop and update it just once per probe call prior 
to the returns.



--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 8
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 31 Mar 2020 17:16:41 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-31 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 8:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5657/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 8
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 31 Mar 2020 17:12:29 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-31 Thread Riza Suminto (Code Review)
Hello David Rorke, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15511

to look at the new patch set (#8).

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..

WIP IMPALA-9434: Implement Robin Hood Hash Table.

Robin hood hashing reduces the variances of probe lengths by
continually rebalancing elements. This patch is the first step towards
full robin hood hash table implementation by doing bucket rebalancing
after every insert.

If a hash table is configured as a robin hood hash table, the new
element insertion will be buffered in a temporary bucket. This
temporary bucket will then matched against existing bucket elements,
swapped with a "rich" bucket, and continue doing so until it swap with
an empty bucket. The PSL (probe sequence length) invariant is
maintained in robin hood hash table. This allow us to add
short-circuit in the Probe function to immediately returns when it
finds out that the PSL of currently visited bucket is smaller/richer
than the key that is being looked up, indicating that the key does not
exist in the table. Instead of continue probing until next empty
bucket is found, Probe can immediately the return the index of
recently visited richer bucket to caller along with the not found
flag. In turn, the caller use this returned index to specify in which
index the new element should be inserted to.

Testing:
- Add backend test for Robin Hood hash table in hash-table-test.cc
- Pass core tests

Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/hash-table-benchmark.cc
M be/src/exec/grouping-aggregator.cc
M be/src/exec/hash-table-test.cc
M be/src/exec/hash-table.cc
M be/src/exec/hash-table.h
M be/src/exec/hash-table.inline.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exprs/scalar-expr.h
10 files changed, 675 insertions(+), 80 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/8
--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 8
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-30 Thread David Rorke (Code Review)
David Rorke has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15511/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15511/7//COMMIT_MSG@8
PS7, Line 8:
I suggest we add an additional profile counter to capture the max travel 
distance for any single call to probe.  This will help verify that robin-hood 
is in fact reducing the probe length variance as expected.



--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 7
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 30 Mar 2020 23:52:50 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 7:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5647/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 7
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 30 Mar 2020 21:08:51 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-30 Thread Riza Suminto (Code Review)
Hello David Rorke, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15511

to look at the new patch set (#7).

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..

WIP IMPALA-9434: Implement Robin Hood Hash Table.

Robin hood hashing reduces the variances of probe lengths by
continually rebalancing elements. This patch is the first step towards
full robin hood hash table implementation by doing bucket rebalancing
after every insert.

If a hash table is configured as a robin hood hash table, the new
element insertion will be buffered in a temporary bucket. This
temporary bucket will then matched against existing bucket elements,
swapped with a "rich" bucket, and continue doing so until it swap with
an empty bucket. The PSL (probe sequence length) invariant is
maintained in robin hood hash table. This allow us to add
short-circuit in the Probe function to immediately returns when it
finds out that the PSL of currently visited bucket is smaller/richer
than the key that is being looked up, indicating that the key does not
exist in the table. Instead of continue probing until next empty
bucket is found, Probe can immediately the return the index of
recently visited richer bucket to caller along with the not found
flag. In turn, the caller use this returned index to specify in which
index the new element should be inserted to.

Testing:
- Add backend test for Robin Hood hash table in hash-table-test.cc
- Pass core tests

Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/hash-table-benchmark.cc
M be/src/exec/grouping-aggregator.cc
M be/src/exec/hash-table-test.cc
M be/src/exec/hash-table.cc
M be/src/exec/hash-table.h
M be/src/exec/hash-table.inline.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exprs/scalar-expr.h
10 files changed, 665 insertions(+), 80 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/7
--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 7
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 6:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/5643/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 6
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Mon, 30 Mar 2020 16:33:01 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-30 Thread Riza Suminto (Code Review)
Hello David Rorke, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15511

to look at the new patch set (#6).

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..

WIP IMPALA-9434: Implement Robin Hood Hash Table.

Robin hood hashing reduces the variances of probe lengths by
continually rebalancing elements. This patch is the first step towards
full robin hood hash table implementation by doing bucket rebalancing
after every insert.

If a hash table is configured as a robin hood hash table, the new
element insertion will be buffered in a temporary bucket. This
temporary bucket will then matched against existing bucket elements,
swapped with a "rich" bucket, and continue doing so until it swap with
an empty bucket. The PSL (probe sequence length) invariant is
maintained in robin hood hash table. This allow us to add
short-circuit in the Probe function to immediately returns when it
finds out that the PSL of currently visited bucket is smaller/richer
than the key that is being looked up, indicating that the key does not
exist in the table. Instead of continue probing until next empty
bucket is found, Probe can immediately the return the index of
recently visited richer bucket to caller along with the not found
flag. In turn, the caller use this returned index to specify in which
index the new element should be inserted to.

Testing:
- Add backend test for Robin Hood hash table in hash-table-test.cc
- Pass core tests

Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/hash-table-benchmark.cc
M be/src/exec/grouping-aggregator.cc
M be/src/exec/hash-table-test.cc
M be/src/exec/hash-table.cc
M be/src/exec/hash-table.h
M be/src/exec/hash-table.inline.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exprs/scalar-expr.h
10 files changed, 664 insertions(+), 80 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/6
--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 6
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 5:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/5632/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 5
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 28 Mar 2020 05:18:39 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-27 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 5:

(1 comment)

Patch Set 5 submitted.

What is different from Patch Set 4 is that now we can do single pass insert 
both in HashTable::Insert and HashTable::Iterator::SetTuple.
Right now, this patch is hijacking the FLAGS_enable_quadratic_probing to enable 
robin hood mode rather than the quadratic probing mode.
I plan to create separate flag for robin hood mode in next iteration, so that 
we can do performance comparison between linear, quadratic, and robin hood 
linear probe.

http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@14
PS3, Line 14: If a hash table is configured as a robin hood hash table, the new
> I'm still thinking how to do single pass insert. From what I understand, we
This is now addressed in Patch Set 5.



--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 5
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Sat, 28 Mar 2020 04:48:28 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-27 Thread Riza Suminto (Code Review)
Hello David Rorke, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15511

to look at the new patch set (#5).

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..

WIP IMPALA-9434: Implement Robin Hood Hash Table.

Robin hood hashing reduces the variances of probe lengths by
continually rebalancing elements. This patch is the first step towards
full robin hood hash table implementation by doing bucket rebalancing
after every insert.

If a hash table is configured as a robin hood hash table, the new
element insertion will be buffered in a temporary bucket. This
temporary bucket will then matched against existing bucket elements,
swapped with a "rich" bucket, and continue doing so until it swap with
an empty bucket. The PSL (probe sequence length) invariant is
maintained in robin hood hash table. This allow us to add
short-circuit in the Probe function to immediately returns when it
finds out that the PSL of currently visited bucket is smaller/richer
than the key that is being looked up, indicating that the key does not
exist in the table. Instead of continue probing until next empty
bucket is found, Probe can immediately the return the index of
recently visited richer bucket to caller along with the not found
flag. In turn, the caller use this returned index to specify in which
index the new element should be inserted to.

Testing:
- Pass core tests

Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/hash-table-benchmark.cc
M be/src/exec/grouping-aggregator.cc
M be/src/exec/hash-table.cc
M be/src/exec/hash-table.h
M be/src/exec/hash-table.inline.h
M be/src/exec/partitioned-hash-join-builder.cc
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exprs/scalar-expr.h
9 files changed, 560 insertions(+), 20 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/5
--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 5
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-26 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@12
PS3, Line 12: robin hood algorithm to keep probe function of hash-table intact.
> I've been looking at this. It looks like short-circuiting lookup only possi
Agree that supporting quadratic probing + robin hood probably doesn't make 
sense. I think if we get linear probing + robin hood working well we  could 
consider just switching to that (since quadratic probing exists mainly to solve 
some issues with linear probing and clusters of values).

The other alternative would be to support two code paths, but it's preferable 
not to have both ways if it's not needed.

IMPALA-2470 has context on the problem and how to reproduce it.



--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 26 Mar 2020 20:59:35 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-25 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 4:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/5601/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 4
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 26 Mar 2020 01:37:30 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-25 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 4:

(6 comments)

Patch set 4 submitted.

This patch address some of comments from patch set 3, and add initial 
microbenchmarking code.
After some exploration, I come to conclusion that quadratic probing does not 
fit well with robin hood hashing. We might be better off by dropping quadratic 
probing entirely and commit on robin hood linear probing only.

http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@12
PS3, Line 12: bucket rebalancing after every insert.
> You told me before about short-circuiting the lookup but I didn't quite get
I've been looking at this. It looks like short-circuiting lookup only possible 
in linear probing mode because it exploit that linear probing behavior. 
Therefore, we need to drop quadratic probing mode first before we can fully 
implement all improvement that is possible in Robin Hood, linear probing.


http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@14
PS3, Line 14: Instead of proactively swapping elements during insertion, the
> Also I think when probing for lookup we might consider trying the "start in
I'm still thinking how to do single pass insert. From what I understand, we 
should tweak the insertion pattern that is using Iterator. For example, 
FindProbeRow(), FindBuildRowBucket() should commit early whether they indeed 
plan to do insert or just testing existence.


http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h
File be/src/exec/hash-table.inline.h:

http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@115
PS3, Line 115: new_distance <= target_d
> Ah, I see what you mean. Somehow I was under impression that element that i
Done


http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@138
PS3, Line 138: // swapped back into temporary location at index 
curr_idx.
> Sorry, this example is bad, because the final balanced bucked should be [-1
This check, unfortunately, need to stay because it is possible to swap with 
empty bucket in the case of quadratic probing. We should iron this out if we 
decide to drop quadratic probing.


http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@162
PS3, Line 162:   } else {
> Wonder if we should consider storing the bucket distance (calculated during
Patch Set 4 address O(1) distance measurement for linear probing.

For quadratic probing, because the probe steps follow Triangular number 
pattern, we supposedly can do ~O(1)  by doing its inverse function
https://en.wikipedia.org/wiki/Triangular_number#Triangular_roots_and_tests_for_triangular_numbers

However, I'm leaning towards dropping quadratic probing entirely, so I keep it 
as loop for quadratic probing.


http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@374
PS3, Line 374: DCHECK(node_ != NULL);
> > And hash-table suppose to be thread-safe for read access.
Add rebalancing at the end of Iterator::SetTuple and it pass core tests.



--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 4
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 26 Mar 2020 01:25:58 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-25 Thread Riza Suminto (Code Review)
Hello David Rorke, Tim Armstrong, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/15511

to look at the new patch set (#4).

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..

WIP IMPALA-9434: Implement Robin Hood Hash Table.

Robin hood hashing reduces the variances of probe lengths by
continually rebalancing elements. This patch is intended to be the
first step towards full robin hood hash table implementation by doing
bucket rebalancing after every insert.

Instead of proactively swapping elements during insertion, the
algorithm does the rebalancing by first letting the new element probe
an empty bucket and claim it through PrepareBucketForInsert(). After
the bucket at index curr_idx claimed, the rebalancing algorithm in
function Rebalance() walk through the probe sequence again to find
richer or empty bucket to swap with bucket[curr_idx]. The search and
swapping continue until it swap with empty bucket or does not find any
richer bucket along the probe sequence. It effectively treat
bucket[curr_idx] as temporary variable during the rebalancing process.
Rebalance() return the new index of the recently claimed bucket after
rebalancing process done.

The decision to not swap elements proactively is because the probe
function is used for both insertion and lookup-only. Keeping probing
routine decoupled from insertion routine allow us to invoke rebalance
only after bucket claim, not on lookup-only. This patch also maintain
compatibility with quadratic probing. However, it block us from
leveraging some robin hood technique that only possible with linear
probing such as lookup short-circuit (fast lookup on nonexistent
keys).

Testing:
- Pass core tests

Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
---
M be/src/benchmarks/CMakeLists.txt
A be/src/benchmarks/hash-table-benchmark.cc
M be/src/exec/hash-table.cc
M be/src/exec/hash-table.h
M be/src/exec/hash-table.inline.h
M be/src/exprs/scalar-expr.h
6 files changed, 610 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/4
--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 4
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-24 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h
File be/src/exec/hash-table.inline.h:

http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@138
PS3, Line 138:   if (curr_bucket->filled) {
> I just remember that it is possible to swap with an empty bucket, especiall
Sorry, this example is bad, because the final balanced bucked should be [-1, 1, 
1, 2]
A better example maybe is table with size 8, quadratic probe, with incoming 
element 1, 1, 1, 1, 7, 7.



--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 25 Mar 2020 01:50:18 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-24 Thread David Rorke (Code Review)
David Rorke has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@14
PS3, Line 14: Instead of proactively swapping elements during insertion, the
> Yes, my intuition is also that doing the additional work to create a separa
Also I think when probing for lookup we might consider trying the "start in the 
middle / smart search" approach described here:  
https://programming.guide/robin-hood-hashing.html
The tradeoff is it might be less cache friendly.  But it might be best to 
explore this as a follow on change later just to keep the scope of the initial 
change manageable.



--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 25 Mar 2020 01:43:19 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-24 Thread David Rorke (Code Review)
David Rorke has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@14
PS3, Line 14: Instead of proactively swapping elements during insertion, the
> Having separate Probe() function that is special for insert might remove th
Yes, my intuition is also that doing the additional work to create a separate 
single pass probe for insert and also making sure we maintain the distance 
invariant in all cases are both worthwhile.


http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h
File be/src/exec/hash-table.inline.h:

http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@162
PS3, Line 162: inline int64_t HashTable::BucketDistance(
> I'll see if I can put O(1) distance measurement, at least for the linear pr
Wonder if we should consider storing the bucket distance (calculated during 
insert/rebalance) in the bucket itself.  This might work against the goals of 
IMPALA-7635 but with a carefully packed memory layout maybe we could still 
squeeze in a byte sized value for the distance.  Seems intuitively like a byte 
should be enough to store most distances, and could fall back to calculating 
the distance via probe in rare cases where the distance won't fit in a byte.

Of course if you can figure out a good enough O(1) approach that doesn't 
require any storage that might be simpler.

Regarding linear vs quadratic I'd also hope that robin-hood doesn't require 
quadratic.  Maybe we should just do some benchmarking and see if there's a 
clear winner between different probing algorithms and if so go with that.



--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 25 Mar 2020 01:36:37 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-24 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h
File be/src/exec/hash-table.inline.h:

http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@138
PS3, Line 138:   if (curr_bucket->filled) {
> Good point, I must have miss this while I refine my code. Will change this 
I just remember that it is possible to swap with an empty bucket, especially in 
quadratic_probing mode.
Consider hash-table with buckets size 4, quadratic probing, and incoming 
elements 1, 1, 2 in that order.
Let say -1 represent empty bucket and hash of an element is equal to that 
element itself. Initially the buckets look like this:

[-1, -1, -1, -1]

After 1 and 1 inserted to the table, buckets will look like this

[-1, 1, 1, -1]

Now, when we want to insert 2 to table, this algorithm will put 2 at the last 
empty bucket in the probe sequence, that is index 3.

[-1, 1, 1, 2]

Then, we rebalance by swapping 2 at index 3 with 1 at index 2

[-1, 1, 2, 1]

Now, the element 1 at index 3 is misplaced, because if we follow 
quadratic_probing, the appropriate probe sequence for 1 should be 1, 2, 0 (that 
is (2 + 2) mod 4), 3, and so on.
So we need to swap 1 at index 3 with the empty bucket at index 0.

[1, 1, 2, -1]

At this point, bucket index 3 is an empty bucket and the rebalancing finish.
This is kind of counter intuitive, because most literature show that robin-hood 
should pair with linear probing. And in linear probing, this case will not 
happen.



--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 25 Mar 2020 01:28:31 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-24 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h
File be/src/exec/hash-table.inline.h:

http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@374
PS3, Line 374:   table_->PrepareBucketForInsert(bucket_idx_, hash);
> And hash-table suppose to be thread-safe for read access.

Yeah, it definitely needs to be thread-safe if muliontiple threads are reading 
from it, but mutations (like SetTuple()) do not need to be thread-safe. We can 
also document SetTuple() as invalidating other iterators, cause we don't depend 
on that either.

SetTuple() is only used by the hash aggregator - see 
be/src/exec/grouping-aggregator-ir.cc. The algorithm is basically this:

  it = ht->FindBucket(...);
  if (found in hash table) {
// Merge into the existing intermediate tuple
UpdateTuple(it, input_row)
  } else {
// Try to construct a new intermediate tuple
new_tuple = TryConstructTuple()
if (tuple construction failed due to OOM) {
  SpillRow(input_row)
} else {
  it.SetTuple(new_tuple)
  UpdateTuple(it, input_row)
}
  }

Instead of FindBucket()/SetTuple() you could do Find()/Insert() but that would 
probe the hash table twice.



--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 24 Mar 2020 21:42:28 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-24 Thread Riza Suminto (Code Review)
Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 3:

(6 comments)

Hi Tim, thanks for your valuable feedbacks!

http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@12
PS3, Line 12: robin hood algorithm to keep probe function of hash-table intact.
> It's worth noting that this isn't the full robin-hood hashing approach, bec
You told me before about short-circuiting the lookup but I didn't quite get it 
then. Now that you put it that way, I understand better what you mean.

Yes, we can do the short-circuit lookup if we maintain the distance invariant.  
Right now, the invariant is not maintained because insert using iterator does 
not trigger rebalance. I'll check the code again and see if I can do something 
about it.


http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@14
PS3, Line 14: Instead of proactively swapping elements during insertion, the
> I guess we can revisit this to see what the impact is, it seems like it pro
Having separate Probe() function that is special for insert might remove the 
redundant second pass. I'll see if this is possible.


http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h
File be/src/exec/hash-table.inline.h:

http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@115
PS3, Line 115: target_distance == 0 ||
> Why do we need this condition? My understanding was that you always want to
Ah, I see what you mean. Somehow I was under impression that element that is 
already perfectly placed should not be evicted. Will remove this condition.


http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@138
PS3, Line 138:   if (curr_bucket->filled) {
> Is it possible for this to be false? Why would there be an empty bucket *be
Good point, I must have miss this while I refine my code. Will change this with 
DCHECK.


http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@162
PS3, Line 162: inline int64_t HashTable::BucketDistance(
> I guess calculating this for linear probing is relatively straightforward,
I'll see if I can put O(1) distance measurement, at least for the linear 
probing. I did tried that before, and it failed PlannerTests. But that probably 
my implementation that still incorrect.


http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@374
PS3, Line 374:   table_->PrepareBucketForInsert(bucket_idx_, hash);
> You could also rebalance there, I think, to benefit the agg. Or combine the
If we rebalance here, it will reorder elements in array buckets_, which in turn 
breaking the order of iterator.
And hash-table suppose to be thread-safe for read access.
I'm worried if there are two iterator going where one do read only and the 
other do write through this SetTuple() function, the later iterator will break 
the former as well.
Please correct me if my assumption is wrong, or if there is some guarantee that 
such case is impossible. I think I haven't fully understand the use case of 
this function.



--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 24 Mar 2020 19:03:33 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-24 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@12
PS3, Line 12: robin hood algorithm to keep probe function of hash-table intact.
It's worth noting that this isn't the full robin-hood hashing approach, because 
we're not short-circuiting the lookup based on the invariant (i.e. once the max 
distance of a hash value you'd seen on the probe exceeds your current distance, 
you know the key cannot be in the hash table).

I think this may be a problem if you do benchmarks where most of the lookups 
are not present in the hash table. This might be significant for some workloads.

But I guess for hash joins, this will give most of the benefit, if we assume 
that missing values are mostly filtered out by bloom filters in the scan. So if 
we see a big impact, we can look at optimising it further.


http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@14
PS3, Line 14: Instead of proactively swapping elements during insertion, the
I guess we can revisit this to see what the impact is, it seems like it 
probably means that insert is somewhat more expensive because of the two passes.


http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h
File be/src/exec/hash-table.inline.h:

http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@115
PS3, Line 115: target_distance == 0 ||
Why do we need this condition? My understanding was that you always want to 
maintain the invariant that the richer key gets bumped.


http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@138
PS3, Line 138:   if (curr_bucket->filled) {
Is it possible for this to be false? Why would there be an empty bucket 
*before* the current position of the key (given that we don't support deletion 
from the hash table). I.e. Maybe this should be a DCHECK to enforce the 
invariant.


http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@162
PS3, Line 162: inline int64_t HashTable::BucketDistance(
I guess calculating this for linear probing is relatively straightforward, but 
not sure if there's a good way to calculate it for quadratic probing (quadratic 
probing + robin hood hashing might not be a great combination).

We did the quadratic probing because of a tendency for the CRC-based hash to 
cluster. I don't know if robin-hood hashing is enough to counter-act that 
tendency.


http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@374
PS3, Line 374:   table_->PrepareBucketForInsert(bucket_idx_, hash);
You could also rebalance there, I think, to benefit the agg. Or combine the 
rebalancing with PrepareBucketForInsert



--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 24 Mar 2020 16:23:27 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-22 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/15511 )

Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5557/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Riza Suminto 
Gerrit-Comment-Date: Sun, 22 Mar 2020 18:13:32 +
Gerrit-HasComments: No


[Impala-ASF-CR] WIP IMPALA-9434: Implement Robin Hood Hash Table.

2020-03-22 Thread Riza Suminto (Code Review)
Riza Suminto has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/15511


Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table.
..

WIP IMPALA-9434: Implement Robin Hood Hash Table.

Robin hood hashing reduces the variances of probe lengths by
continually rebalancing elements. This commit implement robin hood
hashing behavior into exec/hash-table.h with small modification to the
robin hood algorithm to keep probe function of hash-table intact.

Instead of proactively swapping elements during insertion, the
algorithm does the rebalancing by first letting the new element probe
an empty bucket and claim it through PrepareBucketForInsert(). After
the bucket at index curr_idx claimed, the rebalancing algorithm in
function Rebalance() walk through the probe sequence again to find
richer or empty bucket to swap with bucket[curr_idx]. The search and
swapping continue until it swap with empty bucket or does not find any
richer bucket along the probe sequence. It effectively treat
bucket[curr_idx] as temporary variable during the rebalancing process.
Rebalance() return the new index of the recently claimed bucket after
rebalancing process done.

The decision to not swap elements proactively is because the probe
function is used for both insertion and lookup-only. Keeping probing
routine decoupled from insertion routine allow us to invoke rebalance
only after bucket claim, not on lookup-only. Insert into an already
filed bucket (duplicates), or insert using hash-table iterator does
not trigger rebalance.

Testing:
- Pass core tests

Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
---
M be/src/exec/hash-table.h
M be/src/exec/hash-table.inline.h
2 files changed, 94 insertions(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/15511/3
--
To view, visit http://gerrit.cloudera.org:8080/15511
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986
Gerrit-Change-Number: 15511
Gerrit-PatchSet: 3
Gerrit-Owner: Riza Suminto 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Riza Suminto