[jira] [Commented] (HIVE-20873) Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision

2018-11-27 Thread Teddy Choi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700254#comment-16700254
 ] 

Teddy Choi commented on HIVE-20873:
---

Pushed to master. Thanks, [~bslim] and [~gopalv]!

> Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision
> 
>
> Key: HIVE-20873
> URL: https://issues.apache.org/jira/browse/HIVE-20873
> Project: Hive
>  Issue Type: Improvement
>Reporter: Teddy Choi
>Assignee: Teddy Choi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20873.1.patch, HIVE-20873.2.patch, 
> HIVE-20873.3.patch
>
>
> VectorHashKeyWrapperTwoLong is implemented with few bit shift operators and 
> XOR operators for short computation time, but more hash collision. Group by 
> operations become very slow on large data sets. It needs Murmur hash or a 
> better hash function for less hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20873) Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision

2018-11-25 Thread Gopal V (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698475#comment-16698475
 ] 

Gopal V commented on HIVE-20873:


[~teddy.choi]: this is good to go into Apache - has been tested and found to be 
good.

> Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision
> 
>
> Key: HIVE-20873
> URL: https://issues.apache.org/jira/browse/HIVE-20873
> Project: Hive
>  Issue Type: Improvement
>Reporter: Teddy Choi
>Assignee: Teddy Choi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20873.1.patch, HIVE-20873.2.patch, 
> HIVE-20873.3.patch
>
>
> VectorHashKeyWrapperTwoLong is implemented with few bit shift operators and 
> XOR operators for short computation time, but more hash collision. Group by 
> operations become very slow on large data sets. It needs Murmur hash or a 
> better hash function for less hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20873) Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision

2018-11-08 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680922#comment-16680922
 ] 

Hive QA commented on HIVE-20873:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12947335/HIVE-20873.3.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 15531 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[timestamptz_2] 
(batchId=85)
org.apache.hive.jdbc.TestJdbcDriver2.testSelectExecAsync2 (batchId=259)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/14825/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/14825/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-14825/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 2 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12947335 - PreCommit-HIVE-Build

> Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision
> 
>
> Key: HIVE-20873
> URL: https://issues.apache.org/jira/browse/HIVE-20873
> Project: Hive
>  Issue Type: Improvement
>Reporter: Teddy Choi
>Assignee: Teddy Choi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20873.1.patch, HIVE-20873.2.patch, 
> HIVE-20873.3.patch
>
>
> VectorHashKeyWrapperTwoLong is implemented with few bit shift operators and 
> XOR operators for short computation time, but more hash collision. Group by 
> operations become very slow on large data sets. It needs Murmur hash or a 
> better hash function for less hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20873) Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision

2018-11-08 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680894#comment-16680894
 ] 

Hive QA commented on HIVE-20873:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
29s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  9m 
27s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
27s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
55s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m 
20s{color} | {color:blue} storage-api in master has 48 extant Findbugs 
warnings. {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m 
31s{color} | {color:blue} common in master has 65 extant Findbugs warnings. 
{color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
51s{color} | {color:blue} ql in master has 2315 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
13s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m  
9s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
59s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
52s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m 
15s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
12s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 29m 42s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-14825/dev-support/hive-personality.sh
 |
| git revision | master / 5aac805 |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.1 |
| modules | C: storage-api common ql U: . |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-14825/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision
> 
>
> Key: HIVE-20873
> URL: https://issues.apache.org/jira/browse/HIVE-20873
> Project: Hive
>  Issue Type: Improvement
>Reporter: Teddy Choi
>Assignee: Teddy Choi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20873.1.patch, HIVE-20873.2.patch, 
> HIVE-20873.3.patch
>
>
> VectorHashKeyWrapperTwoLong is implemented with few bit shift operators and 
> XOR operators for short computation time, but more hash collision. Group by 
> operations become very slow on large data sets. It needs Murmur hash or a 
> better hash function for less hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20873) Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision

2018-11-08 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678881#comment-16678881
 ] 

Hive QA commented on HIVE-20873:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12947198/HIVE-20873.2.patch

{color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 15528 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[schemeAuthority2]
 (batchId=192)
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-Build/14797/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/14797/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-14797/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.YetusPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12947198 - PreCommit-HIVE-Build

> Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision
> 
>
> Key: HIVE-20873
> URL: https://issues.apache.org/jira/browse/HIVE-20873
> Project: Hive
>  Issue Type: Improvement
>Reporter: Teddy Choi
>Assignee: Teddy Choi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20873.1.patch, HIVE-20873.2.patch
>
>
> VectorHashKeyWrapperTwoLong is implemented with few bit shift operators and 
> XOR operators for short computation time, but more hash collision. Group by 
> operations become very slow on large data sets. It needs Murmur hash or a 
> better hash function for less hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20873) Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision

2018-11-08 Thread slim bouguerra (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16680050#comment-16680050
 ] 

slim bouguerra commented on HIVE-20873:
---

Still unclear to me why are we using Murmur, there is a dozen of other hash 
algorithms including XXhash that way faster and has good quality. 
https://cyan4973.github.io/xxHash/
Anyway i will try to take a look at benchmarking this i have created a sub 
task.  FYI XXHash is widely used by lot of MPP style engines.

> Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision
> 
>
> Key: HIVE-20873
> URL: https://issues.apache.org/jira/browse/HIVE-20873
> Project: Hive
>  Issue Type: Improvement
>Reporter: Teddy Choi
>Assignee: Teddy Choi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20873.1.patch, HIVE-20873.2.patch, 
> HIVE-20873.3.patch
>
>
> VectorHashKeyWrapperTwoLong is implemented with few bit shift operators and 
> XOR operators for short computation time, but more hash collision. Group by 
> operations become very slow on large data sets. It needs Murmur hash or a 
> better hash function for less hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20873) Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision

2018-11-07 Thread slim bouguerra (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678844#comment-16678844
 ] 

slim bouguerra commented on HIVE-20873:
---

[~teddy.choi] Thanks, am not trying by any mean to waste your time, but it 
would be nice if you share what is the improvement you see how are you 
measuring it? and maybe also investigate if this will be a regression for other 
queries as well.
This will help me and others to learn form your experiments.

  

> Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision
> 
>
> Key: HIVE-20873
> URL: https://issues.apache.org/jira/browse/HIVE-20873
> Project: Hive
>  Issue Type: Improvement
>Reporter: Teddy Choi
>Assignee: Teddy Choi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20873.1.patch, HIVE-20873.2.patch
>
>
> VectorHashKeyWrapperTwoLong is implemented with few bit shift operators and 
> XOR operators for short computation time, but more hash collision. Group by 
> operations become very slow on large data sets. It needs Murmur hash or a 
> better hash function for less hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20873) Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision

2018-11-07 Thread Gopal V (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678852#comment-16678852
 ] 

Gopal V commented on HIVE-20873:


[~bslim]: Teddy & I have a UDF for the hash function, which we use to calculate 
skews.

I've merged Teddy's changes into it

https://github.com/t3rmin4t0r/long-hash-udf

{code}
select long2hash(i_item_sk, 1) & 255, count(1)  from item group by 
long2hash(i_item_sk, 1) & 255 order by count(1) desc ;

0   65536
2   65536
3   65536
1   65535
5   37857
{code}

So there's a bit-skew in the old hash function, instead of generating 256 
unique bit-patterns, but it skews the low-bits by the 2nd arg to the long2 hash.

{code}
select long2murmur(i_item_sk, 1) & 255, count(1)  from item group by 
long2murmur(i_item_sk, 1) & 255 order by count(1) desc ;

170 1274
37  1264
220 1254
110 1253
152 1241
5   1235
56  1232
179 1231
231 1228
168 1228
149 1228
84  1222
...
156 1082
Time taken: 1.727 seconds, Fetched: 256 row(s)
{code} 

> Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision
> 
>
> Key: HIVE-20873
> URL: https://issues.apache.org/jira/browse/HIVE-20873
> Project: Hive
>  Issue Type: Improvement
>Reporter: Teddy Choi
>Assignee: Teddy Choi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20873.1.patch, HIVE-20873.2.patch
>
>
> VectorHashKeyWrapperTwoLong is implemented with few bit shift operators and 
> XOR operators for short computation time, but more hash collision. Group by 
> operations become very slow on large data sets. It needs Murmur hash or a 
> better hash function for less hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20873) Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision

2018-11-07 Thread Gopal V (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678877#comment-16678877
 ] 

Gopal V commented on HIVE-20873:


LGTM - +1 tests pending.

TestHashCodeUtil.java needs ASF license.

> Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision
> 
>
> Key: HIVE-20873
> URL: https://issues.apache.org/jira/browse/HIVE-20873
> Project: Hive
>  Issue Type: Improvement
>Reporter: Teddy Choi
>Assignee: Teddy Choi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20873.1.patch, HIVE-20873.2.patch
>
>
> VectorHashKeyWrapperTwoLong is implemented with few bit shift operators and 
> XOR operators for short computation time, but more hash collision. Group by 
> operations become very slow on large data sets. It needs Murmur hash or a 
> better hash function for less hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20873) Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision

2018-11-07 Thread Hive QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678817#comment-16678817
 ] 

Hive QA commented on HIVE-20873:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
33s{color} | {color:blue} Maven dependency ordering for branch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
18s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
19s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
48s{color} | {color:green} master passed {color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m 
30s{color} | {color:blue} common in master has 65 extant Findbugs warnings. 
{color} |
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  3m 
37s{color} | {color:blue} ql in master has 2315 extant Findbugs warnings. 
{color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
8s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue}  0m 
10s{color} | {color:blue} Maven dependency ordering for patch {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
35s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
13s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
13s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  0m 
11s{color} | {color:red} common: The patch generated 4 new + 6 unchanged - 0 
fixed = 10 total (was 6) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  4m 
28s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  1m  
7s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
13s{color} | {color:red} The patch generated 1 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 25m 28s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Optional Tests |  asflicense  javac  javadoc  findbugs  checkstyle  compile  |
| uname | Linux hiveptest-server-upstream 3.16.0-4-amd64 #1 SMP Debian 
3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/data/hiveptest/working/yetus_PreCommit-HIVE-Build-14797/dev-support/hive-personality.sh
 |
| git revision | master / 6d713b6 |
| Default Java | 1.8.0_111 |
| findbugs | v3.0.0 |
| checkstyle | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-14797/yetus/diff-checkstyle-common.txt
 |
| asflicense | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-14797/yetus/patch-asflicense-problems.txt
 |
| modules | C: common ql U: . |
| Console output | 
http://104.198.109.242/logs//PreCommit-HIVE-Build-14797/yetus.txt |
| Powered by | Apache Yetushttp://yetus.apache.org |


This message was automatically generated.



> Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision
> 
>
> Key: HIVE-20873
> URL: https://issues.apache.org/jira/browse/HIVE-20873
> Project: Hive
>  Issue Type: Improvement
>Reporter: Teddy Choi
>Assignee: Teddy Choi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20873.1.patch, HIVE-20873.2.patch
>
>
> VectorHashKeyWrapperTwoLong is implemented with few bit shift operators and 
> XOR operators for short computation time, but more hash collision. Group by 
> operations become very slow on large data sets. It needs Murmur hash or a 
> better hash function for less hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20873) Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision

2018-11-07 Thread Teddy Choi (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677804#comment-16677804
 ] 

Teddy Choi commented on HIVE-20873:
---

In my case, TPC-H query 21 and TPC-DS query 16 seem related with it. TPC-H 
query 21 uses map join, and TPC-DS query 16 uses group by. Both of them use 
VectorHashKeyWrapperBatch, which uses VectorHashKeyWrapperSingleLong, which 
uses HashCodeUtil.calculateLongHashCode.

Also there are other hash algorithms, but Murmur3 is already used in Hadoop and 
Hive. See org.apache.hive.common.util.Murmur3 and 
org.apache.hadoop.util.hash.MurmurHash. So I think it would be safe to use 
Murmur3 instead of benchmarking other hash algorithms.

> Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision
> 
>
> Key: HIVE-20873
> URL: https://issues.apache.org/jira/browse/HIVE-20873
> Project: Hive
>  Issue Type: Improvement
>Reporter: Teddy Choi
>Assignee: Teddy Choi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20873.1.patch, HIVE-20873.2.patch
>
>
> VectorHashKeyWrapperTwoLong is implemented with few bit shift operators and 
> XOR operators for short computation time, but more hash collision. Group by 
> operations become very slow on large data sets. It needs Murmur hash or a 
> better hash function for less hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20873) Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision

2018-11-06 Thread slim bouguerra (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677129#comment-16677129
 ] 

slim bouguerra commented on HIVE-20873:
---

[~teddy.choi]  Am wondering did you get chance to perform any benchmarks to see 
if this actually helps?
Also did you consider other hashing algorithm that are less expensive than this 
one ?
Thanks

> Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision
> 
>
> Key: HIVE-20873
> URL: https://issues.apache.org/jira/browse/HIVE-20873
> Project: Hive
>  Issue Type: Improvement
>Reporter: Teddy Choi
>Assignee: Teddy Choi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20873.1.patch
>
>
> VectorHashKeyWrapperTwoLong is implemented with few bit shift operators and 
> XOR operators for short computation time, but more hash collision. Group by 
> operations become very slow on large data sets. It needs Murmur hash or a 
> better hash function for less hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-20873) Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision

2018-11-06 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-20873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677097#comment-16677097
 ] 

ASF GitHub Bot commented on HIVE-20873:
---

GitHub user pudidic opened a pull request:

https://github.com/apache/hive/pull/485

HIVE-20873: Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce…

… hash collision (Teddy Choi)

Change-Id: Ie3ae307acb331c48bc5e1cb9c417cd5d1d792f50

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/pudidic/hive HIVE-20873

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/hive/pull/485.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #485


commit b658957051c9f75861cd75383f5239a76dfb9f0e
Author: Teddy Choi 
Date:   2018-11-06T18:02:26Z

HIVE-20873: Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash 
collision (Teddy Choi)

Change-Id: Ie3ae307acb331c48bc5e1cb9c417cd5d1d792f50




> Use Murmur hash for VectorHashKeyWrapperTwoLong to reduce hash collision
> 
>
> Key: HIVE-20873
> URL: https://issues.apache.org/jira/browse/HIVE-20873
> Project: Hive
>  Issue Type: Improvement
>Reporter: Teddy Choi
>Assignee: Teddy Choi
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-20873.1.patch
>
>
> VectorHashKeyWrapperTwoLong is implemented with few bit shift operators and 
> XOR operators for short computation time, but more hash collision. Group by 
> operations become very slow on large data sets. It needs Murmur hash or a 
> better hash function for less hash collision.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)