[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2017-07-03 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16071981#comment-16071981
 ] 

Hive QA commented on HIVE-14797:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12833869/HIVE-14797.4.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 10 failed/errored test(s), 10830 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[create_merge_compressed]
 (batchId=237)
org.apache.hadoop.hive.cli.TestBeeLineDriver.testCliDriver[insert_overwrite_local_directory_1]
 (batchId=237)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[llap_smb] 
(batchId=143)
org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[orc_ppd_basic] 
(batchId=140)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr]
 (batchId=145)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] 
(batchId=232)
org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] 
(batchId=232)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionRegistrationWithCustomSchema
 (batchId=177)
org.apache.hive.hcatalog.api.TestHCatClient.testPartitionSpecRegistrationWithCustomSchema
 (batchId=177)
org.apache.hive.hcatalog.api.TestHCatClient.testTableSchemaPropagation 
(batchId=177)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/5868/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/5868/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-5868/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 10 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12833869 - PreCommit-HIVE-Build

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
>  Labels: breaking_change
> Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, 
> HIVE-14797.4.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-10-21 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15597040#comment-15597040
 ] 

Rui Li commented on HIVE-14797:
---

[~roncenzhao], would you mind update the patch as Xuefu suggested? Or let us 
know if you have better ideas.

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, 
> HIVE-14797.4.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-10-20 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15592184#comment-15592184
 ] 

Xuefu Zhang commented on HIVE-14797:


That sounds good enough to me.

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, 
> HIVE-14797.4.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-10-20 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15591149#comment-15591149
 ] 

Rui Li commented on HIVE-14797:
---

[~xuefuz] - I see. Thanks for the information. In that case, I think we should 
use the initial solution to avoid automatically setting number of reducers to 
31?

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, 
> HIVE-14797.4.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-10-19 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15590047#comment-15590047
 ] 

Xuefu Zhang commented on HIVE-14797:


I guess my point is that the chance for a user to hit this problem (however 
valid) is slim. As understood, this happens only if the user specifically picks 
31 as parallelism for whatever a reason. BTW, Hive recommends 2^n for number of 
buckets, so 31 bucket is even more anti-pattern.

My concern is whether it's worth the effort or risk to fix. However, please do 
feel free to tackle it completely.

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, 
> HIVE-14797.4.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-10-18 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587451#comment-15587451
 ] 

Rui Li commented on HIVE-14797:
---

Hi [~xuefuz], for the example in the description, B is skewed but (A, B) 
shouldn't skew ideally. Other than shuffling, bucketed table should also suffer 
from this if the number of buckets happens to be 31, and we can't adjust the 
number of reducers in that case.
I think the problem is valid, but more research is needed to find out how the 
hash code is used, and whether the solution here is correct.

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, 
> HIVE-14797.4.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-10-18 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587229#comment-15587229
 ] 

Xuefu Zhang commented on HIVE-14797:


[~lirui] Choosing a different seed for determining bucket number seems a little 
risky for FS if it's assumed that certain key always lands to a certain bucket 
such as in case of transaction. However, I'm not sure at all.

Looking again at the problem, I'm not sure if we need to deal with data skew 
problem in the way proposed by this patch. The original data is already skewed. 
I'm wondering if it's actually better to adjust the reducer number. I 
understand that this was the original approach. If user happens to specify 31 
for reducers, then let it be. I'd think it's is a rare case, and I don't think 
solving this case justifies the need of a new seed, which seems a little more 
risky.

Thoughts?

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, 
> HIVE-14797.4.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-10-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584787#comment-15584787
 ] 

Hive QA commented on HIVE-14797:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12833869/HIVE-14797.4.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 9 failed/errored test(s), 10592 tests 
executed
*Failed tests:*
{noformat}
TestBeelineWithHS2ConnectionFile - did not produce a TEST-*.xml file (likely 
timed out) (batchId=197)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_globallimit] 
(batchId=27)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[order_null] (batchId=18)
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[union_fast_stats] 
(batchId=46)
org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver[hbase_bulk] 
(batchId=89)
org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJarWithoutAddDriverClazz[0]
 (batchId=155)
org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[0] (batchId=155)
org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[1] (batchId=155)
org.apache.hive.jdbc.authorization.TestJdbcWithSQLAuthorization.testBlackListedUdfUsage
 (batchId=204)
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1613/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1613/console
Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-1613/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 9 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12833869 - PreCommit-HIVE-Build

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, 
> HIVE-14797.4.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-10-17 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584307#comment-15584307
 ] 

Rui Li commented on HIVE-14797:
---

Thanks for the update [~roncenzhao]. I have one more question.
{{ObjectInspectorUtils.getBucketHashCode}} is also used in several places other 
than RS, e.g. in FS. Now if the # of reducers is 31, RS will compute the hash 
code differently from the other places. Wondering if we need to keep some kind 
of consistency among these calling paths. [~xuefuz] do you have any ideas?

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, 
> HIVE-14797.4.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-10-17 Thread roncenzhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584262#comment-15584262
 ] 

roncenzhao commented on HIVE-14797:
---

Hi, [~lirui] , I hava resolved this problem in the new patch.
Please check it. Thanks~

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, 
> HIVE-14797.4.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-10-12 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15568152#comment-15568152
 ] 

Rui Li commented on HIVE-14797:
---

Seems for MR, we need to get #reducers from hconf, but for Spark/Tez, we need 
to get it from ReduceSinkDesc::getNumReducers. Therefore we have to check both 
of them to determine if #reducer is the same as our hash seed.

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-10-11 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15565084#comment-15565084
 ] 

Rui Li commented on HIVE-14797:
---

I did some tests locally. It turns out {{hconf.getInt(JobContext.NUM_REDUCES, 
-1)}} may not give us the number of reducers, i.e. just get -1. In my test, the 
#reducers is automatically determined and stored in ReduceSinkDesc.
We need to find out which way is more reliable to get #reducers.

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-10-09 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15561154#comment-15561154
 ] 

Xuefu Zhang commented on HIVE-14797:


+1

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-10-08 Thread roncenzhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15559171#comment-15559171
 ] 

roncenzhao commented on HIVE-14797:
---

Is there anyone who can review this patch?
thanks~

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-09-22 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15512290#comment-15512290
 ] 

Hive QA commented on HIVE-14797:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12829740/HIVE-14797.3.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 10554 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_mapjoin]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ctas]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_join_part_col_char]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3]
org.apache.hadoop.hive.metastore.TestMetaStoreMetrics.testMetaDataCounts
org.apache.hive.jdbc.TestJdbcWithMiniHS2.testAddJarConstructorUnCaching
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1266/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1266/console
Test logs: 
http://ec2-204-236-174-241.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-Build-1266/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 6 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12829740 - PreCommit-HIVE-Build

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-09-21 Thread roncenzhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511843#comment-15511843
 ] 

roncenzhao commented on HIVE-14797:
---

I think they are not related to my patch. The failure testcases have run 
successfully in my own machine.

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.2.patch, HIVE-14797.3.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-09-21 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511708#comment-15511708
 ] 

Rui Li commented on HIVE-14797:
---

I see some failures "did not produce a TEST-*.xml file". Are they related?

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.2.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-09-21 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510378#comment-15510378
 ] 

Xuefu Zhang commented on HIVE-14797:


The new change seems good. Minor nit: can we change the implementation of 
getBucketHashCode() to call the new method with a seed of 31. This is to save 
some code duplication.

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.2.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-09-21 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509958#comment-15509958
 ] 

Hive QA commented on HIVE-14797:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12829560/HIVE-14797.2.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 7 failed/errored test(s), 10526 tests 
executed
*Failed tests:*
{noformat}
TestMiniLlapCliDriver-auto_sortmerge_join_13.q-tez_dynpart_hashjoin_1.q-schema_evol_orc_acidvec_table_update.q-and-27-more
 - did not produce a TEST-*.xml file
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_mapjoin]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ctas]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_join_part_col_char]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3]
org.apache.hadoop.hive.metastore.TestMetaStoreMetrics.testMetaDataCounts
org.apache.hive.jdbc.TestJdbcWithMiniHS2.testAddJarConstructorUnCaching
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1252/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1252/console
Test logs: 
http://ec2-204-236-174-241.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-Build-1252/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 7 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12829560 - PreCommit-HIVE-Build

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.2.patch, HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-09-20 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508829#comment-15508829
 ] 

Hive QA commented on HIVE-14797:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12829373/HIVE-14797.patch

{color:red}ERROR:{color} -1 due to no test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 10556 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_mapjoin]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[ctas]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_join_part_col_char]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3]
org.apache.hadoop.hive.metastore.TestMetaStoreMetrics.testMetaDataCounts
org.apache.hive.jdbc.TestJdbcWithMiniHS2.testAddJarConstructorUnCaching
{noformat}

Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1248/testReport
Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1248/console
Test logs: 
http://ec2-204-236-174-241.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-Build-1248/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 6 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12829373 - PreCommit-HIVE-Build

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-09-20 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508805#comment-15508805
 ] 

Rui Li commented on HIVE-14797:
---

[~roncenzhao] your solution seems also OK and simpler.
Would like to know [~xuefuz]'s opinions.

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-09-20 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508789#comment-15508789
 ] 

Rui Li commented on HIVE-14797:
---

Hmm random prime won't work because we need to make sure same rows always have 
same hash code. I can think of another way:
1. If we have only one field, we can just return the field's hash code.
2. If we have multiple fields, we can compute hash code as: 
P1*hash(F1)+...+Pn*hash(Fn). Where hash(Fn) is the hash code of the nth field, 
and {P1,...,Pn} is a deterministic series of prime numbers, e.g. {17,19,...}. 
Seems {{BigInteger::nextProbablePrime()}} can help generate the series.

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-09-20 Thread roncenzhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508756#comment-15508756
 ] 

roncenzhao commented on HIVE-14797:
---

Or we can use the follow way:
Let the seed have two options: 31 and 131. In `ReduceSinkOperator` we can get 
the reducer number named `reduceNum`, and then we can choose the other value if 
the `reduceNum` is equal to 31 or 131.
Is it OK?

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-09-20 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508643#comment-15508643
 ] 

Rui Li commented on HIVE-14797:
---

If user specifies #reducers to be 31, we shouldn't change it. Is it possible we 
can use random prime numbers to compute the hash code?

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-09-20 Thread roncenzhao (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15508405#comment-15508405
 ] 

roncenzhao commented on HIVE-14797:
---

Yes, we can not hard code the number (31). But we cannot know which number to 
be set before the end of the job. 
So, I think we can solve it easily by the follow ways:
In the method "Utilities.estimateReducers(xxx)", when the `reducers` value can 
be divisible by 31 we let it plus 1.

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14797) reducer number estimating may lead to data skew

2016-09-20 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506967#comment-15506967
 ] 

Xuefu Zhang commented on HIVE-14797:


This seems making sense, but can we not hard code the number (31)?

> reducer number estimating may lead to data skew
> ---
>
> Key: HIVE-14797
> URL: https://issues.apache.org/jira/browse/HIVE-14797
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Processor
>Reporter: roncenzhao
>Assignee: roncenzhao
> Attachments: HIVE-14797.patch
>
>
> HiveKey's hash code is generated by multipling by 31 key by key which is 
> implemented in method `ObjectInspectorUtils.getBucketHashCode()`:
> for (int i = 0; i < bucketFields.length; i++) {
>   int fieldHash = ObjectInspectorUtils.hashCode(bucketFields[i], 
> bucketFieldInspectors[i]);
>   hashCode = 31 * hashCode + fieldHash;
> }
> The follow example will lead to data skew:
> I hava two table called tbl1 and tbl2 and they have the same column: a int, b 
> string. The values of column 'a' in both two tables are not skew, but values 
> of column 'b' in both two tables are skew.
> When my sql is "select * from tbl1 join tbl2 on tbl1.a=tbl2.a and 
> tbl1.b=tbl2.b" and the estimated reducer number is 31, it will lead to data 
> skew.
> As we know, the HiveKey's hash code is generated by `hash(a)*31 + hash(b)`. 
> When reducer number is 31 the reducer No. of each row is `hash(b)%31`. In the 
> result, the job will be skew.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)