[jira] [Updated] (HIVE-17114) HoS: Possible skew in shuffling when data is not really skewed

2017-07-23 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-17114:
--
   Resolution: Fixed
Fix Version/s: 3.0.0
   Status: Resolved  (was: Patch Available)

Pushed to master. Thanks Chao for the review.

> HoS: Possible skew in shuffling when data is not really skewed
> --
>
> Key: HIVE-17114
> URL: https://issues.apache.org/jira/browse/HIVE-17114
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: HIVE-17114.1.patch, HIVE-17114.2.patch, 
> HIVE-17114.3.patch
>
>
> Observed in HoS and may apply to other engines as well.
> When we join 2 tables on a single int key, we use the key itself as hash code 
> in {{ObjectInspectorUtils.hashCode}}:
> {code}
>   case INT:
> return ((IntObjectInspector) poi).get(o);
> {code}
> Suppose the keys are different but are all some multiples of 10. And if we 
> choose 10 as #reducers, the shuffle will be skewed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17114) HoS: Possible skew in shuffling when data is not really skewed

2017-07-20 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-17114:
--
Attachment: HIVE-17114.3.patch

> HoS: Possible skew in shuffling when data is not really skewed
> --
>
> Key: HIVE-17114
> URL: https://issues.apache.org/jira/browse/HIVE-17114
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-17114.1.patch, HIVE-17114.2.patch, 
> HIVE-17114.3.patch
>
>
> Observed in HoS and may apply to other engines as well.
> When we join 2 tables on a single int key, we use the key itself as hash code 
> in {{ObjectInspectorUtils.hashCode}}:
> {code}
>   case INT:
> return ((IntObjectInspector) poi).get(o);
> {code}
> Suppose the keys are different but are all some multiples of 10. And if we 
> choose 10 as #reducers, the shuffle will be skewed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17114) HoS: Possible skew in shuffling when data is not really skewed

2017-07-20 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-17114:
--
Attachment: HIVE-17114.2.patch

Patch v2 updates some golden files. Most of the changes are because we'll use 
different vector RS operators with uniform hash, which is expected.

> HoS: Possible skew in shuffling when data is not really skewed
> --
>
> Key: HIVE-17114
> URL: https://issues.apache.org/jira/browse/HIVE-17114
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-17114.1.patch, HIVE-17114.2.patch
>
>
> Observed in HoS and may apply to other engines as well.
> When we join 2 tables on a single int key, we use the key itself as hash code 
> in {{ObjectInspectorUtils.hashCode}}:
> {code}
>   case INT:
> return ((IntObjectInspector) poi).get(o);
> {code}
> Suppose the keys are different but are all some multiples of 10. And if we 
> choose 10 as #reducers, the shuffle will be skewed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17114) HoS: Possible skew in shuffling when data is not really skewed

2017-07-18 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-17114:
--
Status: Patch Available  (was: Open)

> HoS: Possible skew in shuffling when data is not really skewed
> --
>
> Key: HIVE-17114
> URL: https://issues.apache.org/jira/browse/HIVE-17114
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-17114.1.patch
>
>
> Observed in HoS and may apply to other engines as well.
> When we join 2 tables on a single int key, we use the key itself as hash code 
> in {{ObjectInspectorUtils.hashCode}}:
> {code}
>   case INT:
> return ((IntObjectInspector) poi).get(o);
> {code}
> Suppose the keys are different but are all some multiples of 10. And if we 
> choose 10 as #reducers, the shuffle will be skewed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (HIVE-17114) HoS: Possible skew in shuffling when data is not really skewed

2017-07-18 Thread Rui Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/HIVE-17114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Li updated HIVE-17114:
--
Attachment: HIVE-17114.1.patch

I think we can set the UNIFORM trait to the RS and then MurmurHash is used to 
compute the hash code, which can solve the issue here.

> HoS: Possible skew in shuffling when data is not really skewed
> --
>
> Key: HIVE-17114
> URL: https://issues.apache.org/jira/browse/HIVE-17114
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Minor
> Attachments: HIVE-17114.1.patch
>
>
> Observed in HoS and may apply to other engines as well.
> When we join 2 tables on a single int key, we use the key itself as hash code 
> in {{ObjectInspectorUtils.hashCode}}:
> {code}
>   case INT:
> return ((IntObjectInspector) poi).get(o);
> {code}
> Suppose the keys are different but are all some multiples of 10. And if we 
> choose 10 as #reducers, the shuffle will be skewed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)