[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

2018-06-26 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16523928#comment-16523928
 ] 

Xuefu Zhang commented on HIVE-19671:


Yeah. I think it makes sense. Thank.

> Distribute by rand() can lead to data inconsistency
> ---
>
> Key: HIVE-19671
> URL: https://issues.apache.org/jira/browse/HIVE-19671
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

2018-06-24 Thread Rui Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521790#comment-16521790
 ] 

Rui Li commented on HIVE-19671:
---

We can check all RS and look for non-deterministic UDF in partition keys -- 
{{FunctionRegistry::isDeterministic}} can be used.
I noted Hive itself may also use non-deterministic partitioning, e.g. to handle 
skewed GBY, we first shuffle randomly to do partial aggregation. Do you think 
it makes sense to print a warning for that?

> Distribute by rand() can lead to data inconsistency
> ---
>
> Key: HIVE-19671
> URL: https://issues.apache.org/jira/browse/HIVE-19671
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

2018-06-21 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519830#comment-16519830
 ] 

Xuefu Zhang commented on HIVE-19671:


Printing a warning is good, but we may not know if a partitioning is 
non-deterministic. Let me know your idea. Thanks.

> Distribute by rand() can lead to data inconsistency
> ---
>
> Key: HIVE-19671
> URL: https://issues.apache.org/jira/browse/HIVE-19671
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

2018-06-21 Thread Rui Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519357#comment-16519357
 ] 

Rui Li commented on HIVE-19671:
---

[~xuefuz], I agree it's not trivial to solve this on Hive side. Maybe we can at 
least print some warning if the query has nondeterministic partitioning?
And another potential solution is to retry all downstream tasks when any 
upstream task fails, which needs help from the execution engine.

> Distribute by rand() can lead to data inconsistency
> ---
>
> Key: HIVE-19671
> URL: https://issues.apache.org/jira/browse/HIVE-19671
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

2018-06-20 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518745#comment-16518745
 ] 

Xuefu Zhang commented on HIVE-19671:


Based on your analysis, it seems that random(see) depends on a deterministic 
order of the data. Reading HDFS may guarantee the order, but probably not every 
data source has such guaranty. Also, a map or reduce logic may also generate 
nondeterministic order. Having said that, it appears to me that any 
partitioning that depends on a deterministic ordering of the data is doomed, 
include rand() and rand(seed). This is rather a user problem, for which I'm not 
sure if Hive needs to do anything. We may document this as a general bad 
practice, but blocking solves rand() problemI  but doesn't help other similar 
problems. I suggest we leave it to user to solve the problem. Thoughts?

> Distribute by rand() can lead to data inconsistency
> ---
>
> Key: HIVE-19671
> URL: https://issues.apache.org/jira/browse/HIVE-19671
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

2018-06-20 Thread Rui Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517846#comment-16517846
 ] 

Rui Li commented on HIVE-19671:
---

[~xuefuz], thanks for your input. I think rand(seed) may not work if the 
mapper's input is not in deterministic order. As an example, suppose a mapper 
needs to process key {{1, 2, 3, 4, 5}}. The partition in 1st attempt is as 
below:
{noformat}
key   rand(seed)
 1  ->   1
 2  ->   2
 3  ->   3
 4  ->   4
 5  ->   5
{noformat}
So there'll be 5 reducers to fetch data from this mapper. Suppose the first 4 
reducers have finished. And when the 5th reducer starts, the node hosting the 
mapper's output is lost, so the mapper is rerun. And the 2nd attempt has the 
following partition:
{noformat}
key   rand(seed)
 1  ->   1
 3  ->   2
 5  ->   3
 2  ->   4
 4  ->   5
{noformat}
Then the 5th reducer is rerun and fetches key 4, which means key 4 is 
duplicated and key 5 is lost.

To avoid the issue, we need to make sure record reader can guarantee an order 
when reading data from HDFS, and we don't use shuffling that doesn't order the 
keys, e.g. groupByKey of Spark. What do you think?

> Distribute by rand() can lead to data inconsistency
> ---
>
> Key: HIVE-19671
> URL: https://issues.apache.org/jira/browse/HIVE-19671
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

2018-05-29 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493796#comment-16493796
 ] 

Xuefu Zhang commented on HIVE-19671:


[~lirui] I think #1 is better. Nondeterministic partitioning including using 
random would be a problem in many aspects. #2 is a little harsh, as those are 
usually service level attributes. Thanks.

> Distribute by rand() can lead to data inconsistency
> ---
>
> Key: HIVE-19671
> URL: https://issues.apache.org/jira/browse/HIVE-19671
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

2018-05-28 Thread Rui Li (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493065#comment-16493065
 ] 

Rui Li commented on HIVE-19671:
---

Verified the issue only happens when there're task retries. I can think of two 
possible solutions:
 # Use rand(seed) instead of rand(). Rand(seed) is supposed to generate 
deterministic sequences so the retried task will have the same partition as 
original one. The prerequisite is the sequence of method calls is the same, and 
the task input is in deterministic order.
 # Disable task retry if the shuffle partition key is non-deterministic, using 
configs like {{mapreduce.map.maxattempts}}, 
{{tez.am.task.max.failed.attempts}}, {{spark.task.maxFailures}}.

[~gopalv], [~xuefuz], [~ashutoshc], do you have any suggestions? Thanks.

> Distribute by rand() can lead to data inconsistency
> ---
>
> Key: HIVE-19671
> URL: https://issues.apache.org/jira/browse/HIVE-19671
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Assignee: Rui Li
>Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand()) a;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency

2018-05-23 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487148#comment-16487148
 ] 

Rui Li commented on HIVE-19671:
---

I haven't verified it but my guess is the issue happens with task failover. 
Suppose mappers of {{distribute by}} finish successfully. Then reducers start 
but fail to fetch shuffle data because some nodes hosting mapper output are 
lost. Then those mappers are retried. But since partition keys are randomly 
generated, the retried tasks can produce different partitions than the previous 
attempt, which leads to the inconsistency.

> Distribute by rand() can lead to data inconsistency
> ---
>
> Key: HIVE-19671
> URL: https://issues.apache.org/jira/browse/HIVE-19671
> Project: Hive
>  Issue Type: Bug
>Reporter: Rui Li
>Priority: Major
>
> Noticed the following queries can give different results:
> {code}
> select count(*) from tbl;
> select count(*) from (select * from tbl distribute by rand());
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)