[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency
[ https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16523928#comment-16523928 ] Xuefu Zhang commented on HIVE-19671: Yeah. I think it makes sense. Thank. > Distribute by rand() can lead to data inconsistency > --- > > Key: HIVE-19671 > URL: https://issues.apache.org/jira/browse/HIVE-19671 > Project: Hive > Issue Type: Bug >Reporter: Rui Li >Assignee: Rui Li >Priority: Major > > Noticed the following queries can give different results: > {code} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency
[ https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521790#comment-16521790 ] Rui Li commented on HIVE-19671: --- We can check all RS and look for non-deterministic UDF in partition keys -- {{FunctionRegistry::isDeterministic}} can be used. I noted Hive itself may also use non-deterministic partitioning, e.g. to handle skewed GBY, we first shuffle randomly to do partial aggregation. Do you think it makes sense to print a warning for that? > Distribute by rand() can lead to data inconsistency > --- > > Key: HIVE-19671 > URL: https://issues.apache.org/jira/browse/HIVE-19671 > Project: Hive > Issue Type: Bug >Reporter: Rui Li >Assignee: Rui Li >Priority: Major > > Noticed the following queries can give different results: > {code} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency
[ https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519830#comment-16519830 ] Xuefu Zhang commented on HIVE-19671: Printing a warning is good, but we may not know if a partitioning is non-deterministic. Let me know your idea. Thanks. > Distribute by rand() can lead to data inconsistency > --- > > Key: HIVE-19671 > URL: https://issues.apache.org/jira/browse/HIVE-19671 > Project: Hive > Issue Type: Bug >Reporter: Rui Li >Assignee: Rui Li >Priority: Major > > Noticed the following queries can give different results: > {code} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency
[ https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16519357#comment-16519357 ] Rui Li commented on HIVE-19671: --- [~xuefuz], I agree it's not trivial to solve this on Hive side. Maybe we can at least print some warning if the query has nondeterministic partitioning? And another potential solution is to retry all downstream tasks when any upstream task fails, which needs help from the execution engine. > Distribute by rand() can lead to data inconsistency > --- > > Key: HIVE-19671 > URL: https://issues.apache.org/jira/browse/HIVE-19671 > Project: Hive > Issue Type: Bug >Reporter: Rui Li >Assignee: Rui Li >Priority: Major > > Noticed the following queries can give different results: > {code} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency
[ https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16518745#comment-16518745 ] Xuefu Zhang commented on HIVE-19671: Based on your analysis, it seems that random(see) depends on a deterministic order of the data. Reading HDFS may guarantee the order, but probably not every data source has such guaranty. Also, a map or reduce logic may also generate nondeterministic order. Having said that, it appears to me that any partitioning that depends on a deterministic ordering of the data is doomed, include rand() and rand(seed). This is rather a user problem, for which I'm not sure if Hive needs to do anything. We may document this as a general bad practice, but blocking solves rand() problemI but doesn't help other similar problems. I suggest we leave it to user to solve the problem. Thoughts? > Distribute by rand() can lead to data inconsistency > --- > > Key: HIVE-19671 > URL: https://issues.apache.org/jira/browse/HIVE-19671 > Project: Hive > Issue Type: Bug >Reporter: Rui Li >Assignee: Rui Li >Priority: Major > > Noticed the following queries can give different results: > {code} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency
[ https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517846#comment-16517846 ] Rui Li commented on HIVE-19671: --- [~xuefuz], thanks for your input. I think rand(seed) may not work if the mapper's input is not in deterministic order. As an example, suppose a mapper needs to process key {{1, 2, 3, 4, 5}}. The partition in 1st attempt is as below: {noformat} key rand(seed) 1 -> 1 2 -> 2 3 -> 3 4 -> 4 5 -> 5 {noformat} So there'll be 5 reducers to fetch data from this mapper. Suppose the first 4 reducers have finished. And when the 5th reducer starts, the node hosting the mapper's output is lost, so the mapper is rerun. And the 2nd attempt has the following partition: {noformat} key rand(seed) 1 -> 1 3 -> 2 5 -> 3 2 -> 4 4 -> 5 {noformat} Then the 5th reducer is rerun and fetches key 4, which means key 4 is duplicated and key 5 is lost. To avoid the issue, we need to make sure record reader can guarantee an order when reading data from HDFS, and we don't use shuffling that doesn't order the keys, e.g. groupByKey of Spark. What do you think? > Distribute by rand() can lead to data inconsistency > --- > > Key: HIVE-19671 > URL: https://issues.apache.org/jira/browse/HIVE-19671 > Project: Hive > Issue Type: Bug >Reporter: Rui Li >Assignee: Rui Li >Priority: Major > > Noticed the following queries can give different results: > {code} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency
[ https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493796#comment-16493796 ] Xuefu Zhang commented on HIVE-19671: [~lirui] I think #1 is better. Nondeterministic partitioning including using random would be a problem in many aspects. #2 is a little harsh, as those are usually service level attributes. Thanks. > Distribute by rand() can lead to data inconsistency > --- > > Key: HIVE-19671 > URL: https://issues.apache.org/jira/browse/HIVE-19671 > Project: Hive > Issue Type: Bug >Reporter: Rui Li >Assignee: Rui Li >Priority: Major > > Noticed the following queries can give different results: > {code} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency
[ https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16493065#comment-16493065 ] Rui Li commented on HIVE-19671: --- Verified the issue only happens when there're task retries. I can think of two possible solutions: # Use rand(seed) instead of rand(). Rand(seed) is supposed to generate deterministic sequences so the retried task will have the same partition as original one. The prerequisite is the sequence of method calls is the same, and the task input is in deterministic order. # Disable task retry if the shuffle partition key is non-deterministic, using configs like {{mapreduce.map.maxattempts}}, {{tez.am.task.max.failed.attempts}}, {{spark.task.maxFailures}}. [~gopalv], [~xuefuz], [~ashutoshc], do you have any suggestions? Thanks. > Distribute by rand() can lead to data inconsistency > --- > > Key: HIVE-19671 > URL: https://issues.apache.org/jira/browse/HIVE-19671 > Project: Hive > Issue Type: Bug >Reporter: Rui Li >Assignee: Rui Li >Priority: Major > > Noticed the following queries can give different results: > {code} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()) a; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HIVE-19671) Distribute by rand() can lead to data inconsistency
[ https://issues.apache.org/jira/browse/HIVE-19671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16487148#comment-16487148 ] Rui Li commented on HIVE-19671: --- I haven't verified it but my guess is the issue happens with task failover. Suppose mappers of {{distribute by}} finish successfully. Then reducers start but fail to fetch shuffle data because some nodes hosting mapper output are lost. Then those mappers are retried. But since partition keys are randomly generated, the retried tasks can produce different partitions than the previous attempt, which leads to the inconsistency. > Distribute by rand() can lead to data inconsistency > --- > > Key: HIVE-19671 > URL: https://issues.apache.org/jira/browse/HIVE-19671 > Project: Hive > Issue Type: Bug >Reporter: Rui Li >Priority: Major > > Noticed the following queries can give different results: > {code} > select count(*) from tbl; > select count(*) from (select * from tbl distribute by rand()); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)