[jira] [Commented] (SPARK-23599) The UUID() expression is too non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-23599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466961#comment-17466961 ] Stu commented on SPARK-23599: - We have encountered this problem with Spark 3.1.2, resulting in duplicate values in a situation where a spark executor died. As suggested in the description, this error was hard to track down and difficult to replicate. > The UUID() expression is too non-deterministic > -- > > Key: SPARK-23599 > URL: https://issues.apache.org/jira/browse/SPARK-23599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hövell >Assignee: L. C. Hsieh >Priority: Critical > Fix For: 2.3.1, 2.4.0 > > > The current {{Uuid()}} expression uses {{java.util.UUID.randomUUID}} for UUID > generation. There are a couple of major problems with this: > - It is non-deterministic across task retries. This breaks Spark's processing > model, and this will to very hard to trace bugs, like non-deterministic > shuffles, duplicates and missing rows. > - It uses a single secure random for UUID generation. This uses a single JVM > wide lock, and this can lead to lock contention and other performance > problems. > We should move to something that is deterministic between retries. This can > be done by using seeded PRNGs for which we set the seed during planning. It > is important here to use a PRNG that provides enough entropy for creating a > proper UUID. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23599) The UUID() expression is too non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-23599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413620#comment-16413620 ] Apache Spark commented on SPARK-23599: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/20903 > The UUID() expression is too non-deterministic > -- > > Key: SPARK-23599 > URL: https://issues.apache.org/jira/browse/SPARK-23599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Liang-Chi Hsieh >Priority: Critical > Fix For: 2.4.0 > > > The current {{Uuid()}} expression uses {{java.util.UUID.randomUUID}} for UUID > generation. There are a couple of major problems with this: > - It is non-deterministic across task retries. This breaks Spark's processing > model, and this will to very hard to trace bugs, like non-deterministic > shuffles, duplicates and missing rows. > - It uses a single secure random for UUID generation. This uses a single JVM > wide lock, and this can lead to lock contention and other performance > problems. > We should move to something that is deterministic between retries. This can > be done by using seeded PRNGs for which we set the seed during planning. It > is important here to use a PRNG that provides enough entropy for creating a > proper UUID. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23599) The UUID() expression is too non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-23599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405764#comment-16405764 ] Apache Spark commented on SPARK-23599: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/20861 > The UUID() expression is too non-deterministic > -- > > Key: SPARK-23599 > URL: https://issues.apache.org/jira/browse/SPARK-23599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Liang-Chi Hsieh >Priority: Critical > > The current {{Uuid()}} expression uses {{java.util.UUID.randomUUID}} for UUID > generation. There are a couple of major problems with this: > - It is non-deterministic across task retries. This breaks Spark's processing > model, and this will to very hard to trace bugs, like non-deterministic > shuffles, duplicates and missing rows. > - It uses a single secure random for UUID generation. This uses a single JVM > wide lock, and this can lead to lock contention and other performance > problems. > We should move to something that is deterministic between retries. This can > be done by using seeded PRNGs for which we set the seed during planning. It > is important here to use a PRNG that provides enough entropy for creating a > proper UUID. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23599) The UUID() expression is too non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-23599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16404511#comment-16404511 ] Herman van Hovell commented on SPARK-23599: --- PR 1 out of 2 has been merged. > The UUID() expression is too non-deterministic > -- > > Key: SPARK-23599 > URL: https://issues.apache.org/jira/browse/SPARK-23599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Assignee: Liang-Chi Hsieh >Priority: Critical > > The current {{Uuid()}} expression uses {{java.util.UUID.randomUUID}} for UUID > generation. There are a couple of major problems with this: > - It is non-deterministic across task retries. This breaks Spark's processing > model, and this will to very hard to trace bugs, like non-deterministic > shuffles, duplicates and missing rows. > - It uses a single secure random for UUID generation. This uses a single JVM > wide lock, and this can lead to lock contention and other performance > problems. > We should move to something that is deterministic between retries. This can > be done by using seeded PRNGs for which we set the seed during planning. It > is important here to use a PRNG that provides enough entropy for creating a > proper UUID. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23599) The UUID() expression is too non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-23599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398121#comment-16398121 ] Apache Spark commented on SPARK-23599: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/20817 > The UUID() expression is too non-deterministic > -- > > Key: SPARK-23599 > URL: https://issues.apache.org/jira/browse/SPARK-23599 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Herman van Hovell >Priority: Critical > > The current {{Uuid()}} expression uses {{java.util.UUID.randomUUID}} for UUID > generation. There are a couple of major problems with this: > - It is non-deterministic across task retries. This breaks Spark's processing > model, and this will to very hard to trace bugs, like non-deterministic > shuffles, duplicates and missing rows. > - It uses a single secure random for UUID generation. This uses a single JVM > wide lock, and this can lead to lock contention and other performance > problems. > We should move to something that is deterministic between retries. This can > be done by using seeded PRNGs for which we set the seed during planning. It > is important here to use a PRNG that provides enough entropy for creating a > proper UUID. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org