[jira] [Commented] (SPARK-23599) The UUID() expression is too non-deterministic

2021-12-30 Thread Stu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466961#comment-17466961
 ] 

Stu commented on SPARK-23599:
-

We have encountered this problem with Spark 3.1.2, resulting in duplicate 
values in a situation where a spark executor died. As suggested in the 
description, this error was hard to track down and difficult to replicate. 

> The UUID() expression is too non-deterministic
> --
>
> Key: SPARK-23599
> URL: https://issues.apache.org/jira/browse/SPARK-23599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hövell
>Assignee: L. C. Hsieh
>Priority: Critical
> Fix For: 2.3.1, 2.4.0
>
>
> The current {{Uuid()}} expression uses {{java.util.UUID.randomUUID}} for UUID 
> generation. There are a couple of major problems with this:
> - It is non-deterministic across task retries. This breaks Spark's processing 
> model, and this will to very hard to trace bugs, like non-deterministic 
> shuffles, duplicates and missing rows.
> - It uses a single secure random for UUID generation. This uses a single JVM 
> wide lock, and this can lead to lock contention and other performance 
> problems.
> We should move to something that is deterministic between retries. This can 
> be done by using seeded PRNGs for which we set the seed during planning. It 
> is important here to use a PRNG that provides enough entropy for creating a 
> proper UUID.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23599) The UUID() expression is too non-deterministic

2018-03-26 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16413620#comment-16413620
 ] 

Apache Spark commented on SPARK-23599:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/20903

> The UUID() expression is too non-deterministic
> --
>
> Key: SPARK-23599
> URL: https://issues.apache.org/jira/browse/SPARK-23599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Liang-Chi Hsieh
>Priority: Critical
> Fix For: 2.4.0
>
>
> The current {{Uuid()}} expression uses {{java.util.UUID.randomUUID}} for UUID 
> generation. There are a couple of major problems with this:
> - It is non-deterministic across task retries. This breaks Spark's processing 
> model, and this will to very hard to trace bugs, like non-deterministic 
> shuffles, duplicates and missing rows.
> - It uses a single secure random for UUID generation. This uses a single JVM 
> wide lock, and this can lead to lock contention and other performance 
> problems.
> We should move to something that is deterministic between retries. This can 
> be done by using seeded PRNGs for which we set the seed during planning. It 
> is important here to use a PRNG that provides enough entropy for creating a 
> proper UUID.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23599) The UUID() expression is too non-deterministic

2018-03-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405764#comment-16405764
 ] 

Apache Spark commented on SPARK-23599:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/20861

> The UUID() expression is too non-deterministic
> --
>
> Key: SPARK-23599
> URL: https://issues.apache.org/jira/browse/SPARK-23599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> The current {{Uuid()}} expression uses {{java.util.UUID.randomUUID}} for UUID 
> generation. There are a couple of major problems with this:
> - It is non-deterministic across task retries. This breaks Spark's processing 
> model, and this will to very hard to trace bugs, like non-deterministic 
> shuffles, duplicates and missing rows.
> - It uses a single secure random for UUID generation. This uses a single JVM 
> wide lock, and this can lead to lock contention and other performance 
> problems.
> We should move to something that is deterministic between retries. This can 
> be done by using seeded PRNGs for which we set the seed during planning. It 
> is important here to use a PRNG that provides enough entropy for creating a 
> proper UUID.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23599) The UUID() expression is too non-deterministic

2018-03-19 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16404511#comment-16404511
 ] 

Herman van Hovell commented on SPARK-23599:
---

PR 1 out of 2 has been merged.

> The UUID() expression is too non-deterministic
> --
>
> Key: SPARK-23599
> URL: https://issues.apache.org/jira/browse/SPARK-23599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> The current {{Uuid()}} expression uses {{java.util.UUID.randomUUID}} for UUID 
> generation. There are a couple of major problems with this:
> - It is non-deterministic across task retries. This breaks Spark's processing 
> model, and this will to very hard to trace bugs, like non-deterministic 
> shuffles, duplicates and missing rows.
> - It uses a single secure random for UUID generation. This uses a single JVM 
> wide lock, and this can lead to lock contention and other performance 
> problems.
> We should move to something that is deterministic between retries. This can 
> be done by using seeded PRNGs for which we set the seed during planning. It 
> is important here to use a PRNG that provides enough entropy for creating a 
> proper UUID.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23599) The UUID() expression is too non-deterministic

2018-03-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16398121#comment-16398121
 ] 

Apache Spark commented on SPARK-23599:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/20817

> The UUID() expression is too non-deterministic
> --
>
> Key: SPARK-23599
> URL: https://issues.apache.org/jira/browse/SPARK-23599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Priority: Critical
>
> The current {{Uuid()}} expression uses {{java.util.UUID.randomUUID}} for UUID 
> generation. There are a couple of major problems with this:
> - It is non-deterministic across task retries. This breaks Spark's processing 
> model, and this will to very hard to trace bugs, like non-deterministic 
> shuffles, duplicates and missing rows.
> - It uses a single secure random for UUID generation. This uses a single JVM 
> wide lock, and this can lead to lock contention and other performance 
> problems.
> We should move to something that is deterministic between retries. This can 
> be done by using seeded PRNGs for which we set the seed during planning. It 
> is important here to use a PRNG that provides enough entropy for creating a 
> proper UUID.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org