[jira] [Commented] (SPARK-23693) SQL function uuid()

2021-08-17 Thread Jean Georges Perrin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17400411#comment-17400411
 ] 

Jean Georges Perrin commented on SPARK-23693:
-

[~rxin] - You could require a parameter to the function this should make it 
deterministic.

> SQL function uuid()
> ---
>
> Key: SPARK-23693
> URL: https://issues.apache.org/jira/browse/SPARK-23693
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Arseniy Tashoyan
>Priority: Minor
>
> Add function uuid() to org.apache.spark.sql.functions that returns 
> [Universally Unique 
> ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].
> Sometimes it is necessary to uniquely identify each row in a DataFrame.
> Currently the following ways are available:
>  * monotonically_increasing_id() function
>  * row_number() function over some window
>  * convert the DataFrame to RDD and zipWithIndex()
> All these approaches do not work when appending this DataFrame to another 
> DataFrame (union). Collisions may occur - two rows in different DataFrames 
> may have the same ID. Re-generating IDs on the resulting DataFrame is not an 
> option, because some data in some other system may already refer to old IDs.
> The proposed solution is to add new function:
> {code:scala}
> def uuid(): Column
> {code}
> that returns String representation of UUID.
> UUID is represented as a 128-bit number (two long numbers). Such numbers are 
> not supported in Scala or Java. In addition, some storage systems do not 
> support 128-bit numbers (Parquet's largest numeric type is INT96). This is 
> the reason for the uuid() function to return String.
> I already have a simple implementation based on 
> [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
> can share it as a PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23693) SQL function uuid()

2018-12-23 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728078#comment-16728078
 ] 

Reynold Xin commented on SPARK-23693:
-

[~tashoyan] the issue with calling uuid directly is that it is 
non-deterministic, and when recompute happens due to fault, the ids are not 
stable. We'd need a different way to generate uuid that can be deterministic 
based on some seed.

> SQL function uuid()
> ---
>
> Key: SPARK-23693
> URL: https://issues.apache.org/jira/browse/SPARK-23693
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Arseniy Tashoyan
>Priority: Minor
>
> Add function uuid() to org.apache.spark.sql.functions that returns 
> [Universally Unique 
> ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].
> Sometimes it is necessary to uniquely identify each row in a DataFrame.
> Currently the following ways are available:
>  * monotonically_increasing_id() function
>  * row_number() function over some window
>  * convert the DataFrame to RDD and zipWithIndex()
> All these approaches do not work when appending this DataFrame to another 
> DataFrame (union). Collisions may occur - two rows in different DataFrames 
> may have the same ID. Re-generating IDs on the resulting DataFrame is not an 
> option, because some data in some other system may already refer to old IDs.
> The proposed solution is to add new function:
> {code:scala}
> def uuid(): Column
> {code}
> that returns String representation of UUID.
> UUID is represented as a 128-bit number (two long numbers). Such numbers are 
> not supported in Scala or Java. In addition, some storage systems do not 
> support 128-bit numbers (Parquet's largest numeric type is INT96). This is 
> the reason for the uuid() function to return String.
> I already have a simple implementation based on 
> [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
> can share it as a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23693) SQL function uuid()

2018-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435869#comment-16435869
 ] 

Apache Spark commented on SPARK-23693:
--

User 'tashoyan' has created a pull request for this issue:
https://github.com/apache/spark/pull/21055

> SQL function uuid()
> ---
>
> Key: SPARK-23693
> URL: https://issues.apache.org/jira/browse/SPARK-23693
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Arseniy Tashoyan
>Priority: Minor
>
> Add function uuid() to org.apache.spark.sql.functions that returns 
> [Universally Unique 
> ID|https://en.wikipedia.org/wiki/Universally_unique_identifier].
> Sometimes it is necessary to uniquely identify each row in a DataFrame.
> Currently the following ways are available:
>  * monotonically_increasing_id() function
>  * row_number() function over some window
>  * convert the DataFrame to RDD and zipWithIndex()
> All these approaches do not work when appending this DataFrame to another 
> DataFrame (union). Collisions may occur - two rows in different DataFrames 
> may have the same ID. Re-generating IDs on the resulting DataFrame is not an 
> option, because some data in some other system may already refer to old IDs.
> The proposed solution is to add new function:
> {code:scala}
> def uuid(): Column
> {code}
> that returns String representation of UUID.
> UUID is represented as a 128-bit number (two long numbers). Such numbers are 
> not supported in Scala or Java. In addition, some storage systems do not 
> support 128-bit numbers (Parquet's largest numeric type is INT96). This is 
> the reason for the uuid() function to return String.
> I already have a simple implementation based on 
> [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I 
> can share it as a PR.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org