[jira] [Commented] (SPARK-23693) SQL function uuid()
[ https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17400411#comment-17400411 ] Jean Georges Perrin commented on SPARK-23693: - [~rxin] - You could require a parameter to the function this should make it deterministic. > SQL function uuid() > --- > > Key: SPARK-23693 > URL: https://issues.apache.org/jira/browse/SPARK-23693 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Arseniy Tashoyan >Priority: Minor > > Add function uuid() to org.apache.spark.sql.functions that returns > [Universally Unique > ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. > Sometimes it is necessary to uniquely identify each row in a DataFrame. > Currently the following ways are available: > * monotonically_increasing_id() function > * row_number() function over some window > * convert the DataFrame to RDD and zipWithIndex() > All these approaches do not work when appending this DataFrame to another > DataFrame (union). Collisions may occur - two rows in different DataFrames > may have the same ID. Re-generating IDs on the resulting DataFrame is not an > option, because some data in some other system may already refer to old IDs. > The proposed solution is to add new function: > {code:scala} > def uuid(): Column > {code} > that returns String representation of UUID. > UUID is represented as a 128-bit number (two long numbers). Such numbers are > not supported in Scala or Java. In addition, some storage systems do not > support 128-bit numbers (Parquet's largest numeric type is INT96). This is > the reason for the uuid() function to return String. > I already have a simple implementation based on > [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I > can share it as a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23693) SQL function uuid()
[ https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16728078#comment-16728078 ] Reynold Xin commented on SPARK-23693: - [~tashoyan] the issue with calling uuid directly is that it is non-deterministic, and when recompute happens due to fault, the ids are not stable. We'd need a different way to generate uuid that can be deterministic based on some seed. > SQL function uuid() > --- > > Key: SPARK-23693 > URL: https://issues.apache.org/jira/browse/SPARK-23693 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Arseniy Tashoyan >Priority: Minor > > Add function uuid() to org.apache.spark.sql.functions that returns > [Universally Unique > ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. > Sometimes it is necessary to uniquely identify each row in a DataFrame. > Currently the following ways are available: > * monotonically_increasing_id() function > * row_number() function over some window > * convert the DataFrame to RDD and zipWithIndex() > All these approaches do not work when appending this DataFrame to another > DataFrame (union). Collisions may occur - two rows in different DataFrames > may have the same ID. Re-generating IDs on the resulting DataFrame is not an > option, because some data in some other system may already refer to old IDs. > The proposed solution is to add new function: > {code:scala} > def uuid(): Column > {code} > that returns String representation of UUID. > UUID is represented as a 128-bit number (two long numbers). Such numbers are > not supported in Scala or Java. In addition, some storage systems do not > support 128-bit numbers (Parquet's largest numeric type is INT96). This is > the reason for the uuid() function to return String. > I already have a simple implementation based on > [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I > can share it as a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23693) SQL function uuid()
[ https://issues.apache.org/jira/browse/SPARK-23693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435869#comment-16435869 ] Apache Spark commented on SPARK-23693: -- User 'tashoyan' has created a pull request for this issue: https://github.com/apache/spark/pull/21055 > SQL function uuid() > --- > > Key: SPARK-23693 > URL: https://issues.apache.org/jira/browse/SPARK-23693 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Arseniy Tashoyan >Priority: Minor > > Add function uuid() to org.apache.spark.sql.functions that returns > [Universally Unique > ID|https://en.wikipedia.org/wiki/Universally_unique_identifier]. > Sometimes it is necessary to uniquely identify each row in a DataFrame. > Currently the following ways are available: > * monotonically_increasing_id() function > * row_number() function over some window > * convert the DataFrame to RDD and zipWithIndex() > All these approaches do not work when appending this DataFrame to another > DataFrame (union). Collisions may occur - two rows in different DataFrames > may have the same ID. Re-generating IDs on the resulting DataFrame is not an > option, because some data in some other system may already refer to old IDs. > The proposed solution is to add new function: > {code:scala} > def uuid(): Column > {code} > that returns String representation of UUID. > UUID is represented as a 128-bit number (two long numbers). Such numbers are > not supported in Scala or Java. In addition, some storage systems do not > support 128-bit numbers (Parquet's largest numeric type is INT96). This is > the reason for the uuid() function to return String. > I already have a simple implementation based on > [java-uuid-generator|https://github.com/cowtowncoder/java-uuid-generator]. I > can share it as a PR. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org