Hi all, We need to use the rand(<seed>) function in Scala Spark SQL in our application, but we discovered that it behavior was not deterministic, that is, different invocations with the same <seed> would result in different values. This is documented in some bugs, for example: https://issues.apache.org/jira/browse/SPARK-13333 and it has to do with partitioning.
So we refactored it by moving the rand() function from a query using Parquet files on S3 as a datasource, to another query that we run on MySQL (still using the Spark SLQ Scala API), assuming that MySQL quesries do not get parallelized. Can we indeed safely assume that now rand(<seed>) will be deterministic, or does the source of non-deterministic behavior lie in the Spark SQL engine rather than the specific datasource ? Gabriele -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-RAND-in-SparkSQL-deterministic-when-used-on-MySql-data-sources-tp28302.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org