Xinyong Tian created SPARK-26166: ------------------------------------ Summary: CrossValidator.fit() bug,training and validation dataset may overlap Key: SPARK-26166 URL: https://issues.apache.org/jira/browse/SPARK-26166 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.0 Reporter: Xinyong Tian
In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column df = dataset.select("*", rand(seed).alias(randCol)) Should add df.cache() If df not cached, it will be reselect each time when train and validation dataframe need to be created. The order of rows in df,which rand(seed) is dependent on, is not deterministic . Thus each time random column value could be different for a specific row even with seed. This might especially be a problem when input 'dataset' dataframe is resulted from a query including 'where' clause. see below. https://dzone.com/articles/non-deterministic-order-for-select-with-limit -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org