Xinyong Tian created SPARK-26166:
------------------------------------

             Summary: CrossValidator.fit() bug,training and validation dataset 
may overlap
                 Key: SPARK-26166
                 URL: https://issues.apache.org/jira/browse/SPARK-26166
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.3.0
            Reporter: Xinyong Tian


In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column

df = dataset.select("*", rand(seed).alias(randCol))

Should add

df.cache()

If  df not cached, it will be reselect each time when train and validation 
dataframe need to be created. The order of rows in df,which rand(seed)  is 
dependent on, is not deterministic . Thus each time random column value could 
be different for a specific row even with seed.

This might especially  be a problem when input 'dataset' dataframe is resulted 
from a query including 'where' clause. see below.

https://dzone.com/articles/non-deterministic-order-for-select-with-limit

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to