[jira] [Updated] (SPARK-26166) CrossValidator.fit() bug,training and validation dataset may overlap
[ https://issues.apache.org/jira/browse/SPARK-26166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinyong Tian updated SPARK-26166: - Description: In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column df = dataset.select("*", rand(seed).alias(randCol)) Should add df.checkpoint() If df is not checkpointed, it will be recomputed each time when train and validation dataframe need to be created. The order of rows in df,which rand(seed) is dependent on, is not deterministic . Thus each time random column value could be different for a specific row even with seed. Note , checkpoint() can not be replaced with cached(), because when a node fails, cached table need be recomputed, thus random number could be different. This might especially be a problem when input 'dataset' dataframe is resulted from a query including 'where' clause. see below. [https://dzone.com/articles/non-deterministic-order-for-select-with-limit] was: In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column df = dataset.select("*", rand(seed).alias(randCol)) Should add df.checkpoint() If df is not checkpointed, it will be recomputed each time when train and validation dataframe need to be created. The order of rows in df,which rand(seed) is dependent on, is not deterministic . Thus each time random column value could be different for a specific row even with seed. Note , checkpoint() can not be replaced with cached(), because when a node fails, cached table might stilled be recomputed, thus random number could be different. This might especially be a problem when input 'dataset' dataframe is resulted from a query including 'where' clause. see below. [https://dzone.com/articles/non-deterministic-order-for-select-with-limit] > CrossValidator.fit() bug,training and validation dataset may overlap > > > Key: SPARK-26166 > URL: https://issues.apache.org/jira/browse/SPARK-26166 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Xinyong Tian >Priority: Major > > In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column > df = dataset.select("*", rand(seed).alias(randCol)) > Should add > df.checkpoint() > If df is not checkpointed, it will be recomputed each time when train and > validation dataframe need to be created. The order of rows in df,which > rand(seed) is dependent on, is not deterministic . Thus each time random > column value could be different for a specific row even with seed. Note , > checkpoint() can not be replaced with cached(), because when a node fails, > cached table need be recomputed, thus random number could be different. > This might especially be a problem when input 'dataset' dataframe is > resulted from a query including 'where' clause. see below. > [https://dzone.com/articles/non-deterministic-order-for-select-with-limit] > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26166) CrossValidator.fit() bug,training and validation dataset may overlap
[ https://issues.apache.org/jira/browse/SPARK-26166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinyong Tian updated SPARK-26166: - Description: In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column df = dataset.select("*", rand(seed).alias(randCol)) Should add df.checkpoint() If df is not checkpointed, it will be recomputed each time when train and validation dataframe need to be created. The order of rows in df,which rand(seed) is dependent on, is not deterministic . Thus each time random column value could be different for a specific row even with seed. Note , checkpoint() can not be replaced with cached(), because when a node fails, cached table might stilled be recomputed, thus random number could be different. This might especially be a problem when input 'dataset' dataframe is resulted from a query including 'where' clause. see below. [https://dzone.com/articles/non-deterministic-order-for-select-with-limit] was: In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column df = dataset.select("*", rand(seed).alias(randCol)) Should add df.cache() If df not cached, it will be reselect each time when train and validation dataframe need to be created. The order of rows in df,which rand(seed) is dependent on, is not deterministic . Thus each time random column value could be different for a specific row even with seed. This might especially be a problem when input 'dataset' dataframe is resulted from a query including 'where' clause. see below. https://dzone.com/articles/non-deterministic-order-for-select-with-limit > CrossValidator.fit() bug,training and validation dataset may overlap > > > Key: SPARK-26166 > URL: https://issues.apache.org/jira/browse/SPARK-26166 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Xinyong Tian >Priority: Major > > In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column > df = dataset.select("*", rand(seed).alias(randCol)) > Should add > df.checkpoint() > If df is not checkpointed, it will be recomputed each time when train and > validation dataframe need to be created. The order of rows in df,which > rand(seed) is dependent on, is not deterministic . Thus each time random > column value could be different for a specific row even with seed. Note , > checkpoint() can not be replaced with cached(), because when a node fails, > cached table might stilled be recomputed, thus random number could be > different. > This might especially be a problem when input 'dataset' dataframe is > resulted from a query including 'where' clause. see below. > [https://dzone.com/articles/non-deterministic-order-for-select-with-limit] > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26166) CrossValidator.fit() bug,training and validation dataset may overlap
Xinyong Tian created SPARK-26166: Summary: CrossValidator.fit() bug,training and validation dataset may overlap Key: SPARK-26166 URL: https://issues.apache.org/jira/browse/SPARK-26166 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.0 Reporter: Xinyong Tian In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column df = dataset.select("*", rand(seed).alias(randCol)) Should add df.cache() If df not cached, it will be reselect each time when train and validation dataframe need to be created. The order of rows in df,which rand(seed) is dependent on, is not deterministic . Thus each time random column value could be different for a specific row even with seed. This might especially be a problem when input 'dataset' dataframe is resulted from a query including 'where' clause. see below. https://dzone.com/articles/non-deterministic-order-for-select-with-limit -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25441) calculate term frequency in CountVectorizer()
Xinyong Tian created SPARK-25441: Summary: calculate term frequency in CountVectorizer() Key: SPARK-25441 URL: https://issues.apache.org/jira/browse/SPARK-25441 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.3.1 Reporter: Xinyong Tian currently CountVectorizer() can not output TF (term frequency). I hope there will be such option. TF defined as https://en.m.wikipedia.org/wiki/Tf–idf example, >>> df = spark.createDataFrame( ... [(0, ["a", "b", "c"]), (1, ["a", "b", "b", >>> "c", "a"])], ... ["label", "raw"]) >>> cv = CountVectorizer(inputCol="raw", outputCol="vectors") >>> model = cv.fit(df) >>> model.transform(df).limit(1).show(truncate=False) label raw vectors 0 [a, b, c] (3,[0,1,2],[1.0,1.0,1.0]) instead I want 0 [a, b, c] (3,[0,1,2],[0.33,0.33,0.33]) # ie, each vector devided by by its sum, here 3, so sum of new vector will 1,for every row(document) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504202#comment-16504202 ] Xinyong Tian commented on SPARK-24431: -- I also feel it is reasonable to set first point as (0,p). In fact, as long as it is not (0,1), aucPR will be small enough for a model that predicts same p for all examples, so cross validation will not select such model. > wrong areaUnderPR calculation in BinaryClassificationEvaluator > --- > > Key: SPARK-24431 > URL: https://issues.apache.org/jira/browse/SPARK-24431 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Xinyong Tian >Priority: Major > > My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., > evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR')) to > select best model. when the regParam in logistict regression is very high, no > variable will be selected (no model), ie every row 's prediction is same ,eg. > equal event rate (baseline frequency). But at this point, > BinaryClassificationEvaluator set the areaUnderPR highest. As a result best > model seleted is a no model. > the reason is following. at time of no model, precision recall curve will be > only two points: at recall =0, precision should be set to zero , while the > software set it to 1. at recall=1, precision is the event rate. As a result, > the areaUnderPR will be close 0.5 (my even rate is very low), which is > maximum . > the solution is to set precision =0 when recall =0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502794#comment-16502794 ] Xinyong Tian commented on SPARK-24431: -- I read more about first point of or curve https://classeval.wordpress.com/introduction/introduction-to-the-precision-recall-plot/ In the above example, when setting predicted probability for each row as 0.01, only one point on pr curve is defined, ie recall=1, precision =0.01. according to the website, first point on the problem curve should be a horizontal line from 2nd point (the only point (1,0.01) here), which should be (0,0.01). In this way, the no model 's areaUnderPR=0.01, instead of 0.05. > wrong areaUnderPR calculation in BinaryClassificationEvaluator > --- > > Key: SPARK-24431 > URL: https://issues.apache.org/jira/browse/SPARK-24431 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Xinyong Tian >Priority: Major > > My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., > evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR')) to > select best model. when the regParam in logistict regression is very high, no > variable will be selected (no model), ie every row 's prediction is same ,eg. > equal event rate (baseline frequency). But at this point, > BinaryClassificationEvaluator set the areaUnderPR highest. As a result best > model seleted is a no model. > the reason is following. at time of no model, precision recall curve will be > only two points: at recall =0, precision should be set to zero , while the > software set it to 1. at recall=1, precision is the event rate. As a result, > the areaUnderPR will be close 0.5 (my even rate is very low), which is > maximum . > the solution is to set precision =0 when recall =0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator
[ https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502761#comment-16502761 ] Xinyong Tian commented on SPARK-24431: -- Your understanding of event rate is what I meant. I understand that max areaUnderPR can be 1. What I meant is that 0.5 is the max areaUnderPR for the grid I searched. For example. Let us say there is a dataset with event rate 0.01 and the best model's areaUnderPR is 0.30. But without any model ,we can set predicted probability for each row as 0.01. This is the situation when there is too much regularlzation. The problem is that , at this situation , BinaryClassificationEvaluator will calculate areaUnderPR as 0.50(for reason see original description), which is better than the best model . This is not what we want. > wrong areaUnderPR calculation in BinaryClassificationEvaluator > --- > > Key: SPARK-24431 > URL: https://issues.apache.org/jira/browse/SPARK-24431 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Xinyong Tian >Priority: Major > > My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., > evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR')) to > select best model. when the regParam in logistict regression is very high, no > variable will be selected (no model), ie every row 's prediction is same ,eg. > equal event rate (baseline frequency). But at this point, > BinaryClassificationEvaluator set the areaUnderPR highest. As a result best > model seleted is a no model. > the reason is following. at time of no model, precision recall curve will be > only two points: at recall =0, precision should be set to zero , while the > software set it to 1. at recall=1, precision is the event rate. As a result, > the areaUnderPR will be close 0.5 (my even rate is very low), which is > maximum . > the solution is to set precision =0 when recall =0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24431) wrong areaUnderPR calculation in BinaryClassificationEvaluator
Xinyong Tian created SPARK-24431: Summary: wrong areaUnderPR calculation in BinaryClassificationEvaluator Key: SPARK-24431 URL: https://issues.apache.org/jira/browse/SPARK-24431 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.2.0 Reporter: Xinyong Tian My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR')) to select best model. when the regParam in logistict regression is very high, no variable will be selected (no model), ie every row 's prediction is same ,eg. equal event rate (baseline frequency). But at this point, BinaryClassificationEvaluator set the areaUnderPR highest. As a result best model seleted is a no model. the reason is following. at time of no model, precision recall curve will be only two points: at recall =0, precision should be set to zero , while the software set it to 1. at recall=1, precision is the event rate. As a result, the areaUnderPR will be close 0.5 (my even rate is very low), which is maximum . the solution is to set precision =0 when recall =0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org