[jira] [Commented] (SPARK-17987) ML Evaluator fails to handle null values in the dataset

2016-10-18 Thread bo song (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587530#comment-15587530
 ] 

bo song commented on SPARK-17987:
-

There is a common case in cross validation, suppose f1 is a categorical 
predictor, its categories are (0,1,2,3,4). As you all know, cross validation 
splits data into training and testing data sets randomly, suppose the training 
data contains only (0,1,2) for f1, when the testing data do forecasts for (3, 
4), almost algorithms could produce null predictions for this case. 

I would like to introduce an option into Spark, its default behavior is still 
an exception thrown for missing null values, caller can change it to exclude 
missing values explicitly, he knows the changes/risks and wants an result 
instead of an exception.


> ML Evaluator fails to handle null values in the dataset
> ---
>
> Key: SPARK-17987
> URL: https://issues.apache.org/jira/browse/SPARK-17987
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.2, 2.0.1
>Reporter: bo song
>
> Take the RegressionEvaluator as an example, when the predictionCol is null in 
> a row, en exception "scala.MatchEror" will be thrown. The missing null 
> prediction is a common case, for example when an predictor is missing, or its 
> value is out of bound, almost machine learning models could not produce 
> correct predictions, then null predictions would be returned. Evaluators 
> should handle the null values instead of an exception thrown, the common way 
> to handle missing null values is to ignore them. Besides of the null value, 
> the NAN value need to be handled correctly too. 
> Those three evaluators RegressionEvaluator, BinaryClassificationEvaluator and 
> MulticlassClassificationEvaluator have the same problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-17987) ML Evaluator fails to handle null values in the dataset

2016-10-18 Thread bo song (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bo song updated SPARK-17987:

Comment: was deleted

(was: Thanks for your comments. When evaluation involves in cross validation, 
there is no way for the caller imputing missing value, the input data set for 
cross validation has no missing value, but the prediction could be, how to 
handle this case? 

A simple way to handle a missing value is ignore it before metric computing, 
that is evaluator excludes any row that contains a missing value.)

> ML Evaluator fails to handle null values in the dataset
> ---
>
> Key: SPARK-17987
> URL: https://issues.apache.org/jira/browse/SPARK-17987
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.2, 2.0.1
>Reporter: bo song
>
> Take the RegressionEvaluator as an example, when the predictionCol is null in 
> a row, en exception "scala.MatchEror" will be thrown. The missing null 
> prediction is a common case, for example when an predictor is missing, or its 
> value is out of bound, almost machine learning models could not produce 
> correct predictions, then null predictions would be returned. Evaluators 
> should handle the null values instead of an exception thrown, the common way 
> to handle missing null values is to ignore them. Besides of the null value, 
> the NAN value need to be handled correctly too. 
> Those three evaluators RegressionEvaluator, BinaryClassificationEvaluator and 
> MulticlassClassificationEvaluator have the same problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17987) ML Evaluator fails to handle null values in the dataset

2016-10-18 Thread bo song (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15585016#comment-15585016
 ] 

bo song commented on SPARK-17987:
-

Thanks for your comments. When evaluation involves in cross validation, there 
is no way for the caller imputing missing value, the input data set for cross 
validation has no missing value, but the prediction could be, how to handle 
this case? 

A simple way to handle a missing value is ignore it before metric computing, 
that is evaluator excludes any row that contains a missing value.

> ML Evaluator fails to handle null values in the dataset
> ---
>
> Key: SPARK-17987
> URL: https://issues.apache.org/jira/browse/SPARK-17987
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.2, 2.0.1
>Reporter: bo song
>
> Take the RegressionEvaluator as an example, when the predictionCol is null in 
> a row, en exception "scala.MatchEror" will be thrown. The missing null 
> prediction is a common case, for example when an predictor is missing, or its 
> value is out of bound, almost machine learning models could not produce 
> correct predictions, then null predictions would be returned. Evaluators 
> should handle the null values instead of an exception thrown, the common way 
> to handle missing null values is to ignore them. Besides of the null value, 
> the NAN value need to be handled correctly too. 
> Those three evaluators RegressionEvaluator, BinaryClassificationEvaluator and 
> MulticlassClassificationEvaluator have the same problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17987) ML Evaluator fails to handle null values in the dataset

2016-10-18 Thread bo song (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15585019#comment-15585019
 ] 

bo song commented on SPARK-17987:
-

Thanks for your comments. When evaluation involves in cross validation, there 
is no way for the caller imputing missing value, the input data set for cross 
validation has no missing value, but the prediction could be, how to handle 
this case?

A simple way to handle a missing value is ignore it before metric computing, 
that is evaluator excludes any row that contains a missing value.


> ML Evaluator fails to handle null values in the dataset
> ---
>
> Key: SPARK-17987
> URL: https://issues.apache.org/jira/browse/SPARK-17987
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.2, 2.0.1
>Reporter: bo song
>
> Take the RegressionEvaluator as an example, when the predictionCol is null in 
> a row, en exception "scala.MatchEror" will be thrown. The missing null 
> prediction is a common case, for example when an predictor is missing, or its 
> value is out of bound, almost machine learning models could not produce 
> correct predictions, then null predictions would be returned. Evaluators 
> should handle the null values instead of an exception thrown, the common way 
> to handle missing null values is to ignore them. Besides of the null value, 
> the NAN value need to be handled correctly too. 
> Those three evaluators RegressionEvaluator, BinaryClassificationEvaluator and 
> MulticlassClassificationEvaluator have the same problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17987) ML Evaluator fails to handle null values in the dataset

2016-10-18 Thread bo song (JIRA)
bo song created SPARK-17987:
---

 Summary: ML Evaluator fails to handle null values in the dataset
 Key: SPARK-17987
 URL: https://issues.apache.org/jira/browse/SPARK-17987
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.0.1, 1.6.2
Reporter: bo song


Take the RegressionEvaluator as an example, when the predictionCol is null in a 
row, en exception "scala.MatchEror" will be thrown. The missing null prediction 
is a common case, for example when an predictor is missing, or its value is out 
of bound, almost machine learning models could not produce correct predictions, 
then null predictions would be returned. Evaluators should handle the null 
values instead of an exception thrown, the common way to handle missing null 
values is to ignore them. Besides of the null value, the NAN value need to be 
handled correctly too. 

Those three evaluators RegressionEvaluator, BinaryClassificationEvaluator and 
MulticlassClassificationEvaluator have the same problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org