[ https://issues.apache.org/jira/browse/SPARK-38584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhengruifeng reassigned SPARK-38584: ------------------------------------ Assignee: zhengruifeng > Unify the data validation > ------------------------- > > Key: SPARK-38584 > URL: https://issues.apache.org/jira/browse/SPARK-38584 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 3.4.0 > Reporter: zhengruifeng > Assignee: zhengruifeng > Priority: Major > > 1, input vector validation is missing in most algorithms, when the input > dataset contains some invalid values (NaN/Infinity), then: > * the training may run successfuly and return model invalid coefficients, > like LinearSVC > * the training will fail with irrelevant message, like KMeans > > {code:java} > import org.apache.spark.ml.feature._ > import org.apache.spark.ml.linalg._ > import org.apache.spark.ml.classification._ > import org.apache.spark.ml.clustering._ > val df = sc.parallelize(Seq(LabeledPoint(1.0, Vectors.dense(1.0, > Double.NaN)), LabeledPoint(0.0, Vectors.dense(Double.PositiveInfinity, > 2.0)))).toDF() > val svc = new LinearSVC() > val model = svc.fit(df) > scala> model.intercept > res0: Double = NaN > scala> model.coefficients > res1: org.apache.spark.ml.linalg.Vector = [NaN,NaN] > val km = new KMeans().setK(2) > scala> km.fit(df) > 22/03/17 14:29:10 ERROR Executor: Exception in task 11.0 in stage 10.0 (TID > 113) > java.lang.IllegalArgumentException: requirement failed: Both norms should be > greater or equal to 0.0, found norm1=NaN, norm2=Infinity > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:543) > {code} > > 2, relative methods to validate input dataset (like labels/weights) exists in > {{{}org.apache.spark.ml.functions{}}}, org.apache.spark.ml.util.DatasetUtils, > org.apache.spark.ml.util.MetadataUtils, etc. > > I think it is time to unify realtive methods to one source file. > -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org