zhengruifeng created SPARK-21690: ------------------------------------ Summary: one-pass imputer Key: SPARK-21690 URL: https://issues.apache.org/jira/browse/SPARK-21690 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.1 Reporter: zhengruifeng
{code} val surrogates = $(inputCols).map { inputCol => val ic = col(inputCol) val filtered = dataset.select(ic.cast(DoubleType)) .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN) if(filtered.take(1).length == 0) { throw new SparkException(s"surrogate cannot be computed. " + s"All the values in $inputCol are Null, Nan or missingValue(${$(missingValue)})") } val surrogate = $(strategy) match { case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first() case Imputer.median => filtered.stat.approxQuantile(inputCol, Array(0.5), 0.001).head } surrogate } {code} Current impl of {{Imputer}} process one column after after another. In this place, we should parallelize the processing in a more efficient way. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org