[GitHub] [spark] huaxingao commented on a change in pull request #26480: [SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols
huaxingao commented on a change in pull request #26480: [SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols URL: https://github.com/apache/spark/pull/26480#discussion_r345356303 ## File path: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ## @@ -22,23 +22,29 @@ import java.util.Locale import org.apache.spark.annotation.Since import org.apache.spark.ml.Transformer import org.apache.spark.ml.param._ -import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.param.shared.{HasInputCol, HasInputCols, HasOutputCol, HasOutputCols} import org.apache.spark.ml.util._ import org.apache.spark.sql.{DataFrame, Dataset} import org.apache.spark.sql.functions.{col, udf} -import org.apache.spark.sql.types.{ArrayType, StringType, StructType} +import org.apache.spark.sql.types.{ArrayType, StringType, StructField, StructType} /** * A feature transformer that filters out stop words from input. * + * Since 3.0.0, Review comment: Sorry, I accidentally broke the line, but I prefer to have it. When other features added the multi columns support, ```since xxx``` was added to the doc. Just try to be consistent with others. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #26480: [SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols
huaxingao commented on a change in pull request #26480: [SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols URL: https://github.com/apache/spark/pull/26480#discussion_r345356303 ## File path: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ## @@ -22,23 +22,29 @@ import java.util.Locale import org.apache.spark.annotation.Since import org.apache.spark.ml.Transformer import org.apache.spark.ml.param._ -import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.param.shared.{HasInputCol, HasInputCols, HasOutputCol, HasOutputCols} import org.apache.spark.ml.util._ import org.apache.spark.sql.{DataFrame, Dataset} import org.apache.spark.sql.functions.{col, udf} -import org.apache.spark.sql.types.{ArrayType, StringType, StructType} +import org.apache.spark.sql.types.{ArrayType, StringType, StructField, StructType} /** * A feature transformer that filters out stop words from input. * + * Since 3.0.0, Review comment: Sorry, I accidentally broke the line, but I prefer to have it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #26480: [SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols
huaxingao commented on a change in pull request #26480: [SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols URL: https://github.com/apache/spark/pull/26480#discussion_r345356363 ## File path: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ## @@ -142,16 +165,40 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String terms.filter(s => !lowerStopWords.contains(toLower(s))) } } -val metadata = outputSchema($(outputCol)).metadata -dataset.select(col("*"), t(col($(inputCol))).as($(outputCol), metadata)) + +val (inputColNames, outputColNames) = getInOutCols() +val ouputCols = inputColNames.map { inputColName => Review comment: Tried this. It doesn't work :( This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #26480: [SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols
huaxingao commented on a change in pull request #26480: [SPARK-29808][ML][PYTHON] StopWordsRemover should support multi-cols URL: https://github.com/apache/spark/pull/26480#discussion_r345028011 ## File path: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ## @@ -51,6 +57,14 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def setOutputCol(value: String): this.type = set(outputCol, value) + /** @group setParam */ + @Since("3.0.0") + def setInputCols(value: Array[String]): this.type = set(inputCols, value) + + /** @group setParam */ + @Since("3.0.0") + def setOutputCols(value: Array[String]): this.type = set(outputCols, value) + Review comment: I am debating if I should add ```stopWordsArray/caseSensitiveArray/localArray```. Seems to me that users will use the same set of ```stopWords``` for all columns, so it's no need to add those. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org