[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21501 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194626092 --- Diff: python/pyspark/ml/feature.py --- @@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadabl typeConverter=TypeConverters.toListString) caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do a case sensitive " + "comparison over the stop words", typeConverter=TypeConverters.toBoolean) +locale = Param(Params._dummy(), "locale", "locale of the input. ignored when case sensitive " + + "is true", typeConverter=TypeConverters.toString) --- End diff -- And also don't forget to mention default value is JVM default locale. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194625679 --- Diff: python/pyspark/ml/feature.py --- @@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadabl typeConverter=TypeConverters.toListString) caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do a case sensitive " + "comparison over the stop words", typeConverter=TypeConverters.toBoolean) +locale = Param(Params._dummy(), "locale", "locale of the input. ignored when case sensitive " + + "is true", typeConverter=TypeConverters.toString) --- End diff -- I'm not sure if users are familiar with available locale setting values here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user dongjinleekr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194623958 --- Diff: python/pyspark/ml/feature.py --- @@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadabl typeConverter=TypeConverters.toListString) caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do a case sensitive " + "comparison over the stop words", typeConverter=TypeConverters.toBoolean) +locale = Param(Params._dummy(), "locale", "locale of the input. ignored when case sensitive " + + "is true", typeConverter=TypeConverters.toString) --- End diff -- Thank you for the comment but... is that necessary? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194606928 --- Diff: python/pyspark/ml/feature.py --- @@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadabl typeConverter=TypeConverters.toListString) caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do a case sensitive " + "comparison over the stop words", typeConverter=TypeConverters.toBoolean) +locale = Param(Params._dummy(), "locale", "locale of the input. ignored when case sensitive " + + "is true", typeConverter=TypeConverters.toString) @keyword_only -def __init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=False): +def __init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=False, + locale=None): """ -__init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=false) +__init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=false, +locale=None) --- End diff -- Please add \ to the end of L2592 and use the right indentation here. Unfortunately, we need this to make the doc correctly displayed. See https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L3112. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194607278 --- Diff: python/pyspark/ml/feature.py --- @@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadabl typeConverter=TypeConverters.toListString) caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do a case sensitive " + "comparison over the stop words", typeConverter=TypeConverters.toBoolean) +locale = Param(Params._dummy(), "locale", "locale of the input. ignored when case sensitive " + + "is true", typeConverter=TypeConverters.toString) @keyword_only -def __init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=False): +def __init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=False, + locale=None): """ -__init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=false) +__init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=false, +locale=None) """ super(StopWordsRemover, self).__init__() self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.StopWordsRemover", self.uid) self._setDefault(stopWords=StopWordsRemover.loadDefaultStopWords("english"), - caseSensitive=False) + caseSensitive=False, locale=StopWordsRemover.defaultLocale()) --- End diff -- You already have the `_java_obj`, call `_java_object.getLocale()` would give you the default locale. And then Python users only need `stopWordsRemover.getLocale()` to get the default value. In the param doc, we should make it clear that the default would be the JVM default locale. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194606981 --- Diff: python/pyspark/ml/feature.py --- @@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadabl typeConverter=TypeConverters.toListString) caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do a case sensitive " + "comparison over the stop words", typeConverter=TypeConverters.toBoolean) +locale = Param(Params._dummy(), "locale", "locale of the input. ignored when case sensitive " + + "is true", typeConverter=TypeConverters.toString) @keyword_only -def __init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=False): +def __init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=False, + locale=None): """ -__init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=false) +__init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=false, +locale=None) """ super(StopWordsRemover, self).__init__() self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.StopWordsRemover", self.uid) self._setDefault(stopWords=StopWordsRemover.loadDefaultStopWords("english"), - caseSensitive=False) + caseSensitive=False, locale=StopWordsRemover.defaultLocale()) kwargs = self._input_kwargs self.setParams(**kwargs) @keyword_only @since("1.6.0") -def setParams(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=False): +def setParams(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=False, + locale=None): """ -setParams(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=false) +setParams(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=false, +locale=None) --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194606418 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * Locale of the input for case insensitive matching. Ignored when [[caseSensitive]] + * is true. + * Default: Locale.getDefault.toString + * @see `StopWordsRemover.getDefaultLocale()` --- End diff -- I feel it is unnecessary to expose it as a public API. This is the same as `Locale.getDefault.toString` or `stopWordsRemover.getLocale` when nothing is set. See my comments on the Python API. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194346755 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala --- @@ -65,6 +65,56 @@ class StopWordsRemoverSuite extends MLTest with DefaultReadWriteTest { testStopWordsRemover(remover, dataSet) } + test("StopWordsRemover with localed input (case insensitive)") { +val stopWords = Array("milk", "cookie") +val remover = new StopWordsRemover() + .setInputCol("raw") + .setOutputCol("filtered") + .setStopWords(stopWords) + .setLocale("tr") // Turkish alphabet: has no Q, W, X but has dotted and dotless 'I's. +val dataSet = Seq( + // scalastyle:off + (Seq("mİlk", "and", "nuts"), Seq("and", "nuts")), + // scalastyle:on + (Seq("cookIe", "and", "nuts"), Seq("cookIe", "and", "nuts")), + (Seq(null), Seq(null)), + (Seq(), Seq()) +).toDF("raw", "expected") + +testStopWordsRemover(remover, dataSet) + } + + test("StopWordsRemover with localed input (case sensitive)") { +val stopWords = Array("milk", "cookie") +val remover = new StopWordsRemover() + .setInputCol("raw") + .setOutputCol("filtered") + .setStopWords(stopWords) + .setCaseSensitive(true) + .setLocale("tr") // Turkish alphabet: has no Q, W, X but has dotted and dotless 'I's. +val dataSet = Seq( + // scalastyle:off + (Seq("mİlk", "and", "nuts"), Seq("mİlk", "and", "nuts")), + // scalastyle:on + (Seq("cookIe", "and", "nuts"), Seq("cookIe", "and", "nuts")), + (Seq(null), Seq(null)), + (Seq(), Seq()) +).toDF("raw", "expected") + +testStopWordsRemover(remover, dataSet) + } + + test("StopWordsRemover with invalid locale") { +intercept[IllegalArgumentException] { + val stopWords = Array("test", "a", "an", "the") + val _ = new StopWordsRemover() --- End diff -- Do we need `val _`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194346431 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala --- @@ -65,6 +65,56 @@ class StopWordsRemoverSuite extends MLTest with DefaultReadWriteTest { testStopWordsRemover(remover, dataSet) } + test("StopWordsRemover with localed input (case insensitive)") { +val stopWords = Array("milk", "cookie") +val remover = new StopWordsRemover() + .setInputCol("raw") + .setOutputCol("filtered") + .setStopWords(stopWords) + .setLocale("tr") // Turkish alphabet: has no Q, W, X but has dotted and dotless 'I's. --- End diff -- Lets explicitly call `.setCaseSensitive(false)` here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194343283 --- Diff: python/pyspark/ml/feature.py --- @@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadabl typeConverter=TypeConverters.toListString) caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do a case sensitive " + "comparison over the stop words", typeConverter=TypeConverters.toBoolean) +locale = Param(Params._dummy(), "locale", "locale of the input. ignored when case sensitive " + + "is true", typeConverter=TypeConverters.toString) --- End diff -- Refer to Java `Locale.getAvailableLocales` for available values. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user dongjinleekr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194342986 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * Locale of the input for case insensitive matching. Ignored when [[caseSensitive]] + * is true. + * Default: Locale.getDefault.toString + * @see `StopWordsRemover.loadDefaultStopWords()` --- End diff -- ... OMG; please wait. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194341627 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * Locale of the input for case insensitive matching. Ignored when [[caseSensitive]] + * is true. + * Default: Locale.getDefault.toString + * @see `StopWordsRemover.loadDefaultStopWords()` --- End diff -- You meant `getDefaultLocale` here? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194298446 --- Diff: python/pyspark/ml/feature.py --- @@ -2610,6 +2610,9 @@ def setParams(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive Sets params for this StopWordRemover. """ kwargs = self._input_kwargs +if locale is None: +sc = SparkContext._active_spark_context +kwargs['locale'] = sc._gateway.jvm.org.spark.ml.util.LocaleUtils.getDefaultLocale() --- End diff -- I was meant that if `sc._gateway.jvm.org.spark.ml.util.LocaleUtils.getDefaultLocale()` returns the same every time, shall we cache it and reuse it? But after re-thinking it, maybe we won't call this method with `locale is None` frequently. So it maybe fine. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user dongjinleekr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194296251 --- Diff: python/pyspark/ml/feature.py --- @@ -2610,6 +2610,9 @@ def setParams(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive Sets params for this StopWordRemover. """ kwargs = self._input_kwargs +if locale is None: +sc = SparkContext._active_spark_context +kwargs['locale'] = sc._gateway.jvm.org.spark.ml.util.LocaleUtils.getDefaultLocale() --- End diff -- @viirya You mean... `locale=SparkContext._active_spark_context.(...)`over `locale=None` with ugly if statement, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194293510 --- Diff: python/pyspark/ml/feature.py --- @@ -2610,6 +2610,9 @@ def setParams(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive Sets params for this StopWordRemover. """ kwargs = self._input_kwargs +if locale is None: +sc = SparkContext._active_spark_context +kwargs['locale'] = sc._gateway.jvm.org.spark.ml.util.LocaleUtils.getDefaultLocale() --- End diff -- We can keep this default local, and use it many times instead of call to JVM form Python every time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194244802 --- Diff: python/pyspark/ml/feature.py --- @@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadabl typeConverter=TypeConverters.toListString) caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do a case sensitive " + "comparison over the stop words", typeConverter=TypeConverters.toBoolean) +locale = Param(Params._dummy(), "locale", "locale of the input. ignored when case sensitive " + + "is true", typeConverter=TypeConverters.toString) @keyword_only -def __init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=False): +def __init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=False, + locale=None): """ -__init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=false) +__init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=false, +locale=None) """ super(StopWordsRemover, self).__init__() self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.StopWordsRemover", self.uid) self._setDefault(stopWords=StopWordsRemover.loadDefaultStopWords("english"), - caseSensitive=False) + caseSensitive=False, locale=None) --- End diff -- I think we just need to explain how we handle the default value in the doc. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194244784 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * Locale of the input for case insensitive matching. Ignored when [[caseSensitive]] + * is true. + * Default: Locale.getDefault.toString + * @see `StopWordsRemover.loadDefaultStopWords()` + * @group param + */ + @Since("2.4.0") + val locale: Param[String] = new Param[String](this, "locale", +"Locale of the input for case insensitive matching. Ignored when caseSensitive is true.", + ParamValidators.inArray[String](Locale.getAvailableLocales.map(_.toString))) + + /** @group setParam */ + @Since("2.4.0") + def setLocale(value: String): this.type = set(locale, value) --- End diff -- Not sure if it is necessary. If users already have a `Locale` instance, they can use `toString` to provide the value. It is simpler if setter/getter have consistent types. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194244754 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * Locale of the input for case insensitive matching. Ignored when [[caseSensitive]] + * is true. + * Default: Locale.getDefault.toString --- End diff -- Locale is usually configured at system level. Setting the default locale to `en` is a breaking change to users who set system locale to `tr` and have the following code: ~~~ val remover = new StopWordsRemover() .setStopWords(loadDefaultStopWords("turkish") ~~~ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user dongjinleekr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194222407 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * Locale of the input for case insensitive matching. Ignored when [[caseSensitive]] + * is true. + * Default: Locale.getDefault.toString + * @see `StopWordsRemover.loadDefaultStopWords()` + * @group param + */ + @Since("2.4.0") + val locale: Param[String] = new Param[String](this, "locale", +"Locale of the input for case insensitive matching. Ignored when caseSensitive is true.", + ParamValidators.inArray[String](Locale.getAvailableLocales.map(_.toString))) + + /** @group setParam */ + @Since("2.4.0") + def setLocale(value: String): this.type = set(locale, value) --- End diff -- @mengxr How do you think? I think supporting two setters is reasonable, but it can introduce some asymmetrical signatures. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user dongjinleekr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194222392 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * Locale of the input for case insensitive matching. Ignored when [[caseSensitive]] + * is true. + * Default: Locale.getDefault.toString --- End diff -- As far as I am understanding, the Locale makes difference how the lowercase string would be, not related to the stopwords. @mengxr It is the reason why you recommended me to change the default locale, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194098947 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * Locale of the input for case insensitive matching. Ignored when [[caseSensitive]] + * is true. + * Default: Locale.getDefault.toString --- End diff -- I understand that Locale.getDefault is the current behavior, yet since we're using english as default stopwords, I'm leaning towards using English as default locale. Also using Locale.getDefault means that the same code will behave differently on different nodes, which adds extra complexity for trouble-shooting. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r194099298 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * Locale of the input for case insensitive matching. Ignored when [[caseSensitive]] + * is true. + * Default: Locale.getDefault.toString + * @see `StopWordsRemover.loadDefaultStopWords()` + * @group param + */ + @Since("2.4.0") + val locale: Param[String] = new Param[String](this, "locale", +"Locale of the input for case insensitive matching. Ignored when caseSensitive is true.", + ParamValidators.inArray[String](Locale.getAvailableLocales.map(_.toString))) + + /** @group setParam */ + @Since("2.4.0") + def setLocale(value: String): this.type = set(locale, value) --- End diff -- how about adding another setter that takes a Locale parameter? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r193963482 --- Diff: python/pyspark/ml/feature.py --- @@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadabl typeConverter=TypeConverters.toListString) caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do a case sensitive " + "comparison over the stop words", typeConverter=TypeConverters.toBoolean) +locale = Param(Params._dummy(), "locale", "locale of the input. ignored when case sensitive " + + "is true", typeConverter=TypeConverters.toString) @keyword_only -def __init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=False): +def __init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=False, + locale=None): """ -__init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=false) +__init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=false, +locale=None) """ super(StopWordsRemover, self).__init__() self._java_obj = self._new_java_obj("org.apache.spark.ml.feature.StopWordsRemover", self.uid) self._setDefault(stopWords=StopWordsRemover.loadDefaultStopWords("english"), - caseSensitive=False) + caseSensitive=False, locale=None) --- End diff -- If we set default as None, will `self.getOrDefault(self.locale)` below just return `None` instead of the default value in Scala side? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r193954052 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -95,8 +121,7 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String terms.filter(s => !stopWordsSet.contains(s)) } } else { - // TODO: support user locale (SPARK-15064) - val toLower = (s: String) => if (s != null) s.toLowerCase else s + val toLower = (s: String) => if (s != null) s.toLowerCase(new Locale($(locale))) else s --- End diff -- Maybe move `new Locale($(locale))` out of `toLower`, so we don't need to create it every time. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r193781216 --- Diff: python/pyspark/ml/feature.py --- @@ -2582,25 +2582,27 @@ class StopWordsRemover(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadabl typeConverter=TypeConverters.toListString) caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do a case sensitive " + "comparison over the stop words", typeConverter=TypeConverters.toBoolean) +locale = Param(Params._dummy(), "locale", "locale of the input. ignored when case sensitive is false", + typeConverter=TypeConverters.toString) @keyword_only -def __init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=False): +def __init__(self, inputCol=None, outputCol=None, stopWords=None, caseSensitive=False, locale="en"): --- End diff -- We should keep the default consistent. I'm not sure if python's [`locale.getdefaultlocale`](https://docs.python.org/2/library/locale.html#locale.getdefaultlocale) is compatible with Java's. But we can always make the default `None` in Python and leave it to JVM to set the default. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r193777474 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,31 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html Locale]] of the input for case insensitive + * matching. Ignored when [[caseSensitive]] is true. + * Default: Locale.getDefault.toString + * @see `StopWordsRemover.loadDefaultStopWords()` + * @group param + */ + @Since("2.4.0") + val locale: Param[String] = new Param[String](this, "locale", +"Locale of the input for case insensitive matching. Ignored when caseSensitive is false.", --- End diff -- `false` -> `true` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r193779131 --- Diff: python/pyspark/ml/feature.py --- @@ -2582,25 +2582,27 @@ class StopWordsRemover(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadabl typeConverter=TypeConverters.toListString) caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do a case sensitive " + "comparison over the stop words", typeConverter=TypeConverters.toBoolean) +locale = Param(Params._dummy(), "locale", "locale of the input. ignored when case sensitive is false", --- End diff -- `false` -> `true`. (Copy the param doc from Scala) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r193635361 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,36 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html Locale]] of the input. + * Ignored when [[caseSensitive]] is false. + * Default: Locale.English + * @see `StopWordsRemover.loadDefaultStopWords()` + * @group param + */ + @Since("2.4.0") + val locale: Param[Locale] = new Param[Locale](this, "locale", --- End diff -- +1 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r193604620 --- Diff: python/pyspark/ml/feature.py --- @@ -2582,25 +2582,27 @@ class StopWordsRemover(JavaTransformer, HasInputCol, HasOutputCol, JavaMLReadabl typeConverter=TypeConverters.toListString) caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do a case sensitive " + "comparison over the stop words", typeConverter=TypeConverters.toBoolean) +locale = Param(Params._dummy(), "locale", "locale of the input. ignored when case sensitive is false", + typeConverter=TypeConverters.toString) --- End diff -- nit: indentation --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r193494337 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala --- @@ -65,6 +67,56 @@ class StopWordsRemoverSuite extends MLTest with DefaultReadWriteTest { testStopWordsRemover(remover, dataSet) } + test("StopWordsRemover with localed input (case insensitive)") { +val stopWords = Array("milk", "cookie") +val remover = new StopWordsRemover() + .setInputCol("raw") + .setOutputCol("filtered") + .setStopWords(stopWords) + .setLocale(new Locale("tr")) --- End diff -- Could you leave an inline comment explaining which part is special to tr locale? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r193493040 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,36 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html Locale]] of the input. --- End diff -- Locale of the input for case insensitive matching. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r193492568 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,36 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html Locale]] of the input. + * Ignored when [[caseSensitive]] is false. + * Default: Locale.English --- End diff -- The default should be the same as `toLowerCase` default local, which is system default (Locale.getDefault). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r193493174 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,36 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html Locale]] of the input. + * Ignored when [[caseSensitive]] is false. + * Default: Locale.English + * @see `StopWordsRemover.loadDefaultStopWords()` + * @group param + */ + @Since("2.4.0") + val locale: Param[Locale] = new Param[Locale](this, "locale", --- End diff -- Please use `Param[String]` so we can add Python/R APIs easily. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r193493663 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,36 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html Locale]] of the input. + * Ignored when [[caseSensitive]] is false. + * Default: Locale.English + * @see `StopWordsRemover.loadDefaultStopWords()` + * @group param + */ + @Since("2.4.0") + val locale: Param[Locale] = new Param[Locale](this, "locale", +"Locale of the input. Ignored when caseSensitive is false.") + + /** + * @group setParam + * @throws IllegalArgumentException if `value` is not an available locale. + */ + @Since("2.4.0") + def setLocale(value: Locale): this.type = { +if (Locale.getAvailableLocales contains value) { --- End diff -- Could you add the check as a param validator? See https://github.com/dongjinleekr/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/params.scala#L167. This is because there are more than one way to set params besides this setter. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r193493884 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,36 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html Locale]] of the input. + * Ignored when [[caseSensitive]] is false. + * Default: Locale.English + * @see `StopWordsRemover.loadDefaultStopWords()` + * @group param + */ + @Since("2.4.0") + val locale: Param[Locale] = new Param[Locale](this, "locale", +"Locale of the input. Ignored when caseSensitive is false.") + + /** + * @group setParam + * @throws IllegalArgumentException if `value` is not an available locale. + */ + @Since("2.4.0") + def setLocale(value: Locale): this.type = { +if (Locale.getAvailableLocales contains value) { + set(locale, value) +} else { + throw new IllegalArgumentException(s"Unsupported locale: $value") +} + } + + /** @group getParam */ + @Since("2.4.0") + def getLocale: Locale = $(locale) + + setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), +caseSensitive -> false, locale -> Locale.ENGLISH) --- End diff -- `Locale.getDefault` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r193492802 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,36 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html Locale]] of the input. + * Ignored when [[caseSensitive]] is false. --- End diff -- Ignored when caseSensitive is true? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21501#discussion_r193332595 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala --- @@ -84,7 +86,36 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") override val uid: String @Since("1.5.0") def getCaseSensitive: Boolean = $(caseSensitive) - setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) + /** + * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html Locale]] of the input. + * Ignored when [[caseSensitive]] is false. + * Default: Locale.English + * @see `StopWordsRemover.loadDefaultStopWords()` + * @group param + */ + @Since("2.5.0") --- End diff -- It should be 2.4.0 :-). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...
GitHub user dongjinleekr opened a pull request: https://github.com/apache/spark/pull/21501 [SPARK-15064][ML] Locale support in StopWordsRemover ## What changes were proposed in this pull request? Add locale support for `StopWordsRemover`. ## How was this patch tested? [Scala|Python] unit tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dongjinleekr/spark feature/SPARK-15064 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21501.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21501 commit 8379d88360651e0571698412e4aecc305ae7fe60 Author: Lee Dongjin Date: 2018-03-21T15:03:17Z Implement SPARK-15064 commit 682df4afe6859119d01de1fa953f95ebaacb4889 Author: Lee Dongjin Date: 2018-06-05T11:00:17Z Add test cases for StopWordsRemoverSuite commit 67ae6ebaebbb537f24db51be1f96759d5c6463ff Author: Lee Dongjin Date: 2018-06-06T07:43:45Z Implement pyspark parity --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org