[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-12 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21501


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-11 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194626092
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, 
HasInputCol, HasOutputCol, JavaMLReadabl
   typeConverter=TypeConverters.toListString)
 caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do 
a case sensitive " +
   "comparison over the stop words", 
typeConverter=TypeConverters.toBoolean)
+locale = Param(Params._dummy(), "locale", "locale of the input. 
ignored when case sensitive " +
+   "is true", typeConverter=TypeConverters.toString)
--- End diff --

And also don't forget to mention default value is JVM default locale.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-11 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194625679
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, 
HasInputCol, HasOutputCol, JavaMLReadabl
   typeConverter=TypeConverters.toListString)
 caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do 
a case sensitive " +
   "comparison over the stop words", 
typeConverter=TypeConverters.toBoolean)
+locale = Param(Params._dummy(), "locale", "locale of the input. 
ignored when case sensitive " +
+   "is true", typeConverter=TypeConverters.toString)
--- End diff --

I'm not sure if users are familiar with available locale setting values 
here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-11 Thread dongjinleekr
Github user dongjinleekr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194623958
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, 
HasInputCol, HasOutputCol, JavaMLReadabl
   typeConverter=TypeConverters.toListString)
 caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do 
a case sensitive " +
   "comparison over the stop words", 
typeConverter=TypeConverters.toBoolean)
+locale = Param(Params._dummy(), "locale", "locale of the input. 
ignored when case sensitive " +
+   "is true", typeConverter=TypeConverters.toString)
--- End diff --

Thank you for the comment but... is that necessary?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-11 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194606928
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, 
HasInputCol, HasOutputCol, JavaMLReadabl
   typeConverter=TypeConverters.toListString)
 caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do 
a case sensitive " +
   "comparison over the stop words", 
typeConverter=TypeConverters.toBoolean)
+locale = Param(Params._dummy(), "locale", "locale of the input. 
ignored when case sensitive " +
+   "is true", typeConverter=TypeConverters.toString)
 
 @keyword_only
-def __init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=False):
+def __init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=False,
+ locale=None):
 """
-__init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=false)
+__init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=false,
+locale=None)
--- End diff --

Please add \ to the end of L2592 and use the right indentation here. 
Unfortunately, we need this to make the doc correctly displayed. See 
https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L3112.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-11 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194607278
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, 
HasInputCol, HasOutputCol, JavaMLReadabl
   typeConverter=TypeConverters.toListString)
 caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do 
a case sensitive " +
   "comparison over the stop words", 
typeConverter=TypeConverters.toBoolean)
+locale = Param(Params._dummy(), "locale", "locale of the input. 
ignored when case sensitive " +
+   "is true", typeConverter=TypeConverters.toString)
 
 @keyword_only
-def __init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=False):
+def __init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=False,
+ locale=None):
 """
-__init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=false)
+__init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=false,
+locale=None)
 """
 super(StopWordsRemover, self).__init__()
 self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.StopWordsRemover",
 self.uid)
 
self._setDefault(stopWords=StopWordsRemover.loadDefaultStopWords("english"),
- caseSensitive=False)
+ caseSensitive=False, 
locale=StopWordsRemover.defaultLocale())
--- End diff --

You already have the `_java_obj`, call `_java_object.getLocale()` would 
give you the default locale. And then Python users only need 
`stopWordsRemover.getLocale()` to get the default value. In the param doc, we 
should make it clear that the default would be the JVM default locale.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-11 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194606981
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, 
HasInputCol, HasOutputCol, JavaMLReadabl
   typeConverter=TypeConverters.toListString)
 caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do 
a case sensitive " +
   "comparison over the stop words", 
typeConverter=TypeConverters.toBoolean)
+locale = Param(Params._dummy(), "locale", "locale of the input. 
ignored when case sensitive " +
+   "is true", typeConverter=TypeConverters.toString)
 
 @keyword_only
-def __init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=False):
+def __init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=False,
+ locale=None):
 """
-__init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=false)
+__init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=false,
+locale=None)
 """
 super(StopWordsRemover, self).__init__()
 self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.StopWordsRemover",
 self.uid)
 
self._setDefault(stopWords=StopWordsRemover.loadDefaultStopWords("english"),
- caseSensitive=False)
+ caseSensitive=False, 
locale=StopWordsRemover.defaultLocale())
 kwargs = self._input_kwargs
 self.setParams(**kwargs)
 
 @keyword_only
 @since("1.6.0")
-def setParams(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=False):
+def setParams(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=False,
+  locale=None):
 """
-setParams(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=false)
+setParams(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=false,
+locale=None)
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-11 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194606418
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * Locale of the input for case insensitive matching. Ignored when 
[[caseSensitive]]
+   * is true.
+   * Default: Locale.getDefault.toString
+   * @see `StopWordsRemover.getDefaultLocale()`
--- End diff --

I feel it is unnecessary to expose it as a public API. This is the same as 
`Locale.getDefault.toString` or `stopWordsRemover.getLocale` when nothing is 
set. See my comments on the Python API.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-11 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194346755
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala ---
@@ -65,6 +65,56 @@ class StopWordsRemoverSuite extends MLTest with 
DefaultReadWriteTest {
 testStopWordsRemover(remover, dataSet)
   }
 
+  test("StopWordsRemover with localed input (case insensitive)") {
+val stopWords = Array("milk", "cookie")
+val remover = new StopWordsRemover()
+  .setInputCol("raw")
+  .setOutputCol("filtered")
+  .setStopWords(stopWords)
+  .setLocale("tr")  // Turkish alphabet: has no Q, W, X but has dotted 
and dotless 'I's.
+val dataSet = Seq(
+  // scalastyle:off
+  (Seq("mİlk", "and", "nuts"), Seq("and", "nuts")),
+  // scalastyle:on
+  (Seq("cookIe", "and", "nuts"), Seq("cookIe", "and", "nuts")),
+  (Seq(null), Seq(null)),
+  (Seq(), Seq())
+).toDF("raw", "expected")
+
+testStopWordsRemover(remover, dataSet)
+  }
+
+  test("StopWordsRemover with localed input (case sensitive)") {
+val stopWords = Array("milk", "cookie")
+val remover = new StopWordsRemover()
+  .setInputCol("raw")
+  .setOutputCol("filtered")
+  .setStopWords(stopWords)
+  .setCaseSensitive(true)
+  .setLocale("tr")  // Turkish alphabet: has no Q, W, X but has dotted 
and dotless 'I's.
+val dataSet = Seq(
+  // scalastyle:off
+  (Seq("mİlk", "and", "nuts"), Seq("mİlk", "and", "nuts")),
+  // scalastyle:on
+  (Seq("cookIe", "and", "nuts"), Seq("cookIe", "and", "nuts")),
+  (Seq(null), Seq(null)),
+  (Seq(), Seq())
+).toDF("raw", "expected")
+
+testStopWordsRemover(remover, dataSet)
+  }
+
+  test("StopWordsRemover with invalid locale") {
+intercept[IllegalArgumentException] {
+  val stopWords = Array("test", "a", "an", "the")
+  val _ = new StopWordsRemover()
--- End diff --

Do we need `val _`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-11 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194346431
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala ---
@@ -65,6 +65,56 @@ class StopWordsRemoverSuite extends MLTest with 
DefaultReadWriteTest {
 testStopWordsRemover(remover, dataSet)
   }
 
+  test("StopWordsRemover with localed input (case insensitive)") {
+val stopWords = Array("milk", "cookie")
+val remover = new StopWordsRemover()
+  .setInputCol("raw")
+  .setOutputCol("filtered")
+  .setStopWords(stopWords)
+  .setLocale("tr")  // Turkish alphabet: has no Q, W, X but has dotted 
and dotless 'I's.
--- End diff --

Lets explicitly call  `.setCaseSensitive(false)` here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-11 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194343283
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, 
HasInputCol, HasOutputCol, JavaMLReadabl
   typeConverter=TypeConverters.toListString)
 caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do 
a case sensitive " +
   "comparison over the stop words", 
typeConverter=TypeConverters.toBoolean)
+locale = Param(Params._dummy(), "locale", "locale of the input. 
ignored when case sensitive " +
+   "is true", typeConverter=TypeConverters.toString)
--- End diff --

Refer to Java `Locale.getAvailableLocales` for available values.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-11 Thread dongjinleekr
Github user dongjinleekr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194342986
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * Locale of the input for case insensitive matching. Ignored when 
[[caseSensitive]]
+   * is true.
+   * Default: Locale.getDefault.toString
+   * @see `StopWordsRemover.loadDefaultStopWords()`
--- End diff --

... OMG; please wait.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-11 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194341627
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * Locale of the input for case insensitive matching. Ignored when 
[[caseSensitive]]
+   * is true.
+   * Default: Locale.getDefault.toString
+   * @see `StopWordsRemover.loadDefaultStopWords()`
--- End diff --

You meant `getDefaultLocale` here?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-10 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194298446
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2610,6 +2610,9 @@ def setParams(self, inputCol=None, outputCol=None, 
stopWords=None, caseSensitive
 Sets params for this StopWordRemover.
 """
 kwargs = self._input_kwargs
+if locale is None:
+sc = SparkContext._active_spark_context
+kwargs['locale'] = 
sc._gateway.jvm.org.spark.ml.util.LocaleUtils.getDefaultLocale()
--- End diff --

I was meant that if 
`sc._gateway.jvm.org.spark.ml.util.LocaleUtils.getDefaultLocale()` returns the 
same every time, shall we cache it and reuse it? But after re-thinking it, 
maybe we won't call this method with `locale is None` frequently. So it maybe 
fine.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-10 Thread dongjinleekr
Github user dongjinleekr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194296251
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2610,6 +2610,9 @@ def setParams(self, inputCol=None, outputCol=None, 
stopWords=None, caseSensitive
 Sets params for this StopWordRemover.
 """
 kwargs = self._input_kwargs
+if locale is None:
+sc = SparkContext._active_spark_context
+kwargs['locale'] = 
sc._gateway.jvm.org.spark.ml.util.LocaleUtils.getDefaultLocale()
--- End diff --

@viirya You mean... `locale=SparkContext._active_spark_context.(...)`over 
`locale=None` with ugly if statement, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-10 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194293510
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2610,6 +2610,9 @@ def setParams(self, inputCol=None, outputCol=None, 
stopWords=None, caseSensitive
 Sets params for this StopWordRemover.
 """
 kwargs = self._input_kwargs
+if locale is None:
+sc = SparkContext._active_spark_context
+kwargs['locale'] = 
sc._gateway.jvm.org.spark.ml.util.LocaleUtils.getDefaultLocale()
--- End diff --

We can keep this default local, and use it many times instead of call to 
JVM form Python every time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194244802
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, 
HasInputCol, HasOutputCol, JavaMLReadabl
   typeConverter=TypeConverters.toListString)
 caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do 
a case sensitive " +
   "comparison over the stop words", 
typeConverter=TypeConverters.toBoolean)
+locale = Param(Params._dummy(), "locale", "locale of the input. 
ignored when case sensitive " +
+   "is true", typeConverter=TypeConverters.toString)
 
 @keyword_only
-def __init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=False):
+def __init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=False,
+ locale=None):
 """
-__init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=false)
+__init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=false,
+locale=None)
 """
 super(StopWordsRemover, self).__init__()
 self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.StopWordsRemover",
 self.uid)
 
self._setDefault(stopWords=StopWordsRemover.loadDefaultStopWords("english"),
- caseSensitive=False)
+ caseSensitive=False, locale=None)
--- End diff --

I think we just need to explain how we handle the default value in the doc.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194244784
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * Locale of the input for case insensitive matching. Ignored when 
[[caseSensitive]]
+   * is true.
+   * Default: Locale.getDefault.toString
+   * @see `StopWordsRemover.loadDefaultStopWords()`
+   * @group param
+   */
+  @Since("2.4.0")
+  val locale: Param[String] = new Param[String](this, "locale",
+"Locale of the input for case insensitive matching. Ignored when 
caseSensitive is true.",
+
ParamValidators.inArray[String](Locale.getAvailableLocales.map(_.toString)))
+
+  /** @group setParam */
+  @Since("2.4.0")
+  def setLocale(value: String): this.type = set(locale, value)
--- End diff --

Not sure if it is necessary. If users already have a `Locale` instance, 
they can use `toString` to provide the value. It is simpler if setter/getter 
have consistent types.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-09 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194244754
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * Locale of the input for case insensitive matching. Ignored when 
[[caseSensitive]]
+   * is true.
+   * Default: Locale.getDefault.toString
--- End diff --

Locale is usually configured at system level. Setting the default locale to 
`en` is a breaking change to users who set system locale to `tr` and have the 
following code:

~~~
val remover = new StopWordsRemover()
  .setStopWords(loadDefaultStopWords("turkish")
~~~


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-09 Thread dongjinleekr
Github user dongjinleekr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194222407
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * Locale of the input for case insensitive matching. Ignored when 
[[caseSensitive]]
+   * is true.
+   * Default: Locale.getDefault.toString
+   * @see `StopWordsRemover.loadDefaultStopWords()`
+   * @group param
+   */
+  @Since("2.4.0")
+  val locale: Param[String] = new Param[String](this, "locale",
+"Locale of the input for case insensitive matching. Ignored when 
caseSensitive is true.",
+
ParamValidators.inArray[String](Locale.getAvailableLocales.map(_.toString)))
+
+  /** @group setParam */
+  @Since("2.4.0")
+  def setLocale(value: String): this.type = set(locale, value)
--- End diff --

@mengxr How do you think? I think supporting two setters is reasonable, but 
it can introduce some asymmetrical signatures.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-09 Thread dongjinleekr
Github user dongjinleekr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194222392
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * Locale of the input for case insensitive matching. Ignored when 
[[caseSensitive]]
+   * is true.
+   * Default: Locale.getDefault.toString
--- End diff --

As far as I am understanding, the Locale makes difference how the lowercase 
string would be, not related to the stopwords. @mengxr It is the reason why you 
recommended me to change the default locale, right?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-08 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194098947
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * Locale of the input for case insensitive matching. Ignored when 
[[caseSensitive]]
+   * is true.
+   * Default: Locale.getDefault.toString
--- End diff --

I understand that Locale.getDefault is the current behavior, yet since 
we're using english as default stopwords, I'm leaning towards using English as 
default locale.
Also using Locale.getDefault means that the same code will behave 
differently on different nodes, which adds extra complexity for 
trouble-shooting. 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-08 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r194099298
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,28 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * Locale of the input for case insensitive matching. Ignored when 
[[caseSensitive]]
+   * is true.
+   * Default: Locale.getDefault.toString
+   * @see `StopWordsRemover.loadDefaultStopWords()`
+   * @group param
+   */
+  @Since("2.4.0")
+  val locale: Param[String] = new Param[String](this, "locale",
+"Locale of the input for case insensitive matching. Ignored when 
caseSensitive is true.",
+
ParamValidators.inArray[String](Locale.getAvailableLocales.map(_.toString)))
+
+  /** @group setParam */
+  @Since("2.4.0")
+  def setLocale(value: String): this.type = set(locale, value)
--- End diff --

how about adding another setter that takes a Locale parameter?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r193963482
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2582,25 +2582,31 @@ class StopWordsRemover(JavaTransformer, 
HasInputCol, HasOutputCol, JavaMLReadabl
   typeConverter=TypeConverters.toListString)
 caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do 
a case sensitive " +
   "comparison over the stop words", 
typeConverter=TypeConverters.toBoolean)
+locale = Param(Params._dummy(), "locale", "locale of the input. 
ignored when case sensitive " +
+   "is true", typeConverter=TypeConverters.toString)
 
 @keyword_only
-def __init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=False):
+def __init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=False,
+ locale=None):
 """
-__init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=false)
+__init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=false,
+locale=None)
 """
 super(StopWordsRemover, self).__init__()
 self._java_obj = 
self._new_java_obj("org.apache.spark.ml.feature.StopWordsRemover",
 self.uid)
 
self._setDefault(stopWords=StopWordsRemover.loadDefaultStopWords("english"),
- caseSensitive=False)
+ caseSensitive=False, locale=None)
--- End diff --

If we set default as None, will `self.getOrDefault(self.locale)` below just 
return `None` instead of the default value in Scala side?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-07 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r193954052
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -95,8 +121,7 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
 terms.filter(s => !stopWordsSet.contains(s))
   }
 } else {
-  // TODO: support user locale (SPARK-15064)
-  val toLower = (s: String) => if (s != null) s.toLowerCase else s
+  val toLower = (s: String) => if (s != null) s.toLowerCase(new 
Locale($(locale))) else s
--- End diff --

Maybe move `new Locale($(locale))` out of `toLower`, so we don't need to 
create it every time.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-07 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r193781216
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2582,25 +2582,27 @@ class StopWordsRemover(JavaTransformer, 
HasInputCol, HasOutputCol, JavaMLReadabl
   typeConverter=TypeConverters.toListString)
 caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do 
a case sensitive " +
   "comparison over the stop words", 
typeConverter=TypeConverters.toBoolean)
+locale = Param(Params._dummy(), "locale", "locale of the input. 
ignored when case sensitive is false",
+   typeConverter=TypeConverters.toString)
 
 @keyword_only
-def __init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=False):
+def __init__(self, inputCol=None, outputCol=None, stopWords=None, 
caseSensitive=False, locale="en"):
--- End diff --

We should keep the default consistent. I'm not sure if python's 
[`locale.getdefaultlocale`](https://docs.python.org/2/library/locale.html#locale.getdefaultlocale)
 is compatible with Java's. But we can always make the default `None` in Python 
and leave it to JVM to set the default.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-07 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r193777474
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,31 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html 
Locale]] of the input for case insensitive
+   * matching. Ignored when [[caseSensitive]] is true.
+   * Default: Locale.getDefault.toString
+   * @see `StopWordsRemover.loadDefaultStopWords()`
+   * @group param
+   */
+  @Since("2.4.0")
+  val locale: Param[String] = new Param[String](this, "locale",
+"Locale of the input for case insensitive matching. Ignored when 
caseSensitive is false.",
--- End diff --

`false` -> `true`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-07 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r193779131
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2582,25 +2582,27 @@ class StopWordsRemover(JavaTransformer, 
HasInputCol, HasOutputCol, JavaMLReadabl
   typeConverter=TypeConverters.toListString)
 caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do 
a case sensitive " +
   "comparison over the stop words", 
typeConverter=TypeConverters.toBoolean)
+locale = Param(Params._dummy(), "locale", "locale of the input. 
ignored when case sensitive is false",
--- End diff --

`false` -> `true`. (Copy the param doc from Scala)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-06 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r193635361
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,36 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html 
Locale]] of the input.
+   * Ignored when [[caseSensitive]] is false.
+   * Default: Locale.English
+   * @see `StopWordsRemover.loadDefaultStopWords()`
+   * @group param
+   */
+  @Since("2.4.0")
+  val locale: Param[Locale] = new Param[Locale](this, "locale",
--- End diff --

+1


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r193604620
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2582,25 +2582,27 @@ class StopWordsRemover(JavaTransformer, 
HasInputCol, HasOutputCol, JavaMLReadabl
   typeConverter=TypeConverters.toListString)
 caseSensitive = Param(Params._dummy(), "caseSensitive", "whether to do 
a case sensitive " +
   "comparison over the stop words", 
typeConverter=TypeConverters.toBoolean)
+locale = Param(Params._dummy(), "locale", "locale of the input. 
ignored when case sensitive is false",
+  typeConverter=TypeConverters.toString)
--- End diff --

nit: indentation


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r193494337
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala ---
@@ -65,6 +67,56 @@ class StopWordsRemoverSuite extends MLTest with 
DefaultReadWriteTest {
 testStopWordsRemover(remover, dataSet)
   }
 
+  test("StopWordsRemover with localed input (case insensitive)") {
+val stopWords = Array("milk", "cookie")
+val remover = new StopWordsRemover()
+  .setInputCol("raw")
+  .setOutputCol("filtered")
+  .setStopWords(stopWords)
+  .setLocale(new Locale("tr"))
--- End diff --

Could you leave an inline comment explaining which part is special to tr 
locale?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r193493040
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,36 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html 
Locale]] of the input.
--- End diff --

Locale of the input for case insensitive matching.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r193492568
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,36 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html 
Locale]] of the input.
+   * Ignored when [[caseSensitive]] is false.
+   * Default: Locale.English
--- End diff --

The default should be the same as `toLowerCase` default local, which is 
system default (Locale.getDefault).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r193493174
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,36 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html 
Locale]] of the input.
+   * Ignored when [[caseSensitive]] is false.
+   * Default: Locale.English
+   * @see `StopWordsRemover.loadDefaultStopWords()`
+   * @group param
+   */
+  @Since("2.4.0")
+  val locale: Param[Locale] = new Param[Locale](this, "locale",
--- End diff --

Please use `Param[String]` so we can add Python/R APIs easily.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r193493663
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,36 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html 
Locale]] of the input.
+   * Ignored when [[caseSensitive]] is false.
+   * Default: Locale.English
+   * @see `StopWordsRemover.loadDefaultStopWords()`
+   * @group param
+   */
+  @Since("2.4.0")
+  val locale: Param[Locale] = new Param[Locale](this, "locale",
+"Locale of the input. Ignored when caseSensitive is false.")
+
+  /**
+   * @group setParam
+   * @throws IllegalArgumentException if `value` is not an available 
locale.
+   */
+  @Since("2.4.0")
+  def setLocale(value: Locale): this.type = {
+if (Locale.getAvailableLocales contains value) {
--- End diff --

Could you add the check as a param validator? See 
https://github.com/dongjinleekr/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/params.scala#L167.
 This is because there are more than one way to set params besides this setter.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r193493884
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,36 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html 
Locale]] of the input.
+   * Ignored when [[caseSensitive]] is false.
+   * Default: Locale.English
+   * @see `StopWordsRemover.loadDefaultStopWords()`
+   * @group param
+   */
+  @Since("2.4.0")
+  val locale: Param[Locale] = new Param[Locale](this, "locale",
+"Locale of the input. Ignored when caseSensitive is false.")
+
+  /**
+   * @group setParam
+   * @throws IllegalArgumentException if `value` is not an available 
locale.
+   */
+  @Since("2.4.0")
+  def setLocale(value: Locale): this.type = {
+if (Locale.getAvailableLocales contains value) {
+  set(locale, value)
+} else {
+  throw new IllegalArgumentException(s"Unsupported locale: $value")
+}
+  }
+
+  /** @group getParam */
+  @Since("2.4.0")
+  def getLocale: Locale = $(locale)
+
+  setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"),
+caseSensitive -> false, locale -> Locale.ENGLISH)
--- End diff --

`Locale.getDefault`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-06 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r193492802
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,36 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html 
Locale]] of the input.
+   * Ignored when [[caseSensitive]] is false.
--- End diff --

Ignored when caseSensitive is true?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-06 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21501#discussion_r193332595
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala ---
@@ -84,7 +86,36 @@ class StopWordsRemover @Since("1.5.0") (@Since("1.5.0") 
override val uid: String
   @Since("1.5.0")
   def getCaseSensitive: Boolean = $(caseSensitive)
 
-  setDefault(stopWords -> 
StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false)
+  /**
+   * [[https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html 
Locale]] of the input.
+   * Ignored when [[caseSensitive]] is false.
+   * Default: Locale.English
+   * @see `StopWordsRemover.loadDefaultStopWords()`
+   * @group param
+   */
+  @Since("2.5.0")
--- End diff --

It should be 2.4.0 :-).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21501: [SPARK-15064][ML] Locale support in StopWordsRemo...

2018-06-06 Thread dongjinleekr
GitHub user dongjinleekr opened a pull request:

https://github.com/apache/spark/pull/21501

[SPARK-15064][ML] Locale support in StopWordsRemover

## What changes were proposed in this pull request?

Add locale support for `StopWordsRemover`.

## How was this patch tested?

[Scala|Python] unit tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dongjinleekr/spark feature/SPARK-15064

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21501.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21501


commit 8379d88360651e0571698412e4aecc305ae7fe60
Author: Lee Dongjin 
Date:   2018-03-21T15:03:17Z

Implement SPARK-15064

commit 682df4afe6859119d01de1fa953f95ebaacb4889
Author: Lee Dongjin 
Date:   2018-06-05T11:00:17Z

Add test cases for StopWordsRemoverSuite

commit 67ae6ebaebbb537f24db51be1f96759d5c6463ff
Author: Lee Dongjin 
Date:   2018-06-06T07:43:45Z

Implement pyspark parity




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org