[GitHub] spark pull request #16516: [SPARK-19155][ML] Make some string params of ML a...

2017-01-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/16516#discussion_r95858814
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -365,7 +365,7 @@ class LogisticRegression @Since("1.2.0") (
   case None => histogram.length
 }
 
-val isMultinomial = $(family) match {
+val isMultinomial = $(family).toLowerCase match {
--- End diff --

@yanboliang is correct that there are other entrance points for setting and 
getting Params.  I agree it'd be nice to consolidate them, but that would be 
quite a bit of work and lower priority than other tech debt we currently have, 
IMO.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16516: [SPARK-19155][ML] Make some string params of ML a...

2017-01-12 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16516#discussion_r95848454
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -91,8 +91,8 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
   @Since("2.1.0")
   final val family: Param[String] = new Param(this, "family",
 "The name of family which is a description of the label distribution 
to be used in the " +
-  s"model. Supported options: ${supportedFamilyNames.mkString(", ")}.",
-ParamValidators.inArray[String](supportedFamilyNames))
+  s"model (case-insensitive). Supported options: 
${supportedFamilyNames.mkString(", ")}.",
--- End diff --

maybe we can add an additional string param validators class then to the 
same params.scala file in ml folder?  There should be a generic function and 
the params.scala file seems to be the right place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16516: [SPARK-19155][ML] Make some string params of ML a...

2017-01-12 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16516#discussion_r95846429
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -91,8 +91,8 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
   @Since("2.1.0")
   final val family: Param[String] = new Param(this, "family",
 "The name of family which is a description of the label distribution 
to be used in the " +
-  s"model. Supported options: ${supportedFamilyNames.mkString(", ")}.",
-ParamValidators.inArray[String](supportedFamilyNames))
+  s"model (case-insensitive). Supported options: 
${supportedFamilyNames.mkString(", ")}.",
--- End diff --

Searching through the code base these are the places where we use 
Param[String]:

spark-mllib_2.11
org.apache.spark.ml.classification
LogisticRegression.scala
  final val family: Param[String] = new Param(this, "family",
MultilayerPerceptronClassifier.scala
  final val solver: Param[String] = new Param[String](this, "solver",
  final val solver: Param[String] = new Param[String](this, "solver",
NaiveBayes.scala
  final val modelType: Param[String] = new Param[String](this, "modelType", 
"The model type " +
  final val modelType: Param[String] = new Param[String](this, "modelType", 
"The model type " +
org.apache.spark.ml.clustering
KMeans.scala
  final val initMode = new Param[String](this, "initMode", "The 
initialization algorithm. " +
LDA.scala
  final val optimizer = new Param[String](this, "optimizer", "Optimizer or 
inference" +
  final val topicDistributionCol = new Param[String](this, 
"topicDistributionCol", "Output column" +
org.apache.spark.ml.evaluation
BinaryClassificationEvaluator.scala
  val metricName: Param[String] = {
MulticlassClassificationEvaluator.scala
  val metricName: Param[String] = {
RegressionEvaluator.scala
  val metricName: Param[String] = {
org.apache.spark.ml.feature
Bucketizer.scala
  val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid", "how to handle " +
  val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid", "how to handle " +
ChiSqSelector.scala
  final val selectorType = new Param[String](this, "selectorType",
QuantileDiscretizer.scala
  val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid", "how to handle " +
  val handleInvalid: Param[String] = new Param[String](this, 
"handleInvalid", "how to handle " +
RFormula.scala
  val formula: Param[String] = new Param(this, "formula", "R model formula")
SQLTransformer.scala
  final val statement: Param[String] = new Param[String](this, "statement", 
"SQL statement")
  final val statement: Param[String] = new Param[String](this, "statement", 
"SQL statement")
Tokenizer.scala
  val pattern: Param[String] = new Param(this, "pattern", "regex pattern 
used for tokenizing")
org.apache.spark.ml.param
ParamsSuite.scala
  val param = new Param[String](dummy, "name", "doc")
org.apache.spark.ml.param.shared
sharedParams.scala
  final val featuresCol: Param[String] = new Param[String](this, 
"featuresCol", "features column name")
  final val featuresCol: Param[String] = new Param[String](this, 
"featuresCol", "features column name")
  final val labelCol: Param[String] = new Param[String](this, "labelCol", 
"label column name")
  final val labelCol: Param[String] = new Param[String](this, "labelCol", 
"label column name")
  final val predictionCol: Param[String] = new Param[String](this, 
"predictionCol", "prediction column name")
  final val predictionCol: Param[String] = new Param[String](this, 
"predictionCol", "prediction column name")
  final val rawPredictionCol: Param[String] = new Param[String](this, 
"rawPredictionCol", "raw prediction (a.k.a. confidence) column name")
  final val rawPredictionCol: Param[String] = new Param[String](this, 
"rawPredictionCol", "raw prediction (a.k.a. confidence) column name")
... P...
... P...
  final val varianceCol: Param[String] = new Param[String](this, 
"varianceCol", "Column name for the biased sample variance of prediction")
  final val varianceCol: Param[String] = new Param[String](this, 
"varianceCol", "Column name for the biased sample variance of prediction")
  final val inputCol: Param[String] = new Param[String](this, "inputCol", 
"input column name")
  final val inputCol: Param[String] = new Param[String](this, "inputCol", 
"input column name")
  final val outputCol: Param[String] = new Param[String](this, "outputCol", 
"output column name")
  final val outputCol: Param[String] = new Param[String](this, "outputCol", 
"output column name")
... P...
... P...
  final 

[GitHub] spark pull request #16516: [SPARK-19155][ML] Make some string params of ML a...

2017-01-12 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16516#discussion_r95846332
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -91,8 +91,8 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
   @Since("2.1.0")
   final val family: Param[String] = new Param(this, "family",
 "The name of family which is a description of the label distribution 
to be used in the " +
-  s"model. Supported options: ${supportedFamilyNames.mkString(", ")}.",
-ParamValidators.inArray[String](supportedFamilyNames))
+  s"model (case-insensitive). Supported options: 
${supportedFamilyNames.mkString(", ")}.",
--- End diff --

you're right, I searched through the code base and case-sensitivity matters 
when:
1.) we are specifying some column name as a parameter
2.) RModel formula (from RFormula.scala)
3.) Tokenizer.scala regex pattern
In all other cases it doesn't seem like it should matter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16516: [SPARK-19155][ML] Make some string params of ML a...

2017-01-12 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16516#discussion_r95745585
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -91,8 +91,8 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
   @Since("2.1.0")
   final val family: Param[String] = new Param(this, "family",
 "The name of family which is a description of the label distribution 
to be used in the " +
-  s"model. Supported options: ${supportedFamilyNames.mkString(", ")}.",
-ParamValidators.inArray[String](supportedFamilyNames))
+  s"model (case-insensitive). Supported options: 
${supportedFamilyNames.mkString(", ")}.",
--- End diff --

@imatiach-msft I think we should not to change the behavior of 
```ParamValidators.inArray[String]```, since some other string params may 
```case-sensitive``` which use the original check.
Adding a new method sounds reasonable, but I'm a bit worried that whether 
we should add a so concrete method in the common validation object 
```ParamValidators``` which use generic type. I'm still open to this topic and 
would like to hear more thoughts. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16516: [SPARK-19155][ML] Make some string params of ML a...

2017-01-11 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16516#discussion_r95681819
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -91,8 +91,8 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
   @Since("2.1.0")
   final val family: Param[String] = new Param(this, "family",
 "The name of family which is a description of the label distribution 
to be used in the " +
-  s"model. Supported options: ${supportedFamilyNames.mkString(", ")}.",
-ParamValidators.inArray[String](supportedFamilyNames))
+  s"model (case-insensitive). Supported options: 
${supportedFamilyNames.mkString(", ")}.",
--- End diff --

maybe you could add a ParamValidators.inStringArray(supportedFamilyNames)) 
method which would both normalize to lowercase and trim whitespace (?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16516: [SPARK-19155][ML] Make some string params of ML a...

2017-01-11 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16516#discussion_r95679245
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -91,8 +91,8 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
   @Since("2.1.0")
   final val family: Param[String] = new Param(this, "family",
 "The name of family which is a description of the label distribution 
to be used in the " +
-  s"model. Supported options: ${supportedFamilyNames.mkString(", ")}.",
-ParamValidators.inArray[String](supportedFamilyNames))
+  s"model (case-insensitive). Supported options: 
${supportedFamilyNames.mkString(", ")}.",
--- End diff --

Is it possible to change the ParamValidators.inArray[String] method to 
verify the given string in a case-insensitive way? Then you wouldn't need to 
make as many changes. (eg this change could be reverted)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16516: [SPARK-19155][ML] Make some string params of ML a...

2017-01-11 Thread imatiach-msft
Github user imatiach-msft commented on a diff in the pull request:

https://github.com/apache/spark/pull/16516#discussion_r95679101
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -365,7 +365,7 @@ class LogisticRegression @Since("1.2.0") (
   case None => histogram.length
 }
 
-val isMultinomial = $(family) match {
+val isMultinomial = $(family).toLowerCase match {
--- End diff --

maybe we need to have a different accessor that is consistently used on the 
transform/estimator side internally to:
1.) change the value to lowercase 2.) trim any whitespace
Changing the setter might cause issues because then when users try to 
validate that their parameters are set correctly they will see that they are 
modified, which is unexpected.  The case-insensitive compare should be done as 
in this PR, but instead of calling toLowerCase everywhere explicitly we should 
be accessing using some other method that normalizes the parameter internally


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16516: [SPARK-19155][ML] Make some string params of ML a...

2017-01-11 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/16516#discussion_r95592357
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -365,7 +365,7 @@ class LogisticRegression @Since("1.2.0") (
   case None => histogram.length
 }
 
-val isMultinomial = $(family) match {
+val isMultinomial = $(family).toLowerCase match {
--- End diff --

I don't think we can do that in ```setXXX``` methods, since they are not 
the only entrance to set params, we can also use the following API to set value 
for params:
```
def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: 
ParamPair[_]*): M = {
val map = new ParamMap()
  .put(firstParamPair)
  .put(otherParamPairs: _*)
fit(dataset, map)
  }
``` 

cc @jkbradley @sethah 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16516: [SPARK-19155][ML] Make some string params of ML a...

2017-01-11 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/16516#discussion_r95542552
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -365,7 +365,7 @@ class LogisticRegression @Since("1.2.0") (
   case None => histogram.length
 }
 
-val isMultinomial = $(family) match {
+val isMultinomial = $(family).toLowerCase match {
--- End diff --

It can, but I think it would need to be done in the concrete `setXXX` 
method each time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16516: [SPARK-19155][ML] Make some string params of ML a...

2017-01-10 Thread felixcheung
Github user felixcheung commented on a diff in the pull request:

https://github.com/apache/spark/pull/16516#discussion_r95512181
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -365,7 +365,7 @@ class LogisticRegression @Since("1.2.0") (
   case None => histogram.length
 }
 
-val isMultinomial = $(family) match {
+val isMultinomial = $(family).toLowerCase match {
--- End diff --

is there a way to store the param as the lowered case version, instead of 
turning it into lower case when accessed? it might be less error prone that way?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org