[GitHub] spark pull request #20968: [SPARK-23828][ML][PYTHON]PySpark StringIndexerMod...

2018-04-06 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20968


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20968: [SPARK-23828][ML][PYTHON]PySpark StringIndexerMod...

2018-04-06 Thread huaxingao
Github user huaxingao commented on a diff in the pull request:

https://github.com/apache/spark/pull/20968#discussion_r179791957
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2342,8 +2342,38 @@ def mean(self):
 return self._call_java("mean")
 
 
+class _StringIndexerParams(JavaParams, HasInputCol, HasOutputCol):
+"""
+Params for :py:attr:`StringIndexer` and :py:attr:`StringIndexerModel`.
+"""
+
+stringOrderType = Param(Params._dummy(), "stringOrderType",
+"How to order labels of string column. The 
first label after " +
+"ordering is assigned an index of 0. Supported 
options: " +
+"frequencyDesc, frequencyAsc, alphabetDesc, 
alphabetAsc.",
+typeConverter=TypeConverters.toString)
+
+handleInvalid = Param(Params._dummy(), "handleInvalid", "how to handle 
invalid data (unseen " +
+  "or NULL values) in features and label column of 
string type. " +
+  "Options are 'skip' (filter out rows with 
invalid data), " +
+  "error (throw an error), or 'keep' (put invalid 
data " +
+  "in a special additional bucket, at index 
numLabels).",
+  typeConverter=TypeConverters.toString)
+
+def __init__(self, *args):
+super(_StringIndexerParams, self).__init__(*args)
+self._setDefault(handleInvalid="error", 
stringOrderType="frequencyDesc")
+
+@since("2.3.0")
+def getStringOrderType(self):
+"""
+Gets the value of :py:attr:`stringOrderType` or its default value 
'frequencyDesc'.
+"""
+return self.getOrDefault(self.stringOrderType)
+
+
 @inherit_doc
-class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, 
HasHandleInvalid, JavaMLReadable,
+class StringIndexer(JavaEstimator, _StringIndexerParams, HasHandleInvalid, 
JavaMLReadable,
--- End diff --

@BryanCutler Thanks a lot for your comments. I will change this. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20968: [SPARK-23828][ML][PYTHON]PySpark StringIndexerMod...

2018-04-05 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/20968#discussion_r179623749
  
--- Diff: python/pyspark/ml/feature.py ---
@@ -2342,8 +2342,38 @@ def mean(self):
 return self._call_java("mean")
 
 
+class _StringIndexerParams(JavaParams, HasInputCol, HasOutputCol):
+"""
+Params for :py:attr:`StringIndexer` and :py:attr:`StringIndexerModel`.
+"""
+
+stringOrderType = Param(Params._dummy(), "stringOrderType",
+"How to order labels of string column. The 
first label after " +
+"ordering is assigned an index of 0. Supported 
options: " +
+"frequencyDesc, frequencyAsc, alphabetDesc, 
alphabetAsc.",
+typeConverter=TypeConverters.toString)
+
+handleInvalid = Param(Params._dummy(), "handleInvalid", "how to handle 
invalid data (unseen " +
+  "or NULL values) in features and label column of 
string type. " +
+  "Options are 'skip' (filter out rows with 
invalid data), " +
+  "error (throw an error), or 'keep' (put invalid 
data " +
+  "in a special additional bucket, at index 
numLabels).",
+  typeConverter=TypeConverters.toString)
+
+def __init__(self, *args):
+super(_StringIndexerParams, self).__init__(*args)
+self._setDefault(handleInvalid="error", 
stringOrderType="frequencyDesc")
+
+@since("2.3.0")
+def getStringOrderType(self):
+"""
+Gets the value of :py:attr:`stringOrderType` or its default value 
'frequencyDesc'.
+"""
+return self.getOrDefault(self.stringOrderType)
+
+
 @inherit_doc
-class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, 
HasHandleInvalid, JavaMLReadable,
+class StringIndexer(JavaEstimator, _StringIndexerParams, HasHandleInvalid, 
JavaMLReadable,
--- End diff --

you should move `HasHandleInvalid` to be a trait for `_StringIndexerParam`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20968: [SPARK-23828][ML][PYTHON]PySpark StringIndexerMod...

2018-04-03 Thread huaxingao
GitHub user huaxingao opened a pull request:

https://github.com/apache/spark/pull/20968

[SPARK-23828][ML][PYTHON]PySpark StringIndexerModel should have constructor 
from labels

## What changes were proposed in this pull request?

The Scala StringIndexerModel has an alternate constructor that will create 
the model from an array of label strings.  Add the corresponding Python API:

model = StringIndexerModel.from_labels(["a", "b", "c"])

## How was this patch tested?

Add doctest and unit test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/huaxingao/spark spark-23828

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20968.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20968


commit 021538e8300fb33dc7462c102c784c0ac20c120a
Author: Huaxin Gao 
Date:   2018-04-03T17:40:23Z

[SPARK-23828][ML][PYTHON]PySpark StringIndexerModel should have constructor 
from labels




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org