[GitHub] spark pull request #20968: [SPARK-23828][ML][PYTHON]PySpark StringIndexerMod...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20968 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20968: [SPARK-23828][ML][PYTHON]PySpark StringIndexerMod...
Github user huaxingao commented on a diff in the pull request: https://github.com/apache/spark/pull/20968#discussion_r179791957 --- Diff: python/pyspark/ml/feature.py --- @@ -2342,8 +2342,38 @@ def mean(self): return self._call_java("mean") +class _StringIndexerParams(JavaParams, HasInputCol, HasOutputCol): +""" +Params for :py:attr:`StringIndexer` and :py:attr:`StringIndexerModel`. +""" + +stringOrderType = Param(Params._dummy(), "stringOrderType", +"How to order labels of string column. The first label after " + +"ordering is assigned an index of 0. Supported options: " + +"frequencyDesc, frequencyAsc, alphabetDesc, alphabetAsc.", +typeConverter=TypeConverters.toString) + +handleInvalid = Param(Params._dummy(), "handleInvalid", "how to handle invalid data (unseen " + + "or NULL values) in features and label column of string type. " + + "Options are 'skip' (filter out rows with invalid data), " + + "error (throw an error), or 'keep' (put invalid data " + + "in a special additional bucket, at index numLabels).", + typeConverter=TypeConverters.toString) + +def __init__(self, *args): +super(_StringIndexerParams, self).__init__(*args) +self._setDefault(handleInvalid="error", stringOrderType="frequencyDesc") + +@since("2.3.0") +def getStringOrderType(self): +""" +Gets the value of :py:attr:`stringOrderType` or its default value 'frequencyDesc'. +""" +return self.getOrDefault(self.stringOrderType) + + @inherit_doc -class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid, JavaMLReadable, +class StringIndexer(JavaEstimator, _StringIndexerParams, HasHandleInvalid, JavaMLReadable, --- End diff -- @BryanCutler Thanks a lot for your comments. I will change this. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20968: [SPARK-23828][ML][PYTHON]PySpark StringIndexerMod...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/20968#discussion_r179623749 --- Diff: python/pyspark/ml/feature.py --- @@ -2342,8 +2342,38 @@ def mean(self): return self._call_java("mean") +class _StringIndexerParams(JavaParams, HasInputCol, HasOutputCol): +""" +Params for :py:attr:`StringIndexer` and :py:attr:`StringIndexerModel`. +""" + +stringOrderType = Param(Params._dummy(), "stringOrderType", +"How to order labels of string column. The first label after " + +"ordering is assigned an index of 0. Supported options: " + +"frequencyDesc, frequencyAsc, alphabetDesc, alphabetAsc.", +typeConverter=TypeConverters.toString) + +handleInvalid = Param(Params._dummy(), "handleInvalid", "how to handle invalid data (unseen " + + "or NULL values) in features and label column of string type. " + + "Options are 'skip' (filter out rows with invalid data), " + + "error (throw an error), or 'keep' (put invalid data " + + "in a special additional bucket, at index numLabels).", + typeConverter=TypeConverters.toString) + +def __init__(self, *args): +super(_StringIndexerParams, self).__init__(*args) +self._setDefault(handleInvalid="error", stringOrderType="frequencyDesc") + +@since("2.3.0") +def getStringOrderType(self): +""" +Gets the value of :py:attr:`stringOrderType` or its default value 'frequencyDesc'. +""" +return self.getOrDefault(self.stringOrderType) + + @inherit_doc -class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol, HasHandleInvalid, JavaMLReadable, +class StringIndexer(JavaEstimator, _StringIndexerParams, HasHandleInvalid, JavaMLReadable, --- End diff -- you should move `HasHandleInvalid` to be a trait for `_StringIndexerParam` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20968: [SPARK-23828][ML][PYTHON]PySpark StringIndexerMod...
GitHub user huaxingao opened a pull request: https://github.com/apache/spark/pull/20968 [SPARK-23828][ML][PYTHON]PySpark StringIndexerModel should have constructor from labels ## What changes were proposed in this pull request? The Scala StringIndexerModel has an alternate constructor that will create the model from an array of label strings. Add the corresponding Python API: model = StringIndexerModel.from_labels(["a", "b", "c"]) ## How was this patch tested? Add doctest and unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/huaxingao/spark spark-23828 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20968.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20968 commit 021538e8300fb33dc7462c102c784c0ac20c120a Author: Huaxin GaoDate: 2018-04-03T17:40:23Z [SPARK-23828][ML][PYTHON]PySpark StringIndexerModel should have constructor from labels --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org