[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...
Github user facaiy closed the pull request at: https://github.com/apache/spark/pull/18736 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/18736#discussion_r133127624 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -74,26 +74,41 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) setDefault(numFeatures -> (1 << 18), binary -> false) + private[this] var hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary)) + /** @group getParam */ @Since("1.2.0") def getNumFeatures: Int = $(numFeatures) /** @group setParam */ @Since("1.2.0") - def setNumFeatures(value: Int): this.type = set(numFeatures, value) + def setNumFeatures(value: Int): this.type = { +val t = set(numFeatures, value) +hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary)) +t + } /** @group getParam */ @Since("2.0.0") def getBinary: Boolean = $(binary) /** @group setParam */ @Since("2.0.0") - def setBinary(value: Boolean): this.type = set(binary, value) + def setBinary(value: Boolean): this.type = { +val t = set(binary, value) +hashingTF.setBinary($(binary)) --- End diff -- I think we can't do other things except for set params in ```set***``` function, this is because PySpark ```set***``` doesn't call corresponding Scala function. In PySpark, we collect all params together and then pass them to Scala side, then call the following function: ``` def fit(dataset: Dataset[_], paramMap: ParamMap): M = { copy(paramMap).fit(dataset) } ``` which will skip the corresponding Scala ```set***``` functions. So it will make your change to old ```hashingTF``` doesn't take effect. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18736#discussion_r133080543 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -80,20 +82,31 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) /** @group setParam */ @Since("1.2.0") - def setNumFeatures(value: Int): this.type = set(numFeatures, value) + def setNumFeatures(value: Int): this.type = { +hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary)) --- End diff -- I think `mllib` only need fix bug, no need add new method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18736#discussion_r133080201 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/HashingTFSuite.scala --- @@ -69,6 +69,20 @@ class HashingTFSuite extends SparkFunSuite with MLlibTestSparkContext with Defau assert(features ~== expected absTol 1e-14) } + test("indexOf method") { +val df = Seq((0, "a a b b c d".split(" ").toSeq)).toDF("id", "words") +val n = 100 +val hashingTF = new HashingTF() + .setInputCol("words") + .setOutputCol("features") + .setNumFeatures(n) +hashingTF.transform(df) +assert(hashingTF.indexOf("a") === 70) +assert(hashingTF.indexOf("b") === 61) +assert(hashingTF.indexOf("c") === 22) +assert(hashingTF.indexOf("d") === 94) --- End diff -- I prefer the testcase compare the result between `ml.feature.HashingTF` and `mllib.feature.HashingTF`, avoid hardcoding the computing result here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18736#discussion_r132618802 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -80,20 +82,31 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) /** @group setParam */ @Since("1.2.0") - def setNumFeatures(value: Int): this.type = set(numFeatures, value) + def setNumFeatures(value: Int): this.type = { +hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary)) --- End diff -- @WeichenXu123 How about adding `setNumFeatures` method to mllib? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18736#discussion_r132253487 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -90,10 +92,22 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) @Since("2.0.0") def setBinary(value: Boolean): this.type = set(binary, value) + /** + * Returns the index of the input term. + */ + @Since("2.3.0") + def indexOf(term: Any): Int = { +if (hashingTF != null) { + hashingTF.indexOf(term) --- End diff -- Another problem: When the parameter changed (numFeatures, binary), you should immediately create a new hashTF object to replace the old one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...
Github user facaiy commented on a diff in the pull request: https://github.com/apache/spark/pull/18736#discussion_r132131171 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -90,10 +92,22 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) @Since("2.0.0") def setBinary(value: Boolean): this.type = set(binary, value) + /** + * Returns the index of the input term. + */ + @Since("2.3.0") + def indexOf(term: Any): Int = { +if (hashingTF != null) { + hashingTF.indexOf(term) --- End diff -- Thanks for you suggestion, I'll like to modify the code later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...
Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/18736#discussion_r132058692 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala --- @@ -90,10 +92,22 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") override val uid: String) @Since("2.0.0") def setBinary(value: Boolean): this.type = set(binary, value) + /** + * Returns the index of the input term. + */ + @Since("2.3.0") + def indexOf(term: Any): Int = { +if (hashingTF != null) { + hashingTF.indexOf(term) --- End diff -- You can move the `feature.HashingTF` initialization into constructor. Otherwise the `indexOf` will throw error when `transform` not called, it is weird to user. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...
GitHub user facaiy opened a pull request: https://github.com/apache/spark/pull/18736 [SPARK-21481][ML] Add indexOf method for ml.feature.HashingTF ## What changes were proposed in this pull request? Add indexOf method for ml.feature.HashingTF. The PR is a hotfix by storing hashingTF. I believe it is better to migrate native implement from mllib to ml. I can work on it, but I don't know whether to open an new JIRA for migrating or not. ## How was this patch tested? + [x] add a test case. Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/facaiy/spark ENH/add_indexOf_for_ml Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/18736.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #18736 commit f08861d64390f0eb4d7e34de13d78db67e046457 Author: Yan Facai (é¢åæ) Date: 2017-07-26T05:45:37Z hotfix: store hashingTF to reuse indexOf commit 174a2d57ebeccba10b8f1435970cde69f5829c79 Author: Yan Facai (é¢åæ) Date: 2017-07-26T06:14:37Z TST: add test case --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org