[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2018-03-03 Thread facaiy
Github user facaiy closed the pull request at:

https://github.com/apache/spark/pull/18736


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2017-08-15 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/18736#discussion_r133127624
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -74,26 +74,41 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
 
   setDefault(numFeatures -> (1 << 18), binary -> false)
 
+  private[this] var hashingTF = new 
feature.HashingTF($(numFeatures)).setBinary($(binary))
+
   /** @group getParam */
   @Since("1.2.0")
   def getNumFeatures: Int = $(numFeatures)
 
   /** @group setParam */
   @Since("1.2.0")
-  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
+  def setNumFeatures(value: Int): this.type = {
+val t = set(numFeatures, value)
+hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary))
+t
+  }
 
   /** @group getParam */
   @Since("2.0.0")
   def getBinary: Boolean = $(binary)
 
   /** @group setParam */
   @Since("2.0.0")
-  def setBinary(value: Boolean): this.type = set(binary, value)
+  def setBinary(value: Boolean): this.type = {
+val t = set(binary, value)
+hashingTF.setBinary($(binary))
--- End diff --

I think we can't do other things except for set params in ```set***``` 
function, this is because PySpark ```set***``` doesn't call corresponding Scala 
function. In PySpark, we collect all params together and then pass them to 
Scala side, then call the following function:
```
def fit(dataset: Dataset[_], paramMap: ParamMap): M = {
copy(paramMap).fit(dataset)
  }
```
which will skip the corresponding Scala ```set***``` functions. So it will 
make your change to old ```hashingTF``` doesn't take effect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2017-08-14 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18736#discussion_r133080543
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -80,20 +82,31 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
 
   /** @group setParam */
   @Since("1.2.0")
-  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
+  def setNumFeatures(value: Int): this.type = {
+hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary))
--- End diff --

I think `mllib` only need fix bug, no need add new method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2017-08-14 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18736#discussion_r133080201
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/HashingTFSuite.scala ---
@@ -69,6 +69,20 @@ class HashingTFSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
 assert(features ~== expected absTol 1e-14)
   }
 
+  test("indexOf method") {
+val df = Seq((0, "a a b b c d".split(" ").toSeq)).toDF("id", "words")
+val n = 100
+val hashingTF = new HashingTF()
+  .setInputCol("words")
+  .setOutputCol("features")
+  .setNumFeatures(n)
+hashingTF.transform(df)
+assert(hashingTF.indexOf("a") === 70)
+assert(hashingTF.indexOf("b") === 61)
+assert(hashingTF.indexOf("c") === 22)
+assert(hashingTF.indexOf("d") === 94)
--- End diff --

I prefer the testcase compare the result between `ml.feature.HashingTF` and 
`mllib.feature.HashingTF`, avoid hardcoding the computing result here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2017-08-10 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18736#discussion_r132618802
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -80,20 +82,31 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
 
   /** @group setParam */
   @Since("1.2.0")
-  def setNumFeatures(value: Int): this.type = set(numFeatures, value)
+  def setNumFeatures(value: Int): this.type = {
+hashingTF = new feature.HashingTF($(numFeatures)).setBinary($(binary))
--- End diff --

@WeichenXu123  How about adding `setNumFeatures` method to mllib? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2017-08-09 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18736#discussion_r132253487
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -90,10 +92,22 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("2.0.0")
   def setBinary(value: Boolean): this.type = set(binary, value)
 
+  /**
+   * Returns the index of the input term.
+   */
+  @Since("2.3.0")
+  def indexOf(term: Any): Int = {
+if (hashingTF != null) {
+  hashingTF.indexOf(term)
--- End diff --

Another problem:

When the parameter changed (numFeatures, binary), you should immediately 
create a new hashTF object  to replace the old one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2017-08-09 Thread facaiy
Github user facaiy commented on a diff in the pull request:

https://github.com/apache/spark/pull/18736#discussion_r132131171
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -90,10 +92,22 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("2.0.0")
   def setBinary(value: Boolean): this.type = set(binary, value)
 
+  /**
+   * Returns the index of the input term.
+   */
+  @Since("2.3.0")
+  def indexOf(term: Any): Int = {
+if (hashingTF != null) {
+  hashingTF.indexOf(term)
--- End diff --

Thanks for you suggestion, I'll like to modify the code later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2017-08-08 Thread WeichenXu123
Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18736#discussion_r132058692
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/HashingTF.scala 
---
@@ -90,10 +92,22 @@ class HashingTF @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
   @Since("2.0.0")
   def setBinary(value: Boolean): this.type = set(binary, value)
 
+  /**
+   * Returns the index of the input term.
+   */
+  @Since("2.3.0")
+  def indexOf(term: Any): Int = {
+if (hashingTF != null) {
+  hashingTF.indexOf(term)
--- End diff --

You can move the `feature.HashingTF` initialization into constructor.
Otherwise the `indexOf` will throw error when `transform` not called, it is 
weird to user.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18736: [SPARK-21481][ML] Add indexOf method for ml.featu...

2017-07-25 Thread facaiy
GitHub user facaiy opened a pull request:

https://github.com/apache/spark/pull/18736

[SPARK-21481][ML] Add indexOf method for ml.feature.HashingTF

## What changes were proposed in this pull request?

Add indexOf method for ml.feature.HashingTF.

The PR is a hotfix by storing hashingTF.
I believe it is better to migrate native implement from mllib to ml. I can 
work on it, but I don't know whether to open an new JIRA for migrating or not.

## How was this patch tested?

+ [x] add a test case.

Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/facaiy/spark ENH/add_indexOf_for_ml

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18736.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18736


commit f08861d64390f0eb4d7e34de13d78db67e046457
Author: Yan Facai (颜发才) 
Date:   2017-07-26T05:45:37Z

hotfix: store hashingTF to reuse indexOf

commit 174a2d57ebeccba10b8f1435970cde69f5829c79
Author: Yan Facai (颜发才) 
Date:   2017-07-26T06:14:37Z

TST: add test case




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org