from:"mpjlu"

[GitHub] spark issue #14449: [SPARK-16843][MLLIB] add the percentage ChiSquareSelecto...

2016-08-02 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/14449
  
Hi @srowen, thanks for your comment. 
I agree for your comment, user can get the number of features without 
percentage method. For the user experience, sometimes the percentage method 
seems better. 
In scikit-learn, there are both  SelectKBest and SelectPercentile APIs: 
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection
 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14449: [SPARK-16843][MLLIB] add the percentage ChiSquare...

2016-08-01 Thread mpjlu

GitHub user mpjlu opened a pull request:

https://github.com/apache/spark/pull/14449

[SPARK-16843][MLLIB] add the percentage ChiSquareSelector feature

## What changes were proposed in this pull request?

add the percentage ChiSquareSelector feature


## How was this patch tested?

add scala ut




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mpjlu/spark chisquare2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14449.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14449


commit fb3c9a93e4b1b20f6738a3b56d8fb0604fbbb59e
Author: Peng, Meng <peng.m...@intel.com>
Date:   2016-08-01T05:00:16Z

add the percentage ChiSquareSelector feature




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...

2016-08-11 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r74404348
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") (
 new ChiSqSelectorModel(indices)
   }
 }
+
+/**
+ * Creates a ChiSquared feature selector by False Positive Rate (FPR) test.
+ * @param alpha the highest p-value for features to be kept
+ */
+@Since("2.1.0")
+class ChiSqSelectorByFpr @Since("2.1.0") (
+  @Since("2.1.0") val alpha: Double) extends Serializable {
+
+  /**
+   * Returns a ChiSquared feature selector by FPR.
+   *
+   * @param data an `RDD[LabeledPoint]` containing the labeled dataset 
with categorical features.
+   * Real-valued features will be treated as categorical for 
each distinct value.
+   * Apply feature discretizer before using this function.
+   */
+  @Since("2.1.0")
+  def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
+val indices = Statistics.chiSqTest(data)
+  .zipWithIndex.filter { case (res, _) => res.pValue < alpha }
--- End diff --

The only problem here is how to the reuse the result of  
"Statistics.chiSqTest(data)", so why not add a var to save the result.  we 
don't need to change the model related code.  Just need to add two lines of 
code to check which selection method is used. No API needs to change.  We need 
to add API to set parameters about other selection methods, like select by FPR. 
How do you think about it. @srowen  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...

2016-08-11 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r74412725
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") (
 new ChiSqSelectorModel(indices)
   }
 }
+
+/**
+ * Creates a ChiSquared feature selector by False Positive Rate (FPR) test.
+ * @param alpha the highest p-value for features to be kept
+ */
+@Since("2.1.0")
+class ChiSqSelectorByFpr @Since("2.1.0") (
+  @Since("2.1.0") val alpha: Double) extends Serializable {
+
+  /**
+   * Returns a ChiSquared feature selector by FPR.
+   *
+   * @param data an `RDD[LabeledPoint]` containing the labeled dataset 
with categorical features.
+   * Real-valued features will be treated as categorical for 
each distinct value.
+   * Apply feature discretizer before using this function.
+   */
+  @Since("2.1.0")
+  def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
+val indices = Statistics.chiSqTest(data)
+  .zipWithIndex.filter { case (res, _) => res.pValue < alpha }
--- End diff --

Two or more models based on the same chi-squared test is reasonable, 
because the chi-squared test results contain "pValue,  degreesOfFreedom,  
statistic,,,". numTopFeatures selection method uses statistic, and FPR 
selection method uses pValue. 

Expose the p-value to the caller maybe useful, but we can expose 
ChiSqTestResult, not only p-value.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...

2016-08-11 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r74400235
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") (
 new ChiSqSelectorModel(indices)
   }
 }
+
+/**
+ * Creates a ChiSquared feature selector by False Positive Rate (FPR) test.
+ * @param alpha the highest p-value for features to be kept
+ */
+@Since("2.1.0")
+class ChiSqSelectorByFpr @Since("2.1.0") (
+  @Since("2.1.0") val alpha: Double) extends Serializable {
+
+  /**
+   * Returns a ChiSquared feature selector by FPR.
+   *
+   * @param data an `RDD[LabeledPoint]` containing the labeled dataset 
with categorical features.
+   * Real-valued features will be treated as categorical for 
each distinct value.
+   * Apply feature discretizer before using this function.
+   */
+  @Since("2.1.0")
+  def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
+val indices = Statistics.chiSqTest(data)
+  .zipWithIndex.filter { case (res, _) => res.pValue < alpha }
--- End diff --

Yes, great proposal. We can do that.  But the chiSquareSelector should be 
changed a lot to implement your proposal, I think it is necessary. I will 
submit a new PR based on my understanding of your proposal, how do you think 
about it? Thanks.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...

2016-08-11 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r74416260
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") (
 new ChiSqSelectorModel(indices)
   }
 }
+
+/**
+ * Creates a ChiSquared feature selector by False Positive Rate (FPR) test.
+ * @param alpha the highest p-value for features to be kept
+ */
+@Since("2.1.0")
+class ChiSqSelectorByFpr @Since("2.1.0") (
+  @Since("2.1.0") val alpha: Double) extends Serializable {
+
+  /**
+   * Returns a ChiSquared feature selector by FPR.
+   *
+   * @param data an `RDD[LabeledPoint]` containing the labeled dataset 
with categorical features.
+   * Real-valued features will be treated as categorical for 
each distinct value.
+   * Apply feature discretizer before using this function.
+   */
+  @Since("2.1.0")
+  def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
+val indices = Statistics.chiSqTest(data)
+  .zipWithIndex.filter { case (res, _) => res.pValue < alpha }
--- End diff --

Do you mean we can put some selection logic or more selection results not 
only indices to the model (ChiSqSelectorModel) ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14449: [SPARK-16843][MLLIB] add the percentage ChiSquareSelecto...

2016-08-04 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/14449
  
Hi @srowen , I also plan to submit some PR about feature selection methods 
based on univariate statistical test, like the methods in scikit-learn:  
SelectFpr (using false positive rate),  SelectFdr ( using false discovery 
rate), and SelectFwe (family wise error ).
http://scikit-learn.org/dev/modules/feature_selection.html
Now, the ChiSqSelector in Spark only can select the topNumFeatures.  Do you 
think Spark should support other feature selection methods?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14597: [SPARK-17017][MLLIB] add a chiSquare Selector based on F...

2016-08-14 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/14597
  
Hi  @avulanov .  In general, FPR feature selection should not modify the 
code of existing ChiSqSelector, as we have implemented in this PR.  But if we 
need to reuse the ChiSqTestResult (Statistics.chiSqTest(data)),  it is better 
to modify the code of ChiSqSelector.  

In Scikit-learn, for each SelectKBest, SelectFpr, SelectPercentile and so 
on, create an object for it, as we implemented in this PR. The good point of 
this method is it is consistent across the LIB, all use the same 
Estimator/Model style.  The disadvantage is it cannot reuse the results of 
score function. @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...

2016-08-14 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r74721128
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") (
 new ChiSqSelectorModel(indices)
   }
 }
+
+/**
+ * Creates a ChiSquared feature selector by False Positive Rate (FPR) test.
+ * @param alpha the highest p-value for features to be kept
+ */
+@Since("2.1.0")
+class ChiSqSelectorByFpr @Since("2.1.0") (
+  @Since("2.1.0") val alpha: Double) extends Serializable {
+
+  /**
+   * Returns a ChiSquared feature selector by FPR.
+   *
+   * @param data an `RDD[LabeledPoint]` containing the labeled dataset 
with categorical features.
+   * Real-valued features will be treated as categorical for 
each distinct value.
+   * Apply feature discretizer before using this function.
+   */
+  @Since("2.1.0")
+  def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
+val indices = Statistics.chiSqTest(data)
+  .zipWithIndex.filter { case (res, _) => res.pValue < alpha }
--- End diff --

Hi @srowen ,  if we configure the model instance with different parameter  
to perform different types of selection, will that be inconsistent with the 
MLlib Estimator/Model style, and cause the user confused?  
If it is not a problem, I will submit a new PR to configure the model to 
perform different selection.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...

2016-08-15 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r74740698
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") (
 new ChiSqSelectorModel(indices)
   }
 }
+
+/**
+ * Creates a ChiSquared feature selector by False Positive Rate (FPR) test.
+ * @param alpha the highest p-value for features to be kept
+ */
+@Since("2.1.0")
+class ChiSqSelectorByFpr @Since("2.1.0") (
+  @Since("2.1.0") val alpha: Double) extends Serializable {
+
+  /**
+   * Returns a ChiSquared feature selector by FPR.
+   *
+   * @param data an `RDD[LabeledPoint]` containing the labeled dataset 
with categorical features.
+   * Real-valued features will be treated as categorical for 
each distinct value.
+   * Apply feature discretizer before using this function.
+   */
+  @Since("2.1.0")
+  def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
+val indices = Statistics.chiSqTest(data)
+  .zipWithIndex.filter { case (res, _) => res.pValue < alpha }
--- End diff --

Thanks, this makes sense.  It is better to add the parameter controlling 
what type of feature selection is done to ChiSqSelector.  But my idea is the 
the selection is done in "fit" function to be consistent with other feature 
selection methods, we don't need to pass what type of selection to the model. 
If we consider the user can get the p-value, maybe return the p-value from 
ChiSqSelector is ok?If so , we don't need to make any change to the model 
class. How do you think about it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...

2016-08-15 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r74742525
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") (
 new ChiSqSelectorModel(indices)
   }
 }
+
+/**
+ * Creates a ChiSquared feature selector by False Positive Rate (FPR) test.
+ * @param alpha the highest p-value for features to be kept
+ */
+@Since("2.1.0")
+class ChiSqSelectorByFpr @Since("2.1.0") (
+  @Since("2.1.0") val alpha: Double) extends Serializable {
+
+  /**
+   * Returns a ChiSquared feature selector by FPR.
+   *
+   * @param data an `RDD[LabeledPoint]` containing the labeled dataset 
with categorical features.
+   * Real-valued features will be treated as categorical for 
each distinct value.
+   * Apply feature discretizer before using this function.
+   */
+  @Since("2.1.0")
+  def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
+val indices = Statistics.chiSqTest(data)
+  .zipWithIndex.filter { case (res, _) => res.pValue < alpha }
--- End diff --

ok,  I will pass the p values to the model and expose to the caller, and 
let the model to sort the indices internally.  I will add the parameter 
controlling what type of feature selection is done in "fit" function of 
ChiSqSelector. Is my understanding right?  Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...

2016-08-15 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r74743026
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") (
 new ChiSqSelectorModel(indices)
   }
 }
+
+/**
+ * Creates a ChiSquared feature selector by False Positive Rate (FPR) test.
+ * @param alpha the highest p-value for features to be kept
+ */
+@Since("2.1.0")
+class ChiSqSelectorByFpr @Since("2.1.0") (
+  @Since("2.1.0") val alpha: Double) extends Serializable {
+
+  /**
+   * Returns a ChiSquared feature selector by FPR.
+   *
+   * @param data an `RDD[LabeledPoint]` containing the labeled dataset 
with categorical features.
+   * Real-valued features will be treated as categorical for 
each distinct value.
+   * Apply feature discretizer before using this function.
+   */
+  @Since("2.1.0")
+  def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
+val indices = Statistics.chiSqTest(data)
+  .zipWithIndex.filter { case (res, _) => res.pValue < alpha }
--- End diff --

ok, thanks. I will submit a new PR based on our discussion. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: Fpr chi square

2016-08-11 Thread mpjlu

GitHub user mpjlu opened a pull request:

https://github.com/apache/spark/pull/14597

Fpr chi square

## What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on 
univariate statistical tests. False Positive Rate (FPR) is a popular univariate 
statistical test for feature selection. We add a chiSquare Selector based on 
False Positive Rate (FPR) test in this PR, like it is implemented in 
scikit-learn. 

http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection


## How was this patch tested?

Add Scala ut



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mpjlu/spark fprChiSquare

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14597.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14597


commit 2adebe8de3881509e510fc518c562d1141ccd0ef
Author: Peng, Meng <peng.m...@intel.com>
Date:   2016-08-10T05:40:18Z

add a chiSquare Selector based on False Positive Rate (FPR) test

commit 04053ca207ef4aa955eddc3e65d09a4e03db6292
Author: Peng, Meng <peng.m...@intel.com>
Date:   2016-08-11T07:10:43Z

Merge remote-tracking branch 'origin/master' into fprChiSquare




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...

2016-08-11 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r74396249
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") (
 new ChiSqSelectorModel(indices)
   }
 }
+
+/**
+ * Creates a ChiSquared feature selector by False Positive Rate (FPR) test.
--- End diff --

Hi @srowen , 
http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/, 
this is a link to introduce p-value for feature selection. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...

2016-08-11 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r74397115
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") (
 new ChiSqSelectorModel(indices)
   }
 }
+
+/**
+ * Creates a ChiSquared feature selector by False Positive Rate (FPR) test.
+ * @param alpha the highest p-value for features to be kept
+ */
+@Since("2.1.0")
+class ChiSqSelectorByFpr @Since("2.1.0") (
+  @Since("2.1.0") val alpha: Double) extends Serializable {
+
+  /**
+   * Returns a ChiSquared feature selector by FPR.
+   *
+   * @param data an `RDD[LabeledPoint]` containing the labeled dataset 
with categorical features.
+   * Real-valued features will be treated as categorical for 
each distinct value.
+   * Apply feature discretizer before using this function.
+   */
+  @Since("2.1.0")
+  def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
+val indices = Statistics.chiSqTest(data)
+  .zipWithIndex.filter { case (res, _) => res.pValue < alpha }
--- End diff --

Here we don't need the features sorted by importance, because all the 
features are filtered by pvalue < alpha. But the ChiSqSelectorModel needs the 
features sorted by index. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14597: [SPARK-17017][MLLIB] add a chiSquare Selector based on F...

2016-08-11 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/14597
  
Hi, @srowen , I can modify the implementation in .ml to accommodate the new 
params. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...

2016-08-16 Thread mpjlu

Github user mpjlu closed the pull request at:

https://github.com/apache/spark/pull/14597


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-18 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75314112
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") (
*/
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
-val indices = Statistics.chiSqTest(data)
-  .zipWithIndex.sortBy { case (res, _) => -res.statistic }
-  .take(numTopFeatures)
-  .map { case (_, indices) => indices }
-  .sorted
+chiSqTestResult = Statistics.chiSqTest(data)
+selectorType match {
+  case ChiSqSelectorType.KBest => selectKBest(numTopFeatures)
+  case ChiSqSelectorType.Percentile => selectPercentile(percentile)
+  case ChiSqSelectorType.Fpr => selectFpr(alpha)
+  case _ => throw new Exception("Unknown ChiSqSelector Type")
+}
+  }
+
+  @Since("2.1.0")
+  def selectKBest(value: Int): ChiSqSelectorModel = {
--- End diff --

Thanks. I am suggesting how one selector can generate multi models by only 
fit one time, because we want to reuse the ChiSqTestResult, this result is 
generated in fit(DataFrame). So we want to only fit one time and can generate 
multi models.   
According to your proposal, suppose we want to generate KBest with defaults 
numTopFeateatures. we can do like this:
val selector = new ChiSqSelector()  (1)
val model = selector.fit(dataframe) (2)
then the user want to generate KBest model with only top 20 features.  The 
user can do like this:
selector.setNumToFeatures(20) //here is the same selector as (1) , and 
the model (2) now is with Top 20 features, because it can use the parameters of 
selector. This is my understanding of your proposal, is my understanding right? 
  Thanks very much. 




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-21 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75610472
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") (
*/
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
-val indices = Statistics.chiSqTest(data)
-  .zipWithIndex.sortBy { case (res, _) => -res.statistic }
-  .take(numTopFeatures)
-  .map { case (_, indices) => indices }
-  .sorted
+chiSqTestResult = Statistics.chiSqTest(data)
+selectorType match {
+  case ChiSqSelectorType.KBest => selectKBest(numTopFeatures)
+  case ChiSqSelectorType.Percentile => selectPercentile(percentile)
+  case ChiSqSelectorType.Fpr => selectFpr(alpha)
+  case _ => throw new Exception("Unknown ChiSqSelector Type")
+}
+  }
+
+  @Since("2.1.0")
+  def selectKBest(value: Int): ChiSqSelectorModel = {
--- End diff --

Yes, I also think so.  I will update this PR. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-21 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75613576
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") (
*/
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
-val indices = Statistics.chiSqTest(data)
-  .zipWithIndex.sortBy { case (res, _) => -res.statistic }
-  .take(numTopFeatures)
-  .map { case (_, indices) => indices }
-  .sorted
+chiSqTestResult = Statistics.chiSqTest(data)
+selectorType match {
+  case ChiSqSelectorType.KBest => selectKBest(numTopFeatures)
+  case ChiSqSelectorType.Percentile => selectPercentile(percentile)
+  case ChiSqSelectorType.Fpr => selectFpr(alpha)
+  case _ => throw new Exception("Unknown ChiSqSelector Type")
+}
+  }
+
+  @Since("2.1.0")
+  def selectKBest(value: Int): ChiSqSelectorModel = {
--- End diff --

Hi @srowen , I have updated this PR. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-22 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75634659
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -54,6 +54,29 @@ private[feature] trait ChiSqSelectorParams extends Params
 
   /** @group getParam */
   def getNumTopFeatures: Int = $(numTopFeatures)
+
+  final val percentile = new DoubleParam(this, "percentile",
+"Percentile of features that selector will select, ordered by 
statistics value descending.",
+ParamValidators.gtEq(0))
+  setDefault(percentile -> 10)
--- End diff --

ok, I will update it. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-22 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75636510
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -189,11 +227,20 @@ class ChiSqSelector @Since("1.3.0") (
*/
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
-val indices = Statistics.chiSqTest(data)
-  .zipWithIndex.sortBy { case (res, _) => -res.statistic }
-  .take(numTopFeatures)
-  .map { case (_, indices) => indices }
-  .sorted
+var indices = selectorType match {
+  case ChiSqSelectorType.KBest => Statistics.chiSqTest(data)
+.zipWithIndex.sortBy { case (res, _) => -res.statistic }
+.take(numTopFeatures)
+.map { case (_, indices) => indices }
+  case ChiSqSelectorType.Percentile => Statistics.chiSqTest(data)
+.zipWithIndex.sortBy { case (res, _) => -res.statistic }
+.take((data.count() * percentile / 100).toInt)
--- End diff --

Yes, I will update the code. thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-18 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75419052
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") (
*/
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
-val indices = Statistics.chiSqTest(data)
-  .zipWithIndex.sortBy { case (res, _) => -res.statistic }
-  .take(numTopFeatures)
-  .map { case (_, indices) => indices }
-  .sorted
+chiSqTestResult = Statistics.chiSqTest(data)
+selectorType match {
+  case ChiSqSelectorType.KBest => selectKBest(numTopFeatures)
+  case ChiSqSelectorType.Percentile => selectPercentile(percentile)
+  case ChiSqSelectorType.Fpr => selectFpr(alpha)
+  case _ => throw new Exception("Unknown ChiSqSelector Type")
+}
+  }
+
+  @Since("2.1.0")
+  def selectKBest(value: Int): ChiSqSelectorModel = {
--- End diff --

Hi @srowen , maybe I misunderstood you. I think make the model instance be 
parameterized is the same as current solution (make selector instance be 
parameterized). This only difference of your suggestion is moving the logic 
from selector to model.  Both of the solution is to avoid computing 
ChiSqTestResult many times when one wanted to perform different chi squared 
selection.

If we don't consider to avoid computing ChiSqTestResult many times,  the 
problem will be very easy, and can be consistent exactly with other methods.   
Can we don't consider reuse ChiSqTestResult in this PR.  Thanks
   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-22 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75661562
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -189,11 +227,20 @@ class ChiSqSelector @Since("1.3.0") (
*/
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
-val indices = Statistics.chiSqTest(data)
-  .zipWithIndex.sortBy { case (res, _) => -res.statistic }
-  .take(numTopFeatures)
-  .map { case (_, indices) => indices }
-  .sorted
+var indices = selectorType match {
+  case ChiSqSelectorType.KBest => Statistics.chiSqTest(data)
+.zipWithIndex.sortBy { case (res, _) => -res.statistic }
+.take(numTopFeatures)
+.map { case (_, indices) => indices }
+  case ChiSqSelectorType.Percentile => Statistics.chiSqTest(data)
+.zipWithIndex.sortBy { case (res, _) => -res.statistic }
+.take((data.count() * percentile / 100).toInt)
+.map { case (_, indices) => indices }
+  case ChiSqSelectorType.Fpr => Statistics.chiSqTest(data)
+.zipWithIndex.filter{ case (res, _) => res.pValue < alpha }
+.map { case (_, indices) => indices }
+  case _ => throw new IllegalStateException("Unknown ChiSqSelector 
Type")
--- End diff --

Hi @srowen , I have updated the code per all you comments.  Thanks very 
much. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14785: [SPARK-17207][MLLIB]fix comparing Vector bug in T...

2016-08-24 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14785#discussion_r76035442
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/util/TestingUtils.scala ---
@@ -154,7 +154,11 @@ object TestingUtils {
  */
 def absTol(eps: Double): CompareVectorRightSide = 
CompareVectorRightSide(
   (x: Vector, y: Vector, eps: Double) => {
-x.toArray.zip(y.toArray).forall(x => x._1 ~= x._2 absTol eps)
+if (x.size != y.size) {
--- End diff --

Yes, thanks. 
How about check matrix length by
x.numRows == y.numRows && x.numCols == y.numCols && ... ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14785: [SPARK-17207][MLLIB]fix comparing Vector bug in T...

2016-08-24 Thread mpjlu

GitHub user mpjlu opened a pull request:

https://github.com/apache/spark/pull/14785

[SPARK-17207][MLLIB]fix comparing Vector bug in TestingUtils

## What changes were proposed in this pull request?

fix comparing Vector bug in TestingUtils.
There is the same bug for Matrix comparing. How to check the length of 
Matrix should be discussed first.  


## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)


(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mpjlu/spark testUtils

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14785.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14785


commit 6c8f87f730ab4e93bbdec02d8b89d82b82042d73
Author: Peng, Meng <peng.m...@intel.com>
Date:   2016-08-24T10:36:10Z

fix comparing Vector bug in TestingUtils




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14785: [SPARK-17207][MLLIB]fix comparing Vector bug in T...

2016-08-24 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14785#discussion_r76037277
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/util/TestingUtils.scala ---
@@ -154,7 +154,11 @@ object TestingUtils {
  */
 def absTol(eps: Double): CompareVectorRightSide = 
CompareVectorRightSide(
   (x: Vector, y: Vector, eps: Double) => {
-x.toArray.zip(y.toArray).forall(x => x._1 ~= x._2 absTol eps)
+if (x.size != y.size) {
--- End diff --

code is updated. Thanks @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-24 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r76041373
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -189,11 +232,21 @@ class ChiSqSelector @Since("1.3.0") (
*/
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
-val indices = Statistics.chiSqTest(data)
-  .zipWithIndex.sortBy { case (res, _) => -res.statistic }
-  .take(numTopFeatures)
-  .map { case (_, indices) => indices }
-  .sorted
+val chiSqTestResult = Statistics.chiSqTest(data)
+val features = selectorType match {
+  case ChiSqSelectorType.KBest => chiSqTestResult
+.zipWithIndex.sortBy { case (res, _) => -res.statistic }
--- End diff --

Hi @srowen , I have updated the code, do I miss any of your comments in 
this pr?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14785: [SPARK-17207][MLLIB]fix comparing Vector bug in TestingU...

2016-08-24 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/14785
  
Sure, I will fix it, and add test cases. thanks. @dbtsai , 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14785: [SPARK-17207][MLLIB]fix comparing Vector bug in TestingU...

2016-08-25 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/14785
  
Hi @dbtsai , PR 2294 added Matrix comparing in TestingUtils, but did not 
add any test cases in TestingUtilsSuite. I did not add test cases for Matrix 
comparing in the PR either.
If Matrix comparing cases are required. I can add all test cases them, not 
only test this PR, like Vector comparing test cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-09-05 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r77536408
  
--- Diff: python/pyspark/mllib/feature.py ---
@@ -305,7 +350,12 @@ def fit(self, data):
  treated as categorical for each distinct value.
  Apply feature discretizer before using this function.
 """
-jmodel = callMLlibFunc("fitChiSqSelector", self.numTopFeatures, 
data)
+if self.selectorType == ChiSqSelectorType.KBest:
+jmodel = callMLlibFunc("fitChiSqSelectorKBest", 
self.numTopFeatures, data)
+elif self.selectorType == ChiSqSelectorType.Percentile:
+jmodel = callMLlibFunc("fitChiSqSelectorPercentile", 
self.percentile, data)
+elif self.selectorType == ChiSqSelectorType.FPR:
--- End diff --

Thanks  @srowen , the type checking is added. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-09-05 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r77532997
  
--- Diff: python/pyspark/mllib/feature.py ---
@@ -271,29 +271,74 @@ def transform(self, vector):
 """
 return JavaVectorTransformer.transform(self, vector)
 
+class ChiSqSelectorType:
+"""
+This class defines the selector types of Chi Square Selector.
+"""
+KBest, Percentile, FPR = range(3)
--- End diff --

Hi @srowen , there are many ways to express enums in Python 2, this is the 
most frequently used and simplest way. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...

2016-09-06 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/14597
  
Hi @yanboliang , could you please kindly review the python code of this PR. 
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-09-01 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r77137991
  
--- Diff: python/pyspark/mllib/feature.py ---
@@ -276,24 +276,64 @@ class ChiSqSelector(object):
 """
 Creates a ChiSquared feature selector.
 
-:param numTopFeatures: number of features that selector will select.
-
 >>> data = [
 ... LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})),
 ... LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})),
 ... LabeledPoint(1.0, [0.0, 9.0, 8.0]),
 ... LabeledPoint(2.0, [8.0, 9.0, 5.0])
 ... ]
->>> model = ChiSqSelector(1).fit(sc.parallelize(data))
+>>> model = 
ChiSqSelector().setNumTopFeatures(1).fit(sc.parallelize(data))
+>>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0}))
+SparseVector(1, {0: 6.0})
+>>> model.transform(DenseVector([8.0, 9.0, 5.0]))
+DenseVector([5.0])
+>>> model = 
ChiSqSelector().setPercentile(0.34).fit(sc.parallelize(data))
 >>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0}))
 SparseVector(1, {0: 6.0})
 >>> model.transform(DenseVector([8.0, 9.0, 5.0]))
 DenseVector([5.0])
+>>> data = [
+... LabeledPoint(0.0, SparseVector(4, {0: 8.0, 1: 7.0})),
+... LabeledPoint(1.0, SparseVector(4, {1: 9.0, 2: 6.0, 3: 4.0})),
+... LabeledPoint(1.0, [0.0, 9.0, 8.0, 4.0]),
+... LabeledPoint(2.0, [8.0, 9.0, 5.0, 9.0])
+... ]
+>>> model = ChiSqSelector().setAlpha(0.1).fit(sc.parallelize(data))
+>>> model.transform(DenseVector([1.0,2.0,3.0,4.0]))
+DenseVector([4.0])
 
 .. versionadded:: 1.4.0
 """
-def __init__(self, numTopFeatures):
-self.numTopFeatures = int(numTopFeatures)
+def __init__(self):
+self.param = 50
--- End diff --

Hi @srowen , use different fields for different value is not a problem, 
just need another selectionType field, and in the fit function, the code will 
be:
if(selectionType == KBest)
   callMLlibFunc("fitChiSqSelectorKBest", self.numTopFeatures, data)
elseif(selectionType == Percentile)
   callMLlibFunc("fitChiSqSelectorKPercentile", self.percentile, data)
elseif(selecitonType == FPR)
   callMLlibFunc("fitChiSqSelectorFPR", self.alpha, data)
Is that ok? Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-09-05 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r77473261
  
--- Diff: python/pyspark/mllib/feature.py ---
@@ -276,24 +276,64 @@ class ChiSqSelector(object):
 """
 Creates a ChiSquared feature selector.
 
-:param numTopFeatures: number of features that selector will select.
-
 >>> data = [
 ... LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})),
 ... LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})),
 ... LabeledPoint(1.0, [0.0, 9.0, 8.0]),
 ... LabeledPoint(2.0, [8.0, 9.0, 5.0])
 ... ]
->>> model = ChiSqSelector(1).fit(sc.parallelize(data))
+>>> model = 
ChiSqSelector().setNumTopFeatures(1).fit(sc.parallelize(data))
+>>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0}))
+SparseVector(1, {0: 6.0})
+>>> model.transform(DenseVector([8.0, 9.0, 5.0]))
+DenseVector([5.0])
+>>> model = 
ChiSqSelector().setPercentile(0.34).fit(sc.parallelize(data))
 >>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0}))
 SparseVector(1, {0: 6.0})
 >>> model.transform(DenseVector([8.0, 9.0, 5.0]))
 DenseVector([5.0])
+>>> data = [
+... LabeledPoint(0.0, SparseVector(4, {0: 8.0, 1: 7.0})),
+... LabeledPoint(1.0, SparseVector(4, {1: 9.0, 2: 6.0, 3: 4.0})),
+... LabeledPoint(1.0, [0.0, 9.0, 8.0, 4.0]),
+... LabeledPoint(2.0, [8.0, 9.0, 5.0, 9.0])
+... ]
+>>> model = ChiSqSelector().setAlpha(0.1).fit(sc.parallelize(data))
+>>> model.transform(DenseVector([1.0,2.0,3.0,4.0]))
+DenseVector([4.0])
 
 .. versionadded:: 1.4.0
 """
-def __init__(self, numTopFeatures):
-self.numTopFeatures = int(numTopFeatures)
+def __init__(self):
+self.param = 50
--- End diff --

Hi @srowen , I have updated the code. Could you please review it again? 
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15058: [SPARK-17505][MLLIB]Add setBins for BinaryClassif...

2016-09-12 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15058#discussion_r78355383
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ---
@@ -43,10 +43,13 @@ import org.apache.spark.sql.DataFrame
 @Since("1.0.0")
 class BinaryClassificationMetrics @Since("1.3.0") (
 @Since("1.3.0") val scoreAndLabels: RDD[(Double, Double)],
-@Since("1.3.0") val numBins: Int) extends Logging {
+@Since("1.3.0") var numBins: Int) extends Logging {
--- End diff --

It can cause the count/confusion RDD to recompute. 
yes, you can set numBins when construct BinaryClassificationMetrics.  But 
when you want to change the numBins, if there is no setBins function, you have 
to construct another BinaryClassificationMetrics. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15058: [SPARK-17505][MLLIB]Add setBins for BinaryClassif...

2016-09-12 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15058#discussion_r78355993
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ---
@@ -43,10 +43,13 @@ import org.apache.spark.sql.DataFrame
 @Since("1.0.0")
 class BinaryClassificationMetrics @Since("1.3.0") (
 @Since("1.3.0") val scoreAndLabels: RDD[(Double, Double)],
-@Since("1.3.0") val numBins: Int) extends Logging {
+@Since("1.3.0") var numBins: Int) extends Logging {
--- End diff --

I add this method is because, in the code, there is "// TODO: Allow the 
user to vary the number of bins using a setBins method in 
BinaryClassificationMetrics. " but I find there is no setBins method. So I add 
it. I we not plan to add this method, maybe we can remove this comments. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15058: [SPARK-17505][MLLIB]Add setBins for BinaryClassif...

2016-09-12 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15058#discussion_r78358229
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ---
@@ -43,10 +43,13 @@ import org.apache.spark.sql.DataFrame
 @Since("1.0.0")
 class BinaryClassificationMetrics @Since("1.3.0") (
 @Since("1.3.0") val scoreAndLabels: RDD[(Double, Double)],
-@Since("1.3.0") val numBins: Int) extends Logging {
+@Since("1.3.0") var numBins: Int) extends Logging {
--- End diff --

Thanks.  I agree with you, the method to add setBins is complexity. I just 
can not find simple method to add setBins.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15058: [SPARK-17505][MLLIB]Add setBins for BinaryClassif...

2016-09-12 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15058#discussion_r78362121
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ---
@@ -43,10 +43,13 @@ import org.apache.spark.sql.DataFrame
 @Since("1.0.0")
 class BinaryClassificationMetrics @Since("1.3.0") (
 @Since("1.3.0") val scoreAndLabels: RDD[(Double, Double)],
-@Since("1.3.0") val numBins: Int) extends Logging {
+@Since("1.3.0") var numBins: Int) extends Logging {
--- End diff --

Ok, got it, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15058: [SPARK-17505][MLLIB]Add setBins for BinaryClassif...

2016-09-12 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15058#discussion_r78356355
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ---
@@ -43,10 +43,13 @@ import org.apache.spark.sql.DataFrame
 @Since("1.0.0")
 class BinaryClassificationMetrics @Since("1.3.0") (
 @Since("1.3.0") val scoreAndLabels: RDD[(Double, Double)],
-@Since("1.3.0") val numBins: Int) extends Logging {
+@Since("1.3.0") var numBins: Int) extends Logging {
--- End diff --

Yes, we can create another metrics. But why not add a setBins method.  I 
think add a setBins method is reasonable.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...

2016-09-12 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/14597
  
hi @srowen , thanks.  
This is my first PR. learn much from you. thanks very much. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15058: [MLLIB]Add setBins for BinaryClassificationMetric...

2016-09-12 Thread mpjlu

GitHub user mpjlu opened a pull request:

https://github.com/apache/spark/pull/15058

[MLLIB]Add setBins for BinaryClassificationMetrics

## What changes were proposed in this pull request?

Add a setBins method for BinaryClassificationMetrics.
BinaryClassificationMetrics is a class in mllib/Evaluation.  numBins is a 
key attribute of it.   If numBins greater than 0, then the curves (ROC curve, 
PR curve) computed internally will be down-sampled to this many "bins". 
It is useful to let the user set the numBins. 


## How was this patch tested?
ut cases were changed to test this PR.

(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mpjlu/spark addSetNumBins

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15058.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15058


commit b28ac7499b62282ac3777d7be21a1ed0dc16a78b
Author: Peng <peng.m...@intel.com>
Date:   2016-09-12T08:02:31Z

add setBins

commit 1f4258e262314cd6781a6d6feca1940c4ea05457
Author: Peng <peng.m...@intel.com>
Date:   2016-09-12T09:41:59Z

Merge remote-tracking branch 'origin/master' into addSetNumBins

commit a70bf454dfd3838780a8d040858e53d3ad870966
Author: Peng <peng.m...@intel.com>
Date:   2016-09-12T10:10:30Z

change the comments style

commit c17bf4e768be614c806bbd913096538f89b380e5
Author: Peng <peng.m...@intel.com>
Date:   2016-09-12T10:15:51Z

minor change

commit d06b2b1d3f0d214a37da8d456e8e585b06fe5d9b
Author: Peng <peng.m...@intel.com>
Date:   2016-09-12T10:20:50Z

change the comments




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15058: [SPARK-17505][MLLIB]Add setBins for BinaryClassif...

2016-09-12 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15058#discussion_r78360606
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
 ---
@@ -43,10 +43,13 @@ import org.apache.spark.sql.DataFrame
 @Since("1.0.0")
 class BinaryClassificationMetrics @Since("1.3.0") (
 @Since("1.3.0") val scoreAndLabels: RDD[(Double, Double)],
-@Since("1.3.0") val numBins: Int) extends Logging {
+@Since("1.3.0") var numBins: Int) extends Logging {
--- End diff --

hi @srowen , can we use a simple way to add the setBins method.  Seems add 
a setBins method is useful. 
Just as the testSuite in this PR shows. If the user wants to test different 
numBins. 
Create one object, just call setBins, he can easily get different results. 
How about change a way to implement setBins? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...

2016-09-13 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/14597
  
hi @srowen , we have moved isSorted in ChiSqSelectorModel.  There is error 
message "*** method isSorted(Array[Int])Boolean in class 
org.apache.spark.mllib.feature.ChiSqSelectorModel does not have a correspondent 
in current version**". Should we revert it?
And we also need to add a new method numTopFeatures() in ChiSqSelector for 
the following error message: " *** method numTopFeatures()Int in class 
org.apache.spark.mllib.feature.ChiSqSelector does not have a correspondent in 
current version**" 
How do you think about it? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...

2016-09-13 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/14597
  
Hi @srowen , Python style fail is updated. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...

2016-09-13 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/14597
  
hi @srowen , I have updated the code for some MiMa test error. Could you 
please review it again. Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15214: [SPARK-17017][Follow-up][ML] Refactor of ChiSqSelector a...

2016-09-24 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/15214
  
Hi @yanboliang , got it. Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15277: [SPARK-17704][ML][MLlib] ChiSqSelector performanc...

2016-09-28 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15277#discussion_r80961052
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -57,22 +69,21 @@ class ChiSqSelectorModel @Since("1.3.0") (
* Preserves the order of filtered features the same as their indices 
are stored.
* Might be moved to Vector as .slice
* @param features vector
-   * @param filterIndices indices of features to filter
+   * @param filterIndices indices of features to filter, must be ordered 
asc
*/
   private def compress(features: Vector, filterIndices: Array[Int]): 
Vector = {
-val orderedIndices = filterIndices.sorted
--- End diff --

I just found sort in transform will sort the number of sample times. sorry 
@srowen



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15299: [SPARK-17704][ML][MLlib] ChiSqSelector performance impro...

2016-09-29 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/15299
  
hi @srowen , is @transient needed for val selectedFeatures or val 
filterIndices, one of them?
is it good to define filterIndices lazy?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15058: [SPARK-17505][MLLIB]Add setBins for BinaryClassif...

2016-09-15 Thread mpjlu

Github user mpjlu closed the pull request at:

https://github.com/apache/spark/pull/15058


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15058: [SPARK-17505][MLLIB]Add setBins for BinaryClassification...

2016-09-15 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/15058
  
sure, I close it now


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...

2016-09-15 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/14597
  
Thanks very much, I am in holiday now, will update the code this Sunday.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-09-19 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r79405860
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -54,11 +55,44 @@ private[feature] trait ChiSqSelectorParams extends 
Params
 
   /** @group getParam */
   def getNumTopFeatures: Int = $(numTopFeatures)
+
+  final val percentile = new DoubleParam(this, "percentile",
+"Percentile of features that selector will select, ordered by 
statistics value descending.",
+ParamValidators.inRange(0, 1))
+  setDefault(percentile -> 0.1)
+
+  /** @group getParam */
+  def getPercentile: Double = $(percentile)
+
+  final val alpha = new DoubleParam(this, "alpha",
+"The highest p-value for features to be kept.",
+ParamValidators.inRange(0, 1))
+  setDefault(alpha -> 0.05)
+
+  /** @group getParam */
+  def getAlpha: Double = $(alpha)
+
+  /**
+   * The ChiSqSelector supports KBest, Percentile, FPR selection,
+   * which is the same as ChiSqSelectorType defined in MLLIB.
+   * when call setNumTopFeatures, the selectorType is set to KBest
+   * when call setPercentile, the selectorType is set to Percentile
+   * when call setAlpha, the selectorType is set to FPR
+   */
+  final val selectorType = new Param[String](this, "selectorType",
+"ChiSqSelector Type: KBest, Percentile, FPR")
+  setDefault(selectorType -> ChiSqSelectorType.KBest.toString)
+
+  /** @group getParam */
+  def getChiSqSelectorType: String = $(selectorType)
 }
 
 /**
  * Chi-Squared feature selection, which selects categorical features to 
use for predicting a
  * categorical label.
+ * The selector supports three selection methods: KBest, Percentile and 
FPR.
--- End diff --

 Thanks @srowen, I have updated the javadoc. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15212: [MLLIB][ML]add feature selector method based on: ...

2016-09-23 Thread mpjlu

GitHub user mpjlu opened a pull request:

https://github.com/apache/spark/pull/15212

[MLLIB][ML]add feature selector method based on: False Discovery Rate (FDR) 
and Family wise error rate (FWE)

## What changes were proposed in this pull request?
Univariate feature selection works by selecting the best features based on 
univariate statistical tests. 
FDR and FWE are a popular univariate statistical test for feature 
selection.  
In 2005, the Benjamini and Hochberg paper on FDR was identified as one of 
the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg 
procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate. 
In statistics, FWE is the probability of making one or more false 
discoveries, or type I errors, among all the hypotheses when performing 
multiple hypotheses tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate

We add  FDR and FWE methods for ChiSqSelector in this PR, like it is 
implemented in scikit-learn. 

http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

## How was this patch tested?

ut will be added soon

(Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)


(If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)




You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mpjlu/spark fdr_fwe

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15212.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15212


commit 2c071791b2c6fd7d388343ac95783c32ffdae529
Author: Peng, Meng <peng.m...@intel.com>
Date:   2016-09-23T07:27:19Z

add feature selector method: FDR and FWE




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15214: [SPARK-17017][Follow-up][ML] Refactor of ChiSqSel...

2016-09-23 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15214#discussion_r80277277
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -143,13 +149,13 @@ final class ChiSqSelector @Since("1.6.0") 
(@Since("1.6.0") override val uid: Str
 case Row(label: Double, features: Vector) =>
   OldLabeledPoint(label, OldVectors.fromML(features))
   }
-var selector = new feature.ChiSqSelector()
-ChiSqSelectorType.withName($(selectorType)) match {
-  case ChiSqSelectorType.KBest =>
+val selector = new feature.ChiSqSelector()
+$(selectorType) match {
+  case OldChiSqSelector.KBest =>
 selector.setNumTopFeatures($(numTopFeatures))
--- End diff --

Do you need to set SelectorType here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15214: [SPARK-17017][Follow-up][ML] Refactor of ChiSqSel...

2016-09-23 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15214#discussion_r80278819
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala ---
@@ -76,7 +76,7 @@ class ChiSqSelectorSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 LabeledPoint(1.0, Vectors.dense(Array(4.0))),
 LabeledPoint(1.0, Vectors.dense(Array(4.0))),
 LabeledPoint(2.0, Vectors.dense(Array(9.0
-val model = new ChiSqSelector().setAlpha(0.1).fit(labeledDiscreteData)
+val model = new 
ChiSqSelector().setSelectorType("fpr").setAlpha(0.1).fit(labeledDiscreteData)
--- End diff --

you should also do the same thing for 
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15214: [SPARK-17017][Follow-up][ML] Refactor of ChiSqSel...

2016-09-23 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15214#discussion_r80277785
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -160,6 +166,12 @@ final class ChiSqSelector @Since("1.6.0") 
(@Since("1.6.0") override val uid: Str
 
   @Since("1.6.0")
   override def transformSchema(schema: StructType): StructType = {
+val otherPairs = 
OldChiSqSelector.supportedTypeAndParamPairs.filter(_._1 == $(selectorType))
--- End diff --

== or != ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15214: [SPARK-17017][Follow-up][ML] Refactor of ChiSqSelector a...

2016-09-23 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/15214
  
Hi @srowen and @yanboliang ; Thanks for your following up PR. 
I partly agree with your comments on 17017. 
**1. "if users both set numTopFeatures and percentile, it will train kbest 
or percentile model based on the order of setting (the latter setting one will 
be trained). This make users confused, and we should allow users to set 
selector type explicitly."**
For the user confused you mentioned here, I think the main reason is 
function name.  I have changed the function name of setAlpha to setFPR in 
SPARK-17645.  setNumTopFeature should be setKBest.
By this change, it can be much clear. 
For example, setKBest(100), setPercentile(0.1), setFPR(0.05). The selection 
type and parameters is very clear by one function.  
But for your method, user have to strike 
"setSelectorType("KBest").setNumTopFeatures(100)" to do the same thing as 
"setKBest(100)"
**2. "if there are more than one parameter except alpha can be set for fpr 
model, we can not handle it elegantly in the existing framework. And similar 
issues for kbest and percentile model. "**
I cannot think out any other parameters for fpr, kbest, percentile now. But 
if there is, I think it is just the same thing as your method. for example, 
setKBest(100).setOther(),, 
I agree with you for other change. Thanks very much.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15214: [SPARK-17017][Follow-up][ML] Refactor of ChiSqSel...

2016-09-23 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15214#discussion_r80277470
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -143,13 +149,13 @@ final class ChiSqSelector @Since("1.6.0") 
(@Since("1.6.0") override val uid: Str
 case Row(label: Double, features: Vector) =>
   OldLabeledPoint(label, OldVectors.fromML(features))
   }
-var selector = new feature.ChiSqSelector()
-ChiSqSelectorType.withName($(selectorType)) match {
-  case ChiSqSelectorType.KBest =>
+val selector = new feature.ChiSqSelector()
+$(selectorType) match {
+  case OldChiSqSelector.KBest =>
 selector.setNumTopFeatures($(numTopFeatures))
-  case ChiSqSelectorType.Percentile =>
+  case OldChiSqSelector.Percentile =>
 selector.setPercentile($(percentile))
-  case ChiSqSelectorType.FPR =>
+  case OldChiSqSelector.FPR =>
 selector.setAlpha($(alpha))
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...

2016-09-20 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/14597
  
No problem. thanks very much @srowen .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14597: [SPARK-17017][MLLIB] add a chiSquare Selector based on F...

2016-08-17 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/14597
  
Hi @srowen, I have added the parameter to control the feature selection 
type.
The usage is like this: 
**var selector = new ChiSqSelector()
var model = selector.fit(df) // by default, the selector is selection 
numTopFeatures (50)
var newModel = selector.selectKBest(10), or var newModel = 
selector.selectPercentile(5), or,,**
You can fit the DataFrame one time, and generate the model multi times. 

And the indices is sort in the model internally as we have discussed. 

For pass the p-value to the model function, this update does not include 
it. Because for the KBest and Percentile selection, the fit function uses 
ChiSqTestResult.statics to generate the model. For Fpr, the fit function uses 
ChiSqTestResult.p-value.  So it maybe better to pass ChiSqTestResult to the 
model and expose to the caller. And I think it is better to submit another PR 
for  "pass value to model and expose to the caller" problem. Because much codes 
will be changed for this problem, includes which data should be passed to the 
model, how  to save the model, how to test the model.   




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...

2016-08-17 Thread mpjlu

GitHub user mpjlu reopened a pull request:

https://github.com/apache/spark/pull/14597

[SPARK-17017][MLLIB] add a chiSquare Selector based on False Positive Rate 
(FPR) test

## What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on 
univariate statistical tests. False Positive Rate (FPR) is a popular univariate 
statistical test for feature selection. We add a chiSquare Selector based on 
False Positive Rate (FPR) test in this PR, like it is implemented in 
scikit-learn. 

http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection


## How was this patch tested?

Add Scala ut



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mpjlu/spark fprChiSquare

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14597.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14597


commit 2adebe8de3881509e510fc518c562d1141ccd0ef
Author: Peng, Meng <peng.m...@intel.com>
Date:   2016-08-10T05:40:18Z

add a chiSquare Selector based on False Positive Rate (FPR) test

commit 04053ca207ef4aa955eddc3e65d09a4e03db6292
Author: Peng, Meng <peng.m...@intel.com>
Date:   2016-08-11T07:10:43Z

Merge remote-tracking branch 'origin/master' into fprChiSquare

commit 7623563884355a04867ce5271baa286f65180e62
Author: Peng, Meng <peng.m...@intel.com>
Date:   2016-08-16T13:36:11Z

Configure the ChiSqSelector to reuse ChiSqTestResult by numTopFeatures, 
Percentile, and Fpr selector

commit 3d6aecb8441503c9c3d62a2d8a3d48824b9d6637
Author: Peng, Meng <peng.m...@intel.com>
Date:   2016-08-17T02:34:59Z

Config the ChiSqSelector to reuse the ChiSqTestResult by KBest, Percentile 
and FPR selector

commit 026ac85dfa190707891b694f40e737f22f9b4bd5
Author: Peng, Meng <peng.m...@intel.com>
Date:   2016-08-17T02:43:45Z

Merge branch 'master' into fprChiSquare2

commit 5305709c9d4029186318b99fa9c7c483897aa653
Author: Peng, Meng <peng.m...@intel.com>
Date:   2016-08-17T09:59:16Z

add Since annotation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-18 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75286500
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -171,14 +177,47 @@ object ChiSqSelectorModel extends 
Loader[ChiSqSelectorModel] {
 
 /**
  * Creates a ChiSquared feature selector.
- * @param numTopFeatures number of features that selector will select
- *   (ordered by statistic value descending)
- *   Note that if the number of features is less than 
numTopFeatures,
- *   then this will select all features.
  */
-@Since("1.3.0")
-class ChiSqSelector @Since("1.3.0") (
-  @Since("1.3.0") val numTopFeatures: Int) extends Serializable {
+@Since("2.1.0")
+class ChiSqSelector @Since("2.1.0") () extends Serializable {
+  private var numTopFeatures: Int = 50
+  private var percentile: Double = 10.0
+  private var alpha: Double = 0.05
+  private var selectorType = ChiSqSelectorType.KBest
+  private var chiSqTestResult: Array[ChiSqTestResult] = _
+
+  @Since("1.3.0")
+  def this(numTopFeatures: Int) {
+this()
+this.numTopFeatures = numTopFeatures
--- End diff --

This is not necessary, because the default selectorType is KBest


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-18 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75294150
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") (
*/
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
-val indices = Statistics.chiSqTest(data)
-  .zipWithIndex.sortBy { case (res, _) => -res.statistic }
-  .take(numTopFeatures)
-  .map { case (_, indices) => indices }
-  .sorted
+chiSqTestResult = Statistics.chiSqTest(data)
+selectorType match {
+  case ChiSqSelectorType.KBest => selectKBest(numTopFeatures)
+  case ChiSqSelectorType.Percentile => selectPercentile(percentile)
+  case ChiSqSelectorType.Fpr => selectFpr(alpha)
+  case _ => throw new Exception("Unknown ChiSqSelector Type")
+}
+  }
+
+  @Since("2.1.0")
+  def selectKBest(value: Int): ChiSqSelectorModel = {
--- End diff --

The usage scenario here is:
model = selector.fit(df)
result = model.transform(features).
I try to understand your proposal:
Suppose the user generate the model1 use KBest:
model1 = selector.fit(df). 
you mean the user can config selector, like 
selector.setPercentile(0.1)
then, the model1 will be a different model, like it is generated by 
Percentile method, because it can use the parameters of selector.  Does my 
understanding right?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-18 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75286355
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") (
*/
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
-val indices = Statistics.chiSqTest(data)
-  .zipWithIndex.sortBy { case (res, _) => -res.statistic }
-  .take(numTopFeatures)
-  .map { case (_, indices) => indices }
-  .sorted
+chiSqTestResult = Statistics.chiSqTest(data)
+selectorType match {
+  case ChiSqSelectorType.KBest => selectKBest(numTopFeatures)
+  case ChiSqSelectorType.Percentile => selectPercentile(percentile)
+  case ChiSqSelectorType.Fpr => selectFpr(alpha)
+  case _ => throw new Exception("Unknown ChiSqSelector Type")
+}
+  }
+
+  @Since("2.1.0")
+  def selectKBest(value: Int): ChiSqSelectorModel = {
--- End diff --

The user can call selectKBest directly, because, use can fit one time, and 
generate models by selectKBest multi times. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-18 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75286399
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") (
*/
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
-val indices = Statistics.chiSqTest(data)
-  .zipWithIndex.sortBy { case (res, _) => -res.statistic }
-  .take(numTopFeatures)
-  .map { case (_, indices) => indices }
-  .sorted
+chiSqTestResult = Statistics.chiSqTest(data)
+selectorType match {
+  case ChiSqSelectorType.KBest => selectKBest(numTopFeatures)
+  case ChiSqSelectorType.Percentile => selectPercentile(percentile)
+  case ChiSqSelectorType.Fpr => selectFpr(alpha)
+  case _ => throw new Exception("Unknown ChiSqSelector Type")
--- End diff --

Thanks, I will update


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-18 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75289389
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") (
*/
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
-val indices = Statistics.chiSqTest(data)
-  .zipWithIndex.sortBy { case (res, _) => -res.statistic }
-  .take(numTopFeatures)
-  .map { case (_, indices) => indices }
-  .sorted
+chiSqTestResult = Statistics.chiSqTest(data)
+selectorType match {
+  case ChiSqSelectorType.KBest => selectKBest(numTopFeatures)
+  case ChiSqSelectorType.Percentile => selectPercentile(percentile)
+  case ChiSqSelectorType.Fpr => selectFpr(alpha)
+  case _ => throw new Exception("Unknown ChiSqSelector Type")
+}
+  }
+
+  @Since("2.1.0")
+  def selectKBest(value: Int): ChiSqSelectorModel = {
--- End diff --

Now, the user can use fit and transform only for consistent. 
If the user wants to reuse ChiSqTestResult, call selectKBest, 
SelectPercentile will help them. If we don't consider reuse ChiSqTestResult, we 
can not expose this methods. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-18 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75286558
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -54,6 +54,29 @@ private[feature] trait ChiSqSelectorParams extends Params
 
   /** @group getParam */
   def getNumTopFeatures: Int = $(numTopFeatures)
+
+  final val percentile = new DoubleParam(this, "percentile",
+"Percentile of features that selector will select, ordered by 
statistics value descending.",
+ParamValidators.gtEq(0))
+  setDefault(percentile -> 10)
--- End diff --

I will update, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...

2016-08-18 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/14597
  
Hi @srowen ,  I will update the Python API to match this changes. Now, the 
current Python API is not conflict with the changes.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14824: [ML][MLLIB]The require condition and message does...

2016-08-26 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14824#discussion_r76421555
  
--- Diff: 
mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala ---
@@ -454,10 +454,15 @@ class SparseMatrix @Since("2.0.0") (
 
   require(values.length == rowIndices.length, "The number of row indices 
and values don't match! " +
 s"values.length: ${values.length}, rowIndices.length: 
${rowIndices.length}")
-  // The Or statement is for the case when the matrix is transposed
-  require(colPtrs.length == numCols + 1 || colPtrs.length == numRows + 1, 
"The length of the " +
-"column indices should be the number of columns + 1. Currently, 
colPointers.length: " +
-s"${colPtrs.length}, numCols: $numCols")
+  if (isTransposed) {
+require(colPtrs.length == numRows + 1, "The length of the column 
indices should be " +
--- End diff --

Updated, thanks @srowen  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14824: [ML][MLLIB]The require condition and message does...

2016-08-26 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14824#discussion_r76417535
  
--- Diff: 
mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala ---
@@ -455,9 +455,11 @@ class SparseMatrix @Since("2.0.0") (
   require(values.length == rowIndices.length, "The number of row indices 
and values don't match! " +
 s"values.length: ${values.length}, rowIndices.length: 
${rowIndices.length}")
   // The Or statement is for the case when the matrix is transposed
-  require(colPtrs.length == numCols + 1 || colPtrs.length == numRows + 1, 
"The length of the " +
-"column indices should be the number of columns + 1. Currently, 
colPointers.length: " +
-s"${colPtrs.length}, numCols: $numCols")
+  require(!isTransposed && colPtrs.length == numCols + 1 ||
--- End diff --

Thanks @srowen , code is updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14824: [ML][MLLIB]The require condition and message does...

2016-08-26 Thread mpjlu

GitHub user mpjlu opened a pull request:

https://github.com/apache/spark/pull/14824

[ML][MLLIB]The require condition and message doesn't match in SparseMatrix. 
 

## What changes were proposed in this pull request?
The require condition and message doesn't match, and the condition also 
should be optimized. 
Small change.  Please kindly let me know if JIRA required. 


## How was this patch tested?
No additional test required.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mpjlu/spark smallChangeForMatrixRequire

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14824.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14824


commit a51c80614006956c13e3a8d5b6a424cc68c35392
Author: Peng, Meng <peng.m...@intel.com>
Date:   2016-08-26T04:52:05Z

Change require condition for Matrix contructor

commit e5af18f83ae628104989ed71688a0ed42318a46b
Author: Peng, Meng <peng.m...@intel.com>
Date:   2016-08-26T09:46:08Z

use interpolation in a string




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-22 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75802754
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -171,14 +180,47 @@ object ChiSqSelectorModel extends 
Loader[ChiSqSelectorModel] {
 
 /**
  * Creates a ChiSquared feature selector.
- * @param numTopFeatures number of features that selector will select
- *   (ordered by statistic value descending)
- *   Note that if the number of features is less than 
numTopFeatures,
- *   then this will select all features.
  */
-@Since("1.3.0")
-class ChiSqSelector @Since("1.3.0") (
-  @Since("1.3.0") val numTopFeatures: Int) extends Serializable {
+@Since("2.1.0")
+class ChiSqSelector @Since("2.1.0") () extends Serializable {
+  private var numTopFeatures: Int = 50
+  private var percentile: Double = 0.1
+  private var alpha: Double = 0.05
+  private var selectorType = ChiSqSelectorType.KBest
+
+  @Since("1.3.0")
+  def this(numTopFeatures: Int) {
+this()
+this.numTopFeatures = numTopFeatures
+  }
+
+  @Since("2.1.0")
+  def setNumTopFeatures(value: Int): this.type = {
+numTopFeatures = value
+selectorType = ChiSqSelectorType.KBest
+this
+  }
+
+  @Since("2.1.0")
+  def setPercentile(value: Double): this.type = {
+require(value <= 1 && value >= 0, "Percentile should be larger than 0 
and less than 100")
--- End diff --

Thanks, the code is updated


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-22 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75802771
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -171,14 +180,47 @@ object ChiSqSelectorModel extends 
Loader[ChiSqSelectorModel] {
 
 /**
  * Creates a ChiSquared feature selector.
- * @param numTopFeatures number of features that selector will select
- *   (ordered by statistic value descending)
- *   Note that if the number of features is less than 
numTopFeatures,
- *   then this will select all features.
  */
-@Since("1.3.0")
-class ChiSqSelector @Since("1.3.0") (
-  @Since("1.3.0") val numTopFeatures: Int) extends Serializable {
+@Since("2.1.0")
+class ChiSqSelector @Since("2.1.0") () extends Serializable {
+  private var numTopFeatures: Int = 50
+  private var percentile: Double = 0.1
+  private var alpha: Double = 0.05
+  private var selectorType = ChiSqSelectorType.KBest
+
+  @Since("1.3.0")
+  def this(numTopFeatures: Int) {
+this()
+this.numTopFeatures = numTopFeatures
+  }
+
+  @Since("2.1.0")
+  def setNumTopFeatures(value: Int): this.type = {
+numTopFeatures = value
+selectorType = ChiSqSelectorType.KBest
+this
+  }
+
+  @Since("2.1.0")
+  def setPercentile(value: Double): this.type = {
+require(value <= 1 && value >= 0, "Percentile should be larger than 0 
and less than 100")
+percentile = value
+selectorType = ChiSqSelectorType.Percentile
+this
+  }
+
+  @Since("2.1.0")
+  def setAlpha(value: Double): this.type = {
--- End diff --

require is added, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...

2016-08-29 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/14597
  
Hi @srowen , I have added Python API and test cases for ChiSqSelector. 
Could you kindly review it again. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-29 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r76624379
  
--- Diff: python/pyspark/mllib/feature.py ---
@@ -276,24 +276,64 @@ class ChiSqSelector(object):
 """
 Creates a ChiSquared feature selector.
 
-:param numTopFeatures: number of features that selector will select.
-
 >>> data = [
 ... LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})),
 ... LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})),
 ... LabeledPoint(1.0, [0.0, 9.0, 8.0]),
 ... LabeledPoint(2.0, [8.0, 9.0, 5.0])
 ... ]
->>> model = ChiSqSelector(1).fit(sc.parallelize(data))
+>>> model = 
ChiSqSelector().setNumTopFeatures(1).fit(sc.parallelize(data))
+>>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0}))
+SparseVector(1, {0: 6.0})
+>>> model.transform(DenseVector([8.0, 9.0, 5.0]))
+DenseVector([5.0])
+>>> model = 
ChiSqSelector().setPercentile(0.34).fit(sc.parallelize(data))
 >>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0}))
 SparseVector(1, {0: 6.0})
 >>> model.transform(DenseVector([8.0, 9.0, 5.0]))
 DenseVector([5.0])
+>>> data = [
+... LabeledPoint(0.0, SparseVector(4, {0: 8.0, 1: 7.0})),
+... LabeledPoint(1.0, SparseVector(4, {1: 9.0, 2: 6.0, 3: 4.0})),
+... LabeledPoint(1.0, [0.0, 9.0, 8.0, 4.0]),
+... LabeledPoint(2.0, [8.0, 9.0, 5.0, 9.0])
+... ]
+>>> model = ChiSqSelector().setAlpha(0.1).fit(sc.parallelize(data))
+>>> model.transform(DenseVector([1.0,2.0,3.0,4.0]))
+DenseVector([4.0])
 
 .. versionadded:: 1.4.0
 """
-def __init__(self, numTopFeatures):
-self.numTopFeatures = int(numTopFeatures)
+def __init__(self):
+self.param = 50
--- End diff --

Use three variables are clear. It needs another selectionType variable also.
In the fit function, according to the selection type, call different 
functions. If you prefer that, I will change the code. I am ok for both 
methods.  Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-23 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75851527
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -171,14 +180,48 @@ object ChiSqSelectorModel extends 
Loader[ChiSqSelectorModel] {
 
 /**
  * Creates a ChiSquared feature selector.
- * @param numTopFeatures number of features that selector will select
- *   (ordered by statistic value descending)
- *   Note that if the number of features is less than 
numTopFeatures,
- *   then this will select all features.
  */
-@Since("1.3.0")
-class ChiSqSelector @Since("1.3.0") (
-  @Since("1.3.0") val numTopFeatures: Int) extends Serializable {
+@Since("2.1.0")
+class ChiSqSelector @Since("2.1.0") () extends Serializable {
+  private var numTopFeatures: Int = 50
+  private var percentile: Double = 0.1
+  private var alpha: Double = 0.05
+  private var selectorType = ChiSqSelectorType.KBest
+
+  @Since("1.3.0")
+  def this(numTopFeatures: Int) {
+this()
+this.numTopFeatures = numTopFeatures
+  }
+
+  @Since("2.1.0")
+  def setNumTopFeatures(value: Int): this.type = {
+numTopFeatures = value
+selectorType = ChiSqSelectorType.KBest
+this
+  }
+
+  @Since("2.1.0")
+  def setPercentile(value: Double): this.type = {
+require(value <= 1 && value >= 0, "Percentile should be larger than 0 
and less than 1")
--- End diff --

thanks, will update


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-23 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75851763
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -189,11 +232,21 @@ class ChiSqSelector @Since("1.3.0") (
*/
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
-val indices = Statistics.chiSqTest(data)
-  .zipWithIndex.sortBy { case (res, _) => -res.statistic }
-  .take(numTopFeatures)
-  .map { case (_, indices) => indices }
-  .sorted
+val chiSqTestResult = Statistics.chiSqTest(data)
+val features = selectorType match {
+  case ChiSqSelectorType.KBest => chiSqTestResult
+.zipWithIndex.sortBy { case (res, _) => -res.statistic }
--- End diff --

Fpr doesn't need sort here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-23 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75852138
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -91,8 +137,17 @@ final class ChiSqSelector @Since("1.6.0") 
(@Since("1.6.0") override val uid: Str
 case Row(label: Double, features: Vector) =>
   OldLabeledPoint(label, OldVectors.fromML(features))
   }
-val chiSqSelector = new 
feature.ChiSqSelector($(numTopFeatures)).fit(input)
-copyValues(new ChiSqSelectorModel(uid, chiSqSelector).setParent(this))
+var model = ChiSqSelectorType.withName($(selectorType)) match {
--- End diff --

If avoid duplicating  the "new feature.ChiSqSelector()" and "fit(input)" 
here, the usage will be not consistent with other method, as we did before


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-23 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75851296
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -69,21 +73,26 @@ class ChiSqSelectorModel @Since("1.3.0") (
* Preserves the order of filtered features the same as their indices 
are stored.
* Might be moved to Vector as .slice
* @param features vector
-   * @param filterIndices indices of features to filter, must be ordered 
asc
+   * @param filterIndices indices of features to filter
*/
   private def compress(features: Vector, filterIndices: Array[Int]): 
Vector = {
+val orderedIndices = if (isSorted(filterIndices)) {
--- End diff --

ok, I agree with you. thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-23 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75856793
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -91,8 +137,17 @@ final class ChiSqSelector @Since("1.6.0") 
(@Since("1.6.0") override val uid: Str
 case Row(label: Double, features: Vector) =>
   OldLabeledPoint(label, OldVectors.fromML(features))
   }
-val chiSqSelector = new 
feature.ChiSqSelector($(numTopFeatures)).fit(input)
-copyValues(new ChiSqSelectorModel(uid, chiSqSelector).setParent(this))
+var model = ChiSqSelectorType.withName($(selectorType)) match {
--- End diff --

Sorry, I misunderstand your last comment.  Just think about reuse selector. 
This will not be inconsistent with other methods. I will update the code, 
thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-23 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r75858450
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -91,8 +137,17 @@ final class ChiSqSelector @Since("1.6.0") 
(@Since("1.6.0") override val uid: Str
 case Row(label: Double, features: Vector) =>
   OldLabeledPoint(label, OldVectors.fromML(features))
   }
-val chiSqSelector = new 
feature.ChiSqSelector($(numTopFeatures)).fit(input)
-copyValues(new ChiSqSelectorModel(uid, chiSqSelector).setParent(this))
+var model = ChiSqSelectorType.withName($(selectorType)) match {
--- End diff --

Hi @srowen , I have updated the code. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-24 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r76065118
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -189,11 +232,21 @@ class ChiSqSelector @Since("1.3.0") (
*/
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
-val indices = Statistics.chiSqTest(data)
-  .zipWithIndex.sortBy { case (res, _) => -res.statistic }
-  .take(numTopFeatures)
-  .map { case (_, indices) => indices }
-  .sorted
+val chiSqTestResult = Statistics.chiSqTest(data)
+val features = selectorType match {
+  case ChiSqSelectorType.KBest => chiSqTestResult
+.zipWithIndex.sortBy { case (res, _) => -res.statistic }
--- End diff --

Thanks, got you. Will update the code


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-24 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r76068163
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -189,11 +232,21 @@ class ChiSqSelector @Since("1.3.0") (
*/
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
-val indices = Statistics.chiSqTest(data)
-  .zipWithIndex.sortBy { case (res, _) => -res.statistic }
-  .take(numTopFeatures)
-  .map { case (_, indices) => indices }
-  .sorted
+val chiSqTestResult = Statistics.chiSqTest(data)
+val features = selectorType match {
+  case ChiSqSelectorType.KBest => chiSqTestResult
+.zipWithIndex.sortBy { case (res, _) => -res.statistic }
--- End diff --

Hi @srowen , code is updated. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...

2016-08-24 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/14597
  
Sure, I can update the Python API. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-24 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r76059098
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -189,11 +232,21 @@ class ChiSqSelector @Since("1.3.0") (
*/
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
-val indices = Statistics.chiSqTest(data)
-  .zipWithIndex.sortBy { case (res, _) => -res.statistic }
-  .take(numTopFeatures)
-  .map { case (_, indices) => indices }
-  .sorted
+val chiSqTestResult = Statistics.chiSqTest(data)
+val features = selectorType match {
+  case ChiSqSelectorType.KBest => chiSqTestResult
+.zipWithIndex.sortBy { case (res, _) => -res.statistic }
--- End diff --

I will update Fpr to FPR, thanks very much. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...

2016-08-24 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/14597#discussion_r76059026
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -189,11 +232,21 @@ class ChiSqSelector @Since("1.3.0") (
*/
   @Since("1.3.0")
   def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = {
-val indices = Statistics.chiSqTest(data)
-  .zipWithIndex.sortBy { case (res, _) => -res.statistic }
-  .take(numTopFeatures)
-  .map { case (_, indices) => indices }
-  .sorted
+val chiSqTestResult = Statistics.chiSqTest(data)
+val features = selectorType match {
+  case ChiSqSelectorType.KBest => chiSqTestResult
+.zipWithIndex.sortBy { case (res, _) => -res.statistic }
--- End diff --

Hi @srowen , do you mean we should move the sort from the model to here?
There are two sort here. 
One is sort the indices, which we have been moved from here to the model. 
The second is sort the features, because the KBest, and Percentile is 
selection by the order of  res.statistic, so we sort the features here.  FPR is 
selection by res.p-value > alpha, so don't need sort here.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector method...

2016-09-27 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/15212
  
hi @srowen @yanboliang , I have updated this PR. Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15277: [SPARK-17704][ML][MLlib] ChiSqSelector performance impro...

2016-09-28 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/15277
  
sort in the transform will cause sort too many times.
so this looks good to me.  thangks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15214: [SPARK-17017][Follow-up][ML] Refactor of ChiSqSelector a...

2016-09-25 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/15214
  
hi @srowen .
My understand of yanbo's comments here is,
if user use chSqSelector like this:
model1 = new ChiSqSelector().setFPR(0.05).setKBest(100).fit(data)
model2 = new ChiSqSelector().setKBest(100).setFPR(0.05).fit(data)
model1 will be different with model2. so the model is dependent on the 
order of users setting params.
Actually, user should not use ChiSqSelector like this. One just need to set 
one SelectorType/Parameter is ok.  But if one don't know ChiSqSelector, he may 
do like this. So yanbo think this is a problem. 

In this PR, setFPR(0.05) is split to two functions: 
setSelectorType("fpr").setAlpha(0.05). This maybe clear to the user.  
By the principle of software development: one function do one thing, I am 
ok with this change.  
But from user experience,  I like the spark-17017 method.  



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15214: [SPARK-17017][Follow-up][ML] Refactor of ChiSqSelector a...

2016-09-25 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/15214
  
Thanks, this looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15214: [SPARK-17017][Follow-up][ML] Refactor of ChiSqSelector a...

2016-09-25 Thread mpjlu

Github user mpjlu commented on the issue:

https://github.com/apache/spark/pull/15214
  
Hi @srowen , sorry for forgetting update the doc and python/ml/feature.py 
in last PR.
This pr has added ml/feature.py. It looks good to me.
Thanks
  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15647: [SPARK-18088][ML] Various ChiSqSelector cleanups

2016-10-27 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15647#discussion_r85311619
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -44,67 +44,78 @@ private[feature] trait ChiSqSelectorParams extends 
Params
   /**
* Number of features that selector will select (ordered by statistic 
value descending). If the
* number of features is less than numTopFeatures, then this will select 
all features.
-   * Only applicable when selectorType = "kbest".
+   * Only applicable when selectorType = "numTopFeatures".
* The default value of numTopFeatures is 50.
*
* @group param
*/
+  @Since("1.6.0")
   final val numTopFeatures = new IntParam(this, "numTopFeatures",
 "Number of features that selector will select, ordered by statistics 
value descending. If the" +
--- End diff --

ordered by pValue ascending.  I missed this change in my PR. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15647: [SPARK-18088][ML] Various ChiSqSelector cleanups

2016-10-27 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15647#discussion_r85311677
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -44,67 +44,78 @@ private[feature] trait ChiSqSelectorParams extends 
Params
   /**
* Number of features that selector will select (ordered by statistic 
value descending). If the
* number of features is less than numTopFeatures, then this will select 
all features.
-   * Only applicable when selectorType = "kbest".
+   * Only applicable when selectorType = "numTopFeatures".
* The default value of numTopFeatures is 50.
*
* @group param
*/
+  @Since("1.6.0")
   final val numTopFeatures = new IntParam(this, "numTopFeatures",
 "Number of features that selector will select, ordered by statistics 
value descending. If the" +
   " number of features is < numTopFeatures, then this will select all 
features.",
 ParamValidators.gtEq(1))
   setDefault(numTopFeatures -> 50)
 
   /** @group getParam */
+  @Since("1.6.0")
   def getNumTopFeatures: Int = $(numTopFeatures)
 
   /**
* Percentile of features that selector will select, ordered by 
statistics value descending.
* Only applicable when selectorType = "percentile".
* Default value is 0.1.
+   * @group param
*/
+  @Since("2.1.0")
   final val percentile = new DoubleParam(this, "percentile",
 "Percentile of features that selector will select, ordered by 
statistics value descending.",
--- End diff --

ordered by pValue ascending. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15647: [SPARK-18088][ML] Various ChiSqSelector cleanups

2016-10-27 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15647#discussion_r85310898
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -227,22 +227,19 @@ both speed and statistical learning behavior.
 
[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector)
 implements
 Chi-Squared feature selection. It operates on labeled data with 
categorical features. ChiSqSelector uses the
 [Chi-Squared test of 
independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
-features to choose. It supports three selection methods: `KBest`, 
`Percentile` and `FPR`:
+features to choose. It supports three selection methods: `numTopFeatures`, 
`percentile`, `fpr`:
 
-* `KBest` chooses the `k` top features according to a chi-squared test. 
This is akin to yielding the features with the most predictive power.
-* `Percentile` is similar to `KBest` but chooses a fraction of all 
features instead of a fixed number.
-* `FPR` chooses all features whose false positive rate meets some 
threshold.
+* `numTopFeatures` chooses the `k` top features according to a chi-squared 
test. This is akin to yielding the features with the most predictive power.
+* `percentile` is similar to `numTopFeatures` but chooses a fraction of 
all features instead of a fixed number.
+* `fpr` chooses all features whose p-value is below a threshold, thus 
controlling the false positive rate of selection.
 
-By default, the selection method is `KBest`, the default number of top 
features is 50. User can use
-`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different 
selection methods.
+By default, the selection method is `numTopFeatures`, with the default 
number of top features set to 50. User can use
+`setNumTopFeatures`, `setPercentile`, `fpr` to set different selection 
methods.
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15647: [SPARK-18088][ML] Various ChiSqSelector cleanups

2016-10-27 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15647#discussion_r85310797
  
--- Diff: docs/ml-features.md ---
@@ -1333,14 +1333,14 @@ for more details on the API.
 `ChiSqSelector` stands for Chi-Squared feature selection. It operates on 
labeled data with
 categorical features. ChiSqSelector uses the
 [Chi-Squared test of 
independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
-features to choose. It supports three selection methods: `KBest`, 
`Percentile` and `FPR`:
+features to choose. It supports three selection methods: `numTopFeatures`, 
`percentile`, `fpr`:
 
-* `KBest` chooses the `k` top features according to a chi-squared test. 
This is akin to yielding the features with the most predictive power.
-* `Percentile` is similar to `KBest` but chooses a fraction of all 
features instead of a fixed number.
-* `FPR` chooses all features whose false positive rate meets some 
threshold.
+* `numTopFeatures` chooses the `k` top features according to a chi-squared 
test. This is akin to yielding the features with the most predictive power.
+* `percentile` is similar to `numTopFeatures` but chooses a fraction of 
all features instead of a fixed number.
+* `fpr` chooses all features whose p-value is below a threshold, thus 
controlling the false positive rate of selection.
 
-By default, the selection method is `KBest`, the default number of top 
features is 50. User can use
-`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different 
selection methods.
+By default, the selection method is `numTopFeatures`, with the default 
number of top features set to 50. User can use
+`setNumTopFeatures`, `setPercentile` and `setFpr` to set different 
selection methods.
--- End diff --

setNumTopFeatures, setPercentile, setFpr just set parameter, 
setSelectorType is used to set selection methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15647: [SPARK-18088][ML] Various ChiSqSelector cleanups

2016-10-27 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15647#discussion_r85310862
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -227,22 +227,19 @@ both speed and statistical learning behavior.
 
[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector)
 implements
 Chi-Squared feature selection. It operates on labeled data with 
categorical features. ChiSqSelector uses the
 [Chi-Squared test of 
independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
-features to choose. It supports three selection methods: `KBest`, 
`Percentile` and `FPR`:
+features to choose. It supports three selection methods: `numTopFeatures`, 
`percentile`, `fpr`:
 
-* `KBest` chooses the `k` top features according to a chi-squared test. 
This is akin to yielding the features with the most predictive power.
-* `Percentile` is similar to `KBest` but chooses a fraction of all 
features instead of a fixed number.
-* `FPR` chooses all features whose false positive rate meets some 
threshold.
+* `numTopFeatures` chooses the `k` top features according to a chi-squared 
test. This is akin to yielding the features with the most predictive power.
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15647: [SPARK-18088][ML] Various ChiSqSelector cleanups

2016-10-27 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15647#discussion_r85311930
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala ---
@@ -171,18 +171,19 @@ object ChiSqSelectorModel extends 
Loader[ChiSqSelectorModel] {
 
 /**
  * Creates a ChiSquared feature selector.
- * The selector supports three selection methods: `kbest`, `percentile` 
and `fpr`.
- * `kbest` chooses the `k` top features according to a chi-squared test.
- * `percentile` is similar but chooses a fraction of all features instead 
of a fixed number.
- * `fpr` chooses all features whose false positive rate meets some 
threshold.
- * By default, the selection method is `kbest`, the default number of top 
features is 50.
+ * The selector supports different selection methods: `numTopFeatures`, 
`percentile`, `fpr`.
+ *  - `numTopFeatures` chooses the `k` top features according to a 
chi-squared test.
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15647: [SPARK-18088][ML] Various ChiSqSelector cleanups

2016-10-27 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15647#discussion_r85310501
  
--- Diff: docs/ml-features.md ---
@@ -1333,14 +1333,14 @@ for more details on the API.
 `ChiSqSelector` stands for Chi-Squared feature selection. It operates on 
labeled data with
 categorical features. ChiSqSelector uses the
 [Chi-Squared test of 
independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
-features to choose. It supports three selection methods: `KBest`, 
`Percentile` and `FPR`:
+features to choose. It supports three selection methods: `numTopFeatures`, 
`percentile`, `fpr`:
 
-* `KBest` chooses the `k` top features according to a chi-squared test. 
This is akin to yielding the features with the most predictive power.
-* `Percentile` is similar to `KBest` but chooses a fraction of all 
features instead of a fixed number.
-* `FPR` chooses all features whose false positive rate meets some 
threshold.
+* `numTopFeatures` chooses the `k` top features according to a chi-squared 
test. This is akin to yielding the features with the most predictive power.
--- End diff --

Should use k here since KBest is changed to numTopFeatures?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...

2016-10-19 Thread mpjlu

Github user mpjlu commented on a diff in the pull request:

https://github.com/apache/spark/pull/15212#discussion_r84049805
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala ---
@@ -72,11 +72,15 @@ private[feature] trait ChiSqSelectorParams extends 
Params
   def getPercentile: Double = $(percentile)
 
   /**
-   * The highest p-value for features to be kept.
-   * Only applicable when selectorType = "fpr".
+   * alpha means the highest p-value for features to be kept when select 
type is "fpr".
+   * alpha means the highest uncorrected p-value for features to be kept 
when select type
--- End diff --

updated, thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 270 matches

Mail list logo