[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75638966
  
Oops, did not realize that a test was still running (glad it passed)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75614718
  
  [Test build #27857 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27857/consoleFull)
 for   PR 4709 at commit 
[`58d9e4d`](https://github.com/apache/spark/commit/58d9e4d0dd4c03399cafd487f6391b1c560e82d8).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `ChiSqSelector stands for Chi-Squared feature selection. It operates on 
the labeled data. ChiSqSelector orders categorical features based on their 
values of Chi-Squared test on independence from class and filters (selects) top 
given features.  `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75599168
  
  [Test build #27857 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27857/consoleFull)
 for   PR 4709 at commit 
[`58d9e4d`](https://github.com/apache/spark/commit/58d9e4d0dd4c03399cafd487f6391b1c560e82d8).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4709#discussion_r25188678
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -375,3 +375,55 @@ data2 = labels.zip(normalizer2.transform(features))
 {% endhighlight %}
 /div
 /div
+
+## Feature selection
+[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) allows 
selecting the most relevant features for use in model construction. The number 
of features to select can be determined using the validation set. Feature 
selection is usually applied on sparse data, for example in text 
classification. Feature selection reduces the size of the vector space and, in 
turn, the complexity of any subsequent operation with vectors. 
+
+### ChiSqSelector
+ChiSqSelector stands for Chi-Squared feature selection. It operates on the 
labeled data. ChiSqSelector orders categorical features based on their values 
of Chi-Squared test on independence from class and filters (selects) top given 
features.  
+
+ Model Fitting
+

+[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector)
 has the
+following parameters in the constructor:
+
+* `numTopFeatures` number of top features that selector will select 
(filter).
+
+We provide a 
[`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) 
method in
+`ChiSqSelector` which can take an input of `RDD[LabeledPoint]` with 
categorical features, learn the summary statistics, and then
+return a model which can transform the input dataset into the reduced 
feature space.
+
+This model implements 
[`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer)
+which can apply the Chi-Squared feature selection on a `Vector` to produce 
a reduced `Vector` or on
+an `RDD[Vector]` to produce a reduced `RDD[Vector]`.
+
+Note that the model that performs actual feature filtering can be 
instantiated independently with array of feature indices that has to be sorted 
ascending.
+
+ Example
+
+The following example shows the basic use of ChiSqSelector.
+
+div class=codetabs
+div data-lang=scala
+{% highlight scala %}
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.util.MLUtils
+
+// load some data in libsvm format, each point is in the range 0..255
+val data = MLUtils.loadLibSVMFile(sc, data/mllib/sample_libsvm_data.txt)
+// discretize data in 16 equal bins
+val discretizedData = data.map { lp =
+  LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x = x / 
16 } ) )
+}
+// create ChiSqSelector that will select 50 features
+val selector = new ChiSqSelector(50)
+// create ChiSqSelector model
+val transformer = selector.fit(disctetizedData)
+// filter top 50 features
+val filteredData = transformer.transform(discretizedData)
--- End diff --

Since transform() takes an RDD[Vector], you'll need to map the data to 
features, and then zip the transformed features with the labels.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75603592
  
I think that last issue is the only one--thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread avulanov
Github user avulanov commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75610561
  
Sorry for this, still sleeping...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75611280
  
  [Test build #27860 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27860/consoleFull)
 for   PR 4709 at commit 
[`19a8a4e`](https://github.com/apache/spark/commit/19a8a4e9b8c3b5607c87fb1eae19810f90b9ad6a).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75614737
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27857/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75621063
  
LGTM  Thanks for the updates!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75621857
  
Merged into master and branch-1.3


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75626645
  
  [Test build #27860 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27860/consoleFull)
 for   PR 4709 at commit 
[`19a8a4e`](https://github.com/apache/spark/commit/19a8a4e9b8c3b5607c87fb1eae19810f90b9ad6a).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `ChiSqSelector stands for Chi-Squared feature selection. It operates on 
the labeled data. ChiSqSelector orders categorical features based on their 
values of Chi-Squared test on independence from class and filters (selects) top 
given features.  `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/4709


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-23 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75626660
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27860/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-22 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75456675
  
@avulanov Thanks for the updates!  Except for those 2 issues, I think this 
should be ready to go.  (I'm testing doc compilation now.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-22 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4709#discussion_r25136231
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -375,3 +375,52 @@ data2 = labels.zip(normalizer2.transform(features))
 {% endhighlight %}
 /div
 /div
+
+## Feature selection
+(Feature selection)[http://en.wikipedia.org/wiki/Feature_selection] allows 
selecting the most relevant features for use in model construction. The number 
of features to select can be determined using the validation set. Feature 
selection is usually applied on sparse data, for example in text 
classification. Feature selection reduces the size of the vector space and, in 
turn, the complexity of any subsequent operation with vectors. 
+
+### ChiSqSelector
+ChiSqSelector stands for Chi-Squared feature selection. It operates on the 
labeled data. ChiSqSelector orders categorical features based on their values 
of Chi-Squared test on independence from class and filters (selects) top given 
features.  
+
+ Model Fitting
+

+[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector)
 has the
+following parameters in the constructor:
+
+* `numTopFeatures` number of top features that selector will select 
(filter).
+
+We provide a 
[`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) 
method in
+`ChiSqSelector` which can take an input of `RDD[LabeledPoint]` with 
categorical features, learn the summary statistics, and then
+return a model which can transform the input dataset into the reduced 
feature space.
+
+This model implements 
[`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer)
+which can apply the Chi-Squared feature selection on a `Vector` to produce 
a reduced `Vector` or on
+an `RDD[Vector]` to produce a reduced `RDD[Vector]`.
+
+Note that the model that performs actual feature filtering can be 
instantiated independently with array of feature indices that has to be sorted 
ascending.
+
+ Example
+
+The following example shows the basic use of ChiSqSelector.
+
+div class=codetabs
+div data-lang=scala
+{% highlight scala %}
+import org.apache.spark.SparkContext._
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.util.MLUtils
+
+// load some data in libsvm format, each point is in the range 0..255
+val data = MLUtils.loadLibSVMFile(sc, data/mllib/sample_libsvm_data.txt)
+// discretize data in 16 equal bins
+val discretizedData = data.map { lp =
+  LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x = x / 
16 } ) )
+}
+// create ChiSqSelector that will select 50 features
+val selector = new ChiSqSelector(50)
+// filter top 50 features
+val filteredData = selector.fit(disctetizedData)
--- End diff --

typo here too: disctetizedData

Also, selector.fit really returns a model, not the data.  Would you mind 
changing filteredData to be labeled as a model and then using the model to do 
something (like print the selected feature indices)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-22 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4709#discussion_r25136229
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -375,3 +375,52 @@ data2 = labels.zip(normalizer2.transform(features))
 {% endhighlight %}
 /div
 /div
+
+## Feature selection
+(Feature selection)[http://en.wikipedia.org/wiki/Feature_selection] allows 
selecting the most relevant features for use in model construction. The number 
of features to select can be determined using the validation set. Feature 
selection is usually applied on sparse data, for example in text 
classification. Feature selection reduces the size of the vector space and, in 
turn, the complexity of any subsequent operation with vectors. 
--- End diff --

Syntax for links: ```[link text](actual link)```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-22 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75462021
  
The generated doc seems Ok except for the comments above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-20 Thread avulanov
GitHub user avulanov opened a pull request:

https://github.com/apache/spark/pull/4709

[MLLIB] SPARK-5912 Programming guide for feature selection

Added description of ChiSqSelector and few words about feature selection in 
general. I could add a code example, however it would not look reasonable in 
the absence of feature discretizer or a dataset in the `data` folder that has 
redundant features.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/avulanov/spark SPARK-5912

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4709.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4709


commit c845350afd91ec5e5e329989fc770da23d0c459d
Author: Alexander Ulanov na...@yandex.ru
Date:   2015-02-20T23:36:52Z

ChiSqSelector docs




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-20 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75341961
  
I think it's better to have an example, even if it doesn't really do 
anything useful on the toy datasets which ship with Spark.  We could add a 
hand-constructed dataset now or later on to improve the example.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75348988
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27799/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75340531
  
  [Test build #27796 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27796/consoleFull)
 for   PR 4709 at commit 
[`c845350`](https://github.com/apache/spark/commit/c845350afd91ec5e5e329989fc770da23d0c459d).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75343738
  
  [Test build #27799 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27799/consoleFull)
 for   PR 4709 at commit 
[`eb6b9fe`](https://github.com/apache/spark/commit/eb6b9fe61126f3b75d4741bc2a978cd51fcc5ba9).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-20 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4709#discussion_r25113936
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -375,3 +375,28 @@ data2 = labels.zip(normalizer2.transform(features))
 {% endhighlight %}
 /div
 /div
+
+## Feature selection
+Feature selection allows selecting relevant features for use in model 
construction leaving out the redundant ones. The number of features to select 
can be determined using the validation set. Feature selection is usually 
applied on sparse data, for example in text classification. Feature selection 
reduces the size of the vector space and, in turn, the complexity of any 
subsequent operation with vectors. 
--- End diff --

Would you mind adding a link to Wikipedia? 
[http://en.wikipedia.org/wiki/Feature_selection]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-20 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4709#discussion_r25113939
  
--- Diff: docs/mllib-feature-extraction.md ---
@@ -375,3 +375,28 @@ data2 = labels.zip(normalizer2.transform(features))
 {% endhighlight %}
 /div
 /div
+
+## Feature selection
+Feature selection allows selecting relevant features for use in model 
construction leaving out the redundant ones. The number of features to select 
can be determined using the validation set. Feature selection is usually 
applied on sparse data, for example in text classification. Feature selection 
reduces the size of the vector space and, in turn, the complexity of any 
subsequent operation with vectors. 
+
+### ChiSqSelector
+ChiSqSelector stands for Chi-Squared feature selection. It operates on the 
labeled data. ChiSqSelector orders categorical features based on their values 
of Chi-Squared test on independence from class and filters (selects) top given 
features.  
+
+ Model Fitting
+

+[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector)
 has the
+following parameters in the constructor:
+
+* `numTopFeatures` number of top features that selector will select 
(filter).
+
+We provide a 
[`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) 
method in
+`ChiSqSelector` which can take an input of `RDD[LabeledPoint]` with 
categorical features, learn the summary statistics, and then
+return a model which can transform the input dataset into the reduced 
feature space.
+
+This model implements 
[`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer)
+which can apply the Chi-Squared feature selection on a `Vector` to produce 
a reduced `Vector` or on
+an `RDD[Vector]` to produce a reduced `RDD[Vector]`.
+
+Note that the model that performs actual feature filtering can be 
instantiated independently with array of feature indices that has to be sorted 
ascending.
+/div
--- End diff --

Extraneous div tags


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75346574
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27796/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75346567
  
  [Test build #27796 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27796/consoleFull)
 for   PR 4709 at commit 
[`c845350`](https://github.com/apache/spark/commit/c845350afd91ec5e5e329989fc770da23d0c459d).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...

2015-02-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4709#issuecomment-75348983
  
  [Test build #27799 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27799/consoleFull)
 for   PR 4709 at commit 
[`eb6b9fe`](https://github.com/apache/spark/commit/eb6b9fe61126f3b75d4741bc2a978cd51fcc5ba9).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `ChiSqSelector stands for Chi-Squared feature selection. It operates on 
the labeled data. ChiSqSelector orders categorical features based on their 
values of Chi-Squared test on independence from class and filters (selects) top 
given features.  `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org