[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75638966 Oops, did not realize that a test was still running (glad it passed) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75614718 [Test build #27857 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27857/consoleFull) for PR 4709 at commit [`58d9e4d`](https://github.com/apache/spark/commit/58d9e4d0dd4c03399cafd487f6391b1c560e82d8). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `ChiSqSelector stands for Chi-Squared feature selection. It operates on the labeled data. ChiSqSelector orders categorical features based on their values of Chi-Squared test on independence from class and filters (selects) top given features. ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75599168 [Test build #27857 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27857/consoleFull) for PR 4709 at commit [`58d9e4d`](https://github.com/apache/spark/commit/58d9e4d0dd4c03399cafd487f6391b1c560e82d8). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/4709#discussion_r25188678 --- Diff: docs/mllib-feature-extraction.md --- @@ -375,3 +375,55 @@ data2 = labels.zip(normalizer2.transform(features)) {% endhighlight %} /div /div + +## Feature selection +[Feature selection](http://en.wikipedia.org/wiki/Feature_selection) allows selecting the most relevant features for use in model construction. The number of features to select can be determined using the validation set. Feature selection is usually applied on sparse data, for example in text classification. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. + +### ChiSqSelector +ChiSqSelector stands for Chi-Squared feature selection. It operates on the labeled data. ChiSqSelector orders categorical features based on their values of Chi-Squared test on independence from class and filters (selects) top given features. + + Model Fitting + +[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) has the +following parameters in the constructor: + +* `numTopFeatures` number of top features that selector will select (filter). + +We provide a [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) method in +`ChiSqSelector` which can take an input of `RDD[LabeledPoint]` with categorical features, learn the summary statistics, and then +return a model which can transform the input dataset into the reduced feature space. + +This model implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer) +which can apply the Chi-Squared feature selection on a `Vector` to produce a reduced `Vector` or on +an `RDD[Vector]` to produce a reduced `RDD[Vector]`. + +Note that the model that performs actual feature filtering can be instantiated independently with array of feature indices that has to be sorted ascending. + + Example + +The following example shows the basic use of ChiSqSelector. + +div class=codetabs +div data-lang=scala +{% highlight scala %} +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLUtils + +// load some data in libsvm format, each point is in the range 0..255 +val data = MLUtils.loadLibSVMFile(sc, data/mllib/sample_libsvm_data.txt) +// discretize data in 16 equal bins +val discretizedData = data.map { lp = + LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x = x / 16 } ) ) +} +// create ChiSqSelector that will select 50 features +val selector = new ChiSqSelector(50) +// create ChiSqSelector model +val transformer = selector.fit(disctetizedData) +// filter top 50 features +val filteredData = transformer.transform(discretizedData) --- End diff -- Since transform() takes an RDD[Vector], you'll need to map the data to features, and then zip the transformed features with the labels. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75603592 I think that last issue is the only one--thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user avulanov commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75610561 Sorry for this, still sleeping... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75611280 [Test build #27860 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27860/consoleFull) for PR 4709 at commit [`19a8a4e`](https://github.com/apache/spark/commit/19a8a4e9b8c3b5607c87fb1eae19810f90b9ad6a). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75614737 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27857/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75621063 LGTM Thanks for the updates! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75621857 Merged into master and branch-1.3 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75626645 [Test build #27860 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27860/consoleFull) for PR 4709 at commit [`19a8a4e`](https://github.com/apache/spark/commit/19a8a4e9b8c3b5607c87fb1eae19810f90b9ad6a). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `ChiSqSelector stands for Chi-Squared feature selection. It operates on the labeled data. ChiSqSelector orders categorical features based on their values of Chi-Squared test on independence from class and filters (selects) top given features. ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/4709 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75626660 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27860/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75456675 @avulanov Thanks for the updates! Except for those 2 issues, I think this should be ready to go. (I'm testing doc compilation now.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/4709#discussion_r25136231 --- Diff: docs/mllib-feature-extraction.md --- @@ -375,3 +375,52 @@ data2 = labels.zip(normalizer2.transform(features)) {% endhighlight %} /div /div + +## Feature selection +(Feature selection)[http://en.wikipedia.org/wiki/Feature_selection] allows selecting the most relevant features for use in model construction. The number of features to select can be determined using the validation set. Feature selection is usually applied on sparse data, for example in text classification. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. + +### ChiSqSelector +ChiSqSelector stands for Chi-Squared feature selection. It operates on the labeled data. ChiSqSelector orders categorical features based on their values of Chi-Squared test on independence from class and filters (selects) top given features. + + Model Fitting + +[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) has the +following parameters in the constructor: + +* `numTopFeatures` number of top features that selector will select (filter). + +We provide a [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) method in +`ChiSqSelector` which can take an input of `RDD[LabeledPoint]` with categorical features, learn the summary statistics, and then +return a model which can transform the input dataset into the reduced feature space. + +This model implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer) +which can apply the Chi-Squared feature selection on a `Vector` to produce a reduced `Vector` or on +an `RDD[Vector]` to produce a reduced `RDD[Vector]`. + +Note that the model that performs actual feature filtering can be instantiated independently with array of feature indices that has to be sorted ascending. + + Example + +The following example shows the basic use of ChiSqSelector. + +div class=codetabs +div data-lang=scala +{% highlight scala %} +import org.apache.spark.SparkContext._ +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.MLUtils + +// load some data in libsvm format, each point is in the range 0..255 +val data = MLUtils.loadLibSVMFile(sc, data/mllib/sample_libsvm_data.txt) +// discretize data in 16 equal bins +val discretizedData = data.map { lp = + LabeledPoint(lp.label, Vectors.dense(lp.features.toArray.map { x = x / 16 } ) ) +} +// create ChiSqSelector that will select 50 features +val selector = new ChiSqSelector(50) +// filter top 50 features +val filteredData = selector.fit(disctetizedData) --- End diff -- typo here too: disctetizedData Also, selector.fit really returns a model, not the data. Would you mind changing filteredData to be labeled as a model and then using the model to do something (like print the selected feature indices)? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/4709#discussion_r25136229 --- Diff: docs/mllib-feature-extraction.md --- @@ -375,3 +375,52 @@ data2 = labels.zip(normalizer2.transform(features)) {% endhighlight %} /div /div + +## Feature selection +(Feature selection)[http://en.wikipedia.org/wiki/Feature_selection] allows selecting the most relevant features for use in model construction. The number of features to select can be determined using the validation set. Feature selection is usually applied on sparse data, for example in text classification. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. --- End diff -- Syntax for links: ```[link text](actual link)``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75462021 The generated doc seems Ok except for the comments above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
GitHub user avulanov opened a pull request: https://github.com/apache/spark/pull/4709 [MLLIB] SPARK-5912 Programming guide for feature selection Added description of ChiSqSelector and few words about feature selection in general. I could add a code example, however it would not look reasonable in the absence of feature discretizer or a dataset in the `data` folder that has redundant features. You can merge this pull request into a Git repository by running: $ git pull https://github.com/avulanov/spark SPARK-5912 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4709.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4709 commit c845350afd91ec5e5e329989fc770da23d0c459d Author: Alexander Ulanov na...@yandex.ru Date: 2015-02-20T23:36:52Z ChiSqSelector docs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75341961 I think it's better to have an example, even if it doesn't really do anything useful on the toy datasets which ship with Spark. We could add a hand-constructed dataset now or later on to improve the example. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75348988 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27799/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75340531 [Test build #27796 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27796/consoleFull) for PR 4709 at commit [`c845350`](https://github.com/apache/spark/commit/c845350afd91ec5e5e329989fc770da23d0c459d). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75343738 [Test build #27799 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27799/consoleFull) for PR 4709 at commit [`eb6b9fe`](https://github.com/apache/spark/commit/eb6b9fe61126f3b75d4741bc2a978cd51fcc5ba9). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/4709#discussion_r25113936 --- Diff: docs/mllib-feature-extraction.md --- @@ -375,3 +375,28 @@ data2 = labels.zip(normalizer2.transform(features)) {% endhighlight %} /div /div + +## Feature selection +Feature selection allows selecting relevant features for use in model construction leaving out the redundant ones. The number of features to select can be determined using the validation set. Feature selection is usually applied on sparse data, for example in text classification. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. --- End diff -- Would you mind adding a link to Wikipedia? [http://en.wikipedia.org/wiki/Feature_selection] --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/4709#discussion_r25113939 --- Diff: docs/mllib-feature-extraction.md --- @@ -375,3 +375,28 @@ data2 = labels.zip(normalizer2.transform(features)) {% endhighlight %} /div /div + +## Feature selection +Feature selection allows selecting relevant features for use in model construction leaving out the redundant ones. The number of features to select can be determined using the validation set. Feature selection is usually applied on sparse data, for example in text classification. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. + +### ChiSqSelector +ChiSqSelector stands for Chi-Squared feature selection. It operates on the labeled data. ChiSqSelector orders categorical features based on their values of Chi-Squared test on independence from class and filters (selects) top given features. + + Model Fitting + +[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) has the +following parameters in the constructor: + +* `numTopFeatures` number of top features that selector will select (filter). + +We provide a [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) method in +`ChiSqSelector` which can take an input of `RDD[LabeledPoint]` with categorical features, learn the summary statistics, and then +return a model which can transform the input dataset into the reduced feature space. + +This model implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer) +which can apply the Chi-Squared feature selection on a `Vector` to produce a reduced `Vector` or on +an `RDD[Vector]` to produce a reduced `RDD[Vector]`. + +Note that the model that performs actual feature filtering can be instantiated independently with array of feature indices that has to be sorted ascending. +/div --- End diff -- Extraneous div tags --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75346574 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27796/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75346567 [Test build #27796 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27796/consoleFull) for PR 4709 at commit [`c845350`](https://github.com/apache/spark/commit/c845350afd91ec5e5e329989fc770da23d0c459d). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-5912 Programming guide for featu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/4709#issuecomment-75348983 [Test build #27799 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27799/consoleFull) for PR 4709 at commit [`eb6b9fe`](https://github.com/apache/spark/commit/eb6b9fe61126f3b75d4741bc2a978cd51fcc5ba9). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `ChiSqSelector stands for Chi-Squared feature selection. It operates on the labeled data. ChiSqSelector orders categorical features based on their values of Chi-Squared test on independence from class and filters (selects) top given features. ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org