[GitHub] spark issue #14449: [SPARK-16843][MLLIB] add the percentage ChiSquareSelecto...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/14449 Hi @srowen, thanks for your comment. I agree for your comment, user can get the number of features without percentage method. For the user experience, sometimes the percentage method seems better. In scikit-learn, there are both SelectKBest and SelectPercentile APIs: http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14449: [SPARK-16843][MLLIB] add the percentage ChiSquare...
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/14449 [SPARK-16843][MLLIB] add the percentage ChiSquareSelector feature ## What changes were proposed in this pull request? add the percentage ChiSquareSelector feature ## How was this patch tested? add scala ut You can merge this pull request into a Git repository by running: $ git pull https://github.com/mpjlu/spark chisquare2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14449.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14449 commit fb3c9a93e4b1b20f6738a3b56d8fb0604fbbb59e Author: Peng, Meng <peng.m...@intel.com> Date: 2016-08-01T05:00:16Z add the percentage ChiSquareSelector feature --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r74404348 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") ( new ChiSqSelectorModel(indices) } } + +/** + * Creates a ChiSquared feature selector by False Positive Rate (FPR) test. + * @param alpha the highest p-value for features to be kept + */ +@Since("2.1.0") +class ChiSqSelectorByFpr @Since("2.1.0") ( + @Since("2.1.0") val alpha: Double) extends Serializable { + + /** + * Returns a ChiSquared feature selector by FPR. + * + * @param data an `RDD[LabeledPoint]` containing the labeled dataset with categorical features. + * Real-valued features will be treated as categorical for each distinct value. + * Apply feature discretizer before using this function. + */ + @Since("2.1.0") + def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { +val indices = Statistics.chiSqTest(data) + .zipWithIndex.filter { case (res, _) => res.pValue < alpha } --- End diff -- The only problem here is how to the reuse the result of "Statistics.chiSqTest(data)", so why not add a var to save the result. we don't need to change the model related code. Just need to add two lines of code to check which selection method is used. No API needs to change. We need to add API to set parameters about other selection methods, like select by FPR. How do you think about it. @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r74412725 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") ( new ChiSqSelectorModel(indices) } } + +/** + * Creates a ChiSquared feature selector by False Positive Rate (FPR) test. + * @param alpha the highest p-value for features to be kept + */ +@Since("2.1.0") +class ChiSqSelectorByFpr @Since("2.1.0") ( + @Since("2.1.0") val alpha: Double) extends Serializable { + + /** + * Returns a ChiSquared feature selector by FPR. + * + * @param data an `RDD[LabeledPoint]` containing the labeled dataset with categorical features. + * Real-valued features will be treated as categorical for each distinct value. + * Apply feature discretizer before using this function. + */ + @Since("2.1.0") + def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { +val indices = Statistics.chiSqTest(data) + .zipWithIndex.filter { case (res, _) => res.pValue < alpha } --- End diff -- Two or more models based on the same chi-squared test is reasonable, because the chi-squared test results contain "pValue, degreesOfFreedom, statistic,,,". numTopFeatures selection method uses statistic, and FPR selection method uses pValue. Expose the p-value to the caller maybe useful, but we can expose ChiSqTestResult, not only p-value. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r74400235 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") ( new ChiSqSelectorModel(indices) } } + +/** + * Creates a ChiSquared feature selector by False Positive Rate (FPR) test. + * @param alpha the highest p-value for features to be kept + */ +@Since("2.1.0") +class ChiSqSelectorByFpr @Since("2.1.0") ( + @Since("2.1.0") val alpha: Double) extends Serializable { + + /** + * Returns a ChiSquared feature selector by FPR. + * + * @param data an `RDD[LabeledPoint]` containing the labeled dataset with categorical features. + * Real-valued features will be treated as categorical for each distinct value. + * Apply feature discretizer before using this function. + */ + @Since("2.1.0") + def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { +val indices = Statistics.chiSqTest(data) + .zipWithIndex.filter { case (res, _) => res.pValue < alpha } --- End diff -- Yes, great proposal. We can do that. But the chiSquareSelector should be changed a lot to implement your proposal, I think it is necessary. I will submit a new PR based on my understanding of your proposal, how do you think about it? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r74416260 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") ( new ChiSqSelectorModel(indices) } } + +/** + * Creates a ChiSquared feature selector by False Positive Rate (FPR) test. + * @param alpha the highest p-value for features to be kept + */ +@Since("2.1.0") +class ChiSqSelectorByFpr @Since("2.1.0") ( + @Since("2.1.0") val alpha: Double) extends Serializable { + + /** + * Returns a ChiSquared feature selector by FPR. + * + * @param data an `RDD[LabeledPoint]` containing the labeled dataset with categorical features. + * Real-valued features will be treated as categorical for each distinct value. + * Apply feature discretizer before using this function. + */ + @Since("2.1.0") + def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { +val indices = Statistics.chiSqTest(data) + .zipWithIndex.filter { case (res, _) => res.pValue < alpha } --- End diff -- Do you mean we can put some selection logic or more selection results not only indices to the model (ChiSqSelectorModel) ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14449: [SPARK-16843][MLLIB] add the percentage ChiSquareSelecto...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/14449 Hi @srowen , I also plan to submit some PR about feature selection methods based on univariate statistical test, like the methods in scikit-learn: SelectFpr (using false positive rate), SelectFdr ( using false discovery rate), and SelectFwe (family wise error ). http://scikit-learn.org/dev/modules/feature_selection.html Now, the ChiSqSelector in Spark only can select the topNumFeatures. Do you think Spark should support other feature selection methods? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14597: [SPARK-17017][MLLIB] add a chiSquare Selector based on F...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/14597 Hi @avulanov . In general, FPR feature selection should not modify the code of existing ChiSqSelector, as we have implemented in this PR. But if we need to reuse the ChiSqTestResult (Statistics.chiSqTest(data)), it is better to modify the code of ChiSqSelector. In Scikit-learn, for each SelectKBest, SelectFpr, SelectPercentile and so on, create an object for it, as we implemented in this PR. The good point of this method is it is consistent across the LIB, all use the same Estimator/Model style. The disadvantage is it cannot reuse the results of score function. @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r74721128 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") ( new ChiSqSelectorModel(indices) } } + +/** + * Creates a ChiSquared feature selector by False Positive Rate (FPR) test. + * @param alpha the highest p-value for features to be kept + */ +@Since("2.1.0") +class ChiSqSelectorByFpr @Since("2.1.0") ( + @Since("2.1.0") val alpha: Double) extends Serializable { + + /** + * Returns a ChiSquared feature selector by FPR. + * + * @param data an `RDD[LabeledPoint]` containing the labeled dataset with categorical features. + * Real-valued features will be treated as categorical for each distinct value. + * Apply feature discretizer before using this function. + */ + @Since("2.1.0") + def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { +val indices = Statistics.chiSqTest(data) + .zipWithIndex.filter { case (res, _) => res.pValue < alpha } --- End diff -- Hi @srowen , if we configure the model instance with different parameter to perform different types of selection, will that be inconsistent with the MLlib Estimator/Model style, and cause the user confused? If it is not a problem, I will submit a new PR to configure the model to perform different selection. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r74740698 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") ( new ChiSqSelectorModel(indices) } } + +/** + * Creates a ChiSquared feature selector by False Positive Rate (FPR) test. + * @param alpha the highest p-value for features to be kept + */ +@Since("2.1.0") +class ChiSqSelectorByFpr @Since("2.1.0") ( + @Since("2.1.0") val alpha: Double) extends Serializable { + + /** + * Returns a ChiSquared feature selector by FPR. + * + * @param data an `RDD[LabeledPoint]` containing the labeled dataset with categorical features. + * Real-valued features will be treated as categorical for each distinct value. + * Apply feature discretizer before using this function. + */ + @Since("2.1.0") + def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { +val indices = Statistics.chiSqTest(data) + .zipWithIndex.filter { case (res, _) => res.pValue < alpha } --- End diff -- Thanks, this makes sense. It is better to add the parameter controlling what type of feature selection is done to ChiSqSelector. But my idea is the the selection is done in "fit" function to be consistent with other feature selection methods, we don't need to pass what type of selection to the model. If we consider the user can get the p-value, maybe return the p-value from ChiSqSelector is ok?If so , we don't need to make any change to the model class. How do you think about it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r74742525 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") ( new ChiSqSelectorModel(indices) } } + +/** + * Creates a ChiSquared feature selector by False Positive Rate (FPR) test. + * @param alpha the highest p-value for features to be kept + */ +@Since("2.1.0") +class ChiSqSelectorByFpr @Since("2.1.0") ( + @Since("2.1.0") val alpha: Double) extends Serializable { + + /** + * Returns a ChiSquared feature selector by FPR. + * + * @param data an `RDD[LabeledPoint]` containing the labeled dataset with categorical features. + * Real-valued features will be treated as categorical for each distinct value. + * Apply feature discretizer before using this function. + */ + @Since("2.1.0") + def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { +val indices = Statistics.chiSqTest(data) + .zipWithIndex.filter { case (res, _) => res.pValue < alpha } --- End diff -- ok, I will pass the p values to the model and expose to the caller, and let the model to sort the indices internally. I will add the parameter controlling what type of feature selection is done in "fit" function of ChiSqSelector. Is my understanding right? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r74743026 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") ( new ChiSqSelectorModel(indices) } } + +/** + * Creates a ChiSquared feature selector by False Positive Rate (FPR) test. + * @param alpha the highest p-value for features to be kept + */ +@Since("2.1.0") +class ChiSqSelectorByFpr @Since("2.1.0") ( + @Since("2.1.0") val alpha: Double) extends Serializable { + + /** + * Returns a ChiSquared feature selector by FPR. + * + * @param data an `RDD[LabeledPoint]` containing the labeled dataset with categorical features. + * Real-valued features will be treated as categorical for each distinct value. + * Apply feature discretizer before using this function. + */ + @Since("2.1.0") + def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { +val indices = Statistics.chiSqTest(data) + .zipWithIndex.filter { case (res, _) => res.pValue < alpha } --- End diff -- ok, thanks. I will submit a new PR based on our discussion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: Fpr chi square
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/14597 Fpr chi square ## What changes were proposed in this pull request? Univariate feature selection works by selecting the best features based on univariate statistical tests. False Positive Rate (FPR) is a popular univariate statistical test for feature selection. We add a chiSquare Selector based on False Positive Rate (FPR) test in this PR, like it is implemented in scikit-learn. http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection ## How was this patch tested? Add Scala ut You can merge this pull request into a Git repository by running: $ git pull https://github.com/mpjlu/spark fprChiSquare Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14597.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14597 commit 2adebe8de3881509e510fc518c562d1141ccd0ef Author: Peng, Meng <peng.m...@intel.com> Date: 2016-08-10T05:40:18Z add a chiSquare Selector based on False Positive Rate (FPR) test commit 04053ca207ef4aa955eddc3e65d09a4e03db6292 Author: Peng, Meng <peng.m...@intel.com> Date: 2016-08-11T07:10:43Z Merge remote-tracking branch 'origin/master' into fprChiSquare --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r74396249 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") ( new ChiSqSelectorModel(indices) } } + +/** + * Creates a ChiSquared feature selector by False Positive Rate (FPR) test. --- End diff -- Hi @srowen , http://blog.datadive.net/selecting-good-features-part-i-univariate-selection/, this is a link to introduce p-value for feature selection. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r74397115 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -197,3 +197,28 @@ class ChiSqSelector @Since("1.3.0") ( new ChiSqSelectorModel(indices) } } + +/** + * Creates a ChiSquared feature selector by False Positive Rate (FPR) test. + * @param alpha the highest p-value for features to be kept + */ +@Since("2.1.0") +class ChiSqSelectorByFpr @Since("2.1.0") ( + @Since("2.1.0") val alpha: Double) extends Serializable { + + /** + * Returns a ChiSquared feature selector by FPR. + * + * @param data an `RDD[LabeledPoint]` containing the labeled dataset with categorical features. + * Real-valued features will be treated as categorical for each distinct value. + * Apply feature discretizer before using this function. + */ + @Since("2.1.0") + def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { +val indices = Statistics.chiSqTest(data) + .zipWithIndex.filter { case (res, _) => res.pValue < alpha } --- End diff -- Here we don't need the features sorted by importance, because all the features are filtered by pvalue < alpha. But the ChiSqSelectorModel needs the features sorted by index. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14597: [SPARK-17017][MLLIB] add a chiSquare Selector based on F...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/14597 Hi, @srowen , I can modify the implementation in .ml to accommodate the new params. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...
Github user mpjlu closed the pull request at: https://github.com/apache/spark/pull/14597 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75314112 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") ( */ @Since("1.3.0") def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { -val indices = Statistics.chiSqTest(data) - .zipWithIndex.sortBy { case (res, _) => -res.statistic } - .take(numTopFeatures) - .map { case (_, indices) => indices } - .sorted +chiSqTestResult = Statistics.chiSqTest(data) +selectorType match { + case ChiSqSelectorType.KBest => selectKBest(numTopFeatures) + case ChiSqSelectorType.Percentile => selectPercentile(percentile) + case ChiSqSelectorType.Fpr => selectFpr(alpha) + case _ => throw new Exception("Unknown ChiSqSelector Type") +} + } + + @Since("2.1.0") + def selectKBest(value: Int): ChiSqSelectorModel = { --- End diff -- Thanks. I am suggesting how one selector can generate multi models by only fit one time, because we want to reuse the ChiSqTestResult, this result is generated in fit(DataFrame). So we want to only fit one time and can generate multi models. According to your proposal, suppose we want to generate KBest with defaults numTopFeateatures. we can do like this: val selector = new ChiSqSelector() (1) val model = selector.fit(dataframe) (2) then the user want to generate KBest model with only top 20 features. The user can do like this: selector.setNumToFeatures(20) //here is the same selector as (1) , and the model (2) now is with Top 20 features, because it can use the parameters of selector. This is my understanding of your proposal, is my understanding right? Thanks very much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75610472 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") ( */ @Since("1.3.0") def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { -val indices = Statistics.chiSqTest(data) - .zipWithIndex.sortBy { case (res, _) => -res.statistic } - .take(numTopFeatures) - .map { case (_, indices) => indices } - .sorted +chiSqTestResult = Statistics.chiSqTest(data) +selectorType match { + case ChiSqSelectorType.KBest => selectKBest(numTopFeatures) + case ChiSqSelectorType.Percentile => selectPercentile(percentile) + case ChiSqSelectorType.Fpr => selectFpr(alpha) + case _ => throw new Exception("Unknown ChiSqSelector Type") +} + } + + @Since("2.1.0") + def selectKBest(value: Int): ChiSqSelectorModel = { --- End diff -- Yes, I also think so. I will update this PR. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75613576 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") ( */ @Since("1.3.0") def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { -val indices = Statistics.chiSqTest(data) - .zipWithIndex.sortBy { case (res, _) => -res.statistic } - .take(numTopFeatures) - .map { case (_, indices) => indices } - .sorted +chiSqTestResult = Statistics.chiSqTest(data) +selectorType match { + case ChiSqSelectorType.KBest => selectKBest(numTopFeatures) + case ChiSqSelectorType.Percentile => selectPercentile(percentile) + case ChiSqSelectorType.Fpr => selectFpr(alpha) + case _ => throw new Exception("Unknown ChiSqSelector Type") +} + } + + @Since("2.1.0") + def selectKBest(value: Int): ChiSqSelectorModel = { --- End diff -- Hi @srowen , I have updated this PR. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75634659 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -54,6 +54,29 @@ private[feature] trait ChiSqSelectorParams extends Params /** @group getParam */ def getNumTopFeatures: Int = $(numTopFeatures) + + final val percentile = new DoubleParam(this, "percentile", +"Percentile of features that selector will select, ordered by statistics value descending.", +ParamValidators.gtEq(0)) + setDefault(percentile -> 10) --- End diff -- ok, I will update it. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75636510 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -189,11 +227,20 @@ class ChiSqSelector @Since("1.3.0") ( */ @Since("1.3.0") def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { -val indices = Statistics.chiSqTest(data) - .zipWithIndex.sortBy { case (res, _) => -res.statistic } - .take(numTopFeatures) - .map { case (_, indices) => indices } - .sorted +var indices = selectorType match { + case ChiSqSelectorType.KBest => Statistics.chiSqTest(data) +.zipWithIndex.sortBy { case (res, _) => -res.statistic } +.take(numTopFeatures) +.map { case (_, indices) => indices } + case ChiSqSelectorType.Percentile => Statistics.chiSqTest(data) +.zipWithIndex.sortBy { case (res, _) => -res.statistic } +.take((data.count() * percentile / 100).toInt) --- End diff -- Yes, I will update the code. thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75419052 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") ( */ @Since("1.3.0") def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { -val indices = Statistics.chiSqTest(data) - .zipWithIndex.sortBy { case (res, _) => -res.statistic } - .take(numTopFeatures) - .map { case (_, indices) => indices } - .sorted +chiSqTestResult = Statistics.chiSqTest(data) +selectorType match { + case ChiSqSelectorType.KBest => selectKBest(numTopFeatures) + case ChiSqSelectorType.Percentile => selectPercentile(percentile) + case ChiSqSelectorType.Fpr => selectFpr(alpha) + case _ => throw new Exception("Unknown ChiSqSelector Type") +} + } + + @Since("2.1.0") + def selectKBest(value: Int): ChiSqSelectorModel = { --- End diff -- Hi @srowen , maybe I misunderstood you. I think make the model instance be parameterized is the same as current solution (make selector instance be parameterized). This only difference of your suggestion is moving the logic from selector to model. Both of the solution is to avoid computing ChiSqTestResult many times when one wanted to perform different chi squared selection. If we don't consider to avoid computing ChiSqTestResult many times, the problem will be very easy, and can be consistent exactly with other methods. Can we don't consider reuse ChiSqTestResult in this PR. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75661562 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -189,11 +227,20 @@ class ChiSqSelector @Since("1.3.0") ( */ @Since("1.3.0") def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { -val indices = Statistics.chiSqTest(data) - .zipWithIndex.sortBy { case (res, _) => -res.statistic } - .take(numTopFeatures) - .map { case (_, indices) => indices } - .sorted +var indices = selectorType match { + case ChiSqSelectorType.KBest => Statistics.chiSqTest(data) +.zipWithIndex.sortBy { case (res, _) => -res.statistic } +.take(numTopFeatures) +.map { case (_, indices) => indices } + case ChiSqSelectorType.Percentile => Statistics.chiSqTest(data) +.zipWithIndex.sortBy { case (res, _) => -res.statistic } +.take((data.count() * percentile / 100).toInt) +.map { case (_, indices) => indices } + case ChiSqSelectorType.Fpr => Statistics.chiSqTest(data) +.zipWithIndex.filter{ case (res, _) => res.pValue < alpha } +.map { case (_, indices) => indices } + case _ => throw new IllegalStateException("Unknown ChiSqSelector Type") --- End diff -- Hi @srowen , I have updated the code per all you comments. Thanks very much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14785: [SPARK-17207][MLLIB]fix comparing Vector bug in T...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14785#discussion_r76035442 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/TestingUtils.scala --- @@ -154,7 +154,11 @@ object TestingUtils { */ def absTol(eps: Double): CompareVectorRightSide = CompareVectorRightSide( (x: Vector, y: Vector, eps: Double) => { -x.toArray.zip(y.toArray).forall(x => x._1 ~= x._2 absTol eps) +if (x.size != y.size) { --- End diff -- Yes, thanks. How about check matrix length by x.numRows == y.numRows && x.numCols == y.numCols && ... ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14785: [SPARK-17207][MLLIB]fix comparing Vector bug in T...
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/14785 [SPARK-17207][MLLIB]fix comparing Vector bug in TestingUtils ## What changes were proposed in this pull request? fix comparing Vector bug in TestingUtils. There is the same bug for Matrix comparing. How to check the length of Matrix should be discussed first. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/mpjlu/spark testUtils Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14785.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14785 commit 6c8f87f730ab4e93bbdec02d8b89d82b82042d73 Author: Peng, Meng <peng.m...@intel.com> Date: 2016-08-24T10:36:10Z fix comparing Vector bug in TestingUtils --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14785: [SPARK-17207][MLLIB]fix comparing Vector bug in T...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14785#discussion_r76037277 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/util/TestingUtils.scala --- @@ -154,7 +154,11 @@ object TestingUtils { */ def absTol(eps: Double): CompareVectorRightSide = CompareVectorRightSide( (x: Vector, y: Vector, eps: Double) => { -x.toArray.zip(y.toArray).forall(x => x._1 ~= x._2 absTol eps) +if (x.size != y.size) { --- End diff -- code is updated. Thanks @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r76041373 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -189,11 +232,21 @@ class ChiSqSelector @Since("1.3.0") ( */ @Since("1.3.0") def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { -val indices = Statistics.chiSqTest(data) - .zipWithIndex.sortBy { case (res, _) => -res.statistic } - .take(numTopFeatures) - .map { case (_, indices) => indices } - .sorted +val chiSqTestResult = Statistics.chiSqTest(data) +val features = selectorType match { + case ChiSqSelectorType.KBest => chiSqTestResult +.zipWithIndex.sortBy { case (res, _) => -res.statistic } --- End diff -- Hi @srowen , I have updated the code, do I miss any of your comments in this pr? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14785: [SPARK-17207][MLLIB]fix comparing Vector bug in TestingU...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/14785 Sure, I will fix it, and add test cases. thanks. @dbtsai , --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14785: [SPARK-17207][MLLIB]fix comparing Vector bug in TestingU...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/14785 Hi @dbtsai , PR 2294 added Matrix comparing in TestingUtils, but did not add any test cases in TestingUtilsSuite. I did not add test cases for Matrix comparing in the PR either. If Matrix comparing cases are required. I can add all test cases them, not only test this PR, like Vector comparing test cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r77536408 --- Diff: python/pyspark/mllib/feature.py --- @@ -305,7 +350,12 @@ def fit(self, data): treated as categorical for each distinct value. Apply feature discretizer before using this function. """ -jmodel = callMLlibFunc("fitChiSqSelector", self.numTopFeatures, data) +if self.selectorType == ChiSqSelectorType.KBest: +jmodel = callMLlibFunc("fitChiSqSelectorKBest", self.numTopFeatures, data) +elif self.selectorType == ChiSqSelectorType.Percentile: +jmodel = callMLlibFunc("fitChiSqSelectorPercentile", self.percentile, data) +elif self.selectorType == ChiSqSelectorType.FPR: --- End diff -- Thanks @srowen , the type checking is added. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r77532997 --- Diff: python/pyspark/mllib/feature.py --- @@ -271,29 +271,74 @@ def transform(self, vector): """ return JavaVectorTransformer.transform(self, vector) +class ChiSqSelectorType: +""" +This class defines the selector types of Chi Square Selector. +""" +KBest, Percentile, FPR = range(3) --- End diff -- Hi @srowen , there are many ways to express enums in Python 2, this is the most frequently used and simplest way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/14597 Hi @yanboliang , could you please kindly review the python code of this PR. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r77137991 --- Diff: python/pyspark/mllib/feature.py --- @@ -276,24 +276,64 @@ class ChiSqSelector(object): """ Creates a ChiSquared feature selector. -:param numTopFeatures: number of features that selector will select. - >>> data = [ ... LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), ... LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})), ... LabeledPoint(1.0, [0.0, 9.0, 8.0]), ... LabeledPoint(2.0, [8.0, 9.0, 5.0]) ... ] ->>> model = ChiSqSelector(1).fit(sc.parallelize(data)) +>>> model = ChiSqSelector().setNumTopFeatures(1).fit(sc.parallelize(data)) +>>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0})) +SparseVector(1, {0: 6.0}) +>>> model.transform(DenseVector([8.0, 9.0, 5.0])) +DenseVector([5.0]) +>>> model = ChiSqSelector().setPercentile(0.34).fit(sc.parallelize(data)) >>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0})) SparseVector(1, {0: 6.0}) >>> model.transform(DenseVector([8.0, 9.0, 5.0])) DenseVector([5.0]) +>>> data = [ +... LabeledPoint(0.0, SparseVector(4, {0: 8.0, 1: 7.0})), +... LabeledPoint(1.0, SparseVector(4, {1: 9.0, 2: 6.0, 3: 4.0})), +... LabeledPoint(1.0, [0.0, 9.0, 8.0, 4.0]), +... LabeledPoint(2.0, [8.0, 9.0, 5.0, 9.0]) +... ] +>>> model = ChiSqSelector().setAlpha(0.1).fit(sc.parallelize(data)) +>>> model.transform(DenseVector([1.0,2.0,3.0,4.0])) +DenseVector([4.0]) .. versionadded:: 1.4.0 """ -def __init__(self, numTopFeatures): -self.numTopFeatures = int(numTopFeatures) +def __init__(self): +self.param = 50 --- End diff -- Hi @srowen , use different fields for different value is not a problem, just need another selectionType field, and in the fit function, the code will be: if(selectionType == KBest) callMLlibFunc("fitChiSqSelectorKBest", self.numTopFeatures, data) elseif(selectionType == Percentile) callMLlibFunc("fitChiSqSelectorKPercentile", self.percentile, data) elseif(selecitonType == FPR) callMLlibFunc("fitChiSqSelectorFPR", self.alpha, data) Is that ok? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r77473261 --- Diff: python/pyspark/mllib/feature.py --- @@ -276,24 +276,64 @@ class ChiSqSelector(object): """ Creates a ChiSquared feature selector. -:param numTopFeatures: number of features that selector will select. - >>> data = [ ... LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), ... LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})), ... LabeledPoint(1.0, [0.0, 9.0, 8.0]), ... LabeledPoint(2.0, [8.0, 9.0, 5.0]) ... ] ->>> model = ChiSqSelector(1).fit(sc.parallelize(data)) +>>> model = ChiSqSelector().setNumTopFeatures(1).fit(sc.parallelize(data)) +>>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0})) +SparseVector(1, {0: 6.0}) +>>> model.transform(DenseVector([8.0, 9.0, 5.0])) +DenseVector([5.0]) +>>> model = ChiSqSelector().setPercentile(0.34).fit(sc.parallelize(data)) >>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0})) SparseVector(1, {0: 6.0}) >>> model.transform(DenseVector([8.0, 9.0, 5.0])) DenseVector([5.0]) +>>> data = [ +... LabeledPoint(0.0, SparseVector(4, {0: 8.0, 1: 7.0})), +... LabeledPoint(1.0, SparseVector(4, {1: 9.0, 2: 6.0, 3: 4.0})), +... LabeledPoint(1.0, [0.0, 9.0, 8.0, 4.0]), +... LabeledPoint(2.0, [8.0, 9.0, 5.0, 9.0]) +... ] +>>> model = ChiSqSelector().setAlpha(0.1).fit(sc.parallelize(data)) +>>> model.transform(DenseVector([1.0,2.0,3.0,4.0])) +DenseVector([4.0]) .. versionadded:: 1.4.0 """ -def __init__(self, numTopFeatures): -self.numTopFeatures = int(numTopFeatures) +def __init__(self): +self.param = 50 --- End diff -- Hi @srowen , I have updated the code. Could you please review it again? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15058: [SPARK-17505][MLLIB]Add setBins for BinaryClassif...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15058#discussion_r78355383 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala --- @@ -43,10 +43,13 @@ import org.apache.spark.sql.DataFrame @Since("1.0.0") class BinaryClassificationMetrics @Since("1.3.0") ( @Since("1.3.0") val scoreAndLabels: RDD[(Double, Double)], -@Since("1.3.0") val numBins: Int) extends Logging { +@Since("1.3.0") var numBins: Int) extends Logging { --- End diff -- It can cause the count/confusion RDD to recompute. yes, you can set numBins when construct BinaryClassificationMetrics. But when you want to change the numBins, if there is no setBins function, you have to construct another BinaryClassificationMetrics. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15058: [SPARK-17505][MLLIB]Add setBins for BinaryClassif...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15058#discussion_r78355993 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala --- @@ -43,10 +43,13 @@ import org.apache.spark.sql.DataFrame @Since("1.0.0") class BinaryClassificationMetrics @Since("1.3.0") ( @Since("1.3.0") val scoreAndLabels: RDD[(Double, Double)], -@Since("1.3.0") val numBins: Int) extends Logging { +@Since("1.3.0") var numBins: Int) extends Logging { --- End diff -- I add this method is because, in the code, there is "// TODO: Allow the user to vary the number of bins using a setBins method in BinaryClassificationMetrics. " but I find there is no setBins method. So I add it. I we not plan to add this method, maybe we can remove this comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15058: [SPARK-17505][MLLIB]Add setBins for BinaryClassif...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15058#discussion_r78358229 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala --- @@ -43,10 +43,13 @@ import org.apache.spark.sql.DataFrame @Since("1.0.0") class BinaryClassificationMetrics @Since("1.3.0") ( @Since("1.3.0") val scoreAndLabels: RDD[(Double, Double)], -@Since("1.3.0") val numBins: Int) extends Logging { +@Since("1.3.0") var numBins: Int) extends Logging { --- End diff -- Thanks. I agree with you, the method to add setBins is complexity. I just can not find simple method to add setBins. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15058: [SPARK-17505][MLLIB]Add setBins for BinaryClassif...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15058#discussion_r78362121 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala --- @@ -43,10 +43,13 @@ import org.apache.spark.sql.DataFrame @Since("1.0.0") class BinaryClassificationMetrics @Since("1.3.0") ( @Since("1.3.0") val scoreAndLabels: RDD[(Double, Double)], -@Since("1.3.0") val numBins: Int) extends Logging { +@Since("1.3.0") var numBins: Int) extends Logging { --- End diff -- Ok, got it, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15058: [SPARK-17505][MLLIB]Add setBins for BinaryClassif...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15058#discussion_r78356355 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala --- @@ -43,10 +43,13 @@ import org.apache.spark.sql.DataFrame @Since("1.0.0") class BinaryClassificationMetrics @Since("1.3.0") ( @Since("1.3.0") val scoreAndLabels: RDD[(Double, Double)], -@Since("1.3.0") val numBins: Int) extends Logging { +@Since("1.3.0") var numBins: Int) extends Logging { --- End diff -- Yes, we can create another metrics. But why not add a setBins method. I think add a setBins method is reasonable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/14597 hi @srowen , thanks. This is my first PR. learn much from you. thanks very much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15058: [MLLIB]Add setBins for BinaryClassificationMetric...
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/15058 [MLLIB]Add setBins for BinaryClassificationMetrics ## What changes were proposed in this pull request? Add a setBins method for BinaryClassificationMetrics. BinaryClassificationMetrics is a class in mllib/Evaluation. numBins is a key attribute of it. If numBins greater than 0, then the curves (ROC curve, PR curve) computed internally will be down-sampled to this many "bins". It is useful to let the user set the numBins. ## How was this patch tested? ut cases were changed to test this PR. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/mpjlu/spark addSetNumBins Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15058.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15058 commit b28ac7499b62282ac3777d7be21a1ed0dc16a78b Author: Peng <peng.m...@intel.com> Date: 2016-09-12T08:02:31Z add setBins commit 1f4258e262314cd6781a6d6feca1940c4ea05457 Author: Peng <peng.m...@intel.com> Date: 2016-09-12T09:41:59Z Merge remote-tracking branch 'origin/master' into addSetNumBins commit a70bf454dfd3838780a8d040858e53d3ad870966 Author: Peng <peng.m...@intel.com> Date: 2016-09-12T10:10:30Z change the comments style commit c17bf4e768be614c806bbd913096538f89b380e5 Author: Peng <peng.m...@intel.com> Date: 2016-09-12T10:15:51Z minor change commit d06b2b1d3f0d214a37da8d456e8e585b06fe5d9b Author: Peng <peng.m...@intel.com> Date: 2016-09-12T10:20:50Z change the comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15058: [SPARK-17505][MLLIB]Add setBins for BinaryClassif...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15058#discussion_r78360606 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala --- @@ -43,10 +43,13 @@ import org.apache.spark.sql.DataFrame @Since("1.0.0") class BinaryClassificationMetrics @Since("1.3.0") ( @Since("1.3.0") val scoreAndLabels: RDD[(Double, Double)], -@Since("1.3.0") val numBins: Int) extends Logging { +@Since("1.3.0") var numBins: Int) extends Logging { --- End diff -- hi @srowen , can we use a simple way to add the setBins method. Seems add a setBins method is useful. Just as the testSuite in this PR shows. If the user wants to test different numBins. Create one object, just call setBins, he can easily get different results. How about change a way to implement setBins? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/14597 hi @srowen , we have moved isSorted in ChiSqSelectorModel. There is error message "*** method isSorted(Array[Int])Boolean in class org.apache.spark.mllib.feature.ChiSqSelectorModel does not have a correspondent in current version**". Should we revert it? And we also need to add a new method numTopFeatures() in ChiSqSelector for the following error message: " *** method numTopFeatures()Int in class org.apache.spark.mllib.feature.ChiSqSelector does not have a correspondent in current version**" How do you think about it? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/14597 Hi @srowen , Python style fail is updated. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/14597 hi @srowen , I have updated the code for some MiMa test error. Could you please review it again. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15214: [SPARK-17017][Follow-up][ML] Refactor of ChiSqSelector a...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/15214 Hi @yanboliang , got it. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15277: [SPARK-17704][ML][MLlib] ChiSqSelector performanc...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15277#discussion_r80961052 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -57,22 +69,21 @@ class ChiSqSelectorModel @Since("1.3.0") ( * Preserves the order of filtered features the same as their indices are stored. * Might be moved to Vector as .slice * @param features vector - * @param filterIndices indices of features to filter + * @param filterIndices indices of features to filter, must be ordered asc */ private def compress(features: Vector, filterIndices: Array[Int]): Vector = { -val orderedIndices = filterIndices.sorted --- End diff -- I just found sort in transform will sort the number of sample times. sorry @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15299: [SPARK-17704][ML][MLlib] ChiSqSelector performance impro...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/15299 hi @srowen , is @transient needed for val selectedFeatures or val filterIndices, one of them? is it good to define filterIndices lazy? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15058: [SPARK-17505][MLLIB]Add setBins for BinaryClassif...
Github user mpjlu closed the pull request at: https://github.com/apache/spark/pull/15058 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15058: [SPARK-17505][MLLIB]Add setBins for BinaryClassification...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/15058 sure, I close it now --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/14597 Thanks very much, I am in holiday now, will update the code this Sunday. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r79405860 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -54,11 +55,44 @@ private[feature] trait ChiSqSelectorParams extends Params /** @group getParam */ def getNumTopFeatures: Int = $(numTopFeatures) + + final val percentile = new DoubleParam(this, "percentile", +"Percentile of features that selector will select, ordered by statistics value descending.", +ParamValidators.inRange(0, 1)) + setDefault(percentile -> 0.1) + + /** @group getParam */ + def getPercentile: Double = $(percentile) + + final val alpha = new DoubleParam(this, "alpha", +"The highest p-value for features to be kept.", +ParamValidators.inRange(0, 1)) + setDefault(alpha -> 0.05) + + /** @group getParam */ + def getAlpha: Double = $(alpha) + + /** + * The ChiSqSelector supports KBest, Percentile, FPR selection, + * which is the same as ChiSqSelectorType defined in MLLIB. + * when call setNumTopFeatures, the selectorType is set to KBest + * when call setPercentile, the selectorType is set to Percentile + * when call setAlpha, the selectorType is set to FPR + */ + final val selectorType = new Param[String](this, "selectorType", +"ChiSqSelector Type: KBest, Percentile, FPR") + setDefault(selectorType -> ChiSqSelectorType.KBest.toString) + + /** @group getParam */ + def getChiSqSelectorType: String = $(selectorType) } /** * Chi-Squared feature selection, which selects categorical features to use for predicting a * categorical label. + * The selector supports three selection methods: KBest, Percentile and FPR. --- End diff -- Thanks @srowen, I have updated the javadoc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [MLLIB][ML]add feature selector method based on: ...
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/15212 [MLLIB][ML]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) ## What changes were proposed in this pull request? Univariate feature selection works by selecting the best features based on univariate statistical tests. FDR and FWE are a popular univariate statistical test for feature selection. In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate. In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests. https://en.wikipedia.org/wiki/Family-wise_error_rate We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn. http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection ## How was this patch tested? ut will be added soon (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) You can merge this pull request into a Git repository by running: $ git pull https://github.com/mpjlu/spark fdr_fwe Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15212.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15212 commit 2c071791b2c6fd7d388343ac95783c32ffdae529 Author: Peng, Meng <peng.m...@intel.com> Date: 2016-09-23T07:27:19Z add feature selector method: FDR and FWE --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15214: [SPARK-17017][Follow-up][ML] Refactor of ChiSqSel...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15214#discussion_r80277277 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -143,13 +149,13 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } -var selector = new feature.ChiSqSelector() -ChiSqSelectorType.withName($(selectorType)) match { - case ChiSqSelectorType.KBest => +val selector = new feature.ChiSqSelector() +$(selectorType) match { + case OldChiSqSelector.KBest => selector.setNumTopFeatures($(numTopFeatures)) --- End diff -- Do you need to set SelectorType here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15214: [SPARK-17017][Follow-up][ML] Refactor of ChiSqSel...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15214#discussion_r80278819 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala --- @@ -76,7 +76,7 @@ class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext { LabeledPoint(1.0, Vectors.dense(Array(4.0))), LabeledPoint(1.0, Vectors.dense(Array(4.0))), LabeledPoint(2.0, Vectors.dense(Array(9.0 -val model = new ChiSqSelector().setAlpha(0.1).fit(labeledDiscreteData) +val model = new ChiSqSelector().setSelectorType("fpr").setAlpha(0.1).fit(labeledDiscreteData) --- End diff -- you should also do the same thing for https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15214: [SPARK-17017][Follow-up][ML] Refactor of ChiSqSel...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15214#discussion_r80277785 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -160,6 +166,12 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str @Since("1.6.0") override def transformSchema(schema: StructType): StructType = { +val otherPairs = OldChiSqSelector.supportedTypeAndParamPairs.filter(_._1 == $(selectorType)) --- End diff -- == or != ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15214: [SPARK-17017][Follow-up][ML] Refactor of ChiSqSelector a...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/15214 Hi @srowen and @yanboliang ; Thanks for your following up PR. I partly agree with your comments on 17017. **1. "if users both set numTopFeatures and percentile, it will train kbest or percentile model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly."** For the user confused you mentioned here, I think the main reason is function name. I have changed the function name of setAlpha to setFPR in SPARK-17645. setNumTopFeature should be setKBest. By this change, it can be much clear. For example, setKBest(100), setPercentile(0.1), setFPR(0.05). The selection type and parameters is very clear by one function. But for your method, user have to strike "setSelectorType("KBest").setNumTopFeatures(100)" to do the same thing as "setKBest(100)" **2. "if there are more than one parameter except alpha can be set for fpr model, we can not handle it elegantly in the existing framework. And similar issues for kbest and percentile model. "** I cannot think out any other parameters for fpr, kbest, percentile now. But if there is, I think it is just the same thing as your method. for example, setKBest(100).setOther(),, I agree with you for other change. Thanks very much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15214: [SPARK-17017][Follow-up][ML] Refactor of ChiSqSel...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15214#discussion_r80277470 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -143,13 +149,13 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } -var selector = new feature.ChiSqSelector() -ChiSqSelectorType.withName($(selectorType)) match { - case ChiSqSelectorType.KBest => +val selector = new feature.ChiSqSelector() +$(selectorType) match { + case OldChiSqSelector.KBest => selector.setNumTopFeatures($(numTopFeatures)) - case ChiSqSelectorType.Percentile => + case OldChiSqSelector.Percentile => selector.setPercentile($(percentile)) - case ChiSqSelectorType.FPR => + case OldChiSqSelector.FPR => selector.setAlpha($(alpha)) --- End diff -- ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/14597 No problem. thanks very much @srowen . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14597: [SPARK-17017][MLLIB] add a chiSquare Selector based on F...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/14597 Hi @srowen, I have added the parameter to control the feature selection type. The usage is like this: **var selector = new ChiSqSelector() var model = selector.fit(df) // by default, the selector is selection numTopFeatures (50) var newModel = selector.selectKBest(10), or var newModel = selector.selectPercentile(5), or,,** You can fit the DataFrame one time, and generate the model multi times. And the indices is sort in the model internally as we have discussed. For pass the p-value to the model function, this update does not include it. Because for the KBest and Percentile selection, the fit function uses ChiSqTestResult.statics to generate the model. For Fpr, the fit function uses ChiSqTestResult.p-value. So it maybe better to pass ChiSqTestResult to the model and expose to the caller. And I think it is better to submit another PR for "pass value to model and expose to the caller" problem. Because much codes will be changed for this problem, includes which data should be passed to the model, how to save the model, how to test the model. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB] add a chiSquare Selector bas...
GitHub user mpjlu reopened a pull request: https://github.com/apache/spark/pull/14597 [SPARK-17017][MLLIB] add a chiSquare Selector based on False Positive Rate (FPR) test ## What changes were proposed in this pull request? Univariate feature selection works by selecting the best features based on univariate statistical tests. False Positive Rate (FPR) is a popular univariate statistical test for feature selection. We add a chiSquare Selector based on False Positive Rate (FPR) test in this PR, like it is implemented in scikit-learn. http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection ## How was this patch tested? Add Scala ut You can merge this pull request into a Git repository by running: $ git pull https://github.com/mpjlu/spark fprChiSquare Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14597.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14597 commit 2adebe8de3881509e510fc518c562d1141ccd0ef Author: Peng, Meng <peng.m...@intel.com> Date: 2016-08-10T05:40:18Z add a chiSquare Selector based on False Positive Rate (FPR) test commit 04053ca207ef4aa955eddc3e65d09a4e03db6292 Author: Peng, Meng <peng.m...@intel.com> Date: 2016-08-11T07:10:43Z Merge remote-tracking branch 'origin/master' into fprChiSquare commit 7623563884355a04867ce5271baa286f65180e62 Author: Peng, Meng <peng.m...@intel.com> Date: 2016-08-16T13:36:11Z Configure the ChiSqSelector to reuse ChiSqTestResult by numTopFeatures, Percentile, and Fpr selector commit 3d6aecb8441503c9c3d62a2d8a3d48824b9d6637 Author: Peng, Meng <peng.m...@intel.com> Date: 2016-08-17T02:34:59Z Config the ChiSqSelector to reuse the ChiSqTestResult by KBest, Percentile and FPR selector commit 026ac85dfa190707891b694f40e737f22f9b4bd5 Author: Peng, Meng <peng.m...@intel.com> Date: 2016-08-17T02:43:45Z Merge branch 'master' into fprChiSquare2 commit 5305709c9d4029186318b99fa9c7c483897aa653 Author: Peng, Meng <peng.m...@intel.com> Date: 2016-08-17T09:59:16Z add Since annotation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75286500 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -171,14 +177,47 @@ object ChiSqSelectorModel extends Loader[ChiSqSelectorModel] { /** * Creates a ChiSquared feature selector. - * @param numTopFeatures number of features that selector will select - * (ordered by statistic value descending) - * Note that if the number of features is less than numTopFeatures, - * then this will select all features. */ -@Since("1.3.0") -class ChiSqSelector @Since("1.3.0") ( - @Since("1.3.0") val numTopFeatures: Int) extends Serializable { +@Since("2.1.0") +class ChiSqSelector @Since("2.1.0") () extends Serializable { + private var numTopFeatures: Int = 50 + private var percentile: Double = 10.0 + private var alpha: Double = 0.05 + private var selectorType = ChiSqSelectorType.KBest + private var chiSqTestResult: Array[ChiSqTestResult] = _ + + @Since("1.3.0") + def this(numTopFeatures: Int) { +this() +this.numTopFeatures = numTopFeatures --- End diff -- This is not necessary, because the default selectorType is KBest --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75294150 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") ( */ @Since("1.3.0") def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { -val indices = Statistics.chiSqTest(data) - .zipWithIndex.sortBy { case (res, _) => -res.statistic } - .take(numTopFeatures) - .map { case (_, indices) => indices } - .sorted +chiSqTestResult = Statistics.chiSqTest(data) +selectorType match { + case ChiSqSelectorType.KBest => selectKBest(numTopFeatures) + case ChiSqSelectorType.Percentile => selectPercentile(percentile) + case ChiSqSelectorType.Fpr => selectFpr(alpha) + case _ => throw new Exception("Unknown ChiSqSelector Type") +} + } + + @Since("2.1.0") + def selectKBest(value: Int): ChiSqSelectorModel = { --- End diff -- The usage scenario here is: model = selector.fit(df) result = model.transform(features). I try to understand your proposal: Suppose the user generate the model1 use KBest: model1 = selector.fit(df). you mean the user can config selector, like selector.setPercentile(0.1) then, the model1 will be a different model, like it is generated by Percentile method, because it can use the parameters of selector. Does my understanding right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75286355 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") ( */ @Since("1.3.0") def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { -val indices = Statistics.chiSqTest(data) - .zipWithIndex.sortBy { case (res, _) => -res.statistic } - .take(numTopFeatures) - .map { case (_, indices) => indices } - .sorted +chiSqTestResult = Statistics.chiSqTest(data) +selectorType match { + case ChiSqSelectorType.KBest => selectKBest(numTopFeatures) + case ChiSqSelectorType.Percentile => selectPercentile(percentile) + case ChiSqSelectorType.Fpr => selectFpr(alpha) + case _ => throw new Exception("Unknown ChiSqSelector Type") +} + } + + @Since("2.1.0") + def selectKBest(value: Int): ChiSqSelectorModel = { --- End diff -- The user can call selectKBest directly, because, use can fit one time, and generate models by selectKBest multi times. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75286399 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") ( */ @Since("1.3.0") def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { -val indices = Statistics.chiSqTest(data) - .zipWithIndex.sortBy { case (res, _) => -res.statistic } - .take(numTopFeatures) - .map { case (_, indices) => indices } - .sorted +chiSqTestResult = Statistics.chiSqTest(data) +selectorType match { + case ChiSqSelectorType.KBest => selectKBest(numTopFeatures) + case ChiSqSelectorType.Percentile => selectPercentile(percentile) + case ChiSqSelectorType.Fpr => selectFpr(alpha) + case _ => throw new Exception("Unknown ChiSqSelector Type") --- End diff -- Thanks, I will update --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75289389 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -189,11 +228,35 @@ class ChiSqSelector @Since("1.3.0") ( */ @Since("1.3.0") def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { -val indices = Statistics.chiSqTest(data) - .zipWithIndex.sortBy { case (res, _) => -res.statistic } - .take(numTopFeatures) - .map { case (_, indices) => indices } - .sorted +chiSqTestResult = Statistics.chiSqTest(data) +selectorType match { + case ChiSqSelectorType.KBest => selectKBest(numTopFeatures) + case ChiSqSelectorType.Percentile => selectPercentile(percentile) + case ChiSqSelectorType.Fpr => selectFpr(alpha) + case _ => throw new Exception("Unknown ChiSqSelector Type") +} + } + + @Since("2.1.0") + def selectKBest(value: Int): ChiSqSelectorModel = { --- End diff -- Now, the user can use fit and transform only for consistent. If the user wants to reuse ChiSqTestResult, call selectKBest, SelectPercentile will help them. If we don't consider reuse ChiSqTestResult, we can not expose this methods. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75286558 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -54,6 +54,29 @@ private[feature] trait ChiSqSelectorParams extends Params /** @group getParam */ def getNumTopFeatures: Int = $(numTopFeatures) + + final val percentile = new DoubleParam(this, "percentile", +"Percentile of features that selector will select, ordered by statistics value descending.", +ParamValidators.gtEq(0)) + setDefault(percentile -> 10) --- End diff -- I will update, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/14597 Hi @srowen , I will update the Python API to match this changes. Now, the current Python API is not conflict with the changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14824: [ML][MLLIB]The require condition and message does...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14824#discussion_r76421555 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -454,10 +454,15 @@ class SparseMatrix @Since("2.0.0") ( require(values.length == rowIndices.length, "The number of row indices and values don't match! " + s"values.length: ${values.length}, rowIndices.length: ${rowIndices.length}") - // The Or statement is for the case when the matrix is transposed - require(colPtrs.length == numCols + 1 || colPtrs.length == numRows + 1, "The length of the " + -"column indices should be the number of columns + 1. Currently, colPointers.length: " + -s"${colPtrs.length}, numCols: $numCols") + if (isTransposed) { +require(colPtrs.length == numRows + 1, "The length of the column indices should be " + --- End diff -- Updated, thanks @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14824: [ML][MLLIB]The require condition and message does...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14824#discussion_r76417535 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/Matrices.scala --- @@ -455,9 +455,11 @@ class SparseMatrix @Since("2.0.0") ( require(values.length == rowIndices.length, "The number of row indices and values don't match! " + s"values.length: ${values.length}, rowIndices.length: ${rowIndices.length}") // The Or statement is for the case when the matrix is transposed - require(colPtrs.length == numCols + 1 || colPtrs.length == numRows + 1, "The length of the " + -"column indices should be the number of columns + 1. Currently, colPointers.length: " + -s"${colPtrs.length}, numCols: $numCols") + require(!isTransposed && colPtrs.length == numCols + 1 || --- End diff -- Thanks @srowen , code is updated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14824: [ML][MLLIB]The require condition and message does...
GitHub user mpjlu opened a pull request: https://github.com/apache/spark/pull/14824 [ML][MLLIB]The require condition and message doesn't match in SparseMatrix. ## What changes were proposed in this pull request? The require condition and message doesn't match, and the condition also should be optimized. Small change. Please kindly let me know if JIRA required. ## How was this patch tested? No additional test required. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mpjlu/spark smallChangeForMatrixRequire Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14824.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14824 commit a51c80614006956c13e3a8d5b6a424cc68c35392 Author: Peng, Meng <peng.m...@intel.com> Date: 2016-08-26T04:52:05Z Change require condition for Matrix contructor commit e5af18f83ae628104989ed71688a0ed42318a46b Author: Peng, Meng <peng.m...@intel.com> Date: 2016-08-26T09:46:08Z use interpolation in a string --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75802754 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -171,14 +180,47 @@ object ChiSqSelectorModel extends Loader[ChiSqSelectorModel] { /** * Creates a ChiSquared feature selector. - * @param numTopFeatures number of features that selector will select - * (ordered by statistic value descending) - * Note that if the number of features is less than numTopFeatures, - * then this will select all features. */ -@Since("1.3.0") -class ChiSqSelector @Since("1.3.0") ( - @Since("1.3.0") val numTopFeatures: Int) extends Serializable { +@Since("2.1.0") +class ChiSqSelector @Since("2.1.0") () extends Serializable { + private var numTopFeatures: Int = 50 + private var percentile: Double = 0.1 + private var alpha: Double = 0.05 + private var selectorType = ChiSqSelectorType.KBest + + @Since("1.3.0") + def this(numTopFeatures: Int) { +this() +this.numTopFeatures = numTopFeatures + } + + @Since("2.1.0") + def setNumTopFeatures(value: Int): this.type = { +numTopFeatures = value +selectorType = ChiSqSelectorType.KBest +this + } + + @Since("2.1.0") + def setPercentile(value: Double): this.type = { +require(value <= 1 && value >= 0, "Percentile should be larger than 0 and less than 100") --- End diff -- Thanks, the code is updated --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75802771 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -171,14 +180,47 @@ object ChiSqSelectorModel extends Loader[ChiSqSelectorModel] { /** * Creates a ChiSquared feature selector. - * @param numTopFeatures number of features that selector will select - * (ordered by statistic value descending) - * Note that if the number of features is less than numTopFeatures, - * then this will select all features. */ -@Since("1.3.0") -class ChiSqSelector @Since("1.3.0") ( - @Since("1.3.0") val numTopFeatures: Int) extends Serializable { +@Since("2.1.0") +class ChiSqSelector @Since("2.1.0") () extends Serializable { + private var numTopFeatures: Int = 50 + private var percentile: Double = 0.1 + private var alpha: Double = 0.05 + private var selectorType = ChiSqSelectorType.KBest + + @Since("1.3.0") + def this(numTopFeatures: Int) { +this() +this.numTopFeatures = numTopFeatures + } + + @Since("2.1.0") + def setNumTopFeatures(value: Int): this.type = { +numTopFeatures = value +selectorType = ChiSqSelectorType.KBest +this + } + + @Since("2.1.0") + def setPercentile(value: Double): this.type = { +require(value <= 1 && value >= 0, "Percentile should be larger than 0 and less than 100") +percentile = value +selectorType = ChiSqSelectorType.Percentile +this + } + + @Since("2.1.0") + def setAlpha(value: Double): this.type = { --- End diff -- require is added, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/14597 Hi @srowen , I have added Python API and test cases for ChiSqSelector. Could you kindly review it again. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r76624379 --- Diff: python/pyspark/mllib/feature.py --- @@ -276,24 +276,64 @@ class ChiSqSelector(object): """ Creates a ChiSquared feature selector. -:param numTopFeatures: number of features that selector will select. - >>> data = [ ... LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})), ... LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})), ... LabeledPoint(1.0, [0.0, 9.0, 8.0]), ... LabeledPoint(2.0, [8.0, 9.0, 5.0]) ... ] ->>> model = ChiSqSelector(1).fit(sc.parallelize(data)) +>>> model = ChiSqSelector().setNumTopFeatures(1).fit(sc.parallelize(data)) +>>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0})) +SparseVector(1, {0: 6.0}) +>>> model.transform(DenseVector([8.0, 9.0, 5.0])) +DenseVector([5.0]) +>>> model = ChiSqSelector().setPercentile(0.34).fit(sc.parallelize(data)) >>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0})) SparseVector(1, {0: 6.0}) >>> model.transform(DenseVector([8.0, 9.0, 5.0])) DenseVector([5.0]) +>>> data = [ +... LabeledPoint(0.0, SparseVector(4, {0: 8.0, 1: 7.0})), +... LabeledPoint(1.0, SparseVector(4, {1: 9.0, 2: 6.0, 3: 4.0})), +... LabeledPoint(1.0, [0.0, 9.0, 8.0, 4.0]), +... LabeledPoint(2.0, [8.0, 9.0, 5.0, 9.0]) +... ] +>>> model = ChiSqSelector().setAlpha(0.1).fit(sc.parallelize(data)) +>>> model.transform(DenseVector([1.0,2.0,3.0,4.0])) +DenseVector([4.0]) .. versionadded:: 1.4.0 """ -def __init__(self, numTopFeatures): -self.numTopFeatures = int(numTopFeatures) +def __init__(self): +self.param = 50 --- End diff -- Use three variables are clear. It needs another selectionType variable also. In the fit function, according to the selection type, call different functions. If you prefer that, I will change the code. I am ok for both methods. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75851527 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -171,14 +180,48 @@ object ChiSqSelectorModel extends Loader[ChiSqSelectorModel] { /** * Creates a ChiSquared feature selector. - * @param numTopFeatures number of features that selector will select - * (ordered by statistic value descending) - * Note that if the number of features is less than numTopFeatures, - * then this will select all features. */ -@Since("1.3.0") -class ChiSqSelector @Since("1.3.0") ( - @Since("1.3.0") val numTopFeatures: Int) extends Serializable { +@Since("2.1.0") +class ChiSqSelector @Since("2.1.0") () extends Serializable { + private var numTopFeatures: Int = 50 + private var percentile: Double = 0.1 + private var alpha: Double = 0.05 + private var selectorType = ChiSqSelectorType.KBest + + @Since("1.3.0") + def this(numTopFeatures: Int) { +this() +this.numTopFeatures = numTopFeatures + } + + @Since("2.1.0") + def setNumTopFeatures(value: Int): this.type = { +numTopFeatures = value +selectorType = ChiSqSelectorType.KBest +this + } + + @Since("2.1.0") + def setPercentile(value: Double): this.type = { +require(value <= 1 && value >= 0, "Percentile should be larger than 0 and less than 1") --- End diff -- thanks, will update --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75851763 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -189,11 +232,21 @@ class ChiSqSelector @Since("1.3.0") ( */ @Since("1.3.0") def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { -val indices = Statistics.chiSqTest(data) - .zipWithIndex.sortBy { case (res, _) => -res.statistic } - .take(numTopFeatures) - .map { case (_, indices) => indices } - .sorted +val chiSqTestResult = Statistics.chiSqTest(data) +val features = selectorType match { + case ChiSqSelectorType.KBest => chiSqTestResult +.zipWithIndex.sortBy { case (res, _) => -res.statistic } --- End diff -- Fpr doesn't need sort here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75852138 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -91,8 +137,17 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } -val chiSqSelector = new feature.ChiSqSelector($(numTopFeatures)).fit(input) -copyValues(new ChiSqSelectorModel(uid, chiSqSelector).setParent(this)) +var model = ChiSqSelectorType.withName($(selectorType)) match { --- End diff -- If avoid duplicating the "new feature.ChiSqSelector()" and "fit(input)" here, the usage will be not consistent with other method, as we did before --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75851296 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -69,21 +73,26 @@ class ChiSqSelectorModel @Since("1.3.0") ( * Preserves the order of filtered features the same as their indices are stored. * Might be moved to Vector as .slice * @param features vector - * @param filterIndices indices of features to filter, must be ordered asc + * @param filterIndices indices of features to filter */ private def compress(features: Vector, filterIndices: Array[Int]): Vector = { +val orderedIndices = if (isSorted(filterIndices)) { --- End diff -- ok, I agree with you. thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75856793 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -91,8 +137,17 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } -val chiSqSelector = new feature.ChiSqSelector($(numTopFeatures)).fit(input) -copyValues(new ChiSqSelectorModel(uid, chiSqSelector).setParent(this)) +var model = ChiSqSelectorType.withName($(selectorType)) match { --- End diff -- Sorry, I misunderstand your last comment. Just think about reuse selector. This will not be inconsistent with other methods. I will update the code, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r75858450 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -91,8 +137,17 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str case Row(label: Double, features: Vector) => OldLabeledPoint(label, OldVectors.fromML(features)) } -val chiSqSelector = new feature.ChiSqSelector($(numTopFeatures)).fit(input) -copyValues(new ChiSqSelectorModel(uid, chiSqSelector).setParent(this)) +var model = ChiSqSelectorType.withName($(selectorType)) match { --- End diff -- Hi @srowen , I have updated the code. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r76065118 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -189,11 +232,21 @@ class ChiSqSelector @Since("1.3.0") ( */ @Since("1.3.0") def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { -val indices = Statistics.chiSqTest(data) - .zipWithIndex.sortBy { case (res, _) => -res.statistic } - .take(numTopFeatures) - .map { case (_, indices) => indices } - .sorted +val chiSqTestResult = Statistics.chiSqTest(data) +val features = selectorType match { + case ChiSqSelectorType.KBest => chiSqTestResult +.zipWithIndex.sortBy { case (res, _) => -res.statistic } --- End diff -- Thanks, got you. Will update the code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r76068163 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -189,11 +232,21 @@ class ChiSqSelector @Since("1.3.0") ( */ @Since("1.3.0") def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { -val indices = Statistics.chiSqTest(data) - .zipWithIndex.sortBy { case (res, _) => -res.statistic } - .take(numTopFeatures) - .map { case (_, indices) => indices } - .sorted +val chiSqTestResult = Statistics.chiSqTest(data) +val features = selectorType match { + case ChiSqSelectorType.KBest => chiSqTestResult +.zipWithIndex.sortBy { case (res, _) => -res.statistic } --- End diff -- Hi @srowen , code is updated. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector based ...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/14597 Sure, I can update the Python API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r76059098 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -189,11 +232,21 @@ class ChiSqSelector @Since("1.3.0") ( */ @Since("1.3.0") def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { -val indices = Statistics.chiSqTest(data) - .zipWithIndex.sortBy { case (res, _) => -res.statistic } - .take(numTopFeatures) - .map { case (_, indices) => indices } - .sorted +val chiSqTestResult = Statistics.chiSqTest(data) +val features = selectorType match { + case ChiSqSelectorType.KBest => chiSqTestResult +.zipWithIndex.sortBy { case (res, _) => -res.statistic } --- End diff -- I will update Fpr to FPR, thanks very much. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14597: [SPARK-17017][MLLIB][ML] add a chiSquare Selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/14597#discussion_r76059026 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -189,11 +232,21 @@ class ChiSqSelector @Since("1.3.0") ( */ @Since("1.3.0") def fit(data: RDD[LabeledPoint]): ChiSqSelectorModel = { -val indices = Statistics.chiSqTest(data) - .zipWithIndex.sortBy { case (res, _) => -res.statistic } - .take(numTopFeatures) - .map { case (_, indices) => indices } - .sorted +val chiSqTestResult = Statistics.chiSqTest(data) +val features = selectorType match { + case ChiSqSelectorType.KBest => chiSqTestResult +.zipWithIndex.sortBy { case (res, _) => -res.statistic } --- End diff -- Hi @srowen , do you mean we should move the sort from the model to here? There are two sort here. One is sort the indices, which we have been moved from here to the model. The second is sort the features, because the KBest, and Percentile is selection by the order of res.statistic, so we sort the features here. FPR is selection by res.p-value > alpha, so don't need sort here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector method...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/15212 hi @srowen @yanboliang , I have updated this PR. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15277: [SPARK-17704][ML][MLlib] ChiSqSelector performance impro...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/15277 sort in the transform will cause sort too many times. so this looks good to me. thangks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15214: [SPARK-17017][Follow-up][ML] Refactor of ChiSqSelector a...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/15214 hi @srowen . My understand of yanbo's comments here is, if user use chSqSelector like this: model1 = new ChiSqSelector().setFPR(0.05).setKBest(100).fit(data) model2 = new ChiSqSelector().setKBest(100).setFPR(0.05).fit(data) model1 will be different with model2. so the model is dependent on the order of users setting params. Actually, user should not use ChiSqSelector like this. One just need to set one SelectorType/Parameter is ok. But if one don't know ChiSqSelector, he may do like this. So yanbo think this is a problem. In this PR, setFPR(0.05) is split to two functions: setSelectorType("fpr").setAlpha(0.05). This maybe clear to the user. By the principle of software development: one function do one thing, I am ok with this change. But from user experience, I like the spark-17017 method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15214: [SPARK-17017][Follow-up][ML] Refactor of ChiSqSelector a...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/15214 Thanks, this looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15214: [SPARK-17017][Follow-up][ML] Refactor of ChiSqSelector a...
Github user mpjlu commented on the issue: https://github.com/apache/spark/pull/15214 Hi @srowen , sorry for forgetting update the doc and python/ml/feature.py in last PR. This pr has added ml/feature.py. It looks good to me. Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15647: [SPARK-18088][ML] Various ChiSqSelector cleanups
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15647#discussion_r85311619 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -44,67 +44,78 @@ private[feature] trait ChiSqSelectorParams extends Params /** * Number of features that selector will select (ordered by statistic value descending). If the * number of features is less than numTopFeatures, then this will select all features. - * Only applicable when selectorType = "kbest". + * Only applicable when selectorType = "numTopFeatures". * The default value of numTopFeatures is 50. * * @group param */ + @Since("1.6.0") final val numTopFeatures = new IntParam(this, "numTopFeatures", "Number of features that selector will select, ordered by statistics value descending. If the" + --- End diff -- ordered by pValue ascending. I missed this change in my PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15647: [SPARK-18088][ML] Various ChiSqSelector cleanups
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15647#discussion_r85311677 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -44,67 +44,78 @@ private[feature] trait ChiSqSelectorParams extends Params /** * Number of features that selector will select (ordered by statistic value descending). If the * number of features is less than numTopFeatures, then this will select all features. - * Only applicable when selectorType = "kbest". + * Only applicable when selectorType = "numTopFeatures". * The default value of numTopFeatures is 50. * * @group param */ + @Since("1.6.0") final val numTopFeatures = new IntParam(this, "numTopFeatures", "Number of features that selector will select, ordered by statistics value descending. If the" + " number of features is < numTopFeatures, then this will select all features.", ParamValidators.gtEq(1)) setDefault(numTopFeatures -> 50) /** @group getParam */ + @Since("1.6.0") def getNumTopFeatures: Int = $(numTopFeatures) /** * Percentile of features that selector will select, ordered by statistics value descending. * Only applicable when selectorType = "percentile". * Default value is 0.1. + * @group param */ + @Since("2.1.0") final val percentile = new DoubleParam(this, "percentile", "Percentile of features that selector will select, ordered by statistics value descending.", --- End diff -- ordered by pValue ascending. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15647: [SPARK-18088][ML] Various ChiSqSelector cleanups
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15647#discussion_r85310898 --- Diff: docs/mllib-feature-extraction.md --- @@ -227,22 +227,19 @@ both speed and statistical learning behavior. [`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) implements Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the [Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which -features to choose. It supports three selection methods: `KBest`, `Percentile` and `FPR`: +features to choose. It supports three selection methods: `numTopFeatures`, `percentile`, `fpr`: -* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. -* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number. -* `FPR` chooses all features whose false positive rate meets some threshold. +* `numTopFeatures` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. +* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. +* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection. -By default, the selection method is `KBest`, the default number of top features is 50. User can use -`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods. +By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. User can use +`setNumTopFeatures`, `setPercentile`, `fpr` to set different selection methods. --- End diff -- ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15647: [SPARK-18088][ML] Various ChiSqSelector cleanups
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15647#discussion_r85310797 --- Diff: docs/ml-features.md --- @@ -1333,14 +1333,14 @@ for more details on the API. `ChiSqSelector` stands for Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the [Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which -features to choose. It supports three selection methods: `KBest`, `Percentile` and `FPR`: +features to choose. It supports three selection methods: `numTopFeatures`, `percentile`, `fpr`: -* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. -* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number. -* `FPR` chooses all features whose false positive rate meets some threshold. +* `numTopFeatures` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. +* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. +* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection. -By default, the selection method is `KBest`, the default number of top features is 50. User can use -`setNumTopFeatures`, `setPercentile` and `setAlpha` to set different selection methods. +By default, the selection method is `numTopFeatures`, with the default number of top features set to 50. User can use +`setNumTopFeatures`, `setPercentile` and `setFpr` to set different selection methods. --- End diff -- setNumTopFeatures, setPercentile, setFpr just set parameter, setSelectorType is used to set selection methods. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15647: [SPARK-18088][ML] Various ChiSqSelector cleanups
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15647#discussion_r85310862 --- Diff: docs/mllib-feature-extraction.md --- @@ -227,22 +227,19 @@ both speed and statistical learning behavior. [`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) implements Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the [Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which -features to choose. It supports three selection methods: `KBest`, `Percentile` and `FPR`: +features to choose. It supports three selection methods: `numTopFeatures`, `percentile`, `fpr`: -* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. -* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number. -* `FPR` chooses all features whose false positive rate meets some threshold. +* `numTopFeatures` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. --- End diff -- ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15647: [SPARK-18088][ML] Various ChiSqSelector cleanups
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15647#discussion_r85311930 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala --- @@ -171,18 +171,19 @@ object ChiSqSelectorModel extends Loader[ChiSqSelectorModel] { /** * Creates a ChiSquared feature selector. - * The selector supports three selection methods: `kbest`, `percentile` and `fpr`. - * `kbest` chooses the `k` top features according to a chi-squared test. - * `percentile` is similar but chooses a fraction of all features instead of a fixed number. - * `fpr` chooses all features whose false positive rate meets some threshold. - * By default, the selection method is `kbest`, the default number of top features is 50. + * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`. + * - `numTopFeatures` chooses the `k` top features according to a chi-squared test. --- End diff -- ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15647: [SPARK-18088][ML] Various ChiSqSelector cleanups
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15647#discussion_r85310501 --- Diff: docs/ml-features.md --- @@ -1333,14 +1333,14 @@ for more details on the API. `ChiSqSelector` stands for Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the [Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which -features to choose. It supports three selection methods: `KBest`, `Percentile` and `FPR`: +features to choose. It supports three selection methods: `numTopFeatures`, `percentile`, `fpr`: -* `KBest` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. -* `Percentile` is similar to `KBest` but chooses a fraction of all features instead of a fixed number. -* `FPR` chooses all features whose false positive rate meets some threshold. +* `numTopFeatures` chooses the `k` top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. --- End diff -- Should use k here since KBest is changed to numTopFeatures? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15212: [SPARK-17645][MLLIB][ML][WIP]add feature selector...
Github user mpjlu commented on a diff in the pull request: https://github.com/apache/spark/pull/15212#discussion_r84049805 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala --- @@ -72,11 +72,15 @@ private[feature] trait ChiSqSelectorParams extends Params def getPercentile: Double = $(percentile) /** - * The highest p-value for features to be kept. - * Only applicable when selectorType = "fpr". + * alpha means the highest p-value for features to be kept when select type is "fpr". + * alpha means the highest uncorrected p-value for features to be kept when select type --- End diff -- updated, thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org