[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user erikerlandson commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r218997600 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala --- @@ -71,6 +71,23 @@ object Gini extends Impurity { @Since("1.1.0") def instance: this.type = this + /** + * :: DeveloperApi :: + * p-values for test-statistic measures, unsupported for [[Gini]] + */ + @Since("2.2.0") + @DeveloperApi + def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double = --- End diff -- @srowen @willb I cached the design of the metrics back in. In general, `Impurity` already uses methods that are only defined on certain impurity sub-classes, and so this new method does not change that situation. My take on the "problem" is that the existing measures are all based on a localized concept of "purity" (or impurity) that can be calculated using _only_ the data at a single node. Splitting based on statistical tests (p-values) breaks that model, since it is making use of a more generalized concept of split quality that requires the sample populations of both children from a candidate split. A maximally general signature would probably involve the parent and both children. Another kink in the current design is that `ImpurityCalculator` is essentially parallel to `Impurity`, and in fact `ImpurityCalculator#calculate()` is how impurity measures are currently requested. `Impurity` seems somewhat redundant, and might be factored out in favor of `ImpurityCalculator`. The current signature `calculate()` might be generalized into a more inclusive concept of split quality that expects to make use of {parent,left,right}. Calls to `calculate()` are not very wide-spread but threading that change through is outside the scope of this particular PR. If people are interested in that kind of refactoring I could look into it in the near future but probably not in the next couple weeks. That kind of change would also be API breaking and so a good target for 3.0 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user erikerlandson commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r218252461 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -52,6 +52,49 @@ trait Impurity extends Serializable { @Since("1.0.0") @DeveloperApi def calculate(count: Double, sum: Double, sumSquares: Double): Double + + /** + * :: DeveloperApi :: + * Compute a test-statistic p-value quality measure from left and right split populations + * @param calcL impurity calculator for the left split population + * @param calcR impurity calculator for the right split population + * @return The p-value for the null hypothesis; that left and right split populations + * represent the same distribution + * @note Unless overridden this method will fail with an exception, for backward compatability + */ + @Since("2.2.0") + @DeveloperApi + def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double + + /** + * :: DeveloperApi :: + * Determine if this impurity measure is a test-statistic measure + * @return True if this is a split quality measure based on a test statistic (i.e. returns a + * p-value) or false otherwise. + * @note Unless overridden this method returns false by default, for backward compatability + */ + @Since("2.2.0") + @DeveloperApi + def isTestStatistic: Boolean --- End diff -- Can this be customized or extended externally to spark? I'm wondering why it is public. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user erikerlandson commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r218252156 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala --- @@ -71,6 +71,23 @@ object Gini extends Impurity { @Since("1.1.0") def instance: this.type = this + /** + * :: DeveloperApi :: + * p-values for test-statistic measures, unsupported for [[Gini]] + */ + @Since("2.2.0") + @DeveloperApi + def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double = --- End diff -- I suspect that the generalization is closer to my newer signature `val pval = imp.calculate(leftImpurityCalculator, rightImpurityCalculator)` where you have all the context from the left and right nodes. The existing gain-based calculation should fit into this framework, just doing its current weighted average of purity gain. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user erikerlandson commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r218246415 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala --- @@ -71,6 +71,23 @@ object Gini extends Impurity { @Since("1.1.0") def instance: this.type = this + /** + * :: DeveloperApi :: + * p-values for test-statistic measures, unsupported for [[Gini]] + */ + @Since("2.2.0") + @DeveloperApi + def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double = --- End diff -- I'll consider if there's a unifying idea here. pval-based metrics require integrating information across the new split children, which I believe was not the case for existing methods. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user erikerlandson commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r218245862 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala --- @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.impurity + +import org.apache.spark.annotation.{DeveloperApi, Experimental, Since} + +/** + * :: Experimental :: + * Class for calculating Chi Squared as a split quality metric during binary classification. + */ +@Since("2.2.0") --- End diff -- I'll update those. 3.0 might be a good target, especially if I can't do this without new `isTestStatistic` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user erikerlandson commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r218245670 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -670,14 +670,32 @@ private[spark] object RandomForest extends Logging { val leftImpurity = leftImpurityCalculator.calculate() // Note: This equals 0 if count = 0 val rightImpurity = rightImpurityCalculator.calculate() -val leftWeight = leftCount / totalCount.toDouble -val rightWeight = rightCount / totalCount.toDouble +val gain = metadata.impurity match { + case imp if (imp.isTestStatistic) => +// For split quality measures based on a test-statistic, run the test on the +// left and right sub-populations to get a p-value for the null hypothesis +val pval = imp.calculate(leftImpurityCalculator, rightImpurityCalculator) +// Transform the test statistic p-val into a larger-is-better gain value +Impurity.pValToGain(pval) + + case _ => +// Default purity-gain logic: +// measure the weighted decrease in impurity from parent to the left and right +val leftWeight = leftCount / totalCount.toDouble +val rightWeight = rightCount / totalCount.toDouble + +impurity - leftWeight * leftImpurity - rightWeight * rightImpurity +} -val gain = impurity - leftWeight * leftImpurity - rightWeight * rightImpurity +// If the impurity being used is a test statistic p-val, apply a standard transform into +// a larger-is-better gain value for the minimum-gain threshold +val minGain = + if (metadata.impurity.isTestStatistic) Impurity.pValToGain(metadata.minInfoGain) + else metadata.minInfoGain --- End diff -- The main issue I recall was that all of the existing metrics assume some kind of "larger is better" gain, and p-values are "smaller is better." I'll take another pass over it and see if I can push that distinction down so it doesn't require exposing new methods. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r218208245 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala --- @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.impurity + +import org.apache.spark.annotation.{DeveloperApi, Experimental, Since} + +/** + * :: Experimental :: + * Class for calculating Chi Squared as a split quality metric during binary classification. + */ +@Since("2.2.0") --- End diff -- This will have to be 2.5.0 for the moment --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r218209825 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala --- @@ -71,6 +71,23 @@ object Gini extends Impurity { @Since("1.1.0") def instance: this.type = this + /** + * :: DeveloperApi :: + * p-values for test-statistic measures, unsupported for [[Gini]] + */ + @Since("2.2.0") + @DeveloperApi + def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double = --- End diff -- It looks like this new method doesn't make sense to implement for existing implementations, only the new one. That kind of suggests to me it isn't part of the generic API for an impurity. Is this really something that belongs inside the logic of the implementations? maybe there's a more general method that needs to be exposed, that can then be specialized for all implementations. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r218209381 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -52,6 +52,49 @@ trait Impurity extends Serializable { @Since("1.0.0") @DeveloperApi def calculate(count: Double, sum: Double, sumSquares: Double): Double + + /** + * :: DeveloperApi :: + * Compute a test-statistic p-value quality measure from left and right split populations + * @param calcL impurity calculator for the left split population + * @param calcR impurity calculator for the right split population + * @return The p-value for the null hypothesis; that left and right split populations + * represent the same distribution + * @note Unless overridden this method will fail with an exception, for backward compatability + */ + @Since("2.2.0") + @DeveloperApi + def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double + + /** + * :: DeveloperApi :: + * Determine if this impurity measure is a test-statistic measure + * @return True if this is a split quality measure based on a test statistic (i.e. returns a + * p-value) or false otherwise. + * @note Unless overridden this method returns false by default, for backward compatability + */ + @Since("2.2.0") + @DeveloperApi + def isTestStatistic: Boolean +} + +/** + * :: DeveloperApi :: + * Utility functions for Impurity measures + */ +@Since("2.0.0") +@DeveloperApi +object Impurity { + /** + * :: DeveloperApi :: + * Convert a test-statistic p-value into a "larger-is-better" gain value. + * @param pval The test statistic p-value + * @return The negative logarithm of the p-value. Any p-values smaller than 10^-20 are clipped + * to 10^-20 to prevent arithmetic errors + */ + @Since("2.0.0") + @DeveloperApi + def pValToGain(pval: Double): Double = -math.log(math.max(1e-20, pval)) --- End diff -- private to spark? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r218208383 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala --- @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.impurity + +import org.apache.spark.annotation.{DeveloperApi, Experimental, Since} + +/** + * :: Experimental :: + * Class for calculating Chi Squared as a split quality metric during binary classification. + */ +@Since("2.2.0") +@Experimental +object ChiSquared extends Impurity { + private object CSTest extends org.apache.commons.math3.stat.inference.ChiSquareTest() + + /** + * Get this impurity instance. + * This is useful for passing impurity parameters to a Strategy in Java. + */ + @Since("1.1.0") --- End diff -- I think I'd label all these as Since 2.5.0 even if they override a method that existed earlier. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r218209453 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -52,6 +52,49 @@ trait Impurity extends Serializable { @Since("1.0.0") @DeveloperApi def calculate(count: Double, sum: Double, sumSquares: Double): Double + + /** + * :: DeveloperApi :: + * Compute a test-statistic p-value quality measure from left and right split populations + * @param calcL impurity calculator for the left split population + * @param calcR impurity calculator for the right split population + * @return The p-value for the null hypothesis; that left and right split populations + * represent the same distribution + * @note Unless overridden this method will fail with an exception, for backward compatability + */ + @Since("2.2.0") + @DeveloperApi + def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double + + /** + * :: DeveloperApi :: + * Determine if this impurity measure is a test-statistic measure + * @return True if this is a split quality measure based on a test statistic (i.e. returns a + * p-value) or false otherwise. + * @note Unless overridden this method returns false by default, for backward compatability + */ + @Since("2.2.0") + @DeveloperApi + def isTestStatistic: Boolean --- End diff -- Adding methods to a public trait is technically an API breaking change. This might be considered a Developer API even though it's not labeled that way. Still if we can avoid adding to the API here, it'd be better. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r218208123 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala --- @@ -670,14 +670,32 @@ private[spark] object RandomForest extends Logging { val leftImpurity = leftImpurityCalculator.calculate() // Note: This equals 0 if count = 0 val rightImpurity = rightImpurityCalculator.calculate() -val leftWeight = leftCount / totalCount.toDouble -val rightWeight = rightCount / totalCount.toDouble +val gain = metadata.impurity match { + case imp if (imp.isTestStatistic) => +// For split quality measures based on a test-statistic, run the test on the +// left and right sub-populations to get a p-value for the null hypothesis +val pval = imp.calculate(leftImpurityCalculator, rightImpurityCalculator) +// Transform the test statistic p-val into a larger-is-better gain value +Impurity.pValToGain(pval) + + case _ => +// Default purity-gain logic: +// measure the weighted decrease in impurity from parent to the left and right +val leftWeight = leftCount / totalCount.toDouble +val rightWeight = rightCount / totalCount.toDouble + +impurity - leftWeight * leftImpurity - rightWeight * rightImpurity +} -val gain = impurity - leftWeight * leftImpurity - rightWeight * rightImpurity +// If the impurity being used is a test statistic p-val, apply a standard transform into +// a larger-is-better gain value for the minimum-gain threshold +val minGain = + if (metadata.impurity.isTestStatistic) Impurity.pValToGain(metadata.minInfoGain) + else metadata.minInfoGain --- End diff -- Kind of a design question here... right now the caller has to switch logic based on what's inside metadata. Can methods like metadata.minInfoGain just implement different logic when the impurity is a test statistic, and so on? push this down towards the impurity implementation? I wonder if `isTestStatistic` can go away with the right API, but I am not familiar with the details of what that requires. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user erikerlandson commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r112719021 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala --- @@ -0,0 +1,163 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.impurity + +import org.apache.spark.annotation.{DeveloperApi, Experimental, Since} + +/** + * :: Experimental :: + * Class for calculating [[https://en.wikipedia.org/wiki/Chi-squared_test chi-squared]] --- End diff -- @HyukjinKwon thank you! I didn't notice that was in the full console output --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r112596837 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala --- @@ -0,0 +1,163 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.impurity + +import org.apache.spark.annotation.{DeveloperApi, Experimental, Since} + +/** + * :: Experimental :: + * Class for calculating [[https://en.wikipedia.org/wiki/Chi-squared_test chi-squared]] --- End diff -- The documentation generation is being failed due to this line - ``` [error] /home/jenkins/workspace/SparkPullRequestBuilder/mllib/target/java/org/apache/spark/mllib/tree/impurity/ChiSquared.java:4: error: unexpected text [error] * Class for calculating {@link https://en.wikipedia.org/wiki/Chi-squared_test chi-squared} [error] ``` Probably, remove the link or wrap it `href` as I did before - https://github.com/apache/spark/pull/16013 The other errors seem spurious. Please refer my observation - https://github.com/apache/spark/pull/17389#issuecomment-288438704 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user erikerlandson commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104822632 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala --- @@ -237,6 +237,41 @@ class DecisionTreeClassifierSuite compareAPIs(rdd, dt, categoricalFeatures = Map.empty[Int, Int], numClasses) } + test("split quality using chi-squared and minimum gain") { +// Generate a data set where the 1st feature is useful and the others are noise +val features = Vector.fill(200) { + Array.fill(3) { scala.util.Random.nextInt(2).toDouble } +} +val labels = features.map { fv => + LabeledPoint(if (fv(0) == 1.0) 1.0 else 0.0, Vectors.dense(fv)) +} +val rdd = sc.parallelize(labels) + +// two-class learning problem +val numClasses = 2 +// all binary features +val catFeatures = Map(Vector.tabulate(features.head.length) { j => (j, 2) } : _*) + +// Chi-squared split quality with a p-value threshold of 0.01 should allow +// only the first feature to be used since the others are uncorrelated noise +val train: DataFrame = TreeTests.setMetadata(rdd, catFeatures, numClasses) +val dt = new DecisionTreeClassifier() + .setImpurity("chisquared") + .setMaxDepth(5) + .setMinInfoGain(0.01) +val treeModel = dt.fit(train) + +// The tree should use exactly one of the 3 features: featue(0) --- End diff -- ð --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user erikerlandson commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104822458 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -50,6 +50,50 @@ trait Impurity extends Serializable { @Since("1.0.0") @DeveloperApi def calculate(count: Double, sum: Double, sumSquares: Double): Double + + /** + * :: DeveloperApi :: + * Compute a test-statistic p-value quality measure from left and right split populations + * @param calcL impurity calculator for the left split population + * @param calcR impurity calculator for the right split population + * @return The p-value for the null hypothesis; that left and right split populations + * represent the same distribution + * @note Unless overridden this method will fail with an exception, for backward compatability + */ + @Since("2.0.0") + @DeveloperApi + def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double = --- End diff -- IIRC, there were other impurity measures using this 'unsupported' idiom. However, I'm fine with making it abstract, if that's the direction you are thinking. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user erikerlandson commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104822183 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -50,6 +50,50 @@ trait Impurity extends Serializable { @Since("1.0.0") @DeveloperApi def calculate(count: Double, sum: Double, sumSquares: Double): Double + + /** + * :: DeveloperApi :: + * Compute a test-statistic p-value quality measure from left and right split populations + * @param calcL impurity calculator for the left split population + * @param calcR impurity calculator for the right split population + * @return The p-value for the null hypothesis; that left and right split populations + * represent the same distribution + * @note Unless overridden this method will fail with an exception, for backward compatability + */ + @Since("2.0.0") + @DeveloperApi + def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double = +throw new UnsupportedOperationException("Impurity.calculate") + + /** + * :: DeveloperApi :: + * Determine if this impurity measure is a test-statistic measure + * @return True if this is a split quality measure based on a test statistic (i.e. returns a + * p-value) or false otherwise. + * @note Unless overridden this method returns false by default, for backward compatability + */ + @Since("2.0.0") + @DeveloperApi + def isTestStatistic: Boolean = false +} + +/** + * :: DeveloperApi :: + * Utility functions for Impurity measures + */ +@Since("2.0.0") +@DeveloperApi --- End diff -- I don't think so. I don't recall any specific motivation to keep it private, but historically Spark seems to default things to "minimum visibility." The only method currently defined here is an implementation detail for hacking p-values into the existing 'gain' system, where larger is assumed to be better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user erikerlandson commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104821445 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala --- @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.impurity + +import org.apache.spark.annotation.{DeveloperApi, Experimental, Since} + +/** + * :: Experimental :: + * Class for calculating [[https://en.wikipedia.org/wiki/Chi-squared_test chi-squared]] + * during binary classification. + */ +@Since("2.0.0") +@Experimental +object ChiSquared extends Impurity { + private object CSTest extends org.apache.commons.math3.stat.inference.ChiSquareTest() --- End diff -- IIRC it was to "allocate on the stack" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104813864 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala --- @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.impurity + +import org.apache.spark.annotation.{DeveloperApi, Experimental, Since} + +/** + * :: Experimental :: + * Class for calculating [[https://en.wikipedia.org/wiki/Chi-squared_test chi-squared]] + * during binary classification. + */ +@Since("2.0.0") +@Experimental +object ChiSquared extends Impurity { + private object CSTest extends org.apache.commons.math3.stat.inference.ChiSquareTest() --- End diff -- why not a private `val`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104813302 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala --- @@ -237,6 +237,41 @@ class DecisionTreeClassifierSuite compareAPIs(rdd, dt, categoricalFeatures = Map.empty[Int, Int], numClasses) } + test("split quality using chi-squared and minimum gain") { +// Generate a data set where the 1st feature is useful and the others are noise +val features = Vector.fill(200) { + Array.fill(3) { scala.util.Random.nextInt(2).toDouble } +} +val labels = features.map { fv => + LabeledPoint(if (fv(0) == 1.0) 1.0 else 0.0, Vectors.dense(fv)) +} +val rdd = sc.parallelize(labels) + +// two-class learning problem +val numClasses = 2 +// all binary features +val catFeatures = Map(Vector.tabulate(features.head.length) { j => (j, 2) } : _*) + +// Chi-squared split quality with a p-value threshold of 0.01 should allow +// only the first feature to be used since the others are uncorrelated noise +val train: DataFrame = TreeTests.setMetadata(rdd, catFeatures, numClasses) +val dt = new DecisionTreeClassifier() + .setImpurity("chisquared") + .setMaxDepth(5) + .setMinInfoGain(0.01) +val treeModel = dt.fit(train) + +// The tree should use exactly one of the 3 features: featue(0) --- End diff -- nit: feature --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104812803 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -50,6 +50,50 @@ trait Impurity extends Serializable { @Since("1.0.0") @DeveloperApi def calculate(count: Double, sum: Double, sumSquares: Double): Double + + /** + * :: DeveloperApi :: + * Compute a test-statistic p-value quality measure from left and right split populations + * @param calcL impurity calculator for the left split population + * @param calcR impurity calculator for the right split population + * @return The p-value for the null hypothesis; that left and right split populations + * represent the same distribution + * @note Unless overridden this method will fail with an exception, for backward compatability + */ + @Since("2.0.0") + @DeveloperApi + def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double = --- End diff -- you cannot guarantee compatibility with existing code here, since you would break the bytecode either way. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104812596 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -50,6 +50,50 @@ trait Impurity extends Serializable { @Since("1.0.0") @DeveloperApi def calculate(count: Double, sum: Double, sumSquares: Double): Double + + /** + * :: DeveloperApi :: + * Compute a test-statistic p-value quality measure from left and right split populations + * @param calcL impurity calculator for the left split population + * @param calcR impurity calculator for the right split population + * @return The p-value for the null hypothesis; that left and right split populations + * represent the same distribution + * @note Unless overridden this method will fail with an exception, for backward compatability + */ + @Since("2.0.0") + @DeveloperApi + def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double = +throw new UnsupportedOperationException("Impurity.calculate") + + /** + * :: DeveloperApi :: + * Determine if this impurity measure is a test-statistic measure + * @return True if this is a split quality measure based on a test statistic (i.e. returns a + * p-value) or false otherwise. + * @note Unless overridden this method returns false by default, for backward compatability + */ + @Since("2.0.0") + @DeveloperApi + def isTestStatistic: Boolean = false +} + +/** + * :: DeveloperApi :: + * Utility functions for Impurity measures + */ +@Since("2.0.0") +@DeveloperApi --- End diff -- there is no need for this object to be publicly exposed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104812468 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -50,6 +50,50 @@ trait Impurity extends Serializable { @Since("1.0.0") @DeveloperApi def calculate(count: Double, sum: Double, sumSquares: Double): Double + + /** + * :: DeveloperApi :: + * Compute a test-statistic p-value quality measure from left and right split populations + * @param calcL impurity calculator for the left split population + * @param calcR impurity calculator for the right split population + * @return The p-value for the null hypothesis; that left and right split populations + * represent the same distribution + * @note Unless overridden this method will fail with an exception, for backward compatability + */ + @Since("2.0.0") + @DeveloperApi + def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double = --- End diff -- scala: do not add a default implementation, it causes issues with java compatibility --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/13440#discussion_r104812484 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala --- @@ -50,6 +50,50 @@ trait Impurity extends Serializable { @Since("1.0.0") @DeveloperApi def calculate(count: Double, sum: Double, sumSquares: Double): Double + + /** + * :: DeveloperApi :: + * Compute a test-statistic p-value quality measure from left and right split populations + * @param calcL impurity calculator for the left split population + * @param calcR impurity calculator for the right split population + * @return The p-value for the null hypothesis; that left and right split populations + * represent the same distribution + * @note Unless overridden this method will fail with an exception, for backward compatability + */ + @Since("2.0.0") + @DeveloperApi + def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double = +throw new UnsupportedOperationException("Impurity.calculate") + + /** + * :: DeveloperApi :: + * Determine if this impurity measure is a test-statistic measure + * @return True if this is a split quality measure based on a test statistic (i.e. returns a + * p-value) or false otherwise. + * @note Unless overridden this method returns false by default, for backward compatability + */ + @Since("2.0.0") + @DeveloperApi + def isTestStatistic: Boolean = false --- End diff -- scala: do not add a default implementation, it causes issues with java compatibility --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...
GitHub user erikerlandson opened a pull request: https://github.com/apache/spark/pull/13440 [SPARK-15699] [ML] Implement a Chi-Squared test statistic option for measuring split quality ## What changes were proposed in this pull request? Using test statistics as a measure of decision tree split quality is a useful split halting measure that can yield improved model quality. I am proposing to add the chi-squared test statistic as a new impurity option (in addition to "gini" and "entropy") for classification decision trees and ensembles. https://issues.apache.org/jira/browse/SPARK-15699 http://erikerlandson.github.io/blog/2016/05/26/measuring-decision-tree-split-quality-with-test-statistic-p-values/ ## How was this patch tested? I added unit testing to verify that the chi-squared "impurity" measure functions as expected when used for decision tree training. You can merge this pull request into a Git repository by running: $ git pull https://github.com/erikerlandson/spark chisquared_split_quality Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13440.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13440 commit 04c131647f9cbc8208b755c6d08f16a1056f171c Author: Erik ErlandsonDate: 2016-05-31T21:44:22Z Implement a Chi-Squared test statistic option for measuring split quality when training decision trees --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org