[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2018-09-19 Thread erikerlandson
Github user erikerlandson commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r218997600
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala ---
@@ -71,6 +71,23 @@ object Gini extends Impurity {
   @Since("1.1.0")
   def instance: this.type = this
 
+  /**
+   * :: DeveloperApi ::
+   * p-values for test-statistic measures, unsupported for [[Gini]]
+   */
+  @Since("2.2.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): 
Double =
--- End diff --

@srowen @willb 
I cached the design of the metrics back in. In general, `Impurity` already 
uses methods that are only defined on certain impurity sub-classes, and so this 
new method does not change that situation.

My take on the "problem" is that the existing measures are all based on a 
localized concept of "purity" (or impurity) that can be calculated using _only_ 
the data at a single node. Splitting based on statistical tests (p-values) 
breaks that model, since it is making use of a more generalized concept of 
split quality that requires the sample populations of both children from a 
candidate split. A maximally general signature would probably involve the 
parent and both children.

Another kink in the current design is that `ImpurityCalculator` is 
essentially parallel to `Impurity`, and in fact 
`ImpurityCalculator#calculate()` is how impurity measures are currently 
requested. `Impurity` seems somewhat redundant, and might be factored out in 
favor of `ImpurityCalculator`. The current signature `calculate()` might be 
generalized into a more inclusive concept of split quality that expects to make 
use of {parent,left,right}.

Calls to `calculate()` are not very wide-spread but threading that change 
through is outside the scope of this particular PR. If people are interested in 
that kind of refactoring I could look into it in the near future but probably 
not in the next couple weeks.

That kind of change would also be API breaking and so a good target for 3.0


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2018-09-17 Thread erikerlandson
Github user erikerlandson commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r218252461
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -52,6 +52,49 @@ trait Impurity extends Serializable {
   @Since("1.0.0")
   @DeveloperApi
   def calculate(count: Double, sum: Double, sumSquares: Double): Double
+
+  /**
+   * :: DeveloperApi ::
+   * Compute a test-statistic p-value quality measure from left and right 
split populations
+   * @param calcL impurity calculator for the left split population
+   * @param calcR impurity calculator for the right split population
+   * @return The p-value for the null hypothesis; that left and right 
split populations
+   * represent the same distribution
+   * @note Unless overridden this method will fail with an exception, for 
backward compatability
+   */
+  @Since("2.2.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): 
Double
+
+  /**
+   * :: DeveloperApi ::
+   * Determine if this impurity measure is a test-statistic measure
+   * @return True if this is a split quality measure based on a test 
statistic (i.e. returns a
+   * p-value) or false otherwise.
+   * @note Unless overridden this method returns false by default, for 
backward compatability
+   */
+  @Since("2.2.0")
+  @DeveloperApi
+  def isTestStatistic: Boolean
--- End diff --

Can this be customized or extended externally to spark?  I'm wondering why 
it is public.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2018-09-17 Thread erikerlandson
Github user erikerlandson commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r218252156
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala ---
@@ -71,6 +71,23 @@ object Gini extends Impurity {
   @Since("1.1.0")
   def instance: this.type = this
 
+  /**
+   * :: DeveloperApi ::
+   * p-values for test-statistic measures, unsupported for [[Gini]]
+   */
+  @Since("2.2.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): 
Double =
--- End diff --

I suspect that the generalization is closer to my newer signature 
`val pval = imp.calculate(leftImpurityCalculator, rightImpurityCalculator)`
where you have all the context from the left and right nodes. The existing 
gain-based calculation should fit into this framework, just doing its current 
weighted average of purity gain.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2018-09-17 Thread erikerlandson
Github user erikerlandson commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r218246415
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala ---
@@ -71,6 +71,23 @@ object Gini extends Impurity {
   @Since("1.1.0")
   def instance: this.type = this
 
+  /**
+   * :: DeveloperApi ::
+   * p-values for test-statistic measures, unsupported for [[Gini]]
+   */
+  @Since("2.2.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): 
Double =
--- End diff --

I'll consider if there's a unifying idea here. pval-based metrics require 
integrating information across the new split children, which I believe was not 
the case for existing methods.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2018-09-17 Thread erikerlandson
Github user erikerlandson commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r218245862
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala ---
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+import org.apache.spark.annotation.{DeveloperApi, Experimental, Since}
+
+/**
+ * :: Experimental ::
+ * Class for calculating Chi Squared as a split quality metric during 
binary classification.
+ */
+@Since("2.2.0")
--- End diff --

I'll update those.  3.0 might be a good target, especially if I can't do 
this without new `isTestStatistic`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2018-09-17 Thread erikerlandson
Github user erikerlandson commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r218245670
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -670,14 +670,32 @@ private[spark] object RandomForest extends Logging {
 val leftImpurity = leftImpurityCalculator.calculate() // Note: This 
equals 0 if count = 0
 val rightImpurity = rightImpurityCalculator.calculate()
 
-val leftWeight = leftCount / totalCount.toDouble
-val rightWeight = rightCount / totalCount.toDouble
+val gain = metadata.impurity match {
+  case imp if (imp.isTestStatistic) =>
+// For split quality measures based on a test-statistic, run the 
test on the
+// left and right sub-populations to get a p-value for the null 
hypothesis
+val pval = imp.calculate(leftImpurityCalculator, 
rightImpurityCalculator)
+// Transform the test statistic p-val into a larger-is-better gain 
value
+Impurity.pValToGain(pval)
+
+  case _ =>
+// Default purity-gain logic:
+// measure the weighted decrease in impurity from parent to the 
left and right
+val leftWeight = leftCount / totalCount.toDouble
+val rightWeight = rightCount / totalCount.toDouble
+
+impurity - leftWeight * leftImpurity - rightWeight * rightImpurity
+}
 
-val gain = impurity - leftWeight * leftImpurity - rightWeight * 
rightImpurity
+// If the impurity being used is a test statistic p-val, apply a 
standard transform into
+// a larger-is-better gain value for the minimum-gain threshold
+val minGain =
+  if (metadata.impurity.isTestStatistic) 
Impurity.pValToGain(metadata.minInfoGain)
+  else metadata.minInfoGain
--- End diff --

The main issue I recall was that all of the existing metrics assume some 
kind of "larger is better" gain, and p-values are "smaller is better."  I'll 
take another pass over it and see if I can push that distinction down so it 
doesn't require exposing new methods.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2018-09-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r218208245
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala ---
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+import org.apache.spark.annotation.{DeveloperApi, Experimental, Since}
+
+/**
+ * :: Experimental ::
+ * Class for calculating Chi Squared as a split quality metric during 
binary classification.
+ */
+@Since("2.2.0")
--- End diff --

This will have to be 2.5.0 for the moment


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2018-09-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r218209825
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala ---
@@ -71,6 +71,23 @@ object Gini extends Impurity {
   @Since("1.1.0")
   def instance: this.type = this
 
+  /**
+   * :: DeveloperApi ::
+   * p-values for test-statistic measures, unsupported for [[Gini]]
+   */
+  @Since("2.2.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): 
Double =
--- End diff --

It looks like this new method doesn't make sense to implement for existing 
implementations, only the new one. That kind of suggests to me it isn't part of 
the generic API for an impurity. Is this really something that belongs inside 
the logic of the implementations? maybe there's a more general method that 
needs to be exposed, that can then be specialized for all implementations.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2018-09-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r218209381
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -52,6 +52,49 @@ trait Impurity extends Serializable {
   @Since("1.0.0")
   @DeveloperApi
   def calculate(count: Double, sum: Double, sumSquares: Double): Double
+
+  /**
+   * :: DeveloperApi ::
+   * Compute a test-statistic p-value quality measure from left and right 
split populations
+   * @param calcL impurity calculator for the left split population
+   * @param calcR impurity calculator for the right split population
+   * @return The p-value for the null hypothesis; that left and right 
split populations
+   * represent the same distribution
+   * @note Unless overridden this method will fail with an exception, for 
backward compatability
+   */
+  @Since("2.2.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): 
Double
+
+  /**
+   * :: DeveloperApi ::
+   * Determine if this impurity measure is a test-statistic measure
+   * @return True if this is a split quality measure based on a test 
statistic (i.e. returns a
+   * p-value) or false otherwise.
+   * @note Unless overridden this method returns false by default, for 
backward compatability
+   */
+  @Since("2.2.0")
+  @DeveloperApi
+  def isTestStatistic: Boolean
+}
+
+/**
+ * :: DeveloperApi ::
+ * Utility functions for Impurity measures
+ */
+@Since("2.0.0")
+@DeveloperApi
+object Impurity {
+  /**
+   * :: DeveloperApi ::
+   * Convert a test-statistic p-value into a "larger-is-better" gain value.
+   * @param pval The test statistic p-value
+   * @return The negative logarithm of the p-value.  Any p-values smaller 
than 10^-20 are clipped
+   * to 10^-20 to prevent arithmetic errors
+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def pValToGain(pval: Double): Double = -math.log(math.max(1e-20, pval))
--- End diff --

private to spark?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2018-09-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r218208383
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala ---
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+import org.apache.spark.annotation.{DeveloperApi, Experimental, Since}
+
+/**
+ * :: Experimental ::
+ * Class for calculating Chi Squared as a split quality metric during 
binary classification.
+ */
+@Since("2.2.0")
+@Experimental
+object ChiSquared extends Impurity {
+  private object CSTest extends 
org.apache.commons.math3.stat.inference.ChiSquareTest()
+
+  /**
+   * Get this impurity instance.
+   * This is useful for passing impurity parameters to a Strategy in Java.
+   */
+  @Since("1.1.0")
--- End diff --

I think I'd label all these as Since 2.5.0 even if they override a method 
that existed earlier. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2018-09-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r218209453
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -52,6 +52,49 @@ trait Impurity extends Serializable {
   @Since("1.0.0")
   @DeveloperApi
   def calculate(count: Double, sum: Double, sumSquares: Double): Double
+
+  /**
+   * :: DeveloperApi ::
+   * Compute a test-statistic p-value quality measure from left and right 
split populations
+   * @param calcL impurity calculator for the left split population
+   * @param calcR impurity calculator for the right split population
+   * @return The p-value for the null hypothesis; that left and right 
split populations
+   * represent the same distribution
+   * @note Unless overridden this method will fail with an exception, for 
backward compatability
+   */
+  @Since("2.2.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): 
Double
+
+  /**
+   * :: DeveloperApi ::
+   * Determine if this impurity measure is a test-statistic measure
+   * @return True if this is a split quality measure based on a test 
statistic (i.e. returns a
+   * p-value) or false otherwise.
+   * @note Unless overridden this method returns false by default, for 
backward compatability
+   */
+  @Since("2.2.0")
+  @DeveloperApi
+  def isTestStatistic: Boolean
--- End diff --

Adding methods to a public trait is technically an API breaking change. 
This might be considered a Developer API even though it's not labeled that way. 
Still if we can avoid adding to the API here, it'd be better.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2018-09-17 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r218208123
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -670,14 +670,32 @@ private[spark] object RandomForest extends Logging {
 val leftImpurity = leftImpurityCalculator.calculate() // Note: This 
equals 0 if count = 0
 val rightImpurity = rightImpurityCalculator.calculate()
 
-val leftWeight = leftCount / totalCount.toDouble
-val rightWeight = rightCount / totalCount.toDouble
+val gain = metadata.impurity match {
+  case imp if (imp.isTestStatistic) =>
+// For split quality measures based on a test-statistic, run the 
test on the
+// left and right sub-populations to get a p-value for the null 
hypothesis
+val pval = imp.calculate(leftImpurityCalculator, 
rightImpurityCalculator)
+// Transform the test statistic p-val into a larger-is-better gain 
value
+Impurity.pValToGain(pval)
+
+  case _ =>
+// Default purity-gain logic:
+// measure the weighted decrease in impurity from parent to the 
left and right
+val leftWeight = leftCount / totalCount.toDouble
+val rightWeight = rightCount / totalCount.toDouble
+
+impurity - leftWeight * leftImpurity - rightWeight * rightImpurity
+}
 
-val gain = impurity - leftWeight * leftImpurity - rightWeight * 
rightImpurity
+// If the impurity being used is a test statistic p-val, apply a 
standard transform into
+// a larger-is-better gain value for the minimum-gain threshold
+val minGain =
+  if (metadata.impurity.isTestStatistic) 
Impurity.pValToGain(metadata.minInfoGain)
+  else metadata.minInfoGain
--- End diff --

Kind of a design question here... right now the caller has to switch logic 
based on what's inside metadata. Can methods like metadata.minInfoGain just 
implement different logic when the impurity is a test statistic, and so on? 
push this down towards the impurity implementation? I wonder if 
`isTestStatistic` can go away with the right API, but I am not familiar with 
the details of what that requires.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-04-21 Thread erikerlandson
Github user erikerlandson commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r112719021
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala ---
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+import org.apache.spark.annotation.{DeveloperApi, Experimental, Since}
+
+/**
+ * :: Experimental ::
+ * Class for calculating [[https://en.wikipedia.org/wiki/Chi-squared_test 
chi-squared]]
--- End diff --

@HyukjinKwon thank you!   I didn't notice that was in the full console 
output


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-04-20 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r112596837
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala ---
@@ -0,0 +1,163 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+import org.apache.spark.annotation.{DeveloperApi, Experimental, Since}
+
+/**
+ * :: Experimental ::
+ * Class for calculating [[https://en.wikipedia.org/wiki/Chi-squared_test 
chi-squared]]
--- End diff --

The documentation generation is being failed due to this line - 

```
[error] 
/home/jenkins/workspace/SparkPullRequestBuilder/mllib/target/java/org/apache/spark/mllib/tree/impurity/ChiSquared.java:4:
 error: unexpected text
[error]  * Class for calculating {@link 
https://en.wikipedia.org/wiki/Chi-squared_test chi-squared}
[error]
```

Probably, remove the link or wrap it `href` as I did before - 
https://github.com/apache/spark/pull/16013

The other errors seem spurious. Please refer my observation - 
https://github.com/apache/spark/pull/17389#issuecomment-288438704


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread erikerlandson
Github user erikerlandson commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r104822632
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
 ---
@@ -237,6 +237,41 @@ class DecisionTreeClassifierSuite
 compareAPIs(rdd, dt, categoricalFeatures = Map.empty[Int, Int], 
numClasses)
   }
 
+  test("split quality using chi-squared and minimum gain") {
+// Generate a data set where the 1st feature is useful and the others 
are noise
+val features = Vector.fill(200) {
+  Array.fill(3) { scala.util.Random.nextInt(2).toDouble }
+}
+val labels = features.map { fv =>
+  LabeledPoint(if (fv(0) == 1.0) 1.0 else 0.0, Vectors.dense(fv))
+}
+val rdd = sc.parallelize(labels)
+
+// two-class learning problem
+val numClasses = 2
+// all binary features
+val catFeatures = Map(Vector.tabulate(features.head.length) { j => (j, 
2) } : _*)
+
+// Chi-squared split quality with a p-value threshold of 0.01 should 
allow
+// only the first feature to be used since the others are uncorrelated 
noise
+val train: DataFrame = TreeTests.setMetadata(rdd, catFeatures, 
numClasses)
+val dt = new DecisionTreeClassifier()
+  .setImpurity("chisquared")
+  .setMaxDepth(5)
+  .setMinInfoGain(0.01)
+val treeModel = dt.fit(train)
+
+// The tree should use exactly one of the 3 features: featue(0)
--- End diff --

👍 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread erikerlandson
Github user erikerlandson commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r104822458
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -50,6 +50,50 @@ trait Impurity extends Serializable {
   @Since("1.0.0")
   @DeveloperApi
   def calculate(count: Double, sum: Double, sumSquares: Double): Double
+
+  /**
+   * :: DeveloperApi ::
+   * Compute a test-statistic p-value quality measure from left and right 
split populations
+   * @param calcL impurity calculator for the left split population
+   * @param calcR impurity calculator for the right split population
+   * @return The p-value for the null hypothesis; that left and right 
split populations
+   * represent the same distribution
+   * @note Unless overridden this method will fail with an exception, for 
backward compatability
+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): 
Double =
--- End diff --

IIRC, there were other impurity measures using this 'unsupported' idiom.   
However, I'm fine with making it abstract, if that's the direction you are 
thinking.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread erikerlandson
Github user erikerlandson commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r104822183
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -50,6 +50,50 @@ trait Impurity extends Serializable {
   @Since("1.0.0")
   @DeveloperApi
   def calculate(count: Double, sum: Double, sumSquares: Double): Double
+
+  /**
+   * :: DeveloperApi ::
+   * Compute a test-statistic p-value quality measure from left and right 
split populations
+   * @param calcL impurity calculator for the left split population
+   * @param calcR impurity calculator for the right split population
+   * @return The p-value for the null hypothesis; that left and right 
split populations
+   * represent the same distribution
+   * @note Unless overridden this method will fail with an exception, for 
backward compatability
+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): 
Double =
+throw new UnsupportedOperationException("Impurity.calculate")
+
+  /**
+   * :: DeveloperApi ::
+   * Determine if this impurity measure is a test-statistic measure
+   * @return True if this is a split quality measure based on a test 
statistic (i.e. returns a
+   * p-value) or false otherwise.
+   * @note Unless overridden this method returns false by default, for 
backward compatability
+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def isTestStatistic: Boolean = false
+}
+
+/**
+ * :: DeveloperApi ::
+ * Utility functions for Impurity measures
+ */
+@Since("2.0.0")
+@DeveloperApi
--- End diff --

I don't think so.  I don't recall any specific motivation to keep it 
private, but historically Spark seems to default things to "minimum 
visibility."   The only method currently defined here is an implementation 
detail for hacking p-values into the existing 'gain' system, where larger is 
assumed to be better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread erikerlandson
Github user erikerlandson commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r104821445
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala ---
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+import org.apache.spark.annotation.{DeveloperApi, Experimental, Since}
+
+/**
+ * :: Experimental ::
+ * Class for calculating [[https://en.wikipedia.org/wiki/Chi-squared_test 
chi-squared]]
+ * during binary classification.
+ */
+@Since("2.0.0")
+@Experimental
+object ChiSquared extends Impurity {
+  private object CSTest extends 
org.apache.commons.math3.stat.inference.ChiSquareTest()
--- End diff --

IIRC it was to "allocate on the stack"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r104813864
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/ChiSquared.scala ---
@@ -0,0 +1,162 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impurity
+
+import org.apache.spark.annotation.{DeveloperApi, Experimental, Since}
+
+/**
+ * :: Experimental ::
+ * Class for calculating [[https://en.wikipedia.org/wiki/Chi-squared_test 
chi-squared]]
+ * during binary classification.
+ */
+@Since("2.0.0")
+@Experimental
+object ChiSquared extends Impurity {
+  private object CSTest extends 
org.apache.commons.math3.stat.inference.ChiSquareTest()
--- End diff --

why not a private `val`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r104813302
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/DecisionTreeClassifierSuite.scala
 ---
@@ -237,6 +237,41 @@ class DecisionTreeClassifierSuite
 compareAPIs(rdd, dt, categoricalFeatures = Map.empty[Int, Int], 
numClasses)
   }
 
+  test("split quality using chi-squared and minimum gain") {
+// Generate a data set where the 1st feature is useful and the others 
are noise
+val features = Vector.fill(200) {
+  Array.fill(3) { scala.util.Random.nextInt(2).toDouble }
+}
+val labels = features.map { fv =>
+  LabeledPoint(if (fv(0) == 1.0) 1.0 else 0.0, Vectors.dense(fv))
+}
+val rdd = sc.parallelize(labels)
+
+// two-class learning problem
+val numClasses = 2
+// all binary features
+val catFeatures = Map(Vector.tabulate(features.head.length) { j => (j, 
2) } : _*)
+
+// Chi-squared split quality with a p-value threshold of 0.01 should 
allow
+// only the first feature to be used since the others are uncorrelated 
noise
+val train: DataFrame = TreeTests.setMetadata(rdd, catFeatures, 
numClasses)
+val dt = new DecisionTreeClassifier()
+  .setImpurity("chisquared")
+  .setMaxDepth(5)
+  .setMinInfoGain(0.01)
+val treeModel = dt.fit(train)
+
+// The tree should use exactly one of the 3 features: featue(0)
--- End diff --

nit: feature


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r104812803
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -50,6 +50,50 @@ trait Impurity extends Serializable {
   @Since("1.0.0")
   @DeveloperApi
   def calculate(count: Double, sum: Double, sumSquares: Double): Double
+
+  /**
+   * :: DeveloperApi ::
+   * Compute a test-statistic p-value quality measure from left and right 
split populations
+   * @param calcL impurity calculator for the left split population
+   * @param calcR impurity calculator for the right split population
+   * @return The p-value for the null hypothesis; that left and right 
split populations
+   * represent the same distribution
+   * @note Unless overridden this method will fail with an exception, for 
backward compatability
+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): 
Double =
--- End diff --

you cannot guarantee compatibility with existing code here, since you would 
break the bytecode either way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r104812596
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -50,6 +50,50 @@ trait Impurity extends Serializable {
   @Since("1.0.0")
   @DeveloperApi
   def calculate(count: Double, sum: Double, sumSquares: Double): Double
+
+  /**
+   * :: DeveloperApi ::
+   * Compute a test-statistic p-value quality measure from left and right 
split populations
+   * @param calcL impurity calculator for the left split population
+   * @param calcR impurity calculator for the right split population
+   * @return The p-value for the null hypothesis; that left and right 
split populations
+   * represent the same distribution
+   * @note Unless overridden this method will fail with an exception, for 
backward compatability
+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): 
Double =
+throw new UnsupportedOperationException("Impurity.calculate")
+
+  /**
+   * :: DeveloperApi ::
+   * Determine if this impurity measure is a test-statistic measure
+   * @return True if this is a split quality measure based on a test 
statistic (i.e. returns a
+   * p-value) or false otherwise.
+   * @note Unless overridden this method returns false by default, for 
backward compatability
+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def isTestStatistic: Boolean = false
+}
+
+/**
+ * :: DeveloperApi ::
+ * Utility functions for Impurity measures
+ */
+@Since("2.0.0")
+@DeveloperApi
--- End diff --

there is no need for this object to be publicly exposed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r104812468
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -50,6 +50,50 @@ trait Impurity extends Serializable {
   @Since("1.0.0")
   @DeveloperApi
   def calculate(count: Double, sum: Double, sumSquares: Double): Double
+
+  /**
+   * :: DeveloperApi ::
+   * Compute a test-statistic p-value quality measure from left and right 
split populations
+   * @param calcL impurity calculator for the left split population
+   * @param calcR impurity calculator for the right split population
+   * @return The p-value for the null hypothesis; that left and right 
split populations
+   * represent the same distribution
+   * @note Unless overridden this method will fail with an exception, for 
backward compatability
+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): 
Double =
--- End diff --

scala: do not add a default implementation, it causes issues with java 
compatibility


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2017-03-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/13440#discussion_r104812484
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Impurity.scala ---
@@ -50,6 +50,50 @@ trait Impurity extends Serializable {
   @Since("1.0.0")
   @DeveloperApi
   def calculate(count: Double, sum: Double, sumSquares: Double): Double
+
+  /**
+   * :: DeveloperApi ::
+   * Compute a test-statistic p-value quality measure from left and right 
split populations
+   * @param calcL impurity calculator for the left split population
+   * @param calcR impurity calculator for the right split population
+   * @return The p-value for the null hypothesis; that left and right 
split populations
+   * represent the same distribution
+   * @note Unless overridden this method will fail with an exception, for 
backward compatability
+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): 
Double =
+throw new UnsupportedOperationException("Impurity.calculate")
+
+  /**
+   * :: DeveloperApi ::
+   * Determine if this impurity measure is a test-statistic measure
+   * @return True if this is a split quality measure based on a test 
statistic (i.e. returns a
+   * p-value) or false otherwise.
+   * @note Unless overridden this method returns false by default, for 
backward compatability
+   */
+  @Since("2.0.0")
+  @DeveloperApi
+  def isTestStatistic: Boolean = false
--- End diff --

scala: do not add a default implementation, it causes issues with java 
compatibility


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

2016-06-01 Thread erikerlandson
GitHub user erikerlandson opened a pull request:

https://github.com/apache/spark/pull/13440

[SPARK-15699] [ML] Implement a Chi-Squared test statistic option for 
measuring split quality

## What changes were proposed in this pull request?
Using test statistics as a measure of decision tree split quality is a 
useful split halting measure that can yield improved model quality. I am 
proposing to add the chi-squared test statistic as a new impurity option (in 
addition to "gini" and "entropy") for classification decision trees and 
ensembles.

https://issues.apache.org/jira/browse/SPARK-15699


http://erikerlandson.github.io/blog/2016/05/26/measuring-decision-tree-split-quality-with-test-statistic-p-values/

## How was this patch tested?
I added unit testing to verify that the chi-squared "impurity" measure 
functions as expected when used for decision tree training.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/erikerlandson/spark chisquared_split_quality

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13440.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13440


commit 04c131647f9cbc8208b755c6d08f16a1056f171c
Author: Erik Erlandson 
Date:   2016-05-31T21:44:22Z

Implement a Chi-Squared test statistic option for measuring split quality 
when training decision trees




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org