[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3374#issuecomment-63774229
  
  [Test build #23663 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23663/consoleFull)
 for   PR 3374 at commit 
[`7097251`](https://github.com/apache/spark/commit/70972515085245957df9601e425141746f268c4b).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3374#issuecomment-63774234
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23663/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-20 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3374#issuecomment-63777328
  
@manishamde @jkbradley Thanks! Merged into master and branch-1.2.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/3374


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread mengxr
GitHub user mengxr opened a pull request:

https://github.com/apache/spark/pull/3374

[SPARK-4486][MLLIB] Improve GradientBoosting APIs and doc

There are some inconsistencies in the gradient boosting APIs. The target is 
a general boosting meta-algorithm, but the implementation is attached to trees. 
This was partially due to the delay of SPARK-1856. But for the 1.2 release, we 
should make the APIs consistent.

1. WeightedEnsembleModel - private[tree] TreeEnsembleModel
1. GradientBoosting - GradientBoostedTrees
1. Add RandomForestModel and GradientBoostedTreesModel and hide 
CombiningStrategy
1. Slightly refactored TreeEnsembleModel
1. Remove `trainClassifier` and `trainRegressor` from 
`GradientBoostedTrees` because they are the same as `train`
1. Rename class `train` method to `run` because it hides the static methods 
with the same name in Java. Deprecated `DecisionTree.run` class method.
1. Simplify BoostingStrategy and make sure the input strategy is not 
modified. Users should put algo and numClasses in treeStrategy. We create 
ensembleStrategy inside boosting.
1. Fix a bug in GradientBoostedTreesSuite with AbsoluteError
1. doc updates

@manishamde @jkbradley

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mengxr/spark SPARK-4486

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3374.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3374


commit 19030a5edf8acc90010d2430fcf5c46d4389d86a
Author: Xiangrui Meng m...@databricks.com
Date:   2014-11-19T21:17:01Z

update boosting public APIs

commit 751da4e16a1fea86398abdb37ecb33b2b8f723a8
Author: Xiangrui Meng m...@databricks.com
Date:   2014-11-19T22:09:11Z

rename class method train - run

commit ea4c467474ff488d4f4367edeb008cf2c042fc64
Author: Xiangrui Meng m...@databricks.com
Date:   2014-11-19T22:25:51Z

fix unit tests

commit 4aae3b761c5e98d19d6a5bf6b8a425f4bb4d2ebc
Author: Xiangrui Meng m...@databricks.com
Date:   2014-11-19T23:25:16Z

add RandomForestModel and GradientBoostedTreesModel, hide CombiningStrategy




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20620079
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostedTreesSuite.scala
 ---
@@ -23,104 +23,95 @@ import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.tree.configuration.Algo._
 import org.apache.spark.mllib.tree.configuration.{BoostingStrategy, 
Strategy}
 import org.apache.spark.mllib.tree.impurity.Variance
-import org.apache.spark.mllib.tree.loss.{SquaredError, LogLoss}
+import org.apache.spark.mllib.tree.loss.{AbsoluteError, SquaredError, 
LogLoss}
 
 import org.apache.spark.mllib.util.MLlibTestSparkContext
 
 /**
- * Test suite for [[GradientBoosting]].
+ * Test suite for [[GradientBoostedTrees]].
  */
-class GradientBoostingSuite extends FunSuite with MLlibTestSparkContext {
+class GradientBoostedTreesSuite extends FunSuite with 
MLlibTestSparkContext {
 
   test(Regression with continuous features: SquaredError) {
-GradientBoostingSuite.testCombinations.foreach {
+GradientBoostedTreesSuite.testCombinations.foreach {
   case (numIterations, learningRate, subsamplingRate) =
 val arr = 
EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100)
-val rdd = sc.parallelize(arr)
-val categoricalFeaturesInfo = Map.empty[Int, Int]
+val rdd = sc.parallelize(arr, 2)
 
-val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 
1, x.features))
 val treeStrategy = new Strategy(algo = Regression, impurity = 
Variance, maxDepth = 2,
-  numClassesForClassification = 2, categoricalFeaturesInfo = 
categoricalFeaturesInfo,
-  subsamplingRate = subsamplingRate)
-
-val dt = DecisionTree.train(remappedInput, treeStrategy)
-
-val boostingStrategy = new BoostingStrategy(Regression, 
numIterations, SquaredError,
-  learningRate, 1, treeStrategy)
+  categoricalFeaturesInfo = Map.empty, subsamplingRate = 
subsamplingRate)
+val boostingStrategy =
+  new BoostingStrategy(treeStrategy, SquaredError, numIterations, 
learningRate)
 
-val gbt = GradientBoosting.trainRegressor(rdd, boostingStrategy)
-assert(gbt.weakHypotheses.size === numIterations)
-val gbtTree = gbt.weakHypotheses(0)
+val gbt = GradientBoostedTrees.train(rdd, boostingStrategy)
 
+assert(gbt.trees.size === numIterations)
 EnsembleTestHelper.validateRegressor(gbt, arr, 0.03)
 
+val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 
1, x.features))
+val dt = DecisionTree.train(remappedInput, treeStrategy)
+
 // Make sure trees are the same.
-assert(gbtTree.toString == dt.toString)
+assert(gbt.trees.head.toString == dt.toString)
 }
   }
 
   test(Regression with continuous features: Absolute Error) {
-GradientBoostingSuite.testCombinations.foreach {
+GradientBoostedTreesSuite.testCombinations.foreach {
   case (numIterations, learningRate, subsamplingRate) =
 val arr = 
EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100)
-val rdd = sc.parallelize(arr)
-val categoricalFeaturesInfo = Map.empty[Int, Int]
+val rdd = sc.parallelize(arr, 2)
 
-val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 
1, x.features))
 val treeStrategy = new Strategy(algo = Regression, impurity = 
Variance, maxDepth = 2,
-  numClassesForClassification = 2, categoricalFeaturesInfo = 
categoricalFeaturesInfo,
-  subsamplingRate = subsamplingRate)
-
-val dt = DecisionTree.train(remappedInput, treeStrategy)
+  categoricalFeaturesInfo = Map.empty, subsamplingRate = 
subsamplingRate)
+val boostingStrategy =
+  new BoostingStrategy(treeStrategy, AbsoluteError, numIterations, 
learningRate)
 
-val boostingStrategy = new BoostingStrategy(Regression, 
numIterations, SquaredError,
--- End diff --

It was `SquaredError` before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3374#issuecomment-63742362
  
  [Test build #23643 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23643/consoleFull)
 for   PR 3374 at commit 
[`4aae3b7`](https://github.com/apache/spark/commit/4aae3b761c5e98d19d6a5bf6b8a425f4bb4d2ebc).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/3374#issuecomment-63744101
  
Will we have to rename ```GradientBoostedTrees``` back to 
```GradientBoosting``` when we add generic weak learner support? I think we 
should not modify the name of the algorithm and make it tree-specific to avoid 
renaming it in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/3374#issuecomment-63744703
  
@manishamde The current impl is attached to trees. Even if we rename it 
back to `GradientBoosting`. it has to live under `mllib.tree` instead of 
`mllib.ensemble`. When we have a generalized boosting implementation in the 
future, we don't rename `GradientBoostedTrees`. Instead, we can add 
`mllib.ensemble.GradientBoosting`, and let `tree.GradientBoostedTrees` extend 
that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20621257
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
@@ -45,146 +43,92 @@ import org.apache.spark.storage.StorageLevel
  *but weak hypothesis weights are not computed correctly for LogLoss 
or AbsoluteError.
  *Running with those losses will likely behave reasonably, but lacks 
the same guarantees.
  *
- * @param boostingStrategy Parameters for the gradient boosting algorithm
+ * @param boostingStrategy Parameters for the gradient boosting algorithm.
  */
 @Experimental
-class GradientBoosting (
-private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
-
-  boostingStrategy.weakLearnerParams.algo = Regression
-  boostingStrategy.weakLearnerParams.impurity = impurity.Variance
-
-  // Ensure values for weak learner are the same as what is provided to 
the boosting algorithm.
-  boostingStrategy.weakLearnerParams.numClassesForClassification =
-boostingStrategy.numClassesForClassification
-
-  boostingStrategy.assertValid()
+class GradientBoostedTrees(private val boostingStrategy: BoostingStrategy)
+  extends Serializable with Logging {
 
   /**
* Method to train a gradient boosting model
* @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
-   * @return WeightedEnsembleModel that can be used for prediction
+   * @return a gradient boosted trees model that can be used for prediction
*/
-  def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = {
-val algo = boostingStrategy.algo
+  def run(input: RDD[LabeledPoint]): GradientBoostedTreesModel = {
+val algo = boostingStrategy.treeStrategy.algo
 algo match {
-  case Regression = GradientBoosting.boost(input, boostingStrategy)
+  case Regression = GradientBoostedTrees.boost(input, 
boostingStrategy)
   case Classification =
 // Map labels to -1, +1 so binary classification can be treated as 
regression.
 val remappedInput = input.map(x = new LabeledPoint((x.label * 2) 
- 1, x.features))
-GradientBoosting.boost(remappedInput, boostingStrategy)
+GradientBoostedTrees.boost(remappedInput, boostingStrategy)
   case _ =
 throw new IllegalArgumentException(s$algo is not supported by the 
gradient boosting.)
 }
   }
 
+  /**
+   * Java-friendly API for 
[[org.apache.spark.mllib.tree.GradientBoostedTrees!#run]].
+   */
+  def run(input: JavaRDD[LabeledPoint]): GradientBoostedTreesModel = {
+run(input.rdd)
+  }
 }
 
 
-object GradientBoosting extends Logging {
+object GradientBoostedTrees extends Logging {
 
   /**
* Method to train a gradient boosting model.
*
-   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting$#trainRegressor]]
-   *   is recommended to clearly specify regression.
-   *   Using 
[[org.apache.spark.mllib.tree.GradientBoosting$#trainClassifier]]
-   *   is recommended to clearly specify regression.
-   *
* @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
*  For classification, labels should take values {0, 1, 
..., numClasses-1}.
*  For regression, labels are real numbers.
* @param boostingStrategy Configuration options for the boosting 
algorithm.
-   * @return WeightedEnsembleModel that can be used for prediction
+   * @return a gradient boosted trees model that can be used for prediction
--- End diff --

Very minor nit: A gradient boosted trees model that can be used for 
prediction.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/3374#issuecomment-63744889
  
@mengxr The plan to move to mllib.ensemble namespace with a new class 
sounds good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/3374#issuecomment-63746922
  
Should the```trainClassifier``` and ``trainRegressor`` methods from 
```DecisionTree``` and ```RandomForest``` classes also be the deprecated?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20622307
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
@@ -45,146 +43,92 @@ import org.apache.spark.storage.StorageLevel
  *but weak hypothesis weights are not computed correctly for LogLoss 
or AbsoluteError.
--- End diff --

Should now read something like this but tree predictions are not computed 
accurately for LogLoss or AbsoluteError loss functions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20622629
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
@@ -45,146 +43,92 @@ import org.apache.spark.storage.StorageLevel
  *but weak hypothesis weights are not computed correctly for LogLoss 
or AbsoluteError.
  *Running with those losses will likely behave reasonably, but lacks 
the same guarantees.
  *
- * @param boostingStrategy Parameters for the gradient boosting algorithm
+ * @param boostingStrategy Parameters for the gradient boosting algorithm.
  */
 @Experimental
-class GradientBoosting (
-private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
-
-  boostingStrategy.weakLearnerParams.algo = Regression
-  boostingStrategy.weakLearnerParams.impurity = impurity.Variance
-
-  // Ensure values for weak learner are the same as what is provided to 
the boosting algorithm.
-  boostingStrategy.weakLearnerParams.numClassesForClassification =
-boostingStrategy.numClassesForClassification
-
-  boostingStrategy.assertValid()
+class GradientBoostedTrees(private val boostingStrategy: BoostingStrategy)
+  extends Serializable with Logging {
 
   /**
* Method to train a gradient boosting model
* @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
-   * @return WeightedEnsembleModel that can be used for prediction
+   * @return a gradient boosted trees model that can be used for prediction
*/
-  def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = {
-val algo = boostingStrategy.algo
+  def run(input: RDD[LabeledPoint]): GradientBoostedTreesModel = {
+val algo = boostingStrategy.treeStrategy.algo
 algo match {
-  case Regression = GradientBoosting.boost(input, boostingStrategy)
+  case Regression = GradientBoostedTrees.boost(input, 
boostingStrategy)
   case Classification =
 // Map labels to -1, +1 so binary classification can be treated as 
regression.
 val remappedInput = input.map(x = new LabeledPoint((x.label * 2) 
- 1, x.features))
-GradientBoosting.boost(remappedInput, boostingStrategy)
+GradientBoostedTrees.boost(remappedInput, boostingStrategy)
   case _ =
 throw new IllegalArgumentException(s$algo is not supported by the 
gradient boosting.)
 }
   }
 
+  /**
+   * Java-friendly API for 
[[org.apache.spark.mllib.tree.GradientBoostedTrees!#run]].
+   */
+  def run(input: JavaRDD[LabeledPoint]): GradientBoostedTreesModel = {
+run(input.rdd)
+  }
 }
 
 
-object GradientBoosting extends Logging {
+object GradientBoostedTrees extends Logging {
 
   /**
* Method to train a gradient boosting model.
*
-   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting$#trainRegressor]]
-   *   is recommended to clearly specify regression.
-   *   Using 
[[org.apache.spark.mllib.tree.GradientBoosting$#trainClassifier]]
-   *   is recommended to clearly specify regression.
-   *
* @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
*  For classification, labels should take values {0, 1, 
..., numClasses-1}.
*  For regression, labels are real numbers.
* @param boostingStrategy Configuration options for the boosting 
algorithm.
-   * @return WeightedEnsembleModel that can be used for prediction
+   * @return a gradient boosted trees model that can be used for prediction
*/
   def train(
   input: RDD[LabeledPoint],
-  boostingStrategy: BoostingStrategy): WeightedEnsembleModel = {
-new GradientBoosting(boostingStrategy).train(input)
+  boostingStrategy: BoostingStrategy): GradientBoostedTreesModel = {
+new GradientBoostedTrees(boostingStrategy).run(input)
   }
 
   /**
-   * Method to train a gradient boosting classification model.
-   *
-   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
-   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
-   *  For regression, labels are real numbers.
-   * @param boostingStrategy Configuration options for the boosting 
algorithm.
-   * @return WeightedEnsembleModel that can be used for prediction
-   */
-  def trainClassifier(
-  input: RDD[LabeledPoint],
-  boostingStrategy: BoostingStrategy): WeightedEnsembleModel = {
-val algo = boostingStrategy.algo
-require(algo == Classification, sOnly Classification algo supported. 
Provided 

[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20622816
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/TreeEnsembleModel.scala 
---
@@ -0,0 +1,182 @@
+/*
--- End diff --

Shouldn't the class name start with lower case according to Scala 
conventions? Also ```treeEnsembleModels.scala``` might be more appropriate.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20623463
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/TreeEnsembleModel.scala 
---
@@ -0,0 +1,182 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.model
+
+import org.apache.spark.api.java.JavaRDD
+
+import scala.collection.mutable
+
+import com.github.fommil.netlib.BLAS.{getInstance = blas}
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.configuration.Algo._
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy._
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: Experimental ::
+ * Represents a random forest model.
+ *
+ * @param algo algorithm for the ensemble model, either Classification or 
Regression
+ * @param trees tree ensembles
+ */
+@Experimental
+class RandomForestModel(override val algo: Algo, override val trees: 
Array[DecisionTreeModel])
+  extends TreeEnsembleModel(algo, trees, Array.fill(trees.size)(1.0),
+combiningStrategy = if (algo == Classification) Vote else Average) {
+
+  require(trees.forall(_.algo == algo))
+}
+
+/**
+ * :: Experimental ::
+ * Represents a gradient boosted trees model.
+ *
+ * @param algo algorithm for the ensemble model, either Classification or 
Regression
+ * @param trees tree ensembles
+ * @param treeWeights tree ensemble weights
+ */
+@Experimental
+class GradientBoostedTreesModel(
+override val algo: Algo,
+override val trees: Array[DecisionTreeModel],
+override val treeWeights: Array[Double])
+  extends TreeEnsembleModel(algo, trees, treeWeights, combiningStrategy = 
Sum) {
+
+  require(trees.size == treeWeights.size)
+}
+
+/**
+ * Represents a tree ensemble model.
+ *
+ * @param algo algorithm for the ensemble model, either Classification or 
Regression
+ * @param trees tree ensembles
+ * @param treeWeights tree ensemble weights
+ * @param combiningStrategy strategy for combining the predictions, not 
used for regression.
+ */
+private[tree] sealed class TreeEnsembleModel(
+protected val algo: Algo,
+protected val trees: Array[DecisionTreeModel],
+protected val treeWeights: Array[Double],
+protected val combiningStrategy: EnsembleCombiningStrategy) extends 
Serializable {
+
+  require(numTrees  0, TreeEnsembleModel cannot be created without 
trees.)
+
+  private val sumWeights = math.max(treeWeights.sum, 1e-15)
+
+  /**
+   * Predicts for a single data point using the weighted sum of ensemble 
predictions.
+   *
+   * @param features array representing a single data point
+   * @return predicted category from the trained model
+   */
+  private def predictBySumming(features: Vector): Double = {
+val treePredictions = trees.map(learner = learner.predict(features))
+blas.ddot(numTrees, treePredictions, 1, treeWeights, 1)
+  }
+
+  /**
+   * Classifies a single data point based on (weighted) majority votes.
+   */
+  private def predictByVoting(features: Vector): Double = {
+val votes = mutable.Map.empty[Int, Double]
+trees.view.zip(treeWeights).foreach { case (tree, weight) =
+  val prediction = tree.predict(features).toInt
+  votes(prediction) = votes.getOrElse(prediction, 0.0) + weight
+}
+votes.maxBy(_._2)._1
+  }
+
+  /**
+   * Predict values for a single data point using the model trained.
+   *
+   * @param features array representing a single data point
+   * @return predicted category from the trained model
+   */
+  def predict(features: Vector): Double = {
+(algo, combiningStrategy) match {
+  case (Regression, Sum) =
+predictBySumming(features)
+  case (Regression, 

[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3374#issuecomment-63750651
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23643/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3374#issuecomment-63750642
  
  [Test build #23643 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23643/consoleFull)
 for   PR 3374 at commit 
[`4aae3b7`](https://github.com/apache/spark/commit/4aae3b761c5e98d19d6a5bf6b8a425f4bb4d2ebc).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20623750
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostedTreesSuite.scala
 ---
@@ -23,104 +23,95 @@ import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.tree.configuration.Algo._
 import org.apache.spark.mllib.tree.configuration.{BoostingStrategy, 
Strategy}
 import org.apache.spark.mllib.tree.impurity.Variance
-import org.apache.spark.mllib.tree.loss.{SquaredError, LogLoss}
+import org.apache.spark.mllib.tree.loss.{AbsoluteError, SquaredError, 
LogLoss}
 
 import org.apache.spark.mllib.util.MLlibTestSparkContext
 
 /**
- * Test suite for [[GradientBoosting]].
+ * Test suite for [[GradientBoostedTrees]].
  */
-class GradientBoostingSuite extends FunSuite with MLlibTestSparkContext {
+class GradientBoostedTreesSuite extends FunSuite with 
MLlibTestSparkContext {
 
   test(Regression with continuous features: SquaredError) {
-GradientBoostingSuite.testCombinations.foreach {
+GradientBoostedTreesSuite.testCombinations.foreach {
   case (numIterations, learningRate, subsamplingRate) =
 val arr = 
EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100)
-val rdd = sc.parallelize(arr)
-val categoricalFeaturesInfo = Map.empty[Int, Int]
+val rdd = sc.parallelize(arr, 2)
 
-val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 
1, x.features))
 val treeStrategy = new Strategy(algo = Regression, impurity = 
Variance, maxDepth = 2,
-  numClassesForClassification = 2, categoricalFeaturesInfo = 
categoricalFeaturesInfo,
-  subsamplingRate = subsamplingRate)
-
-val dt = DecisionTree.train(remappedInput, treeStrategy)
-
-val boostingStrategy = new BoostingStrategy(Regression, 
numIterations, SquaredError,
-  learningRate, 1, treeStrategy)
+  categoricalFeaturesInfo = Map.empty, subsamplingRate = 
subsamplingRate)
+val boostingStrategy =
+  new BoostingStrategy(treeStrategy, SquaredError, numIterations, 
learningRate)
 
-val gbt = GradientBoosting.trainRegressor(rdd, boostingStrategy)
-assert(gbt.weakHypotheses.size === numIterations)
-val gbtTree = gbt.weakHypotheses(0)
+val gbt = GradientBoostedTrees.train(rdd, boostingStrategy)
 
+assert(gbt.trees.size === numIterations)
 EnsembleTestHelper.validateRegressor(gbt, arr, 0.03)
 
+val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 
1, x.features))
+val dt = DecisionTree.train(remappedInput, treeStrategy)
+
 // Make sure trees are the same.
-assert(gbtTree.toString == dt.toString)
+assert(gbt.trees.head.toString == dt.toString)
 }
   }
 
   test(Regression with continuous features: Absolute Error) {
-GradientBoostingSuite.testCombinations.foreach {
+GradientBoostedTreesSuite.testCombinations.foreach {
   case (numIterations, learningRate, subsamplingRate) =
 val arr = 
EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100)
-val rdd = sc.parallelize(arr)
-val categoricalFeaturesInfo = Map.empty[Int, Int]
+val rdd = sc.parallelize(arr, 2)
 
-val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 
1, x.features))
 val treeStrategy = new Strategy(algo = Regression, impurity = 
Variance, maxDepth = 2,
-  numClassesForClassification = 2, categoricalFeaturesInfo = 
categoricalFeaturesInfo,
-  subsamplingRate = subsamplingRate)
-
-val dt = DecisionTree.train(remappedInput, treeStrategy)
+  categoricalFeaturesInfo = Map.empty, subsamplingRate = 
subsamplingRate)
+val boostingStrategy =
+  new BoostingStrategy(treeStrategy, AbsoluteError, numIterations, 
learningRate)
 
-val boostingStrategy = new BoostingStrategy(Regression, 
numIterations, SquaredError,
--- End diff --

Thanks for fixing this. I am taking a look at it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20624623
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostedTreesSuite.scala
 ---
@@ -23,104 +23,95 @@ import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.tree.configuration.Algo._
 import org.apache.spark.mllib.tree.configuration.{BoostingStrategy, 
Strategy}
 import org.apache.spark.mllib.tree.impurity.Variance
-import org.apache.spark.mllib.tree.loss.{SquaredError, LogLoss}
+import org.apache.spark.mllib.tree.loss.{AbsoluteError, SquaredError, 
LogLoss}
 
 import org.apache.spark.mllib.util.MLlibTestSparkContext
 
 /**
- * Test suite for [[GradientBoosting]].
+ * Test suite for [[GradientBoostedTrees]].
  */
-class GradientBoostingSuite extends FunSuite with MLlibTestSparkContext {
+class GradientBoostedTreesSuite extends FunSuite with 
MLlibTestSparkContext {
 
   test(Regression with continuous features: SquaredError) {
-GradientBoostingSuite.testCombinations.foreach {
+GradientBoostedTreesSuite.testCombinations.foreach {
   case (numIterations, learningRate, subsamplingRate) =
 val arr = 
EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100)
-val rdd = sc.parallelize(arr)
-val categoricalFeaturesInfo = Map.empty[Int, Int]
+val rdd = sc.parallelize(arr, 2)
 
-val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 
1, x.features))
 val treeStrategy = new Strategy(algo = Regression, impurity = 
Variance, maxDepth = 2,
-  numClassesForClassification = 2, categoricalFeaturesInfo = 
categoricalFeaturesInfo,
-  subsamplingRate = subsamplingRate)
-
-val dt = DecisionTree.train(remappedInput, treeStrategy)
-
-val boostingStrategy = new BoostingStrategy(Regression, 
numIterations, SquaredError,
-  learningRate, 1, treeStrategy)
+  categoricalFeaturesInfo = Map.empty, subsamplingRate = 
subsamplingRate)
+val boostingStrategy =
+  new BoostingStrategy(treeStrategy, SquaredError, numIterations, 
learningRate)
 
-val gbt = GradientBoosting.trainRegressor(rdd, boostingStrategy)
-assert(gbt.weakHypotheses.size === numIterations)
-val gbtTree = gbt.weakHypotheses(0)
+val gbt = GradientBoostedTrees.train(rdd, boostingStrategy)
 
+assert(gbt.trees.size === numIterations)
 EnsembleTestHelper.validateRegressor(gbt, arr, 0.03)
 
+val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 
1, x.features))
+val dt = DecisionTree.train(remappedInput, treeStrategy)
+
 // Make sure trees are the same.
-assert(gbtTree.toString == dt.toString)
+assert(gbt.trees.head.toString == dt.toString)
 }
   }
 
   test(Regression with continuous features: Absolute Error) {
-GradientBoostingSuite.testCombinations.foreach {
+GradientBoostedTreesSuite.testCombinations.foreach {
   case (numIterations, learningRate, subsamplingRate) =
 val arr = 
EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100)
-val rdd = sc.parallelize(arr)
-val categoricalFeaturesInfo = Map.empty[Int, Int]
+val rdd = sc.parallelize(arr, 2)
 
-val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 
1, x.features))
 val treeStrategy = new Strategy(algo = Regression, impurity = 
Variance, maxDepth = 2,
-  numClassesForClassification = 2, categoricalFeaturesInfo = 
categoricalFeaturesInfo,
-  subsamplingRate = subsamplingRate)
-
-val dt = DecisionTree.train(remappedInput, treeStrategy)
+  categoricalFeaturesInfo = Map.empty, subsamplingRate = 
subsamplingRate)
+val boostingStrategy =
+  new BoostingStrategy(treeStrategy, AbsoluteError, numIterations, 
learningRate)
 
-val boostingStrategy = new BoostingStrategy(Regression, 
numIterations, SquaredError,
--- End diff --

Here are my findings. I added two more test cases with numIterations = 100.

```
numIterations = 10, learningRate = 1.0, subsamplingRate = 1.0
metric = 0.8405
numIterations = 100, learningRate = 1.0, subsamplingRate = 1.0
metric = 0.5344090056285183
numIterations = 10, learningRate = 0.1, subsamplingRate = 1.0
metric = 0.08384
numIterations = 10, learningRate = 1.0, subsamplingRate = 0.75
metric = 0.8102205882352937
numIterations = 100, learningRate = 1.0, subsamplingRate = 0.75
metric = 0.565608647936787
numIterations = 10, learningRate = 0.1, subsamplingRate 

[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20627362
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostedTreesSuite.scala
 ---
@@ -23,104 +23,95 @@ import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.tree.configuration.Algo._
 import org.apache.spark.mllib.tree.configuration.{BoostingStrategy, 
Strategy}
 import org.apache.spark.mllib.tree.impurity.Variance
-import org.apache.spark.mllib.tree.loss.{SquaredError, LogLoss}
+import org.apache.spark.mllib.tree.loss.{AbsoluteError, SquaredError, 
LogLoss}
 
 import org.apache.spark.mllib.util.MLlibTestSparkContext
 
 /**
- * Test suite for [[GradientBoosting]].
+ * Test suite for [[GradientBoostedTrees]].
  */
-class GradientBoostingSuite extends FunSuite with MLlibTestSparkContext {
+class GradientBoostedTreesSuite extends FunSuite with 
MLlibTestSparkContext {
 
   test(Regression with continuous features: SquaredError) {
-GradientBoostingSuite.testCombinations.foreach {
+GradientBoostedTreesSuite.testCombinations.foreach {
   case (numIterations, learningRate, subsamplingRate) =
 val arr = 
EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100)
-val rdd = sc.parallelize(arr)
-val categoricalFeaturesInfo = Map.empty[Int, Int]
+val rdd = sc.parallelize(arr, 2)
 
-val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 
1, x.features))
 val treeStrategy = new Strategy(algo = Regression, impurity = 
Variance, maxDepth = 2,
-  numClassesForClassification = 2, categoricalFeaturesInfo = 
categoricalFeaturesInfo,
-  subsamplingRate = subsamplingRate)
-
-val dt = DecisionTree.train(remappedInput, treeStrategy)
-
-val boostingStrategy = new BoostingStrategy(Regression, 
numIterations, SquaredError,
-  learningRate, 1, treeStrategy)
+  categoricalFeaturesInfo = Map.empty, subsamplingRate = 
subsamplingRate)
+val boostingStrategy =
+  new BoostingStrategy(treeStrategy, SquaredError, numIterations, 
learningRate)
 
-val gbt = GradientBoosting.trainRegressor(rdd, boostingStrategy)
-assert(gbt.weakHypotheses.size === numIterations)
-val gbtTree = gbt.weakHypotheses(0)
+val gbt = GradientBoostedTrees.train(rdd, boostingStrategy)
 
+assert(gbt.trees.size === numIterations)
 EnsembleTestHelper.validateRegressor(gbt, arr, 0.03)
 
+val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 
1, x.features))
+val dt = DecisionTree.train(remappedInput, treeStrategy)
+
 // Make sure trees are the same.
-assert(gbtTree.toString == dt.toString)
+assert(gbt.trees.head.toString == dt.toString)
 }
   }
 
   test(Regression with continuous features: Absolute Error) {
-GradientBoostingSuite.testCombinations.foreach {
+GradientBoostedTreesSuite.testCombinations.foreach {
   case (numIterations, learningRate, subsamplingRate) =
 val arr = 
EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100)
-val rdd = sc.parallelize(arr)
-val categoricalFeaturesInfo = Map.empty[Int, Int]
+val rdd = sc.parallelize(arr, 2)
 
-val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 
1, x.features))
 val treeStrategy = new Strategy(algo = Regression, impurity = 
Variance, maxDepth = 2,
-  numClassesForClassification = 2, categoricalFeaturesInfo = 
categoricalFeaturesInfo,
-  subsamplingRate = subsamplingRate)
-
-val dt = DecisionTree.train(remappedInput, treeStrategy)
+  categoricalFeaturesInfo = Map.empty, subsamplingRate = 
subsamplingRate)
+val boostingStrategy =
+  new BoostingStrategy(treeStrategy, AbsoluteError, numIterations, 
learningRate)
 
-val boostingStrategy = new BoostingStrategy(Regression, 
numIterations, SquaredError,
--- End diff --

@manishamde Thanks for checking this test! Let's fix it in a separate PR. 
We are going to cut a release candidate and I hope we can update the API before 
that. Let me know when you finish a pass, I will update the PR following your 
suggestions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To 

[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread manishamde
Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/3374#issuecomment-63763093
  
Completed my pass. LGTM! :+1: 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20628796
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
@@ -45,146 +43,92 @@ import org.apache.spark.storage.StorageLevel
  *but weak hypothesis weights are not computed correctly for LogLoss 
or AbsoluteError.
--- End diff --

@manishamde The current explanation is correct for the original Gradient 
Boosting algorithm, which uses weak hypothesis weights and is oblivious to the 
weak learner being used.  Your suggested explanation is really for TreeBoost, 
Friedman's improvement to the original algorithm which is specialized for trees 
(which we should add at some point but isn't what we're claiming to have now, 
I'd say).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3374#issuecomment-63765215
  
  [Test build #23662 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23662/consoleFull)
 for   PR 3374 at commit 
[`98dea09`](https://github.com/apache/spark/commit/98dea097f226578762caca41ad708efb85f00d64).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20628996
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
@@ -45,146 +43,92 @@ import org.apache.spark.storage.StorageLevel
  *but weak hypothesis weights are not computed correctly for LogLoss 
or AbsoluteError.
--- End diff --

@jkbradley Agree. Having said that, I am not sure whether the algorithm 
predictions are changed or not based upon the loss function in other weak 
learners such as LR. Let's refine this later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20629011
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
@@ -40,151 +39,98 @@ import org.apache.spark.storage.StorageLevel
  * Notes:
  *  - This currently can be run with several loss functions.  However, 
only SquaredError is
  *fully supported.  Specifically, the loss function should be used to 
compute the gradient
- *(to re-label training instances on each iteration) and to weight 
weak hypotheses.
+ *(to re-label training instances on each iteration) and to weight 
tree ensembles.
  *Currently, gradients are computed correctly for the available loss 
functions,
- *but weak hypothesis weights are not computed correctly for LogLoss 
or AbsoluteError.
- *Running with those losses will likely behave reasonably, but lacks 
the same guarantees.
+ *but tree predictions are not computed correctly for LogLoss or 
AbsoluteError since they
--- End diff --

(copying comment here since it was on an outdated diff)
The original explanation is correct for the original Gradient Boosting 
algorithm, which uses weak hypothesis weights and is oblivious to the weak 
learner being used. This updated explanation is really for TreeBoost, 
Friedman's improvement to the original algorithm which is specialized for trees 
(which we should add at some point but isn't what we're claiming to have now, 
I'd say).  So I think the original explanation is more accurate since we do not 
claim to implement TreeBoost.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread manishamde
Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20629031
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
@@ -40,151 +39,98 @@ import org.apache.spark.storage.StorageLevel
  * Notes:
  *  - This currently can be run with several loss functions.  However, 
only SquaredError is
  *fully supported.  Specifically, the loss function should be used to 
compute the gradient
- *(to re-label training instances on each iteration) and to weight 
weak hypotheses.
+ *(to re-label training instances on each iteration) and to weight 
tree ensembles.
  *Currently, gradients are computed correctly for the available loss 
functions,
- *but weak hypothesis weights are not computed correctly for LogLoss 
or AbsoluteError.
- *Running with those losses will likely behave reasonably, but lacks 
the same guarantees.
+ *but tree predictions are not computed correctly for LogLoss or 
AbsoluteError since they
--- End diff --

Agree.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20629126
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala ---
@@ -387,7 +386,7 @@ object RandomForest extends Serializable with Logging {
   impurity: String,
   maxDepth: Int,
   maxBins: Int,
-  seed: Int): WeightedEnsembleModel = {
+  seed: Int): TreeEnsembleModel = {
--- End diff --

RandomForestModel


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20629452
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala 
---
@@ -0,0 +1,178 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.model
+
+import scala.collection.mutable
+
+import com.github.fommil.netlib.BLAS.{getInstance = blas}
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.configuration.Algo._
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy._
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: Experimental ::
+ * Represents a random forest model.
+ *
+ * @param algo algorithm for the ensemble model, either Classification or 
Regression
+ * @param trees tree ensembles
+ */
+@Experimental
+class RandomForestModel(override val algo: Algo, override val trees: 
Array[DecisionTreeModel])
+  extends TreeEnsembleModel(algo, trees, Array.fill(trees.size)(1.0),
+combiningStrategy = if (algo == Classification) Vote else Average) {
+
+  require(trees.forall(_.algo == algo))
+}
+
+/**
+ * :: Experimental ::
+ * Represents a gradient boosted trees model.
+ *
+ * @param algo algorithm for the ensemble model, either Classification or 
Regression
+ * @param trees tree ensembles
+ * @param treeWeights tree ensemble weights
+ */
+@Experimental
+class GradientBoostedTreesModel(
+override val algo: Algo,
+override val trees: Array[DecisionTreeModel],
+override val treeWeights: Array[Double])
+  extends TreeEnsembleModel(algo, trees, treeWeights, combiningStrategy = 
Sum) {
+
+  require(trees.size == treeWeights.size)
+}
+
+/**
+ * Represents a tree ensemble model.
+ *
+ * @param algo algorithm for the ensemble model, either Classification or 
Regression
+ * @param trees tree ensembles
+ * @param treeWeights tree ensemble weights
+ * @param combiningStrategy strategy for combining the predictions, not 
used for regression.
+ */
+private[tree] sealed class TreeEnsembleModel(
+protected val algo: Algo,
+protected val trees: Array[DecisionTreeModel],
+protected val treeWeights: Array[Double],
+protected val combiningStrategy: EnsembleCombiningStrategy) extends 
Serializable {
+
+  require(numTrees  0, TreeEnsembleModel cannot be created without 
trees.)
+
+  private val sumWeights = math.max(treeWeights.sum, 1e-15)
+
+  /**
+   * Predicts for a single data point using the weighted sum of ensemble 
predictions.
+   *
+   * @param features array representing a single data point
+   * @return predicted category from the trained model
+   */
+  private def predictBySumming(features: Vector): Double = {
+val treePredictions = trees.map(learner = learner.predict(features))
--- End diff --

Could use _.predict(features)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20629451
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
@@ -40,151 +39,98 @@ import org.apache.spark.storage.StorageLevel
  * Notes:
  *  - This currently can be run with several loss functions.  However, 
only SquaredError is
  *fully supported.  Specifically, the loss function should be used to 
compute the gradient
- *(to re-label training instances on each iteration) and to weight 
weak hypotheses.
+ *(to re-label training instances on each iteration) and to weight 
tree ensembles.
  *Currently, gradients are computed correctly for the available loss 
functions,
- *but weak hypothesis weights are not computed correctly for LogLoss 
or AbsoluteError.
- *Running with those losses will likely behave reasonably, but lacks 
the same guarantees.
+ *but tree predictions are not computed correctly for LogLoss or 
AbsoluteError since they
+ *use the mean of the samples at each leaf node.  Running with those 
losses will likely behave
+ *reasonably, but lacks the same guarantees.
  *
- * @param boostingStrategy Parameters for the gradient boosting algorithm
+ * @param boostingStrategy Parameters for the gradient boosting algorithm.
  */
 @Experimental
-class GradientBoosting (
-private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
-
-  boostingStrategy.weakLearnerParams.algo = Regression
-  boostingStrategy.weakLearnerParams.impurity = impurity.Variance
-
-  // Ensure values for weak learner are the same as what is provided to 
the boosting algorithm.
-  boostingStrategy.weakLearnerParams.numClassesForClassification =
-boostingStrategy.numClassesForClassification
-
-  boostingStrategy.assertValid()
+class GradientBoostedTrees(private val boostingStrategy: BoostingStrategy)
+  extends Serializable with Logging {
 
   /**
* Method to train a gradient boosting model
* @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
-   * @return WeightedEnsembleModel that can be used for prediction
+   * @return a gradient boosted trees model that can be used for prediction
*/
-  def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = {
-val algo = boostingStrategy.algo
+  def run(input: RDD[LabeledPoint]): GradientBoostedTreesModel = {
+val algo = boostingStrategy.treeStrategy.algo
 algo match {
-  case Regression = GradientBoosting.boost(input, boostingStrategy)
+  case Regression = GradientBoostedTrees.boost(input, 
boostingStrategy)
   case Classification =
 // Map labels to -1, +1 so binary classification can be treated as 
regression.
 val remappedInput = input.map(x = new LabeledPoint((x.label * 2) 
- 1, x.features))
-GradientBoosting.boost(remappedInput, boostingStrategy)
+GradientBoostedTrees.boost(remappedInput, boostingStrategy)
   case _ =
 throw new IllegalArgumentException(s$algo is not supported by the 
gradient boosting.)
 }
   }
 
+  /**
+   * Java-friendly API for 
[[org.apache.spark.mllib.tree.GradientBoostedTrees!#run]].
+   */
+  def run(input: JavaRDD[LabeledPoint]): GradientBoostedTreesModel = {
+run(input.rdd)
+  }
 }
 
 
-object GradientBoosting extends Logging {
+object GradientBoostedTrees extends Logging {
 
   /**
* Method to train a gradient boosting model.
*
-   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting$#trainRegressor]]
-   *   is recommended to clearly specify regression.
-   *   Using 
[[org.apache.spark.mllib.tree.GradientBoosting$#trainClassifier]]
-   *   is recommended to clearly specify regression.
-   *
* @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
*  For classification, labels should take values {0, 1, 
..., numClasses-1}.
*  For regression, labels are real numbers.
* @param boostingStrategy Configuration options for the boosting 
algorithm.
-   * @return WeightedEnsembleModel that can be used for prediction
+   * @return a gradient boosted trees model that can be used for prediction
*/
   def train(
   input: RDD[LabeledPoint],
-  boostingStrategy: BoostingStrategy): WeightedEnsembleModel = {
-new GradientBoosting(boostingStrategy).train(input)
+  boostingStrategy: BoostingStrategy): GradientBoostedTreesModel = {
+new GradientBoostedTrees(boostingStrategy).run(input)
   }
 
   /**
-   * Method to train a 

[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/3374#issuecomment-63766642
  
@mengxr Thanks for the updates!  Just added a few small comments.  Other 
than those, LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20629858
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala ---
@@ -387,7 +386,7 @@ object RandomForest extends Serializable with Logging {
   impurity: String,
   maxDepth: Int,
   maxBins: Int,
-  seed: Int): WeightedEnsembleModel = {
+  seed: Int): TreeEnsembleModel = {
--- End diff --

thanks for catching this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20629860
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala 
---
@@ -0,0 +1,178 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.model
+
+import scala.collection.mutable
+
+import com.github.fommil.netlib.BLAS.{getInstance = blas}
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.configuration.Algo._
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy._
+import org.apache.spark.rdd.RDD
+
+/**
+ * :: Experimental ::
+ * Represents a random forest model.
+ *
+ * @param algo algorithm for the ensemble model, either Classification or 
Regression
+ * @param trees tree ensembles
+ */
+@Experimental
+class RandomForestModel(override val algo: Algo, override val trees: 
Array[DecisionTreeModel])
+  extends TreeEnsembleModel(algo, trees, Array.fill(trees.size)(1.0),
+combiningStrategy = if (algo == Classification) Vote else Average) {
+
+  require(trees.forall(_.algo == algo))
+}
+
+/**
+ * :: Experimental ::
+ * Represents a gradient boosted trees model.
+ *
+ * @param algo algorithm for the ensemble model, either Classification or 
Regression
+ * @param trees tree ensembles
+ * @param treeWeights tree ensemble weights
+ */
+@Experimental
+class GradientBoostedTreesModel(
+override val algo: Algo,
+override val trees: Array[DecisionTreeModel],
+override val treeWeights: Array[Double])
+  extends TreeEnsembleModel(algo, trees, treeWeights, combiningStrategy = 
Sum) {
+
+  require(trees.size == treeWeights.size)
+}
+
+/**
+ * Represents a tree ensemble model.
+ *
+ * @param algo algorithm for the ensemble model, either Classification or 
Regression
+ * @param trees tree ensembles
+ * @param treeWeights tree ensemble weights
+ * @param combiningStrategy strategy for combining the predictions, not 
used for regression.
+ */
+private[tree] sealed class TreeEnsembleModel(
+protected val algo: Algo,
+protected val trees: Array[DecisionTreeModel],
+protected val treeWeights: Array[Double],
+protected val combiningStrategy: EnsembleCombiningStrategy) extends 
Serializable {
+
+  require(numTrees  0, TreeEnsembleModel cannot be created without 
trees.)
+
+  private val sumWeights = math.max(treeWeights.sum, 1e-15)
+
+  /**
+   * Predicts for a single data point using the weighted sum of ensemble 
predictions.
+   *
+   * @param features array representing a single data point
+   * @return predicted category from the trained model
+   */
+  private def predictBySumming(features: Vector): Double = {
+val treePredictions = trees.map(learner = learner.predict(features))
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/3374#discussion_r20629859
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ---
@@ -40,151 +39,98 @@ import org.apache.spark.storage.StorageLevel
  * Notes:
  *  - This currently can be run with several loss functions.  However, 
only SquaredError is
  *fully supported.  Specifically, the loss function should be used to 
compute the gradient
- *(to re-label training instances on each iteration) and to weight 
weak hypotheses.
+ *(to re-label training instances on each iteration) and to weight 
tree ensembles.
  *Currently, gradients are computed correctly for the available loss 
functions,
- *but weak hypothesis weights are not computed correctly for LogLoss 
or AbsoluteError.
- *Running with those losses will likely behave reasonably, but lacks 
the same guarantees.
+ *but tree predictions are not computed correctly for LogLoss or 
AbsoluteError since they
+ *use the mean of the samples at each leaf node.  Running with those 
losses will likely behave
+ *reasonably, but lacks the same guarantees.
  *
- * @param boostingStrategy Parameters for the gradient boosting algorithm
+ * @param boostingStrategy Parameters for the gradient boosting algorithm.
  */
 @Experimental
-class GradientBoosting (
-private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
-
-  boostingStrategy.weakLearnerParams.algo = Regression
-  boostingStrategy.weakLearnerParams.impurity = impurity.Variance
-
-  // Ensure values for weak learner are the same as what is provided to 
the boosting algorithm.
-  boostingStrategy.weakLearnerParams.numClassesForClassification =
-boostingStrategy.numClassesForClassification
-
-  boostingStrategy.assertValid()
+class GradientBoostedTrees(private val boostingStrategy: BoostingStrategy)
+  extends Serializable with Logging {
 
   /**
* Method to train a gradient boosting model
* @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
-   * @return WeightedEnsembleModel that can be used for prediction
+   * @return a gradient boosted trees model that can be used for prediction
*/
-  def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = {
-val algo = boostingStrategy.algo
+  def run(input: RDD[LabeledPoint]): GradientBoostedTreesModel = {
+val algo = boostingStrategy.treeStrategy.algo
 algo match {
-  case Regression = GradientBoosting.boost(input, boostingStrategy)
+  case Regression = GradientBoostedTrees.boost(input, 
boostingStrategy)
   case Classification =
 // Map labels to -1, +1 so binary classification can be treated as 
regression.
 val remappedInput = input.map(x = new LabeledPoint((x.label * 2) 
- 1, x.features))
-GradientBoosting.boost(remappedInput, boostingStrategy)
+GradientBoostedTrees.boost(remappedInput, boostingStrategy)
   case _ =
 throw new IllegalArgumentException(s$algo is not supported by the 
gradient boosting.)
 }
   }
 
+  /**
+   * Java-friendly API for 
[[org.apache.spark.mllib.tree.GradientBoostedTrees!#run]].
+   */
+  def run(input: JavaRDD[LabeledPoint]): GradientBoostedTreesModel = {
+run(input.rdd)
+  }
 }
 
 
-object GradientBoosting extends Logging {
+object GradientBoostedTrees extends Logging {
 
   /**
* Method to train a gradient boosting model.
*
-   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting$#trainRegressor]]
-   *   is recommended to clearly specify regression.
-   *   Using 
[[org.apache.spark.mllib.tree.GradientBoosting$#trainClassifier]]
-   *   is recommended to clearly specify regression.
-   *
* @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
*  For classification, labels should take values {0, 1, 
..., numClasses-1}.
*  For regression, labels are real numbers.
* @param boostingStrategy Configuration options for the boosting 
algorithm.
-   * @return WeightedEnsembleModel that can be used for prediction
+   * @return a gradient boosted trees model that can be used for prediction
*/
   def train(
   input: RDD[LabeledPoint],
-  boostingStrategy: BoostingStrategy): WeightedEnsembleModel = {
-new GradientBoosting(boostingStrategy).train(input)
+  boostingStrategy: BoostingStrategy): GradientBoostedTreesModel = {
+new GradientBoostedTrees(boostingStrategy).run(input)
   }
 
   /**
-   * Method to train a 

[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3374#issuecomment-63767980
  
  [Test build #23663 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23663/consoleFull)
 for   PR 3374 at commit 
[`7097251`](https://github.com/apache/spark/commit/70972515085245957df9601e425141746f268c4b).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3374#issuecomment-63771201
  
  [Test build #23662 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23662/consoleFull)
 for   PR 3374 at commit 
[`98dea09`](https://github.com/apache/spark/commit/98dea097f226578762caca41ad708efb85f00d64).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...

2014-11-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3374#issuecomment-63771207
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23662/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org