[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3374#issuecomment-63774229 [Test build #23663 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23663/consoleFull) for PR 3374 at commit [`7097251`](https://github.com/apache/spark/commit/70972515085245957df9601e425141746f268c4b). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3374#issuecomment-63774234 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23663/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3374#issuecomment-63777328 @manishamde @jkbradley Thanks! Merged into master and branch-1.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3374 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/3374 [SPARK-4486][MLLIB] Improve GradientBoosting APIs and doc There are some inconsistencies in the gradient boosting APIs. The target is a general boosting meta-algorithm, but the implementation is attached to trees. This was partially due to the delay of SPARK-1856. But for the 1.2 release, we should make the APIs consistent. 1. WeightedEnsembleModel - private[tree] TreeEnsembleModel 1. GradientBoosting - GradientBoostedTrees 1. Add RandomForestModel and GradientBoostedTreesModel and hide CombiningStrategy 1. Slightly refactored TreeEnsembleModel 1. Remove `trainClassifier` and `trainRegressor` from `GradientBoostedTrees` because they are the same as `train` 1. Rename class `train` method to `run` because it hides the static methods with the same name in Java. Deprecated `DecisionTree.run` class method. 1. Simplify BoostingStrategy and make sure the input strategy is not modified. Users should put algo and numClasses in treeStrategy. We create ensembleStrategy inside boosting. 1. Fix a bug in GradientBoostedTreesSuite with AbsoluteError 1. doc updates @manishamde @jkbradley You can merge this pull request into a Git repository by running: $ git pull https://github.com/mengxr/spark SPARK-4486 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3374.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3374 commit 19030a5edf8acc90010d2430fcf5c46d4389d86a Author: Xiangrui Meng m...@databricks.com Date: 2014-11-19T21:17:01Z update boosting public APIs commit 751da4e16a1fea86398abdb37ecb33b2b8f723a8 Author: Xiangrui Meng m...@databricks.com Date: 2014-11-19T22:09:11Z rename class method train - run commit ea4c467474ff488d4f4367edeb008cf2c042fc64 Author: Xiangrui Meng m...@databricks.com Date: 2014-11-19T22:25:51Z fix unit tests commit 4aae3b761c5e98d19d6a5bf6b8a425f4bb4d2ebc Author: Xiangrui Meng m...@databricks.com Date: 2014-11-19T23:25:16Z add RandomForestModel and GradientBoostedTreesModel, hide CombiningStrategy --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20620079 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostedTreesSuite.scala --- @@ -23,104 +23,95 @@ import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.configuration.Algo._ import org.apache.spark.mllib.tree.configuration.{BoostingStrategy, Strategy} import org.apache.spark.mllib.tree.impurity.Variance -import org.apache.spark.mllib.tree.loss.{SquaredError, LogLoss} +import org.apache.spark.mllib.tree.loss.{AbsoluteError, SquaredError, LogLoss} import org.apache.spark.mllib.util.MLlibTestSparkContext /** - * Test suite for [[GradientBoosting]]. + * Test suite for [[GradientBoostedTrees]]. */ -class GradientBoostingSuite extends FunSuite with MLlibTestSparkContext { +class GradientBoostedTreesSuite extends FunSuite with MLlibTestSparkContext { test(Regression with continuous features: SquaredError) { -GradientBoostingSuite.testCombinations.foreach { +GradientBoostedTreesSuite.testCombinations.foreach { case (numIterations, learningRate, subsamplingRate) = val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100) -val rdd = sc.parallelize(arr) -val categoricalFeaturesInfo = Map.empty[Int, Int] +val rdd = sc.parallelize(arr, 2) -val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 1, x.features)) val treeStrategy = new Strategy(algo = Regression, impurity = Variance, maxDepth = 2, - numClassesForClassification = 2, categoricalFeaturesInfo = categoricalFeaturesInfo, - subsamplingRate = subsamplingRate) - -val dt = DecisionTree.train(remappedInput, treeStrategy) - -val boostingStrategy = new BoostingStrategy(Regression, numIterations, SquaredError, - learningRate, 1, treeStrategy) + categoricalFeaturesInfo = Map.empty, subsamplingRate = subsamplingRate) +val boostingStrategy = + new BoostingStrategy(treeStrategy, SquaredError, numIterations, learningRate) -val gbt = GradientBoosting.trainRegressor(rdd, boostingStrategy) -assert(gbt.weakHypotheses.size === numIterations) -val gbtTree = gbt.weakHypotheses(0) +val gbt = GradientBoostedTrees.train(rdd, boostingStrategy) +assert(gbt.trees.size === numIterations) EnsembleTestHelper.validateRegressor(gbt, arr, 0.03) +val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 1, x.features)) +val dt = DecisionTree.train(remappedInput, treeStrategy) + // Make sure trees are the same. -assert(gbtTree.toString == dt.toString) +assert(gbt.trees.head.toString == dt.toString) } } test(Regression with continuous features: Absolute Error) { -GradientBoostingSuite.testCombinations.foreach { +GradientBoostedTreesSuite.testCombinations.foreach { case (numIterations, learningRate, subsamplingRate) = val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100) -val rdd = sc.parallelize(arr) -val categoricalFeaturesInfo = Map.empty[Int, Int] +val rdd = sc.parallelize(arr, 2) -val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 1, x.features)) val treeStrategy = new Strategy(algo = Regression, impurity = Variance, maxDepth = 2, - numClassesForClassification = 2, categoricalFeaturesInfo = categoricalFeaturesInfo, - subsamplingRate = subsamplingRate) - -val dt = DecisionTree.train(remappedInput, treeStrategy) + categoricalFeaturesInfo = Map.empty, subsamplingRate = subsamplingRate) +val boostingStrategy = + new BoostingStrategy(treeStrategy, AbsoluteError, numIterations, learningRate) -val boostingStrategy = new BoostingStrategy(Regression, numIterations, SquaredError, --- End diff -- It was `SquaredError` before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3374#issuecomment-63742362 [Test build #23643 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23643/consoleFull) for PR 3374 at commit [`4aae3b7`](https://github.com/apache/spark/commit/4aae3b761c5e98d19d6a5bf6b8a425f4bb4d2ebc). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/3374#issuecomment-63744101 Will we have to rename ```GradientBoostedTrees``` back to ```GradientBoosting``` when we add generic weak learner support? I think we should not modify the name of the algorithm and make it tree-specific to avoid renaming it in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3374#issuecomment-63744703 @manishamde The current impl is attached to trees. Even if we rename it back to `GradientBoosting`. it has to live under `mllib.tree` instead of `mllib.ensemble`. When we have a generalized boosting implementation in the future, we don't rename `GradientBoostedTrees`. Instead, we can add `mllib.ensemble.GradientBoosting`, and let `tree.GradientBoostedTrees` extend that. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20621257 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala --- @@ -45,146 +43,92 @@ import org.apache.spark.storage.StorageLevel *but weak hypothesis weights are not computed correctly for LogLoss or AbsoluteError. *Running with those losses will likely behave reasonably, but lacks the same guarantees. * - * @param boostingStrategy Parameters for the gradient boosting algorithm + * @param boostingStrategy Parameters for the gradient boosting algorithm. */ @Experimental -class GradientBoosting ( -private val boostingStrategy: BoostingStrategy) extends Serializable with Logging { - - boostingStrategy.weakLearnerParams.algo = Regression - boostingStrategy.weakLearnerParams.impurity = impurity.Variance - - // Ensure values for weak learner are the same as what is provided to the boosting algorithm. - boostingStrategy.weakLearnerParams.numClassesForClassification = -boostingStrategy.numClassesForClassification - - boostingStrategy.assertValid() +class GradientBoostedTrees(private val boostingStrategy: BoostingStrategy) + extends Serializable with Logging { /** * Method to train a gradient boosting model * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]. - * @return WeightedEnsembleModel that can be used for prediction + * @return a gradient boosted trees model that can be used for prediction */ - def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = { -val algo = boostingStrategy.algo + def run(input: RDD[LabeledPoint]): GradientBoostedTreesModel = { +val algo = boostingStrategy.treeStrategy.algo algo match { - case Regression = GradientBoosting.boost(input, boostingStrategy) + case Regression = GradientBoostedTrees.boost(input, boostingStrategy) case Classification = // Map labels to -1, +1 so binary classification can be treated as regression. val remappedInput = input.map(x = new LabeledPoint((x.label * 2) - 1, x.features)) -GradientBoosting.boost(remappedInput, boostingStrategy) +GradientBoostedTrees.boost(remappedInput, boostingStrategy) case _ = throw new IllegalArgumentException(s$algo is not supported by the gradient boosting.) } } + /** + * Java-friendly API for [[org.apache.spark.mllib.tree.GradientBoostedTrees!#run]]. + */ + def run(input: JavaRDD[LabeledPoint]): GradientBoostedTreesModel = { +run(input.rdd) + } } -object GradientBoosting extends Logging { +object GradientBoostedTrees extends Logging { /** * Method to train a gradient boosting model. * - * Note: Using [[org.apache.spark.mllib.tree.GradientBoosting$#trainRegressor]] - * is recommended to clearly specify regression. - * Using [[org.apache.spark.mllib.tree.GradientBoosting$#trainClassifier]] - * is recommended to clearly specify regression. - * * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]. * For classification, labels should take values {0, 1, ..., numClasses-1}. * For regression, labels are real numbers. * @param boostingStrategy Configuration options for the boosting algorithm. - * @return WeightedEnsembleModel that can be used for prediction + * @return a gradient boosted trees model that can be used for prediction --- End diff -- Very minor nit: A gradient boosted trees model that can be used for prediction. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/3374#issuecomment-63744889 @mengxr The plan to move to mllib.ensemble namespace with a new class sounds good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/3374#issuecomment-63746922 Should the```trainClassifier``` and ``trainRegressor`` methods from ```DecisionTree``` and ```RandomForest``` classes also be the deprecated? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20622307 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala --- @@ -45,146 +43,92 @@ import org.apache.spark.storage.StorageLevel *but weak hypothesis weights are not computed correctly for LogLoss or AbsoluteError. --- End diff -- Should now read something like this but tree predictions are not computed accurately for LogLoss or AbsoluteError loss functions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20622629 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala --- @@ -45,146 +43,92 @@ import org.apache.spark.storage.StorageLevel *but weak hypothesis weights are not computed correctly for LogLoss or AbsoluteError. *Running with those losses will likely behave reasonably, but lacks the same guarantees. * - * @param boostingStrategy Parameters for the gradient boosting algorithm + * @param boostingStrategy Parameters for the gradient boosting algorithm. */ @Experimental -class GradientBoosting ( -private val boostingStrategy: BoostingStrategy) extends Serializable with Logging { - - boostingStrategy.weakLearnerParams.algo = Regression - boostingStrategy.weakLearnerParams.impurity = impurity.Variance - - // Ensure values for weak learner are the same as what is provided to the boosting algorithm. - boostingStrategy.weakLearnerParams.numClassesForClassification = -boostingStrategy.numClassesForClassification - - boostingStrategy.assertValid() +class GradientBoostedTrees(private val boostingStrategy: BoostingStrategy) + extends Serializable with Logging { /** * Method to train a gradient boosting model * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]. - * @return WeightedEnsembleModel that can be used for prediction + * @return a gradient boosted trees model that can be used for prediction */ - def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = { -val algo = boostingStrategy.algo + def run(input: RDD[LabeledPoint]): GradientBoostedTreesModel = { +val algo = boostingStrategy.treeStrategy.algo algo match { - case Regression = GradientBoosting.boost(input, boostingStrategy) + case Regression = GradientBoostedTrees.boost(input, boostingStrategy) case Classification = // Map labels to -1, +1 so binary classification can be treated as regression. val remappedInput = input.map(x = new LabeledPoint((x.label * 2) - 1, x.features)) -GradientBoosting.boost(remappedInput, boostingStrategy) +GradientBoostedTrees.boost(remappedInput, boostingStrategy) case _ = throw new IllegalArgumentException(s$algo is not supported by the gradient boosting.) } } + /** + * Java-friendly API for [[org.apache.spark.mllib.tree.GradientBoostedTrees!#run]]. + */ + def run(input: JavaRDD[LabeledPoint]): GradientBoostedTreesModel = { +run(input.rdd) + } } -object GradientBoosting extends Logging { +object GradientBoostedTrees extends Logging { /** * Method to train a gradient boosting model. * - * Note: Using [[org.apache.spark.mllib.tree.GradientBoosting$#trainRegressor]] - * is recommended to clearly specify regression. - * Using [[org.apache.spark.mllib.tree.GradientBoosting$#trainClassifier]] - * is recommended to clearly specify regression. - * * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]. * For classification, labels should take values {0, 1, ..., numClasses-1}. * For regression, labels are real numbers. * @param boostingStrategy Configuration options for the boosting algorithm. - * @return WeightedEnsembleModel that can be used for prediction + * @return a gradient boosted trees model that can be used for prediction */ def train( input: RDD[LabeledPoint], - boostingStrategy: BoostingStrategy): WeightedEnsembleModel = { -new GradientBoosting(boostingStrategy).train(input) + boostingStrategy: BoostingStrategy): GradientBoostedTreesModel = { +new GradientBoostedTrees(boostingStrategy).run(input) } /** - * Method to train a gradient boosting classification model. - * - * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]. - * For classification, labels should take values {0, 1, ..., numClasses-1}. - * For regression, labels are real numbers. - * @param boostingStrategy Configuration options for the boosting algorithm. - * @return WeightedEnsembleModel that can be used for prediction - */ - def trainClassifier( - input: RDD[LabeledPoint], - boostingStrategy: BoostingStrategy): WeightedEnsembleModel = { -val algo = boostingStrategy.algo -require(algo == Classification, sOnly Classification algo supported. Provided
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20622816 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/TreeEnsembleModel.scala --- @@ -0,0 +1,182 @@ +/* --- End diff -- Shouldn't the class name start with lower case according to Scala conventions? Also ```treeEnsembleModels.scala``` might be more appropriate. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20623463 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/TreeEnsembleModel.scala --- @@ -0,0 +1,182 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.model + +import org.apache.spark.api.java.JavaRDD + +import scala.collection.mutable + +import com.github.fommil.netlib.BLAS.{getInstance = blas} + +import org.apache.spark.annotation.Experimental +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.mllib.tree.configuration.Algo._ +import org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy._ +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * Represents a random forest model. + * + * @param algo algorithm for the ensemble model, either Classification or Regression + * @param trees tree ensembles + */ +@Experimental +class RandomForestModel(override val algo: Algo, override val trees: Array[DecisionTreeModel]) + extends TreeEnsembleModel(algo, trees, Array.fill(trees.size)(1.0), +combiningStrategy = if (algo == Classification) Vote else Average) { + + require(trees.forall(_.algo == algo)) +} + +/** + * :: Experimental :: + * Represents a gradient boosted trees model. + * + * @param algo algorithm for the ensemble model, either Classification or Regression + * @param trees tree ensembles + * @param treeWeights tree ensemble weights + */ +@Experimental +class GradientBoostedTreesModel( +override val algo: Algo, +override val trees: Array[DecisionTreeModel], +override val treeWeights: Array[Double]) + extends TreeEnsembleModel(algo, trees, treeWeights, combiningStrategy = Sum) { + + require(trees.size == treeWeights.size) +} + +/** + * Represents a tree ensemble model. + * + * @param algo algorithm for the ensemble model, either Classification or Regression + * @param trees tree ensembles + * @param treeWeights tree ensemble weights + * @param combiningStrategy strategy for combining the predictions, not used for regression. + */ +private[tree] sealed class TreeEnsembleModel( +protected val algo: Algo, +protected val trees: Array[DecisionTreeModel], +protected val treeWeights: Array[Double], +protected val combiningStrategy: EnsembleCombiningStrategy) extends Serializable { + + require(numTrees 0, TreeEnsembleModel cannot be created without trees.) + + private val sumWeights = math.max(treeWeights.sum, 1e-15) + + /** + * Predicts for a single data point using the weighted sum of ensemble predictions. + * + * @param features array representing a single data point + * @return predicted category from the trained model + */ + private def predictBySumming(features: Vector): Double = { +val treePredictions = trees.map(learner = learner.predict(features)) +blas.ddot(numTrees, treePredictions, 1, treeWeights, 1) + } + + /** + * Classifies a single data point based on (weighted) majority votes. + */ + private def predictByVoting(features: Vector): Double = { +val votes = mutable.Map.empty[Int, Double] +trees.view.zip(treeWeights).foreach { case (tree, weight) = + val prediction = tree.predict(features).toInt + votes(prediction) = votes.getOrElse(prediction, 0.0) + weight +} +votes.maxBy(_._2)._1 + } + + /** + * Predict values for a single data point using the model trained. + * + * @param features array representing a single data point + * @return predicted category from the trained model + */ + def predict(features: Vector): Double = { +(algo, combiningStrategy) match { + case (Regression, Sum) = +predictBySumming(features) + case (Regression,
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3374#issuecomment-63750651 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23643/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3374#issuecomment-63750642 [Test build #23643 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23643/consoleFull) for PR 3374 at commit [`4aae3b7`](https://github.com/apache/spark/commit/4aae3b761c5e98d19d6a5bf6b8a425f4bb4d2ebc). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20623750 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostedTreesSuite.scala --- @@ -23,104 +23,95 @@ import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.configuration.Algo._ import org.apache.spark.mllib.tree.configuration.{BoostingStrategy, Strategy} import org.apache.spark.mllib.tree.impurity.Variance -import org.apache.spark.mllib.tree.loss.{SquaredError, LogLoss} +import org.apache.spark.mllib.tree.loss.{AbsoluteError, SquaredError, LogLoss} import org.apache.spark.mllib.util.MLlibTestSparkContext /** - * Test suite for [[GradientBoosting]]. + * Test suite for [[GradientBoostedTrees]]. */ -class GradientBoostingSuite extends FunSuite with MLlibTestSparkContext { +class GradientBoostedTreesSuite extends FunSuite with MLlibTestSparkContext { test(Regression with continuous features: SquaredError) { -GradientBoostingSuite.testCombinations.foreach { +GradientBoostedTreesSuite.testCombinations.foreach { case (numIterations, learningRate, subsamplingRate) = val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100) -val rdd = sc.parallelize(arr) -val categoricalFeaturesInfo = Map.empty[Int, Int] +val rdd = sc.parallelize(arr, 2) -val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 1, x.features)) val treeStrategy = new Strategy(algo = Regression, impurity = Variance, maxDepth = 2, - numClassesForClassification = 2, categoricalFeaturesInfo = categoricalFeaturesInfo, - subsamplingRate = subsamplingRate) - -val dt = DecisionTree.train(remappedInput, treeStrategy) - -val boostingStrategy = new BoostingStrategy(Regression, numIterations, SquaredError, - learningRate, 1, treeStrategy) + categoricalFeaturesInfo = Map.empty, subsamplingRate = subsamplingRate) +val boostingStrategy = + new BoostingStrategy(treeStrategy, SquaredError, numIterations, learningRate) -val gbt = GradientBoosting.trainRegressor(rdd, boostingStrategy) -assert(gbt.weakHypotheses.size === numIterations) -val gbtTree = gbt.weakHypotheses(0) +val gbt = GradientBoostedTrees.train(rdd, boostingStrategy) +assert(gbt.trees.size === numIterations) EnsembleTestHelper.validateRegressor(gbt, arr, 0.03) +val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 1, x.features)) +val dt = DecisionTree.train(remappedInput, treeStrategy) + // Make sure trees are the same. -assert(gbtTree.toString == dt.toString) +assert(gbt.trees.head.toString == dt.toString) } } test(Regression with continuous features: Absolute Error) { -GradientBoostingSuite.testCombinations.foreach { +GradientBoostedTreesSuite.testCombinations.foreach { case (numIterations, learningRate, subsamplingRate) = val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100) -val rdd = sc.parallelize(arr) -val categoricalFeaturesInfo = Map.empty[Int, Int] +val rdd = sc.parallelize(arr, 2) -val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 1, x.features)) val treeStrategy = new Strategy(algo = Regression, impurity = Variance, maxDepth = 2, - numClassesForClassification = 2, categoricalFeaturesInfo = categoricalFeaturesInfo, - subsamplingRate = subsamplingRate) - -val dt = DecisionTree.train(remappedInput, treeStrategy) + categoricalFeaturesInfo = Map.empty, subsamplingRate = subsamplingRate) +val boostingStrategy = + new BoostingStrategy(treeStrategy, AbsoluteError, numIterations, learningRate) -val boostingStrategy = new BoostingStrategy(Regression, numIterations, SquaredError, --- End diff -- Thanks for fixing this. I am taking a look at it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20624623 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostedTreesSuite.scala --- @@ -23,104 +23,95 @@ import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.configuration.Algo._ import org.apache.spark.mllib.tree.configuration.{BoostingStrategy, Strategy} import org.apache.spark.mllib.tree.impurity.Variance -import org.apache.spark.mllib.tree.loss.{SquaredError, LogLoss} +import org.apache.spark.mllib.tree.loss.{AbsoluteError, SquaredError, LogLoss} import org.apache.spark.mllib.util.MLlibTestSparkContext /** - * Test suite for [[GradientBoosting]]. + * Test suite for [[GradientBoostedTrees]]. */ -class GradientBoostingSuite extends FunSuite with MLlibTestSparkContext { +class GradientBoostedTreesSuite extends FunSuite with MLlibTestSparkContext { test(Regression with continuous features: SquaredError) { -GradientBoostingSuite.testCombinations.foreach { +GradientBoostedTreesSuite.testCombinations.foreach { case (numIterations, learningRate, subsamplingRate) = val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100) -val rdd = sc.parallelize(arr) -val categoricalFeaturesInfo = Map.empty[Int, Int] +val rdd = sc.parallelize(arr, 2) -val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 1, x.features)) val treeStrategy = new Strategy(algo = Regression, impurity = Variance, maxDepth = 2, - numClassesForClassification = 2, categoricalFeaturesInfo = categoricalFeaturesInfo, - subsamplingRate = subsamplingRate) - -val dt = DecisionTree.train(remappedInput, treeStrategy) - -val boostingStrategy = new BoostingStrategy(Regression, numIterations, SquaredError, - learningRate, 1, treeStrategy) + categoricalFeaturesInfo = Map.empty, subsamplingRate = subsamplingRate) +val boostingStrategy = + new BoostingStrategy(treeStrategy, SquaredError, numIterations, learningRate) -val gbt = GradientBoosting.trainRegressor(rdd, boostingStrategy) -assert(gbt.weakHypotheses.size === numIterations) -val gbtTree = gbt.weakHypotheses(0) +val gbt = GradientBoostedTrees.train(rdd, boostingStrategy) +assert(gbt.trees.size === numIterations) EnsembleTestHelper.validateRegressor(gbt, arr, 0.03) +val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 1, x.features)) +val dt = DecisionTree.train(remappedInput, treeStrategy) + // Make sure trees are the same. -assert(gbtTree.toString == dt.toString) +assert(gbt.trees.head.toString == dt.toString) } } test(Regression with continuous features: Absolute Error) { -GradientBoostingSuite.testCombinations.foreach { +GradientBoostedTreesSuite.testCombinations.foreach { case (numIterations, learningRate, subsamplingRate) = val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100) -val rdd = sc.parallelize(arr) -val categoricalFeaturesInfo = Map.empty[Int, Int] +val rdd = sc.parallelize(arr, 2) -val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 1, x.features)) val treeStrategy = new Strategy(algo = Regression, impurity = Variance, maxDepth = 2, - numClassesForClassification = 2, categoricalFeaturesInfo = categoricalFeaturesInfo, - subsamplingRate = subsamplingRate) - -val dt = DecisionTree.train(remappedInput, treeStrategy) + categoricalFeaturesInfo = Map.empty, subsamplingRate = subsamplingRate) +val boostingStrategy = + new BoostingStrategy(treeStrategy, AbsoluteError, numIterations, learningRate) -val boostingStrategy = new BoostingStrategy(Regression, numIterations, SquaredError, --- End diff -- Here are my findings. I added two more test cases with numIterations = 100. ``` numIterations = 10, learningRate = 1.0, subsamplingRate = 1.0 metric = 0.8405 numIterations = 100, learningRate = 1.0, subsamplingRate = 1.0 metric = 0.5344090056285183 numIterations = 10, learningRate = 0.1, subsamplingRate = 1.0 metric = 0.08384 numIterations = 10, learningRate = 1.0, subsamplingRate = 0.75 metric = 0.8102205882352937 numIterations = 100, learningRate = 1.0, subsamplingRate = 0.75 metric = 0.565608647936787 numIterations = 10, learningRate = 0.1, subsamplingRate
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20627362 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostedTreesSuite.scala --- @@ -23,104 +23,95 @@ import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.tree.configuration.Algo._ import org.apache.spark.mllib.tree.configuration.{BoostingStrategy, Strategy} import org.apache.spark.mllib.tree.impurity.Variance -import org.apache.spark.mllib.tree.loss.{SquaredError, LogLoss} +import org.apache.spark.mllib.tree.loss.{AbsoluteError, SquaredError, LogLoss} import org.apache.spark.mllib.util.MLlibTestSparkContext /** - * Test suite for [[GradientBoosting]]. + * Test suite for [[GradientBoostedTrees]]. */ -class GradientBoostingSuite extends FunSuite with MLlibTestSparkContext { +class GradientBoostedTreesSuite extends FunSuite with MLlibTestSparkContext { test(Regression with continuous features: SquaredError) { -GradientBoostingSuite.testCombinations.foreach { +GradientBoostedTreesSuite.testCombinations.foreach { case (numIterations, learningRate, subsamplingRate) = val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100) -val rdd = sc.parallelize(arr) -val categoricalFeaturesInfo = Map.empty[Int, Int] +val rdd = sc.parallelize(arr, 2) -val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 1, x.features)) val treeStrategy = new Strategy(algo = Regression, impurity = Variance, maxDepth = 2, - numClassesForClassification = 2, categoricalFeaturesInfo = categoricalFeaturesInfo, - subsamplingRate = subsamplingRate) - -val dt = DecisionTree.train(remappedInput, treeStrategy) - -val boostingStrategy = new BoostingStrategy(Regression, numIterations, SquaredError, - learningRate, 1, treeStrategy) + categoricalFeaturesInfo = Map.empty, subsamplingRate = subsamplingRate) +val boostingStrategy = + new BoostingStrategy(treeStrategy, SquaredError, numIterations, learningRate) -val gbt = GradientBoosting.trainRegressor(rdd, boostingStrategy) -assert(gbt.weakHypotheses.size === numIterations) -val gbtTree = gbt.weakHypotheses(0) +val gbt = GradientBoostedTrees.train(rdd, boostingStrategy) +assert(gbt.trees.size === numIterations) EnsembleTestHelper.validateRegressor(gbt, arr, 0.03) +val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 1, x.features)) +val dt = DecisionTree.train(remappedInput, treeStrategy) + // Make sure trees are the same. -assert(gbtTree.toString == dt.toString) +assert(gbt.trees.head.toString == dt.toString) } } test(Regression with continuous features: Absolute Error) { -GradientBoostingSuite.testCombinations.foreach { +GradientBoostedTreesSuite.testCombinations.foreach { case (numIterations, learningRate, subsamplingRate) = val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100) -val rdd = sc.parallelize(arr) -val categoricalFeaturesInfo = Map.empty[Int, Int] +val rdd = sc.parallelize(arr, 2) -val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 1, x.features)) val treeStrategy = new Strategy(algo = Regression, impurity = Variance, maxDepth = 2, - numClassesForClassification = 2, categoricalFeaturesInfo = categoricalFeaturesInfo, - subsamplingRate = subsamplingRate) - -val dt = DecisionTree.train(remappedInput, treeStrategy) + categoricalFeaturesInfo = Map.empty, subsamplingRate = subsamplingRate) +val boostingStrategy = + new BoostingStrategy(treeStrategy, AbsoluteError, numIterations, learningRate) -val boostingStrategy = new BoostingStrategy(Regression, numIterations, SquaredError, --- End diff -- @manishamde Thanks for checking this test! Let's fix it in a separate PR. We are going to cut a release candidate and I hope we can update the API before that. Let me know when you finish a pass, I will update the PR following your suggestions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/3374#issuecomment-63763093 Completed my pass. LGTM! :+1: --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20628796 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala --- @@ -45,146 +43,92 @@ import org.apache.spark.storage.StorageLevel *but weak hypothesis weights are not computed correctly for LogLoss or AbsoluteError. --- End diff -- @manishamde The current explanation is correct for the original Gradient Boosting algorithm, which uses weak hypothesis weights and is oblivious to the weak learner being used. Your suggested explanation is really for TreeBoost, Friedman's improvement to the original algorithm which is specialized for trees (which we should add at some point but isn't what we're claiming to have now, I'd say). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3374#issuecomment-63765215 [Test build #23662 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23662/consoleFull) for PR 3374 at commit [`98dea09`](https://github.com/apache/spark/commit/98dea097f226578762caca41ad708efb85f00d64). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20628996 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala --- @@ -45,146 +43,92 @@ import org.apache.spark.storage.StorageLevel *but weak hypothesis weights are not computed correctly for LogLoss or AbsoluteError. --- End diff -- @jkbradley Agree. Having said that, I am not sure whether the algorithm predictions are changed or not based upon the loss function in other weak learners such as LR. Let's refine this later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20629011 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala --- @@ -40,151 +39,98 @@ import org.apache.spark.storage.StorageLevel * Notes: * - This currently can be run with several loss functions. However, only SquaredError is *fully supported. Specifically, the loss function should be used to compute the gradient - *(to re-label training instances on each iteration) and to weight weak hypotheses. + *(to re-label training instances on each iteration) and to weight tree ensembles. *Currently, gradients are computed correctly for the available loss functions, - *but weak hypothesis weights are not computed correctly for LogLoss or AbsoluteError. - *Running with those losses will likely behave reasonably, but lacks the same guarantees. + *but tree predictions are not computed correctly for LogLoss or AbsoluteError since they --- End diff -- (copying comment here since it was on an outdated diff) The original explanation is correct for the original Gradient Boosting algorithm, which uses weak hypothesis weights and is oblivious to the weak learner being used. This updated explanation is really for TreeBoost, Friedman's improvement to the original algorithm which is specialized for trees (which we should add at some point but isn't what we're claiming to have now, I'd say). So I think the original explanation is more accurate since we do not claim to implement TreeBoost. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user manishamde commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20629031 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala --- @@ -40,151 +39,98 @@ import org.apache.spark.storage.StorageLevel * Notes: * - This currently can be run with several loss functions. However, only SquaredError is *fully supported. Specifically, the loss function should be used to compute the gradient - *(to re-label training instances on each iteration) and to weight weak hypotheses. + *(to re-label training instances on each iteration) and to weight tree ensembles. *Currently, gradients are computed correctly for the available loss functions, - *but weak hypothesis weights are not computed correctly for LogLoss or AbsoluteError. - *Running with those losses will likely behave reasonably, but lacks the same guarantees. + *but tree predictions are not computed correctly for LogLoss or AbsoluteError since they --- End diff -- Agree. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20629126 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala --- @@ -387,7 +386,7 @@ object RandomForest extends Serializable with Logging { impurity: String, maxDepth: Int, maxBins: Int, - seed: Int): WeightedEnsembleModel = { + seed: Int): TreeEnsembleModel = { --- End diff -- RandomForestModel --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20629452 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala --- @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.model + +import scala.collection.mutable + +import com.github.fommil.netlib.BLAS.{getInstance = blas} + +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.mllib.tree.configuration.Algo._ +import org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy._ +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * Represents a random forest model. + * + * @param algo algorithm for the ensemble model, either Classification or Regression + * @param trees tree ensembles + */ +@Experimental +class RandomForestModel(override val algo: Algo, override val trees: Array[DecisionTreeModel]) + extends TreeEnsembleModel(algo, trees, Array.fill(trees.size)(1.0), +combiningStrategy = if (algo == Classification) Vote else Average) { + + require(trees.forall(_.algo == algo)) +} + +/** + * :: Experimental :: + * Represents a gradient boosted trees model. + * + * @param algo algorithm for the ensemble model, either Classification or Regression + * @param trees tree ensembles + * @param treeWeights tree ensemble weights + */ +@Experimental +class GradientBoostedTreesModel( +override val algo: Algo, +override val trees: Array[DecisionTreeModel], +override val treeWeights: Array[Double]) + extends TreeEnsembleModel(algo, trees, treeWeights, combiningStrategy = Sum) { + + require(trees.size == treeWeights.size) +} + +/** + * Represents a tree ensemble model. + * + * @param algo algorithm for the ensemble model, either Classification or Regression + * @param trees tree ensembles + * @param treeWeights tree ensemble weights + * @param combiningStrategy strategy for combining the predictions, not used for regression. + */ +private[tree] sealed class TreeEnsembleModel( +protected val algo: Algo, +protected val trees: Array[DecisionTreeModel], +protected val treeWeights: Array[Double], +protected val combiningStrategy: EnsembleCombiningStrategy) extends Serializable { + + require(numTrees 0, TreeEnsembleModel cannot be created without trees.) + + private val sumWeights = math.max(treeWeights.sum, 1e-15) + + /** + * Predicts for a single data point using the weighted sum of ensemble predictions. + * + * @param features array representing a single data point + * @return predicted category from the trained model + */ + private def predictBySumming(features: Vector): Double = { +val treePredictions = trees.map(learner = learner.predict(features)) --- End diff -- Could use _.predict(features) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20629451 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala --- @@ -40,151 +39,98 @@ import org.apache.spark.storage.StorageLevel * Notes: * - This currently can be run with several loss functions. However, only SquaredError is *fully supported. Specifically, the loss function should be used to compute the gradient - *(to re-label training instances on each iteration) and to weight weak hypotheses. + *(to re-label training instances on each iteration) and to weight tree ensembles. *Currently, gradients are computed correctly for the available loss functions, - *but weak hypothesis weights are not computed correctly for LogLoss or AbsoluteError. - *Running with those losses will likely behave reasonably, but lacks the same guarantees. + *but tree predictions are not computed correctly for LogLoss or AbsoluteError since they + *use the mean of the samples at each leaf node. Running with those losses will likely behave + *reasonably, but lacks the same guarantees. * - * @param boostingStrategy Parameters for the gradient boosting algorithm + * @param boostingStrategy Parameters for the gradient boosting algorithm. */ @Experimental -class GradientBoosting ( -private val boostingStrategy: BoostingStrategy) extends Serializable with Logging { - - boostingStrategy.weakLearnerParams.algo = Regression - boostingStrategy.weakLearnerParams.impurity = impurity.Variance - - // Ensure values for weak learner are the same as what is provided to the boosting algorithm. - boostingStrategy.weakLearnerParams.numClassesForClassification = -boostingStrategy.numClassesForClassification - - boostingStrategy.assertValid() +class GradientBoostedTrees(private val boostingStrategy: BoostingStrategy) + extends Serializable with Logging { /** * Method to train a gradient boosting model * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]. - * @return WeightedEnsembleModel that can be used for prediction + * @return a gradient boosted trees model that can be used for prediction */ - def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = { -val algo = boostingStrategy.algo + def run(input: RDD[LabeledPoint]): GradientBoostedTreesModel = { +val algo = boostingStrategy.treeStrategy.algo algo match { - case Regression = GradientBoosting.boost(input, boostingStrategy) + case Regression = GradientBoostedTrees.boost(input, boostingStrategy) case Classification = // Map labels to -1, +1 so binary classification can be treated as regression. val remappedInput = input.map(x = new LabeledPoint((x.label * 2) - 1, x.features)) -GradientBoosting.boost(remappedInput, boostingStrategy) +GradientBoostedTrees.boost(remappedInput, boostingStrategy) case _ = throw new IllegalArgumentException(s$algo is not supported by the gradient boosting.) } } + /** + * Java-friendly API for [[org.apache.spark.mllib.tree.GradientBoostedTrees!#run]]. + */ + def run(input: JavaRDD[LabeledPoint]): GradientBoostedTreesModel = { +run(input.rdd) + } } -object GradientBoosting extends Logging { +object GradientBoostedTrees extends Logging { /** * Method to train a gradient boosting model. * - * Note: Using [[org.apache.spark.mllib.tree.GradientBoosting$#trainRegressor]] - * is recommended to clearly specify regression. - * Using [[org.apache.spark.mllib.tree.GradientBoosting$#trainClassifier]] - * is recommended to clearly specify regression. - * * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]. * For classification, labels should take values {0, 1, ..., numClasses-1}. * For regression, labels are real numbers. * @param boostingStrategy Configuration options for the boosting algorithm. - * @return WeightedEnsembleModel that can be used for prediction + * @return a gradient boosted trees model that can be used for prediction */ def train( input: RDD[LabeledPoint], - boostingStrategy: BoostingStrategy): WeightedEnsembleModel = { -new GradientBoosting(boostingStrategy).train(input) + boostingStrategy: BoostingStrategy): GradientBoostedTreesModel = { +new GradientBoostedTrees(boostingStrategy).run(input) } /** - * Method to train a
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3374#issuecomment-63766642 @mengxr Thanks for the updates! Just added a few small comments. Other than those, LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20629858 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala --- @@ -387,7 +386,7 @@ object RandomForest extends Serializable with Logging { impurity: String, maxDepth: Int, maxBins: Int, - seed: Int): WeightedEnsembleModel = { + seed: Int): TreeEnsembleModel = { --- End diff -- thanks for catching this! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20629860 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala --- @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.tree.model + +import scala.collection.mutable + +import com.github.fommil.netlib.BLAS.{getInstance = blas} + +import org.apache.spark.annotation.Experimental +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.mllib.tree.configuration.Algo._ +import org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy._ +import org.apache.spark.rdd.RDD + +/** + * :: Experimental :: + * Represents a random forest model. + * + * @param algo algorithm for the ensemble model, either Classification or Regression + * @param trees tree ensembles + */ +@Experimental +class RandomForestModel(override val algo: Algo, override val trees: Array[DecisionTreeModel]) + extends TreeEnsembleModel(algo, trees, Array.fill(trees.size)(1.0), +combiningStrategy = if (algo == Classification) Vote else Average) { + + require(trees.forall(_.algo == algo)) +} + +/** + * :: Experimental :: + * Represents a gradient boosted trees model. + * + * @param algo algorithm for the ensemble model, either Classification or Regression + * @param trees tree ensembles + * @param treeWeights tree ensemble weights + */ +@Experimental +class GradientBoostedTreesModel( +override val algo: Algo, +override val trees: Array[DecisionTreeModel], +override val treeWeights: Array[Double]) + extends TreeEnsembleModel(algo, trees, treeWeights, combiningStrategy = Sum) { + + require(trees.size == treeWeights.size) +} + +/** + * Represents a tree ensemble model. + * + * @param algo algorithm for the ensemble model, either Classification or Regression + * @param trees tree ensembles + * @param treeWeights tree ensemble weights + * @param combiningStrategy strategy for combining the predictions, not used for regression. + */ +private[tree] sealed class TreeEnsembleModel( +protected val algo: Algo, +protected val trees: Array[DecisionTreeModel], +protected val treeWeights: Array[Double], +protected val combiningStrategy: EnsembleCombiningStrategy) extends Serializable { + + require(numTrees 0, TreeEnsembleModel cannot be created without trees.) + + private val sumWeights = math.max(treeWeights.sum, 1e-15) + + /** + * Predicts for a single data point using the weighted sum of ensemble predictions. + * + * @param features array representing a single data point + * @return predicted category from the trained model + */ + private def predictBySumming(features: Vector): Double = { +val treePredictions = trees.map(learner = learner.predict(features)) --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3374#discussion_r20629859 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala --- @@ -40,151 +39,98 @@ import org.apache.spark.storage.StorageLevel * Notes: * - This currently can be run with several loss functions. However, only SquaredError is *fully supported. Specifically, the loss function should be used to compute the gradient - *(to re-label training instances on each iteration) and to weight weak hypotheses. + *(to re-label training instances on each iteration) and to weight tree ensembles. *Currently, gradients are computed correctly for the available loss functions, - *but weak hypothesis weights are not computed correctly for LogLoss or AbsoluteError. - *Running with those losses will likely behave reasonably, but lacks the same guarantees. + *but tree predictions are not computed correctly for LogLoss or AbsoluteError since they + *use the mean of the samples at each leaf node. Running with those losses will likely behave + *reasonably, but lacks the same guarantees. * - * @param boostingStrategy Parameters for the gradient boosting algorithm + * @param boostingStrategy Parameters for the gradient boosting algorithm. */ @Experimental -class GradientBoosting ( -private val boostingStrategy: BoostingStrategy) extends Serializable with Logging { - - boostingStrategy.weakLearnerParams.algo = Regression - boostingStrategy.weakLearnerParams.impurity = impurity.Variance - - // Ensure values for weak learner are the same as what is provided to the boosting algorithm. - boostingStrategy.weakLearnerParams.numClassesForClassification = -boostingStrategy.numClassesForClassification - - boostingStrategy.assertValid() +class GradientBoostedTrees(private val boostingStrategy: BoostingStrategy) + extends Serializable with Logging { /** * Method to train a gradient boosting model * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]. - * @return WeightedEnsembleModel that can be used for prediction + * @return a gradient boosted trees model that can be used for prediction */ - def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = { -val algo = boostingStrategy.algo + def run(input: RDD[LabeledPoint]): GradientBoostedTreesModel = { +val algo = boostingStrategy.treeStrategy.algo algo match { - case Regression = GradientBoosting.boost(input, boostingStrategy) + case Regression = GradientBoostedTrees.boost(input, boostingStrategy) case Classification = // Map labels to -1, +1 so binary classification can be treated as regression. val remappedInput = input.map(x = new LabeledPoint((x.label * 2) - 1, x.features)) -GradientBoosting.boost(remappedInput, boostingStrategy) +GradientBoostedTrees.boost(remappedInput, boostingStrategy) case _ = throw new IllegalArgumentException(s$algo is not supported by the gradient boosting.) } } + /** + * Java-friendly API for [[org.apache.spark.mllib.tree.GradientBoostedTrees!#run]]. + */ + def run(input: JavaRDD[LabeledPoint]): GradientBoostedTreesModel = { +run(input.rdd) + } } -object GradientBoosting extends Logging { +object GradientBoostedTrees extends Logging { /** * Method to train a gradient boosting model. * - * Note: Using [[org.apache.spark.mllib.tree.GradientBoosting$#trainRegressor]] - * is recommended to clearly specify regression. - * Using [[org.apache.spark.mllib.tree.GradientBoosting$#trainClassifier]] - * is recommended to clearly specify regression. - * * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]. * For classification, labels should take values {0, 1, ..., numClasses-1}. * For regression, labels are real numbers. * @param boostingStrategy Configuration options for the boosting algorithm. - * @return WeightedEnsembleModel that can be used for prediction + * @return a gradient boosted trees model that can be used for prediction */ def train( input: RDD[LabeledPoint], - boostingStrategy: BoostingStrategy): WeightedEnsembleModel = { -new GradientBoosting(boostingStrategy).train(input) + boostingStrategy: BoostingStrategy): GradientBoostedTreesModel = { +new GradientBoostedTrees(boostingStrategy).run(input) } /** - * Method to train a
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3374#issuecomment-63767980 [Test build #23663 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23663/consoleFull) for PR 3374 at commit [`7097251`](https://github.com/apache/spark/commit/70972515085245957df9601e425141746f268c4b). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3374#issuecomment-63771201 [Test build #23662 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23662/consoleFull) for PR 3374 at commit [`98dea09`](https://github.com/apache/spark/commit/98dea097f226578762caca41ad708efb85f00d64). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4486][MLLIB] Improve GradientBoosting A...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3374#issuecomment-63771207 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23662/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org