subject:"\[GitHub\] spark pull request\: \[MLLIB\] SPARK\-1547\: Adding Gradient Boosting t..."

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-30 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-61067173
  
  [Test build #22531 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22531/consoleFull)
 for   PR 2607 at commit 
[`035a2ed`](https://github.com/apache/spark/commit/035a2ed6bb09910a3e8a6593b3276b742cf7b7d2).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-30 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-61068006
  
  [Test build #22532 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22532/consoleFull)
 for   PR 2607 at commit 
[`e33ab61`](https://github.com/apache/spark/commit/e33ab61bfc6e11de6f8a72368de660b90aea1345).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-30 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-61068865
  
  [Test build #22533 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22533/consoleFull)
 for   PR 2607 at commit 
[`1c40c33`](https://github.com/apache/spark/commit/1c40c33e73d68edaa14b5573c5cdef2b591c6419).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-30 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-61069431
  
  [Test build #22534 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22534/consoleFull)
 for   PR 2607 at commit 
[`0183cb9`](https://github.com/apache/spark/commit/0183cb994641292020b1a28e890d5419105ae204).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-30 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-61069535
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22534/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-30 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-61069532
  
  [Test build #22534 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22534/consoleFull)
 for   PR 2607 at commit 
[`0183cb9`](https://github.com/apache/spark/commit/0183cb994641292020b1a28e890d5419105ae204).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-30 Thread manishamde

Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-61069736
  
@jkbradley @codedeft I think I have implemented all the suggestions on the 
PR except for 1) public API and 2) warning when using non SquaredError loss 
functions. I will work on them next.

In the latest version, I have simplified the algorithm implementation and 
removed the need for checkpointing since the dataset for each iteration is 
calculated using the cached input training dataset and the partial GBT model 
till that iteration. I think this implementation will avoid a lot of 
checkpointing/caching overhead and lead to a simpler implementation. Could you 
please take a look at the logic of ```boost``` method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread codedeft

Github user codedeft commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19563516
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy.Sum
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
+ * @param boostingStrategy Parameters for the gradient boosting algorithm
+ */
+@Experimental
+class GradientBoosting (
+private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = {
+val algo = boostingStrategy.algo
+algo match {
+  case Regression = GradientBoosting.boost(input, boostingStrategy)
+  case Classification =
+val remappedInput = input.map(x = new LabeledPoint((x.label * 2) 
- 1, x.features))
+GradientBoosting.boost(remappedInput, boostingStrategy)
+  case _ =
+throw new IllegalArgumentException(s$algo is not supported by the 
gradient boosting.)
+}
+  }
+
+}
+
+
+object GradientBoosting extends Logging {
+
+  /**
+   * Method to train a gradient boosting model.
+   *
+   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainRegressor]]
+   *   is recommended to clearly specify regression.
+   *   Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainClassifier]]
+   *   is recommended to clearly specify regression.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param boostingStrategy Configuration options for the boosting 
algorithm.
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(
+  input: RDD[LabeledPoint],
+  boostingStrategy: BoostingStrategy): WeightedEnsembleModel = {
+new GradientBoosting(boostingStrategy).train(input)
+  }
+
+  /**
+   * Method to train a gradient boosting regression model.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param numEstimators Number of estimators used in boosting stages. In 
other words,
+   *  number of boosting iterations performed.
+   * @param loss Loss function used for minimization during gradient 
boosting.
+   * @param maxDepth Maximum depth of the tree.
+   * E.g., depth 0 means 1 leaf node; depth 1 means 1 
internal node + 2 leaf nodes.
+   * @param learningRate Learning rate for shrinking the

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread codedeft

Github user codedeft commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19563689
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -26,7 +26,7 @@ import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.tree.{RandomForest, DecisionTree, impurity}
 import org.apache.spark.mllib.tree.configuration.{Algo, Strategy}
 import org.apache.spark.mllib.tree.configuration.Algo._
-import org.apache.spark.mllib.tree.model.{RandomForestModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
--- End diff --

I personally think that Boosted Model can be a separate one from 
RandomForestModel. E.g., it's not inconceivable to have boosted models to use 
RandomForestModel as its base learners.

And if this were a truly generic weighted ensemble model, then it could 
probably live outside of tree.model namespace, since boosting at least in 
theory doesn't care whether base learners are trees or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19564695
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy.Sum
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
+ * @param boostingStrategy Parameters for the gradient boosting algorithm
+ */
+@Experimental
+class GradientBoosting (
+private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = {
+val algo = boostingStrategy.algo
+algo match {
+  case Regression = GradientBoosting.boost(input, boostingStrategy)
+  case Classification =
+val remappedInput = input.map(x = new LabeledPoint((x.label * 2) 
- 1, x.features))
+GradientBoosting.boost(remappedInput, boostingStrategy)
+  case _ =
+throw new IllegalArgumentException(s$algo is not supported by the 
gradient boosting.)
+}
+  }
+
+}
+
+
+object GradientBoosting extends Logging {
+
+  /**
+   * Method to train a gradient boosting model.
+   *
+   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainRegressor]]
+   *   is recommended to clearly specify regression.
+   *   Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainClassifier]]
+   *   is recommended to clearly specify regression.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param boostingStrategy Configuration options for the boosting 
algorithm.
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(
+  input: RDD[LabeledPoint],
+  boostingStrategy: BoostingStrategy): WeightedEnsembleModel = {
+new GradientBoosting(boostingStrategy).train(input)
+  }
+
+  /**
+   * Method to train a gradient boosting regression model.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param numEstimators Number of estimators used in boosting stages. In 
other words,
+   *  number of boosting iterations performed.
+   * @param loss Loss function used for minimization during gradient 
boosting.
+   * @param maxDepth Maximum depth of the tree.
+   * E.g., depth 0 means 1 leaf node; depth 1 means 1 
internal node + 2 leaf nodes.
+   * @param learningRate Learning rate for shrinking the

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19564926
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -26,7 +26,7 @@ import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.tree.{RandomForest, DecisionTree, impurity}
 import org.apache.spark.mllib.tree.configuration.{Algo, Strategy}
 import org.apache.spark.mllib.tree.configuration.Algo._
-import org.apache.spark.mllib.tree.model.{RandomForestModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
--- End diff --

These generalizations will rely on the new ML API (for which there will be 
a PR any day now); it makes sense to keep it in the tree namespace since there 
is not generic Estimator concept currently.  But once we can, I agree it will 
be important to generalize meta-algorithms.

With respect to the models, I don't see how the model concepts are 
different.  The learning algorithms are different, but that will not prevent a 
meta-algorithm to use another meta-algorithm as a weak learner (once the new 
API is available).  (I think it's good to separate the concepts of Estimator 
(learning algorithm) and Transformer (learned model) here.)  What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread codedeft

Github user codedeft commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19567364
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -26,7 +26,7 @@ import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.tree.{RandomForest, DecisionTree, impurity}
 import org.apache.spark.mllib.tree.configuration.{Algo, Strategy}
 import org.apache.spark.mllib.tree.configuration.Algo._
-import org.apache.spark.mllib.tree.model.{RandomForestModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
--- End diff --

Yea, I guess from the design perspective, it's tempting to unify these 
under the same umbrella.

IMO, RandomForest is *mostly* a specific instance of a generic ensemble 
model, so this makes sense.

However, I think that boosted models have some specific things about them 
due to their sequential nature (as opposed to parallel nature of RandomForest). 
E.g., if you have 1000 models, you can potentially predict based on the *first* 
100 models whereas with RandomForest you can pick any 100. You also have to do 
overfitting/underfitting analyses on boosted models sequentially, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread codedeft

Github user codedeft commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19567808
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy.Sum
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
+ * @param boostingStrategy Parameters for the gradient boosting algorithm
+ */
+@Experimental
+class GradientBoosting (
+private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = {
+val algo = boostingStrategy.algo
+algo match {
+  case Regression = GradientBoosting.boost(input, boostingStrategy)
+  case Classification =
+val remappedInput = input.map(x = new LabeledPoint((x.label * 2) 
- 1, x.features))
+GradientBoosting.boost(remappedInput, boostingStrategy)
+  case _ =
+throw new IllegalArgumentException(s$algo is not supported by the 
gradient boosting.)
+}
+  }
+
+}
+
+
+object GradientBoosting extends Logging {
+
+  /**
+   * Method to train a gradient boosting model.
+   *
+   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainRegressor]]
+   *   is recommended to clearly specify regression.
+   *   Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainClassifier]]
+   *   is recommended to clearly specify regression.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param boostingStrategy Configuration options for the boosting 
algorithm.
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(
+  input: RDD[LabeledPoint],
+  boostingStrategy: BoostingStrategy): WeightedEnsembleModel = {
+new GradientBoosting(boostingStrategy).train(input)
+  }
+
+  /**
+   * Method to train a gradient boosting regression model.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param numEstimators Number of estimators used in boosting stages. In 
other words,
+   *  number of boosting iterations performed.
+   * @param loss Loss function used for minimization during gradient 
boosting.
+   * @param maxDepth Maximum depth of the tree.
+   * E.g., depth 0 means 1 leaf node; depth 1 means 1 
internal node + 2 leaf nodes.
+   * @param learningRate Learning rate for shrinking the

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19569087
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy.Sum
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
+ * @param boostingStrategy Parameters for the gradient boosting algorithm
+ */
+@Experimental
+class GradientBoosting (
+private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = {
+val algo = boostingStrategy.algo
+algo match {
+  case Regression = GradientBoosting.boost(input, boostingStrategy)
+  case Classification =
+val remappedInput = input.map(x = new LabeledPoint((x.label * 2) 
- 1, x.features))
+GradientBoosting.boost(remappedInput, boostingStrategy)
+  case _ =
+throw new IllegalArgumentException(s$algo is not supported by the 
gradient boosting.)
+}
+  }
+
+}
+
+
+object GradientBoosting extends Logging {
+
+  /**
+   * Method to train a gradient boosting model.
+   *
+   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainRegressor]]
+   *   is recommended to clearly specify regression.
+   *   Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainClassifier]]
+   *   is recommended to clearly specify regression.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param boostingStrategy Configuration options for the boosting 
algorithm.
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(
+  input: RDD[LabeledPoint],
+  boostingStrategy: BoostingStrategy): WeightedEnsembleModel = {
+new GradientBoosting(boostingStrategy).train(input)
+  }
+
+  /**
+   * Method to train a gradient boosting regression model.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param numEstimators Number of estimators used in boosting stages. In 
other words,
+   *  number of boosting iterations performed.
+   * @param loss Loss function used for minimization during gradient 
boosting.
+   * @param maxDepth Maximum depth of the tree.
+   * E.g., depth 0 means 1 leaf node; depth 1 means 1 
internal node + 2 leaf nodes.
+   * @param learningRate Learning rate for shrinking the

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread codedeft

Github user codedeft commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19569497
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy.Sum
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
+ * @param boostingStrategy Parameters for the gradient boosting algorithm
+ */
+@Experimental
+class GradientBoosting (
+private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = {
+val algo = boostingStrategy.algo
+algo match {
+  case Regression = GradientBoosting.boost(input, boostingStrategy)
+  case Classification =
+val remappedInput = input.map(x = new LabeledPoint((x.label * 2) 
- 1, x.features))
+GradientBoosting.boost(remappedInput, boostingStrategy)
+  case _ =
+throw new IllegalArgumentException(s$algo is not supported by the 
gradient boosting.)
+}
+  }
+
+}
+
+
+object GradientBoosting extends Logging {
+
+  /**
+   * Method to train a gradient boosting model.
+   *
+   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainRegressor]]
+   *   is recommended to clearly specify regression.
+   *   Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainClassifier]]
+   *   is recommended to clearly specify regression.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param boostingStrategy Configuration options for the boosting 
algorithm.
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(
+  input: RDD[LabeledPoint],
+  boostingStrategy: BoostingStrategy): WeightedEnsembleModel = {
+new GradientBoosting(boostingStrategy).train(input)
+  }
+
+  /**
+   * Method to train a gradient boosting regression model.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param numEstimators Number of estimators used in boosting stages. In 
other words,
+   *  number of boosting iterations performed.
+   * @param loss Loss function used for minimization during gradient 
boosting.
+   * @param maxDepth Maximum depth of the tree.
+   * E.g., depth 0 means 1 leaf node; depth 1 means 1 
internal node + 2 leaf nodes.
+   * @param learningRate Learning rate for shrinking the

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19569553
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -26,7 +26,7 @@ import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.tree.{RandomForest, DecisionTree, impurity}
 import org.apache.spark.mllib.tree.configuration.{Algo, Strategy}
 import org.apache.spark.mllib.tree.configuration.Algo._
-import org.apache.spark.mllib.tree.model.{RandomForestModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
--- End diff --

@codedeft I started with a separate model for boosting but @jkbradley 
(quite correctly IMO) convinced me otherwise. :-)

I agree methods like boosting require support such as early stopping, 
sequential selection of models, etc. but may be we can handle it as a part of 
the model configuration. AdaBoost and RF in some ways are more similar than 
AdaBoost and GBT in their combining operation. It might be better to capture 
all these nuances in one place. Of course, we can always split them later if we 
end up writing a lot of custom logic for each algorithm. Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread codedeft

Github user codedeft commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19570062
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -26,7 +26,7 @@ import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.tree.{RandomForest, DecisionTree, impurity}
 import org.apache.spark.mllib.tree.configuration.{Algo, Strategy}
 import org.apache.spark.mllib.tree.configuration.Algo._
-import org.apache.spark.mllib.tree.model.{RandomForestModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
--- End diff --

@manishamde Sounds good.

Just a side note. Because RF models tend to be much bigger than boosted 
ensembles, we've encountered situations where the model was *too* big to fit in 
a single machine memory. RandomForest model is in a way a good model for 
embarassingly parallel predictions so a model could potentially reside in a 
distributed fashion.

But we haven't yet decided whether we really want to do this (i.e. are 
humongous models really useful in practice and do we really expect crazy 
scenarios of gigantic models surpassing dozens of GBs?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19570259
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy.Sum
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
+ * @param boostingStrategy Parameters for the gradient boosting algorithm
+ */
+@Experimental
+class GradientBoosting (
+private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = {
+val algo = boostingStrategy.algo
+algo match {
+  case Regression = GradientBoosting.boost(input, boostingStrategy)
+  case Classification =
+val remappedInput = input.map(x = new LabeledPoint((x.label * 2) 
- 1, x.features))
+GradientBoosting.boost(remappedInput, boostingStrategy)
+  case _ =
+throw new IllegalArgumentException(s$algo is not supported by the 
gradient boosting.)
+}
+  }
+
+}
+
+
+object GradientBoosting extends Logging {
+
+  /**
+   * Method to train a gradient boosting model.
+   *
+   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainRegressor]]
+   *   is recommended to clearly specify regression.
+   *   Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainClassifier]]
+   *   is recommended to clearly specify regression.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param boostingStrategy Configuration options for the boosting 
algorithm.
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(
+  input: RDD[LabeledPoint],
+  boostingStrategy: BoostingStrategy): WeightedEnsembleModel = {
+new GradientBoosting(boostingStrategy).train(input)
+  }
+
+  /**
+   * Method to train a gradient boosting regression model.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param numEstimators Number of estimators used in boosting stages. In 
other words,
+   *  number of boosting iterations performed.
+   * @param loss Loss function used for minimization during gradient 
boosting.
+   * @param maxDepth Maximum depth of the tree.
+   * E.g., depth 0 means 1 leaf node; depth 1 means 1 
internal node + 2 leaf nodes.
+   * @param learningRate Learning rate for shrinking the

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-61000703
  
It's a good point about the sequential nature of boosting models being 
important when doing approximate predictions (using only some of the weak 
hypotheses); I could imagine that being useful.  Perhaps the generic 
WeightedEnsembleModel could be subclassed in order to support that kind of 
extended functionality in the future.

Distributed models sound useful to me, though I suspect applying a 
sparsifying step (like running Lasso on the outputs of the many trees to choose 
a subset of trees) might be faster and almost as accurate in many cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19570610
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRunner.scala
 ---
@@ -26,7 +26,7 @@ import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.tree.{RandomForest, DecisionTree, impurity}
 import org.apache.spark.mllib.tree.configuration.{Algo, Strategy}
 import org.apache.spark.mllib.tree.configuration.Algo._
-import org.apache.spark.mllib.tree.model.{RandomForestModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
--- End diff --

@codedeft Agree about the distributed storage though I never bothered to 
check the size of big deep trees in memory! :-) In fact, such a storage might 
be a good option for [Partial Forest 
implementation](https://issues.apache.org/jira/browse/SPARK-1548).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19572600
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy.Sum
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
+ * @param boostingStrategy Parameters for the gradient boosting algorithm
+ */
+@Experimental
+class GradientBoosting (
+private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = {
+val algo = boostingStrategy.algo
+algo match {
+  case Regression = GradientBoosting.boost(input, boostingStrategy)
+  case Classification =
+val remappedInput = input.map(x = new LabeledPoint((x.label * 2) 
- 1, x.features))
+GradientBoosting.boost(remappedInput, boostingStrategy)
+  case _ =
+throw new IllegalArgumentException(s$algo is not supported by the 
gradient boosting.)
+}
+  }
+
+}
+
+
+object GradientBoosting extends Logging {
+
+  /**
+   * Method to train a gradient boosting model.
+   *
+   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainRegressor]]
+   *   is recommended to clearly specify regression.
+   *   Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainClassifier]]
+   *   is recommended to clearly specify regression.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param boostingStrategy Configuration options for the boosting 
algorithm.
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(
+  input: RDD[LabeledPoint],
+  boostingStrategy: BoostingStrategy): WeightedEnsembleModel = {
+new GradientBoosting(boostingStrategy).train(input)
+  }
+
+  /**
+   * Method to train a gradient boosting regression model.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param numEstimators Number of estimators used in boosting stages. In 
other words,
+   *  number of boosting iterations performed.
+   * @param loss Loss function used for minimization during gradient 
boosting.
+   * @param maxDepth Maximum depth of the tree.
+   * E.g., depth 0 means 1 leaf node; depth 1 means 1 
internal node + 2 leaf nodes.
+   * @param learningRate Learning rate for shrinking the

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread manishamde

Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-61024859
  
@jkbradley I originally used checkpointing instead of simply caching in 
memory. There are trade-offs going with one versus the other. I will study what 
@codedeft implemented in PR [#2868](https://github.com/apache/spark/pull/2868) 
and see what we can re-use here.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-29 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-61026909
  
Studying the trade-offs sounds great.  I think it's OK if checkpointing is 
added later as an option.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19495219
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/loss/Loss.scala 
---
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.loss
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.RDD
+
+/**
+ * Trait for adding pluggable loss functions for the gradient boosting 
algorithm.
+ */
+trait Loss extends Serializable {
+
+  /**
+   * Method to calculate the loss gradients for the gradient boosting 
calculation.
+   * @param model Model of the weak learner.
+   * @param point Instance of the training dataset.
+   * @param learningRate Learning rate parameter for regularization.
+   * @return Loss gradient.
+   */
+  @DeveloperApi
+  def lossGradient(
--- End diff --

Could this please be renamed to gradient so it is less repetitive to call 
loss.lossGradient?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19495241
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostingSuite.scala ---
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.configuration.{BoostingStrategy, 
Strategy}
+import org.apache.spark.mllib.tree.impurity.{Variance, Gini}
+import org.apache.spark.mllib.tree.loss.{SquaredError, LogLoss}
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
+
+import org.apache.spark.mllib.util.LocalSparkContext
+
+/**
+ * Test suite for [[GradientBoosting]].
+ */
+class GradientBoostingSuite extends FunSuite with LocalSparkContext {
+
+  test(Binary classification with continuous features: +
+ comparing DecisionTree vs. GradientBoosting (numEstimators = 1)) {
+
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures 
= 50, 1000)
+val rdd = sc.parallelize(arr)
+val categoricalFeaturesInfo = Map.empty[Int, Int]
+val numEstimators = 1
+
+val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 1, 
x.features))
+val treeStrategy = new Strategy(algo = Regression, impurity = 
Variance, maxDepth = 2,
+  numClassesForClassification = 2, categoricalFeaturesInfo = 
categoricalFeaturesInfo)
+
+val dt = DecisionTree.train(remappedInput, treeStrategy)
+
+val boostingStrategy = new BoostingStrategy(algo = Classification,
+  numEstimators = numEstimators, loss = LogLoss, maxDepth = 2,
+  numClassesForClassification = 2, categoricalFeaturesInfo = 
categoricalFeaturesInfo)
+
+val gbt = GradientBoosting.trainClassifier(rdd, boostingStrategy)
+assert(gbt.baseLearners.size === 1)
+val gbtTree = gbt.baseLearners(0)
+
+
+EnsembleTestHelper.validateClassifier(gbt, arr, 0.9)
+
+// Make sure trees are the same.
+assert(gbtTree.toString == dt.toString)
+  }
+
+  test(Binary classification with continuous features: +
+ comparing DecisionTree vs. GradientBoosting (numEstimators = 10)) {
+
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures 
= 50, 1000)
+val rdd = sc.parallelize(arr)
+val categoricalFeaturesInfo = Map.empty[Int, Int]
+val numEstimators = 10
+
+val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 1, 
x.features))
+val treeStrategy = new Strategy(algo = Regression, impurity = 
Variance, maxDepth = 2,
+  numClassesForClassification = 2, categoricalFeaturesInfo = 
categoricalFeaturesInfo)
+
+val dt = DecisionTree.train(remappedInput, treeStrategy)
+
+val boostingStrategy = new BoostingStrategy(algo = Classification,
+  numEstimators = numEstimators, loss = LogLoss, maxDepth = 2,
+  numClassesForClassification = 2, categoricalFeaturesInfo = 
categoricalFeaturesInfo)
+
+val gbt = GradientBoosting.trainClassifier(rdd, boostingStrategy)
+assert(gbt.baseLearners.size === 10)
+val gbtTree = gbt.baseLearners(0)
+
+
+EnsembleTestHelper.validateClassifier(gbt, arr, 0.9)
+
+// Make sure trees are the same.
+assert(gbtTree.toString == dt.toString)
+  }
+
+  test(Binary classification with continuous features: +
+ Stochastic GradientBoosting (numEstimators = 10, learning rate = 
0.9, subsample = 0.75)) {
+
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures 
= 50, 1000)
+val rdd = sc.parallelize(arr)
+val categoricalFeaturesInfo = Map.empty[Int, Int]
+val numEstimators = 10
+
+val boostingStrategy = new BoostingStrategy(algo =

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19495244
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/impl/BaggedPointSuite.scala ---
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impl
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.tree.EnsembleTestHelper
+import org.apache.spark.mllib.util.LocalSparkContext
+
+/**
+ * Test suite for [[BaggedPoint]].
+ */
+class BaggedPointSuite extends FunSuite with LocalSparkContext  {
+
+  test(BaggedPoint RDD: without subsampling) {
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, 1.0, 1, false)
+baggedRDD.collect().foreach { baggedPoint =
+  assert(baggedPoint.subsampleWeights.size == 1  
baggedPoint.subsampleWeights(0) == 1)
+}
+  }
+
+  test(BaggedPoint RDD: with subsampling with replacement (fraction = 
1.0)) {
+val numSubsamples = 100
+val (expectedMean, expectedStddev) = (1.0, 1.0)
+
+val seeds = Array(123, 5354, 230, 349867, 23987)
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+seeds.foreach { seed =
+  val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, 1.0, 
numSubsamples, true)
+  val subsampleCounts: Array[Array[Double]] = 
baggedRDD.map(_.subsampleWeights).collect()
+  EnsembleTestHelper.testRandomArrays(subsampleCounts, numSubsamples, 
expectedMean,
+expectedStddev, epsilon = 0.01)
+}
+  }
+
+  test(BaggedPoint RDD: with subsampling with replacement (fraction = 
0.5)) {
+val numSubsamples = 100
+val subsample = 0.5
+val (expectedMean, expectedStddev) = (subsample, math.sqrt(subsample))
+
+val seeds = Array(123, 5354, 230, 349867, 23987)
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+seeds.foreach { seed =
+  val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, subsample, 
numSubsamples, true)
+  val subsampleCounts: Array[Array[Double]] = 
baggedRDD.map(_.subsampleWeights).collect()
+  EnsembleTestHelper.testRandomArrays(subsampleCounts, numSubsamples, 
expectedMean,
+expectedStddev, epsilon = 0.01)
+}
+  }
+
+  test(BaggedPoint RDD: with subsampling without replacement (fraction = 
1.0)) {
+val numSubsamples = 100
+val (expectedMean, expectedStddev) = (1.0, 0)
+
+val seeds = Array(123, 5354, 230, 349867, 23987)
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+seeds.foreach { seed =
+  val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, 1.0, 
numSubsamples, false)
+  val subsampleCounts: Array[Array[Double]] = 
baggedRDD.map(_.subsampleWeights).collect()
+  EnsembleTestHelper.testRandomArrays(subsampleCounts, numSubsamples, 
expectedMean,
+expectedStddev, epsilon = 0.01)
+}
+  }
+
+  test(BaggedPoint RDD: with subsampling without replacement (fraction = 
0.5)) {
+val numSubsamples = 100
+val subsample = 0.5
+val (expectedMean, expectedStddev) = (subsample, math.sqrt(subsample * 
(1 - subsample)))
+
+val seeds = Array(123, 5354, 230, 349867, 23987)
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+seeds.foreach { seed =
+  val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, subsample, 
numSubsamples, false)
+  val subsampleCounts: Array[Array[Double]] = 
baggedRDD.map(_.subsampleWeights).collect()
+  EnsembleTestHelper.testRandomArrays(subsampleCounts, numSubsamples, 
expectedMean,
+expectedStddev, epsilon = 0.01)
+}
+

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19495254
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala 
---
@@ -70,7 +71,8 @@ class Strategy (
 val categoricalFeaturesInfo: Map[Int, Int] = Map[Int, Int](),
 val minInstancesPerNode: Int = 1,
 val minInfoGain: Double = 0.0,
-val maxMemoryInMB: Int = 256) extends Serializable {
+val maxMemoryInMB: Int = 256,
+val subsample: Double = 1) extends Serializable {
--- End diff --

Rename: subsample -- subsamplingRate


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19495228
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/loss/Loss.scala 
---
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.loss
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.RDD
+
+/**
+ * Trait for adding pluggable loss functions for the gradient boosting 
algorithm.
+ */
+trait Loss extends Serializable {
+
+  /**
+   * Method to calculate the loss gradients for the gradient boosting 
calculation.
+   * @param model Model of the weak learner.
+   * @param point Instance of the training dataset.
+   * @param learningRate Learning rate parameter for regularization.
+   * @return Loss gradient.
+   */
+  @DeveloperApi
+  def lossGradient(
+  model: DecisionTreeModel,
+  point: LabeledPoint,
+  learningRate: Double): Double
+
+  /**
+   * Method to calculate error of the base learner for the gradient 
boosting calculation.
+   * Note: This method is not used by the gradient boosting algorithm but 
is useful for debugging
+   * purposes.
+   * @param model Model of the weak learner.
+   * @param data Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return
+   */
+  @DeveloperApi
+  def computeError(model: DecisionTreeModel, data: RDD[LabeledPoint]): 
Double
--- End diff --

Rename to compute or loss


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19495261
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/loss/SquaredError.scala ---
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.loss
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.RDD
+
+/**
+ * Class for least squares error loss calculation.
+ *
+ * The features x and the corresponding label y is predicted using the 
function F.
+ * For each instance:
+ * Loss: (y - F)**2/2
+ * Negative gradient: y - F
+ */
+object SquaredError extends Loss {
+
+  /**
+   * Method to calculate the loss gradients for the gradient boosting 
calculation for least
+   * squares error calculation.
+   * @param model Model of the weak learner
+   * @param point Instance of the training dataset
+   * @param learningRate Learning rate parameter for regularization
+   * @return Loss gradient
+   */
+  @DeveloperApi
--- End diff --

You can put the DeveloperApi annotation on the object, rather than on each 
method.  (Please change elsewhere too)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19495234
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/WeightedEnsembleModel.scala
 ---
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.model
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.configuration.Algo._
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy._
+import org.apache.spark.rdd.RDD
+
+import scala.collection.mutable
+
+@Experimental
+class WeightedEnsembleModel(
+val baseLearners: Array[DecisionTreeModel],
+val baseLearnerWeights: Array[Double],
+val algo: Algo,
+val combiningStrategy: EnsembleCombiningStrategy) extends Serializable 
{
+
+  require(numTrees  0, sWeightedEnsembleModel cannot be created without 
base learners. Number  +
+sof baselearners = $baseLearners)
+
+  /**
+   * Predict values for a single data point using the model trained.
+   *
+   * @param features array representing a single data point
+   * @return predicted category from the trained model
+   */
+  private def predictRaw(features: Vector): Double = {
+val treePredictions = baseLearners.map(learner = 
learner.predict(features))
+if (numTrees == 1){
+  treePredictions(0)
+} else {
+  var prediction = treePredictions(0)
+  var index = 1
+  while (index  numTrees) {
+prediction += baseLearnerWeights(index) * treePredictions(index)
+index += 1
+  }
+  prediction
+}
+  }
+
+  /**
+   * Predict values for a single data point using the model trained.
+   *
+   * @param features array representing a single data point
+   * @return predicted category from the trained model
+   */
+  private def predictBySumming(features: Vector): Double = {
+val treePredictions = baseLearners.map(learner = 
learner.predict(features))
+val rawPrediction = {
--- End diff --

Remember to remove (since you moved it to predictRaw)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19495252
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/EnsembleCombiningStrategy.scala
 ---
@@ -0,0 +1,30 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.configuration
+
+import org.apache.spark.annotation.DeveloperApi
+
+/**
+ * :: Experimental ::
+ * Enum to select ensemble combining strategy for base learners
+ */
+@DeveloperApi
+object EnsembleCombiningStrategy extends Enumeration {
--- End diff --

I think this strategy option would be useful at some point, but not yet.
* sum and average are essentially the same thing
* Eventually, when we support options such as median, this could be nice to 
add
Remove for now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19495258
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/BaggedPoint.scala ---
@@ -46,20 +47,63 @@ private[tree] object BaggedPoint {
* Convert an input dataset into its BaggedPoint representation,
* choosing subsample counts for each instance.
* Each subsample has the same number of instances as the original 
dataset,
-   * and is created by subsampling with replacement.
-   * @param input Input dataset.
-   * @param numSubsamples  Number of subsamples of this RDD to take.
-   * @param seed   Random seed.
-   * @return  BaggedPoint dataset representation
+   * and is created by subsampling without replacement.
+   * @param input Input dataset.
+   * @param subsample Fraction of the training data used for learning 
decision tree.
+   * @param numSubsamples Number of subsamples of this RDD to take.
+   * @param withReplacement Sampling with/without replacement.
+   * @param seed Random seed.
+   * @return BaggedPoint dataset representation.
*/
-  def convertToBaggedRDD[Datum](
+  def convertToBaggedRDD[Datum] (
   input: RDD[Datum],
+  subsample: Double,
--- End diff --

Rename: subsample -- subsamplingRate


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19495249
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/impl/BaggedPointSuite.scala ---
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impl
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.tree.EnsembleTestHelper
+import org.apache.spark.mllib.util.LocalSparkContext
+
+/**
+ * Test suite for [[BaggedPoint]].
+ */
+class BaggedPointSuite extends FunSuite with LocalSparkContext  {
+
+  test(BaggedPoint RDD: without subsampling) {
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, 1.0, 1, false)
+baggedRDD.collect().foreach { baggedPoint =
+  assert(baggedPoint.subsampleWeights.size == 1  
baggedPoint.subsampleWeights(0) == 1)
+}
+  }
+
+  test(BaggedPoint RDD: with subsampling with replacement (fraction = 
1.0)) {
+val numSubsamples = 100
+val (expectedMean, expectedStddev) = (1.0, 1.0)
+
+val seeds = Array(123, 5354, 230, 349867, 23987)
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+seeds.foreach { seed =
+  val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, 1.0, 
numSubsamples, true)
+  val subsampleCounts: Array[Array[Double]] = 
baggedRDD.map(_.subsampleWeights).collect()
+  EnsembleTestHelper.testRandomArrays(subsampleCounts, numSubsamples, 
expectedMean,
+expectedStddev, epsilon = 0.01)
+}
+  }
+
+  test(BaggedPoint RDD: with subsampling with replacement (fraction = 
0.5)) {
+val numSubsamples = 100
+val subsample = 0.5
+val (expectedMean, expectedStddev) = (subsample, math.sqrt(subsample))
+
+val seeds = Array(123, 5354, 230, 349867, 23987)
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+seeds.foreach { seed =
+  val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, subsample, 
numSubsamples, true)
+  val subsampleCounts: Array[Array[Double]] = 
baggedRDD.map(_.subsampleWeights).collect()
+  EnsembleTestHelper.testRandomArrays(subsampleCounts, numSubsamples, 
expectedMean,
+expectedStddev, epsilon = 0.01)
+}
+  }
+
+  test(BaggedPoint RDD: with subsampling without replacement (fraction = 
1.0)) {
+val numSubsamples = 100
+val (expectedMean, expectedStddev) = (1.0, 0)
+
+val seeds = Array(123, 5354, 230, 349867, 23987)
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+seeds.foreach { seed =
+  val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, 1.0, 
numSubsamples, false)
+  val subsampleCounts: Array[Array[Double]] = 
baggedRDD.map(_.subsampleWeights).collect()
+  EnsembleTestHelper.testRandomArrays(subsampleCounts, numSubsamples, 
expectedMean,
+expectedStddev, epsilon = 0.01)
+}
+  }
+
+  test(BaggedPoint RDD: with subsampling without replacement (fraction = 
0.5)) {
+val numSubsamples = 100
+val subsample = 0.5
+val (expectedMean, expectedStddev) = (subsample, math.sqrt(subsample * 
(1 - subsample)))
+
+val seeds = Array(123, 5354, 230, 349867, 23987)
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+seeds.foreach { seed =
+  val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, subsample, 
numSubsamples, false)
+  val subsampleCounts: Array[Array[Double]] = 
baggedRDD.map(_.subsampleWeights).collect()
+  EnsembleTestHelper.testRandomArrays(subsampleCounts, numSubsamples, 
expectedMean,
+expectedStddev, epsilon = 0.01)
+}
+

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19495231
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/WeightedEnsembleModel.scala
 ---
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.model
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.configuration.Algo._
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy._
+import org.apache.spark.rdd.RDD
+
+import scala.collection.mutable
+
+@Experimental
+class WeightedEnsembleModel(
+val baseLearners: Array[DecisionTreeModel],
--- End diff --

Please switch from learner to hypothesis since these are hypotheses 
produced by the learner.  I'd recommend:
* baseLearners - weakHypotheses
* baseLearnerWeights - weakHypothesisWeights


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-60812919
  
@manishamde  Added comments based on a quick pass looking mainly at the 
API.  My main concern is the same as in my comment above about the verbosity of 
(a) the many GradientBoosting.train* methods and (b) BoostingStrategy.  Could 
you please respond to the comment above about making these more modular for 
easier construction?  Thanks!  In the meantime, I'll make a more detailed pass 
over the internals.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19495237
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/WeightedEnsembleModel.scala
 ---
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.model
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.configuration.Algo._
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy._
+import org.apache.spark.rdd.RDD
+
+import scala.collection.mutable
+
+@Experimental
+class WeightedEnsembleModel(
+val baseLearners: Array[DecisionTreeModel],
+val baseLearnerWeights: Array[Double],
+val algo: Algo,
+val combiningStrategy: EnsembleCombiningStrategy) extends Serializable 
{
+
+  require(numTrees  0, sWeightedEnsembleModel cannot be created without 
base learners. Number  +
+sof baselearners = $baseLearners)
+
+  /**
+   * Predict values for a single data point using the model trained.
+   *
+   * @param features array representing a single data point
+   * @return predicted category from the trained model
+   */
+  private def predictRaw(features: Vector): Double = {
+val treePredictions = baseLearners.map(learner = 
learner.predict(features))
+if (numTrees == 1){
+  treePredictions(0)
+} else {
+  var prediction = treePredictions(0)
+  var index = 1
+  while (index  numTrees) {
+prediction += baseLearnerWeights(index) * treePredictions(index)
+index += 1
+  }
+  prediction
+}
+  }
+
+  /**
+   * Predict values for a single data point using the model trained.
+   *
+   * @param features array representing a single data point
+   * @return predicted category from the trained model
+   */
+  private def predictBySumming(features: Vector): Double = {
+val treePredictions = baseLearners.map(learner = 
learner.predict(features))
+val rawPrediction = {
+  if (numTrees == 1) {
+treePredictions(0)
+  } else {
+var prediction = treePredictions(0)
+var index = 1
+while (index  numTrees) {
+  prediction += baseLearnerWeights(index) * treePredictions(index)
+  index += 1
+}
+prediction
+  }
+}
+algo match {
+  case Regression = predictRaw(features)
+  case Classification = {
+// TODO: predicted labels are +1 or -1 for GBT. Need a better way 
to store this info.
+if (predictRaw(features)  0 ) 1.0 else 0.0
+  }
+  case _ = throw new IllegalArgumentException(
+sWeightedEnsembleModel given unknown algo parameter: $algo.)
+}
+  }
+
+  /**
+   * Predict values for a single data point.
+   *
+   * @param features array representing a single data point
+   * @return Double prediction from the trained model
+   */
+  def predictByAveraging(features: Vector): Double = {
+algo match {
+  case Classification =
+val predictionToCount = new mutable.HashMap[Int, Int]()
+baseLearners.foreach { learner =
+  val prediction = learner.predict(features).toInt
+  predictionToCount(prediction) = 
predictionToCount.getOrElse(prediction, 0) + 1
+}
+predictionToCount.maxBy(_._2)._1
+  case Regression =
+baseLearners.map(_.predict(features)).sum / baseLearners.size
+}
+  }
+
+
+  /**
+   * Predict values for a single data point using the model trained.
+   *
+   * @param features array representing a single data point
+   * @return predicted category from the trained model
+   */
+  def predict(features: Vector): Double = {
+combiningStrategy match {

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread manishamde

Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-60814290
  
@jkbradley Your API suggestions sound reasonable. Let me work on 
simplifying the API. I had originally started with something similar to what 
you suggested so I will revert to that. I will let you know once the API has 
been updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19496069
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/loss/SquaredError.scala ---
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.loss
+
+import org.apache.spark.SparkContext._
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.RDD
+
+/**
+ * Class for least squares error loss calculation.
+ *
+ * The features x and the corresponding label y is predicted using the 
function F.
+ * For each instance:
+ * Loss: (y - F)**2/2
+ * Negative gradient: y - F
+ */
+object SquaredError extends Loss {
+
+  /**
+   * Method to calculate the loss gradients for the gradient boosting 
calculation for least
+   * squares error calculation.
+   * @param model Model of the weak learner
+   * @param point Instance of the training dataset
+   * @param learningRate Learning rate parameter for regularization
+   * @return Loss gradient
+   */
+  @DeveloperApi
--- End diff --

Will do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19496095
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/BaggedPoint.scala ---
@@ -46,20 +47,63 @@ private[tree] object BaggedPoint {
* Convert an input dataset into its BaggedPoint representation,
* choosing subsample counts for each instance.
* Each subsample has the same number of instances as the original 
dataset,
-   * and is created by subsampling with replacement.
-   * @param input Input dataset.
-   * @param numSubsamples  Number of subsamples of this RDD to take.
-   * @param seed   Random seed.
-   * @return  BaggedPoint dataset representation
+   * and is created by subsampling without replacement.
+   * @param input Input dataset.
+   * @param subsample Fraction of the training data used for learning 
decision tree.
+   * @param numSubsamples Number of subsamples of this RDD to take.
+   * @param withReplacement Sampling with/without replacement.
+   * @param seed Random seed.
+   * @return BaggedPoint dataset representation.
*/
-  def convertToBaggedRDD[Datum](
+  def convertToBaggedRDD[Datum] (
   input: RDD[Datum],
+  subsample: Double,
--- End diff --

Agree. Will do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19496113
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala 
---
@@ -70,7 +71,8 @@ class Strategy (
 val categoricalFeaturesInfo: Map[Int, Int] = Map[Int, Int](),
 val minInstancesPerNode: Int = 1,
 val minInfoGain: Double = 0.0,
-val maxMemoryInMB: Int = 256) extends Serializable {
+val maxMemoryInMB: Int = 256,
+val subsample: Double = 1) extends Serializable {
--- End diff --

Agree. Will do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19496210
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/EnsembleCombiningStrategy.scala
 ---
@@ -0,0 +1,30 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.configuration
+
+import org.apache.spark.annotation.DeveloperApi
+
+/**
+ * :: Experimental ::
+ * Enum to select ensemble combining strategy for base learners
+ */
+@DeveloperApi
+object EnsembleCombiningStrategy extends Enumeration {
--- End diff --

Why are sum and average the same? In RF you average predictions and in GBT 
you add predictions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19496224
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/impl/BaggedPointSuite.scala ---
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impl
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.tree.EnsembleTestHelper
+import org.apache.spark.mllib.util.LocalSparkContext
+
+/**
+ * Test suite for [[BaggedPoint]].
+ */
+class BaggedPointSuite extends FunSuite with LocalSparkContext  {
+
+  test(BaggedPoint RDD: without subsampling) {
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, 1.0, 1, false)
+baggedRDD.collect().foreach { baggedPoint =
+  assert(baggedPoint.subsampleWeights.size == 1  
baggedPoint.subsampleWeights(0) == 1)
+}
+  }
+
+  test(BaggedPoint RDD: with subsampling with replacement (fraction = 
1.0)) {
+val numSubsamples = 100
+val (expectedMean, expectedStddev) = (1.0, 1.0)
+
+val seeds = Array(123, 5354, 230, 349867, 23987)
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+seeds.foreach { seed =
+  val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, 1.0, 
numSubsamples, true)
+  val subsampleCounts: Array[Array[Double]] = 
baggedRDD.map(_.subsampleWeights).collect()
+  EnsembleTestHelper.testRandomArrays(subsampleCounts, numSubsamples, 
expectedMean,
+expectedStddev, epsilon = 0.01)
+}
+  }
+
+  test(BaggedPoint RDD: with subsampling with replacement (fraction = 
0.5)) {
+val numSubsamples = 100
+val subsample = 0.5
+val (expectedMean, expectedStddev) = (subsample, math.sqrt(subsample))
+
+val seeds = Array(123, 5354, 230, 349867, 23987)
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+seeds.foreach { seed =
+  val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, subsample, 
numSubsamples, true)
+  val subsampleCounts: Array[Array[Double]] = 
baggedRDD.map(_.subsampleWeights).collect()
+  EnsembleTestHelper.testRandomArrays(subsampleCounts, numSubsamples, 
expectedMean,
+expectedStddev, epsilon = 0.01)
+}
+  }
+
+  test(BaggedPoint RDD: with subsampling without replacement (fraction = 
1.0)) {
+val numSubsamples = 100
+val (expectedMean, expectedStddev) = (1.0, 0)
+
+val seeds = Array(123, 5354, 230, 349867, 23987)
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+seeds.foreach { seed =
+  val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, 1.0, 
numSubsamples, false)
+  val subsampleCounts: Array[Array[Double]] = 
baggedRDD.map(_.subsampleWeights).collect()
+  EnsembleTestHelper.testRandomArrays(subsampleCounts, numSubsamples, 
expectedMean,
+expectedStddev, epsilon = 0.01)
+}
+  }
+
+  test(BaggedPoint RDD: with subsampling without replacement (fraction = 
0.5)) {
+val numSubsamples = 100
+val subsample = 0.5
+val (expectedMean, expectedStddev) = (subsample, math.sqrt(subsample * 
(1 - subsample)))
+
+val seeds = Array(123, 5354, 230, 349867, 23987)
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+seeds.foreach { seed =
+  val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, subsample, 
numSubsamples, false)
+  val subsampleCounts: Array[Array[Double]] = 
baggedRDD.map(_.subsampleWeights).collect()
+  EnsembleTestHelper.testRandomArrays(subsampleCounts, numSubsamples, 
expectedMean,
+expectedStddev, epsilon = 0.01)
+}
+

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19496241
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/impl/BaggedPointSuite.scala ---
@@ -0,0 +1,103 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.impl
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.tree.EnsembleTestHelper
+import org.apache.spark.mllib.util.LocalSparkContext
+
+/**
+ * Test suite for [[BaggedPoint]].
+ */
+class BaggedPointSuite extends FunSuite with LocalSparkContext  {
+
+  test(BaggedPoint RDD: without subsampling) {
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, 1.0, 1, false)
+baggedRDD.collect().foreach { baggedPoint =
+  assert(baggedPoint.subsampleWeights.size == 1  
baggedPoint.subsampleWeights(0) == 1)
+}
+  }
+
+  test(BaggedPoint RDD: with subsampling with replacement (fraction = 
1.0)) {
+val numSubsamples = 100
+val (expectedMean, expectedStddev) = (1.0, 1.0)
+
+val seeds = Array(123, 5354, 230, 349867, 23987)
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+seeds.foreach { seed =
+  val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, 1.0, 
numSubsamples, true)
+  val subsampleCounts: Array[Array[Double]] = 
baggedRDD.map(_.subsampleWeights).collect()
+  EnsembleTestHelper.testRandomArrays(subsampleCounts, numSubsamples, 
expectedMean,
+expectedStddev, epsilon = 0.01)
+}
+  }
+
+  test(BaggedPoint RDD: with subsampling with replacement (fraction = 
0.5)) {
+val numSubsamples = 100
+val subsample = 0.5
+val (expectedMean, expectedStddev) = (subsample, math.sqrt(subsample))
+
+val seeds = Array(123, 5354, 230, 349867, 23987)
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+seeds.foreach { seed =
+  val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, subsample, 
numSubsamples, true)
+  val subsampleCounts: Array[Array[Double]] = 
baggedRDD.map(_.subsampleWeights).collect()
+  EnsembleTestHelper.testRandomArrays(subsampleCounts, numSubsamples, 
expectedMean,
+expectedStddev, epsilon = 0.01)
+}
+  }
+
+  test(BaggedPoint RDD: with subsampling without replacement (fraction = 
1.0)) {
+val numSubsamples = 100
+val (expectedMean, expectedStddev) = (1.0, 0)
+
+val seeds = Array(123, 5354, 230, 349867, 23987)
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+seeds.foreach { seed =
+  val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, 1.0, 
numSubsamples, false)
+  val subsampleCounts: Array[Array[Double]] = 
baggedRDD.map(_.subsampleWeights).collect()
+  EnsembleTestHelper.testRandomArrays(subsampleCounts, numSubsamples, 
expectedMean,
+expectedStddev, epsilon = 0.01)
+}
+  }
+
+  test(BaggedPoint RDD: with subsampling without replacement (fraction = 
0.5)) {
+val numSubsamples = 100
+val subsample = 0.5
+val (expectedMean, expectedStddev) = (subsample, math.sqrt(subsample * 
(1 - subsample)))
+
+val seeds = Array(123, 5354, 230, 349867, 23987)
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(1, 1000)
+val rdd = sc.parallelize(arr)
+seeds.foreach { seed =
+  val baggedRDD = BaggedPoint.convertToBaggedRDD(rdd, subsample, 
numSubsamples, false)
+  val subsampleCounts: Array[Array[Double]] = 
baggedRDD.map(_.subsampleWeights).collect()
+  EnsembleTestHelper.testRandomArrays(subsampleCounts, numSubsamples, 
expectedMean,
+expectedStddev, epsilon = 0.01)
+}
+

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19496253
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostingSuite.scala ---
@@ -0,0 +1,208 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import org.scalatest.FunSuite
+
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.mllib.tree.configuration.{BoostingStrategy, 
Strategy}
+import org.apache.spark.mllib.tree.impurity.{Variance, Gini}
+import org.apache.spark.mllib.tree.loss.{SquaredError, LogLoss}
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
+
+import org.apache.spark.mllib.util.LocalSparkContext
+
+/**
+ * Test suite for [[GradientBoosting]].
+ */
+class GradientBoostingSuite extends FunSuite with LocalSparkContext {
+
+  test(Binary classification with continuous features: +
+ comparing DecisionTree vs. GradientBoosting (numEstimators = 1)) {
+
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures 
= 50, 1000)
+val rdd = sc.parallelize(arr)
+val categoricalFeaturesInfo = Map.empty[Int, Int]
+val numEstimators = 1
+
+val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 1, 
x.features))
+val treeStrategy = new Strategy(algo = Regression, impurity = 
Variance, maxDepth = 2,
+  numClassesForClassification = 2, categoricalFeaturesInfo = 
categoricalFeaturesInfo)
+
+val dt = DecisionTree.train(remappedInput, treeStrategy)
+
+val boostingStrategy = new BoostingStrategy(algo = Classification,
+  numEstimators = numEstimators, loss = LogLoss, maxDepth = 2,
+  numClassesForClassification = 2, categoricalFeaturesInfo = 
categoricalFeaturesInfo)
+
+val gbt = GradientBoosting.trainClassifier(rdd, boostingStrategy)
+assert(gbt.baseLearners.size === 1)
+val gbtTree = gbt.baseLearners(0)
+
+
+EnsembleTestHelper.validateClassifier(gbt, arr, 0.9)
+
+// Make sure trees are the same.
+assert(gbtTree.toString == dt.toString)
+  }
+
+  test(Binary classification with continuous features: +
+ comparing DecisionTree vs. GradientBoosting (numEstimators = 10)) {
+
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures 
= 50, 1000)
+val rdd = sc.parallelize(arr)
+val categoricalFeaturesInfo = Map.empty[Int, Int]
+val numEstimators = 10
+
+val remappedInput = rdd.map(x = new LabeledPoint((x.label * 2) - 1, 
x.features))
+val treeStrategy = new Strategy(algo = Regression, impurity = 
Variance, maxDepth = 2,
+  numClassesForClassification = 2, categoricalFeaturesInfo = 
categoricalFeaturesInfo)
+
+val dt = DecisionTree.train(remappedInput, treeStrategy)
+
+val boostingStrategy = new BoostingStrategy(algo = Classification,
+  numEstimators = numEstimators, loss = LogLoss, maxDepth = 2,
+  numClassesForClassification = 2, categoricalFeaturesInfo = 
categoricalFeaturesInfo)
+
+val gbt = GradientBoosting.trainClassifier(rdd, boostingStrategy)
+assert(gbt.baseLearners.size === 10)
+val gbtTree = gbt.baseLearners(0)
+
+
+EnsembleTestHelper.validateClassifier(gbt, arr, 0.9)
+
+// Make sure trees are the same.
+assert(gbtTree.toString == dt.toString)
+  }
+
+  test(Binary classification with continuous features: +
+ Stochastic GradientBoosting (numEstimators = 10, learning rate = 
0.9, subsample = 0.75)) {
+
+val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures 
= 50, 1000)
+val rdd = sc.parallelize(arr)
+val categoricalFeaturesInfo = Map.empty[Int, Int]
+val numEstimators = 10
+
+val boostingStrategy = new BoostingStrategy(algo =

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19496266
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/WeightedEnsembleModel.scala
 ---
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.model
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.configuration.Algo._
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy._
+import org.apache.spark.rdd.RDD
+
+import scala.collection.mutable
+
+@Experimental
+class WeightedEnsembleModel(
+val baseLearners: Array[DecisionTreeModel],
+val baseLearnerWeights: Array[Double],
+val algo: Algo,
+val combiningStrategy: EnsembleCombiningStrategy) extends Serializable 
{
+
+  require(numTrees  0, sWeightedEnsembleModel cannot be created without 
base learners. Number  +
+sof baselearners = $baseLearners)
+
+  /**
+   * Predict values for a single data point using the model trained.
+   *
+   * @param features array representing a single data point
+   * @return predicted category from the trained model
+   */
+  private def predictRaw(features: Vector): Double = {
+val treePredictions = baseLearners.map(learner = 
learner.predict(features))
+if (numTrees == 1){
+  treePredictions(0)
+} else {
+  var prediction = treePredictions(0)
+  var index = 1
+  while (index  numTrees) {
+prediction += baseLearnerWeights(index) * treePredictions(index)
+index += 1
+  }
+  prediction
+}
+  }
+
+  /**
+   * Predict values for a single data point using the model trained.
+   *
+   * @param features array representing a single data point
+   * @return predicted category from the trained model
+   */
+  private def predictBySumming(features: Vector): Double = {
+val treePredictions = baseLearners.map(learner = 
learner.predict(features))
+val rawPrediction = {
+  if (numTrees == 1) {
+treePredictions(0)
+  } else {
+var prediction = treePredictions(0)
+var index = 1
+while (index  numTrees) {
+  prediction += baseLearnerWeights(index) * treePredictions(index)
+  index += 1
+}
+prediction
+  }
+}
+algo match {
+  case Regression = predictRaw(features)
+  case Classification = {
+// TODO: predicted labels are +1 or -1 for GBT. Need a better way 
to store this info.
+if (predictRaw(features)  0 ) 1.0 else 0.0
+  }
+  case _ = throw new IllegalArgumentException(
+sWeightedEnsembleModel given unknown algo parameter: $algo.)
+}
+  }
+
+  /**
+   * Predict values for a single data point.
+   *
+   * @param features array representing a single data point
+   * @return Double prediction from the trained model
+   */
+  def predictByAveraging(features: Vector): Double = {
+algo match {
+  case Classification =
+val predictionToCount = new mutable.HashMap[Int, Int]()
+baseLearners.foreach { learner =
+  val prediction = learner.predict(features).toInt
+  predictionToCount(prediction) = 
predictionToCount.getOrElse(prediction, 0) + 1
+}
+predictionToCount.maxBy(_._2)._1
+  case Regression =
+baseLearners.map(_.predict(features)).sum / baseLearners.size
+}
+  }
+
+
+  /**
+   * Predict values for a single data point using the model trained.
+   *
+   * @param features array representing a single data point
+   * @return predicted category from the trained model
+   */
+  def predict(features: Vector): Double = {
+combiningStrategy match {

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19496447
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/model/WeightedEnsembleModel.scala
 ---
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.model
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.configuration.Algo._
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy._
+import org.apache.spark.rdd.RDD
+
+import scala.collection.mutable
+
+@Experimental
+class WeightedEnsembleModel(
+val baseLearners: Array[DecisionTreeModel],
--- End diff --

Sure. Will do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19496561
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/loss/Loss.scala 
---
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.loss
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.RDD
+
+/**
+ * Trait for adding pluggable loss functions for the gradient boosting 
algorithm.
+ */
+trait Loss extends Serializable {
+
+  /**
+   * Method to calculate the loss gradients for the gradient boosting 
calculation.
+   * @param model Model of the weak learner.
+   * @param point Instance of the training dataset.
+   * @param learningRate Learning rate parameter for regularization.
+   * @return Loss gradient.
+   */
+  @DeveloperApi
+  def lossGradient(
--- End diff --

Technically negative of the gradient. I can rename it to gradient but it 
might be confusing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19497878
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy.Sum
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
+ * @param boostingStrategy Parameters for the gradient boosting algorithm
+ */
+@Experimental
+class GradientBoosting (
+private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return GradientBoostingModel that can be used for prediction
--- End diff --

GradientBoostingModel -- WeightedEnsembleModel


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19497871
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy.Sum
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
--- End diff --

regression -- binary classification and regression


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19497886
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy.Sum
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
+ * @param boostingStrategy Parameters for the gradient boosting algorithm
+ */
+@Experimental
+class GradientBoosting (
+private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = {
+val algo = boostingStrategy.algo
+algo match {
+  case Regression = GradientBoosting.boost(input, boostingStrategy)
+  case Classification =
+val remappedInput = input.map(x = new LabeledPoint((x.label * 2) 
- 1, x.features))
+GradientBoosting.boost(remappedInput, boostingStrategy)
+  case _ =
+throw new IllegalArgumentException(s$algo is not supported by the 
gradient boosting.)
+}
+  }
+
+}
+
+
+object GradientBoosting extends Logging {
+
+  /**
+   * Method to train a gradient boosting model.
+   *
+   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainRegressor]]
+   *   is recommended to clearly specify regression.
+   *   Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainClassifier]]
+   *   is recommended to clearly specify regression.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param boostingStrategy Configuration options for the boosting 
algorithm.
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(
+  input: RDD[LabeledPoint],
+  boostingStrategy: BoostingStrategy): WeightedEnsembleModel = {
+new GradientBoosting(boostingStrategy).train(input)
+  }
+
+  /**
+   * Method to train a gradient boosting regression model.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param numEstimators Number of estimators used in boosting stages. In 
other words,
+   *  number of boosting iterations performed.
+   * @param loss Loss function used for minimization during gradient 
boosting.
+   * @param maxDepth Maximum depth of the tree.
+   * E.g., depth 0 means 1 leaf node; depth 1 means 1 
internal node + 2 leaf nodes.
+   * @param learningRate Learning rate for shrinking the

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19497891
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy.Sum
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
+ * @param boostingStrategy Parameters for the gradient boosting algorithm
+ */
+@Experimental
+class GradientBoosting (
+private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = {
+val algo = boostingStrategy.algo
+algo match {
+  case Regression = GradientBoosting.boost(input, boostingStrategy)
+  case Classification =
+val remappedInput = input.map(x = new LabeledPoint((x.label * 2) 
- 1, x.features))
+GradientBoosting.boost(remappedInput, boostingStrategy)
+  case _ =
+throw new IllegalArgumentException(s$algo is not supported by the 
gradient boosting.)
+}
+  }
+
+}
+
+
+object GradientBoosting extends Logging {
+
+  /**
+   * Method to train a gradient boosting model.
+   *
+   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainRegressor]]
--- End diff --

Use GradientBoosting$#trainRegressor instead of 
GradientBoosting#trainRegressor so doc processor handles object correctly 
(using Java reflection syntax).  Please update elsewhere too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19497888
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy.Sum
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
+ * @param boostingStrategy Parameters for the gradient boosting algorithm
+ */
+@Experimental
+class GradientBoosting (
+private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = {
+val algo = boostingStrategy.algo
+algo match {
+  case Regression = GradientBoosting.boost(input, boostingStrategy)
+  case Classification =
+val remappedInput = input.map(x = new LabeledPoint((x.label * 2) 
- 1, x.features))
+GradientBoosting.boost(remappedInput, boostingStrategy)
+  case _ =
+throw new IllegalArgumentException(s$algo is not supported by the 
gradient boosting.)
+}
+  }
+
+}
+
+
+object GradientBoosting extends Logging {
+
+  /**
+   * Method to train a gradient boosting model.
+   *
+   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainRegressor]]
+   *   is recommended to clearly specify regression.
+   *   Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainClassifier]]
+   *   is recommended to clearly specify regression.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param boostingStrategy Configuration options for the boosting 
algorithm.
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(
+  input: RDD[LabeledPoint],
+  boostingStrategy: BoostingStrategy): WeightedEnsembleModel = {
+new GradientBoosting(boostingStrategy).train(input)
+  }
+
+  /**
+   * Method to train a gradient boosting regression model.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param numEstimators Number of estimators used in boosting stages. In 
other words,
+   *  number of boosting iterations performed.
+   * @param loss Loss function used for minimization during gradient 
boosting.
+   * @param maxDepth Maximum depth of the tree.
+   * E.g., depth 0 means 1 leaf node; depth 1 means 1 
internal node + 2 leaf nodes.
+   * @param learningRate Learning rate for shrinking the

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19498319
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/loss/Loss.scala 
---
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.loss
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.RDD
+
+/**
+ * Trait for adding pluggable loss functions for the gradient boosting 
algorithm.
+ */
+trait Loss extends Serializable {
+
+  /**
+   * Method to calculate the loss gradients for the gradient boosting 
calculation.
+   * @param model Model of the weak learner.
+   * @param point Instance of the training dataset.
+   * @param learningRate Learning rate parameter for regularization.
+   * @return Loss gradient.
--- End diff --

Since this is called gradient, could it please return the gradient instead 
of the negated gradient?  (And boosting can be updated accordingly.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-60820308
  
@manishamde  Thanks in advance for the API simplification!

Also, I'm realizing that this code should be correct for SquaredError but 
might not be quite right for the other losses.  Looking at Friedman's paper, I 
believe that the weak hypothesis weight needs to be adjusted according to the 
loss.  That calculation is simple for squared error, but it could get 
complicated for absolute error and logistic loss (requiring median calculations 
and general convex optimization, respectively, I'd guess).  I'm OK with leaving 
those other losses as long as they are marked with warnings.  I believe the 
code will still do something reasonable, although not quite ideal.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19498899
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/EnsembleCombiningStrategy.scala
 ---
@@ -0,0 +1,30 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.configuration
+
+import org.apache.spark.annotation.DeveloperApi
+
+/**
+ * :: Experimental ::
+ * Enum to select ensemble combining strategy for base learners
+ */
+@DeveloperApi
+object EnsembleCombiningStrategy extends Enumeration {
--- End diff --

You're right that they are a bit different; I was thinking in terms of 
thresholding for classification, but it would be important to sum, not average, 
for regression.  I also revoke what I said about supporting things like median.
This is making me vote for removing EnsembleCombiningStrategy and only 
supporting sum.  Do you have a use case for average?  (Sorry for the confusion!)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19498943
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/loss/Loss.scala 
---
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.loss
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.RDD
+
+/**
+ * Trait for adding pluggable loss functions for the gradient boosting 
algorithm.
+ */
+trait Loss extends Serializable {
+
+  /**
+   * Method to calculate the loss gradients for the gradient boosting 
calculation.
+   * @param model Model of the weak learner.
+   * @param point Instance of the training dataset.
+   * @param learningRate Learning rate parameter for regularization.
+   * @return Loss gradient.
--- End diff --

(just saw your other note; I vote for renaming to gradient, and making it 
actually return the gradient)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread manishamde

Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-60821283
  
@jkbradley Your understanding is correct. Sorry for not mentioning it 
explicitly on the JIRA/PR earlier. 

Yes, calculating median, etc. for terminal region predictions in trees will 
be hard for the distributed decision tree. I can add a warning mentioning in 
what way our implementation is different from Friedman's algorithm. I was 
planning to point it out in the documentation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-60821510
  
Great, that sounds reasonable.  I believe we could do it eventually: since 
the trees won't be too deep in many cases, the sufficient stats to pass around 
might be manageable.  Future work!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19499284
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/EnsembleCombiningStrategy.scala
 ---
@@ -0,0 +1,30 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.configuration
+
+import org.apache.spark.annotation.DeveloperApi
+
+/**
+ * :: Experimental ::
+ * Enum to select ensemble combining strategy for base learners
+ */
+@DeveloperApi
+object EnsembleCombiningStrategy extends Enumeration {
--- End diff --

I think we use average or majority for the random forest ensemble 
calculations. I moved the RF code to also return a WeightedEnsembleModel.

I am okay with removing EnsembleCombiningStrategy class but we still need 
to support both sum and averaging operations for combining ensemble predictions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19499327
  
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/loss/Loss.scala 
---
@@ -0,0 +1,54 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.loss
+
+import org.apache.spark.annotation.DeveloperApi
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.RDD
+
+/**
+ * Trait for adding pluggable loss functions for the gradient boosting 
algorithm.
+ */
+trait Loss extends Serializable {
+
+  /**
+   * Method to calculate the loss gradients for the gradient boosting 
calculation.
+   * @param model Model of the weak learner.
+   * @param point Instance of the training dataset.
+   * @param learningRate Learning rate parameter for regularization.
+   * @return Loss gradient.
--- End diff --

Sounds good. Agree.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19499682
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/EnsembleCombiningStrategy.scala
 ---
@@ -0,0 +1,30 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree.configuration
+
+import org.apache.spark.annotation.DeveloperApi
+
+/**
+ * :: Experimental ::
+ * Enum to select ensemble combining strategy for base learners
+ */
+@DeveloperApi
+object EnsembleCombiningStrategy extends Enumeration {
--- End diff --

OK, I agree; keeping it sounds good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19499795
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy.Sum
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
+ * @param boostingStrategy Parameters for the gradient boosting algorithm
+ */
+@Experimental
+class GradientBoosting (
+private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = {
+val algo = boostingStrategy.algo
+algo match {
+  case Regression = GradientBoosting.boost(input, boostingStrategy)
+  case Classification =
+val remappedInput = input.map(x = new LabeledPoint((x.label * 2) 
- 1, x.features))
+GradientBoosting.boost(remappedInput, boostingStrategy)
+  case _ =
+throw new IllegalArgumentException(s$algo is not supported by the 
gradient boosting.)
+}
+  }
+
+}
+
+
+object GradientBoosting extends Logging {
+
+  /**
+   * Method to train a gradient boosting model.
+   *
+   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainRegressor]]
--- End diff --

Will do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-60824501
  
@manishamde Thinking more about the losses, I'm really not sure if absolute 
error and logistic loss will behave reasonably.  Could we make those losses 
private[tree] and mark them with TODO?  That way, your code is not thrown out, 
but we won't expose unsafe options to users.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread manishamde

Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-60824862
  
@jkbradley Should we even support classification then?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread manishamde

Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-60824790
  
@jkbradley I agree. This needs more testing since it's a non-standard 
option.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-60837931
  
I think it's OK to leave classification support but make a note in the doc 
for SquaredError that it is meant for Regression.  What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19508317
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy.Sum
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
+ * @param boostingStrategy Parameters for the gradient boosting algorithm
+ */
+@Experimental
+class GradientBoosting (
+private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = {
+val algo = boostingStrategy.algo
+algo match {
+  case Regression = GradientBoosting.boost(input, boostingStrategy)
+  case Classification =
+val remappedInput = input.map(x = new LabeledPoint((x.label * 2) 
- 1, x.features))
+GradientBoosting.boost(remappedInput, boostingStrategy)
+  case _ =
+throw new IllegalArgumentException(s$algo is not supported by the 
gradient boosting.)
+}
+  }
+
+}
+
+
+object GradientBoosting extends Logging {
+
+  /**
+   * Method to train a gradient boosting model.
+   *
+   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainRegressor]]
+   *   is recommended to clearly specify regression.
+   *   Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainClassifier]]
+   *   is recommended to clearly specify regression.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param boostingStrategy Configuration options for the boosting 
algorithm.
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(
+  input: RDD[LabeledPoint],
+  boostingStrategy: BoostingStrategy): WeightedEnsembleModel = {
+new GradientBoosting(boostingStrategy).train(input)
+  }
+
+  /**
+   * Method to train a gradient boosting regression model.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param numEstimators Number of estimators used in boosting stages. In 
other words,
+   *  number of boosting iterations performed.
+   * @param loss Loss function used for minimization during gradient 
boosting.
+   * @param maxDepth Maximum depth of the tree.
+   * E.g., depth 0 means 1 leaf node; depth 1 means 1 
internal node + 2 leaf nodes.
+   * @param learningRate Learning rate for shrinking the

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread manishamde

Github user manishamde commented on a diff in the pull request:

https://github.com/apache/spark/pull/2607#discussion_r19510078
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoosting.scala ---
@@ -0,0 +1,433 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.tree
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.tree.configuration.BoostingStrategy
+import org.apache.spark.Logging
+import org.apache.spark.mllib.tree.impl.TimeTracker
+import org.apache.spark.mllib.tree.loss.Losses
+import org.apache.spark.rdd.RDD
+import org.apache.spark.mllib.regression.LabeledPoint
+import org.apache.spark.mllib.tree.model.{WeightedEnsembleModel, 
DecisionTreeModel}
+import org.apache.spark.mllib.tree.configuration.Algo._
+import org.apache.spark.storage.StorageLevel
+import 
org.apache.spark.mllib.tree.configuration.EnsembleCombiningStrategy.Sum
+
+/**
+ * :: Experimental ::
+ * A class that implements gradient boosting for regression problems.
+ * @param boostingStrategy Parameters for the gradient boosting algorithm
+ */
+@Experimental
+class GradientBoosting (
+private val boostingStrategy: BoostingStrategy) extends Serializable 
with Logging {
+
+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(input: RDD[LabeledPoint]): WeightedEnsembleModel = {
+val algo = boostingStrategy.algo
+algo match {
+  case Regression = GradientBoosting.boost(input, boostingStrategy)
+  case Classification =
+val remappedInput = input.map(x = new LabeledPoint((x.label * 2) 
- 1, x.features))
+GradientBoosting.boost(remappedInput, boostingStrategy)
+  case _ =
+throw new IllegalArgumentException(s$algo is not supported by the 
gradient boosting.)
+}
+  }
+
+}
+
+
+object GradientBoosting extends Logging {
+
+  /**
+   * Method to train a gradient boosting model.
+   *
+   * Note: Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainRegressor]]
+   *   is recommended to clearly specify regression.
+   *   Using 
[[org.apache.spark.mllib.tree.GradientBoosting#trainClassifier]]
+   *   is recommended to clearly specify regression.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param boostingStrategy Configuration options for the boosting 
algorithm.
+   * @return GradientBoostingModel that can be used for prediction
+   */
+  def train(
+  input: RDD[LabeledPoint],
+  boostingStrategy: BoostingStrategy): WeightedEnsembleModel = {
+new GradientBoosting(boostingStrategy).train(input)
+  }
+
+  /**
+   * Method to train a gradient boosting regression model.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   *  For classification, labels should take values {0, 1, 
..., numClasses-1}.
+   *  For regression, labels are real numbers.
+   * @param numEstimators Number of estimators used in boosting stages. In 
other words,
+   *  number of boosting iterations performed.
+   * @param loss Loss function used for minimization during gradient 
boosting.
+   * @param maxDepth Maximum depth of the tree.
+   * E.g., depth 0 means 1 leaf node; depth 1 means 1 
internal node + 2 leaf nodes.
+   * @param learningRate Learning rate for shrinking the

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-28 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-60863725
  
By the way, checkpointing is not quite the right term; currently, the code 
persists but does not checkpoint the RDDs.  I hope that the logic which 
@codedeft implemented in another PR [https://github.com/apache/spark/pull/2868] 
can be abstracted away to be reused for boosting.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-60561667
  
  [Test build #22285 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22285/consoleFull)
 for   PR 2607 at commit 
[`eff21fe`](https://github.com/apache/spark/commit/eff21fea01393a44c7876542832e752c26cbcd86).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-60568456
  
  [Test build #22285 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22285/consoleFull)
 for   PR 2607 at commit 
[`eff21fe`](https://github.com/apache/spark/commit/eff21fea01393a44c7876542832e752c26cbcd86).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-60568462
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22285/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-27 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-60686705
  
@manishamde   I'll make a pass now; thanks for the updates!  A patch 
(SPARK-4022) was just merged which causes a few small conflicts.  Could you 
please fix those?  Then I'll do some tests.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-27 Thread manishamde

Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-60690690
  
@jkbradley I fixed the merge conflicts. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-60690948
  
  [Test build #22313 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22313/consoleFull)
 for   PR 2607 at commit 
[`49ba107`](https://github.com/apache/spark/commit/49ba107065e0c53beee316e7108f7900a49b47e1).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-27 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-60696746
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22313/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-1547: Adding Gradient Boosting t...

2014-10-27 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2607#issuecomment-60696743
  
  [Test build #22313 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22313/consoleFull)
 for   PR 2607 at commit 
[`49ba107`](https://github.com/apache/spark/commit/49ba107065e0c53beee316e7108f7900a49b47e1).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

77 matches

Mail list logo