[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-10-04 Thread epahomov
Github user epahomov closed the pull request at:

https://github.com/apache/spark/pull/2394


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-28 Thread epahomov
Github user epahomov commented on the pull request:

https://github.com/apache/spark/pull/2394#issuecomment-57077001
  
Sorry for such messy pull request, I didn't review my student code close 
enough. Would try my best next time. We'll fix everything by the middle of the 
week.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-28 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/2394#issuecomment-57106342
  
@epahomov  Hi, just making sure you saw the [comment in the 
JIRA](https://issues.apache.org/jira/browse/SPARK-3525) about overlapping JIRAs 
and PRs in preparation for gradient boosting.  It would be great to get your 
student's input on [the other GBT 
JIRA](https://issues.apache.org/jira/browse/SPARK-1547) and the linked design 
doc---thank you both!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2394#issuecomment-5782
  
Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2394#issuecomment-57000202
  
@jkbradley @manishamde Could you help review this PR? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2394#issuecomment-57000221
  
this is ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2394#issuecomment-57001565
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20864/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2394#discussion_r18105434
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
 ---
@@ -0,0 +1,173 @@
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.DecisionTree
+import org.apache.spark.mllib.tree.configuration.Algo.Algo
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.impurity.Impurity
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.{DoubleRDDFunctions, RDD}
+import scala.util.Random
+
+/**
+ *
+ * Read about the algorithm Gradient boosting here:
+ * 
http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2007/GWD07/geurts-icml2007.pdf
+ *
+ * Libraries that implement the algorithm Gradient boosting similar way
+ * https://code.google.com/p/jforests/
+ * https://code.google.com/p/jsgbm/
+ *
+ */
+class StochasticGradientBoosting {
+
+  /**
+   * Train a Gradient Boosting model given an RDD of (label, features) 
pairs.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @param leaningRate Learning rate
+   * @param countOfTrees Number of trees.
+   * @param samplingSizeRatio Size of random sample, percent of ${input} 
size.
+   * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+   * of algorithm (classification, regression, etc.), 
feature type (continuous,
+   * categorical), depth of the tree, quantile calculation 
strategy, etc.
+   * @return StochasticGradientBoostingModel that can be used for 
prediction
+   */
+  def run(
+   input : RDD[LabeledPoint],
+   leaningRate : Double,
+   countOfTrees : Int,
+   samplingSizeRatio : Double,
+   strategy: Strategy): StochasticGradientBoostingModel = {
+
+val featureDimension = input.count()
+val mean = new DoubleRDDFunctions(input.map(l = l.label)).mean()
--- End diff --

First import  SparkConetxt, then you can use 
input.map(l = l.label).mean()
directly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2394#discussion_r18105459
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
 ---
@@ -0,0 +1,173 @@
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.DecisionTree
+import org.apache.spark.mllib.tree.configuration.Algo.Algo
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.impurity.Impurity
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.{DoubleRDDFunctions, RDD}
--- End diff --

DoubleRDDFunctions is not needed here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2394#discussion_r18105497
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
 ---
@@ -0,0 +1,173 @@
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.DecisionTree
+import org.apache.spark.mllib.tree.configuration.Algo.Algo
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.impurity.Impurity
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.{DoubleRDDFunctions, RDD}
+import scala.util.Random
+
+/**
+ *
+ * Read about the algorithm Gradient boosting here:
+ * 
http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2007/GWD07/geurts-icml2007.pdf
+ *
+ * Libraries that implement the algorithm Gradient boosting similar way
+ * https://code.google.com/p/jforests/
+ * https://code.google.com/p/jsgbm/
+ *
+ */
+class StochasticGradientBoosting {
+
+  /**
+   * Train a Gradient Boosting model given an RDD of (label, features) 
pairs.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @param leaningRate Learning rate
+   * @param countOfTrees Number of trees.
+   * @param samplingSizeRatio Size of random sample, percent of ${input} 
size.
+   * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+   * of algorithm (classification, regression, etc.), 
feature type (continuous,
+   * categorical), depth of the tree, quantile calculation 
strategy, etc.
+   * @return StochasticGradientBoostingModel that can be used for 
prediction
+   */
+  def run(
+   input : RDD[LabeledPoint],
+   leaningRate : Double,
+   countOfTrees : Int,
--- End diff --

How about TreeCount? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2394#discussion_r18106160
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
 ---
@@ -0,0 +1,173 @@
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.DecisionTree
+import org.apache.spark.mllib.tree.configuration.Algo.Algo
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.impurity.Impurity
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.{DoubleRDDFunctions, RDD}
+import scala.util.Random
+
+/**
+ *
+ * Read about the algorithm Gradient boosting here:
+ * 
http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2007/GWD07/geurts-icml2007.pdf
+ *
+ * Libraries that implement the algorithm Gradient boosting similar way
+ * https://code.google.com/p/jforests/
+ * https://code.google.com/p/jsgbm/
+ *
+ */
+class StochasticGradientBoosting {
+
+  /**
+   * Train a Gradient Boosting model given an RDD of (label, features) 
pairs.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @param leaningRate Learning rate
+   * @param countOfTrees Number of trees.
+   * @param samplingSizeRatio Size of random sample, percent of ${input} 
size.
+   * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+   * of algorithm (classification, regression, etc.), 
feature type (continuous,
+   * categorical), depth of the tree, quantile calculation 
strategy, etc.
+   * @return StochasticGradientBoostingModel that can be used for 
prediction
+   */
+  def run(
+   input : RDD[LabeledPoint],
+   leaningRate : Double,
+   countOfTrees : Int,
+   samplingSizeRatio : Double,
+   strategy: Strategy): StochasticGradientBoostingModel = {
+
+val featureDimension = input.count()
+val mean = new DoubleRDDFunctions(input.map(l = l.label)).mean()
+val boostingModel = new StochasticGradientBoostingModel(countOfTrees, 
mean, leaningRate)
+
+for (i - 0 to countOfTrees - 1) {
+  val gradient = input.map(l = l.label - 
boostingModel.computeValue(l.features))
+
+  val newInput: RDD[LabeledPoint] = input
+.zip(gradient)
+.map{case(inputVal, gradientVal) = new LabeledPoint(gradientVal, 
inputVal.features)}
+
+  val randomSample = newInput.sample(
+false,
+(samplingSizeRatio * featureDimension).asInstanceOf[Int],
+Random.nextInt()
+  )
+
+  val model = DecisionTree.train(randomSample, strategy)
+  boostingModel.addTree(model)
+}
+boostingModel
+  }
+}
+
+/**
+ * Model that can be used for prediction.
+ *
+ * @param countOfTrees Number of trees.
+ * @param initValue Initialize model with this value.
+ * @param learningRate Learning rate.
+ */
+class StochasticGradientBoostingModel (
+private val countOfTrees: Int,
+private var initValue: Double,
+private val learningRate: Double) extends Serializable with 
RegressionModel {
+
+  val trees: Array[DecisionTreeModel] = new 
Array[DecisionTreeModel](countOfTrees)
+  var index: Int = 0
+
+  def this(countOfTrees:Int, learning_rate: Double) = {
+this(countOfTrees, 0, learning_rate)
+  }
+
+  def computeValue(feature_x: Vector): Double = {
+var re_res = initValue
+
+if (index == 0) {
+  return re_res
+}
+for (i - 0 to index - 1) {
+  re_res += learningRate * trees(i).predict(feature_x)
+}
+re_res
+  }
+
+  def addTree(tree : DecisionTreeModel) = {
+trees(index) = tree
+index += 1
+  }
+
+  def setInitValue (value : Double) = {
+initValue = value
+  }
--- End diff --

return this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2394#discussion_r18106192
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
 ---
@@ -0,0 +1,173 @@
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.DecisionTree
+import org.apache.spark.mllib.tree.configuration.Algo.Algo
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.impurity.Impurity
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.{DoubleRDDFunctions, RDD}
+import scala.util.Random
+
+/**
+ *
+ * Read about the algorithm Gradient boosting here:
+ * 
http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2007/GWD07/geurts-icml2007.pdf
+ *
+ * Libraries that implement the algorithm Gradient boosting similar way
+ * https://code.google.com/p/jforests/
+ * https://code.google.com/p/jsgbm/
+ *
+ */
+class StochasticGradientBoosting {
+
+  /**
+   * Train a Gradient Boosting model given an RDD of (label, features) 
pairs.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @param leaningRate Learning rate
+   * @param countOfTrees Number of trees.
+   * @param samplingSizeRatio Size of random sample, percent of ${input} 
size.
+   * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+   * of algorithm (classification, regression, etc.), 
feature type (continuous,
+   * categorical), depth of the tree, quantile calculation 
strategy, etc.
+   * @return StochasticGradientBoostingModel that can be used for 
prediction
+   */
+  def run(
+   input : RDD[LabeledPoint],
+   leaningRate : Double,
+   countOfTrees : Int,
+   samplingSizeRatio : Double,
+   strategy: Strategy): StochasticGradientBoostingModel = {
+
+val featureDimension = input.count()
+val mean = new DoubleRDDFunctions(input.map(l = l.label)).mean()
+val boostingModel = new StochasticGradientBoostingModel(countOfTrees, 
mean, leaningRate)
+
+for (i - 0 to countOfTrees - 1) {
+  val gradient = input.map(l = l.label - 
boostingModel.computeValue(l.features))
+
+  val newInput: RDD[LabeledPoint] = input
+.zip(gradient)
+.map{case(inputVal, gradientVal) = new LabeledPoint(gradientVal, 
inputVal.features)}
+
+  val randomSample = newInput.sample(
+false,
+(samplingSizeRatio * featureDimension).asInstanceOf[Int],
+Random.nextInt()
+  )
+
+  val model = DecisionTree.train(randomSample, strategy)
+  boostingModel.addTree(model)
+}
+boostingModel
+  }
+}
+
+/**
+ * Model that can be used for prediction.
+ *
+ * @param countOfTrees Number of trees.
+ * @param initValue Initialize model with this value.
+ * @param learningRate Learning rate.
+ */
+class StochasticGradientBoostingModel (
+private val countOfTrees: Int,
+private var initValue: Double,
+private val learningRate: Double) extends Serializable with 
RegressionModel {
+
+  val trees: Array[DecisionTreeModel] = new 
Array[DecisionTreeModel](countOfTrees)
+  var index: Int = 0
+
+  def this(countOfTrees:Int, learning_rate: Double) = {
+this(countOfTrees, 0, learning_rate)
+  }
+
+  def computeValue(feature_x: Vector): Double = {
+var re_res = initValue
+
+if (index == 0) {
+  return re_res
+}
+for (i - 0 to index - 1) {
+  re_res += learningRate * trees(i).predict(feature_x)
+}
+re_res
+  }
+
+  def addTree(tree : DecisionTreeModel) = {
--- End diff --

Check whether index is out of bound


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2394#discussion_r18106266
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
 ---
@@ -0,0 +1,173 @@
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.DecisionTree
+import org.apache.spark.mllib.tree.configuration.Algo.Algo
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.impurity.Impurity
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.{DoubleRDDFunctions, RDD}
+import scala.util.Random
+
+/**
+ *
+ * Read about the algorithm Gradient boosting here:
+ * 
http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2007/GWD07/geurts-icml2007.pdf
+ *
+ * Libraries that implement the algorithm Gradient boosting similar way
+ * https://code.google.com/p/jforests/
+ * https://code.google.com/p/jsgbm/
+ *
+ */
+class StochasticGradientBoosting {
+
+  /**
+   * Train a Gradient Boosting model given an RDD of (label, features) 
pairs.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @param leaningRate Learning rate
+   * @param countOfTrees Number of trees.
+   * @param samplingSizeRatio Size of random sample, percent of ${input} 
size.
+   * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+   * of algorithm (classification, regression, etc.), 
feature type (continuous,
+   * categorical), depth of the tree, quantile calculation 
strategy, etc.
+   * @return StochasticGradientBoostingModel that can be used for 
prediction
+   */
+  def run(
+   input : RDD[LabeledPoint],
+   leaningRate : Double,
+   countOfTrees : Int,
+   samplingSizeRatio : Double,
+   strategy: Strategy): StochasticGradientBoostingModel = {
+
+val featureDimension = input.count()
+val mean = new DoubleRDDFunctions(input.map(l = l.label)).mean()
+val boostingModel = new StochasticGradientBoostingModel(countOfTrees, 
mean, leaningRate)
+
+for (i - 0 to countOfTrees - 1) {
--- End diff --

use while instead of for. while is faster in scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2394#discussion_r18106416
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
 ---
@@ -0,0 +1,173 @@
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.DecisionTree
+import org.apache.spark.mllib.tree.configuration.Algo.Algo
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.impurity.Impurity
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.{DoubleRDDFunctions, RDD}
+import scala.util.Random
+
+/**
+ *
+ * Read about the algorithm Gradient boosting here:
+ * 
http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2007/GWD07/geurts-icml2007.pdf
+ *
+ * Libraries that implement the algorithm Gradient boosting similar way
+ * https://code.google.com/p/jforests/
+ * https://code.google.com/p/jsgbm/
+ *
+ */
+class StochasticGradientBoosting {
+
+  /**
+   * Train a Gradient Boosting model given an RDD of (label, features) 
pairs.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @param leaningRate Learning rate
+   * @param countOfTrees Number of trees.
+   * @param samplingSizeRatio Size of random sample, percent of ${input} 
size.
+   * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+   * of algorithm (classification, regression, etc.), 
feature type (continuous,
+   * categorical), depth of the tree, quantile calculation 
strategy, etc.
+   * @return StochasticGradientBoostingModel that can be used for 
prediction
+   */
+  def run(
+   input : RDD[LabeledPoint],
+   leaningRate : Double,
+   countOfTrees : Int,
+   samplingSizeRatio : Double,
+   strategy: Strategy): StochasticGradientBoostingModel = {
+
+val featureDimension = input.count()
+val mean = new DoubleRDDFunctions(input.map(l = l.label)).mean()
+val boostingModel = new StochasticGradientBoostingModel(countOfTrees, 
mean, leaningRate)
+
+for (i - 0 to countOfTrees - 1) {
+  val gradient = input.map(l = l.label - 
boostingModel.computeValue(l.features))
--- End diff --

@mengxr Would it be better if cache input explicitly as it is used many 
times inside this function? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2394#discussion_r18106472
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
 ---
@@ -0,0 +1,173 @@
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.DecisionTree
+import org.apache.spark.mllib.tree.configuration.Algo.Algo
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.impurity.Impurity
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.{DoubleRDDFunctions, RDD}
+import scala.util.Random
+
+/**
+ *
+ * Read about the algorithm Gradient boosting here:
+ * 
http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2007/GWD07/geurts-icml2007.pdf
+ *
+ * Libraries that implement the algorithm Gradient boosting similar way
+ * https://code.google.com/p/jforests/
+ * https://code.google.com/p/jsgbm/
+ *
+ */
+class StochasticGradientBoosting {
+
+  /**
+   * Train a Gradient Boosting model given an RDD of (label, features) 
pairs.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @param leaningRate Learning rate
+   * @param countOfTrees Number of trees.
+   * @param samplingSizeRatio Size of random sample, percent of ${input} 
size.
+   * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+   * of algorithm (classification, regression, etc.), 
feature type (continuous,
+   * categorical), depth of the tree, quantile calculation 
strategy, etc.
+   * @return StochasticGradientBoostingModel that can be used for 
prediction
+   */
+  def run(
+   input : RDD[LabeledPoint],
+   leaningRate : Double,
+   countOfTrees : Int,
+   samplingSizeRatio : Double,
+   strategy: Strategy): StochasticGradientBoostingModel = {
+
+val featureDimension = input.count()
+val mean = new DoubleRDDFunctions(input.map(l = l.label)).mean()
+val boostingModel = new StochasticGradientBoostingModel(countOfTrees, 
mean, leaningRate)
+
+for (i - 0 to countOfTrees - 1) {
+  val gradient = input.map(l = l.label - 
boostingModel.computeValue(l.features))
+
+  val newInput: RDD[LabeledPoint] = input
+.zip(gradient)
+.map{case(inputVal, gradientVal) = new LabeledPoint(gradientVal, 
inputVal.features)}
+
+  val randomSample = newInput.sample(
+false,
+(samplingSizeRatio * featureDimension).asInstanceOf[Int],
--- End diff --

change asInstanceOf[Int] to toInt


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2394#discussion_r18106585
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
 ---
@@ -0,0 +1,173 @@
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.DecisionTree
+import org.apache.spark.mllib.tree.configuration.Algo.Algo
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.impurity.Impurity
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.{DoubleRDDFunctions, RDD}
+import scala.util.Random
+
+/**
+ *
+ * Read about the algorithm Gradient boosting here:
+ * 
http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2007/GWD07/geurts-icml2007.pdf
+ *
+ * Libraries that implement the algorithm Gradient boosting similar way
+ * https://code.google.com/p/jforests/
+ * https://code.google.com/p/jsgbm/
+ *
+ */
+class StochasticGradientBoosting {
+
+  /**
+   * Train a Gradient Boosting model given an RDD of (label, features) 
pairs.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @param leaningRate Learning rate
+   * @param countOfTrees Number of trees.
+   * @param samplingSizeRatio Size of random sample, percent of ${input} 
size.
+   * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+   * of algorithm (classification, regression, etc.), 
feature type (continuous,
+   * categorical), depth of the tree, quantile calculation 
strategy, etc.
+   * @return StochasticGradientBoostingModel that can be used for 
prediction
+   */
+  def run(
--- End diff --

may be put run method under object StochasticGradientBoosting, the 
StochasticGradientBoosting class does not have any state in it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2394#discussion_r18106962
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
 ---
@@ -0,0 +1,173 @@
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.DecisionTree
+import org.apache.spark.mllib.tree.configuration.Algo.Algo
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.impurity.Impurity
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.{DoubleRDDFunctions, RDD}
+import scala.util.Random
+
+/**
+ *
+ * Read about the algorithm Gradient boosting here:
+ * 
http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2007/GWD07/geurts-icml2007.pdf
+ *
+ * Libraries that implement the algorithm Gradient boosting similar way
+ * https://code.google.com/p/jforests/
+ * https://code.google.com/p/jsgbm/
+ *
+ */
+class StochasticGradientBoosting {
+
+  /**
+   * Train a Gradient Boosting model given an RDD of (label, features) 
pairs.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @param leaningRate Learning rate
+   * @param countOfTrees Number of trees.
+   * @param samplingSizeRatio Size of random sample, percent of ${input} 
size.
+   * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+   * of algorithm (classification, regression, etc.), 
feature type (continuous,
+   * categorical), depth of the tree, quantile calculation 
strategy, etc.
+   * @return StochasticGradientBoostingModel that can be used for 
prediction
+   */
+  def run(
+   input : RDD[LabeledPoint],
+   leaningRate : Double,
+   countOfTrees : Int,
+   samplingSizeRatio : Double,
+   strategy: Strategy): StochasticGradientBoostingModel = {
+
+val featureDimension = input.count()
--- End diff --

This is not feature dimension, rather, input.count() is the number of 
LabeledPoint in your RDD. if you would like to compute feature dimension, 
please use
input.take(1).features.length


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2394#discussion_r18107064
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
 ---
@@ -0,0 +1,173 @@
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.DecisionTree
+import org.apache.spark.mllib.tree.configuration.Algo.Algo
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.impurity.Impurity
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.{DoubleRDDFunctions, RDD}
+import scala.util.Random
+
+/**
+ *
+ * Read about the algorithm Gradient boosting here:
+ * 
http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2007/GWD07/geurts-icml2007.pdf
+ *
+ * Libraries that implement the algorithm Gradient boosting similar way
+ * https://code.google.com/p/jforests/
+ * https://code.google.com/p/jsgbm/
+ *
+ */
+class StochasticGradientBoosting {
+
+  /**
+   * Train a Gradient Boosting model given an RDD of (label, features) 
pairs.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @param leaningRate Learning rate
+   * @param countOfTrees Number of trees.
+   * @param samplingSizeRatio Size of random sample, percent of ${input} 
size.
+   * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+   * of algorithm (classification, regression, etc.), 
feature type (continuous,
+   * categorical), depth of the tree, quantile calculation 
strategy, etc.
+   * @return StochasticGradientBoostingModel that can be used for 
prediction
+   */
+  def run(
+   input : RDD[LabeledPoint],
+   leaningRate : Double,
+   countOfTrees : Int,
+   samplingSizeRatio : Double,
+   strategy: Strategy): StochasticGradientBoostingModel = {
+
+val featureDimension = input.count()
+val mean = new DoubleRDDFunctions(input.map(l = l.label)).mean()
+val boostingModel = new StochasticGradientBoostingModel(countOfTrees, 
mean, leaningRate)
+
+for (i - 0 to countOfTrees - 1) {
+  val gradient = input.map(l = l.label - 
boostingModel.computeValue(l.features))
+
+  val newInput: RDD[LabeledPoint] = input
+.zip(gradient)
+.map{case(inputVal, gradientVal) = new LabeledPoint(gradientVal, 
inputVal.features)}
+
+  val randomSample = newInput.sample(
+false,
+(samplingSizeRatio * featureDimension).asInstanceOf[Int],
--- End diff --

featureDimension is the number of instance? Probably we need a better name 
for it.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2394#discussion_r18107119
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/regression/StochasticGradientBoostingSuite.scala
 ---
@@ -0,0 +1,44 @@
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.tree.configuration.Algo
+import org.apache.spark.mllib.tree.impurity.Variance
+import org.apache.spark.mllib.util.{LinearDataGenerator, LocalSparkContext}
+import org.apache.spark.rdd.{RDD, DoubleRDDFunctions}
--- End diff --

DoubleRDDFuctions is not needed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2394#discussion_r18107150
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/regression/StochasticGradientBoostingSuite.scala
 ---
@@ -0,0 +1,44 @@
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.tree.configuration.Algo
+import org.apache.spark.mllib.tree.impurity.Variance
+import org.apache.spark.mllib.util.{LinearDataGenerator, LocalSparkContext}
+import org.apache.spark.rdd.{RDD, DoubleRDDFunctions}
+import org.apache.spark.util.Utils
+import org.scalatest.FunSuite
+
+class StochasticGradientBoostingSuite extends FunSuite with 
LocalSparkContext {
+
+  /**
+   * Test if we can correctly learn on random data
+   */
+  test(stochastic gradient boosting) {
+val parsedData = randomLabeledPoints()
+val model = StochasticGradientBoosting.train(parsedData, 
Algo.Regression, Variance, 3)
+checkModel(parsedData, model)
+  }
+
+  test(test serialization) {
+val parsedData = randomLabeledPoints()
+val model = StochasticGradientBoosting.train(parsedData, 
Algo.Regression, Variance, 3)
+checkModel(parsedData, 
Utils.deserialize[StochasticGradientBoostingModel](Utils.serialize(model)))
+  }
+
+  def checkModel(parsedData: RDD[LabeledPoint], model: RegressionModel) {
+val valuesAndPredictions = parsedData.map { point =
+  val prediction = model.predict(point.features)
+  (point.label, prediction)
+}
+val actualValues = parsedData.map(l = l.label)
+val mean = new DoubleRDDFunctions(actualValues).mean()
--- End diff --

use mean() directly. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2394#discussion_r18107249
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/regression/StochasticGradientBoostingSuite.scala
 ---
@@ -0,0 +1,44 @@
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.tree.configuration.Algo
+import org.apache.spark.mllib.tree.impurity.Variance
+import org.apache.spark.mllib.util.{LinearDataGenerator, LocalSparkContext}
+import org.apache.spark.rdd.{RDD, DoubleRDDFunctions}
+import org.apache.spark.util.Utils
+import org.scalatest.FunSuite
+
+class StochasticGradientBoostingSuite extends FunSuite with 
LocalSparkContext {
+
+  /**
+   * Test if we can correctly learn on random data
+   */
+  test(stochastic gradient boosting) {
+val parsedData = randomLabeledPoints()
+val model = StochasticGradientBoosting.train(parsedData, 
Algo.Regression, Variance, 3)
+checkModel(parsedData, model)
+  }
+
+  test(test serialization) {
+val parsedData = randomLabeledPoints()
+val model = StochasticGradientBoosting.train(parsedData, 
Algo.Regression, Variance, 3)
+checkModel(parsedData, 
Utils.deserialize[StochasticGradientBoostingModel](Utils.serialize(model)))
+  }
+
+  def checkModel(parsedData: RDD[LabeledPoint], model: RegressionModel) {
+val valuesAndPredictions = parsedData.map { point =
+  val prediction = model.predict(point.features)
+  (point.label, prediction)
+}
+val actualValues = parsedData.map(l = l.label)
+val mean = new DoubleRDDFunctions(actualValues).mean()
+val meanError = new DoubleRDDFunctions(actualValues.map(i = 
math.pow(i - mean, 2))).mean()
+val MSE = valuesAndPredictions.map { case (v, p) = math.pow(v - p, 2)}
+val error = new DoubleRDDFunctions(MSE).mean()
--- End diff --

same as above


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2394#discussion_r18107318
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
 ---
@@ -0,0 +1,173 @@
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.DecisionTree
+import org.apache.spark.mllib.tree.configuration.Algo.Algo
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.impurity.Impurity
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.{DoubleRDDFunctions, RDD}
+import scala.util.Random
+
+/**
+ *
+ * Read about the algorithm Gradient boosting here:
+ * 
http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2007/GWD07/geurts-icml2007.pdf
+ *
+ * Libraries that implement the algorithm Gradient boosting similar way
+ * https://code.google.com/p/jforests/
+ * https://code.google.com/p/jsgbm/
+ *
+ */
+class StochasticGradientBoosting {
+
+  /**
+   * Train a Gradient Boosting model given an RDD of (label, features) 
pairs.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @param leaningRate Learning rate
+   * @param countOfTrees Number of trees.
+   * @param samplingSizeRatio Size of random sample, percent of ${input} 
size.
+   * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+   * of algorithm (classification, regression, etc.), 
feature type (continuous,
+   * categorical), depth of the tree, quantile calculation 
strategy, etc.
+   * @return StochasticGradientBoostingModel that can be used for 
prediction
+   */
+  def run(
+   input : RDD[LabeledPoint],
+   leaningRate : Double,
+   countOfTrees : Int,
+   samplingSizeRatio : Double,
+   strategy: Strategy): StochasticGradientBoostingModel = {
+
+val featureDimension = input.count()
+val mean = new DoubleRDDFunctions(input.map(l = l.label)).mean()
+val boostingModel = new StochasticGradientBoostingModel(countOfTrees, 
mean, leaningRate)
+
+for (i - 0 to countOfTrees - 1) {
+  val gradient = input.map(l = l.label - 
boostingModel.computeValue(l.features))
+
+  val newInput: RDD[LabeledPoint] = input
+.zip(gradient)
+.map{case(inputVal, gradientVal) = new LabeledPoint(gradientVal, 
inputVal.features)}
+
+  val randomSample = newInput.sample(
+false,
+(samplingSizeRatio * featureDimension).asInstanceOf[Int],
+Random.nextInt()
+  )
+
+  val model = DecisionTree.train(randomSample, strategy)
+  boostingModel.addTree(model)
+}
+boostingModel
+  }
+}
+
+/**
+ * Model that can be used for prediction.
+ *
+ * @param countOfTrees Number of trees.
+ * @param initValue Initialize model with this value.
+ * @param learningRate Learning rate.
+ */
+class StochasticGradientBoostingModel (
+private val countOfTrees: Int,
+private var initValue: Double,
+private val learningRate: Double) extends Serializable with 
RegressionModel {
+
+  val trees: Array[DecisionTreeModel] = new 
Array[DecisionTreeModel](countOfTrees)
+  var index: Int = 0
--- End diff --

private


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on a diff in the pull request:

https://github.com/apache/spark/pull/2394#discussion_r18107306
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/regression/StochasticGradientBoosting.scala
 ---
@@ -0,0 +1,173 @@
+package org.apache.spark.mllib.regression
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.tree.DecisionTree
+import org.apache.spark.mllib.tree.configuration.Algo.Algo
+import org.apache.spark.mllib.tree.configuration.Strategy
+import org.apache.spark.mllib.tree.impurity.Impurity
+import org.apache.spark.mllib.tree.model.DecisionTreeModel
+import org.apache.spark.rdd.{DoubleRDDFunctions, RDD}
+import scala.util.Random
+
+/**
+ *
+ * Read about the algorithm Gradient boosting here:
+ * 
http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2007/GWD07/geurts-icml2007.pdf
+ *
+ * Libraries that implement the algorithm Gradient boosting similar way
+ * https://code.google.com/p/jforests/
+ * https://code.google.com/p/jsgbm/
+ *
+ */
+class StochasticGradientBoosting {
+
+  /**
+   * Train a Gradient Boosting model given an RDD of (label, features) 
pairs.
+   *
+   * @param input Training dataset: RDD of 
[[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @param leaningRate Learning rate
+   * @param countOfTrees Number of trees.
+   * @param samplingSizeRatio Size of random sample, percent of ${input} 
size.
+   * @param strategy The configuration parameters for the tree algorithm 
which specify the type
+   * of algorithm (classification, regression, etc.), 
feature type (continuous,
+   * categorical), depth of the tree, quantile calculation 
strategy, etc.
+   * @return StochasticGradientBoostingModel that can be used for 
prediction
+   */
+  def run(
+   input : RDD[LabeledPoint],
+   leaningRate : Double,
+   countOfTrees : Int,
+   samplingSizeRatio : Double,
+   strategy: Strategy): StochasticGradientBoostingModel = {
+
+val featureDimension = input.count()
+val mean = new DoubleRDDFunctions(input.map(l = l.label)).mean()
+val boostingModel = new StochasticGradientBoostingModel(countOfTrees, 
mean, leaningRate)
+
+for (i - 0 to countOfTrees - 1) {
+  val gradient = input.map(l = l.label - 
boostingModel.computeValue(l.features))
+
+  val newInput: RDD[LabeledPoint] = input
+.zip(gradient)
+.map{case(inputVal, gradientVal) = new LabeledPoint(gradientVal, 
inputVal.features)}
+
+  val randomSample = newInput.sample(
+false,
+(samplingSizeRatio * featureDimension).asInstanceOf[Int],
+Random.nextInt()
+  )
+
+  val model = DecisionTree.train(randomSample, strategy)
+  boostingModel.addTree(model)
+}
+boostingModel
+  }
+}
+
+/**
+ * Model that can be used for prediction.
+ *
+ * @param countOfTrees Number of trees.
+ * @param initValue Initialize model with this value.
+ * @param learningRate Learning rate.
+ */
+class StochasticGradientBoostingModel (
+private val countOfTrees: Int,
+private var initValue: Double,
+private val learningRate: Double) extends Serializable with 
RegressionModel {
+
+  val trees: Array[DecisionTreeModel] = new 
Array[DecisionTreeModel](countOfTrees)
--- End diff --

private 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-26 Thread Ishiihara
Github user Ishiihara commented on the pull request:

https://github.com/apache/spark/pull/2394#issuecomment-57006487
  
@mengxr @epahomov Added some comments after quickly going through the code. 
Will do a deeper looking at the algorithm later. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-15 Thread epahomov
GitHub user epahomov opened a pull request:

https://github.com/apache/spark/pull/2394

[Spark-3525] Adding gradient boosting



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/epahomov/spark SPARK-3525

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2394.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2394


commit d0dfb7b632715c60ef78964ea4d20aaa7712d2e2
Author: olgaoskina olgaosk...@yandex-team.ru
Date:   2014-09-04T06:51:45Z

Added stochastic gradient boosting algorithm

commit 11c247a72e1681661cef4314fec5d1b4283b087f
Author: olgaoskina olgaosk...@yandex-team.ru
Date:   2014-09-04T06:52:05Z

Added stochastic gradient boosting algorithm

commit fdfc88e046a29202058b8f45168d624ed91f6d16
Author: olgaoskina olgaosk...@yandex-team.ru
Date:   2014-09-05T12:25:41Z

Code refactor

commit b91b372c951db8bd1be6bd4d2308bc509bc1b44f
Author: olgaoskina olgaosk...@yandex-team.ru
Date:   2014-09-06T09:02:51Z

Added test 'StochasticGradientBoostingSuite'

commit 223f0907b6accaa0bf08c7948b2e6c1d728dab18
Author: olgaoskina olgaosk...@yandex-team.ru
Date:   2014-09-10T08:08:30Z

Added new test

commit da13706bd8101ec8a2b648ce6ddc9777516e121f
Author: olgaoskina olgaosk...@yandex-team.ru
Date:   2014-09-14T15:33:52Z

Refactor tests

commit eafa0b75785b2ac570ddbc26a80b08b328f7b29c
Author: Egor Pakhomov pahomov.e...@gmail.com
Date:   2014-09-15T07:42:53Z

Merge branch 'gradient_boosting' of https://github.com/olgaoskina/spark 
into olgaoskina-gradient_boosting

commit 3c56f4ef65fb0df80804b0f4b9436f0623582be7
Author: Egor Pakhomov pahomov.e...@gmail.com
Date:   2014-09-15T08:46:43Z

Merge branch 'olgaoskina-gradient_boosting' into SPARK-3525

commit ce1934a329783629a12f615cbeac3d7e1a05a791
Author: Egor Pakhomov pahomov.e...@gmail.com
Date:   2014-09-15T08:32:48Z

[SPARK-3525] Fixing GradientBoostingSuite




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-3525] Adding gradient boosting

2014-09-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2394#issuecomment-55565637
  
Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org