[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21129


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-17 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r189052646
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -460,18 +461,37 @@ private[ml] trait RandomForestRegressorParams
  *
  * Note: Marked as private and DeveloperApi since this may be made public 
in the future.
  */
-private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter 
with HasStepSize {
+private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter 
with HasStepSize
+  with HasValidationIndicatorCol {
 
-  /* TODO: Add this doc when we add this param.  SPARK-7132
-   * Threshold for stopping early when runWithValidation is used.
+  /**
+   * Threshold for stopping early when fit with validation is used.
* If the error rate on the validation input changes by less than the 
validationTol,
--- End diff --

This doc is not quite accurate.  Can you please update it to:
```
Threshold for stopping early when fit with validation is used.
(This parameter is ignored when fit without validation is used.)
The decision to stop early is decided based on this logic:
If the current loss on the validation set is greater than 0.01, the diff
of validation error is compared to relative tolerance which is
validationTol * (current loss on the validation set).
If the current loss on the validation set is less than or equal to 0.01,
the diff of validation error is compared to absolute tolerance which is
validationTol * 0.01.
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r187114172
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -460,18 +461,29 @@ private[ml] trait RandomForestRegressorParams
  *
  * Note: Marked as private and DeveloperApi since this may be made public 
in the future.
  */
-private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter 
with HasStepSize {
+private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter 
with HasStepSize
+  with HasValidationIndicatorCol {
 
-  /* TODO: Add this doc when we add this param.  SPARK-7132
-   * Threshold for stopping early when runWithValidation is used.
+  /**
+   * Threshold for stopping early when fit with validation is used.
* If the error rate on the validation input changes by less than the 
validationTol,
--- End diff --

Let's add the more precise description from the old BoostingStrategy for 
this Param.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r187113953
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -460,18 +461,29 @@ private[ml] trait RandomForestRegressorParams
  *
  * Note: Marked as private and DeveloperApi since this may be made public 
in the future.
  */
-private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter 
with HasStepSize {
+private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter 
with HasStepSize
+  with HasValidationIndicatorCol {
 
-  /* TODO: Add this doc when we add this param.  SPARK-7132
-   * Threshold for stopping early when runWithValidation is used.
+  /**
+   * Threshold for stopping early when fit with validation is used.
* If the error rate on the validation input changes by less than the 
validationTol,
-   * then learning will stop early (before [[numIterations]]).
-   * This parameter is ignored when run is used.
+   * then learning will stop early (before [[maxIter]]).
+   * This parameter is ignored when fit without validation is used.
* (default = 1e-5)
* @group param
--- End diff --

Let's add a `@see` validationIndicatorCol


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-09 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r187112582
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -460,18 +461,29 @@ private[ml] trait RandomForestRegressorParams
  *
  * Note: Marked as private and DeveloperApi since this may be made public 
in the future.
  */
-private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter 
with HasStepSize {
+private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter 
with HasStepSize
+  with HasValidationIndicatorCol {
 
-  /* TODO: Add this doc when we add this param.  SPARK-7132
-   * Threshold for stopping early when runWithValidation is used.
+  /**
+   * Threshold for stopping early when fit with validation is used.
* If the error rate on the validation input changes by less than the 
validationTol,
-   * then learning will stop early (before [[numIterations]]).
-   * This parameter is ignored when run is used.
+   * then learning will stop early (before [[maxIter]]).
+   * This parameter is ignored when fit without validation is used.
* (default = 1e-5)
--- End diff --

I forget why we chose 1e-5 (which is different from spark.mllib).  What do 
you think about using 0.01 to match the sklearn docs here? 
http://scikit-learn.org/dev/auto_examples/ensemble/plot_gradient_boosting_early_stopping.html
  (I also checked xgboost, but they use a different approach based on x number 
of steps without improvement.  We may want to add that at some point since it 
sounds more robust.)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-07 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r186572477
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala ---
@@ -497,6 +498,9 @@ private[ml] trait GBTParams extends TreeEnsembleParams 
with HasMaxIter with HasS
   @deprecated("This method is deprecated and will be removed in 3.0.0.", 
"2.1.0")
   def setStepSize(value: Double): this.type = set(stepSize, value)
 
+  /** @group setParam */
+  def setValidationIndicatorCol(value: String): this.type = 
set(validationIndicatorCol, value)
--- End diff --

Since we need this setter to be in the concrete classes for Java 
compatibility, don't bother putting it here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-07 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r186572374
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -365,6 +366,50 @@ class GBTClassifierSuite extends MLTest with 
DefaultReadWriteTest {
 assert(mostImportantFeature !== mostIF)
   }
 
+  test("runWithValidation stops early and performs better on a validation 
dataset") {
+val validationIndicatorCol = "validationIndicator"
+val trainDF = trainData.toDF().withColumn(validationIndicatorCol, 
lit(false))
+val validationDF = 
validationData.toDF().withColumn(validationIndicatorCol, lit(true))
+
+val numIter = 20
+for (lossType <- GBTClassifier.supportedLossTypes) {
+  val gbt = new GBTClassifier()
+.setSeed(123)
+.setMaxDepth(2)
+.setLossType(lossType)
+.setMaxIter(numIter)
+  val modelWithoutValidation = gbt.fit(trainDF)
+
+  gbt.setValidationIndicatorCol(validationIndicatorCol)
+  val modelWithValidation = gbt.fit(trainDF.union(validationDF))
+
+  // early stop
+  assert(modelWithValidation.numTrees < numIter)
+
+  val (errorWithoutValidation, errorWithValidation) = {
+val remappedRdd = validationData.map(x => new LabeledPoint(2 * 
x.label - 1, x.features))
+(GradientBoostedTrees.computeError(remappedRdd, 
modelWithoutValidation.trees,
+  modelWithoutValidation.treeWeights, 
modelWithoutValidation.getOldLossType),
+  GradientBoostedTrees.computeError(remappedRdd, 
modelWithValidation.trees,
+modelWithValidation.treeWeights, 
modelWithValidation.getOldLossType))
+  }
+  assert(errorWithValidation <= errorWithoutValidation)
--- End diff --

It'd be nice to have this be strictly true. Is it not?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-07 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r186573136
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -146,20 +146,40 @@ class GBTClassifier @Since("1.4.0") (
   @Since("1.4.0")
   def setLossType(value: String): this.type = set(lossType, value)
 
+  /** @group setParam */
+  @Since("2.4.0")
+  override def setValidationIndicatorCol(value: String): this.type = {
+set(validationIndicatorCol, value)
+  }
+
   override protected def train(dataset: Dataset[_]): 
GBTClassificationModel = {
 val categoricalFeatures: Map[Int, Int] =
   MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
+
+val withValidation = isDefined(validationIndicatorCol)
+
 // We copy and modify this from Classifier.extractLabeledPoints since 
GBT only supports
 // 2 classes now.  This lets us provide a more precise error message.
-val oldDataset: RDD[LabeledPoint] =
+val convert2LabelPoint = (dataset: Dataset[_]) => {
--- End diff --

nit: "LabelPoint" -> "LabeledPoint"


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-07 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r186569928
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala
 ---
@@ -365,6 +366,50 @@ class GBTClassifierSuite extends MLTest with 
DefaultReadWriteTest {
 assert(mostImportantFeature !== mostIF)
   }
 
+  test("runWithValidation stops early and performs better on a validation 
dataset") {
+val validationIndicatorCol = "validationIndicator"
+val trainDF = trainData.toDF().withColumn(validationIndicatorCol, 
lit(false))
+val validationDF = 
validationData.toDF().withColumn(validationIndicatorCol, lit(true))
+
+val numIter = 20
+for (lossType <- GBTClassifier.supportedLossTypes) {
+  val gbt = new GBTClassifier()
+.setSeed(123)
+.setMaxDepth(2)
+.setLossType(lossType)
+.setMaxIter(numIter)
+  val modelWithoutValidation = gbt.fit(trainDF)
+
+  gbt.setValidationIndicatorCol(validationIndicatorCol)
+  val modelWithValidation = gbt.fit(trainDF.union(validationDF))
+
+  // early stop
+  assert(modelWithValidation.numTrees < numIter)
--- End diff --

Let's also assert `modelWithoutValidation.numTrees == numIter`.  That's 
true now, but I could imagine it changing later on if we add a convergence 
tolerance to the algorithm.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-05-07 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r186572853
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala ---
@@ -146,20 +146,40 @@ class GBTClassifier @Since("1.4.0") (
   @Since("1.4.0")
   def setLossType(value: String): this.type = set(lossType, value)
 
+  /** @group setParam */
+  @Since("2.4.0")
+  override def setValidationIndicatorCol(value: String): this.type = {
+set(validationIndicatorCol, value)
+  }
+
   override protected def train(dataset: Dataset[_]): 
GBTClassificationModel = {
 val categoricalFeatures: Map[Int, Int] =
   MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol)))
+
+val withValidation = isDefined(validationIndicatorCol)
--- End diff --

Let's do: `isDefined(validationIndicatorCol) and 
$(validationIndicatorCol).nonEmpty` (so we count "" as not defined).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-04-27 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/21129#discussion_r184768987
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala 
---
@@ -95,7 +95,9 @@ private[shared] object SharedParamsCodeGen {
   ParamDesc[String]("distanceMeasure", "The distance measure. 
Supported options: 'euclidean'" +
 " and 'cosine'", 
Some("org.apache.spark.mllib.clustering.DistanceMeasure.EUCLIDEAN"),
 isValid = "(value: String) => " +
-
"org.apache.spark.mllib.clustering.DistanceMeasure.validateDistanceMeasure(value)")
+
"org.apache.spark.mllib.clustering.DistanceMeasure.validateDistanceMeasure(value)"),
+  ParamDesc[String]("validationIndicatorCol", "the indicator column 
name for indicating " +
--- End diff --

How about rephrasing the description: "name of the column that indicates 
whether each row is for training or for validation.  False indicates training; 
true indicates validation."


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...

2018-04-23 Thread WeichenXu123
GitHub user WeichenXu123 opened a pull request:

https://github.com/apache/spark/pull/21129

[SPARK-7132][ML] Add fit with validation set to spark.ml GBT

## What changes were proposed in this pull request?

Add fit with validation set to spark.ml GBT

## How was this patch tested?

N/A


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/WeichenXu123/spark gbt_fit_validation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21129.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21129


commit 3722fec40eeb738e5c19cb8562ac821c050dbc39
Author: WeichenXu 
Date:   2018-04-23T12:09:39Z

init pr




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org