[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21129 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r189052646 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -460,18 +461,37 @@ private[ml] trait RandomForestRegressorParams * * Note: Marked as private and DeveloperApi since this may be made public in the future. */ -private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasStepSize { +private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasStepSize + with HasValidationIndicatorCol { - /* TODO: Add this doc when we add this param. SPARK-7132 - * Threshold for stopping early when runWithValidation is used. + /** + * Threshold for stopping early when fit with validation is used. * If the error rate on the validation input changes by less than the validationTol, --- End diff -- This doc is not quite accurate. Can you please update it to: ``` Threshold for stopping early when fit with validation is used. (This parameter is ignored when fit without validation is used.) The decision to stop early is decided based on this logic: If the current loss on the validation set is greater than 0.01, the diff of validation error is compared to relative tolerance which is validationTol * (current loss on the validation set). If the current loss on the validation set is less than or equal to 0.01, the diff of validation error is compared to absolute tolerance which is validationTol * 0.01. ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r187114172 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -460,18 +461,29 @@ private[ml] trait RandomForestRegressorParams * * Note: Marked as private and DeveloperApi since this may be made public in the future. */ -private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasStepSize { +private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasStepSize + with HasValidationIndicatorCol { - /* TODO: Add this doc when we add this param. SPARK-7132 - * Threshold for stopping early when runWithValidation is used. + /** + * Threshold for stopping early when fit with validation is used. * If the error rate on the validation input changes by less than the validationTol, --- End diff -- Let's add the more precise description from the old BoostingStrategy for this Param. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r187113953 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -460,18 +461,29 @@ private[ml] trait RandomForestRegressorParams * * Note: Marked as private and DeveloperApi since this may be made public in the future. */ -private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasStepSize { +private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasStepSize + with HasValidationIndicatorCol { - /* TODO: Add this doc when we add this param. SPARK-7132 - * Threshold for stopping early when runWithValidation is used. + /** + * Threshold for stopping early when fit with validation is used. * If the error rate on the validation input changes by less than the validationTol, - * then learning will stop early (before [[numIterations]]). - * This parameter is ignored when run is used. + * then learning will stop early (before [[maxIter]]). + * This parameter is ignored when fit without validation is used. * (default = 1e-5) * @group param --- End diff -- Let's add a `@see` validationIndicatorCol --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r187112582 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -460,18 +461,29 @@ private[ml] trait RandomForestRegressorParams * * Note: Marked as private and DeveloperApi since this may be made public in the future. */ -private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasStepSize { +private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasStepSize + with HasValidationIndicatorCol { - /* TODO: Add this doc when we add this param. SPARK-7132 - * Threshold for stopping early when runWithValidation is used. + /** + * Threshold for stopping early when fit with validation is used. * If the error rate on the validation input changes by less than the validationTol, - * then learning will stop early (before [[numIterations]]). - * This parameter is ignored when run is used. + * then learning will stop early (before [[maxIter]]). + * This parameter is ignored when fit without validation is used. * (default = 1e-5) --- End diff -- I forget why we chose 1e-5 (which is different from spark.mllib). What do you think about using 0.01 to match the sklearn docs here? http://scikit-learn.org/dev/auto_examples/ensemble/plot_gradient_boosting_early_stopping.html (I also checked xgboost, but they use a different approach based on x number of steps without improvement. We may want to add that at some point since it sounds more robust.) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r186572477 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala --- @@ -497,6 +498,9 @@ private[ml] trait GBTParams extends TreeEnsembleParams with HasMaxIter with HasS @deprecated("This method is deprecated and will be removed in 3.0.0.", "2.1.0") def setStepSize(value: Double): this.type = set(stepSize, value) + /** @group setParam */ + def setValidationIndicatorCol(value: String): this.type = set(validationIndicatorCol, value) --- End diff -- Since we need this setter to be in the concrete classes for Java compatibility, don't bother putting it here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r186572374 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -365,6 +366,50 @@ class GBTClassifierSuite extends MLTest with DefaultReadWriteTest { assert(mostImportantFeature !== mostIF) } + test("runWithValidation stops early and performs better on a validation dataset") { +val validationIndicatorCol = "validationIndicator" +val trainDF = trainData.toDF().withColumn(validationIndicatorCol, lit(false)) +val validationDF = validationData.toDF().withColumn(validationIndicatorCol, lit(true)) + +val numIter = 20 +for (lossType <- GBTClassifier.supportedLossTypes) { + val gbt = new GBTClassifier() +.setSeed(123) +.setMaxDepth(2) +.setLossType(lossType) +.setMaxIter(numIter) + val modelWithoutValidation = gbt.fit(trainDF) + + gbt.setValidationIndicatorCol(validationIndicatorCol) + val modelWithValidation = gbt.fit(trainDF.union(validationDF)) + + // early stop + assert(modelWithValidation.numTrees < numIter) + + val (errorWithoutValidation, errorWithValidation) = { +val remappedRdd = validationData.map(x => new LabeledPoint(2 * x.label - 1, x.features)) +(GradientBoostedTrees.computeError(remappedRdd, modelWithoutValidation.trees, + modelWithoutValidation.treeWeights, modelWithoutValidation.getOldLossType), + GradientBoostedTrees.computeError(remappedRdd, modelWithValidation.trees, +modelWithValidation.treeWeights, modelWithValidation.getOldLossType)) + } + assert(errorWithValidation <= errorWithoutValidation) --- End diff -- It'd be nice to have this be strictly true. Is it not? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r186573136 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -146,20 +146,40 @@ class GBTClassifier @Since("1.4.0") ( @Since("1.4.0") def setLossType(value: String): this.type = set(lossType, value) + /** @group setParam */ + @Since("2.4.0") + override def setValidationIndicatorCol(value: String): this.type = { +set(validationIndicatorCol, value) + } + override protected def train(dataset: Dataset[_]): GBTClassificationModel = { val categoricalFeatures: Map[Int, Int] = MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) + +val withValidation = isDefined(validationIndicatorCol) + // We copy and modify this from Classifier.extractLabeledPoints since GBT only supports // 2 classes now. This lets us provide a more precise error message. -val oldDataset: RDD[LabeledPoint] = +val convert2LabelPoint = (dataset: Dataset[_]) => { --- End diff -- nit: "LabelPoint" -> "LabeledPoint" --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r186569928 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/GBTClassifierSuite.scala --- @@ -365,6 +366,50 @@ class GBTClassifierSuite extends MLTest with DefaultReadWriteTest { assert(mostImportantFeature !== mostIF) } + test("runWithValidation stops early and performs better on a validation dataset") { +val validationIndicatorCol = "validationIndicator" +val trainDF = trainData.toDF().withColumn(validationIndicatorCol, lit(false)) +val validationDF = validationData.toDF().withColumn(validationIndicatorCol, lit(true)) + +val numIter = 20 +for (lossType <- GBTClassifier.supportedLossTypes) { + val gbt = new GBTClassifier() +.setSeed(123) +.setMaxDepth(2) +.setLossType(lossType) +.setMaxIter(numIter) + val modelWithoutValidation = gbt.fit(trainDF) + + gbt.setValidationIndicatorCol(validationIndicatorCol) + val modelWithValidation = gbt.fit(trainDF.union(validationDF)) + + // early stop + assert(modelWithValidation.numTrees < numIter) --- End diff -- Let's also assert `modelWithoutValidation.numTrees == numIter`. That's true now, but I could imagine it changing later on if we add a convergence tolerance to the algorithm. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r186572853 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala --- @@ -146,20 +146,40 @@ class GBTClassifier @Since("1.4.0") ( @Since("1.4.0") def setLossType(value: String): this.type = set(lossType, value) + /** @group setParam */ + @Since("2.4.0") + override def setValidationIndicatorCol(value: String): this.type = { +set(validationIndicatorCol, value) + } + override protected def train(dataset: Dataset[_]): GBTClassificationModel = { val categoricalFeatures: Map[Int, Int] = MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) + +val withValidation = isDefined(validationIndicatorCol) --- End diff -- Let's do: `isDefined(validationIndicatorCol) and $(validationIndicatorCol).nonEmpty` (so we count "" as not defined). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/21129#discussion_r184768987 --- Diff: mllib/src/main/scala/org/apache/spark/ml/param/shared/SharedParamsCodeGen.scala --- @@ -95,7 +95,9 @@ private[shared] object SharedParamsCodeGen { ParamDesc[String]("distanceMeasure", "The distance measure. Supported options: 'euclidean'" + " and 'cosine'", Some("org.apache.spark.mllib.clustering.DistanceMeasure.EUCLIDEAN"), isValid = "(value: String) => " + - "org.apache.spark.mllib.clustering.DistanceMeasure.validateDistanceMeasure(value)") + "org.apache.spark.mllib.clustering.DistanceMeasure.validateDistanceMeasure(value)"), + ParamDesc[String]("validationIndicatorCol", "the indicator column name for indicating " + --- End diff -- How about rephrasing the description: "name of the column that indicates whether each row is for training or for validation. False indicates training; true indicates validation." --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21129: [SPARK-7132][ML] Add fit with validation set to s...
GitHub user WeichenXu123 opened a pull request: https://github.com/apache/spark/pull/21129 [SPARK-7132][ML] Add fit with validation set to spark.ml GBT ## What changes were proposed in this pull request? Add fit with validation set to spark.ml GBT ## How was this patch tested? N/A You can merge this pull request into a Git repository by running: $ git pull https://github.com/WeichenXu123/spark gbt_fit_validation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21129.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21129 commit 3722fec40eeb738e5c19cb8562ac821c050dbc39 Author: WeichenXuDate: 2018-04-23T12:09:39Z init pr --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org