[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10702#issuecomment-177376070 LGTM except minor comments. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10702#discussion_r51354776 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -219,33 +219,44 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String } val yMean = ySummarizer.mean(0) -val yStd = math.sqrt(ySummarizer.variance(0)) - -// If the yStd is zero, then the intercept is yMean with zero coefficient; -// as a result, training is not needed. -if (yStd == 0.0) { - logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + -s"zeros and the intercept will be the mean of the label; as a result, " + -s"training is not needed.") - if (handlePersistence) instances.unpersist() - val coefficients = Vectors.sparse(numFeatures, Seq()) - val intercept = yMean - - val model = new LinearRegressionModel(uid, coefficients, intercept) - // Handle possible missing or invalid prediction columns - val (summaryModel, predictionColName) = model.findSummaryModelAndPredictionCol() - - val trainingSummary = new LinearRegressionTrainingSummary( -summaryModel.transform(dataset), -predictionColName, -$(labelCol), -model, -Array(0D), -$(featuresCol), -Array(0D)) - return copyValues(model.setSummary(trainingSummary)) +val rawYStd = math.sqrt(ySummarizer.variance(0)) +if (rawYStd == 0.0) { + if ($(fitIntercept) || yMean==0.0) { +// If the rawYStd is zero and fitIntercept=true, then the intercept is yMean with +// zero coefficient; as a result, training is not needed. +// Also, if yMean==0 and rawYStd==0, all the coefficients are zero regardless of +// the fitIntercept +logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + --- End diff -- Maybe you want to update the warning message for the second situation as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10702#discussion_r51354767 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -219,33 +219,44 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String } val yMean = ySummarizer.mean(0) -val yStd = math.sqrt(ySummarizer.variance(0)) - -// If the yStd is zero, then the intercept is yMean with zero coefficient; -// as a result, training is not needed. -if (yStd == 0.0) { - logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + -s"zeros and the intercept will be the mean of the label; as a result, " + -s"training is not needed.") - if (handlePersistence) instances.unpersist() - val coefficients = Vectors.sparse(numFeatures, Seq()) - val intercept = yMean - - val model = new LinearRegressionModel(uid, coefficients, intercept) - // Handle possible missing or invalid prediction columns - val (summaryModel, predictionColName) = model.findSummaryModelAndPredictionCol() - - val trainingSummary = new LinearRegressionTrainingSummary( -summaryModel.transform(dataset), -predictionColName, -$(labelCol), -model, -Array(0D), -$(featuresCol), -Array(0D)) - return copyValues(model.setSummary(trainingSummary)) +val rawYStd = math.sqrt(ySummarizer.variance(0)) +if (rawYStd == 0.0) { + if ($(fitIntercept) || yMean==0.0) { +// If the rawYStd is zero and fitIntercept=true, then the intercept is yMean with +// zero coefficient; as a result, training is not needed. +// Also, if yMean==0 and rawYStd==0, all the coefficients are zero regardless of +// the fitIntercept +logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + + s"zeros and the intercept will be the mean of the label; as a result, " + + s"training is not needed.") +if (handlePersistence) instances.unpersist() +val coefficients = Vectors.sparse(numFeatures, Seq()) +val intercept = yMean + +val model = new LinearRegressionModel(uid, coefficients, intercept) +// Handle possible missing or invalid prediction columns +val (summaryModel, predictionColName) = model.findSummaryModelAndPredictionCol() + +val trainingSummary = new LinearRegressionTrainingSummary( + summaryModel.transform(dataset), + predictionColName, + $(labelCol), + model, + Array(0D), + $(featuresCol), + Array(0D)) +return copyValues(model.setSummary(trainingSummary)) + } else { +require($(regParam) == 0.0, "The standard deviation of the label is zero. " + + "Model cannot be regularized.") +logWarning(s"The standard deviation of the label is zero. " + + "Consider setting fitIntercept=true.") + } } +// if y is constant (rawYStd is zero), then y cannot be scaled. In this case +// setting yStd=1.0 ensures that y is not scaled anymore in l-bfgs algorithm. +val yStd = if (rawYStd > 0) rawYStd else if (yMean != 0.0) math.abs(yMean) else 1.0 --- End diff -- `val yStd = if (rawYStd > 0) rawYStd else math.abs(yMean)' since you already check the condition before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10702#discussion_r51354784 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -558,6 +575,47 @@ class LinearRegressionSuite } } + test("linear regression model with constant label") { +/* + R code: + for (formula in c(b.const ~ . -1, b.const ~ .)) { + model <- lm(formula, data=df.const.label, weights=w) + print(as.vector(coef(model))) + } + [1] -9.221298 3.394343 + [1] 17 0 0 +*/ +val expected = Seq( + Vectors.dense(0.0, -9.221298, 3.394343), + Vectors.dense(17.0, 0.0, 0.0)) + +Seq("auto", "l-bfgs", "normal").foreach { solver => + var idx = 0 + for (fitIntercept <- Seq(false, true)) { +val model = new LinearRegression() + .setFitIntercept(fitIntercept) + .setWeightCol("weight") + .setSolver(solver) + .fit(datasetWithWeightConstantLabel) +val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1)) +assert(actual ~== expected(idx) absTol 1e-4) +idx += 1 + } +} + } + + test("regularized linear regression through origin with constant label") { +// The problem is ill-defined if fitIntercept=false, regParam is non-zero and \ --- End diff -- seems you forget to remove `\` :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10702#discussion_r51355106 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -558,6 +575,47 @@ class LinearRegressionSuite } } + test("linear regression model with constant label") { +/* + R code: + for (formula in c(b.const ~ . -1, b.const ~ .)) { + model <- lm(formula, data=df.const.label, weights=w) + print(as.vector(coef(model))) + } + [1] -9.221298 3.394343 + [1] 17 0 0 +*/ +val expected = Seq( + Vectors.dense(0.0, -9.221298, 3.394343), + Vectors.dense(17.0, 0.0, 0.0)) + +Seq("auto", "l-bfgs", "normal").foreach { solver => + var idx = 0 + for (fitIntercept <- Seq(false, true)) { +val model = new LinearRegression() + .setFitIntercept(fitIntercept) + .setWeightCol("weight") + .setSolver(solver) + .fit(datasetWithWeightConstantLabel) +val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1)) +assert(actual ~== expected(idx) absTol 1e-4) +idx += 1 --- End diff -- In `LinearRegressionTrainingSummary`, you can get it from `objectiveHistory`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10940#issuecomment-176459204 @iyounus Ideally, it will be great that `intercept=true`, we keep the current behavior which is constant column doesn't have any predictive power. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10940#issuecomment-176503318 @iyounus Maybe in this case, `WeightedLeastSqures ` should drop those columns so the model can be still trained. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12230][ML] WeightedLeastSquares.fit() s...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10274#discussion_r51059677 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -74,6 +89,35 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext } } + test("WLS against lm when label is constant") { +/* + R code: + # here b is constant + df <- as.data.frame(cbind(A, b)) + for (formula in c(b ~ . -1, b ~ .)) { + model <- lm(formula, data=df, weights=w) + print(as.vector(coef(model))) + } + + [1] -9.221298 3.394343 + [1] 17 0 0 +*/ + +val expected = Seq( + Vectors.dense(0.0, -9.221298, 3.394343), + Vectors.dense(17.0, 0.0, 0.0)) + +var idx = 0 +for (fitIntercept <- Seq(false, true)) { + val wls = new WeightedLeastSquares( +fitIntercept, regParam = 0.0, standardizeFeatures = false, standardizeLabel = true) +.fit(instancesConstLabel) --- End diff -- Sorry for getting you back so late. The difference is due to that `glmnet` always standardizes labels even `standardization == false`. `standardization == false` is turning off the standardization on features. As a result, at least in `glmnet`, when `ystd == 0.0`, the training is not valid. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10702#discussion_r51064724 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -558,6 +575,47 @@ class LinearRegressionSuite } } + test("linear regression model with constant label") { +/* + R code: + for (formula in c(b.const ~ . -1, b.const ~ .)) { + model <- lm(formula, data=df.const.label, weights=w) + print(as.vector(coef(model))) + } + [1] -9.221298 3.394343 + [1] 17 0 0 +*/ +val expected = Seq( + Vectors.dense(0.0, -9.221298, 3.394343), + Vectors.dense(17.0, 0.0, 0.0)) + +Seq("auto", "l-bfgs", "normal").foreach { solver => + var idx = 0 + for (fitIntercept <- Seq(false, true)) { +val model = new LinearRegression() + .setFitIntercept(fitIntercept) + .setWeightCol("weight") + .setSolver(solver) + .fit(datasetWithWeightConstantLabel) +val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1)) +assert(actual ~== expected(idx) absTol 1e-4) +idx += 1 + } +} + } + + test("regularized linear regression through origin with constant label") { +// The problem is ill-defined if fitIntercept=false, regParam is non-zero and \ --- End diff -- Remove `\` in the end of line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10702#discussion_r51064615 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -558,6 +575,47 @@ class LinearRegressionSuite } } + test("linear regression model with constant label") { +/* + R code: + for (formula in c(b.const ~ . -1, b.const ~ .)) { + model <- lm(formula, data=df.const.label, weights=w) + print(as.vector(coef(model))) + } + [1] -9.221298 3.394343 + [1] 17 0 0 +*/ +val expected = Seq( + Vectors.dense(0.0, -9.221298, 3.394343), + Vectors.dense(17.0, 0.0, 0.0)) + +Seq("auto", "l-bfgs", "normal").foreach { solver => + var idx = 0 + for (fitIntercept <- Seq(false, true)) { +val model = new LinearRegression() + .setFitIntercept(fitIntercept) + .setWeightCol("weight") + .setSolver(solver) + .fit(datasetWithWeightConstantLabel) +val actual = Vectors.dense(model.intercept, model.coefficients(0), model.coefficients(1)) +assert(actual ~== expected(idx) absTol 1e-4) +idx += 1 + } +} + } + + test("regularized linear regression through origin with constant label") { +// The problem is ill-defined if fitIntercept=false, regParam is non-zero and \ +// standardization=true. An exception is thrown in this case. --- End diff -- When `standardization=false`, the problem is still ill-defined since GLMNET always standardizes the labels. That's why you see it in the analytical solution. Let's throw exception when `fitIntercept=false` and `regParam != 0.0`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10702#discussion_r51069962 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -219,33 +219,43 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String } val yMean = ySummarizer.mean(0) -val yStd = math.sqrt(ySummarizer.variance(0)) - -// If the yStd is zero, then the intercept is yMean with zero coefficient; -// as a result, training is not needed. -if (yStd == 0.0) { - logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + -s"zeros and the intercept will be the mean of the label; as a result, " + -s"training is not needed.") - if (handlePersistence) instances.unpersist() - val coefficients = Vectors.sparse(numFeatures, Seq()) - val intercept = yMean - - val model = new LinearRegressionModel(uid, coefficients, intercept) - // Handle possible missing or invalid prediction columns - val (summaryModel, predictionColName) = model.findSummaryModelAndPredictionCol() - - val trainingSummary = new LinearRegressionTrainingSummary( -summaryModel.transform(dataset), -predictionColName, -$(labelCol), -model, -Array(0D), -$(featuresCol), -Array(0D)) - return copyValues(model.setSummary(trainingSummary)) +val rawYStd = math.sqrt(ySummarizer.variance(0)) +if (rawYStd == 0.0) { + if ($(fitIntercept)) { +// If the rawYStd is zero and fitIntercept=true, then the intercept is yMean with +// zero coefficient; as a result, training is not needed. +logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + + s"zeros and the intercept will be the mean of the label; as a result, " + + s"training is not needed.") +if (handlePersistence) instances.unpersist() +val coefficients = Vectors.sparse(numFeatures, Seq()) +val intercept = yMean + +val model = new LinearRegressionModel(uid, coefficients, intercept) +// Handle possible missing or invalid prediction columns +val (summaryModel, predictionColName) = model.findSummaryModelAndPredictionCol() + +val trainingSummary = new LinearRegressionTrainingSummary( + summaryModel.transform(dataset), + predictionColName, + $(labelCol), + model, + Array(0D), + $(featuresCol), + Array(0D)) +return copyValues(model.setSummary(trainingSummary)) + } else { +require(!($(regParam) > 0.0 && $(standardization)), --- End diff -- remove `&& $(standardization)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10702#discussion_r51070090 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -219,33 +219,43 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String } val yMean = ySummarizer.mean(0) -val yStd = math.sqrt(ySummarizer.variance(0)) - -// If the yStd is zero, then the intercept is yMean with zero coefficient; -// as a result, training is not needed. -if (yStd == 0.0) { - logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + -s"zeros and the intercept will be the mean of the label; as a result, " + -s"training is not needed.") - if (handlePersistence) instances.unpersist() - val coefficients = Vectors.sparse(numFeatures, Seq()) - val intercept = yMean - - val model = new LinearRegressionModel(uid, coefficients, intercept) - // Handle possible missing or invalid prediction columns - val (summaryModel, predictionColName) = model.findSummaryModelAndPredictionCol() - - val trainingSummary = new LinearRegressionTrainingSummary( -summaryModel.transform(dataset), -predictionColName, -$(labelCol), -model, -Array(0D), -$(featuresCol), -Array(0D)) - return copyValues(model.setSummary(trainingSummary)) +val rawYStd = math.sqrt(ySummarizer.variance(0)) +if (rawYStd == 0.0) { + if ($(fitIntercept)) { +// If the rawYStd is zero and fitIntercept=true, then the intercept is yMean with +// zero coefficient; as a result, training is not needed. +logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + + s"zeros and the intercept will be the mean of the label; as a result, " + + s"training is not needed.") +if (handlePersistence) instances.unpersist() +val coefficients = Vectors.sparse(numFeatures, Seq()) +val intercept = yMean + +val model = new LinearRegressionModel(uid, coefficients, intercept) +// Handle possible missing or invalid prediction columns +val (summaryModel, predictionColName) = model.findSummaryModelAndPredictionCol() + +val trainingSummary = new LinearRegressionTrainingSummary( + summaryModel.transform(dataset), + predictionColName, + $(labelCol), + model, + Array(0D), + $(featuresCol), + Array(0D)) +return copyValues(model.setSummary(trainingSummary)) + } else { +require(!($(regParam) > 0.0 && $(standardization)), --- End diff -- Just ` require($(regParam) != 0.0)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10702#discussion_r51070105 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -219,33 +219,43 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String } val yMean = ySummarizer.mean(0) -val yStd = math.sqrt(ySummarizer.variance(0)) - -// If the yStd is zero, then the intercept is yMean with zero coefficient; -// as a result, training is not needed. -if (yStd == 0.0) { - logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + -s"zeros and the intercept will be the mean of the label; as a result, " + -s"training is not needed.") - if (handlePersistence) instances.unpersist() - val coefficients = Vectors.sparse(numFeatures, Seq()) - val intercept = yMean - - val model = new LinearRegressionModel(uid, coefficients, intercept) - // Handle possible missing or invalid prediction columns - val (summaryModel, predictionColName) = model.findSummaryModelAndPredictionCol() - - val trainingSummary = new LinearRegressionTrainingSummary( -summaryModel.transform(dataset), -predictionColName, -$(labelCol), -model, -Array(0D), -$(featuresCol), -Array(0D)) - return copyValues(model.setSummary(trainingSummary)) +val rawYStd = math.sqrt(ySummarizer.variance(0)) +if (rawYStd == 0.0) { + if ($(fitIntercept)) { +// If the rawYStd is zero and fitIntercept=true, then the intercept is yMean with +// zero coefficient; as a result, training is not needed. +logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + + s"zeros and the intercept will be the mean of the label; as a result, " + + s"training is not needed.") +if (handlePersistence) instances.unpersist() +val coefficients = Vectors.sparse(numFeatures, Seq()) +val intercept = yMean + +val model = new LinearRegressionModel(uid, coefficients, intercept) +// Handle possible missing or invalid prediction columns +val (summaryModel, predictionColName) = model.findSummaryModelAndPredictionCol() + +val trainingSummary = new LinearRegressionTrainingSummary( + summaryModel.transform(dataset), + predictionColName, + $(labelCol), + model, + Array(0D), + $(featuresCol), + Array(0D)) +return copyValues(model.setSummary(trainingSummary)) + } else { +require(!($(regParam) > 0.0 && $(standardization)), --- End diff -- also change the message. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10702#issuecomment-175901752 @iyounus `standardizeLabel = false/ture` with non-zero `regParam`, let's throw the exception. I explained the mismatch against the analytic normal equation in the other PR. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10702#discussion_r51070506 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -219,33 +219,43 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String } val yMean = ySummarizer.mean(0) -val yStd = math.sqrt(ySummarizer.variance(0)) - -// If the yStd is zero, then the intercept is yMean with zero coefficient; -// as a result, training is not needed. -if (yStd == 0.0) { - logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + -s"zeros and the intercept will be the mean of the label; as a result, " + -s"training is not needed.") - if (handlePersistence) instances.unpersist() - val coefficients = Vectors.sparse(numFeatures, Seq()) - val intercept = yMean - - val model = new LinearRegressionModel(uid, coefficients, intercept) - // Handle possible missing or invalid prediction columns - val (summaryModel, predictionColName) = model.findSummaryModelAndPredictionCol() - - val trainingSummary = new LinearRegressionTrainingSummary( -summaryModel.transform(dataset), -predictionColName, -$(labelCol), -model, -Array(0D), -$(featuresCol), -Array(0D)) - return copyValues(model.setSummary(trainingSummary)) +val rawYStd = math.sqrt(ySummarizer.variance(0)) +if (rawYStd == 0.0) { + if ($(fitIntercept)) { +// If the rawYStd is zero and fitIntercept=true, then the intercept is yMean with +// zero coefficient; as a result, training is not needed. +logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + + s"zeros and the intercept will be the mean of the label; as a result, " + + s"training is not needed.") +if (handlePersistence) instances.unpersist() +val coefficients = Vectors.sparse(numFeatures, Seq()) +val intercept = yMean + +val model = new LinearRegressionModel(uid, coefficients, intercept) +// Handle possible missing or invalid prediction columns +val (summaryModel, predictionColName) = model.findSummaryModelAndPredictionCol() + +val trainingSummary = new LinearRegressionTrainingSummary( + summaryModel.transform(dataset), + predictionColName, + $(labelCol), + model, + Array(0D), + $(featuresCol), + Array(0D)) +return copyValues(model.setSummary(trainingSummary)) + } else { +require(!($(regParam) > 0.0 && $(standardization)), + "The standard deviation of the label is zero. " + +"Model cannot be regularized with standardization=true") +logWarning(s"The standard deviation of the label is zero. " + + "Consider setting fitIntercept=true.") + } } +// if y is constant (rawYStd is zero), then y cannot be scaled. In this case +// setting yStd=1.0 ensures that y is not scaled anymore in l-bfgs algorithm. +val yStd = if (rawYStd > 0) rawYStd else 1.0 --- End diff -- Actually, in the case of `yMean == 0.0 && yStd == 0.0`, the coefficients will be all zeros as well even `fitIntercept == false`. This is rare case, so we can let model training to figure out. But if you want to handle this explicitly, it's great. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10702#discussion_r51077616 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -219,33 +219,43 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String } val yMean = ySummarizer.mean(0) -val yStd = math.sqrt(ySummarizer.variance(0)) - -// If the yStd is zero, then the intercept is yMean with zero coefficient; -// as a result, training is not needed. -if (yStd == 0.0) { - logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + -s"zeros and the intercept will be the mean of the label; as a result, " + -s"training is not needed.") - if (handlePersistence) instances.unpersist() - val coefficients = Vectors.sparse(numFeatures, Seq()) - val intercept = yMean - - val model = new LinearRegressionModel(uid, coefficients, intercept) - // Handle possible missing or invalid prediction columns - val (summaryModel, predictionColName) = model.findSummaryModelAndPredictionCol() - - val trainingSummary = new LinearRegressionTrainingSummary( -summaryModel.transform(dataset), -predictionColName, -$(labelCol), -model, -Array(0D), -$(featuresCol), -Array(0D)) - return copyValues(model.setSummary(trainingSummary)) +val rawYStd = math.sqrt(ySummarizer.variance(0)) +if (rawYStd == 0.0) { + if ($(fitIntercept)) { +// If the rawYStd is zero and fitIntercept=true, then the intercept is yMean with +// zero coefficient; as a result, training is not needed. +logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + + s"zeros and the intercept will be the mean of the label; as a result, " + + s"training is not needed.") +if (handlePersistence) instances.unpersist() +val coefficients = Vectors.sparse(numFeatures, Seq()) +val intercept = yMean + +val model = new LinearRegressionModel(uid, coefficients, intercept) +// Handle possible missing or invalid prediction columns +val (summaryModel, predictionColName) = model.findSummaryModelAndPredictionCol() + +val trainingSummary = new LinearRegressionTrainingSummary( + summaryModel.transform(dataset), + predictionColName, + $(labelCol), + model, + Array(0D), + $(featuresCol), + Array(0D)) +return copyValues(model.setSummary(trainingSummary)) + } else { +require(!($(regParam) > 0.0 && $(standardization)), + "The standard deviation of the label is zero. " + +"Model cannot be regularized with standardization=true") +logWarning(s"The standard deviation of the label is zero. " + + "Consider setting fitIntercept=true.") + } } +// if y is constant (rawYStd is zero), then y cannot be scaled. In this case +// setting yStd=1.0 ensures that y is not scaled anymore in l-bfgs algorithm. +val yStd = if (rawYStd > 0) rawYStd else 1.0 --- End diff -- This will help the optimization if `yMean` is very large. Kind of we are normalizing the label instead of standardizing the label. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10702#discussion_r51070301 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -219,33 +219,43 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String } val yMean = ySummarizer.mean(0) -val yStd = math.sqrt(ySummarizer.variance(0)) - -// If the yStd is zero, then the intercept is yMean with zero coefficient; -// as a result, training is not needed. -if (yStd == 0.0) { - logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + -s"zeros and the intercept will be the mean of the label; as a result, " + -s"training is not needed.") - if (handlePersistence) instances.unpersist() - val coefficients = Vectors.sparse(numFeatures, Seq()) - val intercept = yMean - - val model = new LinearRegressionModel(uid, coefficients, intercept) - // Handle possible missing or invalid prediction columns - val (summaryModel, predictionColName) = model.findSummaryModelAndPredictionCol() - - val trainingSummary = new LinearRegressionTrainingSummary( -summaryModel.transform(dataset), -predictionColName, -$(labelCol), -model, -Array(0D), -$(featuresCol), -Array(0D)) - return copyValues(model.setSummary(trainingSummary)) +val rawYStd = math.sqrt(ySummarizer.variance(0)) +if (rawYStd == 0.0) { + if ($(fitIntercept)) { +// If the rawYStd is zero and fitIntercept=true, then the intercept is yMean with +// zero coefficient; as a result, training is not needed. +logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + + s"zeros and the intercept will be the mean of the label; as a result, " + + s"training is not needed.") +if (handlePersistence) instances.unpersist() +val coefficients = Vectors.sparse(numFeatures, Seq()) +val intercept = yMean + +val model = new LinearRegressionModel(uid, coefficients, intercept) +// Handle possible missing or invalid prediction columns +val (summaryModel, predictionColName) = model.findSummaryModelAndPredictionCol() + +val trainingSummary = new LinearRegressionTrainingSummary( + summaryModel.transform(dataset), + predictionColName, + $(labelCol), + model, + Array(0D), + $(featuresCol), + Array(0D)) +return copyValues(model.setSummary(trainingSummary)) + } else { +require(!($(regParam) > 0.0 && $(standardization)), + "The standard deviation of the label is zero. " + +"Model cannot be regularized with standardization=true") +logWarning(s"The standard deviation of the label is zero. " + + "Consider setting fitIntercept=true.") + } } +// if y is constant (rawYStd is zero), then y cannot be scaled. In this case +// setting yStd=1.0 ensures that y is not scaled anymore in l-bfgs algorithm. +val yStd = if (rawYStd > 0) rawYStd else 1.0 --- End diff -- I will do ```scala val yStd = if (rawYStd > 0) rawYStd else if(yMean != 0.0) math.abs(yMean) else 1.0 ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10940#discussion_r51051112 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -341,11 +341,11 @@ class LogisticRegression @Since("1.2.0") ( regParamL1 } else { // If `standardization` is false, we still standardize the data -// to improve the rate of convergence; as a result, we have to -// perform this reverse standardization by penalizing each component -// differently to get effectively the same objective function when +// to improve the rate of convergence unless the standard deviation is zero; +// as a result, we have to perform this reverse standardization by penalizing +// each component differently to get effectively the same objective function when // the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 +if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else regParamL1 --- End diff -- The constant `value` can be really large or very small negatively. The optimizer may not be able to converge well in this case. I don't prove or try it yet, but mathematically, with the following changes, this should be solving identical problem. ```scala // the training dataset is not standardized. if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else regParamL1 / featuresMean(index) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10940#discussion_r51051223 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -971,8 +971,10 @@ private class LogisticAggregator( val margin = - { var sum = 0.0 features.foreachActive { (index, value) => - if (featuresStd(index) != 0.0 && value != 0.0) { + if (featuresStd(index) != 0.0) { sum += localCoefficientsArray(index) * (value / featuresStd(index)) + } else { +sum += localCoefficientsArray(index) * value --- End diff -- Change `value` into `1.0` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10940#discussion_r51051184 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -417,7 +417,7 @@ class LogisticRegression @Since("1.2.0") ( val rawCoefficients = state.x.toArray.clone() var i = 0 while (i < numFeatures) { - rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / featuresStd(i) else 0.0 } + rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / featuresStd(i) else 1.0 } --- End diff -- ```scala rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / featuresStd(i) else 1.0 / featuresMean(i) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10940#discussion_r51051267 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -1106,7 +1110,8 @@ private class LogisticCostFun( totalGradientArray(index) += regParamL2 * temp value * temp } else { -0.0 +totalGradientArray(index) += regParamL2 * value +value * value --- End diff -- Change `value` into `1.0` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10940#discussion_r51051238 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -983,8 +985,10 @@ private class LogisticAggregator( val multiplier = weight * (1.0 / (1.0 + math.exp(margin)) - label) features.foreachActive { (index, value) => -if (featuresStd(index) != 0.0 && value != 0.0) { +if (featuresStd(index) != 0.0) { localGradientSumArray(index) += multiplier * (value / featuresStd(index)) +} else { + localGradientSumArray(index) += multiplier * value --- End diff -- Change `value` into `1.0` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10940#discussion_r51052743 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -341,11 +341,11 @@ class LogisticRegression @Since("1.2.0") ( regParamL1 } else { // If `standardization` is false, we still standardize the data -// to improve the rate of convergence; as a result, we have to -// perform this reverse standardization by penalizing each component -// differently to get effectively the same objective function when +// to improve the rate of convergence unless the standard deviation is zero; +// as a result, we have to perform this reverse standardization by penalizing +// each component differently to get effectively the same objective function when // the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 +if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else regParamL1 --- End diff -- BTW, this technique is already used here by dividing the std of feature. Note that you need to make sure `value != 0`, and if `value == 0`, the coefficient of that feature will be zero. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10940#issuecomment-175868432 Intuitively, this makes sense to me. Since when `setFitIntercept(false)`, those features with `std == 0` will act as the effect of intercept resulting non-zero coefficients. Can you add couple more tests as the following. Thanks. First, adding two new datasets by zeroing out `binaryDataset` and making one column as non-zero constance. Matching the result against GLMNET like the rest of the tests when 1) setFitIntercept(false), setStandardization(false) with/without regularization 2) setFitIntercept(false), setStandardization(true) with/without regularization 3) setFitIntercept(true), setStandardization(false) with/without regularization 4) setFitIntercept(true), setStandardization(true) with/without regularization +cc @iyounus Linear Regression may have similar issue, if you have time, you may check it out. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10940#discussion_r51054476 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -341,11 +341,11 @@ class LogisticRegression @Since("1.2.0") ( regParamL1 } else { // If `standardization` is false, we still standardize the data -// to improve the rate of convergence; as a result, we have to -// perform this reverse standardization by penalizing each component -// differently to get effectively the same objective function when +// to improve the rate of convergence unless the standard deviation is zero; +// as a result, we have to perform this reverse standardization by penalizing +// each component differently to get effectively the same objective function when // the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 +if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else regParamL1 --- End diff -- @coderxiang You may try it out. When `value` is very small like `1e-6` as a constant comparing the rest of the features, the corresponding coefficient will be very large. In this case, the optimization on coefficients will be on different scales, and this often causes some convergence issue in line search. Similar argument can be made when the constant `value` is very large comparing the rest of the features. That's why we still standardize the features even users ask `standardization == false`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10940#issuecomment-175872406 @coderxiang I agree, without regularization, those features become collinear so the solution will not be unique. However, for those features with std != 0, the coefficients should be unique. Can you check them at least? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10780] [ML] Set initialModel in KMeans ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/8972#issuecomment-175882882 @yinxusen Thanks. I will check it out soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10702#discussion_r51077516 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -219,33 +219,43 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String } val yMean = ySummarizer.mean(0) -val yStd = math.sqrt(ySummarizer.variance(0)) - -// If the yStd is zero, then the intercept is yMean with zero coefficient; -// as a result, training is not needed. -if (yStd == 0.0) { - logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + -s"zeros and the intercept will be the mean of the label; as a result, " + -s"training is not needed.") - if (handlePersistence) instances.unpersist() - val coefficients = Vectors.sparse(numFeatures, Seq()) - val intercept = yMean - - val model = new LinearRegressionModel(uid, coefficients, intercept) - // Handle possible missing or invalid prediction columns - val (summaryModel, predictionColName) = model.findSummaryModelAndPredictionCol() - - val trainingSummary = new LinearRegressionTrainingSummary( -summaryModel.transform(dataset), -predictionColName, -$(labelCol), -model, -Array(0D), -$(featuresCol), -Array(0D)) - return copyValues(model.setSummary(trainingSummary)) +val rawYStd = math.sqrt(ySummarizer.variance(0)) +if (rawYStd == 0.0) { + if ($(fitIntercept)) { +// If the rawYStd is zero and fitIntercept=true, then the intercept is yMean with +// zero coefficient; as a result, training is not needed. +logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + + s"zeros and the intercept will be the mean of the label; as a result, " + + s"training is not needed.") +if (handlePersistence) instances.unpersist() +val coefficients = Vectors.sparse(numFeatures, Seq()) +val intercept = yMean + +val model = new LinearRegressionModel(uid, coefficients, intercept) +// Handle possible missing or invalid prediction columns +val (summaryModel, predictionColName) = model.findSummaryModelAndPredictionCol() + +val trainingSummary = new LinearRegressionTrainingSummary( + summaryModel.transform(dataset), + predictionColName, + $(labelCol), + model, + Array(0D), + $(featuresCol), + Array(0D)) +return copyValues(model.setSummary(trainingSummary)) + } else { +require(!($(regParam) > 0.0 && $(standardization)), --- End diff -- Your PR for `WeightedLeastSquares` is doing the right behavior, and consistent to `require($(regParam) != 0.0)`. The problem only happens when `standardizationLabel = true`, and in your `WeightedLeastSquares`, you already throw exception in this case. GLMNET will do `standardizationLabel = true` even `standardization == false`, and that's why you see inconsistent solution between GLMNET and your analytical solution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50931610 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -343,22 +355,36 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { +if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients.size == numFeatures) { + val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray + optInitialModel.get.coefficients.foreachActive { case (index, value) => +initialCoefficientsWithInterceptArray(index) = value + } + if ($(fitIntercept)) { +initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept + } +} else if ($(fitIntercept)) { /* For binary logistic regression, when we initialize the coefficients as zeros, it will converge faster if we initialize the intercept such that it follows the distribution of the labels. {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) + P(0) = 1 / (1 + \exp(b)), and + P(1) = \exp(b) / (1 + \exp(b)) }}}, hence {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} + b = \log{P(1) / P(0)} = \log{count_1 / count_0} --- End diff -- put two spaces back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50931631 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -343,22 +355,36 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { +if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients.size == numFeatures) { + val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray + optInitialModel.get.coefficients.foreachActive { case (index, value) => +initialCoefficientsWithInterceptArray(index) = value + } + if ($(fitIntercept)) { +initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept + } +} else if ($(fitIntercept)) { /* For binary logistic regression, when we initialize the coefficients as zeros, it will converge faster if we initialize the intercept such that it follows the distribution of the labels. {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) + P(0) = 1 / (1 + \exp(b)), and + P(1) = \exp(b) / (1 + \exp(b)) --- End diff -- ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50931598 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -343,22 +355,36 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { +if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients.size == numFeatures) { + val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray + optInitialModel.get.coefficients.foreachActive { case (index, value) => +initialCoefficientsWithInterceptArray(index) = value + } + if ($(fitIntercept)) { +initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept + } +} else if ($(fitIntercept)) { /* For binary logistic regression, when we initialize the coefficients as zeros, it will converge faster if we initialize the intercept such that it follows the distribution of the labels. {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) + P(0) = 1 / (1 + \exp(b)), and + P(1) = \exp(b) / (1 + \exp(b)) }}}, hence {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} + b = \log{P(1) / P(0)} = \log{count_1 / count_0} }}} */ - initialCoefficientsWithIntercept.toArray(numFeatures) = math.log( -histogram(1) / histogram(0)) + initialCoefficientsWithIntercept.toArray(numFeatures) += math.log(histogram(1) / histogram(0)) --- End diff -- revert this change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50931620 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -343,22 +355,36 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { +if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients.size == numFeatures) { + val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray + optInitialModel.get.coefficients.foreachActive { case (index, value) => +initialCoefficientsWithInterceptArray(index) = value + } + if ($(fitIntercept)) { +initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept + } +} else if ($(fitIntercept)) { /* For binary logistic regression, when we initialize the coefficients as zeros, it will converge faster if we initialize the intercept such that it follows the distribution of the labels. {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) + P(0) = 1 / (1 + \exp(b)), and --- End diff -- put two spaces back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-175341628 Thanks. Merged into master. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10780] [ML] Set initialModel in KMeans ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/8972#issuecomment-175342081 @jayantshekhar Still working on it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10780] [ML] Set initialModel in KMeans ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/8972#issuecomment-175342352 We add a private API in LOR to do the same thing, and would like to generalize it for other algorithms. +cc @holdenk --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-174662125 I'm going through the caching logic now. Will let you know soon. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50369722 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { - /* - For binary logistic regression, when we initialize the coefficients as zeros, - it will converge faster if we initialize the intercept such that - it follows the distribution of the labels. - - {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) - }}}, hence - {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} - }}} +if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients == numFeatures) { + val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray + optInitialModel.get.coefficients.foreachActive { case (index, value) => +initialCoefficientsWithInterceptArray(index) = value + } + if ($(fitIntercept)) { +initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept + } +} else if ($(fitIntercept)) { + /** + * For binary logistic regression, when we initialize the coefficients as zeros, + * it will converge faster if we initialize the intercept such that + * it follows the distribution of the labels. + + * {{{ + * P(0) = 1 / (1 + \exp(b)), and + * P(1) = \exp(b) / (1 + \exp(b)) + * }}}, hence + * {{{ + * b = \log{P(1) / P(0)} = \log{count_1 / count_0} + * }}} */ - initialCoefficientsWithIntercept.toArray(numFeatures) = math.log( -histogram(1) / histogram(0)) + initialCoefficientsWithIntercept.toArray(numFeatures) + = math.log(histogram(1) / histogram(0)) } val states = optimizer.iterations(new CachedDiffFunction(costFun), initialCoefficientsWithIntercept.toBreeze.toDenseVector) -/* - Note that in Logistic Regression, the objective history (loss + regularization) - is log-likelihood which is invariance under feature standardization. As a result, - the objective history from optimizer is the same as the one in the original space. +/** + * Note that in Logistic Regression, the objective history (loss + regularization) --- End diff -- reverse the style change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50369730 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -374,11 +395,11 @@ class LogisticRegression @Since("1.2.0") ( throw new SparkException(msg) } -/* - The coefficients are trained in the scaled space; we're converting them back to - the original space. - Note that the intercept in scaled space and original space is the same; - as a result, no scaling is needed. +/** + * The coefficients are trained in the scaled space; we're converting them back to --- End diff -- ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50370169 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +384,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries. + * + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), userSuppliedWeights = false) + } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, userSuppliedWeights = true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +lr.setStandardization(useFeatureScaling) +if (userSuppliedWeights) { + val uid = Identifiable.randomUID("logreg-static") + lr.setInitialModel(new org.apache.spark.ml.classification.LogisticRegressionModel( +uid, initialWeights, 1.0)) +} +lr.setFitIntercept(addIntercept) +lr.setMaxIter(optimizer.getNumIterations()) +lr.setTol(optimizer.getConvergenceTol()) +// Convert our input into a DataFrame +val sqlContext = new SQLContext(input.context) +import sqlContext.implicits._ +val df = input.toDF() +// Determine if we should cache the DF +val handlePersistence = input.getStorageLevel == StorageLevel.NONE --- End diff -- Will this cause double caching? Let's say input RDD is cached, so `handlePersistence` will be false. As a result, `df == StorageLevel.NONE` will be true in ml's LOR code, and this will cause caching twice. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12908][ML] Add warning message for Logi...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10862#issuecomment-173494166 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50370273 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +384,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries. + * + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), userSuppliedWeights = false) + } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, userSuppliedWeights = true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +lr.setStandardization(useFeatureScaling) +if (userSuppliedWeights) { + val uid = Identifiable.randomUID("logreg-static") + lr.setInitialModel(new org.apache.spark.ml.classification.LogisticRegressionModel( +uid, initialWeights, 1.0)) +} +lr.setFitIntercept(addIntercept) +lr.setMaxIter(optimizer.getNumIterations()) +lr.setTol(optimizer.getConvergenceTol()) +// Convert our input into a DataFrame +val sqlContext = new SQLContext(input.context) +import sqlContext.implicits._ +val df = input.toDF() +// Determine if we should cache the DF +val handlePersistence = input.getStorageLevel == StorageLevel.NONE +if (handlePersistence) { + df.persist(StorageLevel.MEMORY_AND_DISK) +} +// Train our model +val mlLogisticRegresionModel = lr.train(df) +// unpersist if we persisted +if (handlePersistence) { + df.unpersist() +} +// convert the model +val weights = mlLogisticRegresionModel.weights match { --- End diff -- ```scala val weights = Vectors.dense(mlLogisticRegresionModel.coefficients.toArray) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50370414 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +384,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries. + * + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), userSuppliedWeights = false) + } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, userSuppliedWeights = true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +lr.setStandardization(useFeatureScaling) +if (userSuppliedWeights) { + val uid = Identifiable.randomUID("logreg-static") + lr.setInitialModel(new org.apache.spark.ml.classification.LogisticRegressionModel( +uid, initialWeights, 1.0)) +} +lr.setFitIntercept(addIntercept) +lr.setMaxIter(optimizer.getNumIterations()) +lr.setTol(optimizer.getConvergenceTol()) +// Convert our input into a DataFrame +val sqlContext = new SQLContext(input.context) +import sqlContext.implicits._ +val df = input.toDF() +// Determine if we should cache the DF +val handlePersistence = input.getStorageLevel == StorageLevel.NONE +if (handlePersistence) { + df.persist(StorageLevel.MEMORY_AND_DISK) +} +// Train our model +val mlLogisticRegresionModel = lr.train(df) +// unpersist if we persisted +if (handlePersistence) { + df.unpersist() +} +// convert the model +val weights = mlLogisticRegresionModel.weights match { + case x: DenseVector => x + case y: Vector => Vectors.dense(y.toArray) +} +createModel(weights, mlLogisticRegresionModel.intercept) + } + optimizer.getUpdater() match { --- End diff -- when `optimizer.getRegParam() == 0.0`, run the old version. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10788#issuecomment-173493541 LGTM except some styling issues, and concern about caching twice. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50371017 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +384,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries. + * + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), userSuppliedWeights = false) + } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, userSuppliedWeights = true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +lr.setStandardization(useFeatureScaling) +if (userSuppliedWeights) { + val uid = Identifiable.randomUID("logreg-static") + lr.setInitialModel(new org.apache.spark.ml.classification.LogisticRegressionModel( +uid, initialWeights, 1.0)) +} +lr.setFitIntercept(addIntercept) +lr.setMaxIter(optimizer.getNumIterations()) +lr.setTol(optimizer.getConvergenceTol()) +// Convert our input into a DataFrame +val sqlContext = new SQLContext(input.context) +import sqlContext.implicits._ +val df = input.toDF() +// Determine if we should cache the DF +val handlePersistence = input.getStorageLevel == StorageLevel.NONE +if (handlePersistence) { + df.persist(StorageLevel.MEMORY_AND_DISK) +} +// Train our model +val mlLogisticRegresionModel = lr.train(df) +// unpersist if we persisted +if (handlePersistence) { + df.unpersist() +} +// convert the model +val weights = mlLogisticRegresionModel.weights match { + case x: DenseVector => x + case y: Vector => Vectors.dense(y.toArray) +} +createModel(weights, mlLogisticRegresionModel.intercept) + } + optimizer.getUpdater() match { --- End diff -- okay, this will make the test harder to write. I don't care this one now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50369078 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { - /* - For binary logistic regression, when we initialize the coefficients as zeros, - it will converge faster if we initialize the intercept such that - it follows the distribution of the labels. - - {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) - }}}, hence - {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} - }}} +if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) { --- End diff -- How can this compile? Should be `optInitialModel.get.coefficients.size != numFeatures` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50369207 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { - /* - For binary logistic regression, when we initialize the coefficients as zeros, - it will converge faster if we initialize the intercept such that - it follows the distribution of the labels. - - {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) - }}}, hence - {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} - }}} +if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients == numFeatures) { --- End diff -- ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50369141 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { - /* - For binary logistic regression, when we initialize the coefficients as zeros, - it will converge faster if we initialize the intercept such that - it follows the distribution of the labels. - - {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) - }}}, hence - {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} - }}} +if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) { + val vec = optInitialModel.get.coefficients --- End diff -- `vec` is not used. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50369668 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { - /* - For binary logistic regression, when we initialize the coefficients as zeros, - it will converge faster if we initialize the intercept such that - it follows the distribution of the labels. - - {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) - }}}, hence - {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} - }}} +if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients == numFeatures) { + val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray + optInitialModel.get.coefficients.foreachActive { case (index, value) => +initialCoefficientsWithInterceptArray(index) = value + } + if ($(fitIntercept)) { +initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept + } +} else if ($(fitIntercept)) { + /** + * For binary logistic regression, when we initialize the coefficients as zeros, + * it will converge faster if we initialize the intercept such that + * it follows the distribution of the labels. + --- End diff -- I think u have to remove all the `*`. I think we decide to do comment like ``` /* Start the sentence. */ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12908][ML] Add warning message for Logi...
GitHub user dbtsai opened a pull request: https://github.com/apache/spark/pull/10862 [SPARK-12908][ML] Add warning message for LogisticRegression for potential converge issue When all labels are the same, it's a dangerous ground for LogisticRegression without intercept to converge. GLMNET doesn't support this case, and will just exit. GLM can train, but will have a warning message saying the algorithm doesn't converge. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dbtsai/spark add-tests Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10862.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10862 commit e68cc38134ca78d5e8425aad4b1b5fd36c781ccc Author: DB Tsai <d...@netflix.com> Date: 2016-01-21T06:58:36Z add warning message --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50369552 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { - /* - For binary logistic regression, when we initialize the coefficients as zeros, - it will converge faster if we initialize the intercept such that - it follows the distribution of the labels. - - {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) - }}}, hence - {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} - }}} +if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients == numFeatures) { + val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray + optInitialModel.get.coefficients.foreachActive { case (index, value) => +initialCoefficientsWithInterceptArray(index) = value + } + if ($(fitIntercept)) { +initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept + } +} else if ($(fitIntercept)) { + /** + * For binary logistic regression, when we initialize the coefficients as zeros, + * it will converge faster if we initialize the intercept such that + * it follows the distribution of the labels. + --- End diff -- remove the extra line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50369516 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") ( val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) -if ($(fitIntercept)) { - /* - For binary logistic regression, when we initialize the coefficients as zeros, - it will converge faster if we initialize the intercept such that - it follows the distribution of the labels. - - {{{ - P(0) = 1 / (1 + \exp(b)), and - P(1) = \exp(b) / (1 + \exp(b)) - }}}, hence - {{{ - b = \log{P(1) / P(0)} = \log{count_1 / count_0} - }}} +if (optInitialModel.isDefined && optInitialModel.get.coefficients != numFeatures) { + val vec = optInitialModel.get.coefficients + logWarning( +s"Initial coefficients provided ${vec} did not match the expected size ${numFeatures}") +} + +if (optInitialModel.isDefined && optInitialModel.get.coefficients == numFeatures) { + val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray + optInitialModel.get.coefficients.foreachActive { case (index, value) => +initialCoefficientsWithInterceptArray(index) = value + } + if ($(fitIntercept)) { +initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept + } +} else if ($(fitIntercept)) { + /** + * For binary logistic regression, when we initialize the coefficients as zeros, + * it will converge faster if we initialize the intercept such that + * it follows the distribution of the labels. + + * {{{ + * P(0) = 1 / (1 + \exp(b)), and + * P(1) = \exp(b) / (1 + \exp(b)) + * }}}, hence + * {{{ + * b = \log{P(1) / P(0)} = \log{count_1 / count_0} + * }}} */ - initialCoefficientsWithIntercept.toArray(numFeatures) = math.log( -histogram(1) / histogram(0)) + initialCoefficientsWithIntercept.toArray(numFeatures) + = math.log(histogram(1) / histogram(0)) --- End diff -- add two spaces. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10702#issuecomment-173395155 @mengxr This PR is also on my radar. Working on another PR now, once @iyounus is ready, I will work on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172954547 Merged into master. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172965625 A followup PR is created: https://issues.apache.org/jira/browse/SPARK-12908 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172954097 Ideally, we would like to see the one without intercept to ensure that nothing will not be messed up during the refactoring. I'll merge now, and let's add the test for all same labels without intercept. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r50032693 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") ( val numClasses = histogram.length val numFeatures = summarizer.mean.size -if (numInvalid != 0) { - val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + -s"Found $numInvalid invalid labels." - logError(msg) - throw new SparkException(msg) -} - -if (numClasses > 2) { - val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + -s"binary classification. Found $numClasses in the input dataset." - logError(msg) - throw new SparkException(msg) -} +val (coefficients, intercept, objectiveHistory) = { + if (numInvalid != 0) { +val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + + s"Found $numInvalid invalid labels." +logError(msg) +throw new SparkException(msg) + } -val featuresMean = summarizer.mean.toArray -val featuresStd = summarizer.variance.toArray.map(math.sqrt) + if (numClasses > 2) { +val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + + s"binary classification. Found $numClasses in the input dataset." +logError(msg) +throw new SparkException(msg) + } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 0.0) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be positive infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, Array.empty[Double]) + } else if ($(fitIntercept) && numClasses == 1) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be negative infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, Array.empty[Double]) + } else { +val featuresMean = summarizer.mean.toArray +val featuresStd = summarizer.variance.toArray.map(math.sqrt) -val regParamL1 = $(elasticNetParam) * $(regParam) -val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) +val regParamL1 = $(elasticNetParam) * $(regParam) +val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) -val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), $(standardization), - featuresStd, featuresMean, regParamL2) +val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), + $(standardization), featuresStd, featuresMean, regParamL2) -val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { - new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) -} else { - def regParamL1Fun = (index: Int) => { -// Remove the L1 penalization on the intercept -if (index == numFeatures) { - 0.0 +val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { + new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { - if ($(standardization)) { -regParamL1 - } else { -// If `standardization` is false, we still standardize the data -// to improve the rate of convergence; as a result, we have to -// perform this reverse standardization by penalizing each component -// differently to get effectively the same objective function when -// the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 + def regParamL1Fun = (index: Int) => { +// Remove the L1 penalization on the intercept +if (index == numFeatures) { + 0.0 +} else { + if ($(standardization)) { +regParamL1 + } else { +// If `standardization` is false, we still standardize the data +// to improve the rate of convergence; as a result, we h
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r50032721 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") ( val numClasses = histogram.length val numFeatures = summarizer.mean.size -if (numInvalid != 0) { - val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + -s"Found $numInvalid invalid labels." - logError(msg) - throw new SparkException(msg) -} - -if (numClasses > 2) { - val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + -s"binary classification. Found $numClasses in the input dataset." - logError(msg) - throw new SparkException(msg) -} +val (coefficients, intercept, objectiveHistory) = { + if (numInvalid != 0) { +val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + + s"Found $numInvalid invalid labels." +logError(msg) +throw new SparkException(msg) + } -val featuresMean = summarizer.mean.toArray -val featuresStd = summarizer.variance.toArray.map(math.sqrt) + if (numClasses > 2) { +val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + + s"binary classification. Found $numClasses in the input dataset." +logError(msg) +throw new SparkException(msg) + } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 0.0) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be positive infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, Array.empty[Double]) + } else if ($(fitIntercept) && numClasses == 1) { +logWarning(s"All labels are zero and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be negative infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, Array.empty[Double]) + } else { +val featuresMean = summarizer.mean.toArray +val featuresStd = summarizer.variance.toArray.map(math.sqrt) -val regParamL1 = $(elasticNetParam) * $(regParam) -val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) +val regParamL1 = $(elasticNetParam) * $(regParam) +val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) -val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), $(standardization), - featuresStd, featuresMean, regParamL2) +val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), + $(standardization), featuresStd, featuresMean, regParamL2) -val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { - new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) -} else { - def regParamL1Fun = (index: Int) => { -// Remove the L1 penalization on the intercept -if (index == numFeatures) { - 0.0 +val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { + new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { - if ($(standardization)) { -regParamL1 - } else { -// If `standardization` is false, we still standardize the data -// to improve the rate of convergence; as a result, we have to -// perform this reverse standardization by penalizing each component -// differently to get effectively the same objective function when -// the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 + def regParamL1Fun = (index: Int) => { +// Remove the L1 penalization on the intercept +if (index == numFeatures) { + 0.0 +} else { + if ($(standardization)) { +regParamL1 + } else { +// If `standardization` is false, we still standardize the data +// to improve the rate of convergence; as a result, we h
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50040773 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -322,10 +329,25 @@ class LogisticRegression @Since("1.2.0") ( new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } +val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures --- End diff -- Is this used? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50041495 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +384,85 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), userSuppliedWeights = false) + } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, userSuppliedWeights = true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +lr.setStandardization(useFeatureScaling) +if (userSuppliedWeights) { + val initialWeightsWithIntercept = if (addIntercept) { --- End diff -- This is not used anymore. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50041152 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -343,8 +365,8 @@ class LogisticRegression @Since("1.2.0") ( = math.log(histogram(1) / histogram(0)) } -val states = optimizer.iterations(new CachedDiffFunction(costFun), - initialCoefficientsWithIntercept.toBreeze.toDenseVector) + val states = optimizer.iterations(new CachedDiffFunction(costFun), +initialCoefficientsWithIntercept.toBreeze.toDenseVector) --- End diff -- Wrong indentation. Remove two spaces. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50041320 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +384,85 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. --- End diff -- Removed `starting from the initial weights provided.` and add extra new line here for readability. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r50041364 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +384,85 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), userSuppliedWeights = false) + } + + /** + * Run Logistic Regression with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. --- End diff -- Add extra new line before `If a known updater is...` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962408 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. --- End diff -- Replace `algorithm` by `Logistic Regression`, and remove `starting from the initial weights provided`. Add a new line between `of LabeledPoint entries` and `If a known updater is used`. Actually, in ml version, disabling feature scaling is supported now. So please call ml implementation in this case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962416 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) --- End diff -- `run(input, generateInitialWeights(input), userSuppliedWeights = false)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49963912 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -322,10 +329,11 @@ class LogisticRegression @Since("1.2.0") ( new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } -val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) +val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures +val initialCoefficientsWithIntercept = optInitialCoefficients.getOrElse( + Vectors.zeros(numFeaturesWithIntercept)) --- End diff -- here, ```scala val initialCoefficientsWithIntercept = Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) if (optInitialModel.isDefined && optInitialModel.get.coefficients.size == numFeatures) { val initialCoefficientsWithInterceptArray = initialCoefficientsWithIntercept.toArray optInitialModel.get.coefficients.foreachActive { case (index, value) => initialCoefficientsWithInterceptArray(index) = value } if ($(fitIntercept) { initialCoefficientsWithInterceptArray(numFeatures) == optInitialModel.get.intercept } } else if ($(fitIntercept)) { /* For binary logistic regression, when we initialize the coefficients as zeros, it will converge faster if we initialize the intercept such that it follows the distribution of the labels. {{{ P(0) = 1 / (1 + \exp(b)), and P(1) = \exp(b) / (1 + \exp(b)) }}}, hence {{{ b = \log{P(1) / P(0)} = \log{count_1 / count_0} }}} */ initialCoefficientsWithIntercept.toArray(numFeatures) = math.log(histogram(1) / histogram(0)) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962448 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, true) --- End diff -- `run(input, generateInitialWeights(input), userSuppliedWeights = true)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962512 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD --- End diff -- Replace `algorithm` by `Logistic Regression`, and add a new line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962541 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1 && useFeatureScaling) { --- End diff -- You can remove `useFeatureScaling`, and pass it as setStandardization in ML implementation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49961903 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -247,10 +247,31 @@ class LogisticRegression @Since("1.2.0") ( @Since("1.5.0") override def getThresholds: Array[Double] = super.getThresholds - override protected def train(dataset: DataFrame): LogisticRegressionModel = { -// Extract columns from data. If dataset is persisted, do not persist oldDataset. + private var optInitialCoefficients: Option[Vector] = None + /** @group setParam */ + private[spark] def setInitialWeights(value: Vector): this.type = { +this.optInitialCoefficients = Some(value) +this + } + + /** + * Validate the initial weights, return an Option, if not the expected size return None + * and log a warning. + */ + private def validateWeights(vectorOpt: Option[Vector], numFeatures: Int): Option[Vector] = { +vectorOpt.flatMap(vec => + if (vec.size == numFeatures) { +Some(vec) + } else { +logWarning( + s"""Initial weights provided (${vec})did not match the expected size ${numFeatures}""") --- End diff -- btw, why `s"""`, also change `weights` to coefficients --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49961912 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -247,10 +247,31 @@ class LogisticRegression @Since("1.2.0") ( @Since("1.5.0") override def getThresholds: Array[Double] = super.getThresholds - override protected def train(dataset: DataFrame): LogisticRegressionModel = { -// Extract columns from data. If dataset is persisted, do not persist oldDataset. + private var optInitialCoefficients: Option[Vector] = None + /** @group setParam */ + private[spark] def setInitialWeights(value: Vector): this.type = { +this.optInitialCoefficients = Some(value) +this + } + + /** + * Validate the initial weights, return an Option, if not the expected size return None + * and log a warning. + */ + private def validateWeights(vectorOpt: Option[Vector], numFeatures: Int): Option[Vector] = { --- End diff -- validateCoefficients --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49963463 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -247,8 +247,15 @@ class LogisticRegression @Since("1.2.0") ( @Since("1.5.0") override def getThresholds: Array[Double] = super.getThresholds - override protected def train(dataset: DataFrame): LogisticRegressionModel = { -// Extract columns from data. If dataset is persisted, do not persist oldDataset. + private var optInitialCoefficients: Option[Vector] = None + + /** @group setParam */ + private[spark] def setInitialModel(model: LogisticRegressionModel): this.type = { +this.optInitialCoefficients = Some(model.coefficients) --- End diff -- You don't want to lose the information of intercept in model. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49963455 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -247,8 +247,15 @@ class LogisticRegression @Since("1.2.0") ( @Since("1.5.0") override def getThresholds: Array[Double] = super.getThresholds - override protected def train(dataset: DataFrame): LogisticRegressionModel = { -// Extract columns from data. If dataset is persisted, do not persist oldDataset. + private var optInitialCoefficients: Option[Vector] = None --- End diff -- Keep the reference to `InitialModel` here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49961714 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -247,10 +247,30 @@ class LogisticRegression @Since("1.2.0") ( @Since("1.5.0") override def getThresholds: Array[Double] = super.getThresholds - override protected def train(dataset: DataFrame): LogisticRegressionModel = { -// Extract columns from data. If dataset is persisted, do not persist oldDataset. + private var optInitialWeights: Option[Vector] = None + /** @group setParam */ + private[spark] def setInitialWeights(value: Vector): this.type = { +this.optInitialWeights = Some(value) +this + } --- End diff -- How about we follow https://github.com/apache/spark/pull/8972 , and have the following code. We can create another seprate JIRA for moving `setInitialModel` to public with a sharedParam. ```scala private var initialModel: Option[LogisticRegressionModel] = None private def setInitialModel(model: LogisticRegressionModel): this.type = { ... ... this } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962964 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, true) --- End diff -- haha... yes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962944 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -322,10 +343,12 @@ class LogisticRegression @Since("1.2.0") ( new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } -val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) +val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures +val userSuppliedCoefficients = validateWeights(optInitialCoefficients, numFeaturesWithIntercept) --- End diff -- You will know # of features by the size of coefficients set by setInitialModel. There is no ambiguity here since it's binary, and intercept has a separate variable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49963357 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1 && useFeatureScaling) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +if (userSuppliedWeights) { + val initialWeightsWithIntercept = if (addIntercept) { +appendBias(initialWeights) + } else { +initialWeights + } + lr.setInitialWeights(initialWeightsWithIntercept) +} +lr.setFitIntercept(addIntercept) +lr.setMaxIter(optimizer.getNumIterations()) +lr.setTol(optimizer.getConvergenceTol()) +// Convert our input into a DataFrame +val sqlContext = new SQLContext(input.context) +import sqlContext.implicits._ +val df = input.toDF() +// Determine if we should cache the DF +val handlePersistence = input.getStorageLevel == StorageLevel.NONE +if (handlePersistence) { + df.persist(StorageLevel.MEMORY_AND_DISK) +} --- End diff -- that makes sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962157 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -322,10 +343,12 @@ class LogisticRegression @Since("1.2.0") ( new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } -val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) +val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures +val userSuppliedCoefficients = validateWeights(optInitialCoefficients, numFeaturesWithIntercept) --- End diff -- let's handle it through setInitialModel, and have another PR to make it public. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962716 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1 && useFeatureScaling) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +if (userSuppliedWeights) { + val initialWeightsWithIntercept = if (addIntercept) { +appendBias(initialWeights) + } else { +initialWeights + } + lr.setInitialWeights(initialWeightsWithIntercept) +} +lr.setFitIntercept(addIntercept) +lr.setMaxIter(optimizer.getNumIterations()) +lr.setTol(optimizer.getConvergenceTol()) +// Convert our input into a DataFrame +val sqlContext = new SQLContext(input.context) +import sqlContext.implicits._ +val df = input.toDF() +// Determine if we should cache the DF +val handlePersistence = input.getStorageLevel == StorageLevel.NONE +if (handlePersistence) { + df.persist(StorageLevel.MEMORY_AND_DISK) +} --- End diff -- Why do we need to do it? I through those check is already in ML code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49962722 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1 && useFeatureScaling) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +if (userSuppliedWeights) { + val initialWeightsWithIntercept = if (addIntercept) { +appendBias(initialWeights) + } else { +initialWeights + } + lr.setInitialWeights(initialWeightsWithIntercept) +} +lr.setFitIntercept(addIntercept) +lr.setMaxIter(optimizer.getNumIterations()) +lr.setTol(optimizer.getConvergenceTol()) +// Convert our input into a DataFrame +val sqlContext = new SQLContext(input.context) +import sqlContext.implicits._ +val df = input.toDF() +// Determine if we should cache the DF +val handlePersistence = input.getStorageLevel == StorageLevel.NONE +if (handlePersistence) { + df.persist(StorageLevel.MEMORY_AND_DISK) +} +// Train our model +val mlLogisticRegresionModel = lr.train(df) +// unpersist if we persisted +if (handlePersistence) { + df.unpersist() +} --- End diff -- ditto? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10085#issuecomment-172429593 LGTM as well! Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49961964 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -247,10 +247,31 @@ class LogisticRegression @Since("1.2.0") ( @Since("1.5.0") override def getThresholds: Array[Double] = super.getThresholds - override protected def train(dataset: DataFrame): LogisticRegressionModel = { -// Extract columns from data. If dataset is persisted, do not persist oldDataset. + private var optInitialCoefficients: Option[Vector] = None + /** @group setParam */ + private[spark] def setInitialWeights(value: Vector): this.type = { +this.optInitialCoefficients = Some(value) +this + } + + /** + * Validate the initial weights, return an Option, if not the expected size return None + * and log a warning. + */ + private def validateWeights(vectorOpt: Option[Vector], numFeatures: Int): Option[Vector] = { +vectorOpt.flatMap(vec => + if (vec.size == numFeatures) { +Some(vec) + } else { +logWarning( + s"""Initial weights provided (${vec})did not match the expected size ${numFeatures}""") +None + }) + } + + override protected[spark] def train(dataset: DataFrame): LogisticRegressionModel = { val w = if ($(weightCol).isEmpty) lit(1.0) else col($(weightCol)) -val instances: RDD[Instance] = dataset.select(col($(labelCol)), w, col($(featuresCol))).map { +val instances = dataset.select(col($(labelCol)), w, col($(featuresCol))).map { --- End diff -- why this line is changed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49964184 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -322,10 +329,11 @@ class LogisticRegression @Since("1.2.0") ( new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) } -val initialCoefficientsWithIntercept = - Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures) +val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 else numFeatures +val initialCoefficientsWithIntercept = optInitialCoefficients.getOrElse( + Vectors.zeros(numFeaturesWithIntercept)) --- End diff -- btw, may we want to log. `if (optInitialModel.isDefined && optInitialModel.get.coefficients.size != numFeatures)`, let's log it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49961836 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala --- @@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS new LogisticRegressionModel(weights, intercept, numFeatures, numOfLinearPredictor + 1) } } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * If using ml implementation, uses ml code to generate initial weights. + */ + override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = { +run(input, generateInitialWeights(input), false) + } + + /** + * Run the algorithm with the configured parameters on an input RDD + * of LabeledPoint entries starting from the initial weights provided. + * If a known updater is used calls the ml implementation, to avoid + * applying a regularization penalty to the intercept, otherwise + * defaults to the mllib implementation. If more than two classes + * or feature scaling is disabled, always uses mllib implementation. + * Uses user provided weights. + */ + override def run(input: RDD[LabeledPoint], initialWeights: Vector): LogisticRegressionModel = { +run(input, initialWeights, true) + } + + private def run(input: RDD[LabeledPoint], initialWeights: Vector, userSuppliedWeights: Boolean): + LogisticRegressionModel = { +// ml's Logisitic regression only supports binary classifcation currently. +if (numOfLinearPredictor == 1 && useFeatureScaling) { + def runWithMlLogisitcRegression(elasticNetParam: Double) = { +// Prepare the ml LogisticRegression based on our settings +val lr = new org.apache.spark.ml.classification.LogisticRegression() +lr.setRegParam(optimizer.getRegParam()) +lr.setElasticNetParam(elasticNetParam) +if (userSuppliedWeights) { + val initialWeightsWithIntercept = if (addIntercept) { +appendBias(initialWeights) + } else { +initialWeights + } + lr.setInitialWeights(initialWeightsWithIntercept) --- End diff -- Here will be ```scala lr.setInitialModel(new org.apache.spark.ml.classification.LogisticRegressionModel(uid, initialWeights, 1)) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r49955619 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -883,6 +884,27 @@ class LogisticRegressionSuite assert(model1a0.intercept ~== model1b.intercept absTol 1E-3) } + test("logistic regression with fitIntercept=true and all labels the same") { +val lr = new LogisticRegression() + .setFitIntercept(true) + .setMaxIter(3) --- End diff -- add extra line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10743#issuecomment-172401612 LGTM except couple minor issues. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r49955940 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -883,6 +884,27 @@ class LogisticRegressionSuite assert(model1a0.intercept ~== model1b.intercept absTol 1E-3) } + test("logistic regression with fitIntercept=true and all labels the same") { +val lr = new LogisticRegression() + .setFitIntercept(true) + .setMaxIter(3) +val sameLabels = dataset + .withColumn("zeroLabel", lit(0.0)) + .withColumn("oneLabel", lit(1.0)) + +val allZeroModel = lr + .setLabelCol("zeroLabel") + .fit(sameLabels) +assert(allZeroModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3) +assert(allZeroModel.intercept === Double.NegativeInfinity) + +val allOneModel = lr + .setLabelCol("oneLabel") + .fit(sameLabels) +assert(allOneModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3) +assert(allOneModel.intercept === Double.PositiveInfinity) + } + --- End diff -- Also, check if objectiveHistory has length of zero. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r49955755 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") ( val numClasses = histogram.length val numFeatures = summarizer.mean.size -if (numInvalid != 0) { - val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + -s"Found $numInvalid invalid labels." - logError(msg) - throw new SparkException(msg) -} - -if (numClasses > 2) { - val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + -s"binary classification. Found $numClasses in the input dataset." - logError(msg) - throw new SparkException(msg) -} +val (coefficients, intercept, objectiveHistory) = { + if (numInvalid != 0) { +val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + + s"Found $numInvalid invalid labels." +logError(msg) +throw new SparkException(msg) + } -val featuresMean = summarizer.mean.toArray -val featuresStd = summarizer.variance.toArray.map(math.sqrt) + if (numClasses > 2) { +val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + + s"binary classification. Found $numClasses in the input dataset." +logError(msg) +throw new SparkException(msg) + } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 0.0) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be positive infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, Array.empty[Double]) + } else if ($(fitIntercept) && numClasses == 1) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + --- End diff -- "All labels are zero" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r49955925 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") ( val numClasses = histogram.length val numFeatures = summarizer.mean.size -if (numInvalid != 0) { - val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + -s"Found $numInvalid invalid labels." - logError(msg) - throw new SparkException(msg) -} - -if (numClasses > 2) { - val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + -s"binary classification. Found $numClasses in the input dataset." - logError(msg) - throw new SparkException(msg) -} +val (coefficients, intercept, objectiveHistory) = { + if (numInvalid != 0) { +val msg = s"Classification labels should be in {0 to ${numClasses - 1} " + + s"Found $numInvalid invalid labels." +logError(msg) +throw new SparkException(msg) + } -val featuresMean = summarizer.mean.toArray -val featuresStd = summarizer.variance.toArray.map(math.sqrt) + if (numClasses > 2) { +val msg = s"Currently, LogisticRegression with ElasticNet in ML package only supports " + + s"binary classification. Found $numClasses in the input dataset." +logError(msg) +throw new SparkException(msg) + } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 0.0) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be positive infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, Array.empty[Double]) + } else if ($(fitIntercept) && numClasses == 1) { +logWarning(s"All labels are one and fitIntercept=true, so the coefficients will be " + + s"zeros and the intercept will be negative infinity; as a result, " + + s"training is not needed.") +(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, Array.empty[Double]) + } else { +val featuresMean = summarizer.mean.toArray +val featuresStd = summarizer.variance.toArray.map(math.sqrt) -val regParamL1 = $(elasticNetParam) * $(regParam) -val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) +val regParamL1 = $(elasticNetParam) * $(regParam) +val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) -val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), $(standardization), - featuresStd, featuresMean, regParamL2) +val costFun = new LogisticCostFun(instances, numClasses, $(fitIntercept), + $(standardization), featuresStd, featuresMean, regParamL2) -val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { - new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) -} else { - def regParamL1Fun = (index: Int) => { -// Remove the L1 penalization on the intercept -if (index == numFeatures) { - 0.0 +val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) { + new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) } else { - if ($(standardization)) { -regParamL1 - } else { -// If `standardization` is false, we still standardize the data -// to improve the rate of convergence; as a result, we have to -// perform this reverse standardization by penalizing each component -// differently to get effectively the same objective function when -// the training dataset is not standardized. -if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 0.0 + def regParamL1Fun = (index: Int) => { +// Remove the L1 penalization on the intercept +if (index == numFeatures) { + 0.0 +} else { + if ($(standardization)) { +regParamL1 + } else { +// If `standardization` is false, we still standardize the data +// to improve the rate of convergence; as a result, we h
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r49955604 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -883,6 +884,27 @@ class LogisticRegressionSuite assert(model1a0.intercept ~== model1b.intercept absTol 1E-3) } + test("logistic regression with fitIntercept=true and all labels the same") { +val lr = new LogisticRegression() + .setFitIntercept(true) + .setMaxIter(3) +val sameLabels = dataset + .withColumn("zeroLabel", lit(0.0)) + .withColumn("oneLabel", lit(1.0)) + +val allZeroModel = lr + .setLabelCol("zeroLabel") + .fit(sameLabels) +assert(allZeroModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3) +assert(allZeroModel.intercept === Double.NegativeInfinity) + +val allOneModel = lr + .setLabelCol("oneLabel") + .fit(sameLabels) +assert(allOneModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3) +assert(allOneModel.intercept === Double.PositiveInfinity) + } + --- End diff -- Can you add one test which is all labels the same but `fitIntercept=false` here to avoid the issue in LiR? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49959401 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -247,10 +247,30 @@ class LogisticRegression @Since("1.2.0") ( @Since("1.5.0") override def getThresholds: Array[Double] = super.getThresholds - override protected def train(dataset: DataFrame): LogisticRegressionModel = { -// Extract columns from data. If dataset is persisted, do not persist oldDataset. + private var optInitialWeights: Option[Vector] = None + /** @group setParam */ + private[spark] def setInitialWeights(value: Vector): this.type = { +this.optInitialWeights = Some(value) +this + } + + /** Validate the initial weights, return an Option, if not the expected size return None --- End diff -- Move `Validate into a new line` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10788#discussion_r49959559 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -247,10 +247,30 @@ class LogisticRegression @Since("1.2.0") ( @Since("1.5.0") override def getThresholds: Array[Double] = super.getThresholds - override protected def train(dataset: DataFrame): LogisticRegressionModel = { -// Extract columns from data. If dataset is persisted, do not persist oldDataset. + private var optInitialWeights: Option[Vector] = None --- End diff -- We started to change our terminology from `weights` into `coefficients` after the implementation of weighted LoR. Please change it into `optInitialCoefficients`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12230][ML] WeightedLeastSquares.fit() s...
Github user dbtsai commented on the pull request: https://github.com/apache/spark/pull/10274#issuecomment-171876447 LGTM. Wait for one extra test. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12230][ML] WeightedLeastSquares.fit() s...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10274#discussion_r49822543 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -74,6 +89,35 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext } } + test("WLS against lm when label is constant") { +/* + R code: + # here b is constant + df <- as.data.frame(cbind(A, b)) + for (formula in c(b ~ . -1, b ~ .)) { + model <- lm(formula, data=df, weights=w) + print(as.vector(coef(model))) + } + + [1] -9.221298 3.394343 + [1] 17 0 0 +*/ + +val expected = Seq( + Vectors.dense(0.0, -9.221298, 3.394343), + Vectors.dense(17.0, 0.0, 0.0)) + +var idx = 0 +for (fitIntercept <- Seq(false, true)) { + val wls = new WeightedLeastSquares( +fitIntercept, regParam = 0.0, standardizeFeatures = false, standardizeLabel = true) +.fit(instancesConstLabel) --- End diff -- Awesome! Great that you see the same result! For normal equation, will be nice to in R so we can have it in comment consistently. I did implement it once when I implemented the LBFGS version, let me try to find it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r49652270 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -883,6 +884,22 @@ class LogisticRegressionSuite assert(model1a0.intercept ~== model1b.intercept absTol 1E-3) } + test("logistic regression with all labels the same") { +val lr = new LogisticRegression() + .setFitIntercept(true) + .setMaxIter(3) +val sameLabels = dataset + .withColumn("zeroLabel", lit(0.0)) + .withColumn("oneLabel", lit(1.0)) + +val model = lr + .setLabelCol("oneLabel") + .fit(sameLabels) + +assert(model.coefficients ~== Vectors.dense(0.0) absTol 1E-3) +assert(model.intercept === Double.PositiveInfinity) --- End diff -- BTW, this bug should not happen when all the labels are one since the histogram should be still size of two. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r49649843 --- Diff: mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala --- @@ -339,9 +339,11 @@ class LogisticRegression @Since("1.2.0") ( b = \log{P(1) / P(0)} = \log{count_1 / count_0} }}} */ - initialCoefficientsWithIntercept.toArray(numFeatures) -= math.log(histogram(1) / histogram(0)) -} + if (histogram.length >= 2) { // check to make sure indexing into histogram(1) is safe + initialCoefficientsWithIntercept.toArray(numFeatures) = math.log( + histogram(1) / histogram(0)) --- End diff -- In this case, the whole training step can be skipped. Currently, we only support binary LoR, so the max of `histogram.length` will be two. In LiR, when the `yStd == 0.0`, the model will be returned immediately without training, see https://github.com/feynmanliang/spark/blob/SPARK-12804/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L226 We can do similar thing here like ```scala if (histogram.length == 2) { if (histogram(0) == 0.0) { model = (new LogisticRegressionModel(uid, Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity)) return model } else { initialCoefficientsWithIntercept.toArray(numFeatures) = math.log( histogram(1) / histogram(0)) } else if (histogram.length == 1) { model = (new LogisticRegressionModel(uid, Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity)) return model } else { some excpetion } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10743#discussion_r49650452 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala --- @@ -883,6 +884,22 @@ class LogisticRegressionSuite assert(model1a0.intercept ~== model1b.intercept absTol 1E-3) } + test("logistic regression with all labels the same") { +val lr = new LogisticRegression() + .setFitIntercept(true) + .setMaxIter(3) +val sameLabels = dataset + .withColumn("zeroLabel", lit(0.0)) + .withColumn("oneLabel", lit(1.0)) + +val model = lr + .setLabelCol("oneLabel") + .fit(sameLabels) + +assert(model.coefficients ~== Vectors.dense(0.0) absTol 1E-3) +assert(model.intercept === Double.PositiveInfinity) --- End diff -- Can you add another test showing that all `zeroLabel` will return intercept with `Double.NegativeInfinity`? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12230][ML] WeightedLeastSquares.fit() s...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10274#discussion_r49667844 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -94,8 +110,7 @@ private[ml] class WeightedLeastSquares( if (standardizeFeatures) { lambda *= aVar(j - 2) } - if (standardizeLabel) { -// TODO: handle the case when bStd = 0 + if (standardizeLabel && bStd != 0) { --- End diff -- Here is the exercise, ```scala test("WLS against lm") { /* R code: df <- as.data.frame(cbind(A, b)) for (formula in c(b ~ . -1, b ~ .)) { model <- lm(formula, data=df, weights=w) print(as.vector(coef(model))) } [1] -3.727121 3.009983 [1] 18.08 6.08 -0.60 */ val expected = Seq( Vectors.dense(0.0, -3.727121, 3.009983), Vectors.dense(18.08, 6.08, -0.60)) var idx = 0 for (fitIntercept <- Seq(false, true)) { for (standardization <- Seq(false, true)) { val wls = new WeightedLeastSquares( fitIntercept, regParam = 0.0, standardizeFeatures = standardization, standardizeLabel = standardization).fit(instances) val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) assert(actual ~== expected(idx) absTol 1e-4) } idx += 1 } } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12230][ML] WeightedLeastSquares.fit() s...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10274#discussion_r49654189 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -94,8 +110,7 @@ private[ml] class WeightedLeastSquares( if (standardizeFeatures) { lambda *= aVar(j - 2) } - if (standardizeLabel) { -// TODO: handle the case when bStd = 0 + if (standardizeLabel && bStd != 0) { --- End diff -- It's interesting to see that when regularization is zero, with/without standardization on labels or features will not change the solution of Linear Regression which you can experiment. As a result, the only issue that the model will be non-interpretable will be `yStd` is zeo and `regParam` is non-zero. You can have a `require` there with proper message. I think logging a warning will be probably very easy for users to ignore. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12230][ML] WeightedLeastSquares.fit() s...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10274#discussion_r49394419 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -94,8 +110,7 @@ private[ml] class WeightedLeastSquares( if (standardizeFeatures) { lambda *= aVar(j - 2) } - if (standardizeLabel) { -// TODO: handle the case when bStd = 0 + if (standardizeLabel && bStd != 0) { --- End diff -- Thanks. As you said, we will expect non zero coefficients in this case, so we don't have to match glmnet. However, we may want to throw excpetion when standerizeLabe is true, and ystd is zero since the problem is not well defined. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/10702#discussion_r49399116 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -219,33 +219,41 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String } val yMean = ySummarizer.mean(0) -val yStd = math.sqrt(ySummarizer.variance(0)) +var yStd = math.sqrt(ySummarizer.variance(0)) --- End diff -- Also, when `rawYStd == 0.0`, `standardization == true` and `regParam != 0.0`, the problem will be ill-defined. We may want to throw an exception. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org