[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...

2016-01-30 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10702#issuecomment-177376070
  
LGTM except minor comments. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...

2016-01-30 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10702#discussion_r51354776
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -219,33 +219,44 @@ class LinearRegression @Since("1.3.0") 
(@Since("1.3.0") override val uid: String
 }
 
 val yMean = ySummarizer.mean(0)
-val yStd = math.sqrt(ySummarizer.variance(0))
-
-// If the yStd is zero, then the intercept is yMean with zero 
coefficient;
-// as a result, training is not needed.
-if (yStd == 0.0) {
-  logWarning(s"The standard deviation of the label is zero, so the 
coefficients will be " +
-s"zeros and the intercept will be the mean of the label; as a 
result, " +
-s"training is not needed.")
-  if (handlePersistence) instances.unpersist()
-  val coefficients = Vectors.sparse(numFeatures, Seq())
-  val intercept = yMean
-
-  val model = new LinearRegressionModel(uid, coefficients, intercept)
-  // Handle possible missing or invalid prediction columns
-  val (summaryModel, predictionColName) = 
model.findSummaryModelAndPredictionCol()
-
-  val trainingSummary = new LinearRegressionTrainingSummary(
-summaryModel.transform(dataset),
-predictionColName,
-$(labelCol),
-model,
-Array(0D),
-$(featuresCol),
-Array(0D))
-  return copyValues(model.setSummary(trainingSummary))
+val rawYStd = math.sqrt(ySummarizer.variance(0))
+if (rawYStd == 0.0) {
+  if ($(fitIntercept) || yMean==0.0) {
+// If the rawYStd is zero and fitIntercept=true, then the 
intercept is yMean with
+// zero coefficient; as a result, training is not needed.
+// Also, if yMean==0 and rawYStd==0, all the coefficients are zero 
regardless of
+// the fitIntercept
+logWarning(s"The standard deviation of the label is zero, so the 
coefficients will be " +
--- End diff --

Maybe you want to update the warning message for the second situation as 
well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...

2016-01-30 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10702#discussion_r51354767
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -219,33 +219,44 @@ class LinearRegression @Since("1.3.0") 
(@Since("1.3.0") override val uid: String
 }
 
 val yMean = ySummarizer.mean(0)
-val yStd = math.sqrt(ySummarizer.variance(0))
-
-// If the yStd is zero, then the intercept is yMean with zero 
coefficient;
-// as a result, training is not needed.
-if (yStd == 0.0) {
-  logWarning(s"The standard deviation of the label is zero, so the 
coefficients will be " +
-s"zeros and the intercept will be the mean of the label; as a 
result, " +
-s"training is not needed.")
-  if (handlePersistence) instances.unpersist()
-  val coefficients = Vectors.sparse(numFeatures, Seq())
-  val intercept = yMean
-
-  val model = new LinearRegressionModel(uid, coefficients, intercept)
-  // Handle possible missing or invalid prediction columns
-  val (summaryModel, predictionColName) = 
model.findSummaryModelAndPredictionCol()
-
-  val trainingSummary = new LinearRegressionTrainingSummary(
-summaryModel.transform(dataset),
-predictionColName,
-$(labelCol),
-model,
-Array(0D),
-$(featuresCol),
-Array(0D))
-  return copyValues(model.setSummary(trainingSummary))
+val rawYStd = math.sqrt(ySummarizer.variance(0))
+if (rawYStd == 0.0) {
+  if ($(fitIntercept) || yMean==0.0) {
+// If the rawYStd is zero and fitIntercept=true, then the 
intercept is yMean with
+// zero coefficient; as a result, training is not needed.
+// Also, if yMean==0 and rawYStd==0, all the coefficients are zero 
regardless of
+// the fitIntercept
+logWarning(s"The standard deviation of the label is zero, so the 
coefficients will be " +
+  s"zeros and the intercept will be the mean of the label; as a 
result, " +
+  s"training is not needed.")
+if (handlePersistence) instances.unpersist()
+val coefficients = Vectors.sparse(numFeatures, Seq())
+val intercept = yMean
+
+val model = new LinearRegressionModel(uid, coefficients, intercept)
+// Handle possible missing or invalid prediction columns
+val (summaryModel, predictionColName) = 
model.findSummaryModelAndPredictionCol()
+
+val trainingSummary = new LinearRegressionTrainingSummary(
+  summaryModel.transform(dataset),
+  predictionColName,
+  $(labelCol),
+  model,
+  Array(0D),
+  $(featuresCol),
+  Array(0D))
+return copyValues(model.setSummary(trainingSummary))
+  } else {
+require($(regParam) == 0.0, "The standard deviation of the label 
is zero. " +
+  "Model cannot be regularized.")
+logWarning(s"The standard deviation of the label is zero. " +
+  "Consider setting fitIntercept=true.")
+  }
 }
 
+// if y is constant (rawYStd is zero), then y cannot be scaled. In 
this case
+// setting yStd=1.0 ensures that y is not scaled anymore in l-bfgs 
algorithm.
+val yStd = if (rawYStd > 0) rawYStd else if (yMean != 0.0) 
math.abs(yMean) else 1.0
--- End diff --

`val yStd = if (rawYStd > 0) rawYStd else math.abs(yMean)' since you 
already check the condition before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...

2016-01-30 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10702#discussion_r51354784
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala 
---
@@ -558,6 +575,47 @@ class LinearRegressionSuite
 }
   }
 
+  test("linear regression model with constant label") {
+/*
+   R code:
+   for (formula in c(b.const ~ . -1, b.const ~ .)) {
+ model <- lm(formula, data=df.const.label, weights=w)
+ print(as.vector(coef(model)))
+   }
+  [1] -9.221298  3.394343
+  [1] 17  0  0
+*/
+val expected = Seq(
+  Vectors.dense(0.0, -9.221298, 3.394343),
+  Vectors.dense(17.0, 0.0, 0.0))
+
+Seq("auto", "l-bfgs", "normal").foreach { solver =>
+  var idx = 0
+  for (fitIntercept <- Seq(false, true)) {
+val model = new LinearRegression()
+  .setFitIntercept(fitIntercept)
+  .setWeightCol("weight")
+  .setSolver(solver)
+  .fit(datasetWithWeightConstantLabel)
+val actual = Vectors.dense(model.intercept, model.coefficients(0), 
model.coefficients(1))
+assert(actual ~== expected(idx) absTol 1e-4)
+idx += 1
+  }
+}
+  }
+
+  test("regularized linear regression through origin with constant label") 
{
+// The problem is ill-defined if fitIntercept=false, regParam is 
non-zero and \
--- End diff --

seems you forget to remove `\` :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...

2016-01-30 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10702#discussion_r51355106
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala 
---
@@ -558,6 +575,47 @@ class LinearRegressionSuite
 }
   }
 
+  test("linear regression model with constant label") {
+/*
+   R code:
+   for (formula in c(b.const ~ . -1, b.const ~ .)) {
+ model <- lm(formula, data=df.const.label, weights=w)
+ print(as.vector(coef(model)))
+   }
+  [1] -9.221298  3.394343
+  [1] 17  0  0
+*/
+val expected = Seq(
+  Vectors.dense(0.0, -9.221298, 3.394343),
+  Vectors.dense(17.0, 0.0, 0.0))
+
+Seq("auto", "l-bfgs", "normal").foreach { solver =>
+  var idx = 0
+  for (fitIntercept <- Seq(false, true)) {
+val model = new LinearRegression()
+  .setFitIntercept(fitIntercept)
+  .setWeightCol("weight")
+  .setSolver(solver)
+  .fit(datasetWithWeightConstantLabel)
+val actual = Vectors.dense(model.intercept, model.coefficients(0), 
model.coefficients(1))
+assert(actual ~== expected(idx) absTol 1e-4)
+idx += 1
--- End diff --

In `LinearRegressionTrainingSummary`, you can get it from 
`objectiveHistory`. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...

2016-01-28 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10940#issuecomment-176459204
  
@iyounus Ideally, it will be great that `intercept=true`, we keep the 
current behavior which is constant column doesn't have any predictive power. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...

2016-01-28 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10940#issuecomment-176503318
  
@iyounus Maybe in this case, `WeightedLeastSqures ` should drop those 
columns so the model can be still trained.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12230][ML] WeightedLeastSquares.fit() s...

2016-01-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10274#discussion_r51059677
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala 
---
@@ -74,6 +89,35 @@ class WeightedLeastSquaresSuite extends SparkFunSuite 
with MLlibTestSparkContext
 }
   }
 
+  test("WLS against lm when label is constant") {
+/*
+   R code:
+   # here b is constant
+   df <- as.data.frame(cbind(A, b))
+   for (formula in c(b ~ . -1, b ~ .)) {
+ model <- lm(formula, data=df, weights=w)
+ print(as.vector(coef(model)))
+   }
+
+  [1] -9.221298  3.394343
+  [1] 17  0  0
+*/
+
+val expected = Seq(
+  Vectors.dense(0.0, -9.221298, 3.394343),
+  Vectors.dense(17.0, 0.0, 0.0))
+
+var idx = 0
+for (fitIntercept <- Seq(false, true)) {
+  val wls = new WeightedLeastSquares(
+fitIntercept, regParam = 0.0, standardizeFeatures = false, 
standardizeLabel = true)
+.fit(instancesConstLabel)
--- End diff --

Sorry for getting you back so late. The difference is due to that `glmnet` 
always standardizes labels even `standardization == false`. `standardization == 
false` is turning off the standardization on features. As a result, at least in 
`glmnet`, when `ystd == 0.0`, the training is not valid. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...

2016-01-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10702#discussion_r51064724
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala 
---
@@ -558,6 +575,47 @@ class LinearRegressionSuite
 }
   }
 
+  test("linear regression model with constant label") {
+/*
+   R code:
+   for (formula in c(b.const ~ . -1, b.const ~ .)) {
+ model <- lm(formula, data=df.const.label, weights=w)
+ print(as.vector(coef(model)))
+   }
+  [1] -9.221298  3.394343
+  [1] 17  0  0
+*/
+val expected = Seq(
+  Vectors.dense(0.0, -9.221298, 3.394343),
+  Vectors.dense(17.0, 0.0, 0.0))
+
+Seq("auto", "l-bfgs", "normal").foreach { solver =>
+  var idx = 0
+  for (fitIntercept <- Seq(false, true)) {
+val model = new LinearRegression()
+  .setFitIntercept(fitIntercept)
+  .setWeightCol("weight")
+  .setSolver(solver)
+  .fit(datasetWithWeightConstantLabel)
+val actual = Vectors.dense(model.intercept, model.coefficients(0), 
model.coefficients(1))
+assert(actual ~== expected(idx) absTol 1e-4)
+idx += 1
+  }
+}
+  }
+
+  test("regularized linear regression through origin with constant label") 
{
+// The problem is ill-defined if fitIntercept=false, regParam is 
non-zero and \
--- End diff --

Remove `\` in the end of line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...

2016-01-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10702#discussion_r51064615
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala 
---
@@ -558,6 +575,47 @@ class LinearRegressionSuite
 }
   }
 
+  test("linear regression model with constant label") {
+/*
+   R code:
+   for (formula in c(b.const ~ . -1, b.const ~ .)) {
+ model <- lm(formula, data=df.const.label, weights=w)
+ print(as.vector(coef(model)))
+   }
+  [1] -9.221298  3.394343
+  [1] 17  0  0
+*/
+val expected = Seq(
+  Vectors.dense(0.0, -9.221298, 3.394343),
+  Vectors.dense(17.0, 0.0, 0.0))
+
+Seq("auto", "l-bfgs", "normal").foreach { solver =>
+  var idx = 0
+  for (fitIntercept <- Seq(false, true)) {
+val model = new LinearRegression()
+  .setFitIntercept(fitIntercept)
+  .setWeightCol("weight")
+  .setSolver(solver)
+  .fit(datasetWithWeightConstantLabel)
+val actual = Vectors.dense(model.intercept, model.coefficients(0), 
model.coefficients(1))
+assert(actual ~== expected(idx) absTol 1e-4)
+idx += 1
+  }
+}
+  }
+
+  test("regularized linear regression through origin with constant label") 
{
+// The problem is ill-defined if fitIntercept=false, regParam is 
non-zero and \
+// standardization=true. An exception is thrown in this case.
--- End diff --

When `standardization=false`, the problem is still ill-defined since GLMNET 
always standardizes the labels. That's why you see it in the analytical 
solution. Let's throw exception when `fitIntercept=false` and `regParam != 0.0`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...

2016-01-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10702#discussion_r51069962
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -219,33 +219,43 @@ class LinearRegression @Since("1.3.0") 
(@Since("1.3.0") override val uid: String
 }
 
 val yMean = ySummarizer.mean(0)
-val yStd = math.sqrt(ySummarizer.variance(0))
-
-// If the yStd is zero, then the intercept is yMean with zero 
coefficient;
-// as a result, training is not needed.
-if (yStd == 0.0) {
-  logWarning(s"The standard deviation of the label is zero, so the 
coefficients will be " +
-s"zeros and the intercept will be the mean of the label; as a 
result, " +
-s"training is not needed.")
-  if (handlePersistence) instances.unpersist()
-  val coefficients = Vectors.sparse(numFeatures, Seq())
-  val intercept = yMean
-
-  val model = new LinearRegressionModel(uid, coefficients, intercept)
-  // Handle possible missing or invalid prediction columns
-  val (summaryModel, predictionColName) = 
model.findSummaryModelAndPredictionCol()
-
-  val trainingSummary = new LinearRegressionTrainingSummary(
-summaryModel.transform(dataset),
-predictionColName,
-$(labelCol),
-model,
-Array(0D),
-$(featuresCol),
-Array(0D))
-  return copyValues(model.setSummary(trainingSummary))
+val rawYStd = math.sqrt(ySummarizer.variance(0))
+if (rawYStd == 0.0) {
+  if ($(fitIntercept)) {
+// If the rawYStd is zero and fitIntercept=true, then the 
intercept is yMean with
+// zero coefficient; as a result, training is not needed.
+logWarning(s"The standard deviation of the label is zero, so the 
coefficients will be " +
+  s"zeros and the intercept will be the mean of the label; as a 
result, " +
+  s"training is not needed.")
+if (handlePersistence) instances.unpersist()
+val coefficients = Vectors.sparse(numFeatures, Seq())
+val intercept = yMean
+
+val model = new LinearRegressionModel(uid, coefficients, intercept)
+// Handle possible missing or invalid prediction columns
+val (summaryModel, predictionColName) = 
model.findSummaryModelAndPredictionCol()
+
+val trainingSummary = new LinearRegressionTrainingSummary(
+  summaryModel.transform(dataset),
+  predictionColName,
+  $(labelCol),
+  model,
+  Array(0D),
+  $(featuresCol),
+  Array(0D))
+return copyValues(model.setSummary(trainingSummary))
+  } else {
+require(!($(regParam) > 0.0 && $(standardization)),
--- End diff --

remove `&& $(standardization)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...

2016-01-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10702#discussion_r51070090
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -219,33 +219,43 @@ class LinearRegression @Since("1.3.0") 
(@Since("1.3.0") override val uid: String
 }
 
 val yMean = ySummarizer.mean(0)
-val yStd = math.sqrt(ySummarizer.variance(0))
-
-// If the yStd is zero, then the intercept is yMean with zero 
coefficient;
-// as a result, training is not needed.
-if (yStd == 0.0) {
-  logWarning(s"The standard deviation of the label is zero, so the 
coefficients will be " +
-s"zeros and the intercept will be the mean of the label; as a 
result, " +
-s"training is not needed.")
-  if (handlePersistence) instances.unpersist()
-  val coefficients = Vectors.sparse(numFeatures, Seq())
-  val intercept = yMean
-
-  val model = new LinearRegressionModel(uid, coefficients, intercept)
-  // Handle possible missing or invalid prediction columns
-  val (summaryModel, predictionColName) = 
model.findSummaryModelAndPredictionCol()
-
-  val trainingSummary = new LinearRegressionTrainingSummary(
-summaryModel.transform(dataset),
-predictionColName,
-$(labelCol),
-model,
-Array(0D),
-$(featuresCol),
-Array(0D))
-  return copyValues(model.setSummary(trainingSummary))
+val rawYStd = math.sqrt(ySummarizer.variance(0))
+if (rawYStd == 0.0) {
+  if ($(fitIntercept)) {
+// If the rawYStd is zero and fitIntercept=true, then the 
intercept is yMean with
+// zero coefficient; as a result, training is not needed.
+logWarning(s"The standard deviation of the label is zero, so the 
coefficients will be " +
+  s"zeros and the intercept will be the mean of the label; as a 
result, " +
+  s"training is not needed.")
+if (handlePersistence) instances.unpersist()
+val coefficients = Vectors.sparse(numFeatures, Seq())
+val intercept = yMean
+
+val model = new LinearRegressionModel(uid, coefficients, intercept)
+// Handle possible missing or invalid prediction columns
+val (summaryModel, predictionColName) = 
model.findSummaryModelAndPredictionCol()
+
+val trainingSummary = new LinearRegressionTrainingSummary(
+  summaryModel.transform(dataset),
+  predictionColName,
+  $(labelCol),
+  model,
+  Array(0D),
+  $(featuresCol),
+  Array(0D))
+return copyValues(model.setSummary(trainingSummary))
+  } else {
+require(!($(regParam) > 0.0 && $(standardization)),
--- End diff --

Just 
` require($(regParam) != 0.0)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...

2016-01-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10702#discussion_r51070105
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -219,33 +219,43 @@ class LinearRegression @Since("1.3.0") 
(@Since("1.3.0") override val uid: String
 }
 
 val yMean = ySummarizer.mean(0)
-val yStd = math.sqrt(ySummarizer.variance(0))
-
-// If the yStd is zero, then the intercept is yMean with zero 
coefficient;
-// as a result, training is not needed.
-if (yStd == 0.0) {
-  logWarning(s"The standard deviation of the label is zero, so the 
coefficients will be " +
-s"zeros and the intercept will be the mean of the label; as a 
result, " +
-s"training is not needed.")
-  if (handlePersistence) instances.unpersist()
-  val coefficients = Vectors.sparse(numFeatures, Seq())
-  val intercept = yMean
-
-  val model = new LinearRegressionModel(uid, coefficients, intercept)
-  // Handle possible missing or invalid prediction columns
-  val (summaryModel, predictionColName) = 
model.findSummaryModelAndPredictionCol()
-
-  val trainingSummary = new LinearRegressionTrainingSummary(
-summaryModel.transform(dataset),
-predictionColName,
-$(labelCol),
-model,
-Array(0D),
-$(featuresCol),
-Array(0D))
-  return copyValues(model.setSummary(trainingSummary))
+val rawYStd = math.sqrt(ySummarizer.variance(0))
+if (rawYStd == 0.0) {
+  if ($(fitIntercept)) {
+// If the rawYStd is zero and fitIntercept=true, then the 
intercept is yMean with
+// zero coefficient; as a result, training is not needed.
+logWarning(s"The standard deviation of the label is zero, so the 
coefficients will be " +
+  s"zeros and the intercept will be the mean of the label; as a 
result, " +
+  s"training is not needed.")
+if (handlePersistence) instances.unpersist()
+val coefficients = Vectors.sparse(numFeatures, Seq())
+val intercept = yMean
+
+val model = new LinearRegressionModel(uid, coefficients, intercept)
+// Handle possible missing or invalid prediction columns
+val (summaryModel, predictionColName) = 
model.findSummaryModelAndPredictionCol()
+
+val trainingSummary = new LinearRegressionTrainingSummary(
+  summaryModel.transform(dataset),
+  predictionColName,
+  $(labelCol),
+  model,
+  Array(0D),
+  $(featuresCol),
+  Array(0D))
+return copyValues(model.setSummary(trainingSummary))
+  } else {
+require(!($(regParam) > 0.0 && $(standardization)),
--- End diff --

also change the message.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...

2016-01-27 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10702#issuecomment-175901752
  
@iyounus `standardizeLabel = false/ture` with non-zero `regParam`, let's 
throw the exception. I explained the mismatch against the analytic normal 
equation in the other PR. 

Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...

2016-01-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10702#discussion_r51070506
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -219,33 +219,43 @@ class LinearRegression @Since("1.3.0") 
(@Since("1.3.0") override val uid: String
 }
 
 val yMean = ySummarizer.mean(0)
-val yStd = math.sqrt(ySummarizer.variance(0))
-
-// If the yStd is zero, then the intercept is yMean with zero 
coefficient;
-// as a result, training is not needed.
-if (yStd == 0.0) {
-  logWarning(s"The standard deviation of the label is zero, so the 
coefficients will be " +
-s"zeros and the intercept will be the mean of the label; as a 
result, " +
-s"training is not needed.")
-  if (handlePersistence) instances.unpersist()
-  val coefficients = Vectors.sparse(numFeatures, Seq())
-  val intercept = yMean
-
-  val model = new LinearRegressionModel(uid, coefficients, intercept)
-  // Handle possible missing or invalid prediction columns
-  val (summaryModel, predictionColName) = 
model.findSummaryModelAndPredictionCol()
-
-  val trainingSummary = new LinearRegressionTrainingSummary(
-summaryModel.transform(dataset),
-predictionColName,
-$(labelCol),
-model,
-Array(0D),
-$(featuresCol),
-Array(0D))
-  return copyValues(model.setSummary(trainingSummary))
+val rawYStd = math.sqrt(ySummarizer.variance(0))
+if (rawYStd == 0.0) {
+  if ($(fitIntercept)) {
+// If the rawYStd is zero and fitIntercept=true, then the 
intercept is yMean with
+// zero coefficient; as a result, training is not needed.
+logWarning(s"The standard deviation of the label is zero, so the 
coefficients will be " +
+  s"zeros and the intercept will be the mean of the label; as a 
result, " +
+  s"training is not needed.")
+if (handlePersistence) instances.unpersist()
+val coefficients = Vectors.sparse(numFeatures, Seq())
+val intercept = yMean
+
+val model = new LinearRegressionModel(uid, coefficients, intercept)
+// Handle possible missing or invalid prediction columns
+val (summaryModel, predictionColName) = 
model.findSummaryModelAndPredictionCol()
+
+val trainingSummary = new LinearRegressionTrainingSummary(
+  summaryModel.transform(dataset),
+  predictionColName,
+  $(labelCol),
+  model,
+  Array(0D),
+  $(featuresCol),
+  Array(0D))
+return copyValues(model.setSummary(trainingSummary))
+  } else {
+require(!($(regParam) > 0.0 && $(standardization)),
+  "The standard deviation of the label is zero. " +
+"Model cannot be regularized with standardization=true")
+logWarning(s"The standard deviation of the label is zero. " +
+  "Consider setting fitIntercept=true.")
+  }
 }
 
+// if y is constant (rawYStd is zero), then y cannot be scaled. In 
this case
+// setting yStd=1.0 ensures that y is not scaled anymore in l-bfgs 
algorithm.
+val yStd = if (rawYStd > 0) rawYStd else 1.0
--- End diff --

Actually, in the case of `yMean == 0.0 && yStd == 0.0`, the coefficients 
will be all zeros as well even `fitIntercept  == false`. This is rare case, so 
we can let model training to figure out. But if you want to handle this 
explicitly, it's great.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...

2016-01-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10702#discussion_r51077616
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -219,33 +219,43 @@ class LinearRegression @Since("1.3.0") 
(@Since("1.3.0") override val uid: String
 }
 
 val yMean = ySummarizer.mean(0)
-val yStd = math.sqrt(ySummarizer.variance(0))
-
-// If the yStd is zero, then the intercept is yMean with zero 
coefficient;
-// as a result, training is not needed.
-if (yStd == 0.0) {
-  logWarning(s"The standard deviation of the label is zero, so the 
coefficients will be " +
-s"zeros and the intercept will be the mean of the label; as a 
result, " +
-s"training is not needed.")
-  if (handlePersistence) instances.unpersist()
-  val coefficients = Vectors.sparse(numFeatures, Seq())
-  val intercept = yMean
-
-  val model = new LinearRegressionModel(uid, coefficients, intercept)
-  // Handle possible missing or invalid prediction columns
-  val (summaryModel, predictionColName) = 
model.findSummaryModelAndPredictionCol()
-
-  val trainingSummary = new LinearRegressionTrainingSummary(
-summaryModel.transform(dataset),
-predictionColName,
-$(labelCol),
-model,
-Array(0D),
-$(featuresCol),
-Array(0D))
-  return copyValues(model.setSummary(trainingSummary))
+val rawYStd = math.sqrt(ySummarizer.variance(0))
+if (rawYStd == 0.0) {
+  if ($(fitIntercept)) {
+// If the rawYStd is zero and fitIntercept=true, then the 
intercept is yMean with
+// zero coefficient; as a result, training is not needed.
+logWarning(s"The standard deviation of the label is zero, so the 
coefficients will be " +
+  s"zeros and the intercept will be the mean of the label; as a 
result, " +
+  s"training is not needed.")
+if (handlePersistence) instances.unpersist()
+val coefficients = Vectors.sparse(numFeatures, Seq())
+val intercept = yMean
+
+val model = new LinearRegressionModel(uid, coefficients, intercept)
+// Handle possible missing or invalid prediction columns
+val (summaryModel, predictionColName) = 
model.findSummaryModelAndPredictionCol()
+
+val trainingSummary = new LinearRegressionTrainingSummary(
+  summaryModel.transform(dataset),
+  predictionColName,
+  $(labelCol),
+  model,
+  Array(0D),
+  $(featuresCol),
+  Array(0D))
+return copyValues(model.setSummary(trainingSummary))
+  } else {
+require(!($(regParam) > 0.0 && $(standardization)),
+  "The standard deviation of the label is zero. " +
+"Model cannot be regularized with standardization=true")
+logWarning(s"The standard deviation of the label is zero. " +
+  "Consider setting fitIntercept=true.")
+  }
 }
 
+// if y is constant (rawYStd is zero), then y cannot be scaled. In 
this case
+// setting yStd=1.0 ensures that y is not scaled anymore in l-bfgs 
algorithm.
+val yStd = if (rawYStd > 0) rawYStd else 1.0
--- End diff --

This will help the optimization if `yMean` is very large. Kind of we are 
normalizing the label instead of standardizing the label.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...

2016-01-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10702#discussion_r51070301
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -219,33 +219,43 @@ class LinearRegression @Since("1.3.0") 
(@Since("1.3.0") override val uid: String
 }
 
 val yMean = ySummarizer.mean(0)
-val yStd = math.sqrt(ySummarizer.variance(0))
-
-// If the yStd is zero, then the intercept is yMean with zero 
coefficient;
-// as a result, training is not needed.
-if (yStd == 0.0) {
-  logWarning(s"The standard deviation of the label is zero, so the 
coefficients will be " +
-s"zeros and the intercept will be the mean of the label; as a 
result, " +
-s"training is not needed.")
-  if (handlePersistence) instances.unpersist()
-  val coefficients = Vectors.sparse(numFeatures, Seq())
-  val intercept = yMean
-
-  val model = new LinearRegressionModel(uid, coefficients, intercept)
-  // Handle possible missing or invalid prediction columns
-  val (summaryModel, predictionColName) = 
model.findSummaryModelAndPredictionCol()
-
-  val trainingSummary = new LinearRegressionTrainingSummary(
-summaryModel.transform(dataset),
-predictionColName,
-$(labelCol),
-model,
-Array(0D),
-$(featuresCol),
-Array(0D))
-  return copyValues(model.setSummary(trainingSummary))
+val rawYStd = math.sqrt(ySummarizer.variance(0))
+if (rawYStd == 0.0) {
+  if ($(fitIntercept)) {
+// If the rawYStd is zero and fitIntercept=true, then the 
intercept is yMean with
+// zero coefficient; as a result, training is not needed.
+logWarning(s"The standard deviation of the label is zero, so the 
coefficients will be " +
+  s"zeros and the intercept will be the mean of the label; as a 
result, " +
+  s"training is not needed.")
+if (handlePersistence) instances.unpersist()
+val coefficients = Vectors.sparse(numFeatures, Seq())
+val intercept = yMean
+
+val model = new LinearRegressionModel(uid, coefficients, intercept)
+// Handle possible missing or invalid prediction columns
+val (summaryModel, predictionColName) = 
model.findSummaryModelAndPredictionCol()
+
+val trainingSummary = new LinearRegressionTrainingSummary(
+  summaryModel.transform(dataset),
+  predictionColName,
+  $(labelCol),
+  model,
+  Array(0D),
+  $(featuresCol),
+  Array(0D))
+return copyValues(model.setSummary(trainingSummary))
+  } else {
+require(!($(regParam) > 0.0 && $(standardization)),
+  "The standard deviation of the label is zero. " +
+"Model cannot be regularized with standardization=true")
+logWarning(s"The standard deviation of the label is zero. " +
+  "Consider setting fitIntercept=true.")
+  }
 }
 
+// if y is constant (rawYStd is zero), then y cannot be scaled. In 
this case
+// setting yStd=1.0 ensures that y is not scaled anymore in l-bfgs 
algorithm.
+val yStd = if (rawYStd > 0) rawYStd else 1.0
--- End diff --

I will do
```scala
val yStd = if (rawYStd > 0) rawYStd else if(yMean != 0.0) math.abs(yMean) 
else 1.0
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...

2016-01-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10940#discussion_r51051112
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -341,11 +341,11 @@ class LogisticRegression @Since("1.2.0") (
 regParamL1
   } else {
 // If `standardization` is false, we still standardize the 
data
-// to improve the rate of convergence; as a result, we 
have to
-// perform this reverse standardization by penalizing each 
component
-// differently to get effectively the same objective 
function when
+// to improve the rate of convergence unless the standard 
deviation is zero;
+// as a result, we have to perform this reverse 
standardization by penalizing
+// each component differently to get effectively the same 
objective function when
 // the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / 
featuresStd(index) else 0.0
+if (featuresStd(index) != 0.0) regParamL1 / 
featuresStd(index) else regParamL1
--- End diff --

The constant `value` can be really large or very small negatively. The 
optimizer may not be able to converge well in this case. I don't prove or try 
it yet, but mathematically, with the following changes, this should be solving 
identical problem.

```scala
// the training dataset is not standardized.
if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) else 
regParamL1 / featuresMean(index)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...

2016-01-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10940#discussion_r51051223
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -971,8 +971,10 @@ private class LogisticAggregator(
   val margin = - {
 var sum = 0.0
 features.foreachActive { (index, value) =>
-  if (featuresStd(index) != 0.0 && value != 0.0) {
+  if (featuresStd(index) != 0.0) {
 sum += localCoefficientsArray(index) * (value / 
featuresStd(index))
+  } else {
+sum += localCoefficientsArray(index) * value
--- End diff --

Change `value` into `1.0`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...

2016-01-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10940#discussion_r51051184
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -417,7 +417,7 @@ class LogisticRegression @Since("1.2.0") (
 val rawCoefficients = state.x.toArray.clone()
 var i = 0
 while (i < numFeatures) {
-  rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / 
featuresStd(i) else 0.0 }
+  rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / 
featuresStd(i) else 1.0 }
--- End diff --

```scala
rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / featuresStd(i) 
else 1.0 / featuresMean(i) }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...

2016-01-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10940#discussion_r51051267
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1106,7 +1110,8 @@ private class LogisticCostFun(
 totalGradientArray(index) += regParamL2 * temp
 value * temp
   } else {
-0.0
+totalGradientArray(index) += regParamL2 * value
+value * value
--- End diff --

Change `value` into `1.0`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...

2016-01-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10940#discussion_r51051238
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -983,8 +985,10 @@ private class LogisticAggregator(
   val multiplier = weight * (1.0 / (1.0 + math.exp(margin)) - 
label)
 
   features.foreachActive { (index, value) =>
-if (featuresStd(index) != 0.0 && value != 0.0) {
+if (featuresStd(index) != 0.0) {
   localGradientSumArray(index) += multiplier * (value / 
featuresStd(index))
+} else {
+  localGradientSumArray(index) += multiplier * value
--- End diff --

Change `value` into `1.0`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...

2016-01-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10940#discussion_r51052743
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -341,11 +341,11 @@ class LogisticRegression @Since("1.2.0") (
 regParamL1
   } else {
 // If `standardization` is false, we still standardize the 
data
-// to improve the rate of convergence; as a result, we 
have to
-// perform this reverse standardization by penalizing each 
component
-// differently to get effectively the same objective 
function when
+// to improve the rate of convergence unless the standard 
deviation is zero;
+// as a result, we have to perform this reverse 
standardization by penalizing
+// each component differently to get effectively the same 
objective function when
 // the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / 
featuresStd(index) else 0.0
+if (featuresStd(index) != 0.0) regParamL1 / 
featuresStd(index) else regParamL1
--- End diff --

BTW, this technique is already used here by dividing the std of feature. 
Note that you need to make sure `value != 0`, and if `value == 0`, the 
coefficient of that feature will be zero. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...

2016-01-27 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10940#issuecomment-175868432
  
Intuitively, this makes sense to me. Since when `setFitIntercept(false)`, 
those features with `std == 0` will act as the effect of intercept resulting 
non-zero coefficients. 

Can you add couple more tests as the following. Thanks.

First, adding two new datasets by zeroing out `binaryDataset` and making 
one column as non-zero constance. 

Matching the result against GLMNET like the rest of the tests when
1)  setFitIntercept(false), setStandardization(false) with/without 
regularization 
2)  setFitIntercept(false), setStandardization(true) with/without 
regularization
3)  setFitIntercept(true), setStandardization(false) with/without 
regularization
4)  setFitIntercept(true), setStandardization(true) with/without 
regularization

+cc @iyounus Linear Regression may have similar issue, if you have time, 
you may check it out. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...

2016-01-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10940#discussion_r51054476
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -341,11 +341,11 @@ class LogisticRegression @Since("1.2.0") (
 regParamL1
   } else {
 // If `standardization` is false, we still standardize the 
data
-// to improve the rate of convergence; as a result, we 
have to
-// perform this reverse standardization by penalizing each 
component
-// differently to get effectively the same objective 
function when
+// to improve the rate of convergence unless the standard 
deviation is zero;
+// as a result, we have to perform this reverse 
standardization by penalizing
+// each component differently to get effectively the same 
objective function when
 // the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / 
featuresStd(index) else 0.0
+if (featuresStd(index) != 0.0) regParamL1 / 
featuresStd(index) else regParamL1
--- End diff --

@coderxiang You may try it out. When `value` is very small like `1e-6` as a 
constant comparing the rest of the features, the corresponding coefficient will 
be very large. In this case, the optimization on coefficients will be on 
different scales, and this often causes some convergence issue in line search. 

Similar argument can be made when the constant `value` is very large 
comparing the rest of the features. 

That's why we still standardize the features even users ask 
`standardization == false`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13029][ml] fix a logistic regression is...

2016-01-27 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10940#issuecomment-175872406
  
@coderxiang I agree, without regularization, those features become 
collinear so the solution will not be unique. However, for those features with 
std != 0, the coefficients should be unique. Can you check them at least?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10780] [ML] Set initialModel in KMeans ...

2016-01-27 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/8972#issuecomment-175882882
  
@yinxusen Thanks. I will check it out soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...

2016-01-27 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10702#discussion_r51077516
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -219,33 +219,43 @@ class LinearRegression @Since("1.3.0") 
(@Since("1.3.0") override val uid: String
 }
 
 val yMean = ySummarizer.mean(0)
-val yStd = math.sqrt(ySummarizer.variance(0))
-
-// If the yStd is zero, then the intercept is yMean with zero 
coefficient;
-// as a result, training is not needed.
-if (yStd == 0.0) {
-  logWarning(s"The standard deviation of the label is zero, so the 
coefficients will be " +
-s"zeros and the intercept will be the mean of the label; as a 
result, " +
-s"training is not needed.")
-  if (handlePersistence) instances.unpersist()
-  val coefficients = Vectors.sparse(numFeatures, Seq())
-  val intercept = yMean
-
-  val model = new LinearRegressionModel(uid, coefficients, intercept)
-  // Handle possible missing or invalid prediction columns
-  val (summaryModel, predictionColName) = 
model.findSummaryModelAndPredictionCol()
-
-  val trainingSummary = new LinearRegressionTrainingSummary(
-summaryModel.transform(dataset),
-predictionColName,
-$(labelCol),
-model,
-Array(0D),
-$(featuresCol),
-Array(0D))
-  return copyValues(model.setSummary(trainingSummary))
+val rawYStd = math.sqrt(ySummarizer.variance(0))
+if (rawYStd == 0.0) {
+  if ($(fitIntercept)) {
+// If the rawYStd is zero and fitIntercept=true, then the 
intercept is yMean with
+// zero coefficient; as a result, training is not needed.
+logWarning(s"The standard deviation of the label is zero, so the 
coefficients will be " +
+  s"zeros and the intercept will be the mean of the label; as a 
result, " +
+  s"training is not needed.")
+if (handlePersistence) instances.unpersist()
+val coefficients = Vectors.sparse(numFeatures, Seq())
+val intercept = yMean
+
+val model = new LinearRegressionModel(uid, coefficients, intercept)
+// Handle possible missing or invalid prediction columns
+val (summaryModel, predictionColName) = 
model.findSummaryModelAndPredictionCol()
+
+val trainingSummary = new LinearRegressionTrainingSummary(
+  summaryModel.transform(dataset),
+  predictionColName,
+  $(labelCol),
+  model,
+  Array(0D),
+  $(featuresCol),
+  Array(0D))
+return copyValues(model.setSummary(trainingSummary))
+  } else {
+require(!($(regParam) > 0.0 && $(standardization)),
--- End diff --

Your PR for `WeightedLeastSquares` is doing the right behavior, and 
consistent to `require($(regParam) != 0.0)`. The problem only happens when 
`standardizationLabel = true`, and in your `WeightedLeastSquares`, you already 
throw exception in this case. 

GLMNET will do `standardizationLabel = true` even `standardization == 
false`, and that's why you see inconsistent solution between GLMNET and your 
analytical solution. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-26 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50931610
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -343,22 +355,36 @@ class LogisticRegression @Since("1.2.0") (
 val initialCoefficientsWithIntercept =
   Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
 
-if ($(fitIntercept)) {
+if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size != numFeatures) {
+  val vec = optInitialModel.get.coefficients
+  logWarning(
+s"Initial coefficients provided ${vec} did not match the 
expected size ${numFeatures}")
+}
+
+if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size == numFeatures) {
+  val initialCoefficientsWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
+  optInitialModel.get.coefficients.foreachActive { case (index, 
value) =>
+initialCoefficientsWithInterceptArray(index) = value
+  }
+  if ($(fitIntercept)) {
+initialCoefficientsWithInterceptArray(numFeatures) == 
optInitialModel.get.intercept
+  }
+} else if ($(fitIntercept)) {
   /*
  For binary logistic regression, when we initialize the 
coefficients as zeros,
  it will converge faster if we initialize the intercept such 
that
  it follows the distribution of the labels.
 
  {{{
-   P(0) = 1 / (1 + \exp(b)), and
-   P(1) = \exp(b) / (1 + \exp(b))
+ P(0) = 1 / (1 + \exp(b)), and
+ P(1) = \exp(b) / (1 + \exp(b))
  }}}, hence
  {{{
-   b = \log{P(1) / P(0)} = \log{count_1 / count_0}
+ b = \log{P(1) / P(0)} = \log{count_1 / count_0}
--- End diff --

put two spaces back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-26 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50931631
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -343,22 +355,36 @@ class LogisticRegression @Since("1.2.0") (
 val initialCoefficientsWithIntercept =
   Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
 
-if ($(fitIntercept)) {
+if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size != numFeatures) {
+  val vec = optInitialModel.get.coefficients
+  logWarning(
+s"Initial coefficients provided ${vec} did not match the 
expected size ${numFeatures}")
+}
+
+if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size == numFeatures) {
+  val initialCoefficientsWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
+  optInitialModel.get.coefficients.foreachActive { case (index, 
value) =>
+initialCoefficientsWithInterceptArray(index) = value
+  }
+  if ($(fitIntercept)) {
+initialCoefficientsWithInterceptArray(numFeatures) == 
optInitialModel.get.intercept
+  }
+} else if ($(fitIntercept)) {
   /*
  For binary logistic regression, when we initialize the 
coefficients as zeros,
  it will converge faster if we initialize the intercept such 
that
  it follows the distribution of the labels.
 
  {{{
-   P(0) = 1 / (1 + \exp(b)), and
-   P(1) = \exp(b) / (1 + \exp(b))
+ P(0) = 1 / (1 + \exp(b)), and
+ P(1) = \exp(b) / (1 + \exp(b))
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-26 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50931598
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -343,22 +355,36 @@ class LogisticRegression @Since("1.2.0") (
 val initialCoefficientsWithIntercept =
   Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
 
-if ($(fitIntercept)) {
+if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size != numFeatures) {
+  val vec = optInitialModel.get.coefficients
+  logWarning(
+s"Initial coefficients provided ${vec} did not match the 
expected size ${numFeatures}")
+}
+
+if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size == numFeatures) {
+  val initialCoefficientsWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
+  optInitialModel.get.coefficients.foreachActive { case (index, 
value) =>
+initialCoefficientsWithInterceptArray(index) = value
+  }
+  if ($(fitIntercept)) {
+initialCoefficientsWithInterceptArray(numFeatures) == 
optInitialModel.get.intercept
+  }
+} else if ($(fitIntercept)) {
   /*
  For binary logistic regression, when we initialize the 
coefficients as zeros,
  it will converge faster if we initialize the intercept such 
that
  it follows the distribution of the labels.
 
  {{{
-   P(0) = 1 / (1 + \exp(b)), and
-   P(1) = \exp(b) / (1 + \exp(b))
+ P(0) = 1 / (1 + \exp(b)), and
+ P(1) = \exp(b) / (1 + \exp(b))
  }}}, hence
  {{{
-   b = \log{P(1) / P(0)} = \log{count_1 / count_0}
+ b = \log{P(1) / P(0)} = \log{count_1 / count_0}
  }}}
*/
-  initialCoefficientsWithIntercept.toArray(numFeatures) = math.log(
-histogram(1) / histogram(0))
+  initialCoefficientsWithIntercept.toArray(numFeatures)
+= math.log(histogram(1) / histogram(0))
--- End diff --

revert this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-26 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50931620
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -343,22 +355,36 @@ class LogisticRegression @Since("1.2.0") (
 val initialCoefficientsWithIntercept =
   Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
 
-if ($(fitIntercept)) {
+if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size != numFeatures) {
+  val vec = optInitialModel.get.coefficients
+  logWarning(
+s"Initial coefficients provided ${vec} did not match the 
expected size ${numFeatures}")
+}
+
+if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size == numFeatures) {
+  val initialCoefficientsWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
+  optInitialModel.get.coefficients.foreachActive { case (index, 
value) =>
+initialCoefficientsWithInterceptArray(index) = value
+  }
+  if ($(fitIntercept)) {
+initialCoefficientsWithInterceptArray(numFeatures) == 
optInitialModel.get.intercept
+  }
+} else if ($(fitIntercept)) {
   /*
  For binary logistic regression, when we initialize the 
coefficients as zeros,
  it will converge faster if we initialize the intercept such 
that
  it follows the distribution of the labels.
 
  {{{
-   P(0) = 1 / (1 + \exp(b)), and
-   P(1) = \exp(b) / (1 + \exp(b))
+ P(0) = 1 / (1 + \exp(b)), and
--- End diff --

put two spaces back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-26 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10788#issuecomment-175341628
  
Thanks. Merged into master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10780] [ML] Set initialModel in KMeans ...

2016-01-26 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/8972#issuecomment-175342081
  
@jayantshekhar  Still working on it? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-10780] [ML] Set initialModel in KMeans ...

2016-01-26 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/8972#issuecomment-175342352
  
We add a private API in LOR to do the same thing, and would like to 
generalize it for other algorithms. 
+cc @holdenk 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-25 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10788#issuecomment-174662125
  
I'm going through the caching logic now. Will let you know soon. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50369722
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") (
 val initialCoefficientsWithIntercept =
   Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
 
-if ($(fitIntercept)) {
-  /*
- For binary logistic regression, when we initialize the 
coefficients as zeros,
- it will converge faster if we initialize the intercept such 
that
- it follows the distribution of the labels.
-
- {{{
-   P(0) = 1 / (1 + \exp(b)), and
-   P(1) = \exp(b) / (1 + \exp(b))
- }}}, hence
- {{{
-   b = \log{P(1) / P(0)} = \log{count_1 / count_0}
- }}}
+if (optInitialModel.isDefined && optInitialModel.get.coefficients 
!= numFeatures) {
+  val vec = optInitialModel.get.coefficients
+  logWarning(
+s"Initial coefficients provided ${vec} did not match the 
expected size ${numFeatures}")
+}
+
+if (optInitialModel.isDefined && optInitialModel.get.coefficients 
== numFeatures) {
+  val initialCoefficientsWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
+  optInitialModel.get.coefficients.foreachActive { case (index, 
value) =>
+initialCoefficientsWithInterceptArray(index) = value
+  }
+  if ($(fitIntercept)) {
+initialCoefficientsWithInterceptArray(numFeatures) == 
optInitialModel.get.intercept
+  }
+} else if ($(fitIntercept)) {
+  /**
+   * For binary logistic regression, when we initialize the 
coefficients as zeros,
+   * it will converge faster if we initialize the intercept such 
that
+   * it follows the distribution of the labels.
+
+   * {{{
+   * P(0) = 1 / (1 + \exp(b)), and
+   * P(1) = \exp(b) / (1 + \exp(b))
+   * }}}, hence
+   * {{{
+   * b = \log{P(1) / P(0)} = \log{count_1 / count_0}
+   * }}}
*/
-  initialCoefficientsWithIntercept.toArray(numFeatures) = math.log(
-histogram(1) / histogram(0))
+  initialCoefficientsWithIntercept.toArray(numFeatures)
+  = math.log(histogram(1) / histogram(0))
 }
 
 val states = optimizer.iterations(new CachedDiffFunction(costFun),
   initialCoefficientsWithIntercept.toBreeze.toDenseVector)
 
-/*
-   Note that in Logistic Regression, the objective history (loss + 
regularization)
-   is log-likelihood which is invariance under feature 
standardization. As a result,
-   the objective history from optimizer is the same as the one in 
the original space.
+/**
+ * Note that in Logistic Regression, the objective history (loss + 
regularization)
--- End diff --

reverse the style change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50369730
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -374,11 +395,11 @@ class LogisticRegression @Since("1.2.0") (
   throw new SparkException(msg)
 }
 
-/*
-   The coefficients are trained in the scaled space; we're 
converting them back to
-   the original space.
-   Note that the intercept in scaled space and original space is 
the same;
-   as a result, no scaling is needed.
+/**
+ * The coefficients are trained in the scaled space; we're 
converting them back to
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50370169
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -374,4 +384,82 @@ class LogisticRegressionWithLBFGS
   new LogisticRegressionModel(weights, intercept, numFeatures, 
numOfLinearPredictor + 1)
 }
   }
+
+  /**
+   * Run Logistic Regression with the configured parameters on an input RDD
+   * of LabeledPoint entries.
+   *
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * If using ml implementation, uses ml code to generate initial weights.
+   */
+  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
+run(input, generateInitialWeights(input), userSuppliedWeights = false)
+  }
+
+  /**
+   * Run Logistic Regression with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   *
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * Uses user provided weights.
+   */
+  override def run(input: RDD[LabeledPoint], initialWeights: Vector): 
LogisticRegressionModel = {
+run(input, initialWeights, userSuppliedWeights = true)
+  }
+
+  private def run(input: RDD[LabeledPoint], initialWeights: Vector, 
userSuppliedWeights: Boolean):
+  LogisticRegressionModel = {
+// ml's Logisitic regression only supports binary classifcation 
currently.
+if (numOfLinearPredictor == 1) {
+  def runWithMlLogisitcRegression(elasticNetParam: Double) = {
+// Prepare the ml LogisticRegression based on our settings
+val lr = new 
org.apache.spark.ml.classification.LogisticRegression()
+lr.setRegParam(optimizer.getRegParam())
+lr.setElasticNetParam(elasticNetParam)
+lr.setStandardization(useFeatureScaling)
+if (userSuppliedWeights) {
+  val uid = Identifiable.randomUID("logreg-static")
+  lr.setInitialModel(new 
org.apache.spark.ml.classification.LogisticRegressionModel(
+uid, initialWeights, 1.0))
+}
+lr.setFitIntercept(addIntercept)
+lr.setMaxIter(optimizer.getNumIterations())
+lr.setTol(optimizer.getConvergenceTol())
+// Convert our input into a DataFrame
+val sqlContext = new SQLContext(input.context)
+import sqlContext.implicits._
+val df = input.toDF()
+// Determine if we should cache the DF
+val handlePersistence = input.getStorageLevel == StorageLevel.NONE
--- End diff --

Will this cause double caching? Let's say input RDD is cached, so 
`handlePersistence` will be false. As a result, `df == StorageLevel.NONE` will 
be true in ml's LOR code, and this will cause caching twice. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12908][ML] Add warning message for Logi...

2016-01-21 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10862#issuecomment-173494166
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50370273
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -374,4 +384,82 @@ class LogisticRegressionWithLBFGS
   new LogisticRegressionModel(weights, intercept, numFeatures, 
numOfLinearPredictor + 1)
 }
   }
+
+  /**
+   * Run Logistic Regression with the configured parameters on an input RDD
+   * of LabeledPoint entries.
+   *
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * If using ml implementation, uses ml code to generate initial weights.
+   */
+  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
+run(input, generateInitialWeights(input), userSuppliedWeights = false)
+  }
+
+  /**
+   * Run Logistic Regression with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   *
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * Uses user provided weights.
+   */
+  override def run(input: RDD[LabeledPoint], initialWeights: Vector): 
LogisticRegressionModel = {
+run(input, initialWeights, userSuppliedWeights = true)
+  }
+
+  private def run(input: RDD[LabeledPoint], initialWeights: Vector, 
userSuppliedWeights: Boolean):
+  LogisticRegressionModel = {
+// ml's Logisitic regression only supports binary classifcation 
currently.
+if (numOfLinearPredictor == 1) {
+  def runWithMlLogisitcRegression(elasticNetParam: Double) = {
+// Prepare the ml LogisticRegression based on our settings
+val lr = new 
org.apache.spark.ml.classification.LogisticRegression()
+lr.setRegParam(optimizer.getRegParam())
+lr.setElasticNetParam(elasticNetParam)
+lr.setStandardization(useFeatureScaling)
+if (userSuppliedWeights) {
+  val uid = Identifiable.randomUID("logreg-static")
+  lr.setInitialModel(new 
org.apache.spark.ml.classification.LogisticRegressionModel(
+uid, initialWeights, 1.0))
+}
+lr.setFitIntercept(addIntercept)
+lr.setMaxIter(optimizer.getNumIterations())
+lr.setTol(optimizer.getConvergenceTol())
+// Convert our input into a DataFrame
+val sqlContext = new SQLContext(input.context)
+import sqlContext.implicits._
+val df = input.toDF()
+// Determine if we should cache the DF
+val handlePersistence = input.getStorageLevel == StorageLevel.NONE
+if (handlePersistence) {
+  df.persist(StorageLevel.MEMORY_AND_DISK)
+}
+// Train our model
+val mlLogisticRegresionModel = lr.train(df)
+// unpersist if we persisted
+if (handlePersistence) {
+  df.unpersist()
+}
+// convert the model
+val weights = mlLogisticRegresionModel.weights match {
--- End diff --

```scala
val weights = Vectors.dense(mlLogisticRegresionModel.coefficients.toArray)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50370414
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -374,4 +384,82 @@ class LogisticRegressionWithLBFGS
   new LogisticRegressionModel(weights, intercept, numFeatures, 
numOfLinearPredictor + 1)
 }
   }
+
+  /**
+   * Run Logistic Regression with the configured parameters on an input RDD
+   * of LabeledPoint entries.
+   *
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * If using ml implementation, uses ml code to generate initial weights.
+   */
+  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
+run(input, generateInitialWeights(input), userSuppliedWeights = false)
+  }
+
+  /**
+   * Run Logistic Regression with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   *
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * Uses user provided weights.
+   */
+  override def run(input: RDD[LabeledPoint], initialWeights: Vector): 
LogisticRegressionModel = {
+run(input, initialWeights, userSuppliedWeights = true)
+  }
+
+  private def run(input: RDD[LabeledPoint], initialWeights: Vector, 
userSuppliedWeights: Boolean):
+  LogisticRegressionModel = {
+// ml's Logisitic regression only supports binary classifcation 
currently.
+if (numOfLinearPredictor == 1) {
+  def runWithMlLogisitcRegression(elasticNetParam: Double) = {
+// Prepare the ml LogisticRegression based on our settings
+val lr = new 
org.apache.spark.ml.classification.LogisticRegression()
+lr.setRegParam(optimizer.getRegParam())
+lr.setElasticNetParam(elasticNetParam)
+lr.setStandardization(useFeatureScaling)
+if (userSuppliedWeights) {
+  val uid = Identifiable.randomUID("logreg-static")
+  lr.setInitialModel(new 
org.apache.spark.ml.classification.LogisticRegressionModel(
+uid, initialWeights, 1.0))
+}
+lr.setFitIntercept(addIntercept)
+lr.setMaxIter(optimizer.getNumIterations())
+lr.setTol(optimizer.getConvergenceTol())
+// Convert our input into a DataFrame
+val sqlContext = new SQLContext(input.context)
+import sqlContext.implicits._
+val df = input.toDF()
+// Determine if we should cache the DF
+val handlePersistence = input.getStorageLevel == StorageLevel.NONE
+if (handlePersistence) {
+  df.persist(StorageLevel.MEMORY_AND_DISK)
+}
+// Train our model
+val mlLogisticRegresionModel = lr.train(df)
+// unpersist if we persisted
+if (handlePersistence) {
+  df.unpersist()
+}
+// convert the model
+val weights = mlLogisticRegresionModel.weights match {
+  case x: DenseVector => x
+  case y: Vector => Vectors.dense(y.toArray)
+}
+createModel(weights, mlLogisticRegresionModel.intercept)
+  }
+  optimizer.getUpdater() match {
--- End diff --

when `optimizer.getRegParam() == 0.0`, run the old version.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-21 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10788#issuecomment-173493541
  
LGTM except some styling issues, and concern about caching twice. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-21 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50371017
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -374,4 +384,82 @@ class LogisticRegressionWithLBFGS
   new LogisticRegressionModel(weights, intercept, numFeatures, 
numOfLinearPredictor + 1)
 }
   }
+
+  /**
+   * Run Logistic Regression with the configured parameters on an input RDD
+   * of LabeledPoint entries.
+   *
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * If using ml implementation, uses ml code to generate initial weights.
+   */
+  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
+run(input, generateInitialWeights(input), userSuppliedWeights = false)
+  }
+
+  /**
+   * Run Logistic Regression with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   *
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * Uses user provided weights.
+   */
+  override def run(input: RDD[LabeledPoint], initialWeights: Vector): 
LogisticRegressionModel = {
+run(input, initialWeights, userSuppliedWeights = true)
+  }
+
+  private def run(input: RDD[LabeledPoint], initialWeights: Vector, 
userSuppliedWeights: Boolean):
+  LogisticRegressionModel = {
+// ml's Logisitic regression only supports binary classifcation 
currently.
+if (numOfLinearPredictor == 1) {
+  def runWithMlLogisitcRegression(elasticNetParam: Double) = {
+// Prepare the ml LogisticRegression based on our settings
+val lr = new 
org.apache.spark.ml.classification.LogisticRegression()
+lr.setRegParam(optimizer.getRegParam())
+lr.setElasticNetParam(elasticNetParam)
+lr.setStandardization(useFeatureScaling)
+if (userSuppliedWeights) {
+  val uid = Identifiable.randomUID("logreg-static")
+  lr.setInitialModel(new 
org.apache.spark.ml.classification.LogisticRegressionModel(
+uid, initialWeights, 1.0))
+}
+lr.setFitIntercept(addIntercept)
+lr.setMaxIter(optimizer.getNumIterations())
+lr.setTol(optimizer.getConvergenceTol())
+// Convert our input into a DataFrame
+val sqlContext = new SQLContext(input.context)
+import sqlContext.implicits._
+val df = input.toDF()
+// Determine if we should cache the DF
+val handlePersistence = input.getStorageLevel == StorageLevel.NONE
+if (handlePersistence) {
+  df.persist(StorageLevel.MEMORY_AND_DISK)
+}
+// Train our model
+val mlLogisticRegresionModel = lr.train(df)
+// unpersist if we persisted
+if (handlePersistence) {
+  df.unpersist()
+}
+// convert the model
+val weights = mlLogisticRegresionModel.weights match {
+  case x: DenseVector => x
+  case y: Vector => Vectors.dense(y.toArray)
+}
+createModel(weights, mlLogisticRegresionModel.intercept)
+  }
+  optimizer.getUpdater() match {
--- End diff --

okay, this will make the test harder to write. I don't care this one now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-20 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50369078
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") (
 val initialCoefficientsWithIntercept =
   Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
 
-if ($(fitIntercept)) {
-  /*
- For binary logistic regression, when we initialize the 
coefficients as zeros,
- it will converge faster if we initialize the intercept such 
that
- it follows the distribution of the labels.
-
- {{{
-   P(0) = 1 / (1 + \exp(b)), and
-   P(1) = \exp(b) / (1 + \exp(b))
- }}}, hence
- {{{
-   b = \log{P(1) / P(0)} = \log{count_1 / count_0}
- }}}
+if (optInitialModel.isDefined && optInitialModel.get.coefficients 
!= numFeatures) {
--- End diff --

How can this compile? Should be `optInitialModel.get.coefficients.size != 
numFeatures`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-20 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50369207
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") (
 val initialCoefficientsWithIntercept =
   Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
 
-if ($(fitIntercept)) {
-  /*
- For binary logistic regression, when we initialize the 
coefficients as zeros,
- it will converge faster if we initialize the intercept such 
that
- it follows the distribution of the labels.
-
- {{{
-   P(0) = 1 / (1 + \exp(b)), and
-   P(1) = \exp(b) / (1 + \exp(b))
- }}}, hence
- {{{
-   b = \log{P(1) / P(0)} = \log{count_1 / count_0}
- }}}
+if (optInitialModel.isDefined && optInitialModel.get.coefficients 
!= numFeatures) {
+  val vec = optInitialModel.get.coefficients
+  logWarning(
+s"Initial coefficients provided ${vec} did not match the 
expected size ${numFeatures}")
+}
+
+if (optInitialModel.isDefined && optInitialModel.get.coefficients 
== numFeatures) {
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-20 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50369141
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") (
 val initialCoefficientsWithIntercept =
   Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
 
-if ($(fitIntercept)) {
-  /*
- For binary logistic regression, when we initialize the 
coefficients as zeros,
- it will converge faster if we initialize the intercept such 
that
- it follows the distribution of the labels.
-
- {{{
-   P(0) = 1 / (1 + \exp(b)), and
-   P(1) = \exp(b) / (1 + \exp(b))
- }}}, hence
- {{{
-   b = \log{P(1) / P(0)} = \log{count_1 / count_0}
- }}}
+if (optInitialModel.isDefined && optInitialModel.get.coefficients 
!= numFeatures) {
+  val vec = optInitialModel.get.coefficients
--- End diff --

`vec` is not used.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-20 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50369668
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") (
 val initialCoefficientsWithIntercept =
   Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
 
-if ($(fitIntercept)) {
-  /*
- For binary logistic regression, when we initialize the 
coefficients as zeros,
- it will converge faster if we initialize the intercept such 
that
- it follows the distribution of the labels.
-
- {{{
-   P(0) = 1 / (1 + \exp(b)), and
-   P(1) = \exp(b) / (1 + \exp(b))
- }}}, hence
- {{{
-   b = \log{P(1) / P(0)} = \log{count_1 / count_0}
- }}}
+if (optInitialModel.isDefined && optInitialModel.get.coefficients 
!= numFeatures) {
+  val vec = optInitialModel.get.coefficients
+  logWarning(
+s"Initial coefficients provided ${vec} did not match the 
expected size ${numFeatures}")
+}
+
+if (optInitialModel.isDefined && optInitialModel.get.coefficients 
== numFeatures) {
+  val initialCoefficientsWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
+  optInitialModel.get.coefficients.foreachActive { case (index, 
value) =>
+initialCoefficientsWithInterceptArray(index) = value
+  }
+  if ($(fitIntercept)) {
+initialCoefficientsWithInterceptArray(numFeatures) == 
optInitialModel.get.intercept
+  }
+} else if ($(fitIntercept)) {
+  /**
+   * For binary logistic regression, when we initialize the 
coefficients as zeros,
+   * it will converge faster if we initialize the intercept such 
that
+   * it follows the distribution of the labels.
+
--- End diff --

I think u have to remove all the `*`. I think we decide to do comment like

```
/*
Start the sentence.
 */


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12908][ML] Add warning message for Logi...

2016-01-20 Thread dbtsai
GitHub user dbtsai opened a pull request:

https://github.com/apache/spark/pull/10862

[SPARK-12908][ML] Add warning message for LogisticRegression for potential 
converge issue

When all labels are the same, it's a dangerous ground for 
LogisticRegression without intercept to converge. GLMNET doesn't support this 
case, and will just exit. GLM can train, but will have a warning message saying 
the algorithm doesn't converge. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/dbtsai/spark add-tests

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10862.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10862


commit e68cc38134ca78d5e8425aad4b1b5fd36c781ccc
Author: DB Tsai <d...@netflix.com>
Date:   2016-01-21T06:58:36Z

add warning message




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-20 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50369552
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") (
 val initialCoefficientsWithIntercept =
   Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
 
-if ($(fitIntercept)) {
-  /*
- For binary logistic regression, when we initialize the 
coefficients as zeros,
- it will converge faster if we initialize the intercept such 
that
- it follows the distribution of the labels.
-
- {{{
-   P(0) = 1 / (1 + \exp(b)), and
-   P(1) = \exp(b) / (1 + \exp(b))
- }}}, hence
- {{{
-   b = \log{P(1) / P(0)} = \log{count_1 / count_0}
- }}}
+if (optInitialModel.isDefined && optInitialModel.get.coefficients 
!= numFeatures) {
+  val vec = optInitialModel.get.coefficients
+  logWarning(
+s"Initial coefficients provided ${vec} did not match the 
expected size ${numFeatures}")
+}
+
+if (optInitialModel.isDefined && optInitialModel.get.coefficients 
== numFeatures) {
+  val initialCoefficientsWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
+  optInitialModel.get.coefficients.foreachActive { case (index, 
value) =>
+initialCoefficientsWithInterceptArray(index) = value
+  }
+  if ($(fitIntercept)) {
+initialCoefficientsWithInterceptArray(numFeatures) == 
optInitialModel.get.intercept
+  }
+} else if ($(fitIntercept)) {
+  /**
+   * For binary logistic regression, when we initialize the 
coefficients as zeros,
+   * it will converge faster if we initialize the intercept such 
that
+   * it follows the distribution of the labels.
+
--- End diff --

remove the extra line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-20 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50369516
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -335,31 +342,45 @@ class LogisticRegression @Since("1.2.0") (
 val initialCoefficientsWithIntercept =
   Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else 
numFeatures)
 
-if ($(fitIntercept)) {
-  /*
- For binary logistic regression, when we initialize the 
coefficients as zeros,
- it will converge faster if we initialize the intercept such 
that
- it follows the distribution of the labels.
-
- {{{
-   P(0) = 1 / (1 + \exp(b)), and
-   P(1) = \exp(b) / (1 + \exp(b))
- }}}, hence
- {{{
-   b = \log{P(1) / P(0)} = \log{count_1 / count_0}
- }}}
+if (optInitialModel.isDefined && optInitialModel.get.coefficients 
!= numFeatures) {
+  val vec = optInitialModel.get.coefficients
+  logWarning(
+s"Initial coefficients provided ${vec} did not match the 
expected size ${numFeatures}")
+}
+
+if (optInitialModel.isDefined && optInitialModel.get.coefficients 
== numFeatures) {
+  val initialCoefficientsWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
+  optInitialModel.get.coefficients.foreachActive { case (index, 
value) =>
+initialCoefficientsWithInterceptArray(index) = value
+  }
+  if ($(fitIntercept)) {
+initialCoefficientsWithInterceptArray(numFeatures) == 
optInitialModel.get.intercept
+  }
+} else if ($(fitIntercept)) {
+  /**
+   * For binary logistic regression, when we initialize the 
coefficients as zeros,
+   * it will converge faster if we initialize the intercept such 
that
+   * it follows the distribution of the labels.
+
+   * {{{
+   * P(0) = 1 / (1 + \exp(b)), and
+   * P(1) = \exp(b) / (1 + \exp(b))
+   * }}}, hence
+   * {{{
+   * b = \log{P(1) / P(0)} = \log{count_1 / count_0}
+   * }}}
*/
-  initialCoefficientsWithIntercept.toArray(numFeatures) = math.log(
-histogram(1) / histogram(0))
+  initialCoefficientsWithIntercept.toArray(numFeatures)
+  = math.log(histogram(1) / histogram(0))
--- End diff --

add two spaces.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...

2016-01-20 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10702#issuecomment-173395155
  
@mengxr This PR is also on my radar. Working on another PR now, once 
@iyounus is ready, I will work on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-19 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172954547
  
Merged into master. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-19 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172965625
  
A followup PR is created: https://issues.apache.org/jira/browse/SPARK-12908


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-19 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172954097
  
Ideally, we would like to see the one without intercept to ensure that 
nothing will not be messed up during the refactoring. I'll merge now, and let's 
add the test for all same labels without intercept. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r50032693
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") (
 val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
 
-if (numInvalid != 0) {
-  val msg = s"Classification labels should be in {0 to ${numClasses - 
1} " +
-s"Found $numInvalid invalid labels."
-  logError(msg)
-  throw new SparkException(msg)
-}
-
-if (numClasses > 2) {
-  val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
-s"binary classification. Found $numClasses in the input dataset."
-  logError(msg)
-  throw new SparkException(msg)
-}
+val (coefficients, intercept, objectiveHistory) = {
+  if (numInvalid != 0) {
+val msg = s"Classification labels should be in {0 to ${numClasses 
- 1} " +
+  s"Found $numInvalid invalid labels."
+logError(msg)
+throw new SparkException(msg)
+  }
 
-val featuresMean = summarizer.mean.toArray
-val featuresStd = summarizer.variance.toArray.map(math.sqrt)
+  if (numClasses > 2) {
+val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
+  s"binary classification. Found $numClasses in the input dataset."
+logError(msg)
+throw new SparkException(msg)
+  } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 
0.0) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be positive infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
+  } else if ($(fitIntercept) && numClasses == 1) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be negative infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, 
Array.empty[Double])
+  } else {
+val featuresMean = summarizer.mean.toArray
+val featuresStd = summarizer.variance.toArray.map(math.sqrt)
 
-val regParamL1 = $(elasticNetParam) * $(regParam)
-val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
+val regParamL1 = $(elasticNetParam) * $(regParam)
+val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 
-val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept), $(standardization),
-  featuresStd, featuresMean, regParamL2)
+val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
+  $(standardization), featuresStd, featuresMean, regParamL2)
 
-val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) {
-  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
-} else {
-  def regParamL1Fun = (index: Int) => {
-// Remove the L1 penalization on the intercept
-if (index == numFeatures) {
-  0.0
+val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
+  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
-  if ($(standardization)) {
-regParamL1
-  } else {
-// If `standardization` is false, we still standardize the data
-// to improve the rate of convergence; as a result, we have to
-// perform this reverse standardization by penalizing each 
component
-// differently to get effectively the same objective function 
when
-// the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) 
else 0.0
+  def regParamL1Fun = (index: Int) => {
+// Remove the L1 penalization on the intercept
+if (index == numFeatures) {
+  0.0
+} else {
+  if ($(standardization)) {
+regParamL1
+  } else {
+// If `standardization` is false, we still standardize the 
data
+// to improve the rate of convergence; as a result, we 
h

[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r50032721
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") (
 val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
 
-if (numInvalid != 0) {
-  val msg = s"Classification labels should be in {0 to ${numClasses - 
1} " +
-s"Found $numInvalid invalid labels."
-  logError(msg)
-  throw new SparkException(msg)
-}
-
-if (numClasses > 2) {
-  val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
-s"binary classification. Found $numClasses in the input dataset."
-  logError(msg)
-  throw new SparkException(msg)
-}
+val (coefficients, intercept, objectiveHistory) = {
+  if (numInvalid != 0) {
+val msg = s"Classification labels should be in {0 to ${numClasses 
- 1} " +
+  s"Found $numInvalid invalid labels."
+logError(msg)
+throw new SparkException(msg)
+  }
 
-val featuresMean = summarizer.mean.toArray
-val featuresStd = summarizer.variance.toArray.map(math.sqrt)
+  if (numClasses > 2) {
+val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
+  s"binary classification. Found $numClasses in the input dataset."
+logError(msg)
+throw new SparkException(msg)
+  } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 
0.0) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be positive infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
+  } else if ($(fitIntercept) && numClasses == 1) {
+logWarning(s"All labels are zero and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be negative infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, 
Array.empty[Double])
+  } else {
+val featuresMean = summarizer.mean.toArray
+val featuresStd = summarizer.variance.toArray.map(math.sqrt)
 
-val regParamL1 = $(elasticNetParam) * $(regParam)
-val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
+val regParamL1 = $(elasticNetParam) * $(regParam)
+val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 
-val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept), $(standardization),
-  featuresStd, featuresMean, regParamL2)
+val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
+  $(standardization), featuresStd, featuresMean, regParamL2)
 
-val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) {
-  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
-} else {
-  def regParamL1Fun = (index: Int) => {
-// Remove the L1 penalization on the intercept
-if (index == numFeatures) {
-  0.0
+val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
+  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
-  if ($(standardization)) {
-regParamL1
-  } else {
-// If `standardization` is false, we still standardize the data
-// to improve the rate of convergence; as a result, we have to
-// perform this reverse standardization by penalizing each 
component
-// differently to get effectively the same objective function 
when
-// the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) 
else 0.0
+  def regParamL1Fun = (index: Int) => {
+// Remove the L1 penalization on the intercept
+if (index == numFeatures) {
+  0.0
+} else {
+  if ($(standardization)) {
+regParamL1
+  } else {
+// If `standardization` is false, we still standardize the 
data
+// to improve the rate of convergence; as a result, we 
h

[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50040773
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -322,10 +329,25 @@ class LogisticRegression @Since("1.2.0") (
   new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, 
$(tol))
 }
 
+val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 
else numFeatures
--- End diff --

Is this used?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50041495
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -374,4 +384,85 @@ class LogisticRegressionWithLBFGS
   new LogisticRegressionModel(weights, intercept, numFeatures, 
numOfLinearPredictor + 1)
 }
   }
+
+  /**
+   * Run Logistic Regression with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * If using ml implementation, uses ml code to generate initial weights.
+   */
+  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
+run(input, generateInitialWeights(input), userSuppliedWeights = false)
+  }
+
+  /**
+   * Run Logistic Regression with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * Uses user provided weights.
+   */
+  override def run(input: RDD[LabeledPoint], initialWeights: Vector): 
LogisticRegressionModel = {
+run(input, initialWeights, userSuppliedWeights = true)
+  }
+
+  private def run(input: RDD[LabeledPoint], initialWeights: Vector, 
userSuppliedWeights: Boolean):
+  LogisticRegressionModel = {
+// ml's Logisitic regression only supports binary classifcation 
currently.
+if (numOfLinearPredictor == 1) {
+  def runWithMlLogisitcRegression(elasticNetParam: Double) = {
+// Prepare the ml LogisticRegression based on our settings
+val lr = new 
org.apache.spark.ml.classification.LogisticRegression()
+lr.setRegParam(optimizer.getRegParam())
+lr.setElasticNetParam(elasticNetParam)
+lr.setStandardization(useFeatureScaling)
+if (userSuppliedWeights) {
+  val initialWeightsWithIntercept = if (addIntercept) {
--- End diff --

This is not used anymore.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50041152
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -343,8 +365,8 @@ class LogisticRegression @Since("1.2.0") (
 = math.log(histogram(1) / histogram(0))
 }
 
-val states = optimizer.iterations(new CachedDiffFunction(costFun),
-  initialCoefficientsWithIntercept.toBreeze.toDenseVector)
+  val states = optimizer.iterations(new CachedDiffFunction(costFun),
+initialCoefficientsWithIntercept.toBreeze.toDenseVector)
--- End diff --

Wrong indentation. Remove two spaces.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50041320
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -374,4 +384,85 @@ class LogisticRegressionWithLBFGS
   new LogisticRegressionModel(weights, intercept, numFeatures, 
numOfLinearPredictor + 1)
 }
   }
+
+  /**
+   * Run Logistic Regression with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
--- End diff --

Removed `starting from the initial weights provided.` and add extra new 
line here for readability. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-18 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r50041364
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -374,4 +384,85 @@ class LogisticRegressionWithLBFGS
   new LogisticRegressionModel(weights, intercept, numFeatures, 
numOfLinearPredictor + 1)
 }
   }
+
+  /**
+   * Run Logistic Regression with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * If using ml implementation, uses ml code to generate initial weights.
+   */
+  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
+run(input, generateInitialWeights(input), userSuppliedWeights = false)
+  }
+
+  /**
+   * Run Logistic Regression with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
--- End diff --

Add extra new line before `If a known updater is...`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49962408
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
   new LogisticRegressionModel(weights, intercept, numFeatures, 
numOfLinearPredictor + 1)
 }
   }
+
+  /**
+   * Run the algorithm with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
--- End diff --

Replace `algorithm` by `Logistic Regression`, and remove `starting from the 
initial weights provided`.

Add a new line between `of LabeledPoint entries` and `If a known updater is 
used`.

Actually, in ml version, disabling feature scaling is supported now. So 
please call ml implementation in this case. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49962416
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
   new LogisticRegressionModel(weights, intercept, numFeatures, 
numOfLinearPredictor + 1)
 }
   }
+
+  /**
+   * Run the algorithm with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * If using ml implementation, uses ml code to generate initial weights.
+   */
+  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
+run(input, generateInitialWeights(input), false)
--- End diff --

`run(input, generateInitialWeights(input), userSuppliedWeights = false)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49963912
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -322,10 +329,11 @@ class LogisticRegression @Since("1.2.0") (
   new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, 
$(tol))
 }
 
-val initialCoefficientsWithIntercept =
-  Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
+val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 
else numFeatures
+val initialCoefficientsWithIntercept = 
optInitialCoefficients.getOrElse(
+  Vectors.zeros(numFeaturesWithIntercept))
 
--- End diff --

here, 

```scala
val initialCoefficientsWithIntercept =  
   Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)

if (optInitialModel.isDefined && optInitialModel.get.coefficients.size == 
numFeatures) {
  val initialCoefficientsWithInterceptArray = 
initialCoefficientsWithIntercept.toArray
  optInitialModel.get.coefficients.foreachActive { case (index, value) =>
initialCoefficientsWithInterceptArray(index) = value
  }
  if ($(fitIntercept) {
  initialCoefficientsWithInterceptArray(numFeatures) == 
optInitialModel.get.intercept
  } 
} else if ($(fitIntercept)) {
  /*
 For binary logistic regression, when we initialize the 
coefficients as zeros,
 it will converge faster if we initialize the intercept such that
 it follows the distribution of the labels.
 {{{
 P(0) = 1 / (1 + \exp(b)), and
 P(1) = \exp(b) / (1 + \exp(b))
 }}}, hence
 {{{
 b = \log{P(1) / P(0)} = \log{count_1 / count_0}
 }}}
   */
  initialCoefficientsWithIntercept.toArray(numFeatures)
= math.log(histogram(1) / histogram(0))
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49962448
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
   new LogisticRegressionModel(weights, intercept, numFeatures, 
numOfLinearPredictor + 1)
 }
   }
+
+  /**
+   * Run the algorithm with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * If using ml implementation, uses ml code to generate initial weights.
+   */
+  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
+run(input, generateInitialWeights(input), false)
+  }
+
+  /**
+   * Run the algorithm with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * Uses user provided weights.
+   */
+  override def run(input: RDD[LabeledPoint], initialWeights: Vector): 
LogisticRegressionModel = {
+run(input, initialWeights, true)
--- End diff --

`run(input, generateInitialWeights(input), userSuppliedWeights = true)`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49962512
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
   new LogisticRegressionModel(weights, intercept, numFeatures, 
numOfLinearPredictor + 1)
 }
   }
+
+  /**
+   * Run the algorithm with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * If using ml implementation, uses ml code to generate initial weights.
+   */
+  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
+run(input, generateInitialWeights(input), false)
+  }
+
+  /**
+   * Run the algorithm with the configured parameters on an input RDD
--- End diff --

Replace `algorithm` by `Logistic Regression`, and add a new line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49962541
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
   new LogisticRegressionModel(weights, intercept, numFeatures, 
numOfLinearPredictor + 1)
 }
   }
+
+  /**
+   * Run the algorithm with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * If using ml implementation, uses ml code to generate initial weights.
+   */
+  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
+run(input, generateInitialWeights(input), false)
+  }
+
+  /**
+   * Run the algorithm with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * Uses user provided weights.
+   */
+  override def run(input: RDD[LabeledPoint], initialWeights: Vector): 
LogisticRegressionModel = {
+run(input, initialWeights, true)
+  }
+
+  private def run(input: RDD[LabeledPoint], initialWeights: Vector, 
userSuppliedWeights: Boolean):
+  LogisticRegressionModel = {
+// ml's Logisitic regression only supports binary classifcation 
currently.
+if (numOfLinearPredictor == 1 && useFeatureScaling) {
--- End diff --

You can remove `useFeatureScaling`, and pass it as setStandardization in ML 
implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49961903
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -247,10 +247,31 @@ class LogisticRegression @Since("1.2.0") (
   @Since("1.5.0")
   override def getThresholds: Array[Double] = super.getThresholds
 
-  override protected def train(dataset: DataFrame): 
LogisticRegressionModel = {
-// Extract columns from data.  If dataset is persisted, do not persist 
oldDataset.
+  private var optInitialCoefficients: Option[Vector] = None
+  /** @group setParam */
+  private[spark] def setInitialWeights(value: Vector): this.type = {
+this.optInitialCoefficients = Some(value)
+this
+  }
+
+  /**
+   * Validate the initial weights, return an Option, if not the expected 
size return None
+   * and log a warning.
+   */
+  private def validateWeights(vectorOpt: Option[Vector], numFeatures: 
Int): Option[Vector] = {
+vectorOpt.flatMap(vec =>
+  if (vec.size == numFeatures) {
+Some(vec)
+  } else {
+logWarning(
+  s"""Initial weights provided (${vec})did not match the expected 
size ${numFeatures}""")
--- End diff --

btw, why `s"""`, also change `weights` to coefficients


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49961912
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -247,10 +247,31 @@ class LogisticRegression @Since("1.2.0") (
   @Since("1.5.0")
   override def getThresholds: Array[Double] = super.getThresholds
 
-  override protected def train(dataset: DataFrame): 
LogisticRegressionModel = {
-// Extract columns from data.  If dataset is persisted, do not persist 
oldDataset.
+  private var optInitialCoefficients: Option[Vector] = None
+  /** @group setParam */
+  private[spark] def setInitialWeights(value: Vector): this.type = {
+this.optInitialCoefficients = Some(value)
+this
+  }
+
+  /**
+   * Validate the initial weights, return an Option, if not the expected 
size return None
+   * and log a warning.
+   */
+  private def validateWeights(vectorOpt: Option[Vector], numFeatures: 
Int): Option[Vector] = {
--- End diff --

validateCoefficients


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49963463
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -247,8 +247,15 @@ class LogisticRegression @Since("1.2.0") (
   @Since("1.5.0")
   override def getThresholds: Array[Double] = super.getThresholds
 
-  override protected def train(dataset: DataFrame): 
LogisticRegressionModel = {
-// Extract columns from data.  If dataset is persisted, do not persist 
oldDataset.
+  private var optInitialCoefficients: Option[Vector] = None
+
+  /** @group setParam */
+  private[spark] def setInitialModel(model: LogisticRegressionModel): 
this.type = {
+this.optInitialCoefficients = Some(model.coefficients)
--- End diff --

You don't want to lose the information of intercept in model.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49963455
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -247,8 +247,15 @@ class LogisticRegression @Since("1.2.0") (
   @Since("1.5.0")
   override def getThresholds: Array[Double] = super.getThresholds
 
-  override protected def train(dataset: DataFrame): 
LogisticRegressionModel = {
-// Extract columns from data.  If dataset is persisted, do not persist 
oldDataset.
+  private var optInitialCoefficients: Option[Vector] = None
--- End diff --

Keep the reference to `InitialModel` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49961714
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -247,10 +247,30 @@ class LogisticRegression @Since("1.2.0") (
   @Since("1.5.0")
   override def getThresholds: Array[Double] = super.getThresholds
 
-  override protected def train(dataset: DataFrame): 
LogisticRegressionModel = {
-// Extract columns from data.  If dataset is persisted, do not persist 
oldDataset.
+  private var optInitialWeights: Option[Vector] = None
+  /** @group setParam */
+  private[spark] def setInitialWeights(value: Vector): this.type = {
+this.optInitialWeights = Some(value)
+this
+  }
--- End diff --

How about we follow https://github.com/apache/spark/pull/8972 , and have 
the following code. We can create another seprate JIRA for moving 
`setInitialModel` to public with a sharedParam.

```scala
  private var initialModel: Option[LogisticRegressionModel] = None

  private def setInitialModel(model: LogisticRegressionModel): this.type = {
...
...
this
  }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49962964
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
   new LogisticRegressionModel(weights, intercept, numFeatures, 
numOfLinearPredictor + 1)
 }
   }
+
+  /**
+   * Run the algorithm with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * If using ml implementation, uses ml code to generate initial weights.
+   */
+  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
+run(input, generateInitialWeights(input), false)
+  }
+
+  /**
+   * Run the algorithm with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * Uses user provided weights.
+   */
+  override def run(input: RDD[LabeledPoint], initialWeights: Vector): 
LogisticRegressionModel = {
+run(input, initialWeights, true)
--- End diff --

haha... yes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49962944
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -322,10 +343,12 @@ class LogisticRegression @Since("1.2.0") (
   new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, 
$(tol))
 }
 
-val initialCoefficientsWithIntercept =
-  Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
+val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 
else numFeatures
+val userSuppliedCoefficients = validateWeights(optInitialCoefficients, 
numFeaturesWithIntercept)
--- End diff --

You will know # of features by the size of coefficients set by 
setInitialModel. There is no ambiguity here since it's binary, and intercept 
has a separate variable. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49963357
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
   new LogisticRegressionModel(weights, intercept, numFeatures, 
numOfLinearPredictor + 1)
 }
   }
+
+  /**
+   * Run the algorithm with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * If using ml implementation, uses ml code to generate initial weights.
+   */
+  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
+run(input, generateInitialWeights(input), false)
+  }
+
+  /**
+   * Run the algorithm with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * Uses user provided weights.
+   */
+  override def run(input: RDD[LabeledPoint], initialWeights: Vector): 
LogisticRegressionModel = {
+run(input, initialWeights, true)
+  }
+
+  private def run(input: RDD[LabeledPoint], initialWeights: Vector, 
userSuppliedWeights: Boolean):
+  LogisticRegressionModel = {
+// ml's Logisitic regression only supports binary classifcation 
currently.
+if (numOfLinearPredictor == 1 && useFeatureScaling) {
+  def runWithMlLogisitcRegression(elasticNetParam: Double) = {
+// Prepare the ml LogisticRegression based on our settings
+val lr = new 
org.apache.spark.ml.classification.LogisticRegression()
+lr.setRegParam(optimizer.getRegParam())
+lr.setElasticNetParam(elasticNetParam)
+if (userSuppliedWeights) {
+  val initialWeightsWithIntercept = if (addIntercept) {
+appendBias(initialWeights)
+  } else {
+initialWeights
+  }
+  lr.setInitialWeights(initialWeightsWithIntercept)
+}
+lr.setFitIntercept(addIntercept)
+lr.setMaxIter(optimizer.getNumIterations())
+lr.setTol(optimizer.getConvergenceTol())
+// Convert our input into a DataFrame
+val sqlContext = new SQLContext(input.context)
+import sqlContext.implicits._
+val df = input.toDF()
+// Determine if we should cache the DF
+val handlePersistence = input.getStorageLevel == StorageLevel.NONE
+if (handlePersistence) {
+  df.persist(StorageLevel.MEMORY_AND_DISK)
+}
--- End diff --

that makes sense. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49962157
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -322,10 +343,12 @@ class LogisticRegression @Since("1.2.0") (
   new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, 
$(tol))
 }
 
-val initialCoefficientsWithIntercept =
-  Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
+val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 
else numFeatures
+val userSuppliedCoefficients = validateWeights(optInitialCoefficients, 
numFeaturesWithIntercept)
--- End diff --

let's handle it through setInitialModel, and have another PR to make it 
public. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49962716
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
   new LogisticRegressionModel(weights, intercept, numFeatures, 
numOfLinearPredictor + 1)
 }
   }
+
+  /**
+   * Run the algorithm with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * If using ml implementation, uses ml code to generate initial weights.
+   */
+  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
+run(input, generateInitialWeights(input), false)
+  }
+
+  /**
+   * Run the algorithm with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * Uses user provided weights.
+   */
+  override def run(input: RDD[LabeledPoint], initialWeights: Vector): 
LogisticRegressionModel = {
+run(input, initialWeights, true)
+  }
+
+  private def run(input: RDD[LabeledPoint], initialWeights: Vector, 
userSuppliedWeights: Boolean):
+  LogisticRegressionModel = {
+// ml's Logisitic regression only supports binary classifcation 
currently.
+if (numOfLinearPredictor == 1 && useFeatureScaling) {
+  def runWithMlLogisitcRegression(elasticNetParam: Double) = {
+// Prepare the ml LogisticRegression based on our settings
+val lr = new 
org.apache.spark.ml.classification.LogisticRegression()
+lr.setRegParam(optimizer.getRegParam())
+lr.setElasticNetParam(elasticNetParam)
+if (userSuppliedWeights) {
+  val initialWeightsWithIntercept = if (addIntercept) {
+appendBias(initialWeights)
+  } else {
+initialWeights
+  }
+  lr.setInitialWeights(initialWeightsWithIntercept)
+}
+lr.setFitIntercept(addIntercept)
+lr.setMaxIter(optimizer.getNumIterations())
+lr.setTol(optimizer.getConvergenceTol())
+// Convert our input into a DataFrame
+val sqlContext = new SQLContext(input.context)
+import sqlContext.implicits._
+val df = input.toDF()
+// Determine if we should cache the DF
+val handlePersistence = input.getStorageLevel == StorageLevel.NONE
+if (handlePersistence) {
+  df.persist(StorageLevel.MEMORY_AND_DISK)
+}
--- End diff --

Why do we need to do it? I through those check is already in ML code. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49962722
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
   new LogisticRegressionModel(weights, intercept, numFeatures, 
numOfLinearPredictor + 1)
 }
   }
+
+  /**
+   * Run the algorithm with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * If using ml implementation, uses ml code to generate initial weights.
+   */
+  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
+run(input, generateInitialWeights(input), false)
+  }
+
+  /**
+   * Run the algorithm with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * Uses user provided weights.
+   */
+  override def run(input: RDD[LabeledPoint], initialWeights: Vector): 
LogisticRegressionModel = {
+run(input, initialWeights, true)
+  }
+
+  private def run(input: RDD[LabeledPoint], initialWeights: Vector, 
userSuppliedWeights: Boolean):
+  LogisticRegressionModel = {
+// ml's Logisitic regression only supports binary classifcation 
currently.
+if (numOfLinearPredictor == 1 && useFeatureScaling) {
+  def runWithMlLogisitcRegression(elasticNetParam: Double) = {
+// Prepare the ml LogisticRegression based on our settings
+val lr = new 
org.apache.spark.ml.classification.LogisticRegression()
+lr.setRegParam(optimizer.getRegParam())
+lr.setElasticNetParam(elasticNetParam)
+if (userSuppliedWeights) {
+  val initialWeightsWithIntercept = if (addIntercept) {
+appendBias(initialWeights)
+  } else {
+initialWeights
+  }
+  lr.setInitialWeights(initialWeightsWithIntercept)
+}
+lr.setFitIntercept(addIntercept)
+lr.setMaxIter(optimizer.getNumIterations())
+lr.setTol(optimizer.getConvergenceTol())
+// Convert our input into a DataFrame
+val sqlContext = new SQLContext(input.context)
+import sqlContext.implicits._
+val df = input.toDF()
+// Determine if we should cache the DF
+val handlePersistence = input.getStorageLevel == StorageLevel.NONE
+if (handlePersistence) {
+  df.persist(StorageLevel.MEMORY_AND_DISK)
+}
+// Train our model
+val mlLogisticRegresionModel = lr.train(df)
+// unpersist if we persisted
+if (handlePersistence) {
+  df.unpersist()
+}
--- End diff --

ditto?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11922][PYSPARK][ML] Python api for ml.f...

2016-01-17 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10085#issuecomment-172429593
  
LGTM as well! Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49961964
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -247,10 +247,31 @@ class LogisticRegression @Since("1.2.0") (
   @Since("1.5.0")
   override def getThresholds: Array[Double] = super.getThresholds
 
-  override protected def train(dataset: DataFrame): 
LogisticRegressionModel = {
-// Extract columns from data.  If dataset is persisted, do not persist 
oldDataset.
+  private var optInitialCoefficients: Option[Vector] = None
+  /** @group setParam */
+  private[spark] def setInitialWeights(value: Vector): this.type = {
+this.optInitialCoefficients = Some(value)
+this
+  }
+
+  /**
+   * Validate the initial weights, return an Option, if not the expected 
size return None
+   * and log a warning.
+   */
+  private def validateWeights(vectorOpt: Option[Vector], numFeatures: 
Int): Option[Vector] = {
+vectorOpt.flatMap(vec =>
+  if (vec.size == numFeatures) {
+Some(vec)
+  } else {
+logWarning(
+  s"""Initial weights provided (${vec})did not match the expected 
size ${numFeatures}""")
+None
+  })
+  }
+
+  override protected[spark] def train(dataset: DataFrame): 
LogisticRegressionModel = {
 val w = if ($(weightCol).isEmpty) lit(1.0) else col($(weightCol))
-val instances: RDD[Instance] = dataset.select(col($(labelCol)), w, 
col($(featuresCol))).map {
+val instances = dataset.select(col($(labelCol)), w, 
col($(featuresCol))).map {
--- End diff --

why this line is changed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49964184
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -322,10 +329,11 @@ class LogisticRegression @Since("1.2.0") (
   new BreezeOWLQN[Int, BDV[Double]]($(maxIter), 10, regParamL1Fun, 
$(tol))
 }
 
-val initialCoefficientsWithIntercept =
-  Vectors.zeros(if ($(fitIntercept)) numFeatures + 1 else numFeatures)
+val numFeaturesWithIntercept = if ($(fitIntercept)) numFeatures + 1 
else numFeatures
+val initialCoefficientsWithIntercept = 
optInitialCoefficients.getOrElse(
+  Vectors.zeros(numFeaturesWithIntercept))
 
--- End diff --

btw, may we want to log. `if (optInitialModel.isDefined && 
optInitialModel.get.coefficients.size != numFeatures)`, let's log it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49961836
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -374,4 +383,82 @@ class LogisticRegressionWithLBFGS
   new LogisticRegressionModel(weights, intercept, numFeatures, 
numOfLinearPredictor + 1)
 }
   }
+
+  /**
+   * Run the algorithm with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * If using ml implementation, uses ml code to generate initial weights.
+   */
+  override def run(input: RDD[LabeledPoint]): LogisticRegressionModel = {
+run(input, generateInitialWeights(input), false)
+  }
+
+  /**
+   * Run the algorithm with the configured parameters on an input RDD
+   * of LabeledPoint entries starting from the initial weights provided.
+   * If a known updater is used calls the ml implementation, to avoid
+   * applying a regularization penalty to the intercept, otherwise
+   * defaults to the mllib implementation. If more than two classes
+   * or feature scaling is disabled, always uses mllib implementation.
+   * Uses user provided weights.
+   */
+  override def run(input: RDD[LabeledPoint], initialWeights: Vector): 
LogisticRegressionModel = {
+run(input, initialWeights, true)
+  }
+
+  private def run(input: RDD[LabeledPoint], initialWeights: Vector, 
userSuppliedWeights: Boolean):
+  LogisticRegressionModel = {
+// ml's Logisitic regression only supports binary classifcation 
currently.
+if (numOfLinearPredictor == 1 && useFeatureScaling) {
+  def runWithMlLogisitcRegression(elasticNetParam: Double) = {
+// Prepare the ml LogisticRegression based on our settings
+val lr = new 
org.apache.spark.ml.classification.LogisticRegression()
+lr.setRegParam(optimizer.getRegParam())
+lr.setElasticNetParam(elasticNetParam)
+if (userSuppliedWeights) {
+  val initialWeightsWithIntercept = if (addIntercept) {
+appendBias(initialWeights)
+  } else {
+initialWeights
+  }
+  lr.setInitialWeights(initialWeightsWithIntercept)
--- End diff --

Here will be

```scala
lr.setInitialModel(new 
org.apache.spark.ml.classification.LogisticRegressionModel(uid, initialWeights, 
1))
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49955619
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -883,6 +884,27 @@ class LogisticRegressionSuite
 assert(model1a0.intercept ~== model1b.intercept absTol 1E-3)
   }
 
+  test("logistic regression with fitIntercept=true and all labels the 
same") {
+val lr = new LogisticRegression()
+  .setFitIntercept(true)
+  .setMaxIter(3)
--- End diff --

add extra line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10743#issuecomment-172401612
  
LGTM except couple minor issues. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49955940
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -883,6 +884,27 @@ class LogisticRegressionSuite
 assert(model1a0.intercept ~== model1b.intercept absTol 1E-3)
   }
 
+  test("logistic regression with fitIntercept=true and all labels the 
same") {
+val lr = new LogisticRegression()
+  .setFitIntercept(true)
+  .setMaxIter(3)
+val sameLabels = dataset
+  .withColumn("zeroLabel", lit(0.0))
+  .withColumn("oneLabel", lit(1.0))
+
+val allZeroModel = lr
+  .setLabelCol("zeroLabel")
+  .fit(sameLabels)
+assert(allZeroModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
+assert(allZeroModel.intercept === Double.NegativeInfinity)
+
+val allOneModel = lr
+  .setLabelCol("oneLabel")
+  .fit(sameLabels)
+assert(allOneModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
+assert(allOneModel.intercept === Double.PositiveInfinity)
+  }
+
--- End diff --

Also, check if objectiveHistory has length of zero.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49955755
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") (
 val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
 
-if (numInvalid != 0) {
-  val msg = s"Classification labels should be in {0 to ${numClasses - 
1} " +
-s"Found $numInvalid invalid labels."
-  logError(msg)
-  throw new SparkException(msg)
-}
-
-if (numClasses > 2) {
-  val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
-s"binary classification. Found $numClasses in the input dataset."
-  logError(msg)
-  throw new SparkException(msg)
-}
+val (coefficients, intercept, objectiveHistory) = {
+  if (numInvalid != 0) {
+val msg = s"Classification labels should be in {0 to ${numClasses 
- 1} " +
+  s"Found $numInvalid invalid labels."
+logError(msg)
+throw new SparkException(msg)
+  }
 
-val featuresMean = summarizer.mean.toArray
-val featuresStd = summarizer.variance.toArray.map(math.sqrt)
+  if (numClasses > 2) {
+val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
+  s"binary classification. Found $numClasses in the input dataset."
+logError(msg)
+throw new SparkException(msg)
+  } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 
0.0) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be positive infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
+  } else if ($(fitIntercept) && numClasses == 1) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
--- End diff --

"All labels are zero"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49955925
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -276,113 +276,123 @@ class LogisticRegression @Since("1.2.0") (
 val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
 
-if (numInvalid != 0) {
-  val msg = s"Classification labels should be in {0 to ${numClasses - 
1} " +
-s"Found $numInvalid invalid labels."
-  logError(msg)
-  throw new SparkException(msg)
-}
-
-if (numClasses > 2) {
-  val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
-s"binary classification. Found $numClasses in the input dataset."
-  logError(msg)
-  throw new SparkException(msg)
-}
+val (coefficients, intercept, objectiveHistory) = {
+  if (numInvalid != 0) {
+val msg = s"Classification labels should be in {0 to ${numClasses 
- 1} " +
+  s"Found $numInvalid invalid labels."
+logError(msg)
+throw new SparkException(msg)
+  }
 
-val featuresMean = summarizer.mean.toArray
-val featuresStd = summarizer.variance.toArray.map(math.sqrt)
+  if (numClasses > 2) {
+val msg = s"Currently, LogisticRegression with ElasticNet in ML 
package only supports " +
+  s"binary classification. Found $numClasses in the input dataset."
+logError(msg)
+throw new SparkException(msg)
+  } else if ($(fitIntercept) && numClasses == 2 && histogram(0) == 
0.0) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be positive infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
+  } else if ($(fitIntercept) && numClasses == 1) {
+logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
+  s"zeros and the intercept will be negative infinity; as a 
result, " +
+  s"training is not needed.")
+(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, 
Array.empty[Double])
+  } else {
+val featuresMean = summarizer.mean.toArray
+val featuresStd = summarizer.variance.toArray.map(math.sqrt)
 
-val regParamL1 = $(elasticNetParam) * $(regParam)
-val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
+val regParamL1 = $(elasticNetParam) * $(regParam)
+val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 
-val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept), $(standardization),
-  featuresStd, featuresMean, regParamL2)
+val costFun = new LogisticCostFun(instances, numClasses, 
$(fitIntercept),
+  $(standardization), featuresStd, featuresMean, regParamL2)
 
-val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 0.0) {
-  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
-} else {
-  def regParamL1Fun = (index: Int) => {
-// Remove the L1 penalization on the intercept
-if (index == numFeatures) {
-  0.0
+val optimizer = if ($(elasticNetParam) == 0.0 || $(regParam) == 
0.0) {
+  new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 } else {
-  if ($(standardization)) {
-regParamL1
-  } else {
-// If `standardization` is false, we still standardize the data
-// to improve the rate of convergence; as a result, we have to
-// perform this reverse standardization by penalizing each 
component
-// differently to get effectively the same objective function 
when
-// the training dataset is not standardized.
-if (featuresStd(index) != 0.0) regParamL1 / featuresStd(index) 
else 0.0
+  def regParamL1Fun = (index: Int) => {
+// Remove the L1 penalization on the intercept
+if (index == numFeatures) {
+  0.0
+} else {
+  if ($(standardization)) {
+regParamL1
+  } else {
+// If `standardization` is false, we still standardize the 
data
+// to improve the rate of convergence; as a result, we 
h

[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49955604
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -883,6 +884,27 @@ class LogisticRegressionSuite
 assert(model1a0.intercept ~== model1b.intercept absTol 1E-3)
   }
 
+  test("logistic regression with fitIntercept=true and all labels the 
same") {
+val lr = new LogisticRegression()
+  .setFitIntercept(true)
+  .setMaxIter(3)
+val sameLabels = dataset
+  .withColumn("zeroLabel", lit(0.0))
+  .withColumn("oneLabel", lit(1.0))
+
+val allZeroModel = lr
+  .setLabelCol("zeroLabel")
+  .fit(sameLabels)
+assert(allZeroModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
+assert(allZeroModel.intercept === Double.NegativeInfinity)
+
+val allOneModel = lr
+  .setLabelCol("oneLabel")
+  .fit(sameLabels)
+assert(allOneModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
+assert(allOneModel.intercept === Double.PositiveInfinity)
+  }
+
--- End diff --

Can you add one test which is all labels the same but `fitIntercept=false` 
here to avoid the issue in LiR? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49959401
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -247,10 +247,30 @@ class LogisticRegression @Since("1.2.0") (
   @Since("1.5.0")
   override def getThresholds: Array[Double] = super.getThresholds
 
-  override protected def train(dataset: DataFrame): 
LogisticRegressionModel = {
-// Extract columns from data.  If dataset is persisted, do not persist 
oldDataset.
+  private var optInitialWeights: Option[Vector] = None
+  /** @group setParam */
+  private[spark] def setInitialWeights(value: Vector): this.type = {
+this.optInitialWeights = Some(value)
+this
+  }
+
+  /** Validate the initial weights, return an Option, if not the expected 
size return None
--- End diff --

Move `Validate  into a new line`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-7780][MLLIB] intercept in logisticregre...

2016-01-17 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10788#discussion_r49959559
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -247,10 +247,30 @@ class LogisticRegression @Since("1.2.0") (
   @Since("1.5.0")
   override def getThresholds: Array[Double] = super.getThresholds
 
-  override protected def train(dataset: DataFrame): 
LogisticRegressionModel = {
-// Extract columns from data.  If dataset is persisted, do not persist 
oldDataset.
+  private var optInitialWeights: Option[Vector] = None
--- End diff --

We started to change our terminology from `weights` into `coefficients` 
after the implementation of weighted LoR. Please change it into 
`optInitialCoefficients`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12230][ML] WeightedLeastSquares.fit() s...

2016-01-14 Thread dbtsai
Github user dbtsai commented on the pull request:

https://github.com/apache/spark/pull/10274#issuecomment-171876447
  
LGTM. Wait for one extra test. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12230][ML] WeightedLeastSquares.fit() s...

2016-01-14 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10274#discussion_r49822543
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala 
---
@@ -74,6 +89,35 @@ class WeightedLeastSquaresSuite extends SparkFunSuite 
with MLlibTestSparkContext
 }
   }
 
+  test("WLS against lm when label is constant") {
+/*
+   R code:
+   # here b is constant
+   df <- as.data.frame(cbind(A, b))
+   for (formula in c(b ~ . -1, b ~ .)) {
+ model <- lm(formula, data=df, weights=w)
+ print(as.vector(coef(model)))
+   }
+
+  [1] -9.221298  3.394343
+  [1] 17  0  0
+*/
+
+val expected = Seq(
+  Vectors.dense(0.0, -9.221298, 3.394343),
+  Vectors.dense(17.0, 0.0, 0.0))
+
+var idx = 0
+for (fitIntercept <- Seq(false, true)) {
+  val wls = new WeightedLeastSquares(
+fitIntercept, regParam = 0.0, standardizeFeatures = false, 
standardizeLabel = true)
+.fit(instancesConstLabel)
--- End diff --

Awesome! Great that you see the same result! For normal equation, will be 
nice to in R so we can have it in comment consistently. I did implement it once 
when I implemented the LBFGS version, let me try to find it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-13 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49652270
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -883,6 +884,22 @@ class LogisticRegressionSuite
 assert(model1a0.intercept ~== model1b.intercept absTol 1E-3)
   }
 
+  test("logistic regression with all labels the same") {
+val lr = new LogisticRegression()
+  .setFitIntercept(true)
+  .setMaxIter(3)
+val sameLabels = dataset
+  .withColumn("zeroLabel", lit(0.0))
+  .withColumn("oneLabel", lit(1.0))
+
+val model = lr
+  .setLabelCol("oneLabel")
+  .fit(sameLabels)
+
+assert(model.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
+assert(model.intercept === Double.PositiveInfinity)
--- End diff --

BTW, this bug should not happen when all the labels are one since the 
histogram should be still size of two. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-13 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49649843
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -339,9 +339,11 @@ class LogisticRegression @Since("1.2.0") (
  b = \log{P(1) / P(0)} = \log{count_1 / count_0}
  }}}
*/
-  initialCoefficientsWithIntercept.toArray(numFeatures)
-= math.log(histogram(1) / histogram(0))
-}
+   if (histogram.length >= 2) { // check to make sure indexing into 
histogram(1) is safe
+ initialCoefficientsWithIntercept.toArray(numFeatures) = math.log(
+   histogram(1) / histogram(0))
--- End diff --

In this case, the whole training step can be skipped. Currently, we only 
support binary LoR, so the max of `histogram.length` will be two. In LiR, when 
the `yStd == 0.0`, the model will be returned immediately without training, see 
https://github.com/feynmanliang/spark/blob/SPARK-12804/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L226

We can do similar thing here like

```scala
if (histogram.length == 2) {
  if (histogram(0) == 0.0) {
model = (new LogisticRegressionModel(uid, Vectors.sparse(numFeatures, 
Seq()), Double.PositiveInfinity))
return model
  } else {
initialCoefficientsWithIntercept.toArray(numFeatures) = math.log(
histogram(1) / histogram(0))
  } else if (histogram.length == 1) {
model = (new LogisticRegressionModel(uid, Vectors.sparse(numFeatures, 
Seq()), Double.NegativeInfinity))
return model
  } else {
some excpetion
  }
}
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12804][ML] Fix LogisticRegression with ...

2016-01-13 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10743#discussion_r49650452
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -883,6 +884,22 @@ class LogisticRegressionSuite
 assert(model1a0.intercept ~== model1b.intercept absTol 1E-3)
   }
 
+  test("logistic regression with all labels the same") {
+val lr = new LogisticRegression()
+  .setFitIntercept(true)
+  .setMaxIter(3)
+val sameLabels = dataset
+  .withColumn("zeroLabel", lit(0.0))
+  .withColumn("oneLabel", lit(1.0))
+
+val model = lr
+  .setLabelCol("oneLabel")
+  .fit(sameLabels)
+
+assert(model.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
+assert(model.intercept === Double.PositiveInfinity)
--- End diff --

Can you add another test showing that all `zeroLabel` will return intercept 
with `Double.NegativeInfinity`?

Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12230][ML] WeightedLeastSquares.fit() s...

2016-01-13 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10274#discussion_r49667844
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
@@ -94,8 +110,7 @@ private[ml] class WeightedLeastSquares(
   if (standardizeFeatures) {
 lambda *= aVar(j - 2)
   }
-  if (standardizeLabel) {
-// TODO: handle the case when bStd = 0
+  if (standardizeLabel && bStd != 0) {
--- End diff --

Here is the exercise,

```scala
  test("WLS against lm") {
/*
   R code:

   df <- as.data.frame(cbind(A, b))
   for (formula in c(b ~ . -1, b ~ .)) {
 model <- lm(formula, data=df, weights=w)
 print(as.vector(coef(model)))
   }

   [1] -3.727121  3.009983
   [1] 18.08  6.08 -0.60
 */

val expected = Seq(
  Vectors.dense(0.0, -3.727121, 3.009983),
  Vectors.dense(18.08, 6.08, -0.60))

var idx = 0
for (fitIntercept <- Seq(false, true)) {
  for (standardization <- Seq(false, true)) {
val wls = new WeightedLeastSquares(
  fitIntercept, regParam = 0.0, standardizeFeatures = 
standardization,
  standardizeLabel = standardization).fit(instances)
val actual = Vectors.dense(wls.intercept, wls.coefficients(0), 
wls.coefficients(1))
assert(actual ~== expected(idx) absTol 1e-4)
  }
  idx += 1
}
  }
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12230][ML] WeightedLeastSquares.fit() s...

2016-01-13 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10274#discussion_r49654189
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
@@ -94,8 +110,7 @@ private[ml] class WeightedLeastSquares(
   if (standardizeFeatures) {
 lambda *= aVar(j - 2)
   }
-  if (standardizeLabel) {
-// TODO: handle the case when bStd = 0
+  if (standardizeLabel && bStd != 0) {
--- End diff --

It's interesting to see that when regularization is zero, with/without 
standardization on labels or features will not change the solution of Linear 
Regression which you can experiment. 

As a result, the only issue that the model will be non-interpretable will 
be `yStd` is zeo and `regParam` is non-zero. You can have a `require` there 
with proper message. I think logging a warning will be probably very easy for 
users to ignore. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12230][ML] WeightedLeastSquares.fit() s...

2016-01-11 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10274#discussion_r49394419
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala ---
@@ -94,8 +110,7 @@ private[ml] class WeightedLeastSquares(
   if (standardizeFeatures) {
 lambda *= aVar(j - 2)
   }
-  if (standardizeLabel) {
-// TODO: handle the case when bStd = 0
+  if (standardizeLabel && bStd != 0) {
--- End diff --

Thanks. As you said, we will expect non zero coefficients in this case, so 
we don't have to match glmnet. 

However, we may want to throw excpetion when standerizeLabe is true, and 
ystd is zero since the problem is not well defined. 

Thanks. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-12732][ML] bug fix in linear regression...

2016-01-11 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/10702#discussion_r49399116
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---
@@ -219,33 +219,41 @@ class LinearRegression @Since("1.3.0") 
(@Since("1.3.0") override val uid: String
 }
 
 val yMean = ySummarizer.mean(0)
-val yStd = math.sqrt(ySummarizer.variance(0))
+var yStd = math.sqrt(ySummarizer.variance(0))
--- End diff --

Also, when `rawYStd == 0.0`, `standardization == true` and `regParam != 
0.0`, the problem will be ill-defined. We may want to throw an exception. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    5   6   7   8   9   10   11   12   13   14   >