[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/14834


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79518247
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -508,11 +680,51 @@ object LogisticRegression extends 
DefaultParamsReadable[LogisticRegression] {
 @Since("1.4.0")
 class LogisticRegressionModel private[spark] (
 @Since("1.4.0") override val uid: String,
-@Since("2.0.0") val coefficients: Vector,
-@Since("1.3.0") val intercept: Double)
+@Since("2.1.0") val coefficientMatrix: Matrix,
+@Since("2.1.0") val interceptVector: Vector,
+@Since("1.3.0") override val numClasses: Int,
+private val isMultinomial: Boolean)
   extends ProbabilisticClassificationModel[Vector, LogisticRegressionModel]
   with LogisticRegressionParams with MLWritable {
 
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79517559
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -508,11 +680,51 @@ object LogisticRegression extends 
DefaultParamsReadable[LogisticRegression] {
 @Since("1.4.0")
 class LogisticRegressionModel private[spark] (
 @Since("1.4.0") override val uid: String,
-@Since("2.0.0") val coefficients: Vector,
-@Since("1.3.0") val intercept: Double)
+@Since("2.1.0") val coefficientMatrix: Matrix,
+@Since("2.1.0") val interceptVector: Vector,
+@Since("1.3.0") override val numClasses: Int,
--- End diff --

Okay. Let's merge it as it for now. I want to make change so we don't train 
on the classes that is non-seen. I'll address them together. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79515456
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -67,11 +105,15 @@ class LogisticRegressionSuite
 binaryDataset.rdd.map { case Row(label: Double, features: Vector) =>
   label + "," + features.toArray.mkString(",")
 
}.repartition(1).saveAsTextFile("target/tmp/LogisticRegressionSuite/binaryDataset")
+multinomialDataset.rdd.map { case Row(label: Double, features: Vector) 
=>
+  label + "," + features.toArray.mkString(",")
+
}.repartition(1).saveAsTextFile("target/tmp/LogisticRegressionSuite/multinomialDataset")
   }
 
   test("params") {
 ParamsSuite.checkParams(new LogisticRegression)
-val model = new LogisticRegressionModel("logReg", Vectors.dense(0.0), 
0.0)
+val model = new LogisticRegressionModel("logReg",
+  new DenseMatrix(1, 1, Array(0.0)), Vectors.dense(0.0), 2, 
isMultinomial = false)
--- End diff --

Done. and below


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79515439
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -22,28 +22,49 @@ import scala.language.existentials
 import scala.util.Random
 import scala.util.control.Breaks._
 
-import org.apache.spark.SparkFunSuite
+import org.apache.spark.{SparkException, SparkFunSuite}
+import org.apache.spark.ml.attribute.NominalAttribute
 import org.apache.spark.ml.classification.LogisticRegressionSuite._
-import org.apache.spark.ml.feature.{Instance, LabeledPoint}
-import org.apache.spark.ml.linalg.{Vector, Vectors}
+import org.apache.spark.ml.feature.LabeledPoint
+import org.apache.spark.ml.linalg._
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79515424
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1103,6 +1382,7 @@ class BinaryLogisticRegressionSummary 
private[classification] (
  *$$
  * 
  *
+ *
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79515434
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -430,8 +431,9 @@ class LogisticRegressionWithLBFGS
 lr.setStandardization(useFeatureScaling)
 if (userSuppliedWeights) {
   val uid = Identifiable.randomUID("logreg-static")
-  lr.setInitialModel(new 
org.apache.spark.ml.classification.LogisticRegressionModel(
-uid, initialWeights.asML, 1.0))
+  lr.setInitialModel(new 
org.apache.spark.ml.classification.LogisticRegressionModel(uid,
+new DenseMatrix(1, initialWeights.size, 
initialWeights.toArray),
+Vectors.dense(1.0).asML, 2, false))
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79515057
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -66,11 +69,37 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
*
* @group setParam
*/
+  // TODO: Implement SPARK-11543?
   def setThreshold(value: Double): this.type = {
 if (isSet(thresholds)) clear(thresholds)
 set(threshold, value)
   }
 
+
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79515040
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -508,11 +680,51 @@ object LogisticRegression extends 
DefaultParamsReadable[LogisticRegression] {
 @Since("1.4.0")
 class LogisticRegressionModel private[spark] (
 @Since("1.4.0") override val uid: String,
-@Since("2.0.0") val coefficients: Vector,
-@Since("1.3.0") val intercept: Double)
+@Since("2.1.0") val coefficientMatrix: Matrix,
+@Since("2.1.0") val interceptVector: Vector,
+@Since("1.3.0") override val numClasses: Int,
--- End diff --

Actually that won't work under the current edge case behaviors. When the 
labels are all `0.0` then the coefficient matrix will have only one row 
regardless of multinomial or binomial. We could potentially change this 
behavior though, as if we always assume there will be at minimum two classes. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79509186
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -676,39 +941,53 @@ object LogisticRegressionModel extends 
MLReadable[LogisticRegressionModel] {
 private case class Data(
 numClasses: Int,
 numFeatures: Int,
-intercept: Double,
-coefficients: Vector)
+interceptVector: Vector,
+coefficientMatrix: Matrix,
+isMultinomial: Boolean)
 
 override protected def saveImpl(path: String): Unit = {
   // Save metadata and Params
   DefaultParamsWriter.saveMetadata(instance, path, sc)
   // Save model data: numClasses, numFeatures, intercept, coefficients
-  val data = Data(instance.numClasses, instance.numFeatures, 
instance.intercept,
-instance.coefficients)
+  val data = Data(instance.numClasses, instance.numFeatures, 
instance.interceptVector,
+instance.coefficientMatrix, instance.isMultinomial)
   val dataPath = new Path(path, "data").toString
   
sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
 }
   }
 
-  private class LogisticRegressionModelReader
-extends MLReader[LogisticRegressionModel] {
+  private class LogisticRegressionModelReader extends 
MLReader[LogisticRegressionModel] {
 
 /** Checked against metadata when loading model */
 private val className = classOf[LogisticRegressionModel].getName
 
 override def load(path: String): LogisticRegressionModel = {
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
+  val (major, minor) = 
VersionUtils.majorMinorVersion(metadata.sparkVersion)
 
   val dataPath = new Path(path, "data").toString
   val data = sparkSession.read.format("parquet").load(dataPath)
 
-  // We will need numClasses, numFeatures in the future for 
multinomial logreg support.
-  // TODO: remove numClasses and numFeatures fields?
-  val Row(numClasses: Int, numFeatures: Int, intercept: Double, 
coefficients: Vector) =
-MLUtils.convertVectorColumnsToML(data, "coefficients")
-  .select("numClasses", "numFeatures", "intercept", "coefficients")
-  .head()
-  val model = new LogisticRegressionModel(metadata.uid, coefficients, 
intercept)
+  val model = if (major.toInt < 2 || (major.toInt == 2 && minor.toInt 
== 0)) {
+// 2.0 and before
+val Row(numClasses: Int, numFeatures: Int, intercept: Double, 
coefficients: Vector) =
+  MLUtils.convertVectorColumnsToML(data, "coefficients")
+.select("numClasses", "numFeatures", "intercept", 
"coefficients")
+.head()
+val coefficientMatrix =
+  new DenseMatrix(1, coefficients.size, coefficients.toArray, 
isTransposed = true)
+val interceptVector = Vectors.dense(intercept)
+new LogisticRegressionModel(metadata.uid, coefficientMatrix,
+  interceptVector, numClasses, isMultinomial = false)
+  } else {
+// 2.1+
+val Row(numClasses: Int, numFeatures: Int, interceptVector: Vector,
+coefficientMatrix: Matrix, isMultinomial: Boolean) = data
+  .select("numClasses", "numFeatures", "interceptVector", 
"coefficientMatrix",
+"isMultinomial").head()
--- End diff --

if we decide to get rid of `numClasses` to reduce the parameters in the 
constructor, have a require here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79507627
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -508,11 +680,51 @@ object LogisticRegression extends 
DefaultParamsReadable[LogisticRegression] {
 @Since("1.4.0")
 class LogisticRegressionModel private[spark] (
 @Since("1.4.0") override val uid: String,
-@Since("2.0.0") val coefficients: Vector,
-@Since("1.3.0") val intercept: Double)
+@Since("2.1.0") val coefficientMatrix: Matrix,
+@Since("2.1.0") val interceptVector: Vector,
+@Since("1.3.0") override val numClasses: Int,
+private val isMultinomial: Boolean)
--- End diff --

actually, `isMultinomial` can be determined by `coefficientMatrix .numRows` 
as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79510132
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/OneVsRestSuite.scala ---
@@ -60,7 +60,8 @@ class OneVsRestSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
 
   test("params") {
 ParamsSuite.checkParams(new OneVsRest)
-val lrModel = new LogisticRegressionModel("lr", Vectors.dense(0.0), 
0.0)
+val lrModel = new LogisticRegressionModel("logReg",
+  new DenseMatrix(1, 1, Array(0.0), isTransposed = true), 
Vectors.dense(0.0), 2, false)
--- End diff --

ditto. revert this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79510068
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala ---
@@ -244,7 +244,8 @@ class CrossValidatorSuite
   test("read/write: CrossValidatorModel") {
 val lr = new LogisticRegression()
   .setThreshold(0.6)
-val lrModel = new LogisticRegressionModel(lr.uid, Vectors.dense(1.0, 
2.0), 1.2)
+val lrModel = new LogisticRegressionModel(lr.uid,
+  new DenseMatrix(1, 1, Array(0.0), isTransposed = true), 
Vectors.dense(0.0), 2, false)
--- End diff --

ditto. revert this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79509747
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -67,11 +105,15 @@ class LogisticRegressionSuite
 binaryDataset.rdd.map { case Row(label: Double, features: Vector) =>
   label + "," + features.toArray.mkString(",")
 
}.repartition(1).saveAsTextFile("target/tmp/LogisticRegressionSuite/binaryDataset")
+multinomialDataset.rdd.map { case Row(label: Double, features: Vector) 
=>
+  label + "," + features.toArray.mkString(",")
+
}.repartition(1).saveAsTextFile("target/tmp/LogisticRegressionSuite/multinomialDataset")
   }
 
   test("params") {
 ParamsSuite.checkParams(new LogisticRegression)
-val model = new LogisticRegressionModel("logReg", Vectors.dense(0.0), 
0.0)
+val model = new LogisticRegressionModel("logReg",
+  new DenseMatrix(1, 1, Array(0.0)), Vectors.dense(0.0), 2, 
isMultinomial = false)
--- End diff --

If we have old constructor, revert this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79510095
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala 
---
@@ -133,7 +133,8 @@ class TrainValidationSplitSuite
   test("read/write: TrainValidationSplitModel") {
 val lr = new LogisticRegression()
   .setThreshold(0.6)
-val lrModel = new LogisticRegressionModel(lr.uid, Vectors.dense(1.0, 
2.0), 1.2)
+val lrModel = new LogisticRegressionModel(lr.uid,
+  new DenseMatrix(1, 1, Array(0.0), isTransposed = true), 
Vectors.dense(0.0), 2, false)
--- End diff --

ditto. revert this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79509600
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala
 ---
@@ -22,28 +22,49 @@ import scala.language.existentials
 import scala.util.Random
 import scala.util.control.Breaks._
 
-import org.apache.spark.SparkFunSuite
+import org.apache.spark.{SparkException, SparkFunSuite}
+import org.apache.spark.ml.attribute.NominalAttribute
 import org.apache.spark.ml.classification.LogisticRegressionSuite._
-import org.apache.spark.ml.feature.{Instance, LabeledPoint}
-import org.apache.spark.ml.linalg.{Vector, Vectors}
+import org.apache.spark.ml.feature.LabeledPoint
+import org.apache.spark.ml.linalg._
--- End diff --

avoid `import _` if possible


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79507486
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -508,11 +680,51 @@ object LogisticRegression extends 
DefaultParamsReadable[LogisticRegression] {
 @Since("1.4.0")
 class LogisticRegressionModel private[spark] (
 @Since("1.4.0") override val uid: String,
-@Since("2.0.0") val coefficients: Vector,
-@Since("1.3.0") val intercept: Double)
+@Since("2.1.0") val coefficientMatrix: Matrix,
+@Since("2.1.0") val interceptVector: Vector,
+@Since("1.3.0") override val numClasses: Int,
--- End diff --

How about we make `numClasses` as a function determined by `isMultinomial` 
and `coefficientMatrix .numRows`. Thus, we can reduce one parameter in the 
constructor. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79508998
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -1103,6 +1382,7 @@ class BinaryLogisticRegressionSummary 
private[classification] (
  *$$
  * 
  *
+ *
--- End diff --

remove extra line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79506294
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -66,11 +69,37 @@ private[classification] trait LogisticRegressionParams 
extends ProbabilisticClas
*
* @group setParam
*/
+  // TODO: Implement SPARK-11543?
   def setThreshold(value: Double): this.type = {
 if (isSet(thresholds)) clear(thresholds)
 set(threshold, value)
   }
 
+
--- End diff --

remove this extra new line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79509539
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -430,8 +431,9 @@ class LogisticRegressionWithLBFGS
 lr.setStandardization(useFeatureScaling)
 if (userSuppliedWeights) {
   val uid = Identifiable.randomUID("logreg-static")
-  lr.setInitialModel(new 
org.apache.spark.ml.classification.LogisticRegressionModel(
-uid, initialWeights.asML, 1.0))
+  lr.setInitialModel(new 
org.apache.spark.ml.classification.LogisticRegressionModel(uid,
+new DenseMatrix(1, initialWeights.size, 
initialWeights.toArray),
+Vectors.dense(1.0).asML, 2, false))
--- End diff --

Maybe we could have another constructor taking the original parameters. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-19 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r79507775
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -508,11 +680,51 @@ object LogisticRegression extends 
DefaultParamsReadable[LogisticRegression] {
 @Since("1.4.0")
 class LogisticRegressionModel private[spark] (
 @Since("1.4.0") override val uid: String,
-@Since("2.0.0") val coefficients: Vector,
-@Since("1.3.0") val intercept: Double)
+@Since("2.1.0") val coefficientMatrix: Matrix,
+@Since("2.1.0") val interceptVector: Vector,
+@Since("1.3.0") override val numClasses: Int,
+private val isMultinomial: Boolean)
   extends ProbabilisticClassificationModel[Vector, LogisticRegressionModel]
   with LogisticRegressionParams with MLWritable {
 
--- End diff --

Can we have a `require(coefficientMatrix .numRows == 
interceptVector.length)` here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-14 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78774078
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -676,39 +936,54 @@ object LogisticRegressionModel extends 
MLReadable[LogisticRegressionModel] {
 private case class Data(
 numClasses: Int,
 numFeatures: Int,
-intercept: Double,
-coefficients: Vector)
+interceptVector: Vector,
+coefficientMatrix: Matrix,
+isMultinomial: Boolean)
 
 override protected def saveImpl(path: String): Unit = {
   // Save metadata and Params
   DefaultParamsWriter.saveMetadata(instance, path, sc)
   // Save model data: numClasses, numFeatures, intercept, coefficients
-  val data = Data(instance.numClasses, instance.numFeatures, 
instance.intercept,
-instance.coefficients)
+  val data = Data(instance.numClasses, instance.numFeatures, 
instance.interceptVector,
+instance.coefficientMatrix, instance.isMultinomial)
   val dataPath = new Path(path, "data").toString
   
sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
 }
   }
 
-  private class LogisticRegressionModelReader
-extends MLReader[LogisticRegressionModel] {
+  private class LogisticRegressionModelReader extends 
MLReader[LogisticRegressionModel] {
 
 /** Checked against metadata when loading model */
 private val className = classOf[LogisticRegressionModel].getName
 
 override def load(path: String): LogisticRegressionModel = {
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
+  val versionRegex = "([0-9]+)\\.([0-9]+)\\.(.+)".r
+  val versionRegex(major, minor, _) = metadata.sparkVersion
 
   val dataPath = new Path(path, "data").toString
   val data = sparkSession.read.format("parquet").load(dataPath)
 
-  // We will need numClasses, numFeatures in the future for 
multinomial logreg support.
-  // TODO: remove numClasses and numFeatures fields?
-  val Row(numClasses: Int, numFeatures: Int, intercept: Double, 
coefficients: Vector) =
-MLUtils.convertVectorColumnsToML(data, "coefficients")
-  .select("numClasses", "numFeatures", "intercept", "coefficients")
-  .head()
-  val model = new LogisticRegressionModel(metadata.uid, coefficients, 
intercept)
+  val model = if (major.toInt < 2 || (major.toInt == 2 && minor.toInt 
== 0)) {
--- End diff --

Will this patch make it into 2.0.1? If so, we'd need to change this to also 
check the "micro" version number. Otherwise, this check should still be valid.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-13 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78688637
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -595,55 +831,104 @@ class LogisticRegressionModel private[spark] (
* Predict label for the given feature vector.
* The behavior of this can be adjusted using [[thresholds]].
*/
-  override protected def predict(features: Vector): Double = {
+  override protected def predict(features: Vector): Double = if 
(isMultinomial) {
+super.predict(features)
--- End diff --

Would you mind elaborating? This calls ends up calling 
`predictRaw(features).argmax`, which equates to `margins(features).argmax`. 
What specialized version are you referring to? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-13 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78688210
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -508,11 +680,42 @@ object LogisticRegression extends 
DefaultParamsReadable[LogisticRegression] {
 @Since("1.4.0")
 class LogisticRegressionModel private[spark] (
 @Since("1.4.0") override val uid: String,
-@Since("2.0.0") val coefficients: Vector,
-@Since("1.3.0") val intercept: Double)
+@Since("2.1.0") val coefficientMatrix: Matrix,
+@Since("2.1.0") val interceptVector: Vector,
+@Since("1.3.0") override val numClasses: Int,
+private val isMultinomial: Boolean)
   extends ProbabilisticClassificationModel[Vector, LogisticRegressionModel]
   with LogisticRegressionParams with MLWritable {
 
+  @Since("2.0.0")
+  def coefficients: Vector = if (isMultinomial) {
+throw new SparkException("Multinomial models contain a matrix of 
coefficients, use " +
+  "coefficientMatrix instead.")
+  } else {
+_coefficients
+  }
+
+  // convert to appropriate vector representation without replicating data
+  private lazy val _coefficients: Vector = coefficientMatrix match {
+case dm: DenseMatrix => Vectors.dense(dm.values)
--- End diff --

In that case, `coefficientMatrix` is a 1 x numFeatures dense matrix, I 
don't believe it makes any difference if it's row major or column major.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-13 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78674556
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/ProbabilisticClassifier.scala
 ---
@@ -201,11 +201,24 @@ abstract class ProbabilisticClassificationModel[
   probability.argmax
 } else {
   val thresholds: Array[Double] = getThresholds
-  val scaledProbability: Array[Double] =
-probability.toArray.zip(thresholds).map { case (p, t) =>
-  if (t == 0.0) Double.PositiveInfinity else p / t
+  val probabilities = probability.toArray
+  var argMax = 0
+  var max = Double.NegativeInfinity
+  var i = 0
+  while (i < probability.size) {
--- End diff --

val length = probability.size


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-13 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r7867
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -676,39 +936,54 @@ object LogisticRegressionModel extends 
MLReadable[LogisticRegressionModel] {
 private case class Data(
 numClasses: Int,
 numFeatures: Int,
-intercept: Double,
-coefficients: Vector)
+interceptVector: Vector,
+coefficientMatrix: Matrix,
+isMultinomial: Boolean)
 
 override protected def saveImpl(path: String): Unit = {
   // Save metadata and Params
   DefaultParamsWriter.saveMetadata(instance, path, sc)
   // Save model data: numClasses, numFeatures, intercept, coefficients
-  val data = Data(instance.numClasses, instance.numFeatures, 
instance.intercept,
-instance.coefficients)
+  val data = Data(instance.numClasses, instance.numFeatures, 
instance.interceptVector,
+instance.coefficientMatrix, instance.isMultinomial)
   val dataPath = new Path(path, "data").toString
   
sparkSession.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
 }
   }
 
-  private class LogisticRegressionModelReader
-extends MLReader[LogisticRegressionModel] {
+  private class LogisticRegressionModelReader extends 
MLReader[LogisticRegressionModel] {
 
 /** Checked against metadata when loading model */
 private val className = classOf[LogisticRegressionModel].getName
 
 override def load(path: String): LogisticRegressionModel = {
   val metadata = DefaultParamsReader.loadMetadata(path, sc, className)
+  val versionRegex = "([0-9]+)\\.([0-9]+)\\.(.+)".r
+  val versionRegex(major, minor, _) = metadata.sparkVersion
 
   val dataPath = new Path(path, "data").toString
   val data = sparkSession.read.format("parquet").load(dataPath)
 
-  // We will need numClasses, numFeatures in the future for 
multinomial logreg support.
-  // TODO: remove numClasses and numFeatures fields?
-  val Row(numClasses: Int, numFeatures: Int, intercept: Double, 
coefficients: Vector) =
-MLUtils.convertVectorColumnsToML(data, "coefficients")
-  .select("numClasses", "numFeatures", "intercept", "coefficients")
-  .head()
-  val model = new LogisticRegressionModel(metadata.uid, coefficients, 
intercept)
+  val model = if (major.toInt < 2 || (major.toInt == 2 && minor.toInt 
== 0)) {
--- End diff --

How about `2.0.1`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-13 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78674092
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -595,55 +831,104 @@ class LogisticRegressionModel private[spark] (
* Predict label for the given feature vector.
* The behavior of this can be adjusted using [[thresholds]].
*/
-  override protected def predict(features: Vector): Double = {
+  override protected def predict(features: Vector): Double = if 
(isMultinomial) {
+super.predict(features)
--- End diff --

maybe we want to have the specialized version when thresholds is not 
defined? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-13 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78673689
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -508,11 +680,42 @@ object LogisticRegression extends 
DefaultParamsReadable[LogisticRegression] {
 @Since("1.4.0")
 class LogisticRegressionModel private[spark] (
 @Since("1.4.0") override val uid: String,
-@Since("2.0.0") val coefficients: Vector,
-@Since("1.3.0") val intercept: Double)
+@Since("2.1.0") val coefficientMatrix: Matrix,
+@Since("2.1.0") val interceptVector: Vector,
+@Since("1.3.0") override val numClasses: Int,
+private val isMultinomial: Boolean)
   extends ProbabilisticClassificationModel[Vector, LogisticRegressionModel]
   with LogisticRegressionParams with MLWritable {
 
+  @Since("2.0.0")
+  def coefficients: Vector = if (isMultinomial) {
+throw new SparkException("Multinomial models contain a matrix of 
coefficients, use " +
+  "coefficientMatrix instead.")
+  } else {
+_coefficients
+  }
+
+  // convert to appropriate vector representation without replicating data
+  private lazy val _coefficients: Vector = coefficientMatrix match {
+case dm: DenseMatrix => Vectors.dense(dm.values)
--- End diff --

I think you need to check `coefficientMatrix.isTransposed` even it's dense 
here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-12 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78483816
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -460,33 +577,74 @@ class LogisticRegression @Since("1.2.0") (
as a result, no scaling is needed.
  */
 val rawCoefficients = state.x.toArray.clone()
-var i = 0
-while (i < numFeatures) {
-  rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / 
featuresStd(i) else 0.0 }
-  i += 1
+val coefficientArray = Array.tabulate(numCoefficientSets * 
numFeatures) { i =>
+  // flatIndex will loop though rawCoefficients, and skip the 
intercept terms.
+  val flatIndex = if ($(fitIntercept)) i + i / numFeatures else i
+  val featureIndex = i % numFeatures
+  if (featuresStd(featureIndex) != 0.0) {
+rawCoefficients(flatIndex) / featuresStd(featureIndex)
+  } else {
+0.0
+  }
+}
+val coefficientMatrix =
+  new DenseMatrix(numCoefficientSets, numFeatures, 
coefficientArray, isTransposed = true)
+
+if ($(regParam) == 0.0 && isMultinomial) {
+  /*
+When no regularization is applied, the coefficients lack 
identifiability because
+we do not use a pivot class. We can add any constant value to 
the coefficients and
+get the same likelihood. So here, we choose the mean centered 
coefficients for
+reproducibility. This method follows the approach in glmnet, 
described here:
+
+Friedman, et al. "Regularization Paths for Generalized Linear 
Models via
+  Coordinate Descent," 
https://core.ac.uk/download/files/153/6287975.pdf
+   */
+  val coefficientMean = coefficientMatrix.values.sum / 
coefficientMatrix.values.length
+  coefficientMatrix.update(_ - coefficientMean)
 }
-bcFeaturesStd.destroy(blocking = false)
 
-if ($(fitIntercept)) {
-  (Vectors.dense(rawCoefficients.dropRight(1)).compressed, 
rawCoefficients.last,
-arrayBuilder.result())
+val interceptsArray: Array[Double] = if ($(fitIntercept)) {
+  Array.tabulate(numCoefficientSets) { i =>
+val coefIndex = (i + 1) * numFeaturesPlusIntercept - 1
+rawCoefficients(coefIndex)
+  }
+} else {
+  Array[Double]()
+}
+/*
+  The intercepts are never regularized, so we always center the 
mean.
+ */
+val interceptVector = if (interceptsArray.nonEmpty && 
isMultinomial) {
+  val interceptMean = interceptsArray.sum / numClasses
+  interceptsArray.indices.foreach { i => interceptsArray(i) -= 
interceptMean }
+  Vectors.dense(interceptsArray)
+} else if (interceptsArray.length == 1) {
+  Vectors.dense(interceptsArray)
 } else {
-  (Vectors.dense(rawCoefficients).compressed, 0.0, 
arrayBuilder.result())
+  Vectors.sparse(numCoefficientSets, Seq())
 }
+(coefficientMatrix, interceptVector, arrayBuilder.result())
--- End diff --

This looks good to me as the temporary solution. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-12 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78464458
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -323,32 +382,33 @@ class LogisticRegression @Since("1.2.0") (
 instr.logNumClasses(numClasses)
 instr.logNumFeatures(numFeatures)
 
-val (coefficients, intercept, objectiveHistory) = {
+val (coefficientMatrix, interceptVector, objectiveHistory) = {
   if (numInvalid != 0) {
 val msg = s"Classification labels should be in [0 to ${numClasses 
- 1}]. " +
   s"Found $numInvalid invalid labels."
 logError(msg)
 throw new SparkException(msg)
   }
 
-  val isConstantLabel = histogram.count(_ != 0) == 1
+  val isConstantLabel = histogram.count(_ != 0.0) == 1
 
-  if (numClasses > 2) {
-val msg = s"LogisticRegression with ElasticNet in ML package only 
supports " +
-  s"binary classification. Found $numClasses in the input dataset. 
Consider using " +
-  s"MultinomialLogisticRegression instead."
-logError(msg)
-throw new SparkException(msg)
-  } else if ($(fitIntercept) && numClasses == 2 && isConstantLabel) {
-logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
-  s"zeros and the intercept will be positive infinity; as a 
result, " +
-  s"training is not needed.")
-(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
-  } else if ($(fitIntercept) && numClasses == 1) {
-logWarning(s"All labels are zero and fitIntercept=true, so the 
coefficients will be " +
-  s"zeros and the intercept will be negative infinity; as a 
result, " +
-  s"training is not needed.")
-(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, 
Array.empty[Double])
+  if ($(fitIntercept) && isConstantLabel) {
+logWarning(s"All labels are the same value and fitIntercept=true, 
so the coefficients " +
+  s"will be zeros. Training is not needed.")
+val constantLabelIndex = Vectors.dense(histogram).argmax
+val coefMatrix = if (numFeatures < numCoefficientSets) {
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-12 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78464517
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -460,33 +577,74 @@ class LogisticRegression @Since("1.2.0") (
as a result, no scaling is needed.
  */
 val rawCoefficients = state.x.toArray.clone()
-var i = 0
-while (i < numFeatures) {
-  rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / 
featuresStd(i) else 0.0 }
-  i += 1
+val coefficientArray = Array.tabulate(numCoefficientSets * 
numFeatures) { i =>
+  // flatIndex will loop though rawCoefficients, and skip the 
intercept terms.
+  val flatIndex = if ($(fitIntercept)) i + i / numFeatures else i
+  val featureIndex = i % numFeatures
+  if (featuresStd(featureIndex) != 0.0) {
+rawCoefficients(flatIndex) / featuresStd(featureIndex)
+  } else {
+0.0
+  }
+}
+val coefficientMatrix =
+  new DenseMatrix(numCoefficientSets, numFeatures, 
coefficientArray, isTransposed = true)
+
+if ($(regParam) == 0.0 && isMultinomial) {
+  /*
+When no regularization is applied, the coefficients lack 
identifiability because
+we do not use a pivot class. We can add any constant value to 
the coefficients and
+get the same likelihood. So here, we choose the mean centered 
coefficients for
+reproducibility. This method follows the approach in glmnet, 
described here:
+
+Friedman, et al. "Regularization Paths for Generalized Linear 
Models via
+  Coordinate Descent," 
https://core.ac.uk/download/files/153/6287975.pdf
+   */
+  val coefficientMean = coefficientMatrix.values.sum / 
coefficientMatrix.values.length
+  coefficientMatrix.update(_ - coefficientMean)
 }
-bcFeaturesStd.destroy(blocking = false)
 
-if ($(fitIntercept)) {
-  (Vectors.dense(rawCoefficients.dropRight(1)).compressed, 
rawCoefficients.last,
-arrayBuilder.result())
+val interceptsArray: Array[Double] = if ($(fitIntercept)) {
+  Array.tabulate(numCoefficientSets) { i =>
+val coefIndex = (i + 1) * numFeaturesPlusIntercept - 1
+rawCoefficients(coefIndex)
+  }
+} else {
+  Array[Double]()
+}
+/*
+  The intercepts are never regularized, so we always center the 
mean.
+ */
+val interceptVector = if (interceptsArray.nonEmpty && 
isMultinomial) {
+  val interceptMean = interceptsArray.sum / numClasses
+  interceptsArray.indices.foreach { i => interceptsArray(i) -= 
interceptMean }
+  Vectors.dense(interceptsArray)
+} else if (interceptsArray.length == 1) {
+  Vectors.dense(interceptsArray)
 } else {
-  (Vectors.dense(rawCoefficients).compressed, 0.0, 
arrayBuilder.result())
+  Vectors.sparse(numCoefficientSets, Seq())
 }
+(coefficientMatrix, interceptVector, arrayBuilder.result())
--- End diff --

I implemented this logic, good suggestion. Let me know if that's what you 
had in mind.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-12 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78464438
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -311,8 +350,28 @@ class LogisticRegression @Since("1.2.0") (
 
 val histogram = labelSummarizer.histogram
 val numInvalid = labelSummarizer.countInvalid
-val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
+val numFeaturesPlusIntercept = if (getFitIntercept) numFeatures + 1 
else numFeatures
+
+val numClasses = 
MetadataUtils.getNumClasses(dataset.schema($(labelCol))) match {
+  case Some(n: Int) =>
+require(n >= histogram.length, s"Specified number of classes $n 
was " +
+  s"less than the number of unique labels ${histogram.length}.")
+n
+  case None => histogram.length
+}
+
+val isBinaryClassification = numClasses == 1 || numClasses == 2
+val isMultinomial = $(family) match {
+  case "binomial" =>
+require(isBinaryClassification, s"Binomial family only supports 1 
or 2 " +
+s"outcome classes but found $numClasses.")
+false
+  case "multinomial" => true
+  case "auto" => !isBinaryClassification
+  case other => throw new IllegalArgumentException(s"Unsupported 
family: $other")
+}
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-12 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78420053
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -460,33 +577,74 @@ class LogisticRegression @Since("1.2.0") (
as a result, no scaling is needed.
  */
 val rawCoefficients = state.x.toArray.clone()
-var i = 0
-while (i < numFeatures) {
-  rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / 
featuresStd(i) else 0.0 }
-  i += 1
+val coefficientArray = Array.tabulate(numCoefficientSets * 
numFeatures) { i =>
+  // flatIndex will loop though rawCoefficients, and skip the 
intercept terms.
+  val flatIndex = if ($(fitIntercept)) i + i / numFeatures else i
+  val featureIndex = i % numFeatures
+  if (featuresStd(featureIndex) != 0.0) {
+rawCoefficients(flatIndex) / featuresStd(featureIndex)
+  } else {
+0.0
+  }
+}
+val coefficientMatrix =
+  new DenseMatrix(numCoefficientSets, numFeatures, 
coefficientArray, isTransposed = true)
+
+if ($(regParam) == 0.0 && isMultinomial) {
+  /*
+When no regularization is applied, the coefficients lack 
identifiability because
+we do not use a pivot class. We can add any constant value to 
the coefficients and
+get the same likelihood. So here, we choose the mean centered 
coefficients for
+reproducibility. This method follows the approach in glmnet, 
described here:
+
+Friedman, et al. "Regularization Paths for Generalized Linear 
Models via
+  Coordinate Descent," 
https://core.ac.uk/download/files/153/6287975.pdf
+   */
+  val coefficientMean = coefficientMatrix.values.sum / 
coefficientMatrix.values.length
+  coefficientMatrix.update(_ - coefficientMean)
 }
-bcFeaturesStd.destroy(blocking = false)
 
-if ($(fitIntercept)) {
-  (Vectors.dense(rawCoefficients.dropRight(1)).compressed, 
rawCoefficients.last,
-arrayBuilder.result())
+val interceptsArray: Array[Double] = if ($(fitIntercept)) {
+  Array.tabulate(numCoefficientSets) { i =>
+val coefIndex = (i + 1) * numFeaturesPlusIntercept - 1
+rawCoefficients(coefIndex)
+  }
+} else {
+  Array[Double]()
+}
+/*
+  The intercepts are never regularized, so we always center the 
mean.
+ */
+val interceptVector = if (interceptsArray.nonEmpty && 
isMultinomial) {
+  val interceptMean = interceptsArray.sum / numClasses
+  interceptsArray.indices.foreach { i => interceptsArray(i) -= 
interceptMean }
+  Vectors.dense(interceptsArray)
+} else if (interceptsArray.length == 1) {
+  Vectors.dense(interceptsArray)
 } else {
-  (Vectors.dense(rawCoefficients).compressed, 0.0, 
arrayBuilder.result())
+  Vectors.sparse(numCoefficientSets, Seq())
 }
+(coefficientMatrix, interceptVector, arrayBuilder.result())
--- End diff --

One way to unblock it without implementing `compress` in matrix will be for 
binary classification, a vector can be created, and compress can be called to 
return a dense or sparse vector. If it's sparse vector, you put it into CSR 
matrix without changing the internal data structure.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-12 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78419383
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -311,8 +350,28 @@ class LogisticRegression @Since("1.2.0") (
 
 val histogram = labelSummarizer.histogram
 val numInvalid = labelSummarizer.countInvalid
-val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
+val numFeaturesPlusIntercept = if (getFitIntercept) numFeatures + 1 
else numFeatures
+
+val numClasses = 
MetadataUtils.getNumClasses(dataset.schema($(labelCol))) match {
+  case Some(n: Int) =>
+require(n >= histogram.length, s"Specified number of classes $n 
was " +
+  s"less than the number of unique labels ${histogram.length}.")
+n
+  case None => histogram.length
+}
+
+val isBinaryClassification = numClasses == 1 || numClasses == 2
+val isMultinomial = $(family) match {
+  case "binomial" =>
+require(isBinaryClassification, s"Binomial family only supports 1 
or 2 " +
+s"outcome classes but found $numClasses.")
+false
+  case "multinomial" => true
+  case "auto" => !isBinaryClassification
+  case other => throw new IllegalArgumentException(s"Unsupported 
family: $other")
+}
--- End diff --

Oh, I original thought it was a typo. `isBinaryClassification` is not used 
in other place, and I think it's kind of leading people to interpret as 
`isBinomial`. Maybe we should just remove it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-12 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78389388
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -311,8 +350,28 @@ class LogisticRegression @Since("1.2.0") (
 
 val histogram = labelSummarizer.histogram
 val numInvalid = labelSummarizer.countInvalid
-val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
+val numFeaturesPlusIntercept = if (getFitIntercept) numFeatures + 1 
else numFeatures
+
+val numClasses = 
MetadataUtils.getNumClasses(dataset.schema($(labelCol))) match {
+  case Some(n: Int) =>
+require(n >= histogram.length, s"Specified number of classes $n 
was " +
+  s"less than the number of unique labels ${histogram.length}.")
+n
+  case None => histogram.length
+}
+
+val isBinaryClassification = numClasses == 1 || numClasses == 2
+val isMultinomial = $(family) match {
+  case "binomial" =>
+require(isBinaryClassification, s"Binomial family only supports 1 
or 2 " +
+s"outcome classes but found $numClasses.")
+false
+  case "multinomial" => true
+  case "auto" => !isBinaryClassification
+  case other => throw new IllegalArgumentException(s"Unsupported 
family: $other")
+}
--- End diff --

`isBinaryClassification` is not the same as `isBinomial`, and it's intended 
that it could be true along with `isMultinomial`. I can switch it up to not use 
the `isBinaryClassification` variable, but `isBinomial` is never used so I 
don't see a need to define it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-12 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r7839
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -460,33 +577,74 @@ class LogisticRegression @Since("1.2.0") (
as a result, no scaling is needed.
  */
 val rawCoefficients = state.x.toArray.clone()
-var i = 0
-while (i < numFeatures) {
-  rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / 
featuresStd(i) else 0.0 }
-  i += 1
+val coefficientArray = Array.tabulate(numCoefficientSets * 
numFeatures) { i =>
+  // flatIndex will loop though rawCoefficients, and skip the 
intercept terms.
+  val flatIndex = if ($(fitIntercept)) i + i / numFeatures else i
+  val featureIndex = i % numFeatures
+  if (featuresStd(featureIndex) != 0.0) {
+rawCoefficients(flatIndex) / featuresStd(featureIndex)
+  } else {
+0.0
+  }
+}
+val coefficientMatrix =
+  new DenseMatrix(numCoefficientSets, numFeatures, 
coefficientArray, isTransposed = true)
+
+if ($(regParam) == 0.0 && isMultinomial) {
+  /*
+When no regularization is applied, the coefficients lack 
identifiability because
+we do not use a pivot class. We can add any constant value to 
the coefficients and
+get the same likelihood. So here, we choose the mean centered 
coefficients for
+reproducibility. This method follows the approach in glmnet, 
described here:
+
+Friedman, et al. "Regularization Paths for Generalized Linear 
Models via
+  Coordinate Descent," 
https://core.ac.uk/download/files/153/6287975.pdf
+   */
+  val coefficientMean = coefficientMatrix.values.sum / 
coefficientMatrix.values.length
+  coefficientMatrix.update(_ - coefficientMean)
 }
-bcFeaturesStd.destroy(blocking = false)
 
-if ($(fitIntercept)) {
-  (Vectors.dense(rawCoefficients.dropRight(1)).compressed, 
rawCoefficients.last,
-arrayBuilder.result())
+val interceptsArray: Array[Double] = if ($(fitIntercept)) {
+  Array.tabulate(numCoefficientSets) { i =>
+val coefIndex = (i + 1) * numFeaturesPlusIntercept - 1
+rawCoefficients(coefIndex)
+  }
+} else {
+  Array[Double]()
+}
+/*
+  The intercepts are never regularized, so we always center the 
mean.
+ */
+val interceptVector = if (interceptsArray.nonEmpty && 
isMultinomial) {
+  val interceptMean = interceptsArray.sum / numClasses
+  interceptsArray.indices.foreach { i => interceptsArray(i) -= 
interceptMean }
+  Vectors.dense(interceptsArray)
+} else if (interceptsArray.length == 1) {
+  Vectors.dense(interceptsArray)
 } else {
-  (Vectors.dense(rawCoefficients).compressed, 0.0, 
arrayBuilder.result())
+  Vectors.sparse(numCoefficientSets, Seq())
 }
+(coefficientMatrix, interceptVector, arrayBuilder.result())
--- End diff --

Yeah, that was one of the concerns I had. I think we definitely need to do 
it before 2.1 release. But it might be safer to block this PR until it's done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-12 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78321887
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -460,33 +577,74 @@ class LogisticRegression @Since("1.2.0") (
as a result, no scaling is needed.
  */
 val rawCoefficients = state.x.toArray.clone()
-var i = 0
-while (i < numFeatures) {
-  rawCoefficients(i) *= { if (featuresStd(i) != 0.0) 1.0 / 
featuresStd(i) else 0.0 }
-  i += 1
+val coefficientArray = Array.tabulate(numCoefficientSets * 
numFeatures) { i =>
+  // flatIndex will loop though rawCoefficients, and skip the 
intercept terms.
+  val flatIndex = if ($(fitIntercept)) i + i / numFeatures else i
+  val featureIndex = i % numFeatures
+  if (featuresStd(featureIndex) != 0.0) {
+rawCoefficients(flatIndex) / featuresStd(featureIndex)
+  } else {
+0.0
+  }
+}
+val coefficientMatrix =
+  new DenseMatrix(numCoefficientSets, numFeatures, 
coefficientArray, isTransposed = true)
+
+if ($(regParam) == 0.0 && isMultinomial) {
+  /*
+When no regularization is applied, the coefficients lack 
identifiability because
+we do not use a pivot class. We can add any constant value to 
the coefficients and
+get the same likelihood. So here, we choose the mean centered 
coefficients for
+reproducibility. This method follows the approach in glmnet, 
described here:
+
+Friedman, et al. "Regularization Paths for Generalized Linear 
Models via
+  Coordinate Descent," 
https://core.ac.uk/download/files/153/6287975.pdf
+   */
+  val coefficientMean = coefficientMatrix.values.sum / 
coefficientMatrix.values.length
+  coefficientMatrix.update(_ - coefficientMean)
 }
-bcFeaturesStd.destroy(blocking = false)
 
-if ($(fitIntercept)) {
-  (Vectors.dense(rawCoefficients.dropRight(1)).compressed, 
rawCoefficients.last,
-arrayBuilder.result())
+val interceptsArray: Array[Double] = if ($(fitIntercept)) {
+  Array.tabulate(numCoefficientSets) { i =>
+val coefIndex = (i + 1) * numFeaturesPlusIntercept - 1
+rawCoefficients(coefIndex)
+  }
+} else {
+  Array[Double]()
+}
+/*
+  The intercepts are never regularized, so we always center the 
mean.
+ */
+val interceptVector = if (interceptsArray.nonEmpty && 
isMultinomial) {
+  val interceptMean = interceptsArray.sum / numClasses
+  interceptsArray.indices.foreach { i => interceptsArray(i) -= 
interceptMean }
+  Vectors.dense(interceptsArray)
+} else if (interceptsArray.length == 1) {
+  Vectors.dense(interceptsArray)
 } else {
-  (Vectors.dense(rawCoefficients).compressed, 0.0, 
arrayBuilder.result())
+  Vectors.sparse(numCoefficientSets, Seq())
 }
+(coefficientMatrix, interceptVector, arrayBuilder.result())
--- End diff --

Should we implement `coefficientMatrix.compressed` before merging this? 
Otherwise, this can be a regression.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-12 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78321247
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -323,32 +382,33 @@ class LogisticRegression @Since("1.2.0") (
 instr.logNumClasses(numClasses)
 instr.logNumFeatures(numFeatures)
 
-val (coefficients, intercept, objectiveHistory) = {
+val (coefficientMatrix, interceptVector, objectiveHistory) = {
   if (numInvalid != 0) {
 val msg = s"Classification labels should be in [0 to ${numClasses 
- 1}]. " +
   s"Found $numInvalid invalid labels."
 logError(msg)
 throw new SparkException(msg)
   }
 
-  val isConstantLabel = histogram.count(_ != 0) == 1
+  val isConstantLabel = histogram.count(_ != 0.0) == 1
 
-  if (numClasses > 2) {
-val msg = s"LogisticRegression with ElasticNet in ML package only 
supports " +
-  s"binary classification. Found $numClasses in the input dataset. 
Consider using " +
-  s"MultinomialLogisticRegression instead."
-logError(msg)
-throw new SparkException(msg)
-  } else if ($(fitIntercept) && numClasses == 2 && isConstantLabel) {
-logWarning(s"All labels are one and fitIntercept=true, so the 
coefficients will be " +
-  s"zeros and the intercept will be positive infinity; as a 
result, " +
-  s"training is not needed.")
-(Vectors.sparse(numFeatures, Seq()), Double.PositiveInfinity, 
Array.empty[Double])
-  } else if ($(fitIntercept) && numClasses == 1) {
-logWarning(s"All labels are zero and fitIntercept=true, so the 
coefficients will be " +
-  s"zeros and the intercept will be negative infinity; as a 
result, " +
-  s"training is not needed.")
-(Vectors.sparse(numFeatures, Seq()), Double.NegativeInfinity, 
Array.empty[Double])
+  if ($(fitIntercept) && isConstantLabel) {
+logWarning(s"All labels are the same value and fitIntercept=true, 
so the coefficients " +
+  s"will be zeros. Training is not needed.")
+val constantLabelIndex = Vectors.dense(histogram).argmax
+val coefMatrix = if (numFeatures < numCoefficientSets) {
--- End diff --

Add a comment on this, having a TODO to consolidate the sparse matrix logic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-12 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78321146
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -311,8 +350,28 @@ class LogisticRegression @Since("1.2.0") (
 
 val histogram = labelSummarizer.histogram
 val numInvalid = labelSummarizer.countInvalid
-val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
+val numFeaturesPlusIntercept = if (getFitIntercept) numFeatures + 1 
else numFeatures
+
+val numClasses = 
MetadataUtils.getNumClasses(dataset.schema($(labelCol))) match {
+  case Some(n: Int) =>
+require(n >= histogram.length, s"Specified number of classes $n 
was " +
+  s"less than the number of unique labels ${histogram.length}.")
+n
+  case None => histogram.length
+}
+
+val isBinaryClassification = numClasses == 1 || numClasses == 2
+val isMultinomial = $(family) match {
+  case "binomial" =>
+require(isBinaryClassification, s"Binomial family only supports 1 
or 2 " +
+s"outcome classes but found $numClasses.")
+false
+  case "multinomial" => true
+  case "auto" => !isBinaryClassification
+  case other => throw new IllegalArgumentException(s"Unsupported 
family: $other")
+}
--- End diff --

BTW, I think `isBinaryClassification` is not needed since it's not being 
used at all.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-11 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78316060
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -311,8 +350,28 @@ class LogisticRegression @Since("1.2.0") (
 
 val histogram = labelSummarizer.histogram
 val numInvalid = labelSummarizer.countInvalid
-val numClasses = histogram.length
 val numFeatures = summarizer.mean.size
+val numFeaturesPlusIntercept = if (getFitIntercept) numFeatures + 1 
else numFeatures
+
+val numClasses = 
MetadataUtils.getNumClasses(dataset.schema($(labelCol))) match {
+  case Some(n: Int) =>
+require(n >= histogram.length, s"Specified number of classes $n 
was " +
+  s"less than the number of unique labels ${histogram.length}.")
+n
+  case None => histogram.length
+}
+
+val isBinaryClassification = numClasses == 1 || numClasses == 2
+val isMultinomial = $(family) match {
+  case "binomial" =>
+require(isBinaryClassification, s"Binomial family only supports 1 
or 2 " +
+s"outcome classes but found $numClasses.")
+false
+  case "multinomial" => true
+  case "auto" => !isBinaryClassification
+  case other => throw new IllegalArgumentException(s"Unsupported 
family: $other")
+}
--- End diff --

Both `isBinaryClassification` and `isMultinomial` can be true when 
`numClasses == 2`.  I think it's better

```scala
val isMultinomial = $(family) match {
  case "binomial" => 
 require(numClasses == 1 || numClasses == 2, s"Binomial family only 
supports 1 or 2 " +
 s"outcome classes but found $numClasses.")
 false
  case "multinomial" => true
  case "auto" => numClasses > 2
  case other => throw new IllegalArgumentException(s"Unsupported family: 
$other")
}
val isBinomial = !isMultinomial
```




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14834: [SPARK-17163][ML] Unified LogisticRegression inte...

2016-09-11 Thread dbtsai
Github user dbtsai commented on a diff in the pull request:

https://github.com/apache/spark/pull/14834#discussion_r78315600
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 ---
@@ -261,6 +299,7 @@ class LogisticRegression @Since("1.2.0") (
* If the dimensions of features or the number of partitions are large,
* this param could be adjusted to a larger size.
* Default is 2.
+ *
--- End diff --

the indentation is off.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org