[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83420249 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -85,73 +104,191 @@ private[ml] class WeightedLeastSquares( val triK = summary.triK val wSum = summary.wSum val bBar = summary.bBar -val bStd = summary.bStd +val bbBar = summary.bbBar val aBar = summary.aBar -val aVar = summary.aVar +val aStd = summary.aStd val abBar = summary.abBar val aaBar = summary.aaBar -val aaValues = aaBar.values - -if (bStd == 0) { - if (fitIntercept) { -logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + - s"zeros and the intercept will be the mean of the label; as a result, " + - s"training is not needed.") -val coefficients = new DenseVector(Array.ofDim(k-1)) +val numFeatures = abBar.size +val rawBStd = summary.bStd +// if b is constant (rawBStd is zero), then b cannot be scaled. In this case +// setting bStd=abs(bBar) ensures that b is not scaled anymore in l-bfgs algorithm. +val bStd = if (rawBStd == 0.0) math.abs(bBar) else rawBStd + +if (rawBStd == 0) { + if (fitIntercept || bBar == 0.0) { +if (bBar == 0.0) { + logWarning(s"Mean and standard deviation of the label are zero, so the coefficients " + +s"and the intercept will all be zero; as a result, training is not needed.") +} else { + logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + +s"zeros and the intercept will be the mean of the label; as a result, " + +s"training is not needed.") +} +val coefficients = new DenseVector(Array.ofDim(numFeatures)) val intercept = bBar val diagInvAtWA = new DenseVector(Array(0D)) -return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA) +return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA, Array(0D)) + } else { +require(!(regParam > 0.0 && standardizeLabel), "The standard deviation of the label is " + + "zero. Model cannot be regularized with standardization=true") +logWarning(s"The standard deviation of the label is zero. Consider setting " + + s"fitIntercept=true.") + } +} + +val aBarValues = aBar.values +var j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +aBarValues(j) = 0.0 } else { -require(!(regParam > 0.0 && standardizeLabel), - "The standard deviation of the label is zero. " + -"Model cannot be regularized with standardization=true") -logWarning(s"The standard deviation of the label is zero. " + - "Consider setting fitIntercept=true.") +aBarValues(j) /= aStd(j) + } + j += 1 +} + +val abBarValues = abBar.values +val aStdValues = aStd.values +j = 0 +while (j < numFeatures) { + if (aStdValues(j) == 0.0) { +abBarValues(j) = 0.0 + } else { +abBarValues(j) /= (aStdValues(j) * bStd) + } + j += 1 +} + +val aaBarValues = aaBar.values +j = 0 +var p = 0 +while (j < numFeatures) { + val aStdJ = aStdValues(j) + var i = 0 + while (i <= j) { +val aStdI = aStdValues(i) +if (aStdJ == 0.0 || aStdI == 0.0) { + aaBarValues(p) = 0.0 +} else { + aaBarValues(p) /= (aStdI * aStdJ) +} +p += 1 +i += 1 } + j += 1 } -// add regularization to diagonals +val bBarStd = bBar / bStd +val bbBarStd = bbBar / (bStd * bStd) + +val effectiveRegParam = regParam / bStd +val effectiveL1RegParam = elasticNetParam * effectiveRegParam +val effectiveL2RegParam = (1.0 - elasticNetParam) * effectiveRegParam + +// add L2 regularization to diagonals var i = 0 -var j = 2 +j = 2 while (i < triK) { - var lambda = regParam - if (standardizeFeatures) { -lambda *= aVar(j - 2) + var lambda = effectiveL2RegParam + if (!standardizeFeatures) { +val std = aStd(j - 2) +if (std != 0.0) { + lambda /= (std * std) +} else { + lambda = 0.0 +} } - if (standardizeLabel && bStd != 0) { -lambda /= bStd +
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83421445 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -85,73 +104,191 @@ private[ml] class WeightedLeastSquares( val triK = summary.triK val wSum = summary.wSum val bBar = summary.bBar -val bStd = summary.bStd +val bbBar = summary.bbBar val aBar = summary.aBar -val aVar = summary.aVar +val aStd = summary.aStd val abBar = summary.abBar val aaBar = summary.aaBar -val aaValues = aaBar.values - -if (bStd == 0) { - if (fitIntercept) { -logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + - s"zeros and the intercept will be the mean of the label; as a result, " + - s"training is not needed.") -val coefficients = new DenseVector(Array.ofDim(k-1)) +val numFeatures = abBar.size +val rawBStd = summary.bStd +// if b is constant (rawBStd is zero), then b cannot be scaled. In this case +// setting bStd=abs(bBar) ensures that b is not scaled anymore in l-bfgs algorithm. +val bStd = if (rawBStd == 0.0) math.abs(bBar) else rawBStd + +if (rawBStd == 0) { + if (fitIntercept || bBar == 0.0) { +if (bBar == 0.0) { + logWarning(s"Mean and standard deviation of the label are zero, so the coefficients " + +s"and the intercept will all be zero; as a result, training is not needed.") +} else { + logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + +s"zeros and the intercept will be the mean of the label; as a result, " + +s"training is not needed.") +} +val coefficients = new DenseVector(Array.ofDim(numFeatures)) val intercept = bBar val diagInvAtWA = new DenseVector(Array(0D)) -return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA) +return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA, Array(0D)) + } else { +require(!(regParam > 0.0 && standardizeLabel), "The standard deviation of the label is " + + "zero. Model cannot be regularized with standardization=true") +logWarning(s"The standard deviation of the label is zero. Consider setting " + + s"fitIntercept=true.") + } +} + +val aBarValues = aBar.values +var j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +aBarValues(j) = 0.0 } else { -require(!(regParam > 0.0 && standardizeLabel), - "The standard deviation of the label is zero. " + -"Model cannot be regularized with standardization=true") -logWarning(s"The standard deviation of the label is zero. " + - "Consider setting fitIntercept=true.") +aBarValues(j) /= aStd(j) + } + j += 1 +} + +val abBarValues = abBar.values +val aStdValues = aStd.values +j = 0 +while (j < numFeatures) { + if (aStdValues(j) == 0.0) { +abBarValues(j) = 0.0 + } else { +abBarValues(j) /= (aStdValues(j) * bStd) + } + j += 1 +} + +val aaBarValues = aaBar.values +j = 0 +var p = 0 +while (j < numFeatures) { + val aStdJ = aStdValues(j) + var i = 0 + while (i <= j) { +val aStdI = aStdValues(i) +if (aStdJ == 0.0 || aStdI == 0.0) { + aaBarValues(p) = 0.0 +} else { + aaBarValues(p) /= (aStdI * aStdJ) +} +p += 1 +i += 1 } + j += 1 } -// add regularization to diagonals +val bBarStd = bBar / bStd +val bbBarStd = bbBar / (bStd * bStd) + +val effectiveRegParam = regParam / bStd +val effectiveL1RegParam = elasticNetParam * effectiveRegParam +val effectiveL2RegParam = (1.0 - elasticNetParam) * effectiveRegParam + +// add L2 regularization to diagonals var i = 0 -var j = 2 +j = 2 while (i < triK) { - var lambda = regParam - if (standardizeFeatures) { -lambda *= aVar(j - 2) + var lambda = effectiveL2RegParam + if (!standardizeFeatures) { +val std = aStd(j - 2) +if (std != 0.0) { + lambda /= (std * std) +} else { + lambda = 0.0 +} } - if (standardizeLabel && bStd != 0) { -lambda /= bStd +
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83428359 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -201,11 +476,13 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext for (fitIntercept <- Seq(false, true); regParam <- Seq(0.0, 0.1, 1.0); standardizeFeatures <- Seq(false, true)) { - val wls = new WeightedLeastSquares( -fitIntercept, regParam, standardizeFeatures, standardizeLabel = true) -.fit(instances) - val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) - assert(actual ~== expected(idx) absTol 1e-4) + for (solver <- WeightedLeastSquares.supportedSolvers) { +val wls = new WeightedLeastSquares(fitIntercept, regParam, elasticNetParam = 0.0, + standardizeFeatures, standardizeLabel = true, solverType = solver) + .fit(instances) +val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) +assert(actual ~== expected(idx) absTol 1e-4) + } idx += 1 } } --- End diff -- Yeah, may be we can leave TODO here to consider remove this arguments for WLS. Let's address it in follow up work when we are confidence. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83408577 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/NormalEquationSolver.scala --- @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.optim + +import breeze.linalg.{DenseVector => BDV} +import breeze.optimize.{CachedDiffFunction, DiffFunction, LBFGS => BreezeLBFGS, OWLQN => BreezeOWLQN} +import scala.collection.mutable + +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.linalg.CholeskyDecomposition + +private[ml] class NormalEquationSolution( +val fitIntercept: Boolean, +private val _coefficients: Array[Double], +val aaInv: Option[Array[Double]], +val objectiveHistory: Option[Array[Double]]) { + + def coefficients: Array[Double] = { +if (fitIntercept) { + _coefficients.slice(0, _coefficients.length - 1) +} else { + _coefficients +} + } + + def intercept: Double = if (fitIntercept) _coefficients.last else 0.0 +} + +/** + * Interface for classes that solve the normal equations locally. + */ +private[ml] sealed trait NormalEquationSolver { + + /** Solve the normal equations from summary statistics. */ + def solve( + bBar: Double, + bbBar: Double, + abBar: DenseVector, + aaBar: DenseVector, + aBar: DenseVector): NormalEquationSolution +} + +/** + * A class that solves the normal equations directly, using Cholesky decomposition. + */ +private[ml] class CholeskySolver(val fitIntercept: Boolean) extends NormalEquationSolver { --- End diff -- I'm strongly prefer to move ```fitIntercept``` out of ```NormalEquationSolver``` and ```NormalEquationSolution```. Since the arguments ```aaBar``` and ```abBar``` were constructed to append bias if ```fitIntercept = true``` at ```WeightedLeastSquares``` L277, normal equation should be not aware of intercept. It will also make the solution more clean. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83425844 --- Diff: mllib-local/src/test/scala/org/apache/spark/ml/linalg/BLASSuite.scala --- @@ -422,4 +422,49 @@ class BLASSuite extends SparkMLFunSuite { assert(dATT.multiply(sx) ~== expected absTol 1e-15) assert(sATT.multiply(sx) ~== expected absTol 1e-15) } + + test("spmv") { --- End diff -- dspmv --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83425552 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/BLAS.scala --- @@ -244,6 +244,23 @@ private[spark] object BLAS extends Serializable { } /** + * y := alpha*A*x + beta*y + * + * @param A The upper triangular part of A in a [[DenseVector]] (column major). + * @param x The [[DenseVector]] transformed by A. + * @param y The [[DenseVector]] to be modified in place. --- End diff -- Add annotation for ```n```: ```the order of the matrix A```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83422603 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -85,73 +104,191 @@ private[ml] class WeightedLeastSquares( val triK = summary.triK val wSum = summary.wSum val bBar = summary.bBar -val bStd = summary.bStd +val bbBar = summary.bbBar val aBar = summary.aBar -val aVar = summary.aVar +val aStd = summary.aStd val abBar = summary.abBar val aaBar = summary.aaBar -val aaValues = aaBar.values - -if (bStd == 0) { - if (fitIntercept) { -logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + - s"zeros and the intercept will be the mean of the label; as a result, " + - s"training is not needed.") -val coefficients = new DenseVector(Array.ofDim(k-1)) +val numFeatures = abBar.size +val rawBStd = summary.bStd +// if b is constant (rawBStd is zero), then b cannot be scaled. In this case +// setting bStd=abs(bBar) ensures that b is not scaled anymore in l-bfgs algorithm. +val bStd = if (rawBStd == 0.0) math.abs(bBar) else rawBStd + +if (rawBStd == 0) { + if (fitIntercept || bBar == 0.0) { +if (bBar == 0.0) { + logWarning(s"Mean and standard deviation of the label are zero, so the coefficients " + +s"and the intercept will all be zero; as a result, training is not needed.") +} else { + logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + +s"zeros and the intercept will be the mean of the label; as a result, " + +s"training is not needed.") +} +val coefficients = new DenseVector(Array.ofDim(numFeatures)) val intercept = bBar val diagInvAtWA = new DenseVector(Array(0D)) -return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA) +return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA, Array(0D)) + } else { +require(!(regParam > 0.0 && standardizeLabel), "The standard deviation of the label is " + + "zero. Model cannot be regularized with standardization=true") +logWarning(s"The standard deviation of the label is zero. Consider setting " + + s"fitIntercept=true.") + } +} + +val aBarValues = aBar.values +var j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +aBarValues(j) = 0.0 } else { -require(!(regParam > 0.0 && standardizeLabel), - "The standard deviation of the label is zero. " + -"Model cannot be regularized with standardization=true") -logWarning(s"The standard deviation of the label is zero. " + - "Consider setting fitIntercept=true.") +aBarValues(j) /= aStd(j) + } + j += 1 +} + +val abBarValues = abBar.values +val aStdValues = aStd.values +j = 0 +while (j < numFeatures) { + if (aStdValues(j) == 0.0) { +abBarValues(j) = 0.0 + } else { +abBarValues(j) /= (aStdValues(j) * bStd) + } + j += 1 +} + +val aaBarValues = aaBar.values +j = 0 +var p = 0 +while (j < numFeatures) { + val aStdJ = aStdValues(j) + var i = 0 + while (i <= j) { +val aStdI = aStdValues(i) +if (aStdJ == 0.0 || aStdI == 0.0) { + aaBarValues(p) = 0.0 +} else { + aaBarValues(p) /= (aStdI * aStdJ) +} +p += 1 +i += 1 } + j += 1 } -// add regularization to diagonals +val bBarStd = bBar / bStd +val bbBarStd = bbBar / (bStd * bStd) + +val effectiveRegParam = regParam / bStd +val effectiveL1RegParam = elasticNetParam * effectiveRegParam +val effectiveL2RegParam = (1.0 - elasticNetParam) * effectiveRegParam + +// add L2 regularization to diagonals var i = 0 -var j = 2 +j = 2 while (i < triK) { - var lambda = regParam - if (standardizeFeatures) { -lambda *= aVar(j - 2) + var lambda = effectiveL2RegParam + if (!standardizeFeatures) { +val std = aStd(j - 2) +if (std != 0.0) { + lambda /= (std * std) +} else { + lambda = 0.0 +} } - if (standardizeLabel && bStd != 0) { -lambda /= bStd +
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83413947 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/NormalEquationSolver.scala --- @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.optim + +import breeze.linalg.{DenseVector => BDV} +import breeze.optimize.{CachedDiffFunction, DiffFunction, LBFGS => BreezeLBFGS, OWLQN => BreezeOWLQN} +import scala.collection.mutable + +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.linalg.CholeskyDecomposition + +private[ml] class NormalEquationSolution( +val fitIntercept: Boolean, +private val _coefficients: Array[Double], +val aaInv: Option[Array[Double]], +val objectiveHistory: Option[Array[Double]]) { + + def coefficients: Array[Double] = { +if (fitIntercept) { + _coefficients.slice(0, _coefficients.length - 1) +} else { + _coefficients +} + } + + def intercept: Double = if (fitIntercept) _coefficients.last else 0.0 +} + +/** + * Interface for classes that solve the normal equations locally. + */ +private[ml] sealed trait NormalEquationSolver { + + /** Solve the normal equations from summary statistics. */ + def solve( + bBar: Double, + bbBar: Double, + abBar: DenseVector, + aaBar: DenseVector, + aBar: DenseVector): NormalEquationSolution +} + +/** + * A class that solves the normal equations directly, using Cholesky decomposition. + */ +private[ml] class CholeskySolver(val fitIntercept: Boolean) extends NormalEquationSolver { + + def solve( + bBar: Double, + bbBar: Double, + abBar: DenseVector, + aaBar: DenseVector, + aBar: DenseVector): NormalEquationSolution = { +val k = abBar.size +val x = CholeskyDecomposition.solve(aaBar.values, abBar.values) +val aaInv = CholeskyDecomposition.inverse(aaBar.values, k) + +new NormalEquationSolution(fitIntercept, x, Some(aaInv), None) + } +} + +/** + * A class for solving the normal equations using Quasi-Newton optimization methods. + */ +private[ml] class QuasiNewtonSolver( +val fitIntercept: Boolean, +maxIter: Int, +tol: Double, +l1RegFunc: Option[(Int) => Double]) extends NormalEquationSolver { + + def solve( + bBar: Double, + bbBar: Double, + abBar: DenseVector, + aaBar: DenseVector, + aBar: DenseVector): NormalEquationSolution = { +val numFeatures = aBar.size +val numFeaturesPlusIntercept = if (fitIntercept) numFeatures + 1 else numFeatures +val initialCoefficientsWithIntercept = new Array[Double](numFeaturesPlusIntercept) +if (fitIntercept) { + initialCoefficientsWithIntercept(numFeaturesPlusIntercept - 1) = bBar +} + +val costFun = + new NormalEquationCostFun(bBar, bbBar, abBar, aaBar, aBar, fitIntercept, numFeatures) +val optimizer = l1RegFunc.map { func => + new BreezeOWLQN[Int, BDV[Double]](maxIter, 10, func, tol) +}.getOrElse(new BreezeLBFGS[BDV[Double]](maxIter, 10, tol)) + +val states = optimizer.iterations(new CachedDiffFunction(costFun), + new BDV[Double](initialCoefficientsWithIntercept)) + +val arrayBuilder = mutable.ArrayBuilder.make[Double] +var state: optimizer.State = null +while (states.hasNext) { + state = states.next() + arrayBuilder += state.adjustedValue +} +val x = state.x.toArray.clone() +new NormalEquationSolution(fitIntercept, x, None, Some(arrayBuilder.result())) + } + + /** + * NormalEquationCostFun implements Breeze's DiffFunction[T] for the normal equation. + * It returns the loss and gradient with L2 regularization at a particular
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83420840 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -85,73 +104,191 @@ private[ml] class WeightedLeastSquares( val triK = summary.triK val wSum = summary.wSum val bBar = summary.bBar -val bStd = summary.bStd +val bbBar = summary.bbBar val aBar = summary.aBar -val aVar = summary.aVar +val aStd = summary.aStd val abBar = summary.abBar val aaBar = summary.aaBar -val aaValues = aaBar.values - -if (bStd == 0) { - if (fitIntercept) { -logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + - s"zeros and the intercept will be the mean of the label; as a result, " + - s"training is not needed.") -val coefficients = new DenseVector(Array.ofDim(k-1)) +val numFeatures = abBar.size +val rawBStd = summary.bStd +// if b is constant (rawBStd is zero), then b cannot be scaled. In this case +// setting bStd=abs(bBar) ensures that b is not scaled anymore in l-bfgs algorithm. +val bStd = if (rawBStd == 0.0) math.abs(bBar) else rawBStd + +if (rawBStd == 0) { + if (fitIntercept || bBar == 0.0) { +if (bBar == 0.0) { + logWarning(s"Mean and standard deviation of the label are zero, so the coefficients " + +s"and the intercept will all be zero; as a result, training is not needed.") +} else { + logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + +s"zeros and the intercept will be the mean of the label; as a result, " + +s"training is not needed.") +} +val coefficients = new DenseVector(Array.ofDim(numFeatures)) val intercept = bBar val diagInvAtWA = new DenseVector(Array(0D)) -return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA) +return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA, Array(0D)) + } else { +require(!(regParam > 0.0 && standardizeLabel), "The standard deviation of the label is " + + "zero. Model cannot be regularized with standardization=true") +logWarning(s"The standard deviation of the label is zero. Consider setting " + + s"fitIntercept=true.") + } +} + +val aBarValues = aBar.values --- End diff -- Doc here and below, such as: Scale ```aBar``` to standardized space in-place. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83091619 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -45,34 +47,48 @@ private[ml] class WeightedLeastSquaresModel( * formulation: * * min,,x,z,, 1/2 sum,,i,, w,,i,, (a,,i,,^T^ x + z - b,,i,,)^2^ / sum,,i,, w_i - * + 1/2 lambda / delta sum,,j,, (sigma,,j,, x,,j,,)^2^, + * + lambda / delta (1/2 (1 - alpha) sum,,j,, (sigma,,j,, x,,j,,)^2^ + * + alpha sum,,j,, abs(sigma,,j,, x,,j,,)), * - * where lambda is the regularization parameter, and delta and sigma,,j,, are controlled by - * [[standardizeLabel]] and [[standardizeFeatures]], respectively. + * where lambda is the regularization parameter, alpha is the ElasticNet mixing parameter, + * and delta and sigma,,j,, are controlled by [[standardizeLabel]] and [[standardizeFeatures]], + * respectively. * * Set [[regParam]] to 0.0 and turn off both [[standardizeFeatures]] and [[standardizeLabel]] to * match R's `lm`. * Turn on [[standardizeLabel]] to match R's `glmnet`. * * @param fitIntercept whether to fit intercept. If false, z is 0.0. - * @param regParam L2 regularization parameter (lambda) + * @param regParam L2 regularization parameter (lambda). + * @param elasticNetParam the ElasticNet mixing parameter. * @param standardizeFeatures whether to standardize features. If true, sigma_,,j,, is the *population standard deviation of the j-th column of A. Otherwise, *sigma,,j,, is 1.0. * @param standardizeLabel whether to standardize label. If true, delta is the population standard * deviation of the label column b. Otherwise, delta is 1.0. + * @param solverType the type of solver to use for optimization. + * @param maxIter maximum number of iterations when stochastic optimization is used. + * @param tol the convergence tolerance of the iterations when stochastic optimization is used. --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83079711 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -60,24 +60,87 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext ), 2) } - test("two collinear features result in error with no regularization") { + test("WLS with strong L1 regularization") { +/* + We initialize the coefficients for WLS QN solver to be weighted average of the label. Check + here that with only an intercept the model converges to bBar. + */ +val bAgg = instances.collect().foldLeft((0.0, 0.0)) { case ((sum, count), Instance(l, w, f)) => + (sum + w * l, count + w) +} +val bBar = bAgg._1 / bAgg._2 +val wls = new WeightedLeastSquares(true, 10, 1.0, true, true) +val model = wls.fit(instances) +assert(model.intercept ~== bBar relTol 1e-6) + } + + test("diagonal inverse of AtWA") { +/* + A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2) --- End diff -- done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83090088 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/NormalEquationSolver.scala --- @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.optim + +import breeze.linalg.{DenseVector => BDV} +import breeze.optimize.{CachedDiffFunction, DiffFunction, LBFGS => BreezeLBFGS, OWLQN => BreezeOWLQN} +import scala.collection.mutable + +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.linalg.CholeskyDecomposition + +private[ml] class NormalEquationSolution( +val fitIntercept: Boolean, +private val _coefficients: Array[Double], +val aaInv: Option[DenseVector], --- End diff -- Seems a bit cleaner so I updated it, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83090919 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -45,34 +47,48 @@ private[ml] class WeightedLeastSquaresModel( * formulation: * * min,,x,z,, 1/2 sum,,i,, w,,i,, (a,,i,,^T^ x + z - b,,i,,)^2^ / sum,,i,, w_i - * + 1/2 lambda / delta sum,,j,, (sigma,,j,, x,,j,,)^2^, + * + lambda / delta (1/2 (1 - alpha) sum,,j,, (sigma,,j,, x,,j,,)^2^ + * + alpha sum,,j,, abs(sigma,,j,, x,,j,,)), * - * where lambda is the regularization parameter, and delta and sigma,,j,, are controlled by - * [[standardizeLabel]] and [[standardizeFeatures]], respectively. + * where lambda is the regularization parameter, alpha is the ElasticNet mixing parameter, + * and delta and sigma,,j,, are controlled by [[standardizeLabel]] and [[standardizeFeatures]], + * respectively. * * Set [[regParam]] to 0.0 and turn off both [[standardizeFeatures]] and [[standardizeLabel]] to * match R's `lm`. * Turn on [[standardizeLabel]] to match R's `glmnet`. * * @param fitIntercept whether to fit intercept. If false, z is 0.0. - * @param regParam L2 regularization parameter (lambda) + * @param regParam L2 regularization parameter (lambda). --- End diff -- I changed it to say "Regularization parameter (lambda)". Since lambda is specified above, it should be clear. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83086536 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -132,24 +197,234 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext var idx = 0 for (fitIntercept <- Seq(false, true)) { for (standardization <- Seq(false, true)) { -val wls = new WeightedLeastSquares( - fitIntercept, regParam = 0.0, standardizeFeatures = standardization, - standardizeLabel = standardization).fit(instancesConstLabel) -val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) -assert(actual ~== expected(idx) absTol 1e-4) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(fitIntercept, regParam = 0.0, elasticNetParam = 0.0, +standardizeFeatures = standardization, +standardizeLabel = standardization, solverType = solver).fit(instancesConstLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual ~== expected(idx) absTol 1e-4) +} } idx += 1 } + +// when label is constant zero, and fitIntercept is false, we should not train and get all zeros +val instancesConstZeroLabel = instancesConstLabel.map { case Instance(l, w, f) => + Instance(0.0, w, f) +} +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(false, 0.0, 0.0, true, true, solverType = solver) +.fit(instancesConstZeroLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual === Vectors.dense(0.0, 0.0, 0.0)) + assert(wls.objectiveHistory === Array(0.0)) +} } test("WLS with regularization when label is constant") { // if regParam is non-zero and standardization is true, the problem is ill-defined and // an exception is thrown. -val wls = new WeightedLeastSquares( - fitIntercept = false, regParam = 0.1, standardizeFeatures = true, - standardizeLabel = true) -intercept[IllegalArgumentException]{ - wls.fit(instancesConstLabel) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares( +fitIntercept = false, regParam = 0.1, elasticNetParam = 0.0, standardizeFeatures = true, +standardizeLabel = true, solverType = solver) + intercept[IllegalArgumentException]{ +wls.fit(instancesConstLabel) + } +} + } + + test("WLS against glmnet with constant features") { +/* + R code: + + A <- matrix(c(1, 1, 1, 1, 5, 7, 11, 13), 4, 2) + b <- c(17, 19, 23, 29) + w <- c(1, 2, 3, 4) + */ +val constantFeatures = sc.parallelize(Seq( + Instance(17.0, 1.0, Vectors.dense(1.0, 5.0)), + Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(23.0, 3.0, Vectors.dense(1.0, 11.0)), + Instance(29.0, 4.0, Vectors.dense(1.0, 13.0)) +), 2) + +// Cholesky solver does not handle singular input with no regularization +for (fitIntercept <- Seq(false, true); + standardization <- Seq(false, true)) { + val wls = new WeightedLeastSquares(fitIntercept, 0.0, 0.0, standardization, standardization, +solverType = WeightedLeastSquares.Cholesky) + // for the case of no intercept, this would not have failed before but since we train + // in the standardized space now, it will fail + intercept[SingularMatrixException] { +wls.fit(constantFeatures) + } +} + +// should not fail when regularization is added +new WeightedLeastSquares(true, 0.5, 0.0, standardizeFeatures = true, + standardizeLabel = true, solverType = WeightedLeastSquares.Cholesky).fit(constantFeatures) + +/* + for (intercept in c(FALSE, TRUE)) { +for (standardize in c(FALSE, TRUE)) { + for (regParams in list(c(0.0, 0.0), c(0.5, 0.0), c(0.5, 0.5), c(0.5, 1.0))) { +model <- glmnet(A, b, weights=w, intercept=intercept, lambda=regParams[1], + standardize=standardize, alpha=regParams[2], thresh=1E-14) +print(as.vector(coef(model))) + } +} + } + [1] 0.00 0.00 2.253012 + [1] 0.00 0.00 2.250857 + [1] 0.00 0.00 2.249784 + [1] 0.00 0.00 2.248709 + [1] 0.00 0.00
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83091695 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -85,73 +101,193 @@ private[ml] class WeightedLeastSquares( val triK = summary.triK val wSum = summary.wSum val bBar = summary.bBar -val bStd = summary.bStd +val bbBar = summary.bbBar val aBar = summary.aBar -val aVar = summary.aVar +val aStd = summary.aStd val abBar = summary.abBar val aaBar = summary.aaBar -val aaValues = aaBar.values - -if (bStd == 0) { - if (fitIntercept) { -logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + - s"zeros and the intercept will be the mean of the label; as a result, " + - s"training is not needed.") -val coefficients = new DenseVector(Array.ofDim(k-1)) +val aaBarValues = aaBar.values +val numFeatures = abBar.size +val rawBStd = summary.bStd +// if b is constant (rawBStd is zero), then b cannot be scaled. In this case +// setting bStd=abs(bBar) ensures that b is not scaled anymore in l-bfgs algorithm. +val bStd = if (rawBStd == 0.0) math.abs(bBar) else rawBStd + +if (rawBStd == 0) { + if (fitIntercept || bBar == 0.0) { +if (bBar == 0.0) { + logWarning(s"Mean and standard deviation of the label are zero, so the coefficients " + +s"and the intercept will all be zero; as a result, training is not needed.") +} else { + logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + +s"zeros and the intercept will be the mean of the label; as a result, " + +s"training is not needed.") +} +val coefficients = new DenseVector(Array.ofDim(numFeatures)) val intercept = bBar val diagInvAtWA = new DenseVector(Array(0D)) -return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA) +return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA, Array(0D)) + } else { +require(!(regParam > 0.0 && standardizeLabel), "The standard deviation of the label is " + + "zero. Model cannot be regularized with standardization=true") +logWarning(s"The standard deviation of the label is zero. Consider setting " + + s"fitIntercept=true.") + } +} + +val aBarStd = new Array[Double](numFeatures) +var j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +aBarStd(j) = 0.0 } else { -require(!(regParam > 0.0 && standardizeLabel), - "The standard deviation of the label is zero. " + -"Model cannot be regularized with standardization=true") -logWarning(s"The standard deviation of the label is zero. " + - "Consider setting fitIntercept=true.") +aBarStd(j) = aBar(j) / aStd(j) + } + j += 1 +} + +val abBarStd = new Array[Double](numFeatures) +j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +abBarStd(j) = 0.0 + } else { +abBarStd(j) = abBar(j) / (aStd(j) * bStd) + } + j += 1 +} + +val aaBarStd = new Array[Double](triK) +j = 0 +var kk = 0 +while (j < numFeatures) { + val aStdJ = aStd(j) + var i = 0 + while (i <= j) { +val aStdI = aStd(i) +if (aStdJ == 0.0 || aStdI == 0.0) { + aaBarStd(kk) = 0.0 +} else { + aaBarStd(kk) = aaBarValues(kk) / (aStdI * aStdJ) +} +kk += 1 +i += 1 } + j += 1 } +val bBarStd = bBar / bStd +val bbBarStd = bbBar / (bStd * bStd) + +val effectiveRegParam = regParam / bStd +val effectiveL1RegParam = elasticNetParam * effectiveRegParam +val effectiveL2RegParam = (1.0 - elasticNetParam) * effectiveRegParam + // add regularization to diagonals var i = 0 -var j = 2 +j = 2 while (i < triK) { - var lambda = regParam - if (standardizeFeatures) { -lambda *= aVar(j - 2) + var lambda = effectiveL2RegParam + if (!standardizeFeatures) { +val std = aStd(j - 2) +if (std != 0.0) { + lambda /= (std * std) +} else { + lambda = 0.0 +} } - if (standardizeLabel && bStd != 0) { -lambda /= bStd + if
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83084385 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -132,24 +197,234 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext var idx = 0 for (fitIntercept <- Seq(false, true)) { for (standardization <- Seq(false, true)) { -val wls = new WeightedLeastSquares( - fitIntercept, regParam = 0.0, standardizeFeatures = standardization, - standardizeLabel = standardization).fit(instancesConstLabel) -val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) -assert(actual ~== expected(idx) absTol 1e-4) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(fitIntercept, regParam = 0.0, elasticNetParam = 0.0, +standardizeFeatures = standardization, +standardizeLabel = standardization, solverType = solver).fit(instancesConstLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual ~== expected(idx) absTol 1e-4) +} } idx += 1 } + +// when label is constant zero, and fitIntercept is false, we should not train and get all zeros +val instancesConstZeroLabel = instancesConstLabel.map { case Instance(l, w, f) => --- End diff -- done, moved them all --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83086860 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -201,11 +476,13 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext for (fitIntercept <- Seq(false, true); regParam <- Seq(0.0, 0.1, 1.0); standardizeFeatures <- Seq(false, true)) { - val wls = new WeightedLeastSquares( -fitIntercept, regParam, standardizeFeatures, standardizeLabel = true) -.fit(instances) - val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) - assert(actual ~== expected(idx) absTol 1e-4) + for (solver <- WeightedLeastSquares.supportedSolvers) { --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83079621 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -60,24 +60,87 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext ), 2) } - test("two collinear features result in error with no regularization") { + test("WLS with strong L1 regularization") { +/* + We initialize the coefficients for WLS QN solver to be weighted average of the label. Check + here that with only an intercept the model converges to bBar. + */ +val bAgg = instances.collect().foldLeft((0.0, 0.0)) { case ((sum, count), Instance(l, w, f)) => --- End diff -- done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83081588 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -60,24 +60,87 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext ), 2) } - test("two collinear features result in error with no regularization") { + test("WLS with strong L1 regularization") { +/* + We initialize the coefficients for WLS QN solver to be weighted average of the label. Check + here that with only an intercept the model converges to bBar. + */ +val bAgg = instances.collect().foldLeft((0.0, 0.0)) { case ((sum, count), Instance(l, w, f)) => + (sum + w * l, count + w) +} +val bBar = bAgg._1 / bAgg._2 +val wls = new WeightedLeastSquares(true, 10, 1.0, true, true) +val model = wls.fit(instances) +assert(model.intercept ~== bBar relTol 1e-6) + } + + test("diagonal inverse of AtWA") { +/* + A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2) + w <- c(1, 2, 3, 4) + W <- Diagonal(length(w), w) + A.intercept <- cbind(A, rep.int(1, length(w))) + AtA.intercept <- t(A.intercept) %*% W %*% A.intercept + inv.intercept <- solve(AtA.intercept) + print(diag(inv.intercept)) + [1] 4.02 0.50 12.02 + + AtA <- t(A) %*% W %*% A + inv <- solve(AtA) + print(diag(inv)) + [1] 0.48336106 0.02079867 + + */ +val expectedWithIntercept = Vectors.dense(4.02, 0.50, 12.02) +val expected = Vectors.dense(0.48336106, 0.02079867) +val wlsWithIntercept = new WeightedLeastSquares(true, 0.0, 0.0, true, true, --- End diff -- Yeah, I thought about this. It makes the code kind of ridiculously long. Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83091585 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -45,34 +47,48 @@ private[ml] class WeightedLeastSquaresModel( * formulation: * * min,,x,z,, 1/2 sum,,i,, w,,i,, (a,,i,,^T^ x + z - b,,i,,)^2^ / sum,,i,, w_i - * + 1/2 lambda / delta sum,,j,, (sigma,,j,, x,,j,,)^2^, + * + lambda / delta (1/2 (1 - alpha) sum,,j,, (sigma,,j,, x,,j,,)^2^ + * + alpha sum,,j,, abs(sigma,,j,, x,,j,,)), * - * where lambda is the regularization parameter, and delta and sigma,,j,, are controlled by - * [[standardizeLabel]] and [[standardizeFeatures]], respectively. + * where lambda is the regularization parameter, alpha is the ElasticNet mixing parameter, + * and delta and sigma,,j,, are controlled by [[standardizeLabel]] and [[standardizeFeatures]], + * respectively. --- End diff -- I added a note. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83078777 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -85,73 +101,193 @@ private[ml] class WeightedLeastSquares( val triK = summary.triK val wSum = summary.wSum val bBar = summary.bBar -val bStd = summary.bStd +val bbBar = summary.bbBar val aBar = summary.aBar -val aVar = summary.aVar +val aStd = summary.aStd val abBar = summary.abBar val aaBar = summary.aaBar -val aaValues = aaBar.values - -if (bStd == 0) { - if (fitIntercept) { -logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + - s"zeros and the intercept will be the mean of the label; as a result, " + - s"training is not needed.") -val coefficients = new DenseVector(Array.ofDim(k-1)) +val aaBarValues = aaBar.values +val numFeatures = abBar.size +val rawBStd = summary.bStd +// if b is constant (rawBStd is zero), then b cannot be scaled. In this case +// setting bStd=abs(bBar) ensures that b is not scaled anymore in l-bfgs algorithm. +val bStd = if (rawBStd == 0.0) math.abs(bBar) else rawBStd + +if (rawBStd == 0) { + if (fitIntercept || bBar == 0.0) { +if (bBar == 0.0) { + logWarning(s"Mean and standard deviation of the label are zero, so the coefficients " + +s"and the intercept will all be zero; as a result, training is not needed.") +} else { + logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + +s"zeros and the intercept will be the mean of the label; as a result, " + +s"training is not needed.") +} +val coefficients = new DenseVector(Array.ofDim(numFeatures)) val intercept = bBar val diagInvAtWA = new DenseVector(Array(0D)) -return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA) +return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA, Array(0D)) + } else { +require(!(regParam > 0.0 && standardizeLabel), "The standard deviation of the label is " + + "zero. Model cannot be regularized with standardization=true") +logWarning(s"The standard deviation of the label is zero. Consider setting " + + s"fitIntercept=true.") + } +} + +val aBarStd = new Array[Double](numFeatures) +var j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +aBarStd(j) = 0.0 } else { -require(!(regParam > 0.0 && standardizeLabel), - "The standard deviation of the label is zero. " + -"Model cannot be regularized with standardization=true") -logWarning(s"The standard deviation of the label is zero. " + - "Consider setting fitIntercept=true.") +aBarStd(j) = aBar(j) / aStd(j) + } + j += 1 +} + +val abBarStd = new Array[Double](numFeatures) +j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +abBarStd(j) = 0.0 + } else { +abBarStd(j) = abBar(j) / (aStd(j) * bStd) + } + j += 1 +} + +val aaBarStd = new Array[Double](triK) +j = 0 +var kk = 0 --- End diff -- Ok, done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83069767 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/NormalEquationSolver.scala --- @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.optim + +import breeze.linalg.{DenseVector => BDV} +import breeze.optimize.{CachedDiffFunction, DiffFunction, LBFGS => BreezeLBFGS, OWLQN => BreezeOWLQN} +import scala.collection.mutable + +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.linalg.CholeskyDecomposition + +private[ml] class NormalEquationSolution( +val fitIntercept: Boolean, +private val _coefficients: Array[Double], +val aaInv: Option[DenseVector], +val objectiveHistory: Option[Array[Double]]) { + + def coefficients: DenseVector = { +if (fitIntercept) { + new DenseVector(_coefficients.slice(0, _coefficients.length - 1)) +} else { + new DenseVector(_coefficients) +} + } + + def intercept: Double = if (fitIntercept) _coefficients.last else 0.0 +} + +/** + * Interface for classes that solve the normal equations locally. + */ +private[ml] sealed trait NormalEquationSolver { + + /** Solve the normal equations from summary statistics. */ + def solve( + bBar: Double, + bbBar: Double, + abBar: DenseVector, + aaBar: DenseVector, + aBar: DenseVector): NormalEquationSolution +} + +/** + * A class that solves the normal equations directly, using Cholesky decomposition. + */ +private[ml] class CholeskySolver(val fitIntercept: Boolean) extends NormalEquationSolver { + + def solve( + bBar: Double, + bbBar: Double, + abBar: DenseVector, + aaBar: DenseVector, + aBar: DenseVector): NormalEquationSolution = { +val k = abBar.size +val x = CholeskyDecomposition.solve(aaBar.values, abBar.values) +val aaInv = CholeskyDecomposition.inverse(aaBar.values, k) + +new NormalEquationSolution(fitIntercept, x, Some(new DenseVector(aaInv)), None) + } +} + +/** + * A class for solving the normal equations using Quasi-Newton optimization methods. + */ +private[ml] class QuasiNewtonSolver( +val fitIntercept: Boolean, +maxIter: Int, +tol: Double, +l1RegFunc: Option[(Int) => Double]) extends NormalEquationSolver { + + def solve( + bBar: Double, + bbBar: Double, + abBar: DenseVector, + aaBar: DenseVector, + aBar: DenseVector): NormalEquationSolution = { +val numFeatures = aBar.size +val numFeaturesPlusIntercept = if (fitIntercept) numFeatures + 1 else numFeatures +val initialCoefficientsWithIntercept = new Array[Double](numFeaturesPlusIntercept) +if (fitIntercept) { + initialCoefficientsWithIntercept(numFeaturesPlusIntercept - 1) = bBar +} + +val costFun = + new NormalEquationCostFun(bBar, bbBar, abBar, aaBar, aBar, fitIntercept, numFeatures) +val optimizer = l1RegFunc.map { func => + new BreezeOWLQN[Int, BDV[Double]](maxIter, 10, func, tol) +}.getOrElse(new BreezeLBFGS[BDV[Double]](maxIter, 10, tol)) + +val states = optimizer.iterations(new CachedDiffFunction(costFun), + new BDV[Double](initialCoefficientsWithIntercept)) + +val arrayBuilder = mutable.ArrayBuilder.make[Double] +var state: optimizer.State = null +while (states.hasNext) { + state = states.next() + arrayBuilder += state.adjustedValue +} +val x = state.x.toArray.clone() +new NormalEquationSolution(fitIntercept, x, None, Some(arrayBuilder.result())) + } + + /** + * NormalEquationCostFun implements Breeze's DiffFunction[T] for the normal equation. + * It returns the loss and
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83087184 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -757,8 +761,9 @@ class LinearRegressionSuite assert(model.summary.meanAbsoluteError ~== 0.07961668 relTol 1E-4) assert(model.summary.r2 ~== 0.9998737 relTol 1E-4) - // Normal solver uses "WeightedLeastSquares". This algorithm does not generate - // objective history because it does not run through iterations. + // Normal solver uses "WeightedLeastSquares". When no regularization is applied, --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83041314 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -132,24 +197,234 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext var idx = 0 for (fitIntercept <- Seq(false, true)) { for (standardization <- Seq(false, true)) { -val wls = new WeightedLeastSquares( - fitIntercept, regParam = 0.0, standardizeFeatures = standardization, - standardizeLabel = standardization).fit(instancesConstLabel) -val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) -assert(actual ~== expected(idx) absTol 1e-4) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(fitIntercept, regParam = 0.0, elasticNetParam = 0.0, +standardizeFeatures = standardization, +standardizeLabel = standardization, solverType = solver).fit(instancesConstLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual ~== expected(idx) absTol 1e-4) +} } idx += 1 } + +// when label is constant zero, and fitIntercept is false, we should not train and get all zeros +val instancesConstZeroLabel = instancesConstLabel.map { case Instance(l, w, f) => + Instance(0.0, w, f) +} +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(false, 0.0, 0.0, true, true, solverType = solver) +.fit(instancesConstZeroLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual === Vectors.dense(0.0, 0.0, 0.0)) + assert(wls.objectiveHistory === Array(0.0)) +} } test("WLS with regularization when label is constant") { // if regParam is non-zero and standardization is true, the problem is ill-defined and // an exception is thrown. -val wls = new WeightedLeastSquares( - fitIntercept = false, regParam = 0.1, standardizeFeatures = true, - standardizeLabel = true) -intercept[IllegalArgumentException]{ - wls.fit(instancesConstLabel) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares( +fitIntercept = false, regParam = 0.1, elasticNetParam = 0.0, standardizeFeatures = true, +standardizeLabel = true, solverType = solver) + intercept[IllegalArgumentException]{ +wls.fit(instancesConstLabel) + } +} + } + + test("WLS against glmnet with constant features") { +/* + R code: + + A <- matrix(c(1, 1, 1, 1, 5, 7, 11, 13), 4, 2) + b <- c(17, 19, 23, 29) + w <- c(1, 2, 3, 4) + */ +val constantFeatures = sc.parallelize(Seq( + Instance(17.0, 1.0, Vectors.dense(1.0, 5.0)), + Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(23.0, 3.0, Vectors.dense(1.0, 11.0)), + Instance(29.0, 4.0, Vectors.dense(1.0, 13.0)) +), 2) + +// Cholesky solver does not handle singular input with no regularization +for (fitIntercept <- Seq(false, true); + standardization <- Seq(false, true)) { + val wls = new WeightedLeastSquares(fitIntercept, 0.0, 0.0, standardization, standardization, +solverType = WeightedLeastSquares.Cholesky) + // for the case of no intercept, this would not have failed before but since we train --- End diff -- Should remove this comment. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83091607 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -45,34 +47,48 @@ private[ml] class WeightedLeastSquaresModel( * formulation: * * min,,x,z,, 1/2 sum,,i,, w,,i,, (a,,i,,^T^ x + z - b,,i,,)^2^ / sum,,i,, w_i - * + 1/2 lambda / delta sum,,j,, (sigma,,j,, x,,j,,)^2^, + * + lambda / delta (1/2 (1 - alpha) sum,,j,, (sigma,,j,, x,,j,,)^2^ + * + alpha sum,,j,, abs(sigma,,j,, x,,j,,)), * - * where lambda is the regularization parameter, and delta and sigma,,j,, are controlled by - * [[standardizeLabel]] and [[standardizeFeatures]], respectively. + * where lambda is the regularization parameter, alpha is the ElasticNet mixing parameter, + * and delta and sigma,,j,, are controlled by [[standardizeLabel]] and [[standardizeFeatures]], + * respectively. * * Set [[regParam]] to 0.0 and turn off both [[standardizeFeatures]] and [[standardizeLabel]] to * match R's `lm`. * Turn on [[standardizeLabel]] to match R's `glmnet`. * * @param fitIntercept whether to fit intercept. If false, z is 0.0. - * @param regParam L2 regularization parameter (lambda) + * @param regParam L2 regularization parameter (lambda). + * @param elasticNetParam the ElasticNet mixing parameter. * @param standardizeFeatures whether to standardize features. If true, sigma_,,j,, is the *population standard deviation of the j-th column of A. Otherwise, *sigma,,j,, is 1.0. * @param standardizeLabel whether to standardize label. If true, delta is the population standard * deviation of the label column b. Otherwise, delta is 1.0. + * @param solverType the type of solver to use for optimization. + * @param maxIter maximum number of iterations when stochastic optimization is used. --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83084435 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -132,24 +197,234 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext var idx = 0 for (fitIntercept <- Seq(false, true)) { for (standardization <- Seq(false, true)) { -val wls = new WeightedLeastSquares( - fitIntercept, regParam = 0.0, standardizeFeatures = standardization, - standardizeLabel = standardization).fit(instancesConstLabel) -val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) -assert(actual ~== expected(idx) absTol 1e-4) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(fitIntercept, regParam = 0.0, elasticNetParam = 0.0, +standardizeFeatures = standardization, +standardizeLabel = standardization, solverType = solver).fit(instancesConstLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual ~== expected(idx) absTol 1e-4) +} } idx += 1 } + +// when label is constant zero, and fitIntercept is false, we should not train and get all zeros +val instancesConstZeroLabel = instancesConstLabel.map { case Instance(l, w, f) => + Instance(0.0, w, f) +} +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(false, 0.0, 0.0, true, true, solverType = solver) +.fit(instancesConstZeroLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual === Vectors.dense(0.0, 0.0, 0.0)) + assert(wls.objectiveHistory === Array(0.0)) +} } test("WLS with regularization when label is constant") { // if regParam is non-zero and standardization is true, the problem is ill-defined and // an exception is thrown. -val wls = new WeightedLeastSquares( - fitIntercept = false, regParam = 0.1, standardizeFeatures = true, - standardizeLabel = true) -intercept[IllegalArgumentException]{ - wls.fit(instancesConstLabel) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares( +fitIntercept = false, regParam = 0.1, elasticNetParam = 0.0, standardizeFeatures = true, +standardizeLabel = true, solverType = solver) + intercept[IllegalArgumentException]{ +wls.fit(instancesConstLabel) + } +} + } + + test("WLS against glmnet with constant features") { +/* + R code: + + A <- matrix(c(1, 1, 1, 1, 5, 7, 11, 13), 4, 2) + b <- c(17, 19, 23, 29) + w <- c(1, 2, 3, 4) + */ +val constantFeatures = sc.parallelize(Seq( + Instance(17.0, 1.0, Vectors.dense(1.0, 5.0)), + Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(23.0, 3.0, Vectors.dense(1.0, 11.0)), + Instance(29.0, 4.0, Vectors.dense(1.0, 13.0)) +), 2) + +// Cholesky solver does not handle singular input with no regularization +for (fitIntercept <- Seq(false, true); + standardization <- Seq(false, true)) { + val wls = new WeightedLeastSquares(fitIntercept, 0.0, 0.0, standardization, standardization, +solverType = WeightedLeastSquares.Cholesky) + // for the case of no intercept, this would not have failed before but since we train --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83072526 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -85,73 +101,193 @@ private[ml] class WeightedLeastSquares( val triK = summary.triK val wSum = summary.wSum val bBar = summary.bBar -val bStd = summary.bStd +val bbBar = summary.bbBar val aBar = summary.aBar -val aVar = summary.aVar +val aStd = summary.aStd val abBar = summary.abBar val aaBar = summary.aaBar -val aaValues = aaBar.values - -if (bStd == 0) { - if (fitIntercept) { -logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + - s"zeros and the intercept will be the mean of the label; as a result, " + - s"training is not needed.") -val coefficients = new DenseVector(Array.ofDim(k-1)) +val aaBarValues = aaBar.values +val numFeatures = abBar.size +val rawBStd = summary.bStd +// if b is constant (rawBStd is zero), then b cannot be scaled. In this case +// setting bStd=abs(bBar) ensures that b is not scaled anymore in l-bfgs algorithm. +val bStd = if (rawBStd == 0.0) math.abs(bBar) else rawBStd + +if (rawBStd == 0) { + if (fitIntercept || bBar == 0.0) { +if (bBar == 0.0) { + logWarning(s"Mean and standard deviation of the label are zero, so the coefficients " + +s"and the intercept will all be zero; as a result, training is not needed.") +} else { + logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + +s"zeros and the intercept will be the mean of the label; as a result, " + +s"training is not needed.") +} +val coefficients = new DenseVector(Array.ofDim(numFeatures)) val intercept = bBar val diagInvAtWA = new DenseVector(Array(0D)) -return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA) +return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA, Array(0D)) + } else { +require(!(regParam > 0.0 && standardizeLabel), "The standard deviation of the label is " + + "zero. Model cannot be regularized with standardization=true") +logWarning(s"The standard deviation of the label is zero. Consider setting " + + s"fitIntercept=true.") + } +} + +val aBarStd = new Array[Double](numFeatures) --- End diff -- Yeah, good thought. Initially, I implemented new methods in the aggregator to do compute these, but since we need to modify bStd in some cases, I thought it was too hacky and just put the logic here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83086545 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -132,24 +197,234 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext var idx = 0 for (fitIntercept <- Seq(false, true)) { for (standardization <- Seq(false, true)) { -val wls = new WeightedLeastSquares( - fitIntercept, regParam = 0.0, standardizeFeatures = standardization, - standardizeLabel = standardization).fit(instancesConstLabel) -val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) -assert(actual ~== expected(idx) absTol 1e-4) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(fitIntercept, regParam = 0.0, elasticNetParam = 0.0, +standardizeFeatures = standardization, +standardizeLabel = standardization, solverType = solver).fit(instancesConstLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual ~== expected(idx) absTol 1e-4) +} } idx += 1 } + +// when label is constant zero, and fitIntercept is false, we should not train and get all zeros +val instancesConstZeroLabel = instancesConstLabel.map { case Instance(l, w, f) => + Instance(0.0, w, f) +} +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(false, 0.0, 0.0, true, true, solverType = solver) +.fit(instancesConstZeroLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual === Vectors.dense(0.0, 0.0, 0.0)) + assert(wls.objectiveHistory === Array(0.0)) +} } test("WLS with regularization when label is constant") { // if regParam is non-zero and standardization is true, the problem is ill-defined and // an exception is thrown. -val wls = new WeightedLeastSquares( - fitIntercept = false, regParam = 0.1, standardizeFeatures = true, - standardizeLabel = true) -intercept[IllegalArgumentException]{ - wls.fit(instancesConstLabel) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares( +fitIntercept = false, regParam = 0.1, elasticNetParam = 0.0, standardizeFeatures = true, +standardizeLabel = true, solverType = solver) + intercept[IllegalArgumentException]{ +wls.fit(instancesConstLabel) + } +} + } + + test("WLS against glmnet with constant features") { +/* + R code: + + A <- matrix(c(1, 1, 1, 1, 5, 7, 11, 13), 4, 2) + b <- c(17, 19, 23, 29) + w <- c(1, 2, 3, 4) + */ +val constantFeatures = sc.parallelize(Seq( + Instance(17.0, 1.0, Vectors.dense(1.0, 5.0)), + Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(23.0, 3.0, Vectors.dense(1.0, 11.0)), + Instance(29.0, 4.0, Vectors.dense(1.0, 13.0)) +), 2) + +// Cholesky solver does not handle singular input with no regularization +for (fitIntercept <- Seq(false, true); + standardization <- Seq(false, true)) { + val wls = new WeightedLeastSquares(fitIntercept, 0.0, 0.0, standardization, standardization, +solverType = WeightedLeastSquares.Cholesky) + // for the case of no intercept, this would not have failed before but since we train + // in the standardized space now, it will fail + intercept[SingularMatrixException] { +wls.fit(constantFeatures) + } +} + +// should not fail when regularization is added +new WeightedLeastSquares(true, 0.5, 0.0, standardizeFeatures = true, + standardizeLabel = true, solverType = WeightedLeastSquares.Cholesky).fit(constantFeatures) + +/* + for (intercept in c(FALSE, TRUE)) { +for (standardize in c(FALSE, TRUE)) { + for (regParams in list(c(0.0, 0.0), c(0.5, 0.0), c(0.5, 0.5), c(0.5, 1.0))) { +model <- glmnet(A, b, weights=w, intercept=intercept, lambda=regParams[1], + standardize=standardize, alpha=regParams[2], thresh=1E-14) +print(as.vector(coef(model))) + } +} + } + [1] 0.00 0.00 2.253012 + [1] 0.00 0.00 2.250857 + [1] 0.00 0.00 2.249784 + [1] 0.00 0.00 2.248709 + [1] 0.00 0.00
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83113222 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -132,24 +197,234 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext var idx = 0 for (fitIntercept <- Seq(false, true)) { for (standardization <- Seq(false, true)) { -val wls = new WeightedLeastSquares( - fitIntercept, regParam = 0.0, standardizeFeatures = standardization, - standardizeLabel = standardization).fit(instancesConstLabel) -val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) -assert(actual ~== expected(idx) absTol 1e-4) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(fitIntercept, regParam = 0.0, elasticNetParam = 0.0, +standardizeFeatures = standardization, +standardizeLabel = standardization, solverType = solver).fit(instancesConstLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual ~== expected(idx) absTol 1e-4) +} } idx += 1 } + +// when label is constant zero, and fitIntercept is false, we should not train and get all zeros +val instancesConstZeroLabel = instancesConstLabel.map { case Instance(l, w, f) => + Instance(0.0, w, f) +} +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(false, 0.0, 0.0, true, true, solverType = solver) +.fit(instancesConstZeroLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual === Vectors.dense(0.0, 0.0, 0.0)) + assert(wls.objectiveHistory === Array(0.0)) +} } test("WLS with regularization when label is constant") { // if regParam is non-zero and standardization is true, the problem is ill-defined and // an exception is thrown. -val wls = new WeightedLeastSquares( - fitIntercept = false, regParam = 0.1, standardizeFeatures = true, - standardizeLabel = true) -intercept[IllegalArgumentException]{ - wls.fit(instancesConstLabel) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares( +fitIntercept = false, regParam = 0.1, elasticNetParam = 0.0, standardizeFeatures = true, +standardizeLabel = true, solverType = solver) + intercept[IllegalArgumentException]{ +wls.fit(instancesConstLabel) + } +} + } + + test("WLS against glmnet with constant features") { +/* + R code: + + A <- matrix(c(1, 1, 1, 1, 5, 7, 11, 13), 4, 2) + b <- c(17, 19, 23, 29) + w <- c(1, 2, 3, 4) + */ +val constantFeatures = sc.parallelize(Seq( + Instance(17.0, 1.0, Vectors.dense(1.0, 5.0)), + Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(23.0, 3.0, Vectors.dense(1.0, 11.0)), + Instance(29.0, 4.0, Vectors.dense(1.0, 13.0)) +), 2) + +// Cholesky solver does not handle singular input with no regularization +for (fitIntercept <- Seq(false, true); + standardization <- Seq(false, true)) { + val wls = new WeightedLeastSquares(fitIntercept, 0.0, 0.0, standardization, standardization, +solverType = WeightedLeastSquares.Cholesky) + // for the case of no intercept, this would not have failed before but since we train + // in the standardized space now, it will fail + intercept[SingularMatrixException] { +wls.fit(constantFeatures) + } +} + +// should not fail when regularization is added +new WeightedLeastSquares(true, 0.5, 0.0, standardizeFeatures = true, + standardizeLabel = true, solverType = WeightedLeastSquares.Cholesky).fit(constantFeatures) --- End diff -- ok I added some checks. Actually, Cholesky now fails if regularization is added, but standardization is false. This is because in this case we have to divide by the variance of the features along the diagonal, and that would cause divide by zero, so we make that element of the diagonal zero. I think this is ok, and not a regression, since it will fail but fall back to QN solver. I added test cases to make this clearer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83086848 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -201,11 +476,13 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext for (fitIntercept <- Seq(false, true); regParam <- Seq(0.0, 0.1, 1.0); standardizeFeatures <- Seq(false, true)) { - val wls = new WeightedLeastSquares( -fitIntercept, regParam, standardizeFeatures, standardizeLabel = true) -.fit(instances) - val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) - assert(actual ~== expected(idx) absTol 1e-4) + for (solver <- WeightedLeastSquares.supportedSolvers) { +val wls = new WeightedLeastSquares(fitIntercept, regParam, elasticNetParam = 0.0, + standardizeFeatures, standardizeLabel = true, solverType = solver) + .fit(instances) +val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) +assert(actual ~== expected(idx) absTol 1e-4) + } idx += 1 } } --- End diff -- hm, ya there weren't any before. I wasn't actually sure why we even provide the option since we never use it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83089207 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/BLAS.scala --- @@ -244,6 +244,15 @@ private[spark] object BLAS extends Serializable { } /** + * y += alpha * A * x + * + * @param A The upper triangular part of A in a [[DenseVector]] (column major) + */ + def dspmv(n: Int, alpha: Double, A: DenseVector, x: DenseVector, y: DenseVector): Unit = { --- End diff -- updated it and added some tests --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83079430 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -286,6 +434,24 @@ private[ml] object WeightedLeastSquares { } /** + * Weighted population standard deviation of features. + */ +def aStd: DenseVector = { + val std = Array.ofDim[Double](k) + var i = 0 + var j = 2 + val aaValues = aaSum.values + while (i < triK) { +val l = j - 2 +val aw = aSum(l) / wSum +std(l) = math.sqrt(aaValues(i) / wSum - aw * aw) +i += j +j += 1 + } + new DenseVector(std) +} + +/** * Weighted population variance of features. */ def aVar: DenseVector = { --- End diff -- I think we should keep it. It may be used in the future, but I don't feel strongly about it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83082013 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -60,24 +60,87 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext ), 2) } - test("two collinear features result in error with no regularization") { + test("WLS with strong L1 regularization") { +/* + We initialize the coefficients for WLS QN solver to be weighted average of the label. Check + here that with only an intercept the model converges to bBar. + */ +val bAgg = instances.collect().foldLeft((0.0, 0.0)) { case ((sum, count), Instance(l, w, f)) => + (sum + w * l, count + w) +} +val bBar = bAgg._1 / bAgg._2 +val wls = new WeightedLeastSquares(true, 10, 1.0, true, true) +val model = wls.fit(instances) +assert(model.intercept ~== bBar relTol 1e-6) + } + + test("diagonal inverse of AtWA") { +/* + A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2) + w <- c(1, 2, 3, 4) + W <- Diagonal(length(w), w) + A.intercept <- cbind(A, rep.int(1, length(w))) + AtA.intercept <- t(A.intercept) %*% W %*% A.intercept + inv.intercept <- solve(AtA.intercept) + print(diag(inv.intercept)) + [1] 4.02 0.50 12.02 + + AtA <- t(A) %*% W %*% A + inv <- solve(AtA) + print(diag(inv)) + [1] 0.48336106 0.02079867 + + */ +val expectedWithIntercept = Vectors.dense(4.02, 0.50, 12.02) +val expected = Vectors.dense(0.48336106, 0.02079867) +val wlsWithIntercept = new WeightedLeastSquares(true, 0.0, 0.0, true, true, + solverType = WeightedLeastSquares.Cholesky) +val wlsModelWithIntercept = wlsWithIntercept.fit(instances) +val wls = new WeightedLeastSquares(false, 0.0, 0.0, true, true, + solverType = WeightedLeastSquares.Cholesky) +val wlsModel = wls.fit(instances) + +assert(expectedWithIntercept ~== wlsModelWithIntercept.diagInvAtWA relTol 1e-4) +assert(expected ~== wlsModel.diagInvAtWA relTol 1e-4) + } + + test("two collinear features") { +/* + A <- matrix(c(1, 2, 3, 4, 2, 4, 6, 8), 4, 2) + b <- c(1, 2, 3, 4) + w <- c(1, 1, 1, 1) + */ val singularInstances = sc.parallelize(Seq( --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83078820 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -85,73 +101,193 @@ private[ml] class WeightedLeastSquares( val triK = summary.triK val wSum = summary.wSum val bBar = summary.bBar -val bStd = summary.bStd +val bbBar = summary.bbBar val aBar = summary.aBar -val aVar = summary.aVar +val aStd = summary.aStd val abBar = summary.abBar val aaBar = summary.aaBar -val aaValues = aaBar.values - -if (bStd == 0) { - if (fitIntercept) { -logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + - s"zeros and the intercept will be the mean of the label; as a result, " + - s"training is not needed.") -val coefficients = new DenseVector(Array.ofDim(k-1)) +val aaBarValues = aaBar.values +val numFeatures = abBar.size +val rawBStd = summary.bStd +// if b is constant (rawBStd is zero), then b cannot be scaled. In this case +// setting bStd=abs(bBar) ensures that b is not scaled anymore in l-bfgs algorithm. +val bStd = if (rawBStd == 0.0) math.abs(bBar) else rawBStd + +if (rawBStd == 0) { + if (fitIntercept || bBar == 0.0) { +if (bBar == 0.0) { + logWarning(s"Mean and standard deviation of the label are zero, so the coefficients " + +s"and the intercept will all be zero; as a result, training is not needed.") +} else { + logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + +s"zeros and the intercept will be the mean of the label; as a result, " + +s"training is not needed.") +} +val coefficients = new DenseVector(Array.ofDim(numFeatures)) val intercept = bBar val diagInvAtWA = new DenseVector(Array(0D)) -return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA) +return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA, Array(0D)) + } else { +require(!(regParam > 0.0 && standardizeLabel), "The standard deviation of the label is " + + "zero. Model cannot be regularized with standardization=true") +logWarning(s"The standard deviation of the label is zero. Consider setting " + + s"fitIntercept=true.") + } +} + +val aBarStd = new Array[Double](numFeatures) +var j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +aBarStd(j) = 0.0 } else { -require(!(regParam > 0.0 && standardizeLabel), - "The standard deviation of the label is zero. " + -"Model cannot be regularized with standardization=true") -logWarning(s"The standard deviation of the label is zero. " + - "Consider setting fitIntercept=true.") +aBarStd(j) = aBar(j) / aStd(j) + } + j += 1 +} + +val abBarStd = new Array[Double](numFeatures) +j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +abBarStd(j) = 0.0 + } else { +abBarStd(j) = abBar(j) / (aStd(j) * bStd) + } + j += 1 +} + +val aaBarStd = new Array[Double](triK) +j = 0 +var kk = 0 +while (j < numFeatures) { + val aStdJ = aStd(j) + var i = 0 + while (i <= j) { +val aStdI = aStd(i) +if (aStdJ == 0.0 || aStdI == 0.0) { + aaBarStd(kk) = 0.0 +} else { + aaBarStd(kk) = aaBarValues(kk) / (aStdI * aStdJ) +} +kk += 1 +i += 1 } + j += 1 } +val bBarStd = bBar / bStd +val bbBarStd = bbBar / (bStd * bStd) + +val effectiveRegParam = regParam / bStd +val effectiveL1RegParam = elasticNetParam * effectiveRegParam +val effectiveL2RegParam = (1.0 - elasticNetParam) * effectiveRegParam + // add regularization to diagonals --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82996998 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -132,24 +197,234 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext var idx = 0 for (fitIntercept <- Seq(false, true)) { for (standardization <- Seq(false, true)) { -val wls = new WeightedLeastSquares( - fitIntercept, regParam = 0.0, standardizeFeatures = standardization, - standardizeLabel = standardization).fit(instancesConstLabel) -val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) -assert(actual ~== expected(idx) absTol 1e-4) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(fitIntercept, regParam = 0.0, elasticNetParam = 0.0, +standardizeFeatures = standardization, +standardizeLabel = standardization, solverType = solver).fit(instancesConstLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual ~== expected(idx) absTol 1e-4) +} } idx += 1 } + +// when label is constant zero, and fitIntercept is false, we should not train and get all zeros +val instancesConstZeroLabel = instancesConstLabel.map { case Instance(l, w, f) => --- End diff -- It's better to move this as a class member variable, and add corresponding R code to illustrate how to reproduce the output. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82979581 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -201,11 +476,13 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext for (fitIntercept <- Seq(false, true); regParam <- Seq(0.0, 0.1, 1.0); standardizeFeatures <- Seq(false, true)) { - val wls = new WeightedLeastSquares( -fitIntercept, regParam, standardizeFeatures, standardizeLabel = true) -.fit(instances) - val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) - assert(actual ~== expected(idx) absTol 1e-4) + for (solver <- WeightedLeastSquares.supportedSolvers) { --- End diff -- Update name of this test cast to ```WLS against glmnet with L2 regularization```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82999864 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -45,34 +47,48 @@ private[ml] class WeightedLeastSquaresModel( * formulation: * * min,,x,z,, 1/2 sum,,i,, w,,i,, (a,,i,,^T^ x + z - b,,i,,)^2^ / sum,,i,, w_i --- End diff -- off-topic: ```w_i``` should be a typo, since we use the ScalaDoc syntax at here. I'm prefer to change the formula to latex format like what we did at [LogisticRegression](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L1279). The ```WeightedLeastSquares``` is a private class and will not generate public doc to users. So it's better to use latex format and it's more easy to be understand by developers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83004676 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -85,73 +101,193 @@ private[ml] class WeightedLeastSquares( val triK = summary.triK val wSum = summary.wSum val bBar = summary.bBar -val bStd = summary.bStd +val bbBar = summary.bbBar val aBar = summary.aBar -val aVar = summary.aVar +val aStd = summary.aStd val abBar = summary.abBar val aaBar = summary.aaBar -val aaValues = aaBar.values - -if (bStd == 0) { - if (fitIntercept) { -logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + - s"zeros and the intercept will be the mean of the label; as a result, " + - s"training is not needed.") -val coefficients = new DenseVector(Array.ofDim(k-1)) +val aaBarValues = aaBar.values +val numFeatures = abBar.size +val rawBStd = summary.bStd +// if b is constant (rawBStd is zero), then b cannot be scaled. In this case +// setting bStd=abs(bBar) ensures that b is not scaled anymore in l-bfgs algorithm. +val bStd = if (rawBStd == 0.0) math.abs(bBar) else rawBStd + +if (rawBStd == 0) { + if (fitIntercept || bBar == 0.0) { +if (bBar == 0.0) { + logWarning(s"Mean and standard deviation of the label are zero, so the coefficients " + +s"and the intercept will all be zero; as a result, training is not needed.") +} else { + logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + +s"zeros and the intercept will be the mean of the label; as a result, " + +s"training is not needed.") +} +val coefficients = new DenseVector(Array.ofDim(numFeatures)) val intercept = bBar val diagInvAtWA = new DenseVector(Array(0D)) -return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA) +return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA, Array(0D)) + } else { +require(!(regParam > 0.0 && standardizeLabel), "The standard deviation of the label is " + + "zero. Model cannot be regularized with standardization=true") +logWarning(s"The standard deviation of the label is zero. Consider setting " + + s"fitIntercept=true.") + } +} + +val aBarStd = new Array[Double](numFeatures) +var j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +aBarStd(j) = 0.0 } else { -require(!(regParam > 0.0 && standardizeLabel), - "The standard deviation of the label is zero. " + -"Model cannot be regularized with standardization=true") -logWarning(s"The standard deviation of the label is zero. " + - "Consider setting fitIntercept=true.") +aBarStd(j) = aBar(j) / aStd(j) + } + j += 1 +} + +val abBarStd = new Array[Double](numFeatures) +j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +abBarStd(j) = 0.0 + } else { +abBarStd(j) = abBar(j) / (aStd(j) * bStd) + } + j += 1 +} + +val aaBarStd = new Array[Double](triK) +j = 0 +var kk = 0 +while (j < numFeatures) { + val aStdJ = aStd(j) + var i = 0 + while (i <= j) { +val aStdI = aStd(i) +if (aStdJ == 0.0 || aStdI == 0.0) { + aaBarStd(kk) = 0.0 +} else { + aaBarStd(kk) = aaBarValues(kk) / (aStdI * aStdJ) +} +kk += 1 +i += 1 } + j += 1 } +val bBarStd = bBar / bStd +val bbBarStd = bbBar / (bStd * bStd) + +val effectiveRegParam = regParam / bStd +val effectiveL1RegParam = elasticNetParam * effectiveRegParam +val effectiveL2RegParam = (1.0 - elasticNetParam) * effectiveRegParam + // add regularization to diagonals --- End diff -- ```add L2 regularization to diagonals``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82976378 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -60,24 +60,87 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext ), 2) } - test("two collinear features result in error with no regularization") { + test("WLS with strong L1 regularization") { +/* + We initialize the coefficients for WLS QN solver to be weighted average of the label. Check + here that with only an intercept the model converges to bBar. + */ +val bAgg = instances.collect().foldLeft((0.0, 0.0)) { case ((sum, count), Instance(l, w, f)) => + (sum + w * l, count + w) +} +val bBar = bAgg._1 / bAgg._2 +val wls = new WeightedLeastSquares(true, 10, 1.0, true, true) +val model = wls.fit(instances) +assert(model.intercept ~== bBar relTol 1e-6) + } + + test("diagonal inverse of AtWA") { +/* + A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2) --- End diff -- Add ```library(Matrix)``` can help users or developers to reproduce this example easily. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82978203 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -60,24 +60,87 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext ), 2) } - test("two collinear features result in error with no regularization") { + test("WLS with strong L1 regularization") { +/* + We initialize the coefficients for WLS QN solver to be weighted average of the label. Check + here that with only an intercept the model converges to bBar. + */ +val bAgg = instances.collect().foldLeft((0.0, 0.0)) { case ((sum, count), Instance(l, w, f)) => + (sum + w * l, count + w) +} +val bBar = bAgg._1 / bAgg._2 +val wls = new WeightedLeastSquares(true, 10, 1.0, true, true) +val model = wls.fit(instances) +assert(model.intercept ~== bBar relTol 1e-6) + } + + test("diagonal inverse of AtWA") { +/* + A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2) + w <- c(1, 2, 3, 4) + W <- Diagonal(length(w), w) + A.intercept <- cbind(A, rep.int(1, length(w))) + AtA.intercept <- t(A.intercept) %*% W %*% A.intercept + inv.intercept <- solve(AtA.intercept) + print(diag(inv.intercept)) + [1] 4.02 0.50 12.02 + + AtA <- t(A) %*% W %*% A + inv <- solve(AtA) + print(diag(inv)) + [1] 0.48336106 0.02079867 + + */ +val expectedWithIntercept = Vectors.dense(4.02, 0.50, 12.02) +val expected = Vectors.dense(0.48336106, 0.02079867) +val wlsWithIntercept = new WeightedLeastSquares(true, 0.0, 0.0, true, true, --- End diff -- It's better to use arguments with name (here and other places), such as ```new WeightedLeastSquares(fitIntercept = true, regParam = 0.0, ... ``` which should be better to understand. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83013547 --- Diff: mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala --- @@ -757,8 +761,9 @@ class LinearRegressionSuite assert(model.summary.meanAbsoluteError ~== 0.07961668 relTol 1E-4) assert(model.summary.r2 ~== 0.9998737 relTol 1E-4) - // Normal solver uses "WeightedLeastSquares". This algorithm does not generate - // objective history because it does not run through iterations. + // Normal solver uses "WeightedLeastSquares". When no regularization is applied, --- End diff -- ```When no regularization or L2 regularization is applied``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83021809 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/NormalEquationSolver.scala --- @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.optim + +import breeze.linalg.{DenseVector => BDV} +import breeze.optimize.{CachedDiffFunction, DiffFunction, LBFGS => BreezeLBFGS, OWLQN => BreezeOWLQN} +import scala.collection.mutable + +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.linalg.CholeskyDecomposition + +private[ml] class NormalEquationSolution( +val fitIntercept: Boolean, +private val _coefficients: Array[Double], +val aaInv: Option[DenseVector], +val objectiveHistory: Option[Array[Double]]) { + + def coefficients: DenseVector = { +if (fitIntercept) { + new DenseVector(_coefficients.slice(0, _coefficients.length - 1)) +} else { + new DenseVector(_coefficients) +} + } + + def intercept: Double = if (fitIntercept) _coefficients.last else 0.0 +} + +/** + * Interface for classes that solve the normal equations locally. + */ +private[ml] sealed trait NormalEquationSolver { + + /** Solve the normal equations from summary statistics. */ + def solve( + bBar: Double, + bbBar: Double, + abBar: DenseVector, + aaBar: DenseVector, + aBar: DenseVector): NormalEquationSolution +} + +/** + * A class that solves the normal equations directly, using Cholesky decomposition. + */ +private[ml] class CholeskySolver(val fitIntercept: Boolean) extends NormalEquationSolver { + + def solve( + bBar: Double, + bbBar: Double, + abBar: DenseVector, + aaBar: DenseVector, + aBar: DenseVector): NormalEquationSolution = { +val k = abBar.size +val x = CholeskyDecomposition.solve(aaBar.values, abBar.values) +val aaInv = CholeskyDecomposition.inverse(aaBar.values, k) + +new NormalEquationSolution(fitIntercept, x, Some(new DenseVector(aaInv)), None) + } +} + +/** + * A class for solving the normal equations using Quasi-Newton optimization methods. + */ +private[ml] class QuasiNewtonSolver( --- End diff -- That's OK. We can change it later if necessary, it's private. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82980129 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -132,24 +197,234 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext var idx = 0 for (fitIntercept <- Seq(false, true)) { for (standardization <- Seq(false, true)) { -val wls = new WeightedLeastSquares( - fitIntercept, regParam = 0.0, standardizeFeatures = standardization, - standardizeLabel = standardization).fit(instancesConstLabel) -val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) -assert(actual ~== expected(idx) absTol 1e-4) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(fitIntercept, regParam = 0.0, elasticNetParam = 0.0, +standardizeFeatures = standardization, +standardizeLabel = standardization, solverType = solver).fit(instancesConstLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual ~== expected(idx) absTol 1e-4) +} } idx += 1 } + +// when label is constant zero, and fitIntercept is false, we should not train and get all zeros +val instancesConstZeroLabel = instancesConstLabel.map { case Instance(l, w, f) => + Instance(0.0, w, f) +} +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(false, 0.0, 0.0, true, true, solverType = solver) +.fit(instancesConstZeroLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual === Vectors.dense(0.0, 0.0, 0.0)) + assert(wls.objectiveHistory === Array(0.0)) +} } test("WLS with regularization when label is constant") { // if regParam is non-zero and standardization is true, the problem is ill-defined and // an exception is thrown. -val wls = new WeightedLeastSquares( - fitIntercept = false, regParam = 0.1, standardizeFeatures = true, - standardizeLabel = true) -intercept[IllegalArgumentException]{ - wls.fit(instancesConstLabel) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares( +fitIntercept = false, regParam = 0.1, elasticNetParam = 0.0, standardizeFeatures = true, +standardizeLabel = true, solverType = solver) + intercept[IllegalArgumentException]{ +wls.fit(instancesConstLabel) + } +} + } + + test("WLS against glmnet with constant features") { +/* + R code: + + A <- matrix(c(1, 1, 1, 1, 5, 7, 11, 13), 4, 2) + b <- c(17, 19, 23, 29) + w <- c(1, 2, 3, 4) + */ +val constantFeatures = sc.parallelize(Seq( + Instance(17.0, 1.0, Vectors.dense(1.0, 5.0)), + Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(23.0, 3.0, Vectors.dense(1.0, 11.0)), + Instance(29.0, 4.0, Vectors.dense(1.0, 13.0)) +), 2) + +// Cholesky solver does not handle singular input with no regularization +for (fitIntercept <- Seq(false, true); + standardization <- Seq(false, true)) { + val wls = new WeightedLeastSquares(fitIntercept, 0.0, 0.0, standardization, standardization, +solverType = WeightedLeastSquares.Cholesky) + // for the case of no intercept, this would not have failed before but since we train + // in the standardized space now, it will fail + intercept[SingularMatrixException] { +wls.fit(constantFeatures) + } +} + +// should not fail when regularization is added +new WeightedLeastSquares(true, 0.5, 0.0, standardizeFeatures = true, + standardizeLabel = true, solverType = WeightedLeastSquares.Cholesky).fit(constantFeatures) + +/* + for (intercept in c(FALSE, TRUE)) { +for (standardize in c(FALSE, TRUE)) { + for (regParams in list(c(0.0, 0.0), c(0.5, 0.0), c(0.5, 0.5), c(0.5, 1.0))) { +model <- glmnet(A, b, weights=w, intercept=intercept, lambda=regParams[1], + standardize=standardize, alpha=regParams[2], thresh=1E-14) +print(as.vector(coef(model))) + } +} + } + [1] 0.00 0.00 2.253012 + [1] 0.00 0.00 2.250857 + [1] 0.00 0.00 2.249784 + [1] 0.00 0.00 2.248709 + [1] 0.00 0.00
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83001469 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -45,34 +47,48 @@ private[ml] class WeightedLeastSquaresModel( * formulation: * * min,,x,z,, 1/2 sum,,i,, w,,i,, (a,,i,,^T^ x + z - b,,i,,)^2^ / sum,,i,, w_i - * + 1/2 lambda / delta sum,,j,, (sigma,,j,, x,,j,,)^2^, + * + lambda / delta (1/2 (1 - alpha) sum,,j,, (sigma,,j,, x,,j,,)^2^ + * + alpha sum,,j,, abs(sigma,,j,, x,,j,,)), * - * where lambda is the regularization parameter, and delta and sigma,,j,, are controlled by - * [[standardizeLabel]] and [[standardizeFeatures]], respectively. + * where lambda is the regularization parameter, alpha is the ElasticNet mixing parameter, + * and delta and sigma,,j,, are controlled by [[standardizeLabel]] and [[standardizeFeatures]], + * respectively. --- End diff -- Should we document that we always solve the normal equations in the standardized space and convert back to the original space for output? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83027710 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -217,7 +215,7 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String $(featuresCol), summaryModel, model.diagInvAtWA.toArray, -Array(0D)) +model.objectiveHistory) --- End diff -- Sorry for mistake, I agree it should be equivalent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83016461 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -85,73 +101,193 @@ private[ml] class WeightedLeastSquares( val triK = summary.triK val wSum = summary.wSum val bBar = summary.bBar -val bStd = summary.bStd +val bbBar = summary.bbBar val aBar = summary.aBar -val aVar = summary.aVar +val aStd = summary.aStd val abBar = summary.abBar val aaBar = summary.aaBar -val aaValues = aaBar.values - -if (bStd == 0) { - if (fitIntercept) { -logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + - s"zeros and the intercept will be the mean of the label; as a result, " + - s"training is not needed.") -val coefficients = new DenseVector(Array.ofDim(k-1)) +val aaBarValues = aaBar.values +val numFeatures = abBar.size +val rawBStd = summary.bStd +// if b is constant (rawBStd is zero), then b cannot be scaled. In this case +// setting bStd=abs(bBar) ensures that b is not scaled anymore in l-bfgs algorithm. +val bStd = if (rawBStd == 0.0) math.abs(bBar) else rawBStd + +if (rawBStd == 0) { + if (fitIntercept || bBar == 0.0) { +if (bBar == 0.0) { + logWarning(s"Mean and standard deviation of the label are zero, so the coefficients " + +s"and the intercept will all be zero; as a result, training is not needed.") +} else { + logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + +s"zeros and the intercept will be the mean of the label; as a result, " + +s"training is not needed.") +} +val coefficients = new DenseVector(Array.ofDim(numFeatures)) val intercept = bBar val diagInvAtWA = new DenseVector(Array(0D)) -return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA) +return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA, Array(0D)) + } else { +require(!(regParam > 0.0 && standardizeLabel), "The standard deviation of the label is " + + "zero. Model cannot be regularized with standardization=true") +logWarning(s"The standard deviation of the label is zero. Consider setting " + + s"fitIntercept=true.") + } +} + +val aBarStd = new Array[Double](numFeatures) +var j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +aBarStd(j) = 0.0 } else { -require(!(regParam > 0.0 && standardizeLabel), - "The standard deviation of the label is zero. " + -"Model cannot be regularized with standardization=true") -logWarning(s"The standard deviation of the label is zero. " + - "Consider setting fitIntercept=true.") +aBarStd(j) = aBar(j) / aStd(j) + } + j += 1 +} + +val abBarStd = new Array[Double](numFeatures) +j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +abBarStd(j) = 0.0 + } else { +abBarStd(j) = abBar(j) / (aStd(j) * bStd) + } + j += 1 +} + +val aaBarStd = new Array[Double](triK) +j = 0 +var kk = 0 +while (j < numFeatures) { + val aStdJ = aStd(j) + var i = 0 + while (i <= j) { +val aStdI = aStd(i) +if (aStdJ == 0.0 || aStdI == 0.0) { + aaBarStd(kk) = 0.0 +} else { + aaBarStd(kk) = aaBarValues(kk) / (aStdI * aStdJ) +} +kk += 1 +i += 1 } + j += 1 } +val bBarStd = bBar / bStd +val bbBarStd = bbBar / (bStd * bStd) + +val effectiveRegParam = regParam / bStd +val effectiveL1RegParam = elasticNetParam * effectiveRegParam +val effectiveL2RegParam = (1.0 - elasticNetParam) * effectiveRegParam + // add regularization to diagonals var i = 0 -var j = 2 +j = 2 while (i < triK) { - var lambda = regParam - if (standardizeFeatures) { -lambda *= aVar(j - 2) + var lambda = effectiveL2RegParam + if (!standardizeFeatures) { +val std = aStd(j - 2) +if (std != 0.0) { + lambda /= (std * std) +} else { + lambda = 0.0 +} } - if (standardizeLabel && bStd != 0) { -lambda /= bStd + if
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82980460 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -132,24 +197,234 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext var idx = 0 for (fitIntercept <- Seq(false, true)) { for (standardization <- Seq(false, true)) { -val wls = new WeightedLeastSquares( - fitIntercept, regParam = 0.0, standardizeFeatures = standardization, - standardizeLabel = standardization).fit(instancesConstLabel) -val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) -assert(actual ~== expected(idx) absTol 1e-4) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(fitIntercept, regParam = 0.0, elasticNetParam = 0.0, +standardizeFeatures = standardization, +standardizeLabel = standardization, solverType = solver).fit(instancesConstLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual ~== expected(idx) absTol 1e-4) +} } idx += 1 } + +// when label is constant zero, and fitIntercept is false, we should not train and get all zeros +val instancesConstZeroLabel = instancesConstLabel.map { case Instance(l, w, f) => + Instance(0.0, w, f) +} +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(false, 0.0, 0.0, true, true, solverType = solver) +.fit(instancesConstZeroLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual === Vectors.dense(0.0, 0.0, 0.0)) + assert(wls.objectiveHistory === Array(0.0)) +} } test("WLS with regularization when label is constant") { // if regParam is non-zero and standardization is true, the problem is ill-defined and // an exception is thrown. -val wls = new WeightedLeastSquares( - fitIntercept = false, regParam = 0.1, standardizeFeatures = true, - standardizeLabel = true) -intercept[IllegalArgumentException]{ - wls.fit(instancesConstLabel) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares( +fitIntercept = false, regParam = 0.1, elasticNetParam = 0.0, standardizeFeatures = true, +standardizeLabel = true, solverType = solver) + intercept[IllegalArgumentException]{ +wls.fit(instancesConstLabel) + } +} + } + + test("WLS against glmnet with constant features") { +/* + R code: + + A <- matrix(c(1, 1, 1, 1, 5, 7, 11, 13), 4, 2) + b <- c(17, 19, 23, 29) + w <- c(1, 2, 3, 4) + */ +val constantFeatures = sc.parallelize(Seq( --- End diff -- It's better to move the dataset as a member variable of the class like ```instancesConstLabel```, then other test cases can leverage it later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83006722 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -85,73 +101,193 @@ private[ml] class WeightedLeastSquares( val triK = summary.triK val wSum = summary.wSum val bBar = summary.bBar -val bStd = summary.bStd +val bbBar = summary.bbBar val aBar = summary.aBar -val aVar = summary.aVar +val aStd = summary.aStd val abBar = summary.abBar val aaBar = summary.aaBar -val aaValues = aaBar.values - -if (bStd == 0) { - if (fitIntercept) { -logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + - s"zeros and the intercept will be the mean of the label; as a result, " + - s"training is not needed.") -val coefficients = new DenseVector(Array.ofDim(k-1)) +val aaBarValues = aaBar.values +val numFeatures = abBar.size +val rawBStd = summary.bStd +// if b is constant (rawBStd is zero), then b cannot be scaled. In this case +// setting bStd=abs(bBar) ensures that b is not scaled anymore in l-bfgs algorithm. +val bStd = if (rawBStd == 0.0) math.abs(bBar) else rawBStd + +if (rawBStd == 0) { + if (fitIntercept || bBar == 0.0) { +if (bBar == 0.0) { + logWarning(s"Mean and standard deviation of the label are zero, so the coefficients " + +s"and the intercept will all be zero; as a result, training is not needed.") +} else { + logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + +s"zeros and the intercept will be the mean of the label; as a result, " + +s"training is not needed.") +} +val coefficients = new DenseVector(Array.ofDim(numFeatures)) val intercept = bBar val diagInvAtWA = new DenseVector(Array(0D)) -return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA) +return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA, Array(0D)) + } else { +require(!(regParam > 0.0 && standardizeLabel), "The standard deviation of the label is " + + "zero. Model cannot be regularized with standardization=true") +logWarning(s"The standard deviation of the label is zero. Consider setting " + + s"fitIntercept=true.") + } +} + +val aBarStd = new Array[Double](numFeatures) +var j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +aBarStd(j) = 0.0 } else { -require(!(regParam > 0.0 && standardizeLabel), - "The standard deviation of the label is zero. " + -"Model cannot be regularized with standardization=true") -logWarning(s"The standard deviation of the label is zero. " + - "Consider setting fitIntercept=true.") +aBarStd(j) = aBar(j) / aStd(j) + } + j += 1 +} + +val abBarStd = new Array[Double](numFeatures) +j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +abBarStd(j) = 0.0 + } else { +abBarStd(j) = abBar(j) / (aStd(j) * bStd) + } + j += 1 +} + +val aaBarStd = new Array[Double](triK) +j = 0 +var kk = 0 --- End diff -- I'd like to use ```p``` or ```q``` rather than ```kk```. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82975032 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -60,24 +60,87 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext ), 2) } - test("two collinear features result in error with no regularization") { + test("WLS with strong L1 regularization") { +/* + We initialize the coefficients for WLS QN solver to be weighted average of the label. Check + here that with only an intercept the model converges to bBar. + */ +val bAgg = instances.collect().foldLeft((0.0, 0.0)) { case ((sum, count), Instance(l, w, f)) => --- End diff -- ```count``` -> ```weightSum``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83006938 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -85,73 +101,193 @@ private[ml] class WeightedLeastSquares( val triK = summary.triK val wSum = summary.wSum val bBar = summary.bBar -val bStd = summary.bStd +val bbBar = summary.bbBar val aBar = summary.aBar -val aVar = summary.aVar +val aStd = summary.aStd val abBar = summary.abBar val aaBar = summary.aaBar -val aaValues = aaBar.values - -if (bStd == 0) { - if (fitIntercept) { -logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + - s"zeros and the intercept will be the mean of the label; as a result, " + - s"training is not needed.") -val coefficients = new DenseVector(Array.ofDim(k-1)) +val aaBarValues = aaBar.values +val numFeatures = abBar.size +val rawBStd = summary.bStd +// if b is constant (rawBStd is zero), then b cannot be scaled. In this case +// setting bStd=abs(bBar) ensures that b is not scaled anymore in l-bfgs algorithm. +val bStd = if (rawBStd == 0.0) math.abs(bBar) else rawBStd + +if (rawBStd == 0) { + if (fitIntercept || bBar == 0.0) { +if (bBar == 0.0) { + logWarning(s"Mean and standard deviation of the label are zero, so the coefficients " + +s"and the intercept will all be zero; as a result, training is not needed.") +} else { + logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + +s"zeros and the intercept will be the mean of the label; as a result, " + +s"training is not needed.") +} +val coefficients = new DenseVector(Array.ofDim(numFeatures)) val intercept = bBar val diagInvAtWA = new DenseVector(Array(0D)) -return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA) +return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA, Array(0D)) + } else { +require(!(regParam > 0.0 && standardizeLabel), "The standard deviation of the label is " + + "zero. Model cannot be regularized with standardization=true") +logWarning(s"The standard deviation of the label is zero. Consider setting " + + s"fitIntercept=true.") + } +} + +val aBarStd = new Array[Double](numFeatures) +var j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +aBarStd(j) = 0.0 } else { -require(!(regParam > 0.0 && standardizeLabel), - "The standard deviation of the label is zero. " + -"Model cannot be regularized with standardization=true") -logWarning(s"The standard deviation of the label is zero. " + - "Consider setting fitIntercept=true.") +aBarStd(j) = aBar(j) / aStd(j) + } + j += 1 +} + +val abBarStd = new Array[Double](numFeatures) +j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +abBarStd(j) = 0.0 + } else { +abBarStd(j) = abBar(j) / (aStd(j) * bStd) + } + j += 1 +} + +val aaBarStd = new Array[Double](triK) +j = 0 +var kk = 0 +while (j < numFeatures) { + val aStdJ = aStd(j) + var i = 0 + while (i <= j) { +val aStdI = aStd(i) +if (aStdJ == 0.0 || aStdI == 0.0) { + aaBarStd(kk) = 0.0 +} else { + aaBarStd(kk) = aaBarValues(kk) / (aStdI * aStdJ) +} +kk += 1 +i += 1 } + j += 1 } +val bBarStd = bBar / bStd +val bbBarStd = bbBar / (bStd * bStd) + +val effectiveRegParam = regParam / bStd +val effectiveL1RegParam = elasticNetParam * effectiveRegParam +val effectiveL2RegParam = (1.0 - elasticNetParam) * effectiveRegParam + // add regularization to diagonals var i = 0 -var j = 2 +j = 2 while (i < triK) { - var lambda = regParam - if (standardizeFeatures) { -lambda *= aVar(j - 2) + var lambda = effectiveL2RegParam + if (!standardizeFeatures) { +val std = aStd(j - 2) +if (std != 0.0) { + lambda /= (std * std) +} else { + lambda = 0.0 +} } - if (standardizeLabel && bStd != 0) { -lambda /= bStd + if
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83004415 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -85,73 +101,193 @@ private[ml] class WeightedLeastSquares( val triK = summary.triK val wSum = summary.wSum val bBar = summary.bBar -val bStd = summary.bStd +val bbBar = summary.bbBar val aBar = summary.aBar -val aVar = summary.aVar +val aStd = summary.aStd val abBar = summary.abBar val aaBar = summary.aaBar -val aaValues = aaBar.values - -if (bStd == 0) { - if (fitIntercept) { -logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + - s"zeros and the intercept will be the mean of the label; as a result, " + - s"training is not needed.") -val coefficients = new DenseVector(Array.ofDim(k-1)) +val aaBarValues = aaBar.values +val numFeatures = abBar.size +val rawBStd = summary.bStd +// if b is constant (rawBStd is zero), then b cannot be scaled. In this case +// setting bStd=abs(bBar) ensures that b is not scaled anymore in l-bfgs algorithm. +val bStd = if (rawBStd == 0.0) math.abs(bBar) else rawBStd + +if (rawBStd == 0) { + if (fitIntercept || bBar == 0.0) { +if (bBar == 0.0) { + logWarning(s"Mean and standard deviation of the label are zero, so the coefficients " + +s"and the intercept will all be zero; as a result, training is not needed.") +} else { + logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + +s"zeros and the intercept will be the mean of the label; as a result, " + +s"training is not needed.") +} +val coefficients = new DenseVector(Array.ofDim(numFeatures)) val intercept = bBar val diagInvAtWA = new DenseVector(Array(0D)) -return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA) +return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA, Array(0D)) + } else { +require(!(regParam > 0.0 && standardizeLabel), "The standard deviation of the label is " + + "zero. Model cannot be regularized with standardization=true") +logWarning(s"The standard deviation of the label is zero. Consider setting " + + s"fitIntercept=true.") + } +} + +val aBarStd = new Array[Double](numFeatures) --- End diff -- Here we convert ```aBar, abBar, aaBar``` to the standardized space ```aBarStd, abBarStd, aaBarStd```, could we update them in place rather than reallocate new arrays? I found we are no longer using the original data in the following code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83022418 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/NormalEquationSolver.scala --- @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.optim + +import breeze.linalg.{DenseVector => BDV} +import breeze.optimize.{CachedDiffFunction, DiffFunction, LBFGS => BreezeLBFGS, OWLQN => BreezeOWLQN} +import scala.collection.mutable + +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.linalg.CholeskyDecomposition + +private[ml] class NormalEquationSolution( +val fitIntercept: Boolean, +private val _coefficients: Array[Double], +val aaInv: Option[DenseVector], +val objectiveHistory: Option[Array[Double]]) { + + def coefficients: DenseVector = { +if (fitIntercept) { + new DenseVector(_coefficients.slice(0, _coefficients.length - 1)) +} else { + new DenseVector(_coefficients) +} + } + + def intercept: Double = if (fitIntercept) _coefficients.last else 0.0 +} + +/** + * Interface for classes that solve the normal equations locally. + */ +private[ml] sealed trait NormalEquationSolver { + + /** Solve the normal equations from summary statistics. */ + def solve( + bBar: Double, + bbBar: Double, + abBar: DenseVector, + aaBar: DenseVector, + aBar: DenseVector): NormalEquationSolution +} + +/** + * A class that solves the normal equations directly, using Cholesky decomposition. + */ +private[ml] class CholeskySolver(val fitIntercept: Boolean) extends NormalEquationSolver { + + def solve( + bBar: Double, + bbBar: Double, + abBar: DenseVector, + aaBar: DenseVector, + aBar: DenseVector): NormalEquationSolution = { +val k = abBar.size +val x = CholeskyDecomposition.solve(aaBar.values, abBar.values) +val aaInv = CholeskyDecomposition.inverse(aaBar.values, k) + +new NormalEquationSolution(fitIntercept, x, Some(new DenseVector(aaInv)), None) + } +} + +/** + * A class for solving the normal equations using Quasi-Newton optimization methods. + */ +private[ml] class QuasiNewtonSolver( +val fitIntercept: Boolean, +maxIter: Int, +tol: Double, +l1RegFunc: Option[(Int) => Double]) extends NormalEquationSolver { + + def solve( + bBar: Double, + bbBar: Double, + abBar: DenseVector, + aaBar: DenseVector, + aBar: DenseVector): NormalEquationSolution = { +val numFeatures = aBar.size +val numFeaturesPlusIntercept = if (fitIntercept) numFeatures + 1 else numFeatures +val initialCoefficientsWithIntercept = new Array[Double](numFeaturesPlusIntercept) +if (fitIntercept) { + initialCoefficientsWithIntercept(numFeaturesPlusIntercept - 1) = bBar +} + +val costFun = + new NormalEquationCostFun(bBar, bbBar, abBar, aaBar, aBar, fitIntercept, numFeatures) +val optimizer = l1RegFunc.map { func => + new BreezeOWLQN[Int, BDV[Double]](maxIter, 10, func, tol) +}.getOrElse(new BreezeLBFGS[BDV[Double]](maxIter, 10, tol)) + +val states = optimizer.iterations(new CachedDiffFunction(costFun), + new BDV[Double](initialCoefficientsWithIntercept)) + +val arrayBuilder = mutable.ArrayBuilder.make[Double] +var state: optimizer.State = null +while (states.hasNext) { + state = states.next() + arrayBuilder += state.adjustedValue +} +val x = state.x.toArray.clone() +new NormalEquationSolution(fitIntercept, x, None, Some(arrayBuilder.result())) + } + + /** + * NormalEquationCostFun implements Breeze's DiffFunction[T] for the normal equation. + * It returns the loss and
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83021228 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/BLAS.scala --- @@ -244,6 +244,15 @@ private[spark] object BLAS extends Serializable { } /** + * y += alpha * A * x + * + * @param A The upper triangular part of A in a [[DenseVector]] (column major) + */ + def dspmv(n: Int, alpha: Double, A: DenseVector, x: DenseVector, y: DenseVector): Unit = { --- End diff -- Yeah, I think developers will check linalg.BLAS to find functions satisfy their requirements. The annotation will tell them how these functions can do, so they may overlook this function if they need to calculate ```y := alpha*A*x + beta*y```. So I think it's better to use the complete formula. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83010921 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -132,24 +197,234 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext var idx = 0 for (fitIntercept <- Seq(false, true)) { for (standardization <- Seq(false, true)) { -val wls = new WeightedLeastSquares( - fitIntercept, regParam = 0.0, standardizeFeatures = standardization, - standardizeLabel = standardization).fit(instancesConstLabel) -val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) -assert(actual ~== expected(idx) absTol 1e-4) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(fitIntercept, regParam = 0.0, elasticNetParam = 0.0, +standardizeFeatures = standardization, +standardizeLabel = standardization, solverType = solver).fit(instancesConstLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual ~== expected(idx) absTol 1e-4) +} } idx += 1 } + +// when label is constant zero, and fitIntercept is false, we should not train and get all zeros +val instancesConstZeroLabel = instancesConstLabel.map { case Instance(l, w, f) => + Instance(0.0, w, f) +} +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(false, 0.0, 0.0, true, true, solverType = solver) +.fit(instancesConstZeroLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual === Vectors.dense(0.0, 0.0, 0.0)) + assert(wls.objectiveHistory === Array(0.0)) +} } test("WLS with regularization when label is constant") { // if regParam is non-zero and standardization is true, the problem is ill-defined and // an exception is thrown. -val wls = new WeightedLeastSquares( - fitIntercept = false, regParam = 0.1, standardizeFeatures = true, - standardizeLabel = true) -intercept[IllegalArgumentException]{ - wls.fit(instancesConstLabel) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares( +fitIntercept = false, regParam = 0.1, elasticNetParam = 0.0, standardizeFeatures = true, +standardizeLabel = true, solverType = solver) + intercept[IllegalArgumentException]{ +wls.fit(instancesConstLabel) + } +} + } + + test("WLS against glmnet with constant features") { +/* + R code: + + A <- matrix(c(1, 1, 1, 1, 5, 7, 11, 13), 4, 2) + b <- c(17, 19, 23, 29) + w <- c(1, 2, 3, 4) + */ +val constantFeatures = sc.parallelize(Seq( + Instance(17.0, 1.0, Vectors.dense(1.0, 5.0)), + Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(23.0, 3.0, Vectors.dense(1.0, 11.0)), + Instance(29.0, 4.0, Vectors.dense(1.0, 13.0)) +), 2) + +// Cholesky solver does not handle singular input with no regularization +for (fitIntercept <- Seq(false, true); + standardization <- Seq(false, true)) { + val wls = new WeightedLeastSquares(fitIntercept, 0.0, 0.0, standardization, standardization, +solverType = WeightedLeastSquares.Cholesky) + // for the case of no intercept, this would not have failed before but since we train + // in the standardized space now, it will fail + intercept[SingularMatrixException] { +wls.fit(constantFeatures) + } +} + +// should not fail when regularization is added +new WeightedLeastSquares(true, 0.5, 0.0, standardizeFeatures = true, + standardizeLabel = true, solverType = WeightedLeastSquares.Cholesky).fit(constantFeatures) --- End diff -- Should we add output check for Cholesky solver? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83025262 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/NormalEquationSolver.scala --- @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.optim + +import breeze.linalg.{DenseVector => BDV} +import breeze.optimize.{CachedDiffFunction, DiffFunction, LBFGS => BreezeLBFGS, OWLQN => BreezeOWLQN} +import scala.collection.mutable + +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.linalg.CholeskyDecomposition + +private[ml] class NormalEquationSolution( +val fitIntercept: Boolean, +private val _coefficients: Array[Double], +val aaInv: Option[DenseVector], --- End diff -- I'd use ```Option[Array[Double]]``` to represent ```aaInv```, since we will scale it to original space at ```WeightedLeastSquares``` L263. BTW, should we use also expose ```coefficients``` as ```Array[Double]``` rather than ```DenseVector```? I found we did in-place update to scale back at ```WeightedLeastSquares``` L255. I'm more prefer to get coefficients in array by solver, then scale back and wrap it with ```DenseVector``` at upper layer. You can refer the ```LinearRegression``` solved by ```L-BFGS```. Only suggestion, not big deal. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r83020033 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -201,11 +476,13 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext for (fitIntercept <- Seq(false, true); regParam <- Seq(0.0, 0.1, 1.0); standardizeFeatures <- Seq(false, true)) { - val wls = new WeightedLeastSquares( -fitIntercept, regParam, standardizeFeatures, standardizeLabel = true) -.fit(instances) - val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) - assert(actual ~== expected(idx) absTol 1e-4) + for (solver <- WeightedLeastSquares.supportedSolvers) { +val wls = new WeightedLeastSquares(fitIntercept, regParam, elasticNetParam = 0.0, + standardizeFeatures, standardizeLabel = true, solverType = solver) + .fit(instances) +val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) +assert(actual ~== expected(idx) absTol 1e-4) + } idx += 1 } } --- End diff -- It looks like we have no test case for ```standardizeLabel = false```, is it necessary to add some? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82967639 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -286,6 +434,24 @@ private[ml] object WeightedLeastSquares { } /** + * Weighted population standard deviation of features. + */ +def aStd: DenseVector = { + val std = Array.ofDim[Double](k) + var i = 0 + var j = 2 + val aaValues = aaSum.values + while (i < triK) { +val l = j - 2 +val aw = aSum(l) / wSum +std(l) = math.sqrt(aaValues(i) / wSum - aw * aw) +i += j +j += 1 + } + new DenseVector(std) +} + +/** * Weighted population variance of features. */ def aVar: DenseVector = { --- End diff -- ```aVar``` is no longer used, shall we remove it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82979921 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -132,24 +197,234 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext var idx = 0 for (fitIntercept <- Seq(false, true)) { for (standardization <- Seq(false, true)) { -val wls = new WeightedLeastSquares( - fitIntercept, regParam = 0.0, standardizeFeatures = standardization, - standardizeLabel = standardization).fit(instancesConstLabel) -val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) -assert(actual ~== expected(idx) absTol 1e-4) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(fitIntercept, regParam = 0.0, elasticNetParam = 0.0, +standardizeFeatures = standardization, +standardizeLabel = standardization, solverType = solver).fit(instancesConstLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual ~== expected(idx) absTol 1e-4) +} } idx += 1 } + +// when label is constant zero, and fitIntercept is false, we should not train and get all zeros +val instancesConstZeroLabel = instancesConstLabel.map { case Instance(l, w, f) => + Instance(0.0, w, f) +} +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares(false, 0.0, 0.0, true, true, solverType = solver) +.fit(instancesConstZeroLabel) + val actual = Vectors.dense(wls.intercept, wls.coefficients(0), wls.coefficients(1)) + assert(actual === Vectors.dense(0.0, 0.0, 0.0)) + assert(wls.objectiveHistory === Array(0.0)) +} } test("WLS with regularization when label is constant") { // if regParam is non-zero and standardization is true, the problem is ill-defined and // an exception is thrown. -val wls = new WeightedLeastSquares( - fitIntercept = false, regParam = 0.1, standardizeFeatures = true, - standardizeLabel = true) -intercept[IllegalArgumentException]{ - wls.fit(instancesConstLabel) +for (solver <- WeightedLeastSquares.supportedSolvers) { + val wls = new WeightedLeastSquares( +fitIntercept = false, regParam = 0.1, elasticNetParam = 0.0, standardizeFeatures = true, +standardizeLabel = true, solverType = solver) + intercept[IllegalArgumentException]{ +wls.fit(instancesConstLabel) + } +} + } + + test("WLS against glmnet with constant features") { +/* + R code: + + A <- matrix(c(1, 1, 1, 1, 5, 7, 11, 13), 4, 2) + b <- c(17, 19, 23, 29) + w <- c(1, 2, 3, 4) + */ +val constantFeatures = sc.parallelize(Seq( + Instance(17.0, 1.0, Vectors.dense(1.0, 5.0)), + Instance(19.0, 2.0, Vectors.dense(1.0, 7.0)), + Instance(23.0, 3.0, Vectors.dense(1.0, 11.0)), + Instance(29.0, 4.0, Vectors.dense(1.0, 13.0)) +), 2) + +// Cholesky solver does not handle singular input with no regularization +for (fitIntercept <- Seq(false, true); + standardization <- Seq(false, true)) { + val wls = new WeightedLeastSquares(fitIntercept, 0.0, 0.0, standardization, standardization, +solverType = WeightedLeastSquares.Cholesky) + // for the case of no intercept, this would not have failed before but since we train + // in the standardized space now, it will fail + intercept[SingularMatrixException] { +wls.fit(constantFeatures) + } +} + +// should not fail when regularization is added +new WeightedLeastSquares(true, 0.5, 0.0, standardizeFeatures = true, + standardizeLabel = true, solverType = WeightedLeastSquares.Cholesky).fit(constantFeatures) + +/* + for (intercept in c(FALSE, TRUE)) { +for (standardize in c(FALSE, TRUE)) { + for (regParams in list(c(0.0, 0.0), c(0.5, 0.0), c(0.5, 0.5), c(0.5, 1.0))) { +model <- glmnet(A, b, weights=w, intercept=intercept, lambda=regParams[1], + standardize=standardize, alpha=regParams[2], thresh=1E-14) +print(as.vector(coef(model))) + } +} + } + [1] 0.00 0.00 2.253012 + [1] 0.00 0.00 2.250857 + [1] 0.00 0.00 2.249784 + [1] 0.00 0.00 2.248709 + [1] 0.00 0.00
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82994352 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -60,24 +60,87 @@ class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext ), 2) } - test("two collinear features result in error with no regularization") { + test("WLS with strong L1 regularization") { +/* + We initialize the coefficients for WLS QN solver to be weighted average of the label. Check + here that with only an intercept the model converges to bBar. + */ +val bAgg = instances.collect().foldLeft((0.0, 0.0)) { case ((sum, count), Instance(l, w, f)) => + (sum + w * l, count + w) +} +val bBar = bAgg._1 / bAgg._2 +val wls = new WeightedLeastSquares(true, 10, 1.0, true, true) +val model = wls.fit(instances) +assert(model.intercept ~== bBar relTol 1e-6) + } + + test("diagonal inverse of AtWA") { +/* + A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2) + w <- c(1, 2, 3, 4) + W <- Diagonal(length(w), w) + A.intercept <- cbind(A, rep.int(1, length(w))) + AtA.intercept <- t(A.intercept) %*% W %*% A.intercept + inv.intercept <- solve(AtA.intercept) + print(diag(inv.intercept)) + [1] 4.02 0.50 12.02 + + AtA <- t(A) %*% W %*% A + inv <- solve(AtA) + print(diag(inv)) + [1] 0.48336106 0.02079867 + + */ +val expectedWithIntercept = Vectors.dense(4.02, 0.50, 12.02) +val expected = Vectors.dense(0.48336106, 0.02079867) +val wlsWithIntercept = new WeightedLeastSquares(true, 0.0, 0.0, true, true, + solverType = WeightedLeastSquares.Cholesky) +val wlsModelWithIntercept = wlsWithIntercept.fit(instances) +val wls = new WeightedLeastSquares(false, 0.0, 0.0, true, true, + solverType = WeightedLeastSquares.Cholesky) +val wlsModel = wls.fit(instances) + +assert(expectedWithIntercept ~== wlsModelWithIntercept.diagInvAtWA relTol 1e-4) +assert(expected ~== wlsModel.diagInvAtWA relTol 1e-4) + } + + test("two collinear features") { +/* + A <- matrix(c(1, 2, 3, 4, 2, 4, 6, 8), 4, 2) + b <- c(1, 2, 3, 4) + w <- c(1, 1, 1, 1) + */ val singularInstances = sc.parallelize(Seq( --- End diff -- Move this as a class member variable. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82648263 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -217,7 +215,7 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String $(featuresCol), summaryModel, model.diagInvAtWA.toArray, -Array(0D)) +model.objectiveHistory) --- End diff -- How is it different? Mathematically it should be the same. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82647560 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/BLAS.scala --- @@ -244,6 +244,15 @@ private[spark] object BLAS extends Serializable { } /** + * y += alpha * A * x + * + * @param A The upper triangular part of A in a [[DenseVector]] (column major) + */ + def dspmv(n: Int, alpha: Double, A: DenseVector, x: DenseVector, y: DenseVector): Unit = { --- End diff -- I exposed it in this simplified form for now since it's all I needed. I think it's ok to leave it and add the other functionality when we need it in the future. But I can change it if you think it's best. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82647370 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/NormalEquationSolver.scala --- @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.optim + +import breeze.linalg.{DenseVector => BDV} +import breeze.optimize.{CachedDiffFunction, DiffFunction, LBFGS => BreezeLBFGS, OWLQN => BreezeOWLQN} +import scala.collection.mutable + +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.linalg.CholeskyDecomposition + +private[ml] class NormalEquationSolution( +val fitIntercept: Boolean, +private val _coefficients: Array[Double], +val aaInv: Option[DenseVector], +val objectiveHistory: Option[Array[Double]]) { + + def coefficients: DenseVector = { +if (fitIntercept) { + new DenseVector(_coefficients.slice(0, _coefficients.length - 1)) +} else { + new DenseVector(_coefficients) +} + } + + def intercept: Double = if (fitIntercept) _coefficients.last else 0.0 +} + +/** + * Interface for classes that solve the normal equations locally. + */ +private[ml] sealed trait NormalEquationSolver { + + /** Solve the normal equations from summary statistics. */ + def solve( + bBar: Double, + bbBar: Double, + abBar: DenseVector, + aaBar: DenseVector, + aBar: DenseVector): NormalEquationSolution +} + +/** + * A class that solves the normal equations directly, using Cholesky decomposition. + */ +private[ml] class CholeskySolver(val fitIntercept: Boolean) extends NormalEquationSolver { + + def solve( + bBar: Double, + bbBar: Double, + abBar: DenseVector, + aaBar: DenseVector, + aBar: DenseVector): NormalEquationSolution = { +val k = abBar.size +val x = CholeskyDecomposition.solve(aaBar.values, abBar.values) +val aaInv = CholeskyDecomposition.inverse(aaBar.values, k) + +new NormalEquationSolution(fitIntercept, x, Some(new DenseVector(aaInv)), None) + } +} + +/** + * A class for solving the normal equations using Quasi-Newton optimization methods. + */ +private[ml] class QuasiNewtonSolver( --- End diff -- I think the fact that it extends `NormalEquationSolver` implies it is local. We can easily add another `QuasiNewtonSolver` that extends from some distributed implementation (or even change the name of this later if needed). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82504017 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/NormalEquationSolver.scala --- @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.optim + +import breeze.linalg.{DenseVector => BDV} +import breeze.optimize.{CachedDiffFunction, DiffFunction, LBFGS => BreezeLBFGS, OWLQN => BreezeOWLQN} +import scala.collection.mutable + +import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vectors} +import org.apache.spark.mllib.linalg.CholeskyDecomposition + +private[ml] class NormalEquationSolution( +val fitIntercept: Boolean, +private val _coefficients: Array[Double], +val aaInv: Option[DenseVector], +val objectiveHistory: Option[Array[Double]]) { + + def coefficients: DenseVector = { +if (fitIntercept) { + new DenseVector(_coefficients.slice(0, _coefficients.length - 1)) +} else { + new DenseVector(_coefficients) +} + } + + def intercept: Double = if (fitIntercept) _coefficients.last else 0.0 +} + +/** + * Interface for classes that solve the normal equations locally. + */ +private[ml] sealed trait NormalEquationSolver { + + /** Solve the normal equations from summary statistics. */ + def solve( + bBar: Double, + bbBar: Double, + abBar: DenseVector, + aaBar: DenseVector, + aBar: DenseVector): NormalEquationSolution +} + +/** + * A class that solves the normal equations directly, using Cholesky decomposition. + */ +private[ml] class CholeskySolver(val fitIntercept: Boolean) extends NormalEquationSolver { + + def solve( + bBar: Double, + bbBar: Double, + abBar: DenseVector, + aaBar: DenseVector, + aBar: DenseVector): NormalEquationSolution = { +val k = abBar.size +val x = CholeskyDecomposition.solve(aaBar.values, abBar.values) +val aaInv = CholeskyDecomposition.inverse(aaBar.values, k) + +new NormalEquationSolution(fitIntercept, x, Some(new DenseVector(aaInv)), None) + } +} + +/** + * A class for solving the normal equations using Quasi-Newton optimization methods. + */ +private[ml] class QuasiNewtonSolver( --- End diff -- should ```LocalQuasiNewtonSolver``` be better? It's easy to confuse with distributed L-BFGS solver after we add optimizer interface at [SPARK-17136](https://issues.apache.org/jira/browse/SPARK-17136). Or other suggestion? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82503897 --- Diff: mllib-local/src/main/scala/org/apache/spark/ml/linalg/BLAS.scala --- @@ -244,6 +244,15 @@ private[spark] object BLAS extends Serializable { } /** + * y += alpha * A * x + * + * @param A The upper triangular part of A in a [[DenseVector]] (column major) + */ + def dspmv(n: Int, alpha: Double, A: DenseVector, x: DenseVector, y: DenseVector): Unit = { --- End diff -- Actually BLAS DSPMV performs matrix-vector operation ```y := alpha*A*x + beta*y```, should we provide the same function as original BLAS? Since it would be used by other MLlib function in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82503779 --- Diff: mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala --- @@ -217,7 +215,7 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String $(featuresCol), summaryModel, model.diagInvAtWA.toArray, -Array(0D)) +model.objectiveHistory) --- End diff -- I think the ```objectiveHistory``` get from normal equation solver is different from distributed L-BFGS solver, so it may confuse users. We should handle ```objectiveHistory``` differently when we use it(I saw only used in linear regression summary), or may be just better documents? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82503417 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -45,34 +47,48 @@ private[ml] class WeightedLeastSquaresModel( * formulation: * * min,,x,z,, 1/2 sum,,i,, w,,i,, (a,,i,,^T^ x + z - b,,i,,)^2^ / sum,,i,, w_i - * + 1/2 lambda / delta sum,,j,, (sigma,,j,, x,,j,,)^2^, + * + lambda / delta (1/2 (1 - alpha) sum,,j,, (sigma,,j,, x,,j,,)^2^ + * + alpha sum,,j,, abs(sigma,,j,, x,,j,,)), * - * where lambda is the regularization parameter, and delta and sigma,,j,, are controlled by - * [[standardizeLabel]] and [[standardizeFeatures]], respectively. + * where lambda is the regularization parameter, alpha is the ElasticNet mixing parameter, + * and delta and sigma,,j,, are controlled by [[standardizeLabel]] and [[standardizeFeatures]], + * respectively. * * Set [[regParam]] to 0.0 and turn off both [[standardizeFeatures]] and [[standardizeLabel]] to * match R's `lm`. * Turn on [[standardizeLabel]] to match R's `glmnet`. * * @param fitIntercept whether to fit intercept. If false, z is 0.0. - * @param regParam L2 regularization parameter (lambda) + * @param regParam L2 regularization parameter (lambda). --- End diff -- ```regParam``` applicable for all including L2, L1, elasticNet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82503493 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -45,34 +47,48 @@ private[ml] class WeightedLeastSquaresModel( * formulation: * * min,,x,z,, 1/2 sum,,i,, w,,i,, (a,,i,,^T^ x + z - b,,i,,)^2^ / sum,,i,, w_i - * + 1/2 lambda / delta sum,,j,, (sigma,,j,, x,,j,,)^2^, + * + lambda / delta (1/2 (1 - alpha) sum,,j,, (sigma,,j,, x,,j,,)^2^ + * + alpha sum,,j,, abs(sigma,,j,, x,,j,,)), * - * where lambda is the regularization parameter, and delta and sigma,,j,, are controlled by - * [[standardizeLabel]] and [[standardizeFeatures]], respectively. + * where lambda is the regularization parameter, alpha is the ElasticNet mixing parameter, + * and delta and sigma,,j,, are controlled by [[standardizeLabel]] and [[standardizeFeatures]], + * respectively. * * Set [[regParam]] to 0.0 and turn off both [[standardizeFeatures]] and [[standardizeLabel]] to * match R's `lm`. * Turn on [[standardizeLabel]] to match R's `glmnet`. * * @param fitIntercept whether to fit intercept. If false, z is 0.0. - * @param regParam L2 regularization parameter (lambda) + * @param regParam L2 regularization parameter (lambda). + * @param elasticNetParam the ElasticNet mixing parameter. * @param standardizeFeatures whether to standardize features. If true, sigma_,,j,, is the *population standard deviation of the j-th column of A. Otherwise, *sigma,,j,, is 1.0. * @param standardizeLabel whether to standardize label. If true, delta is the population standard * deviation of the label column b. Otherwise, delta is 1.0. + * @param solverType the type of solver to use for optimization. + * @param maxIter maximum number of iterations when stochastic optimization is used. + * @param tol the convergence tolerance of the iterations when stochastic optimization is used. --- End diff -- Ditto --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82503487 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -45,34 +47,48 @@ private[ml] class WeightedLeastSquaresModel( * formulation: * * min,,x,z,, 1/2 sum,,i,, w,,i,, (a,,i,,^T^ x + z - b,,i,,)^2^ / sum,,i,, w_i - * + 1/2 lambda / delta sum,,j,, (sigma,,j,, x,,j,,)^2^, + * + lambda / delta (1/2 (1 - alpha) sum,,j,, (sigma,,j,, x,,j,,)^2^ + * + alpha sum,,j,, abs(sigma,,j,, x,,j,,)), * - * where lambda is the regularization parameter, and delta and sigma,,j,, are controlled by - * [[standardizeLabel]] and [[standardizeFeatures]], respectively. + * where lambda is the regularization parameter, alpha is the ElasticNet mixing parameter, + * and delta and sigma,,j,, are controlled by [[standardizeLabel]] and [[standardizeFeatures]], + * respectively. * * Set [[regParam]] to 0.0 and turn off both [[standardizeFeatures]] and [[standardizeLabel]] to * match R's `lm`. * Turn on [[standardizeLabel]] to match R's `glmnet`. * * @param fitIntercept whether to fit intercept. If false, z is 0.0. - * @param regParam L2 regularization parameter (lambda) + * @param regParam L2 regularization parameter (lambda). + * @param elasticNetParam the ElasticNet mixing parameter. * @param standardizeFeatures whether to standardize features. If true, sigma_,,j,, is the *population standard deviation of the j-th column of A. Otherwise, *sigma,,j,, is 1.0. * @param standardizeLabel whether to standardize label. If true, delta is the population standard * deviation of the label column b. Otherwise, delta is 1.0. + * @param solverType the type of solver to use for optimization. + * @param maxIter maximum number of iterations when stochastic optimization is used. --- End diff -- Clarify it more clearly? only applicable for ```QuasiNewtonSolver```? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/15394#discussion_r82466110 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -85,73 +101,193 @@ private[ml] class WeightedLeastSquares( val triK = summary.triK val wSum = summary.wSum val bBar = summary.bBar -val bStd = summary.bStd +val bbBar = summary.bbBar val aBar = summary.aBar -val aVar = summary.aVar +val aStd = summary.aStd val abBar = summary.abBar val aaBar = summary.aaBar -val aaValues = aaBar.values - -if (bStd == 0) { - if (fitIntercept) { -logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + - s"zeros and the intercept will be the mean of the label; as a result, " + - s"training is not needed.") -val coefficients = new DenseVector(Array.ofDim(k-1)) +val aaBarValues = aaBar.values +val numFeatures = abBar.size +val rawBStd = summary.bStd +// if b is constant (rawBStd is zero), then b cannot be scaled. In this case +// setting bStd=abs(bBar) ensures that b is not scaled anymore in l-bfgs algorithm. +val bStd = if (rawBStd == 0.0) math.abs(bBar) else rawBStd + +if (rawBStd == 0) { + if (fitIntercept || bBar == 0.0) { +if (bBar == 0.0) { + logWarning(s"Mean and standard deviation of the label are zero, so the coefficients " + +s"and the intercept will all be zero; as a result, training is not needed.") +} else { + logWarning(s"The standard deviation of the label is zero, so the coefficients will be " + +s"zeros and the intercept will be the mean of the label; as a result, " + +s"training is not needed.") +} +val coefficients = new DenseVector(Array.ofDim(numFeatures)) val intercept = bBar val diagInvAtWA = new DenseVector(Array(0D)) -return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA) +return new WeightedLeastSquaresModel(coefficients, intercept, diagInvAtWA, Array(0D)) + } else { +require(!(regParam > 0.0 && standardizeLabel), "The standard deviation of the label is " + + "zero. Model cannot be regularized with standardization=true") +logWarning(s"The standard deviation of the label is zero. Consider setting " + + s"fitIntercept=true.") + } +} + +val aBarStd = new Array[Double](numFeatures) +var j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +aBarStd(j) = 0.0 } else { -require(!(regParam > 0.0 && standardizeLabel), - "The standard deviation of the label is zero. " + -"Model cannot be regularized with standardization=true") -logWarning(s"The standard deviation of the label is zero. " + - "Consider setting fitIntercept=true.") +aBarStd(j) = aBar(j) / aStd(j) + } + j += 1 +} + +val abBarStd = new Array[Double](numFeatures) +j = 0 +while (j < numFeatures) { + if (aStd(j) == 0.0) { +abBarStd(j) = 0.0 + } else { +abBarStd(j) = abBar(j) / (aStd(j) * bStd) + } + j += 1 +} + +val aaBarStd = new Array[Double](triK) +j = 0 +var kk = 0 +while (j < numFeatures) { + val aStdJ = aStd(j) + var i = 0 + while (i <= j) { +val aStdI = aStd(i) +if (aStdJ == 0.0 || aStdI == 0.0) { + aaBarStd(kk) = 0.0 +} else { + aaBarStd(kk) = aaBarValues(kk) / (aStdI * aStdJ) +} +kk += 1 +i += 1 } + j += 1 } +val bBarStd = bBar / bStd +val bbBarStd = bbBar / (bStd * bStd) + +val effectiveRegParam = regParam / bStd +val effectiveL1RegParam = elasticNetParam * effectiveRegParam +val effectiveL2RegParam = (1.0 - elasticNetParam) * effectiveRegParam + // add regularization to diagonals var i = 0 -var j = 2 +j = 2 while (i < triK) { - var lambda = regParam - if (standardizeFeatures) { -lambda *= aVar(j - 2) + var lambda = effectiveL2RegParam + if (!standardizeFeatures) { +val std = aStd(j - 2) +if (std != 0.0) { + lambda /= (std * std) +} else { + lambda = 0.0 +} } - if (standardizeLabel && bStd != 0) { -lambda /= bStd + if
[GitHub] spark pull request #15394: [SPARK-17749][ML] One pass solver for Weighted Le...
GitHub user sethah opened a pull request: https://github.com/apache/spark/pull/15394 [SPARK-17749][ML] One pass solver for Weighted Least Squares with ElasticNet ## What changes were proposed in this pull request? 1. Make a pluggable solver interface for `WeightedLeastSquares` 2. Add a `QuasiNewton` solver to handle elastic net regularization for `WeightedLeastSquares` 3. Add method `BLAS.dspmv` used by QN solver 4. Add mechanism for WLS to handle singular covariance matrices by falling back to QN solver when Cholesky fails. ## How was this patch tested? Unit tests - see below. ## Design choices **Pluggable Normal Solver** Before, the `WeightedLeastSquares` package always used the Cholesky decomposition solver to compute the solution to the normal equations. Now, we specify the solver as a constructor argument to the `WeightedLeastSquares`. We introduce a new trait: scala private[ml] sealed trait NormalEquationSolver { def solve( bBar: Double, bbBar: Double, abBar: DenseVector, aaBar: DenseVector, aBar: DenseVector): NormalEquationSolution } We extend this trait for different variants of normal equation solvers. In the future, we can easily add others (like QR) using this interface. **Always train in the standardized space** The normal solver did not previously standardize the data, but this patch introduces a change such that we always solve the normal equations in the standardized space. We convert back to the original space in the same way that is done for distributed L-BFGS/OWL-QN. We add test cases for zero variance features/labels. **Use L-BFGS locally to solve normal equations for singular matrix** When linear regression with the normal solver is called for a singular matrix, we initially try to solve with Cholesky. We use the output of `lapack.dppsv` to determine if the matrix is singular. If it is, we fall back to using L-BFGS locally to solve the normal equations. We add test cases for this as well. ## Test cases I found it helpful to enumerate some of the test cases and hopefully it makes review easier. **WeightedLeastSquares** 1. Constant columns - Cholesky solver fails with no regularization, Auto solver falls back to QN, and QN trains successfully. 2. Collinear features - Cholesky solver fails with no regularization, Auto solver falls back to QN, and QN trains successfully. 3. Label is constant zero - no training is performed regardless of intercept. Coefficients are zero and intercept is zero. 4. Label is constant - if fitIntercept, then no training is performed and intercept equals label mean. If not fitIntercept, then we train and return an answer that matches R's lm package. 5. Test with L1 - go through various combinations of L1/L2, standardization, fitIntercept and verify that output matches glmnet. 6. Initial intercept - verify that setting the initial intercept to label mean is correct by training model with strong L1 regularization so that all coefficients are zero and intercept converges to label mean. 7. Test diagInvAtWA - since we are standardizing features now during training, we should test that the inverse is computed to match R. **LinearRegression** 1. For all existing L1 test cases, test the "normal" solver too. 2. Check that using the normal solver now handles singular matrices. 3. Check that using the normal solver with L1 produces an objective history in the model summary, but does not produce the inverse of AtA. **BLAS** 1. Test new method `dspmv`. ## Performance Testing This patch will speed up linear regression with L1/elasticnet penalties when the feature size is < 4096. I have not conducted performance tests at scale, only observed by testing locally that there is a speed improvement. We should decide if this PR needs to be blocked before performance testing is conducted. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sethah/spark SPARK-17748 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15394.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15394 commit c562823e96d63773b0a81e0b71e53f8ceca96680 Author: sethahDate: 2016-09-25T02:30:41Z one pass solver for elasticnet --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org