[jira] [Comment Edited] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021787#comment-15021787 ] Sean Owen edited comment on SPARK-11918 at 9/20/16 9:54 PM: [~yanboliang] yes this is true in general of ill-conditioned problems. What are you proposing? to propagate the error from lapack in a different way? check the condition number? it's roughly speaking the correct behavior in that there's no real answer here. EDIT to my old comment: I don't think that's accurate. It's possible to return a 'best' answer in at least some cases that would trigger this problem, like two identical features. was (Author: srowen): [~yanboliang] yes this is true in general of ill-conditioned problems. What are you proposing? to propagate the error from lapack in a different way? check the condition number? it's roughly speaking the correct behavior in that there's no real answer here. > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Priority: Minor > Labels: starter > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed (But "l-bfgs" can train and get the > model). The failure is caused by the underneath lapack library return error > value when Cholesky decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15127437#comment-15127437 ] Imran Younus edited comment on SPARK-11918 at 2/2/16 2:12 AM: -- Several columns in the given dataset contain only zeros. In this case, the data matrix is no full rank. Therefore the Gramian matrix is singular and hence not invertible. The Cholesky decomposition will fail in this case. This will also happen if standard deviation of more than one columns is zero (even if the values are not zero). I think we should catch this error in the code and exit with a warning message. OR we can drop columns with zero variance, and continue with the algorithm. was (Author: iyounus): Several columns in the given dataset contain only zeros. In this case, the data matrix is no full rank. Therefore the Gramian matrix is singular and hence not invertible. The Cholesky decomposition will fail in this case. This will also happen if standard deviation of more than one columns is zero (even if the values are not zero). I think we should catch this error in the code and exit with a warning message. > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Priority: Minor > Labels: starter > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed (But "l-bfgs" can train and get the > model). The failure is caused by the underneath lapack library return error > value when Cholesky decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021729#comment-15021729 ] Yanbo Liang edited comment on SPARK-11918 at 11/23/15 8:31 AM: --- Further more, I use the breeze library to train the model by local normal equation method. {code} import sqlCtx.implicits._ import org.apache.spark.mllib.linalg.Vector import breeze.linalg.DenseMatrix import breeze.linalg._ val df = MLUtils.loadLibSVMFile(sqlCtx.sparkContext, "/Users/yanboliang/data/trunk/spark/data/mllib/sample_libsvm_data.txt").toDF() val features = df.select(col("features")).map { r => r.getAs[Vector](0) }.collect().flatMap { v => v.toArray } val labelArray = df.select(col("label")).map { r => r.getDouble(0) }.collect() val Xt = new DenseMatrix[Double](692, 100, features) val X = Xt.t val y = new DenseMatrix[Double](100, 1, labelArray) val XtXi = inv(Xt * X) val XtY = Xt * y val coefs = XtXi * XtY println(coefs.toString) {code} It also throw exception like: {code} breeze.linalg.MatrixSingularException: at breeze.linalg.inv$$anon$1.apply(inv.scala:36) at breeze.linalg.inv$$anon$1.apply(inv.scala:19) at breeze.generic.UFunc$class.apply(UFunc.scala:48) at breeze.linalg.inv$.apply(inv.scala:17) {code} The breeze.linalg.inv is also call netlib LAPACK package which is the same library as Spark. Tracking the breeze code, we can get this exception is thrown at here (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/inv.scala#L33) which is also caused by the underneath lapack error. was (Author: yanboliang): Further more, I use the breeze library to train the model by local normal equation method. {code} import sqlCtx.implicits._ import org.apache.spark.mllib.linalg.Vector import breeze.linalg.DenseMatrix import breeze.linalg._ val df = MLUtils.loadLibSVMFile(sqlCtx.sparkContext, "/Users/yanboliang/data/trunk/spark/data/mllib/sample_libsvm_data.txt").toDF() val features = df.select(col("features")).map { r => r.getAs[Vector](0) }.collect().flatMap { v => v.toArray } val labelArray = df.select(col("label")).map { r => r.getDouble(0) }.collect() val Xt = new DenseMatrix[Double](692, 100, features) val X = Xt.t val y = new DenseMatrix[Double](100, 1, labelArray) val XtXi = inv(Xt * X) val XtY = Xt * y val coefs = XtXi * XtY println(coefs.toString) {code} It also throw exception > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed. The failure is caused by the underneath > Cholesky Decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021729#comment-15021729 ] Yanbo Liang edited comment on SPARK-11918 at 11/23/15 8:44 AM: --- Further more, I use the breeze library to train the model by local normal equation method. {code} import sqlCtx.implicits._ import org.apache.spark.mllib.linalg.Vector import breeze.linalg.DenseMatrix import breeze.linalg._ val df = MLUtils.loadLibSVMFile(sqlCtx.sparkContext, "/Users/yanboliang/data/trunk/spark/data/mllib/sample_libsvm_data.txt").toDF() val features = df.select(col("features")).map { r => r.getAs[Vector](0) }.collect().flatMap { v => v.toArray } val labelArray = df.select(col("label")).map { r => r.getDouble(0) }.collect() val Xt = new DenseMatrix[Double](692, 100, features) val X = Xt.t val y = new DenseMatrix[Double](100, 1, labelArray) val XtXi = inv(Xt * X) val XtY = Xt * y val coefs = XtXi * XtY println(coefs.toString) {code} It also throw exception like: {code} breeze.linalg.MatrixSingularException: at breeze.linalg.inv$$anon$1.apply(inv.scala:36) at breeze.linalg.inv$$anon$1.apply(inv.scala:19) at breeze.generic.UFunc$class.apply(UFunc.scala:48) at breeze.linalg.inv$.apply(inv.scala:17) {code} breeze.linalg.inv is also call netlib lapack library which is the same as Spark. Tracking the breeze code, we can get this exception is thrown at here (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/inv.scala#L33) also caused by the underneath lapack error. was (Author: yanboliang): Further more, I use the breeze library to train the model by local normal equation method. {code} import sqlCtx.implicits._ import org.apache.spark.mllib.linalg.Vector import breeze.linalg.DenseMatrix import breeze.linalg._ val df = MLUtils.loadLibSVMFile(sqlCtx.sparkContext, "/Users/yanboliang/data/trunk/spark/data/mllib/sample_libsvm_data.txt").toDF() val features = df.select(col("features")).map { r => r.getAs[Vector](0) }.collect().flatMap { v => v.toArray } val labelArray = df.select(col("label")).map { r => r.getDouble(0) }.collect() val Xt = new DenseMatrix[Double](692, 100, features) val X = Xt.t val y = new DenseMatrix[Double](100, 1, labelArray) val XtXi = inv(Xt * X) val XtY = Xt * y val coefs = XtXi * XtY println(coefs.toString) {code} It also throw exception like: {code} breeze.linalg.MatrixSingularException: at breeze.linalg.inv$$anon$1.apply(inv.scala:36) at breeze.linalg.inv$$anon$1.apply(inv.scala:19) at breeze.generic.UFunc$class.apply(UFunc.scala:48) at breeze.linalg.inv$.apply(inv.scala:17) {code} The breeze.linalg.inv is also call netlib LAPACK package which is the same library as Spark. Tracking the breeze code, we can get this exception is thrown at here (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/inv.scala#L33) which is also caused by the underneath lapack error. > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed (But "l-bfgs" can train and get the > model). The failure is caused by the underneath lapack library return error > value when Cholesky decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)