[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/15593 Thanks @dbtsai for your reply - yes that makes things clear. I didn't realise we can use a Breeze matrix in the optimizers. That will definitely help. I agree we should try to do that and simplify & clarify the code. Hopefully you or @sethah has time to do that for 2.1 but if not it'll be fine for 2.1.1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15593 Thanks @dbtsai! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/15593 Thanks all for working on this PR. I merged this into master, and I'll create a followup task and PR to handle the abstraction together with handling the smoothing in the initialization of coefficients. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15593 Thanks for the detailed explanation @dbtsai. +1 for doing this in a separate PR, since I'd imagine we want to run all the performance tests again as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/15593 @MLnick I'm hoping that we could abstract out the the implementation of using column major format as much as possible; as a result, in the future, new developers can understand the code without thinking hard. I agree that we have good comments and documentation in the code now, but having said that, with some effort, we are able to hide the detail using column major matrix. For example, let's say we have coeffs and gradient in column major matrices, we will be able to write L2 part of gradient in the following code without computing `isIntercept` and `featureIndex` using complicated logics. ```scala val coeffMatrix = new DenseMatrix(numClasses, numFeatures, Array(), false) val totalGradientMatrix = new DenseMatrix(numClasses, numFeatures, Array(), false) coeffMatrix.foreachActive { case (classIndex, featureIndex, value) => // We do not apply regularization to the intercepts val isIntercept = fitIntercept && featureIndex == numFeatures if (!isIntercept) { // The following code will compute the loss of the regularization; also // the gradient of the regularization, and add back to totalGradientArray. sum += { if (standardization) { totalGradientMatrix.update(classIndex, featureIndex, totalGradientMatrix(classIndex, featureIndex) + regParamL2 * value) value * value } else { if (featuresStd(featureIndex) != 0.0) { // If `standardization` is false, we still standardize the data // to improve the rate of convergence; as a result, we have to // perform this reverse standardization by penalizing each component // differently to get effectively the same objective function when // the training dataset is not standardized. val temp = value / (featuresStd(featureIndex) * featuresStd(featureIndex)) totalGradientMatrix.update(classIndex, featureIndex, totalGradientMatrix(classIndex, featureIndex) + regParamL2 * temp) value * temp } else { 0.0 } } } } } 0.5 * regParamL2 * sum } ``` For L1 part, I'm thinking to do the following, ```scala new BreezeOWLQN[(Int, Int), BDM[Double]]($(maxIter), 10, regParamL1Fun, $(tol)) ``` then we can implement L1 function as ```scala def regParamL1Fun = (classIndex: Int, featureIndex: Int) => { // Remove the L1 penalization on the intercept val isIntercept = fitIntercept && featureIndex == numFeatures if (isIntercept) { 0.0 } else { if (standardizationParam) { regParamL1 } else { if (featuresStd(featureIndex) != 0.0) { regParamL1 / featuresStd(featureIndex) } else { 0.0 } } } } ``` By having coeffs and gradientSum in matrix, we also open up a opportunity to having both of them in sparse representation. With high dimensional problems, often times, both of them can be very sparse. I'm also aware that we're cutting 2.1 release now, and this is an important performance improvement, so I'm open to deal with those cleaning up work in a separate PR so at least we're able to merge this. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user MLnick commented on the issue: https://github.com/apache/spark/pull/15593 @dbtsai I understand wanting to make things clear & easy to understand for current and future developers. Of course I would like the same. However, it's currently not clear to me how (or if) your proposal above would work to achieve that goal. Concrete example code would help a lot here (I am probably missing something in my interpretation). To be honest, even the current code is difficult to understand at first (or even third!) glance for a developer unfamiliar with the implementation (perhaps we can work to improve that, perhaps it can't be improved that much given we have to work with arrays, low-level performance and the Breeze optimizers). I don't think the col-major layout during training changes this situation significantly over row-major. As long as it is clearly and correctly documented I think it is ok (if the doc comments added here need improvement we can make changes). (By the way, the row- or col-major layout shouldn't make a difference for BLoR, and indeed the BLoR code is not really touched. The indexing may look different but is in practice the same). I've worked through the impl in the PR and to me the changes look good. The tests are pretty comprehensive and are a good check on things so we can be comfortable. This is an important performance fix that I think must be in 2.1. If you can show a way to achieve the same but have more clarity & simplicity, then I'm all for that, but I think that can be done in a separate JIRA (even after 2.1 if need be). Let me know if you disagree. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15593 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15593 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68420/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15593 **[Test build #68420 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68420/consoleFull)** for PR 15593 at commit [`c003ee9`](https://github.com/apache/spark/commit/c003ee98ea50f0606046298ae412ca9f0a752566). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15593 **[Test build #68420 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68420/consoleFull)** for PR 15593 at commit [`c003ee9`](https://github.com/apache/spark/commit/c003ee98ea50f0606046298ae412ca9f0a752566). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15593 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15593 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68308/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15593 **[Test build #68308 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68308/consoleFull)** for PR 15593 at commit [`e205670`](https://github.com/apache/spark/commit/e205670c163e7805f143097086d201661c7f104f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15593 **[Test build #68308 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68308/consoleFull)** for PR 15593 at commit [`e205670`](https://github.com/apache/spark/commit/e205670c163e7805f143097086d201661c7f104f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15593 @MLnick I updated it with your suggested wording for the comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15593 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/68084/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15593 **[Test build #68084 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68084/consoleFull)** for PR 15593 at commit [`210e1d3`](https://github.com/apache/spark/commit/210e1d307ffd63b797bcae45bb2dc49e0d10c586). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15593 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15593 **[Test build #68084 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68084/consoleFull)** for PR 15593 at commit [`210e1d3`](https://github.com/apache/spark/commit/210e1d307ffd63b797bcae45bb2dc49e0d10c586). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user dbtsai commented on the issue: https://github.com/apache/spark/pull/15593 @sethah I'm recently busy on company work. Will start to work on open source code review soon this week. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15593 ping @MLnick @jkbradley This should be a nice performance boost for MLOR in ML, hopefully we can get it in for 2.1. If you get some time to review or run tests I'd really appreciate it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15593 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67362/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15593 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15593 **[Test build #67362 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67362/consoleFull)** for PR 15593 at commit [`07fd150`](https://github.com/apache/spark/commit/07fd1504136ad7b1ce37f443e26f407b07345991). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15593 **[Test build #67362 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67362/consoleFull)** for PR 15593 at commit [`07fd150`](https://github.com/apache/spark/commit/07fd1504136ad7b1ce37f443e26f407b07345991). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15593: [SPARK-18060][ML] Avoid unnecessary computation for MLOR
Github user sethah commented on the issue: https://github.com/apache/spark/pull/15593 cc @dbtsai This may improve the problems you mentioned you were having on [SPARK-17134](https://issues.apache.org/jira/browse/SPARK-17134) :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org