[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-58234549 It's better to have `WIP` in title if you still work in progress. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-57920855 Again, python tests failed because the python interface is disabled in order to focus on the scala implementation first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-57920707 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21300/Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-57920705 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21300/consoleFull) for PR 2491 at commit [`e535d8b`](https://github.com/apache/spark/commit/e535d8b7f08fb848ee5687881cbfc6e4c9e798cd). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `abstract class NaiveBayesModel extends ClassificationModel with Serializable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-57919346 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21300/consoleFull) for PR 2491 at commit [`e535d8b`](https://github.com/apache/spark/commit/e535d8b7f08fb848ee5687881cbfc6e4c9e798cd). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-57919277 @mengxr Ok, updated to address your suggestions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-57867834 @mengxr Sorry I misunderstood your comment on that first point. I'll just do the 2nd and 3rd. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-57865710 @staple The conditional distribution matrix may not be sparse. That is why we use dense format to store it. Maybe we can do a hard thresholding to make it parse, but this should be in a separate PR. Let's focus on the second and the third in this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-57855256 @mengxr Sorry about that, in the future Iâll follow the best practice youâve outlined. Here are the take-aways from my perspective: - Investigate use of sparse storage for the conditional distribution. I believe the existing implementation in master uses dense conditional distribution matrices, but sparse is obviously possible. - Remove grouping of conditional probabilities, as it adds complexity and you mentioned you arenât sure if it will help performance. - Add support for predictValues with consistent partitioning. Iâll look into all these. Thanks for your feedback! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-57839878 @staple Sorry for late response and thank you for working on this JIRA! For the best practice, before you start working on a JIRA, please first ask on the JIRA page and see whether someone else is working or plans to work on the same JIRA, to avoid duplicate effort. Discussing the design before sending out PR is also encouraged. I just assigned you to the JIRA. The algorithm looks good to me. Some general comments: 1. The data to NB is usually sparse. I'm not sure whether grouping the conditional probabilities helps performance. 2. In the implementation of `predict`, the output `RDD[Double]` doesn't have the same partitioner as the input data. Though the ordering doesn't change, it is still hard to inspect the result. I suggested adding `predictValues`, which takes `RDD[(K, Vector)]` and output `RDD[(K, Double)]`, so user can put either id or label in the key, and we can preserve the input partitioner. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-57824188 @mengxr - Iâm sure you have a lot on your plate right now, but I wanted to check in on this PR. Overall, how do you feel about this approach for SPARK-1655? Iâm happy to implement it differently if youâd prefer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-56425754 Hi - The QA tests failed in python because I disabled the naive bayes python api in order to focus on approval of the scala implementation first. (I mentioned this in a comment above as well.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-56424824 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20658/consoleFull) for PR 2491 at commit [`4594761`](https://github.com/apache/spark/commit/4594761dd035d2d01b91fb36a9029bda9f34c4a1). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class LabelAggregate(label: Double, numDocuments: Long, sumFeatures: BDV[Double])` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
GitHub user staple opened a pull request: https://github.com/apache/spark/pull/2491 [SPARK-1655][MLLIB] Add option for distributed naive bayes model. Adds an option to store a naive bayes model distributively. The default behavior, in which the whole model is stored on the driver node, remains unchanged. NaiveBayes.trainâs new distMode parameter can be used to request that a model be distributed. When distributed, the model is stored as an RDD of model blocks. Each block contains the labels and prior and conditional probabilities for a set of label classes, allowing fast computation of the maximum a posteriori prediction for each block and straightforward aggregation of these MAP predictions across blocks. You can merge this pull request into a Git repository by running: $ git pull https://github.com/staple/spark SPARK-1655 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2491.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2491 commit 4594761dd035d2d01b91fb36a9029bda9f34c4a1 Author: Aaron Staple Date: 2014-09-22T05:02:28Z [SPARK-1655][MLLIB] Add option for distributed naive bayes model. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-56415406 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20658/consoleFull) for PR 2491 at commit [`4594761`](https://github.com/apache/spark/commit/4594761dd035d2d01b91fb36a9029bda9f34c4a1). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/2491#discussion_r17866638 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala --- @@ -68,6 +74,54 @@ class NaiveBayesModel private[mllib] ( } /** + * One block from a distributed model for a naive bayes classifier. The model is divided into + * blocks, each containing the complete model state for a group of labels. + * + * @param labels array of labels + * @param pi log of class priors, with dimension C, the number of labels in this block + * @param theta log of class conditional probabilities, with dimensions C-by-D, + * where D is the number of features + */ +private case class NBModelBlock(labels: Array[Double], pi: BDV[Double], theta: BDM[Double]) + +/** + * Distributed model for a naive bayes classifier. + * + * @param modelBlocks RDD of NBModelBlock, comprising the model + */ +private class DistNaiveBayesModel(val modelBlocks: RDD[NBModelBlock]) extends NaiveBayesModel { + + override def predict(testData: RDD[Vector]): RDD[Double] = { +// Pair each test data point with all model blocks. +val testXModel = testData.map(_.toBreeze).zipWithIndex().cartesian(modelBlocks) --- End diff -- Potentially the test vectors could be batched here to reduce the number of intermediate distributed objects. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/2491#discussion_r17866616 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -232,11 +232,11 @@ class PythonMLLibAPI extends Serializable { def trainNaiveBayes( data: JavaRDD[LabeledPoint], lambda: Double): java.util.List[java.lang.Object] = { -val model = NaiveBayes.train(data.rdd, lambda) +// val model = NaiveBayes.train(data.rdd, lambda, "local") --- End diff -- I disabled the python interface for now. Letâs figure out the scala implementation first. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on the pull request: https://github.com/apache/spark/pull/2491#issuecomment-56414916 Hi - Does this seem like a reasonable approach for SPARK-1655? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...
Github user staple commented on a diff in the pull request: https://github.com/apache/spark/pull/2491#discussion_r17866654 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala --- @@ -141,7 +214,41 @@ class NaiveBayes private (private var lambda: Double) extends Serializable with i += 1 } -new NaiveBayesModel(labels, pi, theta) +new LocalNaiveBayesModel(labels, pi, theta) + } + + private def trainDistModel(labelAggregates: RDD[(Double, (Long, BDV[Double]))]) = { +case class LabelAggregate(label: Double, numDocuments: Long, sumFeatures: BDV[Double]) +val aggregated = labelAggregates.map(x => LabelAggregate(x._1, x._2._1, x._2._2)) + +// Compute the model's prior (pi) vector and conditional (theta) matrix for each batch of +// labels. +// NOTE In contrast to the local trainer, the piLogDenom normalization term is omitted here. +// Computing this term requires an additional aggregation on 'aggregated', and because the +// term is an additive constant it does not affect maximum a posteriori model prediction. +val modelBlocks = aggregated.mapPartitions(p => p.grouped(100).map { batch => --- End diff -- The batch size of 100 here is just an arbitrary value right now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org