[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-10-07 Thread davies
Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-58234549
  
It's better to have `WIP` in title if you still work in progress. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-10-04 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-57920855
  
Again, python tests failed because the python interface is disabled in 
order to focus on the scala implementation first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-10-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-57920707
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21300/Test 
FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-10-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-57920705
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21300/consoleFull)
 for   PR 2491 at commit 
[`e535d8b`](https://github.com/apache/spark/commit/e535d8b7f08fb848ee5687881cbfc6e4c9e798cd).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `abstract class NaiveBayesModel extends ClassificationModel with 
Serializable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-10-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-57919346
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21300/consoleFull)
 for   PR 2491 at commit 
[`e535d8b`](https://github.com/apache/spark/commit/e535d8b7f08fb848ee5687881cbfc6e4c9e798cd).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-10-04 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-57919277
  
@mengxr Ok, updated to address your suggestions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-10-03 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-57867834
  
@mengxr Sorry I misunderstood your comment on that first point. I'll just 
do the 2nd and 3rd.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-10-03 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-57865710
  
@staple The conditional distribution matrix may not be sparse. That is why 
we use dense format to store it. Maybe we can do a hard thresholding to make it 
parse, but this should be in a separate PR. Let's focus on the second and the 
third in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-10-03 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-57855256
  
@mengxr Sorry about that, in the future I’ll follow the best practice 
you’ve outlined.

Here are the take-aways from my perspective:
- Investigate use of sparse storage for the conditional distribution. I 
believe the existing implementation in master uses dense conditional 
distribution matrices, but sparse is obviously possible.
- Remove grouping of conditional probabilities, as it adds complexity and 
you mentioned you aren’t sure if it will help performance.
- Add support for predictValues with consistent partitioning.

I’ll look into all these. Thanks for your feedback!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-10-03 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-57839878
  
@staple Sorry for late response and thank you for working on this JIRA! For 
the best practice, before you start working on a JIRA, please first ask on the 
JIRA page and see whether someone else is working or plans to work on the same 
JIRA, to avoid duplicate effort. Discussing the design before sending out PR is 
also encouraged. I just assigned you to the JIRA.

The algorithm looks good to me. Some general comments:
1. The data to NB is usually sparse. I'm not sure whether grouping the 
conditional probabilities helps performance.
2. In the implementation of `predict`, the output `RDD[Double]` doesn't 
have the same partitioner as the input data. Though the ordering doesn't 
change, it is still hard to inspect the result. I suggested adding 
`predictValues`, which takes `RDD[(K, Vector)]` and output `RDD[(K, Double)]`, 
so user can put either id or label in the key, and we can preserve the input 
partitioner.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-10-03 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-57824188
  
@mengxr - I’m sure you have a lot on your plate right now, but I wanted 
to check in on this PR. Overall, how do you feel about this approach for 
SPARK-1655? I’m happy to implement it differently if you’d prefer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-09-22 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-56425754
  
Hi - The QA tests failed in python because I disabled the naive bayes 
python api in order to focus on approval of the scala implementation first. (I 
mentioned this in a comment above as well.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-09-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-56424824
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20658/consoleFull)
 for   PR 2491 at commit 
[`4594761`](https://github.com/apache/spark/commit/4594761dd035d2d01b91fb36a9029bda9f34c4a1).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `case class LabelAggregate(label: Double, numDocuments: Long, 
sumFeatures: BDV[Double])`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-09-22 Thread staple
GitHub user staple opened a pull request:

https://github.com/apache/spark/pull/2491

[SPARK-1655][MLLIB] Add option for distributed naive bayes model.

Adds an option to store a naive bayes model distributively. The default 
behavior, in which the whole model is stored on the driver node, remains 
unchanged. NaiveBayes.train’s new distMode parameter can be used to request 
that a model be distributed.

When distributed, the model is stored as an RDD of model blocks. Each block 
contains the labels and prior and conditional probabilities for a set of label 
classes, allowing fast computation of the maximum a posteriori prediction for 
each block and straightforward aggregation of these MAP predictions across 
blocks.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/staple/spark SPARK-1655

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2491.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2491


commit 4594761dd035d2d01b91fb36a9029bda9f34c4a1
Author: Aaron Staple 
Date:   2014-09-22T05:02:28Z

[SPARK-1655][MLLIB] Add option for distributed naive bayes model.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-09-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-56415406
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20658/consoleFull)
 for   PR 2491 at commit 
[`4594761`](https://github.com/apache/spark/commit/4594761dd035d2d01b91fb36a9029bda9f34c4a1).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-09-22 Thread staple
Github user staple commented on a diff in the pull request:

https://github.com/apache/spark/pull/2491#discussion_r17866638
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
@@ -68,6 +74,54 @@ class NaiveBayesModel private[mllib] (
 }
 
 /**
+ * One block from a distributed model for a naive bayes classifier. The 
model is divided into
+ * blocks, each containing the complete model state for a group of labels.
+ *
+ * @param labels array of labels
+ * @param pi log of class priors, with dimension C, the number of labels 
in this block
+ * @param theta log of class conditional probabilities, with dimensions 
C-by-D,
+ *  where D is the number of features
+ */
+private case class NBModelBlock(labels: Array[Double], pi: BDV[Double], 
theta: BDM[Double])
+
+/**
+ * Distributed model for a naive bayes classifier.
+ *
+ * @param modelBlocks RDD of NBModelBlock, comprising the model
+ */
+private class DistNaiveBayesModel(val modelBlocks: RDD[NBModelBlock]) 
extends NaiveBayesModel {
+
+  override def predict(testData: RDD[Vector]): RDD[Double] = {
+// Pair each test data point with all model blocks.
+val testXModel = 
testData.map(_.toBreeze).zipWithIndex().cartesian(modelBlocks)
--- End diff --

Potentially the test vectors could be batched here to reduce the number of 
intermediate distributed objects.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-09-22 Thread staple
Github user staple commented on a diff in the pull request:

https://github.com/apache/spark/pull/2491#discussion_r17866616
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -232,11 +232,11 @@ class PythonMLLibAPI extends Serializable {
   def trainNaiveBayes(
   data: JavaRDD[LabeledPoint],
   lambda: Double): java.util.List[java.lang.Object] = {
-val model = NaiveBayes.train(data.rdd, lambda)
+// val model = NaiveBayes.train(data.rdd, lambda, "local")
--- End diff --

I disabled the python interface for now. Let’s figure out the scala 
implementation first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-09-22 Thread staple
Github user staple commented on the pull request:

https://github.com/apache/spark/pull/2491#issuecomment-56414916
  
Hi - Does this seem like a reasonable approach for SPARK-1655?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1655][MLLIB] Add option for distributed...

2014-09-22 Thread staple
Github user staple commented on a diff in the pull request:

https://github.com/apache/spark/pull/2491#discussion_r17866654
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
@@ -141,7 +214,41 @@ class NaiveBayes private (private var lambda: Double) 
extends Serializable with
   i += 1
 }
 
-new NaiveBayesModel(labels, pi, theta)
+new LocalNaiveBayesModel(labels, pi, theta)
+  }
+
+  private def trainDistModel(labelAggregates: RDD[(Double, (Long, 
BDV[Double]))]) = {
+case class LabelAggregate(label: Double, numDocuments: Long, 
sumFeatures: BDV[Double])
+val aggregated = labelAggregates.map(x => LabelAggregate(x._1, 
x._2._1, x._2._2))
+
+// Compute the model's prior (pi) vector and conditional (theta) 
matrix for each batch of
+// labels.
+// NOTE In contrast to the local trainer, the piLogDenom normalization 
term is omitted here.
+// Computing this term requires an additional aggregation on 
'aggregated', and because the
+// term is an additive constant it does not affect maximum a 
posteriori model prediction.
+val modelBlocks = aggregated.mapPartitions(p => p.grouped(100).map { 
batch =>
--- End diff --

The batch size of 100 here is just an arbitrary value right now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org