[GitHub] spark pull request #15402: [SPARK-17835][ML][MLlib] Optimize NaiveBayes mlli...

2016-10-12 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15402


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15402: [SPARK-17835][ML][MLlib] Optimize NaiveBayes mlli...

2016-10-07 Thread yanboliang
GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/15402

[SPARK-17835][ML][MLlib] Optimize NaiveBayes mllib wrapper to eliminate 
extra pass on data

## What changes were proposed in this pull request?
[SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077) copied the 
```NaiveBayes``` implementation from mllib to ml and left mllib as a wrapper. 
However, there are some difference between mllib and ml to handle labels:
* mllib allow input labels as {-1, +1}, however, ml assumes the input 
labels in range [0, numClasses).
* mllib ```NaiveBayesModel``` expose ```labels``` but ml did not due to the 
assumption mention above.

During the copy in 
[SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077), we use ```val 
labels = data.map(_.label).distinct().collect().sorted``` to get the distinct 
labels firstly, and then feed to training. It involves extra Spark job compared 
with the original implementation. Since ```NaiveBayes``` only do one 
aggregation during training, adding another one seems less efficient. We can 
get the labels in a single pass along with ```NaiveBayes``` training and send 
them to MLlib side.

## How was this patch tested?
Existing tests.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-17835

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15402.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15402


commit 2fd38fe8b15855f0f64b26472e252e12737e8b1a
Author: Yanbo Liang 
Date:   2016-10-08T06:50:55Z

Optimize NaiveBayes mllib wrapper to eliminate extra pass on data




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org