GitHub user yanboliang opened a pull request:
https://github.com/apache/spark/pull/15402
[SPARK-17835][ML][MLlib] Optimize NaiveBayes mllib wrapper to eliminate
extra pass on data
## What changes were proposed in this pull request?
[SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077) copied the
```NaiveBayes``` implementation from mllib to ml and left mllib as a wrapper.
However, there are some difference between mllib and ml to handle labels:
* mllib allow input labels as {-1, +1}, however, ml assumes the input
labels in range [0, numClasses).
* mllib ```NaiveBayesModel``` expose ```labels``` but ml did not due to the
assumption mention above.
During the copy in
[SPARK-14077](https://issues.apache.org/jira/browse/SPARK-14077), we use ```val
labels = data.map(_.label).distinct().collect().sorted``` to get the distinct
labels firstly, and then feed to training. It involves extra Spark job compared
with the original implementation. Since ```NaiveBayes``` only do one
aggregation during training, adding another one seems less efficient. We can
get the labels in a single pass along with ```NaiveBayes``` training and send
them to MLlib side.
## How was this patch tested?
Existing tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/yanboliang/spark spark-17835
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15402.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15402
commit 2fd38fe8b15855f0f64b26472e252e12737e8b1a
Author: Yanbo Liang
Date: 2016-10-08T06:50:55Z
Optimize NaiveBayes mllib wrapper to eliminate extra pass on data
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org