subject:"\[jira\] \[Comment Edited\] \(SPARK\-4894\) Add Bernoulli\-variant of Naive Bayes"

[jira] [Comment Edited] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-14 Thread RJ Nowling (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278228#comment-14278228
]

RJ Nowling edited comment on SPARK-4894 at 1/15/15 4:21 AM:

[~josephkb], after some thought, I've come around and think your idea of 1 NB
class with a Factor type parameter may be the more maintainable choice as well
as offering some novel functionality. But, there seems to be a lot to figure
out (we should be checking the decision tree implementation for example) and I
don't want to hold up what should be a relatively simple change to support
Bernoulli NB. Can we create a new JIRA to discuss the NB refactoring?

Comments about refactoring:
(1) how often is NB used with continuous values? I see that sklearn supports
Gaussian NB but is this used in practice? My understanding is that NB is
generally used for text classification with counts or binary values, possibly
weighted by TF-IDF. We should probably email the users and dev lists to get
user feedback. If no one is asking for it, we should shelve it and focus on
other things.

(2) after some more reflection, I can see a few more benefits to your
suggestions of feature types (e.g., categorial, discrete counts, continuous,
binary, etc.). If we created corresponding FeatureLikelihood types (e.g.,
Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which
would be easier to test, debug, and maintain versus multiple NB subclasses like
sklearn. Additionally, if the user can define a type for each feature, then
users can mix and match likelihood types as well. Most NB implementations
treat all features the same -- what if we had a model that allowed heterozygous
features? If it works well in NB, it could be extended to other parts of
MLlib. (There is likely some overlap with decision trees since they support
multiple feature types, so we might want to see if there is anything there we
can reuse.) At the API level, we could provide a basic API which takes
{noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity
isn't compromised and provide a more advanced API for power users.

was (Author: rnowling):
[~josephkb], after some thought, I've come around and think your idea of 1 NB
class with a Factor type parameter may be the more maintainable choice as well
as offering some novel functionality. But, there seems to be a lot to figure
out (we should be checking the decision tree implementation for example) and I
don't want to hold up what should be a relatively simple change to support
Bernoulli NB. What do you think?

Add Bernoulli-variant of Naive Bayes

Key: SPARK-4894
URL: https://issues.apache.org/jira/browse/SPARK-4894
Project: Spark
Issue Type: New Feature
Components: MLlib
Affects Versions: 1.2.0
Reporter: RJ Nowling
Assignee: RJ Nowling

MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli
version of Naive Bayes is more useful for situations where the features are
binary values.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Comment Edited] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-13 Thread RJ Nowling (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276380#comment-14276380
]

RJ Nowling edited comment on SPARK-4894 at 1/14/15 2:06 AM:

Hi [~lmcguire]

Always happy to have more help! :)

I started looking through the Spark NB functions but I haven't started writing
code yet. The docs for NB mention that using binary features will cause the
multinomial NB to act like Bernoulli NB. I don't believe the documentation is
correct, at least when smoothing is used since P(0) != 1 - P(1).I was
planning on comparing the sklearn implementation with the Spark implementation
and showing that the docs were wrong. Once verified, I think the changes will
be very small to add a Bernoulli mode controlled by a flag in the constructor.

I won't get to this until next week, though. If you have time now and want to
tackle this, I'd be happy to hand it over to you and review any patches. (I'm
not a committer, though -- [~mengxr] would have to sign off.)Otherwise, if
you want to wait until I have a patch and test it, that could work, too. What
do you think?

was (Author: rnowling):
Hi @lmcguire,

Always happy to have more help! :)

Add Bernoulli-variant of Naive Bayes

Key: SPARK-4894
URL: https://issues.apache.org/jira/browse/SPARK-4894
Project: Spark
Issue Type: New Feature
Components: MLlib
Affects Versions: 1.1.1
Reporter: RJ Nowling

MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli
version of Naive Bayes is more useful for situations where the features are
binary values.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

[jira] [Comment Edited] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2 matches

Site Navigation

Mail list logo

Footer information