[
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278228#comment-14278228
]
RJ Nowling edited comment on SPARK-4894 at 1/15/15 4:21 AM:
[~josephkb], after some thought, I've come around and think your idea of 1 NB
class with a Factor type parameter may be the more maintainable choice as well
as offering some novel functionality. But, there seems to be a lot to figure
out (we should be checking the decision tree implementation for example) and I
don't want to hold up what should be a relatively simple change to support
Bernoulli NB. Can we create a new JIRA to discuss the NB refactoring?
Comments about refactoring:
(1) how often is NB used with continuous values? I see that sklearn supports
Gaussian NB but is this used in practice? My understanding is that NB is
generally used for text classification with counts or binary values, possibly
weighted by TF-IDF. We should probably email the users and dev lists to get
user feedback. If no one is asking for it, we should shelve it and focus on
other things.
(2) after some more reflection, I can see a few more benefits to your
suggestions of feature types (e.g., categorial, discrete counts, continuous,
binary, etc.). If we created corresponding FeatureLikelihood types (e.g.,
Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which
would be easier to test, debug, and maintain versus multiple NB subclasses like
sklearn. Additionally, if the user can define a type for each feature, then
users can mix and match likelihood types as well. Most NB implementations
treat all features the same -- what if we had a model that allowed heterozygous
features? If it works well in NB, it could be extended to other parts of
MLlib. (There is likely some overlap with decision trees since they support
multiple feature types, so we might want to see if there is anything there we
can reuse.) At the API level, we could provide a basic API which takes
{noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity
isn't compromised and provide a more advanced API for power users.
was (Author: rnowling):
[~josephkb], after some thought, I've come around and think your idea of 1 NB
class with a Factor type parameter may be the more maintainable choice as well
as offering some novel functionality. But, there seems to be a lot to figure
out (we should be checking the decision tree implementation for example) and I
don't want to hold up what should be a relatively simple change to support
Bernoulli NB. What do you think?
Comments about refactoring:
(1) how often is NB used with continuous values? I see that sklearn supports
Gaussian NB but is this used in practice? My understanding is that NB is
generally used for text classification with counts or binary values, possibly
weighted by TF-IDF. We should probably email the users and dev lists to get
user feedback. If no one is asking for it, we should shelve it and focus on
other things.
(2) after some more reflection, I can see a few more benefits to your
suggestions of feature types (e.g., categorial, discrete counts, continuous,
binary, etc.). If we created corresponding FeatureLikelihood types (e.g.,
Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which
would be easier to test, debug, and maintain versus multiple NB subclasses like
sklearn. Additionally, if the user can define a type for each feature, then
users can mix and match likelihood types as well. Most NB implementations
treat all features the same -- what if we had a model that allowed heterozygous
features? If it works well in NB, it could be extended to other parts of
MLlib. (There is likely some overlap with decision trees since they support
multiple feature types, so we might want to see if there is anything there we
can reuse.) At the API level, we could provide a basic API which takes
{noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity
isn't compromised and provide a more advanced API for power users.
Add Bernoulli-variant of Naive Bayes
Key: SPARK-4894
URL: https://issues.apache.org/jira/browse/SPARK-4894
Project: Spark
Issue Type: New Feature
Components: MLlib
Affects Versions: 1.2.0
Reporter: RJ Nowling
Assignee: RJ Nowling
MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli
version of Naive Bayes is more useful for situations where the features are
binary values.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: