[ 
https://issues.apache.org/jira/browse/SPARK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279235#comment-14279235
 ] 

Joseph K. Bradley edited comment on SPARK-5272 at 1/15/15 8:13 PM:
-------------------------------------------------------------------

My initial thoughts:

(1) Are continuous labels/features important to support?

In terms of when NB *should* be used, I believe they are important.  People use 
Logistic Regression with continuous labels and features, and Naive Bayes is 
really the same type of model (just trained differently).
* E.g.: Ng & Jordan. "On Discriminative vs. Generative classifiers: A 
comparison of logistic regression and naive Bayes."  NIPS 2002.
** Theoretically, the 2 types of models have the same purpose, but they should 
be used in different regimes.

In terms of when NB is actually used by Spark users, I'm not sure.  Hopefully 
some research and discussion here will make that clearer.

(2) What should the API look like?

I believe there should be a NaiveBayesClassifier and NaiveBayesRegressor which 
use the same underlying implementation.  That implementation should include a 
Factor concept encoding the type of distribution.

This should be simple to do for Naive Bayes, and it will give some guidance if 
we move to support more general probabilistic graphical models in MLlib.


was (Author: josephkb):
My initial thoughts:

(1) Are continuous labels/features important to support?

In terms of when NB *should* be used, I believe they are important.  People use 
Logistic Regression with continuous labels and features, and Naive Bayes is 
really the same type of model (just trained differently).
* E.g.: Ng & Jordan. "On Discriminative vs. Generative classifiers: A 
comparison of logistic regression and naive Bayes."  NIPS 2002.
** Theoretically, the 2 types of models have the same purpose, but they should 
be used in different regimes.

(2) What should the API look like?

I believe there should be a NaiveBayesClassifier and NaiveBayesRegressor which 
use the same underlying implementation.  That implementation should include a 
Factor concept encoding the type of distribution.

This should be simple to do for Naive Bayes, and it will give some guidance if 
we move to support more general probabilistic graphical models in MLlib.

> Refactor NaiveBayes to support discrete and continuous labels,features
> ----------------------------------------------------------------------
>
>                 Key: SPARK-5272
>                 URL: https://issues.apache.org/jira/browse/SPARK-5272
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: Joseph K. Bradley
>
> This JIRA is to discuss refactoring NaiveBayes in order to support both 
> discrete and continuous labels and features.
> Currently, NaiveBayes supports only discrete labels and features.
> Proposal: Generalize it to support continuous values as well.
> Some items to discuss are:
> * How commonly are continuous labels/features used in practice?  (Is this 
> necessary?)
> * What should the API look like?
> ** E.g., should NB have multiple classes for each type of label/feature, or 
> should it take a general Factor type parameter?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to