[jira] [Commented] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features

RJ Nowling (JIRA) Thu, 15 Jan 2015 12:31:07 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279258#comment-14279258
 ]


RJ Nowling commented on SPARK-5272:
-----------------------------------

Hi [~josephkb], 

I can see benefits to your suggestions of feature types (e.g., categorial, 
discrete counts, continuous, binary, etc.).  If we created corresponding 
FeatureLikelihood types (e.g., Bernoulli, Multinomial, Gaussian, etc.), it 
would promote composition which would be easier to test, debug, and maintain 
versus multiple NB subclasses like sklearn.  Additionally, if the user can 
define a type for each feature, then users can mix and match likelihood types 
as well.  Most NB implementations treat all features the same -- what if we had 
a model that allowed heterozygous features?  If it works well in NB, it could 
be extended to other parts of MLlib.  (There is likely some overlap with 
decision trees since they support multiple feature types, so we might want to 
see if there is anything there we can reuse.)  At the API level, we could 
provide a basic API which takes {noformat}RDD[Vector[Double]]{noformat} like 
the current API so that simplicity isn't compromised and provide a more 
advanced API for power users.

Does this sound like I'm understanding you correctly?

Re: Decision trees.  Decision tree models generally support different types of 
features (categorical, binary, discrete, continuous).  Does Spark's decision 
tree implementation support those different types?  How are they handled?  Do 
they abstract the feature type?  I feel there could be common ground here.


> Refactor NaiveBayes to support discrete and continuous labels,features
> ----------------------------------------------------------------------
>
>                 Key: SPARK-5272
>                 URL: https://issues.apache.org/jira/browse/SPARK-5272
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.2.0
>            Reporter: Joseph K. Bradley
>
> This JIRA is to discuss refactoring NaiveBayes in order to support both 
> discrete and continuous labels and features.
> Currently, NaiveBayes supports only discrete labels and features.
> Proposal: Generalize it to support continuous values as well.
> Some items to discuss are:
> * How commonly are continuous labels/features used in practice?  (Is this 
> necessary?)
> * What should the API look like?
> ** E.g., should NB have multiple classes for each type of label/feature, or 
> should it take a general Factor type parameter?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features

Reply via email to