[ 
https://issues.apache.org/jira/browse/SPARK-26579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738954#comment-16738954
 ] 

Hyukjin Kwon commented on SPARK-26579:
--------------------------------------

Let's ask question to mailing list rather then filing a JIRA here. You could 
have a better answer there.

> SparkML DecisionTree, how does the algorithm identify categorical features?
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-26579
>                 URL: https://issues.apache.org/jira/browse/SPARK-26579
>             Project: Spark
>          Issue Type: Question
>          Components: ML
>    Affects Versions: 2.4.0
>         Environment: os: Centos7
> software: pyspark.
>            Reporter: Xufeng Wang
>            Priority: Major
>
> I am confused about the decision tree and other tree based models. My current 
> project involves data with both nominal and continuous features. I have 
> converted the nominal data to continuous values using the StringIndexer 
> transformer from the ml.feature module. Then I vector assembled all the 
> feature values into a vector type column named features. The feature vector, 
> as I see it, are all double datatype.
> While I keep getting the maxBins should be larger than the largest number for 
> all categorical features error, as I correct the maxBins size, I still see 
> some features (continuous type since the beginning) having the bigger than my 
> maxBins size values. Since the pipeline works with correct maxBins that is 
> not bigger than some continuous values, I should be able to say that the 
> algorithm automatically pick which features are categorical and which ones 
> are continuous. But how did it figure out which is which, as all of the 
> features are of double datatype?
> Another question, if anyone can help, what is the tree type for spark 
> decision tree. Is it CART or else?
> Last question, what are the procedures for treating categorical features in 
> tree based algorithms.
> Thank you in advance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to