[ 
https://issues.apache.org/jira/browse/SPARK-5688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Denovitzer updated SPARK-5688:
-----------------------------------
    Labels: categorical decisiontree  (was: categorical)

> In Decision Trees, choosing a random subset of categories for each split
> ------------------------------------------------------------------------
>
>                 Key: SPARK-5688
>                 URL: https://issues.apache.org/jira/browse/SPARK-5688
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.2.0
>         Environment: Any
>            Reporter: Eric Denovitzer
>              Labels: categorical, decisiontree
>             Fix For: 1.2.0
>
>
> The categories on each subset chosen to build a split on a categorical 
> variable  was not random. The categories for the subset are chosen based on 
> the binary representation of a number from 1 to (2^(number of categories)) - 
> 2 (excludes empty and full subset). On the current implementation, the 
> integers used for the subsets are 1..numSplits. This should be random instead 
> of biasing towards the categories with the lower indexes. 
> Another problem is that if numBins/2 is bigger than the possible subsets 
> given a category set, it still considered the numSplits to be numBins/2. This 
> should be the min of numBins/2 and  (2^(number of categories)) - 2 (otherwise 
> the same subsets might be considered more than once when choosing the splits).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to