Eric Denovitzer created SPARK-5688:
--------------------------------------

             Summary: In Decision Trees, choosing a random subset of categories 
for each split
                 Key: SPARK-5688
                 URL: https://issues.apache.org/jira/browse/SPARK-5688
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
    Affects Versions: 1.2.0
         Environment: Any
            Reporter: Eric Denovitzer
             Fix For: 1.2.0


The categories on each subset chosen to build a split on a categorical variable 
 was not random. The categories for the subset are chosen based on the binary 
representation of a number from 1 to (2^(number of categories)) - 2 (excludes 
empty and full subset). On the current implementation, the integers used for 
the subsets are 1..numSplits. This should be random instead of biasing towards 
the categories with the lower indexes. 
Another problem is that if numBins/2 is bigger than the possible subsets given 
a category set, it still considered the numSplits to be numBins/2. This should 
be the min of numBins/2 and  (2^(number of categories)) - 2 (otherwise the same 
subsets might be considered more than once when choosing the splits).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to