Eric Denovitzer created SPARK-5688: -------------------------------------- Summary: In Decision Trees, choosing a random subset of categories for each split Key: SPARK-5688 URL: https://issues.apache.org/jira/browse/SPARK-5688 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Environment: Any Reporter: Eric Denovitzer Fix For: 1.2.0
The categories on each subset chosen to build a split on a categorical variable was not random. The categories for the subset are chosen based on the binary representation of a number from 1 to (2^(number of categories)) - 2 (excludes empty and full subset). On the current implementation, the integers used for the subsets are 1..numSplits. This should be random instead of biasing towards the categories with the lower indexes. Another problem is that if numBins/2 is bigger than the possible subsets given a category set, it still considered the numSplits to be numBins/2. This should be the min of numBins/2 and (2^(number of categories)) - 2 (otherwise the same subsets might be considered more than once when choosing the splits). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org