Hi Anca, Thanks for replying, and it corrects my understanding. The method only use the data passed to it to decide whether to split a node or not. And I might find a problem with the code. Please look at the code from line 277 to line 285 of this file, http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/builder/DecisionTreeBuilder.java?av=f
I agree with that when complemented = false, on that node we might only branch on a subset of possible values of C, which is contained in the data that is passed to the method. However, when complemented = true, the split is still based on the same possible values of C from the data that is passed to the method. As said by the code from line 278 to line 280, if a value of C is contained in the entire dataset, but not the data that is passed to the method, the continue statement is executed. So those values of C that are not contained in the data passed to the method do not affect the method. In a word, whether complemented is true or false, the result after executing the code from line 267 to line 285 is the same. On Fri, Nov 2, 2012 at 10:47 PM, Anca Leuca <[email protected]> wrote: > Hi Yang, > > I think I understand it better now, as well. So this is what I think it > does: > > First of all, I think it only affects the categorical node splits. It will > work as following in this scenario: > Let us consider a dataset D we want to build a decision tree from. > Let's say the tree has been partially built, and we've reached a > categorical attribute C that we want to split on. > > As I understand it, when parametrized = false, on that node we might only > branch on a subset of possible values of C. > > When parametrized = true, however, we will 'force' branching on all > possible values of C from the entire dataset, and replace the missing data > with leaves having a label computed from the parent data (line 307): > > if (data.getDataset > < > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29 > >().isNumerical > < > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.isNumerical%28int%29 > >(data.getDataset > < > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29 > >().getLabelId > < > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.getLabelId%28%29 > >())) > { > > label = sum / data.size > < > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.size%28%29 > >(); > > } else { > > label = data.majorityLabel > < > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.majorityLabel%28java.util.Random%29 > >(rng); > > } > > > I hope this is correct and helps with understanding it better. > > > Also, I found this <https://issues.apache.org/jira/browse/MAHOUT-840>, > it's the Jira task that introduced the DecisionTreeBuilder, take a > look at the comments, maybe it'll help you as well. > > > > Anca >
