Hi, However, when complemented = true, the split is still based on the same > possible values of C from the data that is passed to the method.
Yes. The split is indeed based on a subset of the data. > As said by > the code from line 278 to line 280, if a value of C is contained in the > entire dataset, but not the data that is passed to the method, the continue > statement is executed. So those values of C that are not contained in the > data passed to the method do not affect the method. > Not sure what you mean by 'affect the method'. I think the datapoints that refer to values of C not contained in the data passed are not meant to change the calculations. Also, *c**ontinue* is being called twice: in the loop 277-285 and the loop 303-317, under the same conditions. So technically I don't think there's a bug there, although admittedly it's not a very clean/obvious solution :). > In a word, whether complemented is true or false, the result after > executing the code from line 267 to line 285 is the same. > Again, I am not sure what you mean by 'result'. If you mean the variable * subsets*, yes, that one will have the same value, regardless of complemented. The interesting stuff, however, happens in lines 302-332, where the 'complementing' leaves are being built. That being said, I think the best approach would be to just give the tree builder a test and see what it spits out, for a simple dataset that you can eyeball. Or have a look at the unit tests (if any), they should also give a clue on what was meant. Anca > On Fri, Nov 2, 2012 at 10:47 PM, Anca Leuca <[email protected]> > wrote: > > > Hi Yang, > > > > I think I understand it better now, as well. So this is what I think it > > does: > > > > First of all, I think it only affects the categorical node splits. It > will > > work as following in this scenario: > > Let us consider a dataset D we want to build a decision tree from. > > Let's say the tree has been partially built, and we've reached a > > categorical attribute C that we want to split on. > > > > As I understand it, when parametrized = false, on that node we might only > > branch on a subset of possible values of C. > > > > When parametrized = true, however, we will 'force' branching on all > > possible values of C from the entire dataset, and replace the missing > data > > with leaves having a label computed from the parent data (line 307): > > > > if (data.getDataset > > < > > > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29 > > >().isNumerical > > < > > > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.isNumerical%28int%29 > > >(data.getDataset > > < > > > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29 > > >().getLabelId > > < > > > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.getLabelId%28%29 > > >())) > > { > > > > label = sum / data.size > > < > > > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.size%28%29 > > >(); > > > > } else { > > > > label = data.majorityLabel > > < > > > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.majorityLabel%28java.util.Random%29 > > >(rng); > > > > } > > > > > > I hope this is correct and helps with understanding it better. > > > > > > Also, I found this <https://issues.apache.org/jira/browse/MAHOUT-840>, > > it's the Jira task that introduced the DecisionTreeBuilder, take a > > look at the comments, maybe it'll help you as well. > > > > > > > > Anca > > >
