Ah, good that we reached an agreement. No problem, I quite enjoyed it. On Fri, Nov 2, 2012 at 5:49 PM, Yang Zhou <[email protected]> wrote:
> Hi, > > Sorry about those confusing words. > > I do not mean a bug there. What I mean, which is the same as what you > said, is whether complemented is true or not, the value of cnt in line 288 > is the same. And complemented does affect how the leaves are built. > > Really appreciate your time! > > On Sat, Nov 3, 2012 at 1:07 AM, Anca Leuca <[email protected]> > wrote: > > > Hi, > > > > However, when complemented = true, the split is still based on the same > > > possible values of C from the data that is passed to the method. > > > > > > Yes. The split is indeed based on a subset of the data. > > > > > > > As said by > > > the code from line 278 to line 280, if a value of C is contained in > the > > > entire dataset, but not the data that is passed to the method, the > > continue > > > statement is executed. So those values of C that are not contained in > the > > > data passed to the method do not affect the method. > > > > > > > Not sure what you mean by 'affect the method'. I think the datapoints > that > > refer to values of C not contained in the data passed are not meant to > > change the calculations. > > Also, *c**ontinue* is being called twice: in the loop 277-285 and the > loop > > 303-317, under the same conditions. So technically I don't think there's > a > > bug there, although admittedly it's not a very clean/obvious solution :). > > > > > > > In a word, whether complemented is true or false, the result after > > > executing the code from line 267 to line 285 is the same. > > > > > > > Again, I am not sure what you mean by 'result'. If you mean the variable > * > > subsets*, yes, that one will have the same value, regardless of > > complemented. The interesting stuff, however, happens in lines 302-332, > > where the 'complementing' leaves are being built. > > > > That being said, I think the best approach would be to just give the tree > > builder a test and see what it spits out, for a simple dataset that you > can > > eyeball. Or have a look at the unit tests (if any), they should also > give a > > clue on what was meant. > > > > Anca > > > > > > > On Fri, Nov 2, 2012 at 10:47 PM, Anca Leuca <[email protected]> > > > wrote: > > > > > > > Hi Yang, > > > > > > > > I think I understand it better now, as well. So this is what I think > it > > > > does: > > > > > > > > First of all, I think it only affects the categorical node splits. It > > > will > > > > work as following in this scenario: > > > > Let us consider a dataset D we want to build a decision tree from. > > > > Let's say the tree has been partially built, and we've reached a > > > > categorical attribute C that we want to split on. > > > > > > > > As I understand it, when parametrized = false, on that node we might > > only > > > > branch on a subset of possible values of C. > > > > > > > > When parametrized = true, however, we will 'force' branching on all > > > > possible values of C from the entire dataset, and replace the missing > > > data > > > > with leaves having a label computed from the parent data (line 307): > > > > > > > > if (data.getDataset > > > > < > > > > > > > > > > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29 > > > > >().isNumerical > > > > < > > > > > > > > > > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.isNumerical%28int%29 > > > > >(data.getDataset > > > > < > > > > > > > > > > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29 > > > > >().getLabelId > > > > < > > > > > > > > > > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.getLabelId%28%29 > > > > >())) > > > > { > > > > > > > > label = sum / data.size > > > > < > > > > > > > > > > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.size%28%29 > > > > >(); > > > > > > > > } else { > > > > > > > > label = data.majorityLabel > > > > < > > > > > > > > > > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.majorityLabel%28java.util.Random%29 > > > > >(rng); > > > > > > > > } > > > > > > > > > > > > I hope this is correct and helps with understanding it better. > > > > > > > > > > > > Also, I found this <https://issues.apache.org/jira/browse/MAHOUT-840 > >, > > > > it's the Jira task that introduced the DecisionTreeBuilder, take a > > > > look at the comments, maybe it'll help you as well. > > > > > > > > > > > > > > > > Anca > > > > > > > > > >
