Re: The function of the parameter complemented in DecisionTreeBuilder

Anca Leuca Fri, 02 Nov 2012 10:57:29 -0700

Ah, good that we reached an agreement. No problem, I quite enjoyed it.

On Fri, Nov 2, 2012 at 5:49 PM, Yang Zhou <[email protected]> wrote:


> Hi,
>
> Sorry about those confusing words.
>
> I do not mean a bug there.  What I mean, which is the same as what you
> said,  is whether complemented is true or not, the value of cnt in line 288
> is the same. And complemented does affect how the leaves are built.
>
> Really appreciate your time!
>
> On Sat, Nov 3, 2012 at 1:07 AM, Anca Leuca <[email protected]>
> wrote:
>
> > Hi,
> >
> > However, when complemented = true, the split is still based on the same
> > > possible values of C from the data that is passed to the method.
> >
> >
> > Yes. The split is indeed based on a subset of the data.
> >
> >
> > > As said by
> > > the code  from line 278 to line 280, if a value of C is contained in
> the
> > > entire dataset, but not the data that is passed to the method, the
> > continue
> > > statement is executed. So those values of C that are not contained in
> the
> > > data passed to the method do not affect the method.
> > >
> >
> > Not sure what you mean by 'affect the method'. I think the datapoints
> that
> > refer to values of C not contained in the data passed are not meant to
> > change the calculations.
> > Also, *c**ontinue* is being called twice: in the loop 277-285 and the
> loop
> > 303-317, under the same conditions. So technically I don't think there's
> a
> > bug there, although admittedly it's not a very clean/obvious solution :).
> >
> >
> > > In a word, whether complemented is true or false, the result after
> > > executing the code from line 267 to line 285 is the same.
> > >
> >
> > Again, I am not sure what you mean by 'result'. If you mean the variable
> *
> > subsets*, yes, that one will have the same value, regardless of
> > complemented. The interesting stuff, however, happens in lines 302-332,
> > where the 'complementing' leaves are being built.
> >
> > That being said, I think the best approach would be to just give the tree
> > builder a test and see what it spits out, for a simple dataset that you
> can
> > eyeball. Or have a look at the unit tests (if any), they should also
> give a
> > clue on what was meant.
> >
> > Anca
> >
> >
> > > On Fri, Nov 2, 2012 at 10:47 PM, Anca Leuca <[email protected]>
> > > wrote:
> > >
> > > > Hi Yang,
> > > >
> > > > I think I understand it better now, as well. So this is what I think
> it
> > > > does:
> > > >
> > > > First of all, I think it only affects the categorical node splits. It
> > > will
> > > > work as following in this scenario:
> > > > Let us consider a dataset D we want to build a decision tree from.
> > > > Let's say the tree has been partially built, and we've reached a
> > > > categorical attribute C that we want to split on.
> > > >
> > > > As I understand it, when parametrized = false, on that node we might
> > only
> > > > branch on a subset of possible values of C.
> > > >
> > > > When parametrized = true, however, we will 'force' branching on all
> > > > possible values of C from the entire dataset, and replace the missing
> > > data
> > > > with leaves having a label computed from the parent data (line 307):
> > > >
> > > > if (data.getDataset
> > > > <
> > > >
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29
> > > > >().isNumerical
> > > > <
> > > >
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.isNumerical%28int%29
> > > > >(data.getDataset
> > > > <
> > > >
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.getDataset%28%29
> > > > >().getLabelId
> > > > <
> > > >
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Dataset.java#Dataset.getLabelId%28%29
> > > > >()))
> > > > {
> > > >
> > > > label = sum / data.size
> > > > <
> > > >
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.size%28%29
> > > > >();
> > > >
> > > > } else {
> > > >
> > > > label = data.majorityLabel
> > > > <
> > > >
> > >
> >
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/classifier/df/data/Data.java#Data.majorityLabel%28java.util.Random%29
> > > > >(rng);
> > > >
> > > > }
> > > >
> > > >
> > > > I hope this is correct and helps with understanding it better.
> > > >
> > > >
> > > > Also, I found this <https://issues.apache.org/jira/browse/MAHOUT-840
> >,
> > > > it's the Jira task that introduced the DecisionTreeBuilder, take a
> > > > look at the comments, maybe it'll help you as well.
> > > >
> > > >
> > > >
> > > > Anca
> > > >
> > >
> >
>

Re: The function of the parameter complemented in DecisionTreeBuilder

Reply via email to