Hi there,

I am an environmental studies masters student trying to get my thesis out the 
door.  I am also newbie at trees in general, but I like what I see in the 
literature about the random forest algorithm.  I think I get the general gist 
of things, but even after reading stuff I’m unclear about how I could be 
getting the results I’m seeing.  I obviously am missing something about how the 
split points in the final tree are decided.

I’ve been using random forests in image classification by entering split values 
into decision tree classifiers, and that has seemed work very well.  The map 
output appears legitimate and withheld data gives confusion matrices similar to 
the predictive errors from the random forest.  This leads me to assume that the 
split points are effective.

However now that I’ve turned to the ecological portion of my analysis, with a 
data set that contains few variable levels and lots of zeros, suddenly the 
splitting node information is not making sense.

Here is my situation.  I have a matrix of study plots that each belong to one 
of three elevation classes and which each have percent cover class data for 15 
plant species associated with them.  

plot    elev    sp1     sp2     sp3… sp15
1       3       0       2       6…      5
2       0       0       0       1…      0
etc.

The species data are ordered factors from 0-9.  When I run the algorithm using 
species cover values to predict elevation class, two species alone come up as 
the best predictors.  That makes ecological sense in this setting, given the 
species ranges in question.

Here’s my difficulty though.  The split point values can’t be interpreted, as 
far as I can tell.  I’m getting split points of, say, 1.5 and 2.5 for a species 
who’s cover is either 0 (absent) or 4 and above.  So obviously the split points 
in the final tree are being generated in some way I don’t understand.  
Averaged?  

I’ve tried running the tree using the data as factors, using the data as 
ordered factors, and using the data as numerical variables, just to see if I 
could gain insight into what’s going on, but I’m coming up clueless.  My 
literature hunt reveals repeated instances of folks saying that the final tree 
can’t be interpreted the way other trees are, but I’m not getting a lot on just 
why that might be.  

Some folks talk about the final tree being “averaged,” others say that “mode,” 
is employed (which doesn’t make sense to me if I’m getting 1.5 and 2.5 split 
values).  If the trees are only good as black box predictors (which is of 
course a very useful thing in itself), should I even be using the node 
information in my image classifications?  

As you see, I’m missing some rather important point or other here.  Can you 
enlighten?

Thanks,
A
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to