Hi Julien, According to the father of Random Forests, one solution for this problem is to use weighted classes. Take a look at this:
http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance On Fri, Apr 27, 2012 at 4:24 PM, Julien Naour <[email protected]> wrote: > Hello, > > I'm currently doing an internship and I have to use Mahout and Hadoop. For > the moment I did mostly documentation reading and tutorials. I begin to > change the code of Mahout to try some small changes. For the moment I work > on a single-node cluster to try differents solution for data mining on > large dataset. > > My dataset is as follow : one learning sample ~1 000 000 rows for ~80 > variables, my target is categorical with 2 classes 1 and 0. One test sample > ~ 700 000 rows. It's a simple random sample. > > The real dataset is much more larger. > > My problem is that Random Forest doesn't work on my dataset. I suppose it's > mostly due to the target variable. It's a categorical variable with two > classes 1 and 0. The problem is that in my dataset there is 0.22% of 1 and > 99.78% of 0. I thought that Random Forest are pretty good even if the > target variable is rare but when i use BuildForest and TestForest all of my > dataset is classified to 0 with a very good score of 99.78% of the dataset > well classified ... > > In fact all of my trees have few nodes. 300 total nodes for 100 trees. I > think that the non common target give me a very poor forest. I'll try with > an other learning sample with ~3% of 1 and it works better with well formed > trees. For the moment the objective is to build the forest on this sample > with no hard sample issue. I know that it's a solution, we already have > explore it but we want to try Random Forest on the real dataset. > > So my question is : what can i do ? I'll try to modify some code variable > like the EPSILON in DecisionTreeBuilder but it change nothing. I probably > have to change the Information Gain which use entropy and use perhaps a > Gini information gain. Perhaps there is a known solution for this kind of > problems other than sample. > > My objectives for the moment is to implement something like a score for > each instance of data to find the 200 surer predictions and It could be > very interesting to implement the importance of variables from random > forest theory. > > Thank you for your help and sorry for my bad english. > > Regards, > > Julien Naour >
