Hi Julien,

According to the father of Random Forests, one solution for this problem is
to use weighted classes. Take a look at this:

http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance


On Fri, Apr 27, 2012 at 4:24 PM, Julien Naour <[email protected]> wrote:

> Hello,
>
> I'm currently doing an internship and I have to use Mahout and Hadoop. For
> the moment I did mostly documentation reading and tutorials. I begin to
> change the code of Mahout to try some small changes. For the moment I work
> on a single-node cluster to try differents solution for data mining on
> large dataset.
>
> My dataset is as follow : one learning sample ~1 000 000 rows for ~80
> variables, my target is categorical with 2 classes 1 and 0. One test sample
> ~ 700 000 rows. It's a simple random sample.
>
> The real dataset is much more larger.
>
> My problem is that Random Forest doesn't work on my dataset. I suppose it's
> mostly due to the target variable. It's a categorical variable with two
> classes 1 and 0. The problem is that in my dataset there is 0.22% of 1 and
> 99.78% of 0. I thought that Random Forest are pretty good even if the
> target variable is rare but when i use BuildForest and TestForest all of my
> dataset is classified to 0 with a very good score of 99.78% of the dataset
> well classified ...
>
> In fact all of my trees have few nodes. 300 total nodes for 100 trees. I
> think that the non common target give me a very poor forest. I'll try with
> an other learning sample with ~3% of 1 and it works better with well formed
> trees. For the moment the objective is to build the forest on this sample
> with no hard sample issue. I know that it's a solution, we already have
> explore it but we want to try Random Forest on the real dataset.
>
> So my question is : what can i do ? I'll try to modify some code variable
> like the EPSILON in DecisionTreeBuilder but it change nothing. I probably
> have to change the Information Gain which use entropy and use perhaps a
> Gini information gain. Perhaps there is a known solution for this kind of
> problems other than sample.
>
> My objectives for the moment is to implement something like a score for
> each instance of data to find the 200 surer predictions and It could be
> very interesting to implement the importance of variables from random
> forest theory.
>
> Thank you for your help and sorry for my bad english.
>
> Regards,
>
> Julien Naour
>

Reply via email to