Hello,

I'm currently doing an internship and I have to use Mahout and Hadoop. For
the moment I did mostly documentation reading and tutorials. I begin to
change the code of Mahout to try some small changes. For the moment I work
on a single-node cluster to try differents solution for data mining on
large dataset.

My dataset is as follow : one learning sample ~1 000 000 rows for ~80
variables, my target is categorical with 2 classes 1 and 0. One test sample
~ 700 000 rows. It's a simple random sample.

The real dataset is much more larger.

My problem is that Random Forest doesn't work on my dataset. I suppose it's
mostly due to the target variable. It's a categorical variable with two
classes 1 and 0. The problem is that in my dataset there is 0.22% of 1 and
99.78% of 0. I thought that Random Forest are pretty good even if the
target variable is rare but when i use BuildForest and TestForest all of my
dataset is classified to 0 with a very good score of 99.78% of the dataset
well classified ...

In fact all of my trees have few nodes. 300 total nodes for 100 trees. I
think that the non common target give me a very poor forest. I'll try with
an other learning sample with ~3% of 1 and it works better with well formed
trees. For the moment the objective is to build the forest on this sample
with no hard sample issue. I know that it's a solution, we already have
explore it but we want to try Random Forest on the real dataset.

So my question is : what can i do ? I'll try to modify some code variable
like the EPSILON in DecisionTreeBuilder but it change nothing. I probably
have to change the Information Gain which use entropy and use perhaps a
Gini information gain. Perhaps there is a known solution for this kind of
problems other than sample.

My objectives for the moment is to implement something like a score for
each instance of data to find the 200 surer predictions and It could be
very interesting to implement the importance of variables from random
forest theory.

Thank you for your help and sorry for my bad english.

Regards,

Julien Naour

Reply via email to