Hello, I'm currently doing an internship and I have to use Mahout and Hadoop. For the moment I did mostly documentation reading and tutorials. I begin to change the code of Mahout to try some small changes. For the moment I work on a single-node cluster to try differents solution for data mining on large dataset.
My dataset is as follow : one learning sample ~1 000 000 rows for ~80 variables, my target is categorical with 2 classes 1 and 0. One test sample ~ 700 000 rows. It's a simple random sample. The real dataset is much more larger. My problem is that Random Forest doesn't work on my dataset. I suppose it's mostly due to the target variable. It's a categorical variable with two classes 1 and 0. The problem is that in my dataset there is 0.22% of 1 and 99.78% of 0. I thought that Random Forest are pretty good even if the target variable is rare but when i use BuildForest and TestForest all of my dataset is classified to 0 with a very good score of 99.78% of the dataset well classified ... In fact all of my trees have few nodes. 300 total nodes for 100 trees. I think that the non common target give me a very poor forest. I'll try with an other learning sample with ~3% of 1 and it works better with well formed trees. For the moment the objective is to build the forest on this sample with no hard sample issue. I know that it's a solution, we already have explore it but we want to try Random Forest on the real dataset. So my question is : what can i do ? I'll try to modify some code variable like the EPSILON in DecisionTreeBuilder but it change nothing. I probably have to change the Information Gain which use entropy and use perhaps a Gini information gain. Perhaps there is a known solution for this kind of problems other than sample. My objectives for the moment is to implement something like a score for each instance of data to find the 200 surer predictions and It could be very interesting to implement the importance of variables from random forest theory. Thank you for your help and sorry for my bad english. Regards, Julien Naour
