If you have enough memory, you could try the in-memory implementation (remove -p parameter) and see if the results do improve.
How did you split the data into train/test datasets ? On Sat, Feb 25, 2012 at 12:57 PM, tanzek <[email protected]> wrote: > Hello, Ted. > I have a big problem when I run random forest to classify my dataset which > is in the following format: > > -2.73887,-2.731803,15,0.00009,3,8,0.002033,0.046203,0.000005,1,0.00009,1,1,3,8,0.002033,0.124838,0.125,0.000005,1,0.298142,0,0,0,0.001425,1,1,11,11,0.001425,0.062832,0.090909,0.00001,0.001425,1,1,11,11,0.001425,0.114017,0.090909,0.000008,5.466667,10,1 > There are 44 numeric features and one label at last. > First, I split the dataset into two parts. One has 90% which is used to > train the model and the remain is used to test. So when I run random forest > to train model with the following parameters: > -Dmapred.max.split.size=13488881 -oob -sl 5 -p -t 100 -o forest-model > the 13488881 means 1/10 size of the dataset. After I use the test set to > predict the values, I get the following result: > > 12/02/24 22:12:55 INFO mapreduce.TestForest: > ======================================================= > Summary > ------------------------------------------------------- > Correctly Classified Instances : 1999 3.7294% > Incorrectly Classified Instances : 51602 96.2706% > Total Classified Instances : 53601 > > ======================================================= > Confusion Matrix > ------------------------------------------------------- > a b c d e f g h > <--Classified as > 0 51 1 7154 5257 0 0 0 | 12463 > a = 1 > 0 183 0 3255 26901 0 0 0 | 30339 > b = 2 > 0 152 0 549 3742 0 0 0 | 4443 > c = 3 > 0 699 0 320 2280 0 0 0 | 3299 > d = 5 > 0 148 1 234 1472 0 0 0 | 1855 > e = 4 > 0 14 0 497 332 20 19 0 | 882 > f = 0 > 0 14 0 94 206 2 4 0 | 320 > g = 6 > 0 0 0 0 0 0 0 0 | 0 > h = unknown > Default Category: unknown: 7 > > Is this really a bad result? I don't know what this means? I need a help. > Thank you. >
