OK, I will have a try. Because of the large size of the dataset, I manually split the dataset into two parts using VI editor. One have the first 90%, and another have the last 10%. I don't complete the process of cross validation, just finish the first iteration. Is there a problem? Thank you for your replies.
在 2012年2月26日 上午2:17,deneche abdelhakim <[email protected]>写道: > If you have enough memory, you could try the in-memory implementation > (remove -p parameter) and see if the results do improve. > > How did you split the data into train/test datasets ? > > On Sat, Feb 25, 2012 at 12:57 PM, tanzek <[email protected]> wrote: > > > Hello, Ted. > > I have a big problem when I run random forest to classify my dataset > which > > is in the following format: > > > > > -2.73887,-2.731803,15,0.00009,3,8,0.002033,0.046203,0.000005,1,0.00009,1,1,3,8,0.002033,0.124838,0.125,0.000005,1,0.298142,0,0,0,0.001425,1,1,11,11,0.001425,0.062832,0.090909,0.00001,0.001425,1,1,11,11,0.001425,0.114017,0.090909,0.000008,5.466667,10,1 > > There are 44 numeric features and one label at last. > > First, I split the dataset into two parts. One has 90% which is used to > > train the model and the remain is used to test. So when I run random > forest > > to train model with the following parameters: > > -Dmapred.max.split.size=13488881 -oob -sl 5 -p -t 100 -o forest-model > > the 13488881 means 1/10 size of the dataset. After I use the test set to > > predict the values, I get the following result: > > > > 12/02/24 22:12:55 INFO mapreduce.TestForest: > > ======================================================= > > Summary > > ------------------------------------------------------- > > Correctly Classified Instances : 1999 3.7294% > > Incorrectly Classified Instances : 51602 96.2706% > > Total Classified Instances : 53601 > > > > ======================================================= > > Confusion Matrix > > ------------------------------------------------------- > > a b c d e f g h > > <--Classified as > > 0 51 1 7154 5257 0 0 0 | 12463 > > a = 1 > > 0 183 0 3255 26901 0 0 0 | 30339 > > b = 2 > > 0 152 0 549 3742 0 0 0 | 4443 > > c = 3 > > 0 699 0 320 2280 0 0 0 | 3299 > > d = 5 > > 0 148 1 234 1472 0 0 0 | 1855 > > e = 4 > > 0 14 0 497 332 20 19 0 | 882 > > f = 0 > > 0 14 0 94 206 2 4 0 | 320 > > g = 6 > > 0 0 0 0 0 0 0 0 | 0 > > h = unknown > > Default Category: unknown: 7 > > > > Is this really a bad result? I don't know what this means? I need a help. > > Thank you. > > >
