Hello, Ted.
I have a big problem when I run random forest to classify my dataset which
is in the following format:
-2.73887,-2.731803,15,0.00009,3,8,0.002033,0.046203,0.000005,1,0.00009,1,1,3,8,0.002033,0.124838,0.125,0.000005,1,0.298142,0,0,0,0.001425,1,1,11,11,0.001425,0.062832,0.090909,0.00001,0.001425,1,1,11,11,0.001425,0.114017,0.090909,0.000008,5.466667,10,1
There are 44 numeric features and one label at last.
First, I split the dataset into two parts. One has 90% which is used to
train the model and the remain is used to test. So when I run random forest
to train model with the following parameters:
-Dmapred.max.split.size=13488881 -oob -sl 5 -p -t 100 -o forest-model
the 13488881 means 1/10 size of the dataset. After I use the test set to
predict the values, I get the following result:
12/02/24 22:12:55 INFO mapreduce.TestForest:
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 1999 3.7294%
Incorrectly Classified Instances : 51602 96.2706%
Total Classified Instances : 53601
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h
<--Classified as
0 51 1 7154 5257 0 0 0 | 12463
a = 1
0 183 0 3255 26901 0 0 0 | 30339
b = 2
0 152 0 549 3742 0 0 0 | 4443
c = 3
0 699 0 320 2280 0 0 0 | 3299
d = 5
0 148 1 234 1472 0 0 0 | 1855
e = 4
0 14 0 497 332 20 19 0 | 882
f = 0
0 14 0 94 206 2 4 0 | 320
g = 6
0 0 0 0 0 0 0 0 | 0
h = unknown
Default Category: unknown: 7
Is this really a bad result? I don't know what this means? I need a help.
Thank you.