If you have enough memory, you could try the in-memory implementation
(remove -p parameter) and see if the results do improve.

How did you split the data into train/test datasets ?

On Sat, Feb 25, 2012 at 12:57 PM, tanzek <[email protected]> wrote:

> Hello, Ted.
> I have a big problem when I run random forest to classify my dataset which
> is in the following format:
>
> -2.73887,-2.731803,15,0.00009,3,8,0.002033,0.046203,0.000005,1,0.00009,1,1,3,8,0.002033,0.124838,0.125,0.000005,1,0.298142,0,0,0,0.001425,1,1,11,11,0.001425,0.062832,0.090909,0.00001,0.001425,1,1,11,11,0.001425,0.114017,0.090909,0.000008,5.466667,10,1
> There are 44 numeric features and one label at last.
> First, I split the dataset into two parts. One has 90% which is used to
> train the model and the remain is used to test. So when I run random forest
> to train model with the following parameters:
>    -Dmapred.max.split.size=13488881 -oob -sl 5 -p -t 100 -o forest-model
> the 13488881 means 1/10 size of the dataset. After I use the test set to
> predict the values, I get the following result:
>
> 12/02/24 22:12:55 INFO mapreduce.TestForest:
> =======================================================
> Summary
> -------------------------------------------------------
> Correctly Classified Instances          :       1999        3.7294%
> Incorrectly Classified Instances        :      51602       96.2706%
> Total Classified Instances              :      53601
>
> =======================================================
> Confusion Matrix
> -------------------------------------------------------
> a       b       c       d       e       f       g       h
> <--Classified as
> 0       51      1       7154    5257    0       0       0        |  12463
>   a     = 1
> 0       183     0       3255 26901 0       0       0        |  30339
>   b     = 2
> 0       152     0       549     3742    0       0       0        |  4443
>    c     = 3
> 0       699     0       320     2280    0       0       0        |  3299
>    d     = 5
> 0       148     1       234     1472    0       0       0        |  1855
>    e     = 4
> 0       14      0       497     332     20      19      0        |  882
>   f     = 0
> 0       14      0       94      206     2       4       0        |  320
>   g     = 6
> 0       0       0       0       0       0       0       0        |  0
>   h     = unknown
> Default Category: unknown: 7
>
> Is this really a bad result? I don't know what this means? I need a help.
> Thank you.
>

Reply via email to