Depending on the dataset, the classes may not be well distributed between
the train and test dataset, e.g. if a class is completely missing or under
represented in the train dataset you will get poor results. Can you count
the number of instances of each class in the train and test dataset ?

On Sun, Feb 26, 2012 at 2:29 AM, tanzek <[email protected]> wrote:

> OK, I will have a try.
> Because of the large size of the dataset, I manually split the dataset into
> two parts using VI editor. One have the first 90%, and another have the
> last 10%. I don't complete the process of cross validation, just finish the
> first iteration. Is there a problem?
> Thank you for your replies.
>
> 在 2012年2月26日 上午2:17,deneche abdelhakim <[email protected]>写道:
>
> > If you have enough memory, you could try the in-memory implementation
> > (remove -p parameter) and see if the results do improve.
> >
> > How did you split the data into train/test datasets ?
> >
> > On Sat, Feb 25, 2012 at 12:57 PM, tanzek <[email protected]> wrote:
> >
> > > Hello, Ted.
> > > I have a big problem when I run random forest to classify my dataset
> > which
> > > is in the following format:
> > >
> > >
> >
> -2.73887,-2.731803,15,0.00009,3,8,0.002033,0.046203,0.000005,1,0.00009,1,1,3,8,0.002033,0.124838,0.125,0.000005,1,0.298142,0,0,0,0.001425,1,1,11,11,0.001425,0.062832,0.090909,0.00001,0.001425,1,1,11,11,0.001425,0.114017,0.090909,0.000008,5.466667,10,1
> > > There are 44 numeric features and one label at last.
> > > First, I split the dataset into two parts. One has 90% which is used to
> > > train the model and the remain is used to test. So when I run random
> > forest
> > > to train model with the following parameters:
> > >    -Dmapred.max.split.size=13488881 -oob -sl 5 -p -t 100 -o
> forest-model
> > > the 13488881 means 1/10 size of the dataset. After I use the test set
> to
> > > predict the values, I get the following result:
> > >
> > > 12/02/24 22:12:55 INFO mapreduce.TestForest:
> > > =======================================================
> > > Summary
> > > -------------------------------------------------------
> > > Correctly Classified Instances          :       1999        3.7294%
> > > Incorrectly Classified Instances        :      51602       96.2706%
> > > Total Classified Instances              :      53601
> > >
> > > =======================================================
> > > Confusion Matrix
> > > -------------------------------------------------------
> > > a       b       c       d       e       f       g       h
> > > <--Classified as
> > > 0       51      1       7154    5257    0       0       0        |
>  12463
> > >   a     = 1
> > > 0       183     0       3255 26901 0       0       0        |  30339
> > >   b     = 2
> > > 0       152     0       549     3742    0       0       0        |
>  4443
> > >    c     = 3
> > > 0       699     0       320     2280    0       0       0        |
>  3299
> > >    d     = 5
> > > 0       148     1       234     1472    0       0       0        |
>  1855
> > >    e     = 4
> > > 0       14      0       497     332     20      19      0        |  882
> > >   f     = 0
> > > 0       14      0       94      206     2       4       0        |  320
> > >   g     = 6
> > > 0       0       0       0       0       0       0       0        |  0
> > >   h     = unknown
> > > Default Category: unknown: 7
> > >
> > > Is this really a bad result? I don't know what this means? I need a
> help.
> > > Thank you.
> > >
> >
>

Reply via email to