Hi all,

I've been using the randomForest package and I'm trying to make the switch
over to party. My problem is that I have an extremely unbalanced outcome
(only 1% of the data has a positive outcome) which makes resampling methods
necessary.

randomForest has a very useful argument that is sampsize which allows me to
use a balanced subsample to build each tree in my forest. lets say the
number of positive cases is 100, my forest would look something like this:

rf<-randomForest(y~. ,data=train, ntree=800,replace=TRUE,sampsize = c(100,
100))

so I use 100 cases and 100 controls to build each individual tree. Can I do
the same for cforests? I know I can always upsample but I'd rather not.

I've tried playing around with the weights argument but I'm either not
getting it right or it's just the wrong thing to use.

weights are your friend here: Suppose you have 100 obs of the first and 1000 obs of the second class. Using weights 1 / 100 for the class one obs and 1 / 1000 for the class two obs gives you a balanced sample:

y <- gl(2, 1)[c(rep(1, 100), rep(2, 1000))]
w <- 1 / (table(y))[y]
tapply(rmultinom(n = 1, size = length(y), prob = w), y, sum)

Best,

Torsten



Any advice on how to adapt cforests to datasets with imbalanced outcomes is
greatly appreciated...



Thanks!

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to