from:"Liaw, Andy"

Re: [R] Random Forest classification

2016-04-18 Thread Liaw, Andy

This is explained in the "Details" section of the help page for partialPlot.

Best
Andy

> -Original Message-
> From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Jesús Para
> Fernández
> Sent: Tuesday, April 12, 2016 1:17 AM
> To: r-help@r-project.org
> Subject: [R] Random Forest classification
> 
> Hi,
> 
> To evaluate the partial influence of a factor with a random Forest, wich
> response is OK/NOK I�m using partialPlot, being the x axis the factor axis and
> the Y axis is between -1 and 1. What this -1 and 1 means?
> 
> An example:
> 
> https://www.dropbox.com/s/4b92lqxi3592r0d/Captura.JPG?dl=0
> 
> 
> Thanks for all!!!
>   [[alternative HTML version deleted]]

Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth,
New Jersey, USA 07033), and/or its affiliates Direct contact information
for affiliates is available at 
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from 
your system.
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] rpart and randomforest results

2014-04-07 Thread Liaw, Andy

Hi Sonja,

How did you build the rpart tree (i.e., what settings did you use in 
rpart.control)?  Rpart by default will use cross validation to prune back the 
tree, whereas RF doesn't need that.  There are other more subtle differences as 
well.  If you want to compare single tree results, you really want to make sure 
the settings in the two are as close as possible.  Also, how did you compute 
the pseudo R2, on test set, or some other way?

Best,
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Schillo, Sonja
Sent: Thursday, April 03, 2014 3:58 PM
To: Mitchell Maltenfort
Cc: r-help@r-project.org
Subject: Re: [R] rpart and randomforest results

Hi,

the random forest should do that, you're totally right. As far as I know it 
does so by randomly selecting the variables considered for a split (but here we 
set the option for how many variables to consider at each split to the number 
of variables available so that I thought that the random forest does not have 
the chance to randomly select the variables). The next thing that randomforest 
does is bootstrapping. But here again we set the option to the number of cases 
we have in the data set so that no bootstrapping should be done.
We tried to take all the randomness from the randomforest away.

Is that plausible and does anyone have another idea?

Thanks
Sonja


Von: Mitchell Maltenfort [mailto:mmal...@gmail.com]
Gesendet: Dienstag, 1. April 2014 13:32
An: Schillo, Sonja
Cc: r-help@r-project.org
Betreff: Re: [R] rpart and randomforest results


Is it possible that the random forest is somehow adjusting for optimism or 
overfitting?
On Apr 1, 2014 7:27 AM, Schillo, Sonja 
sonja.schi...@uni-due.demailto:sonja.schi...@uni-due.de wrote:
Hi all,

I have a question on rpart and randomforest results:

We calculated a single regression tree using rpart and got a pseudo-r2 of 
roundabout 10% (which is not too bad compared to a linear regression on this 
data). Encouraged by this we grew a whole regression forest on the same data 
set using randomforest. But we got  pretty bad pseudo-r2 values for the 
randomforest (even sometimes negative values for some option settings).
We then thought that if we built only one single tree with the randomforest 
routine we should get a result similar to that of rpart. So we set the options 
for randomforest to only one single tree but the resulting pseudo-r2 value was 
negative aswell.

Does anyone have a clue as to why the randomforest results are so bad whereas 
the rpart result is quite ok?
Is our assumption that a single tree grown by randomforest should give similar 
results as a tree grown by rpart wrong?
What am I missing here?

Thanks a lot for your help!
Sonja

__
R-help@r-project.orgmailto:R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] randomForest warning: The response has five or fewer unique values. Are you sure you want to do regression?

2014-03-24 Thread Liaw, Andy

If you are using the code, that's not really using randomForest directly.  I 
don't understand the data structure you have (since you did not show anything) 
so can't really tell you much.  In any case, that warning came from 
randomForest() when it is run in regression mode but the response has fewer 
than five distinct values.  It may be legitimate regression data, and if so you 
can safely ignore the warning (that's why it's not an error).  It's there to 
catch the cases when people try to do classification with class labels 1, 2, 
..., k and forgot to make it a factor.

Best,
Andy Liaw

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Sean Porter
Sent: Thursday, March 20, 2014 3:27 AM
To: r-help@r-project.org
Subject: [R] randomForest warning: The response has five or fewer unique 
values. Are you sure you want to do regression?

Hello everyone,

 

Im relatively new to R and new to the randomForest package and have scoured
the archives for help with no luck. I am trying to perform a regression on a
set of predictors and response variables to determine the most important
predictors. I have 100 response variables collected from 14 sites and 8
predictor variables from the same 14 sites. I run the code to perform the
randomForest  regression given by Pitcher et al 2011   (
http://gradientforest.r-forge.r-project.org/biodiversity-survey.pdf ). 

 

However, after running the code I get the warning:

 

 In randomForest.default(m, y, ...) :

  The response has five or fewer unique values.  Are you sure you want to do
regression?

 

And it produces a set of 500 regression trees for each of 3 species only
when the number of species in the response file is 100. I noticed that in
the example by Pitcher they get 500 trees from only 90 species even though
they input 110 species in the response data.

 

Why am I getting the warning/how do I solve it, and why is randomForest
producing trees for only 3 species when I am looking at 100 species
(response variables)?

 

Many thanks

 

Sean

 


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Variable importance - ANN

2013-12-04 Thread Liaw, Andy

You can try something like this:
http://pubs.acs.org/doi/abs/10.1021/ci050022a

Basically similar idea to what is done in random forests: permute predictor 
variable one at a time and see how much that degrades prediction performance.

Cheers,
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Giulia Di Lauro
Sent: Wednesday, December 04, 2013 6:42 AM
To: r-help@r-project.org
Subject: [R] Variable importance - ANN

Hi everybody,
I created a neural network for a regression analysis with package ANN, but
now I need to know which is the significance of each predictor variable in
explaining the dependent variable. I thought to analyze the weight, but I
don't know how to do it.

Thanks in advance,
Giulia Di Lauro.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How do I extract Random Forest Terms and Probabilities?

2013-12-02 Thread Liaw, Andy

#2 can be done simply with predict(fmi, type=prob).  See the help page for 
predict.randomForest().

Best,
Andy


-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of arun
Sent: Tuesday, November 26, 2013 6:57 PM
To: R help
Subject: Re: [R] How do I extract Random Forest Terms and Probabilities?



Hi,
For the first part, you could do:

fmi2 - fmi 
attributes(fmi2$terms) - NULL
capture.output(fmi2$terms)
#[1] Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width

A.k.

On Tuesday, November 26, 2013 3:55 PM, Lopez, Dan lopez...@llnl.gov wrote:
Hi R Experts,

I need your help with two question regarding randomForest.


1.       When I run a Random Forest model how do I extract the formula I used 
so that I can store it in a character vector in a dataframe?
For example the dataframe might look like this if I am running models using the 
IRIS dataset
#ModelID,Type,

#001,RF,Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width

fmi-randomForest(Species~.,iris,mtry=3,ntry=500)
#I know one place where the information is in fmi$terms but not sure how to 
extract just the formula info. Or perhaps there is somewhere else in fmi that I 
could get this?


2.       How do I get the probabilities (probability-like values) from the 
model that was run? I know for the test set I can use predict. And I know to 
extract the classifications from the model I use fmi$predicted. But where are 
the probabilities?


Dan
Workforce Analyst
HRIM - Workforce Analytics  Metrics
LLNL


    [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:13}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] interpretation of MDS plot in random forest

2013-12-02 Thread Liaw, Andy

Yes, that's part of the intention anyway.  One can also use them to do 
clustering.

Best,
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Massimo Bressan
Sent: Monday, December 02, 2013 6:34 AM
To: r-help@r-project.org
Subject: [R] interpretation of MDS plot in random forest

Given this general example:

set.seed(1)

data(iris)

iris.rf - randomForest(Species ~ ., iris, proximity=TRUE, keep.forest=TRUE)

#varImpPlot(iris.rf)

#varUsed(iris.rf)

MDSplot(iris.rf, iris$Species)

I’ve been reading the documentation about random forest (at best of my - 
poor - knowledge) but I’m in trouble with the correct interpretation of 
the MDS plot and I hope someone can give me some clues

What is intended for “the scaling coordinates of the proximity matrix”?


I think to understand that the objective is here to present the distance 
among species in a parsimonious and visual way (of lower dimensionality)

Is therefore a parallelism to what are intended the principal components 
in a classical PCA?

Are the scaling coordinates DIM 1 and DIM2 the eigenvectors of the 
proximity matrix?

If that is correct, how would you find the eigenvalues for that 
eigenvectors? And what are the eigenvalues repreenting?


What are saying these two dimensions in the plot about the different 
iris species? Their relative distance in terms of proximity within the 
space DIM1 and DIM2?

How to choose for the k parameter (number of dimensions for the scaling 
coordinates)?

And finally how would you explain the plot in simple terms?

Thank you for any feedback
Best regards

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachments, contains
information of Merck  Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates Direct contact information
for affiliates is available at 
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from 
your system.
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Split type in the RandomForest package

2013-11-20 Thread Liaw, Andy

Classification trees use the Gini index, whereas the regression trees use sum 
of squared errors.  They are hard-wired into the C/Fortran code, so not 
easily changeable.

Best,
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Cheng, Chuan
Sent: Monday, September 30, 2013 6:30 AM
To: 'R-help@r-project.org'
Subject: [R] Split type in the RandomForest package

Hi guys,

I'm new to Random Forest package and I'd like to know what type of split is 
used in the package for classification? Or can I configure the package to use 
different split type (like simple split alongside single attribute axis or 
linear split based on several attributes etc..)

Thanks a lot!

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] What is the difference between Mean Decrease Accuracy produced by importance(foo) vs foo$importance in a Random Forest Model?

2013-11-19 Thread Liaw, Andy

The difference is importance(..., scale=TRUE).  See the help page for detail.  
If you extract the $importance component from a randomForest object, you do not 
get the scaling.

Best,
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Lopez, Dan
Sent: Wednesday, November 13, 2013 12:16 PM
To: R help (r-help@r-project.org)
Subject: [R] What is the difference between Mean Decrease Accuracy produced by 
importance(foo) vs foo$importance in a Random Forest Model?

Hi R Expert Community,

My question: What is the difference between Mean Decrease Accuracy produced by 
importance(foo) vs foo$importance in a Random Forest Model?

I ran a Random Forest classification model where the classifier is binary. I 
stored the model in object FOREST_model. I than ran importance(FOREST_model) 
and FOREST_model$importance. I usually use the prior but decided to learn more 
about what is in summary(randomForest ) so I ran the latter. I expected both to 
produce identical output. Mean Decrease Gini is the only thing that is 
identical in both.

I looked at ? Random Forest and Package 'randomForest' documentation and didn't 
find any info explaining this difference.

I am not including a reproducible example because this is most likely 
something, perhaps simple, such as one  is divided by something (if so, what?), 
that I am just not aware of.


importance(FOREST_model)

 HC  TER MeanDecreaseAccuracy MeanDecreaseGini
APPT_TYP_CD_LL0.16025157 -0.521041660   0.1567029712.793624
ORG_NAM_LL0.20886631 -0.952057325   0.20208393   107.137049
NEW_DISCIPLINE0.20685079 -0.960719435   0.2007676286.495063


FOREST_model$importance


  HC   TER MeanDecreaseAccuracy MeanDecreaseGini

APPT_TYP_CD_LL0.0049473962 -3.727629e-03 0.0045949805
12.793624

ORG_NAM_LL0.0090715845 -2.401016e-02 0.0077298067   
107.137049

NEW_DISCIPLINE0.0130672572 -2.656671e-02 0.0114583178
86.495063

Dan Lopez
LLNL, HRIM, Workforce Analytics  Metrics


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] FW: Nadaraya-Watson kernel

2013-11-07 Thread Liaw, Andy

Use KernSmooth (one of the recommended packages that are included in R 
distribution).  E.g.,

 library(KernSmooth)
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
 x - seq(0, 1, length=201)
 y - 4 * cos(2*pi*x) + rnorm(x)
 f - locpoly(x, y, degree=0, kernel=epan, bandwidth=.1)
 plot(x, y)
 lines(f, lwd=2)

Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Ms khulood aljehani
Sent: Tuesday, November 05, 2013 9:49 AM
To: r-h...@stat.math.ethz.ch
Subject: [R] FW: Nadaraya-Watson kernel

From: aljehan...@hotmail.com
To: r-help@r-project.org
Subject: Nadaraya-Watson kernel
Date: Tue, 5 Nov 2013 17:42:13 +0300




Hello
 
i want to compute the Nadaraya-Watson kernel estimation when the kernel 
function is Epanchincov kernel
i use the command
ksmooth(x, y, kernel=normal, bandwidth ,)
 
the argmunt ( kernel=normal ) accept normal and box kernels
i want to compute it if the kerenl = Epanchincov
 
 
thank you
 
  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Creating 3d partial dependence plots

2013-03-20 Thread Liaw, Andy

It needs to be done by hand, in that partialPlot() does not handle more than 
one variable at a time.  You need to modify its code to do that (and be ready 
to wait even longer, as it can be slow).

Andy
 
-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Jerrod Parker
Sent: Sunday, March 03, 2013 7:08 PM
To: r-help@r-project.org
Subject: [R] Creating 3d partial dependence plots

Help,

I've been having a difficult time trying to create 3d partial dependence
plots using rgl.  It looks like this question has been asked a couple
times, but I'm unable to find a clear answer googling.  I've tried creating
x, y, and z variables by extracting them from the partialPlot output to no
avail.  I've seen these plots used several times in articles, and I think
they would help me a great deal looking at interactions.  Could someone
provide a coding example using randomForest and rgl?  It would be greatly
appreciated.

Thank you,
Jerrod Parker

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How do I make R randomForest model size smaller?

2012-12-04 Thread Liaw, Andy

Try the following:

set.seed(100)
rf1 - randomForest(Species ~ ., data=iris)
set.seed(100)
rf2 - randomForest(iris[1:4], iris$Species)
object.size(rf1)
object.size(rf2)
str(rf1)
str(rf2)

You can try it on your own data.  That should give you some hints about why the 
formula interface should be avoided with large datasets.

Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of John Foreman
Sent: Monday, December 03, 2012 3:43 PM
To: r-help@r-project.org
Subject: [R] How do I make R randomForest model size smaller?

I've been training randomForest models on 7 million rows of data (41
features). Here's an example call:

myModel - randomForest(RESPONSE~., data=mydata, ntree=50, maxnodes=30)

I thought surely with only 50 trees and 30 terminal nodes that the memory
footprint of myModel would be small. But it's 65 megs in a dump file. The
object seems to be holding all sorts of predicted, actual, and vote data
from the training process.

What if I just want the forest and that's it? I want a tiny dump file that
I can load later to make predictions off of quickly. I feel like the forest
by itself shouldn't be all that large...

Anyone know how to strip this sucker down to just something I can make
predictions off of going forward?

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Different results from random.Forest with test option and using predict function

2012-12-04 Thread Liaw, Andy

Without data to reproduce what you saw, we can only guess.

One possibility is due to tie-breaking.  There are several places where ties 
can occur and are broken at random, including at the prediction step.  One 
difference between the two ways of doing prediction is that when it's all done 
within randomForest(), the test set prediction is performed as each tree is 
grown.  If there is any tie that needs to be broken at any prediction step, it 
will affect the RNG stream used by the subsequent tree growing step.

You can also inspect/compare the forest components of the randomForest 
objects to see if they are the same.  At least the first tree in both should be 
identical.

Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of tdbuskirk
Sent: Monday, December 03, 2012 6:31 PM
To: r-help@r-project.org
Subject: [R] Different results from random.Forest with test option and using 
predict function

Hello R Gurus,

I am perplexed by the different results I obtained when I ran code like
this:
set.seed(100)
test1-randomForest(BinaryY~., data=Xvars, trees=51, mtry=5, seed=200)
predict(test1, newdata=cbind(NewBinaryY, NewXs), type=response)

and this code:
set.seed(100)
test2-randomForest(BinaryY~., data=Xvars, trees=51, mtry=5, seed=200,
xtest=NewXs, ytest=NewBinarY)

The confusion matrices for the two forests I thought would be the same by
virtue of the same seed settings, but they differ as do the predicted values
as well as the votes.  At first I thought it was just the way ties were
broken, so I changed the number of trees to an odd number so there are no
ties anymore.  

Can anyone shed light on what I am hoping is a simple oversight?  I just
can't figure out why the results of the predictions from these two forests
applied to the NewBinaryYs and NewX data sets would not be the same.

Thanks for any hints and help.

Sincerely,

Trent Buskirk



--
View this message in context: 
http://r.789695.n4.nabble.com/Different-results-from-random-Forest-with-test-option-and-using-predict-function-tp4651970.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Partial dependence plot in randomForest package (all flat responses)

2012-11-26 Thread Liaw, Andy

Not unless we have more information.  Please read the Posting Guide to see how 
to make it easier for people to answer your question.

Best,
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Oritteropus
Sent: Thursday, November 22, 2012 2:02 PM
To: r-help@r-project.org
Subject: [R] Partial dependence plot in randomForest package (all flat 
responses)

Hi,
I'm trying to make a partial plot with package randomForest in R. After I
perform my random forest object I type

partialPlot(data.rforest, pred.data=act2, x.var=centroid, C) 

where data.rforest is my randomforest object, act2 is the original dataset,
centroid is one of the predictor and C is one of the classes in my response
variable. 
Whatever predictor or response class I try I always get a plot with a
straight line (a completely flat response). Similarly, If I set a
categorical variable as predictor, I get a barplot with all the bar with the
same height. I suppose I'm doing something wrong here because all other
analysis on the same rforest object seem correct (e.g. varImp or MDSplot).
Is it possible it is related to some option set in random forest object? Can
somebody see the problem here?
Thanks for your time



--
View this message in context: 
http://r.789695.n4.nabble.com/Partial-dependence-plot-in-randomForest-package-all-flat-responses-tp4650470.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random Forest for multiple categorical variables

2012-10-17 Thread Liaw, Andy

How about taking the combination of the two?  E.g., gamma = factor(paste(alpha, 
beta1, sep=:)) and use gamma as the response.

Best,
Andy
 

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Gyanendra Pokharel
Sent: Tuesday, October 16, 2012 10:47 PM
To: R-help@r-project.org
Subject: [R] Random Forest for multiple categorical variables

Dear all,

I have the following data set.

V1  V2  V3  V4  V5  V6  V7  V8  V9  V10  alpha   beta
1111   111   111   11111alpha   beta1
2122   122   12   2   12212alpha  beta1
3133   133   13   3   13 313alpha   beta1
4144   14414  4   144 14   alpha   beta1
5155   15515  5   155 15   alpha   beta1
6166166   16   6  166 16   alpha   beta2
717717717  7   17   7 17   alpha   beta2
8188   18 818  818   818alpha  beta2
919919919  9 19   9   19alpha   beta2
10   20   10   20   10   20  10   20  10  20  alpha   beta2

I want to use the randomForest classification. If there is one categorical
variable with different classes, we can use

randomForest(resp~., data, ), but here I need to classify the data
with two categorical variables. Any idea will be great.

Thanks

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random Forest - Extract

2012-10-03 Thread Liaw, Andy

1.  Not sure what you want.  What details are you looking for exactly?  If 
you call predict(trainset) without the newdata argument, you will get the 
(out-of-bag) prediction of the training set, which is exactly the predicted  
component of the RF object.

2. If you set type=votes and norm.votes=FALSE, you will get the counts 
instead of proportions.

Best,
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Lopez, Dan
Sent: Wednesday, September 26, 2012 9:05 PM
To: R help (r-help@r-project.org)
Subject: [R] Random Forest - Extract

Hello,

I have two Random Forest (RF) related questions.


1.   How do I view the classifications for the detail data of my training 
data (aka trainset) that I used to build the model? I know there is an object 
called predicted which I believe is a vector. To view the detail for my testset 
I use the below-bind the columns together. I was trying to do something similar 
for my trainset  but without putting it through the predict function. Instead 
taking directly from the randomForest which I stored in FOREST_model. I really 
need to get to this information to do some comparison of certain cases.

RF_DTL-cbind(testset,predict(FOREST_model, testset, type=response))



2.   In the RF model in R the predict function has three possible 
arguments: response, vote or prob. I noticed vote and prob are 
identical for all records in my data set. Is this typical? If so then what is 
the point of having these two arguments? Ease of use?

Dan


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] interpret the importance output?

2012-08-29 Thread Liaw, Andy

The type=1 importance measure in RF compares the prediction error of each 
tree on the OOB data with the prediction error of the same tree on the OOB data 
with the values of one variable randomly shuffled.  If the variable has no 
predictive power, then the two should be very close, and there's 50% chance 
that the difference is negative.  If the variable is important, then 
shuffling the values should significantly degrade the prediction in the form of 
increased MSE.  The importance measure takes mean of the differences of all 
these individual tree MSEs and then divide by the SD of these differences.

With that, I hope it's clear that only v2 and v4 in your example are 
potentially important.

Best,
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Johnathan Mercer
Sent: Monday, August 27, 2012 11:40 AM
To: r-h...@stat.math.ethz.ch
Subject: [R] interpret the importance output?

 importance(rfor.pdp11_t25.comb1,type=1)
  %IncMSE
v1 -0.28956401263
v2  1.92865561147
v3 -0.63443929130
v4  1.58949137047
v5  0.03190940065

I wasn't entirely confident with interpreting these results based on the
documentation.
Could you please interpret?

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Stratified Sampling with randomForest Regression

2012-06-01 Thread Liaw, Andy

Yes, you need to modify both the R and the underlying C code.  It's the the 
source package on CRAN (the .tar.gz file).

Andy
 

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Josh Browning
Sent: Friday, June 01, 2012 10:48 AM
To: r-help@r-project.org
Subject: [R] Stratified Sampling with randomForest Regression

Hi All,

 

I'm using R's randomForest package (and it's quite awesome!) but I'd
really like to do some stratified sampling with a regression problem.
However, it appears that the package was designed to only accommodate
stratified sampling for classification purposes (see
https://stat.ethz.ch/pipermail/r-help/2006-November/117477.html).  As
Andy suggests in the link just mentioned, I'm trying to modify the
source code.  However, it appears that I may also need to modify the C
code that randomForest is calling, is that correct?  If so, how do I
access that code?

 

Or, has anyone modified the package to allow for stratified sampling in
regression problems?

 

Please let me know if I'm not being clear enough with this question, and
thanks for helping me out!

 

Josh


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Question about random Forest function in R

2012-05-29 Thread Liaw, Andy

Hi Kelly,

The function has a limitation that it cannot handle any column in your x that 
is a categorical variable with more than 32 categories.  One possibility is to 
see if you can bin some of the categories into one to get below 32 categories.

Andy 

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Kelly Cool
Sent: Tuesday, May 29, 2012 10:47 AM
To: r-help@r-project.org
Subject: [R] Question about random Forest function in R



Hello, 

I am trying to run the random Forest function on a data.frame using the 
following code..

myrf - randomForest (y=sample_data_metal, x=Train, importance=TRUE, 
proximity=TRUE)


However, an error occurs saying, can not handle categorical predictors with 
more than 32 categories. 

My x=Train data.frame is quite large and my y=sample_data_metal is one 
column. 

I'm not sure how to go about fixing this error or if there is even a way to get 
around this error. Thanks in advance for any help. 

[[alternative HTML version deleted]]

Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random Forest Classification_ForestCombination

2012-05-29 Thread Liaw, Andy

As long as you can remember that the summaries such as variable importance, OOB 
predictions, and OOB error rates are not applicable, I think that should be 
fine.

Andy 

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Nikita Desai
Sent: Wednesday, May 23, 2012 1:51 PM
To: r-help@R-project.org
Subject: [R] Random Forest Classification_ForestCombination

Hello,

I am aware of the fact that the combine() function in the Random Forest package 
of R is meant to combine forests built from the same training set, but is there 
any way to combine trees built on different training sets? Both the training 
datasets used contain the same variables and classes, but their sizes are 
different.

Thanks


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random forests prediction

2012-05-14 Thread Liaw, Andy

I don't think this is so hard to explain.  If you evaluate AUC using either OOB 
prediction or on a test set (or something like CV or bootstrap), that would be 
what I expect for most data.  When you add more variables (that are, say, less 
informative) to a model, the model has to look harder to find the informative 
ones, and thus you pay a penalty.  One exception to that is if some of the 
new variables happen to have very strong interaction with some of the old 
variables, then you may see improved performance.

I've said it several times before, but it seems to be worth repeating:  Don't 
use the training set for evaluating models:  that almost never make sense.

Andy


-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of matt
Sent: Friday, May 11, 2012 3:43 PM
To: r-help@r-project.org
Subject: [R] Random forests prediction

Hi all,

I have a strange problem when applying RF in R. 
I have a set of variables with which I obtain an AUC of 0.67.

I do have a second set of variables that have an AUC of 0.57. 

When I merge the first and second set of variables, the AUC becomes 0.64. 

I would expect the prediction to become better as I add variables that do
have some predictive power?
This is even more strange as the AUC on the training set increased when I
added more variables (while the AUC of the validation set thus decreased).

Is there anyone who has experienced the same and/or who know what could be
the reason?

Thanks,

Matthijs

--
View this message in context: 
http://r.789695.n4.nabble.com/Random-forests-prediction-tp4627409.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] No Data in randomForest predict

2012-05-14 Thread Liaw, Andy

It doesn't:  You just get an error if there are NAs in the data; e.g.,

R rf1 = randomForest(iris[1:4], iris[[5]])
R predict(rf1, newdata=data.frame(Sepal.Length=1, Sepal.Width=2, 
Petal.Length=3, Petal.Width=NA))
Error in predict.randomForest(rf1, newdata = data.frame(Sepal.Length = 1,  : 
  missing values in newdata
 
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Jennifer Corcoran
Sent: Saturday, May 05, 2012 5:17 PM
To: r-help@r-project.org
Subject: [R] No Data in randomForest predict

I would like to ask a general question about the randomForest predict
function and how it handles No Data values.  I understand that you can omit
No Data values while developing the randomForest object, but how does it
handle No Data in the prediction phase?  I would like the output to be NA
if any (not just all) of the input data have an NA value. It is not clear
to me if this is the default or if I need to add an argument in the predict
function.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random forests prediction

2012-05-14 Thread Liaw, Andy

That's not how RF works at all.  The setting of mtry is irrelevant to this.

Andy 

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of matt
Sent: Monday, May 14, 2012 10:22 AM
To: r-help@r-project.org
Subject: Re: [R] Random forests prediction

But shouldn't it be resolved when I set mtry to the maximum number of
variables? 
Then the model explores all the variables for the next step, so it will
still be able to find the better ones? And then in the later steps it could
use the (less important) variables.

Matthijs

--
View this message in context: 
http://r.789695.n4.nabble.com/Random-forests-prediction-tp4627409p4629944.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Partial Dependence and RandomForest

2012-04-17 Thread Liaw, Andy

Note that the partialPlot() function also returns the x-y pairs being plotted, 
so you can work from there if you wish.  As to SD, my guess is you want some 
sort of confidence interval or band around the curve?  I do not know of any 
theory to produce that, but that may well just be my ignorance.

Andy 

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of jmc
Sent: Friday, April 13, 2012 11:20 AM
To: r-help@r-project.org
Subject: Re: [R] Partial Dependence and RandomForest

Thank you Andy.  I obviously neglected to read into the help file and,
frustratingly, could have known this all along.  However, I am still
interested in knowing the relative maximum value in the partial plots via
query instead of visual interpretation (and possibly getting at other
statistical measures like standard deviation).  Is it possible to do this? 
I will keep investigating, but would appreciate a hint in the right
direction if you have time.

--
View this message in context: 
http://r.789695.n4.nabble.com/Partial-Dependence-and-RandomForest-tp4549705p4555146.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] loess function take

2012-04-13 Thread Liaw, Andy

Alternatively, use only a subset to run loess(), either a random sample or 
something like every other k-th (sorted) data value, or the quantiles.  It's 
hard for me to imagine that that many data points are going to improve your 
model much at all (unless you use tiny span).

Andy


From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Uwe Ligges

On 12.04.2012 05:49, arunkumar wrote:
 Hi

 The function loess takes very long time if the dataset is very huge
 I have around 100 records
 and used only one independent variable. still it takes very long time

 Any suggestion to reduce the time


Use another method that is computationally less expensive for that many 
observations.

Uwe Ligges


 -
 Thanks in Advance
  Arun
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/loess-function-take-tp4550896p4550896.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Partial Dependence and RandomForest

2012-04-13 Thread Liaw, Andy

Please read the help page for the partialPlot() function and make sure you 
learn about all its arguments (in particular, which.class).

Andy 

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of jmc
Sent: Wednesday, April 11, 2012 2:44 PM
To: r-help@r-project.org
Subject: [R] Partial Dependence and RandomForest

Hello all~

I am interested in clarifying something more conceptual, so I won't be
providing any data or code here.  

From what I understand, partial dependence plots can help you understand the
relative dependence on a variable, and the subsequent values of that
variable, after averaging out the effects of the other input variables. 
This is great, but what I am interested in knowing is how that relates to
each predictor class, not just the overall prediction.

Is it possible to plot partial dependence per class?  Specifically, I'd like
to know the important threshold values of my most important variables.

Thank you for your time,

--
View this message in context: 
http://r.789695.n4.nabble.com/Partial-Dependence-and-RandomForest-tp4549705p4549705.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Execution speed in randomForest

2012-04-13 Thread Liaw, Andy

Without seeing your code, it's hard to say much more, but do avoid using 
formula when you have large data.

Andy 

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Jason  Caroline Shaw
Sent: Friday, April 06, 2012 1:20 PM
To: jim holtman
Cc: r-help@r-project.org
Subject: Re: [R] Execution speed in randomForest

The CPU time and elapsed time are essentially identical. (That is, the
system time is negligible.)

Using Rprof, I just ran the code twice.  The first time, while
randomForest is doing its thing, there are 850 consecutive lines which
read:
.C randomForest.default randomForest randomForest.formula randomForest
Upon running it a second time, this time taking 285 seconds to
complete, there are 14201 such lines, with nothing intervening

There shouldn't be interference from elsewhere on the machine.  This
is the only memory- and CPU-intensive process.  I don't know how to
check what kind of paging is going on, but since the machine has 16GB
of memory and I am using maybe 3 or 4 at most, I hope paging is not an
issue.

I'm on a CentOS 5 box running R 2.15.0.

On Fri, Apr 6, 2012 at 12:45 PM, jim holtman jholt...@gmail.com wrote:
 Are you looking at the CPU or the elapsed time?  If it is the elapsed
 time, then also capture the CPU time to see if it is different.  Also
 consider the use of the Rprof function to see where time is being
 spent.  What else is running on the machine?  Are you doing any
 paging?  What type of system are you running on?  Use some of the
 system level profiling tools.  If on Windows, then use perfmon.

 On Fri, Apr 6, 2012 at 11:28 AM, Jason  Caroline Shaw
 los.sh...@gmail.com wrote:
 I am using the randomForest package.  I have found that multiple runs
 of precisely the same command can generate drastically different run
 times.  Can anyone with knowledge of this package provide some insight
 as to why this would happen and whether there's anything I can do
 about it?  Here are some details of what I'm doing:

 - Data: ~80,000 rows, with 10 columns (one of which is the class label)
 - I randomly select 90% of the data to use to build 500 trees.

 And this is what I find:

 - Execution times of randomForest() using the entire dataset (in
 seconds): 20.65, 20.93, 20.79, 21.05, 21.00, 21.52, 21.22, 21.22
 - Execution times of randomForest() using the 90% selection: 17.78,
 17.74, 126.52, 241.87, 17.56, 17.97, 182.05, 17.82 -- Note the 3rd,
 4th, and 7th.
 - When the speed is slow, it often stutters, with one or a few trees
 being produced very quickly, followed by a slow build taking 10 or 20
 seconds
 - The oob results are indistinguishable between the fast and slow runs.

 I select the 90% of my data by using sample() to generate indices and
 then subsetting, like: selection - data[sample,].  I thought perhaps
 this subsetting was getting repeated, rather than storing in memory a
 new copy of all that data, so I tried circumventing this with
 eval(data[sample,]).  Probably barking up the wrong tree -- it had no
 effect, and doesn't explain the run-to-run variation (really, I'm just
 not clear on what eval() is for).  I have also tried garbage
 collecting with gc() between each run, and adding a Sys.sleep() for 5
 seconds, but neither of these has helped either.

 Any ideas?

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

 --
 Jim Holtman
 Data Munger Guru

 What is the problem that you are trying to solve?
 Tell me what you want to do, not how you want to do it.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Imputing missing values using LSmeans (i.e., population marginal means) - advice in R?

2012-04-05 Thread Liaw, Andy

Don't know how you searched, but perhaps this might help:

https://stat.ethz.ch/pipermail/r-help/2007-March/128064.html 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Jenn Barrett
 Sent: Tuesday, April 03, 2012 1:23 AM
 To: r-help@r-project.org
 Subject: [R] Imputing missing values using LSmeans (i.e., 
 population marginal means) - advice in R?

 Hi folks,

 I have a dataset that consists of counts over a ~30 year 
 period at multiple (200) sites. Only one count is conducted 
 at each site in each year; however, not all sites are 
 surveyed in all years. I need to impute the missing values 
 because I need an estimate of the total population size 
 (i.e., sum of counts across all sites) in each year as input 
 to another model. 

  head(newdat,40)
SITE YEAR COUNT
 1 1 1975 12620
 2 1 1976 13499
 3 1 1977 45575
 4 1 1978 21919
 5 1 1979 33423
 ...
 372 1975 4
 382 1978 40322
 392 1979 7
 402 1980 16244

 It was suggested to me by a statistician to use LSmeans to do 
 this; however, I do not have SAS, nor do I know anything much 
 about SAS. I have spent DAYS reading about these LSmeans 
 and while (I think) I understand what they are, I have 
 absolutely no idea how to a) calculate them in R and b) how 
 to use them to impute my missing values in R. Again, I've 
 searched the mail lists, internet and literature and have not 
 found any documentation to advise on how to do this - I'm lost.

 I've looked at popMeans, but have no clue how to use this 
 with predict() - if this is even the route to go. Any advice 
 would be much appreciated. Note that YEAR will be treated as 
 a factor and not a linear variable (i.e., the relationship 
 between COUNT and YEAR is not linear - rather there are highs 
 and lows about every 10 or so years).

 One thought I did have was to just set up a loop to calculate 
 the least-squares estimates as:

 Yij = (IYi + JYj - Y)/[(I-1)(J-1)]
 where  I = number of treatments and J = number of blocks (so 
 I = sites and J = years). I found this formula in some stats 
 lecture handouts by UC Davis on unbalanced data and 
 LSMeans...but does it yield the same thing as using the 
 LSmeans estimates? Does it make any sense? Thoughts?

 Many thanks in advance.

 Jenn

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Question about randomForest

2012-04-04 Thread Liaw, Andy

 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Saruman
 
 I dont see how this answered the original question of the poster.
 
 He was quite clear: the value of the predictions coming out 
 of RF do not
 match what comes out of the predict function using the same 
 RF object and
 the same data. Therefore, what is predict() doing that is 
 different from RF?
 Yes, RF is making its predictions using OOB, but nowhere does 
 it say way
 predict() is doing; indeed, it says if newdata is not given, then the
 results are just the OOB predictions. But newdata=oldata, then
 predict(newdata) != OOB predictions. So what is it then? 

Let me make this as clear as I possibly can:  If predict() is called without 
newdata, all it can do is assume prediction on the training set is desired.  In 
that case it returns the OOB prediction.  If newdata is given in predict(), it 
assumes it is new data and thus makes prediction using all trees.  If you 
just feed the training data as newdata, then yes, you will get overfitted 
predictions.  It almost never make sense (to me anyway) to make predictions on 
the training set.
 
 Opens another issue, which is if newdata is close but not 
 exactly oldata,
 then you get overfitted results?

Possibly, depending on how close the new data are to the training set.  This 
applies to nearly _ALL_ methods, not just RF.

Andy
 
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Question-about-randomForest-tp41
11311p4529770.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Memory limits for MDSplot in randomForest package

2012-03-30 Thread Liaw, Andy

Sam,

As you've probably seen, all the MDSplot() function does is feed 1 - proximity 
to the cmdscale() function.  Some suggestion and clarification:

1. If all you want is the proximity matrix, you can run randomForest() with 
keep.forest=FALSE to save memory.  You will likely want to run somewhat large 
number of trees if you're interested in proximity, and with the large number of 
data points, the trees are going to be quite large as well.

2. The proximity is nxn, so if you have about 19000 data points, that's a 19000 
by 19000 matrix, which takes approx. 2.8GB of memory to store a copy.

3. I tried making up a 19000^2 cross-product matrix, then tried cmdscale(1-xx, 
k=5).  The memory usage seems to peak at around 16.3GB, but I killed it after 
more than two hours.  Thus I suspect it really is the eigen decomposition in 
cmdscale() on such a large matrix that's taking up the time.

My suggestion is to see if you can find some efficient ways of doing eigen 
decomposition on such large matrices.  You might be able to make the proximity 
matrix sparse (e.g., by thresholding), and see if there are packages that can 
do the decomposition on the sparse form.

Best,
Andy


 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Sam Albers
 Sent: Friday, March 23, 2012 3:31 PM
 To: r-help@r-project.org
 Subject: [R] Memory limits for MDSplot in randomForest package
 
 Hello,
 
 I am struggling to produce an MDS plot using the randomForest package
 with a moderately large data set. My data set has one categorical
 response variables, 7 predictor variables and just under 19000
 observations. That means my proximity matrix is approximately 133000
 by 133000 which is quite large. To train a random forest on this large
 a dataset I have to use my institutions high performance computer.
 Using this setup I was able to train a randomForest with the proximity
 argument set to TRUE. At this point I wanted to construct an MDSplot
 using the following:
 
 MDSplot(nech.rf, nech.d$pd.fl, palette=c(1,2,3), 
 pch=as.numeric(nech.d$pd.fl))
 
 where nech.rf is the randomForest object and nech.d$pd.fl is the
 classification factor. Now with the architecture listed below, I've
 been waiting for approximately 2 days for this to run. My issue is
 that I am not sure if this will ever run.
 
 Can anyone recommend a way to tweak the MDSplot function to run a
 little faster? I tried changing the cmdscale arguments (i.e.
 eigenvalues) within the MDSplot function a little but that didn't seem
 to have any effect of the overall running time using a much smaller
 data set. Or even if someone could comment whether I am dreaming that
 this will actually ever run?
 
 This is probably the best computer that I will have access to so I was
 hoping that somehow I could get this to run. I was just hoping that
 someone reading the list might have some experience with randomForests
 and using large datasets and might be able to comment on my situation.
 Below the architecture information I have constructed a dummy example
 to illustrate what I am doing but given the nature of the problem,
 this doesn't completely reflect my situation.
 
 Any help would be much appreciated!
 
 Thanks!
 
 Sam
 
 
 
 Computer specs and sessionInfo()
 
 OS: Suse Linux
 Memory: 64 GB
 Processors: Intel Itanium 2, 64 x 1500 MHz
 
 And:
 
  sessionInfo()
 R version 2.6.2 (2008-02-08)
 ia64-unknown-linux-gnu
 
 locale:
 LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLA
 TE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8
 ;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC
 _MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
 
 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods   base
 
 other attached packages:
 [1] randomForest_4.6-6
 
 loaded via a namespace (and not attached):
 [1] rcompgen_0.1-17
 
 
 ###
 # Dummy Example
 ###
 
 require(randomForest)
 set.seed(17)
 
 ## Number of points
 x - 10
 
 df - rbind(
 data.frame(var1=runif(x, 10, 50),
var2=runif(x, 2, 7),
var3=runif(x, 0.2, 0.35),
var4=runif(x, 1, 2),
var5=runif(x, 5, 8),
var6=runif(x, 1, 2),
var7=runif(x, 5, 8),
cls=factor(CLASS-2)
)
   ,
 data.frame(var1=runif(x, 10, 50),
var2=runif(x, -3, 3),
var3=runif(x, 0.1, 0.25),
var4=runif(x, 1, 2),
var5=runif(x, 5, 8),
var6=runif(x, 1, 2),
var7=runif(x, 5, 8),
cls=factor(CLASS-1)
)
 
 )
 
 
 df.rf-randomForest(y=df[,8],x=df[,1:7], proximity=TRUE, 
 importance=TRUE)
 
 MDSplot(df.rf, df$cls, k=2, palette=c(1,2,3,4), 
 pch=as.numeric(df$cls))
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html

Re: [R] fitted values with locfit

2012-03-28 Thread Liaw, Andy

I believe you are expecting the software to do what it did not claim being able 
to do.  predict.locfit() does not have a type argument, nor can that take on 
terms.  When you specify two variables in the smooth, a bivariate smooth is 
done, so you get one bivariate smooth function, not the sum of two univariate 
smooths.  If the latter is what you want, use packages that fits additive 
models.

Best,
Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Soberon 
 Velez, Alexandra Pilar
 Sent: Monday, March 19, 2012 5:13 AM
 To: r-help@r-project.org
 Subject: [R] fitted values with locfit
 
 Dear memberships,
 
 
 
 I'm trying to estimate the following multivariate local 
 regression model using the locfit package:
 
 BMI=m1(RCC)+m2(WCC)
 
 where (m1) and (m2) are unknown smooth functions.
 
 
 My problem is that once I get the regression done I cannot 
 get the fitted values of each of this smooth functions (m1) 
 and (m2). What I write is the following
 
 library(locfit)
 
 data(ais)
 fit2-locfit.raw(x=lp(ais$RCC,h=0.5,deg=1)+lp(ais$WCC,deg=1,h=
 0.75),y=ais$BMI,ev=dat(),kt=prod,kern=gauss)
 g21-predict(fit2,type=terms)
 
 
 If I done this on the computer the results of (g21) is a 
 vector when I should have a matrix with 2 columns (one for 
 each fitted smooth function).
 
 
 Please, somebody knows how can I get the estimated fitted 
 values of both smooth functions (m1) and (m2) using a local 
 linear regression with kernel weights as this example?
 
 
 thanks a lot in advance I'm very desperate.
 
 Alexandra
 
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] job opening at Merck Research Labs, NJ USA

2012-03-20 Thread Liaw, Andy

The Biometrics Research department at the Merck Research Laboratories has an 
open position to be located in Rahway, New Jersey, USA:

This position will be responsible for imaging and bio-signal biomarkers 
projects including analysis of preclinical, early clinical, and experimental 
medicine imaging and EEG data. Responsibilities include all phases of data 
analysis from processing of raw imaging and EEG data to derivation of 
endpoints. Part of the responsibilities is development and implementation of 
novel statistical methods and software for analysis of imaging and bio-signal 
data.  This position will closely collaborate with Imaging and Clinical 
Pharmacology departments; Experimental Medicine; Early and Late Stage 
Development Statistics; and Modeling and Simulation.  Publication and 
presentation of the results is highly encouraged as is collaboration with 
external experts. 

Education Minimum Requirement:  PhD in Statistics, Applied Mathematics, 
Physics, Computer Science, Engineering, or related fields
Required Experience and Skills: Education should include Statistics related 
courses or equivalently working experience should involve data analysis and 
statistical modeling for at least 1 year. Excellent computing skills, R and/or 
SAS , MATLAB  in Linux and Windows environment; working knowledge of parallel 
computing; C, C++,  or Fortran programming.  Dissertation or experience in at 
least one of these areas: statistical image and signal analysis; data mining 
and machine learning; mathematical modeling in medicine and biology;  general 
statistical research
Desired Experience and Skills -  education in and/or experience with EEG and 
Imaging data analysis; stochastic modeling; functional data analysis; 
familiarity with wavelet analysis and other spectral analysis methods


Please apply electronically at:
http://www.merck.com/careers/search-and-apply/search-jobs/home.html 
Click on Experienced Opportunities, and search by Requisition ID: BIO003546 
and email CV to:
vladimir_svet...@merck.com

Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Using caegorical variables in package randomForest.

2012-03-13 Thread Liaw, Andy

The way to represent categorical variables is with factors.  See ?factor.  
randomForest() will handle factors appropriately, as most modeling functions in 
R.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of abhishek
 Sent: Tuesday, March 13, 2012 8:11 AM
 To: r-help@r-project.org
 Subject: [R] Using caegorical variables in package randomForest.
 
 Hello,
 
 I am sorry if there are already post that answers to this 
 question but i
 tried to find them before making this post. I did not really 
 find relevant
 posts.
 
 I am using randomForest package for building a two class 
 classifier. There
 are categorical variables and numerical variables in my data. 
 Different
 categorical variables have different number of categories 
 from 2 to 10. I am
 not sure about how to represent the categorical data.
 For example, I am using 0 and 1 for variables that have only 
 two categories.
 But, i doubt, the program is analysing the values as 
 numerical. Do you have
 any idea how can i use the c*ategorical variables for 
 building a two class
 classifier.* I am using a factor consisting of 0 and 1 for the
 classification target.
 
 Thank you for your ideas.
 
 -
 abhishek
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Using-caegorical-variables-in-pa
ckage-randomForest-tp4468923p4468923.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Help on reshape function

2012-03-06 Thread Liaw, Andy

Just using the reshape() function in base R:

df.long = reshape(df, varying=list(names(df)[4:7]), direction=long)

This also gives two extra columns (time and id) can can be dropped.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of R. Michael Weylandt
 Sent: Tuesday, March 06, 2012 8:45 AM
 To: mails
 Cc: r-help@r-project.org
 Subject: Re: [R] Help on reshape function
 
 library(reshape2)
 
 melt(df, id.vars = c(ID1, ID2, ID3))[, -4]
 # To drop an extraneous column (but you should take a look and see
 what it is for future reference)
 
 Michael
 
 On Tue, Mar 6, 2012 at 6:17 AM, mails mails00...@gmail.com wrote:
  Hello,
 
 
  I am trying to reshape a data.frame in wide format into long format.
  Although in the reshape R documentation
  they programmer list some examples I am struggling to bring 
 my data.frame
  into long and then transform it back into wide format. The 
 data.frame I look
  at is:
 
 
  df - data.frame(ID1 = c(1,1,1,1,1,1,1,1,1), ID2 = c(A, 
 A, A, B,
  B, B, C, C, C),
 
                                  ID3 = c(E, E, E, E, 
 E, E, E, E, E),
 
                                  X1 = c(1,4,3,5,2,4,6,4,2), 
 X2 = c(6,8,9,6,7,8,9,6,7),
 
                                  X3 = c(7,6,7,5,6,5,6,7,5), 
 X4 = c(1,2,1,2,3,1,2,1,2))
 
  df
   ID1 ID2 ID3 X1 X2 X3 X4
  1   1   A   E  1  6  7  1
  2   1   A   E  4  8  6  2
  3   1   A   E  3  9  7  1
  4   1   B   E  5  6  5  2
  5   1   B   E  2  7  6  3
  6   1   B   E  4  8  5  1
  7   1   C   E  6  9  6  2
  8   1   C   E  4  6  7  1
  9   1   C   E  2  7  5  2
 
  I want to use the reshape function to get the following result:
 
  df
   ID1 ID2 ID3 X
  1   1   A   E  1
  2   1   A   E  4
  3   1   A   E  3
  4   1   B   E  5
  5   1   B   E  2
  6   1   B   E  4
  7   1   C   E  6
  8   1   C   E  4
  9   1   C   E  2
 
  10   1   A   E  6
  11   1   A   E  8
  12   1   A   E  9
  13   1   B   E  6
  14   1   B   E  7
  15   1   B   E  8
  16   1   C   E  9
  17   1   C   E  6
  18   1   C   E  7
 
  19   1   A   E  7
  20   1   A   E  6
  21   1   A   E  7
  22   1   B   E  5
  23   1   B   E  6
  24   1   B   E  5
  25   1   C   E  6
  26   1   C   E  7
  27   1   C   E  5
 
  28   1   A   E  1
  29   1   A   E  2
  30   1   A   E  1
  31   1   B   E  2
  32   1   B   E  3
  33   1   B   E  1
  34   1   C   E  2
  35   1   C   E  1
  36   1   C   E  2
 
 
  Can anyone help?
 
  Cheers
 
 
 
  --
  View this message in context: 
 http://r.789695.n4.nabble.com/Help-on-reshape-function-tp44494
64p4449464.html
  Sent from the R help mailing list archive at Nabble.com.
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth?

2012-02-29 Thread Liaw, Andy

That's why I said you need the book.  The details are all in the book.

From: Michael [mailto:comtech@gmail.com]
Sent: Thursday, February 23, 2012 1:49 PM
To: Liaw, Andy
Cc: r-help
Subject: Re: [R] Good and modern Kernel Regression package in R with 
auto-bandwidth?

Thanks Andy.

I am reading the locfit document...

but not sure how to do the CV and bandwidth selection...

Here is a quote about the function regband: it doesn't seem to be usable?

Basically I am looking for a locfit that comes with an automatic bandwidth 
selection so that I am essentially parameter free for the local-regression 
step...

-

regband

Bandwidth selectors for local regression.

Description

Function to compute local regression bandwidths for local linear regression, 
implemented as a front

end to

locfit().

This function is included for comparative purposes only. Plug-in selectors are 
based on flawed logic,

make unreasonable and restrictive assumptions and do not use the full power of 
the estimates available

in Locfit. Any relation between the results produced by this function and 
desirable estimates

are entirely coincidental.

Usage

regband(formula, what = c(CP, GCV, GKK, RSW), deg=1, ...)

2012/2/23 Liaw, Andy andy_l...@merck.commailto:andy_l...@merck.com
If that's the kind of framework you'd like to work in, use locfit, which has 
the predict() method for evaluating new data.  There are several different 
handwidth selectors in that package for your choosing.

Kernel smoothers don't really fit the framework of creating a model object, 
followed by predicting new data using that fitted model object very well 
because of it's local nature.  Think of k-nn classification, which has similar 
problem:  The model needs to be computed for every data point you want to 
predict.

Andy

From: Michael [mailto:comtech@gmail.commailto:comtech@gmail.com]
Sent: Thursday, February 23, 2012 10:06 AM

To: Liaw, Andy
Cc: Bert Gunter; r-help
Subject: Re: [R] Good and modern Kernel Regression package in R with 
auto-bandwidth?

Thank you Andy!

I went thru KernSmooth package but I don't see a way to use the fitted function 
to do the predict part...

data=data.frame(z=z, x=x)

datanew=data.frame(z=z, x=x)

lmfit=lm(z

~x, data=data)

lmforecast=predict(lmfit, newdata=datanew)

Am I missing anything here?

Thanks!
2012/2/23 Liaw, Andy andy_l...@merck.commailto:andy_l...@merck.com
In short, pick your poison...

Is there any particular reason why the tools that shipped with R itself (e.g., 
kernSmooth) are inadequate for you?

I like using the locfit package because it has many tools, including the ones 
that the author didn't think were optimal.  You may need the book to get most 
mileage out of it though.

Andy

From: Michael [mailto:comtech@gmail.commailto:comtech@gmail.com]
Sent: Thursday, February 23, 2012 12:25 AM
To: Liaw, Andy
Cc: Bert Gunter; r-help

Subject: Re: [R] Good and modern Kernel Regression package in R with 
auto-bandwidth?

$B#I(Bmeant its very slow when I use cv.aic...

On Wed, Feb 22, 2012 at 11:24 PM, Michael 
comtech@gmail.commailto:comtech@gmail.com wrote:
Is np an okay package to use?

I am worried about the multi-start thing... and also it's very slow...

On Wed, Feb 22, 2012 at 8:35 PM, Liaw, Andy 
andy_l...@merck.commailto:andy_l...@merck.com wrote:
Bert's question aside (I was going to ask about laundry, but that's much harder 
than taxes...), my understanding of the situation is that optimal is in the 
eye of the beholder.  There were at least two schools of thought on which is 
the better way of automatically selecting bandwidth, using plug-in methods or 
CV-type.  The last I check, the jury is still out.

Andy

 -Original Message-
 From: r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org
 [mailto:r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org] On 
 Behalf Of Bert Gunter
 Sent: Wednesday, February 22, 2012 6:03 PM
 To: Michael
 Cc: r-help
 Subject: Re: [R] Good and modern Kernel Regression package in
 R with auto-bandwidth?

 Would you like it to do your your taxes for you too? :-)

 Bert

 Sent from my iPhone -- please excuse typos.

 On Feb 22, 2012, at 11:46 AM, Michael 
 comtech@gmail.commailto:comtech@gmail.com wrote:

  Hi all,

  I am looking for a good and modern Kernel Regression
 package in R, which
  has the following features:

  1) It has cross-validation
  2) It can automatically choose the optimal bandwidth
  3) It doesn't have random effect - i.e. if I run the
 function at different
  times on the same data-set, the results should be exactly
 the same... I am
  trying np, but I am seeing:

  Multistart 1 of 1 |
  Multistart 1 of 1 |
  ...

  It looks like in order to do the optimization, it's doing
  multiple-random-start optimization... am I right?

  Could you please

Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth?

2012-02-23 Thread Liaw, Andy

In short, pick your poison...

Is there any particular reason why the tools that shipped with R itself (e.g., 
kernSmooth) are inadequate for you?

I like using the locfit package because it has many tools, including the ones 
that the author didn't think were optimal.  You may need the book to get most 
mileage out of it though.

Andy


From: Michael [mailto:comtech@gmail.com]
Sent: Thursday, February 23, 2012 12:25 AM
To: Liaw, Andy
Cc: Bert Gunter; r-help
Subject: Re: [R] Good and modern Kernel Regression package in R with 
auto-bandwidth?

$B#I(Bmeant its very slow when I use cv.aic...

On Wed, Feb 22, 2012 at 11:24 PM, Michael 
comtech@gmail.commailto:comtech@gmail.com wrote:
Is np an okay package to use?

I am worried about the multi-start thing... and also it's very slow...


On Wed, Feb 22, 2012 at 8:35 PM, Liaw, Andy 
andy_l...@merck.commailto:andy_l...@merck.com wrote:
Bert's question aside (I was going to ask about laundry, but that's much harder 
than taxes...), my understanding of the situation is that optimal is in the 
eye of the beholder.  There were at least two schools of thought on which is 
the better way of automatically selecting bandwidth, using plug-in methods or 
CV-type.  The last I check, the jury is still out.

Andy

 -Original Message-
 From: r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org
 [mailto:r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org] On 
 Behalf Of Bert Gunter
 Sent: Wednesday, February 22, 2012 6:03 PM
 To: Michael
 Cc: r-help
 Subject: Re: [R] Good and modern Kernel Regression package in
 R with auto-bandwidth?

 Would you like it to do your your taxes for you too? :-)

 Bert

 Sent from my iPhone -- please excuse typos.

 On Feb 22, 2012, at 11:46 AM, Michael 
 comtech@gmail.commailto:comtech@gmail.com wrote:

  Hi all,
 
  I am looking for a good and modern Kernel Regression
 package in R, which
  has the following features:
 
  1) It has cross-validation
  2) It can automatically choose the optimal bandwidth
  3) It doesn't have random effect - i.e. if I run the
 function at different
  times on the same data-set, the results should be exactly
 the same... I am
  trying np, but I am seeing:
 
  Multistart 1 of 1 |
  Multistart 1 of 1 |
  ...
 
  It looks like in order to do the optimization, it's doing
  multiple-random-start optimization... am I right?
 
 
  Could you please give me some pointers?
 
  I did some google search but there are so many packages
 that do this... I
  just wanted to find the best/modern one to use...
 
  Thank you!
 
 [[alternative HTML version deleted]]
 
  __
  R-help@r-project.orgmailto:R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.

 __
 R-help@r-project.orgmailto:R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Notice:  This e-mail message, together with any attachme...{{dropped:27}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth?

2012-02-23 Thread Liaw, Andy

If that's the kind of framework you'd like to work in, use locfit, which has 
the predict() method for evaluating new data.  There are several different 
handwidth selectors in that package for your choosing.

Kernel smoothers don't really fit the framework of creating a model object, 
followed by predicting new data using that fitted model object very well 
because of it's local nature.  Think of k-nn classification, which has similar 
problem:  The model needs to be computed for every data point you want to 
predict.

Andy


From: Michael [mailto:comtech@gmail.com]
Sent: Thursday, February 23, 2012 10:06 AM
To: Liaw, Andy
Cc: Bert Gunter; r-help
Subject: Re: [R] Good and modern Kernel Regression package in R with 
auto-bandwidth?

Thank you Andy!

I went thru KernSmooth package but I don't see a way to use the fitted function 
to do the predict part...


data=data.frame(z=z, x=x)

datanew=data.frame(z=z, x=x)

lmfit=lm(z

~x, data=data)

lmforecast=predict(lmfit, newdata=datanew)

Am I missing anything here?

Thanks!
2012/2/23 Liaw, Andy andy_l...@merck.commailto:andy_l...@merck.com
In short, pick your poison...

Is there any particular reason why the tools that shipped with R itself (e.g., 
kernSmooth) are inadequate for you?

I like using the locfit package because it has many tools, including the ones 
that the author didn't think were optimal.  You may need the book to get most 
mileage out of it though.

Andy


From: Michael [mailto:comtech@gmail.commailto:comtech@gmail.com]
Sent: Thursday, February 23, 2012 12:25 AM
To: Liaw, Andy
Cc: Bert Gunter; r-help

Subject: Re: [R] Good and modern Kernel Regression package in R with 
auto-bandwidth?

$B#I(Bmeant its very slow when I use cv.aic...

On Wed, Feb 22, 2012 at 11:24 PM, Michael 
comtech@gmail.commailto:comtech@gmail.com wrote:
Is np an okay package to use?

I am worried about the multi-start thing... and also it's very slow...


On Wed, Feb 22, 2012 at 8:35 PM, Liaw, Andy 
andy_l...@merck.commailto:andy_l...@merck.com wrote:
Bert's question aside (I was going to ask about laundry, but that's much harder 
than taxes...), my understanding of the situation is that optimal is in the 
eye of the beholder.  There were at least two schools of thought on which is 
the better way of automatically selecting bandwidth, using plug-in methods or 
CV-type.  The last I check, the jury is still out.

Andy

 -Original Message-
 From: r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org
 [mailto:r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org] On 
 Behalf Of Bert Gunter
 Sent: Wednesday, February 22, 2012 6:03 PM
 To: Michael
 Cc: r-help
 Subject: Re: [R] Good and modern Kernel Regression package in
 R with auto-bandwidth?

 Would you like it to do your your taxes for you too? :-)

 Bert

 Sent from my iPhone -- please excuse typos.

 On Feb 22, 2012, at 11:46 AM, Michael 
 comtech@gmail.commailto:comtech@gmail.com wrote:

  Hi all,
 
  I am looking for a good and modern Kernel Regression
 package in R, which
  has the following features:
 
  1) It has cross-validation
  2) It can automatically choose the optimal bandwidth
  3) It doesn't have random effect - i.e. if I run the
 function at different
  times on the same data-set, the results should be exactly
 the same... I am
  trying np, but I am seeing:
 
  Multistart 1 of 1 |
  Multistart 1 of 1 |
  ...
 
  It looks like in order to do the optimization, it's doing
  multiple-random-start optimization... am I right?
 
 
  Could you please give me some pointers?
 
  I did some google search but there are so many packages
 that do this... I
  just wanted to find the best/modern one to use...
 
  Thank you!
 
 [[alternative HTML version deleted]]
 
  __
  R-help@r-project.orgmailto:R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.

 __
 R-help@r-project.orgmailto:R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Notice:  This e-mail message, together with any attachments, contains
information of Merck  Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates Direct contact information
for affiliates is available at
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you

Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth?

2012-02-22 Thread Liaw, Andy

Bert's question aside (I was going to ask about laundry, but that's much harder 
than taxes...), my understanding of the situation is that optimal is in the 
eye of the beholder.  There were at least two schools of thought on which is 
the better way of automatically selecting bandwidth, using plug-in methods or 
CV-type.  The last I check, the jury is still out.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Bert Gunter
 Sent: Wednesday, February 22, 2012 6:03 PM
 To: Michael
 Cc: r-help
 Subject: Re: [R] Good and modern Kernel Regression package in 
 R with auto-bandwidth?
 
 Would you like it to do your your taxes for you too? :-)
 
 Bert
 
 Sent from my iPhone -- please excuse typos.
 
 On Feb 22, 2012, at 11:46 AM, Michael comtech@gmail.com wrote:
 
  Hi all,
  
  I am looking for a good and modern Kernel Regression 
 package in R, which
  has the following features:
  
  1) It has cross-validation
  2) It can automatically choose the optimal bandwidth
  3) It doesn't have random effect - i.e. if I run the 
 function at different
  times on the same data-set, the results should be exactly 
 the same... I am
  trying np, but I am seeing:
  
  Multistart 1 of 1 |
  Multistart 1 of 1 |
  ...
  
  It looks like in order to do the optimization, it's doing
  multiple-random-start optimization... am I right?
  
  
  Could you please give me some pointers?
  
  I did some google search but there are so many packages 
 that do this... I
  just wanted to find the best/modern one to use...
  
  Thank you!
  
 [[alternative HTML version deleted]]
  
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random Forest Package

2012-02-01 Thread Liaw, Andy

You should be able to use the Rgui menu to install packages.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Niratha
 Sent: Wednesday, February 01, 2012 5:16 AM
 To: r-help@r-project.org
 Subject: [R] Random Forest Package

 Hi,
  I have installed R version 2.14 in windows 7 . I want to use
 randomForest package. I installed Rtools and MikTex 2.9, but i am not
 possible to read description file and also its not possible to build
 package. when i give this command in windows R CMD IINSTALL --build
 randomForest its shows the error R CMD is not recognized as 
 an internal or
 external command.

 Thanks
 Niratha

 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Random-Forest-Package-tp4347424p
4347424.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] randomForest: proximity for new objects using an existing rf

2012-02-01 Thread Liaw, Andy

There's an alternative, but it may not be any more efficient in time or 
memory...

You can run predict() on the training set once, setting nodes=TRUE.  That will 
give you a n by ntree matrix of which node of which tree the data point falls 
in.  For any new data, you would run predict() with nodes=TRUE, then compute 
the proximity by hand by counting how often any given pair landed in the same 
terminal node of each tree.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Kilian
 Sent: Wednesday, February 01, 2012 5:39 AM
 To: r-help@r-project.org
 Subject: [R] randomForest: proximity for new objects using an 
 existing rf
 
 Dear all,
 
 using an existing random forest, I would like to calculate 
 the proximity
 for a new test object, i.e. the similarity between the new 
 object and the
 old training objects which were used for building the random 
 forest. I do
 not want to build a new random forest based on both old and 
 new objects.
 
 Currently, my workaround is to calculate the proximites of a 
 combined data
 set consisting of training and new objects like this:
 
 model - randomForest(Xtrain, Ytrain) # build random forest
 nnew - nrow(Xnew) # number of new objects
 Xcombi - rbind(Xnew, Xtrain) # combine new objects and 
 training objects
 predcombi - predict(model, Xcombi, proximity=TRUE) # 
 calculate proximities
 proxcombi - predcombi$proximity # get proximities of combined dataset
 proxnew - proxcombi[(1:nnew),-(1:nnew)] # get proximities of 
 new objects
 only
 
 But this approach causes a lot of wasted computation time as I am not
 interested in the proximities among the training objects 
 themselves but
 only among the training objects and the new objects. With 
 1000 training
 objects and 5 new objects, I have to calculate a 1005x1005 
 proximity matrix
 to get the essential 5x1000 matrix of the new objects only.
 
 Am I doing something wrong? I read through the documentation 
 but could not
 find another solution. Any advice would be highly appreciated.
 
 Thanks in advance!
 Kilian
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] indexing by empty string (was RE: Error in predict.randomForest ... subscript out of bounds with NULL name in X)

2012-02-01 Thread Liaw, Andy

Hi Ista,

When you write a package, you have to anticipate what users will throw at the 
code.  I can insist that users only input matriices where none of the column 
names are empty, but that's not what I wish to impose on the users.  I can add 
the name if it's empty, but as a user I don't want a function to do that, 
either.  That's why I need to look for a workaround.

Using which() seems rather clumsy for the purpose, as I need to combine those 
with the non-empty ones, and preserving ordering would be a mess.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Ista Zahn
 Sent: Wednesday, February 01, 2012 5:45 AM
 To: r-help@r-project.org
 Subject: Re: [R] indexing by empty string (was RE: Error in 
 predict.randomForest ... subscript out of bounds with NULL name in X)
 
 Hi Andy,
 
 On Tuesday, January 31, 2012 08:44:13 AM Liaw, Andy wrote:
  I'm not exactly sure if this is a problem with indexing by 
 name; i.e., is
  the following behavior by design?  The problem is that 
 names or dimnames
  that are empty seem to be treated differently, and one 
 can't index by them:
  
  R junk = 1:3
  R names(junk) = c(a, b, )
  R junk
  a b
  1 2 3
  R junk[]
  NA
NA
  R junk = matrix(1:4, 2, 2)
  R colnames(junk) = c(a, )
  R junk[, ]
  Error: subscript out of bounds
 
 You can index them by number, e.g.,
 junk[, 2]
 
 and you can use which() to find the numbers where the colname 
 is empty.
 
 junk[, which(colnames(junk) == )]
 
 
  
  I may need to find workaround...
 
 Going back to the original issue with predict, I don't think 
 you need a 
 workaround. I think you need give your matrix some colnames.
 
 Best,
 Ista
 
  
   -Original Message-
   From: r-help-boun...@r-project.org
   [mailto:r-help-boun...@r-project.org] On Behalf Of 
 Czerminski, Ryszard
   Sent: Wednesday, January 25, 2012 10:39 AM
   To: r-help@r-project.org
   Subject: [R] Error in predict.randomForest ... subscript out
   of bounds with NULL name in X
   
   RF trains fine with X, but fails on prediction
   
library(randomForest)
chirps -
   
   
 c(20,16.0,19.8,18.4,17.1,15.5,14.7,17.1,15.4,16.2,15,17.2,16,17,14.1)
   
temp -
   
   c(88.6,71.6,93.3,84.3,80.6,75.2,69.7,82,69.4,83.3,78.6,82.6,80
   .6,83.5,76
   .3)
   
X - cbind(1,chirps)
rf - randomForest(X, temp)
yp - predict(rf, X)
   
   Error in predict.randomForest(rf, X) : subscript out of bounds
   
   BTW: Just find out that  apparently predict() does not like
   NULL name in
   
   X, because this works fine:
one - rep(1, length(chirps))
X - cbind(one,chirps)
rf - randomForest(X, temp)
yp - predict(rf, X)
   
   Ryszard Czerminski
   AstraZeneca Pharmaceuticals LP
   35 Gatehouse Drive
   Waltham, MA 02451
   USA
   781-839-4304
   ryszard.czermin...@astrazeneca.com
   
   
   --
   
   Confidentiality Notice: This message is private and may
   ...{{dropped:11}}
   
   __
   R-help@r-project.org mailing list
   https://stat.ethz.ch/mailman/listinfo/r-help
   PLEASE do read the posting guide
   http://www.R-project.org/posting-guide.html
   and provide commented, minimal, self-contained, reproducible code.
  
  Notice:  This e-mail message, together with any 
 attachme...{{dropped:11}}
  
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Bivariate Partial Dependence Plots in Random Forests

2012-01-31 Thread Liaw, Andy

The reason that it's not implemented is because of computational cost.  Some 
users had done it on their own using the same idea.  It's just that it takes 
too much memory for even moderately sized data.  It can be done much more 
efficiently in MART because computational shortcuts were used.  

Best,
Andy

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Lucie Bland
 Sent: Friday, January 27, 2012 5:01 AM
 To: r-help@r-project.org
 Subject: [R] Bivariate Partial Dependence Plots in Random Forests
 
 Hello,
 
  
 
 I was wondering if anyone knew of an R function/R code to 
 plot bivariate
 (3 dimensional) partial dependence plots in random forests 
 (randomForest
 package). 
 
  
 
 It is apparently possible using the rgl package
 (http://esapubs.org/archive/ecol/E088/173/appendix-C.htm) or there may
 be a more direct function such as the pairplot() in MART (multiple
 additive regression trees)?
 
  
 
 Many thanks,
 
  
 
 Lucie
 
  
 
 My Computer:
 
 HP Z400 Workstation,
 
 16.0 GB, Windows 7 Professional, Intel(R) Xeon(R) CPU, W365 3.20 GHz
 3.19 GHz
 
 64bit
 
  
 
 My R version:
 
 R version 2.14.1 (2011-12-22) 64 bit
 
 
 
 The Zoological Society of London is incorporated by Royal Charter
 Principal Office England. Company Number RC000749
 Registered address: 
 Regent's Park, London, England NW1 4RY
 Registered Charity in England and Wales no. 208728 
 
 __
 ___
 This e-mail has been sent in confidence to the named=2...{{dropped:21}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] indexing by empty string (was RE: Error in predict.randomForest ... subscript out of bounds with NULL name in X)

2012-01-31 Thread Liaw, Andy

I'm not exactly sure if this is a problem with indexing by name; i.e., is the 
following behavior by design?  The problem is that names or dimnames that are 
empty seem to be treated differently, and one can't index by them:

R junk = 1:3
R names(junk) = c(a, b, )
R junk
a b   
1 2 3 
R junk[]
NA 
  NA 
R junk = matrix(1:4, 2, 2)
R colnames(junk) = c(a, )
R junk[, ]
Error: subscript out of bounds

I may need to find workaround...

 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Czerminski, Ryszard
 Sent: Wednesday, January 25, 2012 10:39 AM
 To: r-help@r-project.org
 Subject: [R] Error in predict.randomForest ... subscript out 
 of bounds with NULL name in X
 
 RF trains fine with X, but fails on prediction
 
  library(randomForest)
  chirps -
 c(20,16.0,19.8,18.4,17.1,15.5,14.7,17.1,15.4,16.2,15,17.2,16,17,14.1)
  temp -
 c(88.6,71.6,93.3,84.3,80.6,75.2,69.7,82,69.4,83.3,78.6,82.6,80
 .6,83.5,76
 .3)
  X - cbind(1,chirps)
  rf - randomForest(X, temp)
  yp - predict(rf, X)
 Error in predict.randomForest(rf, X) : subscript out of bounds
 
 BTW: Just find out that  apparently predict() does not like 
 NULL name in
 X, because this works fine:
 
  one - rep(1, length(chirps))
  X - cbind(one,chirps)
  rf - randomForest(X, temp)
  yp - predict(rf, X)
 
 Ryszard Czerminski
 AstraZeneca Pharmaceuticals LP
 35 Gatehouse Drive
 Waltham, MA 02451
 USA
 781-839-4304
 ryszard.czermin...@astrazeneca.com
 
 
 --
 
 Confidentiality Notice: This message is private and may 
 ...{{dropped:11}}
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Variable selection based on both training and testing data

2012-01-30 Thread Liaw, Andy

Variable section is part of the training process-- it chooses the model.  By 
definition, test data is used only for testing (evaluating chosen model).

If you find a package or function that does variable selection on test data, 
run from it!

Best,
Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Jin Minming
 Sent: Monday, January 30, 2012 8:14 AM
 To: r-help@r-project.org
 Subject: [R] Variable selection based on both training and 
 testing data
 
 Dear all,
 
 The variable selection in regression is usually determined by 
 the training data using AIC or F value, such as stepAIC. Is 
 there some R package that can consider both the training and 
 test dataset? For example, I have two separate training data 
 and test data. Firstly, a regression model is obtained by 
 using training data, and then this model is tested by using 
 test data. This process continues in order to find some 
 possible optimal models in terms of RMSE or R2 for both 
 training and test data. 
 
 Thanks,
 
 Jim
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] What is the function for smoothing splines with the smoothing parameter selected by generalized maximum likelihood?

2012-01-09 Thread Liaw, Andy

See the gss package on CRAN.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of ali_protocol
 Sent: Monday, January 09, 2012 7:13 AM
 To: r-help@r-project.org
 Subject: [R] What is the function for smoothing splines with 
 the smoothing parameter selected by generalized maximum likelihood?

 Dear all, 

 I am new to R, and I am a biotechnologist, I want to fit a 
 smoothing spline
 with smoothing parameter selected by generalized maximum 
 likelihood. I was
 wondering what function implement this,  and, if possible how 
 I can find the
 fitted results for a certain point (or predict from the 
 fitted spline if
 this is the correct  language)

 --
 View this message in context: 
 http://r.789695.n4.nabble.com/What-is-the-function-for-smoothi
ng-splines-with-the-smoothing-parameter-selected-by-generalized- 
maxi-tp4278275p4278275.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] explanation why RandomForest don't require a transformations (e.g. logarithmic) of variables

2011-12-05 Thread Liaw, Andy

Tree based models (such as RF) are invriant to monotonic transformations in the 
predictor (x) variables, because they only use the ranks of the variables, not 
their actual values.  More specifically, they look for splits that are at the 
mid-points of unique values.  Thus the resulting trees are basically identical 
regardless of how you transform the x variables.

Of course, the only, probably minor, differences is, e.g., mid-points can be 
different between the original and transformed data.  While this doesn't impact 
the training data, it can impact the prediction on test data (although 
difference should be slight).

Transformation of the response variable is quite another thing.  RF needs it 
just as much as others if the situation calls for it.

Cheers,
Andy
 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of gianni lavaredo
 Sent: Monday, December 05, 2011 1:41 PM
 To: r-help@r-project.org
 Subject: [R] explanation why RandomForest don't require a 
 transformations (e.g. logarithmic) of variables
 
 Dear Researches,
 
 sorry for the easy and common question. I am trying to 
 justify the idea of
 RandomForest don't require a transformations (e.g. logarithmic) of
 variables, comparing this non parametrics method with e.g. the linear
 regressions. In leteruature to study my phenomena i need to apply a
 logarithmic trasformation to describe my model, but i found RF don't
 required this approach. Some people could suggest me text or 
 bibliography
 to study?
 
 thanks in advance
 
 Gianni
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] explanation why RandomForest don't require a transformations (e.g. logarithmic) of variables

2011-12-05 Thread Liaw, Andy

You should see no differences beyond what you'd get by running RF a second time 
with a different random number seed.

Best,
Andy

From: gianni lavaredo [mailto:gianni.lavar...@gmail.com]
Sent: Monday, December 05, 2011 2:19 PM
To: Liaw, Andy
Cc: r-help@r-project.org
Subject: Re: [R] explanation why RandomForest don't require a transformations 
(e.g. logarithmic) of variables

about the  because they only use the ranks of the variables. Using a 
leave-one-out, in each interaction the the predictor variable ranks change 
slightly every time RF builds the model, especially for the variables with low 
importance. Is It correct to justify this because there are random splitting?

Thanks in advance
Gianni

On Mon, Dec 5, 2011 at 7:59 PM, Liaw, Andy 
andy_l...@merck.commailto:andy_l...@merck.com wrote:
Tree based models (such as RF) are invriant to monotonic transformations in the 
predictor (x) variables, because they only use the ranks of the variables, not 
their actual values.  More specifically, they look for splits that are at the 
mid-points of unique values.  Thus the resulting trees are basically identical 
regardless of how you transform the x variables.

Of course, the only, probably minor, differences is, e.g., mid-points can be 
different between the original and transformed data.  While this doesn't impact 
the training data, it can impact the prediction on test data (although 
difference should be slight).

Transformation of the response variable is quite another thing.  RF needs it 
just as much as others if the situation calls for it.

Cheers,
Andy

 -Original Message-
 From: r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org
 [mailto:r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org] On 
 Behalf Of gianni lavaredo
 Sent: Monday, December 05, 2011 1:41 PM
 To: r-help@r-project.orgmailto:r-help@r-project.org
 Subject: [R] explanation why RandomForest don't require a
 transformations (e.g. logarithmic) of variables

 Dear Researches,

 sorry for the easy and common question. I am trying to
 justify the idea of
 RandomForest don't require a transformations (e.g. logarithmic) of
 variables, comparing this non parametrics method with e.g. the linear
 regressions. In leteruature to study my phenomena i need to apply a
 logarithmic trasformation to describe my model, but i found RF don't
 required this approach. Some people could suggest me text or
 bibliography
 to study?

 thanks in advance

 Gianni

   [[alternative HTML version deleted]]

 __
 R-help@r-project.orgmailto:R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Notice:  This e-mail message, together with any attachme...{{dropped:26}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random Forests in R

2011-12-01 Thread Liaw, Andy

The first version of the package was created by re-writing the main program in 
the original Fortran as C, and calls other Fortran subroutines that were mostly 
untouched, so dynamic memory allocation can be done.  Later versions have most 
of the Fortran code translated/re-written in C.  Currently the only Fortran 
part is the node splitting in classification trees.

Andy

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Peter Langfelder
 Sent: Thursday, December 01, 2011 12:33 AM
 To: Axel Urbiz
 Cc: R-help@r-project.org
 Subject: Re: [R] Random Forests in R
 
 On Wed, Nov 30, 2011 at 7:48 PM, Axel Urbiz 
 axel.ur...@gmail.com wrote:
  I understand the original implementation of Random Forest 
 was done in
  Fortran code. In the source files of the R implementation 
 there is a note
  C wrapper for random forests:  get input from R and drive  
the Fortran
  routines.. I'm far from an expert on this...does that mean that the
  implementation in R is through calls to C functions only 
 (not Fortran)?
 
  So, would knowing C be enough to understand this code, or 
 Fortran is also
  necessary?
 
 I haven't seen the C and Fortran code for Random Forest but I
 understand the note to say that R code calls some C functions that
 pre-process (possibly re-format etc) the data, then call the actual
 Random Forest method that's written in Fortran, then possibly
 post-process the output and return it to R. It would imply that to
 understand the actual Random Forest code, you will have to read the
 Fortran source code.
 
 Best,
 
 Peter
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Question about randomForest

2011-11-28 Thread Liaw, Andy

Not only that, but in the same help page, same Value section, it says:

predicted   the predicted values of the input data based on out-of-bag 
samples
 
so people really should read the help pages instead of speculate...

If the error rates were not based on OOB samples, they would drop to (near) 0 
rather quickly, as each tree is intentially overfitting its training set.

Andy
 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Weidong Gu
 Sent: Sunday, November 27, 2011 10:56 AM
 To: Matthew Francis
 Cc: r-help@r-project.org
 Subject: Re: [R] Question about randomForest
 
 Matthew,
 
 Your intepretation of calculating error rates based on the training
 data is incorrect.
 
 In Andy Liaw's help file err.rate-- (classification only) vector
 error rates of the prediction on the input data, the i-th element
 being the (OOB) error rate for all trees up to the i-th.
 
 My understanding is that the error rate is calculated by throwing the
 OOB cases(after a few trees, all cases in the original data would
 serve as OOB for some trees) to all the trees up to the i-th which
 they are OOB and get the majority vote. The plot of a rf object
 indicates that OOB error declines quickly after the ensemble becomes
 sizable and increase variation in trees works! ( If they are based on
 the training sets, you wouldn't see such a drop since each tree is
 overfitting to the training set)
 
 Weidong
 
 
 On Sun, Nov 27, 2011 at 3:21 AM, Matthew Francis
 mattjamesfran...@gmail.com wrote:
  Thanks for the help. Let me explain in more detail how I think that
  randomForest works so that you (or others) can more easily see the
  error of my ways.
 
  The function first takes a random sample of the data, of the size
  specified by the sampsize argument. With this it fully grows a tree
  resulting in a horribly over-fitted classifier for the 
 random sub-set.
  It then repeats this again with a different sample to generate the
  next tree and so on.
 
  Now, my understanding is that after each tree is constructed, a test
  prediction for the *whole* training data set is made by 
 combining the
  results of all trees (so e.g. for classification the 
 majority votes of
  all individual tree predictions). From this an error rate is
  determined (applicable to the ensemble applied to the training data)
  and reported in the err.rate member of the returned randomForest
  object. If you look at the error rate (or plot it using the default
  plot method) you see that it starts out very high when only 
 1 or a few
  over-fitted trees are contributing, but once the forest gets larger
  the error rate drops since the ensemble is doing its job. It doesn't
  make sense to me that this error rate is for a sub-set of the data,
  since the sub-set in question changes at each step (i.e. at 
 each tree
  construction)?
 
  By doing cross-validation test making 'training' and 'test' 
 sets from
  the data I have, I do find that I get error rates on the test sets
  comparable to the error rate that is obtained from the prediction
  member of the returned randomForest object. So that does seem to be
  the 'correct' error.
 
  By my understanding the error reported for the ith tree is that
  obtained using all trees up to and including the ith tree to make an
  ensemble prediction. Therefore the final error reported 
 should be the
  same as that obtained using the predict.randomForest function on the
  training set, because by my understanding that should return an
  identical result to that used to generate the error rate 
 for the final
  tree constructed??
 
  Sorry that is a bit long winded, but I hope someone can point out
  where I'm going wrong and set me straight.
 
  Thanks!
 
  On Sun, Nov 27, 2011 at 11:44 AM, Weidong Gu 
 anopheles...@gmail.com wrote:
  Hi Matthew,
 
  The error rate reported by randomForest is the prediction 
 error based
  on out-of-bag OOB data. Therefore, it is different from prediction
  error on the original data  since each tree was built 
 using bootstrap
  samples (about 70% of the original data), and the error 
 rate of OOB is
  likely higher than the prediction error of the original data as you
  observed.
 
  Weidong
 
  On Sat, Nov 26, 2011 at 3:02 PM, Matthew Francis
  mattjamesfran...@gmail.com wrote:
  I've been using the R package randomForest but there is 
 an aspect I
  cannot work out the meaning of. After calling the randomForest
  function, the returned object contains an element called 
 prediction,
  which is the prediction obtained using all the trees (at 
 least that's
  my understanding). I've checked that this prediction set 
 has the error
  rate as reported by err.rate.
 
  However, if I send the training data back into the the
  predict.randomForest function I find I get a different 
 result to the
  stored set of predictions. This is true for both 
 classification and
  regression. I find the predictions obtained this

Re: [R] tuning random forest. An unexpected result

2011-11-23 Thread Liaw, Andy

Gianni,

You should not tune ntree in cross-validation or other validation methods, 
and especially should not be using OOB MSE to do so.

1. At ntree=1, you are using only about 36% of the data to assess the 
performance of a single random tree.  This number can vary wildly.  I'd say 
don't bother looking at OOB measure of anything with ntree  30.  If you want 
an exercise in probability, compute the number of trees you need to have the 
desired probability that all n data points are out-of-bag at least k times, and 
don't look at ntree  k.

2. If you just plot the randomForest object using the generic plot() function, 
you will see that it gives you the vector of MSEs for ntree=1 to the max.  
That's why you need not use other methods such as cross-validation.

3. As mentioned in the article you cited, RF is insentive to ntree, and they 
settled on ntree=250.  Also as we mentioned in the R News article, too many 
trees does not degrade prediction performance, only computational cost (which 
is trivial even for moderate size of data set).

4. It is not wise to optimize parameters of a model like that.  When all of 
the MSE estimates are within a few percent of each other, you're likely just 
chasing noise in the evaluation process.

Just my $0.02...

Best,
Andy


 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of gianni lavaredo
 Sent: Thursday, November 17, 2011 6:29 PM
 To: r-help@r-project.org
 Subject: [R] tuning random forest. An unexpected result
 
 Dear Researches,
 
 I am using RF (in regression way) for analize several metrics 
 extract from
 image. I am tuning RF setting a loop using different range of 
 mtry, tree
 and nodesize using the lower value of MSE-OOB
 
 mtry from 1 to 5
 nodesize from1 to 10
 tree from 1 to 500
 
 using this paper as refery
 
 Palmer, D. S., O'Boyle, N. M., Glen, R. C.,  Mitchell, J. B. 
 O. (2007).
 Random Forest Models To Predict Aqueous Solubility. Journal 
 of Chemical
 Information and Modeling, 47, 150-158.
 
 my problem is the following using data(airquality) :
 
 the tunning parameters with the lower value is:
 
  print(result.mtry.df[result.mtry.df$RMSE == 
 min(result.mtry.df$RMSE),])
 *RMSE  = 15.44751
 MSE = 238.6257
 mtry = 3
 nodesize = 5
 tree = 35*
 
 the numer of tree is very low, different respect how i can 
 read in several
 pubblications
 
 And the second value lower is a tunning parameters with *tree = 1*
 
 print(head(result.mtry.df[
 with(result.mtry.df, order(MSE)), ]))
   RMSE  MSE mtry nodesize tree
 12035 15.44751 238.625735   35
 *18001 15.44861 238.6595471
 *7018  16.02354 256.753925   18
 20031 16.02536 256.812151   31
 11037 16.02862 256.916533   37
 11612 16.05162 257.654434  112
 
 i am wondering if i wrong in the setting or there are some 
 aspects i don't
 conseder.
 thanks for attention and thanks in advance for suggestions and help
 
 Gianni
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] gsDesign

2011-11-15 Thread Liaw, Andy

Hi Dongli,

Questions about usage of specific contributed packages are best directed toward 
the package maintainer/author first, as they are likely the best sources of 
information, and they don't necessarily subscribe to or keep up with the daily 
deluge of R-help messages.

(In this particular case, I'm quite sure the package maintainer for gsDesign 
doesn't keep up with R-help.)

Best,
Andy
 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Dongli Zhou
 Sent: Monday, November 14, 2011 6:13 PM
 To: Marc Schwartz
 Cc: r-help@r-project.org
 Subject: Re: [R] gsDesign
 
 Hi, Marc,
 
 Thank you very much for the reply. I'm using the gsDesign 
 function to create an object of type gsDesign. But the inputs 
 do not include the 'ratio' argument.
 
 Dongli 
 
 On Nov 14, 2011, at 5:50 PM, Marc Schwartz 
 marc_schwa...@me.com wrote:
 
  On Nov 14, 2011, at 4:11 PM, Dongli Zhou wrote:
  
  I'm trying to use gsDesign for a noninferiority trial with binary
  endpoint. Did anyone know how to specify the trial with 
 different sample
  sizes for two treatment groups? Thanks in advance!
  
  
  Hi,
  
  Presuming that you are using the nBinomial() function, see 
 the 'ratio' argument, which defines the desired sample size 
 ratio between the two groups.
  
  See ?nBinomial and the examples there, which does include 
 one using the 'ratio' argument.
  
  HTH,
  
  Marc Schwartz
  
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] randomForest - NaN in %IncMSE

2011-09-23 Thread Liaw, Andy

You are not giving anyone much to go on.  Please read the posting guide and see 
how to ask your question in a way that's easier for others to answer.  At the 
_very_ least, show what commands you used, what your data looks like, etc.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Katharine Miller
 Sent: Tuesday, September 20, 2011 1:43 PM
 To: r-help@r-project.org
 Subject: [R] randomForest - NaN in %IncMSE
 
 Hi
 
 I am having a problem using varImpPlot in randomForest.  I 
 get the error
 message Error in plot.window(xlim = xlim, ylim = ylim, log = 
 ) :   need
 finite 'xlim' values
 
 When print $importance, several variables have NaN under 
 %IncMSE.   There
 are no NaNs in the original data.  Can someone help me figure 
 out what is
 happening here?
 
 Thanks!
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] class weights with Random Forest

2011-09-13 Thread Liaw, Andy

The current classwt option in the randomForest package has been there since 
the beginning, and is different from how the official Fortran code (version 4 
and later) implements class weights.  It simply account for the class weights 
in the Gini index calculation when splitting nodes, exactly as how a single 
CART tree is done when given class weights.  Prof. Breiman came up with the 
newer class weighting scheme implemented in the newer version of his Fortran 
code after we found that simply using the weights in the Gini index didn't seem 
to help much in extremely unbalanced data (say 1:100 or worse).  If using 
weighted Gini helps in your situation, by all means do it.  I can only say that 
in the past it didn't give us the result we were expecting.

Best,
Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of James Long
 Sent: Tuesday, September 13, 2011 2:10 AM
 To: r-help@r-project.org
 Subject: [R] class weights with Random Forest
 
 Hi All,
 
 I am looking for a reference that explains how the 
 randomForest function in
 the randomForest package uses the classwt parameter. Here:
 
 http://tolstoy.newcastle.edu.au/R/e4/help/08/05/12088.html
 
 Andy Liaw suggests not using classwt. And according to:
 
 http://r.789695.n4.nabble.com/R-help-with-RandomForest-classwt
 -option-td817149.html
 
 it has not been implemented as of 2007. However it improved 
 classification
 performance for a problem I am working on, more than 
 adjusting the sampsize
 parameter. So I'm wondering if it has been implemented 
 recently (since 2007)
 or if there is a detailed explanation of what this 
 unimplemented version is
 doing.
 
 Thanks!
 James
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] randomForest memory footprint

2011-09-08 Thread Liaw, Andy

It looks like you are building a regression model.  With such a large number of 
rows, you should try to limit the size of the trees by setting nodesize to 
something larger than the default (5).  The issue, I suspect, is the fact that 
the size of the largest possible tree has about 2*nodesize nodes, and each node 
takes a row in a matrix to store.  Multiply that by the number of trees you are 
trying to build, and you see how the memory can be gobbled up quickly.  Boosted 
trees don't usually run into this problem because one usually boost very small 
trees (usually no more than 10 terminal nodes per tree).

Best,
Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of John Foreman
 Sent: Wednesday, September 07, 2011 2:46 PM
 To: r-help@r-project.org
 Subject: [R] randomForest memory footprint
 
 Hello, I am attempting to train a random forest model using the
 randomForest package on 500,000 rows and 8 columns (7 predictors, 1
 response). The data set is the first block of data from the UCI
 Machine Learning Repo dataset Record Linkage Comparison Patterns
 with the slight modification that I dropped two columns with lots of
 NA's and I used knn imputation to fill in other gaps.
 
 When I load in my dataset, R uses no more than 100 megs of RAM. I'm
 running a 64-bit R with ~4 gigs of RAM available. When I execute the
 randomForest() function, however I get memory complaints. Example:
 
  summary(mydata1.clean[,3:10])
   cmp_fname_c1 cmp_lname_c1   cmp_sex   cmp_bd
   cmp_bm   cmp_by  cmp_plz is_match
  Min.   :0.   Min.   :0.   Min.   :0.   Min.   :0.
 Min.   :0.   Min.   :0.   Min.   :0.0   FALSE:572820
  1st Qu.:0.2857   1st Qu.:0.1000   1st Qu.:1.   1st Qu.:0.
 1st Qu.:0.   1st Qu.:0.   1st Qu.:0.0   TRUE :  2093
  Median :1.   Median :0.1818   Median :1.   Median :0.
 Median :0.   Median :0.   Median :0.0
  Mean   :0.7127   Mean   :0.3156   Mean   :0.9551   Mean   :0.2247
 Mean   :0.4886   Mean   :0.2226   Mean   :0.00549
  3rd Qu.:1.   3rd Qu.:0.4286   3rd Qu.:1.   3rd Qu.:0.
 3rd Qu.:1.   3rd Qu.:0.   3rd Qu.:0.0
  Max.   :1.   Max.   :1.   Max.   :1.   Max.   :1.
 Max.   :1.   Max.   :1.   Max.   :1.0
  mydata1.rf.model2 - randomForest(x = 
 mydata1.clean[,3:9],y=mydata1.clean[,10],ntree=100)
 Error: cannot allocate vector of size 877.2 Mb
 In addition: Warning messages:
 1: In dim(data) - dim :
   Reached total allocation of 3992Mb: see help(memory.size)
 2: In dim(data) - dim :
   Reached total allocation of 3992Mb: see help(memory.size)
 3: In dim(data) - dim :
   Reached total allocation of 3992Mb: see help(memory.size)
 4: In dim(data) - dim :
   Reached total allocation of 3992Mb: see help(memory.size)
 
 Other techniques such as boosted trees handle the data size just fine.
 Are there any parameters I can adjust such that I can use a value of
 100 or more for ntree?
 
 Thanks,
 John
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] randomForest partial dependence plot variable names

2011-08-09 Thread Liaw, Andy

See if the following is close to what you're looking for.  If not, please give 
more detail on what you want to do.

data(airquality)
airquality - na.omit(airquality)
set.seed(131)
ozone.rf - randomForest(Ozone ~ ., airquality, importance=TRUE)
imp - importance(ozone.rf)  # get the importance measures
impvar - rownames(imp)[order(imp[, 1], decreasing=TRUE)]  # get the sorted 
names
op - par(mfrow=c(2, 3))
for (i in seq_along(impvar)) {
partialPlot(ozone.rf, airquality, impvar[i], xlab=impvar[i],
main=paste(Partial Dependence on, impvar[i]), ylim=c(30, 70))
}
par(op)

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Katharine Miller
 Sent: Thursday, August 04, 2011 4:38 PM
 To: r-help@r-project.org
 Subject: [R] randomForest partial dependence plot variable names
 
 Hello,
 
 I am running randomForest models on a number of species.  I 
 would like to be
 able to automate the printing of dependence plots for the 
 most important
 variables in each model, but I am unable to figure out how to 
 enter the
 variable names into my code.  I had originally thought to 
 extract them from
 the $importance matrix after sorting by metric (e.g. %IncMSE), but the
 importance matrix is n by 2 - containing only the data for each metric
 (%IncMSE and IncNodePurity).  It is clearly linked to the 
 variable names,
 but I am unsure how to extract those names for use in scripting.  Any
 assistance would be greatly appreciated as I am currently typing the
 variable names into each partialPlot call for every model I 
 run.and that
 is taking a LONG time.
 
 Thanks!
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] convert a splus randomforest object to R

2011-08-09 Thread Liaw, Andy

You really need to follow the suggestions in the posting guide to get the best 
help from this list.  

Which versions of randomForest are you using in S-PLUS and R?  Which version of 
R are you using?  When you restore the object into R, what does str(object) 
say?  Have you also tried dump()/source() as the R Data Import/Export manual 
suggests?

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Zhiming Ni
 Sent: Tuesday, August 02, 2011 8:11 PM
 To: r-help@r-project.org
 Subject: [R] convert a splus randomforest object to R
 
 Hi,
  
 I have a randomforest object cost.rf that was created in splus 8.0,
 now I need to use this trained RF model in R. So in Splus, I 
 dump the RF
 file as below
  
 data.dump(cost.rf, file=cost.rf.txt, oldStyle=T) 
 
 then in R, restore the dumped file,
 
 library(foreign)
 
 data.restore(cost.rf.txt)
 
 it works fine and able to restore the cost.rf object. But when I try
 to pass a new data through this randomforest object using predict()
 function, it gives me error message.
 
 in R:
 
 library(randomForest)
 set.seed(2211)
 
 pred - predict(cost.rf, InputData[ , ])
 
 Error in object$forest$cutoff : $ operator is invalid for 
 atomic vectors
 
 
 Looks like after restoring the dump file, the object is not compatible
 in R. Have anyone successfully converted a splus randomforest 
 object to
 R? what will be the appropriate method to do this?
 
 Thanks in advance.
 
 Jimmy
 
 ==
 This communication contains information that is confidential, 
 and solely for the use of the intended recipient. It may 
 contain information that is privileged and exempt from 
 disclosure under applicable law. If you are not the intended 
 recipient of this communication, please be advised that any 
 disclosure, copying, distribution or use of this 
 communication is strictly prohibited. Please also immediately 
 notify SCAN Health Plan at 1-800-247-5091, x5263 and return 
 the communication to the originating address.
 Thank You.
 ==
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] squared pie chart - is there such a thing?

2011-07-25 Thread Liaw, Andy

Has anyone suggested mosaic displays?  That's the closest I can think of as a 
square pie chart... 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Naomi Robbins
 Sent: Sunday, July 24, 2011 7:09 AM
 To: Thomas Levine
 Cc: r-help@r-project.org
 Subject: Re: [R] squared pie chart - is there such a thing?

 I don't usually use stacked bar charts since it is difficult 
 to compare 
 lengths that don't have
 a common baseline.

 Naomi

 On 7/23/2011 11:14 PM, Thomas Levine wrote:
  How about just a stacked bar plot?

  barplot(matrix(c(3,5,3),3,1),horiz=T,beside=F)

  Tom

  On Fri, Jul 22, 2011 at 7:14 AM, Naomi 
 Robbinsnbrgra...@optonline.net  wrote:
  Hello!
  It's a shoot in the dark, but I'll try. If one has a total of 100
  (e.g., %), and three components of the total, e.g.,
  mytotal=data.frame(x=50,y=30,z=20), - one could build a 
 pie chart with
  3 sectors representing x, y, and z according to their 
 proportions in
  the total.
  I am wondering if it's possible to build something very 
 similar, but
  not on a circle but in a square - such that the total area of the
  square is the sum of the components and the components (x, 
 y, and z)
  are represented on a square as shapes with right angles (squares,
  rectangles, L-shapes, etc.). I realize there are many possible
  positions and shapes - even for 3 components. But I don't 
 really care
  where components are located within the square - as long 
 as they are
  there.

  Is there a package that could do something like that?
  Thanks a lot!

  -

  I included waffle charts in Creating More Effective Graphs.
  The reaction was very negative; many readers let me know
  that they didn't like them. To create them I just drew a table
  in Word with 10 rows and 10 columns. Then I shaded the
  backgrounds of cells so for your example we would shade
  50 cells one color, 30 another, and 20 a third color.

  Naomi

  -

  Naomi B. Robbins
  11 Christine Court
  Wayne, NJ 07470
  973-694-6009

  na...@nbr-graphs.commailto:na...@nbr-graphs.com

  http://www.nbr-graphs.com

  Author of Creating More Effective Graphs
  http://www.nbr-graphs.com/bookframe.html

  //

  [[alternative HTML version deleted]]

  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.

 -- 

 -- 

 Naomi B. Robbins

 NBR

 11 Christine Court

 Wayne, NJ 07470

 Phone:  (973) 694-6009

 na...@nbr-graphs.com mailto:na...@nbr-graphs.com

 http://www.nbr-graphs.com http://www.nbr-graphs.com/

 Follow me at http://www.twitter.com/nbrgraphs

 Author of /Creating More Effective Graphs 
 http://www.nbr-graphs.com/bookframe.html/

   [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] not using attach() but in one case ....

2011-05-19 Thread Liaw, Andy

From: Prof Brian Ripley
 
 Hmm, load() does have an 'envir' argument.  So you could simply use 
 that and with() (which is pretty much what attach() does internally).
 
 If people really wanted a lazy approach, with() could be extended to 
 allow file names (as attach does).

I'm not sure if laziness like this should be encouraged.  

If I may bring up another black hole:  IMHO the formula interface
allows too much flexibility (perhaps to allow some laziness?) that
beginners and even non-beginners fall into its various traps a bit too
often, and sometimes not even aware of it.  It would be great if there's
a way to (optionally?) limit the scope of where a formula looks for
variables.
 
Just my $0.02...

Andy

 On Thu, 19 May 2011, Martin Maechler wrote:
 
  [modified 'Subject' on purpose;
  Good mail readers will still thread correctly, using the 
 'References'
  and 'In-Reply-To' headers, however, unfortunately,
  in my limited experience, good mail readers seem to 
 disappear more and more ..
  ]
 
  Peter Ehlers ehl...@ucalgary.ca
  on Tue, 17 May 2011 06:08:30 -0700 writes:
 
  On 2011-05-17 02:22, Timothy Bates wrote:
  Dear Bryony: the suggestion was not to change the name of
  the data object, but to explicitly tell glm.nb what
  dataset it should look in to find the variables you
  mention in the formula.
 
  so the salient difference is:
 
  m1- glm.nb(Cells ~ Cryogel*Day, data = side)
 
  instead of
 
  attach(side) m1- glm.nb(Cells ~ Cryogel*Day)
 
  This works for other functions also, but not uniformly as
  yet (how I wish it did and I could say hist(x, data=side)
  Instead of hist(side$x)
 
  this inconsistency encourages the need for attach()
 
  Only if the user hasn't yet been introduced to the with()
  function, which is linked to on the ?attach page.
 
  Note also this sentence from the ?attach page:
   attach can lead to confusion.
 
  I can't remember the last time I needed attach().
  Peter Ehlers
 
  Well, then you don't know  *THE ONE* case where modern users of
  R should use attach() ... as I have been teaching for a while,
  but seem not have got enought students listening ;-) ...
 
   ---  Use it instead of load()  {for save()d R objects} ---
 
  The advantage of attach() over load() there is that loaded
  objects (and there maye be a bunch!), are put into a separate
  place in the search path and will not accidentally overwrite
  objects in the global workspace.
 
  Of course, there are still quite a few situations {e.g. in
  typical BATCH use of R for simulations, or Sweaving, etc} where
  load() is good enough, and the extras of using attach() are not
  worth it.
 
  But the unconditional  do not use attach()
  is not quite ok,
  at least not when you talk to non-beginners.
 
  Martin Maechler, ETH Zurich
 
 -- 
 Brian D. Ripley,  rip...@stats.ox.ac.uk
 Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
 University of Oxford, Tel:  +44 1865 272861 (self)
 1 South Parks Road, +44 1865 272866 (PA)
 Oxford OX1 3TG, UKFax:  +44 1865 272595
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Rotation Forest in R

2011-04-12 Thread Liaw, Andy

I don't have access to that article, but just reading the abstract, it
should be quite easy to do by writing a wrapper function that calls
randomForest().  I've done so with random projections before.  One
limitation to methods like these is that they only apply to all numeric
data.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Mario Beolco
 Sent: Thursday, April 07, 2011 7:55 PM
 To: r-help@r-project.org
 Subject: [R] Rotation Forest in R
 
 Dear R users,
 
 I was wondering whether you could tell me if there are any R functions
 or packages that can implement Rotation Forest (not Random Forests)
 algorithm:
 
 http://www.computer.org/portal/web/csdl/doi/10.1109/TPAMI.2006.211
 
 thanks in advance,
 
 Mario
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Difference in mixture normals and one density

2011-04-04 Thread Liaw, Andy

Is something like this what you're looking for?

R library(nor1mix)
R nmix2 - norMix(c(2, 3), sig2=c(25, 4), w=c(.2, .8))
R dnorMix(1, nmix2) - dnorm(1, 2, 5)
[1] 0.03422146

Andy

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Jim Silverton
 Sent: Monday, April 04, 2011 10:01 AM
 To: r-help@r-project.org
 Subject: Re: [R] Difference in mixture normals and one density
 
 Hello,
 I am trying to find out if R can do the following:
 
 I have a mixture of normals say f = 0.2*Normal(2, 5) + 0.8*Normal(3,2)
 How do I find the difference in the densities at any 
 particular point of f
 and at Normal(2,5)?
 
 -- 
 Thanks,
 Jim.
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] ok to use glht() when interaction is NOT significant?

2011-03-08 Thread Liaw, Andy

Just to add my ever depreciating $0.02 USD:

Keep in mind that the significance testing paradigm puts a constraint on
false positive rate, and let false negative rate float.  What you should
consider is whether that makes sense in your situation.  All too often
this is not carefully considered, and sometimes people will do
not-very-kosher things to compensate for the conservativism of the
significance testing.

If you want to stay with the formality of the protected tests, you
should first check the overall F-test of the entire model and make sure
that's significant before you look at the individual terms in the model.

It's not sufficient for A1 and A2 to be significantly different at B2
and not at B1 to say that there's significant interaction, but that the
difference between A1 and A2 at B1 has to be significantly different
that that at B2.  That's the definition of the interaction in the 2x2
case.  If you have a priori interest in the comparison of A1 vs. A2 at
B2, then you can test it as a pre-planned contrast and not worry too
much about protection or multiplicity.

HTH,
Andy
 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of array chip
 Sent: Tuesday, March 08, 2011 1:31 AM
 To: Bert Gunter
 Cc: r-h...@stat.math.ethz.ch
 Subject: Re: [R] ok to use glht() when interaction is NOT significant?
 
 Hi Bert, thank you for your thoughtful and humorous comments, :-)
 
 It is scientifically meaningful to do those comparisons, and 
 the results of 
 these comparisons actually make sense to our hypothesis, i.e. 
 one is significant 
 at B2 level while the other is not at B1 level. Just 
 unfortunately, the overall 
 F test for interaction is not significant. I understand 
 formally one should 
 not do these post-hoc comparisons under non-significant 
 interaction term. But 
 should I really stop comparing under this situation, 
 especially when these 
 comparisons conform to our hypothesis? I am encouraged to see 
 that you said For 
 exploratory purposes, such post hoc comparisons might lead to 
 great science. 
 However, my concern is these results may not pass reviewers 
 when sent out for 
 publication.
 
 BTW, I am non-US reader, so I did google never inhaled. :-)
 
 John
 
 
 
 
 
 From: Bert Gunter gunter.ber...@gene.com
 
 Cc: r-h...@stat.math.ethz.ch
 Sent: Mon, March 7, 2011 9:20:11 PM
 Subject: Re: [R] ok to use glht() when interaction is NOT significant?
 
 Inline below
 
 
  Hi, let's say I have a simple ANOVA model with 2 factors A 
 (level A1 and A2) 
 and
  B (level B1 and B2) and their interaction:
 
  aov(y~A*B, data=dat)
 
  It turns out that the interaction term is not significant 
 (e.g. P value = 
 0.2),
  but if I used glht() to compare A1 vs. A2 within each level 
 of B, I found that
  the comparison is not significant when B=B1, but is very 
 significant (P0.01)
  when B=B2.
 
  My question is whether it's legal to do this post-hoc 
 comparison when the
  interaction is NOT significant? Can I still make the claim 
 that there is a
  significant difference between A1 and A2 when B=B2?
 
 (I am serious here). Don't know what legal means. Why do you want to
 make the claim? When does it **ever** mean anything scientifically
 meaningful to make it? What is the **scientific** question of
 interest? Are the data unbalanced? Have you plotted the data to tell
 you what's going on?
 
 Warning: I come from the school (maybe I'm the only student...) that
 believes all such formal post hoc comparisons are pointless, silly,
 wastes of effort.  Note the word: formal -- that is pretending the P
 values mean anything, For exploratory purposes, which can certainly
 include producing P values as well as graphs, such post hoc
 comparisons might lead to great science. It's the formal part that I
 reject and that you seem to be hung up on.
 
 Note also: If you're a Bayesian and can put priors on everything, you
 can spit out posteriors and Bayes factors to your heart's content.
 Really! -- no need to sweat multiplicity even. Of course, I speak here
 only as an observer, having never actually inhaled myself.*
 
 Cheers,
 Bert
 
 *Apologies to all non-US and younger readers. This is a smart-aleck
 reference to an infamous dumb remark from a recent famous, smart
 former U.S. president. Google never inhaled for details.
 
 
  Thanks
 
  John
 
 
 
 
 [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 
 
 -- 
 Bert Gunter
 Genentech Nonclinical Biostatistics
 467-7374
 http://devo.gene.com/groups/devo/depts/ncb/home.shtml
 
 
 
   
   [[alternative HTML version deleted]]
 
 __

Re: [R] Coefficient of Determination for nonlinear function

2011-03-04 Thread Liaw, Andy

As far as I can tell, Uwe is not even fitting a model, but instead just
solving a nonlinear equation, so I don't know why he wants a R^2.  I
don't see a statistical model here, so I don't know why one would want a
statistical measure.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Bert Gunter
 Sent: Friday, March 04, 2011 11:21 AM
 To: uwe.wolf...@uni-ulm.de; r-help@r-project.org
 Subject: Re: [R] Coefficient of Determination for nonlinear function
 
 The coefficient of determination, R^2, is a measure of how well your
 model fits versus a NULL model, which is that the data are constant.
 In nonlinear models, as opposed to linear models, such a null model
 rarely makes sense. Therefore the coefficient of determination is
 generally not meaningful in nonlinear modeling.
 
 Yet another way in which linear and nonlinear models 
 fundamentally differ.
 
 -- Bert
 
 On Fri, Mar 4, 2011 at 5:40 AM, Uwe Wolfram 
 uwe.wolf...@uni-ulm.de wrote:
  Dear Subscribers,
 
  I did fit an equation of the form 1 = f(x1,x2,x3) using a 
 minimization
  scheme. Now I want to compute the coefficient of 
 determination. Normally
  I would compute it as
 
  r_square = 1- sserr/sstot with sserr = sum_i (y_i - f_i) and sstot =
  sum_i (y_i - mean(y))
 
  sserr is clear to me but how can I compute sstot when there 
 is no such
  thing than differing y_i. These are all one. Thus 
 mean(y)=1. Therefore,
  sstot is 0.
 
  Thank you very much for your efforts,
 
  Uwe
  --
  Uwe Wolfram
  Dipl.-Ing. (Ph.D Student)
  __
  Institute of Orthopaedic Research and Biomechanics
  Director and Chair: Prof. Dr. Anita Ignatius
  Center of Musculoskeletal Research Ulm
  University Hospital Ulm
  Helmholtzstr. 14
  89081 Ulm, Germany
  Phone: +49 731 500-55301
  Fax: +49 731 500-55302
  http://www.biomechanics.de
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 
 
 -- 
 Bert Gunter
 Genentech Nonclinical Biostatistics
 467-7374
 http://devo.gene.com/groups/devo/depts/ncb/home.shtml
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] lm - log(variable) - skip log(0)

2011-02-25 Thread Liaw, Andy

You need to use == instead of = for testing equality.  While you're at it, 
you should check for positive values, not just screening out 0s.  This works 
for me:

R mydata = data.frame(x=0:10, y=runif(11))
R fm = lm(y ~ log(x), mydata, subset=x0)
 
Andy


 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of agent dunham
 Sent: Friday, February 25, 2011 6:24 AM
 To: r-help@r-project.org
 Subject: [R] lm - log(variable) - skip log(0)
 
 
 
 I want to do a lm regression, some of the variables are going 
 to be affected
 with log, I would like not no take into account the values 
 which imply doing
 log(0) 
 
 for just one variable I have done the following but it doesn't work: 
 
 lmod1.lm - 
 lm(log(dat$inaltu)~log(dat$indiam),subset=(!(dat$indiam %in%
 c(0,1))) 
 
 and obtain: 
 
 Error en lm.fit(x, y, offset = offset, singular.ok = 
 singular.ok, ...) : 
   0 (non-NA) cases 
 
 lmod1.lm - 
 lm(log(dat$inaltu)~log(dat$indiam),subset=(!(dat$indiam = 0)),
 na.action=na.exclude) 
 
 and obtain 
 
 Error en lm.fit(x, y, offset = offset, singular.ok = 
 singular.ok, ...) : 
   NA/NaN/Inf en llamada a una función externa (arg 1)
 
 Thanks, u...@host.com
 -- 
 View this message in context: 
 http://r.789695.n4.nabble.com/lm-log-variable-skip-log-0-tp332
4263p3324263.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random Forest Cross Validation

2011-02-24 Thread Liaw, Andy

Exactly as Max said.  See the rfcv() function in the latest version of 
randomForest, as well as the reference in the help page for that function.

OOB estimate is as accurate as CV estimate _if_ you run straight RF.  Most 
other methods do not have this feature.  However, if you start adding steps 
such as feature selections, all bets are off.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of mxkuhn
 Sent: Tuesday, February 22, 2011 7:17 PM
 To: ronzhao
 Cc: r-help@r-project.org
 Subject: Re: [R] Random Forest  Cross Validation
 
 If you want to get honest estimates of accuracy, you should 
 repeat the feature selection within the resampling (not the 
 test set). You will get different lists each time, but that's 
 the point. Right now you are not capturing that uncertainty 
 which is why the oob and test set results differ so much.
 
 The list you get int the original training set is still the 
 real list. The resampling results help you understand how 
 much you might be overfitting the *variables*.
 
 Max
 
 On Feb 22, 2011, at 4:39 PM, ronzhao yzhaoh...@gmail.com wrote:
 
  
  Thanks, Max.
  
  Yes, I　did some feature selections in the training set. Basically, I
  selected the top 1000 SNPs based on OOB error and grow the 
 forest using
  training set, then using the test set to validate the forest grown.
  
  But if I do the same thing in test set, the top SNPs would 
 be different than
  those in training set. That may be difficult to interpret.
  
  
  
  
  -- 
  View this message in context: 
 http://r.789695.n4.nabble.com/Random-Forest-Cross-Validation-t
p3314777p3320094.html
  Sent from the R help mailing list archive at Nabble.com.
  
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] tri-cube and gaussian weights in loess

2011-02-07 Thread Liaw, Andy

Locfit() in the locfit package has a slightly more modern implementation of 
loess, and is much more flexible in that it has a lot of options to tweak.  One 
such option is the kernel.  There are seven to choose from.

Andy 

From: wisdomtooth
 
 From what I understand, loess in R uses the standard 
 tri-cube function.
 SAS/INSIGHT offers loess with Gaussian weights. Is there a 
 function in R
 that does the same?
 
 Also, can anyone offer any references comparing properties 
 between tri-cube
 and Gaussian weights in LOESS?
 
 Thanks. - André
 -- 
 View this message in context: 
 http://r.789695.n4.nabble.com/tri-cube-and-gaussian-weights-in
-loess-tp3263934p3263934.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to measure/rank variable importance when using rpart?

2011-01-24 Thread Liaw, Andy

Check out caret::varImp.rpart().  It's described in the original CART
book.

Andy 

From: Tal Galili
 
 Hello all,
 
 When building a CART model (specifically classification tree) 
 using rpart,
 it is sometimes interesting to know what is the importance of 
 the various
 variables introduced to the model.
 
 Thus, my question is: *What common measures exists for 
 ranking/measuring
 variable importance of participating variables in a CART 
 model? And how can
 this be computed using R (for example, when using the rpart package)*
 
 For example, here is some dummy code, created so you might show your
 solutions on it. This example is structured so that it is clear that
 variable x1 and x2 are important while (in some sense) x1 is more
 important then x2 (since x1 should apply to more cases, thus make more
 influence on the structure of the data, then x2).
 
 set.seed(31431)
 
 n - 400
 
 x1 - rnorm(n)
 
 x2 - rnorm(n)
 
 x3 - rnorm(n)
 
 x4 - rnorm(n)
 
 x5 - rnorm(n)
 
 X - data.frame(x1,x2,x3,x4,x5)
 
 y - sample(letters[1:4], n, T)
 
 y - ifelse(X[,2]  -1 , b, y)
 
 y - ifelse(X[,1]  0 , a, y)
 
 require(rpart)
 
 fit - rpart(y~., X)
 
 plot(fit); text(fit)
 
 info.gain.rpart(fit) # your function - telling us on each variable how
 important it is
 
 (references are always welcomed)
 
 
 Thanks!
 
 Tal
 
 Contact
 Details:---
 Contact me: tal.gal...@gmail.com |  972-52-7275845
 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il 
 (Hebrew) |
 www.r-statistics.com (English)
 --
 
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] randomForest: too many elements specified?

2011-01-21 Thread Liaw, Andy

I grep for n, n) in all the R code of the package (current version),
and the only place that happens is in creating proximity.  Can you do a
traceback() and see where it happens?

You should seriously consider upgrading R and the packages...

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Czerminski, Ryszard
 Sent: Thursday, January 20, 2011 1:08 PM
 To: r-h...@stat.math.ethz.ch
 Subject: [R] randomForest: too many elements specified?
 
 I getting Error in matrix(0, n, n) : too many elements specified
 while building randomForest model, which looks like memory allocation
 error.
 Software versions are: randomForest 4.5-25, R version 2.7.1
 
 Dataset is big (~90K rows, ~200 columns), but this is on a 
 big machine (
 ~120G RAM)
 and I call randomForest like this: randomForest(x,y)
 i.e. in supervised mode and not requesting proximity matrix, therefore
 answer from Andy Liaw to an email reporting the same problems in 2005
 (see below)
 is probably not directly applicable, still it looks like it is too big
 data set for this dataset/machine combination.
 
 How does memory usage in randomForest scale with dataset size?
 Is there a way to build global rf model with dataset of this size?
 
 Best regards,
 Ryszard
 
 Ryszard Czerminski
 AstraZeneca Pharmaceuticals LP
 35 Gatehouse Drive
 Waltham, MA 02451
 USA
 781-839-4304
 ryszard.czermin...@astrazeneca.com
 
 RE: [R] randomForest: too many element specified?
 Liaw, Andy
 Mon, 17 Jan 2005 05:56:28 -0800
  From: luk
 
  When I run randonForest with a 169453x5 matrix, I got the
  following message.
 
  Error in matrix(0, n, n) : matrix: too many elements specified
 
  Can you please advise me how to solve this problem?
 
  Thanks,
 
  Lu
 
 1.  When asking new questions, please don't reply to other posts.
 
 2.  When asking questions like these, please do show the commands you
 used.
 
 My guess is that you asked for the proximity matrix, or is running
 unsupervised randomForest (by not providing a response vector).  This
 will
 requires a couple of n by n matrices to be created (on top of other
 things),
 n being 169453 in this case.  To store a 169453 x 169453 matrix in
 double
 precision, you need 169453^2 * 8 bytes, or or nearly 214 GB of memory.
 Even
 if you have that kind of hardware, I doubt you'll be able to make much
 sense
 out of the result.
 
 Andy
 
 
 
 --
 
 Confidentiality Notice: This message is private and may 
 ...{{dropped:11}}
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Where is a package NEWS.Rd located?

2011-01-06 Thread Liaw, Andy

I was communicating with Kevin off-list.

The problem seems to be run time, not install time.  News() calls
tools:::.build_news_db(), and the 2nd line of that function is:

 nfile - file.path(dir, inst, NEWS.Rd)

and that's the problem:  an installed package shouldn't have an inst/
subdirectory, right?

Andy
 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Duncan Murdoch
 Sent: Thursday, January 06, 2011 2:30 PM
 To: Kevin Wright
 Cc: R list
 Subject: Re: [R] Where is a package NEWS.Rd located?
 
 On 06/01/2011 2:19 PM, Kevin Wright wrote:
  Yes, exactly.  But the problem is with NEWS.Rd, not NEWS.
 
 I'm not sure who you are arguing with, but if you do file a 
 bug report, 
 please also put together a simple reproducible example, e.g. a small 
 package containing NEWS.Rd in the inst directory (which is where the 
 docs say it should go) and code that shows why this is bad.  
 Don't just 
 talk about internal functions used for building packages; as 
 far as we 
 can tell so far tools:::.build_news_db is doing exactly what 
 it should 
 be doing.
 
 Duncan Murdoch
 
  pkg/inst/NEWS.Rd is moved to pkg/NEWS.Rd at build time, but for
  installed packages, news tried to load pkg/inst/NEWS.Rd.
 
  I'm going to file a bug report.
 
  Kevin
 
 
  On Thu, Jan 6, 2011 at 7:29 AM, Kevin 
 Wrightkw.s...@gmail.com  wrote:
If you look at tools:::.build_news_db, the plain text 
 NEWS file is
searched for in pkg/NEWS and pkg/inst/NEWS, but NEWS.Rd in only
searched for in pkg/inst/NEWS.Rd.
  
Looks like a bug to me.
  
I *think*.
  
Thanks,
  
Kevin
  
  
On Thu, Jan 6, 2011 at 7:09 AM, Kevin 
 Wrightkw.s...@gmail.com  wrote:
Hopefully a quick question.  My package has a NEWS.Rd 
 file that is not
being found by news.
  
The news function calls tools:::.build_news_db 
 which has this line:
  
nfile- file.path(dir, inst, NEWS.Rd)
  
So it appears that the news function is searching for
mypackage/inst/NEWS.Rd.
  
However, Writing R extensions says The contents of the inst
subdirectory will be copied recursively to the 
 installation directory
  
During the installation, mypackage/inst/NEWS.Rd is 
 copied into the
mypackage directory, not mypackage/inst.
  
What am I doing wrong, or is this a bug?
  
Kevin Wright
  
  
  
--
Kevin Wright
  
  
  
  
--
Kevin Wright
  
 
 
 
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] randomForest speed improvements

2011-01-05 Thread Liaw, Andy

Note that that isn't exactly what I recommended.  If you look at the
example in the help page for combine(), you'll see that it is combining
RF objects trained on the same data; i.e., instead of having one RF with
500 trees, you can combine five RFs trained on the same data with 100
trees each into one 500-tree RF.

The way you are using combine() is basically using sample size to limit
tree size, which you can do by playing with the nodesize argument in
randomForest() as I suggested previously.  Either way is fine as long as
you don't see prediction performance degrading.

Andy

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of apresley
 Sent: Tuesday, January 04, 2011 6:30 PM
 To: r-help@r-project.org
 Subject: Re: [R] randomForest speed improvements
 
 
 Andy,
 
 Thanks for the reply.  I had no idea I could combine them 
 back ... that
 actually will work pretty well.  We can have several worker 
 threads load
 up the RF's on different machines and/or cores, and then 
 re-assemble them. 
 RMPI might be an option down the road, but would be a bit of 
 overhead for us
 now.
 
 Using the method of combine() ... I was able to drastically reduce the
 amount of time to build randomForest objects.  IE, using 
 about 25,000 rows
 (6 columns), it takes maybe 5 minutes on my laptop.  Using 5 
 randomForest
 objects (each with 5k rows), and then combining them, takes  
 1 minute.
 
 --
 Anthony
 -- 
 View this message in context: 
 http://r.789695.n4.nabble.com/randomForest-speed-improvements-
 tp3172523p3174621.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] randomForest speed improvements

2011-01-05 Thread Liaw, Andy

From: Liaw, Andy
 
 Note that that isn't exactly what I recommended.  If you look at the
 example in the help page for combine(), you'll see that it is 
 combining
 RF objects trained on the same data; i.e., instead of having 
 one RF with
 500 trees, you can combine five RFs trained on the same data with 100
 trees each into one 500-tree RF.
 
 The way you are using combine() is basically using sample 
 size to limit
 tree size, which you can do by playing with the nodesize argument in
 randomForest() as I suggested previously.  Either way is fine 
 as long as
 you don't see prediction performance degrading.

I should also mention that another way you can do something similar is
by making use of the sampsize argument in randomForest().  For example,
if you call randomForest() with sampsize=500, it will randomly draw 500
data points to grow each tree.  This way you don't even need to run the
RFs separately and combine them.  

Andy


 Andy
 
  -Original Message-
  From: r-help-boun...@r-project.org 
  [mailto:r-help-boun...@r-project.org] On Behalf Of apresley
  Sent: Tuesday, January 04, 2011 6:30 PM
  To: r-help@r-project.org
  Subject: Re: [R] randomForest speed improvements
  
  
  Andy,
  
  Thanks for the reply.  I had no idea I could combine them 
  back ... that
  actually will work pretty well.  We can have several worker 
  threads load
  up the RF's on different machines and/or cores, and then 
  re-assemble them. 
  RMPI might be an option down the road, but would be a bit of 
  overhead for us
  now.
  
  Using the method of combine() ... I was able to drastically 
 reduce the
  amount of time to build randomForest objects.  IE, using 
  about 25,000 rows
  (6 columns), it takes maybe 5 minutes on my laptop.  Using 5 
  randomForest
  objects (each with 5k rows), and then combining them, takes  
  1 minute.
  
  --
  Anthony
  -- 
  View this message in context: 
  http://r.789695.n4.nabble.com/randomForest-speed-improvements-
  tp3172523p3174621.html
  Sent from the R help mailing list archive at Nabble.com.
  
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
  
 Notice:  This e-mail message, together with any 
 attachme...{{dropped:11}}
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] randomForest speed improvements

2011-01-04 Thread Liaw, Andy

If you have multiple cores, one poor man's solution is to run separate
forests in different R sessions, save the RF objects, load them into the
same session and combine() them.  You can do this less clumsily if you
use things like Rmpi or other distributed computing packages.

Another consideration is to increase nodesize (which reduces the sizes
of trees).  The problem with numeric predictors for tree-based
algorithms is that the number of computations to find the best splitting
point increases by that much _at each node_.  Some algorithms try to
save on this by using only certain quantiles.  The current RF code
doesn't do this.

Andy

 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of apresley
 Sent: Monday, January 03, 2011 6:28 PM
 To: r-help@r-project.org
 Subject: Re: [R] randomForest speed improvements
 
 
 I haven't tried changing the mtry or ntree at all ... though 
 I suppose with
 only 6 variables, and tens-of-thousands of rows, we can 
 probably do less
 than 500 tree's (the default?).
 
 Although tossing the forest does speed things up a bit, seems 
 to be about 15
 - 20% faster in some cases, I need to keep the forest to do 
 the prediction,
 otherwise, it complains that there is no forest component in 
 the object.
 
 --
 Anthony
 -- 
 View this message in context: 
 http://r.789695.n4.nabble.com/randomForest-speed-improvements-
 tp3172523p3172834.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] randomForest: help with combine() function

2010-12-11 Thread Liaw, Andy

combine() is meant to be used on randomForest objects that were built
from identical training data.

Andy

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Dennis Duro
 Sent: Friday, December 10, 2010 11:59 PM
 To: r-help@r-project.org
 Subject: [R] randomForest: help with combine() function

 I've built two RF objects (RF1 and RF2) and have tried to combine
 them, but I get the following error:

 Error in rf$votes + ifelse(is.na(rflist[[i]]$votes), 0, 
 rflist[[i]]$votes) :
   non-conformable arrays
 In addition: Warning message:
 In rf$oob.times + rflist[[i]]$oob.times :
   longer object length is not a multiple of shorter object length

 Both RF models use the same variables, although the NAs in both models
 likely differ (using na.roughfix in both models). I assume this is
 part of the reason that my arrays are non-conformable. If so, does
 anyone have any suggestions on how to combine in such a situation? How
 similar do RFs have to be in order to combine?

 Cheers

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] randomForest: How to append ID column along with predictions

2010-12-07 Thread Liaw, Andy

The order in the output correspond to the order of the input.  I will
patch the code so that it grabs the row names of the input (if exist).
If you specify type=prob, it already labels the rows by the input row
names. 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Dennis Duro
 Sent: Tuesday, December 07, 2010 11:46 AM
 To: r-help@r-project.org
 Subject: [R] randomForest: How to append ID column along with 
 predictions

 Hi all,

 When running a prediction using RF on another data, I get two columns
 returned: row number(?) and predicted class. Is there a way of
 appending the unique row value from an ID column in the dataframe to
 the predictions instead of the row number? I'm assuming that the
 returned results follow the data frame in that the first result
 returned equals the first entry in the dataframe.

 i.e., instead of a prediction output like this:

 1, ants
 2, ants
 3, bees
 4, ants

 I'd like the first column to pull IDs from the dataframe associated
 with each row (row number in parenthesis for illustration):

 (1) 1130, ants
 (2) 1130, ants
 (3) 2139, bees
 (4) 1130, ants

 This is likely a simple procedure, but I haven't been able to get
 anything to work. Any help would be appreciated!

 Cheers,

 Dennis

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] randomForest parameters for image classification

2010-11-18 Thread Liaw, Andy

1. Memory issue: You may want to try to increase nodesize (e.g., to 5,
11, or even 21) and see if that degrades performance.  If not, you
should be able to grow more trees with the larger nodesize.   Another
option is to use the sampsize argument to have randomForest() do the
random subsampling for you (on a per tree basis, rather than one random
subset for the entire forest).

2. predict() giving NA: Have no idea why you are calling predict() that
way.  The first argument of all predict() methods that I know about (not
just for randomForest) needs to be a model object, then followed by the
data you want to predict, not the other way around.

Andy

 -Original Message-
 From: Deschamps, Benjamin [mailto:benjamin.descha...@agr.gc.ca] 
 Sent: Tuesday, November 16, 2010 11:16 AM
 To: r-help@r-project.org
 Cc: Liaw, Andy
 Subject: RE: [R] randomForest parameters for image classification
 
 I have modified my code since asking my original question. The
 classifier is now generated correctly (with a good, low error rate, as
 expected). However, I am running into two issues: 
 
 1) I am getting an error at the prediction stage, I get only 
 NA's when I
 try to run data down the forest;
 2) I run out of memory when generating the forest with more than 200
 trees due to the large block of memory already occupied by 
 the training
 data
 
 Here is my code:
 
 
 library(raster)
 library(randomForest)
 
 # Set some user variables
 fn = image.pix
 outraster = output.pix
 training_band = 2
 validation_band = 1
 
 # Get the training data
 myraster = stack(fn)
 training_class = subset(myraster, training_band)
 training_class[training_class == 0] = NA
 training_class = Which(training_class != 0, cells=TRUE)
 training_data = extract(myraster, training_class)
 training_response = 
 as.factor(as.vector(training_data[,training_band]))
 training_predictors = training_data[,3:nlayers(myraster)]
 remove(training_data)
 
 # Create and save the forest
 r_tree = randomForest(training_predictors, 
 y=training_response, ntree =
 200, keep.forest=TRUE) # Runs out of memory with ntree  ~200
 remove(training_predictors, training_response)
 
 # Classify the whole image
 predictor_data = subset(myraster, 3:nlayers(myraster))
 layerNames(predictor_data) = layerNames(myraster)[3:nlayers(myraster)]
 predictions = predict(predictor_data, r_tree, filename=outraster,
 format=PCIDSK, overwrite=TRUE, progress=text, 
 type=response) #All
 NA!?
 remove(predictor_data)
 
 
 See also a thread I started on
 http://stackoverflow.com/questions/4186507/rgdal-efficiently-r
 eading-lar
 ge-multiband-rasters about improving the efficiency of collecting the
 training data...
 
 Thanks, Benjamin
 
 
 -Original Message-
 From: Liaw, Andy [mailto:andy_l...@merck.com] 
 Sent: November 11, 2010 7:02 AM
 To: Deschamps, Benjamin; r-help@r-project.org
 Subject: RE: [R] randomForest parameters for image classification
 
 Please show us the code you used to run randomForest, the output, as
 well as what you get with other algorithms (on the same random subset
 for comparison).  I have yet to see a dataset where randomForest does
 _far_ worse than other methods.
 
 Andy 
 
  -Original Message-
  From: r-help-boun...@r-project.org 
  [mailto:r-help-boun...@r-project.org] On Behalf Of 
 Deschamps, Benjamin
  Sent: Tuesday, November 09, 2010 10:52 AM
  To: r-help@r-project.org
  Subject: [R] randomForest parameters for image classification
  
  I am implementing an image classification algorithm using the
  randomForest package. The training data consists of 31000+ training
  cases over 26 variables, plus one factor predictor variable (the
  training class). The main issue I am encountering is very 
 low overall
  classification accuracy (a lot of confusion between classes). 
  However, I
  know from other classifications (including a regular decision tree
  classifier) that the training and validation data is sound 
 and capable
  of producing good accuracies). 
  
   
  
  Currently, I am using the default parameters (500 trees, 
 mtry not set
  (default), nodesize = 1, replace=TRUE). Does anyone have experience
  using this with large datasets? Currently I need to 
 randomly sample my
  training data because giving it the full 31000+ cases returns 
  an out of
  memory error; the same thing happens with large numbers of 
  trees.  From
  what I read in the documentation, perhaps I do not have 
  enough trees to
  fully capture the training data?
  
   
  
  Any suggestions or ideas will be greatly appreciated.
  
   
  
  Benjamin
  
  
  [[alternative HTML version deleted]]
  
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
  
 Notice:  This e-mail message, together with any attach...{{dropped:26

Re: [R] randomForest parameters for image classification

2010-11-11 Thread Liaw, Andy

Please show us the code you used to run randomForest, the output, as
well as what you get with other algorithms (on the same random subset
for comparison).  I have yet to see a dataset where randomForest does
_far_ worse than other methods.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Deschamps, Benjamin
 Sent: Tuesday, November 09, 2010 10:52 AM
 To: r-help@r-project.org
 Subject: [R] randomForest parameters for image classification
 
 I am implementing an image classification algorithm using the
 randomForest package. The training data consists of 31000+ training
 cases over 26 variables, plus one factor predictor variable (the
 training class). The main issue I am encountering is very low overall
 classification accuracy (a lot of confusion between classes). 
 However, I
 know from other classifications (including a regular decision tree
 classifier) that the training and validation data is sound and capable
 of producing good accuracies). 
 
  
 
 Currently, I am using the default parameters (500 trees, mtry not set
 (default), nodesize = 1, replace=TRUE). Does anyone have experience
 using this with large datasets? Currently I need to randomly sample my
 training data because giving it the full 31000+ cases returns 
 an out of
 memory error; the same thing happens with large numbers of 
 trees.  From
 what I read in the documentation, perhaps I do not have 
 enough trees to
 fully capture the training data?
 
  
 
 Any suggestions or ideas will be greatly appreciated.
 
  
 
 Benjamin
 
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Contract programming position at Merck (NJ, USA)

2010-10-29 Thread Liaw, Andy

Job: Scientific programmer at Merck, Biostatistics, Rahway, NJ, USA
[Job Description] 
This position works closely with statisticians to process and analyze
ultrasound, MRI, and radiotelemetry longitudinal studies using a series
of programs developed in R and Mathworks/Matlab.  This position provides
support for the analysis of several pre-clinical and clinical functional
MRI studies by preprocessing and processing data using the software FSL.
Qualified candidates must have a proficiency and experience with
statistical software and technical computing packages including Matlab,
R, SAS, and S-Plus as well as familiarity with medical image concepts
(e.g., functional MRI) and an understanding of analysis tools for fMRI
(FSL, SPM).

This is contract position for an ongoing need in Biometrics Research.
It is a term contract position (1 year) with the possibility to extend
up to 2 years in length based on continued business need and available
budget. 

If you are interested, please contact:
amy_gilles...@merck.com

Notice:  This e-mail message, together with any attachme...{{dropped:14}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] to determine the variable importance in svm

2010-10-26 Thread Liaw, Andy

The caret package has answers to all your questions. 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Neeti
 Sent: Tuesday, October 26, 2010 10:42 AM
 To: r-help@r-project.org
 Subject: [R] to determine the variable importance in svm

 hii everyone!!

 i have two questions:

 1) How to obtain a variable (attribute) importance using 
 e1071:SVM (or other
 svm methods)?

 2) how to validate the results of svm? 

 currently i am using the following code to determine the error.

 library(ipred)
 for(i in 1:20) error.model1[i]-
 errorest(Species~.,data=trainset,model=svm)$error
 summary(error.model1)

 ##
 not able to understand errorest result.. if anyone know the 
 better method to
 analyse my result please let me know.
 ###

 library(mda)
 cmat - confusion(pred.1,species_test)
 -- 
 View this message in context: 
 http://r.789695.n4.nabble.com/to-determine-the-variable-import
 ance-in-svm-tp3013817p3013817.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random Forest AUC

2010-10-24 Thread Liaw, Andy

The OOB error estimates in RF is one really nifty feature that alleviate
the need for additional cross-validation or resampling.  I've done some
empirical comparison between OOB estimates and 10-fold CV estimates, and
they are basically the same.  

Andy

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Claudia Beleites
 Sent: Saturday, October 23, 2010 3:39 PM
 To: r-help@r-project.org
 Subject: Re: [R] Random Forest AUC
 
 Dear List,
 
 Just curiosity (disclaimer: I never used random forests till now for 
 more than a little playing around):
 
 Is there no out-of-bag estimate available?
 I mean, there are already ca. 1/e trees where a (one) given sample is 
 out-of-bag, as Andy explained. If now the voting is done only 
 over the 
 oob trees, I should get a classical oob performance measure.
 Or is the oob estimate internally used up by some kind of 
 optimization 
 (what would that be, given that the trees are grown till the end?)?
 
 Hoping that I do not spoil the pedagogic efforts of the list 
 in teaching 
 Ravishankar to do his homework reasoning himself...
 
 Claudia
 
 Am 23.10.2010 20:49, schrieb Changbin Du:
  I think you should use 10 fold cross validation to judge 
 your performance on
  the validation parts. What you did will be overfitted for 
 sure, you test on
  the same training set used for your model buliding.
 
 
  On Sat, Oct 23, 2010 at 6:39 AM, mxkuhnmxk...@gmail.com  wrote:
 
  I think the issue is that you really can't use the 
 training set to judge
  this (without resampling).
 
  For example, k nearest neighbors are not known to over 
 fit, but  a 1nn
  model will always perfectly predict the training data.
 
  Max
 
  On Oct 23, 2010, at 9:05 AM, Liaw, 
 Andyandy_l...@merck.com  wrote:
 
  What Breiman meant is that as the model gets more complex 
 (i.e., as the
  number of trees tends to infinity) the geneeralization 
 error (test set
  error) does not increase.  This does not hold for 
 boosting, for example;
  i.e., you can't boost forever, which nececitate the 
 need to find the
  optimal number of iterations.  You don't need that with RF.
 
  -Original Message-
  From: r-help-boun...@r-project.org
  [mailto:r-help-boun...@r-project.org] On Behalf Of vioravis
  Sent: Saturday, October 23, 2010 12:15 AM
  To: r-help@r-project.org
  Subject: Re: [R] Random Forest AUC
 
 
  Thanks Max and Andy. If the Random Forest is always giving an
  AUC of 1, isn't
  it over fitting??? If not, how do you differentiate this 
 from over
  fitting??? I believe Random forests are claimed to never over
  fit (from the
  following link).
 
  
 http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.ht
http://www.stat.berkeley.edu/%7Ebreiman/RandomForests/cc_home.ht
  m#features
 
 
  Ravishankar R
  --
  View this message in context:
  
 http://r.789695.n4.nabble.com/Random-Forest-AUC-tp3006649p3008157.html
  Sent from the R help mailing list archive at Nabble.com.
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, 
 reproducible code.
 
  Notice:  This e-mail message, together with any 
 attachme...{{dropped:11}}
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 
 
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random Forest AUC

2010-10-23 Thread Liaw, Andy

What Breiman meant is that as the model gets more complex (i.e., as the
number of trees tends to infinity) the geneeralization error (test set
error) does not increase.  This does not hold for boosting, for example;
i.e., you can't boost forever, which nececitate the need to find the
optimal number of iterations.  You don't need that with RF.

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of vioravis
 Sent: Saturday, October 23, 2010 12:15 AM
 To: r-help@r-project.org
 Subject: Re: [R] Random Forest AUC
 
 
 Thanks Max and Andy. If the Random Forest is always giving an 
 AUC of 1, isn't
 it over fitting??? If not, how do you differentiate this from over
 fitting??? I believe Random forests are claimed to never over 
 fit (from the
 following link).
 
 http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.ht
 m#features
 
 
 Ravishankar R
 -- 
 View this message in context: 
 http://r.789695.n4.nabble.com/Random-Forest-AUC-tp3006649p3008157.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random Forest AUC

2010-10-22 Thread Liaw, Andy

Let me expand on what Max showed.

For the most part, performance on training set is meaningless.  (That's
the case for most algorithms, but especially so for RF.)  In the default
(and recommended) setting, the trees are grown to the maximum size,
which means that quite likely there's only one data point in most
terminal nodes, and the prediction at the terminal nodes are determined
by the majority class in the node, or the lone data point.  Suppose that
is the case all the time; i.e., in all trees all terminal nodes have
only one data point.  A particular data point would be in-bag in about
64% of the trees in the forest, and every one of those trees has the
correct prediction for that data point.  Even if all the trees where
that data points are out-of-bag gave the wrong prediction, by majority
vote of all trees, you still get the right answer in the end.  Thus
basically the perfect prediction on train set for RF is by design.

Generally, good training prediction is just self-fulfilling prophecy.

Andy

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of vioravis
 Sent: Friday, October 22, 2010 1:20 AM
 To: r-help@r-project.org
 Subject: [R] Random Forest AUC
 
 
 Guys,
 
 I used Random Forest with a couple of data sets I had to 
 predict for binary
 response. In all the cases, the AUC of the training set is 
 coming to be 1.
 Is this always the case with random forests? Can someone 
 please clarify
 this? 
 
 I have given a simple example, first using logistic 
 regression and then
 using random forests to explain the problem. AUC of the 
 random forest is
 coming out to be 1.
 
 data(iris)
 iris - iris[(iris$Species != setosa),]
 iris$Species - factor(iris$Species)
 fit - glm(Species~.,iris,family=binomial)
 train.predict - predict(fit,newdata = iris,type=response)  
 library(ROCR)
 plot(performance(prediction(train.predict,iris$Species),tpr,
 fpr),col =
 red)
 auc1 -
 performance(prediction(train.predict,iris$Species),auc)@y.va
 lues[[1]]
 legend(bottomright,legend=c(paste(Logistic Regression
 (AUC=,formatC(auc1,digits=4,format=f),),sep=)),  
   col=c(red), lty=1)
 
 
 library(randomForest)
 fit - randomForest(Species ~ ., data=iris, ntree=50)
 train.predict - predict(fit,iris,type=prob)[,2]  
 plot(performance(prediction(train.predict,iris$Species),tpr,
 fpr),col =
 red)
 auc1 -
 performance(prediction(train.predict,iris$Species),auc)@y.va
 lues[[1]]
 legend(bottomright,legend=c(paste(Random Forests
 (AUC=,formatC(auc1,digits=4,format=f),),sep=)),  
   col=c(red), lty=1)
 
 Thank you.
 
 Regards,
 Ravishankar R
 -- 
 View this message in context: 
 http://r.789695.n4.nabble.com/Random-Forest-AUC-tp3006649p3006649.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] RandomForest Proximity Matrix

2010-10-21 Thread Liaw, Andy

From: Michael Lindgren
 
 Greetings R Users!
 
 I am posting to inquire about the proximity matrix in the randomForest
 R-package.  I am having difficulty pushing very large data through the
 algorithm and it appears to hang on the building of the prox 
 matrix.  I have
 read on Dr. Breiman's website that in the original code a 
 choice can be made
 between using an N x N matrix OR to increase the ability to 
 compute large
 datasets an N x T matrix can be created.  The N refers to the 
 number of
 samples and the T refers to the number of trees in the 
 forest.  It is a
 sentence in the FORTRAN documentation and nothing else is 
 stated about it...
  My question is, does the randomForest module in R allow for 
 this choice in
 proximity matrices generated by the algorithm?  If so, can 
 someone please
 point me in the direction of how to implement it?  That would 
 be great!

The R package is based on version 3.3 of the Fortran code, with some new
features grafted on.  Unfortunately the sparse proximity matrix is one
of the features that hasn't been added in the R version.  The truth is
that I find the way it's done in the Fortran code not terribly
satisfying, but do not know any other better way of doing it.  

Andy


 Many thanks in advance and best wishes from Alaska!
 
 Michael
 
 -- 
 Michael Lindgren
 GIS Technician / Programmer
 EWHALE Lab - Institute of Arctic Biology
 University of Alaska
 419 IRVING I
 Fairbanks, AK 99775-7000
 
 Email: malindg...@alaska.edu
 Phone: 907 474 7959
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Force evaluation of variable when calling partialPlot

2010-10-04 Thread Liaw, Andy

The plot titles aren't pretty, but the following works for me:

R library(randomForest)
randomForest 4.5-37
Type rfNews() to see new features/changes/bug fixes.
R set.seed(1004)
R iris.rf - randomForest(iris[-5], iris[[5]], ntree=1001)
R par(mfrow=c(2,2))
R for (i in 1:4) partialPlot(iris.rf, iris, names(iris)[i])

Andy

From: Ben Bond-Lamberty
 
 Dear R Users,
 
 I'm using the randomForest package and would like to generate partial
 dependence plots, one after another, for a variety of variables:
 
 m - randomForest( s, ... )
 varnames - c( var1, var2, var3, var4 )   # var1..4 are all in
 data frame s
 for( v in varnames ) {
partialPlot( x=m, pred.data=s, x.var=v )
 }
 
 ...but this doesn't work, with partialPlot complaining that it can't
 find the variable v. I think I need to force the evaluation of the
 loop variable v so that partialPlot sees the correct variable names,
 but am stumped after trying eval and similar functions. Any
 suggestions on how to do this? Googling has not turned up anything
 very useful.
 
 Thanks,
 Ben
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] randomForest - PartialPlot - reg

2010-09-24 Thread Liaw, Andy

In a partial dependence plot, only the relative scale, not absolute
scale, of the y-axis is meaningful.  I.e., you can compare the range of
the curves between partial dependence plots of two different variables,
but not the actual numbers on the axis.  The range is compressed
compared to the original data because of averaging.  For classification,
the function is computed in the logit scale, so it's not necessarily
positive.  High does mean higher probability for the target class.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Vijayan Padmanabhan
 Sent: Wednesday, September 22, 2010 11:47 PM
 To: r-help
 Subject: [R] randomForest - PartialPlot - reg
 
 
 Dear R Group
 I am not sure if this is the right forum to raise this query, 
 but i would 
 rather give it a try and aim for reaching the right person 
 who might be a 
 part of this group who can help.
 I have a query on interpretation of PartialPlot in package 
 randomForest. 
 In my earlier queries in this regard, I probably did not give 
 sufficient 
 explanation to elicit the intended details in the explanations being 
 provided.. Hence I am resending the query with examples and bit more 
 details.
 
 In a scenario where a set of continuous variables vs a class 
 response is 
 being modeled by RF, say the iris example.. 
 using the following code, how do I interpret the partial plot that is 
 generated?
 
 library(randomForest)
 data(iris)
 set.seed(543)
 iris.rf - randomForest(Species~., iris)
 partialPlot(iris.rf, iris, Sepal.Length, setosa)
 
 How is the  y-axis values to be understood?
 
 A straight forward Textual interpretation of the output  from 
 the experts 
 in this area, would help me understand this concept of 
 marginal effect 
 being plotted for the variable Sepal.Length on the 
 which.class=setosa.
 
 Thanks for your help.
 
 Regards
 Vijayan Padmanabhan
 
 
 What is expressed without proof can be denied without proof 
 - Euclide. 
 
 
 Can you avoid printing this?
 Think of the environment before printing the email.
 --
 -
 Please visit us at www.itcportal.com
 **
 
 This Communication is for the exclusive use of the intended 
 recipient (s) and shall
 not attach any liability on the originator or ITC Ltd./its 
 Subsidiaries/its Group 
 Companies. If you are the addressee, the contents of this 
 email are intended for your 
 use only and it shall not be forwarded to any third party, 
 without first obtaining 
 written authorisation from the originator or ITC Ltd./its 
 Subsidiaries/its Group 
 Companies. It may contain information which is confidential 
 and legally privileged
 and the same shall not be used or dealt with by any third 
 party in any manner 
 whatsoever without the specific consent of ITC Ltd./its 
 Subsidiaries/its Group 
 Companies.
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] randomForest - partialPlot - Reg

2010-09-22 Thread Liaw, Andy

 From: Vijayan Padmanabhan
 
 Dear R Group
 I had an observation that in some cases, when I use the 
 randomForest model 
 to create partialPlot in R using the package randomForest
 the y-axis displays values that are more than -1!
 It is a classification problem that i was trying to address.
 Any insights as to how the y axis can display value more than 
 -1 for some 
 variables?
 Am i missing something!

Yes, the Detail section of the help page for partialPlot, or
specifically, what the function is plotting for a classification model.

Andy


 Thanks
 Regards
 Vijayan Padmanabhan
 
 
 Can you avoid printing this?
 Think of the environment before printing the email.
 --
 -
 Please visit us at www.itcportal.com
 **
 
 This Communication is for the exclusive use of the intended 
 recipient (s) and shall
 not attach any liability on the originator or ITC Ltd./its 
 Subsidiaries/its Group 
 Companies. If you are the addressee, the contents of this 
 email are intended for your 
 use only and it shall not be forwarded to any third party, 
 without first obtaining 
 written authorisation from the originator or ITC Ltd./its 
 Subsidiaries/its Group 
 Companies. It may contain information which is confidential 
 and legally privileged
 and the same shall not be used or dealt with by any third 
 party in any manner 
 whatsoever without the specific consent of ITC Ltd./its 
 Subsidiaries/its Group 
 Companies.
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Passing a function as a parameter...

2010-09-22 Thread Liaw, Andy

One possibility:

R f = function(x, f) eval(as.call(list(as.name(f), x)))
R f(1:10, mean)
[1] 5.5
R f(1:10, max)
[1] 10

Andy 

From: Jonathan Greenberg
 R-helpers:
 
 If I want to pass a character name of a function TO a 
 function, and then
 have that function executed, how would I do this?  I want
 an arbitrary version of the following, where any function can 
 be used (e.g.
 I don't want the if-then statement here):
 
 apply_some_function - function(data,function_name)
 {
   if(function_name==mean)
 {
 return(mean(data))
 }
 if(function_name==min)
 {
 return(min(data))
 }
 
 }
 
 apply_some_function(1:10,mean)
 apply_some_function(1:10,min)
 
 Basically, I want the character name of the function used to actually
 execute that function.  Thanks!
 
 --j
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] OT: Is randomization for targeted cancer therapies ethical?

2010-09-21 Thread Liaw, Andy

 From: jlu...@ria.buffalo.edu
 
 Clearly inferior treatments are unethical. 

The Big Question is:  What constitute clearly?  Who or How to decide
what is clearly?  I'm sure there are plenty of people who don't
understand much Statistics and are perfectly willing to say the results
on the two cousins show the conventional treatment is clearly
inferior.  Sure, on these two cousins we can say so, but what about
others?
 
 Donald Berry at MD Anderson in Houston TX  and Jay Kadane at Carnegie 
 Mellon have been working on more ethical designs within the Bayesian 
 framework.  In particular, response adaptive designs reduce 
 the assignment 
 to and continuation of patients on inferior treatments.
 
I've heard LJ Wei talked about this kinds of designs (don't remember if
they are Bayesian) more than a dozen years ago.  Don't know how common
they are in use. 

Andy 

 
 
 Bert Gunter gunter.ber...@gene.com 
 Sent by: r-help-boun...@r-project.org
 09/20/2010 01:31 PM
 
 To
 r-help@r-project.org
 cc
 
 Subject
 [R] OT: Is randomization for targeted cancer therapies ethical?
 
 
 
 
 
 
 Hi Folks:
 
 **Off Topic**
 
 Those interested in clinical trials may find the following of 
 interest:
 
 http://www.nytimes.com/2010/09/19/health/research/19trial.html
 
 It concerns the ethicality of randomizing those with life-threatening
 disease to relatively ineffective SOC when new biologically targeted
 therapies appear to be more effective. While the context may be new,
 the debate, itself, is not: Tukey wrote (or maybe it was talked -- I
 can't remember for sure) about this about 30 years ago. I'm sure many
 other also have done so.
 
 Cheers,
 
 Bert
 -- 
 Bert Gunter
 Genentech Nonclinical Biostatistics
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Decision Tree in Python or C++?

2010-09-08 Thread Liaw, Andy

For Python, check out the project orange:
http://www.ailab.si/orange/doc/catalog/Classify/ClassificationTree.htm

Not sure about C++, but OpenDT is in C:
http://opendt.sourceforge.net/

Looks like OpenCV has both Python and C++ interface (didn't see Python interace 
to decision tree, though):
http://opencv.willowgarage.com/documentation/cpp/decision_trees.html

Andy

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Wensui Liu
 Sent: Saturday, September 04, 2010 5:14 PM
 To: noclue_
 Cc: r-help@r-project.org
 Subject: Re: [R] Decision Tree in Python or C++?
 
 for python, please check
 http://onlamp.com/pub/a/python/2006/02/09/ai_decision_trees.html
 
 On Sat, Sep 4, 2010 at 11:21 AM, noclue_ tim@netzero.net wrote:
 
 
  Have anybody used Decision Tree in Python or C++?  (or 
 written their own
  decision tree implementation in Python or C++)?  My goal is 
 to run decision
  tree on 8 million obs as training set and score 7 million 
 in test set.
 
  I am testing 'rpart' package on a 64-bit-Linux + 64-bit-R 
 environment. But
  it seems that rpart is either not stable or running out of 
 memory very
  quickly. (Is it because R is passing everything as copy 
 instead of as object
  reference?)
 
  Any idea would be greatly appreciated!
 
  Have a nice weekend!
  --
  View this message in context: 
 http://r.789695.n4.nabble.com/Decision-Tree-in-Python-or-C-tp2
 526810p2526810.html
  Sent from the R help mailing list archive at Nabble.com.
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 
 
 -- 
 ==
 WenSui Liu
 wens...@paypal.com
 statcompute.spaces.live.com
 ==
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Open position at Merck (NJ, USA)

2010-09-07 Thread Liaw, Andy

Job description: Computational statistician/biometrician 

The Biometrics Research Department at Merck Research Laboratories, Merck
 Co., Inc. in Rahway, NJ, is seeking a highly motivated
statistician/data analyst to work in its basic research, drug discovery,
preclinical and early clinical development areas. The applicant should
have broad expertise in statistical computing. Experience and/or
education relevant to signal processing, image processing, pattern
recognition, machine learning, or bioinformatics are preferred. The
position will involve providing statistical, mathematical, and software
development support for one or more of the following areas: medical
imaging, biological signal analysis including EEG , MS proteomics, and
computational chemistry.  We are looking for a Ph.D. with a background
and/or  experience in at least one of the following fields: Statistics,
Electrical/Computer or Biomedical Engineering, Computer Science, Applied
Mathematics, or Physics. Advanced computer programming skills
(including, but not limited to R, S-PLUS, Matlab, C/C++) and excellent
communication skills are essential. An ability to lead statistical
analysis efforts within a multidisciplinary team is required. The
position may also involve general statistical consulting and training.
Our dedication to delivering quality medicines in innovative ways and
our commitment to bringing out the best in our people are just some of
the reasons why we're ranked among Fortune magazine's 100 Best
Companies to Work for in America. We offer a competitive salary, an
outstanding benefits package, and a professional work environment with a
company known for scientific excellence. 

To apply, please forward your CV or resume and cover letter to 

vladimir_svetnik(at)merck.com

ATTENTION: Open Position 
Vladimir Svetnik, Ph.D. 
Biometrics Research Dept. 
Merck Research Laboratories  
PO Box 2000, RY33-300
Rahway, NJ 07065-0900
USA

Notice:  This e-mail message, together with any attachme...{{dropped:14}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] RandomForests Limitations? Work Arounds?

2010-09-07 Thread Liaw, Andy

You're not giving us much to go on, so the info I can give is
correspondingly vague.

I take it you are using RF in unsupervised mode.  What RF does in this
case is simply generate a second part of the data that have the same
marginal distribution as the data you have, but the variables are
independent.  It then runs classification treating your data as one
class and the generated data as the other class.  The output is the
proximity matrix, which you can use as the similarity matrix for
clustering.

Given that, you know that RF has to basically use twice as much memory
to store the data.  That's one place where it can take lots of memory.
The second place is the storage of the proximity matrix itself:  If you
have n rows in your data, the proximity matrix is n by n.  For moderate
n this is going to be the part that takes up lots of memory.

Just in case you haven't seen/heard: avoid the formula interface (i.e.,
randomForest(~., data=mydata, ...) because that can really soak up
memory.

Yes, 64-bit OS and 64-bit R can help, but only if you have the RAM to
take advantage of the platform. 

Andy

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Michael Lindgren
 Sent: Tuesday, September 07, 2010 4:28 PM
 To: r-help@r-project.org
 Subject: [R] RandomForests Limitations? Work Arounds?
 
 Greetings,
 
 I want to inquire about the memory limitations of the 
 randomForest package.
  I am attempting to perform clustering analysis using RF but 
 I keep getting
 the message that RF cannot allocate a vector of a given size.  I am
 currently using the 32-bit version of R to run this analysis, 
  are there
 fewer memory issues when using the 64-bit version of R?  
 Mainly I want to be
 able to run RF on a very large dataset, but keep having to 
 take very small
 sample sizes to do so.  Any advice is more than appreciated.
 
 Best,
 
 Michael
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] predict.loess and NA/NaN values

2010-08-27 Thread Liaw, Andy

From: Philipp Pagel
 
 In a current project, I am fitting loess models to subsets of data in
 order to use the loess predicitons for normalization (similar to what
 is done in many microarray analyses). While working on this I ran into
 a problem when I tried to predict from the loess models and the data
 contained NAs or NaNs. I tracked down the problem to the fact that
 predict.loess will not return a value at all when fed with such
 values. A toy example:
 
 x - rnorm(15)
 y - x + rnorm(15)
 model.lm - lm(y~x)
 model.loess - loess(y~x)
 predict(model.lm, data.frame(x=c(0.5, Inf, -Inf, NA, NaN)))
 predict(model.loess, data.frame(x=c(0.5, Inf, -Inf, NA, NaN)))
 
 The behaviour of predict.lm meets my expectation: I get a vector of
 length 5 where the unpredictable ones are NA or NaN. 
 predict.loess on the
 other hand returns only 3 values quietly skipping the last two.
 
 I was unable to find anything in the manual page that explains this
 behaviour or says how to change it. So I'm asking the community: Is
 there a way to fix this or do I have to code around it?

This is not much help, but I did a bit of digging by using

  debug(stats:::predict.loess)

And then step through the function line-by-line.  Apparently the
Problem happens before the actual prediction is done.  The code

   as.matrix(model.frame(delete.response(terms(object)), newdata))

already omitted the NA and NaN.  The problem is that that's the
default behavior of model.frame().  Consulting ?model.frame, I see
that you can override this by setting the na.action attribute of the 
data frame passed to it.  Thus I tried setting 

  na.dat = data.frame(x=c(0.5, Inf, -Inf, NA, NaN))
  attr(na.dat, na.action) = na.pass

This does make the as.matrix(model.frame()) line retain the NA and
NaN, but it bombs in the prediction at the subsequent step.  I guess
It really doesn't like NA as inputs.

What you can do is patch the code to add the NAs back after the 
Prediction step (which many predict() methods do).

Cheers,
Andy
 
 This is in R 2.11.1 (Linux), by the way.
 
 Thanks in advance
 
   Philipp
 
 
 -- 
 Dr. Philipp Pagel
 Lehrstuhl für Genomorientierte Bioinformatik
 Technische Universität München
 Wissenschaftszentrum Weihenstephan
 Maximus-von-Imhof-Forum 3
 85354 Freising, Germany
 http://webclu.bio.wzw.tum.de/~pagel/
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Learning ANOVA

2010-08-16 Thread Liaw, Andy

From: Stephen Liu
 
 Hi JesperHybel,
 
 Thanks for your advice.
 
 If you're trying to follow the youtube video you have a 
 typing mistake here:
 
 InsectSprays.aov -(test01$count ~ test01$spray)
 
 I think this should be:
 
 InsectSprays.aov -aov(test01$count ~ test01$spray)
 
 
 Your advice works for me.  Sorry I missed aov before(test01$count ~ 
 test01$spray)

I just want to offer another point:  If you see any
tutorial/document/book advising you to use model formula as above; e.g.,
anything like

  df$var1 ~ df$var2 + df$var3

Just run away from it as fast as you can, and try to wipe it from your
memory.  That's about the worst way to use a model formula, and very
likely to give you what may seem to be strange problems down the road.
Well-written model fitting functions should be called like this:

   modelfn(var1 ~ var2 + var3, data=df, ...)

Andy
 
  InsectSprays.aov - aov(test01$count ~ test01$spray)
  summary(InsectSprays.aov)
  Df Sum Sq Mean Sq F valuePr(F)
 test01$spray  5 2668.8  533.77  34.702  2.2e-16 ***
 Residuals66 1015.2   15.38  
 ---
 Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
 
 
  TukeyHSD(InsectSprays.aov)
   Tukey multiple comparisons of means
 95% family-wise confidence level
 
 Fit: aov(formula = test01$count ~ test01$spray)
 
 $`test01$spray`
difflwr   upr p adj
 B-A   0.833  -3.866075  5.532742 0.9951810
 C-A -12.417 -17.116075 -7.717258 0.000
 D-A  -9.583 -14.282742 -4.883925 0.014
 E-A -11.000 -15.699409 -6.300591 0.000
 F-A   2.167  -2.532742  6.866075 0.7542147
 C-B -13.250 -17.949409 -8.550591 0.000
 D-B -10.417 -15.116075 -5.717258 0.002
 E-B -11.833 -16.532742 -7.133925 0.000
 F-B   1.333  -3.366075  6.032742 0.9603075
 D-C   2.833  -1.866075  7.532742 0.4920707
 E-C   1.417  -3.282742  6.116075 0.9488669
 F-C  14.583   9.883925 19.282742 0.000
 E-D  -1.417  -6.116075  3.282742 0.9488669
 F-D  11.750   7.050591 16.449409 0.000
 F-E  13.167   8.467258 17.866075 0.000
 
 
 I made a comparison of my result with example(InsectSprays). 
  They looks the 
 same.
 
 I also compared plot(InsectSprays.aov).
 
 
 A further question how to plot 4 graphs simultaneously?  
 Instead of reading 
 them, individually.  I read ?plot but unable to resolve.
 
 Also how to save InsectSprays.aov?  I think I can only save it as 
 InsectSprays.csv.  I can't find write.aov command. 
 
 
 Thanks
 
 
 TIA
 
 B.R.
 satimis
 
 
 
 
 
 
 - Original Message 
 From: JesperHybel jesperhy...@hotmail.com
 To: r-help@r-project.org
 Sent: Sat, August 14, 2010 2:09:48 AM
 Subject: Re: [R] Learning ANOVA
 
 
 If you're trying to follow the youtube video you have a 
 typing mistake here:
 
 InsectSprays.aov -(test01$count ~ test01$spray)
 
 I think this should be:
 
 InsectSprays.aov -aov(test01$count ~ test01$spray)
 
 
 youre missing the functioncall aov on the right hand side of the
 assignment operator '-'.
 
 the results of the application of function aov() is stored in
 InsectSprays.aov and is accessed through
 
 summary(InsectSprays.aov)
 
 since you missed the functioncall you cannot apply TukeyHSD() to
 InsectSprays.aov
 
 I think
 -- 
 View this message in context: 
 http://r.789695.n4.nabble.com/Learning-ANOVA-tp2323660p2324590.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 
 
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Learning ANOVA

2010-08-13 Thread Liaw, Andy

From: Stephen Liu
 
 Hi folks,
 
 R on Ubuntu 10.04 64 bit.
 
 Performed following steps on R:-
 
 ### to access to the object
  data(InsectSprays)
 
 ### create a .csv file
  write.csv(InsectSprays, InsectSpraysCopy.csv)
 
 
 On another terminal
 $ sudo updatedb
 $ locate InsectSpraysCopy.csv
 /home/userA/InsectSpraysCopy.csv
 
 
 ### Read in some data
  test01 - read.csv(file.choose(), header=TRUE)
 
 Enter file name: /home/userA/InsectSpraysCopy.csv
 
 
 ### Look at the data
  test01
 X count spray
 1   110 A
[snipped]

Note the names of the variables here.  They don't match what you tried
to use in your boxplot() call below.  Where did you get the idea that
there are DO and Stream in the test01 data frame?

Andy

 
 ### Create a side-by-side boxplot of the data
 boxplot(test01$DO ~ test01$Stream)
 Error in model.frame.default(formula = test01$DO ~ test01$Stream) : 
   invalid type (NULL) for variable 'test01$DO'
 
 
 I was stucked here.  Pls help.  TIA
 
 
 B.R.
 Stephen L
 
 
 
 - Original Message 
 From: Stephen Liu sati...@yahoo.com
 To: r-help@r-project.org
 Sent: Fri, August 13, 2010 11:34:31 AM
 Subject: [R] Learning ANOVA
 
 Hi folks,
 
 
 File to be used is on;
 data(InsectSprays)
 
 
 I can't figure out where to insert it on following command;
 test01 - read.csv(fil.choose(), header=TRUE)
 
 Please help.  TIA
 
 B.R.
 
 
 
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 
 
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Error on random forest variable importance estimates

2010-08-06 Thread Liaw, Andy

From: Pierre Dubath
 
 Hello,
 
 I am using the R randomForest package to classify variable 
 stars. I have 
 a training set of 1755 stars described by (too) many 
 variables. Some of 
 these variables are highly correlated.
 
 I believe that I understand how randomForest works and how 
 the variable 
 importance are evaluated (through variable permutations). Here are my 
 questions.
 
 1) variable importance error? Is there any ways to estimate 
 the error on 
 the MeanDecreaseAccuracy? In other words, I would like to know how 
 significant are MeanDecreaseAccuracy differences (and display 
 horizontal error bars in the VarImpPlot output).

If you really want to do it, one possibility is to do permutation test:
Permute your response, say, 1000 or 2000 times, run RF on each of these
permuted response, and use the importance measures as samples from the
null distribution.
 
 I have notice that even with relatively large number of trees, I have 
 variation in the importance values from one run to the next. 
 Could this 
 serve as a measure of the errors/uncertainties?

Yes.
 
 2) how to deal with variable correlation? so far, I am iterating, 
 selecting the most important variable first, removing all 
 other variable 
 that have a high correlation (say higher than 80%), taking the second 
 most important variable left, removing variables with 
 high-correlation 
 with any of the first two variables, and so on... (also using some 
 astronomical insight as to which variables are the most important!)
 
 Is there a better way to deal with correlation in randomForest? (I 
 suppose that using many correlated variables should not be a 
 problem for 
 randomForest, but it is for my understanding of the data and 
 for other 
 algorithms).

That depends a lot on what you're trying to do.  RF can tolerate
problematic data, but that doesn't mean it will magically give you good
answers.  Trying to draw conclusions about effects when there are highly
correlated (and worse, important) variables is a tricky business.
 
 3) How many variables should eventually be used? I have made 
 successive 
 runs, adding one variable at a time from the most to the 
 least important 
 (not-too-correlated) variables. I then plot the error rate 
 (err.rate) as 
 a function of the number of variable used. As this number 
 increase, the 
 error first sharply decrease, but relatively soon it reaches 
 a plateau .
 I assume that the point of inflexion can be use to derive the minimum 
 number of variable to be used. Is that a sensible approach? 
 Is there any 
 other suggestion? A measure of the error on err.rate would 
 also here 
 really help. Is there any idea how to estimate this? From the 
 variation 
 between runs or with the help of importanceSD somehow?

One approach is described in the following paper (in the Proceedings of
MCS 2004):
http://www.springerlink.com/content/9n61mquugf9tungl/

Best,
Andy
 
 Thanks very much in advance for any help.
 
 Pierre Dubath
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Collinearity in Moderated Multiple Regression

2010-08-04 Thread Liaw, Andy

Seems to me it may be worth stating what may be elementary to some on this list:

- If all relevant variables are included in the model and the true model is 
indeed linear, then all least squares estimated coefficients are unbiased.  [ 
David Ruppert once said about the three kinds of lies:  Lies, damn lies, and 
Y~N(Xb, s^2). ]

- If some variables with non-zero true coefficients are omitted in the fitted 
model, the estimated coefficients of those variables in the model may be 
biased, with the exception when the omitted variables are orthogonal to those 
in the model (i.e., 0 correlations).

- If x1 and x2 are correlated, you'd have a tough enough time separating their 
effects on y, let alone trying to assess their interaction effect on y.  

Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Bert Gunter
Sent: Tuesday, August 03, 2010 4:52 PM
To: Michael Haenlein
Cc: r-help@r-project.org
Subject: Re: [R] Collinearity in Moderated Multiple Regression

biased regression coefficients is nonsense.  The coefficients are
unbiased: their expectation (in the appropriate model) is the true
value of the parameters (when estimated by, e.g. least squares).

The problem is model selection. I suggest you consult a local
statistician, as you seem confused about the basic concepts.

Bert Gunter
Genentech Nonclinical Biostatistics



On Tue, Aug 3, 2010 at 1:42 PM, Michael Haenlein haenl...@escpeurope.eu wrote:
 Thanks for all your comments!

 @Dennis: Are there any thresholds that I can use to evaluate the Variance
 Inflation Factor? I think I learned at some point that VIF should be less
 than 10, but probably that is too conservative? You mentioned in your
 example that a VIF of 13 is not big enough to raise a red flag. So is the
 cut-off more around 15 or 20?

 @Bert: The purpose of my regression is inference, that is to know whether
 and to which extent x1, x2 and x1*x2 influence y. It's less about prediction
 than about understanding the relative impact of different variables. So, if
 I get your message correctly, correlation among the predictors is likely to
 be an issue in my case as it leads to biased regression coefficients (which
 is what I feared).

 Thanks,

 Michael



 -Original Message-
 From: Bert Gunter [mailto:gunter.ber...@gene.com]
 Sent: Tuesday, August 03, 2010 22:37
 To: Dennis Murphy
 Cc: haenl...@gmail.com; r-help@r-project.org
 Subject: Re: [R] Collinearity in Moderated Multiple Regression

 Absolutely right.

 But I think it's also worth adding that when the predictors _are_
 correlated, the estimates of their coefficients depend on which are included
 in the model. This means that one should generally not try to interpret the
 individual coefficients, e.g. as a way to assess their relative importance.
 Rather, they should just be viewed as the machinery that produces the
 prediction surface, and that is what one needs to consider to understand the
 model.

 In my experience, this elementary fact is not understood by many
 (most?) nonstatistical practicioners using multiple regression -- and this
 ignorance gets them into a world of trouble.

 -- Bert

 Bert Gunter
 Genentech Nonclinical Biostatistics


 On Tue, Aug 3, 2010 at 12:57 PM, Dennis Murphy djmu...@gmail.com wrote:

 Hi:

 On Tue, Aug 3, 2010 at 6:51 AM, haenl...@gmail.com wrote:

  I'm sorry -- I think I chose a bad example. Let me start over again:
 
  I want to estimate a moderated regression model of the following form:
  y = a*x1 + b*x2 + c*x1*x2 + e
 

 No intercept? What's your null model, then?


 
  Based on my understanding, including an interaction term (x1*x2)
  into the regression in addition to x1 and x2 leads to issues of
  multicollinearity, as x1*x2 is likely to covary to some degree with x1
 (and x2).


 Is it possible you're confusing interaction with multicollinearity?
 You've stated that x1 and x2 are weakly correlated;  the product term
 is going to be correlated with each of its constituent covariates, but
 unless that correlation is above 0.9 (some say 0.95) in magnitude,
 multicollinearity is not really a substantive issue. As others have
 suggested, if you're concerned about multicollinearity, then fit the
 interaction model and use the vif() function from package car or elsewhere
 to check for it.
 Multicollinearity has to do with ill-conditioning in the model matrix;
 interaction means that the response y is influenced by the product of
 x1 and
 x2 covariates as well as the individual covariates. They are not the
 same thing. Perhaps an example will help.

 Here's your x1 and x2 with a manufactured response:

 df - data.frame(x1 = rep(1:3, each = 3),
                  x2 = rep(1:3, 3))
 df$y - 0.5 + df$x1 + 1.2 * df$x2 + 2.5 * df$x1 * df$x2 + rnorm(9) #
 Response is generated to produce a significant interaction df
  x1 x2         y
 1  1  1  5.968255
 2  1  2  7.566212
 3  1  3 13.420006
 4  2  1  9.025791
 5  2  2 16.382381

Re: [R] Problems with normality req. for ANOVA

2010-08-03 Thread Liaw, Andy

As a matter of fact, I would say both Bert and I encounter designed
experiments a lot more than observational studies, yet we speak from
experience that those things that Bert mentioned happen on a daily
basis.  When you talk to experimenters, ask your questions carefully and
you'll see these things crop up.

Andy
 

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
On Behalf Of David Winsemius
Sent: Monday, August 02, 2010 3:35 PM
To: Bert Gunter
Cc: r-help@r-project.org; wwreith
Subject: Re: [R] Problems with normality req. for ANOVA

In a general situation of observational studies, your point is  
undoubtedly true, and apparently you believe it to be true even in the  
setting of designed experiments. Perhaps I should have confined myself  
to my first sentence.

-- 
David.


On Aug 2, 2010, at 2:05 PM, Bert Gunter wrote:

 David et. al:

 I take issue with this. It is the lack of independence that is the  
 major issue. In particular, clustering, split-plotting, and so forth  
 due to convenience order experimentation, lack of randomization,  
 exogenous effects like the systematic effects due to measurement  
 method/location have the major effect on inducing bias and  
 distorting inference. Normality and unequal variances typically pale  
 to insignificance compared to this.

 Obviously, IMHO.

 Note 1: George Box noted this at least 50 years ago in the early  
 '60's when he and Jenkins developed arima modeling.

 Note 2: If you can, have a look at Jack Youden's classic paper  
 Enduring Values, which comments to some extent on these issues,  
 here: http://www.jstor.org/pss/1266913

 Cheers,
 Bert


 Bert Gunter
 Genentech Nonclinical Biostatistics



 On Mon, Aug 2, 2010 at 10:32 AM, David Winsemius
dwinsem...@comcast.net 
  wrote:

 On Aug 2, 2010, at 9:33 AM, wwreith wrote:


 I am conducting an experiment with four independent variables each  
 of which
 has three or more factor levels. The sample size is quite large i.e.  
 several
 thousand. The dependent variable data does not pass a normality test  
 but
 visually looks close to normal so is there a way to compute the  
 affect
 this would have on the p-value for ANOVA or is there a way to  
 perform an
 nonparametric test in R that will handle this many independent  
 variables.
 Simply saying ANOVA is robust to small departures from normality is  
 not
 going to be good enough for my client.

 The statistical assumption of normality for linear models do not  
 apply to the distribution of the dependent variable, but rather to  
 the residuals after a model is estimated. Furthermore, it is the  
 homoskedasticity assumption that is more commonly violated and also  
 greater threat to validity. (And if you don't already know both of  
 these points, then you desperately need to review your basic  
 modeling practices.)


  I need to compute an error amount for
 ANOVA or find a nonparametric equivalent.

 You might get a better answer if you expressed the first part of  
 that question in unambiguous terminology.  What is error amount?

 For the second part, there is an entire Task View on Robust  
 Statistical Methods.

 -- 

 David Winsemius, MD
 West Hartford, CT





David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Collinearity in Moderated Multiple Regression

2010-08-03 Thread Liaw, Andy

If the collinearity you're seeing arose from the addition of a product
(interaction) term, I do not think penalization is the best answer.
What is the goal of your analysis?  If it's prediction, then I wouldn't
worry about this type of collinearity.  If you're interested in
inference, I'd try some transformation to reduce (but not necessarily
eliminate) the effect of collinearity.  Mean centering is the simplest,
but not the only thing you can do.

Just my $0.02...

Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
On Behalf Of Michael Haenlein
Sent: Tuesday, August 03, 2010 10:44 AM
To: 'Nikhil Kaza'
Cc: r-help@r-project.org
Subject: Re: [R] Collinearity in Moderated Multiple Regression

Thanks very much -- it seems that Ridge Regression can do what I'm
looking
for!
Best,
Michael



-Original Message-
From: Nikhil Kaza [mailto:nikhil.l...@gmail.com] 
Sent: Tuesday, August 03, 2010 16:21
To: haenl...@gmail.com
Cc: r-help@r-project.org (r-help@R-project.org)
Subject: Re: [R] Collinearity in Moderated Multiple Regression

My usual strategy of dealing with multicollinearity is to drop the
offending
variable or transform one them. I would also check vif functions in car
and
Design.

I think you are looking for lm.ridge in MASS package.


Nikhil Kaza
Asst. Professor,
City and Regional Planning
University of North Carolina

nikhil.l...@gmail.com

On Aug 3, 2010, at 9:51 AM, haenl...@gmail.com wrote:

 I'm sorry -- I think I chose a bad example. Let me start over again:

 I want to estimate a moderated regression model of the following form:
 y = a*x1 + b*x2 + c*x1*x2 + e

 Based on my understanding, including an interaction term (x1*x2) into 
 the regression in addition to x1 and x2 leads to issues of 
 multicollinearity, as x1*x2 is likely to covary to some degree with x1

 (and x2). One recommendation I have seen in this context is to use 
 mean centering, but apparently this does not solve the problem (see: 
 Echambadi, Raj and James D. Hess (2007), Mean-centering does not 
 alleviate collinearity problems in moderated multiple regression 
 models, Marketing science, 26 (3),
 438 -
 45). So my question is: Which R function can I use to estimate this 
 type of model.

 Sorry for the confusion caused due to my previous message,

 Michael






 On Aug 3, 2010 3:42pm, David Winsemius dwinsem...@comcast.net wrote:
 I think you are attributing to collinearity a problem that is due 
 to your small sample size. You are predicting 9 points with 3 
 predictor terms, and incorrectly concluding that there is some 
 inconsistency
 because you get an R^2 that is above some number you deem surprising.

 (I got values between 0.2 and 0.4 on several runs.



 Try:

 x1
 x2
 x3


 y
 model
 summary(model)



 # Multiple R-squared: 0.04269



 --

 David.



 On Aug 3, 2010, at 9:10 AM, Michael Haenlein wrote:




 Dear all,



 I have one dependent variable y and two independent variables x1 and 
 x2

 which I would like to use to explain y. x1 and x2 are design factors 
 in an

 experiment and are not correlated with each other. For example assume
 that:



 x1
 x2
 cor(x1,x2)



 The problem is that I do not only want to analyze the effect of x1 
 and x2 on

 y but also of their interaction x1*x2. Evidently this interaction 
 term has a

 substantial correlation with both x1 and x2:



 x3
 cor(x1,x3)

 cor(x2,x3)



 I therefore expect that a simple regression of y on x1, x2 and
 x1*x2 will

 lead to biased results due to multicollinearity. For example, even 
 when y is

 completely random and unrelated to x1 and x2, I obtain a substantial 
 R2 for

 a simple linear model which includes all three variables. This 
 evidently

 does not make sense:



 y
 model
 summary(model)



 Is there some function within R or in some separate library that 
 allows me

 to estimate such a regression without obtaining inconsistent results?



 Thanks for your help in advance,



 Michael





 Michael Haenlein

 Associate Professor of Marketing

 ESCP Europe

 Paris, France



 [[alternative HTML version deleted]]



 __

 R-help@r-project.org mailing list

 https://stat.ethz.ch/mailman/listinfo/r-help

 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html

 and provide commented, minimal, self-contained, reproducible code.




 David Winsemius, MD

 West Hartford, CT




   [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide

Re: [R] randomForest outlier return NA

2010-07-15 Thread Liaw, Andy

There's a bug in the code.  If you add row names to the X matrix befor
you call randomForest(), you'd get:

R summary (outlier(mdl.rf) )
   Min. 1st Qu.  MedianMean 3rd Qu.Max. 
-1.0580 -0.5957  0.  0.6406  1.2650  9.5200 

I'll fix this in the next release.  Thanks for reporting.

Best,
Andy 

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
On Behalf Of Pau Carrio Gaspar
Sent: Wednesday, July 14, 2010 6:36 AM
To: r-help@r-project.org
Subject: [R] randomForest outlier return NA

Dear R-users,

I have a problem with randomForest{outlier}.
After running the following code ( that produces a silly data set and
builds
a model with randomForest ):

###
library(randomForest)
set.seed(0)

## build data set
X - rbind(  matrix( runif(n=400,min=-1,max=1), ncol = 10 ) ,
rep(1,times= 10 )  )
Y - matrix( nrow = nrow(X), ncol = 1)
for( i in (1:nrow(X))){   Y[i,1] - sign( sum ( X[i,])) }

## build model
mdl.rf -  randomForest( x = X, y = as.factor(Y) , proximity=TRUE ,
mtry =
10 , ntree = 500)
summary (outlier(mdl.rf) )
###

I get the following output:

  Min. 1st Qu.  MedianMean 3rd Qu.Max.NA's
 41


Can anyone explain why the output of outlier only returns NA's ?

Thanks
Pau

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] anyone know why package RandomForest na.roughfix is so slow??

2010-07-02 Thread Liaw, Andy

I'll incorporate some of these ideas into the next release.  Thanks!

Best,
Andy 

-Original Message-
From: h.wick...@gmail.com [mailto:h.wick...@gmail.com] On Behalf Of Hadley 
Wickham
Sent: Thursday, July 01, 2010 8:08 PM
To: Mike Williamson
Cc: Liaw, Andy; r-help
Subject: Re: [R] anyone know why package RandomForest na.roughfix is so slow??

Here's another version that's a bit easier to read:

na.roughfix2 - function (object, ...) {
  res - lapply(object, roughfix)
  structure(res, class = data.frame, row.names = seq_len(nrow(object)))
}

roughfix - function(x) {
  missing - is.na(x)
  if (!any(missing)) return(x)

  if (is.numeric(x)) {
x[missing] - median.default(x[!missing])
  } else if (is.factor(x)) {
freq - table(x)
x[missing] - names(freq)[which.max(freq)]
  } else {
stop(na.roughfix only works for numeric or factor)
  }
  x
}

I'm cheating a bit because as.data.frame is so slow.

Hadley

On Thu, Jul 1, 2010 at 6:44 PM, Mike Williamson this.is@gmail.com wrote:
 Jim, Andy,

    Thanks for your suggestions!

    I found some time today to futz around with it, and I found a home
 made script to fill in NA values to be much quicker.  For those who are
 interested, instead of using:

          dataSet - na.roughfix(dataSet)

    I used:

                    origCols - names(dataSet)
                    ## Fix numeric values...
                    dataSet - as.data.frame(lapply(dataSet, FUN=function(x)
 {
                        if(!is.numeric(x)) { x } else {
                            ifelse(is.na(x), median(x, na.rm=TRUE), x) } }
 ),
                                             row.names=row.names(dataSet) )
                    ## Fix factors...
                    dataSet - as.data.frame(lapply(dataSet, FUN=function(x)
 {
                        if(!is.factor(x)) { x } else {
                            levels(x)[ifelse(!is.na
 (x),x,table(max(table(x)))
                                                          ) ] } } ),
                                             row.names=row.names(dataSet) )
                    names(dataSet) - origCols

    In one case study that I ran, the na.roughfix() algo took 296 seconds
 whereas the homemade one above took 16 seconds.

                                      Regards,
                                            Mike

 Telescopes and bathyscaphes and sonar probes of Scottish lakes,
 Tacoma Narrows bridge collapse explained with abstract phase-space maps,
 Some x-ray slides, a music score, Minard's Napoleanic war:
 The most exciting frontier is charting what's already here.
  -- xkcd

 --
 Help protect Wikipedia. Donate now:
 http://wikimediafoundation.org/wiki/Support_Wikipedia/en

 On Thu, Jul 1, 2010 at 10:05 AM, Liaw, Andy andy_l...@merck.com wrote:

  You need to isolate the problem further, or give more detail about your
 data.  This is what I get:

 R nr - 2134
 R nc - 14037
 R x - matrix(runif(nr*nc), nr, nc)
 R n.na - round(nr*nc/10)
 R x[sample(nr*nc, n.na)] - NA
 R system.time(x.fixed - na.roughfix(x))
    user  system elapsed
    8.44    0.39    8.85
 R 2.11.1, randomForest 4.5-35, Windows XP (32-bit), Thinkpad T61 with 2GB
 ram.

 Andy

  --
 *From:* Mike Williamson [mailto:this.is@gmail.com]
 *Sent:* Thursday, July 01, 2010 12:48 PM
 *To:* Liaw, Andy
 *Cc:* r-help
 *Subject:* Re: [R] anyone know why package RandomForest na.roughfix is
 so slow??

 Andy,

     You're right, I didn't supply any code, because my call was very simple
 and it was the call itself at question.  However, here is the associated
 code I am using:

         naFixTime - system.time( {
             if (fltrResponse) {  ## TRUE: there are no NA's in the
 response... cleared via earlier steps
                 message(paste(iAm,: Missing values will now be
 imputed...\n, sep=))
         try( dataSet - rfImpute(dataSet[,!is.element(names(dataSet),
 response)],
                                          dataSet[,response]) )
             } else {  ## In this case, there is no response column in the
 data set
                 message(paste(iAm,: Missing values will now be filled in
 with median,
                                values or most frequent levels, sep=))
                 try( dataSet - na.roughfix(dataSet) )
             }
         } )

     As you can see, the na.roughfix call is made as simply as possible:
 I supply the entire dataSet (only parameters, no responses).  I am not doing
 the prediction here (that is done later, and the prediction itself is not
 taking very long).
     Here are some calculation times that I experienced:

 # rows       # cols       time to run na.roughfix
 ===     ===     
   2046          2833             ~ 2 minutes
   2066          5626             ~ 6 minutes
   2134         14037             ~ 30 minutes

     These numbers are on a Windows server using the 64-bit version of 'R'.

                                           Regards

Re: [R] anyone know why package RandomForest na.roughfix is so slow??

2010-07-01 Thread Liaw, Andy

You have not shown any code on exactly how you use na.roughfix(), so I
can only guess.

If you are doing something like:

  randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...)

I would not be surprised that it's taking very long on large datasets.
Most likely it's caused by the formula interface, not na.roughfix()
itself.

If that is your case, try doing the imputation beforehand and run
randomForest() afterward; e.g.,

myroughfixed - na.roughfix(mybigdata)
randomForest(myroughfixed[list.of.predictor.columns],
myroughfixed[[myresponse]],...)

HTH,
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
On Behalf Of Mike Williamson
Sent: Wednesday, June 30, 2010 7:53 PM
To: r-help
Subject: [R] anyone know why package RandomForest na.roughfix is so
slow??

Hi all,

I am using the package random forest for random forest
predictions.  I
like the package.  However, I have fairly large data sets, and it can
often
take *hours* just to go through the na.roughfix call, which simply
goes
through and cleans up any NA values to either the median (numerical
data) or
the most frequent occurrence (factors).
I am going to start doing some comparisons between na.roughfix() and
some apply() functions which, it seems, are able to do the same job more
quickly.  But I hesitate to duplicate a function that is already in the
package, since I presume the na.roughfix should be as quick as possible
and
it should also be well tailored to the requirements of random forest.

Has anyone else seen that this is really slow?  (I haven't noticed
rfImpute to be nearly as slow, but I cannot say for sure:  my predict
data
sets are MUCH larger than my model data sets, so cleaning the prediction
data set simply takes much longer.)
If so, any ideas how to speed this up?

  Thanks!
   Mike



Telescopes and bathyscaphes and sonar probes of Scottish lakes,
Tacoma Narrows bridge collapse explained with abstract phase-space maps,
Some x-ray slides, a music score, Minard's Napoleanic war:
The most exciting frontier is charting what's already here.
 -- xkcd

--
Help protect Wikipedia. Donate now:
http://wikimediafoundation.org/wiki/Support_Wikipedia/en

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] anyone know why package RandomForest na.roughfix is so slow??

2010-07-01 Thread Liaw, Andy

You need to isolate the problem further, or give more detail about your
data.  This is what I get:
 
R nr - 2134
R nc - 14037
R x - matrix(runif(nr*nc), nr, nc)
R n.na - round(nr*nc/10)
R x[sample(nr*nc, n.na)] - NA
R system.time(x.fixed - na.roughfix(x))
   user  system elapsed 
   8.440.398.85 

R 2.11.1, randomForest 4.5-35, Windows XP (32-bit), Thinkpad T61 with
2GB ram.
 
Andy



From: Mike Williamson [mailto:this.is@gmail.com] 
Sent: Thursday, July 01, 2010 12:48 PM
To: Liaw, Andy
Cc: r-help
Subject: Re: [R] anyone know why package RandomForest na.roughfix is
so slow??


Andy,

You're right, I didn't supply any code, because my call was very
simple and it was the call itself at question.  However, here is the
associated code I am using:


naFixTime - system.time( {
if (fltrResponse) {  ## TRUE: there are no NA's in the
response... cleared via earlier steps
message(paste(iAm,: Missing values will now be
imputed...\n, sep=))
try( dataSet - rfImpute(dataSet[,!is.element(names(dataSet),
response)],
 dataSet[,response]) )
} else {  ## In this case, there is no response column in
the data set
message(paste(iAm,: Missing values will now be filled
in with median,
   values or most frequent levels,
sep=))
try( dataSet - na.roughfix(dataSet) )
}
} )



As you can see, the na.roughfix call is made as simply as
possible:  I supply the entire dataSet (only parameters, no responses).
I am not doing the prediction here (that is done later, and the
prediction itself is not taking very long).
Here are some calculation times that I experienced:

# rows   # cols   time to run na.roughfix
=== === 
  2046  2833 ~ 2 minutes
  2066  5626 ~ 6 minutes
  2134 14037 ~ 30 minutes

These numbers are on a Windows server using the 64-bit version of
'R'.

  Regards,
   Mike


Telescopes and bathyscaphes and sonar probes of Scottish lakes,
Tacoma Narrows bridge collapse explained with abstract phase-space maps,
Some x-ray slides, a music score, Minard's Napoleanic war:
The most exciting frontier is charting what's already here.
 -- xkcd

--
Help protect Wikipedia. Donate now:
http://wikimediafoundation.org/wiki/Support_Wikipedia/en



On Thu, Jul 1, 2010 at 8:58 AM, Liaw, Andy andy_l...@merck.com wrote:


You have not shown any code on exactly how you use
na.roughfix(), so I
can only guess.

If you are doing something like:

 randomForest(y ~ ., mybigdata, na.action=na.roughfix, ...)

I would not be surprised that it's taking very long on large
datasets.
Most likely it's caused by the formula interface, not
na.roughfix()
itself.

If that is your case, try doing the imputation beforehand and
run
randomForest() afterward; e.g.,

myroughfixed - na.roughfix(mybigdata)
randomForest(myroughfixed[list.of.predictor.columns],
myroughfixed[[myresponse]],...)

HTH,
Andy


-Original Message-
From: r-help-boun...@r-project.org
[mailto:r-help-boun...@r-project.org]
On Behalf Of Mike Williamson
Sent: Wednesday, June 30, 2010 7:53 PM
To: r-help
Subject: [R] anyone know why package RandomForest na.roughfix
is so
slow??

Hi all,

   I am using the package random forest for random forest
predictions.  I
like the package.  However, I have fairly large data sets, and
it can
often
take *hours* just to go through the na.roughfix call, which
simply
goes
through and cleans up any NA values to either the median
(numerical
data) or
the most frequent occurrence (factors).
   I am going to start doing some comparisons between
na.roughfix() and
some apply() functions which, it seems, are able to do the same
job more
quickly.  But I hesitate to duplicate a function that is already
in the
package, since I presume the na.roughfix should be as quick as
possible
and
it should also be well tailored to the requirements of random
forest.

   Has anyone else seen that this is really slow?  (I haven't
noticed
rfImpute to be nearly as slow, but I cannot say for sure:  my
predict
data
sets are MUCH larger than my model data sets, so cleaning the
prediction
data set simply takes much longer.)
   If so, any ideas how to speed this up?

 Thanks

Re: [R] Linear Discriminant Analysis in R

2010-05-28 Thread Liaw, Andy

cobler_squad needs more basic help than doing lda.  The data input just
doesn't make sense.   

If vowel_feature is a data frame, than G - vowel_feature[15] creates
another data frame containing the 15th variable in vowel_feature, so G
is the name of a data frame, not a variable in a data frame.  The lda()
call makes even less sense.  I wonder if he had tried to go through the
examples in the help file and try to understand how it is used?

Andy

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Joris Meys
 Sent: Friday, May 28, 2010 8:50 AM
 To: cobbler_squad
 Cc: r-help@r-project.org
 Subject: Re: [R] Linear Discriminant Analysis in R
 
 Could you provide us with data to test the code? use dput 
 (and limit the
 size!)
 
 eg:
 dput(vowel_features)
 dput(mask_features)
 
 Without this information, it's impossible to say what's going 
 wrong. It looks like you're doing something wrong in the 
 selection. What should vowel_features[15] return? Did you 
 check it's actually what you want? Did you use str(G) to 
 check the type?
 
 Cheers
 Joris
 
 On Thu, May 27, 2010 at 5:28 PM, cobbler_squad 
 la.f...@gmail.com wrote:
 
 
  Joris,
 
  You are a life saver. Based on two sample files above, I think lda 
  should go something like this:
 
  vowel_features - read.table(file = mappings_for_vowels.txt) 
  mask_features - data.frame(as.matrix(read.table(file =
  3dmaskdump_ICA_37_Combined.txt)))
  G - vowel_features[15]
 
  cvc_lda - lda(G~ vowel_features[15], data=mask_features, 
  na.action=na.omit, CV=TRUE)
 
  ERROR: Error in model.frame.default(formula = G ~ 
 vowel_features[15], 
  data = mask_features,  :
   invalid type (list) for variable 'G'
 
  I am clearly doing something wrong declaring G (how should 
 I declare 
  grouping in R when I need to use one column from 
 vowel_feature file)? 
  Sorry for stupid questions and thank you for being so helpful!
 
  -
  again, sample files that I am working with:
 
  mappings_for_vowels.txt:
 
 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 
 V17 V18 V19 
  V20
  V21 V22 V23 V24 V25 V26
  1E  0  0  0  0  0  0  0  0   0   0   0   0   1   1   0  
  0   0   1   0
  0   0   0   0   0   0
  2o  0  0  0  0  0  0  0  0   0   0   0   0   1   0   0  
  1   0   1   0
  1   0   1   0   0   0
  3I  0  0  0  0  0  0  0  0   0   0   0   0   1   1   0  
  0   1   0   0
  0   0   0   0   0   0
  4^  0  0  0  0  0  0  0  0   0   0   0   0   1   0   1  
  0   0   1   0
  0   0   0   0   0   0
  5@  0  0  0  0  0  0  0  0   0   0   0   0   1   0   0  
  1   0   0   1
  0   0   0   0   0   0
 
  and the mask_features file is:
 
   V42  V43  V44  V45  V46
  V47  V48  V49
   [1,]  2.890891625  2.881188521  2.88778 -2.882606612 
 -2.77341
  2.879834384  2.886483229  2.883815864
   [2,]  2.763404707  2.756198683  2.761863881 -2.756827983 
 -2.762268531
  2.754305072  2.760017050  2.758399799
   [3,]  0.556614506  0.556377530  0.556247414 -0.556300910 
 -0.556098321 
  0.557495060  0.557383073  0.556867424  [4,]  0.367065248  
 0.366962036  
  0.366870087 -0.366794442 -0.366644148
  0.366613343  0.366537320  0.366953464
   [5,]  0.423692393  0.421835623  0.421741829 -0.421897460 
 -0.421659824
  0.421567705  0.421465738  0.422407838
 
  --
  View this message in context:
  
 http://r.789695.n4.nabble.com/Linear-Discriminant-Analysis-in-R-tp2231
  922p223.html Sent from the R help mailing list archive at 
  Nabble.com.
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 
 
 --
 Joris Meys
 Statistical Consultant
 
 Ghent University
 Faculty of Bioscience Engineering
 Department of Applied mathematics, biometrics and process control
 
 Coupure Links 653
 B-9000 Gent
 
 tel : +32 9 264 59 87
 joris.m...@ugent.be
 ---
 Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

1 2 3 >

1 - 100 of 235 matches

Mail list logo