Re: [R] speeding up regressions using ddply

2010-09-22 Thread Ista Zahn
Hi Alison,

On Wed, Sep 22, 2010 at 11:05 AM, Alison Macalady a...@kmhome.org wrote:


 Hi,

 I have a data set that I'd like to run logistic regressions on, using ddply
 to speed up the computation of many models with different combinations of
 variables.

In my experience ddply is not particularly fast. I use it a lot
because it is flexible and has easy to understand syntax, not for it's
speed.

I would like to run regressions on every unique two-variable
 combination in a portion of my data set,  but I can't quite figure out how
 to do using ddply.

I'm not sure ddply is the tool for this job.

The data set looks like this, with status as the
 binary dependent variable and V1:V8 as potential independent variables in
 the logistic regression:

 m - matrix(rnorm(288), nrow = 36)
 colnames(m) - paste('V', 1:8, sep = '')
 x - data.frame( status = factor(rep(rep(c('D','L'), each = 6), 3)),
               as.data.frame(m))


You can use combn to determine the combinations you want:

Varcombos - combn(names(x)[-1], 2)

From there you can do a loop, something like

results - list()
for(i in 1:dim(Varcombos)[2])
{
  log.glm - glm(as.formula(paste(status ~ , Varcombos[1,i],   + ,
Varcombos[2,i], sep=)), family=binomial(link=logit),
na.action=na.omit, data=x)
  glm.summary-summary(log.glm)
  aic - extractAIC(log.glm)
  coef - coef(glm.summary)
  results[[i]] - list(Est1=coef[1,2], Est2=coef[3,2],  AIC=aic[2])
#or whatever other output here
  names(results)[i] - paste(Varcombos[1,i], Varcombos[2,i], sep=_)
}

I'm sure you could replace the loop with something more elegant, but
I'm not really sure how to go about it.

 I used melt to put my data frame into a more workable format
 require(reshape)
 xm - melt(x, id = 'status')

 Here is the basic shape of the function I'd like to apply to every
 combination of variables in the dataset:

 h- function(df)
 {

 attach(df)
 log.glm - (glm(status ~ value1+ value2 , family=binomial(link=logit),
 na.action=na.omit)) #What I can't figure out is how to specify 2 different
 variables (I've put value1 and value2 as placeholders) from the xm to
 include in the model

 glm.summary-summary(log.glm)
 aic - extractAIC(log.glm)
 coef - coef(glm.summary)
 list(Est1=coef[1,2], Est2=coef[3,2],  AIC=aic[2]) #or whatever other output
 here
 }

 And then I'd like to use ddply to speed up the computations.

 require(pplyr)
 output-dddply(xm, .(variable), as.data.frame.function(h))
 output


 I can easily do this using ddply when I only want to use 1 variable in the
 model, but can't figure out how to do it with two variables.

I don't think this approach can work. You are saying split up xm by
variable and then expecting  to be able to reference different levels
of variable within each split, an impossible request.

Hope this helps,
Ista


 Many thanks for any hints!

 Ali



 
 Alison Macalady
 Ph.D. Candidate
 University of Arizona
 School of Geography and Development
  Laboratory of Tree Ring Research

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Ista Zahn
Graduate student
University of Rochester
Department of Clinical and Social Psychology
http://yourpsyche.org

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speeding up regressions using ddply

2010-09-22 Thread Abhijit Dasgupta, PhD
 There has been a recent addition of parallel processing capabilities 
to plyr (I believe v1.2 and later), along with a dataframe iterator 
construct. Both have improved performance of ddply greatly for 
multicore/cluster computing. So we now have the niceness of plyr's 
grammar with pretty good performance. From the plyr NEWS file:


Version 1.2 (2010-09-09)
--

NEW FEATURES

* l*ply, d*ply, a*ply and m*ply all gain a .parallel argument that when 
TRUE,
  applies functions in parallel using a parallel backend registered 
with the

  foreach package:

  x - seq_len(20)
  wait - function(i) Sys.sleep(0.1)
  system.time(llply(x, wait))
  #  user  system elapsed
  # 0.007   0.005   2.005

  library(doMC)
  registerDoMC(2)
  system.time(llply(x, wait, .parallel = TRUE))
  #  user  system elapsed
  # 0.020   0.011   1.038



On 9/22/10 10:41 AM, Ista Zahn wrote:

Hi Alison,

On Wed, Sep 22, 2010 at 11:05 AM, Alison Macaladya...@kmhome.org  wrote:


Hi,

I have a data set that I'd like to run logistic regressions on, using ddply
to speed up the computation of many models with different combinations of
variables.

In my experience ddply is not particularly fast. I use it a lot
because it is flexible and has easy to understand syntax, not for it's
speed.

I would like to run regressions on every unique two-variable

combination in a portion of my data set,  but I can't quite figure out how
to do using ddply.

I'm not sure ddply is the tool for this job.

The data set looks like this, with status as the

binary dependent variable and V1:V8 as potential independent variables in
the logistic regression:

m- matrix(rnorm(288), nrow = 36)
colnames(m)- paste('V', 1:8, sep = '')
x- data.frame( status = factor(rep(rep(c('D','L'), each = 6), 3)),
   as.data.frame(m))


You can use combn to determine the combinations you want:

Varcombos- combn(names(x)[-1], 2)

 From there you can do a loop, something like

results- list()
for(i in 1:dim(Varcombos)[2])
{
   log.glm- glm(as.formula(paste(status ~ , Varcombos[1,i],   + ,
Varcombos[2,i], sep=)), family=binomial(link=logit),
na.action=na.omit, data=x)
   glm.summary-summary(log.glm)
   aic- extractAIC(log.glm)
   coef- coef(glm.summary)
   results[[i]]- list(Est1=coef[1,2], Est2=coef[3,2],  AIC=aic[2])
#or whatever other output here
   names(results)[i]- paste(Varcombos[1,i], Varcombos[2,i], sep=_)
}

I'm sure you could replace the loop with something more elegant, but
I'm not really sure how to go about it.


I used melt to put my data frame into a more workable format
require(reshape)
xm- melt(x, id = 'status')

Here is the basic shape of the function I'd like to apply to every
combination of variables in the dataset:

h- function(df)
{

attach(df)
log.glm- (glm(status ~ value1+ value2 , family=binomial(link=logit),
na.action=na.omit)) #What I can't figure out is how to specify 2 different
variables (I've put value1 and value2 as placeholders) from the xm to
include in the model

glm.summary-summary(log.glm)
aic- extractAIC(log.glm)
coef- coef(glm.summary)
list(Est1=coef[1,2], Est2=coef[3,2],  AIC=aic[2]) #or whatever other output
here
}

And then I'd like to use ddply to speed up the computations.

require(pplyr)
output-dddply(xm, .(variable), as.data.frame.function(h))
output


I can easily do this using ddply when I only want to use 1 variable in the
model, but can't figure out how to do it with two variables.

I don't think this approach can work. You are saying split up xm by
variable and then expecting  to be able to reference different levels
of variable within each split, an impossible request.

Hope this helps,
Ista


Many thanks for any hints!

Ali




Alison Macalady
Ph.D. Candidate
University of Arizona
School of Geography and Development
  Laboratory of Tree Ring Research

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.







--

Abhijit Dasgupta, PhD
Director and Principal Statistician
ARAASTAT
Ph: 301.385.3067
E: adasgu...@araastat.com
W: http://www.araastat.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] speeding up regressions using ddply

2010-09-22 Thread Greg Snow
Why do you want to do this?

If there is just a small part of the logistic regression that you are 
interested in, then there may be a way to compute or approximate that more 
quickly than doing a full glm fit on every pair.  It seems unlikely that you 
would get much meaning out of that many full regressions, but there may be some 
piece that you are looking for that getting just that could lend itself to 
further graphing/analysis.

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.s...@imail.org
801.408.8111


 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-
 project.org] On Behalf Of Alison Macalady
 Sent: Wednesday, September 22, 2010 5:05 AM
 To: r-help@r-project.org
 Subject: [R] speeding up regressions using ddply
 
 
 
 Hi,
 
 I have a data set that I'd like to run logistic regressions on, using
 ddply to speed up the computation of many models with different
 combinations of variables.  I would like to run regressions on every
 unique two-variable combination in a portion of my data set,  but I
 can't quite figure out how to do using ddply.  The data set looks like
 this, with status as the binary dependent variable and V1:V8 as
 potential independent variables in the logistic regression:
 
 m - matrix(rnorm(288), nrow = 36)
 colnames(m) - paste('V', 1:8, sep = '')
 x - data.frame( status = factor(rep(rep(c('D','L'), each = 6), 3)),
 as.data.frame(m))
 
 I used melt to put my data frame into a more workable format
 require(reshape)
 xm - melt(x, id = 'status')
 
 Here is the basic shape of the function I'd like to apply to every
 combination of variables in the dataset:
 
 h- function(df)
 {
 
 attach(df)
 log.glm - (glm(status ~ value1+ value2 , family=binomial(link=logit),
 na.action=na.omit)) #What I can't figure out is how to specify 2
 different variables (I've put value1 and value2 as placeholders) from
 the xm to include in the model
 
 glm.summary-summary(log.glm)
 aic - extractAIC(log.glm)
 coef - coef(glm.summary)
 list(Est1=coef[1,2], Est2=coef[3,2],  AIC=aic[2]) #or whatever other
 output here
 }
 
 And then I'd like to use ddply to speed up the computations.
 
 require(pplyr)
 output-dddply(xm, .(variable), as.data.frame.function(h))
 output
 
 
 I can easily do this using ddply when I only want to use 1 variable in
 the model, but can't figure out how to do it with two variables.
 
 Many thanks for any hints!
 
 Ali
 
 
 
 
 Alison Macalady
 Ph.D. Candidate
 University of Arizona
 School of Geography and Development
  Laboratory of Tree Ring Research
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-
 guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.