[R] regression on large file

2009-10-28 Thread Georg Ehret
Dear R community,
   I have a fairly large file with variables in rows. Every variable
(thousands) needs to be regressed on a reference variable. The file is too
big to load into R (or R gets too slow having done it) and I do now read in
line by line with scan (see below) and write the results to out. Although
improved, this is still very slow... Can someone please help me and suggest
how I can make this faster?

Thank you and best regards, Georg.
***
Georg Ehret, Johns Hopkins U, Baltimore MD, USA


for (i in 16:nmax){

line-scan(file=paste(file),nlines=1,skip=(i-1),what=integer,sep=,)
d-as.numeric(line[-1])
name-line[1]
modela - lm(s1~a+a2+b+s+M+W)
modelb - lm(s2~a+a2+b+s+M+W+d)
modelc - lm(s3~a+2+b+s+M+W+d+d*s)
p_main - anova(modela,modelb)$P[2]
p_main_i - anova(modela,modelc)$P[2]
p_i - anova(modelb,modelc)$P[2]

cat(c(name,p_main,p_main_i,p_i),file=paste(out,.txt,sep=),append=T)
cat(\n,file=paste(out,.txt,sep=),append=T)
}

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] regression on large file

2009-10-28 Thread Barry Rowlingson
On Wed, Oct 28, 2009 at 11:50 AM, Georg Ehret georgeh...@gmail.com wrote:
 Dear R community,
   I have a fairly large file with variables in rows. Every variable
 (thousands) needs to be regressed on a reference variable. The file is too
 big to load into R (or R gets too slow having done it) and I do now read in
 line by line with scan (see below) and write the results to out. Although
 improved, this is still very slow... Can someone please help me and suggest
 how I can make this faster?

 Thank you and best regards, Georg.
 ***
 Georg Ehret, Johns Hopkins U, Baltimore MD, USA


 for (i in 16:nmax){

 line-scan(file=paste(file),nlines=1,skip=(i-1),what=integer,sep=,)
        d-as.numeric(line[-1])
        name-line[1]
        modela - lm(s1~a+a2+b+s+M+W)
        modelb - lm(s2~a+a2+b+s+M+W+d)
        modelc - lm(s3~a+2+b+s+M+W+d+d*s)
        p_main - anova(modela,modelb)$P[2]
        p_main_i - anova(modela,modelc)$P[2]
        p_i - anova(modelb,modelc)$P[2]

 cat(c(name,p_main,p_main_i,p_i),file=paste(out,.txt,sep=),append=T)
        cat(\n,file=paste(out,.txt,sep=),append=T)
 }

 Normally you shouldn't try to optimise something until you know where
the time is going. It could be that fitting your three linear models
is taking most time, in which case there's no point optimising the
input/output...

 But I reckon (and this is a guess) the time is taken by the fact that
scan() is having to skip from the start every time. You can confirm
this by commenting out all the stuff inside the loop except for the
line-scan(...) line. If this still takes ages then we've found the
bottleneck.

 So, what you then do to fix that is to get R to read from a
connection - this is an object that you can read from sequentially
without having to skip from the start every time. There's examples in
help(connections) that will get you going.


Barry

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] regression on large file

2009-10-28 Thread Benilton Carvalho

bigmemory and biglm packages may be of your interest.

b

On Oct 28, 2009, at 8:50 AM, Georg Ehret wrote:


Dear R community,
  I have a fairly large file with variables in rows. Every variable
(thousands) needs to be regressed on a reference variable. The file  
is too
big to load into R (or R gets too slow having done it) and I do now  
read in
line by line with scan (see below) and write the results to out.  
Although
improved, this is still very slow... Can someone please help me and  
suggest

how I can make this faster?

Thank you and best regards, Georg.
***
Georg Ehret, Johns Hopkins U, Baltimore MD, USA


for (i in 16:nmax){

line- 
scan(file=paste(file),nlines=1,skip=(i-1),what=integer,sep=,)

   d-as.numeric(line[-1])
   name-line[1]
   modela - lm(s1~a+a2+b+s+M+W)
   modelb - lm(s2~a+a2+b+s+M+W+d)
   modelc - lm(s3~a+2+b+s+M+W+d+d*s)
   p_main - anova(modela,modelb)$P[2]
   p_main_i - anova(modela,modelc)$P[2]
   p_i - anova(modelb,modelc)$P[2]

cat 
(c(name,p_main,p_main_i,p_i),file=paste(out,.txt,sep=),append=T)

   cat(\n,file=paste(out,.txt,sep=),append=T)
}

   [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.