[R] regression on large file
Dear R community, I have a fairly large file with variables in rows. Every variable (thousands) needs to be regressed on a reference variable. The file is too big to load into R (or R gets too slow having done it) and I do now read in line by line with scan (see below) and write the results to out. Although improved, this is still very slow... Can someone please help me and suggest how I can make this faster? Thank you and best regards, Georg. *** Georg Ehret, Johns Hopkins U, Baltimore MD, USA for (i in 16:nmax){ line-scan(file=paste(file),nlines=1,skip=(i-1),what=integer,sep=,) d-as.numeric(line[-1]) name-line[1] modela - lm(s1~a+a2+b+s+M+W) modelb - lm(s2~a+a2+b+s+M+W+d) modelc - lm(s3~a+2+b+s+M+W+d+d*s) p_main - anova(modela,modelb)$P[2] p_main_i - anova(modela,modelc)$P[2] p_i - anova(modelb,modelc)$P[2] cat(c(name,p_main,p_main_i,p_i),file=paste(out,.txt,sep=),append=T) cat(\n,file=paste(out,.txt,sep=),append=T) } [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regression on large file
On Wed, Oct 28, 2009 at 11:50 AM, Georg Ehret georgeh...@gmail.com wrote: Dear R community, I have a fairly large file with variables in rows. Every variable (thousands) needs to be regressed on a reference variable. The file is too big to load into R (or R gets too slow having done it) and I do now read in line by line with scan (see below) and write the results to out. Although improved, this is still very slow... Can someone please help me and suggest how I can make this faster? Thank you and best regards, Georg. *** Georg Ehret, Johns Hopkins U, Baltimore MD, USA for (i in 16:nmax){ line-scan(file=paste(file),nlines=1,skip=(i-1),what=integer,sep=,) d-as.numeric(line[-1]) name-line[1] modela - lm(s1~a+a2+b+s+M+W) modelb - lm(s2~a+a2+b+s+M+W+d) modelc - lm(s3~a+2+b+s+M+W+d+d*s) p_main - anova(modela,modelb)$P[2] p_main_i - anova(modela,modelc)$P[2] p_i - anova(modelb,modelc)$P[2] cat(c(name,p_main,p_main_i,p_i),file=paste(out,.txt,sep=),append=T) cat(\n,file=paste(out,.txt,sep=),append=T) } Normally you shouldn't try to optimise something until you know where the time is going. It could be that fitting your three linear models is taking most time, in which case there's no point optimising the input/output... But I reckon (and this is a guess) the time is taken by the fact that scan() is having to skip from the start every time. You can confirm this by commenting out all the stuff inside the loop except for the line-scan(...) line. If this still takes ages then we've found the bottleneck. So, what you then do to fix that is to get R to read from a connection - this is an object that you can read from sequentially without having to skip from the start every time. There's examples in help(connections) that will get you going. Barry __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] regression on large file
bigmemory and biglm packages may be of your interest. b On Oct 28, 2009, at 8:50 AM, Georg Ehret wrote: Dear R community, I have a fairly large file with variables in rows. Every variable (thousands) needs to be regressed on a reference variable. The file is too big to load into R (or R gets too slow having done it) and I do now read in line by line with scan (see below) and write the results to out. Although improved, this is still very slow... Can someone please help me and suggest how I can make this faster? Thank you and best regards, Georg. *** Georg Ehret, Johns Hopkins U, Baltimore MD, USA for (i in 16:nmax){ line- scan(file=paste(file),nlines=1,skip=(i-1),what=integer,sep=,) d-as.numeric(line[-1]) name-line[1] modela - lm(s1~a+a2+b+s+M+W) modelb - lm(s2~a+a2+b+s+M+W+d) modelc - lm(s3~a+2+b+s+M+W+d+d*s) p_main - anova(modela,modelb)$P[2] p_main_i - anova(modela,modelc)$P[2] p_i - anova(modelb,modelc)$P[2] cat (c(name,p_main,p_main_i,p_i),file=paste(out,.txt,sep=),append=T) cat(\n,file=paste(out,.txt,sep=),append=T) } [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.