[R] Help with speed (replacing the loop?)
Dear R-ers, I have a loop below that loops through my numeric variables in data frame x and through levels of the factor group and multiplies (group by group) the values of numeric variables in x by the corresponding group-specific values from data frame y. In reality, my: dim(x) is 300,000 rows by 100 variables, and dim(y) is 120 levels of group by 100 variables. So, my huge data frame x takes up a lot of space in memory. This is why I am actually replacing values of a and b in x with newly calculated values, rather than adding them. The code does what I need, but it takes forever. Is there maybe a more speedy way to achieve what I need? Thanks a lot! Dimitri # Example data: x-data.frame(group=c(rep(group1,5),rep(group2,5)), a=1:10,b=seq(10,100,by=10)) x$group-as.factor(x$group) y-data.frame(group=c(group1,group2),a=c(10,20),b=c(2,3)) y$group-as.factor(y$group) (x);(y) # My code: myvars-c(a,b) for(var in myvars){ for(group in levels(y$group)){ temp-x[x$group %in% group,var] temp-temp * y[y$group %in% group,var] x[x$group %in% group,var]-temp } } (x) -- Dimitri Liakhovitski __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help with speed (replacing the loop?)
Hi, On Wed, Jan 11, 2012 at 9:57 AM, Dimitri Liakhovitski dimitri.liakhovit...@gmail.com wrote: Dear R-ers, I have a loop below that loops through my numeric variables in data frame x and through levels of the factor group and multiplies (group by group) the values of numeric variables in x by the corresponding group-specific values from data frame y. In reality, my: dim(x) is 300,000 rows by 100 variables, and dim(y) is 120 levels of group by 100 variables. So, my huge data frame x takes up a lot of space in memory. This is why I am actually replacing values of a and b in x with newly calculated values, rather than adding them. The code does what I need, but it takes forever. Is there maybe a more speedy way to achieve what I need? Thanks a lot! Here's an all-middle-steps included way to do so using data.table. If you use more data.table-centric idioms (using `:=` operator and other ways to `merge`) you can likely eek out less memory and higher speed, but I'll leave it like so for pedagogical purposes ;-) library(data.table) ## your data xx - data.table(group=c(rep(group1,5),rep(group2,5)), a=1:10, b=seq(10,100,by=10), key=group) yy - data.table(group=c(group1,group2), a=c(10,20), b=c(2,3), key=group) ## temp data.table to get your ducks in a row m - merge(xx, yy, by=group, suffixes=c(.x, .y)) ## your answers will be in the aa and bb columns result - transform(m, aa=a.x * a.y, bb=b.x * b.y) Truth be told, if you use normal data.frames, the code will look very similar to above, so you can try that, too. HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help with speed (replacing the loop?)
Thanks a lot, Steve. I have one question (below): library(data.table) ## your data xx - data.table(group=c(rep(group1,5),rep(group2,5)), a=1:10, b=seq(10,100,by=10), key=group) yy - data.table(group=c(group1,group2), a=c(10,20), b=c(2,3), key=group) ## temp data.table to get your ducks in a row m - merge(xx, yy, by=group, suffixes=c(.x, .y)) Dimitri: The step above (merge) - I was thinking of it but decided against it because my xx already fills up tons of memory. When I merge xx and yy that doubles the number of variables - I am afraid my memory won't hold that much stuff... ## your answers will be in the aa and bb columns result - transform(m, aa=a.x * a.y, bb=b.x * b.y) Truth be told, if you use normal data.frames, the code will look very similar to above, so you can try that, too. HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact -- Dimitri Liakhovitski marketfusionanalytics.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help with speed (replacing the loop?)
Hi, On Wed, Jan 11, 2012 at 10:50 AM, Dimitri Liakhovitski dimitri.liakhovit...@gmail.com wrote: Thanks a lot, Steve. I have one question (below): library(data.table) ## your data xx - data.table(group=c(rep(group1,5),rep(group2,5)), a=1:10, b=seq(10,100,by=10), key=group) yy - data.table(group=c(group1,group2), a=c(10,20), b=c(2,3), key=group) ## temp data.table to get your ducks in a row m - merge(xx, yy, by=group, suffixes=c(.x, .y)) Dimitri: The step above (merge) - I was thinking of it but decided against it because my xx already fills up tons of memory. When I merge xx and yy that doubles the number of variables - I am afraid my memory won't hold that much stuff... Fair enough ... how about just using `match`, then, ie: R aa - x$a * y$a[match(x$group, y$group)] Should do the trick, no? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help with speed (replacing the loop?)
It is common that performance problems are addressed by using more memory. If your algorithm needs to join those tables and do calculations, then you can either pay the piper in memory (usually the most appropriate answer) or you can reinvent those optimized algorithms in a compiled language and figure out how to thread them together to minimize memory use (a dangerous course of action). Try working on chunks of xx? Or getting more memory in your computer? --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. Dimitri Liakhovitski dimitri.liakhovit...@gmail.com wrote: Thanks a lot, Steve. I have one question (below): library(data.table) ## your data xx - data.table(group=c(rep(group1,5),rep(group2,5)), a=1:10, b=seq(10,100,by=10), key=group) yy - data.table(group=c(group1,group2), a=c(10,20), b=c(2,3), key=group) ## temp data.table to get your ducks in a row m - merge(xx, yy, by=group, suffixes=c(.x, .y)) Dimitri: The step above (merge) - I was thinking of it but decided against it because my xx already fills up tons of memory. When I merge xx and yy that doubles the number of variables - I am afraid my memory won't hold that much stuff... ## your answers will be in the aa and bb columns result - transform(m, aa=a.x * a.y, bb=b.x * b.y) Truth be told, if you use normal data.frames, the code will look very similar to above, so you can try that, too. HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact -- Dimitri Liakhovitski marketfusionanalytics.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help with speed (replacing the loop?)
Thanks a lot, Steve! match sounds very promising - that means I only need a loop across predictors. As far as get more memory advice is concerned: I already have more memory :) On Wed, Jan 11, 2012 at 11:14 AM, Steve Lianoglou mailinglist.honey...@gmail.com wrote: Hi, On Wed, Jan 11, 2012 at 10:50 AM, Dimitri Liakhovitski dimitri.liakhovit...@gmail.com wrote: Thanks a lot, Steve. I have one question (below): library(data.table) ## your data xx - data.table(group=c(rep(group1,5),rep(group2,5)), a=1:10, b=seq(10,100,by=10), key=group) yy - data.table(group=c(group1,group2), a=c(10,20), b=c(2,3), key=group) ## temp data.table to get your ducks in a row m - merge(xx, yy, by=group, suffixes=c(.x, .y)) Dimitri: The step above (merge) - I was thinking of it but decided against it because my xx already fills up tons of memory. When I merge xx and yy that doubles the number of variables - I am afraid my memory won't hold that much stuff... Fair enough ... how about just using `match`, then, ie: R aa - x$a * y$a[match(x$group, y$group)] Should do the trick, no? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact -- Dimitri Liakhovitski marketfusionanalytics.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.