[R] ideas to speed up code: converting a matrix of integers to a matrix of normally distributed values
Hi all, [this is a bit hard to describe, so if my initial description is confusing, please try running my code below] #WHAT I'M TRYING TO DO I'd appreciate any help in trying to speed up some code. I've written a script that converts a matrix of integers (usually between 1-10,000 - these represent allele names) into two new matrices of normally distributed values (representing genetic effects), such that a given number in the integer (allele name) matrix always corresponds to the *same* randomly distributed (genetic) effects. For example, every time my function sees the allele name 3782, it converts that integer into the same two effects (e.g., -.372 1.727), which follow normal distributions (these effects can be correlated; below I've set their correlation to .5). I have an entire matrix of integers, and am converting those into two entire matrices of effects. #HOW I'VE DONE IT SO FAR To get the correlations between the effects, I've used the mvrnorm function from MASS To convert the allele names to genetic effects, I've created a function (make.effect) that resets the set.seed() to the allele name each time its called. To get the matrix of genetic effects, I use sapply. #THE PROBLEM The problem is that I often need to convert matrices that have 500K cells, and do this over several hundred iterations, so it is quite slow. If anyone has ideas to speed this up (e.g., some clever way to convert a matrix of integers to a matrix of effects without using sapply), I would be forever indebted. ##MY CODE library(MASS) ##run this example to see what I'm talking about above make.effects - function(x,mn=0,var=1,cov=.5){ set.seed(x) return(mvrnorm(1,mu=c(mn,mn),Sigma=matrix(c(var,cov,cov,var),nrow=2),empirical=FALSE))} (alleles - matrix(c(5400,3309,2080,1080,2080,,389,9362,6372,3787,2798,1009),ncol=4)) a12 - array(sapply(alleles,make.effects),dim=c(2,nrow(alleles),ncol(alleles))) (a1 - a12[1,,]) (a2 - a12[2,,]) #I've set the population correlation = .5; empirical corr=.4635 cor(as.vector(a1),as.vector(a2)) ##run this to see that the code begins to get pretty slow with even a 3000x4 matrix system.time({ alleles - matrix(rep(c(5400,3309,2080,1080,2080,,389,9362,6372,3787,2798,1009),1000),ncol=4) a12 - array(sapply(alleles,make.effects),dim=c(2,nrow(alleles),ncol(alleles))) a1 - a12[1,,] a2 - a12[2,,] }) -- Matthew C Keller Postdoctoral Fellow Virginia Institute for Psychiatric and Behavioral Genetics __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] ideas to speed up code: converting a matrix of integers to a matrix of normally distributed values
Considering that the vast majority of your time is spent in the function mvrnorm (on my system 5.7 out of 6.1 seconds). In your example that is 12000 calls to the function. To improve your speed you have to cut down the number of calls to the function. For example, how many unique integers do you have and can to do the calls for those and then substitute matching values. Here is what Rprof showed: total.time total.pct self.time self.pct system.time 6.12 99.7 0.00 0.0 as.vector 6.06 98.7 0.18 2.9 FUN 6.06 98.7 0.12 2.0 array 6.06 98.7 0.10 1.6 lapply 6.06 98.7 0.00 0.0 sapply 6.06 98.7 0.00 0.0 eval6.04 98.4 0.06 1.0 mvrnorm 5.72 93.2 0.34 5.5 eigen 2.58 42.0 0.52 8.5 or another way of looking at it: 0 6.1 root 1.6.1 system.time 2. .6.0 eval 3. . .6.0 eval 4. . . .6.0 array 5. . . . .6.0 as.vector 6. . . . . .6.0 sapply 7. . . . . . .6.0 lapply 8. . . . . . . .6.0 FUN 9. . . . . . . . .5.7 mvrnorm 10. . . . . . . . . .2.6 eigen 11. . . . . . . . . . .1.2 sort.list 12. . . . . . . . . . . .1.0 match.arg 13. . . . . . . . . . . . .0.7 eval On 3/16/07, Matthew Keller [EMAIL PROTECTED] wrote: Hi all, [this is a bit hard to describe, so if my initial description is confusing, please try running my code below] #WHAT I'M TRYING TO DO I'd appreciate any help in trying to speed up some code. I've written a script that converts a matrix of integers (usually between 1-10,000 - these represent allele names) into two new matrices of normally distributed values (representing genetic effects), such that a given number in the integer (allele name) matrix always corresponds to the *same* randomly distributed (genetic) effects. For example, every time my function sees the allele name 3782, it converts that integer into the same two effects (e.g., -.372 1.727), which follow normal distributions (these effects can be correlated; below I've set their correlation to .5). I have an entire matrix of integers, and am converting those into two entire matrices of effects. #HOW I'VE DONE IT SO FAR To get the correlations between the effects, I've used the mvrnorm function from MASS To convert the allele names to genetic effects, I've created a function (make.effect) that resets the set.seed() to the allele name each time its called. To get the matrix of genetic effects, I use sapply. #THE PROBLEM The problem is that I often need to convert matrices that have 500K cells, and do this over several hundred iterations, so it is quite slow. If anyone has ideas to speed this up (e.g., some clever way to convert a matrix of integers to a matrix of effects without using sapply), I would be forever indebted. ##MY CODE library(MASS) ##run this example to see what I'm talking about above make.effects - function(x,mn=0,var=1,cov=.5){ set.seed(x) return(mvrnorm(1,mu=c(mn,mn),Sigma=matrix(c(var,cov,cov,var),nrow=2),empirical=FALSE))} (alleles - matrix(c(5400,3309,2080,1080,2080,,389,9362,6372,3787,2798,1009),ncol=4)) a12 - array(sapply(alleles,make.effects ),dim=c(2,nrow(alleles),ncol(alleles))) (a1 - a12[1,,]) (a2 - a12[2,,]) #I've set the population correlation = .5; empirical corr=.4635 cor(as.vector(a1),as.vector(a2)) ##run this to see that the code begins to get pretty slow with even a 3000x4 matrix system.time({ alleles - matrix(rep(c(5400,3309,2080,1080,2080,,389,9362,6372,3787,2798,1009),1000),ncol=4) a12 - array(sapply(alleles,make.effects ),dim=c(2,nrow(alleles),ncol(alleles))) a1 - a12[1,,] a2 - a12[2,,] }) -- Matthew C Keller Postdoctoral Fellow Virginia Institute for Psychiatric and Behavioral Genetics __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] ideas to speed up code: converting a matrix of integers to a matrix of normally distributed values
Hi, Many thanks to Jim and Martin for their suggestions. Using your ideas, I came up with a solution that indexes rather than uses sapply (and therefore calling up mvrnorm separately for each cell in the matrix). The trick is to create a key matrix once, and then to use the match() function each time I need to take the results from the key matrix and place them in the appropriate spots in an 'effects' matrix. If anyone is interested, here is a solution which speeds up the process by a factor of 200 (!) : unique.allele.seq - 1:1 make.effects - function(allele.seq, seed, mn = 0, var=1, cov=.5) { set.seed(seed) return(mvrnorm(length(allele.seq), mu=c(mn,mn),Sigma=matrix(c(var,cov,cov,var),nrow=2), empirical=FALSE))} effects.key - make.effects(unique.allele.seq, 123) (alleles - matrix(c(15,3309,2080,1080,2080,,389,9362,6372,3787,2798,1009),ncol=4)) (indx - match(alleles,key)) (a1 - matrix(effects.key[indx,1],ncol=ncol(alleles))) (a2 - matrix(effects.key[indx,2],ncol=ncol(alleles))) #to check timing system.time({ alleles - matrix(rep(c(5400,3309,2080,1080,2080,,389,9362,6372,3787,2798,1009),1),ncol=4) indx - match(alleles,key) a1 - matrix(effects.key[indx,1],ncol=ncol(alleles)) a2 - matrix(effects.key[indx,2],ncol=ncol(alleles))}) On 3/16/07, jim holtman [EMAIL PROTECTED] wrote: Considering that the vast majority of your time is spent in the function mvrnorm (on my system 5.7 out of 6.1 seconds). In your example that is 12000 calls to the function. To improve your speed you have to cut down the number of calls to the function. For example, how many unique integers do you have and can to do the calls for those and then substitute matching values. Here is what Rprof showed: total.time total.pct self.time self.pct system.time 6.12 99.7 0.00 0.0 as.vector 6.06 98.7 0.18 2.9 FUN 6.06 98.7 0.12 2.0 array 6.06 98.7 0.10 1.6 lapply 6.06 98.7 0.00 0.0 sapply 6.06 98.7 0.00 0.0 eval6.04 98.4 0.06 1.0 mvrnorm 5.72 93.2 0.34 5.5 eigen 2.58 42.0 0.52 8.5 or another way of looking at it: 0 6.1 root 1.6.1 system.time 2. .6.0 eval 3. . .6.0 eval 4. . . .6.0 array 5. . . . .6.0 as.vector 6. . . . . .6.0 sapply 7. . . . . . .6.0 lapply 8. . . . . . . .6.0 FUN 9. . . . . . . . .5.7 mvrnorm 10. . . . . . . . . .2.6 eigen 11. . . . . . . . . . .1.2 sort.list 12. . . . . . . . . . . .1.0 match.arg 13. . . . . . . . . . . . .0.7 eval On 3/16/07, Matthew Keller [EMAIL PROTECTED] wrote: Hi all, [this is a bit hard to describe, so if my initial description is confusing, please try running my code below] #WHAT I'M TRYING TO DO I'd appreciate any help in trying to speed up some code. I've written a script that converts a matrix of integers (usually between 1-10,000 - these represent allele names) into two new matrices of normally distributed values (representing genetic effects), such that a given number in the integer (allele name) matrix always corresponds to the *same* randomly distributed (genetic) effects. For example, every time my function sees the allele name 3782, it converts that integer into the same two effects (e.g., -.372 1.727), which follow normal distributions (these effects can be correlated; below I've set their correlation to .5). I have an entire matrix of integers, and am converting those into two entire matrices of effects. #HOW I'VE DONE IT SO FAR To get the correlations between the effects, I've used the mvrnorm function from MASS To convert the allele names to genetic effects, I've created a function (make.effect) that resets the set.seed() to the allele name each time its called. To get the matrix of genetic effects, I use sapply. #THE PROBLEM The problem is that I often need to convert matrices that have 500K cells, and do this over several hundred iterations, so it is quite slow. If anyone has ideas to speed this up (e.g., some clever way to convert a matrix of integers to a matrix of effects without using sapply), I would be forever indebted. ##MY CODE library(MASS) ##run this example to see what I'm talking about above make.effects - function(x,mn=0,var=1,cov=.5){ set.seed(x) return(mvrnorm(1,mu=c(mn,mn),Sigma=matrix(c(var,cov,cov,var),nrow=2),empirical=FALSE))} (alleles - matrix(c(5400,3309,2080,1080,2080,,389,9362,6372,3787,2798,1009),ncol=4)) a12 - array(sapply(alleles,make.effects),dim=c(2,nrow(alleles),ncol(alleles))) (a1 - a12[1,,]) (a2 - a12[2,,]) #I've set the population correlation = .5; empirical corr=.4635 cor(as.vector