[R] ideas to speed up code: converting a matrix of integers to a matrix of normally distributed values

2007-03-16 Thread Matthew Keller
Hi all,

[this is a bit hard to describe, so if my initial description is
confusing, please try running my code below]

#WHAT I'M TRYING TO DO
I'd appreciate any help in trying to speed up some code. I've written
a script that converts a matrix of integers (usually between 1-10,000
- these represent allele names) into two new matrices of normally
distributed values (representing genetic effects), such that a given
number in the integer (allele name) matrix always corresponds to the
*same* randomly distributed (genetic) effects.

For example, every time my function sees the allele name 3782, it
converts that integer into the same two effects (e.g., -.372  1.727),
which follow normal distributions (these effects can be correlated;
below I've set their correlation to .5). I have an entire matrix of
integers, and am converting those into two entire matrices of effects.


#HOW I'VE DONE IT SO FAR
To get the correlations between the effects, I've used the mvrnorm
function from MASS

To convert the allele names to genetic effects, I've created a
function (make.effect) that resets the set.seed() to the allele name
each time its called.

To get the matrix of genetic effects, I use sapply.


#THE PROBLEM
The problem is that I often need to convert matrices that have 500K
cells, and do this over several hundred iterations, so it is quite
slow. If anyone has ideas to speed this up (e.g., some clever way to
convert a matrix of integers to a matrix of effects without using
sapply), I would be forever indebted.


##MY CODE

library(MASS)

##run this example to see what I'm talking about above

make.effects - function(x,mn=0,var=1,cov=.5){
  set.seed(x)
  
return(mvrnorm(1,mu=c(mn,mn),Sigma=matrix(c(var,cov,cov,var),nrow=2),empirical=FALSE))}

(alleles - 
matrix(c(5400,3309,2080,1080,2080,,389,9362,6372,3787,2798,1009),ncol=4))

a12 - array(sapply(alleles,make.effects),dim=c(2,nrow(alleles),ncol(alleles)))
(a1 - a12[1,,])
(a2 - a12[2,,])

#I've set the population correlation = .5; empirical corr=.4635
cor(as.vector(a1),as.vector(a2))

##run this to see that the code begins to get pretty slow with even a
3000x4 matrix

system.time({

alleles - 
matrix(rep(c(5400,3309,2080,1080,2080,,389,9362,6372,3787,2798,1009),1000),ncol=4)

a12 - array(sapply(alleles,make.effects),dim=c(2,nrow(alleles),ncol(alleles)))
a1 - a12[1,,]
a2 - a12[2,,]

})




-- 
Matthew C Keller
Postdoctoral Fellow
Virginia Institute for Psychiatric and Behavioral Genetics

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ideas to speed up code: converting a matrix of integers to a matrix of normally distributed values

2007-03-16 Thread jim holtman
Considering that the vast majority of your time is spent in the function
mvrnorm (on my system 5.7  out of 6.1 seconds).  In your example that is
12000 calls to the function.  To improve your speed you have to cut down the
number of calls to the function.  For example, how many unique integers do
you have and can to do the calls for those and then substitute matching
values.  Here is what Rprof showed:

  total.time total.pct self.time self.pct
system.time 6.12  99.7  0.00  0.0
as.vector   6.06  98.7  0.18  2.9
FUN 6.06  98.7  0.12  2.0
array   6.06  98.7  0.10  1.6
lapply  6.06  98.7  0.00  0.0
sapply  6.06  98.7  0.00  0.0
eval6.04  98.4  0.06  1.0
mvrnorm 5.72  93.2  0.34  5.5
eigen   2.58  42.0  0.52  8.5

or another way of looking at it:

  0   6.1 root
  1.6.1 system.time
  2. .6.0 eval
  3. . .6.0 eval
  4. . . .6.0 array
  5. . . . .6.0 as.vector
  6. . . . . .6.0 sapply
  7. . . . . . .6.0 lapply
  8. . . . . . . .6.0 FUN
  9. . . . . . . . .5.7 mvrnorm
 10. . . . . . . . . .2.6 eigen
 11. . . . . . . . . . .1.2 sort.list
 12. . . . . . . . . . . .1.0 match.arg
 13. . . . . . . . . . . . .0.7 eval




On 3/16/07, Matthew Keller [EMAIL PROTECTED] wrote:

 Hi all,

 [this is a bit hard to describe, so if my initial description is
 confusing, please try running my code below]

 #WHAT I'M TRYING TO DO
 I'd appreciate any help in trying to speed up some code. I've written
 a script that converts a matrix of integers (usually between 1-10,000
 - these represent allele names) into two new matrices of normally
 distributed values (representing genetic effects), such that a given
 number in the integer (allele name) matrix always corresponds to the
 *same* randomly distributed (genetic) effects.

 For example, every time my function sees the allele name 3782, it
 converts that integer into the same two effects (e.g., -.372  1.727),
 which follow normal distributions (these effects can be correlated;
 below I've set their correlation to .5). I have an entire matrix of
 integers, and am converting those into two entire matrices of effects.


 #HOW I'VE DONE IT SO FAR
 To get the correlations between the effects, I've used the mvrnorm
 function from MASS

 To convert the allele names to genetic effects, I've created a
 function (make.effect) that resets the set.seed() to the allele name
 each time its called.

 To get the matrix of genetic effects, I use sapply.


 #THE PROBLEM
 The problem is that I often need to convert matrices that have 500K
 cells, and do this over several hundred iterations, so it is quite
 slow. If anyone has ideas to speed this up (e.g., some clever way to
 convert a matrix of integers to a matrix of effects without using
 sapply), I would be forever indebted.


 ##MY CODE

 library(MASS)

 ##run this example to see what I'm talking about above

 make.effects - function(x,mn=0,var=1,cov=.5){
 set.seed(x)

 return(mvrnorm(1,mu=c(mn,mn),Sigma=matrix(c(var,cov,cov,var),nrow=2),empirical=FALSE))}

 (alleles -
 matrix(c(5400,3309,2080,1080,2080,,389,9362,6372,3787,2798,1009),ncol=4))

 a12 - array(sapply(alleles,make.effects
 ),dim=c(2,nrow(alleles),ncol(alleles)))
 (a1 - a12[1,,])
 (a2 - a12[2,,])

 #I've set the population correlation = .5; empirical corr=.4635
 cor(as.vector(a1),as.vector(a2))

 ##run this to see that the code begins to get pretty slow with even a
 3000x4 matrix

 system.time({

 alleles -
 matrix(rep(c(5400,3309,2080,1080,2080,,389,9362,6372,3787,2798,1009),1000),ncol=4)

 a12 - array(sapply(alleles,make.effects
 ),dim=c(2,nrow(alleles),ncol(alleles)))
 a1 - a12[1,,]
 a2 - a12[2,,]

 })




 --
 Matthew C Keller
 Postdoctoral Fellow
 Virginia Institute for Psychiatric and Behavioral Genetics

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ideas to speed up code: converting a matrix of integers to a matrix of normally distributed values

2007-03-16 Thread Matthew Keller
Hi,

Many thanks to Jim and Martin for their suggestions. Using your ideas,
I came up with a solution that indexes rather than uses sapply (and
therefore calling up mvrnorm separately for each cell in the matrix).
The trick is to create a key matrix once, and then to use the
match() function each time I need to take the results from the key
matrix and place them in the appropriate spots in an 'effects' matrix.
If anyone is interested, here is a solution which speeds up the
process by a factor of 200 (!) :

unique.allele.seq - 1:1

make.effects - function(allele.seq, seed, mn = 0, var=1, cov=.5) {
   set.seed(seed)
   return(mvrnorm(length(allele.seq),
mu=c(mn,mn),Sigma=matrix(c(var,cov,cov,var),nrow=2),
empirical=FALSE))}

effects.key - make.effects(unique.allele.seq, 123)

(alleles - 
matrix(c(15,3309,2080,1080,2080,,389,9362,6372,3787,2798,1009),ncol=4))
(indx - match(alleles,key))

(a1 - matrix(effects.key[indx,1],ncol=ncol(alleles)))
(a2 - matrix(effects.key[indx,2],ncol=ncol(alleles)))

#to check timing
system.time({
alleles - 
matrix(rep(c(5400,3309,2080,1080,2080,,389,9362,6372,3787,2798,1009),1),ncol=4)
indx - match(alleles,key)

a1 - matrix(effects.key[indx,1],ncol=ncol(alleles))
a2 - matrix(effects.key[indx,2],ncol=ncol(alleles))})




On 3/16/07, jim holtman [EMAIL PROTECTED] wrote:
 Considering that the vast majority of your time is spent in the function
 mvrnorm (on my system 5.7  out of 6.1 seconds).  In your example that is
 12000 calls to the function.  To improve your speed you have to cut down the
 number of calls to the function.  For example, how many unique integers do
 you have and can to do the calls for those and then substitute matching
 values.  Here is what Rprof showed:

   total.time total.pct self.time self.pct
 system.time 6.12  99.7  0.00  0.0
 as.vector   6.06  98.7  0.18  2.9
 FUN 6.06  98.7  0.12  2.0
 array   6.06  98.7  0.10  1.6
 lapply  6.06  98.7  0.00  0.0
 sapply  6.06  98.7  0.00  0.0
 eval6.04  98.4  0.06  1.0
 mvrnorm 5.72  93.2  0.34  5.5
 eigen   2.58  42.0  0.52  8.5

 or another way of looking at it:

   0   6.1 root
   1.6.1 system.time
   2. .6.0 eval
   3. . .6.0 eval
   4. . . .6.0 array
   5. . . . .6.0 as.vector
   6. . . . . .6.0 sapply
   7. . . . . . .6.0 lapply
   8. . . . . . . .6.0 FUN
   9. . . . . . . . .5.7 mvrnorm
  10. . . . . . . . . .2.6 eigen
  11. . . . . . . . . . .1.2 sort.list
  12. . . . . . . . . . . .1.0 match.arg
  13. . . . . . . . . . . . .0.7 eval




 On 3/16/07, Matthew Keller [EMAIL PROTECTED] wrote:
 
  Hi all,
 
  [this is a bit hard to describe, so if my initial description is
  confusing, please try running my code below]
 
  #WHAT I'M TRYING TO DO
  I'd appreciate any help in trying to speed up some code. I've written
  a script that converts a matrix of integers (usually between 1-10,000
  - these represent allele names) into two new matrices of normally
  distributed values (representing genetic effects), such that a given
  number in the integer (allele name) matrix always corresponds to the
  *same* randomly distributed (genetic) effects.
 
  For example, every time my function sees the allele name 3782, it
  converts that integer into the same two effects (e.g., -.372  1.727),
  which follow normal distributions (these effects can be correlated;
  below I've set their correlation to .5). I have an entire matrix of
  integers, and am converting those into two entire matrices of effects.
 
 
  #HOW I'VE DONE IT SO FAR
  To get the correlations between the effects, I've used the mvrnorm
  function from MASS
 
  To convert the allele names to genetic effects, I've created a
  function (make.effect) that resets the set.seed() to the allele name
  each time its called.
 
  To get the matrix of genetic effects, I use sapply.
 
 
  #THE PROBLEM
  The problem is that I often need to convert matrices that have 500K
  cells, and do this over several hundred iterations, so it is quite
  slow. If anyone has ideas to speed this up (e.g., some clever way to
  convert a matrix of integers to a matrix of effects without using
  sapply), I would be forever indebted.
 
 
  ##MY CODE
 
  library(MASS)
 
  ##run this example to see what I'm talking about above
 
  make.effects - function(x,mn=0,var=1,cov=.5){
  set.seed(x)
 
 return(mvrnorm(1,mu=c(mn,mn),Sigma=matrix(c(var,cov,cov,var),nrow=2),empirical=FALSE))}
 
  (alleles -
 matrix(c(5400,3309,2080,1080,2080,,389,9362,6372,3787,2798,1009),ncol=4))
 
  a12 -
 array(sapply(alleles,make.effects),dim=c(2,nrow(alleles),ncol(alleles)))
  (a1 - a12[1,,])
  (a2 - a12[2,,])
 
  #I've set the population correlation = .5; empirical corr=.4635
  cor(as.vector